Assignment 2: Huffman Coding

Huffman coding is an optimal prefix encoding of the symbols (characters) of a text, such that more-frequently-occuring characters are given shorter codings (i.e. fewer bits). This leads to an overall shorter encoding of the text from which the original text can be recovered (lossless compression).

There exist other practical lossy and lossless compression algorithms, but Huffman encoding is a component of production data compression utilities such as gzip.

For this assignment, you will implement two programs, huff and puff:

huff will take a single command-line argument filename and will produce an output file filename.huff containg a Huffman encoding of the original file.
puff will take a single command-line argument filename and produce an output file filename.puff containg the Huffman decoding of filename.

If everything is done correctly, you should be able to show the space savings between filename and filename.huff. Furthemore, you will be able to show that filename and filename.huff.puff are byte-for-byte identical.

For the purposes of the assignment, we will make a few simplifying assumptions that you would not be able to make if you were writing a production-grade data compression program:

no coded symbol will require more than 32 bits (this lets us store the encoding in a 32-bit unsigned int).
files will be no larger than 2 GB (this lets us store the total file size in a 32-bit int)
no individual symbol appears in the text more than 16,000 times (this lets us store frequency values in 16-bit shorts).
we are reading from an ordinary file and can reread the file from the beginning (as opposed to reading from a stream)

Solution

Character Frequency Count

The symbols (characters) are 8-bit bytes. Since there are only 256 distinct byte patterns, represent the freqency table as an array:


      #define SYMBOLS 256
unsigned short symbol_frequency[SYMBOLS];

You may encapsulate this data into a class if you feel more comfortable.

Remember to use unsigned char values to index the table.

Priority Queue

Implement a priority queue along the lines of the standard library definition definition: std::priority_queue. You may omit the second and third template arguments (Container and Compare). You may use std::vector for the underlying storage. You will not be marked on compatability with the standard library.

Note that the priority queue is conceptually a binary tree, but it is implemented using an array (or std::vector). Use the fact that parent(n) is n/2, left(n) is 2*n and right(n) is 2*n+1.

Start with the root in array element 1. You may "waste" the 0-th element to simplify your calculations; assume queue entries have a default constructor.

A good way to do this is to create a PriorityQueue class with a private member class Node. Define the methods that access data as members of PriorityQueue that take a Node parameter.

You will need your priority queue again for Assignment 3. Consider making it a template.

Huffman Tree

Using your priority queue, construct the Huffman tree for the input file.

Include routines that dump the tree to stderr to aid your debugging. Your choice how, but one way to do it is a pre-order traversal that passes depth down and indents each node k*depth spaces. For something slightly more elaborate (which is overkill for this assignment), here is partial output from the linux pstree command (see the tree structure?):


	init-+-NetworkManager-+-dhclientp
     |                |-dnsmasq
     |                `-2*[{NetworkManager}]
     |-accounts-daemon---{accounts-daemon}
     |-acpid
     |-at-spi-bus-laun---2*[{at-spi-bus-laun}]
     |-atd
     |-avahi-daemon---avahi-daemon
     |-bamfdaemon---2*[{bamfdaemon}]
     |-bluetoothd
     |-chromium-browse-+-2*[chromium-browse]
     |                 |-chromium-browse---chromium-browse---chromium-browse-+-60*[chromium-browse---3*[{chromium-browse}]]
     |                 |                                                     `-3*[chromium-browse---4*[{chromium-browse}]]
     |                 |-chromium-browse---339*[{chromium-browse}]
     |                 `-30*[{chromium-browse}]
     |-colord---2*[{colord}]
     |-console-kit-dae---64*[{console-kit-dae}]
     |-cron
     |-cupsd

Encoding

Traverse the Huffman tree to generate an encoding for each character. Since we assume the encoding will take at most 32 bits, we can use a table of unsigned ints to hold the encoding. We will also need to know how many bits are in the encoding.


      unsigned bits[SYMBOLS]
unsigned encoding[SYMBOLS];

Alternatively, you can declare a class to hold the encoding information


      
class Encoding {
  public:
    Encoding();
    // Additional public methods...
  private:
    struct EncodedChar {
        unsigned bits_;
        unsigned coded_value_;
    };
    EncodedChar data_[SYMBOLS];
};

Bit-At-A-Time Operations

Since a byte (8 bits) is the smallest addressable piece of memory, you will have to create a utility that will buffer the bits both for writing and for reading back. Recall your bitwise or and shift operations.

An easy way to manage the tree traversal is to create a class, say class Encode with methods push_bit() which you call with arg 0 on a left branch and 1 on a right branch and get_encoding() which you call at the leaf. The object will manage bit buffer.

For partial credit, print the bit-coding to standard error. Each line of output should have

the hex value of the character (its original ASCII coding) with the usual C-style base prefix ("0x")
the actual character in parentheses if it is printable (or space if it's unprintable), followed by a colon (see: <cctype>).
coded bit string in square brackets
bit count

The output format makes for easier post-processing by a script. Two Huffman trees for a given text may be identical except for left and right branches being swapped, but a script can verify that each character has the correct number of bits and the encodings are unique.

For example:

    0xa ( ): [0] 1
   0x20 ( ): [10010] 5
   0x23 (#): [1010110] 7
   0x28 ((): [100011001] 9
   0x29 ()): [100010010] 9
   0x2c (,): [100011000] 9
   0x2d (-): [11110111] 8
   0x2e (.): [100000101] 9
   0x2f (/): [101101] 6
   0x30 (0): [100010001] 9
   0x31 (1): [100010000] 9
   0x32 (2): [10001011] 8
   0x3a (:): [100010100] 9
   0x3b (;): [10000111] 8
   0x40 (@): [11110110] 8
   0x41 (A): [100001010] 9
   0x42 (B): [11101] 5
   0x43 (C): [1111111] 7
   0x44 (D): [100000100] 9
   0x46 (F): [10000001] 8
   0x48 (H): [10000000] 8
   0x49 (I): [10000011] 8
   0x4c (L): [1111001] 7
   0x4d (M): [11110000] 8
   0x52 (R): [100010011] 9
   0x53 (S): [11110001] 8
   0x54 (T): [10000100] 8
   0x55 (U): [10000110] 8
   0x57 (W): [100010101] 9
   0x5f (_): [101100] 6
   0x61 (a): [1011101] 7
   0x63 (c): [1000111] 7
   0x64 (d): [111000] 6
   0x65 (e): [10011] 5
   0x66 (f): [101111] 6
   0x67 (g): [100001011] 9
   0x68 (h): [1111110] 7
   0x69 (i): [11001] 5
   0x6c (l): [111110] 6
   0x6d (m): [101010] 6
   0x6e (n): [101000] 6
   0x6f (o): [101001] 6
   0x70 (p): [1010111] 7
   0x72 (r): [11011] 5
   0x73 (s): [11000] 5
   0x74 (t): [11010] 5
   0x75 (u): [1011100] 7
   0x79 (y): [111001] 6
   0x7b ({): [1111010] 7
   0x7d (}): [10001101] 8

File Compression

You will have to "rewind" the input file to read it again for compression, so you'll probably want check the manual page for istream.

Emit the frequency table so puff can reconstruct the Huffman tree.

Since the bit-length of the characters is not constant, your encoded text is unlikely end on an 8-bit boundary (what are the chances?). This means you will probably have a few bits left over. That's okay: keep the character count to know when to stop decoding.

File Format

frequency table
32-bit for number of (uncompressed) symbols in file
compressed data

File expansion

Reread the compressed file and expand it into a file with ".puff" extension (e.g. expand foo.txt.huff into foo.txt.huff.puff).

What to Hand In

Submit your source code and test data along with a Bash script named Run which will compile your programs and run it with your test data. The script should compress your test data, compute the space savings, expand the compressed file, and verify decompressed file matches the original text.

The script is required but need not be your own work. Do identify its provenance and do not claim someone else's work as your own.

Optionally submit a text file named README with any additional notes.

Additional Notes

Intermediate Milestones

There are easier ways and harder ways to do this. It may seem like extra work to write the scaffolding and unit tests but really, this is the easiest way to build a complex program. Each of the following should work as a stand-alone program:

character frequency counts
priority queue
Huffman tree (built from frequency table)
encoding table
encoded file
reloading the frequency table
file decode

100% of the grade is based on huff and puff working as an end-to-end solution, but partial marks will be awarded for working milestones.

Standard Library

You may use standard (stream) I/O, std::strings (including <cstring> and <cctype>), std::vector and memory management (new and delete).

You may use the STL sort and/or reverse algorithms in your debug code to order output by frequency.

Test Data

You can obtain large samples of text from Project Gutenberg (http://www.gutenberg.org). Consider using a shell script to remove punctuation and long words from the text.

Avoid generating random data because it will have high entropy and the Huffman algorithm won't yield good results.