Assignment 4: Huffman Coding

Write a pair of C++ programs, huff and puff which will implement the Huffman data compression algorithm. The two programs will need to share some code, so write your BUILD script accordingly.

Here is a compiled copy of the decoder that works on the UW-320 lab machines to use for your testing. For full marks, huff-encoded files must be decodable by the supplied program, as well as by your implementation of puff.

Compressed File Format

256 variable-length records consisting of a single unsigned byte bitcount for coded symbol i (the longest code will not exceed 255 bits!) followed by ceil(bitcount/8) bytes (rounded up!) of bits for the encoding of the symbol. Set any excess bits to zero. Bits will be "big endian".
1 unsigned 32-bit little endian integer N, the number of coded symbols in the file.
N coded symbols (set any extra bits in the last byte to zero).

Note that for English text (such as Common Sense), most of the coded symbols will have length 0. That's okay: the wasted space is small compared with the overall size of the file being compressed. Your program should work equally well for text and binary data (which have different distributions).

Huff

Although there are solutions to the problem, huff cannot simply read from standard input because it needs to make two passes over the input data. Your program will take a comand-line argument ("argv"), the name of the file to compress, and will write the compressed data to a file with the added extension .huff (you may use the standard library string functions for this as needed).

If there is a problem opening or reading the file, huff should exit with a non-zero status. Use the C standard library macros EXIT_SUCCESS and EXIT_FAILURE.

Frequency Counter

The first step is to count the number of occurrences of each character in the text. You may assume the input file will not exceed 2 GB, so you can use an int (signed or unsigned)¹ to hold the frequency of each byte.

There are UCHAR_MAX + 1 distinct unsigned character values (0..255). You may use an array to hold the counts.

While counting the frequency of each symbol, get the total count of symbols in the file which you will need to write to the output file.

Your dump routine should print out the hexadecimal value of the byte and, if the symbol is printable, the character. Consider using the isprint macro from the C standard library.

Consider a crude horizontal bar chart: print, say 25 * N_i/N splat marks to eyeball the relative frequency. Expect to see more occurrences of 'e' than 'z' in an English text.

Huffman Tree

From the frequency table, build the Huffman tree. Remember the priority queue from Assignment 3? You will be using one kind of tree to build an entirely different kind of tree!

Consider how the binary heap dump routine should interact with the Huffman tree dump routine to yield the most effective and convenient output format for debugging.

Coded Symbols

From the Huffman tree, build the table of coded symbols. Since the table is of fixed size, you may use an array as used for the frequency counter.

A coded symbol is determined by its length (number of bits) and the bit pattern. You may assume that 64 bits is an upper bound on the length of any coded symbol, so you can store the length in a byte field and the bit pattern in an unsigned long long.

Your dump function should print out the bitstring is a way that is convenient to analyze. Hexadecimal is compact and, with practice, can be eyeballed directly as its corresponding bit sequence. For the less-practiced, consider a function that prints in binary format (ones and zeros or splats and dots, whatever makes it easier for your eye to perceive). You can do this by calculation or by table lookup ².

Emit the coded-symbol table to the compressed output file, followed by the file length value, as described in the output format.

Use the code table to encode the input file, writing the coded data to the .huff file.

BitStream

For output, you will want a class that emits bits and bytes to a file. Standard library file objects only go down to the granularity of bytes so you will have to collect (buffer) the bits and write out only complete bytes. Bits within the byte should be "big endian".

You'll need a flush function that will write out the currently-being-assembled byte (if there is one) zero-filling the remaining bits. You'll need your destructor or close method to flush the final byte before closing.

The file size will be a sequence of 4 bytes. You can add a write-4-bytes method to your bitstream writer or have the caller disassemble it into individual bytes ³.

Similarly, for decoding the file, you will need a bitstream reader to decode the file, with read-bit, read-byte, and flush operations.

Since you are both the library designer and library user, you must decide whether to call the functions put and get, put_bit and get_bit, write_bit and read_bit, or putBit and getBit⁴.

Puff

puff should read the .huff file and replace the extension with .puff (so as not to accidentally overwrite the original file).

If the filename argument does not end in .huff or the file is not readable, print an error message and terminate with status EXIT_FAILURE

You will reconstruct the Huffman tree from the table of coded symbols in the .huff file and verify that it is well-formed before using it to decode the rest of the contents of the file.

Compression Ratio

Download a reasonably-sized text file from Project Gutenburg and run your encoding on the file. Report the original size, compressed size, and space savings in your README file. Don't forget to include a link to your chosen text.

Verification

For validation, compare your expanded .puff file with your original text. Your build or run script can automate the verification.

1 Which makes more sense?
2 Or, write a program that generates the table.
3 This is the fundamental question of library design.
4 Although putBit and getBit are too Java-ish for your instructor's tastes and might make his eyeballs bleed, with all the attendant consequences.