Assignment 2: Data Compression

Introduction

For this assignment, you will write a pair of C++ programs, huff and puff which will implement the Huffman data compression algorithm.

The project has a lot of moving parts, but you should find each piece to be manageable. Start early and work steadily. Cramming this on the night before it's due will be a losing strategy.

Bitstreams

Since the smallest addressable unit in C++ is a byte (8 bits), you will need to create a utility for reading and writing streams of bits. Just as standard iostreams buffer data into blocks for efficency, you will buffer bits into bytes.

The writer will have methods int putBit(unsigned int bit) and int putByte(unsigned char byte).

putBit should collect (buffer) the bits it receives until it gets 8 of them, then write out the byte.

You will also need to implement a flush() function that writes any buffered bits (adding additional zero bits to fill out the current byte) to align on the next byte boundary.

putByte should flush to the next byte boundary before writing the next byte value. Your destructor should call flush() before closing the file.

Similarly, the bitstream reader should have methods getBit() which buffers the bits as needed and getByte(). which advances to the next byte boundary and reads the next byte.

While this may seem trivial, you have some decisions to make. Should you write one class or two. Should the stream read from standard I/O, an open iostream object, or a filename to be opened by your class? Should your bits be big endian or little endian? What other convenience functions will you need? Is there common code that may be shared between the reader and the writer?

Hint: one way to test your bitstreams is to read a small text file bit at a time, write it back, and verify that the original and copy are identical.

Frequency Table

The first step in the Huffman algorithm is to count the number of occurrences of each character in the input. You may assume the input file will not exceed 2 GB, so you can use an unsigned int to hold the frequency of each byte.

Write a program that reads a text file and produces a table of characters and their frequencies (number of occurrences in the input data). Include a dump routine that prints the table to standard error with character number in hex, the character if it is printable or a space, and the count.

You may also want your dump routine to do a crude horizontal bar chart to capture the relative frequency of each character. Let N_max be the frequency of the most common symbol (probably 'e' in an English text). Print, say, 50 * N_i / N_max splat marks for character i.

Again, while that may sound trivial, you need to make some decisions. Should you use a class or a function? Should the reader read from standard input, an open file, or take a string and open the file? How should the table be represented? Should the output be ordered by character value, frequency, or unordered? How are you going to ensure that the program is correct?

Hint: there are UCHAR_MAX + 1 distinct unsigned character values (0..255). That is a fixed size known at compile time.

Hint: consider using the isprint macro from the standard C library.

Priority Queue

The next step in the Huffman algorithm is to implement a priority queue that you will use to build the Huffman tree.

Write a template class Heap that implements a binary heap. Use std::vector to hold the data.

The class should take two type parameters, Priority to represent the priority value and Data to represent the data. Priority should implement the less-than operator for comparisons (higher priority is a lower number).

Heap should provide methods push(), pop(), and is_empty(). push() will take two parameters, priority and data. pop should extract and return the "minimum" value. is_empty() should return a Boolean value, with the obvious semantics.

Test your algorithm by implementing the HeapSort algorithm.

Huffman Tree

From the frequency table and using your priority queue, build the Huffman tree. You will be using one kind of tree (the heap) to build an entirely different kind of tree! Do not confuse the two!

As with all data structures, you should provide a dump routine that shows the structure of the tree, and for each node the count and the character (if the node is a leaf). Consider how the binary heap dump routine should interact with the Huffman tree dump routine to yield the most effective and convenient output format for debugging.

Coded Symbols

From the Huffman tree, build the table of coded symbols. Since the table is of fixed size, use an array.

A coded symbol is determined by its length (number of bits) and the bit pattern. You may assume that 64 bits is an upper bound on the length of any coded symbol, so you can store the length in a byte field and the bit pattern in an unsigned long long.

Your dump function should print out the bitstring is a way that is convenient to analyze. Hexadecimal is compact and, with practice, can be eyeballed directly as its corresponding bit sequence. For the less-practiced, consider a function that prints in binary format (ones and zeros or splats and dots, whatever makes it easier for your eye to perceive). You can do this by calculation or by table lookup.

Emit the coded-symbol table to the compressed output file, followed by the file length value, as described in the output format.

Use the code table to encode the input file, writing the coded data to the .huff file.

Compressed File Format

256 variable-length records consisting of a single unsigned byte bitcount for the bit length of coded symbol i followed by ceil(bitcount/8) bytes (rounded up!) of bits for the encoding of the symbol. Set any excess bits to zero. Bits will be "big endian" within the byte.
1 unsigned 32-bit little endian integer N, the number of symbols in the file.
N coded symbols (set any extra bits in the last byte to zero).

Note that for English text (such as Common Sense), most of the coded symbols will have length 0 (not present in the file). That's okay: the wasted space is small compared with the overall size of the file being compressed. Your program should work equally well for text and binary data (which have different distributions).

Huff

Although there are solutions to the problem, huff cannot simply read from standard input because it needs to make two passes over the input data. Your program will take a comand-line argument ("argv"), the name of the file to compress, and will write the compressed data to a file with the added extension .huff (you may use the standard library string functions for this as needed).

If there is a problem opening or reading the file, huff should exit with a non-zero status. Use the C standard library macros EXIT_SUCCESS and EXIT_FAILURE.

Puff

puff should read the .huff file and replace the extension with .puff (so as not to accidentally overwrite the original file).

If the filename argument does not end in .huff or the file is not readable, display an error message and terminate with status EXIT_FAILURE

Reconstruct the Huffman tree from the table of coded symbols in the .huff file and verify that it is well-formed before using it to decode the rest of the contents of the file.

Validation

To validate your programs, download a reasonably-sized text file from Project Gutenburg and compress it. Verify that you get the expected compression and that you get the original data back when expanded. The diff utility.

Report the original size, compressed size, and space savings in your README file. Don't forget to include a link to your chosen text.

Here is a compiled copy of the decoder that works on the UW-320 lab machines to use for your testing. For full marks, your huff-encoded files must be decodable by the supplied program, as well as by your implementation of puff.