Assignment 3: Priority Queue and Huffman Coding

Priority Queue

Implement a priority queue template per the standard library definition: priority_queue. You may omit the second and third template arguments (Container and Compare). Instead, just use std::vector and operator<.

The standard definition allows client code to specify its own underlying container type, but (quite reasonably) defaults to std::vector. The standard specifies the contract that the container class should provide. You are matching the contract.

You may use std::vector and the usual strings, I/O, and memory-management stuff (new/delete).

Note that the priority queue is conceptually a binary tree, but it is implemented using an array (or std::vector). Use the fact that parent(n) is n/2, left(n) is 2*n and right(n) is 2*n+1. Start with the root in array element 1 (you may "waste" the 0-th element to simplify your calculation). A good way to do this is to create a PriorityQueue class with a private member class Node. Define the methods that access data as members of PriorityQueue that take a Node parameter.

Huffman Coding

Using your priority queue, write a program called huff that compresses and then expands a text file using the Huffman coding algorithm.

huff will take the name of the file as a command-line argument. Your Run script should take the same command-line argument and pass it to the program.

Huffman Tree

Construct the Huffman tree for the input file. Start by reading the file and computing a table of character frequencies. You will have to "rewind" the file to read it again for compression, so you'll probably want check the manual page for istream.

Include routines that dump the tree to stderr to aid your debugging. Your choice how, but one way to do it is a pre-order traversal that passes depth down and indents each node k*depth spaces. For something slightly more elaborate (which is overkill for this assignment), here is partial output from the linux pstree command:


	init-+-NetworkManager-+-dhclientp
     |                |-dnsmasq
     |                `-2*[{NetworkManager}]
     |-accounts-daemon---{accounts-daemon}
     |-acpid
     |-at-spi-bus-laun---2*[{at-spi-bus-laun}]
     |-atd
     |-avahi-daemon---avahi-daemon
     |-bamfdaemon---2*[{bamfdaemon}]
     |-bluetoothd
     |-chromium-browse-+-2*[chromium-browse]
     |                 |-chromium-browse---chromium-browse---chromium-browse-+-60*[chromium-browse---3*[{chromium-browse}]]
     |                 |                                                     `-3*[chromium-browse---4*[{chromium-browse}]]
     |                 |-chromium-browse---339*[{chromium-browse}]
     |                 `-30*[{chromium-browse}]
     |-colord---2*[{colord}]
     |-console-kit-dae---64*[{console-kit-dae}]
     |-cron
     |-cupsd

Print the frequency count and bit-coding to standard output ordered by increasing character frequency. Each line of output should have

the hex value of the character (its original ASCII coding) with the usual C-style base prefix ("0x")
the actual character in parentheses if it is printable (or just the parentheses if it's unprintable), followed by a colon (see: http://www.cplusplus.com/reference/cctype/).
character frequency
coded bit string in square brackets
bit count in parentheses

For example:

   0x4c (L):     292  [011000010] (9)
   0x2d (-):     352  [000110011] (9)
   0x52 (R):     375  [000110000] (9)
   0x48 (H):     388  [11011001] (8)
   0x53 (S):     442  [11001100] (8)
   0x6b (k):     479  [01101011] (8)

You may use the STL sort and reverse algorithms to order the output by frequency.

File Compression

Compress the file and write to an output file with the ".huff" extension (e.g. for foo.txt, write the compressed data to foo.txt.huff).

The Run script should use the Linux utility wc (wordcount) to show the difference in file size.

If you're feeling particularly ambitious, compute the space savings.

File expansion

Reread the compressed file and expand it int into a file with ".restore" extension (e.g. expand foo.txt.huff into foo.txt.restore).

Use diff or cmp to verify that the restored file is identical to the original text.

For test data, download text files from Project Gutenberg.

Bit-At-A-Time Operations

Since a byte (8 bits) is the smallest addressable piece of memory, you will have to create a utility that will buffer the bits both for writing and for reading back. Recall your bitwise or and shift operations.

Since the bit-length of the characters is not constant, your encoded text is unlikely end on an 8-bit boundary. This means you will probably have a few bits left over. That's okay: keep the character count to know when to stop decoding.

Notes

In order to have a true compression/decompression utility, the compressed file would have to contain a representation of the tree. It would be simplest to just store the frequency counts and recover the tree. Storing this adds some small amount of overhead.
The algorithm requires two passes over the input (the first pass computes the character frequencies and the second pass does the compression). Compressing data from a pipline to standard input means the entire input has to be read into memory or a temporary file must be created. Neither solution works very well for a large amount of streaming data (what, you want to use huffman coding to compress that Youtube video? In realtime???).

One solution is to use a moving window of data. Each data block carries its own frequency table.

Another way to avoid two full passes over the data file is to use statistical estimates for the frequency counts.
The bit vector for a coded symbol might be arbitrarily long, greater than 32 or even 64 bits. Production code would have to deal with this case.

Questions to Ponder

How likely would it be to require more than 64 bits for a Huffman code? How large would the input file have to be?
Compare your compressed file to the compression obtained from gzip which uses Huffman coding combined with the Lempel-Ziv algorithm.