Huffman coding is an optimal prefix encoding of the symbols (characters) of a text, such that more-frequently-occuring characters are given shorter codings (i.e. fewer bits). This leads to an overall shorter encoding of the text from which the original text can be recovered (lossless compression).
There exist other practical lossy and lossless compression algorithms, but Huffman encoding is a component of production data compression utilities such as gzip.
For this assignment, you will implement two programs,
huff
and
puff
:
huff
will take a single command-line argument
filename
and will produce an output file
filename.huff
containg a Huffman encoding of the original file.
puff
will take a single command-line argument
filename
and produce an output file
filename.puff
containg the Huffman
decoding of
filename.
If everything is done correctly, you should be able to show the
space savings
between
filename
and
filename.huff
.
Furthemore, you will be able to show that
filename
and
filename.huff.puff
are byte-for-byte identical.
For the purposes of the assignment, we will make a few simplifying assumptions that you would not be able to make if you were writing a production-grade data compression program:
The symbols (characters) are 8-bit bytes. Since there are only 256 distinct byte patterns, represent the freqency table as an array:
#define SYMBOLS 256
unsigned short symbol_frequency[SYMBOLS];
You may encapsulate this data into a class if you feel more comfortable.
Remember to use
unsigned char
values to index the table.
Implement a priority queue along the lines of the standard library
definition definition:
std::priority_queue
.
You may omit the second and third template arguments (Container
and Compare). You may use
std::vector
for the underlying storage.
You will
not
be marked on compatability with the standard library.
Note that the priority queue is conceptually a binary tree, but it
is implemented using an array (or
std::vector
). Use the fact that parent(n) is n/2,
left(n) is 2*n and right(n) is 2*n+1.
Start with the root in array element 1. You may "waste" the 0-th element to simplify your calculations; assume queue entries have a default constructor.
A good way to do this is to create a
PriorityQueue
class with a private member class
Node
.
Define the methods that access data as members of
PriorityQueue
that take a
Node
parameter.
You will need your priority queue again for Assignment 3. Consider making it a template.
Using your priority queue, construct the Huffman tree for the input file.
Include routines that dump the tree to stderr to aid your
debugging. Your choice how, but one way to do it is a pre-order
traversal that passes depth down and indents each node k*depth
spaces. For something slightly more elaborate (which is overkill
for this assignment), here is partial output from the linux
pstree
command (see the tree structure?):
init-+-NetworkManager-+-dhclientp
| |-dnsmasq
| `-2*[{NetworkManager}]
|-accounts-daemon---{accounts-daemon}
|-acpid
|-at-spi-bus-laun---2*[{at-spi-bus-laun}]
|-atd
|-avahi-daemon---avahi-daemon
|-bamfdaemon---2*[{bamfdaemon}]
|-bluetoothd
|-chromium-browse-+-2*[chromium-browse]
| |-chromium-browse---chromium-browse---chromium-browse-+-60*[chromium-browse---3*[{chromium-browse}]]
| | `-3*[chromium-browse---4*[{chromium-browse}]]
| |-chromium-browse---339*[{chromium-browse}]
| `-30*[{chromium-browse}]
|-colord---2*[{colord}]
|-console-kit-dae---64*[{console-kit-dae}]
|-cron
|-cupsd
Traverse the Huffman tree to generate an encoding for each character. Since we assume the encoding will take at most 32 bits, we can use a table of unsigned ints to hold the encoding. We will also need to know how many bits are in the encoding.
unsigned bits[SYMBOLS]
unsigned encoding[SYMBOLS];
Alternatively, you can declare a class to hold the encoding information
class Encoding {
public:
Encoding();
// Additional public methods...
private:
struct EncodedChar {
unsigned bits_;
unsigned coded_value_;
};
EncodedChar data_[SYMBOLS];
};
Since a byte (8 bits) is the smallest addressable piece of memory, you will have to create a utility that will buffer the bits both for writing and for reading back. Recall your bitwise or and shift operations.
An easy way to manage the tree traversal is to create a class, say
class Encode
with methods
push_bit()
which you call with arg 0 on a left branch and 1 on a right branch
and
get_encoding()
which you call at the leaf.
The object will manage bit buffer.
For partial credit, print the bit-coding to standard error. Each line of output should have
<cctype>
).
The output format makes for easier post-processing by a script. Two Huffman trees for a given text may be identical except for left and right branches being swapped, but a script can verify that each character has the correct number of bits and the encodings are unique.
For example:
0xa ( ): [0] 1
0x20 ( ): [10010] 5
0x23 (#): [1010110] 7
0x28 ((): [100011001] 9
0x29 ()): [100010010] 9
0x2c (,): [100011000] 9
0x2d (-): [11110111] 8
0x2e (.): [100000101] 9
0x2f (/): [101101] 6
0x30 (0): [100010001] 9
0x31 (1): [100010000] 9
0x32 (2): [10001011] 8
0x3a (:): [100010100] 9
0x3b (;): [10000111] 8
0x40 (@): [11110110] 8
0x41 (A): [100001010] 9
0x42 (B): [11101] 5
0x43 (C): [1111111] 7
0x44 (D): [100000100] 9
0x46 (F): [10000001] 8
0x48 (H): [10000000] 8
0x49 (I): [10000011] 8
0x4c (L): [1111001] 7
0x4d (M): [11110000] 8
0x52 (R): [100010011] 9
0x53 (S): [11110001] 8
0x54 (T): [10000100] 8
0x55 (U): [10000110] 8
0x57 (W): [100010101] 9
0x5f (_): [101100] 6
0x61 (a): [1011101] 7
0x63 (c): [1000111] 7
0x64 (d): [111000] 6
0x65 (e): [10011] 5
0x66 (f): [101111] 6
0x67 (g): [100001011] 9
0x68 (h): [1111110] 7
0x69 (i): [11001] 5
0x6c (l): [111110] 6
0x6d (m): [101010] 6
0x6e (n): [101000] 6
0x6f (o): [101001] 6
0x70 (p): [1010111] 7
0x72 (r): [11011] 5
0x73 (s): [11000] 5
0x74 (t): [11010] 5
0x75 (u): [1011100] 7
0x79 (y): [111001] 6
0x7b ({): [1111010] 7
0x7d (}): [10001101] 8
You will have to "rewind" the input file to read it again for
compression, so you'll probably want check the manual page for
istream
.
Emit the frequency table so
puff
can reconstruct the Huffman tree.
Since the bit-length of the characters is not constant, your encoded text is unlikely end on an 8-bit boundary (what are the chances?). This means you will probably have a few bits left over. That's okay: keep the character count to know when to stop decoding.
Reread the compressed file and expand it into a file with
".puff" extension (e.g. expand
foo.txt.huff
into
foo.txt.huff.puff
).
Submit your source code and test data along with a Bash script
named
Run
which will compile your programs and run it with your test
data. The script should compress your test data, compute the
space savings, expand the compressed file, and verify
decompressed file matches the original text.
The script is required but need not be your own work. Do identify its provenance and do not claim someone else's work as your own.
Optionally submit a text file named
README
with any additional notes.
There are easier ways and harder ways to do this. It may seem like extra work to write the scaffolding and unit tests but really, this is the easiest way to build a complex program. Each of the following should work as a stand-alone program:
100% of the grade is based on
huff
and
puff
working as an end-to-end solution, but partial marks will be
awarded for working milestones.
You may use standard (stream) I/O,
std::strings
(including
<cstring>
and
<cctype>
),
std::vector
and memory management
(new
and
delete
).
You may use the STL sort and/or reverse algorithms in your debug code to order output by frequency.
You can obtain large samples of text from Project Gutenberg (http://www.gutenberg.org). Consider using a shell script to remove punctuation and long words from the text.
Avoid generating random data because it will have high entropy and the Huffman algorithm won't yield good results.