For this assignment, you will write a pair of C++ programs,
huff
and
puff
which will implement the Huffman data compression algorithm.
The project has a lot of moving parts, but you should find each piece to be manageable. Start early and work steadily. Cramming this on the night before it's due will be a losing strategy.
Since the smallest addressable unit in C++ is a byte (8 bits), you will need to create a utility for reading and writing streams of bits. Just as standard iostreams buffer data into blocks for efficency, you will buffer bits into bytes.
The
writer will have methods
int putBit(unsigned int bit)
and
int putByte(unsigned char byte)
.
putBit
should collect (buffer) the bits it receives until it gets 8 of
them, then write out the byte.
You will also need to implement a
flush()
function that writes any buffered bits (adding additional zero bits
to fill out the current byte)
to align on the next byte boundary.
putByte
should flush to the next byte boundary before writing the next byte value.
Your destructor should call
flush()
before closing the file.
Similarly, the bitstream reader should have methods
getBit()
which buffers the bits as needed
and
getByte()
.
which advances to the next byte boundary and reads the next
byte.
While this may seem trivial,
you have some decisions to make.
Should you write one class or two. Should the stream read from
standard I/O, an open
iostream
object,
or a filename to be opened by your class? Should your bits be big
endian or little endian? What other convenience functions will you
need? Is there common code that may be shared between the reader
and the writer?
Hint: one way to test your bitstreams is to read a small text file bit at a time, write it back, and verify that the original and copy are identical.
The first step in the Huffman algorithm is to count the number of occurrences of each character in the input. You may assume the input file will not exceed 2 GB, so you can use an unsigned int to hold the frequency of each byte.
Write a program that reads a text file and produces a table of characters and their frequencies (number of occurrences in the input data). Include a dump routine that prints the table to standard error with character number in hex, the character if it is printable or a space, and the count.
You may also want your dump routine to do a crude horizontal bar
chart to capture the relative frequency of each character.
Let
Nmax
be the frequency of the most common symbol (probably 'e' in an
English text). Print, say,
50 * Ni / Nmax
splat marks for character
i
.
Again, while that may sound trivial, you need to make some decisions. Should you use a class or a function? Should the reader read from standard input, an open file, or take a string and open the file? How should the table be represented? Should the output be ordered by character value, frequency, or unordered? How are you going to ensure that the program is correct?
Hint: there are
UCHAR_MAX + 1
distinct unsigned character values
(0..255).
That is a fixed size known at compile time.
Hint: consider using the
isprint
macro from the standard C library.
The next step in the Huffman algorithm is to implement a priority queue that you will use to build the Huffman tree.
Write a template class
Heap
that implements a binary heap.
Use
std::vector
to hold the data.
The class
should take two type parameters,
Priority
to represent the priority value and
Data
to represent the data.
Priority
should implement the less-than operator for comparisons (higher
priority is a lower number).
Heap
should provide methods
push()
,
pop()
,
and
is_empty()
.
push()
will take two parameters,
priority
and
data
.
pop
should extract and return the "minimum" value.
is_empty()
should return a Boolean value, with the obvious semantics.
Test your algorithm by implementing the HeapSort algorithm.
From the frequency table and using your priority queue, build the Huffman tree. You will be using one kind of tree (the heap) to build an entirely different kind of tree! Do not confuse the two!
As with all data structures, you should provide a dump routine that shows the structure of the tree, and for each node the count and the character (if the node is a leaf). Consider how the binary heap dump routine should interact with the Huffman tree dump routine to yield the most effective and convenient output format for debugging.
From the Huffman tree, build the table of coded symbols. Since the table is of fixed size, use an array.
A coded symbol is determined by its length (number of bits) and the
bit pattern. You may assume that 64 bits is an upper bound on the
length of any coded symbol, so you can store the length in a byte
field and the bit pattern in an
unsigned long long
.
Your dump function should print out the bitstring is a way that is convenient to analyze. Hexadecimal is compact and, with practice, can be eyeballed directly as its corresponding bit sequence. For the less-practiced, consider a function that prints in binary format (ones and zeros or splats and dots, whatever makes it easier for your eye to perceive). You can do this by calculation or by table lookup.
Emit the coded-symbol table to the compressed output file, followed by the file length value, as described in the output format.
Use the code table to encode the input file, writing the coded data
to the
.huff
file.
bitcount
for the bit length of coded symbol i
followed by
ceil(bitcount/8)
bytes (rounded up!) of bits for the encoding of the symbol. Set
any excess bits to zero. Bits will be "big endian" within the
byte.
N
,
the number of symbols in the file.
N
coded symbols (set any extra bits in the last byte to zero).
Note that for English text (such as Common Sense), most of the coded symbols will have length 0 (not present in the file). That's okay: the wasted space is small compared with the overall size of the file being compressed. Your program should work equally well for text and binary data (which have different distributions).
Although there are solutions to the problem,
huff
cannot simply read from standard input because it needs to make two
passes over the input data.
Your program
will take a comand-line argument
("argv
"),
the name of the file to compress,
and will write the compressed data to a file with the added
extension
.huff
(you may use the standard library string functions for this as
needed).
If there is a problem opening or reading the file, huff should exit
with a non-zero status. Use the C standard library macros
EXIT_SUCCESS
and
EXIT_FAILURE
.
puff
should read the
.huff
file and replace the extension with
.puff
(so as not to accidentally overwrite the original file).
If the filename argument does not end in
.huff
or the file is not readable,
display an error message and terminate with status
EXIT_FAILURE
Reconstruct the Huffman tree from the table of coded symbols in the
.huff
file and
verify
that it is well-formed before using it to decode the rest of the
contents of the file.
To validate your programs,
download a reasonably-sized text file from
Project Gutenburg
and compress it. Verify that you get the expected compression and
that you get the original data back when expanded. The
diff
utility.
Report the original size, compressed size, and
space savings
in your
README
file.
Don't forget to include a link to your chosen text.
Here is a compiled copy of the
decoder
that works on the UW-320 lab machines to use for your testing.
For full marks,
your
huff
-encoded
files must be decodable by the supplied program, as well as by your
implementation of
puff
.