Write a pair of C++ programs,
huff
and
puff
which will implement the Huffman data compression algorithm.
The two programs will need to share some code, so write your
BUILD
script
accordingly.
Here is a compiled copy of the
decoder
that works on the UW-320 lab machines to use for your testing.
For full marks,
huff-encoded
files must be decodable by the supplied program, as well as by your
implementation of
puff.
bitcount
for coded symbol i (the longest code will not exceed 255 bits!)
followed by
ceil(bitcount/8)
bytes (rounded up!) of bits for the encoding of the symbol. Set
any excess bits to zero. Bits will be "big endian".
N,
the number of coded symbols in the file.
N
coded symbols (set any extra bits in the last byte to zero).
Note that for English text (such as Common Sense), most of the coded symbols will have length 0. That's okay: the wasted space is small compared with the overall size of the file being compressed. Your program should work equally well for text and binary data (which have different distributions).
Although there are solutions to the problem,
huff
cannot simply read from standard input because it needs to make two
passes over the input data.
Your program
will take a comand-line argument
("argv"),
the name of the file to compress,
and will write the compressed data to a file with the added
extension
.huff
(you may use the standard library string functions for this as
needed).
If there is a problem opening or reading the file, huff should exit
with a non-zero status. Use the C standard library macros
EXIT_SUCCESS
and
EXIT_FAILURE.
The first step is to count the number of occurrences of each character in the text. You may assume the input file will not exceed 2 GB, so you can use an int (signed or unsigned)1 to hold the frequency of each byte.
There are
UCHAR_MAX + 1
distinct unsigned character values
(0..255). You may use an array to hold the counts.
While counting the frequency of each symbol, get the total count of symbols in the file which you will need to write to the output file.
Your dump routine should print out the hexadecimal value of the byte
and, if the symbol is printable, the character. Consider using the
isprint
macro from the C standard library.
Consider a crude horizontal bar chart: print, say
25 * Ni/N
splat marks to eyeball the relative frequency. Expect to see more
occurrences of 'e' than 'z' in an English text.
From the frequency table, build the Huffman tree. Remember the priority queue from Assignment 3? You will be using one kind of tree to build an entirely different kind of tree!
Consider how the binary heap dump routine should interact with the Huffman tree dump routine to yield the most effective and convenient output format for debugging.
From the Huffman tree, build the table of coded symbols. Since the table is of fixed size, you may use an array as used for the frequency counter.
A coded symbol is determined by its length (number of bits) and the
bit pattern. You may assume that 64 bits is an upper bound on the
length of any coded symbol, so you can store the length in a byte
field and the bit pattern in an
unsigned long long.
Your dump function should print out the bitstring is a way that is convenient to analyze. Hexadecimal is compact and, with practice, can be eyeballed directly as its corresponding bit sequence. For the less-practiced, consider a function that prints in binary format (ones and zeros or splats and dots, whatever makes it easier for your eye to perceive). You can do this by calculation or by table lookup2.
Emit the coded-symbol table to the compressed output file, followed by the file length value, as described in the output format.
Use the code table to encode the input file, writing the coded data
to the
.huff
file.
For output, you will want a class that emits bits and bytes to a file. Standard library file objects only go down to the granularity of bytes so you will have to collect (buffer) the bits and write out only complete bytes. Bits within the byte should be "big endian".
You'll need a
flush
function that will write out the currently-being-assembled byte (if
there is one) zero-filling the remaining bits. You'll need your
destructor or close method to flush the final byte before closing.
The file size will be a sequence of 4 bytes. You can add a write-4-bytes method to your bitstream writer or have the caller disassemble it into individual bytes3.
Similarly, for decoding the file, you will need a bitstream reader to decode the file, with read-bit, read-byte, and flush operations.
Since you are both the library designer and library user, you must
decide whether to call the functions
put
and
get,
put_bit
and
get_bit,
write_bit
and
read_bit,
or
putBit
and
getBit4.
puff
should read the
.huff
file and replace the extension with
.puff
(so as not to accidentally overwrite the original file).
If the filename argument does not end in
.huff
or the file is not readable,
print an error message and terminate with status
EXIT_FAILURE
You will reconstruct the Huffman tree from the table of coded
symbols in the
.huff
file and
verify
that it is well-formed before using it to decode the rest of the
contents of the file.
Download a reasonably-sized text file from
Project Gutenburg
and run your encoding on the file. Report the original size,
compressed size, and
space savings
in your
README
file.
Don't forget to include a link to your chosen text.
For validation,
compare
your expanded
.puff
file with your original text.
Your build or run script can automate the verification.