Implement a priority queue template per the standard library
definition:
priority_queue
.
You may omit the second and third template arguments (Container
and Compare). Instead, just use
std::vector
and
operator<
.
The standard definition allows client code to specify its own
underlying container type, but (quite reasonably) defaults to
std::vector
.
The standard specifies the
contract
that the container class should provide. You are matching the
contract.
You may use
std::vector
and the usual strings, I/O, and memory-management stuff
(new
/delete
).
Note that the priority queue is conceptually a binary tree, but it
is implemented using an array (or
std::vector
). Use the fact that parent(n) is n/2,
left(n) is 2*n and right(n) is 2*n+1. Start with the root in
array element 1 (you may "waste" the 0-th element to simplify your
calculation). A good way to do this is to create a
PriorityQueue
class with a private member class
Node
.
Define the methods that access data as members of
PriorityQueue
that take a
Node
parameter.
Using your priority queue, write a program called
huff
that compresses and then expands a text file using the Huffman
coding algorithm.
huff
will take the name of the file as a
command-line argument. Your
Run
script should take the same command-line argument and pass it to
the program.
Construct the Huffman tree for the input file. Start by reading the file and computing a table of character frequencies. You will have to "rewind" the file to read it again for compression, so you'll probably want check the manual page for istream.
Include routines that dump the tree to stderr to aid your
debugging. Your choice how, but one way to do it is a pre-order
traversal that passes depth down and indents each node k*depth
spaces. For something slightly more elaborate (which is overkill
for this assignment), here is partial output from the linux
pstree
command:
init-+-NetworkManager-+-dhclientp
| |-dnsmasq
| `-2*[{NetworkManager}]
|-accounts-daemon---{accounts-daemon}
|-acpid
|-at-spi-bus-laun---2*[{at-spi-bus-laun}]
|-atd
|-avahi-daemon---avahi-daemon
|-bamfdaemon---2*[{bamfdaemon}]
|-bluetoothd
|-chromium-browse-+-2*[chromium-browse]
| |-chromium-browse---chromium-browse---chromium-browse-+-60*[chromium-browse---3*[{chromium-browse}]]
| | `-3*[chromium-browse---4*[{chromium-browse}]]
| |-chromium-browse---339*[{chromium-browse}]
| `-30*[{chromium-browse}]
|-colord---2*[{colord}]
|-console-kit-dae---64*[{console-kit-dae}]
|-cron
|-cupsd
Print the frequency count and bit-coding to standard output ordered by increasing character frequency. Each line of output should have
http://www.cplusplus.com/reference/cctype/
).
For example:
0x4c (L): 292 [011000010] (9)
0x2d (-): 352 [000110011] (9)
0x52 (R): 375 [000110000] (9)
0x48 (H): 388 [11011001] (8)
0x53 (S): 442 [11001100] (8)
0x6b (k): 479 [01101011] (8)
You may use the STL sort and reverse algorithms to order the output by frequency.
Compress the file and write to an output file with the ".huff"
extension (e.g. for foo.txt
, write the compressed
data to foo.txt.huff
).
The
Run
script should use the Linux utility
wc
(wordcount) to show the difference in file size.
If you're feeling particularly ambitious, compute the space savings.
Reread the compressed file and expand it int into a file with
".restore" extension (e.g. expand
foo.txt.huff
into
foo.txt.restore
).
Use diff
or cmp
to verify that the
restored file is identical to the original text.
For test data, download text files from Project Gutenberg.
Since a byte (8 bits) is the smallest addressable piece of memory, you will have to create a utility that will buffer the bits both for writing and for reading back. Recall your bitwise or and shift operations.
Since the bit-length of the characters is not constant, your encoded text is unlikely end on an 8-bit boundary. This means you will probably have a few bits left over. That's okay: keep the character count to know when to stop decoding.
The algorithm requires two passes over the input (the first pass computes the character frequencies and the second pass does the compression). Compressing data from a pipline to standard input means the entire input has to be read into memory or a temporary file must be created. Neither solution works very well for a large amount of streaming data (what, you want to use huffman coding to compress that Youtube video? In realtime???).
One solution is to use a moving window of data. Each data block carries its own frequency table.
Another way to avoid two full passes over the data file is to use statistical estimates for the frequency counts.