This assignment revisits the wordcount problem of Assignment 1 using a faster data structure and additional C++ language features. You will write two programs using the same B Tree implementation.
Submit a Linux script,
Run
that will build both programs and execute their test cases. Don't
forget to include your data files. The executables shall be
called
wordcount
and
dedup
.
You may use
classes,
templates,
operator new
,
and C++ strings for this assignment.
Define a class
BTree
that implements a 2-3 Tree in C++. Use a template to specify a
generic type for the data item and assume the item has an
operator<
method. Do not asssume the item has an
operator==
or
operator>
method.
The B Tree should provide the following public methods:
T* insert(T* item)
Find or insert
item
.
Return a pointer to the object that was inserted if a new
entry is added, otherwise returns a pointer to the matching
object that was already in the tree.
T* lookup(T* item)
Given an object, return a pointer to the matching object in
the tree if there is one, NULL
otherwise.
template<typename F> void walk(F f)
Apply function object
f(T* item)
to every element in the tree in order.
Note that a
function object
is an object that has an
operator()
method.
The tree nodes are private to the
BTree
class.
Bonus: reuse the memory allocator from Assignment 1, using the
placement form of
operator new()
.
Unit tests are essential for software maintenance because they
catch defects that are introduced into previously-working code
("regressions").
For our purposes,
wordcount
should provide adequate testing if all cases are covered.
One way to make sure all cases are covered is by using a code coverage tool. Such tools can ensure that each line of code is exercised, but cannot check all the possible combinations of code paths—neccessary but not necessarily sufficient testing.
Another way to ensure coverage is to introduce counters into your code. Each case gets it's own counter that is incremented every time that case is executed. On program termination, you can verify all counters have positive values.
You may use the simple counter class in
counter.tgz
.
To unpack the "tarball", into a subdirectory named
counter
:
tar xfvz counter.tgz
Using the B Tree, reimplement
wordcount
from Assignment 1.
We slightly change the semantics of the program in 2 ways:
You may use the Python script
wordcount.py
to verify your output. Run the script to produce a baseline
output file. The output of your program should match the baseline
file.
Bonus: based on
wordcount
,
write
wordcount-by-frequency
which outputs the words sorted by the frequency of their
occurrences.
Use an optional command-line parameter
-n n
to restrict output to the top n words. This is
equivalent to the Linux pipeline
wordcount.py
datafile |
sort -n | head -n n
Use the B Tree to implement
dedup
,
which reads a text file containing one word per line and outputs the
words in their original order with duplicates removed (i.e. only output
the first occurrence of each word).
wordcount
and
dedup
have been used as interview questions. Since dictionary and
list structures are part of modern programming languages, both
programs are suitable for "whiteboard programming".
wordcount
for input of varying size and graphing the performance. You may
replace the custom allocator from Assignment 1 with
malloc()
and
free
.
Use a graph plotting tool such as
gnuplot
or
R
.
wordcount
and
dedup
are tested by comparing their output to a reference file.
Another common testing technique is to put assertions about pre
and post conditions into the test code. Consider the tradeoffs
between the two techniques. Is one method more fragile? Is one
easier to modify the test cases?