Assignment 2: B TREE

This assignment revisits the wordcount problem of Assignment 1 using a faster data structure and additional C++ language features. You will write two programs using the same B Tree implementation.

Submit a Linux script, Run that will build both programs and execute their test cases. Don't forget to include your data files. The executables shall be called wordcount and dedup.

You may use classes, templates, operator new, and C++ strings for this assignment.

B Tree

Define a class BTree that implements a 2-3 Tree in C++. Use a template to specify a generic type for the data item and assume the item has an operator< method. Do not asssume the item has an operator== or operator> method.

The B Tree should provide the following public methods:

T* insert(T* item)
Find or insert item. Return a pointer to the object that was inserted if a new entry is added, otherwise returns a pointer to the matching object that was already in the tree.
T* lookup(T* item)
Given an object, return a pointer to the matching object in the tree if there is one, NULL otherwise.
template<typename F> void walk(F f)
Apply function object f(T* item) to every element in the tree in order. Note that a function object is an object that has an operator() method.

The tree nodes are private to the BTree class.

Bonus: reuse the memory allocator from Assignment 1, using the placement form of operator new().

Testing

Unit tests are essential for software maintenance because they catch defects that are introduced into previously-working code ("regressions"). For our purposes, wordcount should provide adequate testing if all cases are covered.

One way to make sure all cases are covered is by using a code coverage tool. Such tools can ensure that each line of code is exercised, but cannot check all the possible combinations of code paths—neccessary but not necessarily sufficient testing.

Another way to ensure coverage is to introduce counters into your code. Each case gets it's own counter that is incremented every time that case is executed. On program termination, you can verify all counters have positive values.

You may use the simple counter class in counter.tgz.

To unpack the "tarball", into a subdirectory named counter:

tar xfvz counter.tgz

Program 1

Using the B Tree, reimplement wordcount from Assignment 1.

We slightly change the semantics of the program in 2 ways:

Since B Tree delete is not required, just leave all entries in place, but skip any entries with zero count when walking the tree. The program should still terminate with failure exit status if any count drops below zero.
Output should be in alphabetical order by word.

You may use the Python script wordcount.py to verify your output. Run the script to produce a baseline output file. The output of your program should match the baseline file.

Bonus: based on wordcount, write wordcount-by-frequency which outputs the words sorted by the frequency of their occurrences. Use an optional command-line parameter -n n to restrict output to the top n words. This is equivalent to the Linux pipeline

wordcount.py datafile | sort -n | head -n n

Program 2

Use the B Tree to implement dedup, which reads a text file containing one word per line and outputs the words in their original order with duplicates removed (i.e. only output the first occurrence of each word).

Notes

Both wordcount and dedup have been used as interview questions. Since dictionary and list structures are part of modern programming languages, both programs are suitable for "whiteboard programming".

Questions to Ponder

Consider running a benchmark to compare Assignment 1 and 2 versions of wordcount for input of varying size and graphing the performance. You may replace the custom allocator from Assignment 1 with malloc() and free. Use a graph plotting tool such as gnuplot or R.
Identify the Standard Template Library components that have equivalent functionality to our B Tree. Consider its interface design compared to the B Tree interface specified in this assignment. Think about any strengths or weakness in the interface given here.
Removing an item from a B Tree is just as important as insert and find. It was not required for this assignment because it won't add much to your learning experience, but do think about how the delete operation would be implemented.
Why does C++ have a placement new operator? Consider this wikipedia article or this Stack Overflow discussion.
Both wordcount and dedup are tested by comparing their output to a reference file. Another common testing technique is to put assertions about pre and post conditions into the test code. Consider the tradeoffs between the two techniques. Is one method more fragile? Is one easier to modify the test cases?