Assignment 1: Binary Search Tree

Introduction

We wish to find frequency of each word in some text. Fortunately, the size of the text (number of words) is small enough that we won't have to resort to elaborate solutions. Since we will want to output the word list in alphabetical order, a binary search tree seems like just the ticket.

To keep things simple you may assume the input file has been preprocessed so there is one word per line with no extraneous punctuation.

Write a program that reads the word list from standard input and emits to standard output1 the words in alphabetical order preceeded by frequency count. You must use the this output format to be compatible with output from the Linux uniq utility: that'll give you (and your instructor!) an easy way to validate your program. Do not add prompt strings, pauses, etc2. You may include arbitrary debug messages to standard error3.

For example, if you take a snapshot of an early draft (which no longer exists!) of the preceeding paragraphs, your input file will look like this:

we
wish
to
find
frequency
of
each
word
in
a
text
naturally
the
text
is
too
large
to
read
into
main
memory
so
sorting
and
counting
is
an
intractable
approach
fortunately
the
number
of
distinct
words
is
small
enough
that
we
don
t
have
to
resort
to
elaborate
solutions
since
we
want
to
emit
the
word
list
in
alphabetical
order
a
binary
search
tree
is
just
the
ticket
to
keep
things
simple
you
may
assume
the
input
file
has
been
preprocessed
so
there
is
one
word
per
line
with
no
extraneous
punctuation
write
a
program
that
reads
the
word
list
from
standard
input
and
outputs
the
words
in
alphabetical
order
preceeded
by
frequency
count

Your output should look like this:


      3 a
      2 alphabetical
      1 an
      2 and
      1 approach
      1 assume
      1 been
      1 binary
      1 by
      1 count
      1 counting
      1 distinct
      1 don
      1 each
      1 elaborate
      1 emit
      1 enough
      1 extraneous
      1 file
      1 find
      1 fortunately
      2 frequency
      1 from
      1 has
      1 have
      3 in
      2 input
      1 into
      1 intractable
      5 is
      1 just
      1 keep
      1 large
      1 line
      2 list
      1 main
      1 may
      1 memory
      1 naturally
      1 no
      1 number
      2 of
      1 one
      2 order
      1 outputs
      1 per
      1 preceeded
      1 preprocessed
      1 program
      1 punctuation
      1 read
      1 reads
      1 resort
      1 search
      1 simple
      1 since
      1 small
      2 so
      1 solutions
      1 sorting
      1 standard
      1 t
      2 text
      2 that
      7 the
      1 there
      1 things
      1 ticket
      6 to
      1 too
      1 tree
      1 want
      3 we
      1 wish
      1 with
      4 word
      2 words
      1 write
      1 you

Design Notes

Your Tree class should have two public methods:

The class will hold a pointer to the root Node of the tree. The node structure is private to the tree and should not be part of the API.

If you want to get fancy or find this assignment too easy, make your tree generic using C++ templates and parameterize the walk function with a function pointer or function object.

If you're really bored, implement an iterator class for the tree with operator++() and operator--() methods. You will have to add begin() and end() methods to the Tree class.

Recursion vs. Iteration

In general, any problem with a recursive solution may be solved with an iterative algorithm, using an auxiliary stack if necessary. That is, after all, exactly how the compiler implements recursion. In this case, an auxilliary stack would not be required if a parent pointer is added to the node structure.

Usually, the recursive solution is more succinct or even elegant, but may not be as efficient as the non-recursive solution due to the overhead cost of setting up the stack frames of the recursive calls. Some languages and/or systems limit the total stack size or depth of recursion. For example, Python has a system parameter to limit the number of recursive calls with a default value of 1000.

Recursion-friendly languages, particularly functional-programming languages implement the tail call optimization which turns the recursion into iteration. This is an implementation detail that you do need to know about when performance matters.

For pedagogical reasons, you will implement the walker using recursion. Do not add a parent pointer to the tree. The iterator, if you choose to implement one, is not allowed to use a parent pointer either.

Random Data Generation

A good way to test your program is to use some arbitrary text such as this one. You will have to write or otherwise acquire a shell script to preprocess the file. You can then compare the output with a baseline produced by another shell script.

Since we're interested in evaluating performance as a function of input size, you will need to generate random datasets of different sizes. You may (or may not) use this Python script to generate test data. Use a high enough word length to minimize the number of duplicates. A few duplicates are okay (we are counting occurrences after all), but performance depends on the number of distinct words.

Unbalanced Input

A binary search tree gives pretty decent average case performance, O(n log m) where n is the number of words in the text and m is the number of unique words in the text.

Unfortunately, the BST has many pathological cases, including but not limited to sorted and revese-sorted input. Demonstrate this performance breakdown empirically using the linux time command4 for input sizes from 10,000 words to 100,000 words, with the same input datasets both randomly-ordered and sorted5. Using gnuplot or some similar utility6, generate a graph with two curves for random and sorted data. About 10 datapoints each should be plenty to demonstrate the shape of the curve.

Hand-in checklist

Do not submit: