We wish to find frequency of each word in some text. Fortunately, the size of the text (number of words) is small enough that we won't have to resort to elaborate solutions. Since we will want to output the word list in alphabetical order, a binary search tree seems like just the ticket.
To keep things simple you may assume the input file has been preprocessed so there is one word per line with no extraneous punctuation.
Write a program that reads the word list from standard
input
and emits to standard
output1
the words in alphabetical order
preceeded by frequency count. You
must
use the this output format to be compatible with output
from the Linux
uniq
utility: that'll give you (and your instructor!)
an easy way
to validate your program.
Do
not
add prompt strings, pauses,
etc2.
You may include arbitrary debug messages to standard
error3.
For example, if you take a snapshot of an early draft (which no longer exists!) of the preceeding paragraphs, your input file will look like this:
we
wish
to
find
frequency
of
each
word
in
a
text
naturally
the
text
is
too
large
to
read
into
main
memory
so
sorting
and
counting
is
an
intractable
approach
fortunately
the
number
of
distinct
words
is
small
enough
that
we
don
t
have
to
resort
to
elaborate
solutions
since
we
want
to
emit
the
word
list
in
alphabetical
order
a
binary
search
tree
is
just
the
ticket
to
keep
things
simple
you
may
assume
the
input
file
has
been
preprocessed
so
there
is
one
word
per
line
with
no
extraneous
punctuation
write
a
program
that
reads
the
word
list
from
standard
input
and
outputs
the
words
in
alphabetical
order
preceeded
by
frequency
count
Your output should look like this:
3 a
2 alphabetical
1 an
2 and
1 approach
1 assume
1 been
1 binary
1 by
1 count
1 counting
1 distinct
1 don
1 each
1 elaborate
1 emit
1 enough
1 extraneous
1 file
1 find
1 fortunately
2 frequency
1 from
1 has
1 have
3 in
2 input
1 into
1 intractable
5 is
1 just
1 keep
1 large
1 line
2 list
1 main
1 may
1 memory
1 naturally
1 no
1 number
2 of
1 one
2 order
1 outputs
1 per
1 preceeded
1 preprocessed
1 program
1 punctuation
1 read
1 reads
1 resort
1 search
1 simple
1 since
1 small
2 so
1 solutions
1 sorting
1 standard
1 t
2 text
2 that
7 the
1 there
1 things
1 ticket
6 to
1 too
1 tree
1 want
3 we
1 wish
1 with
4 word
2 words
1 write
1 you
Your
Tree
class should have two public methods:
add()
:
insert a word into the tree or increments its count if the word
is already present
walk()
which visits every node in the correct order and prints out the
count and word.
The class will hold a pointer to the root
Node
of the tree.
The node structure is private to the tree and should not be part of
the
API.
If you want to get fancy or find this assignment too easy, make your tree generic using C++ templates and parameterize the walk function with a function pointer or function object.
If you're really bored, implement an iterator
class for the tree with
operator++()
and
operator--()
methods.
You will have to add
begin()
and
end()
methods to the
Tree
class.
In general, any problem with a recursive solution may be solved with an iterative algorithm, using an auxiliary stack if necessary. That is, after all, exactly how the compiler implements recursion. In this case, an auxilliary stack would not be required if a parent pointer is added to the node structure.
Usually, the recursive solution is more succinct or even elegant, but may not be as efficient as the non-recursive solution due to the overhead cost of setting up the stack frames of the recursive calls. Some languages and/or systems limit the total stack size or depth of recursion. For example, Python has a system parameter to limit the number of recursive calls with a default value of 1000.
Recursion-friendly languages, particularly functional-programming languages implement the tail call optimization which turns the recursion into iteration. This is an implementation detail that you do need to know about when performance matters.
For pedagogical reasons, you will implement the walker using recursion. Do not add a parent pointer to the tree. The iterator, if you choose to implement one, is not allowed to use a parent pointer either.
A good way to test your program is to use some arbitrary text such as this one. You will have to write or otherwise acquire a shell script to preprocess the file. You can then compare the output with a baseline produced by another shell script.
Since we're interested in evaluating performance as a function of input size, you will need to generate random datasets of different sizes. You may (or may not) use this Python script to generate test data. Use a high enough word length to minimize the number of duplicates. A few duplicates are okay (we are counting occurrences after all), but performance depends on the number of distinct words.
A binary search tree gives pretty decent
average case
performance,
O(n log m)
where
n
is the number of words in the text and
m
is the number of
unique
words in the text.
Unfortunately, the BST has many pathological cases, including but
not limited to sorted
and revese-sorted input.
Demonstrate this performance breakdown
empirically using the linux
time
command4
for input sizes from 10,000 words to 100,000 words, with
the same input datasets both randomly-ordered and
sorted5.
Using
gnuplot
or some similar
utility6,
generate a graph with two curves for random and sorted data.
About 10 datapoints each should be plenty to demonstrate the shape of the curve.
README
containing all relevant discussion/notes
BUILD
that compile your
program8
RUN
that executes the program
Do not submit:
.o
files)9
sort
utility.BUILD
call make
if you want, but that would be overkill.CLEAN
script or clean
target in your makefile.