Suppose we'd like to find the number of occurrences of each word in some text. People ask this sort of question all the time (d'oh!).
Assume that the size of the text (number of words) is small enough that we do not have to resort to elaborate or exotic solutions. Furthermore, to keep things simple, let us assume that the input has already been preprocessed into a single word per line.The program should output the words in alphabetical order, preceeded by their counts. For example, if you take the first few paragraphs of an early draft of this assignment, your input file would look like this:
we
wish
to
find
frequency
of
each
word
in
a
text
naturally
the
text
is
too
large
to
read
into
main
memory
so
sorting
and
counting
is
an
intractable
approach
fortunately
the
number
of
distinct
words
is
small
enough
that
we
don
t
have
to
resort
to
elaborate
solutions
since
we
want
to
emit
the
word
list
in
alphabetical
order
a
binary
search
tree
is
just
the
ticket
to
keep
things
simple
you
may
assume
the
input
file
has
been
preprocessed
so
there
is
one
word
per
line
with
no
extraneous
punctuation
write
a
program
that
reads
the
word
list
from
standard
input
and
outputs
the
words
in
alphabetical
order
preceeded
by
frequency
count
Your output should look like this:
3 a
2 alphabetical
1 an
2 and
1 approach
1 assume
1 been
1 binary
1 by
1 count
1 counting
1 distinct
1 don
1 each
1 elaborate
1 emit
1 enough
1 extraneous
1 file
1 find
1 fortunately
2 frequency
1 from
1 has
1 have
3 in
2 input
1 into
1 intractable
5 is
1 just
1 keep
1 large
1 line
2 list
1 main
1 may
1 memory
1 naturally
1 no
1 number
2 of
1 one
2 order
1 outputs
1 per
1 preceeded
1 preprocessed
1 program
1 punctuation
1 read
1 reads
1 resort
1 search
1 simple
1 since
1 small
2 so
1 solutions
1 sorting
1 standard
1 t
2 text
2 that
7 the
1 there
1 things
1 ticket
6 to
1 too
1 tree
1 want
3 we
1 wish
1 with
4 word
2 words
1 write
1 you
One way to solve this problem would be to build a list or vector of
words, sort it, and then count adjacent occurrences. A pair of
Unix utility programs
sort
and
uniq
(with the
-c
flag)
used in combination
already does this for you, so you (and your instructor!) have an easy
way to verify the correctness of your solution, using the
diff
utility.
Modern programming languages have a dictionary data type that
maps a key to a value, such as word to count, name to telephone
number, or host name to IP address.
C++ provides
the standard (template) library
std::map
.
Using the map, the word-count problem can be solved in a mere few
line of C++.
A dictionary may be implemented using any of several different data structures, each with different performance characteristics. The C++ language standard specifies that insert, lookup, and delete operations must be performed in logarithmic time.
To meet the performance requirement,
std::map
is typically implemented using a red-black tree. As discussed in
class, a red-black tree is simply a binary search tree (BST) that
keeps additional information to ensure that the tree is maintained
in an approximately balanced state. Without the balancing, a BST
performs well on average but has undesirable worst-case performance.
Furthermore, the set of pathological cases includes sorted and
reverse-sorted input, which is not uncommon in some applications.
To solve the word-count problem, implement a solution using a BST.
Since one of the strengths of C++ is that user-defined types may be written as if they were native types of the language, we will use idioms of the standard template library:
operator[]()
)
should return a reference to the value for a given key
begin()
and
end()
should return iterator objects referencing the first and one past
the last elements of the tree respectively.
std::pair
containing the key and value.
The iterator will have to maintain a stack of its traversal through
the tree. You may use
std::stack
or
std::vector
.
Note that the full functionality of a standard library data type is not required for the purposes of this exercise. It is designed just to give you the basic flavour.
The binary search tree gives, for the average case, decent O(log N) performance for insert, lookup, and delete. Problems occur with a BST when the input is "unbalanced". The tree degenerates into, essentially, a linked list with O(N) performance. Whether this matters, of course, depends on how large N needs to be.
Run your BST for input sizes ranging from 10,000 to 100,000 words
and collect the run time for randomly-ordered data and for
the same input pre-sorted. Use the Linux
time
command to collect elapsed time.
This is a fairly crude experiment. Ten or so datapoints will give
us a useful ballpark.
While we're benchmarking,
perform the analysis with the compiler optimization flag set to
-O0
(default) and repeat with maximum optimizations
-O3
.
Use
gnuplot
,
matplotlib,
or some similar utility to generate graphs of the performance
data.
You may hand-draw the graph if you prefer, but submit a scanned
image, not a hardcopy.
You may use the following Python script to generate random input. Use a high enough word length to minimize the number of duplicates. Some duplicates are okay (we are counting occurrences after all), but performance depends on the number of distinct words.
AVL and Red-Black trees maintain approximate balance in BSTs by maintaining a balance factor in each node. On insert and delete, the algorithm rebalances the tree by rotatating nodes. Both algorithms guarantee O(log N) worst-case performance for insert, lookup, and delete.
There is a probabilistic algorithm which also maintains balance using tree rotations. It is slighty simpler to implement, but does not guarantee O(log N) performance. It does, however, make it far less likely to suffer catastrophic breakdown in the case of sorted input, so it performs well on average.
A priority queue is an abstract data structure in which every element has a "priority" value. Elements are inserted into the tree in arbitrary order, but only the current highest-priority element is extracted (and then the next-highest becomes highest). Insert and extract operations may be interleaved. Depending on the application and implementation, the priority of an element may be modified dynamically. We will examine priority queues in more detail later in the quarter.
The heap1 is a concrete implementation of the priority queue. A binary tree has the heap property if every node has higher priority than either of its children. It follows that the highest-priority element must be in the root node and is found in constant time. Removing the root node is O(log N) time because of additional processing that must be done to restore the heap invariant.
The treap combines a binary search tree with a a binary heap, where the data has both a key and a priority value. Using a randomly-ssigned value for the priority tends to keep the binary search tree roughtly in balance.
On insertion, each node is assigned a randomly-generated priority value. The new node is added at the appropriate leaf position per the BST algorithm. Then, the priority of the new leaf is compared with its parent. If the leaf has higher priority, the leaf and its parent are rotated. This is repeated so that the new node is sifted upward to the appropriate position for its priority.
Implement a treap to do word-count and repeat the benchmarking of
part II for sorted input
(Use
-O3
only).
You may use the random number generator in the C++ standard library to generate priority values.
For validation, include a function to verify that the tree has the heap property. Provide a dump routine that shows the structure of the tree. The routine should write to standard error. You may combine the two operations into a single function.
Any problem that has a recursive solution may also be solved with an iterative algorithm, using an auxiliary stack if necessary. That is essentially how the compiler implements recursion. Recursive solutions are usually more succinct and elegant but may not be as efficient as the non-recursive solution because of of the way the compiler implements stack frames.
Some languages and/or systems limit the total stack size or depth of recursion. For example, Python has an adjustable system parameter to limit the maximum number of recursive calls with a default value of 1000.
On the other hand, recursion-friendly languages, particularly functional-programming languages, implement the tail-call optimization which automagically turns an important special case known as tail recursion into a loop. This is an implementation detail that you do need to know about when performance matters. There are techniques to transform some non-tail recursive programs using additional "accumulator" parameters. An in-depth study of functional-style programming is beyond the scope of CSS 343.
A parent pointer in a BST would eliminate the need for a stack during non-recursive traversals, but requires O(N) extra memory for an additional pointer in every node. A stack-based or recursive solution only requires O(log N) auxiliary memory. For pedagogical reasons implement the insert operation using recursion and the iterator using a stack. Do not add a parent pointer to the tree. This is a hard requirement2. The iterator will have to explicitly manage a stack because it must maintain state across function calls.
When implementing a data structure, always write a programmer-friendly dump routine. It will simplify your debugging and make your life far easier.
Consider creating two separate classes for tree and node. All
external operations will go through the tree or the iterator.
Internal tree operations may be delegated to the node class.
For example,
tree.insert
could be responsible for creating a new
root node if the tree is empty, or call
root.insert
if the tree is not empty.
If rebalancing changes the root,
node.insert
should return the new root node and let the tree's insert method
update itself.
Since the treap is a BST plus other stuff, consider implementing part I first, then modifying it for part III using a flag, conditional compilation, or subclassing to control whether the balancing action is performed. That way, any changes (bug fixes) to the tree are reflected in both versions, without having to propagate changes to two separate codebases.
Your application should read words from standard input and emit results to standard output. This is a hard requirement, so expect to lose many points if you fail to comply. You may implement your program such that it reads from a file, but it must default to reading from standard input.
Use appropriate shell incantations
(<
, >
, |
)
to redirect standard I/O; you cannot do reliable, repeatable tests
just by banging on the keyboard and reading the ouput on the screen.
You may include arbitrary debug messages to standard error (they can
be easily filtered out using shell magic).
Do not add prompt strings, pauses, or other "features" that require human interaction. Such "features" make it harder to combine your program with other programs and harder to automate your testing.
Your program
may
process
command-line arguments
(argv
),
but
must
default to standard I/O.
You won't get extra credit for this, but you won't be penalized
either.
Putting the count first followed by the word makes your output
diff-compatible with the output from
uniq -c
,
making it easier to validate your solution.
You may freely share your scripts with each other, but remember to give due credit (whether to your peers or to stack overflow). Not giving credit is a very bad thing. Do not copy your C++ code.
Package the following into a compressed tarball
(.tgz
)
or zip archive:
README
containing all relevant discussion/notes (you may use
Markdown
but it is not required)
BUILD
that compiles your
program3
RUN
that executes the program (unless it's trivial).
Do not submit:
.o
files)4
BUILD
call make
if you want, but that might be overkill.CLEAN
script or clean
target in your makefile.