CSS 343B (Spring 2018): Data Structures, Algorithms, and Discrete Mathematics II

Suppose we'd like to find the number of occurrences of each word in some text. People ask this sort of question all the time (d'oh!).

The program should output the words in alphabetical order, preceeded by their counts. For example, if you take the first few paragraphs of an early draft of this assignment, your input file would look like this:

we
wish
to
find
frequency
of
each
word
in
a
text
naturally
the
text
is
too
large
to
read
into
main
memory
so
sorting
and
counting
is
an
intractable
approach
fortunately
the
number
of
distinct
words
is
small
enough
that
we
don
t
have
to
resort
to
elaborate
solutions
since
we
want
to
emit
the
word
list
in
alphabetical
order
a
binary
search
tree
is
just
the
ticket
to
keep
things
simple
you
may
assume
the
input
file
has
been
preprocessed
so
there
is
one
word
per
line
with
no
extraneous
punctuation
write
a
program
that
reads
the
word
list
from
standard
input
and
outputs
the
words
in
alphabetical
order
preceeded
by
frequency
count


      3 a
      2 alphabetical
      1 an
      2 and
      1 approach
      1 assume
      1 been
      1 binary
      1 by
      1 count
      1 counting
      1 distinct
      1 don
      1 each
      1 elaborate
      1 emit
      1 enough
      1 extraneous
      1 file
      1 find
      1 fortunately
      2 frequency
      1 from
      1 has
      1 have
      3 in
      2 input
      1 into
      1 intractable
      5 is
      1 just
      1 keep
      1 large
      1 line
      2 list
      1 main
      1 may
      1 memory
      1 naturally
      1 no
      1 number
      2 of
      1 one
      2 order
      1 outputs
      1 per
      1 preceeded
      1 preprocessed
      1 program
      1 punctuation
      1 read
      1 reads
      1 resort
      1 search
      1 simple
      1 since
      1 small
      2 so
      1 solutions
      1 sorting
      1 standard
      1 t
      2 text
      2 that
      7 the
      1 there
      1 things
      1 ticket
      6 to
      1 too
      1 tree
      1 want
      3 we
      1 wish
      1 with
      4 word
      2 words
      1 write
      1 you

One way to solve this problem would be to build a list or vector of words, sort it, and then count adjacent occurrences. A pair of Unix utility programs sort and uniq (with the -c flag) used in combination already does this for you, so you (and your instructor!) have an easy way to verify the correctness of your solution, using the diff utility.

Modern programming languages have a dictionary data type that maps a key to a value, such as word to count, name to telephone number, or host name to IP address. C++ provides the standard (template) library std::map. Using the map, the word-count problem can be solved in a mere few line of C++.

A dictionary may be implemented using any of several different data structures, each with different performance characteristics. The C++ language standard specifies that insert, lookup, and delete operations must be performed in logarithmic time.

To meet the performance requirement, std::map is typically implemented using a red-black tree. As discussed in class, a red-black tree is simply a binary search tree (BST) that keeps additional information to ensure that the tree is maintained in an approximately balanced state. Without the balancing, a BST performs well on average but has undesirable worst-case performance. Furthermore, the set of pathological cases includes sorted and reverse-sorted input, which is not uncommon in some applications.

Part I: Dictionary Abstract Data Type

Since one of the strengths of C++ is that user-defined types may be written as if they were native types of the language, we will use idioms of the standard template library:

The iterator will have to maintain a stack of its traversal through the tree. You may use std::stack or std::vector.

Note that the full functionality of a standard library data type is not required for the purposes of this exercise. It is designed just to give you the basic flavour.

Part II: Performance Evaluation

The binary search tree gives, for the average case, decent O(log N) performance for insert, lookup, and delete. Problems occur with a BST when the input is "unbalanced". The tree degenerates into, essentially, a linked list with O(N) performance. Whether this matters, of course, depends on how large N needs to be.

Run your BST for input sizes ranging from 10,000 to 100,000 words and collect the run time for randomly-ordered data and for the same input pre-sorted. Use the Linux time command to collect elapsed time. This is a fairly crude experiment. Ten or so datapoints will give us a useful ballpark.

While we're benchmarking, perform the analysis with the compiler optimization flag set to -O0 (default) and repeat with maximum optimizations -O3.

Use gnuplot, matplotlib, or some similar utility to generate graphs of the performance data. You may hand-draw the graph if you prefer, but submit a scanned image, not a hardcopy.

You may use the following Python script to generate random input. Use a high enough word length to minimize the number of duplicates. Some duplicates are okay (we are counting occurrences after all), but performance depends on the number of distinct words.

Part III: Treap

AVL and Red-Black trees maintain approximate balance in BSTs by maintaining a balance factor in each node. On insert and delete, the algorithm rebalances the tree by rotatating nodes. Both algorithms guarantee O(log N) worst-case performance for insert, lookup, and delete.

There is a probabilistic algorithm which also maintains balance using tree rotations. It is slighty simpler to implement, but does not guarantee O(log N) performance. It does, however, make it far less likely to suffer catastrophic breakdown in the case of sorted input, so it performs well on average.

A priority queue is an abstract data structure in which every element has a "priority" value. Elements are inserted into the tree in arbitrary order, but only the current highest-priority element is extracted (and then the next-highest becomes highest). Insert and extract operations may be interleaved. Depending on the application and implementation, the priority of an element may be modified dynamically. We will examine priority queues in more detail later in the quarter.

The heap ¹ is a concrete implementation of the priority queue. A binary tree has the heap property if every node has higher priority than either of its children. It follows that the highest-priority element must be in the root node and is found in constant time. Removing the root node is O(log N) time because of additional processing that must be done to restore the heap invariant.

The treap combines a binary search tree with a a binary heap, where the data has both a key and a priority value. Using a randomly-ssigned value for the priority tends to keep the binary search tree roughtly in balance.

Implement a treap to do word-count and repeat the benchmarking of part II for sorted input (Use -O3 only).

You may use the random number generator in the C++ standard library to generate priority values.

For validation, include a function to verify that the tree has the heap property. Provide a dump routine that shows the structure of the tree. The routine should write to standard error. You may combine the two operations into a single function.

Recursion vs. Iteration

Any problem that has a recursive solution may also be solved with an iterative algorithm, using an auxiliary stack if necessary. That is essentially how the compiler implements recursion. Recursive solutions are usually more succinct and elegant but may not be as efficient as the non-recursive solution because of of the way the compiler implements stack frames.

Some languages and/or systems limit the total stack size or depth of recursion. For example, Python has an adjustable system parameter to limit the maximum number of recursive calls with a default value of 1000.

On the other hand, recursion-friendly languages, particularly functional-programming languages, implement the tail-call optimization which automagically turns an important special case known as tail recursion into a loop. This is an implementation detail that you do need to know about when performance matters. There are techniques to transform some non-tail recursive programs using additional "accumulator" parameters. An in-depth study of functional-style programming is beyond the scope of CSS 343.

A parent pointer in a BST would eliminate the need for a stack during non-recursive traversals, but requires O(N) extra memory for an additional pointer in every node. A stack-based or recursive solution only requires O(log N) auxiliary memory. For pedagogical reasons implement the insert operation using recursion and the iterator using a stack. Do not add a parent pointer to the tree. This is a hard requirement². The iterator will have to explicitly manage a stack because it must maintain state across function calls.

Notes and Hints

When implementing a data structure, always write a programmer-friendly dump routine. It will simplify your debugging and make your life far easier.

Consider creating two separate classes for tree and node. All external operations will go through the tree or the iterator. Internal tree operations may be delegated to the node class. For example, tree.insert could be responsible for creating a new root node if the tree is empty, or call root.insert if the tree is not empty. If rebalancing changes the root, node.insert should return the new root node and let the tree's insert method update itself.

Since the treap is a BST plus other stuff, consider implementing part I first, then modifying it for part III using a flag, conditional compilation, or subclassing to control whether the balancing action is performed. That way, any changes (bug fixes) to the tree are reflected in both versions, without having to propagate changes to two separate codebases.

Your application should read words from standard input and emit results to standard output. This is a hard requirement, so expect to lose many points if you fail to comply. You may implement your program such that it reads from a file, but it must default to reading from standard input.

Use appropriate shell incantations (<, >, |) to redirect standard I/O; you cannot do reliable, repeatable tests just by banging on the keyboard and reading the ouput on the screen. You may include arbitrary debug messages to standard error (they can be easily filtered out using shell magic).

Do not add prompt strings, pauses, or other "features" that require human interaction. Such "features" make it harder to combine your program with other programs and harder to automate your testing.

Your program may process command-line arguments (argv), but must default to standard I/O. You won't get extra credit for this, but you won't be penalized either.

Putting the count first followed by the word makes your output diff-compatible with the output from uniq -c, making it easier to validate your solution.

You may freely share your scripts with each other, but remember to give due credit (whether to your peers or to stack overflow). Not giving credit is a very bad thing. Do not copy your C++ code.

Assignment 1: Treap

Introduction

Part I: Dictionary Abstract Data Type

Part II: Performance Evaluation

Part III: Treap

Recursion vs. Iteration

Notes and Hints

Submission Checklist

Footnotes