Assignment 1: BTree

Introduction

Disks are block-oriented devices. Instead of reading or writing a character at a time, they read and write entire blocks. A typical block size is 512 bytes, but could be as large as 4096 bytes. Under the hood, a character-by-character write request is implemented in the driver by reading the existing block into memory, updating the changed bytes, and writing the block back out to disk. This is expensive because disk operations are orders of magnitude slower than in-memory operations.

Operating systems designers have developed clever techniques for hiding the cost through caching and asynchronous operations, but this is CSS343, not CSS430.

The BTree is a balanced search tree structure designed for block-oriented devices. It can be viewed as a generalization of the 2-3 tree. Essentially, you place as many keys as you can into each block to maximize fan-out. The asymptotic performance for insert/lookup remains O(log N) disk accesses, but the constant factor is reduced.

For example, the instructor's solution to this assigment applied to a particular text is a tree containing 7911 entries (distinct words) using 1816 blocks. There is an average of 4.36 entries per 128-byte block, with a maximum of 8 entries in a node. There is an average of 44.45 bytes of unused space per node (91% utilization). The tree height is 5. To get to any of the entries requires a maximum of 5 reads (you'd keep the root node in memory since it's accessed every time). A balanced binary tree could require a maximum of 13 to 26 disk accesses, depending on the balancing algorithm.

Increasing the block size to 512 bytes reduced the height to two.

For the purposes of this assignment, you will be simulating disk operations by building block structures in memory, using std::vector as described below.

The Problem

The problem is to count the frequency (number of occurrences) of each word in a text. The words should be printed in alphabetical order.

You will implement two functions (methods) add_word() and print_words() as decribed below.

Input should be read from standard input (std::cin). Output should go to standard output (std::cout). Diagnostic and debugging messages should go to standard error (std::cerr). Remember that cin and cout are not the keyboard and screen; the keyboard and screen are simply the defaults which you will need to override (if you're using Visual Studio, set Project > Properties > Debugging > Command).

For simplicity, we'll define a word as whatever the istream operator>> returns to a string argument.

Since words are variable-length, the output will look better if you print the count (with a fixed field width) followed by the word:

      2 reading
      1 size
      5 the
      1 they
      1 time
      1 to
      1 typical
      1 updating
      2 write
      2 writing
      

In fact, this is the output format of the Linux uniq -c command as in the following shell command chain:

sed -r -e 's/[ \t]+/\n/g' | sed -r  -e '/^[ \t]*$/d' | sort | uniq -c
      

Your output should be diff-compatible with uniq -c since that's how the baseline will be produced for grading. You should consider using the script to generate your own baseline data for testing your program.

The Solution

Create a class WordCounter like this:

class WordCounter {
 public:
  WordCounter();

  // Add word to the table, or increment its count if already
  // present.
  //
  // Returns 0 on success; -1 if word cannot be added.
  int add_word(const std::string& word);

  // Print the words in sorted order preceeded by count (should be
  // compatible with output from the "uniq -c" shell command.
  void print_words(std::ostream& out=std::cout);

  // Print out the entire structure in a format suitable for debugging.
  void dump(std::ostream& out=std::cerr);

 private:
  // Block number of root of private BTree implementation.
  uint32_t root_block_;
};
      

Implement the solution using a BTree structure where each node is contained in a single memory block of size BLOCK_SIZE bytes. You will have to cast raw memory bytes into the appropriate structures as described below. Since this is designed to mimic a data structure on disk, you will not store pointers inside the Block structures. Instead, you will use (unsigned) 32-bit integers to represent the block number on disk. This will be the index into an std::vector that will return a pointer to the block in memory. You will need to define a distinguished value to represent an invalid block number.

WordCounter provides the top-level public interface into the BTree. WordCounter::add_word() simply calls Block::add_word() and creates a new root block (updating the root_block_ field) if the insertion overflows.

Block Structure

Create a type Block containing an array of BLOCK_SIZE bytes. This will be the raw representation of a BTree node. Define BLOCK_SIZE to be 128 to make sure we hit notrivial depth.

The Block structure is laid out in 4 sections:

  1. Header:
  2. Table: item_count entries of type Item: Table entries should be maintained in alphabetical order.
  3. Free space: bytes from the end of the table to the beginning of the string data
  4. Strings: C-style (NUL-terminated character sequences)

Note that the block does not contain a field for the parent. The algorithm is naturally recursive and it's important to be comfortable with recursion.

Here is a partial class declaration. You may want to add methods and/or supplementary types, but cannot add data fields. Do not use any virtual functions as they add an extra data field for the vptr (if you don't know what a vptr is, look it up).

class Block {
 public:
  int add_word(string word /* plus additional arguments */);
  void print_words();
  // additional non-virtual and static methods.
 private:
  // Additional non-virtual and static methods.
  char data[BLOCK_SIZE];
}
        

You will use cast operators to access the implicit fields of the Block. Write (and test!) small functions to perform the appropriate manipulations. These functions may be methods of the Block class, or may be stand-alone (in which case, they would require an additional parameter for the block pointer.

The program may terminate on bad input if any word is greater than BLOCK_SIZE/4 bytes (we want to ensure any block can hold at least 3 keys plus overhead).

Inserting a Word Into a Block

If a word is not already present in the tree, insert it into the appropriate leaf node with count 1.

If a new word can be inserted into a block, allocate memory for the string at the top of the block's free space and decrement the string_table field accordingly. All entries in the table lexicographically greater than the insert word must be shifted upward. Thus, we consume free space from both ends.

If the word cannot be inserted, the block must be split into 2 blocks with the "middle" entry passed back to the caller to be inserted into the parent node.

add_word() should return

Example: after inserting the text "recursion is when a function calls itself" in to a block of size 256

memory map

Vector of Blocks

Maintain a std::vector holding Block pointers to map between the block number and the corresponding object in memory. You may make the vector a static member of the Block class.

What to Hand In

Submit your source code and test data along with a Bash script named Run which will compile your programs and run it with your test data.

Optionally submit a text file named README with any additional notes.

Run Script

The Run script is a hard requirement. If you do not submit it, you will lose marks, but it need not be your own work. If it is not, you must identify the author and copyright information in a comment at the top of the script. Do not claim authorship/ownership of else's work (there's a word for that in academia).

The simplest script simply compiles the program and runs it with your test data. A more sophisticated script will verify that the compilation is successful and compare the output of the test runs with a file containing expected output to validate the program.

Test Data

The program will be run against your instructor's data, but it is a hard requirement that you provide your own test data. Think about how you're going to test this beast.

Additional Notes

Working in Groups

Since this is a large assignment, you may work individually or in groups of two. If you are working in a group, each individual should hand in their own copy with his or her partner identified in the README file.

Intermediate Milestones

  1. Write a working program that inserts words into a single node. The program can terminate when the node overflows.
  2. Write a working program that will split a node at a pivot point, returning the pivot string and links to its children.

Think of these stand-alone programs as the equivalent of artistic studies (http://en.wikipedia.org/wiki/Study_%28art%29).

In art, a study is a drawing, sketch or painting done in preparation for a finished piece, or as visual notes.[1] Studies are often used to understand the problems involved in rendering subjects and to plan the elements to be used in finished works, such as light, color, form, perspective and composition.

It is unlikely that you will successfully implement the wordcount program as required without building and testing these smaller components. It may seem like extra work, but the investment will pay off.

100% of the grade is based on a working word_count program, but partial marks will be awarded for working milestones. That is, if word_count works correctly, the milestone programs will not be examined.

Standard Library

Everything else is off limits.

References

Do not use non-const C++ references for function parameters. We want to make pointer usage explicit. If you don't know what references are, don't worry about it yet.

Debug/Trace/Log Messages

Since this is a hefty assignment, trace messages will essential to getting correct results. Typically, you would use a logging library that would allow selective control of the messages. For the purposes of this assignment, it is sufficient to write to standard error (std::cerr).

This is optional, but you're crazy if you don't do it.

If you do write log messages, leave them in your code. You put them in for a reason and if there are ever any changes to the program, you'll be glad to have them. You do not want the frustration of thinking your program is working, removing the log messages to "clean up", and then discovering there's one more bug and you need to put the messages back in.

Think of this as scaffolding (http://en.wikipedia.org/wiki/Scaffolding).

Scaffolding is a temporary structure used to support people and material in the construction or repair of buildings and other large structures.
Unlike physical scaffolding, there can be minimal cost to retaining software scafolding. Usually, there is great benefit, since working software tends to undergo continuing modification.

Obtaining Test Data

You can get large samples of text from Project Gutenberg (http://www.gutenberg.org). Consider using a shell script to remove punctuation and long words from the text.

You may also write a shell script to generate random words of various lengths.