Disks are block-oriented devices. Instead of reading or writing a character at a time, they read and write entire blocks. A typical block size is 512 bytes, but could be as large as 4096 bytes. Under the hood, a character-by-character write request is implemented in the driver by reading the existing block into memory, updating the changed bytes, and writing the block back out to disk. This is expensive because disk operations are orders of magnitude slower than in-memory operations.
Operating systems designers have developed clever techniques for hiding the cost through caching and asynchronous operations, but this is CSS343, not CSS430.
The BTree is a balanced search tree structure designed for block-oriented devices. It can be viewed as a generalization of the 2-3 tree. Essentially, you place as many keys as you can into each block to maximize fan-out. The asymptotic performance for insert/lookup remains O(log N) disk accesses, but the constant factor is reduced.
For example, the instructor's solution to this assigment applied to a particular text is a tree containing 7911 entries (distinct words) using 1816 blocks. There is an average of 4.36 entries per 128-byte block, with a maximum of 8 entries in a node. There is an average of 44.45 bytes of unused space per node (91% utilization). The tree height is 5. To get to any of the entries requires a maximum of 5 reads (you'd keep the root node in memory since it's accessed every time). A balanced binary tree could require a maximum of 13 to 26 disk accesses, depending on the balancing algorithm.
Increasing the block size to 512 bytes reduced the height to two.
For the purposes of this assignment, you will be simulating
disk operations by building block structures in memory, using
std::vector as described below.
The problem is to count the frequency (number of occurrences) of each word in a text. The words should be printed in alphabetical order.
You will implement two functions (methods)
add_word()
and
print_words()
as decribed below.
Input should be read from standard input
(std::cin).
Output should go to standard output
(std::cout).
Diagnostic and debugging messages should go to standard error
(std::cerr). Remember that
cin
and
cout
are
not
the keyboard and screen; the keyboard and screen are
simply the defaults which you will need to override (if you're
using Visual Studio, set
Project > Properties > Debugging > Command).
For simplicity, we'll define a word as whatever the
istream operator>>
returns to a string argument.
Since words are variable-length, the output will look better if you print the count (with a fixed field width) followed by the word:
2 reading
1 size
5 the
1 they
1 time
1 to
1 typical
1 updating
2 write
2 writing
In fact, this is the output format of the Linux
uniq -c
command as in the following shell command chain:
sed -r -e 's/[ \t]+/\n/g' | sed -r -e '/^[ \t]*$/d' | sort | uniq -c
Your output should be
diff-compatible with
uniq -c
since that's how the baseline will be produced for grading.
You should consider using the script to generate your own baseline
data for testing your program.
Create a class
WordCounter
like this:
class WordCounter {
public:
WordCounter();
// Add word to the table, or increment its count if already
// present.
//
// Returns 0 on success; -1 if word cannot be added.
int add_word(const std::string& word);
// Print the words in sorted order preceeded by count (should be
// compatible with output from the "uniq -c" shell command.
void print_words(std::ostream& out=std::cout);
// Print out the entire structure in a format suitable for debugging.
void dump(std::ostream& out=std::cerr);
private:
// Block number of root of private BTree implementation.
uint32_t root_block_;
};
Implement the solution using a BTree structure where each node is
contained in a single memory block of size
BLOCK_SIZE
bytes.
You will have to
cast
raw memory bytes into the appropriate structures as described
below.
Since this is designed to mimic a data structure on disk, you will
not store pointers inside the
Block
structures. Instead, you will use (unsigned) 32-bit integers to
represent the block number on disk. This will be the index into
an
std::vector
that will return a
pointer
to the block in memory.
You will need to define a distinguished value to represent an
invalid block number.
WordCounter
provides the top-level public interface
into the BTree.
WordCounter::add_word()
simply calls
Block::add_word()
and creates a new root block
(updating the
root_block_
field) if the insertion overflows.
Create a type
Block
containing an array of
BLOCK_SIZE
bytes. This will be the raw representation of a BTree node.
Define
BLOCK_SIZE
to be 128 to make sure we hit notrivial depth.
The
Block
structure is laid out in 4 sections:
block_number;
32 bits for the number of this block
item_count;
16 bits for the number of entries stored in this block
string_table;
16 bits for the offset to the end of the free space (one
byte past the free space)
left_child:
32 bits for the block number of the leftmost child.
item_count
entries of type
Item:
offset
16-bit distance from the start of this block to the string
datum
count:
16 bits representing the number of occurrences of the word
right_child:
32 bits representing the block number of the right child
Note that the block does not contain a field for the parent. The algorithm is naturally recursive and it's important to be comfortable with recursion.
Here is a partial class declaration. You may want to add methods
and/or supplementary types, but cannot add data fields. Do not
use any virtual functions as they add an extra data field for the
vptr
(if you don't know what a
vptr
is,
look it up).
class Block {
public:
int add_word(string word /* plus additional arguments */);
void print_words();
// additional non-virtual and static methods.
private:
// Additional non-virtual and static methods.
char data[BLOCK_SIZE];
}
You will use cast operators to access the implicit fields of the
Block. Write (and test!) small functions to perform
the appropriate manipulations.
These functions may be methods of the
Block
class, or may be stand-alone (in which case, they would
require an additional parameter for the block pointer.
The program may terminate on bad input if any word is greater than
BLOCK_SIZE/4
bytes (we want to ensure any block can hold at least 3 keys plus
overhead).
If a word is not already present in the tree, insert it into the appropriate leaf node with count 1.
If a new word can be inserted into a block, allocate memory for
the string at the top of the block's free space and decrement the
string_table
field accordingly.
All entries in the table lexicographically greater than the insert
word must be shifted upward. Thus, we consume free space from
both ends.
If the word cannot be inserted, the block must be split into 2 blocks with the "middle" entry passed back to the caller to be inserted into the parent node.
add_word()
should return
Example: after inserting the text "recursion is when a function calls itself" in to a block of size 256
Maintain a std::vector
holding
Block
pointers to map between the block number and the corresponding
object in memory. You may make the vector a static member of the
Block
class.
Submit your source code and test data along with a Bash script
named
Run
which will compile your programs and run it with your test
data.
Optionally submit a text file named
README
with any additional notes.
The
Run
script is a hard requirement. If you do not submit it, you will
lose marks, but it need not be your own work. If it is not, you
must identify the author and copyright information in a comment at
the top of the script. Do not claim authorship/ownership of
else's work (there's a word for that in academia).
The simplest script simply compiles the program and runs it with your test data. A more sophisticated script will verify that the compilation is successful and compare the output of the test runs with a file containing expected output to validate the program.
The program will be run against your instructor's data, but it is a hard requirement that you provide your own test data. Think about how you're going to test this beast.
Since this is a large assignment, you may work individually or in
groups of two. If you are working in a group, each individual
should hand in their own copy with his or her partner identified
in the
README
file.
Think of these stand-alone programs as the equivalent of artistic studies (http://en.wikipedia.org/wiki/Study_%28art%29).
In art, a study is a drawing, sketch or painting done in preparation for a finished piece, or as visual notes.[1] Studies are often used to understand the problems involved in rendering subjects and to plan the elements to be used in finished works, such as light, color, form, perspective and composition.
It is unlikely that you will successfully implement the wordcount program as required without building and testing these smaller components. It may seem like extra work, but the investment will pay off.
100% of the grade is based on a working
word_count
program, but partial marks will be awarded for working
milestones. That is, if
word_count
works correctly, the milestone programs will not be examined.
<cstring>.
new
to allocate the nodes.
Everything else is off limits.
Do not use non-const C++ references for function parameters. We want to make pointer usage explicit. If you don't know what references are, don't worry about it yet.
Since this is a hefty assignment, trace messages will essential to
getting correct results. Typically, you would use a logging
library that would allow selective control of the messages. For
the purposes of this assignment, it is sufficient to write to
standard error
(std::cerr).
This is optional, but you're crazy if you don't do it.
If you do write log messages, leave them in your code. You put them in for a reason and if there are ever any changes to the program, you'll be glad to have them. You do not want the frustration of thinking your program is working, removing the log messages to "clean up", and then discovering there's one more bug and you need to put the messages back in.
Think of this as scaffolding (http://en.wikipedia.org/wiki/Scaffolding).
Scaffolding is a temporary structure used to support people and material in the construction or repair of buildings and other large structures.Unlike physical scaffolding, there can be minimal cost to retaining software scafolding. Usually, there is great benefit, since working software tends to undergo continuing modification.
You can get large samples of text from Project Gutenberg (http://www.gutenberg.org). Consider using a shell script to remove punctuation and long words from the text.
You may also write a shell script to generate random words of various lengths.