CSS 343: Notes from Lecture 8 (DRAFT)

Administrivia

need to slip hashing part 2 (again)
assignment 2: extension to Saturday night (thereby ruining your weekend but not your Superbowl experience
no office hours Sunday
started marking assignment 1
- marking standard will be posted (post hoc)
assignment 3 is kinda like assignment 1 (two trees instead of two lists)

Preliminary Observations about Assignment 1

note: as a matter of professionalism, if you get some source code with a comment that identifies an author and/or a copyright, you cannot delete the comment and call yourself the author (there's a word for claming authorship of something written by someone else). If you do modify the file, feel free to add your name, but avoid deleting the name of the original author.
The wordcount (and wordcount/dedup in Assignment 2) are intended to be filter programs (see below). As stated in the assignment handout:
The program will be run against the instructor's test data with the output compared to a reference file so standard output needs to be clean.
This means "interactive prompts" would show up as differences in the output
spacing in your code: aim for consistency (you don't have to do it my preferred way, but inconsistency hurts my eyeballs)
the following is undefined behavior:
char word[127]; cin >> word; if (strlen(word) > 127) { cerr << "word too long"; die_a_horrible_death(); }
why? because by the time you check for overflow, it's already overflowed. If you're going to use cin to read into a C string (char array), you need to guard the read:
char word[129]; // need extra for trailing '\0' char cin >> setw(128) >> word; if (strlen(word) > 127) { cerr <&t; "word too long"; die_a_horrible_death(); }
Of course you weren't expected to know about setw but you were specifically cautioned against doing uncontrolled buffer overruns (controlled buffer overruns being one of the points of the exercise). Naturally, the following works fine because the C++ string library takes care of those messy little details:
char word[128]; string wordstring; cin >> wordstring; if (wordstring.length() > 127) { cerr << "word too long"; die_a_horrible_death(); } strcpy(word, wordstring.c_str());
But the point of the exercise is to expose some of those things that the library conveniently hides.
the following is undefined behavior:
a[i++] = i++
why? because the standard allows subexpressions to be evaluated in any order
- as soon as there's any uncertainty about what a construct does, best to avoid using it anyway

Filters, Job Control, and Pipelines

The guys at Bell Labs who put Unix together back in the 1970s had a number of those exceedingly brilliant ideas that are entirely obvious in hindsight. Among ideas were:

make everthing look like a file, instead of treating devices (keyboards, screens, tape drives, etc.) separately (this makes the operating system (library) more complicated, but enormously simplifies the application—a very good tradeoff
define a standard input and output (and error) files that are automatically opened for the program by the operating system before the program starts running
define a mechanism (pipes) for taking the standard output of one program and feeding it into the standard input of another program.

The shell (e.g. bash, the Bourne-Again Shell) can be viewed in a variety of ways. One way is is as a kind of Job Control Language. That is, you have a general-purpose program such as sort which, quite obviously sorts the contents of a file, and you need some mechanism to wire up the plumbing. In other words, you have to tell the sort utility where to get its input and where to write the output.

From a console running bash, to run a program, you simply type the name of the file containing the program. The program will read from and write to the same console you just typed the program name.

Of course, you may also want to pass in other information to the program. For example, you want to tell g++ the name of the files to compile and the name of the program to generate, and various other flags to control the compilation. You could use standard input for that, but that's not as convenient as adding a few flags to the command line.


        g++ -g -Wall -o wordcount wordcount.cpp list.cpp allocator.cpp

First observation is that you're giving the command a list of file names, so we invent pattern-matching rules called wildcard or glob.


        g++ -g -Wall -o wordcount *.cpp

Now, our wordcount reads from standard input and writes to standard output, so we need a mechanism to tell the system we really want to redirect standard input and standard output. There is a simple syntax to do this:


        wordcount < test1.txt > test1.out

Now, suppose we want to know the total number of unique words in your text. We know how to get the total number of unique words in a file, so if we had a program linecount, we could use this


        wordcount < test1.txt > test1.tmp
linecount < test1.tmp > test1.out
rm test.tmp

That works ok, but is a tad clunky. You need to delete the temporary file, and the first program must run to completion before the second one can start. The biggest problem is that writing a file is so much slower than processing something in core memory. So, we invent a special kind of file called a pipe that just holds a small amount of data. The first process writes to this pipe and goes to sleep when the pipe is full. The second process reads from the pipe and sleeps when the pipe is empty. The shell syntax is to separate the two commands by a vertical bar.


        wordcount < test1.txt | linecount  > test1.out

Together these simple ideas led to a powerful concept: the filter program. A filter is a (generally small) program that reads from standard input and writes to standard output, typically without requiring user input. The idea is that a filter should do small number of things but do them well. Unix/Linux filters include:

cat (concatenate): write file(s) to standard output
sort: write input file(s) to standard output
uniq: report/remove duplicates
grep ("global regular expression print"): print only lines matching a pattern
sed ("stream editor"): batch editor

Later, of course, some of these filter programs became more complicated and developed into complete scripting (programming) languages such as awk (Aho, Weinberg, Kernighan — authors of the program), and Perl (Practical Extraction and Report Language).

Using the shell, filter programs can be strung together using pipes to feed the output from one or more programs into the input of another program. For example, a prototype spell checking program is as simple as:


        (cat myfile.txt | sed -e 's/[^A-Za-z]+/\
/g' | sort | uniq; cat /usr/share/dict/words /usr/share/dict/words) | sort | uniq -u

This just says to use the stream editor break the file into one-word-per-line units, then sort the words and run through uniq to get a single copy of each word in myfile.txt. That output is combined with two copies of the system dictionary (why two copies?). The output is sorted and then passed to uniq to get a list of words that occur in the text but not the dicionary. The semicolon separates two commands and the parentheses combine the output.

Fairly crude, but powerful for a one-liner. It's a ground-breaking proof-of-concept when it's the first spell checker in the world.

Check the relevant manual pages for the details.

Similarly, a list of the top 10 most frequently occuring words in a text can be quckly obtained by this incantation:


        cat mydata.txt | sed -r -e 's/[^A-Za-z]+/\
/g' | sed -e '/^$/d' | sort | uniq -c | sort -n  | tail -10

Of course, you can go crazy with these things. Top ten words counting down (Letterman-style):


        cat mydata.txt | sed -r -e 's/[^A-Za-z]+/\
/g' | sed -e '/^$/d' | sort | uniq -c | sort -n -r  | head -10 | cat -n | tac

Try these one-liners, then try doing the same thing in your favorite windowing environment.

Accessing Command-Line Flags from Your C++ Program

Just as the shell pre-opens the standard files via operating system, the shell also passes the command-line arguments to the program. In C/C++, the arguments show up as the arguments to main. It is set up by code in your program that runs before main runs. The main function should take two arguments, an int typically called argc (argument count) and an array of pointers to chars char *[] typically called argv (argument vector).


        int main(int argc, char *argv[])

argv is of size argc + 1. argv[argc] is NULL and argv[0] is the name of the program. You may, of course, name them anything you want. You may also declare the second argument as a pointer to a pointer:


        int main(int argument_count, char **argument_list)

The argument vector is a bunch of C-style strings. There's nothing magical about them. It is a convention that arguments that begin with a minus sign is a "flag" and other arguments typically name files but when it's your program, you can do whatever you want.


        int main(int argc, char argc[]) {
  // the first argument, argv[0] is the name of the program
  // so just skip it
  if (argc < 2) {
    do_something_with(cin, cout);
  } else {
    ostream* out = &cout;
    int next_arg = 1;
    if (strcmp(argc[1], "-o") == 0) {
      if (argc < 3) {
        print_usage();
        exit(1);
      } else {
        next_arg += 2;
        out = new ofstream(argv[2]);
      }
      if (next_arg == argc) {
        do_something_with(cin, *out);
      } else {
        for( ; next_arg < argc; ++next_arg) {
          ifstream f(argv[next_arg]);
          do_something_with(f, *out);
        }
      }
    }
  }
  return 0;
}

Of course, the logic there is a bit tortured, so let's take advantage of the fact that argv[argc] is NULL:


        int main(int argc, char **argv) {
  ++argv;
  ostream* out = &cout;
  if (*argv && strcmp(*argv, "-o") == 0) {
    ++argv;
    if (*argv == NULL) {
       print_usage();
       exit(1);
    } else {
       out = new ofstream(argv);
       ++argv;
    }
  }
  if (*argv == NULL) {
     do_something_with(cin, *out);
  } else {
    for (; *argv != NULL; ++argv) {
      ifstream in(*argv);
      do_something_with(in, *out);
    }
  }
}

Naturally, since programs can define complex rules for command-line arguments, there are libraries to help programs parse the options

A shell script is nothing more than the same commands you may type interactively, but when you invoke a script, it puts the arguments into the shell variables $0, $1, $2, ... ($0 is the program name). The entire list is given by $*. Note that the Run script for assignment 3 will probably have the command line:

./huff $1

Exit Status

main returns an int value (conventionally, 0 for succsss and non-zero for failure). That value is returned to its parent process.

When the process that ran your program is the shell, it puts the exit status into the shell variable $?.

Shell Programming

Bash is a full-on programming language. The basic units are the programs you invoke, but it also has built-in conditional and loop constructs. The Boolean value in tests is the exit status of a program (0 is True; nonzero is False).

Assignment 2

The 2-3 tree has a number of cases, but a more organized way to look at them is in terms of 4 cases each of which you can pass off to a helper function to simplify the insert method.


        Node::insert(data,...)
   if (is_leaf)
     insert_leaf(data,...)
   else if left
     insert_left(data,...)
   else if middle
     insert_middle(data,...)
   else if right
     insert_right(data,...)
   else
     -- data matches one of the keys
}

In the leaf case, there are 4 subcases:

the data matches one of the keys already present
there is room in the node and the data will be the new second key
there is room in the node and the data will be the new first key (the old first key becomes the second key)
the data does not fit and the node must be split

The interior node cases are all similar, but the wiring is slightly different. The similar subcases are:

the data was inserted or found somewhere in the subtree
the insert into the subtree caused a split that was propagated back up to the current node and there is room for new key (left and middle cases only)
the current node must be split. Since we know which case we're in, we know which key to propagate up.

Huffman Coding (cont.)

Suppose we have a text with the following frequencies:

A	B	C	D	E	F	G	H	I	J
50	25	20	18	10	8	6	4	2	2

Here is animation sequence building the Huffman tree (click here to download zip). The red nodes are items in the priority queue.

Bit Buffer

A byte is the smallest addressable piece of memory, so you're going to have to construct a class that will collect bits and write out entire bytes. You will also need another class to collect bytes and deliver a bit-at-a-time.

Extracting the bit involves shifting and masking; composing the byte involves shifting and oring.

On writing out, you probably won't be on a byte boundary, so you will have to write out a few extra bits. Keep track of the number of symbols written. On readback, stop reassembling the text when you've composed enough characters.