The program will be run against the instructor's test data with the output compared to a reference file so standard output needs to be clean.This means "interactive prompts" would show up as differences in the output
char word[127];
cin >> word;
if (strlen(word) > 127) {
cerr << "word too long";
die_a_horrible_death();
}
char word[129]; // need extra for trailing '\0' char
cin >> setw(128) >> word;
if (strlen(word) > 127) {
cerr <&t; "word too long";
die_a_horrible_death();
}
setw
but you were specifically cautioned against doing uncontrolled
buffer overruns (controlled buffer overruns being one of the
points of the exercise).
Naturally, the following works fine because the C++
string
library takes care of those messy little details:
char word[128];
string wordstring;
cin >> wordstring;
if (wordstring.length() > 127) {
cerr << "word too long";
die_a_horrible_death();
}
strcpy(word, wordstring.c_str());
a[i++] = i++
The guys at Bell Labs who put Unix together back in the 1970s had a number of those exceedingly brilliant ideas that are entirely obvious in hindsight. Among ideas were:
The shell (e.g.
bash,
the Bourne-Again Shell) can be viewed in a variety of ways. One
way is is as a kind of
Job Control Language.
That is, you have a general-purpose program such as
sort
which, quite obviously sorts the contents of a file, and you need
some mechanism to
wire up
the
plumbing.
In other words, you have to tell the sort utility where to get its input
and where to write the output.
From a console running bash, to run a program, you simply type the name of the file containing the program. The program will read from and write to the same console you just typed the program name.
Of course, you may also want to pass in other information to the program. For example, you want to tell g++ the name of the files to compile and the name of the program to generate, and various other flags to control the compilation. You could use standard input for that, but that's not as convenient as adding a few flags to the command line.
g++ -g -Wall -o wordcount wordcount.cpp list.cpp allocator.cpp
First observation is that you're giving the command a list of file names, so we invent pattern-matching rules called wildcard or glob.
g++ -g -Wall -o wordcount *.cpp
Now, our
wordcount
reads from standard input and writes to standard output, so we
need a mechanism to tell the system we really want to
redirect
standard input and standard output. There is a simple syntax to
do this:
wordcount < test1.txt > test1.out
Now, suppose we want to know the total number of unique words in your text. We know how to get the total number of unique words in a file, so if we had a program linecount, we could use this
wordcount < test1.txt > test1.tmp
linecount < test1.tmp > test1.out
rm test.tmp
That works ok, but is a tad clunky. You need to delete the temporary file, and the first program must run to completion before the second one can start. The biggest problem is that writing a file is so much slower than processing something in core memory. So, we invent a special kind of file called a pipe that just holds a small amount of data. The first process writes to this pipe and goes to sleep when the pipe is full. The second process reads from the pipe and sleeps when the pipe is empty. The shell syntax is to separate the two commands by a vertical bar.
wordcount < test1.txt | linecount > test1.out
Together these simple ideas led to a powerful concept: the filter program. A filter is a (generally small) program that reads from standard input and writes to standard output, typically without requiring user input. The idea is that a filter should do small number of things but do them well. Unix/Linux filters include:
cat
(concatenate):
write file(s) to standard output
sort
:
write input file(s) to standard output
uniq
:
report/remove duplicates
grep
("global regular expression print"):
print only lines matching a pattern
sed
("stream editor"):
batch editor
Later, of course, some of these filter programs became more
complicated and developed into complete scripting (programming)
languages such as
awk
(Aho, Weinberg, Kernighan — authors of the program),
and
Perl
(Practical Extraction and Report Language).
Using the shell, filter programs can be strung together using pipes to feed the output from one or more programs into the input of another program. For example, a prototype spell checking program is as simple as:
(cat myfile.txt | sed -e 's/[^A-Za-z]+/\
/g' | sort | uniq; cat /usr/share/dict/words /usr/share/dict/words) | sort | uniq -u
myfile.txt
.
That output is combined with two copies of the system dictionary
(why two copies?). The output is sorted and then passed to uniq
to get a list of words that occur in the text but not the
dicionary. The semicolon separates two commands and the
parentheses combine the output.
Fairly crude, but powerful for a one-liner. It's a ground-breaking proof-of-concept when it's the first spell checker in the world.
Check the relevant manual pages for the details.
Similarly, a list of the top 10 most frequently occuring words in a text can be quckly obtained by this incantation:
cat mydata.txt | sed -r -e 's/[^A-Za-z]+/\
/g' | sed -e '/^$/d' | sort | uniq -c | sort -n | tail -10
Of course, you can go crazy with these things. Top ten words counting down (Letterman-style):
cat mydata.txt | sed -r -e 's/[^A-Za-z]+/\
/g' | sed -e '/^$/d' | sort | uniq -c | sort -n -r | head -10 | cat -n | tac
Just as the shell pre-opens the standard files via operating
system, the shell also passes the command-line arguments to the
program. In C/C++, the arguments show up as the arguments to
main
. It is set up by code in your program that runs
before
main
runs.
The main function should take two arguments, an
int
typically called
argc
(argument count) and an
array of pointers to chars
char *[]
typically called
argv
(argument vector).
int main(int argc, char *argv[])
argv
is of size
argc + 1
.
argv[argc]
is
NULL
and
argv[0]
is the name of the program.
You may, of course, name them anything you want. You may also
declare the second argument as a pointer to a pointer:
int main(int argument_count, char **argument_list)
The argument vector is a bunch of C-style strings. There's nothing magical about them. It is a convention that arguments that begin with a minus sign is a "flag" and other arguments typically name files but when it's your program, you can do whatever you want.
int main(int argc, char argc[]) {
// the first argument, argv[0] is the name of the program
// so just skip it
if (argc < 2) {
do_something_with(cin, cout);
} else {
ostream* out = &cout;
int next_arg = 1;
if (strcmp(argc[1], "-o") == 0) {
if (argc < 3) {
print_usage();
exit(1);
} else {
next_arg += 2;
out = new ofstream(argv[2]);
}
if (next_arg == argc) {
do_something_with(cin, *out);
} else {
for( ; next_arg < argc; ++next_arg) {
ifstream f(argv[next_arg]);
do_something_with(f, *out);
}
}
}
}
return 0;
}
Of course, the logic there is a bit tortured, so let's take
advantage of the fact that
argv[argc]
is
NULL
:
int main(int argc, char **argv) {
++argv;
ostream* out = &cout;
if (*argv && strcmp(*argv, "-o") == 0) {
++argv;
if (*argv == NULL) {
print_usage();
exit(1);
} else {
out = new ofstream(argv);
++argv;
}
}
if (*argv == NULL) {
do_something_with(cin, *out);
} else {
for (; *argv != NULL; ++argv) {
ifstream in(*argv);
do_something_with(in, *out);
}
}
}
Naturally, since programs can define complex rules for command-line arguments, there are libraries to help programs parse the options
A shell script is nothing more than the same commands you may
type interactively, but when you invoke a script, it puts the
arguments into the shell variables $0, $1, $2, ... ($0 is the
program name). The entire list is given by $*. Note that the
Run
script for assignment 3 will probably have the command line:
./huff $1
main
returns an
int
value (conventionally, 0 for succsss and non-zero for failure).
That value is returned to its parent process.
When the process that ran your program is the shell, it puts the exit status into the shell variable $?.
Bash is a full-on programming language. The basic units are the programs you invoke, but it also has built-in conditional and loop constructs. The Boolean value in tests is the exit status of a program (0 is True; nonzero is False).
The 2-3 tree has a number of cases, but a more organized way to look at them is in terms of 4 cases each of which you can pass off to a helper function to simplify the insert method.
Node::insert(data,...)
if (is_leaf)
insert_leaf(data,...)
else if left
insert_left(data,...)
else if middle
insert_middle(data,...)
else if right
insert_right(data,...)
else
-- data matches one of the keys
}
In the leaf case, there are 4 subcases:
The interior node cases are all similar, but the wiring is slightly different. The similar subcases are:
Suppose we have a text with the following frequencies:
A | B | C | D | E | F | G | H | I | J |
50 | 25 | 20 | 18 | 10 | 8 | 6 | 4 | 2 | 2 |
Here is animation sequence building the Huffman tree
(click here to download zip).
The red nodes are items in the priority queue.
A byte is the smallest addressable piece of memory, so you're going to have to construct a class that will collect bits and write out entire bytes. You will also need another class to collect bytes and deliver a bit-at-a-time.
Extracting the bit involves shifting and masking; composing the byte involves shifting and oring.
On writing out, you probably won't be on a byte boundary, so you will have to write out a few extra bits. Keep track of the number of symbols written. On readback, stop reassembling the text when you've composed enough characters.