Assignment 1: Linked List and Memory Management

Linked lists should be familiar to you from CSS 342, so this exercise is partially review. What is new in this assignment is the requirement for low-level memory management.

The purpose of this assignment is to implement a simple, low-level, concrete data structure using the C subset of C++. Do not use any standard library functions except for C or C++ style I/O and (optionally) the C strings library functions #include <cstring>. You may use any standard library data type (Plain Old Data, not C++ classes).

Unadorned output should be sent to standard output. Any logging/trace/diagnostic/status/verbose messages should go to standard error. The program will be run against the instructor's test data with the output compared to a reference file so standard output needs to be clean.

Submit a script named Run that will compile the source and run your test cases. Include any data files needed to run your tests. In addition to the executable program, for for each source file foo.c, Run should generate

This level of detail will not be required in future assignments; it's just to expose the low-level toolchain. Hint: the GNU Compiler Collection has command-line flags to specify verbose output and to tell it not to delete intermediate files (temporaries).

Name the excutable wordcount so it will work with the instructor's test harness.

Optionally, submit a text file named README with any relevant notes.

The script should run on Linux. Bash shell scripts are an appropriate choice. For extra credit, include a Makefile and have Run invoke make.

Part 1: Memory Allocator

Implement a memory allocator per the functions declared in allocator.h.

Do not use malloc() or new to allocate memory. Instead, define the array Arena[] to be of size ARENA_SIZE. Memory will be allocated and deallocated from Arena[].

You will have to keep track of the blocks of free memory, spliting them as memory is allocated. When memory is allocated, the allocator needs to retain metadata about the block so that it can be returned to the pool.

Allocated memory should be aligned with pointer boundaries.

If memory cannot be allocated for some request, terminate the program with an error exit status.

To implement the allocator, you will have to convert between pointers and integer types in order to compute offsets. Ensure that your integral type is large enough to hold a pointer value (hint: consider the standard library types ptrdiff_t and size_t).

Part 2: Linked List

Write a program that will insert counters and C-style strings (array of char ending with a NUL character) into a singly-linked circular list, according to the functions declared in list.h .

Define the node structure like this:

struct ListNode {
  struct ListNode* next;
  int count;
  char data[0];
};

Note the zero-length data field. In C, even empty strings contain a byte for the NUL character. This means that any string will overflow struct. When allocating a ListNode, you need to make sure to allocate enough space for the overflow. This also means you cannot copy or reuse nodes.

Use the memory allocator from part 1.

Part 3: Word Count

The driver program should read data from standard input. Each line contains a code character a or d followed by a space, followed by a word.

The “word” and hence the input line could be arbitrarily long. In practice, it is probably going to be less than, say 127 characters. You may terminate the program with an error exit status if the word is 128 characters or greater.

You may want to write a helper function, locate() which will return the predecessor node of a given word (if found).

After the input is read, walk through the list, printing each word and its count. The output format should be compatible with uniq -c (7-character right-justified count field, space, word) but order does not matter.

Make sure you cover the following test cases:

Notes

  1. A common way to manage the reading of an input line into a C string is to define a buffer, say, 128 or 512 characters and to assume no line in the file is longer than the buffer size. This has caused innumerable security loopholes that have been exploited. Since you don't know how long a word may be, but strings are fixed size, a simple way to manage it is to allocate a fixed size buffer, most likely to be large enough, and read the line a character at a time into the buffer. If (when) the buffer fills, allocate a new, larger buffer and copy. The standard C library provides realloc() for this case.
  2. We are not overly concerned with performance at this point. The objective here is just a basic understanding of how the thing works. A more sophisticated memory allocator might: Library implementers spend a lot of time tuning and tweaking the memory allocator and an application programmer is unlikely to get performance gains over a tuned library unless some application-specific property can be exploited. Check out the wiki page on malloc() or Doug Lea's memory allocator .
  3. A custom allocator has to be debugged, tuned, and maintained. To justify writing one, you probably have to be expoiting some application-specific property to get better performance. This tends to make them complicated. It is often useful to provide an alternate implementation that simply uses the library functions malloc() and free(). This thin wrapper makes benchmarking and validation easier.
  4. Log messages are not required but, if you do write any, leave them in your code. It is often frustrating to put many debug messages into a program to get the program working, remove them to "clean up" the code after the program is presumed to be "debugged", only to have to put them back in when something changes or a case was overlooked, or new functionality is added that breaks something.

Questions to Ponder

  1. How do you override the definition of ARENA_SIZE on the compiler command line (in gcc/g++)?
  2. Is it reasonable to provide an in-code default arena size or should ARENA_SIZE be defined only on the command line? If ARENA_SIZE is not defined on command line, what is a clean way to force a compile-time error?
  3. Arrays in C++ are a legacy from C, but are considered undesireable when used with C++ classes. Why?
  4. malloc() returns a pointer to allocated memory which is passed back to free(). How does free know the size of the block that was returned? How does a negative pointer offset work? Is that legal according to the C/C++ language standards?
  5. The largest class of bugs in C involve mishandled memory, especially dangling pointers and memory leaks. What happens if a pointer is freed twice? What happens if if you free a pointer value that does not come from the allocator?
  6. A not-uncommon C idiom is to have a zero-length array as the last element of a struct. How does that work? How does a zero-length field affect the size of the structure? Is this legal according to the language standard?