CSS 343: Notes from Lecture 3 (DRAFT)

Administrivia

Our Story So Far

complexity management via abstraction
abstraction via interface/implementation
separate compilation: .cc files are compiled individually and linked together
using the preprocessor, we place the interface (declarations) into the .h file and the implementation (definitions) into the .cc file
- implementation details leak into the .h file because the compiler needs to know the size of a class and because of inlining and templating
- declarations that are not part of the interface are put into the .cc files
programs that violate the memory model may have undefined behavior (also known as nasal demons)
pointers & pointer arithmetic:
- when adding an integer to a pointer, the compiler multiplies the integer value by the size of the element type

Casting

casting is an escape mechanism for the language typing system
- C-style casting: (type) expr
- C++ casting: 4 different kinds
  - static_cast<type>(expr): change the bits (at compile time)
  - reinterpret_cast<type>(expr): change the interpretation (e.g. pointer type)
  - const_cast<typer>(expr): add or remove const attribute (mostly to remove)
  - dynamic_cast<type>(expr): cast a pointer from a pointer-to-base-class to a pointer-to-derived-class (or return NULL if the object pointed to is not of the derived class)
- casting is usually a sign of bad design (code smell)
- the casting syntax is ugly by design to discourage its use (because casting is usually a sign of bad design)
- although the syntax is similar to templates, they are unrelated
- casting is essential for low-level programming (e.g. implementing device drivers and networking protocols)
  - C/C++ allows implementation of program that are not possible in other languages

Sequential Processing

Sequential processing is essentially a first and next operation. Additional operations include last, prev, insert (first, last, before, after), delete (current, next, prev, first, last), find. Special cases include the stack, queue, and deque.

Sequential operations may be implemented using a linked list (singly or doubly linked) or a vector. The choice determines the asymptotic performance of the operations (e.g. insert into a linked list is O(1) but insert into an array is O(N)). In practice one uses the STL continer classes std::list or std::vector.

Sequential processing so important, the 2011 language standard introduced the range-based for statement (other languages already had it).

Iterators use operator overloading to mimic pointer arithmetic. The following idiom works for serveral different standard container types:


      for(ContainerType::iterator it = container.begin(); it != container.end(); ++it) {
  do_something_with(*it);
}

Random Access

random access is about finding the n^th entry
O(n) on linked list
- might be acceptable if n is really, really small
O(1) using pointer arithmetic (arrays)
this is not the same as find (lookup by value), which is O(N) when the data is unorganized

Dictionaries

The dictionary abstraction is extremely important. Lookup by data is not the same as finding the n^th entry. Typically, we look up a value based on a key.

the main operations are insert and lookup. Additional operations may include:

delete
for-each-item (unordered)
for-each-item (ordered)
union

This dictionary is so important that, naturally, it has many names:

dictionary
symbol table
lookup table
table
associative array
map (not to be confused with the map of functional programming)
hash (Perlism, refers to implementation method)
set
relational database
NoSQL database

A dictionary can be implemented using a linked list with O(N) lookup time if N is small or very few lookups are being performed. Naturally, we can do better.

Binary Search (Review)

If we can sort the data by the key O(N log N), lookup can be performed in O(log N) time.

Note that sorting is more expensive than than an O(N) one-time lookup, but sorting is justified if performing a large number of lookups or if the data can be sorted "offline". In some cases, the data may arrive pre-sorted.

Inverted Index

inverted index: separate tables for each key

data by name

original (raw) data

data by value

	name	ref
0	bambam	5
1	barney	1
2	betty	3
3	fred	0
4	pebbles	4
5	wilma	3

	name	value
0	fred	42
1	barney	68
2	wilma	33
3	betty	24
4	pebbles	18
5	bambam	54

	value	ref
0	18	4
1	24	3
2	33	2
3	42	0
4	54	5
5	68	1

this may lead to screwy nested array subscripting like this: value = data[ by_name[i].ref ].value

Problems with Binary Search

Essentially, the problem is that binary search is inflexible: insert and delete operations are O(N).

Binary Search Tree (review)

The binary search tree (BST) is essentially a binary tree structure that represents the decision pattern of the binary search:

binary trees may be traversed in level, pre, post, and in order
- in-order is the only meaningful traversal for BST
insert, delete, and lookup are, on average, O(log N) time
average performance can usually be achieved if the input sequence is randomized
(ordered) traversal is O(N) using O(log N) space (recursion)
n^th is O(n) without preprocessing (or O(log N) with additional processing during insertion and deletion)
immediate predecessor of a node with a left child is the rightmost node on its left branch
immediate predecessor of a node with no left child is the first ancestor node for which the node is on the ancestor's left branch (TODO: diagram)
similarly, the immediate successor is the leftmost right child or the first right ancestor
successor and predecessor are inverse operations
finding successor/predecessor is O(log N) if the nodes include a parent pointer
deleting from a binary tree: 3 cases
- leaf node: delete
- single child: replace node with child
- two children: replace node with immediate predecessor or successor and perform case 1 or case 2 on the predecessor

Additional Binary Tree Properties

if nodes are marked with preorder and postorder enumerations, determining whether two nodes have an ancestor/descendant relationship is O(1)

Problems with the BST

The binary search tree worst-case performance is O(N) instead of O(log N) because a BST degenerates to a linked list with pathological input. Unfortunately, two pathological cases are building the tree in sorted order and building the tree in reverse sorted order.

Tree Balancing Technique

Binary Search Tree

rotation transformation may be applied to any BST to obtain another valid BST (with the same data)
- a right rotation shortens the left subtree and lengthens the right subtree (by one node each)
- similarly, a left rotation shortens the right subtree and lengthens the left subtree
we can maintain a (roughly) balanced BST if we are suitably clever about which node pairs we chose to rotate
- AVL
- red-black
various tree-balancing techniques establish a notion of balance and make appropriate rotations during insert and delete to maintain the balance invariant
- rebalancing may give a performance penalty for insert/delete in average case (same O(log N) asymptotic performance, but larger constant)
- rebalancing may in some cases be more efficient as it avoids O(N) insert/delete in the degenerate case
- the tradeoff is guaranteed faster lookup time (possibly asymptotically faster, i.e. O(log N) vs. O(N).

2-3 Tree

The 2-3 tree is a special case of the general B-tree. The key insight is that each node holds one or two keys, and all non-leaf nodes have two or three children (depending on the number of keys in the node). All leaves are maintained at the same level.