CSS 343: Notes from Lecture 5 (DRAFT)

Administrivia

course website still under active development: check back frequently for updates
check out this memory allocation simulator.
- it's only about 100 lines (comments excluded) of Python, but you don't have to know Python to run it
- the allocation algorithm has a couple of (semi-)deliberate bugs (see if you can spot them)
- it's a memory allocator; it doesn't have to be a good one
reminder: useful links on Professor Zander's css343 web page, including the C++ toolchain and Linux

Our Story So Far

dictionary/map: extremely useful abstraction
- linked list implementation: O(N)
- binary search: O(log N) but inflexible
- binary search tree: O(h) where h is height of tree
  - h is O(log N) on average
  - h degenerates to O(N) for pathological input
    - sorted and reverse-sorted input are two such cases (but not the only ones)
  - average-case performance may be obtained by randomizing the input
  - output data in sorted sorted order via inorder traversal
  - insert/delete operation
  - successor/predecessor (next/previous)
    - successor is leftmost right desendendant (if node has a right child); otherwise, successor is the first ancestor that takes a left branch
    - similarly for predecessor
    - predecessor/successor are inverse operations
general properties of binary trees
- preorder/inorder/postorder traversal
- some trees may have a parent pointer (this is equivelent to a doubly-linked linked list)
  - most uses do not require parent pointer because algorithms are recursive from the root and you get the parent pointer on return from the recursive call
BST balancing techniques
- AVL
  - store balance factor in node (height of left and right subtrees may differ by at most 1)
  - rebalance on insert to maintain balance factor
  - details glossed over, but generally similar to red-black trees
- red-black
  - less tightly balanced than AVL, but same asymptotic complexity O(log N)
  - nodes are decorated with a color value (red or black)
  - red-black properties:
    - root is black
    - synthetic leaf nodes are black
    - children of a red node are are black
    - all simple paths from some node to its leaves have the same black height (BH)
    - lookup operation is unchanged
  - insert/delete proceeds normally, then tree is rebalanced to restore red-black properties (invariants)
    - case-by-case analysis of for insert
    - max 2 rotations O(1)
    - max O(log N) color changes
    - delete is similar (max 3 rotations, which is still O(1))

Red-Black Trees (Again)

Here is a sequence of images generated for building up a binary search tree (unbalanced) from word sequence based on an arbitrarily-chosen text found on Project Gutenberg. As you can see, the tree looks roughly balanced. Click the image to get a full-size fiew, or download a zip file of the individual frames.

Here is a BST built from the same sequence that maintains the red-black properties. As you can see intuitively, the tree appears to have slightly better balance. Note the various cases that crop up during the rebalancing phases (balanced.zip).

One extreme pathological case for the BST is input in sorted (or reverse-sorted) order. Here is a BST build from the original data, sorted with groups of 5 words randomly permuted. Still pretty bad (permute5-unbalanced.zip).

And the same input to a red-black tree (permute5-balanced.zip).

Any questions?

2-3 Trees

2-3 tree is a search tree but not a binary search tree
all nodes have 1 or 2 keys
all leaves are at the same height
all interior nodes have 2 or 3 children (hence its name)
- if an interior node has 1 key, it has 2 children (left, middle)
- otherwise, it has 2 keys, it has 3 children (left, middle, right)
modified inorder traversal:
1. visit left child (if interior node)
2. process key1
3. visit middle child (if interior node)
4. process key2 (if present)
5. visit right child (if present)
insertion:
- insert new entry at leaf
- if new entry does not fit, leaf must be split
- if leaf is split, the middle key must be propagated back to the parent
  - if the key does not fit into the parent, the parent must be split and the middle (median) key is recursively propagated up to the grandparent, etc.
- if the root node is split, a new root is formed from the median key and all leave drop a level
details are somewhat messy, but manageable
deletion is similarly messy but manageable
- not required for assignment
- hack in lieu of deletion: mark node as "deleted" but leave it in place

Counters

The assignment introduces a new concept: counters

add a counter for each case
verify that all cases are hit by making sure all counters are positive
- counters function as a fast and cheap test code coverage tool

BTrees

The B tree is a generalization of the 2-3 tree (alternatively, the 2-3 tree is a special case of the B Tree).

Motivation:

disk access in orders of magnitude slower that CPU processing time
disks are block-access devices (reads occur block at a time)

We wish to minimize the number of blocks read for insert/lookup/delete, but fitting as many keys as possible into a single block.

Example: if we can fit 100 keys into a block, we can choose among 1,000,000 records in 2 probes (binary search requires up to 20 probes).

2-3-4 Trees

The 2-3-4 tree is another special case of the B tree, in which each node has 1 to 3 keys (hence 2-4 children). It is isomorphic to the red-black tree. Conceptually, the red children of a black node are folded into the black node.

Hashing (Preview)

Example: we can determine population count (number of 1-bits) of a byte using a table of 256 entries.

Generally, a hash function maps a large key space (0..M-1) into a smaller key space (0..N) which can be indexed in a table of size N. A good hash function randomizes the keys so "collisions" are unlikely (hence the name).

A collision occurs when two keys map to the same hash value. Hash tables must deal with this problem.