CSS 343: Notes from Lecture 5 (DRAFT)
Administrivia
-
course website still under active development: check back
frequently for updates
-
check out this
memory allocation simulator.
-
it's only about 100 lines (comments excluded) of Python, but you
don't have to know Python to run it
-
the allocation algorithm has a couple of (semi-)deliberate
bugs (see if you can spot them)
-
it's a memory allocator; it doesn't have to be a good one
-
reminder: useful links on Professor Zander's css343
web page,
including the C++ toolchain and Linux
Our Story So Far
-
dictionary/map: extremely useful abstraction
-
linked list implementation: O(N)
-
binary search: O(log N) but inflexible
-
binary search tree: O(h) where h is height of tree
-
h is O(log N) on average
-
h degenerates to O(N) for pathological input
-
sorted and reverse-sorted input are two such cases
(but not the only ones)
-
average-case performance may be obtained by randomizing
the input
-
output data in sorted sorted order via inorder traversal
-
insert/delete operation
-
successor/predecessor (next/previous)
-
successor is leftmost right desendendant (if node has
a right child); otherwise, successor is the first
ancestor that takes a left branch
-
similarly for predecessor
-
predecessor/successor are inverse operations
-
general properties of binary trees
-
preorder/inorder/postorder traversal
-
some trees may have a parent pointer (this is equivelent to
a doubly-linked linked list)
-
most uses do not require parent pointer because
algorithms are recursive from the root and you get the
parent pointer on return from the recursive call
-
BST balancing techniques
-
AVL
-
store balance factor in node (height of left and right
subtrees may differ by at most 1)
-
rebalance on insert to maintain balance factor
-
details glossed over, but generally similar to red-black
trees
-
red-black
-
less tightly balanced than AVL, but same asymptotic
complexity O(log N)
-
nodes are
decorated
with a color value (red or black)
-
red-black properties:
-
root is black
-
synthetic leaf nodes are black
-
children of a red node are are black
-
all simple paths from some node to its leaves have
the same black height (BH)
-
lookup operation is unchanged
-
insert/delete proceeds normally, then tree is rebalanced to
restore red-black properties (invariants)
-
case-by-case analysis of for insert
-
max 2 rotations O(1)
-
max O(log N) color changes
-
delete is similar (max 3 rotations, which is still O(1))
Red-Black Trees (Again)
Here is a sequence of images generated for building up a binary
search tree (unbalanced) from word sequence based on an
arbitrarily-chosen text found on
Project Gutenberg.
As you can see, the tree looks roughly balanced. Click the image
to get a full-size fiew, or
download a zip file
of the individual frames.
Here is a BST built from the same sequence that maintains the
red-black properties. As you can see intuitively, the tree
appears to have slightly better balance. Note the various cases
that crop up during the rebalancing phases
(balanced.zip).
One extreme pathological case for the BST is input in sorted (or
reverse-sorted) order. Here is a BST build from the original
data, sorted with groups of 5 words randomly permuted. Still
pretty bad
(permute5-unbalanced.zip).
And the same input to a red-black tree
(permute5-balanced.zip).
Any questions?
2-3 Trees
-
2-3 tree is a search tree but not a binary search tree
-
all nodes have 1 or 2 keys
-
all leaves are at the same height
-
all interior nodes have 2 or 3 children (hence its name)
-
if an interior node has 1 key, it has 2 children (left, middle)
-
otherwise, it has 2 keys, it has 3 children (left, middle, right)
-
modified inorder traversal:
-
visit left child (if interior node)
-
process key1
-
visit middle child (if interior node)
-
process key2 (if present)
-
visit right child (if present)
-
insertion:
-
insert new entry at leaf
-
if new entry does not fit, leaf must be split
-
if leaf is split, the middle key must be propagated back to
the parent
-
if the key does not fit into the parent, the parent must
be split and the middle (median) key is recursively
propagated up to the grandparent, etc.
-
if the root node is split, a new root is formed from the
median key and all leave drop a level
details are somewhat messy, but manageable
-
deletion is similarly messy but manageable
-
not required for assignment
-
hack in lieu of deletion: mark node as "deleted" but leave
it in place
Counters
The assignment introduces a new concept: counters
-
add a counter for each case
-
verify that all cases are hit by making sure all counters are
positive
-
counters function as a fast and cheap test code coverage
tool
BTrees
The B tree is a generalization of the 2-3 tree (alternatively, the
2-3 tree is a special case of the B Tree).
Motivation:
-
disk access in orders of magnitude slower that CPU processing time
-
disks are block-access devices (reads occur block at a time)
We wish to minimize the number of blocks read for
insert/lookup/delete, but fitting as many keys as possible into a
single block.
Example: if we can fit 100 keys into a block, we can choose among
1,000,000 records in 2 probes (binary search requires up to 20
probes).
2-3-4 Trees
The 2-3-4 tree is another special case of the B tree, in which
each node has 1 to 3 keys (hence 2-4 children). It is isomorphic
to the red-black tree. Conceptually, the red children of a black
node are folded into the black node.
Hashing (Preview)
Example: we can determine population count (number of 1-bits) of a
byte using a table of 256 entries.
Generally, a hash function maps a large key space (0..M-1) into a smaller
key space (0..N) which can be indexed in a table of size N. A
good hash function randomizes the keys so "collisions" are
unlikely (hence the name).
A collision occurs when two keys map to the same hash value. Hash
tables must deal with this problem.