CSS 343: Notes from Lecture 12 (DRAFT)

Administrivia

assignment 2?
- it's really, really important to hit the intermediate milestones
  - character frequency counts
  - priority queue
  - Huffman tree (built from frequency table)
  - encoding table
  - encoded file
  - reloading the frequency table
  - file decode
office hours: take advantage of them

Dijkstra's algorithm

The core of the algorithm is actually pretty simple: initalize the tentative cost of each node (zero for the start node, infinity for all other notes. Initially every node is unprocessed. The unprocessed nodes are held in a priority queue ordered by cost.

While the priority queue (set of unprocessed nodes) is not empty, remove the lowest-cost unprocessed node. Its tentative cost becomes its actual minimum cost. For each outgoing edge of the selected node, if the cost of the selected node plus the edge weight is less than the current tentative cost of the descendant, reduce ("relax") the cost of the descendant node because we've found a shorter path to the node. This cost reduction will change the position of the descendant node in the priority queue (reduce_key). Terminate the algorithm when the goal node is selected.

Note that, until a node is processed, its cost is tentative and may be updated zero or more times.

Implementation detail: when the cost of a node is updated, update the predecessor edge in the node so it points to the selected node it comes from. The shortest path is recovered by walking the chain backwards.

class Vertex {
  //...
  Edges edges_;
  int cost_;
  Edge* parent;
};

class Edge {
  //...
  Vertex* from_;
  Vertex* to_;
  int weight_;
};

Observe the animation shown in the notes for lecture 11.

Cost Analysis

Naive (not using priority queue) implementation of unprocessed set: selection is O(|V|²) plus every edge has to be visited, so the total cost is O(|V|² + |E|). Since |E| = O(|V|²) for a dense graph, the net cost is O(|V|²).

If the graph is sufficiently sparse, using a priority queue gives performance O(|V| log |V| + |E|).

Argument for Correctness

Induction hypothesis: the lowest-cost path has been found for every processed vertex. Each time we process the next lowest-cost unprocessed unprocessed vertex, it must have its minimum cost, or else there would have to be be a lower-cost unprocessed vertex on its path.

Hashing

Hash (v.): to chop up into little pieces & mix.

Dictionaries, so far, have been based on comparing keys. Insert/lookup/delete are each O(log N) and building an entire tree is O(N log N). This is equivalent to sorting, and the elements can be visited in sorted order in O(N) time.

Now consider a lookup table where keys are in the range 0..N-1. If N is reasonably small, we can hold all N elements in an array. Lookup is O(1) and ordered traversal is O(N).

Example: suppose we want a table of the number 1-bits in a byte. Table size is 256 (number of distinct bytes).

Number	Binary Representation	Table Entry
0	0000 0000	0
1	0000 0001	1
2	0000 0010	1
3	0000 0011	2
4	0000 0100	1
5	0000 0101	2
6	0000 0110	2
7	0000 0111	3
8	0000 1000	1
9	0000 1001	2
10	0000 1010	2
11	0000 1011	3
12	0000 1100	2
13	0000 1101	3
14	0000 1110	3
15	0000 1111	4
16	0001 0000	1
...

Unfortunately if we want the number of 1-bits in a 32-bit int, there would be 4 million entries in the table. This is not tractable (although we can sum the 1-bit counts from each of the 4 bytes of the int).

In practice, we may only want a few values, so we can save the few values we've alredy computed into a lookup table. This is known as memoization

// memoized (cached) computation
get_bits(unsigned int n) {
  int value;
  if (!lookup(n, &value)) {
    value = compute_bits(n);
    insert(n, v);
  }
  return v;
}

Hashing is simply mapping a large keyspace into the smaller range 0..M-1. A very simple hash function is H(x) = x mod M. This works pretty well in practice when M is prime (heuristic).

What makes a good hash function:

uniform distribution
independent of patterns in the data

it's that simple. Sounds too good to be true?

Collisions

What makes hashing problematic is the possibliity of collisions: two keys that hash to the same value. For example, consider a table size of 101. If h(x) = x mod 101, we have h(4567) = 22 h(7597) = 22.

There are two solutions to the collision problem:

closed addressing (AKA open hashing or separate chaining)
open adddressiong (AKA closed hashing)

Separate Chaining

Separate chaining is simpler but requires extra memory and overhead. Each table entry (bucket or slot) is a sequence (list or vector).

TODO: diagram

Open Addressing

Open addressing inserts the second entry somewhere else within the table.

There are several possiblilites for the somehwere else:

linear probing use the next slot (or some fixed delta)
- this may lead to clustering
quadratic probing: delta is a quadratic function of the number of probes
- this may also lead to secondary clustering
use a second hash function ("double hashing") to determine the delta (each key collision will have a different secondary hash, so the probes will diverge).

If the table is not too full, there will be gaps, so the probing will terminate fairly quickly (O(1) probes).

Since the overflow uses another slot, that may cause an overflow in a node that wouldn't otherwise have overflowed.

Deleting Entries

With separate chaining, delete is straight-forward. With open addressing, typically the bucket has to be marked deleted so that probing can continue.

Hashing Strings

Typically, we want to convert the string to a 32-bit number and then use a numeric hash function.

Using the first two letters of a word (table size: 26 * 27 = 702 entries) does not give particularly good results (occurring in a randomly-selected 179-word text: than, the, their, theirs, them, themselves, thereof, they, thing, thinking, this, those,thought, through).

Since hashing is so important, there are numerous hash functions that have been studied. Here's one:

djb2(s):
  h = 5381
  for c in s:
    h = 33 * h + c
  return h

Empirical data, for a 3585-word text with a table size of 4051:

# entries in bucket (collisions)	dbj2	sdbm	Python
0 (empty slots)	1644	1663	1676
1	1525	1494	1472
2	646	644	655
3	188	197	197
4	41	35	44
5	3	11	6
6	3	2	1
7	1

Most of the time, the value will be found in one or two probes. Worst case is 7 probes. Contrast this with a binary search with a tree depth of 12.