CSS 343: Notes from Lecture 12 (DRAFT)

Administrivia

Dijkstra's algorithm

The core of the algorithm is actually pretty simple: initalize the tentative cost of each node (zero for the start node, infinity for all other notes. Initially every node is unprocessed. The unprocessed nodes are held in a priority queue ordered by cost.

While the priority queue (set of unprocessed nodes) is not empty, remove the lowest-cost unprocessed node. Its tentative cost becomes its actual minimum cost. For each outgoing edge of the selected node, if the cost of the selected node plus the edge weight is less than the current tentative cost of the descendant, reduce ("relax") the cost of the descendant node because we've found a shorter path to the node. This cost reduction will change the position of the descendant node in the priority queue (reduce_key). Terminate the algorithm when the goal node is selected.

Note that, until a node is processed, its cost is tentative and may be updated zero or more times.

Implementation detail: when the cost of a node is updated, update the predecessor edge in the node so it points to the selected node it comes from. The shortest path is recovered by walking the chain backwards.

class Vertex {
  //...
  Edges edges_;
  int cost_;
  Edge* parent;
};

class Edge {
  //...
  Vertex* from_;
  Vertex* to_;
  int weight_;
};
  

Observe the animation shown in the notes for lecture 11.

Cost Analysis

Naive (not using priority queue) implementation of unprocessed set: selection is O(|V|2) plus every edge has to be visited, so the total cost is O(|V|2 + |E|). Since |E| = O(|V|2) for a dense graph, the net cost is O(|V|2).

If the graph is sufficiently sparse, using a priority queue gives performance O(|V| log |V| + |E|).

Argument for Correctness

Induction hypothesis: the lowest-cost path has been found for every processed vertex. Each time we process the next lowest-cost unprocessed unprocessed vertex, it must have its minimum cost, or else there would have to be be a lower-cost unprocessed vertex on its path.

Hashing

Hash (v.): to chop up into little pieces & mix.

Dictionaries, so far, have been based on comparing keys. Insert/lookup/delete are each O(log N) and building an entire tree is O(N log N). This is equivalent to sorting, and the elements can be visited in sorted order in O(N) time.

Now consider a lookup table where keys are in the range 0..N-1. If N is reasonably small, we can hold all N elements in an array. Lookup is O(1) and ordered traversal is O(N).

Example: suppose we want a table of the number 1-bits in a byte. Table size is 256 (number of distinct bytes).

NumberBinary RepresentationTable Entry
00000 00000
10000 00011
20000 00101
30000 00112
40000 01001
50000 01012
60000 01102
70000 01113
80000 10001
90000 10012
100000 10102
110000 10113
120000 11002
130000 11013
140000 11103
150000 11114
160001 00001
...

Unfortunately if we want the number of 1-bits in a 32-bit int, there would be 4 million entries in the table. This is not tractable (although we can sum the 1-bit counts from each of the 4 bytes of the int).

In practice, we may only want a few values, so we can save the few values we've alredy computed into a lookup table. This is known as memoization

// memoized (cached) computation
get_bits(unsigned int n) {
  int value;
  if (!lookup(n, &value)) {
    value = compute_bits(n);
    insert(n, v);
  }
  return v;
}
  

Hashing is simply mapping a large keyspace into the smaller range 0..M-1. A very simple hash function is H(x) = x mod M. This works pretty well in practice when M is prime (heuristic).

What makes a good hash function:

it's that simple. Sounds too good to be true?

Collisions

What makes hashing problematic is the possibliity of collisions: two keys that hash to the same value. For example, consider a table size of 101. If h(x) = x mod 101, we have h(4567) = 22 h(7597) = 22.

There are two solutions to the collision problem:

  1. closed addressing (AKA open hashing or separate chaining)
  2. open adddressiong (AKA closed hashing)

Separate Chaining

Separate chaining is simpler but requires extra memory and overhead. Each table entry (bucket or slot) is a sequence (list or vector).

TODO: diagram

Open Addressing

Open addressing inserts the second entry somewhere else within the table.

There are several possiblilites for the somehwere else:

If the table is not too full, there will be gaps, so the probing will terminate fairly quickly (O(1) probes).

Since the overflow uses another slot, that may cause an overflow in a node that wouldn't otherwise have overflowed.

Deleting Entries

With separate chaining, delete is straight-forward. With open addressing, typically the bucket has to be marked deleted so that probing can continue.

Hashing Strings

Typically, we want to convert the string to a 32-bit number and then use a numeric hash function.

Using the first two letters of a word (table size: 26 * 27 = 702 entries) does not give particularly good results (occurring in a randomly-selected 179-word text: than, the, their, theirs, them, themselves, thereof, they, thing, thinking, this, those,thought, through).

Since hashing is so important, there are numerous hash functions that have been studied. Here's one:

djb2(s):
  h = 5381
  for c in s:
    h = 33 * h + c
  return h
  

Empirical data, for a 3585-word text with a table size of 4051:

# entries in bucket
(collisions)
dbj2sdbmPython
0 (empty slots)164416631676
1152514941472
2646644655
3188197197
4413544
53116
6321
71
Most of the time, the value will be found in one or two probes. Worst case is 7 probes. Contrast this with a binary search with a tree depth of 12.