The core of the algorithm is actually pretty simple: initalize the tentative cost of each node (zero for the start node, infinity for all other notes. Initially every node is unprocessed. The unprocessed nodes are held in a priority queue ordered by cost.
While the priority queue (set of unprocessed nodes) is not empty,
remove the lowest-cost unprocessed node. Its tentative cost
becomes its actual minimum cost. For each outgoing edge of the
selected node, if the cost of the selected node plus the edge
weight is less than the current tentative cost of the descendant,
reduce ("relax") the cost of the descendant node because we've
found a shorter path to the node. This cost reduction will change
the position of the descendant node in the priority queue
(reduce_key).
Terminate the algorithm when the goal node is selected.
Note that, until a node is processed, its cost is tentative and may be updated zero or more times.
Implementation detail: when the cost of a node is updated, update the predecessor edge in the node so it points to the selected node it comes from. The shortest path is recovered by walking the chain backwards.
class Vertex {
//...
Edges edges_;
int cost_;
Edge* parent;
};
class Edge {
//...
Vertex* from_;
Vertex* to_;
int weight_;
};
Observe the animation shown in the notes for lecture 11.
Naive (not using priority queue) implementation of unprocessed
set: selection is
O(|V|2)
plus every edge has to be visited, so the total cost is
O(|V|2 + |E|). Since
|E| = O(|V|2)
for a dense graph, the net cost is
O(|V|2).
If
the graph is sufficiently sparse, using a priority queue gives
performance
O(|V| log |V| + |E|).
Induction hypothesis: the lowest-cost path has been found for every processed vertex. Each time we process the next lowest-cost unprocessed unprocessed vertex, it must have its minimum cost, or else there would have to be be a lower-cost unprocessed vertex on its path.
Hash (v.): to chop up into little pieces & mix.
Dictionaries, so far, have been based on comparing keys.
Insert/lookup/delete are each
O(log N)
and building an entire tree is
O(N log N).
This is equivalent to sorting, and the elements can be visited in
sorted order in
O(N)
time.
Now consider a lookup table where keys are in the range
0..N-1.
If
N
is reasonably small, we can hold all
N
elements in an array.
Lookup is
O(1)
and ordered traversal is
O(N).
Example: suppose we want a table of the number 1-bits in a byte. Table size is 256 (number of distinct bytes).
| Number | Binary Representation | Table Entry |
|---|---|---|
| 0 | 0000 0000 | 0 |
| 1 | 0000 0001 | 1 |
| 2 | 0000 0010 | 1 |
| 3 | 0000 0011 | 2 |
| 4 | 0000 0100 | 1 |
| 5 | 0000 0101 | 2 |
| 6 | 0000 0110 | 2 |
| 7 | 0000 0111 | 3 |
| 8 | 0000 1000 | 1 |
| 9 | 0000 1001 | 2 |
| 10 | 0000 1010 | 2 |
| 11 | 0000 1011 | 3 |
| 12 | 0000 1100 | 2 |
| 13 | 0000 1101 | 3 |
| 14 | 0000 1110 | 3 |
| 15 | 0000 1111 | 4 |
| 16 | 0001 0000 | 1 |
| ... | ||
Unfortunately if we want the number of 1-bits in a 32-bit
int,
there would be 4 million entries in the table. This is not
tractable (although we can sum the 1-bit counts from each of the 4
bytes of the int).
In practice, we may only want a few values, so we can save the few values we've alredy computed into a lookup table. This is known as memoization
// memoized (cached) computation
get_bits(unsigned int n) {
int value;
if (!lookup(n, &value)) {
value = compute_bits(n);
insert(n, v);
}
return v;
}
Hashing is simply mapping a large keyspace into the smaller range
0..M-1. A very simple hash function is
H(x) = x mod M. This works pretty well in practice
when M is prime (heuristic).
What makes a good hash function:
it's that simple. Sounds too good to be true?
What makes hashing problematic is the possibliity of
collisions:
two keys that hash to the same value. For example, consider a
table size of 101. If
h(x) = x mod 101,
we have
h(4567) = 22
h(7597) = 22.
There are two solutions to the collision problem:
Separate chaining is simpler but requires extra memory and overhead. Each table entry (bucket or slot) is a sequence (list or vector).
TODO: diagram
Open addressing inserts the second entry somewhere else within the table.
There are several possiblilites for the somehwere else:
If the table is not too full, there will be gaps, so the probing
will terminate fairly quickly
(O(1)
probes).
Since the overflow uses another slot, that may cause an overflow in a node that wouldn't otherwise have overflowed.
With separate chaining, delete is straight-forward. With open addressing, typically the bucket has to be marked deleted so that probing can continue.
Typically, we want to convert the string to a 32-bit number and then use a numeric hash function.
Using the first two letters of a word (table size: 26 * 27 = 702 entries) does not give particularly good results (occurring in a randomly-selected 179-word text: than, the, their, theirs, them, themselves, thereof, they, thing, thinking, this, those,thought, through).
Since hashing is so important, there are numerous hash functions that have been studied. Here's one:
djb2(s):
h = 5381
for c in s:
h = 33 * h + c
return h
Empirical data, for a 3585-word text with a table size of 4051:
| # entries in bucket (collisions) | dbj2 | sdbm | Python |
|---|---|---|---|
| 0 (empty slots) | 1644 | 1663 | 1676 |
| 1 | 1525 | 1494 | 1472 |
| 2 | 646 | 644 | 655 |
| 3 | 188 | 197 | 197 |
| 4 | 41 | 35 | 44 |
| 5 | 3 | 11 | 6 |
| 6 | 3 | 2 | 1 |
| 7 | 1 |