Consider the difference between:
(reinterpret_cast<char*>(p) + n)
and
(reinterpret_cast<char*>(p + n))
The above bug should, ideally, be caught by a unit test. In any case, there should be a regression test to make sure such a bug stays fixed.
Consider the following pair of functions. Think about which one is easier to read.
int f(int a, int b, int c, int* d)
{
do {
int e = (b + c) / 2;
if (a < d[e]) {
c = e - 1;
} else {
b = e + 1;
}
} while ((a != d[e]) && (b <= c));
return (a == d[e]) ? e : -1;
}
#define NOT_FOUND -1
int binary_search(int key, int lo, int hi, int* list)
{
do {
int mid = (lo + hi) / 2;
if (key < list[mid]) {
hi = mid - 1;
} else {
lo = mid + 1;
}
} while ((key != list[mid]) && (lo <= hi));
return (key == list[mid]) ? mid : NOT_FOUND;
}
O(f(N)):
what is N (some parameter describing size of input)
O(N log N)
Multiple character codings. ASCII uses seven bits for English alphabet plus punctuation. Other encodings exist, e.g. Unicode.
6-bit character codes: 64 distinct symbols (uppercase only), but may use "shift" character.
There is no a priori reason why every symbol should have identical encoding length. The idea for data compression is to use fewer bits for more frequently-occurring symbols
Prefix code: no codeword is the prefix of another codeword (term used for technical reasons, but not the best naming convention, like inflammable/flammable)
T,
the number of bits to encode the message is given by
B(T) = Σf(c)dT(c)
where
f(c)
is the frequency (number of occurrences) of character c, and
dT(c)
is the depth (number of bits in codeword) of c in tree T
Build the tree bottom-up by selecting the two least-frequent
symbols x and y (nodes)
from a priority queue
and constructing a new, synthetic symbol z (node) with
f(z) = f(x) + f(y) and children x, y
Each step removes two entries from the priority queue and adds one back. Net effect is to reduce the queue size by 1 at each step. The algorithm terminates when the last entry is removed; it will be the root of the huffman tree.
Proof of correctness: demonstrate greedy choice and optimal substructure.
Lemma 1: Let x and y be characters with minimum frequency. There exists an optimal prefix encoding with x and y having the same length and differing by the last bit. Proof:
Let b, c be sibling leaves at maximum depth.
Assume f(b) ≤ f(c) and f(x) ≤ f(y) and therefore, f(b) ≤ f(x) and f(c) ≤ f(y)
Create T' by swapping (b, x) and (c, y).
B(T) - B(T') =
(f(x)dT(x) + f(y)dT(y) + f(b)dT(b) + f(c)dT(c)) -
(f(x)dT'(x) + f(y)dT'(y) + f(b)dT'(b) + f(c)dT'(c))
= (f(x)dT(x) + f(y)dT(y) + f(b)dT(b) + f(c)dT(c)) -
(f(x)dT(b) + f(y)dT(c) + f(b)dT(x) + f(c)dT(y))
= (f(x) - f(b))dT(x) + (f(y) - f(c))dT(y) +
(f(b) - f(x)dT(b) + (f(c) -f(y))dT(c) +
= (f(x) - f(b))(dT(x) - dT(b)) +
(f(y) - f(c))(dT(y) - dT(c))
≥ 0
Therefore B(T) ≥ B(T').
Lemma 2: Let T be a full binary tree representing an optimal
prefix code. For any two sibling leaves x and y with parent z,
let
f(z) = f(x) + f(y),
C' = C - {x, y} ∪ {z}.
Then
T' = T - {x, y}
is an optimal prefix code for C'.
Informal proof: if T' is not optimal, T will not be optimal, so by
contrediction, T' must be optimal.
Theorem: The Huffman algorithm produces an optimal prefix code.
| Symbol | Frequency | Optimal Prefix Code | Alternate Optimal Prefix Code | Bits |
|---|---|---|---|---|
| A | 60 | 11 | 01 | 2 |
| B | 25 | 00 | 11 | 2 |
| C | 20 | 101 | 001 | 3 |
| D | 18 | 100 | 101 | 3 |
| E | 10 | 0111 | 0001 | 4 |
| F | 8 | 0110 | 1000 | 4 |
| G | 6 | 0101 | 1001 | 4 |
| H | 4 | 01001 | 00001 | 5 |
| I | 4 | 010000 | 00000 | 6 |
| J | 4 | 010001 | 000001 | 6 |
Sample coding of a particular text (click on image for larger view):