CSS 343: Notes from Lecture 7 (DRAFT)

Administrivia

Random Stuff

Consider the difference between:

(reinterpret_cast<char*>(p) + n)
  

and

(reinterpret_cast<char*>(p + n))
  

The above bug should, ideally, be caught by a unit test. In any case, there should be a regression test to make sure such a bug stays fixed.

Consider the following pair of functions. Think about which one is easier to read.

int f(int a, int b, int c, int* d)
{
  do {
    int e = (b + c) / 2;
    if (a < d[e]) {
      c = e - 1;
    } else {
      b = e + 1;
    }
  } while ((a != d[e]) && (b <= c));
  return (a == d[e]) ? e : -1;
}
      

#define NOT_FOUND -1
int binary_search(int key, int lo,  int hi, int* list)
{
  do {
    int mid = (lo + hi) / 2;
    if (key < list[mid]) {
      hi = mid - 1;
    } else {
      lo = mid + 1;
    }
  } while ((key != list[mid]) && (lo <= hi));
  return (key == list[mid]) ? mid : NOT_FOUND;
}
  

Our Story So Far

Greedy Algorithms

Data Compression

Character Coding

Multiple character codings. ASCII uses seven bits for English alphabet plus punctuation. Other encodings exist, e.g. Unicode.

6-bit character codes: 64 distinct symbols (uppercase only), but may use "shift" character.

There is no a priori reason why every symbol should have identical encoding length. The idea for data compression is to use fewer bits for more frequently-occurring symbols

Prefix code: no codeword is the prefix of another codeword (term used for technical reasons, but not the best naming convention, like inflammable/flammable)

Huffman Coding

Build the tree bottom-up by selecting the two least-frequent symbols x and y (nodes) from a priority queue and constructing a new, synthetic symbol z (node) with f(z) = f(x) + f(y) and children x, y

Each step removes two entries from the priority queue and adds one back. Net effect is to reduce the queue size by 1 at each step. The algorithm terminates when the last entry is removed; it will be the root of the huffman tree.

Proof of correctness: demonstrate greedy choice and optimal substructure.

Lemma 1: Let x and y be characters with minimum frequency. There exists an optimal prefix encoding with x and y having the same length and differing by the last bit. Proof:

Let b, c be sibling leaves at maximum depth.

Assume f(b) ≤ f(c) and f(x) ≤ f(y) and therefore, f(b) ≤ f(x) and f(c) ≤ f(y)

Create T' by swapping (b, x) and (c, y).

B(T) - B(T') =
  (f(x)dT(x) + f(y)dT(y) + f(b)dT(b) + f(c)dT(c)) -
  (f(x)dT'(x) + f(y)dT'(y) + f(b)dT'(b) + f(c)dT'(c))
= (f(x)dT(x) + f(y)dT(y) + f(b)dT(b) + f(c)dT(c)) -
  (f(x)dT(b) + f(y)dT(c) + f(b)dT(x) + f(c)dT(y))
= (f(x) - f(b))dT(x) + (f(y) - f(c))dT(y) +
  (f(b) - f(x)dT(b)  + (f(c) -f(y))dT(c) +
= (f(x) - f(b))(dT(x) - dT(b)) +
  (f(y) - f(c))(dT(y) - dT(c))
≥ 0

Therefore B(T) ≥ B(T').
  

Lemma 2: Let T be a full binary tree representing an optimal prefix code. For any two sibling leaves x and y with parent z, let f(z) = f(x) + f(y), C' = C - {x, y} ∪ {z}. Then T' = T - {x, y} is an optimal prefix code for C'. Informal proof: if T' is not optimal, T will not be optimal, so by contrediction, T' must be optimal.

Theorem: The Huffman algorithm produces an optimal prefix code.

Example

SymbolFrequencyOptimal Prefix CodeAlternate Optimal Prefix CodeBits
A6011012
B2500112
C201010013
D181001013
E10011100014
F8011010004
G6010110014
H401001000015
I4010000000006
J40100010000016

Encoding for Common Sense

Sample coding of a particular text (click on image for larger view): huffman tree