CSS 343: Notes from Lecture 7 (DRAFT)

Administrivia

midterm: rescheduled to Oct. 29
- no timelimit (within reason)
- you may bring single 8.5x11" cheat sheet (double-sided)
- no calculators
assignment 1: revised due date Oct. 22
usual office hours Sunday
Huffman algorithm will be on the midterm
- new material covered on Tuesday will not be on the midterm
- Thursday will be review class

Random Stuff

Consider the difference between:

(reinterpret_cast<char*>(p) + n)

and

(reinterpret_cast<char*>(p + n))

The above bug should, ideally, be caught by a unit test. In any case, there should be a regression test to make sure such a bug stays fixed.

Consider the following pair of functions. Think about which one is easier to read.

int f(int a, int b, int c, int* d)
{
  do {
    int e = (b + c) / 2;
    if (a < d[e]) {
      c = e - 1;
    } else {
      b = e + 1;
    }
  } while ((a != d[e]) && (b <= c));
  return (a == d[e]) ? e : -1;
}
      
#define NOT_FOUND -1
int binary_search(int key, int lo,  int hi, int* list)
{
  do {
    int mid = (lo + hi) / 2;
    if (key < list[mid]) {
      hi = mid - 1;
    } else {
      lo = mid + 1;
    }
  } while ((key != list[mid]) && (lo <= hi));
  return (key == list[mid]) ? mid : NOT_FOUND;
}

Our Story So Far

algorithms solve problems
- algorithm analysis: determine performance
  - theoretical optimal performance for a problem
  - performance of aa particular algorithm
  - advanced analysis: amortized cost
big-oh notation: asymptotic performance (general shape of the curve)
- O(f(N)): what is N (some parameter describing size of input)
cognitive complexity:
- 1,000-line programs are hard
- 1,000,000-line programs are incomprehensible
- keep programs small using abstraction
pointer arithmetic
type casting
dictionaries (key-value store)
- linear search (linked-list, array)
- binary search (sorted array)
- binary search tree (BST)
- balanced BST (red-black)
- 2-3 & B tree
BST
- recursive defintion: left and right subchildren are BST
  - left(node) ≤ node
  - right(node ≥ node
- successor/predecessor
- insert/delete
priority queues
- insert random
- lookup & delete minimum
- optional: decrease-key
additional binary-tree properties
- full
- complete
- perfect
binary heaps
- complete binary tree
- recursive definition: left and right subchildren are heaps
  - left(node) ≤ node
  - right(node) ≤ node
- array implementation:
  - enumerate nodes top-to-bottom, left-right starting from 1
  - left(n) = 2 * n
  - right(n) = 2 * n + 1
  - parent(n) = floor(n / 2)
- abstract view: tree; concrete view: array
heapsort: O(N log N)

Greedy Algorithms

also known as hill-climbing
make locally optimal decisions
may not lead to globally optimal solution (depending on particular problem)
algorithms that find globally optimal solutions by making locally-optimal decisions have "greedy choice" property
typical correctness proof (induction): assum optimal solution, modify so taht greedy choice is taken as first step; show remaining subproblem is smaller
optimal substructure: optimal solution contains optimal solution to subproblem (applicable to greedy algorithms & dynamic programming)

Data Compression

relies on the fact that interesting data is not completely random
example: run-length encoding (e.g. fax machine)
English text: some letters occur more frequently than others
lossy vs. lossless compression
- lossless compression: exact copy of original data is recovered after expansion
- lossless compression may give higher compression by sacrificing bit-for-bit copy
  - suitable for audio & video

Character Coding

Multiple character codings. ASCII uses seven bits for English alphabet plus punctuation. Other encodings exist, e.g. Unicode.

6-bit character codes: 64 distinct symbols (uppercase only), but may use "shift" character.

There is no a priori reason why every symbol should have identical encoding length. The idea for data compression is to use fewer bits for more frequently-occurring symbols

Prefix code: no codeword is the prefix of another codeword (term used for technical reasons, but not the best naming convention, like inflammable/flammable)

theorem (beyond scope of class): optimal data cmpression via character coding can always be achieved using a prefix code
prefix codes simplify decompression
we can represent a prefix code using a binary tree and edge labeling
- left branch: 0 bit
- right branch: 1 bit
theorem: an optimal prefix code will be a full binary tree (informal proof: can alwasy make non-full tree shorter by eliminating the knee)
For a given prefix coding T, the number of bits to encode the message is given by B(T) = Σf(c)d_T(c) where
f(c) is the frequency (number of occurrences) of character c, and
d_T(c) is the depth (number of bits in codeword) of c in tree T

Huffman Coding

algorithm used in successful software (e.g. gzip)
greedy algorithm to find optimal prefix code

Build the tree bottom-up by selecting the two least-frequent symbols x and y (nodes) from a priority queue and constructing a new, synthetic symbol z (node) with f(z) = f(x) + f(y) and children x, y

Each step removes two entries from the priority queue and adds one back. Net effect is to reduce the queue size by 1 at each step. The algorithm terminates when the last entry is removed; it will be the root of the huffman tree.

Proof of correctness: demonstrate greedy choice and optimal substructure.

Lemma 1: Let x and y be characters with minimum frequency. There exists an optimal prefix encoding with x and y having the same length and differing by the last bit. Proof:

Let b, c be sibling leaves at maximum depth.

Assume f(b) ≤ f(c) and f(x) ≤ f(y) and therefore, f(b) ≤ f(x) and f(c) ≤ f(y)

Create T' by swapping (b, x) and (c, y).

B(T) - B(T') =
  (f(x)d_T(x) + f(y)d_T(y) + f(b)d_T(b) + f(c)d_T(c)) -
  (f(x)d_T'(x) + f(y)d_T'(y) + f(b)d_T'(b) + f(c)d_T'(c))
= (f(x)d_T(x) + f(y)d_T(y) + f(b)d_T(b) + f(c)d_T(c)) -
  (f(x)d_T(b) + f(y)d_T(c) + f(b)d_T(x) + f(c)d_T(y))
= (f(x) - f(b))d_T(x) + (f(y) - f(c))d_T(y) +
  (f(b) - f(x)d_T(b)  + (f(c) -f(y))d_T(c) +
= (f(x) - f(b))(d_T(x) - d_T(b)) +
  (f(y) - f(c))(d_T(y) - d_T(c))
≥ 0

Therefore B(T) ≥ B(T').

Lemma 2: Let T be a full binary tree representing an optimal prefix code. For any two sibling leaves x and y with parent z, let f(z) = f(x) + f(y), C' = C - {x, y} ∪ {z}. Then T' = T - {x, y} is an optimal prefix code for C'. Informal proof: if T' is not optimal, T will not be optimal, so by contrediction, T' must be optimal.

Theorem: The Huffman algorithm produces an optimal prefix code.

Example

Symbol	Frequency	Optimal Prefix Code	Alternate Optimal Prefix Code	Bits
A	60	11	01	2
B	25	00	11	2
C	20	101	001	3
D	18	100	101	3
E	10	0111	0001	4
F	8	0110	1000	4
G	6	0101	1001	4
H	4	01001	00001	5
I	4	010000	00000	6
J	4	010001	000001	6

no loss in generality by ordering symbols by frequency
total 145 symbols in text (60 + 25 + 20 + 18 + 10 + 8 + 6 + 4 + 4)
ASCII coding: 1160 bits (8 * 145)
10 distinct symbols can be encoded using 4 bits (fixed-length encoding): 580 bits to encode text (4 * 145)
Huffman (optimal prefix) coding: 404 bits (30% space saving)
there may be more than one optimal encoding (right and left branches may be swapped)
some codings require more than 4 bits
tree construction—red nodes mark entries in the priority queue, not shown in any particular order (download frame-by-frame zip):

Encoding for Common Sense

Sample coding of a particular text (click on image for larger view):