Lecture #19

Administrivia

2016 final exam posted
barbers in single source file: pros & cons?
linked-in policy: former students
tonight: penultimate lecture
- last new material
- Wednesday: review/questions (plus tiny amount of new stuf if we don't get through everything tonight)
schedule: final exam originally scheduled Wednesday
- other final also scheduled for Wednesday
- unless someone objects: we'll make it Monday
  - watch Canvas for announcements (e.g. in case of room change)
assignment 3
- drop dead due date: Friday (will try to grade Saturday)
- it is a design error to print error messages in a library
  - except in debug mode
  - for users: return error value or throw exception
    - let client code decide how to handle error
assignment 4
- drop dead due date: next Friday
- will grade final first
final: no news yet
- will discuss on Wednesday
- will try to get additional problem set

Previously on CSS503...

modern architectures: hierarchical memory
- cache
  - L1
  - L2
  - L3
- main memory
- disk
memory-mapped I/O
non-VM systems: allocate main memory to processes TODO: diagram
problem: programs lcation in main memory changes from run to run
- solution #1: re-link at program load time (linker-loader)
- solution #2: position-independent code
- solution #3: hardware support
  - segment + limit/extent registers
  - kernel must reload registers on context switch
    - privileged instruction
still have the problem of external (between processes) fragmentation

Simple Memory Management Unit (MMU)

base register specifies starting point
limit register specifies max address
process (virtual) address: starts at 0
- add base address to get real address
- all this happens in hardware

Memory Allocation

contiguous
- semantically easy
- external fragmentation
- cannot share memory
segmentation: multiple contiguous blocks
paging
- page faults
- requires multiple memory acceses
fixed-size vs. variable-sized partition
- fixed-size
  - wastes space for small processes
  - can't handle jumbo processes
- variable-sized partition
  - problem: memory allocation strategy
    - first-fit
    - best-fit
    - worst-fit
  - problem: external (inter-process) fragmentation
memory compaction
- like defrag util on disk filesystem

Segmentation

different types of data
- main program (code)
- procedures/functions (code)
- library (code) -- shared
- global variables (data)
- heap (allocated memory)
- stack
object file: sections (data structure)
use hardware architecture similar to contiguous, but with multiple segment registers (base & limit)
wild pointer: segmentation violation

Paging

divide memory into "frames*
- convenient size: typically power of 2, whole multiple of disk block size
  - e.g. 512, 1024, 2048, 4096, 8192 bytes
logical adddress space: 0..N-1
- pages: 0..n-1
  - n = N / framesize
each process (data structure in kernel) maintains page table with mapping from virtual pages to physical frames TODO: diagram

Address Translation

translate logical/virtual address to physical address

TODO: diagram

MMU (memory managemnt unit): registers, operating modes
- translation mode: page table in memory
- base (page table base register): pysical address of page table
- size (page table length register)

TODO: diagram

logical address: e.g. 32-bit address with 4 kB page
- 20-bit page#
- 12-bit offset within page
- phys_addr = *(BR + logical >> 12) | (logical & 0xFFF)
  - performed in hardware: very fast

Paging Hardware with TLB

problem: loading logical memory values requires 2 memory accesses: ~200 cycles
solution: TLB (Translation Lookaside Buffer) within MMY TODO: diagram
TLB flush/purge on context switch

Multi-Level Page Table

32-bit architecture: 20-bit page & 12-bit offset
- use 10-bit (1 kB) top-level supertable TODO: diagram
64-bit architecture: 3-level table
- trie-like structure with very high fanout
- could also use hash table
effective memory access time: TLB hit time * hit rate + TLB miss time * (1 - hit rate)
TLB miss stalls the processor (cost of context switch would be much higher)

Paging

logical address space is much larger than physical memory
many processes running concurrently
solution: store some pages on disk, load into memory only when needed
- context switch: let other process run during lengthy I/O operation
internal fragmentation: logical space is not an even multiple of page size
- last page has unused space
minimize fragmentation by using smaller page size
- but this increases likelihood of page misses due to increased #pages

Virtual Memory

paging architecture: page table has valid/invalid flag
page fault: if required page is not in memory, load page from swap space (disk)
- if physical memory is full, find "victim" page & swap it out (write to disk)
- page fault causes a context switch: process moves from run queue to wait queue
working set theory: paging works because, typically, only a small subset of the pages are active at any given moment
- init/teardown code only executed once
- error handling
shared library: save space by loading single copy of library code for all processes
shared memory: interprocess communication
garbage collection breaks working set theory

Memory Protection

page table provides additional info
- read-only vs. read-write
  - code, esp. shared library: read-only
  - constants may be compiled to read-only sections
- valid/invalid
  - invalid page: not in memory (i.e. swapped out)
  - page fault: attempt to read page that is swapped out
    - OS does context switch while disk block is performed to reload page
shared memory: 2 virtual pages share same physical memory
- read-only (e.g. shared library)
- read-write (interprocess communication)
fork: 2 processes share data pages until one process alters a page
- "copy on write"

Clean vs. Dirty Pages

a page that is swapped out, then swapped back in has 2 copies (one in memory, one on disk)
- until the program changes the contents of the page
if the page to be swapped out again has not been changed, i.e. clean, we can save the cost of a disk write (since it's already on disk)
if page is "written", page is marked "dirty" and must be written to disk when swapped out
if we have to swap a page out, the next time we have to use it, it will trigger another page fault
- the swapping out was already caused by some other page fault
timing:
- memory access: 10² to 10³ cycles
- disk access: 10⁷ cycles
disk is 10,000 to 100,000 times slower than memory
- paging is no longer tractable in contemporary architectures
- still need to understand this because CPU vs. I/O tradeoffs do (already have) become relevant again
memory is the new disk; disk is the new tape
even in its heyday, a heavily loaded system would start "thrashing"
- spending most of it's time reloading pages & not making forward progress
- like hashing/ethernet: works well when operating below maximum capacity (e.g. 80%)
when thrashing occurs, CPU utilization decreases
solution: suspend some processes
cache: fast on-chip memory
- cache-aware coding
  1. step through arrays by row (cache lines)
  2. step through in single pass (if possible)
  3. prefer arrays of values over arrays of pointers

Page Replacement Algorithm

goal: minimize the number of page faults
optimal algorithm (theoretical/post-hoc empirical)
- replace the page that will not be used for the longest period of time into the future
- theoretical because it requires an oracle
can use optimal result as baseline for evaluating other (heuristic) approaches
first-in/first-out: swap out "oldest" page in memory
- seems fair (?)
- Belady's Anomaly (see text)
  - page faults increase as #frames increases
    - i.e. increasing available memory decreases performance
least-recently-used (LRU)
- use doubly-linked list
- when page is used, move to end of list
- requires hardware support
approximation using reference bits
second-chance algorithm
- circular list
- if page replace is required
  - if next candidate has reference bit == 0: swap
  - else (reference bit == 1): set ref = 0 & advance next-victim pointer
- if page is used, move it to 1 behind next victim & set ref = 1
why is LRU heuristic effective?
- working set model
garbage collection destroys the working set
can determine working set approximations
- allocate sufficient #frames to a process to manage its fault rate
- too many faults: allocate more frames
- too few faults: reassign frames to other processes
- if overall #frames is exhausted, de-schedule some process