Assignment 7: Supervisor

Background

So far, we've implemented the time server, the auth server, the reverse proxy server, and the monitor. Your pointy-haired boss still wants to bring the time service up to production-grade standards.

The next piece is a supervisor program responsible for starting and shutting down individual processes in the system.

A typical architecture is to have a single datacenter-wide uber-supervisor (plus hot standbys for backup) that gets configuration commands (e.g. via http) describing all server processes.

The ubervisor will allocate tasks to individual machines based on resource requirements and availability (bin packing). The ubervisor communicates task specifications to local supervisor instances on each individual machine.

The per-machine local supervisor is responsible for running the tasks, monitoring their health, and restarting tasks that fail. If a task is in yo-yo mode1, the supervisor reports back to the ubervisor which may re-allocate the task to another machine or page the on-call for manual intervention.

The supervisor also reports its own health status back to the ubervisor. If the supervisor falls over, the ubervisor will issue an ssh command to restart the supervisor, trigger a remote machine restart, or even take that machine out of service.

Time-Service Supervisor

Implement a scaled-down local supervisor for the time service. Instead of getting task specifications from the ubervisor, it will read configuration info from standard input on startup.

The configuration file will be a JSON-format list of objects (dictionaries) containing fields "input", "output", "error", and "command"

input, output. and error are the names of files to redirect standard input, standard output, and standard error (respectively).

command is a JSON list of strings, i.e. the pre-parsed command-line arguments2

Sample configuration file:


[{"command": ["./bin/authserver", "--log=etc/auth.xml"],
   "output":"out/auth.out", "error": "out/auth.err"},
 {"command": ["./bin/timeserver", "--log=etc/log-01.xml", "--port={{port}}",
              "--max-inflight=80",
              "--avg-response-ms=500", "--response-deviation-ms=300"],
  "output": "out/timeserver-01.out", "error": "out/timeserver-01.err"},
 {"command": ["./bin/timeserver", "--log=etc/log-02.xml", "--port={{port}}",
              "--max-inflight=2",
              "--avg-response-ms=20000", "--response-deviation-ms=1000"],
  "output": "out/timeserver-02.out", "error": "out/timeserver-02.err"}]

Port Assignment

The supervisor should take a command-line flag --port-range=from-to. Occurrences of the string '{{port}}' in the config specification will be replaced by the next port within the given range. Each time a process is restarted, it will be assigned the next port number (round robin).

Since this is a prototype proof-of-concept and we only have the single substitution variable, we don't need the heavy weight of full-on template substitution. You may simply use the strings.Replace function.

Recovery

If the supervisor fails, it will leave the processes it supervises running. When the supervisor restarts, the system will be in an indeterminate state.

To avoid chaos, implement the flags --dumpfile and --checkpoint-interval to control dumping of the list of process IDs of the processes under supervision (similar to the authserver). Dump a json-encoded file.

When the supervisor starts up, it will read the dump file and check to see if the identified processes still exist, killing them if they do3.

Running and testing the Time Server

Use the supervisor to start up an authserver and two timeservers. To simplify configuration, the authserver will use a statically-allocated port, but the timeservers will get a dynamically-allocated port.

Manually check the supervisor's log file (you are using the log facility to report each time a task dies and is restarted, right?) to verify that the processes have started up and that you can make requests to the timeservers using your web browser.

Manually kill one of the timeservers and verify that the superviser restarts the process.

Useful Libraries

Footnotes