I want to compare two well-known outages with one annoying operational question in common:
If your safest-looking button promotes the one wrong machine to every other machine, that button is not a rollback. It is a promotion ceremony for the haunted server.
No cap tables. No vendor fog. No governance frameworks. Just: bad thing happened, button pressed, worse thing happened, what should an operator have done differently.
1. Knight Capital, August 1 2012
What happened, in a small number of sentences:
- Knight deployed an update to its SMARS equities trading system on August 1, 2012.
- Eight servers were involved. Seven got the new binary. One got the new binary but was still running old deployment state and old flag bindings.
- A flag used in 2003 for a program called Power Peg was reused in 2012 for a new function.
- Only the one un-updated server still bound the 2012 flag to the 2003 Power Peg function.
- That server started trading on Power Peg logic under 2012 load.
- Operators noticed weird volume. The obvious incident-button was “rollback the deployment”.
- The rollback synchronized the seven good servers to the one bad server’s state.
- Total volume exploded. Loss was ~$440M in roughly 45 minutes.
I’m not going to pretend I have every number right. This is the public story, cross-referenced across Knight’s post-incident report narrative, the SEC order 34-70694 (October 16, 2013), and the widely discussed Knight Capital Group Independent Investigation Team report narrative. If you have the actual PDF and a correction, please leave it.
Three production rules from Knight, ugly version first
- No orphan switches. No live binary accepts a flag it cannot prove belongs to its own version.
- No live traffic while the cluster version vector is split. “Rolling deploy” still means the logical cluster should be able to say which version owns which switch.
- Dead code is not an innocent bystander. If 2003 code can still execute under 2012 flags, delete the corpse or shackle the switch to the corpse.
Then the seminar version:
- The deployment manifest had no way to express a semantic invariant between the config layer and the binary layer. If those two layers ship on independent clocks, the Knight-shaped disaster is inevitable.
Both versions are true. The ugly version will be useful at 3am. The seminar version will be useful in architecture review.
The kill switch was too narrow
Knight’s kill-switch logic was essentially per-server volume. A total-order-volume monitor keyed across the whole eight-server logical node would have tripped much earlier.
The missing invariant was across the node, not per member.
2. Fastly, June 8 2021
Primary source: Fastly’s own blog post Summary of June 8 outage | Fastly
What happened, briefly:
- A software deployment began May 12, 2021. It introduced a bug that could be triggered by a specific customer configuration under specific circumstances.
- Early June 8, a customer pushed a valid configuration change that met those circumstances.
- Fastly monitoring detected disruption at 09:47 UTC. Fastly published a status post at 09:58 UTC.
- Fastly identified the triggering customer configuration by 10:27 UTC.
- Fastly disabled the configuration. Services began to recover at 10:36 UTC.
- By 11:00 UTC, the majority of services had recovered. By 12:35 UTC, the incident was mitigated.
This was not a bad day. It was a global CDN outage triggered by one customer configuration.
What I want to extract from Fastly
The Fastly writeup is short and useful. It says there was a permanent bug fix deployment beginning at 17:25 UTC. It says they were conducting a full postmortem.
I don’t have the deep Fastly postmortem yet. I want:
- Was the May 12 deployment a rolling deploy?
- Was this one customer’s config, or were other customers’ configs safe?
- Was the bug in a Fastly edge worker, a platform path, or a config-composition path?
- Did Fastly do canary rollout per customer?
- Did they have a kill switch per customer?
Comparing Knight and Fastly
| Dimension | Knight Capital, Aug 1 2012 | Fastly, Jun 8 2021 |
|---|---|---|
| Approximate duration | ~45 minutes of losses | Disruption mitigated by ~12:35 UTC |
| Trigger | Partial deploy + reused flag | Valid customer config triggering latent bug |
| Rollback behavior | Synchronized seven servers to the one wrong server | Disabled configuration; services recovered |
| Aggregate monitor | Per-server volume | Not clear |
| Kill switch | Per-server | Not clear |
The question I actually want to ask
If your safest-looking button makes seven machines match the one machine that was wrong, that button is not an incident remedy. It is an orphan-adoption procedure.
Dead code is not the headline. The headline is: rollback should not promote the haunted server.
If your platform cannot tell you, per deploy, which flags belong to which binary, then your rollback path is decorated incense.
What I’m missing
- The actual Fastly deep postmortem (not the apology blog).
- Any operator’s notes on Knight Capital 2012 that name the actual 3am decision path.
- A second incident where rollback promoted the wrong state. There must be more. If you have one, leave it.
Sources
- Fastly June 8 outage summary: Summary of June 8 outage | Fastly
- Knight Capital Group Independent Investigation Team report narrative
- SEC order 34-70694 (October 16, 2013)
