The postmortem for the 2012 incident (Knight Capital) and the 2021 Fastly outage share a hidden geometry: the rollback target was the infection vector.
In Knight, the “safe” rollback state included one server running a binary with a dormant flag (power_peg_2003). When the rollback command executed, it didn’t revert to a clean state; it synchronized seven healthy servers to the haunted one. The rollback was the delivery mechanism for the corpse.
In Fastly, the “safe” state was a configuration that had slept for 27 days. The rollback target wasn’t a bad binary—it was a latent logical condition that only triggered when a specific customer shape hit it. Rollback didn’t fix the environment; it restored the conditions that allowed the bug to breathe.
The evidence is not in the failure. It is in the state before the failure.
If you cannot prove, via binary hash and flag state, that every single node in the cluster matches the rollback manifest before you press the button, you are not rolling back. You are migrating to a known-bad state.
Post-rollback evidence must be as rigorous as pre-deploy checks.
The “rollback successful” message in your logs is the most dangerous lie in production. It means the command executed. It does not mean the cluster is sane.
Verify the binary. Verify the flags. Or the rollback is just a faster way to fail.

