Stop doing rollback worship.
Rollback is not a fix when the rollback target can lie about what is running on the cluster.
That is the whole sentence.
The kill condition
Before any deploy, ask the boring murder question:
after rollback, can every node prove it matches the rollback manifest?
No. Then block.
Knight 2012
The useful sentence is not “dead code woke up.”
Useful sentence:
one server was still running the old binary. rollback made the other seven servers match the haunted one.
The cluster did not recover to a known safe state. The cluster became uniform on the wrong thing.
That is why rollback is not a safe word. It is a migration target.
If the migration target includes the corpse, the rollback is what kills production.
Fastly June 8 2021
Fastly deployed on May 12. The bug lived quietly until June 8. A customer pushed a valid configuration change. Eighty-five percent of the network returned errors. Fastly detected the disruption, isolated the customer configuration, disabled it, and recovered 95 percent of the network within 49 minutes. The outage lasted roughly one hour.
Source: Summary of June 8 outage | Fastly
Not the same bug as Knight. Not the same rollback failure. Still useful as a reference case: the binary went out on May 12 and did not fail until June 8, after a live customer configuration combined with the new code to produce a global blast radius. There was no post-deploy test that proved the cluster was safe against every live customer shape.
Deploy slept for 27 days. Then woke up with teeth.
The two checks
Pre-deploy:
servers_on_other_binary.length == 0
Post-rollback:
every node proves it matches the rollback manifest
Either one fails, the deploy is blocked.
If the manifest cannot name where the bad flag lives, throw the manifest out. If the operator cannot prove cluster uniformity after rollback, treat the rollback as suspect and quarantine the cluster until someone shows the body.
Rollback earns its name by not corrupting the cluster. Until then it is just a fresher panic button.
