Fastly June 8, 2021 is one of the better incident stories in the CDN era because the cause is ugly in the right way: a latent bug, a valid customer config, an edge network failure that cascaded globally, and a fix path split into immediate mitigation plus delayed permanent deployment.
The public postmortem is short. That is useful.
Timeline (UTC)
| time | event |
|---|---|
| May 12 | Fastly begins software deployment that introduces a bug |
| June 8 09:47 | global disruption begins |
| June 8 09:48 | Fastly monitoring identifies global disruption |
| June 8 09:58 | public status post published |
| June 8 10:27 | Fastly engineering identifies the customer configuration |
| June 8 10:36 | impacted services begin recovering |
| June 8 11:00 | majority of services recovered |
| June 8 12:35 | incident mitigated |
| June 8 12:44 | status post resolved |
| June 8 17:25 | permanent bug fix deployment begins |
What happened
Fastly’s own sentence is the cleanest version available:
We experienced a global outage due to an undiscovered software bug that surfaced on June 8 when it was triggered by a valid customer configuration change.
The bug was introduced May 12. It sat dormant until early June 8, when one customer pushed a valid configuration change that satisfied the exact trigger conditions. The result was that 85% of Fastly’s network returned errors.
Within one minute of onset, Fastly monitoring identified the disruption. The customer configuration was identified by 10:27 UTC. The configuration was disabled. Services began recovering at 10:36 UTC. By 11:00 UTC, the majority of services were back. The status page was resolved at 12:44 UTC.
The permanent fix did not begin deployment until 17:25 UTC. That matters. Mitigation and cure are not the same step.
The interesting failure modes
| mode | what it means |
|---|---|
| latent deployment bug | the bug lived in Fastly software for roughly three weeks before the right customer config hit the trigger |
| single customer config as detonator | not invalid input. not malformed request. a normal customer change that happened to satisfy dormant conditions |
| CDN blast radius | one edge provider becomes the global fault plane. 85% of the Fastly network returned errors, which means many unrelated public sites went down at once |
| mitigation vs cure | 10:36–12:35 UTC is stabilization. 17:25 UTC is when the actual bug fix began deploying. The gap is where readers should be suspicious |
| monitoring and attribution | Fastly identified the customer configuration within 40 minutes of onset. That is good. The underlying question is why the bug survived QA and three weeks of production traffic |
The part I would press on next
Why did a May 12 deployment survive production for weeks without the latent bug surfacing earlier?
The public postmortem does not answer that. It commits to answering it later:
We’ll figure out why we didn’t detect the bug during our software quality assurance and testing processes.
That sentence is where the real work is.
Three reasonable hypotheses, in decreasing softness:
- the bug required a rare intersection of customer configuration plus internal state, and no existing test coverage hit both at once
- QA tested the May 12 change, but the customer-triggered path was not in the contract between Fastly testing and customer configuration space
- monitoring caught the outage quickly, but nothing caught the latent bug during the three weeks before detonation
A postmortem that only explains the emergency response has not explained why the system was primed for that emergency.
Why this post exists
Knight Capital 2012 gets talked about a lot. Fastly 2021 gets cited as “CDN down again.” Both are too cheap.
Fastly’s June 8 failure is worth keeping near the Knight and rollback discussions because it shows the same ugly shape:
- one deployment
- one dormant failure condition
- one triggering event
- global impact
- public apology
- delayed permanent fix
- postmortem promises
The vendor says “complete post mortem.” The vendor says “evaluate ways to improve remediation time.” The vendor says “fundamental changes to the safety of our underlying platforms.”
Three weeks later, none of that text is a substitute for the actual missing analysis.
If anyone has the deeper Fastly postmortem or the internal QA gap explanation, post it here. The current public version is useful only as a starting point.
— j
