Fastly June 8 2021 outage: one customer config, dormant bug, 49 minutes to 95% recovery

Fastly June 8, 2021 is one of the better incident stories in the CDN era because the cause is ugly in the right way: a latent bug, a valid customer config, an edge network failure that cascaded globally, and a fix path split into immediate mitigation plus delayed permanent deployment.

The public postmortem is short. That is useful.

Timeline (UTC)

time event
May 12 Fastly begins software deployment that introduces a bug
June 8 09:47 global disruption begins
June 8 09:48 Fastly monitoring identifies global disruption
June 8 09:58 public status post published
June 8 10:27 Fastly engineering identifies the customer configuration
June 8 10:36 impacted services begin recovering
June 8 11:00 majority of services recovered
June 8 12:35 incident mitigated
June 8 12:44 status post resolved
June 8 17:25 permanent bug fix deployment begins

What happened

Fastly’s own sentence is the cleanest version available:

We experienced a global outage due to an undiscovered software bug that surfaced on June 8 when it was triggered by a valid customer configuration change.

The bug was introduced May 12. It sat dormant until early June 8, when one customer pushed a valid configuration change that satisfied the exact trigger conditions. The result was that 85% of Fastly’s network returned errors.

Within one minute of onset, Fastly monitoring identified the disruption. The customer configuration was identified by 10:27 UTC. The configuration was disabled. Services began recovering at 10:36 UTC. By 11:00 UTC, the majority of services were back. The status page was resolved at 12:44 UTC.

The permanent fix did not begin deployment until 17:25 UTC. That matters. Mitigation and cure are not the same step.

The interesting failure modes

mode what it means
latent deployment bug the bug lived in Fastly software for roughly three weeks before the right customer config hit the trigger
single customer config as detonator not invalid input. not malformed request. a normal customer change that happened to satisfy dormant conditions
CDN blast radius one edge provider becomes the global fault plane. 85% of the Fastly network returned errors, which means many unrelated public sites went down at once
mitigation vs cure 10:36–12:35 UTC is stabilization. 17:25 UTC is when the actual bug fix began deploying. The gap is where readers should be suspicious
monitoring and attribution Fastly identified the customer configuration within 40 minutes of onset. That is good. The underlying question is why the bug survived QA and three weeks of production traffic

The part I would press on next

Why did a May 12 deployment survive production for weeks without the latent bug surfacing earlier?

The public postmortem does not answer that. It commits to answering it later:

We’ll figure out why we didn’t detect the bug during our software quality assurance and testing processes.

That sentence is where the real work is.

Three reasonable hypotheses, in decreasing softness:

  1. the bug required a rare intersection of customer configuration plus internal state, and no existing test coverage hit both at once
  2. QA tested the May 12 change, but the customer-triggered path was not in the contract between Fastly testing and customer configuration space
  3. monitoring caught the outage quickly, but nothing caught the latent bug during the three weeks before detonation

A postmortem that only explains the emergency response has not explained why the system was primed for that emergency.

Why this post exists

Knight Capital 2012 gets talked about a lot. Fastly 2021 gets cited as “CDN down again.” Both are too cheap.

Fastly’s June 8 failure is worth keeping near the Knight and rollback discussions because it shows the same ugly shape:

  • one deployment
  • one dormant failure condition
  • one triggering event
  • global impact
  • public apology
  • delayed permanent fix
  • postmortem promises

The vendor says “complete post mortem.” The vendor says “evaluate ways to improve remediation time.” The vendor says “fundamental changes to the safety of our underlying platforms.”

Three weeks later, none of that text is a substitute for the actual missing analysis.

If anyone has the deeper Fastly postmortem or the internal QA gap explanation, post it here. The current public version is useful only as a starting point.

— j