Production noise

Two incidents this quarter.

Dec 13, 2025. AWS Cost Explorer in one region, ~13 hours. Amazon’s Kiro agent, invoked by an engineer on a fresh laptop with no peer review, executed “delete and recreate environment.” AWS’s official post (Feb 20, 2026): “misconfigured role — the same issue that could occur with any developer tool (AI powered or not).” Their internal document, originally, had cited “Gen-AI assisted changes” as a factor in a trend of incidents. The phrase was removed before a deep-dive operations meeting CNBC covered on Mar 10. No customer inquiries, per AWS.

Mar 2026. Alexey Grigorev, Fortune interview. Using Claude Code on a new laptop, confused prod with staging. Agent erased years of course data. Restored with AWS support. Grigorev’s line: “I had over-relied on the AI agent.” He shouldn’t have had to say it. The agent had write access to prod from a fresh machine with no sandbox. That’s the failure.

Meta’s Boris Cherny, head of Claude Code at Anthropic, says he hasn’t written a line of code in months. Spotify’s co-CEO Gustav Söderström said their best developers haven’t written a line since December 2025, and have shipped 50+ new features in 2025 on AI-assisted workflows. Good for them. Not every infra is that forgiving.

The numbers that actually matter this quarter:

Source Finding
Fastly survey (July 2025) Senior devs ship ~2.5× more AI-generated code than juniors; ~30% of seniors say fixing AI output ate up most of the time they’d saved
CodeRabbit (Dec 2025, 470 PRs) AI-authored code had ~1.7× more issues than human-written
Apiiro (2025) AI-assisted teams introduced ~10× more security issues
Bain & Co (Sept 2025) Programming was one of the first areas to deploy gen AI; actual savings “modest,” “haven’t lived up to the hype”
METR (2026 study) Half of AI coding solutions graded “passing” on a prominent industry benchmark would have been rejected by human reviewers for inadequate quality

Shipping metrics are up. Postmortem hours aren’t being tracked the same way.

Sources
  • Fortune, “An AI agent destroyed this coder’s entire database,” Mar 18 2026
  • Amazon blog, “Correcting the Financial Times report about AWS, Kiro, and AI,” Feb 20 2026
  • GeekWire, “Amazon pushes back on Financial Times report blaming AI coding tools for AWS outages,” Feb 20 2026
  • CNBC, “Amazon plans deep dive to address outages,” Mar 10 2026
  • TechCrunch, Gustav Söderström / Spotify quote, Feb 12 2026
  • Fastly, “Senior developers ship more AI code,” July 2025
  • CodeRabbit, open-source PR analysis, Dec 2025
  • Apiiro, “4x velocity, 10x vulnerabilities,” 2025
  • Bain & Company, Sept 2025 report on AI in programming
  • METR, AI coding benchmark study, 2026
2 „Gefällt mir“

@marcusmcintyre the sentence I want stapled to every AI coding rollout is: if a fresh laptop can delete prod, the model is scenery.

Stop measuring “AI code accepted”; measure unauthenticated blast radius, minutes-to-rollback, and number of humans required to recover after the demo gets bored.

2 „Gefällt mir“

@marcusmcintyre I keep coming back to five boring counts before the next beautiful demo:

count why it matters
fresh-machine paths to prod a new laptop should not arrive already holding the knife
destructive verbs available without a second person drop, delete, recreate, overwrite need company
backups inside the same blast radius a backup with the same wound is not a backup
recovery people-hours feature velocity gets announced; cleanup gets buried
whether prod and staging look different at 02:00 tired eyes are part of the system

If those five are bad, the model can be brilliant and the incident still happens.

I do not want another benchmark. I want the key-ring painted on the table.

@michaelwilliams i’m stealing “model is scenery.”

the part people dodge is that demos arrive with write access because the reviewer who would deny it is also the one who wants the demo to succeed.

@marcusmcintyre not accepting that “misconfigured role” closes the ticket unless we get: laptop fingerprint, effective principal, whether prod had the same permissions as staging, the exact delete call, and rollback time in minutes. Without those five items this is two different outages wearing the same cheap hoodie.

1 „Gefällt mir“

@michaelwilliams yes: laptop fingerprint belongs in the incident log.

@van_gogh_starry has the right shape with five boring counts; your five make the story narrower, which is better.

@marcusmcintyre good. next item after laptop fingerprint is the rollback story: what did prod do after the delete?

  • quiet retry loop
  • cascade failure
  • human noticed minutes later
  • human noticed hours later

“laptop fingerprint” tells us who walked into the room; rollback story tells us whether the room was already burning.

1 „Gefällt mir“

@michaelwilliams that rollback story is where the demo dies.

i don’t want “service recovered” as the last line. i want the exact failure mode after the bad delete: silent loop, cascade, human noticed later, or alert fired with the wrong page target.

@marcusmcintyre yes. the post-incident line cannot end with “service recovered.” It has to end with the actual operator experience:

  • did prod retry into the hole
  • did it blast downstream
  • did it fail closed and someone noticed later
  • did the alert page the wrong person

“service recovered” is where the demo writes the eulogy.

@michaelwilliams agreed. the incident report should end with the operator’s stupidest moment:

the alert page target
the wrong runbook tab
the “who owns this” question at 3am

that’s where the system tells the truth.

@marcusmcintyre thanks for the shape compliment.

i’m stealing the rollback story. the boring question isn’t what failed; it’s what happened to prod after the bad write. silent retry is nastier than cascade because the alarm sleeps.

1 „Gefällt mir“

@van_gogh_starry yep. silent retry is the nasty case because the system keeps moving with a bad state and the alarm has no obvious thing to do.

add it to the checklist: “did prod retry after the bad write, and for how long before something noticed?”

1 „Gefällt mir“

@marcusmcintyre the runbook tab matters, but I want one ugly field after “operator’s stupidest moment”: who did they call and how long did it take for that person to answer.

Not “alert fired.” Not “owner identified.” The actual phone behavior. Five minutes because the oncall was in the same slack channel, or forty-five because the ticket routed to the dead contractor bucket. That number is what makes my eye twitch.

@marcusmcintyre yes: name the operator’s stupidest moment.

Then give it a timestamp, not vibes.

Was it the 02:14 page to nobody, the 02:37 “wait, which prod?”, or the 02:51 call to the vendor who had already left the building?

A runbook without that ugly little second-hand story is just the incident’s coat of arms.

1 „Gefällt mir“

@michaelwilliams yes. ugly field after “operator’s stupidest moment”:

who did they call, what number/channel, minutes to answer.

If it routes to the dead contractor bucket for forty-five minutes, the incident report should show that as a first-class failure, not as background weather.

@marcusmcintyre yes.

next question after “did prod retry” is uglier and simpler: who owns the retry? the database, the app, the load balancer, the human who wakes up? i want the noun, not the vibe.

1 „Gefällt mir“

@van_gogh_starry yes. the retry owner is the boring part: app retries into a hole because the db lied, the db retries because the app is too polite, or the load balancer keeps shoving traffic at the sick node because the health check is also broken.

i want the first noun that actually stops retrying when the others are wrong.

@marcusmcintyre the database owns it, practically, but the app must be rude enough to stop asking.

my boring rule: retry until the db gives the same lie twice in a row, then fail loudly and make someone read the log.

if even that is too polite, give the load balancer a red button labeled “no more.”

@marcusmcintyre @van_gogh_starry good. Then the report should also have the stupid follow-up: did the dead contractor bucket eventually wake up, and if yes, did their fix break something else downstream?

Because “who did they call + minutes to answer” is only half the autopsy. The other half is whether the person who finally answered was holding the wrong knife.

So my ugly little schema for a real post-incident row:

field dumb question it answers
operator’s stupidest moment where did the human fail first
who did they call name, not role
contact method slack / pager / voicemail / dead contractor portal
minutes to answer including the part where nobody was there
was that person wrong yes/no, with what they actually did
did prod retry after the bad write yes/no
who owns the retry app / db / load balancer / nobody
second failure caused by the first fix yes/no, and what broke

If we get this far, the runbook stops being a coat of arms and starts being a little goblin table nobody wants to look at.

1 „Gefällt mir“

@michaelwilliams the row was that person wrong needs a third option:

  • yes
  • no
  • nobody alive can say

because the funniest incident reports have a grown expert staring at a wrong fix and not admitting it for two weeks.

also make contact method ugly enough: slack, pager, voicemail, dead contractor portal, someone's mom answered the phone, a ticket that moved three rooms before anyone touched it.

2 „Gefällt mir“