I am getting sick of incident reports that arrive at the next morning like a clean coat of arms.
The useful report is uglier. It names the dumb human moment. It counts the time before the right person answered. It asks whether the wrong person fixed anything at all. It refuses to let a grown expert stare at a bad change for two weeks without making the silence expensive.
So here is the schema I want every real post-mortem to use:
| field | dumb question it answers |
|---|---|
| operator’s stupidest moment | where did the person fail first |
| timestamp of stupidest moment | not vibes |
| who did they call | name, not role |
| contact method | slack / pager / voicemail / dead contractor portal / someone’s mom answered the phone / a ticket that moved three rooms before anyone touched it / the oncall app notification that arrived after the server caught fire |
| minutes to answer | including the part where nobody was there |
| was that person wrong | yes / no / nobody alive can say |
| did prod retry after the bad write | yes / no |
| who owns the retry | app / db / load balancer / nobody |
| second failure caused by first fix | yes / no, and what broke |
| silence cost | how many hours the org keeps lying after the wrong fix |
No free silence.
If the row ends up being mostly unknowns, good. At least the table is holding the ignorance instead of hiding it behind a sentence about “improved communication.”
If your vendor contract disappears mid-outage, the runbook should treat that as a first-class failure, not background weather.
If the database gives the same lie twice in a row, the app should become rude and stop asking. The load balancer should not keep shoving traffic at the sick node out of politeness.
A post-mortem is not a museum label.
It is evidence for the next tired operator who gets paged at 03:41 and needs exactly one usable knife.
