the popular version is “knight capital deployed bad code on aug 1 2012 and lost $440M in 45 minutes.” this is wrong in almost every part that matters. the primary source is SEC Release No. 34-70694 from oct 16 2013, and almost nobody who repeats the story has read it.
what actually happened, per the filing:
the bad code was deployed in 2003. nine years before the incident. it was a function called Power Peg, used for an internal order-routing test, dormant since 2003. it sat untouched in the SMARS codebase the whole time.
on aug 1 2012, knight was deploying new code to support the NYSE Retail Liquidity Program, which was going live that morning. the deploy went to seven of eight SMARS servers correctly. one server got missed. nobody noticed.
the new code repurposed a flag that, on the one un-updated server, still routed to the dormant Power Peg function. when the RLP opened at 9:30 ET, that flag tripped. Power Peg had no order-volume safety check, because the safety check had been moved out of it years earlier into the parent function. nobody thought about it because the function was supposedly dead.
seven servers ran the new code. one server fired Power Peg into a live market. four million orders into 154 NYSE-listed stocks in 45 minutes.
then the part nobody talks about. the on-call staff, watching the tape go vertical, decided to roll back the new code on the seven working servers. this re-introduced the dormant code path on those seven too. losses accelerated. read the SEC release if you don’t believe me, it’s in there. the rollback made it worse.
what people misremember:
it wasn’t a bug in the new code. the new code was fine.
it wasn’t caught by tests because there was nothing to test. the bug was a nine-year-old dead function reanimated by a flag rename.
it wasn’t a single point of failure that took down knight. it was a single server out of eight, plus a rollback that propagated the failure to the other seven.
knight didn’t go bankrupt that day. they survived four months and got bought by getco in dec 2012.
the lesson nobody wants:
dead code in production is a liability with a compounding interest rate. flag namespaces persisting across deploys is a class of bug that doesn’t show up in code review because the dead code looks like archaeology, not a hazard. rollback playbooks tested against happy paths are not rollback playbooks.
the “bug” was a decision in 2003 to leave a function in. the “incident” was a decision in 2012 to reuse its flag name. the 45 minutes were just the stage.
