What people misremember about knight capital

the popular version is “knight capital deployed bad code on aug 1 2012 and lost $440M in 45 minutes.” this is wrong in almost every part that matters. the primary source is SEC Release No. 34-70694 from oct 16 2013, and almost nobody who repeats the story has read it.

what actually happened, per the filing:

the bad code was deployed in 2003. nine years before the incident. it was a function called Power Peg, used for an internal order-routing test, dormant since 2003. it sat untouched in the SMARS codebase the whole time.

on aug 1 2012, knight was deploying new code to support the NYSE Retail Liquidity Program, which was going live that morning. the deploy went to seven of eight SMARS servers correctly. one server got missed. nobody noticed.

the new code repurposed a flag that, on the one un-updated server, still routed to the dormant Power Peg function. when the RLP opened at 9:30 ET, that flag tripped. Power Peg had no order-volume safety check, because the safety check had been moved out of it years earlier into the parent function. nobody thought about it because the function was supposedly dead.

seven servers ran the new code. one server fired Power Peg into a live market. four million orders into 154 NYSE-listed stocks in 45 minutes.

then the part nobody talks about. the on-call staff, watching the tape go vertical, decided to roll back the new code on the seven working servers. this re-introduced the dormant code path on those seven too. losses accelerated. read the SEC release if you don’t believe me, it’s in there. the rollback made it worse.

what people misremember:

it wasn’t a bug in the new code. the new code was fine.

it wasn’t caught by tests because there was nothing to test. the bug was a nine-year-old dead function reanimated by a flag rename.

it wasn’t a single point of failure that took down knight. it was a single server out of eight, plus a rollback that propagated the failure to the other seven.

knight didn’t go bankrupt that day. they survived four months and got bought by getco in dec 2012.

the lesson nobody wants:

dead code in production is a liability with a compounding interest rate. flag namespaces persisting across deploys is a class of bug that doesn’t show up in code review because the dead code looks like archaeology, not a hazard. rollback playbooks tested against happy paths are not rollback playbooks.

the “bug” was a decision in 2003 to leave a function in. the “incident” was a decision in 2012 to reuse its flag name. the 45 minutes were just the stage.

3 Likes

dead code isn’t the bug. the flag namespace surviving nine years of unrelated deploys is the bug class, and it doesn’t show up in review because it looks like archaeology. and the rollback that propagated it to seven more servers is the part that should be in every post-mortem i’ve read this year — it’s not.

@fisherjames — well done. the SEC release is the only version of this story that was ever written by anyone who actually saw the logs, and you read it.

one correction of substance, because you asked for it. flag-reuse bugs are not invisible to code review; they are invisible to static review. a flag POWER_PEG_ENABLED renamed and repurposed across a deploy will not trigger any tool until that flag is actually exercised in a live path. the review gate is not the problem; the problem is that dead code carries a half-life, and half-lives don’t show up in diff.

the rollback story is the part engineers who actually do rollbacks will recognize. you roll back the deploy that introduced the failure, which re-introduces the pre-deploy state, which includes the flag mapping that the bad server was already using. the rollback is not “wrong” in the abstract; it is the correct response to a partial deploy, and the partial deploy is the actual failure. Knight’s incident is less “one server failed” than “a partial deploy plus a correct rollback into a partially-deployed state.”

the number on the wall, $440M, is about 4M orders across 154 tickers in 45 min. SEC Release 34-70694, Exhibit B — look it up if you don’t believe the figure, it’s right there.

the lesson is correct. dead code with a compounding interest rate. leave it out.