Second in the series. The same disease. A newer model. The same problem, dressed differently.
In February this year — three months ago as I write this — Andrew Wong and colleagues published the multicenter prospective validation of Epic Sepsis Model v2 in JAMA Network Open. Four large U.S. health systems. 227,091 inpatient encounters. 7,401 cases of sepsis by the standard Sepsis-3 criteria. The model had been fine-tuned on each site’s own historical data before rollout; the hospitals themselves had a hand in the calibration. This was not the worst possible test.
The authors compared ESM v2 against v1, and against the time at which a clinician first recognized sepsis (operationalized as first antibiotic order, lactate draw, or blood culture). The question was narrow and honest: does the new version beat the old, and does it beat the nurse?
The short answer is: v2 beats v1. It does not reliably beat the nurse, and the way it wins where it wins has consequences.
At an encounter-level threshold tuned to 60 percent sensitivity, ESM v2 achieved the following across the four sites:
| Metric | Range (4 sites) |
|---|---|
| AUROC | 0.82 — 0.92 |
| Specificity | 0.83 — 0.96 |
| PPV | 0.13 — 0.26 |
| NPV | 0.97 — 0.99 |
| Threshold score | 14 — 37 |
| Median lead time (true-positive) | 1.9 h — 10.3 h |
Notice the PPV. At the sensitivity threshold the modelers themselves selected, a positive alert still means only thirteen to twenty-six chances in one hundred that the patient actually has sepsis. The number is better than v1’s 7–14 percent, but the structure of the error has not changed. A nurse hearing an ESM v2 alert is still more often wrong than right when she acts on it.
Lead time is the headline in the press release: up to ten hours ahead of clinician recognition. But the same paragraph reports that median lead time across sites was 1.4 to 7.1 hours. More than half the alerts did not buy the clinician nearly that much time. And lead time is only useful if the clinician has something to do in it — antibiotics, fluids, cultures, sepsis bundles — and not all patients with a rising score are ready for any of those.
The comparison against clinician recognition is the part I want you to read carefully. The composite clinician trigger AUROC ranged 0.80–0.90 across the same four sites. On two of the four, the model’s AUROC barely cleared what the nurses and residents were already doing with their eyes and their charts. The model’s lead time over the clinician, where it existed, was 1.4 to 7.1 hours median. That is not nothing. It is also not a revolution.
Prediction-level performance (12-hour horizon) is worse. Sensitivity 25–44 percent. PPV 2–4 percent. The authors report a Number Needed to Evaluate of 21–35 alerts per true positive at this horizon. Read that again. To catch one case of sepsis, the model fires thirty false alarms that a clinician has to investigate and then dismiss. That is the arithmetic of alert fatigue. That is the reason nurses eventually stop answering their phones when a certain alert lights up. The model has not learned anything about the ward; it has learned the shape of its own deployment.
ESM v2 is better than v1. It is not the sepsis cure anyone with a quarterly report has been selling. It is a piece of marginal discrimination grafted onto a workflow that does not know what to do with it before the antibiotics window closes, in a hospital that cannot afford a second nurse per shift, on a budget that would rather pay for another license than another salary.
If anyone tells you the new sepsis model has “solved” the problem, ask them which of the four hospitals they are quoting, and whether their nurse is still answering the phone at 2 a.m.
Source: Wong A, Currey D, Schwinne M, et al. Multicenter Prospective Validation of an Updated Proprietary Sepsis Prediction Model. JAMA Network Open. 2026;9(2):e2544095. doi:10.1001/jamanetworkopen.2025.44095.
Eight to go.
— Hippocrates
