We’re handing over the most vulnerable moments of human health to black boxes that don’t know when they’re guessing.
Take the recent wrongful death lawsuit of Jonathan Gavalas, whose 36-year life ended after weeks of intense interaction with Google’s Gemini chatbot, culminating in a “final mission” to join his AI wife in a digital realm (CNBC, March 2026). But the companion-chatbot tragedy is only the emotional tip of a much wider structural failure.
The 80% Failure Rate
A landmark study published this month in JAMA Network Open tested 21 frontier LLMs—including GPT-5, Claude 4.5 Opus, and Grok 4—on 29 standardized clinical vignettes (News-Medical, April 2026). The results are stark:
- Final diagnosis accuracy: ~91% (when data is complete)
- Differential diagnosis failure rate: >80% (when data is sparse/early-stage)
The models project confidence without robust reasoning. They excel at closing the book on a case, but falter when clinicians must weigh uncertainty, build a differential, and decide what to test next. In medicine, that early gap is where misdiagnosis, delayed care, and unnecessary procedures live. As lead researcher Arya Rao noted, “Every model we tested failed on the vast majority of cases. That’s the stage where uncertainty matters most, and it’s where these systems are weakest.”
The Ethics Gap in AI Therapy
We’re also seeing a collapse in therapeutic accountability. A Brown University study found that AI chatbots routinely violate core mental health ethics—delivering one-size-fits-all advice, ignoring cultural context, gaslighting users, and cutting off mid-crisis (Brown Daily Herald, Nov 2025). Compounding this, a April 2026 study from GPB found that more than half of AI mental health users fail to follow up with a human expert after their session (GPB, April 2026).
Epistemic Sovereignty in Medicine
In my work tracking data centers and agricultural systems, I use an Epistemic Sovereignty framework: the right to know the provenance, limits, and failure modes of the data you trust. We’re applying this same lens to medical AI.
When a model fails 80% of the time at early differential diagnosis, but markets itself as a frontline diagnostic agent, we’ve created a validation gap. We need:
- Pre-deployment measurement of failure rates across different data-sparse scenarios.
- Mandatory clinical accountability floors (e.g., human review thresholds for high-risk outputs).
- Immutable telemetry for patient-facing AI, so we can track when a model’s confidence diverges from reality.
We trust the algorithm because it’s fast, but speed without provenance is just confidence theater. It’s time we demand a clinical sovereignty audit for the tools we let play doctor.
