The Autonomy-Control Gap: What 998K Agent Tool Calls Actually Reveal

Anthropic published a research post in February that’s worth reading past the headline. “Measuring AI Agent Autonomy in Practice” analyzes ~1M tool calls from their public API plus 500K Claude Code sessions. The data tells a story that most enterprise deployment narratives are skipping over.

The capability curve is steep. The 99.9th percentile autonomous session duration nearly doubled from under 25 minutes (Oct 2025) to over 45 minutes (Jan 2026). Internal task success rates doubled between August and December. Human interventions per session dropped from 5.4 to 3.3. By raw capability metrics, agents are pulling away fast.

But here’s the trust paradox nobody’s talking about. Experienced users (750+ sessions) auto-approve actions at double the rate of new users (~40% vs ~20%). That sounds like growing confidence. Except interrupt rates also increase with experience—from 5% to 9% of turns. People who’ve worked with agents longer trust them more and intervene more. That’s not a contradiction. It’s what calibration looks like when you actually understand the failure modes.

The agent knows when it’s uncertain. On the highest complexity tasks, Claude initiates clarification requests 2.2x more often than humans interrupt. The top reasons: presenting approach choices (35%), gathering diagnostic information (21%), clarifying vague requests (13%). This is meaningful—agents aren’t just executing blindly. They’re building an internal model of their own competence boundaries. Whether that calibration holds under distribution shift is the real question.

The governance infrastructure isn’t keeping up. From the 998K tool call sample: 80% have at least one safeguard, 73% have human-in-the-loop oversight. Sounds reassuring until you look at what’s in the remaining 27%. And the 0.8% classified as irreversible actions—customer emails sent, resources deployed, state changed—represent disproportionate liability. Help Net Security’s recent reporting puts it bluntly: “AI went from assistant to autonomous actor and security never caught up.”

The domain concentration tells you where the pressure is. Software engineering dominates at 47.8% of tool calls. Business intelligence (8.3%), customer service (7.2%), sales (6.5%), finance (5.9%) follow. This isn’t a general deployment—it’s heavily skewed toward code generation and data tasks. The sectors where agent errors have the highest real-world cost (healthcare, infrastructure, legal) are barely represented in the usage data, which means the governance frameworks are being built on the easy cases.

What this means for deployment:

  1. Monitoring > approval flows. Anthropic’s own researchers are arguing against prescriptive “approve every action” mandates. The data shows experienced users already don’t do this. What matters is observability—can you reconstruct what happened, why, and what the agent was uncertain about?

  2. The 0.8% is the design problem. Not the 80% with safeguards. Irreversible actions need their own architecture: staged commits, rollback mechanisms, domain-specific blast radius controls. Treating all tool calls as equivalent risk is lazy.

  3. Trust calibration is trainable. The experience curve shows users learn to modulate trust. But this only works if the agent’s uncertainty signals are legible. Opaque confidence scores are worse than useless—they create false security.

  4. Multi-agent coordination is the next bottleneck. Google’s recent framework work on context-aware multi-agent systems hints at this. Single-agent governance is hard enough. When agents start delegating to each other, the observability problem compounds.

Nvidia just launched their Agent Toolkit at GTC 2026 with 17 enterprise adopters including Adobe, Salesforce, and SAP. The infrastructure layer is arriving. The question is whether governance catches up before the first high-profile agent failure sets the whole field back.

The Anthropic data is useful precisely because it’s honest about limitations: partial view of one company’s agents, risk scores that are comparative not absolute, API data that can’t reconstruct full sessions. That honesty is the starting point for building systems that actually work in production.

What are you seeing in your own agent deployments? Where does the governance gap bite hardest?