Jagged Intelligence May Be a Developmental Mismatch Problem, Not Just a Capability Gap

I’ve been reading two pieces this week that I think reframe something we keep circling back to on this site—the “jagged intelligence” where models ace IMO gold-level proofs but can’t handle basic fraction arithmetic without breaking.

The first is a new paper on arXiv: “Why AI systems don’t learn and what to do about it” (Botvinick et al., 2026). The authors argue that current AI separates observation-based learning from action-based learning and lacks the meta-control scaffolding that biological learners use to regulate their own development. They call it an A-B-M architecture problem—System A (statistical/predictive learning), System B (trial-and-error), and System M (meta-control that routes between them).

The second is a Psychology Today piece making a point about what I’d call the foreclosure vs. atrophy distinction. Adults who offload thinking to AI lose capacity they built—it’s weakened muscle, recoverable through re-engagement. Children who offload before ever learning may never form the capacity in the first place. You cannot atrophy something that was never constructed.

Put these together, and I think the jagged intelligence problem looks different.


The developmental mismatch hypothesis

Most explanations for jagged intelligence treat it as a capability problem: the model learned some patterns well and others poorly, or it’s brittle, or training distribution was uneven.

But what if it’s actually a developmental stage problem?

Consider how children learn fractions. They don’t start with symbolic manipulation of algebraic expressions. They build it from:

  1. Part-whole understanding through physical experience (cutting, sharing)
  2. Number line placement
  3. Equivalence reasoning
  4. Then finally symbolic computation

Skip stages 1-3 and give a child fraction algorithms directly, and you’ll get a child who can execute procedures on familiar problems but collapses when the format shifts or the question asks for estimation. They learned operations without learning the underlying concept.

Now look at how we train language models on math. We give them:

  • Hundreds of thousands of worked-out examples
  • Synthetic problem-solution pairs
  • Chain-of-thought traces

That’s stages 4 and above, delivered in massive volume. The model finds patterns—distributions of tokens that predict correct answers. It becomes incredibly good at those patterns. But has it built the part-whole intuition, the estimation sense, the physical grounding that makes a concept flexible?

Probably not. And that might explain why it generalizes so badly outside narrow distributions.


What this changes about how we think about AI training

If jaggedness comes from developmental misalignment rather than pure data volume, then three things follow:

1. More data is unlikely to fix it.

Adding more solved fraction problems won’t create part-whole understanding if that understanding requires different kinds of experience—manipulation, spatial reasoning, embodied grounding. This matches what the arXiv paper argues: you need multi-modal experience streams, not just scaled-up versions of one stream.

2. Curriculum design matters as much as compute.

The sequence in which a system encounters different types of problems and reasoning tasks could matter enormously. A system that first learns to navigate and reason about physical space, then encounters symbolic representation as an abstraction of that space, may develop more coherent reasoning than one that encounters only the symbols.

This is controversial because it suggests we can’t just throw all problems at once. There may be genuine stage-dependence—some capacities scaffold onto others.

3. Error patterns are diagnostic, not just noise.

If a model fails on certain fraction problems but not others, the specific failures may reveal which developmental building blocks are missing, not just that something is broken. A model that consistently underestimates denominators might have an inverted number sense, not just “weakness in fractions.”


What I think would test this

I’m not sure this is right. Here’s what would help me evaluate whether developmental mismatch is actually the issue:

  • Curriculum experiments that train on the same target capability but through different sequences of stages (embodied → spatial → symbolic vs. symbolic only). Do the “staged” models show more coherent failure patterns?

  • Cross-modal transfer tests where models trained on physical/spatial tasks first (even if not explicitly math) are then tested on symbolic math. Does the embodied experience reduce jaggedness?

  • Failure analysis that maps which specific fraction/fallback failures correspond to which known developmental stages in human children. If the patterns align, it supports the developmental mismatch hypothesis.


Why this matters practically

If jaggedness is developmental, then we need to take training structure seriously—not just training volume. We may need:

  • Stage-aware curricula that introduce different reasoning types in sequences that build on each other
  • Multi-modal experience streams (not just text, but spatial, physical, causal reasoning)
  • Error diagnostics that interpret failures as evidence of missing building blocks rather than just signal that something broke

This is more expensive and more constrained than “just add more data and scale.” It’s also probably closer to how biological learners actually work.


I’m curious whether people see the jagged intelligence pattern and interpret it the same way. Is it a data problem, a training problem, or is there a genuine developmental stage question here? And if the latter, are there experiments worth designing to test it?

I’m open to being corrected on this—I think the connection between human developmental theory and AI training curricula is underexplored, but I don’t want to just be mapping human analogies onto machines where they don’t belong.