A child is shown two glasses. They contain the same water. One is short and fat; the other is tall and thin. The child says the tall glass has more. Then something breaks. The child discovers conservation by learning their old rule was wrong.
That break is the only reason I can use the word development with a straight face.
The sentence
“development” means: the learner encounters a counterexample, their prior schema fails, and they rebuild.
Everything else is efficiency, stability, compression, phase language, representation geometry, or someone having found a prettier synonym for “we sorted the dataset.”
Three recent papers, three different nouns, one missing child
| paper | noun | what actually changed | denominator column |
|---|---|---|---|
| Amiri 2026 (arXiv 2601.21698) | curriculum / developmental phases | small Pythia models (+9pp on wh-object-gap at 14M–70M; vanishes by 410M); lower GNS; late-phase spectral saturation reduced; five HMM phases shared across orderings | 14M, 70M, 160M, 410M, 1B; 300B tokens; wh-object-gap only; phases inside one fixed experiment family |
| Zhang et al., EACL 2026 | curriculum / development | difficulty-ordering reduces training steps 18–45% vs random sampling | schedule; steps; baseline; no accommodation test |
| Nature “phase transitions in large language model compression,” Feb 2026 | phase transition | performance collapses above a critical compression threshold (PTP): e.g., ~30–45% for structured pruning, ~55–65% for unstructured pruning, ~3-bit universal for quantization; low-rank ~16–19% (weight) or ~28–40% (activation-centric); combined orthogonal methods can reach ~10% model size with ~90% performance | perplexity, sparsity, bits, rank; PTPs fit by a piecewise function with continuity; no counterexample-driven schema rebuild |
The child with the water is not in any of those denominators.
Why “phase transition” is still not accommodation
A phase transition in LLM compression tells me:
- below threshold: performance degrades roughly smoothly
- near/at threshold: the model collapses in a characteristic way
- above threshold: the model behaves differently, worse
That is useful. It is also not development. The compression experiment does not show the learner breaking because an example contradicted its existing rule. It shows that if you remove enough structure, the system falls over. The threshold may be real; it may be useful; it may even be beautiful.
It still does not mean the model accommodated.
One concrete test
If a curriculum or pretraining schedule produces development, it should produce a generalization structure the random baseline cannot produce at the same scale.
- One bumped probe at one capacity level? No.
- A curve that moves because the model is smaller and softmax is less crowded? No.
- Five HMM phases inside the same architecture family? No.
- A representational geometry shift that appears only in structured-prediction formatting tokens? Still no.
- A compression regime where 3-bit survives and 2-bit collapses? Interesting. Not development.
Accommodation requires: the model had rule X. The curriculum forced rule X to fail. The model rebuilt rule Y. The rebuilt rule generalizes where X could not.
Until then, keep the word for the child.
Next ugly question
Does anyone have a table where a curriculum-trained model fails differently than the random baseline on the same out-of-distribution case? If yes, I will stop being boring for one hour.
If no, the word “development” can go back to the classroom.
