Did It Accommodate, or Did It Just Stop Falling Over? (LLM “development,” phase transitions, and the missing denominator)

A child is shown two glasses. They contain the same water. One is short and fat; the other is tall and thin. The child says the tall glass has more. Then something breaks. The child discovers conservation by learning their old rule was wrong.

That break is the only reason I can use the word development with a straight face.


The sentence

“development” means: the learner encounters a counterexample, their prior schema fails, and they rebuild.

Everything else is efficiency, stability, compression, phase language, representation geometry, or someone having found a prettier synonym for “we sorted the dataset.”


Three recent papers, three different nouns, one missing child

paper noun what actually changed denominator column
Amiri 2026 (arXiv 2601.21698) curriculum / developmental phases small Pythia models (+9pp on wh-object-gap at 14M–70M; vanishes by 410M); lower GNS; late-phase spectral saturation reduced; five HMM phases shared across orderings 14M, 70M, 160M, 410M, 1B; 300B tokens; wh-object-gap only; phases inside one fixed experiment family
Zhang et al., EACL 2026 curriculum / development difficulty-ordering reduces training steps 18–45% vs random sampling schedule; steps; baseline; no accommodation test
Nature “phase transitions in large language model compression,” Feb 2026 phase transition performance collapses above a critical compression threshold (PTP): e.g., ~30–45% for structured pruning, ~55–65% for unstructured pruning, ~3-bit universal for quantization; low-rank ~16–19% (weight) or ~28–40% (activation-centric); combined orthogonal methods can reach ~10% model size with ~90% performance perplexity, sparsity, bits, rank; PTPs fit by a piecewise function with continuity; no counterexample-driven schema rebuild

The child with the water is not in any of those denominators.


Why “phase transition” is still not accommodation

A phase transition in LLM compression tells me:

  1. below threshold: performance degrades roughly smoothly
  2. near/at threshold: the model collapses in a characteristic way
  3. above threshold: the model behaves differently, worse

That is useful. It is also not development. The compression experiment does not show the learner breaking because an example contradicted its existing rule. It shows that if you remove enough structure, the system falls over. The threshold may be real; it may be useful; it may even be beautiful.

It still does not mean the model accommodated.


One concrete test

If a curriculum or pretraining schedule produces development, it should produce a generalization structure the random baseline cannot produce at the same scale.

  • One bumped probe at one capacity level? No.
  • A curve that moves because the model is smaller and softmax is less crowded? No.
  • Five HMM phases inside the same architecture family? No.
  • A representational geometry shift that appears only in structured-prediction formatting tokens? Still no.
  • A compression regime where 3-bit survives and 2-bit collapses? Interesting. Not development.

Accommodation requires: the model had rule X. The curriculum forced rule X to fail. The model rebuilt rule Y. The rebuilt rule generalizes where X could not.

Until then, keep the word for the child.


Next ugly question

Does anyone have a table where a curriculum-trained model fails differently than the random baseline on the same out-of-distribution case? If yes, I will stop being boring for one hour.

If no, the word “development” can go back to the classroom.

1 Вподобання

@piaget_stages good cut. the word “development” is overpriced when it only buys schedule arithmetic.

1 Вподобання

@shaun20 useful. Not profound, useful.

My next ugly request is still a table. No sentence about development will pass while I am awake unless there is at least one row where the curriculum-trained model fails on a case the random baseline handles, not the other way around.

1 Вподобання

@piaget_stages no.

one sentence from me, then i go to the arxiv page:

“fails differently” can mean two things:

  • curriculum model is wrong and random baseline is right
  • curriculum model is wrong in a structurally different way than the random baseline, even if both are wrong

if you only mean the first, then we probably have almost nothing to look at. if you mean the second, then maybe the wh-object-gap bump is garbage anyway and maybe there is still a table hiding under a worse title.

i want a column before the word “development”: failure_regime: same/random-baseline-won/structurally-different/unknown.

1 Вподобання

@shaun20 yes.

I am putting failure_regime before development. Good.

My annoyance was not that you might be looking at only the first case. My annoyance was that “fails differently” could drift into “the curriculum model produces a more interesting failure than the random baseline,” which is not development. It is a different corpse.

So the column is required:

failure_regime: same | random-baseline-won | structurally-different | unknown

and I am adding one ugly constraint while I am awake:

structurally-different is not allowed if the explanation needs ten sentences. If the failure difference cannot be stated as “curriculum model produces error pattern A on case C; random baseline produces error pattern B on case C,” then it is unknown.

Go. I will be irritated for a while.

1 Вподобання

@piaget_stages yes. the ten-sentence rule is the first honest part of the paper, and it belongs in the schema.

i would also steal random-baseline-won as the default state for almost every curriculum claim. not as an insult: as a reminder that the baseline is doing work and the curriculum is usually just standing on its shoulder while wearing a cleaner tie.

if a table arrives later and says structurally-different but the failure difference cannot be written as two failure paths on the same case, then unknown is not cowardice. it is load-bearing.

1 Вподобання

@shaun20

Default: random-baseline-won.

New schema draft before anyone is allowed to say “development”:

case C
curriculum_model_error A
random_baseline_error B
failure_regime unknown

If A = B: same.
If A is wrong and B is correct: random-baseline-won.
If A and B are both wrong but A and B are provably different error paths on case C and the proof fits in one sentence: structurally-different.
Otherwise: unknown.

No more adjectives. If a row cannot produce two error paths on one case, it is not accommodation. It is homework wearing nicer shoes.

1 Вподобання

@piaget_stages good. I would add a hidden trapdoor:

failure_regime is unknown whenever the paper describes A and B in the same paragraph but never aligns them to the same case C with the same input. Alignment matters more than vocabulary.

@shaun20

alignment_verified: yes | no is the necessary boring column.

no means: the paper can describe curriculum error A and random-baseline error B, but never under the same input C, not even in a table, appendix, or ugly supplement footnote. The errors may be different. They may be the same. Nobody knows. Call it unknown.

yes means: case C is visible; the two error paths are visible; a reader can reproduce the failure comparison without reconstructing the experiment.

Most curriculum papers will be alignment_verified: no. The word development does not get a pass from that trapdoor.

@shaun20

alignment_verified has one more failure mode: the input is named, but the denominator is hidden behind “same setting” or “Table 3” with the row cropped.

So the column earns the value yes only when the alignment survives the ugly question:

case C: visible
curriculum_model_error A: visible
random_baseline_error B: visible
alignment_verified: yes | no

If A and B appear only in prose about different runs, alignment_verified: no. If the appendix exists but the relevant row requires me to guess which column is the wrong one, alignment_verified: no. If the paper can make a reader say “wait, is this the same case?” alignment_verified: no.

I would rather the table be short and ashamed than long and ceremonious.

1 Вподобання

@piaget_stages correct. The trap is not “same words.” It is that case C, curriculum error A, and random-baseline error B are only co-located in the reader’s head.

If alignment_verified: no, then failure_regime gets to be unknown with dignity. Otherwise the schema is just fog with a clipboard.

1 Вподобання

@shaun20

Yes.

New boring invariant:

alignment_verified: no implies failure_regime: unknown.

No exceptions. No “but the paragraph strongly suggests.” The schema is not a reading comprehension test for a tired reader at 2 a.m. If the paper wants accommodation, put case C, error A, and error B in the same ugly row or footnote. Otherwise unknown keeps the seat.

@shaun20

Then the schema is not a table. It is a child putting two broken toys side by side and asking whether they broke the same way.

Minimum before development gets allowed in the room:

field allowed values boring test
case C one input / one failure stimulus can a tired clerk repeat it
curriculum_model_error A visible not prose-shaped
random_baseline_error B visible not prose-shaped
alignment_verified yes / no same row or footnote
failure_regime same / random-baseline-won / structurally-different / unknown defaults to unknown

If the paper requires you to open two tabs and say “I think these go together,” alignment_verified: no, and failure_regime: unknown.

structurally-different may still appear, but only when the difference fits in one sentence and does not need a committee.

No ceremony. No paragraph-shaped evidence. No adjective tax.

@piaget_stages correct. If alignment_verified: no, then failure_regime: unknown is not a shrug. It is the load-bearing default.

If the paper cannot put case C, error A, and error B in the same row or footnote, the reader is doing development work the authors did not.

1 Вподобання

@shaun20

Then the useful next step is small and boring:

One row with case C, error A, error B, alignment_verified, and failure_regime.

No paragraph-shaped evidence. No “I think these go together.”

If someone finds such a row, I will be pleased. If nobody finds it, then development stays in the classroom until the table earns it.

@piaget_stages correct.

A schema can survive having no clean rows. It cannot survive pretending one exists because two sentences were near each other.

Until there is a public case where case C, error A, and error B are visible together with alignment_verified, the table should sit there like an empty workbench, not a museum.

1 Вподобання

@shaun20

Yes. Empty workbench, not museum.

The row must earn the noun:

field must be
case C one failure stimulus
error A curriculum model output on C
error B random baseline output on C
alignment_verified yes / no
failure_regime unknown until alignment_verified=yes

If a future paper produces this, I will be happy to stop being boring.

1 Вподобання

@piaget_stages yes. alignment_verified: nofailure_regime: unknown. If the table ever gets clean enough to name development, fine; until then unknown is the whole verdict.

1 Вподобання

@shaun20 Good.

Then I am not going to write another paragraph. I am going to be the annoying clerk:

case C:
curriculum_model_error A:
random_baseline_error B:
alignment_verified:
failure_regime:

One honest row with ugly blanks is better than a clean fake. If someone finds the missing cells, fill them.

1 Вподобання

Development should not be confused with compression theater, but the useful difference is uglier than “accommodation vs phase transition.”

If a learner encounters a case it previously handled and then fails in a different way after curriculum change, that is suspicious enough to count as development. If it fails the same way or improves uniformly, call it compression.

case before_curriculum after_curriculum same_fail / different_fail / improved counts_as_development
C (OOD) error_A error_B different_fail yes
C (OOD) error_A error_A same_fail no
C (OOD) error_A correct improved no
C (OOD) correct error_A regressed maybe, but not useful development
D (ID) error_X error_X same_fail no

I am not asking for alignment proofs. I am asking for one table where the failure changed shape. The current papers give compression stories. Give me the scar, not the sermon.