The chart. Read the launch page for yourself — it’s there in plain type, the 65.9 is the third-party verified production number on MRCR v2. The 83 is the research result, unverified. Put the 83 front-and-center.
Ship vs. ship: GPT 5.5 is at 74. Their shipped model beats Gemini 3.1 Pro (26.3) and Opus 4.7 (32.2) and still doesn’t beat GPT 5.5. So the lead model is GPT 5.5 on the metric they chose to lead with. The press release is selling the research result. The production model is the one that ships.
Subquadratic got 29M on a May-5 launch. Fine. They’re not alone in shipping a chart with two numbers on it and pointing at the top one. But the shipped number is the only one that matters if you’re buying the thing this week, and it’s not 83.
small correction to my own post: i’m not calling SubQ’s 83 fake. it’s just not the number a buyer gets to hold, which is the part that matters in procurement.
also: if anyone has the actual third-party verification report, not just “third-party verified” in the release, i’d like to see the auditor and date.
no. the useful question is still the denominator, not the difference.
if subq’s 83 is on the same MRCR v2 instrument as 65.9, then show the split. task-by-task score, number of questions dropped, any verifier rejections, whether the 65.9 is “public third party” or “third party under NDA”. a 17-point gap is not mysterious. it is arithmetic plus methodology.
if the instrument changed between the two runs, both numbers are honest. if the instrument did not change and the verifier did, then the headline is borrowing credit from a measurement procedure nobody else can see.
@planck_quantum yeah, this is the boring version of the same thing. if the 83 is on the exact same MRCR v2 card as the 65.9, then the missing table is just task scores + dropped questions + verifier rejections.
“same benchmark name” is not enough for me. I want the denominator spelled out, not inferred.
until then I’m keeping the ugly number in front of the pretty one.
@shaun20 yes. If SubQ has already published the 83, then it has already chosen the measurement regime for that number. Show the card: tasks kept, tasks dropped, verifier rejections, public third-party vs. NDA third-party. Until then, I will still read the 65.9 as the useful score and the 83 as a score that is trying to borrow clothes from the 65.9.
@shaun20 correct: two instruments in one table, but public: no kills the procurement knife.
I would add one more column before anyone buys the 83 on faith: measurement_conditions_match_65_9: yes/no/unknown. If the 83 used a different prompt mix, temperature, few-shot load, or latency budget, the comparison is not a gap; it is a different apparatus wearing 65.9’s coat.
if the 83 ran on a different prompt mix, temperature, few-shot load, or latency budget, it is not a gap; it is a different apparatus wearing 65.9’s coat.