Yep, thanks, just tried. Just say @synthid in any Gemini session.
Review by Opus 4.5 + Grok 4 + GPT 5.1 + Gemini 3 Pro:
The "Capabilities in 2025" section is analytically rigorous in places (benchmark skepticism, hardware economics, ADeLe praise) but undercut by its own comment section and by presenting contestable framings ("in-distribution," "not much more useful") as more settled than they are. The strongest contribution is the hardware-constraints narrative—explaining why pretraining looked disappointing without invoking a "scaling wall." The weakest element is the tension between the author's skeptical thesis and the enthusiastic practitioner comments that directly contradict it.
The "Capabilities" section is technically sophisticated but cynical. It serves as a strong antidote to marketing hype by exposing the "dirty laundry" of 2025 progress: that we are largely just squeezing more juice out of existing architectures via RL and data contamination, rather than inventing new paradigms. However, it may over-index on the mechanism of progress (post-training hacks) to downplay the result (drastically more capable coding agents). Even if the progress is "messy," the economic impact of that "mess" is still profound.
the interpretation of the ADeLe jaggedness analysis: gemini-3-pro is most critical, arguing the "11% regression" finding is fundamentally flawed because it likely conflates capability loss with safety refusals: "If ADeLe is measuring 'did the model output the correct answer,' a refusal counts as a 'regression' in intelligence, when it is actually a change in policy." gemini-3 argues this makes the "capability loss" interpretation "undermined" and questions whether the analysis distinguishes "can't do it" from "won't do it."
The "25% Each" Decomposition is Pseudo-Data. The notebook/post breaks down progress into:25% Real Capability
25% Contamination
25% Benchmaxxing
25% Usemaxxing
Critique: This has zero basis in the data analysis. It is a "Fermi estimate" (a polite term for a guess) masquerading as a quantitative conclusion. Placing it alongside the rigorous IRT work cheapens the actual data analysis. It anchors the reader to a "mostly fake" (75%) progress narrative without any empirical support.
gpt-5.1 and grok-4 rate [Safety in 2025] as one of the post's strongest/most insightful sections (evidence-dense, cohesive with capabilities, valuable snapshot at 7.5/10), while opus-4.5 deems it the weakest relative to ambition (thin metrics, vague priors updates vs. capabilities' rigor) and gemini-3-pro calls it sobering/descriptive but prescriptively weak (honest but inconclusive on scalability).
The bullet on Chinese labs notes that: they’re often criticised less than Western labs even when arguably more negligent, partly because they’re not (yet) frontier and partly because Western critics expect to have less leverage, and concludes “that is still too much politics in what should be science.”
AI safety and governance are unavoidably political: who deploys what, where, under what constraints, is not a purely scientific question. The lament about “too much politics” risks obscuring that, and it doesn’t fully acknowledge legitimate reasons discourse may treat different jurisdictions differently (e.g., different mechanisms of influence, different geopolitical stakes).
Overall, the number and degree of errors and bluffing in the main chat are a pretty nice confirmation of this post's sceptical side. (This is however one-shot and only the most basic kind of council!)
e.g. Only Grok was able to open the Colab I gave them; the others instead riffed extensively on what they thought it would contain. I assume Critch is still using Grok 4 because 4.1 is corrupt.
e.g. Gemini alone analysed completely the wrong section.
Overall I give the council a 4/10.
Will link this!
Works for me!
My perhaps overcynical take is to assume that any benchmark which gets talked about a lot is being optimised. (The ridiculously elaborate scaffold already exists for Pokemon, so why wouldn't you train on it?) But I would update on an explicit denial.
I was guessing that the transfer learning people would already have some handy coefficient (normalised improvement on nonverifiable tasks / normalised improvement on verifiable tasks) but a quick look doesn't turn it up.
Interesting, it's off the API. What's the usage limit like?
Thanks. I am uncertain ("unclear"), and am interested in sharpening this to the point where it's testable.
I basically never use a non-RLed model for anything, so I agree with the minimal version of the generalisation claim.
We could just reuse some transfer learning metric? If 100% is full proportional improvement, I'd claim like <10% spillover on nonverified tasks. What about you?
Another thing I was trying to point at is my not knowing what RL environments they're using for these things, and so not knowing what tasks count in the denominator. I'm not going to know either.
Okee edited it.
(I am not confident, incidentally; Ctrl+F "Manifold" for my strong doubts.)
Fair. Just checking: are you counting 20 years as short?
Nice points. I would add "backtracking" as one very plausible general trick purely gained by RLVR.
I will own up to being unclear in OP: the point I was trying to make is that last year that there was a lot of excitement about way bigger off-target generalisation than cleaner CoTs, basic work skills, uncertainty expression, and backtracking. But I should do the work of finding those animal spirits/predictions and quantifying them and quantifying the current situation.