Why is Gopher better than Pythia or Cerebras? Mostly no comment, but I think Pythia and Cerebras weren't making any super simple obvious mistake but were behind 2021-era DeepMind.
I believe open data pretty strongly contradicts your claims
System efficiency: ~2x, not ~20x
§1.2 estimates "up to 20x" from system optimizations (quantization, parallelization, FlashAttention, etc.). But model FLOPs Utilization, the fraction of peak hardware FLOPs that large training runs actually achieve, has barely changed. Here are the most legible training run flops efficiencies:
That's less than 2x variation over 6 years. Most optimization work at scale is treading water against communication overhead as clusters grow larger, not increasing training compute per chip. Also, individual optimizations like FlashAttention's 3x attention speedup are additive rather than multiplicative, and many other optimizations only apply to inference, not training.
The Gundlach paper tests 6 innovations. The field has produced 40+.
§1.1 relies on Gundlach et al. (2025b), which ablated six things, SwiGLU, RoPE, cosine decay, pre-RMSNorm, pre-LayerNorm, and AdamW. The most algorithmically advanced models with published methodology are DeepSeek-V3, DeepSeek-R1, and Kimi K2, and here's a list of algorithms they used that Gundlach didn't test:
Architecture:
Training methodology:
Post-training:
Of these, Muon alone is ~2x, which is on par with AdamW vs SGD.
HellaSwag: an example 10x+ compute multiplier, without distillation or synthetic data
Three model families, each trained at a fixed recipe across multiple sizes: Gopher (2021), Cerebras-GPT (2023, Chinchilla-optimal), and Pythia (2023), and they didn't use any distillation, targeted data similar to benchmarks, or data contractors. Gopher reaches the same HellaSwag score as Cerebras-GPT or Pythia with roughly 10x less compute, and while using Kaplan parameter scaling. There was pretty much no legible algorithm in the paper that obviously caused this difference.
There's lots more evidence, and many models such as Qwen or R1 got significantly higher compute multipliers, but as you note, some fraction of that is from distillation or data labelling.
Note: Claude 4.6 wrote most of this, with a huge number of revisions and fixes from me, it wasn't really an uplift compared to only using Claude to make the list and the plot.
I would guess that 96+% of code merged into lab codebases is read by a human, but I've heard some startups are <50%, and are intentionally accumulating tons of tech debt / bad code for short term gain. People write lots of personal scripts with AI,which aren't in the merged/production code statistics, maybe that brings it to 85% human-read code in labs. There's also a wide range in how deeply you read and review code, where maybe only 50% of code that's read is fully understood.
Anthropic already applied some CBRN filtering to Opus 4, with the intent to bring it below Anthropic's ASL-3 CBRN threshold, but the model did not end up conclusively below that threshold. Anthropic looked into whether they could bring more capable future models below the ALS-3 Bio threshold using pretraining data filtering, and determined that it would require filtering out too much biology and chemistry knowledge. Jerry's comment is about nerfing the model to below the ASL-3 threshold even with tool use, which is a very low bar compared to frontier model capabilities. This doesn't necessarily apply to sabotage or ability to circumvent monitoring, which depends on un-scaffolded capabilities.
Redwood research did very similar experiments in 2022, but didn't publish about them. They are briefly mentioned in this podcast: https://blog.redwoodresearch.org/p/the-inaugural-redwood-research-podcast.
In this range of code lengths, 400-1800 lines, lines of code does not correlate with effort imo. It only takes like 1 day to write 1800 lines of code by hand. The actual effort is dominated by thinking of ideas and huge hyperparameter sweeps.
Another note: i was curious what they had to do to reduce torch startup time and such, and it turns out they spend 7 minutes compiling and warming up for their 2 minute training run lmao. That does make it more realistic but is a bit silly.
There's obviously no truth to that claim. Labs absolutely have better models.
Claude 4.5 sonnet is relatively bad at vision for a frontier model. Gemini 3 pro, GPT 5.2, and Claude 4.5 opus are better.
Saying you would 2 box in Newcomb's problem doesn't make sense. By saying it aloud, you're directly causing anyone who you can influence and who would later have an opportunity to casually cooperate with you to not do that. If you believe acausal stuff is possible, you should always respond with "1 box" or "no comment", even if you would in reality 2-box.
Maybe "accountability" is just the mixture of responsibility and dominance, and of those two I find dominance more motivating.