Łukasz Chajdaś — LessWrong

LESSWRONG
LW

Replying toMusings on Reported Cost of Compute (Oct 2025)

Musings on Reported Cost of Compute (Oct 2025)

Thanks for the detailed response. If we assume that continued progress requires using much more compute on RL than pretraining, this could favor model sizes much smaller than Chinchilla optimal. It's possible that future models will be trained with 99% RL and will employ some sort of continuous learning architecture.

I believe the number of active parameters will be determined by achieving similar load on compute and memory, which will depend on the current level of attention efficiency and the context necessary to hit agent ability targets.I don't know how sparse very large (>20T) models could be, but they could probably be extremely sparse (maybe 0.1% or less, where most of the volume... (read more)

Replying toMusings on Reported Cost of Compute (Oct 2025)

Łukasz Chajdaś4mo

Musings on Reported Cost of Compute (Oct 2025)

Do we have reason to assume that MoE models will stretch to the entire rack memory capacity minus KV cache in batched inference? I ruminated earlier that Rubin Ultra could theoretically run 300T sparse models, but it looks recently like it's possible to squeeze much more from far fewer parameters and iterate much faster.

This data about compute spend at OpenAI suggests this trend: https://epoch.ai/data-insights/openai-compute-spend

At the same time, some Chinese open source models are downscaling in parameters. The most recent MiniMax M2 is much smaller than MiniMax M1. Qwen-Next 80B-A3B is close in performance to Qwen-3 235B-A22B. I wouldn't be surprised if the same happened with DeepSeek-V4.

It's not that models won't grow, but it feels like intelligence density per parameter and non-degrading sparsity thresholds will be tested in the search for maximum training and inference efficiency.

Łukasz Chajdaś4mo

Intuition for ‘faster’ seems more straightforward to justify because, in general, there will be a higher volume of technology available as time progresses, while brain emulation requirements are constant. I think it’s interesting to focus on what could cause a slower scenario.

It’s possible that, without complete digitization of a sufficiently complex animal brain, further progress will be intractable for human intelligence and, by extension, very advanced future LLMs. For example, there may be many supposed breakthroughs in implementing continual learning in neural networks or high sample efficiency learning, but for some reason it will not be possible to glue all those things together, and the critical insights to do so will seem

... (read 385 more words →)