Do we have reason to assume that MoE models will stretch to the entire rack memory capacity minus KV cache in batched inference? I ruminated earlier that Rubin Ultra could theoretically run 300T sparse models, but it looks recently like it's possible to squeeze much more from far fewer parameters and iterate much faster.
This data about compute spend at OpenAI suggests this trend: https://epoch.ai/data-insights/openai-compute-spend
At the same time, some Chinese open source models are downscaling in parameters. The most recent MiniMax M2 is much smaller than MiniMax M1. Qwen-Next 80B-A3B is close in performance to Qwen-3 235B-A22B. I wouldn't be surprised if the same happened with DeepSeek-V4.
It's not that models won't grow, but it feels like intelligence density per parameter and non-degrading sparsity thresholds will be tested in the search for maximum training and inference efficiency.
Intuition for ‘faster’ seems more straightforward to justify because, in general, there will be a higher volume of technology available as time progresses, while brain emulation requirements are constant. I think it’s interesting to focus on what could cause a slower scenario.
It may be the case that scaling AGI like intelligence is like trying to increase velocity in a fluid. It’s more complex than just a quadratic increase in drag. The type of flow changes above supersonic speeds, and there will be multiple supersonic like transitions that are incredibly complex to understand.
The radical difference in human intellectual capability from a supposedly relatively identical substrate seems contradictory at first. It’s possible that, for some evolutionary reason, the ability to form complex circuits is stunted in non-anomalous brains.
It’s possible that the brain requires much more compute than a naive estimate based on synaptic firing frequency suggests.
I have often heard that what a single neuron does is extremely complex. On the other hand, the frequency of synaptic firing suggests there isn’t much data transmitted in total. This is relatively hard for me: on one hand, Hans Moravec style estimates: computing capacity in the retina multiplied by brain volume make sense; on the other hand, outside the retina, at a whole brain level, some sort of data augmentation may be happening that actually consumes 99.9% of the compute, and in those processes very complex in neuron operations are used.
For now, I think that’s enough. I need to think about this more.
Thanks for the detailed response. If we assume that continued progress requires using much more compute on RL than pretraining, this could favor model sizes much smaller than Chinchilla optimal. It's possible that future models will be trained with 99% RL and will employ some sort of continuous learning architecture.
I believe the number of active parameters will be determined by achieving similar load on compute and memory, which will depend on the current level of attention efficiency and the context necessary to hit agent ability targets.I don't know how sparse very large (>20T) models could be, but they could probably be extremely sparse (maybe 0.1% or less, where most of the volume is taken by a router).
I think it's possible that current and planned AI compute designs are not even close to optimal for training methods that may be developed within a year or so. Assuming something like DeepSeek DSA is relatively close to maximum efficiency, and RL pathways produced by agents will stretch to many millions of tokens, it may be the case that HBM will finally be too slow, and using SRAM will be the only way forward.
There's also a completely different consideration. For example, inclusionAI/Ring-1T has 1T total and 50B active parameters, and the creators claim it can achieve silver in IMO using a multi-agent framework, with plans to train it further to achieve gold, it's only 50b.
As you said, we're probably entering compute ranges that enable training models with many T active parameters (assuming RL will be in roughly one-to-one proportion to pretraining). The question is whether pattern density in our environment is currently large enough to make use of it."