The current scaling speed is created by increasing funding for training projects, which isn't sustainable without continued success. Without this, the speed goes down to the much slower FLOP/dollar trend of improving cost efficiency of compute, making better AI accelerators. The 2 + 4 + 8 years thing might describe gradual increase in funding, but there are still 2 OOMs of training compute beyond original GPT-4 that are already baked-in in the scale of the datacenters that are being built and didn't yet produce deployed models. We'll only observe this in full by late 2026, so the current capabilities don't yet match the capabilities before a possible scaling slowdown.
This is DiLoCo (Nov 2023 paper), a local SGD setup where the outer optimizer updates much more rarely (every 100-500 steps of the inner optimizers), asking for much less bandwidth (it uses Nesterov momentum in its state). The inner optimizers run within individual clusters, and the outer optimizer aggregates updates from individual clusters, using a much slower network that connects the clusters. The experiments were done with models of up to 400M parameters. (See also this paper on asynchronous variants of DiLoCo.)
The original paper lacks good compute efficiency measurements. The distributed training experiments start from a checkpoint trained for 24K steps, continuing for 64K more steps (to a total of 88K) in various distributed configurations. Even for the non-distributed configuration the perplexity keeps increasing to step 29K (Figure 7b, Figure 9). The compute expended in a non-distributed run between steps 24K and 88K gets expended in an 8-cluster run between steps 24K and 32K, when perplexity barely starts going down from the global maximum. So there is no way of comparing how well an 8-cluster run uses its compute, because the non-distributed experiment stops so early (at 88K steps) that the uninformative poorly optimized early state of the model still dominates the distributed configuration that uses the same amount of compute (at 32K steps).
Prime Intellect first reproduced DiLoCo in Jul 2024 (blog post, paper) on models of up to 1.1B parameters, taking training efficiency measurements. The largest experiment with a 1.1B model runs across 4 nodes that communicate only every 125 steps, and matches perplexity of a similar training run within a single cluster (with communication every step) using 20% more compute (Figure 7, comparing with 4x batch size baseline).
The new 10B model lacks baselines for comparison, so doesn't help with understanding how training efficiency depends on scale, but the results on benchmarks seem similar to those of other models with similar size and number of training tokens (Table 4 in the blog post).
Llama-3-405B is an important anchor for compute of other models. With 4e25 FLOPs and conservative training techniques it's about as capable, so the other models probably don't use much more. If they have better techniques, they need less compute to get similar performance, not more. And they probably didn't train for more than 6 months. At $2 per H100-hour[1], $3 billion buys 6 months of time on 300K H100s. There are no publicly known training systems this large, the first 100K H100s systems started appearing in the later part of this year. Thus the training cost figures must include smaller experiments that in aggregate eat more compute than the largest training runs, through the now-ubiquitous smaller clusters also used for inference.
So anchoring to total number of GPUs is misleading about frontier model training because most GPUs are used for inference and smaller experiments, and the above estimate shows that figures like $3 billion for training are also poor anchors. If instead we look at 20K H100s as the typical scale of largest clusters in mid 2023 to early 2024, and 4 months as a typical duration of frontier model training, we get $120 million at $2 per H100-hour or 8e25 dense BF16 FLOPs at 40% compute utilization, only about 2x Llama-3-405B compute. This agrees with how Dario Amodei claimed that in Jun 2024 the scale of deployed models is about $100 million.
For what it's worth, since training the largest models requires building the training system yourself, which makes the market price of renting fewer GPUs from much smaller clusters not that relevant. ↩︎
A first candidate abstraction is the neuron doctrine: the idea that the mind is governed purely by patterns of neuron spiking. This abstraction seems simple enough (it could be captured with an artificial neural network of the same size: requiring around parameters).
Human brain only holds about 200-300 trillion synapses, getting to 1e20 asks for almost a million parameters per synapse.
The first reasoning trace in the QwQ blog post seems impressive in how it manages to eventually stumble on the correct answer despite the 32B model clearly having no clue throughout, so it manages to effectively explore while almost blind. If this is sufficient to get o1-preview level results on reasoning benchmarks, it's plausible that RL in such post-training is mostly unhobbling the base models rather than making them smarter.
So some of these recipes might fail to have any way of scaling far, in the same sense that preference tuning doesn't scale far (unlike AlphaZero). QwQ post doesn't include a scaling plot, and the scaling plot in the DeepSeek-R1 post doesn't show improvement with further training, only with thinking for more tokens. The o1 post shows improvement with more training, but it might plateau in the uninteresting way instruction/preference post-training plateaus, making the model reliably do the thing its base model is already capable of in some sense. The similarity between o1, R1, and QwQ is superficial enough that the potential to scale with more post-training might be present in some of them and not others, or in none of them.
From proliferation perspective, it reduces overhang, makes it more likely that Llama 4 gets long reasoning trace post-training in-house rather than later, and so initial capability evaluations give more relevant results. But if Llama 4 is already training, there might not be enough time for the technique to mature, and Llamas have been quite conservative in their techniques so far.
That's relevant, but about what I expected and why I hedged with "it should be possible to post-train", which that paper doesn't explore. Residual stream on many tokens is working memory, N layers of "vertical" compute over one token only have one activation vector to work with, while with more filler tokens you have many activation vectors that can work on multiple things in parallel and then aggregate. If a weaker model doesn't take advantage of this, or gets too hung up on concrete tokens to think about other things in the meantime, instead of being able to maintain multiple trains of thought simultaneously, a stronger model might[1].
Performance on large questions (such as reading comprehension) with immediate answer (no CoT) shows that N layers across many tokens and no opportunity to get deeper serial compute is sufficient for many purposes. But a question is only fully understood when it's read completely, so some of the thinking about the answer can't start before that. If there are no more tokens, this creates an artificial constraint on working memory for thinking about the answer, filler tokens should be useful for lifting it. Repeating the question seems to help for example (see Figure 3 and Table 5).
During inference, for each token and each layer over it, the attention block computes some vectors, the data called the KV cache. For the current token, the contribution of an attention block in some layer to the residual stream is computed by looking at the entries in the KV cache of the same layer across all the preceding tokens. This won't contribute to the KV cache entry for the current token at the same layer, it only influences the entry at the next layer, which is how all of this can run in parallel when processing input tokens and in training. The dataflow is shallow but wide.
So I would guess it should be possible to post-train an LLM to give answers like "................... Yes" instead of "Because 7! contains both 3 and 5 as factors, which multiply to 15 Yes", and the LLM would still be able to take advantage of CoT (for more challenging questions), because it would be following a line of reasoning written down in the KV cache lines in each layer across the preceding tokens, even if in the first layer there is always the same uninformative dot token. The tokens of the question are still explicitly there and kick off the process by determining the KV cache entries over the first dot tokens of the answer, which can then be taken into account when computing the KV cache entries over the following dot tokens (moving up a layer where the dependence on the KV cache data over the preceding dot tokens is needed), and so on.
Still consistent with great concern. I'm pointing out that O O's point isn't locally valid, observing concern shouldn't translate into observing belief that alignment is impossible.
In Transparent Newcomb's Problem, your decision determines whether you existed when you were making the decision. It's not valid to conclude that you exist merely from subjective observation of your own existence, because such observation can take place within counterfactuals, and you wouldn't be able to tell that you are within a counterfactual or actuality other than by reasoning about (or determining) your situation's actuality status. Like math, existence can't be perceived by looking at rocks.
So this already doesn't follow, your conclusion is true, but doesn't require MWI. Even without MWI, observing our own existence doesn't tell us anything about the probability of biogenesis (and subsequent development of generally intelligent life).