LESSWRONG
LW

Vladimir_Nesov
34033Ω4914496571506
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
10Vladimir_Nesov's Shortform
Ω
9mo
Ω
110
Cole Wyeth's Shortform
Vladimir_Nesov13h141

That is, we see the first generation of massively scaled RLVR around 2026/2027. So it kind of has to work out of the box for AGI to arrive that quickly?

By 2027, we'll also have 10x scaled-up pretraining compared to current models (trained on 2024 compute). And correspondingly scaled RLVR, with many diverse tool-using environments that are not just about math and coding contest style problems. If we go 10x lower than current pretraining, we get original GPT-4 from Mar 2023, which is significantly worse than the current models. So with 10x higher pretraining than current models, the models of 2027 might make significantly better use of RLVR training than the current models can.

Also, 2 years might be enough time to get some sort of test-time training capability started, either with novel or currently-secret methods, or by RLVRing models to autonomously do post-training on variants of themselves to make them better at particular sources of tasks during narrow deployment. Apparently Sutskever's SSI is rumored to be working on the problem (at 39:25 in the podcast), and overall this seems like the most glaring currently-absent faculty. (Once it's implemented, something else might end up a similarly obvious missing piece.)

it seems like your model suggests AGI in 2027 is pretty unlikely?

I'd give it 10% (for 2025-2027). From my impression of the current capabilities and the effect of scaling so far, the remaining 2 OOMs of compute seem like a 30% probability of getting there (by about 2030), with a third of it in the first 10x of the remaining scaling, that is 10% with 2026 compute (for 2027 models). After 2029, scaling slows down to a crawl (relatively speaking), so maybe another 50% for the 1000x of scaling in 2030-2045 when there'll also be time for any useful schlep, with 20% remaining for 2045+ (some of it from a coordinated AI Pause, which I think is likely to last if at all credibly established). If the 5 GW AI training systems don't get built in 2028-2029, they are still likely to get built a bit later, so this essentially doesn't influence predictions outside the 2029-2033 window, some probability within it merely gets pushed a bit towards the future.

So this gives a median of about 2034. Once AGI is still not working in the early 2030s even with more time for schlep, probability at that level of compute starts going down, so 2030s are front-loaded in probability even though compute is not scaling faster in the early 2030s than later.

Reply
Cole Wyeth's Shortform
Vladimir_Nesov14h30

It was indicated in the opening slide of Grok 4 release livestream that Grok 4 was pretrained with the same amount of compute as Grok 3, which in turn was pretrained on 100K H100s, so probably 3e26 FLOPs (40% utilization for 3 months with 1e15 FLOP/s per chip). RLVR has a 3x-4x lower compute utilization than pretraining, so if we are insisting on counting RLVR in FLOPs, then 3 months of RLVR might be 9e25 FLOPs, for the total of 4e26 FLOPs.

Stargate Abilene will be 400K chips in GB200 NVL72 racks in 2026, which is 10x more FLOP/s than 100K H100s. So it'll be able to train 4e27-8e27 FLOPs models (pretraining and RLVR, in 3+3 months), and it might be early 2027 when they are fully trained. (Google is likely to remain inscrutable in their training compute usage, though Meta might also catch up by then.)

the amount of operations used for creating Grok 4 was estimated as 4e27--6e27

(I do realize it's probably some sort of typo, either yours or in your unnamed source. But 10x is almost 2 years of even the current fast funding-fueled scaling, that's not a small difference.)

Reply
Should you steelman what you don’t understand?
Vladimir_Nesov17h40

Steelmanning is a way of generating theories you'd want to consider or develop, it doesn't by itself give them much credence. So it shouldn't be a problem to steelman anything (just as it shouldn't be a problem to hypothesize anything), the failure mode is justifying credence in a resulting theory on the grounds that it was obtained by steelmanning. There is also often no particular reason to expect a steelman to be relevant to its inciting incident.

Reply
"Some Basic Level of Mutual Respect About Whether Other People Deserve to Live"?!
Vladimir_Nesov18h7-2

Discussion of how people or institutions do things in systematically bad ways could focus on norms and incentives, rather than on the people or the institutions themselves. This covers the vast majority of situations (though not all), and is also more constructive in those situations. False beliefs are the easy case, but it's not the only case where it's practical to argue people into self-directed change.

The people who are enforcing bad norms or establishing bad incentives, while acting under other (or the same) bad norms or incentives would often acknowledge that the norms and the incentives are no good, and might cooperate (or even lead) in setting up coordination technologies that reduce or contest the influence of these forces. I think it's worth making sure if this path really is closed, before resorting to the more offense-provoking methods (even rhetoric).

Reply
Do confident short timelines make sense?
Vladimir_Nesov1d90

I'm responding to the point about LLM agents being a thing for years, and that therefore some level of maturity should be expected from them. I think this isn't quite right, as the current method is new, the older methods didn't work out, and it's too early to tell that the new method won't work out.

So I'm discussing when it'll be time to tell that it won't work out either (unless it does), at which point it'll be possible to have some sense as to why. Which is not yet, probably in 2026, and certainly by 2027. I'm not really arguing about the probability that it does work out.

Reply
Do confident short timelines make sense?
Vladimir_Nesov1d80

But people have been talking about LLM agents for years, and I’d be shocked if the frontier labs weren’t trying? Like, if that worked out of the box, we would know by now (?).

Agentic (tool-using) RLVR only started working in late 2024, with o3 the first proper tool-using reasoning LLM prototype. From how it all looks (rickety and failing in weird ways), it'll take another pretraining scale-up to get enough redundant reliability for some noise to fall away, and thus to get a better look at the implied capabilities. Also the development of environments for agentic RLVR only seems to be starting to ramp this year, and GB200 NVL72s that are significantly more efficient for RLVR on large models are only now starting to get online in large quantities.

So I expect that only 2026 LLMs trained with agentic RLVR will give a first reasonable glimpse of what this method gets us, the shape of its limitations, and only in 2027 we'll get a picture overdetermined by essential capabilities of the method rather than by contingent early-days issues. (In the worlds where it ends up below AGI in 2027, and also where nothing else works too well before that.)

Reply
Are agent-action-dependent beliefs underdetermined by external reality?
Vladimir_Nesov2d40

The complicated computation F could be some person making a decision F(), and the complicated computation G could be defining an outcome of that decision we are considering, so that G()>14 is a claim about that outcome with some truth value. If everything is deterministic, it might still make sense to say that G() depends on F(), and even that the truth of G()>14 depends on F(). And also that it's F that determines F(), and therefore that it's F that determines the truth value of G()>14.

(I think there is some equivocation about beliefs vs. decisions in the post, but it doesn't seem essential to the core puzzle it's bringing up. A decision is distinct from a belief about that decision, and if you are making decisions because of beliefs about those decisions, you run into Löbian traps, so it's not a good way of thinking about the role of beliefs about decisions.)

Reply
Are agent-action-dependent beliefs underdetermined by external reality?
Vladimir_Nesov2d42

Embedded agency and beliefs also lead to some kind of embedded truth values, with all the strangeness of depending on possible values of deterministic things. Consider some complicated computation F that would compute a particular value F() such as 2 or 5, but we don't yet know which one. And consider a different computation G=F+10. Does the value G() of G depend on F()? Well, F() is some particular number, what does it mean to depend on it? And also we don't know the number. But there is still some sense in which G() depends on F(). Even though G() is also just some particular number. It doesn't make much sense to claim that 15 depends on 5, outside the context of the computations we are talking about.

Now, we can also consider truth value of a claim that G()>14. Does the truth value of that claim depend on F()? It seems that it does in some sense, in the same way that G() depends on F(), that is the claim G()>14 is true iff the claim F()>4 is true, and the truth value of F()>4 depends on the value of F(). Even though F() evaluates to some particular number such as 5, bringing about a different sense in which G() (which is just 15) or G()>14 (which is just true) don't depend on F().

Reply
Comment on "Four Layers of Intellectual Conversation"
Vladimir_Nesov2d*90

(The first link in this paragraph is to an archive of the Rationalist Conspiracy post.)

(Using a long link (from the "share" menu) both helps clarify the nature of the link, and guards against archive itself going down, so that the same URL can then be manually looked up somewhere else.)

Reply
Vladimir_Nesov's Shortform
Vladimir_Nesov2d30

Mamba paper uses a relevant kind of methodology, it directly compares different algorithmic ingredients in the same setting, training on a fixed dataset and measuring perplexity (do note it's not trying MoE, so the actual total improvement is greater). It's a way of directly comparing cumulative improvement over all that time. To impact future frontier capabilities, an algorithmic ingredient from the past needs to be both applicable to the future frontier models, and help with benchmarks relevant to those frontier models, compared to the counterfactual where the frontier model doesn't use the algorithmic ingredient.

When an ingredient stops being applicable to the frontier model, or stops being relevant to what's currently important about its capabilities, it's no longer compounding towards frontier capabilities. It wouldn't matter if that same ingredient is helping a different contemporary non-frontier small model to match a much older model with much less compute. Or that it's helping the frontier model to do much better than an older model on a benchmark that used to matter then, but doesn't matter now.

So I'm skeptical of the Epoch paper's overall framing, its willingness to compare everything against everything indirectly, that's a lot of the point I'm making. You mostly can't use methods from 2014 and frontier AI compute from 2025 to train something directly comparable to a lightweight version of a frontier model of 2025 trained on less compute (but still compute optimally), compared in a way that matters in 2025. So what does it mean that there is so and so compute multiplier across all of this time? At least for Transformer recipes, there is a possibility of comparing them directly if training converges.

Also, if we are not even aiming to do Chinchilla optimal training runs, what are we even comparing? For older algorithmic ingredients, you still need to aim for compute optimality to extract a meaningful compute multiplier, even if in the time of those older methods people didn't even try to do that, or did it incorrectly. In terms of this comment's framing, compute multipliers with respect to good methodology for Chinchilla optimal training is a "benchmark" that's currently relevant. So even if this benchmark wasn't appreciated or known back then, it's still the thing to use in order to estimate cumulative impact of the older algorithmic improvements, in a way that is relevant now, and so in a way that's analogous to what would be relevant for forecasting future frontier capabilities.

As another example, now that pretraining scale RLVR might soon become important, it's less clear that Chinchilla optimality will remain relevant going forward, and so that the contributions of algorithmic improvements that helped improve perplexity in Chinchilla optimal settings will keep contributing to future frontier capabilities. If most relevant capabilities end up being learned with RLVR "directly", then it might become less important how well pretraining works, even if it remains necessary for bootstrapping the process. And the kinds of things that RLVR trains will likely fail to help with perplexity in any reasonable setting, so measurements of perplexity will fail to remain a relevant benchmark.

Reply
Load More
64Musings on AI Companies of 2025-2026 (Jun 2025)
1mo
4
34Levels of Doom: Eutopia, Disempowerment, Extinction
1mo
0
181Slowdown After 2028: Compute, RLVR Uncertainty, MoE Data Wall
3mo
22
169Short Timelines Don't Devalue Long Horizon Research
Ω
3mo
Ω
24
19Technical Claims
4mo
0
148What o3 Becomes by 2028
7mo
15
41Musings on Text Data Wall (Oct 2024)
9mo
2
10Vladimir_Nesov's Shortform
Ω
9mo
Ω
110
27Superintelligence Can't Solve the Problem of Deciding What You'll Do
10mo
11
83OpenAI o1, Llama 4, and AlphaZero of LLMs
10mo
25
Load More
Quantilization
2y
(+13/-12)
Bayesianism
2y
(+1/-2)
Bayesianism
2y
(+7/-9)
Embedded Agency
3y
(-630)
Conservation of Expected Evidence
4y
(+21/-31)
Conservation of Expected Evidence
4y
(+47/-47)
Ivermectin (drug)
4y
(+5/-4)
Correspondence Bias
4y
(+35/-36)
Illusion of Transparency
4y
(+5/-6)
Incentives
4y
(+6/-6)
Load More