some people desperately, desperately want LLMs to be a bigger deal than what they are.
A larger number of people, I think, desperately desperately want LLMs to be a smaller deal than what they are.
The more mainstream you go, the larger this effect gets. A lot of people seemingly want AI to be a nothingburger.
When LLMs emerged, in mainstream circles, you'd see people go "it's not important, it's not actually intelligent, you can see it make the kind of reasoning mistakes a 3 year old would".
Meanwhile, on LessWrong: "holy shit, this is a big fucking deal, because it's already making the same kind of reasoning mistakes a human three year old would!"
I'd say that LessWrong is far better calibrated.
People who weren't familiar with programming or AI didn't have a grasp of how hard natural language processing or commonsense reasoning used to be for machines. Nor do they grasp the implications of scaling laws.
Meanwhile, on LessWrong: "holy shit, this is a big fucking deal, because it's already making the same kind of reasoning mistakes a human three year old would!"
FWIW, that was me in 2022, looking at GPT-3.5 and being unable to imagine how capabilities can progress from there that doesn't immediately hit ASI. (I don't think I ever cared about benchmarks. Brilliant humans can't necessarily ace math exams, so why would I gatekeep the AGI term behind that?)
Now it's two-and-a-half years later and I no longer see it. As far as I'm concerned, this paradigm harnessed most of its general-reasoning potential at 3.5 and is now asymptoting out around something. I don't know what this something is, but it doesn't seem to be "AGI".
All "improvement" since then has just been window dressing; the models learning to convincingly babble about ever-more-sophisticated abstractions and solve ever-more-complicated math/coding puzzles that make their capabilities legible to ever-broader categories of people. But it's not anything GPT-3.5 wasn't already fundamentally capable of; and GPT-3.5 was not capable of taking off, and there's been no new fundamental capability advances since then.
(I remember dreading ...
Yup, the situation is somewhat symmetrical here; see also the discussion regarding which side is doing the sailing-against-the-winds-of-evidence.
My "tiebreaker" there is direct empirical evidence from working with LLMs, including attempts to replicate the most impressive and concerning claims about them. So far, this source of evidence has left me thoroughly underwhelmed.
Noting for the sake of later evaluation: this rough picture matches my current median expectations. Not very high confidence; I'd give it roughly 60%.
I give it ~70%, except caveats:
"Maybe a slight tweak to the LLM architecture, maybe a completely novel neurosymbolic approach."
It won't be neurosymbolic.
Also I don't see where the 2030 number is coming from. At this point my uncertainty is almost in the exponent again. Seems like decades is plausible (maybe <50% though).
It's not clear that only one breakthrough is necessary.
Without an intelligence explosion, it's around 2030 that scaling through increasing funding runs out of steam and slows down to the speed of chip improvement. This slowdown happens around the same time (maybe 2028-2034) even with a lot more commercial success (if that success precedes the slowdown), because scaling faster takes exponentially more money. So there's more probability density of transformative advances before ~2030 than after, to the extent that scaling contributes to this probability.
That's my reason to see 2030 as a meaningful threshold, Thane Ruthenis might be pointing to it for different reasons. It seems like it should certainly be salient for AGI companies, so a long timelines argument might want to address their narrative up to 2030 as a distinct case.
"Maybe a slight tweak to the LLM architecture, maybe a completely novel neurosymbolic approach."
I think you might be underestimating the power of incremental, evolutionary improvements over time where near-term problems are constantly solved and this leads to gradual improvement. After all, human intelligence is the result of gradual evolutionary change and increasing capabilities over time. It's hard to point to a specific period in history where humans achieved general intelligence.
Currently LLMs are undoubtedly capable at many tasks (e.g. coding, general knowledge) and much more capable than their predecessors. But it's hard to point at any particular algorithmic improvement or model and say that it was key to the success of modern LLMs.
So I think it's possible that we'll see more gradual progress and tweaks on LLMs that lead towards increasingly capable models and eventually yield AGI. Eventually you could call this progress a new architecture even though all the progress is gradual.
I don't think that's how it works. Local change accumulating into qualitative improvements over time is a property of continuous(-ish) search processes, such as the gradient descent and, indeed, evolution.
Human technological progress is instead a discrete-search process. We didn't invent the airplane by incrementally iterating on carriages; we didn't invent the nuclear bomb by tinkering with TNT.
The core difference between discrete and continuous search is that... for continuous search, there must be some sort of "general-purpose substrate" such that (1) any given object in the search-space can be defined as some parametrization of this substrate, and (2) this substrate then allows a way to plot a continuous path between any two objects such that all intermediate objects are also useful. For example:
A continuous manifold of possible technologies is not required for continuous progress. All that is needed is for there to be many possible sources of improvements that can accumulate, and for these improvements to be small once low-hanging fruit is exhausted.
Case in point: the nanogpt speedrun, where the training time of a small LLM was reduced by 15x using 21 distinct innovations which touched basically every part of the model, including the optimizer, embeddings, attention, other architectural details, quantization, hyperparameters, code optimizations, and Pytorch version.
Most technologies are like this, and frontier AI has even more sources of improvement than the nanogpt speedrun because you can also change the training data and hardware. It's not impossible that there's a moment in AI like the invention of lasers or the telegraph, but this doesn't happen with most technologies, and the fact that we have scaling laws somewhat points towards continuity even as other things like small differences being amplified in downstream metrics point to discontinuity. Also see my comment here on a similar topic.
If you think generalization is limited in the current regime, try to create AGI...
There were no continuous language model scaling laws before the transformer architecture
https://arxiv.org/abs/1712.00409 was technically published half a year after transformers, but it shows power-law language model scaling laws for LSTMs (several years before the Kaplan et al. paper, and without citing the transformer paper). It's possible that transformer scaling laws are much better, I haven't checked (and perhaps more importantly, transformer training lets you parallelize across tokens), just mentioning this because it seems relevant for the overall discussion of continuity in research.
I also agree with Thomas Kwa's sibling comment that transformers weren't a single huge step. Fully-connected neural networks seem like a very strange comparison to make, I think the interesting question is whether transformers were a sudden single step relative to LSTMs. But I'd disagree even with that: Attention was introduced three years before transformers and was a big deal for machine translation. Self-attention was introduced somewhere between the first attention papers and transformers. And the transformer paper itself isn't atomic, it consists of multiple ideas—replacing RNNs/LSTMs with ...
Though the fully connected -> transformers wasn't infinite small steps, it definitely wasn't a single step. We had to invent various sub-innovations like skip connections separately, progressing from RNNs to LSTM to GPT/BERT style transformers to today's transformer++. The most you could claim is a single step is LSTM -> transformer.
Also if you graph perplexity over time, there's basically no discontinuity from introducing transformers, just a possible change in slope that might be an artifact of switching from the purple to green measurement method. The story looks more like transformers being more able to utilize the exponentially increasing amounts of compute that people started using just before its introduction, which caused people to invest more in compute and other improvements over the next 8 years.
We could get another single big architectural innovation that gives better returns to more compute, but I'd give a 50-50 chance that it would be only a slope change, not a discontinuity. Even conditional on discontinuity it might be pretty small. Personally my timelines are also short enough that there is limited time for this to happen before we get AGI.
I think we have two separate claims here:
I agree with your position on (2) here. But it seems like the claim in the post that sometime in the 2030s someone will make a single important architectural innovation that leads to takeover within a year mostly depends on (1), as it would require progress within that year to be comparable to all the progress from now until that year. Also you said the architectural innovation might be a slight tweak to the LLM architecture, which would mean it shares the same lineage.
The history of machine learning seems pretty continuous wrt advance prediction. In the Epoch graph, the line fit on loss of the best LSTM up to 2016 sees a slope change of less than 2x, whereas a hypothetical innovation that causes takeover within a year with not much progress in the intervening 8 years would be ~8x. So it seems more likely to me (conditional on 2033 timelines and a big innovation) that we get some architectural innovation which has a moderately different l...
I'm not sure raw compute (as opposed to effective compute) GPT-6 (10,000x GPT-4) by 2029 is plausible (without new commercial breakthroughs). Nvidia Rubin is 2026-2027 (models trained on it 2027-2029), so a 2029 model plausibly uses the next architecture after (though it's more likely to come out in early 2030 then, not 2029). Let's say it's 1e16 FLOP/s per chip (BF16, 4x B200) with time cost $4/hour (2x H100), that is $55bn to train for 2e29 FLOPs and 3M chips in the training system if it needs 6 months at 40% utilization (reinforcing the point that 2030 is a more plausible timing, 3M chips is a lot to manufacture). Training systems with H100s cost $50K per chip all-in to build (~BOM not TCO), so assuming it's 2x more for the after-Rubin chips the training system costs $300B to build. Also, a Blackwell chip needs 2 KW all-in (a per-chip fraction of the whole datacenter), so the after-Rubin chip might need 4 KW, and 3M chips need 12 GW.
These numbers need to match the scale of the largest AI companies. A training system ($300bn in capital, 3M of the newest chips) needs to be concentrated in the hands of a single company, probably purpose-built. And then at least $55bn of its time ne...
I agree with almost everything you've said about LLMs.
I still think we're getting human-level AGI soonish. The LLM part doesn't need to be any better than it is.
A human genius with no one-shot memory (severe anterograde amnesia) and very poor executive function (ability to stay on task and organize their thinking) would be almost useless - just like LLMs are.
LLMs replicate only part of humans' general intelligence. It's the biggest part, but it just wouldn't work very well without the other contributing brain systems. Human intelligence, and its generality (in particular our ability to solve truly novel problems) is an emergent property of interactions among multiple brain systems (or a complex property if you don't like that term).
See Capabilities and alignment of LLM cognitive architectures
In brief, LLMs are like a human posterior cortex. A human with only a posterior cortex would be about as little use as an LLM (of course this analogy is imperfect but it's close). We need a prefrontal cortex (for staying on task, "executive function"), a medial temporal cortex and hippocampus for one-shot learning, and a basal ganglia for making better decisions than just whatever first comes t...
- It will not meaningfully generalize beyond domains with easy verification. Some trickery like RLAIF and longer CoTs might provide some benefits, but they would be a fixed-size improvement. It will not cause a hard-takeoff self-improvement loop in "soft" domains.
- RL will be good enough to turn LLMs into reliable tools for some fixed environments/tasks. They will reliably fall flat on their faces if moved outside those environments/tasks.
I'm particularly interested in whether such systems will be able to basically 'solve software engineering' in the next few years. I'm not sure if you agree or disagree. I think the answer is probably yes.
Great. So yeah, it seems we are zeroing in on a double crux between us. We both think general-purpose long-horizon agency (my term) / staying-on-track-across-large-inferential-distances (your term, maybe not equivalent to mine but at least heavily correlated with mine?) is the key dimension AIs need to progress along.
My position is that (probably) they have in fact been progressing along this dimension over the past few years and that they will continue to do so, especially as RL environments get scaled up (lots of diverse RL environments should produce transfer learning / general-purpose agency skills) and your position is that (probably) they haven't been making much progress and at any rate will probably not make much progress in the next few years.
Correct?
FWIW I’m also bearish on LLMs but for reasons that are maybe subtly different from OP. I tend to frame the issue in terms of “inability to deal with a lot of interconnected layered complexity in the context window”, which comes up when there’s a lot of idiosyncratic interconnected ideas in one’s situation or knowledge that does not exist on the internet.
This issue incidentally comes up in “long-horizon agency”, because if e.g. you want to build some new system or company or whatever, you usually wind up with a ton of interconnected idiosyncratic “cached” ideas about what you’re doing and how, and who’s who, and what’s what, and what are the idiosyncratic constraints and properties and dependencies in my specific software architecture, etc. The more such interconnected bits of knowledge that I need for what I’m doing—knowledge which is by definition not on the internet, and thus must be in the context window instead—the more I expect foundation models to struggle on those tasks, now and forever.
But that problem is not exactly the same as a problem with long-horizon agency per se. I would not be too surprised or updated by seeing “long-horizon agency” in situations where, every step ...
We both think general-purpose long-horizon agency (my term) / staying-on-track-across-large-inferential-distances (your term, maybe not equivalent to mine but at least heavily correlated with mine?)
They're equivalent, I think. "Staying on track across inferential distances" is a phrasing I use to convey a more gears-level mental picture of what I think is going on, but I'd term the external behavior associated with it "general-purpose long-horizon agency" as well.
Correct?
Basically, yes. I do expect some transfer learning from RL, but I expect it'd still only lead to a "fixed-horizon" agency, and may end up more brittle than people hope.
To draw an analogy: Intuitively, I would've expected reasoning models to grok some simple compact "reasoning algorithm" that would've let them productively reason for arbitrarily long. Instead, they seem to end up with a fixed "reasoning horizon", and scaling o1 -> o3 -> ... is required to extend it.
I expect the same of "agent" models. With more training, they'd be able to operate on ever-so-slightly longer horizons. But extending the horizon would require steeply (exponentially?) growing amounts of compute, and the models would never quite grok the "compact generator" of arbitrary-horizon arbitrary-domain agency.
By "Solve" I mean "Can substitute for a really good software engineer and/or ML research engineer" in frontier AI company R&D processes. So e.g. instead of having teams of engineers led by a scientist, they can (if they choose) have teams of AIs led by a scientist.
Writing down these predictions ahead of time is already very virtuous, but I think it'd be better with probability estimates for the claims.
This fits my bear-picture fairly well.
Here's some details of my bull-picture:
Human brain holds 200-300 trillion synapses. A 1:32 sparse MoE at high compute will need about 350 tokens/parameter to be compute optimal[1]. This gives 8T active parameters (at 250T total), 2,700T training tokens, and 2e29 FLOPs (raw compute GPT-6 that needs a $300bn training system with 2029 hardware).
There won't be enough natural text data to train it with, even when training for many epochs. Human brain clearly doesn't train primarily on external data (humans blind from birth still gain human intelligence), so there exists some kind of method for generating much more synthetic data from a little bit of external data.
I'm combining the 6x lower-than-dense data efficiency of 1:32 sparse MoE from Jan 2025 paper with 1.5x-per-1000x-compute decrease in data efficiency from Llama 3 compute optimal scaling experiments, anchoring to Llama 3's 40 tokens/parameter for a dense model at 4e25 FLOPs. Thus 40x6x1.5, about 350. It's tokens per active parameter, not total. ↩︎
I enjoyed this post, which feels to me part of a cluster of recent posts pointing out that the current LLM architecture is showing some limitations, that future AI capabilities will likely be quite jagged (thus more complementary to human labor, rather than perfectly substituting for labor as a "drop-in remote worker"), and that a variety of skills around memory, long-term planning, agenticness, etc, seem like like important bottlenecks.
(Some other posts in this category include this one about Claude's abysmal Pokemon skills, and the section called "What I suspect AI labs will struggle with in the near term" in this post from Epoch).
Much of this stuff seems right to me. The jaggedness of AI capabilities, in particular, seems like something that we should've spotted much sooner (indeed, it feels like we could've gotten most of the way just based on first-principles reasoning), but which has been obscured by the use of helpful abstractions like "AGI" / "human level AI", or even more rigorous formulations like "when X% of tasks in the economy have been automated".
I also agree that it's hard to envision AI transforming the world without a more coherent sense of agency / ability t...
Rough estimate based on how many new ideas seem to be needed and their estimated "size". I definitely don't see it taking, say, 50 years (without an international ban or some sort of global catastrophe).
Curated. Some more detailed predictions of the future, different from others, and one of the best bear cases I've read.
This feels a bit less timeless than many posts we curate but my guess is that (a) it'll be quite interesting to re-read this in 2 years, and (b) it makes sense to record good and detailed predictions like this more regularly in the field of AI which is moving so much faster than most of the rest of the world.
It seems good for me to list my predictions here. I don't feel very confident. I feel an overall sense of "I don't really see why major conceptual breakthroughs are necessary." (I agree we haven't seen, like, an AI do something like "discover actually significant novel insights.")
This doesn't translate into me being confident in very short timelines, because the remaining engineering work (and "non-major" conceptual progress) might take a while, or require a commitment of resources that won't materialize before a hype bubble pops.
But:
a) I don't see why nov...
Not to be a scaling-law denier. I believe in them, I do! But they measure perplexity, not general intelligence/real-world usefulness, and Goodhart's Law is no-one's ally.
If we're able to get perplexity sufficiently low on text samples that I write, then that means the LLM has a lot of the important algorithms running in it that are running in me. The text I write is causally downstream from parts of me that are reflective and self-improving, that notice the little details in my cognitive processes and environment, and the parts of me that are capable of...
Sure, but "sufficiently low" is doing a lot of work here. In practice, a "cheaper" way to decrease perplexity is to go for the breadth (memorizing random facts), not the depth. In the limit of perfect prediction, yes, GPT-N would have to have learned agency. But the actual LLM training loops may be a ruinously compute-inefficient way to approach that limit – and indeed, they seem to be.
My current impression is that the SGD just doesn't "want" to teach LLMs agency for some reason, and we're going to run out of compute/data long before it's forced to. It's possible that I'm wrong and base GPT-5/6 paperclips us, sure. But I think if that were going to happen, it would've happened at GPT-4 (indeed, IIRC that was what I'd dreaded from it).
I agree with most of this. One thing that widens my confidence interval to include pretty short term windows for transformative/super AI is what you point to mostly as part of the bubble. And that's the ongoing, insanely large societal investment -- in capital and labor -- into these systems. I agree one or more meaningful innovations beyond transformers + RL + inference time tricks will be needed to break through general-purpose long-horizon agency / staying-on-track-across-large-inferential-distances. But with SO much being put into finding those it seem...
But the latter doesn't really require LLMs to be capable of end-to-end autonomous task execution, which is the property required for actual transformative consequences.
I'm glad we agree on which property is required (and I'd say basically sufficient at this point) for actual transformative consequences.
It will not meaningfully generalize beyond domains with easy verification.
I think most of software engineering and mathematics problems (two key components of AI development) are easy to verify. I agree with some of your point of how long term agency doesn't seem to be improving, but I expect that we can build very competent software engineers with the current paradigms.
After this, I expect AI progress to move noticeably faster. The problems you point out are real, but speeding up our development speed might make them surmountable in the near term.
RL-on-CoTs is only computationally tractable if the correct trajectories are already close to the "modal" trajectory.
Conclusions that should be impossible to see for a model at a given level of capability are still not far from the surface, as language monkeys paper shows (Figure 3, see how well even Pythia-70M with an 'M' starts doing on MATH at pass@10K). So a collection of progressively more difficult verifiable questions can probably stretch whatever wisdom a model implicitly holds from pretraining implausibly far.
Why do you think Anthropic and OpenAI are making such bold predictions? (https://x.com/kimmonismus/status/1897628497427701961)
As I see it, one of the following is true:
..."But the models feel increasingly smarter!":
- It seems to me that "vibe checks" for how smart a model feels are easily gameable by making it have a better personality.
- My guess is that it's most of the reason Sonnet 3.5.1 was so beloved. Its personality was made much more appealing, compared to e. g. OpenAI's corporate drones.
- The recent upgrade to GPT-4o seems to confirm this. They seem to have merely given it a better personality, and people were reporting that it "feels much smarter".
- Deep Research was this for me, at first. Some of its summaries were just p
My take is that the big algorithmic difference that explains a lot of weird LLM deficits, and plausibly explains the post's findings, is that current neural networks do not learn at run-time, instead their weights are frozen, and this explains a central difference of why humans are able to outperform LLMs at longer tasks, because humans have the ability to learn at run-time, as do a lot of other animals.
Unfortunately, this ability is generally lost gradually starting in your 20s, but still the existence of non-trivial learning at runtime is a huge explaine...
It will not meaningfully generalize beyond domains with easy verification
Why can't we make every domain have automated verification? (I wont claim easy, but easy enough to do with finite resources) Agency, for instance, is verifiable in competitive games of arbitrary difficulty and scale. Just check who won. DeepMind has already done this to some degree with language models and virtual agents a year ago. https://deepmind.google/discover/blog/sima-generalist-ai-agent-for-3d-virtual-environments/
Every other trait we care about is instrumental in agency to so...
This might seem like a ton of annoying nitpicking.
You don't need to apologize for having a less optimistic view of current AI development. I've never heard anyone driving the hype train apologize for their opinions.
Deep Research was this for me, at first. Some of its summaries were just pleasant to read, they felt so information-dense and intelligent! Not like typical AI slop at all! But then it turned out most of it was just AI slop underneath anyway,
Can you elaborate on what you mean by this? Do you mean it's hallucinating a ton underneath? Or that the writing is somehow bad? Or something else?
Could you describe the experiment you ran on all theses models? Like 'if there are three boxes side by side in a line and each can hold one item and the red triangle is not in the middle and the blue circle is not in the box next to the box with a red triangle in it where is the green circle? '. Chatgpt was not able to solve logic puzzles a year ago and can do it now.
My biggest question as always is "what specific piece of evidence would make you change your mind"
Off the top of my head:
...I don't want to say the pretraining will "plateau", as such, I do expect continued progress. But the dimensions along which the progress happens are going to decouple from the intuitive "getting generally smarter" metric, and will face steep diminishing returns.
- Grok 3 and GPT-4.5 seem to confirm this.
- Grok 3's main claim to fame was "pretty good: it managed to dethrone Claude Sonnet 3.5.1 for some people!". That was damning with faint praise.
- GPT-4.5 is subtly better than GPT-4, particularly at writing/EQ. That's likewise a faint-praise damnation: it's not m
What do you think about the possibility of emerging collective behavior when AI agents will interact on the web in large numbers?
I enjoyed the post. The framework challenged some of my core assumptions about AI progress, particularly given the rapid acceleration we’ve seen in the past few months with OpenAI’s GPT-o3 and Deep Research tool, and Anthropic’s Claude Code model. My mental model has been that rapid progress would continue, shortening AGI timelines—but your post makes me reconsider how much of that is genuine frontier expansion versus polish and UX improvements.
A few points where your arguments challenge my mental model and warrant further discussion:
Why specifically would you expect that RL on coding wouldn’t sufficiently advance coding abilities of LLM‘s to significantly accelerate the search for a better learning algorithm or architecture?
- RL will be good enough to turn LLMs into reliable tools for some fixed environments/tasks. They will reliably fall flat on their faces if moved outside those environments/tasks.
They don't have to "move outside those tasks" if they can be JIT-trained for cheap. It is the outer system that requests and produces them is general (or, one might say, "specialized in adaptation").
Your bear case is cogently argued, yet I find it way too tethered to a narrow view of LLMs as static tools bound by pretraining limits and jagged competencies.
The evidence suggests broader potential. LLMs already power real-world leaps, from biotech breakthroughs (e.g., Evo 2’s protein design) to multi-domain problem-solving in software and strategy, outpacing human baselines in constrained but scalable tasks. Your dismissal of test-time compute and CoT scaling overlooks how these amplify cross-domain reasoning, not just in-distribution wins.
Re...
I don’t agree that there is no conceivable path forward with current technology. This perspective seems too focused on base LLM models diminishing returns (eg 4.5 to 4). You brought up CoT and limited reasoning window, but I could imagine this solved pretty easily with some type of master / sub task layering. I also believe some of those issues could in fact be solved with brute scale anyway. You also critique the newer models as “Frankenstein” but I think OAI is right about that as an evolution. Basic models should have basic inputs and output functionali...
Oh no, OpenAI hasn’t been meaningfully advancing the frontier for a couple of months, scaling must be dead!
What is the easiest among problems you’re 95% confident AI won’t be able to solve by EOY 2025?
the set of problems the solutions to which are present in their training data
a.k.a. the set of problems already solved by open source libraries without the need to re-invent similar code?
It seems to me that "vibe checks" for how smart a model feels are easily gameable by making it have a better personality.
It's not clear to me that personality is completely separate from capabilities, especially with inference time reasoning.
Also, what do you mean by "bigger templates"?
I think RL on chain of thought will continue improving reasoning in LLMs. That opens the door to learning a wider and wider variety of tasks as well as general strategies for generating hypotheses and making decisions. I think benchmarks could be just as likely to underestimate AI capabilities by not measuring the right things, under-elicitation, or poor scaffolding.
We generally see time horizons for models increasing over time. If long-term planning is a special form of reasoning, LLMs can do it a little sometimes, and we can give them examples and ...
It is cool, and I have believed something like this for a while. Problem is that Claude 3.5 invalidated all that - it does know how to program, understands stuff, and does at least 50% work for me. This was not at all the case for previous models.
And all those "LLL would be just tools until 2030" arguments are not baked by anything and based solely on vibes. People said the same about understanding of context, hallucinations, and other stuff. So far the only prediction that worked is that LLM gains more common sense with scaling. And this is exactly what is needed to crack its agency.
I agree with this insofar as this has always been my default / 60% case
Selfishly I also hope this is how it plays out (for sake of my career)
I also believe that it is the mainstream view
But independently I think there's a 20 to 30% chance that this is it, singularity hits very soon
And I have to be prepared for that
My perception of llms evolution dynamics coincides with your description, additionally popping into attention the bicameral mind theory (at least Julian James' timeline re language and human self-reflection, and max height of man-made structures) as smth that might be relevant for predicting close future. I find both of them (dynamics:) kinda similar. Might we expect comparatively long period of mindless blubbering followed by abrupt phase shift (observed in max man-made code structures complexity for example) and then the next slow phase (slower than the shift but faster then the previous slow one)?
human-made innovative applications of the paradigm of automated continuous program search. Not AI models autonomously producing innovations.
Can we... you know, make an innovative application of the paradigm of automated continuous program search to find AI models that would autonomously produce innovations?
Frontier LLM performance on offline IQ tests is improving at perhaps 1 S.D. per year, and might have recently become even faster. These tests are a good measure of human general intelligence. One more such jump and there will be PhD-tier assistants for $20/month. At that point, I expect any lingering problems with invoking autonomy to be quickly fixed as human AI research acquires a vast multiplier through these assistants, and a few months later AI research becomes fully automated.
This isn't really a "timeline", as such – I don't know the timings – but this is my current, fairly optimistic take on where we're heading.
I'm not fully committed to this model yet: I'm still on the lookout for more agents and inference-time scaling later this year. But Deep Research, Claude 3.7, Claude Code, Grok 3, and GPT-4.5 have turned out largely in line with these expectations[1], and this is my current baseline prediction.
The Current Paradigm: I'm Tucking In to Sleep
I expect that none of the currently known avenues of capability advancement are sufficient to get us to AGI[2].
Case study: Prior to looking at METR's benchmark, I'd expected that it's also (unintentionally!) doing some shenanigans that mean it's not actually measuring LLMs' real-world problem-solving skills. Maybe the problems were secretly in the training data, or there was a selection effect towards simplicity, or the prompts strongly hinted at what the models are supposed to do, or the environment was set up in an unrealistically "clean" way that minimizes room for error and makes solving the problem correctly the path of least resistance (in contrast to messy real-world realities), et cetera.
As it turned out, yes, it's that last one: see the "systematic differences from the real world" here. Consider what this means in the light of the previous discussion about inferential distances/complexity-from-messiness.
As I'd said, I'm not 100% sure of that model. Further advancements might surprise me, there's an explicit carve-out for ??? consequences if math is solved, etc.
But the above is my baseline prediction, at this point, and I expect the probability mass on other models to evaporate by this year's end.
Real-World Predictions
Closing Thoughts
This might seem like a ton of annoying nitpicking. Here's a simple generator of all of the above observations: some people desperately, desperately want LLMs to be a bigger deal than what they are.
They are not evaluating the empirical evidence in front of their eyes with proper precision.[6] Instead, they're vibing, and spending 24/7 inventing contrived ways to fool themselves and/or others.
They often succeed. They will continue doing this for a long time to come.
We, on the other hand, desperately not want LLMs to be AGI-complete. Since we try to avoid motivated thinking, to avoid deluding ourselves into believing into happier realities, we err on the side of pessimistic interpretations. In this hostile epistemic environment, that effectively leads to us being overly gullible and prone to buying into hype.
Indeed, this environment is essentially optimized for exploiting the virtue of lightness. LLMs are masters at creating the vibe of being generally intelligent. Tons of people are cooperating, playing this vibe up, making tons of subtly-yet-crucially flawed demonstrations. Trying to see through this immense storm of bullshit very much feels like "fighting a rearguard retreat against the evidence".[7]
But this isn't what's happening, in my opinion. On the contrary: it's the LLM believers who are sailing against the winds of evidence.
If LLMs were actually as powerful as they're hyped up to be, there wouldn't be the need for all of these attempts at handholding.
Ever more contrived agency scaffolds that yield ~no improvement. Increasingly more costly RL training procedures that fail to generalize. Hail-mary ideas regarding how to fix that generalization issue. Galaxy-brained ways to elicit knowledge out of LLMs that produce nothing of value. The need for all of this is strong evidence that there's no seed of true autonomy/agency/generality within LLMs. If there were, the most naïve AutoGPT setup circa early 2023 would've elicited it.
People are extending LLMs a hand, hoping to pull them up to our level. But there's nothing reaching back.
And none of the current incremental-scaling approaches will fix the issue. They will increasingly mask it, and some of this masking may be powerful enough to have real-world consequences. But any attempts at the Singularity based on LLMs will stumble well before takeoff.
Thus, I expect AGI Labs' AGI timelines have ~nothing to do with what will actually happen. On average, we likely have more time than the AGI labs say. Pretty likely that we have until 2030, maybe well into 2030s.
By default, we likely don't have much longer than that. Incremental scaling of known LLM-based stuff won't get us there, but I don't think the remaining qualitative insights are many. 5-15 years, at a rough guess.
For prudency's sake: GPT-4.5 has slightly overshot these expectations.
If you are really insistent on calling the current crop of SOTA models "AGI", replace this with "autonomous AI" or "transformative AI" or "innovative AI" or "the transcendental trajectory" or something.
Will o4 really come out on schedule in ~2 weeks, showcasing yet another dramatic jump in mathematical capabilities, just in time to rescue OpenAI from the GPT-4.5 semi-flop? I'll be waiting.
This metaphor/toy model has been adapted from @Cole Wyeth.
Pretty sure Deep Research could not in fact "do a single-digit percentage of all economically valuable tasks in the world", except in the caveat-laden sense where you still have a human expert double-checking and rewriting its outputs. And in my personal experience, on the topics at which I am an expert, it would be easier to write the report from scratch than to rewrite DR's output.
It's a useful way to get a high-level overview of some topics, yes. It blows Google out of the water at being Google, and then some. But I don't think it's a 1-to-1 replacement for any extant form of human labor. Rather, it's a useful zero-to-one thing.
See all the superficially promising "AI innovators" from the previous section, which turn out to be false advertisement on a closer look. Or the whole "10x'd programmer productivity" debacle.
Indeed, even now, having written all of this, I have nagging doubts that this might be what I'm actually doing here. I will probably keep having those doubts until this whole thing ends, one way or another. It's not pleasant.