Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Short Summary

LLMs may be fundamentally incapable of fully general reasoning, and if so, short timelines are less plausible.

Longer summary

There is ML research suggesting that LLMs fail badly on attempts at general reasoning, such as planning problems, scheduling, and attempts to solve novel visual puzzles. This post provides a brief introduction to that research, and asks:

  • Whether this limitation is illusory or actually exists.
  • If it exists, whether it will be solved by scaling or is a problem fundamental to LLMs.
  • If fundamental, whether it can be overcome by scaffolding & tooling.

If this is a real and fundamental limitation that can't be fully overcome by scaffolding, we should be skeptical of arguments like Leopold Aschenbrenner's (in his recent 'Situational Awareness') that we can just 'follow straight lines on graphs' and expect AGI in the next few years.


Leopold Aschenbrenner's recent 'Situational Awareness' document has gotten considerable attention in the safety & alignment community. Aschenbrenner argues that we should expect current systems to reach human-level given further scaling[1], and that it's 'strikingly plausible' that we'll see 'drop-in remote workers' capable of doing the work of an AI researcher or engineer by 2027. Others hold similar views.

Francois Chollet and Mike Knoop's new $500,000 prize for beating the ARC benchmark has also gotten considerable recent attention in AIS[2]. Chollet holds a diametrically opposed view: that the current LLM approach is fundamentally incapable of general reasoning, and hence incapable of solving novel problems. We only imagine that LLMs can reason, Chollet argues, because they've seen such a vast wealth of problems that they can pattern-match against. But LLMs, even if scaled much further, will never be able to do the work of AI researchers.

It would be quite valuable to have a thorough analysis of this question through the lens of AI safety and alignment. This post is not that[3], nor is it a review of the voluminous literature on this debate (from outside the AIS community). It attempts to briefly introduce the disagreement, some evidence on each side, and the impact on timelines.

What is general reasoning?

Part of what makes this issue contentious is that there's not a widely shared definition of 'general reasoning', and in fact various discussions of this use various terms. By 'general reasoning', I mean to capture two things. First, the ability to think carefully and precisely, step by step. Second, the ability to apply that sort of thinking in novel situations[4].

Terminology is inconsistent between authors on this subject; some call this 'system II thinking'; some 'reasoning'; some 'planning' (mainly for the first half of the definition); Chollet just talks about 'intelligence' (mainly for the second half).

This issue is further complicated by the fact that humans aren't fully general reasoners without tool support either. For example, seven-dimensional tic-tac-toe is a simple and easily defined system, but incredibly difficult for humans to play mentally without extensive training and/or tool support. Generalizations that are in-distribution for humans seems like something that any system should be able to do; generalizations that are out-of-distribution for humans don't feel as though they ought to count.

How general are LLMs?

It's important to clarify that this is very much a matter of degree. Nearly everyone was surprised by the degree to which the last generation of state-of-the-art LLMs like GPT-3 generalized; for example, no one I know of predicted that LLMs trained on primarily English-language sources would be able to do translation between languages. Some in the field argued as recently as 2020 that no pure LLM would ever able to correctly complete Three plus five equals. The question is how general they are.

Certainly state-of-the-art LLMs do an enormous number of tasks that, from a user perspective, count as general reasoning. They can handle plenty of mathematical and scientific problems; they can write decent code; they can certainly hold coherent conversations.; they can answer many counterfactual questions; they even predict Supreme Court decisions pretty well. What are we even talking about when we question how general they are?

The surprising thing we find when we look carefully is that they fail pretty badly when we ask them to do certain sorts of reasoning tasks, such as planning problems, that would be fairly straightforward for humans. If in fact they were capable of general reasoning, we wouldn't expect these sorts of problems to present a challenge. Therefore it may be that all their apparent successes at reasoning tasks are in fact simple extensions of examples they've seen in their truly vast corpus of training data. It's hard to internalize just how many examples they've actually seen; one way to think about it is that they've absorbed nearly all of human knowledge.

The weakman version of this argument is the Stochastic Parrot claim, that LLMs are executing relatively shallow statistical inference on an extremely complex training distribution, ie that they're "a blurry JPEG of the web" (Ted Chiang). This view seems obviously false at this point (given that, for example, LLMs appear to build world models), but assuming that LLMs are fully general may be an overcorrection.

Note that this is different from the (also very interesting) question of what LLMs, or the transformer architecture, are capable of accomplishing in a single forward pass. Here we're talking about what they can do under typical auto-regressive conditions like chat.

Evidence for generality

I take this to be most people's default view, and won't spend much time making the case. GPT-4 and Claude 3 Opus seem obviously be capable of general reasoning. You can find places where they hallucinate, but it's relatively hard to find cases in most people's day-to-day use where their reasoning is just wrong. But if you want to see the case made explicitly, see for example "Sparks of AGI" (from Microsoft, on GPT-4) or recent models' performance on benchmarks like MATH which are intended to judge reasoning ability.

Further, there's been a recurring pattern (eg in much of Gary Marcus's writing) of people claiming that LLMs can never do X, only to be promptly proven wrong when the next version comes out. By default we should probably be skeptical of such claims.

One other thing worth noting is that we know from 'The Expressive Power of Transformers with Chain of Thought' that the transformer architecture is capable of general reasoning under autoregressive conditions. That doesn't mean LLMs trained on next-token prediction learn general reasoning, but it means that we can't just rule it out as impossible.

Evidence against generality

The literature here is quite extensive, and I haven't reviewed it all. Here are three examples that I personally find most compelling. For a broader and deeper review, see "A Survey of Reasoning with Foundation Models".

Block world

All LLMs to date fail rather badly at classic problems of rearranging colored blocks. We do see improvement with scale here, but if these problems are obfuscated, performance of even the biggest LLMs drops to almost nothing[5].


LLMs currently do badly at planning trips or scheduling meetings between people with availability constraints [a commenter points out that this paper has quite a few errors, so it should likely be treated with skepticism].


Current LLMs do quite badly on the ARC visual puzzles, which are reasonably easy for smart humans.

Will scaling solve this problem?

The evidence on this is somewhat mixed. Evidence that it will includes LLMs doing better on many of these tasks as they scale. The strongest evidence that it won't is that LLMs still fail miserably on block world problems once you obfuscate the problems (to eliminate the possibility that larger LLMs only do better because they have a larger set of examples to draw from)[5].

One argument made by Sholto Douglas and Trenton Bricken (in a discussion with Dwarkesh Patel) is that this is a simple matter of reliability -- given a 5% failure rate, an AI will most often fail to successfully execute a task that requires 15 correct steps. If that's the case, we have every reason to believe that further scaling will solve the problem.

Will scaffolding or tooling solve this problem?

This is another open question. It seems natural to expect that LLMs could be used as part of scaffolded systems that include other tools optimized for handling general reasoning (eg classic planners like STRIPS), or LLMs can be given access to tools (eg code sandboxes) that they can use to overcome these problems. Ryan Greenblatt's new work on getting very good results on ARC with GPT-4o + a Python interpreter provides some evidence for this.

On the other hand, a year ago many expected scaffolds like AutoGPT and BabyAGI to result in effective LLM-based agents, and many startups have been pushing in that direction; so far results have been underwhelming. Difficulty with planning and novelty seems like the most plausible explanation.

Even if tooling is sufficient to overcome this problem, outcomes depend heavily on the level of integration and performance. Currently for an LLM to make use of a tool, it has to use a substantial number of forward passes to describe the call to the tool, wait for the tool to execute, and then parse the response. If this remains true, then it puts substantial constraints on how heavily LLMs can rely on tools without being too slow to be useful[6]. If, on the other hand, such tools can be more deeply integrated, this may no longer apply. Of course, even if it's slow there are some problems where it's worth spending a large amount of time, eg novel research. But it does seem like the path ahead looks somewhat different if system II thinking remains necessarily slow & external.

Why does this matter?

The main reason that this is important from a safety perspective is that it seems likely to significantly impact timelines. If LLMs are fundamentally incapable of certain kinds of reasoning, and scale won't solve this (at least in the next couple of orders of magnitude), and scaffolding doesn't adequately work around it, then we're at least one significant breakthrough away from dangerous AGI -- it's pretty hard to imagine an AI system executing a coup if it can't successfully schedule a meeting with several of its co-conspirator instances.

If, on the other hand, there is no fundamental blocker to LLMs being able to do general reasoning, then Aschenbrenner's argument starts to be much more plausible, that another couple of orders of magnitude can get us to the drop-in AI researcher, and once that happens, further progress seems likely to move very fast indeed.

So this is an area worth keeping a close eye on. I think that progress on the ARC prize will tell us a lot, now that there's half a million dollars motivating people to try for it. I also think the next generation of frontier LLMs will be highly informative -- it's plausible that GPT-4 is just on the edge of being able to effectively do multi-step general reasoning, and if so we should expect GPT-5 to be substantially better at it (whereas if GPT-5 doesn't show much improvement in this area, arguments like Chollet's and Kambhampati's are strengthened).

OK, but what do you think?

I genuinely don't know! It's one of the most interesting and important open questions about the current state of AI. My best guesses are:

  • LLMs continue to do better at block world and ARC as they scale: 75%
  • LLMs entirely on their own reach the grand prize mark on the ARC prize (solving 85% of problems on the open leaderboard) before hybrid approaches like Ryan's: 10%
  • Scaffolding & tools help a lot, so that the next gen[7] (GPT-5, Claude 4) + Python + a for loop can reach the grand prize mark[8]: 60%
  • Same but for the gen after that (GPT-6, Claude 5): 75%
  • The current architecture, including scaffolding & tools, continues to improve to the point of being able to do original AI research: 65%, with high uncertainty[9]

Further reading


  1. ^

    Aschenbrenner also discusses 'unhobbling', which he describes as 'fixing obvious ways in which models are hobbled by default, unlocking latent capabilities and giving them tools, leading to step-changes in usefulness'. He breaks that down into categories here. Scaffolding and tooling I discuss here; RHLF seems unlikely to help with fundamental reasoning issues. Increased context length serves roughly as a kind of scaffolding for purposes of this discussion. 'Posttraining improvements' is too vague to really evaluate. But note that his core claim (the graph here) 'shows only the scaleup in base models; “unhobblings” are not pictured'.

  2. ^

    Discussion of the ARC prize in the AIS and adjacent communities includes James Wilken-Smith, O O, and Jacques Thibodeaux.

  3. ^

      Section 2.4 of the excellent "Foundational Challenges in Assuring Alignment and Safety of Large Language Models" is the closest I've seen to a thorough consideration of this issue from a safety perspective. Where this post attempts to provide an introduction, "Foundational Challenges" provides a starting point for a deeper dive.

  4. ^

    This definition is neither complete nor uncontroversial, but is sufficient to capture the range of practical uncertainty addressed below. Feel free to mentally substitute 'the sort of reasoning that would be needed to solve the problems described here.'

  5. ^

    A major problem here is that they obfuscated in ways that made the challenge unnecessarily hard for LLMs by pushing against the grain of the English language. For example they use 'pain object' as a fact, and say that an object can 'feast' another object. Beyond that, the fully-obfuscated versions would be nearly incomprehensible to a human as well; eg 'As initial conditions I have that, aqcjuuehivl8auwt object a, aqcjuuehivl8auwt object b...object b 4dmf1cmtyxgsp94g object c...'. See Appendix 1 in 'On the Planning Abilities of Large Language Models'. It would be valuable to repeat this experiment while obfuscating in ways that were compatible with what is, after all, LLMs' deepest knowledge, namely how the English language works.

  6. ^

     A useful intuition pump here might be the distinction between data stored in RAM and data swapped out to a disk cache. The same data can be retrieved in either case, but the former case is normal operation, whereas the latter case is referred to as "thrashing" and grinds the system nearly to a halt.

  7. ^

    Assuming roughly similar compute increase ratio between gens as between GPT-3 and GPT-4.

  8. ^

    This isn't trivially operationalizable, because it's partly a function of how much runtime compute you're willing to throw at it. Let's say a limit of 10k calls per problem.

  9. ^

    This isn't really operationalizable at all, I don't think. But I'd have to get pretty good odds to bet on it anyway; I'm neither willing to buy nor sell at 65%. Feel free to treat as bullshit since I'm not willing to pay the tax ;)

New Comment
87 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

The Natural Plan paper has an insane amount of errors in it. Reading it feels like I'm going crazy. 

This meeting planning task seems unsolvable:
The solution requires traveling from SOMA to Nob Hill in 10 minutes, but the text doesn't mention the travel time between SOMA and Nob Hill.  Also the solution doesn't mention meeting Andrew at all, even though that was part of the requirements.

Here's an example of the trip planning task:

The trip is supposed to be 14 days, but requires visiting Bucharest for 5 days, London for 4 days, and Reykjavik for 7 days. I guess the point is that you can spend a day in multiple cities, but that doesn't match with an intuitive understanding of what it means to "spend N days" in a city.  Also, by that logic you could spend a total of 28 days in different cities by commuting every day, which contradicts the authors' claim that each problem only has one solution.

Thanks for evaluating it in detail. I assumed that they at least hadn't screwed up the problems! Editing the piece to note that the paper has problems. Disappointingly, a significant number of existing benchmarks & evals have problems like that IIRC.
Thanks for writing this post!

IMO, these considerations do lengthen expected timelines, but not enough or certainly enough that we can ignore the possibility of very short timelines. The distribution of timelines matters a lot, not just the point estimate.

We are still not directing enough of our resources to this possibility if I'm right about that. We have limited time and brainpower in alignment, but it looks to me like relatively little is directed at scenarios in which the combination of scaling and scaffolding language models fairly quickly achieves competent agentic AGI. More on this below.

Excellent post, big upvote.

These are good questions, and you've explained them well.

I have a bunch to say on this topic. Some of it I'm not sure I should say in public, as even being convincing about the viability of this route will cause more people to work on it. So to be vague: I think applying cognitive psychology and cognitive neuroscience systems thinking to the problem suggests many routes to scaffold around the weaknesses of LLMs. I suspect the route from here to AGI is disjunctive; several approaches could work relatively quickly. There are doubtless challenges to implementing that scaffolding, as people have e... (read more)

Few thoughts
- actually, these considerations mostly increase uncertainty and variance about timelines; if LLMs miss some magic sauce, it is possible smaller systems with the magic sauce could be competitive, and we can get really powerful systems sooner than Leopold's lines predict
- my take on what is one important thing which makes current LLMs different from humans is the gap described in Why Simulator AIs want to be Active Inference AIs; while that post intentionally avoids having a detailed scenario part, I think the ontology introduced is better for thinking about this than scaffolding
- not sure if this is clear to everyone, but I would expect the discussion of unhobbling being one of the places where Leopold would need to stay vague to not breach OpenAI confidentiality agreements; for example, if OpenAI was putting a lot of effort into make LLM-like systems be better at agency, I would expect he would not describe specific research and engineering bets

My short timelines have their highest probability path going through:

Current LLMs get scaled enough that they are capable of automating search for new and better algorithms.

Somebody does this search and finds something dramatically better than transformers.

A new model trained on this new architecture repeats the search, but even more competently. An even better architecture is found.

The new model trained on this architecture becomes AGI.

So it seems odd to me that so many people seem focused on transformer-based LLMs becoming AGI just through scaling. That seems theoretically possible to me, but I expect it to be so much less efficient that I expect it to take longer. Thus, I don't expect that path to pay off before algorithm search has rendered it irrelevant.

My crux is that LLMs are inherently bad at search tasks over a new domain.  Thus, I don't expect LLMs to scale to improve search. Anecdotal evidence:  I've used LLMs extensively and my experience is that LLMs are great at retrieval but terrible at suggestion when it comes to ideas.  You usually get something resembling an amalgamation of Google searches vs. suggestions from some kind of insight.
[EDIT: @ChosunOne convincingly argues below that the paper I cite in this comment is not good evidence for search, and I would no longer claim that it is, although I'm not necessarily sold on the broader claim that LLMs are inherently bad at search (which I see largely as an expression of the core disagreement I present in this post).] The recently-published 'Evidence of Learned Look-Ahead in a Chess-Playing Neural Network' suggests that this may not be a fundamental limitation. It's looking at a non-LLM transformer, and the degree to which we can treat it as evidence about LLMs is non-obvious (at least to me). But it's enough to make me hesitant to conclude that this is a fundamental limitation rather than something that'll improve with scale (especially since we see performance on planning problems, which in my view are essentially search problems, improving with scale).
The cited paper in Section 5 (Conclusion-Limitations) states plainly: The paper is more just looking at how Leela evaluates a given line rather than doing any kind of search.  And this makes sense.  Pattern recognition is an extremely important part of playing chess (as a player myself), and it is embedded in another system doing the actual search, namely Monte Carlo Tree Search.  So it isn't surprising that it has learned to look ahead in a straight line since that's what all of its training experience is going to entail.  If transformers were any good at doing the search, I would expect a chess bot without employing something like MCTS.
It's not clear to me that there's a very principled distinction between look-ahead and search, since there's not a line of play that's guaranteed to happen. Search is just the comparison of look-ahead on multiple lines. It's notable that the paper generally talks about "look-ahead or search" throughout. That said, I haven't read this paper very closely, so I recognize I might be misinterpreting.
Or to clarify that a bit, it seems like the reason to evaluate any lines at all is in order to do search, even if they didn't test that. Otherwise what would incentivize the model to do look-ahead at all?
In chess, a "line" is sequence of moves that are hard to interrupt.  There are kind of obvious moves you have to play or else you are just losing (such as recapturing a piece, moving king out of check, performing checkmate etc).  Leela uses the neural network more for policy, which means giving a score to a given board position, which then the MCTS can use to determine whether or not to prune that direction or explore that section more.  So it makes sense that Leela would have an embedding of powerful lines as part of its heuristic, since it isn't doing to main work of search.  It's more pattern recognition on the board state, so it can learn to recognize the kinds of lines that are useful and whether or not they are "present" in the current board state.  It gets this information from the MCTS system as it trains, and compresses the "triggers" into the earlier evaluations, which then this paper explores.   It's very cool work and result, but I feel it's too strong to say that the policy network is doing search as opposed to recognizing lines from its training at earlier board states.
Ah, ok, thanks for the clarification; I assumed 'line' just meant 'a sequence of moves'. I'm more of a go player than a chess player myself. It still seems slightly fuzzy in that other than check/mate situations no moves are fully mandatory and eg recaptures may occasionally turn out to be the wrong move? But I retract my claim that this paper is evidence of search, and appreciate you helping me see that.
Indeed it can be difficult to know when it is actually better not to continue the line vs when it is, but that is precisely what MCTS would help figure out.  MCTS would do actual exploration of board states and the budget for which states it explores would be informed by the policy network.  It's usually better to continue a line vs not, so I would expect MCTS to spend most of its budget continuing the line, and the policy would be updated during training with whether or not the recommendation resulted in more wins.  Ultimately though, the policy network is probably storing a fuzzy pattern matcher for good board states (perhaps encoding common lines or interpolations of lines encountered by the MCTS) that it can use to more effectively guide the search by giving it an appropriate score. To be clear, I don't think a transformer is completely incapable of doing any search, just that it is probably not learning to do it in this case and is probably pretty inefficient at doing it when prompted to.
Sorry to be that guy but maybe this idea shouldn't be posted publicly (I never read it before)
How is this not basically the widespread idea of recursive self improvement? This idea is simple enough that it has occurred even to me, and there is no way that, e.g. Ilya Sutskever hasn't thought about that.
I guess the vague idea is in the water. Just never saw it stated so explicitly. Not a big deal.
I agree with most of this. My claim here is mainly that if this is the case, then there's at least one remaining necessary breakthrough, of unknown difficulty, before AGI, and so we can't naively extrapolate timelines from LLM progress to date. I additionally think that if this is the case, then LLMs' difficulty with planning is evidence that they may not be great at automating search for new and better algorithms, although hardly conclusive evidence.
3Nathan Helm-Burger
Yeah, I think my claim needs evidence to support it. That's why I'm personally very excited to design evals targeted at detecting self-improvement capabilities. We shouldn't be stuck guessing about something so important!

It might also be a crux for alignment, since scalable alignment schemes like IDA and Debate rely on "task decomposition", which seems closely related to "planning" and "reasoning". I've been wondering about the slow pace of progress of IDA and Debate. Maybe it's part of the same phenomenon as the underwhelming results of AutoGPT and BabyAGI?

If that's the case (which seems very plausible) then it seems like we'll either get progress on both LLM-based AGI and IDA/Debate, or on neither. That seems like a relatively good situation; those approaches will work for alignment if & only if we need them (to whatever extent they would have worked in the absence of this consideration).
8Wei Dai
There's two other ways for things to go wrong though: 1. AI capabilities research switches attention from LLM (back) to RL. (There was a lot of debate in the early days of IDA about whether it would be competitive with RL, and part of that was about whether all the important tasks we want a highly capable AI to do could be broken down easily enough and well enough.) 2. The task decomposition part starts working well enough, but Eliezer's (and others') concern about "preserving alignment while amplifying capabilities" proves valid.

We do see improvement with scale here, but if these problems are obfuscated, performance of even the biggest LLMs drops to almost nothing

You note something similar, but I think it is pretty notable how much harder the obfuscated problems would be for humans:

Mystery Blocksworld Domain Description (Deceptive Disguising)

I am playing with a set of objects. Here are the actions I can do:

  • Attack object
  • Feast object from another object
  • Succumb object
  • Overcome object from another object

I have the following restrictions on my actions:

To perform Attack action, the following facts need to be true:

  • Province object
  • Planet object
  • Harmony

Once Attack action is performed:

  • The following facts will be true: Pain object
  • The following facts will be false: Province object, Planet object, Harmony

To perform Succumb action, the following facts need to be true:

  • Pain object

Once Succumb action is performed:

  • The following facts will be true: Province object, Planet object, Harmony
  • The following facts will be false: Pain object

To perform Overcome action, the following needs to be true:

  • Province other object
  • Pain object

Once Overcome action is performed:

  • The following will be true: Harmon
... (read more)
Yeah, it's quite frustrating that they made the obfuscated problems so unnecessarily & cryptically ungrammatical. And the randomized version would be absolutely horrendous for humans: I'm fairly tempted to take time to redo those experiments with a more natural obfuscation scheme that follows typical English grammar. It seems pretty plausible to me that LLMs would then do much better (and also pretty plausible that they wouldn't).
1Violet Hour
Largely echoing the points above, but I think a lot of Kambhampati's cases (co-author on the paper you cite) stack the deck against LLMs in an unfair way. E.g., he offered the following problem to the NYT as a contemporary LLM failure case.  When I read that sentence, it felt needlessly hard to parse. So I formatted the question in a way that felt more natural (see below), and Claude Opus appears to have no problem with it (3.5 Sonnet seems less reliable, haven't tried with other models). Tbc, I'm actually somewhat sympathetic to Kambhampati's broader claims about LLMs doing something closer to "approximate retrieval" rather than "reasoning". But I think it's sensible to view the Blocksworld examples (and many similar cases) as providing limited evidence on that question.  
Claude 3 Opus just did fine for me using the original problem statement as well:   [edited to show the temperature-0 response rather than the previous (& also correct) temperature-0.7 response, for better reproducibility]


This is a fairly straightforward point, but one I haven't seen written up before and I've personally been wondering a bunch about. I appreciated this post both for laying out the considerations pretty thoroughly, including a bunch or related reading, and laying out some concrete predictions at the end.

  I feel like I have been going on about this for years. Like here, here or here. But I'd be the first to admit, that I don't really do effort posts. 
I hadn't seen your posts either (despite searching; I think the lack of widely shared terminology around this problem gets in the way). I'd be very interested to learn more about how your research agenda has progressed since that first post. This post was mostly intended to be broad audience / narrow message, just (as Raemon says) pointing to the crux here, breaking it down, and giving a sense of the arguments on each side.
  The post about learned lookahead in Leela has kind of galvanised me into finally finishing an investigation I have worked on for too long already. (Partly because I think that finding is incorrect, but also because using Leela is a great idea, I had got stuck with LLMs requiring a full game for each puzzle position).  I will ping you when I write it up. 
I'm looking forward to it!
It so happens I hadn't seen your other posts, although I think there is something that this post was aiming at, that yours weren't quite pointed at, which is laying out "this is a crux for timelines, these are the subcomponents of the crux." (But, I haven't read your posts in detail yet and thought about what else they might be good at that this post wasn't aiming for)

I believe there is considerable low-hanging algorithmic fruit that can make LLMs better at reasoning tasks. I think these changes will involve modifications to the architecture + training objectives. One major example is highlighted by the work of, which show that Transformers can only heuristically implement algorithms to most interesting problems in the computational complexity hierarchy. With recurrence (e.g. through CoT these problems can be avoided, which might lead to much better generic, domain-independent reasoning capabilities.  A small number of people are already working on such algorithmic modifications to Transformers (e.g.

This is to say that we haven't really explored small variations on the current LLM paradigm, and it's quite likely that the "bugs" we see in their behavior could be addressed through manageable algorithmic changes + a few OOMs more of compute. For this reason, if they make a big difference, I could see capabilities changing quite rapidly once people figure out how to implement them. I think scaling + a little creativity is alive and well as a pathway to nearish-term AGI.

'The Expressive Power of Transformers with Chain of Thought' is extremely interesting, thank you! I've noticed a tendency to conflate the limitations of what transformers can do in a forward pass with what they can do under autoregressive conditions, so it's great to see research explicitly addressing how the latter extends the former. I agree that this is plausible. I mentally lumped this sort of thing into the 'breakthrough needed' category in the 'Why does this matter?' section. Your point is well-taken that there are relatively small improvements that could make the difference, but to me that has to be balanced against the fact that there have been an enormous number of papers claiming improvements to the transformer architecture that then haven't been adopted. From outside the scaling labs, it's hard to know how much of that is the improvements not panning out vs a lack of willingness & ability to throw resources at pursuing them. One the one hand I suspect there's an incentive to focus on the path that they know is working, namely continuing to scale up. On the other hand, scaling the current architecture is an extremely compute-intensive path, so I would think that it's worth putting resources into trying to see whether these improvements would work well at scale. If you (or anyone else) has insight into the degree to which the scaling labs are actually trying to incorporate the various claimed improvements, I'd be quite interested to know.

I always feel like self-play on math with a proof checker like Agda or Coq is a promising way to make LLMs superhuman on these areas. Do we have any strong evidence that it's not?

Do you mean as a (presumably RL?) training method to make LLMs themselves superhuman in that area, or that the combined system can be superhuman? I think AlphaCode is some evidence for the latter, with the compiler in the role of proof-checker.
The former

I think that too much scafolding can obfuscate a lack of general capability, since it allows the system to simulate a much more capable agent - under narrow circumstances and assuming nothing unexpected happens.

Consider the Egyptian Army in '73. With exhaustive drill and scripting of unit movements, they were able to simulate the capabilities of an army with a competent officer corps, up until they ran out of script, upon which it reverted to a lower level of capability. This is because scripting avoids officers on the ground needing to make complex tactic... (read more)

What if the tasks that your scaffolded LLM is doing are randomly selected pieces of cognitive labor from the full distribution of human cognitive tasks? It seems to me like your objection is mostly to narrow distributions of tasks and scaffolding which is heavily specialized to that task. I think narrowness of the task and amount of scaffolding might be correlated in practice, but these attributes don't have to be related. (You might think they are correlated because large amounts of scaffolding won't be very useful for very diverse tasks. I think this is likely false - there exists general purpose software that I find useful for a very broad range of tasks. E.g. neovim. I agree that smart general agents should be able to build their own scaffolding and bootstrap, but its worth noting that the final system might be using a bunch of tools!) ---------------------------------------- For humans, we can consider eyes to be a type of scaffolding: they help us do various cognitive tasks by adding various affordances but are ultimately just attached. Nonetheless, I predict that if I didn't have eyes, I would be notably less efficient at my job.
Very interesting example, thanks. Agreed that that wouldn't be good evidence that those systems could do general reasoning. My intention in this piece is to mainly consider general-purpose scaffolding rather than task-specific.

All LLMs to date fail rather badly at classic problems of rearranging colored blocks.

It's pretty unclear to me that the LLMs do much worse than humans at this task.

They establish the humans baseline by picking one problem at random out of 600 and evaluating 50 humans on this. (Why only one problem!? It would be vastly more meaningful if you check 5 problems with 10 humans assigned to each problem!) 78% of humans succeed.

(Human participants are from Prolific.)

On randomly selected problems, GPT-4 gets 35% right and I bet this improves with better promptin... (read more)

Modulo the questions around obfuscation (which you raise in your other comment), I agree. Kambhampati emphasizes that they're still much worse than humans. In my view the performance of the next gen of LLMs will tell us a lot about whether to take his arguments seriously.

Current LLMs do quite badly on the ARC visual puzzles, which are reasonably easy for smart humans.

We do not in fact have strong evidence for this. There does not exist any baseline for ARC puzzles among humans, smart or otherwise, just a claim that two people the designers asked to attempt them were able to solve them all. It seems entirely plausible to me that the best score on that leaderboard is pretty close to the human median.

Edit: I failed to mention that there is a baseline on the test set, which is different from the eval set th... (read more)

Their website cites as having found an average 84% success rate on the tested subset of puzzles.
It is worth noting that LLM based approachs can perform reasonably well on the train set. For instance, my approach gets 72%. The LLM based approach works quite differently from how a human would normally solve the problem, and if you give LLMs "only one attempt" or otherwise limit them to do a qualitatively similar amount of reasoning as with humans I think they do considerably worse than humans. (Though to make this "only one attempt" baseline fair, you have to allow for the iteration that humans would normally do.)
Yeah, I failed to mention this. Edited to clarify what I meant. 
Thanks for finding a cite. I've definitely seen Chollet (on Twitter) give 85% as the success rate on the (easier) training set (and the paper picks problems from the training set as well).
There is important context here.
I also think this is plausible - note that randomly selected examples from the public evaluation set are often considerably harder than the train set on which there is a known MTurk baseline (which is an average of 84%).

Aschenbrenner argues that we should expect current systems to reach human-level given further scaling

In, "scaffolding" is explicitly named as a thing being worked on, so I take it that progress in scaffolding is already included in the estimate. Nothing about that estimate is "just scaling".

And AFAICT neither Chollet nor Knoop made any claims in the sense that "scaffolding outside of LLMs won't be done in the next 2 years" => what am I missing that is the source of hope for longer timelines, please?

Thanks, yes, I should have mentioned 'unhobbling' in that sentence, have added. I debated including a flowchart on that (given below); in the end I didn't, but maybe I should have. But tl;dr, from the 'Why does this matter' section:  
3Aprillion (Peter Hozák)
I agree with "Why does this matter" and with the "if ... then ..." structure of the argument. But I don't see from where do you see such high probability (>5%) of scaffolding not working... I mean whatever will work can be retroactively called "scaffolding", even if it will be in the "one more major breakthrough" category - and I expect they were already accounted for in the unhobblings predictions. Do we know the base rate how many years after initial marketing hype of a new software technology we should expect "effective" solutions? What is the usual promise:delivery story for SF startups / corporate presentations around VR, metaverse, crypto, sharing drives, sharing appartments, cybersecurity, industrial process automation, self-driving ..? How much hope should we take from the communication so far that the problem is hard to solve - did we expect before AutoGPT and BabyAGI that the first people who will share their first attempt should have been successful?
I've gone back and added my thoughts on unhobbling in a footnote: "Chollet Aschenbrenner also discusses 'unhobbling', which he describes as 'fixing obvious ways in which models are hobbled by default, unlocking latent capabilities and giving them tools, leading to step-changes in usefulness'. He breaks that down into categories here. Scaffolding and tooling I discuss here; RHLF seems unlikely to help with fundamental reasoning issues. Increased context length serves roughly as a kind of scaffolding for purposes of this discussion. 'Posttraining improvements' is too vague to really evaluate. But note that his core claim (the graph here) 'shows only the scaleup in base models; “unhobblings” are not pictured'." Frankly I'd be hesitant to put > 95% on almost any claims on this topic. My strongest reason for suspecting that scaffolding might not work to get LLMs to AGI is updating on the fact that it doesn't seem to have become a useful approach yet despite many people's efforts (and despite the lack of obvious blockers). I certainly expect scaffolding to improve over where it is now, but I haven't seen much reason to believe that it'll enable planning and general reasoning capabilities that are enormously greater than LLMs' base capabilities. What I mean by scaffolding here is specifically wrapping the model in a broader system consisting of some combination of goal-direction, memory, and additional tools that the system can use (not ones that the model calls; I'd put those in the 'tooling' category), with a central outer loop that makes calls to the model. Breakthroughs resulting in better models wouldn't count on my definition.
1Aprillion (Peter Hozák)
Thanks for the clarification, I don't share the intuition this will prove harder than other hard software engineering challenges in non-AI areas that weren't solved in months but were solved in years and not decades, but other than "broad baseline is more significant than narrow evidence for me" I don't have anything more concrete to share. A note until fixed: Chollet also discusses 'unhobbling' -> Aschenbrenner also discusses 'unhobbling'
I think the shift of my intuition over the past year has looked something like: a) (a year ago) LLMs seem really smart and general (especially given all the stuff they unexpectedly learned like translation), but they lack goals and long-term memory, I bet if we give them that they'll be really impressive. b) Oh, huh, if we add goals and long-term memory they don't actually do that well. c) Oh, huh, they fail at stuff that seems pretty basic relative to how smart and general they seem. d) OK, probably we should question our initial impression of how smart and general they are. I realize that's not really a coherent argument; just trying to give a sense of the overall shape of why I've personally gotten more skeptical.

I think that people don't account for the fact that scaling means decreasing space for algorithmic experiments, you can't train GPT-5 10000 times making small tweaks each time. Some algorithmic improvements can show effect only on large scales or if they are implemented in training from scratch, therefore, such improvements are hard to find.

8Seth Herd
I don't think recent and further scaling really changes the ever-present tradeoff between large full runs and small experimental runs. That's been a factor in training large neural networks since 2004 at least, the first time I was involved in attempts to deal with real-world datasets that benefit from scaling networks as far as the hardware allows.
6Nathan Helm-Burger
Personally I believe that a novel algorithm/architecture which is substantially better than transformer-based LLMs is findable, and would show up even at small scale. I think the effect you are discussing is more of an issue for incremental improvements on the existing paradigm.
My point is that people can perceive difficulties with getting incremental improvement as strong evidence about LLMs being generally limited.

so far results have been underwhelming


I work at a startup that would claim otherwise.

For example, the construct of "have the LLM write Python code, then have a Python interpreter execute it" is a powerful one (as Ryan Greenblatt has also shown), and will predictable get better as LLMs scale to be better at writing Python code with a lower density of bugs per N lines of code, and better at debugging it.

Can you name the startup? I'd be very interested to see what level of success it's achieved.
(or if you can't name the startup, I'd love to hear more about what's been achieved -- eg what are the largest & longest-horizon tasks that your scaffolded systems can reliably accomplish?)
3RogerDearnaley — turn on our "genius mode" (free users get a few uses per day) and try asking it to do a moderately complex calculation, or better still figure out how to do one then do it. Generally we find the combination of RAG web search and some agentic tool use makes an LLM appreciably more capable. (Similarly, OpenAI are also doing the tool use, and Google the web-scale RAG.) We're sticking to fairly short-horizon tasks, generally less than a dozen steps.

This issue is further complicated by the fact that humans aren't fully general reasoners without tool support either.

I think the discussion, not specifically here but just in general, vastly underestimates the significance of this point. It isn't like we expect humans to solve meeting planning problems in our heads. I use Calendly, or Outlook's scheduling assistant and calendar. I plug all the destinations into our GPS and re-order them until the time looks low enough. One of the main reasons we want to use LLMs for these tasks at all is that, even with to... (read more)

I often think here of @sarahconstantin 's excellent 'Humans Who Are Not Concentrating Are Not General Intelligences'.

I think it is plausible but not obvious if this is the case, that large language models have a fundamental issue with reasoning. However, I don't think this greatly impacts timelines. Here is my thinking:

I think time lines are fundamentally driven by scale and compute. We have a lot of smart people working on the problem, and there are a lot of obvious ways to address these limitations. Of course, given how research works, most of these ideas won't work, but I am skeptical of the idea that such a counter-intuitive paradigm shift is needed that nobody has e... (read more)

Two possible counterarguments: * I've heard multiple ML researchers argue that the last real breakthrough in ML architecture was transformers, in 2017. If that's the case, and if another breakthrough of that size is needed, then the base rate maybe isn't that high. * If LLMs hit significant limitations, because of the reasoning issue or because of a data wall, then companies & VCs won't necessarily keep pouring money into ever-bigger clusters, and we won't get the continued scaling you suggest.
That's fair. Here are some things to consider: 1 - I think 2017 was not that long ago. My hunch is that the low level architecture of the network itself is not a bottleneck yet. I'd lean on more training procedures and algorithms. I'd throw RLHF and MoE as significant developments, and those are even more recent. 2 - I give maybe 30% chance of a stall, in the case little commercial disruption comes of LLMs. I think there will still be enough research going on at the major labs, and even universities at a smaller scale gives a decent chance at efficiency gains and stuff the big labs can incorporate. Then again, if we agree that they won't build the power plant, that is also my main way of stalling the timeline 10 years. The reason I only put 30% is I'm expecting multi modalities and Aschenbrenner's "unhobblings" to get the industry a couple more years of chances to find profit.
Both of those seem plausible, though the second point seems fairly different from your original 'time lines are fundamentally driven by scale and compute'.

Thanks for a thoughtful article. Intuitively, LLMs are similar to our own internal verbalization. We often turn to verbalizing to handle various problems when we can't keep our train of thought by other means. However, it's clear they only cover a subset of problems; many others can't be tackled this way. Instead, we lean on intuition, a much more abstract and less understood process that generates outcomes based on even more compressed knowledge. It feels that the same is true for LLMs. Without fully understanding intuition and the kind of data transformations and compressions it involves, reaching true AGI could be impossible.

That's an interesting view, but it's not clear to me what the evidence for it is. Is this based on introspection into thinking? Although this new paper reviewing recent evidence on language may shed at least a bit of light on the topic: 'Language is primarily a tool for communication rather than thought'.

@Daniel Tan raises an interesting possibility here, that LLMs are capable of general reasoning about a problem if and only if they've undergone grokking on that problem. In other words, grokking is just what full generalization on a topic is (Daniel please correct me if that's a misrepresentation). 

If that's the case, my initial guess is that we'll only see LLMs doing general reasoning on problems that are relatively small, simple (in the sense that the fully general algorithm is simple), and common in the training data, just because grokking requires... (read more)

A smart human given a long enough lifespan, sufficient motivation, and the ability to respawn could take over the universe (let's assume we are starting in today's society, but all technological progress is halted, except when it comes from that single person).

A LLM can't currently.

Maybe we can better understand what we mean with general reasoning by looking at concrete examples of what we expect humans are capable of achieving in the limit.

How general are LLMs? It's important to clarify that this is very much a matter of degree.

I would like to see attempts to come up with a definition of "generality". Animals seem to be very general, despite not being very intelligent compared to us.

Certainly state-of-the-art LLMs do an enormous number of tasks that, from a user perspective, count as general reasoning. They can handle plenty of mathematical and scientific problems; they can write decent code; they can certainly hold coherent conversations.; they can answer many counterfactual questions;

... (read more)
I do very much duck that question here (as I say in a footnote, feel free to mentally substitute 'the sort of reasoning that would be needed to solve the problems described here'). I notice that the piece you link itself links to an Arbital piece on general intelligence, which I haven't seen before but am now interested to read.

Aschenbrenner's argument starts to be much more plausible

To be fair, Aschenbrenner explicitly mentions that what he terms "unhobbling" of LLMs will also be needed: he just expects progress in that to continue. The question then is whether the various weakness you've mentioned (and any other important ones) will be beaten by either scaling, or unhobbling, or a combination of the two.

Agreed. I added some thoughts on the relevance of Aschenbrenner's unhobbling claims in a footnote: "Chollet Aschenbrenner also discusses 'unhobbling', which he describes as 'fixing obvious ways in which models are hobbled by default, unlocking latent capabilities and giving them tools, leading to step-changes in usefulness'. He breaks that down into categories here. Scaffolding and tooling I discuss here; RHLF seems unlikely to help with fundamental reasoning issues. Increased context length serves roughly as a kind of scaffolding for purposes of this discussion. 'Posttraining improvements' is too vague to really evaluate. But note that his core claim (the graph here) 'shows only the scaleup in base models; “unhobblings” are not pictured'."

A neural net can approximate any function. Given that LLMs are neural nets, I don't see why they can't also approximate any function/behaviour if given the right training data. Given how close they are getting to reasoning with basically unsupervised learning on a range of qualities of training data, I think they will continue to improve, and reach impressive reasoning abilities. I think of the "language" part of an LLM as like a communication layer on top of a general neural net. Being able to "think out loud" with a train of thought and a scratch pad to ... (read more)

Nicky Case points me to 'Emergent Analogical Reasoning in Large Language Models', from Webb et al, late 2022, which claims that GPT-3 does better than human on a version of Raven's Standard Progressive Matrices, often considered one of the best measures of non-verbal fluid intelligence. I somewhat roll to disbelieve, because that would seem to conflict with eg LLMs' poor performance on ARC-AGI. There's been some debate about the paper:

Response: Emergent analogical reasoning in large language models (08/23)

Evidence from counterfactual tasks supports emergen... (read more)

When world chess champion Anand won arguably his best and most creative game, with black, against Aronian, he said in an interview afterward "yeah it's no big deal the position was the same as in [slightly famous game from 100 years ago]".

Of course the similarity is only visible for genius chess players.

So maybe pattern matching and novel thinking are, in fact, the same thing.

Added to the 'Evidence for generality' section after discovering this paper:

One other thing worth noting is that we know from 'The Expressive Power of Transformers with Chain of Thought' that the transformer architecture is capable of general reasoning (though bounded) under autoregressive conditions. That doesn't mean LLMs trained on next-token prediction learn general reasoning, but it means that we can't just rule it out as impossible.

Jacob Steinhardt on predicting emergent capabilities:

There’s two principles I find useful for reasoning about future emergent capabilities:

  1. If a capability would help get lower training loss, it will likely emerge in the future, even if we don’t observe much of it now.
  2. As ML models get larger and are trained on more and better data, simpler heuristics will tend to get replaced by more complex heuristics. . . This points to one general driver of emergence: when one heuristic starts to outcompete another. Usually, a simple heuristic (e.g. answering directly) w
... (read more)
I think the trouble with that argument is that it seems equally compelling for any useful capabilities, regardless of how achievable they are (eg it works for 'deterministically simulate the behavior of the user' even though that seems awfully unlikely to emerge in the foreseeable future). So I don't think we can take it as evidence that we'll see general reasoning emerge at any nearby scale.

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

Some in the field argued as recently as 2020 that no pure LLM would ever able to correctly complete Three plus five equals.

I think you don't mean this literally as the paper linked does not argue for this actual position. Can you clarify exactly what you mean?

No, I mean that quite literally. From Appendix B of 'Climbing Towards NLU':

'Tasks like DROP (Dua et al., 2019) require interpretation of language into an external world; in the case of DROP, the world of arithmetic. To get a sense of how existing LMs might do at such a task, we let GPT-2 complete the simple arithmetic problem Three plus five equals. The five responses below, created in the same way as above, show that this problem is beyond the current capability of GPT-2, and, we would argue, any pure LM.'
I found it because I went looking for falsifiable claims in 'On the Dangers of Stochastic Parrots' and couldn't really find any, so I went back further to 'Climbing Towards NLU' on Ryan Greenblatt and Gwern's recommendation.

1Joseph Miller
Oh wow - missed that. Thanks!