LESSWRONG
LW

Language Models (LLMs)AI
Frontpage

288

Beware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems)

by LawrenceC
11th Jun 2025
AI Alignment Forum
19 min read
19

288

Ω 105

Language Models (LLMs)AI
Frontpage

288

Ω 105

Beware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems)
14Michael Liu
6LawrenceC
7Kaj_Sotala
7LawrenceC
9Leon Lang
5Hruss
4LawrenceC
4Richard Horvath
5lemonhope
3Chastity Ruth
4LawrenceC
3Chastity Ruth
2LawrenceC
2AnthonyC
2DanielFilan
4LawrenceC
1Cedar
1Raphael Roche
1RedMan
New Comment
19 comments, sorted by
top scoring
Click to highlight new comments since: Today at 4:57 AM
[-]Michael Liu1mo144

Really good and well written post. While I think it always good to provide vigorous and strong evidence against Gary Marcus (and others) when he defaults to promoting claims that confirm his bias, I wonder if there is are beter long term solutions to this issue. For normal everyday people that don't follow AI or people that show modest interest in learning more about AI, they will likely never see this posts or have even heard of LW/Alignment Forum/80k/etc. I do think that the lack of public mainstream content is part of the issue, but I'm sure that there are lots of difficulties that I am naive to. Also, I sense that there is a distrust of people working in the field being salesman trying to hype up the AI or fuel their ego for how important their work is, so that might not be the best solution either, but I'm curious to hear if anyone is working towards any concrete ideas for giving informed and strongly evidence backed claims about AI progress to the mainstream (think national tv) or what are the biggest roadblocks which make it hard or impossible. 

Reply
[-]LawrenceC1mo64

This isn't my area of expertise, so take what I have to say with a grain of salt. 

I wouldn't say that anyone is particularly on the ball here, but there are certainly efforts to bring about more "mainstream" awareness about AI. See e.g. the recent post by the ControlAI folk about their efforts to brief British politicians on AI, in which they mention some of the challenges they ran into. Still within the greater LW/80k circle, we have Rational Animations, which has a bunch of AI explainers via animated Youtube videos. Then slightly outside of it, we have Rob Miles's videos, both on his own channel and on Computerphile. Also, I'd take a look at the PauseAI folk to see if they've been working on anything lately. 

My impression is that the primary problem is "Not feeling the AGI", where I think the emphasis should placed on "AGI". People see AI everywhere, but it's kinda poopy. There's AI slop and AI being annoying and lots of startups hyping AI, but all of those suggest AI to be mediocre and not revolutionary. Outside of entry-level SWE jobs I don't think people have really felt much disruption from an employment perspective. It's also just hard to distinguish between more or less powerful models ("o1" vs "o3" vs "gpt-4o" vs "gpt-4.5" vs "ChatGPT", empirically I notice that people just give up and use "ChatGPT" to refer to all OAI LMs). 

A clean demonstration of capabilities tends to shock people into "feeling the AGI". As a result, I expect this to be fixed in part over time, as we get more events like the release of GPT-3 (a clean demo for academics/AIS folk), the ChatGPT release in late 2022, or the more recent Ghiblification wave. I also think that, unfortunately, "giving informed and strongly evidence backed claims about AI progress" is just not easy to do for as broad an audience as National TV, so maybe demos are just the way to go. 

Secondarily, I think the second problem is fatalism. People switch from "we won't get AGI, those people are just being silly" to "we'll get AGI and OAI/Anthropic/GDM/etc will take over/destroy the world, there's nothing we can do". In some cases this even goes into "I'll choose not to believe it, because if I do, I'll give up on everything and just cry." To be honest, I'm not sure how to fix this one (antidepressants, maybe?). 

Then there's the "but China" views, but I think that's much more prevalent on Twitter or the words of corporate lobbyists than a response I've heard from people in real life. 

Reply
[-]Kaj_Sotala1mo70

Outside of entry-level SWE jobs I don't think people have really felt much disruption from an employment perspective.

People in other creative fields have also been affected:

Since last year, when AI really took off, my workload has plummeted. I used to get up to 15 [illustration] commissions a month; now I get around five. [...] I used to work in a small studio as a storyboard artist for TV commercials. Since AI appeared, I’ve seen colleagues lose jobs because companies are using Midjourney. Even those who’ve kept their jobs have had their wages reduced – and pay in south-east Asia is already low.

 

I noticed I was getting less [copywriting] work. One day, I overheard my boss saying to a colleague, “Just put it in ChatGPT.” The marketing department started to use it more often to write their blogs, and they were just asking me to proofread. I remember walking around the company’s beautiful gardens with my manager and asking him if AI would replace me, and he stressed that my job was safe.

Six weeks later, I was called to a meeting with HR. They told me they were letting me go immediately. [...] The company’s website is sad to see now. It’s all AI-generated and factual – there’s no substance, or sense of actually enjoying gardening. AI scares the hell out of me. I feel devastated for the younger generation – it’s taking all the creative jobs.

 

The effect of generative AI in my industry is something I’ve felt personally. Recently, I was listening to an audio drama series I’d recorded and heard my character say a line, but it wasn’t my voice. I hadn’t recorded that section. I contacted the producer, who told me he had input my voice into AI software to say the extra line. [...] 

The Screen Actors Guild, SAG-AFTRA, began a strike last year against certain major video games studios because voice actors were unhappy with the lack of protections against AI. Developers can record actors, then AI can use those initial chunks of audio to generate further recordings. Actors don’t get paid for any of the extra AI-generated stuff, and they lose their jobs. I’ve seen it happen.

One client told me straight out that they have started using generative AI for their voices because it’s faster.

 

When generative AI came along, the company was very vocal about using it as a tool to help clients get creative. As a company that sells digital automation, developments in AI fit them well. I knew they were introducing it to do things like writing emails and generating images, but I never anticipated they’d get rid of me: I’d been there six years and was their only graphic designer. My redundancy came totally out of the blue. One day, HR told me my role was no longer required as much of my work was being replaced by AI.

I made a YouTube video about my experience. It went viral and I received hundreds of responses from graphic designers in the same boat, which made me realise I’m not the only victim – it’s happening globally, and it takes a huge mental toll.

Reply
[-]LawrenceC1mo70

Oh, fair. Thanks for the correction, I didn't realize how much artists were affected. 

Reply
[-]Leon Lang1mo91

This is an amazing post, thank you for writing it. It will be the main post I will link people to who are wondering about the prospects of LLM's reasoning capabilities. 

Reply1
[-]Hruss1moΩ153

I find that studies criticizing current models are often used long after the issue is fixed, or without consideration to the actual meaning. I would wish that technology reporting is more careful, as much of this misunderstanding seems to come from journalistic sources. Examples:

Hands in diffusion models

Text in diffusion models

Water usage

Model collapse - not an issue for actual commercial AI models, the original study was about synthetic data production, and directly feeding the output of models as the exclusive training data

LLMs = Autocorrect - chat models have RLHF post training 

Nightshade/glaze: useless for modern training methods

AI understanding - yes, the weights are not understood, but the overall architecture is

 

It is surprising how many times I hear these, with false context.

Reply
[-]LawrenceC1moΩ241

There are indeed many, many silly claims out there, on either side of any debate. And yes, the people pretending that the AIs of 2025 have the limitations of those from 2020 are being silly, journalist or no.

I do want to clarify that I don't think this is a (tech) journalist problem. Presumably when you mention Nightshade dismissively, it's a combination of two reasons: 1) Nightshade artefacts are removable via small amounts of Gaussian blur and 2) Nightshade can't be deployed at scale on enough archetypal images to have a real effect? If you look at the Nightshade website, you'll see that the authors lie about 1):

As with Glaze, Nightshade effects are robust to normal changes one might apply to an image. You can crop it, resample it, compress it, smooth out pixels, or add noise, and the effects of the poison will remain.

So (assuming my recollection that Nightshade is defeatable by Gaussian noise is correct) this isn't an issue of journalists making stuff up or misunderstanding what the authors said, it's the authors putting things in their press release that, at the very least, are not at all backed up by their paper. 

(Also, either way, Gary Marcus is not a tech journalist!)

Reply
[-]Richard Horvath1mo4-2

Thank you for putting in the effort required to review this. Post like this help a lot in interpreting hyped literature. I am also skeptical myself whether LLMs are the path to AGI, and likely would have counted the paper as an additional (small) datapoint in favour of my conclusion if not for your detailed summary (I did not read the original paper myself and had no intention to, hence only 'small' data point).

 

"There’s a common assumption in many LLM critiques that reasoning ability is binary: either you have it, or you don’t."

I agree, and would even push it further: I think the crux of the whole issue is our lack of good understanding of the concept space we refer to with words such as "reasoning", "thinking", "intelligence" or "agency".

I do have a hunch that we have some inherent limitations while trying to use our own "reasoning"/"intelligence"/etc to understand this space itself (something like Gödel's incompleteness theorem), but I do not have a proof. Than again, maybe not and we will figure it out.

Whatever the case is, we are not good at it right now. I imagine an analogy for this if we had the same (lack of) understanding for moving around in physical space:

Car-3.5 is invented, and is being used to carry things from one location to another. Some people claim it cannot "move on its own", as Car-3.5 could not move over hill #1 or muddy field #1, so it is just a road following engine, not a general artificial mover. Car-4 is created, with stronger engine, being able to climb over hill #1, Car-o1 has better transmission and wheels, being able to cross muddy field #1. It still cannot cross hill #2 or creek #1, so some people claim again, that it cannot actually move on its own. Other people show that just increasing engine power and doing tricks like adjusting wheel structure will help overcome this, and even most humans would be unable to cross hill #2 or creek #1, we would just go along the roads and use the tunnel or bridge to cross these, just as Car-o1 does it. Are we not general movers after all either? Do we need to increase only engine power and get some scaffolding in place for creek crossing to get a something that can move at least as well as a human in all spaces?

Replace the "Car-" string with "Humanoid_robot-", and think about it again. Change back to "Car-" but imagine this is the thought process of a horse, and think about it again.

We do not know which of the three variant describes our situation best.

Reply
[-]lemonhope1mo50

A car can't even take me from the fridge to the couch! Trust me, I tried all of them. You just spend all your time getting in and back out of the car. I don't see much use for cars, unless all you want to is drive down the road.

Reply
[-]Chastity Ruth1mo3-3

I might be wrong here, but I don't think the below is correct?

"For River Crossing, there's an even simpler explanation for the observed failure at n>6: the problem is mathematically impossible, as proven in the literature,  e.g. see page 2 of this arxiv paper."

That paper says n=>6 is impossible if the boat capacity is not 4. But the prompt in the Apple paper allows for the boat capacity to change.

"$N$ actors and their $N$ agents want to cross a river in a boat that is capable of holding
only $k$ people at a time..."

Reply
[-]LawrenceC1mo40

The authors specify that they used k=3 on page 6 of their paper:

If they used a different k (such that the problem is possible), they did not say so in their paper. 

You're right that I could've been more clear here, I've edited in a footnote to clarify. 

Reply
[-]Chastity Ruth1mo32

Ah, fair enough. I had skipped right to their appendix, which has confusing language around this: 

"(1) Boat Capacity Constraint: The boat can carry at most k individuals at a time, where k is typically
set to 2 for smaller puzzles (N ≤ 3) and 3 for larger puzzles (N ≤ 5); (2) Non-Empty Boat Constraint:
The boat cannot travel empty and must have at least one person aboard..."

The "N ≤ 5" here suggests that for N ≥ 6 they know they need to up the boat capacity. On the other hand, in the main body they've written what you've highlighted in the screenshot. Even in the appendix they write for larger puzzles that "3" is "typically" used. But it should be atypically – used only for 4 and 5 agent-actor pairs.

Edit: That section can be found at the bottom of page 20

Reply
[-]LawrenceC1mo20

Super fair. I did not read that section in detail, and missed your interpretation on my skim. I interpreted it to mean "in the planning literature, we typically see k=3 for n<6" (which we do!), noted that they did not say the alternative value, and then went with the k=3 value they did say in the main body. 

You're right that your interpretation is more natural. If they actually used k=4, then the problem is solvable and the paper was (a bit) better than I portrayed it to be here. 

Worth noting that, even assuming your interpretation is correct, it's possible is that the person who wrote the problem spec did know about the impossibility result, but the person running the experiments did not (and thus the experiments were ran with k=3). But it does make it more likely that k=4 was used for n>5. 

I think my statement above that "they did not say" "a different value of k (such that the problem is possible)" still seems true. And if they used k=4 because they knew that k=3 is impossible, it seems quite sloppy to say that they used k=3 in the main body. But a poorly written paper is a different and lesser problem to the experiment being wrong. 

Reply
[-]AnthonyC1mo20

I find it so interesting how often this kind of thing keeps happening, and I can't tell how often it's honest mistakes, versus lack of interest in the core questions, or willful self-delusion. Or maybe they're right, but not noticing that they actually are proving (and exemplifying) that humans also lack generalizable reasoning capability. 

My mental analogy for the Tower of Hanoi case: Imagine I went to a secretarial job interview, and they said they were going to test my skills by making me copy a text by hand with zero mistakes. Harsh, odd, but comprehensible. If they then said, "Here's a copy of Crime and Punishment, along with a pencil and five sheets of loose leaf paper. You have 15 minutes," then the result does not mean what they claim it means. Even if they gave me plenty of paper and time, but I said, "#^&* you, I'm leaving," or, "No, I'll copy page one, and prove if I can copy page n it means I can copy page n+1, so that I have demonstrated the necessary skill, proof by induction," then the result still does not mean what they claim it means.

I do really appreciate the chutzpah of talking about children solving Towers of Hanoi. Yes, given an actual toy and enough time to play, they often can solve the puzzle. Given only a pen, paper, verbal description, of the problem, and demand for an error-free specified-format written procedure for producing the solution, not so much. These are not the same task. There's a reason many of us had to write elementary school essays on things like precisely laying out all the steps required to make a sandwich: this ability is a dimension along which humans greatly vary. If that's not compelling, think of all the examples of badly-written instruction manuals you've come across in your life. 

I also appreciate the chutzpah of that 'goalpost shifting' tweet. If just kind of assumes away any possibility that there is a difference between AGI and ASI, and in the process inadvertently implies a claim that humans also lack reasoning capability? And spreadsheets do crash and freeze up when you put more rows and columns of data in them than your system can handle - I'd much rather they be aware of this limit and warn me I was about to trigger a crash.

Reply
[-]DanielFilan1moΩ220

Ironically, given that it's currently June 11th (two days after my last tweet was posted) my final tweet provides two examples of the planning fallacy.

"Hopefully" is not a prediction!

Reply
[-]LawrenceC1moΩ440

Fair, but in my head I did plan to get it done on the 10th. The tweet is not in itself the prediction, it's just evidence that I made the prediction in my head. 

And indeed I did finish the draft on June 10th, but at 11 PM and I decided to wait for feedback before posting. So I wasn't that off in the end, but I still consider it off. 

Reply
[-]Cedar16d10

I personally would like to see less talk about / with Gary Marcus, and more betting with Gary Marcus, like [here](https://garymarcus.substack.com/p/where-will-ai-be-at-the-end-of-2027).

But I understand that people don't wanna do it because it's a pretty bad bet if you win money in futures where money is likely going to become worthless.

Reply
[-]Raphael Roche21d*10

We currently know of no form of intelligence more general than human intelligence. It serves as our default reference point. It therefore seems reasonable to test for general intelligence by administering identical tests to both humans and machines. If the machine performs as well as or better than humans, it appears reasonable to conclude that, at least for that particular test of general intelligence, the machine has not demonstrated general intelligence inferior to that of humans. This is the fundamental principle behind the Turing test and remains central to the Arc-AGI test.

I fully agree with François Chollet's view that we can reasonably claim to have achieved AGI when it becomes extremely difficult or impossible to design tests where humans significantly outperform machines. However, it makes no sense to test for AGI using tasks that no human—or very few—could successfully complete. Gary Marcus's definition of general intelligence sets inherently superhuman and therefore arbitrary requirements. Moreover, unless we succumb to a double standard bias (which I believe is precisely what's happening here), this definition denies the possibility of general intelligence in humans, even though humans constitute our natural reference point.

I therefore fully endorse the author's analysis, which is very well-written, thoughtfully argued, balanced, and engaging to read. I also completely agree with the author's point that general intelligence should not be viewed as a binary property but rather as a gradient. Indeed, there's little difference between what we mean by "general intelligence" and simply "intelligence." While a calculator or classical computer can do calculus intractable for even the best mathematicians, computational capacity alone does not define intelligence—though some degree of it is certainly necessary. On this point too, I find François Chollet's reflections particularly insightful.

Recent LLMs are already partial AGIs—entities possessing an intermediate level of intelligence between calculators and humans, though arguably already much closer to human intelligence.

Reply
[-]RedMan25d10

I assert that most humans with 'generalizable intelligence' would be confused by questions about the Tower of Hanoi. 

The LLM clearly has better generalizable intelligence than many if not most humans, myself included.

Reply
Moderation Log
Curated and popular this week
19Comments

1.

Late last week, researchers at Apple released a paper provocatively titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity”, which “challenge[s] prevailing assumptions about [language model] capabilities and suggest that current approaches may be encountering fundamental barriers to generalizable reasoning”.

Normally I refrain from publicly commenting on newly released papers. But then I saw the following tweet from Gary Marcus:

I have always wanted to engage thoughtfully with Gary Marcus. In a past life (as a psychology undergrad), I read both his work on infant language acquisition and his 2001 book The Algebraic Mind; I found both insightful and interesting. From reading his Twitter, Gary Marcus is thoughtful and willing to call it like he sees it. If he's right about language models hitting fundamental barriers, it's worth understanding why; if not, it's worth explaining where his analysis went wrong.

As a result, instead of writing a quick-off-the-cuff response in a few 280 character tweets, I read the paper and Gary Marcus’s substack post, reproduced some of the paper’s results, and then wrote this 4000 word post.

Ironically, given that it's currently June 11th (two days after my last tweet was posted) my final tweet provides two examples of the planning fallacy.

2.

I don’t want to bury the lede here. While I find some of the observations interesting, I was quite disappointed by the paper given the amount of hype around it. The paper seems to reflect generally sloppy work and the authors overclaim what their results show (albeit not more so than the average ML conference submission). The paper fails to back up the authors’ claim that language models cannot “reason” due to “fundamental limitations”, or even (if you permit some snark) their claim that they performed “detailed analysis of reasoning traces”.

By now, others have highlighted many of the issues with the paper: see for example twitter threads by Ryan Greenblatt or Lisan al Gaib, as well as the paper drafted by Alex Lawsen and Claude Opus 4[1] and Zvi Moshowitz’s substack post. Or, if you’re feeling really spicy, you can ask any of Gemini 2.5, o3, or Opus 4 to critique the paper as if they were reviewer #2. 

3.

It's important to keep in mind that this paper is not a bombshell dropped out of the blue. Instead, it's merely the latest entry to 60 years of claims about neural networks' fundamental limitations. I’m by no means an expert in this literature, but here are 6 examples off the top of my head: 

  1. In the late 1960s, Minsky and Papert published a book showing that single-layer perceptrons (a precursor to modern MLPs) cannot represent XOR, helping trigger the first AI winter.
  2. Gary Marcus argued in the 1990s and 2000s that undifferentiated, fully-connected neural networks cannot learn important aspects of natural language.
  3. From the 1990s to the mid 2010s, researchers from the statistical learning theory tradition argued that the class of hypotheses represented by neural networks have high intrinsic VC dimension – that is, they are hard to learn in the worst case.
  4. A group of researchers from the natural language processing community have recently argued that large language models (LLMs) are “stochastic parrots” that probabilistically link together words and sentences without consideration of their meaning. A related line of academic work argues that transformers cannot learn causality from statistical data.
  5. Yet another line of work looks at the complexity classes of circuits that transformers can represent and finds that finite-precision Transformers correspond to the (very limited) complexity class of uniform TC0 – a very restricted class of circuits.
  6. The most related line of work to the Illusion of Thought paper involves generating simple problems that humans can solve but that LLMs can not – probably the highest profile of these is ARC-AGI, but other examples include the much earlier CommonSenseQA or some of Gary Marcus’s puzzles. Also, LLMs cannot multiply 10 digit numbers.

Broadly speaking, the arguments tend to take the following form:

  • The authors concede that neural networks/LLMs can do seemingly impressive things in practice.
  • Current techniques fail to generalize to the clearly correct solution in a theoretical setting, or they fail empirically in a simple toy setting.
  • Ergo, their apparently impressiveness in practice is an illusion resulting from regurgitating memorized examples or heuristics from the training dataset.

(Unsurprisingly, the Illusion of Thinking paper also follows this structure: the authors create four toy settings, provide evidence that current LLMs cannot solve them when you scale the toy settings to be sufficiently large, and conclude that they must suffer from fundamental limitations of one form or another.)

It’s worth noting that there exist standard responses to each of the lines of work on “fundamental limitations” I mentioned above:

  1. Current neural networks have many layers, which allows them to represent more complicated functions.
  2. Current LLMs are not undifferentiated, fully connected neural nets. In fact, the field of deep learning as a whole moved away from fully connected neural networks in the mid 2010s, with the widespread adoption of CNNs and LSTMs (and later the transformer architecture).
  3. Academic work has shown that LLMs seem to be able to consistently respond with correct causal reasoning in a way inconsistent with pure dataset memorization, and there are even theoretical results in toy settings that show how this causal reasoning may arise. Also, many of the issues pointed to in earlier stochastic parrot papers seem to have been mitigated with increasing model scale.
  4. Several recent lines of theoretical work have argued that overparameterized neural networks exhibit a tendency toward simple solutions, such as the work on double descent, Singular Learning Theory, or Principles of Deep Learning Theory. Also, this is consistent with empirical work studying neural network generalization or adversarial examples.
  5. Increasing the precision of the attention mechanism greatly increases the representation power of the transformer forward pass. Also, while each individual forward pass may have limited capability, adding chain-of-thought (even while keeping precision fixed) also greatly increases the computational complexity of problems transformers can solve.
  6. Language models seem to be consistently getting better at all of these benchmarks – for example, o3 can solve 60.8% of ARC-AGI-1 problems, compared to a mere 30% performance from o1 and 4.5% from GPT-4o. CommonSenseQA is retired in that it’s too easy for all frontier language models now, and o3/Sonnet 4 can both respond appropriately to all of the examples in that Gary Marcus post. Also, while frontier LLMs still cannot do 10 digit multiplication reliably, the length of multiplication problem they can solve has been increasing over time – as few as five years ago, we were commenting on the fact that LLMs couldn’t even reliably do 2 digit multiplication!

Again, I want to emphasize that the Illusion of Thinking paper is not a bombshell dropped out of the blue. It exists in a context of much prior work arguing for both the existence of limitations and arguing against the applicability of these limitations in practice. Even without diving into this paper, it’s worth tempering your expectations for how much this paper should really affect your belief about the fundamental limits of current LLMs.

4.

Having taken a long digression into historical matters, let us actually go over the content of the Illusion of Paper.

The authors start by creating four different “reasoning tasks”, each parameterized by the number of objects in the problem n (which the authors refer to as the ‘complexity’ of the problem):[2]

Tower of Hanoi, where the model needs to output the (2^n - 1) steps needed to solve a Tower of Hanoi problem with n disks.

Checkers Jumping, where there are n blue checkers and n red checkers lined up on a board with (2n+1) spaces and the model needs to output the minimum sequence of moves to flip the initial board position.

River Crossing, where there are n pairs of actors and agents trying to cross a river on a boat that can hold k people, where the boat cannot travel empty, and where no actor can be in the presence of another agent without their own agent being present. This is generally known as the Missionaries and Cannibals problem (or sometimes the Jealous Husbands Problem).

Blocks World, where there are n ordered blocks divided evenly between two stacks with n/2 blocks each, with the goal of consolidating the two stacks into a single ordered stack using a third empty stack.

On all four tasks, the models are scored by their accuracy – the fraction of model generations that lead to a 100% correct solution.

The authors then run several recent language models on all four tasks, and find that for each task and model, there appears to be a threshold after which accuracy seems to drop to zero. They argue that the existence of this “collapse point” suggests that LLMs cannot truly be doing “generalizable reasoning”.

The authors also do some statistical analysis of the model generated chains-of-thought (CoTs). From this, they first find that  “models counterintuitively begin to reduce their reasoning effort despite increasing problem difficulty”.  They also find that in the Tower of Hanoi case, “counterintuitively”, providing the correct algorithm to the model does not seem to improve performance. Finally, they find that Claude 3.7 Sonnet can solve Tower of Hanoi with n=5 but not River Crossing with n=3, and argue that this is the result of River Crossing not being on the internet.[3] 

5.

A classic blunder when interpreting model evaluation results is to ignore simple, mundane explanations in favor of the fancy hypothesis being tested. I think that the Illusion of Thinking contains several examples of this blunder.[4]

When I reproduced the paper’s results on the Tower of Hanoi task, I noticed that for n >= 9, Claude 3.7 Sonnet would simply state that the task required too many tokens to complete manually, provide the correct Towers of Hanoi algorithm, and then output an (incorrect) solution in the desired format without reasoning about it. When I provide the question to Opus 4 on the Claude chatbot app, it regularly refuses to even attempt the manual solution![5] And for n=15 or n=20, none of the models studied have enough context length to output the correct answer, let alone reason their way to it in the author’s requested format.

A prototypical response from Claude Opus 4, where it calls the n=10 Tower of Hanoi task "extremely tedious and error prone" and refuses to do it.

The authors call it "counterintuitive" that language models use fewer tokens at high complexity, suggesting a "fundamental limitation." But this simply reflects models recognizing their limitations and seeking alternatives to manually executing thousands of possibly error-prone steps  – if anything, evidence of good judgment on the part of the models!

For River Crossing, there's an even simpler explanation for the observed failure at n>6: the problem is mathematically impossible, as proven in the literature,  e.g. see page 2 of this arxiv paper.[6]

Of the two environments I investigated in detail, there seem to be mundane reasons explaining the apparent collapse that the authors failed to consider. I did not have time to investigate or reproduce their results for the other two tasks, but I’d be surprised if similar problems didn’t plague the authors’ results for those as well.

Again, evals failing for mundane, boring reasons unrelated to the question you’re investigating (such as, “the model refuses to do it” or “the problem is impossible”) is a common experience in the field of LM evals. This is precisely why it’s so important to look at your data instead of just statistically testing your hypothesis or running a regex! The fact that the authors seemed to miss the explanation for why reasoning tokens decrease for large n suggests to me that they did not look at their data very carefully (if at all), and the fact that they posed an impossible problem as one of their four environments suggests that they also did not think about their environments very carefully.

I want to emphasize again that this is not an unusually bad ML paper. Writing good papers is hard and this is a preprint that has not been peer reviewed. Anyone who’s served as a peer reviewer for an ML conference knows that the ladder of paper quality goes all the way up and all the way down. Insofar as anyone involved with the paper deserves criticism beyond the standard sort, it’s the people who hyped it up based on the headline result or because it fit their narrative.

That being said, on the basis of research rigor alone, I think there’s good reason to doubt the conclusions of the paper. I do not think that this paper is particularly noteworthy as a contribution to the canon of fundamental limitations, let alone a “knockout blow” that shows the current LLM paradigm is doomed.

6.

Suppose we accepted the authors' results at face value, and accepted that language models could never manually execute thousands of algorithmic steps without error. Should we then conclude that LLMs fundamentally cannot do “generalizable reasoning” as we understand it? 

There’s a common assumption in many LLM critiques that reasoning ability is binary: either you have it, or you don’t. Either you have true generalization, or you have pure memorization.  Under this dichotomy, showing that LLMs fail to learn or implement the general algorithm in a toy example is enough to conclude that they must be blind pattern matchers in general. In the case of the Illusion of Thinking paper, the implicit argument is that, if an LLM cannot flawlessly execute simple toy algorithms, this constitutes damning evidence against generalizable reasoning. They argue this even though frontier LLMs can implement the algorithms in Python, often provide shorter solutions that fit within their context windows, and explain how to solve the problem in detail when asked. 

I’d argue this dichotomy does not reflect much of what we think of as “reasoning” in the real world. People can consistently catch thrown balls without knowing about or solving differential equations (as in the classic Richard Dawkins quote).[7] Even in the realm of mathematics, most mathematicians work via intuition and human-level abstractions, and do not just write formal Lean programs. And there’s a reason why some sociologists argue that heuristics learned from culture, not pure reasoning ability, are the secret of humanity’s success.

Whenever we deal with agents with bounded compute grappling with complicated real-world environments, we’ll see a reliance on heuristics that have worked in the past. In this view, generalization can be the limit of memorization: given only a few examples on a new task, you might start by memorizing individual data points, then learn “shallow heuristics”, and finally generalize to deeper, more useful heuristics. 

This is why I find most of the “fundamental limitation” claims to be unconvincing. The interesting question isn't whether a bounded agent relies on learned heuristics (of course it does!), but rather how well those heuristics generalize to domains of interest. Focusing on if LLMs can implement simple algorithms in toy settings or theoretical domains, without consideration of how these results will apply elsewhere, risks missing this point entirely. 

I’ll concede to the authors that it’s clear that LLMs are not well thought of as traditional computers. In fact, I’ll concede that there’s no way a modern LLM can output the 32,767 steps of the answer to the n=15 Tower of Hanoi in the author’s desired format, while even a simple Python script (written by one of these LLMs, no less) can do this in less than a second. 

But at the risk of repeating myself, do the results thereby imply that LLMs cannot do “generalizable reasoning”? To answer this question, I argue that we ought to be able to look at evidence other than a simple binary “can the LLM implement the general algorithm manually”. For example, perhaps we should consider evidence like the fact that frontier LLMs can implement the algorithm in Python, provide shorter solutions, and explain how to solve the problem – all of which suggest that the LLMs do understand the problem.[8] I think insofar as the results show that there’s a real, fundamental limit on whether LLMs can manually execute algorithms for hundreds or thousands of steps, this is a very different claim than “LLMs cannot do generalizable reasoning”.

7.

I have a confession: setting aside the abstract arguments above, much of my interest in the matter is personal. Namely, seeing the arguments on the fundamental limitations of LLMs sometimes make me question the degree to which I can do “generalizable reasoning”.  

People who know me tend to comment that I “have a good memory”. For example, I remember the exact blunder I made in a chess game with a friend two years ago on this day, as well as the conversations I had that day. By default, I tend to approach problems by quickly iterating through a list of strategies that have worked on similar problems in the past, and insofar as I do first-principles reasoning, I try my best to amortize the computation by remembering the results for future use. In contrast, many people are surprised when I can’t quickly solve problems requiring a lot of computation.

That’s not to say that I can’t reason; after all, I argue that writing this post certainly involved a lot of “reasoning”. I’ve also met smart people who rely even more on learned heuristics than I do. But from the inside it really does feel like much of my cognition is pattern matching (on ever-higher levels). Much of this post drew on arguments or results that I’ve seen before; and even the novel work involved applying previous learned heuristics.

I almost certainly cannot manually write out 1023 Tower of Hanoi steps without errors – like 3.7 Sonnet or Opus 4, I'd write a script instead. By the paper's logic, I lack 'generalizable reasoning.' But the interesting question was never about whether I can flawlessly execute an algorithm manually, but whether I can apply the right tools or heuristics to a given problem.

8.

To his credit, in his Substack post, Gary Marcus does mention the critique I provide above. However, he dismisses this by saying that:

That is, he makes the claim that in order to be AGI, the system needs to be able to reliably do 13-digit arithmetic, or consistently, manually write out the solution to Tower of Hanoi with 8 disks.

There are two responses I have to this: a constructive response and a snarky response.

The constructive response is that current LLMs such as Claude 3.7 Sonnet and o3 can easily write software that solves these tasks. Sure, an LLM might not be able to do “generalized reasoning” in the sense that the authors propose, but an LLM with a simple code interpreter definitely can. Here, the key question is why we must consider the LLM by itself, as opposed to an AI agent composed of an LLM and an agent scaffold – note that even chatbot-style apps such as ChatGPT provide the LLM with various tools such as a code interpreter and internet access. Why should we limit our discussion of AGIs to just the LLM component of an AI system, as opposed to the AI system as a whole?

The snarky response is that sure, I’m happy to concede that “AGI” (as envisioned by Gary Marcus) requires being able to multiply 13 digit numbers or flawlessly write out a 1023 step Towers of Hanoi solution. But there’s a sort of intelligence that humans possess, that is general in that it works across many domains, and does not require being able to multiply 13 digit numbers or write out 1023 steps of Towers of Hanoi. This is the sort of intelligence that can notice when a computer algorithm would be better for a problem and write that algorithm instead of solving the problem by hand. This is the sort of intelligence that allows researchers to come up with new ideas, or construction workers to tackle problems in the real world, or salespeople persuading their customers to buy a product. This is the sort of intelligence that I use when I apply complex heuristics acquired from decades of reading and writing as I write this paragraph. When I think about whether or not LLMs have “fundamental limitations”, I’m interested in whether or not they might become superhumanly intelligent in this sense, not whether or not they’re “AGI” in the sense laid out by Gary Marcus.

Or, if you’ll permit an amateur attempt at a meme:

9.

Having discussed why I think the paper is a bad critique of the existing LLM paradigm and why I find Gary Marcus’s rebuttal unconvincing, let us get back to the question of what a good “fundamental limitations” critique of LLMs would look like.

First, instead of only using a few toy examples, a good critique should either ideally be based on strong empirical trends or on good, applicable theoretical results. The history of machine learning is full of results on toy examples that failed to generalize outside of their narrow domains; if you want to convince me that there’s a  “fundamental limitation”, you’ll need to either offer me a mathematical proof or strong empirical evidence.

Second, the critique should address the key reasons why people are bullish about LLMs – e.g. their ability to converse in fluent English; their ability to write code; their broad and deep knowledge of subjects from biology and their increasing ability to act as software agents. At the very least, the critique should explain both why it does not apply in these cases and why we should expect the limitation to matter in the future.

Yes, I know Gary Marcus doesn't like this graph. If there's enough interest, I'll write another post responding to his critiques.

Finally, critiques need to argue why these limitations will continue to apply despite continuing AI progress and the best efforts of researchers trying to overcome them. Many of the failure modes that LLMs exhibited in the past were solved over time with  a combination of scale and researcher effort. In 2019, GPT-2 was barely able to string together sentences and in 2020 GPT-3 could barely do 2-digit arithmetic. The LLMs of today can often accomplish software tasks that take humans an hour to complete.  

Good critiques of LLMs exist – in fact, Gary Marcus has made many better critiques of LLM capabilities in the past, as have the authors of the Illusion of Thinking paper. But invariably, better critiques tend to be more boring and empirical; not a singular knock-down argument but a back-and-forth discussion with many small examples and personal anecdotes. And instead of originating from outside the field, these critiques center around issues that people who work on LLMs talk extensively about.

I’ll provide 3 such critiques below:

  1. There seem to be computational limitations to current LLMs. No modern LLM handles arbitrarily long context windows, and their performance degrades over very long contexts. More importantly, the amount of compute required to train LLMs has been growing at an exponential rate, and this exponential trend cannot continue for very long into the future. We also might run out of data for either pre-training or sufficiently diverse, long-horizon training environments for RL training. (That being said, I’m not sure how important handling super long contexts is, what level of capabilities we’ll hit before running out of compute/data/environments, or how fundamental these limits are in the face of human research effort.)
  2. Second, LLMs can be sensitive to the prompt in hard-to-foresee ways. Changing the framing of problems can greatly impact their behavior – again, see the GSM-noop work from the Illusion of Thinking authors.[9] This suggests that some sophisticated LLM behavior may be hard to elicit, if not entirely memorized. (That being said, LLMs seem to be becoming less susceptible to being tricked and more capable of solving novel problems over time. Also, humans are famously influenceable by small changes in framing as well.)
  3. LLMs have hallucinations and suffer from reliability issues in general. When I ask o3 or Claude Opus 4 to do research for me, I need to check their work, because they’ll sometimes flat-out lie about what a citation says. (But again, it seems that these issues are getting better over time, as evidenced by the METR time horizon results. Also, having worked with humans, I assure you that humans also make stuff up and suffer from reliability issues.)

Could these be fundamental limitations? I think it’s possible. It's possible that I'll learn tomorrow that we could not train models using more compute than our current ones, or observe that models continued to be easily distracted by irrelevant details in their prompt, or see that trend of increasing reliability seemed to stop at a level far below humans. I’d still want to think about if these purported limitations would continue to hold up to researchers trying to address them. But if they do, I would argue that these are good reasons to expect the modern LLM paradigm to hit a dead end. 

None of these will be an “LLMs are a dead field" knockout blow. Insofar as LLMs hype dies down due to limitations like these, it’ll have been a death to a thousand cuts as evidence accumulates over time and trends reveal themselves. It will not be to a single paper purporting to show “fundamental limitations”.

10.

One delightful irony is that, I suspect, most people would agree with the following tweet by Josh Wolfe, regardless of their thoughts on Gary Marcus-style skepticism:

LLM skeptics can read this as Apple vindicating the long-ignored, sage arguments from one of the foremost skeptics of LLM capabilities. But for others, “Gary Marcus” is synonymous with making pointless critiques that will soon be proven irrelevant, while completely failing to address the cruxes of those he’s arguing against.

I think this is a sad state of affairs. I much prefer a world in which “Gary Marcus”ing means making good, thoughtful critiques, engaging in good faith with critics, and focusing on the strongest points in favor of the skeptical position.

Empirically, this is not what is happening. Over the course of me drafting this post, Gary Marcus has doubled down on this paper being conclusive evidence for LLM limitations, both on Twitter:

And in an opinion piece posted in the Guardian, where he points specifically to large n Tower of Hanoi as evidence for fundamental limitations:

There’s plenty of room for nuanced critiques of LLMs. Lots of the LLM commentary is hype. Twitter abounds with hyperbolic statements that deserve to be brought back to earth.  All language models have limited context windows, show sensitivity to prompts, and suffer from hallucinations. Most relevantly, AIs are worse than humans at many important things: despite their performance on benchmarks, Opus 4 and o3 cannot replace even moderately competent software engineers wholesale, despite many claims that software engineering is a dead discipline. The world needs thoughtful critiques of LLM capabilities grounded in empiricism and careful reasoning.

But the world does not need more tweets or popular articles misrepresenting studies (on either side of the AI debate), clinging to false dichotomies, and making invalid arguments. Useful critiques of the LLM paradigm need to go beyond just theoretical claims or extrapolation on toy problems far removed from practice. Good-faith criticism should focus on the capabilities that “AGI believers” are hopeful for or concerned about, rather than redefining AGI to mean something else to dismiss their hopes or concerns out of hand, and “generalizable reasoning” in a way that implies the participants in the conversation lack it.

The appeal of claiming fundamental limitations is obvious. As too is the unsatisfactory nature of empirical ones. But given the track record, I continue to prefer reading careful analysis of empirical experiments over appreciating the “true significance” of bombastic, careless claims about so-called “fundamental limitations”.

 

Acknowledgements

Thanks to Ryan Greenblatt and Sydney von Arx for comments on this post. Thanks also to Justis Mills for copyediting assistance. 

 

  1. ^

    Edited to add: Though this paper is also quite sloppy, and I don't think all of the claims hold up. For example, it claims without citation that the block problem is PSPACE and river crossing is NP-hard. The former seems flat-out incorrect (you can clearly verify solutions efficiently, as the authors do). Generalized river crossing with arbitrary constraints and k=3 is known to be NP-hard, but I don't think it's the case for Agents/Actors or Missionaries/Cannibals. Maybe Opus got confused by how the river crossing problem was generalized?

  2. ^

     It’s worth noting that “complexity” as the authors use it is not the standard “computational complexity” – instead, the “complexity” of a problem is the number of objects n in the problem. Later on, the authors talk about the number of steps in an optimal solution; this is closer to computational complexity but not the same. For example, even though the solution for the Checkers Jumping task has length quadratic in n, the basic “guess and check” algorithm for finding this solution requires a number of steps exponential in n. Similarly, while the minimum solution length for Blocks World also scales linearly with the number of blocks n, the basic solution requires exploring an exponentially large state space.

  3. ^

     This “counterintuitive result” that Claude Sonnet “achieves near-perfect accuracy when solving the Tower of Hanoi with (N=5), which requires 31 moves, while it fails to solve the River Crossing puzzle when (N=3), which has a solution of 11 moves” has a simple explanation – the former requires executing a simple deterministic algorithm with 31 steps, while the latter requires searching over a much larger space of possible solutions.

    The author’s speculation that “... examples of River Crossing with N>2 are scarce on the web” also seems incorrect – a quick Google search for either Missionaries and Cannibals or the Jealous Husbands problem shows that there are plenty of n=3 k=2 solutions on the internet, including on Wikipedia. If anything, the fact that Claude 3.7 Sonnet fails at this task suggests that it is earnestly trying to solve the task, as opposed to regurgitating a memorized solution (!).

  4. ^

     The standard remedy for this blunder is to read model transcripts. Note that high-level statistical analysis can often fail to notice these simple alternative explanations (as seems to have happened with the authors here).

  5. ^

    Arguably, this behavior is a natural consequence of their RL training, where the environments tend to look like “solve a complicated math problem” or “write correct code for a coding task”, and not “manually execute an algorithm for hundreds of steps”. After all, if you’re given a coding task and you try to solve it by manually executing the algorithm, you’re probably not going to end up doing particularly well on the task. 

  6. ^

    (Edited to clarify: specifically, the authors use k=3 boat capacity for all problems with n>2 pairs. But for n>5 pairs, you need at least k=4 capacity to solve the problem.) 

  7. ^

     In fact, it seems likely that humans (and dogs!) follow a simple heuristic that allows them to chase down and catch a thrown ball.

  8. ^

    This is also my explanation for the authors' “counterintuitive observation” that giving LLMs the algorithm doesn’t improve their performance on the task – they already know the algorithm, it’s just hard for them to manually execute it for hundreds or thousands of steps in the requested format. 

  9. ^

    My best steelman of the Illusion of Thinking paper is also in this vein – the models seem to do a lot better on River Crossing with n=3, k=2 when you call it by the more common name of “jealous husbands” or “missionaries and cannibals”, rather than “actors and agents”. In fact, if you read their CoT, it seems that 3.7 Sonnet/Opus 4 can sometimes get the correct answer in their output even when their CoTs fail to get to the correct answer, suggesting that their performance here comes from memorizing a solution in their training data. 

Mentioned in
51AI #120: While o3 Turned Pro