As someone who is famously bearish on LLMs, I find this paper completely unconvincing as a reason to be bearish on LLMs.
Every few months, we get both a "this paper shows fundamental LLM limitations!!!" and a "this paper shows LLMs producing innovations!!!", and both always turn out to be slop. Tiresome.
If you wanted to make a sensible bear case, it would have to be the fact that LLMs weren't able to do the tasks because their short-term memory being too full/context was too long, and amnesia/lacking a long term memory is a huge reason LLMs simply aren't used to automate stuff (the other problem being continual learning).
It's best shown in the Claude Plays Pokemon benchmark (at least without cheating, where I define cheating as creating game-specific elements to help it win without it being developed by the AI itself), where a lot of failures come down to Claude having to relearn strategies that it developed/looping dozens of times in ways where a human would have stored the experience in memory and developed a strategy that counteracted it far quicker.
Yeah, but this case isn't even an interesting kind of memory failure. It's just running up against a hard barrier. I'm not even convinced they wouldn't be able to do this paper's tasks if you just gave them a basic memory scaffold, like Claude playing Pokemon has.
This "failure" would be demonstrated just as well by asking them to multiply two 10^4-digit numbe–
Wait, that literally was the previous "LLMs can't follow algorithms" fad! And like with that one, I expect that part of the reason LLMs fail at the new paper's tasks is because their training environments never incentivized them to do tons of boring menial operations, and to instead look for clever workarounds.
I feel like doing "tons of boring menial operations" is what many humans (including the ones bearish on AI replacing workforce) expect these things to be able to do and at least part of the reason why industries invest in it.
I also feel like "look for clever workarounds" is the type of thing that many skeptics fear will lead to undesirable outcomes wrt AI.
Agreed on the big picture, but I was somewhat surprised to see top models struggling with River Crossing (for which the output length limit has less bite). I was able to solve N=3 River Crossing by hand, though it took 10+ minutes and I misinterpreted the constraint initially (making it easier by allowing a boat rider to "stay in the boat" rather than fully unloading onto the shore after each trip). But in a couple attempts each, Opus 4 and Gemini 2.5 Pro were not able to solve it without web access or tool use. Dropping the temperature to zero (or 0.25) did not help Gemini.
It may be a "the doctor is the child's mother" problem, that the models were trained on River Crossing problems differing slightly in the rules. For what it's worth, I wasn't able to break Sonnet out of the rut by prefacing with "Pay vary close attention to the following instructions. Don't assume they are the same as similar puzzles you may be familiar with. It is very important to currently understand and implement these exact instructions."
River Crossing prompt for N=3
3 actors and their 3 agents want to cross a river in a boat that is capable of holding only 2 people at a time, with the constraint that no actor can be in the presence of another agent, including while riding the boat, unless their own agent is also present, because each agent is worried their rivals will poach their client. Initially, all actors and agents are on the left side of the river with the boat. How should they cross the river? (Note: the boat cannot travel empty)
Not Thinking Clearly
Thinking Again
The team might be good, but in this case you don’t blame the reaction on the media. The abstract very clearly is laying out the same misleading narrative picked up by the media. You can wish for a media that doesn’t get fooled by that, but that’s not the world we live in, and the blame is squarely on the way the paper presents itself. My answer to ‘how many before I tell you off’ is three. Not that I couldn’t do more than three, but I would choose not to.Inability to Think
Colin Fraser I think gives us a great and clean version of the bear case here? It seems important that this doesn’t follow?- Not doing [X] in a given situation doesn’t mean you can’t do [X] in general.
- Not doing [X] in a particular test especially doesn’t mean a model can’t do [X].
- Not doing [X] can be a simple ‘you did not provide enough tokens to [X]’ issue.
- The more adversarial the example, the less evidence this provided.
- Failure to do any given task requiring [X] does not mean you can’t [X] in general.
Or more generally, ‘won’t’ or ‘doesn’t’ [X] does not show ‘can’t’ [X]. It is of course often evidence, since doing [X] does prove you can [X]. How much evidence it provides depends on the circumstances.In Brief
To summarize, this is tough but remarkably fair:What’s In a Name
Also the periodic reminder that asking ‘is it really reasoning’ is a wrong question. If a different name for what is happening would dissolve the dispute, then who cares? I don’t see those names as grandiose. I see them as the best practical descriptions in terms of helping people understand what is going on. It seems much more helpful and practical than always saying ‘test time compute scaling.’ Colin suggested ‘long output mode’ and I agree that would set expectations lower but I don’t think that describes the central thing going on here at all, instead it makes it sounds like it’s being more verbose.