I would analogize this to a human with anterograde amnesia, who cannot form new memories, and who is constantly writing notes to keep track of their life. The limitations here are obvious, and these are limitations future Claudes will probably share unless LLM memory/continual learning is solved in a better way.
This is an extremely underrated comparison, TBH. Indeed, I'd argue that frozen weights + lack of a long-term memory are easily one of the biggest reasons why LLMs are much more impressive than useful at a lot of tasks (with reliability being another big, independent issue).
It emphasizes 2 things that are both true at once: LLMs do in fact reason like humans and can have (poor-quality) world-models, and there's no fundamental chasm between LLM capabilities and human capabilities that can't be cured by unlimited resources/time, and yet just as humans with anterograde amnesia are usually much less employable/useful to others than people who do have long-term memory, current AIs are much, much less employable/useful than future paradigm AIs.
I'm curious to see how well LLMs can play Slay the Spire. I could actually try that manually and see what happens.
Neuro-sama (the LLM-based AI VTuber) has beaten the game some time ago. As the code isn't open it's not possible to confirm whether the StS AI was done with LLMs though. Would definitiely be interesting to see how frontier LLMs perform!
In Claude's first try, it played Ironclad on Ascension 1 and died to Hexaghost, the Act 1 boss. It wasn't terrible but occasionally got the mechanics a little bit mixed up.
Thanks for following through.
If anyone wants to make a proper harness in the future, I think probably the most interesting question here is if the LLM can learn from multiple playthroughs, unlocking harder difficulties, etc.
Modern LLMs, maybe through notetaking?
Interesting how much it's relying on having information in training data and being able to look stuff up. I wonder how it would do with a "blind" play through of a game that didn't previously exist.
This benchmark includes a Slay the Spire environment! When it was written, Gemini 2.5 did the best, getting roughly halfway through a non-Ascension run.
In Claude's first try, it played Ironclad on Ascension 1 and died to Hexaghost, the Act 1 boss. It wasn't terrible but occasionally got the mechanics a little bit mixed up.
Curated. I appreciate this post's concreteness.
It can be hard to really understand what numbers in a benchmark mean. To do so, you have to be pretty familiar with the task distribution, which is often a little surprising. And, if you are bothering to get familiar with it, you probably already know how the LLM performs. So it's hard to be sure you're judging the difficulty accurately, rather than using your sense of the LLM's intelligence to infer the task difficulty.
Fortunately, a Pokémon game involves a bunch of different tasks, and I'm pretty familiar with them from childhood gameboy sessions. So LLM performance on the game can provide some helpful intuitions about LLM performance in general. Of course, you don't get all the niceties of statistical power and so on, but I still find it a helpful data source to include.
This post does a good job abstracting some of the subskills involved and provides lots of deliciously specific examples for the claims. It's also quite entertaining!
One thing I found fascinating about watching Claude play is it wouldn't play around and experiment the way I'd expect a human to? It would stand still still trying to work out what to do next, move one square up, consider a long time, move one square down, and repeat. When I'd expect a human to immediately get bored and go as far as they could in all directions to see what was there and try interacting with everything. Maybe some cognitive analogue of boredom is useful for avoiding loops?
That is in fact a defect of these models and one of the things you can of want to scream into the screen after it, say, doesn't walk more than 20 tiles to the right having spent days looking for an entrance that is 30 tiles to the right. Or when it doesn't explore the bottom left of a room where the answer is because it's convinced that's not where it is.
Or the fact that it's in week 2 of Silph co, driven by the fact that it's convinced what it's looking for is not an item on the ground, and not picking up any items even when walking right next to them, when in fact its goal is an item on the ground.
It's really interesting to compare how Opus 4.5 is performing on Pokemon, versus how it performs in Claude Code.
One of the big factors here is surely vision: Gemini is one of the best visual LLMs by a wide margin, and I strongly suspect Google does lots of training on specific vision tasks. Even so, 2.0 and 2.5 underperformed human 7-year-olds on many simple tasks on which Gemini hasn't been trained. In comparison, Claude has some visual abilities, but I can't remember ever reaching for them for any serious project. And it sounds like this is affecting lots of things in Pokemon.
Opus 4.5 really is quite good at a programming, enough that I'm passing into the "emotional freakout about the inevitable singularity" stage of grief. But Opus lives and dies by giant piles of Markdown files. It generates them, I read them, I make small corrections, and it continues. I think this is Opus 4.5's happy place, and within this circumscribed area, it's a champ. It can write a thousand lines of good Rust in a day, no problem, with some human feedback and code review. And if your process concentrates knowledge into Markdown files, it gets better.
So this is my current hypothesis:
It's kind of nice to imagine an AI future where the AIs are enormously capable, but that capability is only unlocked by a bit of occasional human interaction and guidance. Sadly, I think that's only a passing phase, and the real future is going to be much weirder, and future AIs won't need a human to say, "That weird rug thing is actually the elevator," and the AI to reply, "Oh, good observation! That simplifies this problem considerably."
GPT-5.1 beating crystal in 108 hours is very interesting. I wonder why that's the case compared to Gemini 3 Pro, which took ~424.5 hours. Do you have any thoughts?
Bunch of reasons:
Unfortunately no one has done full playthrough comparisons on the same harness for all models, due to time and expense. (all three main developers for Claude/Gemini/GPT only have access to free tokens for their particular model brand) Perhaps this will become possible sometime next year as completion time drops? (cost per token might drop too, but perhaps not for frontier models)
We some examples while working on terminal bench, where if the agent is pressured with a deadline, they freak out and act less rationally. Some of your examples remind me of that. Being close to the objective and becoming obsessed with that at the expense of intermediate steps.
I have been experimenting with having stock AI agents compete against each other in Warhammer 10th edition and have found similar problems. Deepseek was telling me units could make shot distances that clearly were not possible by rules it knew. The 'ignoring things that are in front of it' observation here is funny to me because Microsoft Co-Pilot was saying to put units so close together it was impossible. I gave it a grid map in coordinate form. It was ignoring things it put there itself.
I also told Deepseek that I had to play but knew nothing of the game, and was doing so because my friend insulted Chongqing hotpot, saying Chengdu's is better. It themed my whole Space Wolves army as soup-based and wrote text as if it was really into it.
Warhound Titan, "Defender of Simmering Broth" (1100 pts)
Microsoft Co-pilot was pretty boring and accountant-like.
(I am at Ithaca College if anyone wants to participate, it is fun)
For comparison, Pokemon Red in Twitch Plays Pokemon, which was basically just decision making implemented as a race condition between thousands to tens of thousands of different humans at every decision making step, took 16 days, 7 hours, 50 minutes, 19 seconds.
Also, he has a noticeable lack of confusion about key NPCs–Oak is consistently Oak, the player character is never “the red-hatted NPC”, and he can pick out gym leader Erika from a lineup.
I'm a little lost on this front. A person who has never encountered Pokemon before would not recognize the Oak or Erika sprite on-sight; why should the AI vision model? Perhaps one could match the Oak sprite to the full size Oak picture at the beginning of the game, but Erika? Erika can really only be identified by sprite uniqueness and placement in the top center of the gym.
I would instead think the newer models are just trained on more Pokemon, and hence can better identify Pokemon images.
The models have always been deeply familiar with Pokémon and how to play through it from the initial tests with Sonnet 3.7—they all know Erika is the fourth gym leader in Red, there's just too much internet text about this stuff. It contaminates the test, from a certain perspective, but it also makes failures and weaknesses even more apparent.
It is possible that Claude Opus 4.5 alone was trained on more Pokémon images as part of its general image training (more than Sonnet 4.5, though...?), but it wouldn't really matter: pure memorization would not have helped previous models, because they couldn't clearly see/understand stuff they definitely knew about. (I also doubt Anthropic is benchmaxxing Pokémon, considering they've kept their harness limited even after Google and OpenAI beat them on their own benchmark.)
Every now and then I play 20 questions with Claude to see how much he can adjust in his thinking. Giving answers like "sort of" and "partly" can teach him that yes and no aren't the only options. To think outside the box, so to speak. Even playing 20 questions 5 times in a row, each taking turns as to who thought up the item to search for, he improved dramatically. (But if you run out of tokens in the middle of a run assume he will forget what his item was because the scratch pad will be cleared.)
But 20 questions is text based. Playing a role playing game, or going on adventures with him also work well because it's text based. (Though it's clear he will not harm the user, not even in a pillow fight.) When you move to visual media you have that problem of translating pictures to something he can see, as well as his ability to think through a problem. Like missing the tree that could be broken, or not knowing how to get around a wall. His scratch pad is limited in what it can carry.
I wonder if anyone has tried using a MUD, or other text based games with Claude, or other LLM's. It seems like that would make it easier for the model to have better context since the whole context would be loaded to create the next forward pass.
You may be surprised to learn that ClaudePlaysPokemon is still running today, and that Claude still hasn't beaten Pokémon Red, more than half a year after Google proudly announced that Gemini 2.5 Pro beat Pokémon Blue. Indeed, since then, Google and OpenAI models have gone on to beat the longer and more complex Pokémon Crystal, yet Claude has made no real progress on Red since Claude 3.7 Sonnet![1]
This is because ClaudePlaysPokemon is a purer test of LLM ability, thanks to its consistently simple agent harness[2] and the relatively hands-off approach of its creator, David Hershey of Anthropic.[3] When Claudes repeatedly hit brick walls in the form of the Team Rocket Hideout and Erika's Gym for months on end, nothing substantial was done to give Claude a leg up.
But Claude Opus 4.5 has finally broken through those walls, in a way that perhaps validates the chatter that Opus 4.5 is a substantial advancement.
Though, hardly AGI-heralding, as will become clear. What follows are notes on how Claude has improved—or failed to improve—in Opus 4.5, written by a friend of mine who has watched quite a lot of ClaudePlaysPokemon over the past year.[4]
Improvements
Much Better Vision, Somewhat Better Seeing
Earlier this year, LLMs were effectively close to blind when playing Pokémon, with no consistent ability to recognize and distinguish doors, buildings, trees, NPCs, or obstacles.
For example, this screen:
…at the time of Sonnet 3.7 confounded every LLM I tested it on, all of whom had difficulty consistently identifying where the pokeballs were, or figuring out which pokemon they wanted, sometimes even accepting the wrong starter by accident. Opus 4.5 made this look like the trivial problem that it is.[5]
In general, Opus 4.5 no longer has any trouble finding doors, and recognizes key buildings like gyms, pokemon centers, and marts the moment they appear on-screen. Also, he has a noticeable lack of confusion about key NPCs–Oak is consistently Oak, the player character is never “the red-hatted NPC”, and he can pick out gym leader Erika from a lineup.
The new vision is hardly perfect, though, suffering in proportion to whether or not Claude is paying attention, and whether or not Claude is willing to believe his own lying eyes.
Attention is All You Need
On the first point, Claude very frequently seems to simply ignore things in his field of vision if he’s not “looking” there. Even worse, in key moments when he’s close to his current goal, he seems to rely on his vision less, and even ignore it entirely sometimes.
The above represents the ur-example of Claude “blindness”. Those two left-pointing arrows (“spinners”) to his left represent the only potential path to progress, but he knows his goal is to the right and even thinks he sees it. Claude visited this exact spot dozens of times and fewer than 5 times seemed to realize there were spinners to his left. Also, he clearly had trouble distinguishing the green boxes from the spinners and routinely tried stepping onto the boxes–a mistake that only materialized when he was close to the goal. He had no apparent problem telling the difference much of the rest of the time.
Here's another example that will be traumatic to the Twitch viewers:
Here the tree which must be CUT to progress to the gym is clearly in view, but Claude is focused on looking for an open pathway and shows no sign of seeing it, walking right by–yet just minutes later he will spot it on the way back, having given up looking for an open pathway.[6]
The Object of His Desire
On the second point, Claude is noticeably much more prone to hallucinating or misidentifying objects as what he’s looking for if he really wants it to be there.
The classic example here is Claude's search for the elevator in the Team Rocket Hideout:
It's been hours and Claude has grown a bit desperate. This is also not the only time he hallucinates the elevator. For example, in the exact same spot we discussed earlier, which is actually quite near the elevator:
Now, the elevator is actually in that direction, and Claude even saw it (for real) earlier. But he's become so fixated on it that he mistakes the gray wall for the elevator despite really knowing better.
And before you judge Claude too harshly, the elevator he's searching for looks like this:
A Note
Let me be clear: I’m using the language of intentionality, as if Claude is choosing to ignore things. I don’t think that’s the case. I think his attention mechanisms actively screen out what they think is irrelevant, rendering the parts of the model trying to make decisions effectively blind to it.
Humans have built-in attention mechanisms, but they are clearly better built than this, even if they do have similar failure modes in extremis.
Mildly Better Spatial Awareness
I don’t want to oversell this one. Claude’s understanding of how to navigate a 2D world is clearly still below that of most children, but there are improvements:
Better Use of Context Window and Note-keeping to Simulate Memory
Another obvious improvement to Claude’s capability is improved note-keeping and memory of context. Previous versions of Claude such as Sonnet 3.7 showed little sign that they “recalled” anything more recent than a few messages ago, despite having much of it in context. And while they were diligent notetakers, they only rarely seemed to read their own notes–and when they did, it was clearly in a stochastic manner, to the point that chat liked to speculate about whether Claude would read his notes this time and what part of his notes he would read.
Opus 4.5 is much, much better at both monitoring context and using notes, so much so that much of the time he manages to maintain a passable illusion of actually “remembering” the past 15 minutes or so, referencing recent events, evading past hurdles, and just generally maintaining a much more coherent narrative of what’s going on.
For longer-term memory, Claude must blatantly rely on whatever he happens to have written down in his notes, and he does a much better job of writing and reading his own instructions, routinely repeating past navigation tasks successfully and competently. Nowadays, if Claude does something and writes down how to do it, he can do it again.
It is difficult to overstate how much this contributes to a smoother, faster game flow. Claude can maintain navigational focus for extended periods, explore simple areas competently, and as long as the notes are good and his assumptions are sound, things flow smoothly.
Of course, sometimes things go haywire…
Self-Correction; Breaks Out of Loops Faster
…but more than before, things get fixed quickly. This is difficult to quantify, and I believe it stems heavily from a much better ability to notice when events are repeating within his context window. Claude more frequently and consistently notices when he’s trying something that clearly isn’t working and will try to vary it up. Coupled with his improved spatial reasoning, navigation tasks that took previous iterations days or weeks of trial and error have almost breezed by: Viridian Forest and Mt. Moon were relatively simple affairs, only a few loops around Vermilion City were necessary before he pathed right to the dock, etc.
It’s not all smiles and roses: Claude is still much slower than a human would be, and not every puzzle gets solved breezily. Nor has Claude deduced key facts like “walking in front of a trainer triggers a fight”, instead treating these as effectively random encounters.
Still it’s something.
Not Improvements
Claude would still never be mistaken for a Human playing the game
I’d like to tell a quick story to give readers the flavor of what it’s like to watch Claude sometimes, even when he’s technically accomplishing his goals with aplomb. This is the story of Claude attempting to acquire the Rocket HQ Lift Key, technically the first thing he did that no previous model had ever accomplished.
Claude Still Gets Pretty Stuck
Early on, before the Team Rocket Hideout, watchers of Claude Plays Pokemon legitimately wondered if Anthropic had solved all of Claude’s main issues with the game, and perhaps everything would be smooth sailing from here on out. He had overcome some of earlier models’ biggest timesinks—Mt. Moon, Viridian Forest, finding the pathway from Cerulean City to Vermillion City, finding the Captain of the S.S. Anne—without difficulty.
But, critically, Claude had yet to hit the roadblocks that had permanently stopped previous models from progressing.
When he reached Ericka's Gym (the one with the CUT-able tree I mentioned earlier), Claude spent ~4 days, or about 8000 reasoning steps, walking in a plain circle around the top of the gym looking for a path through.
What was he doing? Well, mostly trying to path through impassable walls and, knowing that CUT is involved in getting into the gym somehow, trying to cut through the gym's roof.
If there's one thing Claude does have, it's inhuman patience,[7] but even he eventually gave up, choosing to do Team Rocket Hideout first, which the game does allow you to do.
Over 13,000 reasoning steps later,[8] having completed the Team Rocket Hideout and other tasks, Claude returned and almost immediately found the proper CUT-able tree and finally progressed.[9]
Sometimes you just need to clear your head.
Claude Really Needs His Notes
I think the anecdotes above mostly speak for themselves in illustrating the problems bad vision, cognitive bias, and inconsistent memory still give Claude Opus 4.5. But I would highlight how utterly dependent Claude is on the quality of his notes: One incorrect assumption or hallucination embedded into a note can crater progress for days, while a well-written note can achieve human-like performance.
I would analogize this to a human with anterograde amnesia, who cannot form new memories, and who is constantly writing notes to keep track of their life. The limitations here are obvious, and these are limitations future Claudes will probably share unless LLM memory/continual learning is solved in a better way.
Poor Long-term Planning
It is possible to detect other reasoning issues or inhuman thinking in Claude’s behavior, though these are not as crippling as the others.
Claude is incredibly short-term-goal-obsessed, and seems to have no interest in ever trying to do two things at once, even in the service of the greater goal. There also seems to be little reflection about the long-term consequences of an action, even in trivial ways.
Things that Claude has done that would be alien to human players:
Don't Forget
Just recently, GPT-5.1 completed a run of Pokémon Crystal using a fairly minimal harness in 9,454 reasoning steps across 108 realtime hours. For comparison, the original Gemini 2.5 Pro Pokémon Blue run took 106,505 reasoning steps across 813 realtime hours, and Claude Opus 4.5 is already at 48,854 reasoning steps over 300+ hours. GPT-5.1's 108 hours for Crystal is only ~3x as slow as a human player! Give a frontier LLM a solid minimap and some good prompts[10] and they're not half bad at Pokémon these days.
Claude's consistently minimal harness tells us something about progress in LLM cognition, but we shouldn't forget that the past year's improvements in efficient Pokémon agent harnessing tell us something too: raw intelligence is not the only lever pushing LLM performance forward. In fact, it's not necessarily even the most effective one right now.
That's failures to improve by Claude Sonnet 4, Claude Opus 4, Claude Opus 4.1, and Claude Sonnet 4.5. At least in terms of story progression anyway, they have gotten faster at getting to the same story point at which they get stuck.
There have been a few changes: support for Surf (now that Claude can get that far), removal of a bunch of tailored prompts, and a change where spinner tiles in mazes are labeled like obstructions, as well as a related change to wait for the player character to stop spinning before the screenshot of the current game state is taken. The latter two changes in particular make the Team Rocket Hideout easier than previous runs, though they don't trivialize it. See this doc for more details.
For more on Pokémon agent harnesses, see this previous LW post. But tl;dr harnesses do a lot of work to make the game understandable to an LLM, and use several techniques to address agentic weaknesses common to all LLMs. Even though the harnesses may seem fairly simple, and can (and have!) had their tools coded by the LLMs using them, game-winning harnesses have also been relentlessly optimized by human trial and error to provide exactly the support necessary to overcome current LLM limitations.
With some editing on my part.
Modern Gemini/GPT models can also handle this now.
This might be considered a form of inattentional blindness, the classic example of which is the guy in a gorilla suit walking through a basketball game.
Probably helps that he can't really remember enough of his experiences to get bored. That may be what we all do in the posthuman future, though on a longer timescale.
9000 of those spent stuck on that left arrow spinner issue.
Technically it still took Claude a few hours to notice the CUT-able trees inside the gym that block access to Erika, the gym leader, but he noticed eventually.
The minimap only fills out as the LLM explores. Good prompting ensures that the LLM explores basically everything as a first priority, which means in practice the LLM always has a good map of the area it can understand. This bypasses a lot of vision and spatial reasoning weaknesses. Other key tools include an LLM-reasoning-powered navigator and the ability to place map markers.