I would analogize this to a human with anterograde amnesia, who cannot form new memories, and who is constantly writing notes to keep track of their life. The limitations here are obvious, and these are limitations future Claudes will probably share unless LLM memory/continual learning is solved in a better way.
This is an extremely underrated comparison, TBH. Indeed, I'd argue that frozen weights + lack of a long-term memory are easily one of the biggest reasons why LLMs are much more impressive than useful at a lot of tasks (with reliability being another big, independent issue).
It emphasizes 2 things that are both true at once: LLMs do in fact reason like humans and can have (poor-quality) world-models, and there's no fundamental chasm between LLM capabilities and human capabilities that can't be cured by unlimited resources/time, and yet just as humans with anterograde amnesia are usually much less employable/useful to others than people who do have long-term memory, current AIs are much, much less employable than future paradigm AIs.
You may be surprised to learn that ClaudePlaysPokemon is still running today, and that Claude still hasn't beaten Pokémon Red, more than half a year after Google proudly announced that Gemini 2.5 Pro beat Pokémon Blue. Indeed, since then, Google and OpenAI models have gone on to beat the longer and more complex Pokémon Crystal, yet Claude has made no real progress on Red since Claude 3.7 Sonnet![1]
This is because ClaudePlaysPokemon is a purer test of LLM ability, thanks to its consistently simple agent harness and the relatively hands-off approach of its creator, David Hershey of Anthropic.[2] When Claudes repeatedly hit brick walls in the form of the Team Rocket Hideout and Erika's Gym for months on end, nothing substantial was done to give Claude a leg up.
But Claude Opus 4.5 has finally broken through those walls, in a way that perhaps validates the chatter that Opus 4.5 is a substantial advancement.
Though, hardly AGI-heralding, as will become clear. What follows are notes on how Claude has improved—or failed to improve—in Opus 4.5, written by a friend of mine who has watched quite a lot of ClaudePlaysPokemon over the past year.[3]
Earlier this year, LLMs were effectively close to blind when playing Pokémon, with no consistent ability to recognize and distinguish doors, buildings, trees, NPCs, or obstacles.
For example, this screen:
…at the time of Sonnet 3.7 confounded every LLM I tested it on, all of whom had difficulty consistently identifying where the pokeballs were, or figuring out which pokemon they wanted, sometimes even accepting the wrong starter by accident. Opus 4.5 made this look like the trivial problem that it is.[4]
In general, Opus 4.5 no longer has any trouble finding doors, and recognizes key buildings like gyms, pokemon centers, and marts the moment they appear on-screen. Also, he has a noticeable lack of confusion about key NPCs–Oak is consistently Oak, the player character is never “the red-hatted NPC”, and he can pick out gym leader Erika from a lineup.
The new vision is hardly perfect, though, suffering in proportion to whether or not Claude is paying attention, and whether or not Claude is willing to believe his own lying eyes.
On the first point, Claude very frequently seems to simply ignore things in his field of vision if he’s not “looking” there. Even worse, in key moments when he’s close to his current goal, he seems to rely on his vision less, and even ignore it entirely sometimes.
The above represents the ur-example of Claude “blindness”. Those two left-pointing arrows (“spinners”) to his left represent the only potential path to progress, but he knows his goal is to the right and even thinks he sees it. Claude visited this exact spot dozens of times and fewer than 5 times seemed to realize there were spinners to his left. Also, he clearly had trouble distinguishing the green boxes from the spinners and routinely tried stepping onto the boxes–a mistake that only materialized when he was close to the goal. He had no apparent problem telling the difference much of the rest of the time.
Here's another example that will be traumatic to the Twitch viewers:
Here the tree which must be CUT to progress to the gym is clearly in view, but Claude is focused on looking for an open pathway and shows no sign of seeing it, walking right by–yet just minutes later he will spot it on the way back, having given up looking for an open pathway.[5]
On the second point, Claude is noticeably much more prone to hallucinating or misidentifying objects as what he’s looking for if he really wants it to be there.
The classic example here is Claude's search for the elevator in the Team Rocket Hideout:
It's been hours and Claude has grown a bit desperate. This is also not the only time he hallucinates the elevator. For example, in the exact same spot we discussed earlier, which is actually quite near the elevator:
Now, the elevator is actually in that direction, and Claude even saw it (for real) earlier. But he's become so fixated on it that he mistakes the gray wall for the elevator despite really knowing better.
And before you judge Claude too harshly, the elevator he's searching for looks like this:
Let me be clear: I’m using the language of intentionality, as if Claude is choosing to ignore things. I don’t think that’s the case. I think his attention mechanisms actively screen out what they think is irrelevant, rendering the parts of the model trying to make decisions effectively blind to it.
Humans have built-in attention mechanisms, but they are clearly better built than this, even if they do have similar failure modes in extremis.
I don’t want to oversell this one. Claude’s understanding of how to navigate a 2D world is clearly still below that of most children, but there are improvements:
Another obvious improvement to Claude’s capability is improved note-keeping and memory of context. Previous versions of Claude such as Sonnet 3.7 showed little sign that they “recalled” anything more recent than a few messages ago, despite having much of it in context. And while they were diligent notetakers, they only rarely seemed to read their own notes–and when they did, it was clearly in a stochastic manner, to the point that chat liked to speculate about whether Claude would read his notes this time and what part of his notes he would read.
Opus 4.5 is much, much better at both monitoring context and using notes, so much so that much of the time he manages to maintain a passable illusion of actually “remembering” the past 15 minutes or so, referencing recent events, evading past hurdles, and just generally maintaining a much more coherent narrative of what’s going on.
For longer-term memory, Claude must blatantly rely on whatever he happens to have written down in his notes, and he does a much better job of writing and reading his own instructions, routinely repeating past navigation tasks successfully and competently. Nowadays, if Claude does something and writes down how to do it, he can do it again.
It is difficult to overstate how much this contributes to a smoother, faster game flow. Claude can maintain navigational focus for extended periods, explore simple areas competently, and as long as the notes are good and his assumptions are sound, things flow smoothly.
Of course, sometimes things go haywire…
…but more than before, things get fixed quickly. This is difficult to quantify, and I believe it stems heavily from a much better ability to notice when events are repeating within his context window. Claude more frequently and consistently notices when he’s trying something that clearly isn’t working and will try to vary it up. Coupled with his improved spatial reasoning, navigation tasks that took previous iterations days or weeks of trial and error have almost breezed by: Viridian Forest and Mt. Moon were relatively simple affairs, only a few loops around Vermilion City were necessary before he pathed right to the dock, etc.
It’s not all smiles and roses: Claude is still much slower than a human would be, and not every puzzle gets solved breezily. Nor has Claude deduced key facts like “walking in front of a trainer triggers a fight”, instead treating these as effectively random encounters.
Still it’s something.
I’d like to tell a quick story to give readers the flavor of what it’s like to watch Claude sometimes, even when he’s technically accomplishing his goals with aplomb. This is the story of Claude attempting to acquire the Rocket HQ Lift Key, technically the first thing he did that no previous model had ever accomplished.
Early on, before the Team Rocket Hideout, watchers of Claude Plays Pokemon legitimately wondered if Anthropic had solved all of Claude’s main issues with the game, and perhaps everything would be smooth sailing from here on out. He had overcome some of earlier models’ biggest timesinks—Mt. Moon, Viridian Forest, finding the pathway from Cerulean City to Vermillion City, finding the Captain of the S.S. Anne—without difficulty.
But, critically, Claude had yet to hit the roadblocks that had permanently stopped previous models from progressing.
When he reached Ericka's Gym (the one with the CUT-able tree I mentioned earlier), Claude spent ~4 days, or about 8000 reasoning steps, walking in a plain circle around the top of the gym looking for a path through.
What was he doing? Well, mostly trying to path through impassable walls and, knowing that CUT is involved in getting into the gym somehow, trying to cut through the gym's roof.
If there's one thing Claude does have, it's inhuman patience,[6] but even he eventually gave up, choosing to do Team Rocket Hideout first, which the game does allow you to do.
Over 13,000 reasoning steps later,[7] having completed the Team Rocket Hideout and other tasks, Claude returned and almost immediately found the proper CUT-able tree and finally progressed.[8]
Sometimes you just need to clear your head.
I think the anecdotes above mostly speak for themselves in illustrating the problems bad vision, cognitive bias, and inconsistent memory still give Claude Opus 4.5. But I would highlight how utterly dependent Claude is on the quality of his notes: One incorrect assumption or hallucination embedded into a note can crater progress for days, while a well-written note can achieve human-like performance.
I would analogize this to a human with anterograde amnesia, who cannot form new memories, and who is constantly writing notes to keep track of their life. The limitations here are obvious, and these are limitations future Claudes will probably share unless LLM memory/continual learning is solved in a better way.
It is possible to detect other reasoning issues or inhuman thinking in Claude’s behavior, though these are not as crippling as the others.
Claude is incredibly short-term-goal-obsessed, and seems to have no interest in ever trying to do two things at once, even in the service of the greater goal. There also seems to be little reflection about the long-term consequences of an action, even in trivial ways.
Things that Claude has done that would be alien to human players:
Just recently, GPT-5.1 completed a run of Pokémon Crystal using a fairly minimal harness in 9,454 reasoning steps across 108 realtime hours. For comparison, the original Gemini 2.5 Pro Pokémon Blue run took 106,505 reasoning steps across 813 realtime hours, and Claude Opus 4.5 is already at 48,854 reasoning steps over 300+ hours. GPT-5.1's 108 hours for Crystal is only ~3x as slow as a human player! Give a frontier LLM a solid minimap and some good prompts[9] and they're not half bad at Pokémon these days.
Claude's consistently minimal harness tells us something about progress in LLM cognition, but we shouldn't forget that the past year's improvements in efficient Pokémon agent harnessing tell us something too: raw intelligence is not the only lever pushing LLM performance forward. In fact, it's not necessarily even the most effective one right now.
That's failures to improve by Claude Sonnet 4, Claude Opus 4, Claude Opus 4.1, and Claude Sonnet 4.5. At least in terms of story progression anyway, they have gotten faster at getting to the same story point at which they get stuck.
For more on Pokémon agent harnesses, see this previous LW post. But tl;dr harnesses do a lot of work to make the game understandable to an LLM, and use several techniques to address agentic weaknesses common to all LLMs. Even though the harnesses may seem fairly simple, and can (and have!) had their tools coded by the LLMs using them, game-winning harnesses have also been relentlessly optimized by human trial and error to provide exactly the support necessary to overcome current LLM limitations.
With some editing on my part.
Modern Gemini/GPT models can also handle this now.
This might be considered a form of inattentional blindness, the classic example of which is the guy in a gorilla suit walking through a basketball game.
Probably helps that he can't really remember enough of his experiences to get bored. That may be what we all do in the posthuman future, though on a longer timescale.
9000 of those spent stuck on that left arrow spinner issue.
Technically it still took Claude a few hours to notice the CUT-able trees inside the gym that block access to Erika, the gym leader, but he noticed eventually.
The minimap only fills out as the LLM explores. Good prompting ensures that the LLM explores basically everything as a first priority, which means in practice the LLM always has a good map of the area it can understand. This bypasses a lot of vision and spatial reasoning weaknesses. Other key tools include an LLM-reasoning-powered navigator and the ability to place map markers.