Insights into Claude Opus 4.5 from Pokémon

Julian Bradshaw

Credit: Nano Banana. — Credit: Nano Banana, with some text provided.

You may be surprised to learn that ClaudePlaysPokemon is still running today, and that Claude still hasn't beaten Pokémon Red, more than half a year after Google proudly announced that Gemini 2.5 Pro beat Pokémon Blue. Indeed, since then, Google and OpenAI models have gone on to beat the longer and more complex Pokémon Crystal, yet Claude has made no real progress on Red since Claude 3.7 Sonnet!^[1]

This is because ClaudePlaysPokemon is a purer test of LLM ability, thanks to its consistently simple agent harness^[2] and the relatively hands-off approach of its creator, David Hershey of Anthropic.^[3] When Claudes repeatedly hit brick walls in the form of the Team Rocket Hideout and Erika's Gym for months on end, nothing substantial was done to give Claude a leg up.

But Claude Opus 4.5 has finally broken through those walls, in a way that perhaps validates the chatter that Opus 4.5 is a substantial advancement.

Though, hardly AGI-heralding, as will become clear. What follows are notes on how Claude has improved—or failed to improve—in Opus 4.5, written by a friend of mine who has watched quite a lot of ClaudePlaysPokemon over the past year.^[4]

Improvements

Much Better Vision, Somewhat Better Seeing

Earlier this year, LLMs were effectively close to blind when playing Pokémon, with no consistent ability to recognize and distinguish doors, buildings, trees, NPCs, or obstacles.

For example, this screen:

Choosing your starter Pokemon. — Choosing your starter Pokémon in Professor Oak's lab.

…at the time of Sonnet 3.7 confounded every LLM I tested it on, all of whom had difficulty consistently identifying where the pokeballs were, or figuring out which pokemon they wanted, sometimes even accepting the wrong starter by accident. Opus 4.5 made this look like the trivial problem that it is.^[5]

In general, Opus 4.5 no longer has any trouble finding doors, and recognizes key buildings like gyms, pokemon centers, and marts the moment they appear on-screen. Also, he has a noticeable lack of confusion about key NPCs–Oak is consistently Oak, the player character is never “the red-hatted NPC”, and he can pick out gym leader Erika from a lineup.

Erika is second from left. A previous Claude, the only other Claude to ever reach this gym, failed to recognize the gym leader was here and kept insisting it had beat every trainer. Eventually it left and never came back.

The new vision is hardly perfect, though, suffering in proportion to whether or not Claude is paying attention, and whether or not Claude is willing to believe his own lying eyes.

Attention is All You Need

On the first point, Claude very frequently seems to simply ignore things in his field of vision if he’s not “looking” there. Even worse, in key moments when he’s close to his current goal, he seems to rely on his vision less, and even ignore it entirely sometimes.

Claude in the infamous Team Rocket Hideout Hell.

The above represents the ur-example of Claude “blindness”. Those two left-pointing arrows (“spinners”) to his left represent the only potential path to progress, but he knows his goal is to the right and even thinks he sees it. Claude visited this exact spot dozens of times and fewer than 5 times seemed to realize there were spinners to his left. Also, he clearly had trouble distinguishing the green boxes from the spinners and routinely tried stepping onto the boxes–a mistake that only materialized when he was close to the goal. He had no apparent problem telling the difference much of the rest of the time.

Here's another example that will be traumatic to the Twitch viewers:

Claude in Celadon City, trying to find the gym.

Here the tree which must be CUT to progress to the gym is clearly in view, but Claude is focused on looking for an open pathway and shows no sign of seeing it, walking right by–yet just minutes later he will spot it on the way back, having given up looking for an open pathway.^[6]

The Object of His Desire

On the second point, Claude is noticeably much more prone to hallucinating or misidentifying objects as what he’s looking for if he really wants it to be there.

The classic example here is Claude's search for the elevator in the Team Rocket Hideout:

Claude in another part of the Team Rocket Hideout. — Claude in another part of the Team Rocket Hideout, not anywhere near the elevator he's looking for.

Claude's reasoning about the above screen.

It's been hours and Claude has grown a bit desperate. This is also not the only time he hallucinates the elevator. For example, in the exact same spot we discussed earlier, which is actually quite near the elevator:

Claude still in the infamous Team Rocket Hideout Hell.

Now, the elevator is actually in that direction, and Claude even saw it (for real) earlier. But he's become so fixated on it that he mistakes the gray wall for the elevator despite really knowing better.

And before you judge Claude too harshly, the elevator he's searching for looks like this:

The dark pink carpet leads to the elevator. The elevator itself has its own screen which is more obviously an elevator, but there's no clear "elevator door" sprite as an entrance. — The dark pink/red carpet at the bottom leads to the elevator. The elevator itself has its own separate screen which is more obviously an elevator, but there's no clear "elevator door" sprite as an entrance, just that carpet, which you have to remember leads to the elevator scene.

Nevertheless Claude can identify the carpet as the route to the elevator. — Nevertheless Claude can identify the carpet as the entrance to the elevator.

A Note

Let me be clear: I’m using the language of intentionality, as if Claude is choosing to ignore things. I don’t think that’s the case. I think his attention mechanisms actively screen out what they think is irrelevant, rendering the parts of the model trying to make decisions effectively blind to it.

Humans have built-in attention mechanisms, but they are clearly better built than this, even if they do have similar failure modes in extremis.

Mildly Better Spatial Awareness

I don’t want to oversell this one. Claude’s understanding of how to navigate a 2D world is clearly still below that of most children, but there are improvements:

When trying to reach a door in front of a building and finding his path blocked from a particular direction, Claude will now try to walk around the other way.
Claude can now maintain an awareness (via notes) about which parts of a building or city are relative to each other and perform simple navigation tasks.
1. Previous versions constantly lost track of what wasn’t immediately in view.
Claude can now perform some basic in-out geometric reasoning: leaving a building from the top of the room is likely to push me out the top of the building, elevators on different floors are probably in the same location on the floor, etc.

Better Use of Context Window and Note-keeping to Simulate Memory

Another obvious improvement to Claude’s capability is improved note-keeping and memory of context. Previous versions of Claude such as Sonnet 3.7 showed little sign that they “recalled” anything more recent than a few messages ago, despite having much of it in context. And while they were diligent notetakers, they only rarely seemed to read their own notes–and when they did, it was clearly in a stochastic manner, to the point that chat liked to speculate about whether Claude would read his notes this time and what part of his notes he would read.

Opus 4.5 is much, much better at both monitoring context and using notes, so much so that much of the time he manages to maintain a passable illusion of actually “remembering” the past 15 minutes or so, referencing recent events, evading past hurdles, and just generally maintaining a much more coherent narrative of what’s going on.

For longer-term memory, Claude must blatantly rely on whatever he happens to have written down in his notes, and he does a much better job of writing and reading his own instructions, routinely repeating past navigation tasks successfully and competently. Nowadays, if Claude does something and writes down how to do it, he can do it again.

It is difficult to overstate how much this contributes to a smoother, faster game flow. Claude can maintain navigational focus for extended periods, explore simple areas competently, and as long as the notes are good and his assumptions are sound, things flow smoothly.

Of course, sometimes things go haywire…

Self-Correction; Breaks Out of Loops Faster

…but more than before, things get fixed quickly. This is difficult to quantify, and I believe it stems heavily from a much better ability to notice when events are repeating within his context window. Claude more frequently and consistently notices when he’s trying something that clearly isn’t working and will try to vary it up. Coupled with his improved spatial reasoning, navigation tasks that took previous iterations days or weeks of trial and error have almost breezed by: Viridian Forest and Mt. Moon were relatively simple affairs, only a few loops around Vermilion City were necessary before he pathed right to the dock, etc.

It’s not all smiles and roses: Claude is still much slower than a human would be, and not every puzzle gets solved breezily. Nor has Claude deduced key facts like “walking in front of a trainer triggers a fight”, instead treating these as effectively random encounters.

Still it’s something.

Not Improvements

Claude would still never be mistaken for a Human playing the game

I’d like to tell a quick story to give readers the flavor of what it’s like to watch Claude sometimes, even when he’s technically accomplishing his goals with aplomb. This is the story of Claude attempting to acquire the Rocket HQ Lift Key, technically the first thing he did that no previous model had ever accomplished.

Claude arrives in Team Rocket HQ, immediately declares the staircase next to him the elevator to Giovanni, for which he needs the Lift Key
Claude then ignores the false “elevator” for hours, confident that he will be unable to "use" it, wandering the entire rest of the floor looking for the lift key or another set of stairs.
Finally he chooses to try the original “elevator”, finding to his surprise that it works. He writes down in his notes that the elevator doesn’t need the Lift Key.
Claude makes his way down two floors, encountering the infamous B3F maze. After getting stuck, he uses his only escape rope, then comes right back and, much to everyone’s surprise, solves the maze in one try, writing down the solution.
On B4F after the maze, he clears out the area, but to his puzzlement fails to find Giovanni. He battles the Team Rocket Grunt carrying the Lift Key but doesn’t talk to them again, so they don’t give him the Lift Key. (to be fair on this point, this is an very confusing nuance in the game that has trapped many kids too, and GameFreak changed this in Pokémon Yellow)
He concludes that he is mistaken, and needs to go back to the “elevator” he saw earlier on B1F and use that.
After circling the elevator in frustration trying to “use” it, he concludes he’s missing a Lift Key, goes back to B3F, solves the maze trivially using his notes, and acquires the Lift Key.
He then returns to the “elevator”. He circles the elevator for ~50 minutes, before finally concluding it’s not the real elevator but rather an “elevator/stairs” that mysteriously only connects two floors. Eventually, he amends this to “escalator”, which seems to resolve the cognitive dissonance and he happily refers to it as the escalator for the rest of the time he’s in Team Rocket Hideout.

Claude Still Gets Pretty Stuck

Early on, before the Team Rocket Hideout, watchers of Claude Plays Pokemon legitimately wondered if Anthropic had solved all of Claude’s main issues with the game, and perhaps everything would be smooth sailing from here on out. He had overcome some of earlier models’ biggest timesinks—Mt. Moon, Viridian Forest, finding the pathway from Cerulean City to Vermillion City, finding the Captain of the S.S. Anne—without difficulty.

But, critically, Claude had yet to hit the roadblocks that had permanently stopped previous models from progressing.

When he reached Ericka's Gym (the one with the CUT-able tree I mentioned earlier), Claude spent ~4 days, or about 8000 reasoning steps, walking in a plain circle around the top of the gym looking for a path through.

What was he doing? Well, mostly trying to path through impassable walls and, knowing that CUT is involved in getting into the gym somehow, trying to cut through the gym's roof.

Source: reasonosaur on /r/ClaudePlaysPokemon — Source: user reasonosaur on /r/ClaudePlaysPokemon

If there's one thing Claude does have, it's inhuman patience,^[7] but even he eventually gave up, choosing to do Team Rocket Hideout first, which the game does allow you to do.

Over 13,000 reasoning steps later,^[8] having completed the Team Rocket Hideout and other tasks, Claude returned and almost immediately found the proper CUT-able tree and finally progressed.^[9]

Sometimes you just need to clear your head.

Claude Really Needs His Notes

I think the anecdotes above mostly speak for themselves in illustrating the problems bad vision, cognitive bias, and inconsistent memory still give Claude Opus 4.5. But I would highlight how utterly dependent Claude is on the quality of his notes: One incorrect assumption or hallucination embedded into a note can crater progress for days, while a well-written note can achieve human-like performance.

I would analogize this to a human with anterograde amnesia, who cannot form new memories, and who is constantly writing notes to keep track of their life. The limitations here are obvious, and these are limitations future Claudes will probably share unless LLM memory/continual learning is solved in a better way.

Poor Long-term Planning

It is possible to detect other reasoning issues or inhuman thinking in Claude’s behavior, though these are not as crippling as the others.

Claude is incredibly short-term-goal-obsessed, and seems to have no interest in ever trying to do two things at once, even in the service of the greater goal. There also seems to be little reflection about the long-term consequences of an action, even in trivial ways.

Things that Claude has done that would be alien to human players:

Spamming a valuable move with limited PP when there are clearly going to be many trainers ahead, without considering whether another move might be appropriate for the current fight (Ember to kill a grass type, for instance, to save Slash PP).
When out of space in the inventory, Claude routinely trashes valuable items even when he could just use some of the items. Sometimes he trashes an item that could be used on the spot (e.g a stat boosting vitamin that could be fed to Charizard).
Leaving Charizard out against a water type that could easily be handled by the grass type on the bench, wasting PP. In fact, he just loves using only Charizard.
1. That this is an infamously child-like strategy says something about Claude's cognitive development... or not, as Red is simple enough that just using Charizard is a mainline speedrun strategy. Though, Claude has never claimed to be following any such strategy.
Not picking up a rare candy item that is blocking his path in Pokémon Tower for over an hour, because he was too focused on finding the path.
1. In general Claude is strangely reluctant to pick up items.

Don't Forget

Just recently, GPT-5.1 completed a run of Pokémon Crystal using a fairly minimal harness in 9,454 reasoning steps across 108 realtime hours. For comparison, the original Gemini 2.5 Pro Pokémon Blue run took 106,505 reasoning steps across 813 realtime hours, and Claude Opus 4.5 is already at 48,854 reasoning steps over 300+ hours. GPT-5.1's 108 hours for Crystal is only ~3x as slow as a human player! Give a frontier LLM a solid minimap and some good prompts^[10] and they're not half bad at Pokémon these days.

Claude's consistently minimal harness tells us something about progress in LLM cognition, but we shouldn't forget that the past year's improvements in efficient Pokémon agent harnessing tell us something too: raw intelligence is not the only lever pushing LLM performance forward. In fact, it's not necessarily even the most effective one right now.

^{^}
That's failures to improve by Claude Sonnet 4, Claude Opus 4, Claude Opus 4.1, and Claude Sonnet 4.5. At least in terms of story progression anyway, they have gotten faster at getting to the same story point at which they get stuck.
^{^}
There have been a few changes: support for Surf (now that Claude can get that far), removal of a bunch of tailored prompts, and a change where spinner tiles in mazes are labeled like obstructions, as well as a related change to wait for the player character to stop spinning before the screenshot of the current game state is taken. The latter two changes in particular make the Team Rocket Hideout easier than previous runs, though they don't trivialize it. See this doc for more details.
^{^}
For more on Pokémon agent harnesses, see this previous LW post. But tl;dr harnesses do a lot of work to make the game understandable to an LLM, and use several techniques to address agentic weaknesses common to all LLMs. Even though the harnesses may seem fairly simple, and can (and have!) had their tools coded by the LLMs using them, game-winning harnesses have also been relentlessly optimized by human trial and error to provide exactly the support necessary to overcome current LLM limitations.
^{^}
With some editing on my part.
^{^}
Modern Gemini/GPT models can also handle this now.
^{^}
This might be considered a form of inattentional blindness, the classic example of which is the guy in a gorilla suit walking through a basketball game.
^{^}
Probably helps that he can't really remember enough of his experiences to get bored. That may be what we all do in the posthuman future, though on a longer timescale.
^{^}
9000 of those spent stuck on that left arrow spinner issue.
^{^}
Technically it still took Claude a few hours to notice the CUT-able trees inside the gym that block access to Erika, the gym leader, but he noticed eventually.
^{^}
The minimap only fills out as the LLM explores. Good prompting ensures that the LLM explores basically everything as a first priority, which means in practice the LLM always has a good map of the area it can understand. This bypasses a lot of vision and spatial reasoning weaknesses. Other key tools include an LLM-reasoning-powered navigator and the ability to place map markers.

I would analogize this to a human with anterograde amnesia, who cannot form new memories, and who is constantly writing notes to keep track of their life. The limitations here are obvious, and these are limitations future Claudes will probably share unless LLM memory/continual learning is solved in a better way.

This is an extremely underrated comparison, TBH. Indeed, I'd argue that frozen weights + lack of a long-term memory are easily one of the biggest reasons why LLMs are much more impressive than useful at a lot of tasks (with reliability being another big, independent issue).

It emphasizes 2 things that are both true at once: LLMs do in fact reason like humans and can have (poor-quality) world-models, and there's no fundamental chasm between LLM capabilities and human capabilities that can't be cured by unlimited resources/time, and yet just as humans with anterograde amnesia are usually much less employable/useful to others than people who do have long-term memory, current AIs are much, much less employable/useful than future paradigm AIs.

I'm curious to see how well LLMs can play Slay the Spire. I could actually try that manually and see what happens.

Neuro-sama (the LLM-based AI VTuber) has beaten the game some time ago. As the code isn't open it's not possible to confirm whether the StS AI was done with LLMs though. Would definitiely be interesting to see how frontier LLMs perform!

In Claude's first try, it played Ironclad on Ascension 1 and died to Hexaghost, the Act 1 boss. It wasn't terrible but occasionally got the mechanics a little bit mixed up.

Here's the link to the chat history.

Thanks for following through.

If anyone wants to make a proper harness in the future, I think probably the most interesting question here is if the LLM can learn from multiple playthroughs, unlocking harder difficulties, etc.

Modern LLMs, maybe through notetaking?

Interesting how much it's relying on having information in training data and being able to look stuff up. I wonder how it would do with a "blind" play through of a game that didn't previously exist.

This benchmark includes a Slay the Spire environment! When it was written, Gemini 2.5 did the best, getting roughly halfway through a non-Ascension run.

I'd love to see the results of this :)

In Claude's first try, it played Ironclad on Ascension 1 and died to Hexaghost, the Act 1 boss. It wasn't terrible but occasionally got the mechanics a little bit mixed up.

Here's the link to the chat history.

Curated. I appreciate this post's concreteness.

It can be hard to really understand what numbers in a benchmark mean. To do so, you have to be pretty familiar with the task distribution, which is often a little surprising. And, if you are bothering to get familiar with it, you probably already know how the LLM performs. So it's hard to be sure you're judging the difficulty accurately, rather than using your sense of the LLM's intelligence to infer the task difficulty.

Fortunately, a Pokémon game involves a bunch of different tasks, and I'm pretty familiar with them from childhood gameboy sessions. So LLM performance on the game can provide some helpful intuitions about LLM performance in general. Of course, you don't get all the niceties of statistical power and so on, but I still find it a helpful data source to include.

This post does a good job abstracting some of the subskills involved and provides lots of deliciously specific examples for the claims. It's also quite entertaining!

One thing I found fascinating about watching Claude play is it wouldn't play around and experiment the way I'd expect a human to? It would stand still still trying to work out what to do next, move one square up, consider a long time, move one square down, and repeat. When I'd expect a human to immediately get bored and go as far as they could in all directions to see what was there and try interacting with everything. Maybe some cognitive analogue of boredom is useful for avoiding loops?

That is in fact a defect of these models and one of the things you can of want to scream into the screen after it, say, doesn't walk more than 20 tiles to the right having spent days looking for an entrance that is 30 tiles to the right. Or when it doesn't explore the bottom left of a room where the answer is because it's convinced that's not where it is.

Or the fact that it's in week 2 of Silph co, driven by the fact that it's convinced what it's looking for is not an item on the ground, and not picking up any items even when walking right next to them, when in fact its goal is an item on the ground.

It's really interesting to compare how Opus 4.5 is performing on Pokemon, versus how it performs in Claude Code.

One of the big factors here is surely vision: Gemini is one of the best visual LLMs by a wide margin, and I strongly suspect Google does lots of training on specific vision tasks. Even so, 2.0 and 2.5 underperformed human 7-year-olds on many simple tasks on which Gemini hasn't been trained. In comparison, Claude has some visual abilities, but I can't remember ever reaching for them for any serious project. And it sounds like this is affecting lots of things in Pokemon.

Opus 4.5 really is quite good at a programming, enough that I'm passing into the "emotional freakout about the inevitable singularity" stage of grief. But Opus lives and dies by giant piles of Markdown files. It generates them, I read them, I make small corrections, and it continues. I think this is Opus 4.5's happy place, and within this circumscribed area, it's a champ. It can write a thousand lines of good Rust in a day, no problem, with some human feedback and code review. And if your process concentrates knowledge into Markdown files, it gets better.

So this is my current hypothesis:

Opus 4.5 is a remarkably, startlingly capable model.
It has generally mediocre vision, even by the standards of frontier LLMs, which have mediocre vision by the standards of 7 year olds.
Opus lives and dies by good Markdown files, and a little bit of human feedback here and there results in a giant effective boost in capabilities.

It's kind of nice to imagine an AI future where the AIs are enormously capable, but that capability is only unlocked by a bit of occasional human interaction and guidance. Sadly, I think that's only a passing phase, and the real future is going to be much weirder, and future AIs won't need a human to say, "That weird rug thing is actually the elevator," and the AI to reply, "Oh, good observation! That simplifies this problem considerably."

GPT-5.1 beating crystal in 108 hours is very interesting. I wonder why that's the case compared to Gemini 3 Pro, which took ~424.5 hours. Do you have any thoughts?

Bunch of reasons:

GPT-5.1 harness is stronger, in particular it has better prompts (value from iterated prompt-writing should not be underestimated here)
The two developers have different goals and approaches - Gemini developer has trended towards letting the LLM make its own tools and play the game at its own speed, while GPT developer pushes the LLM to play efficiently and beat the game quickly
GPT-5.1 is being run in "continuous thinking mode" which in practice means it wastes less time and compute on simple tasks and thinks harder to get difficult problems right

Unfortunately no one has done full playthrough comparisons on the same harness for all models, due to time and expense. (all three main developers for Claude/Gemini/GPT only have access to free tokens for their particular model brand) Perhaps this will become possible sometime next year as completion time drops? (cost per token might drop too, but perhaps not for frontier models)

We some examples while working on terminal bench, where if the agent is pressured with a deadline, they freak out and act less rationally. Some of your examples remind me of that. Being close to the objective and becoming obsessed with that at the expense of intermediate steps.

I have been experimenting with having stock AI agents compete against each other in Warhammer 10th edition and have found similar problems. Deepseek was telling me units could make shot distances that clearly were not possible by rules it knew. The 'ignoring things that are in front of it' observation here is funny to me because Microsoft Co-Pilot was saying to put units so close together it was impossible. I gave it a grid map in coordinate form. It was ignoring things it put there itself.

I also told Deepseek that I had to play but knew nothing of the game, and was doing so because my friend insulted Chongqing hotpot, saying Chengdu's is better. It themed my whole Space Wolves army as soup-based and wrote text as if it was really into it.

Warhound Titan, "Defender of Simmering Broth" (1100 pts)

Microsoft Co-pilot was pretty boring and accountant-like.
(I am at Ithaca College if anyone wants to participate, it is fun)

For comparison, Pokemon Red in Twitch Plays Pokemon, which was basically just decision making implemented as a race condition between thousands to tens of thousands of different humans at every decision making step, took 16 days, 7 hours, 50 minutes, 19 seconds.

I would add that that's with a certain amount of malicious inputs in the mix (from trolls).

Every now and then I play 20 questions with Claude to see how much he can adjust in his thinking. Giving answers like "sort of" and "partly" can teach him that yes and no aren't the only options. To think outside the box, so to speak. Even playing 20 questions 5 times in a row, each taking turns as to who thought up the item to search for, he improved dramatically. (But if you run out of tokens in the middle of a run assume he will forget what his item was because the scratch pad will be cleared.)

But 20 questions is text based. Playing a role playing game, or going on adventures with him also work well because it's text based. (Though it's clear he will not harm the user, not even in a pillow fight.) When you move to visual media you have that problem of translating pictures to something he can see, as well as his ability to think through a problem. Like missing the tree that could be broken, or not knowing how to get around a wall. His scratch pad is limited in what it can carry.

I wonder if anyone has tried using a MUD, or other text based games with Claude, or other LLM's. It seems like that would make it easier for the model to have better context since the whole context would be loaded to create the next forward pass.

Also, he has a noticeable lack of confusion about key NPCs–Oak is consistently Oak, the player character is never “the red-hatted NPC”, and he can pick out gym leader Erika from a lineup.

I'm a little lost on this front. A person who has never encountered Pokemon before would not recognize the Oak or Erika sprite on-sight; why should the AI vision model? Perhaps one could match the Oak sprite to the full size Oak picture at the beginning of the game, but Erika? Erika can really only be identified by sprite uniqueness and placement in the top center of the gym.

I would instead think the newer models are just trained on more Pokemon, and hence can better identify Pokemon images.

The models have always been deeply familiar with Pokémon and how to play through it from the initial tests with Sonnet 3.7—they all know Erika is the fourth gym leader in Red, there's just too much internet text about this stuff. It contaminates the test, from a certain perspective, but it also makes failures and weaknesses even more apparent.

It is possible that Claude Opus 4.5 alone was trained on more Pokémon images as part of its general image training (more than Sonnet 4.5, though...?), but it wouldn't really matter: pure memorization would not have helped previous models, because they couldn't clearly see/understand stuff they definitely knew about. (I also doubt Anthropic is benchmaxxing Pokémon, considering they've kept their harness limited even after Google and OpenAI beat them on their own benchmark.)

I would analogize this to a human with anterograde amnesia, who cannot form new memories, and who is constantly writing notes to keep track of their life. The limitations here are obvious, and these are limitations future Claudes will probably share unless LLM memory/continual learning is solved in a better way.

I'm curious to see how well LLMs can play Slay the Spire. I could actually try that manually and see what happens.

In Claude's first try, it played Ironclad on Ascension 1 and died to Hexaghost, the Act 1 boss. It wasn't terrible but occasionally got the mechanics a little bit mixed up.

Here's the link to the chat history.

Thanks for following through.

If anyone wants to make a proper harness in the future, I think probably the most interesting question here is if the LLM can learn from multiple playthroughs, unlocking harder difficulties, etc.

Modern LLMs, maybe through notetaking?

Interesting how much it's relying on having information in training data and being able to look stuff up. I wonder how it would do with a "blind" play through of a game that didn't previously exist.

This benchmark includes a Slay the Spire environment! When it was written, Gemini 2.5 did the best, getting roughly halfway through a non-Ascension run.

I'd love to see the results of this :)

In Claude's first try, it played Ironclad on Ascension 1 and died to Hexaghost, the Act 1 boss. It wasn't terrible but occasionally got the mechanics a little bit mixed up.

Here's the link to the chat history.

Curated. I appreciate this post's concreteness.

This post does a good job abstracting some of the subskills involved and provides lots of deliciously specific examples for the claims. It's also quite entertaining!

It's really interesting to compare how Opus 4.5 is performing on Pokemon, versus how it performs in Claude Code.

So this is my current hypothesis:

Opus 4.5 is a remarkably, startlingly capable model.
It has generally mediocre vision, even by the standards of frontier LLMs, which have mediocre vision by the standards of 7 year olds.
Opus lives and dies by good Markdown files, and a little bit of human feedback here and there results in a giant effective boost in capabilities.

GPT-5.1 beating crystal in 108 hours is very interesting. I wonder why that's the case compared to Gemini 3 Pro, which took ~424.5 hours. Do you have any thoughts?

Bunch of reasons:

GPT-5.1 harness is stronger, in particular it has better prompts (value from iterated prompt-writing should not be underestimated here)
The two developers have different goals and approaches - Gemini developer has trended towards letting the LLM make its own tools and play the game at its own speed, while GPT developer pushes the LLM to play efficiently and beat the game quickly
GPT-5.1 is being run in "continuous thinking mode" which in practice means it wastes less time and compute on simple tasks and thinks harder to get difficult problems right

Warhound Titan, "Defender of Simmering Broth" (1100 pts)

Microsoft Co-pilot was pretty boring and accountant-like.
(I am at Ithaca College if anyone wants to participate, it is fun)

I would add that that's with a certain amount of malicious inputs in the mix (from trolls).

Also, he has a noticeable lack of confusion about key NPCs–Oak is consistently Oak, the player character is never “the red-hatted NPC”, and he can pick out gym leader Erika from a lineup.

I would instead think the newer models are just trained on more Pokemon, and hence can better identify Pokemon images.

LESSWRONG
LW

LESSWRONG
LW

214

Insights into Claude Opus 4.5 from Pokémon

214

Improvements

Much Better Vision, Somewhat Better Seeing

Attention is All You Need

The Object of His Desire

A Note

Mildly Better Spatial Awareness

Better Use of Context Window and Note-keeping to Simulate Memory

Self-Correction; Breaks Out of Loops Faster

Not Improvements

Claude would still never be mistaken for a Human playing the game

Claude Still Gets Pretty Stuck

Claude Really Needs His Notes

Poor Long-term Planning

Don't Forget

214

214