Opus 4.6 and IIRC even Opus 4.5 could see the VR switch in the sense of recognizing it as a distinct looking tile, entertaining various hypotheses of what it could be but being overall bizarrely disinterested in it. While sometimes 4.6 would manage to get the boulder on it by sheer trial and error, occasionally it did it more intentionally by first noting the tile's distinct look and verbalizing that it might be a switch. And IIRC at one point it had made a note what the switch looked like which helped it with consequent switch puzzles, until it destroyed this extremely precious note in overzealous memory cleanups.

Often 4.6 would assume it was an item, and given its default lack of interest in items tended to not pay much attention to it. Other times it dismissed it as just "floor decoration". The big problem was that just the idea of trying to visually identify the correct tile first before starting to push the boulder haphazardly around wasn't sufficiently obvious. It would sometimes verbalize this idea though, only to get distracted by ladders it saw or navigable tiles or whatnot and go back to bumping into walls in case of invisible doors or testing completely random boulder positions.
Neither the average child playing the game back in the 90's nor Opus 4.6 knew in advance what the switches look like but both could see it. A big difference though is that at this stage of the game a child remembers what an item looks like and Opus 4.6 doesn't. But even this doesn't account for Opus 4.6's default dismissal of the tile as "just decoration" when it confirmed it wasn't an item. If you're looking for a tile to push a boulder on, the one navigable one that looks different from the others should be a top candidate.
So what might've looked at first glance like a vision problem was actually more of a failure of reasoning, strategy etc. although to some extent also a problem of visual memory.
As for 4.7's run, I think it's worth noting that it got lucky by early on tossing the item that grants the ability DIG. DIG is notorious among viewers as something that's been kryptonite for the different Claudes so far. It allows them to escape the cave and reset the puzzles when they get confused and desperate, but this also destroys so much of their progress which often a lot of time to redo due to their memory limitations.
This run did have some more things different to previous runs, which I think may be somewhat significant:
By coincidence Claude never got a Pokemon that knew Dig - when he acquired the Dig TM, he already had a full party, of which none could learn Dig at all, and later tossed the TM to make inventory space. I believe in Victory Road digging would reset all progress, thus creating a risk of Claude getting more impatient due to having to solve the puzzles again and deciding to dig out without any actual progress. Though this time Claude was able to make good notes about boulder-switch puzzle solutions and he was able to efficiently re-solve each puzzle multiple times anyway (even though that being unnecessary). Thus Dig probably would not have been a run killer, just a time-waster.
The harness had another change: pressing 'A' multiple times in a single reasoning step was now allowed during dialogs and menus. While this did cause new issues sometimes it did help with random encounters in Victory Road as running away can now be done with less steps, thus avoiding context space waste involving random battles. The actual importance of this harness improvement is harder to predict though, but I do feel it had bigger impact than the zooming tool that didn't really get that much use in the critical parts.
We've been following this on metaculus, iirc last year's forecast was not great, clear overestimate of the speed (in part because of me as I was regularly updating people with progress and giving them my impressions) but for 2026 thr community has been doing great! https://www.metaculus.com/questions/41593/when-will-claude-beat-pokemon-2026/
Actually quite surprised how fast the elite 4 got done, even for Claude this seemed fast
Oh yeah not bad! Expecting mostly June/May since February.
Elite 4 did take a couple tries, but after getting beat in close match against Blue, Claude actually remembered to buy healing items and revives. (and remembered it had a revive to use on Ivysaur in the final battle!)
Thanks for the recap and conclusion btw, really nice to have the opinion of someone who's been following this (even closer I think than I did)
Civ V and Slay the Spire
This link is broken. You need to remove the period from the end of the URL.
Credit: ClaudePlaysPokemon Elevator Shanty by Kurukkoo
Disclaimer: like some previous posts in this series, this was not primarily written by me, but by a friend. I did substantial editing, however.
ClaudePlaysPokemon feat. Opus 4.7 has finally beaten Pokémon Red, fulfilling the challenge set over a year ago when LLMs playing Pokémon went briefly, slightly viral, until Gemini 2.5 Pro suddenly beat Pokémon Blue in May 2025, beating Anthropic at their own challenge by using a stronger harness.
Claude's victory in May 2026. I'm still proud of you, Claude!
Let's get the throat-clearing out of the way: this doesn't make 4.7 a clear breakthrough in intelligence over 4.6 or 4.5. It's smarter, yes, as we'll discuss below, but not by something one could honestly call a big leap. Rather, step changes have finally accumulated to the point of victory.
And to give other models their fair shake: after criticism over its elaborate harness,[1] GeminiPlaysPokemon has beaten Pokémon with progressively weaker harnesses, including about two months ago with a harness comparable to the one Claude uses.[2]
As such, this is a bit of a valedictory post, closing off the cycle of Claude playing Pokémon Red, relating anecdotes for the fun of it, and discussing improvements in Opus 4.7, as well as speculating a bit on what this has all meant.
Retrospective Anecdotes on Claude 4.5 and 4.6
Our last post, on Opus 4.5, was made at a time when it was stuck in Silph Co. for a few days, though this was not mentioned in the post. In fact, it ended up stuck there for weeks, exceeding how long it was stuck at any previous obstacle, though it did eventually make it through. At just over 50k reasoning steps, the adventure in Silph Co. took longer than the entire 29k step run leading up to that.
The exact issue was that Opus 4.5 was uniquely convinced out of all the Claudes that items on the ground were worthless, and moreover consistently failed to even see them (the harness had a small text label indicating it was an NPC or item). This is a problem when the Card Key necessary for progression is an item ON THE GROUND in an out-of-the-way part of the facility. Silph Co. also happens to be large, with numerous floors, overtaxing 4.5's ability to keep track of its "memory". 4.5 spent literal weeks checking and rechecking floors, ignoring the Card Key on the rare occasions it happened to see it, often dismissing it as an irrelevant NPC.[3] This was frustrating to the point of hilarity, but it did at least generate a great song that's gotten stuck in my friend's[4] head more than once:
Many had given up hope when, on one random trip, 4.5 inexplicably decided to pick it up.
Pokeball here is the Silph Co. card key.
After that, the barrier everyone had feared—the Safari Zone—turned out to be relatively straightforward, completed in a relatively smooth 8k steps.[5] Here, 4.5's improved note-taking was able to flex its muscle, and a relatively disciplined pattern of exhaustive search eventually got there… even if Claude acquired HM03 Surf and ignored HM04 Strength nearby on the ground, despite previously mentioning its existence. Much later, Claude would be forced to return for it, but was able to quickly reach it following its stored notes. Truly a triumph in memory management.
None of that prepared anyone for the 112k-step multi-month saga that was the Cinnabar Mansion, where 4.5 completely fell apart, unable to correlate switch presses with the state of barriers, constantly making and writing down mistakes, and generally looking hopeless. When 4.5 finally found the Secret Key, it felt less like an accomplishment and more like infinite monkeys finally typing Shakespeare.
After that, progress toward Victory Road was straightforward, but after entering Victory Road, it became apparent 4.5 cannot see the switches in Victory Road that boulders need to be pushed onto and had to operate blindly. No one really complained when 4.6 released 2 weeks later and the run was reset.
Victory Road switch here is the circle thing up and to the right of the player character.
Opus 4.6's run was broadly similar to 4.5, with a few very notable improvements (that translated to enormous improvements in step count). I didn't follow it very closely, which I regret, because it is clear from tracking the steps on Reddit that 4.6 was clearly improved:
With that, it got to Victory Road in a comparatively breezy 30k steps and... promptly got bogged down in the boulder puzzles, because it still couldn't see the switch. It made a very good showing of it, even solving the first of the puzzles by simply pushing the boulder to every possible available tile, but it was clear this was a crippling problem as the available search space only got worse for each subsequent puzzle and it started to lose faith there even was a switch.
Victory Road Boulder Puzzles - not trivial! (source)
And there it stayed, as stuck as 4.5 had been, for months, all other progress notwithstanding.
Source: Benjamin Todd of 80,000 Hours.
Harness Changes for 4.7
For Opus 4.7, the dev[6] added two more tools to Claude's toolbox. One allows it to pick regions of the screen and "zoom in" for a better look. This is a known tactic to get LLMs to perceive details better, even though in a purely information-theoretic sense no additional information is provided.
It is difficult to think this wasn't motivated by 4.5 and 4.6's struggles seeing the switches on Victory Road, so it is ironic that 4.7 (improved vision!) can see the switches just fine without zooming in.
The other tool allows Claude to store screenshots or pieces of screenshots as a kind of visual reference. While you would think this would not matter, it allows 4.7 an extra layer of visual memory for things it fails to note down. Simple example: while looking for a Pokémon center in Cerulean (where the roofs are, well, cerulean blue[7]), it passed right by, declaring that the building with "Poké" on the label was the Pokémart because it had a blue roof.[8] When it spotted the "Mart", it clearly had a moment of confusion, then said "based on my screenshot, the Pokemon center was probably the Poke building".
So it does matter around the edges, though I don't believe it was decisive in 4.7's victory.
Improvements (4.6 + 4.7)
Source: MrCheeze
4.7's run is not a universal improvement over 4.6's. It's just that the key vision improvement happens to finish solving the game!
Vision
Anthropic touts improvements in vision for Opus 4.7, and this is evident here. It can see the boulder switches in Victory Road, the cut-able trees (though it does not reliably realize they're cut-able trees), and can even identify which directions the arrows point in Team Rocket Hideout... after zoom in.
It still definitely walks into objects though, just to try it.
Claude has finally gotten glasses comparable to what Gemini had a year ago.
Though OK it still mistook a door in Silph Co. for a "Conveyor Belt" with the practical consequence that it couldn't figure out how to use the Card Key, eventually gave up and did something else, and only figured it out after coming back later. (Though even being willing to give up faster is interesting in its own right)
"Conveyer belt" - Honestly I kind of get it, but most humans catch that this makes no sense as a conveyor belt spatially.
Claude: "What if press A on the conveyer belt... anyway?"
Less Tunnel Vision
In my previous post on Opus 4.5 (section "Attention is All That You Need"), I pointed out that when in the immediate vicinity of what he thinks is the goal, 4.5's hallucination rate seems to soar, with a strong propensity to see what he thinks might exist, in ways that often interfere with his reasoning.
While not entirely gone, this behavior seems considerably faded (though perhaps this is just part of a lower propensity to visual hallucination, as part of vision improvements in general).
Not only does Claude see the bottom pocket, he keeps his eye on the ball of following his notes, as part of a broadly effective strategy of trying everything that he wasn't able to do in 4.5. Admittedly this was not the first try, but it was a lot faster.
Another Level of Spatial Awareness
4.7 now understands how trainers work and can dodge them when they're on-screen. Yes, previous Claudes could not handle this.
Breaks Out of Loops EVEN FASTER
This is very difficult to quantify or show in one picture, but 4.7 (and probably 4.6) seems even better than 4.5 at noticing when it's failing at something and trying something else instead. Anecdotally, it's a bit more stubborn than 4.6, but it's hard to say whether this is an intellectual improvement or not.
I suspect "realizing when you're stuck" is a more difficult computational problem than we instinctively realize.[9]
Victory Road
After all of that, 4.7's signal achievement so far—breaking through in Victory Road and finishing the game—is almost anticlimactic. It can see the switches now, and can somewhat effectively reason itself into solving a puzzle:
Example Puzzle - despite knowing there's a switch to the left in its notes and its screenshots, Claude still took remarkably long (nearly a day) to realize it should keep trying to push the boulder left. Spatial reasoning still could use some work.
Not every attempt was a success.
The Little Things
Like with Opus 4.5 before, it is difficult to exactly explain everything that's better, but it is better.
For instance, previously I dinged 4.5 for task myopia, failing to perform trivial actions to improve even the medium-short term if it wasn't obvious for the short term. Well, now it:
The most impressive thing I saw personally was this:
Me making predictions based on my long timelines prior.
Claude forcing me to update my priors. (also Claude elected not to Super Potion, which was actually a good call)
Concluding Thoughts
Let's take a moment to collect what we've learned from the Claude Plays Pokémon experience:
What this doesn't really show:
Notes on Pokémon as a benchmark
In truth, Pokémon Red was never an ideal challenge for testing LLM reasoning, for reasons that became apparent once it really got going. Firstly, the weaker versions of the harnesses were ultimately bottlenecked by vision capability for pixelated graphics, which while an interesting aspect of model ability, is arguably parallel to rather than the same as general reasoning.[10]
Secondly, and more importantly, the mainline Pokémon games and in particular Red are too well-known and described on the internet, and every LLM has clearly read the walkthroughs. Both Gemini and Claude routinely reference memorized pieces of knowledge about the game, and the first Gemini run got much further than Claude 3.5 simply because it had memorized that the exit from Cerulean city was a cut tree on the east side, information that later versions of Claude also picked up. Later versions of Claude have also referenced knowledge such as the approximate location of dungeon exits, the general location of HMs, the big-picture plot progression of the game, and the presence of the Secret Key on the 5th floor of Silph Co.
So, these runs are inevitably polluted by memorized knowledge. While it's clear that raw LLM ability is the central contributor to performance (and models clearly have little-to-no baked-in knowledge of how to solve individual puzzles), it dilutes the signal.
A better test game, then, would be a game that probes interesting aspects of LLM reasoning, isn't as well-known, and perhaps is less dependent on vision. There have been a lot of experiments in this vein:
I'm sure the reader could suggest other options.
However, none of these have generated nearly as much interest as Pokémon did. One can speculate why—the popularity and name-recognition of the game, the fact that it's a nostalgic children's game, the original 2025 race between the frontier models—but it's a fact. Those other games can't get a thousand livestream viewers late on a Friday night yelling:
At least when it comes to Pokémon, we all wanted the AIs to win.
For discussion on harness comparisons, see here and here.
This is the only public record I could quickly find that this happened. The notes on the Twitch stream comment that this run (on Blue) was a weak harness, abandoning the minimap, pre-prompted pathfinding and boulder-puzzle agents, and precomputed tile‑navigability data, among other things. However, it was still allowed to write Python scripts to manage its own pathfinding. Your mileage may vary on how reasonable that is.
Further data if interested.
Julian Bradshaw.
For comparison, that old serpent, Mt. Moon, took about 1k steps.
David Hershey of Anthropic.
All the models have been playing modded versions of Pokémon Red/Blue that add color.
Every model makes this mistake initially in Cerulean, because the later games do in fact have Pokémon centers with clear red roofs and Pokémart with clear blue roofs.
Something something Halting Problem?
Although there seems to be some connection between visual capability and spatial navigation, which seems like a much more direct part of intelligence.