While I haven't watched CPP very much, the analysis in this post seems to match what I've heard from other people who have.
That said, I think claims like
So, how's it doing? Well, pretty badly. Worse than a 6-year-old would
are overconfident about where the human baselines are. Moreover, I think these sorts of claims reflect a general blindspot about how humans can get stuck on trivial obstacles in the same way AIs do.
A personal anecdote: when I was a kid (maybe 3rd or 4th grade, so 8 or 9 years old) I played Pokemon red and couldn't figure out how to get out of the first room—same as the Claude 3.0 Sonnet performance! Why? Well is it obvious to you where the exit to this room is?
Answer: you have to stand on the carpet and press down.
Apparently this was a common issue! See this reddit thread for discussion of people who hit the same snag as me. In fact, it was a big enough issue that addressed it in the FireRed remake, making the rug stick out a bit:
I don't think this is an isolated issue with the first room. Rather, I think that as railroaded as Pokemon might seem, there's actually a bunch of things that it's easy to get crucially confused about, resulting in getting totally s...
Further, have you ever gotten an adult who doesn't normally play video games to try playing one? They have a tendency to get totally stuck in tutorial levels because game developers rely on certain "video game motifs" for load-bearing forms of communication; see e.g. this video.
So much +1 on this.
Also, I've played a ton of games, and in the last few years started helping a bit with playtesting them etc. And I found it striking how games aren't inherently intuitive, but are rather made so via strong economic incentives, endless playtests to stop players from getting stuck, etc. Games are intuitive for humans because humans spend a ton of effort to make them that way. If AIs were the primary target audience, games would be made intuitive for them.
And as a separate note, I'm not sure what the appropriate human reference class for game-playing AIs is, but I challenge the assumption that it should be people who are familiar with games. Rather than, say, people picked at random from anywhere on earth.
And as a separate note, I'm not sure what the appropriate human reference class for game-playing AIs is, but I challenge the assumption that it should be people who are familiar with games. Rather than, say, people picked at random from anywhere on earth.
Should maybe restrict it to someone who has read all the documentation and discussion for the game that exists on the internet.
I'm not sure. I remember playing a bunch of games, like pokemon heart gold, lego starwars, and some other pokemon game where you were controlling little pokemon in 3rd person instead of controlling a human who threw pokeballs (anyone know that game? )
And like, I didn't speak English when I played them. So I had to figure out everything by just pressing random buttons and seeing responses. And this makes it a lot more difficult. Like I could open my "inventory" (didn't know what that was) and then use a "healing potion" (didn't know what that was), and then because my pokemon was at full health already, I would think the healing potion was useless, or think that items in inventory only cause text to appear on the screen, but that they don't have any effect on the actaul, and then I'd believe this until I accidentally clicked the inventory and randomly saw a change, or had failed a level so many times that I was getting desperate and just manually doing exhaustive search over all the actions.
But like, I'm very confident I was more action efficient than claude is. Mostly because like, if I enter a battle, and like fail 5 times more or less in the same way, you start to think something...
as a human, you're much better at getting unstuck
I'm not sure! Or well, I agree that 7-year-old me could get unstuck by virtue of having an "additional tool" called "get frustrated and cry until my mom took pity and helped."[1] But we specifically prevent Claude from doing stuff like that!
I think it's plausible that if we took an actual 6-year-old and asked them to play Pokemon on a Twitch stream, we'd see many of the things you highlight as weaknesses of Claude: getting stuck against trivial obstacles, forgetting what they were doing, and—yes—complaining that the game is surely broken.
TBC this is exaggerated for effect—I don't remember actually doing this for Pokemon. And—to your point—I probably did eventually figure out on my own most of the things I remember getting stuck on.
pokemon is a simple, railroady enough game that RNG can beat the game given enough time (and this has been done)
This is not true. It would take an absurd amount of time
I agree: if you've ever played any of the Pokemon games, it's clear that a true uniform distribution over actions would not finish any time that a human could ever observe it, and the time would have to be galactic. There are just way too many bottlenecks and long trajectories and reset points, including various ways to near-guarantee (or guarantee?) failure like discarding items or Pokemon, and if you've looked at any Pokemon AI projects or even just Twitch Plays Pokemon, this becomes apparent - they struggle to get out of Pallet Town in a reasonable time, never mind the absurdity of playing through the rest of game and beating the Elite Four etc, and that's with much smarter move selection than pure random.
This basically sums up how it's doing: https://www.reddit.com/r/ClaudePlaysPokemon/comments/1j568ck/the_mount_moon_experience
Of course much of that is basic capability issues -poor spatial reasoning, short term memory that doesn't come anywhere close to lasting for 1 lap, etc.
But I've also noticed ways in which Claude's personality is sabotaging it. Claude is capable of taking notes saying that it "THOROUGHLY confirmed NO passages" through the eastern barrier - but never gets impatient or frustrated, so this doesn't actually prevent it from trying the same thing every time it sees the eastern wall again.
And it general, it seems to have a strong bias towards visiting places that are mentioned frequently in its notes - even though that's the exact opposite of what you should be doing for exploration. I've seen it reach the uncommonly reached second ladder on the floor, and then promptly decided it needs to run back to the first ladder (which it has seen hundreds of times) to see whether the first ladder goes anywhere.
And it should definitely be mentioned that run #1 was mercy killed when its knowledge base was populated almost entirely with falsehoods both about how far it had progressed in the game and how to get further, leading to a singleminded obsession with exploring the southern wall of Cerulean City forever.
And now in the second run it has entered a similar delusional loop. It knows the way to Cerulean City is via Route 4, but the route before and after Mt. Moon are both considered part of Route 4. Therefore it deluded itself into thinking it can get to Cerulean from the first part of the route. Because of that, every time it accidentally stumbles into Mt Moon and is making substantial progress towards the exit, it intentionally blacks out to get teleported back outside the entrance, so it can look for the nonexistent path forwards.
From what I've seen on stream, the chances of it questioning and breaking from this delusion are basically zero. There's still the possibility of progress by getting lost in Mt Moon and stumbling into the exit, but it will never actually figure out what it was doing wrong here.
People in the stream chat and subreddit have been discussing this paper suggesting that LLM agents often get into these "meltdown" loops that they aren't able to recover from: https://www.reddit.com/r/ClaudePlaysPokemon/comments/1j65jqf/vendingbench_a_benchmark_for_longterm_coherence
Also, the stream admin seemed to think the same thing, saying during the first run that "some runs just are cursed" and setting up a poll for whether to reset the game.
Update: Claude made it to Cerulean City today, after wandering the Mt. Moon area for 69 hours.
This is convincing evidence LLMs are far from AGI.
Eventually, one of the labs will solve it, a bunch of people will publicly update, and I’ll point out that actually the entire conversation about how an LLM should beat Pokémon was in the training data, the scaffolding was carefully set up to keep it on rails in this specific game, the available action set etc is essentially feature selection, etc.
Seems like an easy way to create a less-fakeable benchmark would be to evaluate the LLM+scaffolding on multiple different games? Optimizing for beating Pokemon Red alone would of course be a cheap PR win, so people will try to do it. But optimizing for beating a wide variety of games would be a much bigger win, since it would probably require the AI to develop some more actually-valuable agentic capabilities.
It will probably be correct to chide people who update on the cheap PR win. But perhaps the bigger win, which would actually justify such updates, might come soon afterwards!
I disagree because to me this just looks like LLMs are one algorithmic improvement away from having executive function, similar to how they couldn't do system 2 style reasoning until this year when RL on math problems started working.
For example, being unable to change its goals on the fly: If a kid kept trying to go forward when his pokemon were too weak. He would keep losing, get upset, and hopefully in a moment of mental clarity, learn the general principle that he should step back and reconsider his goals every so often. I think most children learn some form of this from playing around as a toddler, and reconsidering goals is still something we improve at as adults.
Unlike us, I don't think Claude has training data for executive functions like these, but I wouldn't be surprised if some smart ML researchers solved this in a year.
I was probably going to make it a top level post, but it seems like this post covers the main points well, so I'll just link my own CPP post here (Julian let me know if you mind, and I'll move it):
https://justismills.substack.com/p/the-blackout-strategy
It's specifically about "the blackout strategy" that MrCheeze mentions below, in a greater degree of detail. Basically, I argue that:
I also describe how the blackout strategy came to be in a little bit of detail. Probably not worth reading for anyone who only wanted a primer and by reading this post has gotten one, but if you can't get enough Claudetent or are curious about the blackout strategy, please enjoy.
I'm extremely curious about the design process of the knowledge base. Just learning about ClaudePlaysPokemon today and I'm a bit surprised at how naive the store is. There's a reasonably large amount of research into artificial neural network memory and I've suspected for a few years that improvements in knowledge scaffolding is promising for really overcoming hallucinations and now reasoning deficiencies. It's to the extent that I've supported projects and experiments at work to mature knowledge bases and knowledge graphs in anticipation of marrying them ...
Note that the creator stated that the setup is intentionally somewhat underengineered:
I do not claim this is the world's most incredible agent harness; in fact, I explicitly have tried not to "hyper engineer" this to be like the best chance that exists to beat Pokemon. I think it'd be trivial to build a better computer program to beat Pokemon with Claude in the loop.
This is like meant to be some combination of like "understand what Claude's good at and Benchmark and understand Claude-alongside-a-simple-agent-harness", so what that boils down to is this is like a pretty straightforward tool-using agent.
But these issues seem far from insurmountable, even with current tech. It is just that they are not actually trying, because they want to limit scaffolding.
From what I've seen, the main issues:
1) Poor vision -> Can be improved through tool use, will surely improve greatly regardless with new models
2) Poor mapping -> Can be improved greatly + straightforwardly through tool use
3) Poor executive function -> I feel like this would benefit greatly from something like a separation of concerns. Currently my impression is Claude is getting overwhelmed wit...
Meanwhile, the average human can beat the entirety of Red in just 26 hours, and with substantially less thought per hour.
I mostly agree with the post, but this number is absolutely bullshit. What you could more honestly claim, given the link, is that the average hardcore gamer that both completed the game, then input their completion time into this type of website is 26 hours. That's an insanely different claim. In fact, I would be shocked if even 50% of people who have played a Pokemon game have completed it at all, much less doing so in under a week of playtime.
I got the impression that using only an external memory like in the movie Memento (and otherwise immediately forgetting everything that wasn't explicitly written down) was the biggest hurdle to faster progress. I think it does kind of okay considering that huge limitation. Visually, it would also benefit from learning the difference between what is or isn't a gate/door, though.
Makes sense. With pretraining data being what it is, there are things LLMs are incredibly well equipped to do - like recalling a lot of trivia or pretending to be different kinds of people. And then there are things LLMs aren't equipped to do at all - like doing math, or spotting and calling out their own mistakes.
This task, highly agentic and taxing on executive function? It's the latter.
Keep in mind though: we already know that specialized training can compensate for those "innate" LLM deficiencies.
Reinforcement learning is already used to improve LLM ma...
Mechanisms like attention only seem analogous to a human's sensory memory. Reasoning models have something like a working memory but even then I think we'd need something in embedding space to constitute a real working memory analog. And having something like a short term memory could could help Claude avoid repeating the same mistakes.
This is, in some sense, very scary because when someone figures out how to train agent reasoning in embedded space there might be a very dramatic discontinuity in how well LLMs can act as agents.
A task like this, at which the AI is lousy but not hopeless, is an excellent feedback signal for RL. It's also an excellent feedback signal for "grad student descent": have a human add mechanisms, and see if Claude gets better. This is a very good sign for capabilities, unfortunately.
Doesn't Claude's training data include all the tutorials and step by step walkthroughs of this game ever published on the internet? How is it not using this information?
It's unclear what the optimal amount of thinking per step is. My initial guess would have been that letting Claude think for a whole paragraph before each single action (rather than only each 10 actions, or whenever it's in a match, or whatever) scores slightly better than letting it think more (sequentially). But I guess this might work better if it's what the streamer is using after some iteration.
The story for parallel checks could be different though. My guess would be going all out and letting Claude generate the paragraph 5 times and then generate 5 ...
These problems are partly related to poor planning, but they are clearly also related to language models, which are primarily restricted to operate on text. Actual AGI will likely have to work more like an animal or human brain, which is predicting sensory data (or rather: latent representations of sensory data, JEPA) instead of text tokens. An LLM with good planning may be able to finally beat Pokémon, but it will almost certainly not be able to do robotics or driving or anything with complex or real-time visual data.
Me and my college educated wife recently got stuck playing Lego Star wars... Our solution was to go to Google it. Some of these games are poorly designed and very unintuitive as others have said. Especially a game this old. Seems like they should give Claude some limited Google searches at least.
The earliest Harry Potter games had help hotlines you could call, which we had to do once when I was 9.
It's hilarious it thinks the game might be broken sometimes, like an angry teenager claiming lag when he loses a firefight in CoD.
even with copious amounts of test-time compute
There is no copius amount of test-time compute yet. I would argue that test-time compute has barely been scaled at all. Current spend on RL is only a few million dollars. I expect this to be scaled a few orders of magnitude this year.
I predict that Pokemon Red will be finished very fast (<3 months) and everyone who was disappointed and adjusted their AI timelines due to CPP will have to readjust them.
There's an improvement in LLM's I've seen that is important but has wildly inflated people's expectations beyond what's reasonable:
LLM's have hit a point in some impressive tests where they don't reliably fail past the threshold of being unrecoverable. They are conservative enough that they can do search on a problem, fail a million times until they mumble into an answer.
I'm going to try writing something of at least not-embarrassing quality about my thoughts on this but I am really confused by people's hype around this sort of thing, this feels like directed randomness
LLMs trying to complete long-term tasks are state machines where the context is their state. They have terrible tools to edit that state, at the moment. There is no location in memory that can automatically receive more attention, because the important memories move has the chain of thought does. Thinking off on a tangent throws a lot of garbage into the LLMs working memory. To remember an important fact over time the LLM needs to keep repeating it. And there isn't enough space in the working memory for long-term tasks.
All of this is exemplified real...
Background: After the release of Claude 3.7 Sonnet,[1] an Anthropic employee started livestreaming Claude trying to play through Pokémon Red. The livestream is still going right now.
TL:DR: So, how's it doing? Well, pretty badly. Worse than a 6-year-old would, definitely not PhD-level.
Digging in
But wait! you say. Didn't Anthropic publish a benchmark showing Claude isn't half-bad at Pokémon? Why yes they did:
and the data shown is believable. Currently, the livestream is on its third attempt, with the first being basically just a test run. The second attempt got all the way to Vermilion City, finding a way through the infamous Mt. Moon maze and achieving two badges, so pretty close to the benchmark.
But look carefully at the x-axis in that graph. Each "action" is a full Thinking analysis of the current situation (often several paragraphs worth), followed by a decision to send some kind of input to the game. Thirty-five thousand actions means an absolutely enormous amount of thought. Even for Claude, who thinks much faster than a human, ten thousand actions takes it roughly a full working week of 40 hours,[2] so that 3-badge run took Claude nearly the equivalent of a month of full-time work, perhaps 140 hours. Meanwhile, the average human can beat the entirety of Red in just 26 hours, and with substantially less thought per hour.
What's going wrong?
Basically, while Claude is pretty good at short-term reasoning (ex. Pokémon battles), he's bad at executive function and has a poor memory. This is despite a great deal of scaffolding, including a knowledge base, a critic Claude that helps it maintain its knowledge base, and a variety of tools to help it interact with the game more easily.
What does that mean in practice? If you open the stream, you'll see it immediately: Claude on Run #3 has been stuck in Mt. Moon for 24 hours straight.[3] On Run #2, it took him 78 hours to escape Mt. Moon.
Mt. Moon is not that complicated. It has a few cave levels, and a few trainers. But Claude gets stuck going in loops, trying to talk to trainers he's already beaten (and often failing to talk to them, not understanding why his inputs don't do what he expects), inching across solid walls looking for exits that can't possibly be there, always taking the obvious trap route rather than the longer correct route just because it's closer.
This hasn't been the only problem. Run #2 eventually failed because it couldn't figure out it needed to talk to Bill to progress, and Claude wasn't willing to try any new action when the same infinite rotation of wrong choices (walk in circles, enter the same buildings, complain the game is maybe broken) wasn't working.[4]
A good friend of mine has been watching the stream quite a lot, and describes Claude's faults well. Lightly edited for clarity:
What does this mean for AI?
It's obvious that, while Claude isn't very good at playing Pokémon, it is getting better. 3.7 does significantly better than 3.5 did, after all, and 3.0 was hopeless. So isn't its extremely hard-earned (half-random) achievements so far still progress in the right direction and an indication of things to come?
Well, yeah. But Executive Function (agents, etc.) has always been the big missing puzzle piece, and even with copious amounts of test-time compute, tool use, scaffolding, external memory, the latest and greatest LLM still is not at even a child's level.
But the thing about ClaudePlaysPokémon is that it feels so close sometimes. Each "action" is reasoned carefully (if often on the basis of faulty knowledge), and of course Claude's irrepressible good humor is charming. If it could just plan a little better, or had a little outside direction, it'd clearly blow through the game no problem.
Quoting my friend again (lightly edited again):
Conclusion
In Leopold Aschenbrenner's infamous Situational Awareness paper, in which he correctly predicted the rise of reasoning models, he discussed the concept of "unhobbling" LLMs. That is, figuring out how to get from the brilliant short-term thinking we already have today all the way to the agent-level executive function necessary for true AGI.
ClaudePlaysPokémon is proof that the last 6 months of AI innovation, while incredible, are still far from the true unhobbling necessary for an AI revolution. That doesn't mean 2-year AGI timelines are wrong, but it does feel to me like some new paradigm is yet required for them to be right. If you watch the stream for a couple hours, I think you'll feel the same.
As part of the release, they published a Pokémon benchmark here.
Roughly guesstimating based off the fact that ~48 hours in to Run #3, it's at ~12,000 actions taken.
Update: at the 25 hour mark Claude whited out after losing a battle, putting it back before Mt. Moon. The "Time in Mt. Moon" timer on the livestream is still accurate.
This is simplified. For a full writeup, see here. Run #2 was abandoned at 35,000 actions.
This specifically happened with Mt. Moon in Run #3. Claude deleted all its notes on Mt. Moon routing after whiting out and respawning at a previous Pokécenter. It mistakenly concluded that it had already beaten Mt. Moon.
Actually it turns out this hasn't been done, sorry! A couple RNG attempts were completed, but they involved some human direction/cheating. The point still stands only in the sense that, if Claude took more random/exploratory actions rather than carefully-reasoned shortsighted actions, he'd do better.
Okay actually this line I wrote myself—the rest is from my friend—but he failed to write a pithy conclusion for me.