Yes, they both started with the same harness but there's room for each model to customize its own setup so I'm not sure how much they might have diverged over time. I have 4x speedup as probably an upper bound but I was only counting since the final 2.5 stable release in June, which might be too short. Gemini 2.5 has 6 badges now compared to yesterday, so it's probably too early to assume 4x is certain. But if it was 4x every 8 months then it should be able to match average human playtime by early 2027.
From the Gemini_Plays_Pokemon - Twitch:
"v2 centers on a smaller, flexible toolset (Notepad, Map Markers, code execution, on‑the‑fly custom agents) so Gemini can build exactly what it needs when it needs it."
"The AI has access to a set of built-in tools to interact with the game and its own internal state:
Custom Tools & Agents
The most powerful feature of the system is its ability to self-improve by creating its own tools and specialized agents. You can view the live Notepad and custom tools/agents tracker on GitHub.
Looking at the step count comparisons instead of time is interesting. Claude Opus 4.5 is currently at ~44,500 steps in Silph co., where it has been stuck for several days. So that should now be about 50% higher. The others look roughly right for Opus. It beat Mt. Moon in around 5 hours and was stuck at the Rocket Hideout for days.
I think the Gemini 3 Pro vs 2.5 Pro matchup in Pokemon Crystal was interesting. Gemini 3 cleared the game in ~424.5 hours last night while 2.5 only had 4/16 badges at 435 hours.
This is a really valuable post that clarifies some things I've found hard to articulate to people on each side. I think it's difficult for people to balance when to use each of these epistemic frames without getting too sucked into one. And I imagine most people use these to different degrees at different times even if they may not realize it or one is rarer for them.
Looking forward to what you write next!
Something similar I've been thinking about is putting models in environments with misalignment "temptations" like an easy reward hack and training them to recognize what this type of payoff pattern looks like (e.g. easy win but sacrifice principle) and NOT take it. Recent work shows some promising efforts getting LLMs to explain their reasoning, introspect, and so forth. I think this could be interesting to do some experiments with and am trying to write up my thoughts on why this might be useful and maybe what those could look like.
Gotta account for wordflation since the old days. Might have been 1000 back then
What do you think are ways to identify good strategic takes? This is something that seems rather fuzzy to me. It's not clear how people are judging criteria like this or what they think is needed to improve on this.
Glad to see someone talking about this. I'm excited about ideas for empirical work related to this and suspect you need some kind of mechanism for ground truth to get good outcomes. I would expect AIs to eventually reflect on their goals and for this to have important implications for safety. I've never heard of any mechanism for why they wouldn't do this, let alone an airtight one. It's like assuming an employee will definitely never think about anything other than the task in front of them in a limited way despite wanting to understand things and be useful.
Interesting. I am inclined to think this is accurate. I'm kind of surprised people thought GPT-5 was a huge scaleup given that it's much faster than o3 was. It sort of felt like a distilled o3 + 4o.
Thanks Seth! I appreciate you signal boosting this and laying out your reasoning for why planning is so critical for AI safety.
GPT-5.1 beating crystal in 108 hours is very interesting. I wonder why that's the case compared to Gemini 3 Pro, which took ~424.5 hours. Do you have any thoughts?