Now, Google claims Gemini 2.5 Pro has substantially surpassed Claude's progress on that benchmark.
The premise of this post appears to be a straw man - that is not what is claimed in either of the tweets at the top of the post. Similarly, I have seen precisely no one claim this is a rigorous test - it's obviously for fun. Why do you think it's not a win for a model to get this far through the game with a somewhat reasonably lightweight scaffolding? It seems rather gatekeep-y to require people to use something similar to ClaudePlaysPokemon, which 1) isn't open-source and 2) is so clearly deadly stuck in Mt. Moon.
The premise of this post appears to be a straw man - that is not what is claimed in either of the tweets at the top of the post. Similarly, I have seen precisely no one claim this is a rigorous test - it's obviously for fun. Why do you think it's not a win for a model to get this far through the game with a somewhat reasonably lightweight scaffolding? It seems rather gatekeep-y to require people to use something similar to ClaudePlaysPokemon, which 1) isn't open-source and 2) is so clearly deadly stuck in Mt. Moon.