LESSWRONG
LW

Ixoth
12010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Research Notes: Running Claude 3.7, Gemini 2.5 Pro, and o3 on Pokémon Red
Ixoth4mo131

I'm curious if LLMs would do better on later-gen games. However, they don't have as robust emulation tools as far as I know.

I've been testing on FireRed, and the improvements are marginal at best (though admittedly the tileset is very similar). I wouldn't predict a significant vision difference in DPPt or B/W, but maybe the 3D graphics would help? There's the advantage of less Gen 1 jank like tiny inventory, no "press A to cut" (this might've saved Claude), and stuff like that, but those are slight factors.

A lot (~90%) of my experimentation has been with Gemini 2.5 Flash, which I think has been instructive; reminds me that the intelligence-frontier models are pretty bad at this, but it can always be worse. Claude 3.7 and Gemini 2.5 Pro struggle due to coming up with bad plans, but tend to be okay at executing them; even if you feed it a correct plan, 2.5 Flash often fails to execute anyways. Moments of insanity like "The NPC is below me at (8,3), and I'm at (8,2), so I'll move up to (8,1) to talk to him". (Yes, it output that exact reasoning.)

If nothing else, this does more or less confirm to me that weaker models actually are worse at Pokemon, which so far as I've seen has only faintly been tested by the CPP dev with Sonnet 3.5 and 3.6? Hopefully that means stronger models will be better...

I'm not sure how much educational value this actually has (I've been working on it for entertainment purposes), but I'm experimenting with actively feeding text to the model as it goes, and I've had to resort to literal tile-by-tile handholding at times (I'm pretty impatient, admittedly). The most interesting thing I've noticed from this is that if you point out a specific feature even without saying where it is (the exit red mat is a good example), the model will often start to "see" it properly and interact with it.

Reply1
No wikitag contributions to display.
No posts to display.