LESSWRONG
LW

Andrew Jones
2010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Research Notes: Running Claude 3.7, Gemini 2.5 Pro, and o3 on Pokémon Red
Andrew Jones4mo30

My sense is that "one model to rule them all" isn't so effective for these long-term agentic tasks. As you mention, vision and memory are such large constraints.

Timescales make the memory problem worse for a single model: short term problems (get out of the house) flood the context window and make mid or long term planning impossible. I think the natural solution is different models at different timescales, that set goals hierarchically and higher up the chain update less frequently.

Additionally, the vision problem: I don't think these reasoning models are well fit to solve it. Intuitively, you aren't system-2 thinking step-by-step to do simple bodily-spatial decisions like walking. They sorta can which is impressive, but they aren't fit for it. 

Small ViTs could be finetuned on pixel art and be able to pick up, for example, coordinate locations much more easily.

Until the models improve substantially, I think if you want to beat Pokemon Red, you need to engineer an army of small, integrated models.

Reply
No wikitag contributions to display.
No posts to display.