My sense is that "one model to rule them all" isn't so effective for these long-term agentic tasks. As you mention, vision and memory are such large constraints.
Timescales make the memory problem worse for a single model: short term problems (get out of the house) flood the context window and make mid or long term planning impossible. I think the natural solution is different models at different timescales, that set goals hierarchically and higher up the chain update less frequently.
Additionally, the vision problem: I don't think these reasoning models are well fit to solve it. Intuitively, you aren't system-2 thinking step-by-step to do simple bodily-spatial decisions like walking. They sorta can which is impressive, but they aren't fit for it.
Small ViTs could be finetuned on pixel art and be able to pick up, for example, coordinate locations much more easily.
Until the models improve substantially, I think if you want to beat Pokemon Red, you need to engineer an army of small, integrated models.
My sense is that "one model to rule them all" isn't so effective for these long-term agentic tasks. As you mention, vision and memory are such large constraints.
Timescales make the memory problem worse for a single model: short term problems (get out of the house) flood the context window and make mid or long term planning impossible. I think the natural solution is different models at different timescales, that set goals hierarchically and higher up the chain update less frequently.
Additionally, the vision problem: I don't think these reasoning models are well fit to solve it. Intuitively, you aren't system-2 thinking step-by-step to do simple bodily-spatial decisions like walking. They sorta can which is impressive, but they aren't fit for it.
Small ViTs could be finetuned on pixel art and be able to pick up, for example, coordinate locations much more easily.
Until the models improve substantially, I think if you want to beat Pokemon Red, you need to engineer an army of small, integrated models.