For two hours yesterday, I watched the twitch channel ClaudePlaysPokemon, which shows a Claude 3.7 Sonnet agent playing through the game Pokémon Red. Below I list some limitations I observed with the Claude agent. I think many of these limitations will carry over to agents built on other current frontier models and on other agentic tasks.
All this being said, ClaudePlaysPokemon still impresses me, and is probably the most impressive LLM agent demonstration I've seen. Through reasoning and persistence, Claude is able to progress fairly far in the game, accomplish tasks requiring thousands of steps, and eventually get out of loops even when it's been stuck for a long time. I expect increased agentic RL training, increased cross-context RL training, and test-time learning to iron out a lot of these limitations over the next year or two.
The 200K token context window is a significant bottleneck.
Gemini Pro has a 2 million token context window, so I assume it would do significantly better. (I wonder why no other model has come close to the Gemini context window size. I have to assume not all algorithmic breakthroughs are replicated a few months later by other models.)
Does it really work on RULER( benchmark from Nvidia)?
Not sure where but saw some controversies, https://arxiv.org/html/2410.18745v1#S1 is best I did find now...
Edit: Aah, this was what I had on mind: https://www.reddit.com/r/LocalLLaMA/comments/1io3hn2/nolima_longcontext_evaluation_beyond_literal/
I assume for Pokémon the model doesn't need to remember everything exactly, so the recall quality may be less important than the quantity.
Two quotes from the OpenAI DoW AMA that I thought gave new information:
Prinz asks what provision of the DoW agreement "expressly references the laws and policies as they exist today", as some have expressed concern that the government could just change existing laws/policies to allow for domestic surveillance or fully autonomous weapons. Katrina Mulligan (Head of National Security Partnerships at OpenAI) responds by quoting the publicized portion of the OpenAI-DoW contract. After a followup, she responded that this is how they interpret the phrase 'applicable law':
we intended it to mean "the law applicable at the time the contract is signed".
Peter Wildeford asks Boaz Barak (Member of Technical Staff at OpenAI) whether a currently legal form of surveillance, AI analysis of commercially purchased data on Americans (inc. location data, purchase records, browsing history, etc.), would be allowed under the contract. He says that it wouldn't:
The DoW has not asked us to support collection or analysis of bulk data on Americans, such as geolocation data, web browsing data and personal financial information purchased from data brokers, and our agreement does not permit it. Our agreement does not permit uses of our models for unconstrained monitoring of U.S. persons’ private information, and all intelligence activities must comply with existing US law. In practical terms, this means the system cannot be used to collect or analyze Americans’ data in an open-ended or generalized way.
When asked where this appears in the agreement, he said:
Our legal and policy teams have worked with the DoW and this interpretation is shared between both sides. They will provide more details on the the issue of commercially acquired datasets in the coming days.
we intended it to mean "the law applicable at the time the contract is signed".
Do we have any legal proof that this is the definition that both OpenAI and the DoW agreed on? I could pretty easily see "applicable law" being 'misinterpreted' as "the law as of the time of the action".
I just got back my results from the 2025 AI Forecasting Survey. I scored 31st out of 413 forecasters. Some takeaways about my personal performance:
The setup in the first sketch in Three Sketches of ASL-4 Safety Case Components can be replaced with a classifier that indicates whether a model is doing something catastrophically bad, which corresponds to the high concern features described in the post. Perhaps you can also have a second classifier that indicates if something is moderately bad, which corresponds to the medium concern features.
If you have such a classifier, you can:
This classifier can leverage any information, SAE vectors or other white box methods as mentioned in the post, the model's chain of thought, the sequence of actions the model takes, or better yet, all of the above.