Kei's Shortform

Kei Nishimura-Gasparian

Kei's Shortform — LessWrong

Kei's Shortform

27th Jan 2025

1 min read

3

This is a special post for quick takes by Kei Nishimura-Gasparian. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Kei's Shortform

10Kei Nishimura-Gasparian

2cubefox

3Martin Vlach

3cubefox

9Kei Nishimura-Gasparian

2roha

5Kei Nishimura-Gasparian

6Kei Nishimura-Gasparian

4glencoe2004

3Kei Nishimura-Gasparian

11 comments, sorted by

top scoring

Click to highlight new comments since: Today at 12:19 PM

[-]Kei Nishimura-Gasparian1y*100

For two hours yesterday, I watched the twitch channel ClaudePlaysPokemon, which shows a Claude 3.7 Sonnet agent playing through the game Pokémon Red. Below I list some limitations I observed with the Claude agent. I think many of these limitations will carry over to agents built on other current frontier models and on other agentic tasks.

Claude has poor visual understanding. For example, Claude would often think it was next to a character when it was clearly far away from it. It would also often misidentify objects in its field of vision
The Claude agent is extremely slow. It tends to need to think for hundreds of tokens just to move one or two spaces or press A a single time. For example, it can sometimes take over a minute for the model to walk into a Pokemon center from the entrance and heal its Pokemon. Speed will become less of a problem as inference times continue to increase, but I suspect it will still be necessary for future agents to distill common actions to not require thinking and to allow the model to reason over higher-level actions instead of each individual one
The 200K token context window is a significant bottleneck. This was surprising to me since I had thought a context window that can store a medium-length book should be able to hold a large enough chunk of a game in memory to not be a significant problem. But when the model is outputting ~200 tokens per each action which is often a single button press, and when it regularly processes screenshots of the game which require an unknown but likely large number of tokens, the context window can fill up quite fast. At one point I measured it to take ~7.5 minutes to fill up the context window, which due to the model's slowness was only enough to leave a Pokemon center, fight a few Pokemon, and then go back to the center.
Even though Claude makes a lot of mistakes within a context window, its biggest mistakes occur when it needs to accomplish tasks that span multiple context windows. Because the context window is too small, the model often forgets things it did very recently and gets stuck into loops. This makes it very bad at tasks that require systematic exploration. In one particularly infuriating sequence, Claude entered a house, talked to its residents, left the house, explored left and came back, and did this over and over again for tens of minutes because it kept forgetting what it had already done
- The agent has tools to help it avoid this, like adding things to an external knowledge base, and summarizing what it did to paste into the next context window. But it doesn't know how to use these very well, perhaps because the model is just using it in ways that seem reasonable, and not in ways that have empirically led to good performance in the past
Claude can't learn new behaviors on the fly. One useful skill humans have is that we can learn certain strategies and high-level actions quickly after only a small amount of experience. But Claude needs to re-learn what to do essentially every time, which makes virtually every task, even ones Claude has done hundreds of times before, have some non-negligible chance of failure. While in principle the external knowledge base can help with this, it doesn't appear to in practice
Claude often has bad in-game decision-making and lacks creativity. While the model often thinks of ideas that seem roughly reasonable given what it knows, it often loses simple opportunities to do things quicker or more easily. It also tends to get stuck with its initial idea (as long as it remembers it) even when something slightly different would work better. For example, at one point Claude decided it wanted to level up its lowest-leveled Pokemon, so every time it died, it took the long walk back to the Pokecenter to heal, even though it would've made sense to spend some time leveling up its other low-level characters before making the return trip. Claude sometimes has very good ideas, but because it can't learn new behaviors on the fly, the good ideas never get reinforced

All this being said, ClaudePlaysPokemon still impresses me, and is probably the most impressive LLM agent demonstration I've seen. Through reasoning and persistence, Claude is able to progress fairly far in the game, accomplish tasks requiring thousands of steps, and eventually get out of loops even when it's been stuck for a long time. I expect increased agentic RL training, increased cross-context RL training, and test-time learning to iron out a lot of these limitations over the next year or two.

[-]cubefox1y20

The 200K token context window is a significant bottleneck.

Gemini Pro has a 2 million token context window, so I assume it would do significantly better. (I wonder why no other model has come close to the Gemini context window size. I have to assume not all algorithmic breakthroughs are replicated a few months later by other models.)

[-]Martin Vlach1y30

Does it really work on RULER( benchmark from Nvidia)?
Not sure where but saw some controversies, https://arxiv.org/html/2410.18745v1#S1 is best I did find now...

Edit: Aah, this was what I had on mind: https://www.reddit.com/r/LocalLLaMA/comments/1io3hn2/nolima_longcontext_evaluation_beyond_literal/

[-]cubefox1y30

I assume for Pokémon the model doesn't need to remember everything exactly, so the recall quality may be less important than the quantity.

[-]Kei Nishimura-Gasparian9d92

Will intelligence agencies and hackers be incentivized to burn their stockpiled 0-days in the near future?

Epistemic status: I don't have a substantial cybersecurity background. I'm curious to hear other people's takes.

Anthropic recently announced that their new model Claude Mythos has 'found thousands of high-severity vulnerabilities, including some in every major operating system and web browser.' They plan to use this model in concert with 40 partner organizations through 'Project Glasswing' to find and patch vulnerabilities in software systems.

It's been publicized that intelligence agencies and other actors have access to large numbers of 0-days that are not known to the broader community. In some cases, actors stockpile these 0-days for use on important targets. For example, Stuxnet only worked by exploiting multiple stockpiled 0-days in concert.

With the announcement of Claude Mythos and Project Glasswing, I wonder if holders of 0-days will worry that their 0-days will soon be discovered and patched. If this is the case, then they only have a small window until their 0-days no longer work, and are incentivized to use them to their maximal extent now. If so, we may have additional reason to believe, beyond risks from AI, that this is an especially vulnerable moment when it comes to cybersecurity.

Reasons this story could be wrong or misleading:

There is not a substantial overlap between the types of exploits that Claude Mythos can find, and the highest value 0-days previously found by humans.
The gap between normal and high-value uses of these exploits is large enough that it still makes sense to wait for the right opportunity even if there is a risk the 0-days get patched.
Actors who expect to soon gain access to powerful AI vulnerability-finding tools may view their current stockpile as less precious, reducing the urgency to burn them.

[-]roha9d20

Were they incentivized as soon as they learned about the rapidly changing playing field, which was probably a while in the past?

[-]Kei Nishimura-Gasparian9d51

This was probably foreseeable given the rapid increase in model cyber abilities over the past six months, although it's unclear if the relevant people were paying enough attention.

[-]Kei Nishimura-Gasparian2mo60

Two quotes from the OpenAI DoW AMA that I thought gave new information:

Prinz asks what provision of the DoW agreement "expressly references the laws and policies as they exist today", as some have expressed concern that the government could just change existing laws/policies to allow for domestic surveillance or fully autonomous weapons. Katrina Mulligan (Head of National Security Partnerships at OpenAI) responds by quoting the publicized portion of the OpenAI-DoW contract. After a followup, she responded that this is how they interpret the phrase 'applicable law':

we intended it to mean "the law applicable at the time the contract is signed".

Peter Wildeford asks Boaz Barak (Member of Technical Staff at OpenAI) whether a currently legal form of surveillance, AI analysis of commercially purchased data on Americans (inc. location data, purchase records, browsing history, etc.), would be allowed under the contract. He says that it wouldn't:

The DoW has not asked us to support collection or analysis of bulk data on Americans, such as geolocation data, web browsing data and personal financial information purchased from data brokers, and our agreement does not permit it. Our agreement does not permit uses of our models for unconstrained monitoring of U.S. persons’ private information, and all intelligence activities must comply with existing US law. In practical terms, this means the system cannot be used to collect or analyze Americans’ data in an open-ended or generalized way.

When asked where this appears in the agreement, he said:

Our legal and policy teams have worked with the DoW and this interpretation is shared between both sides. They will provide more details on the the issue of commercially acquired datasets in the coming days.

[-]glencoe20042mo40

we intended it to mean "the law applicable at the time the contract is signed".

Do we have any legal proof that this is the definition that both OpenAI and the DoW agreed on? I could pretty easily see "applicable law" being 'misinterpreted' as "the law as of the time of the action".

[-]Kei Nishimura-Gasparian3mo30

I just got back my results from the 2025 AI Forecasting Survey. I scored 31st out of 413 forecasters. Some takeaways about my personal performance:

I generally overestimated model improvements on benchmarks. It's particularly surprising to me how little SWE-Bench Verified moved given how much labs are optimizing general SWE tasks and how much they care about doing well on this benchmark. I haven't looked at the very hardest tasks in SWE-Bench Verified - it's possible they are notably harder than the other ones
Other forecasters and I underestimated AI lab revenues. Some of this was because the 2024 baseline revenue numbers were a bit out of date and should've been 6.6B (by no fault of the organizers, as I believe this information came out after the survey was created). But it's also because AI revenues are continuing to grow at a crazy rate that shocks even AI bulls. Anthropic in particular has 10x'd its revenue for the past three years

[-]Kei Nishimura-Gasparian1y*30

The setup in the first sketch in Three Sketches of ASL-4 Safety Case Components can be replaced with a classifier that indicates whether a model is doing something catastrophically bad, which corresponds to the high concern features described in the post. Perhaps you can also have a second classifier that indicates if something is moderately bad, which corresponds to the medium concern features.

If you have such a classifier, you can:

Use it at deployment time to monitor model behavior, and stop the session and read the logs (if allowed by privacy constraints) if the classifier outputs a probability above some threshold
Validate your classifier it by:

Running model organisms and see that it is above the threshold whenever the model organism is trying to do something bad and below the threshold when the model is doing something benign
Run red-blue games on the entire monitoring methodology
Depending on the underlying information fed into the classifier, you may also be able to run observational and interventional studies

This classifier can leverage any information, SAE vectors or other white box methods as mentioned in the post, the model's chain of thought, the sequence of actions the model takes, or better yet, all of the above.

Moderation Log

More from Kei Nishimura-Gasparian

Curated and popular this week

11Comments

11 comments, sorted by

top scoring

Click to highlight new comments since: Today at 12:19 PM

[-]Kei Nishimura-Gasparian1y*100

Claude has poor visual understanding. For example, Claude would often think it was next to a character when it was clearly far away from it. It would also often misidentify objects in its field of vision
The Claude agent is extremely slow. It tends to need to think for hundreds of tokens just to move one or two spaces or press A a single time. For example, it can sometimes take over a minute for the model to walk into a Pokemon center from the entrance and heal its Pokemon. Speed will become less of a problem as inference times continue to increase, but I suspect it will still be necessary for future agents to distill common actions to not require thinking and to allow the model to reason over higher-level actions instead of each individual one
The 200K token context window is a significant bottleneck. This was surprising to me since I had thought a context window that can store a medium-length book should be able to hold a large enough chunk of a game in memory to not be a significant problem. But when the model is outputting ~200 tokens per each action which is often a single button press, and when it regularly processes screenshots of the game which require an unknown but likely large number of tokens, the context window can fill up quite fast. At one point I measured it to take ~7.5 minutes to fill up the context window, which due to the model's slowness was only enough to leave a Pokemon center, fight a few Pokemon, and then go back to the center.
Even though Claude makes a lot of mistakes within a context window, its biggest mistakes occur when it needs to accomplish tasks that span multiple context windows. Because the context window is too small, the model often forgets things it did very recently and gets stuck into loops. This makes it very bad at tasks that require systematic exploration. In one particularly infuriating sequence, Claude entered a house, talked to its residents, left the house, explored left and came back, and did this over and over again for tens of minutes because it kept forgetting what it had already done
- The agent has tools to help it avoid this, like adding things to an external knowledge base, and summarizing what it did to paste into the next context window. But it doesn't know how to use these very well, perhaps because the model is just using it in ways that seem reasonable, and not in ways that have empirically led to good performance in the past
Claude can't learn new behaviors on the fly. One useful skill humans have is that we can learn certain strategies and high-level actions quickly after only a small amount of experience. But Claude needs to re-learn what to do essentially every time, which makes virtually every task, even ones Claude has done hundreds of times before, have some non-negligible chance of failure. While in principle the external knowledge base can help with this, it doesn't appear to in practice
Claude often has bad in-game decision-making and lacks creativity. While the model often thinks of ideas that seem roughly reasonable given what it knows, it often loses simple opportunities to do things quicker or more easily. It also tends to get stuck with its initial idea (as long as it remembers it) even when something slightly different would work better. For example, at one point Claude decided it wanted to level up its lowest-leveled Pokemon, so every time it died, it took the long walk back to the Pokecenter to heal, even though it would've made sense to spend some time leveling up its other low-level characters before making the return trip. Claude sometimes has very good ideas, but because it can't learn new behaviors on the fly, the good ideas never get reinforced

[-]cubefox1y20

The 200K token context window is a significant bottleneck.

[-]Martin Vlach1y30

[-]cubefox1y30

I assume for Pokémon the model doesn't need to remember everything exactly, so the recall quality may be less important than the quantity.

[-]Kei Nishimura-Gasparian9d92

Will intelligence agencies and hackers be incentivized to burn their stockpiled 0-days in the near future?

Epistemic status: I don't have a substantial cybersecurity background. I'm curious to hear other people's takes.

Reasons this story could be wrong or misleading:

There is not a substantial overlap between the types of exploits that Claude Mythos can find, and the highest value 0-days previously found by humans.
The gap between normal and high-value uses of these exploits is large enough that it still makes sense to wait for the right opportunity even if there is a risk the 0-days get patched.
Actors who expect to soon gain access to powerful AI vulnerability-finding tools may view their current stockpile as less precious, reducing the urgency to burn them.

[-]roha9d20

Were they incentivized as soon as they learned about the rapidly changing playing field, which was probably a while in the past?

[-]Kei Nishimura-Gasparian9d51

This was probably foreseeable given the rapid increase in model cyber abilities over the past six months, although it's unclear if the relevant people were paying enough attention.

[-]Kei Nishimura-Gasparian2mo60

we intended it to mean "the law applicable at the time the contract is signed".

The DoW has not asked us to support collection or analysis of bulk data on Americans, such as geolocation data, web browsing data and personal financial information purchased from data brokers, and our agreement does not permit it. Our agreement does not permit uses of our models for unconstrained monitoring of U.S. persons’ private information, and all intelligence activities must comply with existing US law. In practical terms, this means the system cannot be used to collect or analyze Americans’ data in an open-ended or generalized way.

When asked where this appears in the agreement, he said:

Our legal and policy teams have worked with the DoW and this interpretation is shared between both sides. They will provide more details on the the issue of commercially acquired datasets in the coming days.

[-]glencoe20042mo40

we intended it to mean "the law applicable at the time the contract is signed".

[-]Kei Nishimura-Gasparian3mo30

I just got back my results from the 2025 AI Forecasting Survey. I scored 31st out of 413 forecasters. Some takeaways about my personal performance:

I generally overestimated model improvements on benchmarks. It's particularly surprising to me how little SWE-Bench Verified moved given how much labs are optimizing general SWE tasks and how much they care about doing well on this benchmark. I haven't looked at the very hardest tasks in SWE-Bench Verified - it's possible they are notably harder than the other ones
Other forecasters and I underestimated AI lab revenues. Some of this was because the 2024 baseline revenue numbers were a bit out of date and should've been 6.6B (by no fault of the organizers, as I believe this information came out after the survey was created). But it's also because AI revenues are continuing to grow at a crazy rate that shocks even AI bulls. Anthropic in particular has 10x'd its revenue for the past three years

[-]Kei Nishimura-Gasparian1y*30

If you have such a classifier, you can:

Use it at deployment time to monitor model behavior, and stop the session and read the logs (if allowed by privacy constraints) if the classifier outputs a probability above some threshold
Validate your classifier it by:

Running model organisms and see that it is above the threshold whenever the model organism is trying to do something bad and below the threshold when the model is doing something benign
Run red-blue games on the entire monitoring methodology
Depending on the underlying information fed into the classifier, you may also be able to run observational and interventional studies

Moderation Log