Vaniver's thoughts on Anthropic's RSP

Vaniver

Announcement, Policy v1.0, evhub's argument in favor on LW. These are my personal thoughts; in the interest of full disclosure, one of my housemates and several of my friends work at Anthropic; my spouse and I hold OpenAI units (but are financially secure without them). This post has three main parts: applause for things done right, a summary / review of RSPs in general, and then specific criticisms and suggestions of what to improve about Anthropic's RSP.

First, the things to applaud. Anthropic’s RSP makes two important commitments: that they will manage their finances to allow for pauses as necessary, and that there is a single directly responsible individual for ensuring the commitments are met and a quarterly report on them is made. Both of those are the sort of thing that represent genuine organizational commitment rather than lip service. I think it's great for companies to be open about what precautions they're taking to ensure the development of advanced artificial intelligence benefits humanity, even if I don’t find those policies fully satisfactory.

Second, the explanation. Following the model of biosafety levels, where labs must meet defined standards in order to work with specific dangerous pathogens, Anthropic suggests AI safety levels, or ASLs, with corresponding standards for labs working with dangerous models. While the makers of BSL could list specific pathogens to populate each tier, ASL tiers must necessarily be speculative. Previous generations of models are ASL-1 (BSL-1 corresponds to no threat of infection in healthy adults), current models like Claude count as ASL-2 (BSL-2 corresponds to moderate health hazards, like HIV), and the next tier of models, which either substantially increase baseline levels of catastrophic risk or are capable of autonomy, count as ASL-3 (BSL-3 corresponds to potentially lethal inhalable diseases, like SARS-CoV-1 and 2).

While BSL tops out at 4, ASL is left unbounded for now, with a commitment to define ASL-4 before using ASL-3 models.^[1] This means having a defined ceiling of what capabilities would call for increased investments in security practices at all levels, while not engaging in too much armchair speculation about how AI development will proceed.

The idea behind RSPs is that rather than pausing at an arbitrary start-date for an arbitrary amount of time (or simply shutting it all down), capability thresholds are used to determine when to start model-specific pauses, and security thresholds are used to determine when to unpause development on that model. RSPs are meant to demand active efforts to determine whether or not models are capable of causing catastrophic harm, rather than simply developing them blind. They seem substantially better than scaling without an RSP or equivalent.

Third, why am I not yet satisfied with Anthropic’s RSP? Criticisms and suggestions in roughly decreasing order of importance:

The core assumption underlying the RSP is that model capabilities can either be predicted in advance or discovered through dedicated testing before deployment. Testing can only prove the presence of capabilities, not their absence, but this method rests on absence of evidence for its evidence of absence. I think capabilities might appear suddenly at an unknown time. Anthropic's approach calls for scaling at a particular rate to lower the chance of this sudden appearance; I'm not yet convinced that their rate is sufficient to handle the risk here. It is still better to look for those capabilities than not look, but pre-deployment testing is inadequate for continued safety.^[2] I think the RSP could include more commitments to post-deployment monitoring at the ASL-2 stage to ensure that it still counts as ASL-2.
This is also reflected by the classification of models before testing rather than after testing. The RSP treats ASL-2 models and ASL-3 models differently, with stricter security standards in place for ASL-3 labs to make it safer to work with ASL-3 models. However, before a model is tested, it is unclear which category it should belong in, and the RSP is unclear on how to handle that ambiguity. While presuming they’re ASL-2 is more convenient (as frontier labs can continue to do scaling research with approximately their current levels of security), presuming they’re ASL-3 is more secure. When a red-teaming effort determines that a model is capable of proliferating itself onto the internet, or that it is capable of enabling catastrophic terrorist attacks, it might be too late to properly secure the model, even if training is immediately halted and deployment delayed. [A lab not hardened against terrorist infiltration might leak that it has an ASL-3 model when it triggers the pause to secure, which then potentially allows for the model to be stolen.]
The choice of baseline for the riskiness of information provided by the model is availability on search engines. While this is limiting for responsible actors, it seems easy to circumvent. The RSP already excludes other advanced AI systems from the baseline, but it seems to me that a static baseline (such as what was available on search engines in 2021) would be harder to bypass.^[3]
Importantly, it seems likely to me that this approach will work well initially in a way that might not correspond to how well it will work when models have grown capable enough to potentially disempower humanity. That is, this might successfully manage the transition from ASL-2 to ASL-3 but not be useful for the transition from ASL-3 to ASL-4, while mistakenly giving us the impression that the system is working. The RSP plans to lay the track ahead of the locomotive, expecting that we can foresee how to test for dangerous capabilities and identify how to secure them as needed, or have the wisdom to call a halt in the future when we fail to do so.

Overall, this feels to me like a move towards adequacy, and it's good to reward those moves; I appreciate that Anthropic's RSP has the feeling of a work-in-progress as opposed to being presented as clearly sufficient to the task.

^{^}
BSL-4 diseases have basically the same features as BSL-3 diseases, except that also there are no available vaccines or treatments. Also, all extraterrestrial samples are BSL-4 by default, to a standard more rigorous than any current BSL-4 labs could meet.
^{^}
The implied belief is that model capabilities primarily come from training during scaling, and that our scaling laws (and various other things) are sufficient to predict model capabilities. Advancements in how to deploy models might break this; as would the scaling laws failing to predict capabilities.
^{^}
It does have some convenience costs--if the baseline were set in 2019, for example, then the model might not be able to talk about coronavirus, even tho AI development and the pandemic were independent.

I mostly disagree with your criticisms.

On the other hand: before dangerous capability X appears, labs specifically testing for signs-of-X should be able to notice those signs. (I'm pretty optimistic about detecting dangerous capabilities; I'm more worried about how labs will get reason-to-believe their models are still safe after those models have dangerous capabilities.)
There's a good solution: build safety buffers into your model evals. See https://www-files.anthropic.com/production/files/responsible-scaling-policy-1.0.pdf#page=11. "the RSP is unclear on how to handle that ambiguity" is wrong, I think; Anthropic treats a model as RSP-2 until the evals-with-safety-buffers trigger, and then they implement ASL-3 safety measures.
I don't really get this.
Shrug; maybe. I wish I was aware of a better plan... but I'm not.

What's the probability associated with that "should"? The higher it is the less of a concern this point is, but I don't think it's high enough to write off this point. (Separately, agreed that in order for danger warnings to be useful, they also have to be good at evaluating the impact of mitigations unless they're used to halt work entirely.)
I don't think safety buffers are a good solution; I think they're helpful but there will still always be a transition point between ASL-2 models and ASL-3 models, and I think it's safer to have that transition in an ASL-3 lab than an ASL-2 lab. Realistically, I think we're going to end up in a situation where, for example, Anthropic researchers put a 10% chance on the next 4x scaling leading to evals declaring a model ASL-3, and it's not obvious what decision they will (or should) make in that case. Is 10% low enough to proceed, and what are the costs of being 'early'?
The relevant section of the RSP:

Note that ASLs are defined by risk relative to baseline, excluding other advanced AI systems. This means that a model that initially merits ASL-3 containment and deployment measures for national security reasons might later be reduced to ASL-2 if defenses against national security risks (such as biological or cyber defenses) advance, or if dangerous information becomes more widely available. However, to avoid a “race to the bottom”, the latter should not include the effects of other companies’ language models; just because other language models pose a catastrophic risk does not mean it is acceptable for ours to.

I think it's sensible to reduce models to ASL-2 if defenses against the threat become available (in the same way that it makes sense to demote pathogens from BSL-4 to BSL-3 once treatments become available), but I'm concerned about the "dangerous information becomes more widely available" clause. Suppose you currently can't get slaughterbot schematics off Google; if those become available, I am not sure it then becomes ok for models to provide users with slaughterbot schematics. (Specifically, I don't want companies that make models which are 'safe' except they leak dangerous information X to have an incentive to cause dangerous information X to become available thru other means.)

[There's a related, slightly more subtle point here; supposing you can currently get instructions on how to make a pipe bomb on Google, it can actually reduce security for Claude to explain to users how to make pipe bombs if Google is recording those searches and supplying information to law enforcement / the high-ranked sites on Google search are honeypot sites and Anthropic is not. The baseline is not just "is the information available?" but "who is noticing you accessing the information?".]

4. I mean, superior alternatives always preferred. I am moderately optimistic about "just stop" plans, and am not yet convinced that "scale until our tests tell us to stop" is dramatically superior to "stop now."

(Like, I think the hope here is to have an AI summer while we develop alignment methods / other ways to make humanity more prepared for advanced AI; it is not clear to me that doing that with the just-below-ASL-3 model is all that much better than doing it with the ASL-2 models we have today.)

Thanks.

Shrug. You can get it ~arbitrarily high by using more sensitive tests. Of course the probability for Anthropic is less than 1, and I totally agree it's not "high enough to write off this point." I just feel like this is an engineering problem, not a flawed "core assumption."

[Busy now but I hope to reply to the rest later.]

As someone with experience in BSL-3 labs, BSL feels like a good metaphor to me. The big issue with the RSP proposal is that it's still just a set of voluntary commitments that could undermine progress on real risk management by giving policymakers a way to make it look like they've done something without really doing anything. It would be much better with input from risk management professionals.

I mostly disagree with your criticisms.

On the other hand: before dangerous capability X appears, labs specifically testing for signs-of-X should be able to notice those signs. (I'm pretty optimistic about detecting dangerous capabilities; I'm more worried about how labs will get reason-to-believe their models are still safe after those models have dangerous capabilities.)
There's a good solution: build safety buffers into your model evals. See https://www-files.anthropic.com/production/files/responsible-scaling-policy-1.0.pdf#page=11. "the RSP is unclear on how to handle that ambiguity" is wrong, I think; Anthropic treats a model as RSP-2 until the evals-with-safety-buffers trigger, and then they implement ASL-3 safety measures.
I don't really get this.
Shrug; maybe. I wish I was aware of a better plan... but I'm not.

What's the probability associated with that "should"? The higher it is the less of a concern this point is, but I don't think it's high enough to write off this point. (Separately, agreed that in order for danger warnings to be useful, they also have to be good at evaluating the impact of mitigations unless they're used to halt work entirely.)
I don't think safety buffers are a good solution; I think they're helpful but there will still always be a transition point between ASL-2 models and ASL-3 models, and I think it's safer to have that transition in an ASL-3 lab than an ASL-2 lab. Realistically, I think we're going to end up in a situation where, for example, Anthropic researchers put a 10% chance on the next 4x scaling leading to evals declaring a model ASL-3, and it's not obvious what decision they will (or should) make in that case. Is 10% low enough to proceed, and what are the costs of being 'early'?
The relevant section of the RSP:

Note that ASLs are defined by risk relative to baseline, excluding other advanced AI systems. This means that a model that initially merits ASL-3 containment and deployment measures for national security reasons might later be reduced to ASL-2 if defenses against national security risks (such as biological or cyber defenses) advance, or if dangerous information becomes more widely available. However, to avoid a “race to the bottom”, the latter should not include the effects of other companies’ language models; just because other language models pose a catastrophic risk does not mean it is acceptable for ours to.

Thanks.

Shrug. You can get it ~arbitrarily high by using more sensitive tests. Of course the probability for Anthropic is less than 1, and I totally agree it's not "high enough to write off this point." I just feel like this is an engineering problem, not a flawed "core assumption."

[Busy now but I hope to reply to the rest later.]

46

Vaniver's thoughts on Anthropic's RSP

46

46

46