Anthropic's Responsible Scaling Policy & Long-Term Benefit Trust

[-]ryan_greenblatt8mo*Ω559852Review for 2023 Review

Anthropic releasing their RSP was an important change in the AI safety landscape. The RSP was likely a substantial catalyst for policies like RSPs—which contain if-then commitments and more generally describe safety procedures—becoming more prominent. In particular, OpenAI now has a beta Preparedness Framework, Google DeepMind has a Frontier Safety Framework but there aren't any concrete publicly-known policies yet, many companies agreed to the Seoul commitments which require making a similar policy, and SB-1047 required safety and security protocols.

However, I think the way Anthropic presented their RSP was misleading in practice (at least misleading to the AI safety community) in that it neither strictly requires pausing nor do I expect Anthropic to pause until they have sufficient safeguards in practice. I discuss why I think pausing until sufficient safeguards are in place is unlikely, at least in timelines as short as Dario's (Dario Amodei is the CEO of Anthropic), in my earlier post.

I also have serious doubts about whether the LTBT will serve as a meaningful check to ensure Anthropic serves the interests of the public. The LTBT has seemingly done very little thus far—appointing only 1 board member despite being able to appoint 3/5 of the board members (a majority) and the LTBT is down to only 3 members. And none of its members have technical expertise related to AI. (The LTBT trustees seem altruistically motivated and seem like they would be thoughtful about questions about how to widely distribute benefits of AI, but this is different from being able to evaluate whether Anthropic is making good decisions with respect to AI safety.)

Additionally, in this article, Anthropic's general counsel Brian Israel seemingly claims that the board probably couldn't fire the CEO (currently Dario) if the board did this despite believing it would greatly reduce profits to shareholders^[1]. Almost all of a board's hard power comes from being able to fire the CEO, so if this claim were to be true, that would greatly undermine the ability of the board (and the LTBT which appoints the board) to ensure Anthropic, a public benefit corporation, serves the interests of the public in cases where this conflicts with shareholder interests. In practice, I think this claim which was seemingly made by the general counsel of Anthropic is likely false and, because Anthropic is a public benefit corporation, the board could fire the CEO and win in court even if they openly thought this would massively reduce shareholder value (so long as the board could show they used a reasonable process to consider shareholder interests and decided that the public interest outweighed in this case). Regardless, this article is evidence that the LTBT won't provide a meaningful check on Anthropic in practice.

Misleading communication about the RSP

On the RSP, this post says:

On the one hand, the ASL system implicitly requires us to temporarily pause training of more powerful models if our AI scaling outstrips our ability to comply with the necessary safety procedures.

While I think this exact statement might be technically true, people have sometimes interpreted this quote and similar statements as a claim that Anthropic would pause until their safety measures sufficed for more powerful models. I think Anthropic isn't likely to do this; in particular:

The RSP leaves open the option of revising it to reduce required countermeasures (so pausing is only required until the policy is changed).
This implies countermeasures would suffice for ensuring a reasonable level of safety, but given that commitments still haven't been made for ASL-4 (the level at which existential or near-existential risks become plausible) and there aren't clear procedural reasons to expect countermeasures to suffice to ensure a reasonable level of safety, I don't think we should assume this will be the case.
Protections for ASL-3 are defined vaguely rather than having some sort of credible and independent risk analysis process (in addition to best guess countermeasures) and a requirement to ensure risk is sufficiently low with respect to this process. Perhaps ASL-4 requirements will differ; something more procedural seems particularly plausible as I don't see a route to outlining specific tests in advance for ASL-4.
As mentioned earlier, I expect that if Anthropic ends up being able to build transformatively capable AI systems (as in, AI systems capable of obsoleting all human cognitive labor), they'll fail to provide assurance of a reasonable level of safety. That said, it's worth noting that insofar as Anthropic is actually a more responsible actor (as I currently tentatively think is the case), then from my perspective this choice is probably overall good—though I wish their communication was less misleading.

Anthropic and Anthropic employees often use similar language to this quote when describing the RSP, potentially contributing to a poor sense of what will happen. My impression is that lots of Anthropic employees just haven't thought about this, and believe that Anthropic will behave much more cautiously than I think is plausible (and more cautiously than I think is prudent given other actors).

Other companies have worse policies and governance

While I focus on Anthropic in this comment, it is worth emphasizing that the policies and governance of other AI companies seem substantially worse. xAI, Meta and DeepSeek have no public safety policies at all, though they have said they will make a policy like this. Google DeepMind has published that they are working on making a frontier safety framework with commitments, but thus far they have just listed potential threat models corresponding to model capabilities and security levels without committing to security for specific capability levels. OpenAI has the beta preparedness framework, but the current security requirements seem inadequate and the required mitigations and assessment process for this is unspecified other than saying that the post-mitigation risk must be medium or below prior to deployment and high or below prior to continued development. I don't expect OpenAI to keep the spirit of this commitment in short timelines. OpenAI, Google DeepMind, xAI, Meta, and DeepSeek all have clearly much worse governance than Anthropic.

What could Anthropic do to address my concerns?

Given these concerns about the RSP and the LTBT, what do I think should happen? First, I'll outline some lower cost measures that seem relatively robust and then I'll outline more expensive measures that don't seem obviously good (at least not obviously good to strongly prioritize) but would be needed to make the situation no longer be problematic.

Lower cost measures:

Have the leadership clarify its views to Anthropic employees (at least alignment science employees) in terms of questions like: "How likely is Anthropic to achieve an absolutely low (e.g., 0.25%) lifetime level of risk (according to various third-party safety experts) if AIs that obsolete top human experts are created in the next 4 years?", "Will Anthropic aim to have an RSP that would be the policy that a responsible developer would follow in a world with reasonable international safety practices?", "How likely is Anthropic to exit from its RSP commitments if this is needed to be a competitive frontier model developer?".
Clearly communicate to Anthropic employees (or at least a relevant subset of Anthropic employees) about in what circumstances the board could (and should) fire the CEO due to safety/public interest concerns. Additionally, explain the leadership's policies with respect to cases where the board does fire the CEO—does the leadership of Anthropic commit to not fighting such an action?
Have an employee liaison to the LTBT who provides the LTBT with more information that isn't filtered through the CEO or current board members. Ensure this employee is quite independent-minded, has expertise on AI safety (and ideally security), and ideally is employed by the LTBT rather than Anthropic.

Unfortunately, these measures aren't straightforwardly independently verifiable based on public knowledge. As far as I know, some of these measures could already be in place.

More expensive measures:

In the above list, I explain various types of information that should be communicated to employees. Ensure that this information is communicated publicly including in relevant places like in the RSP.
Ensure the LTBT has 2 additional members with technical expertise in AI safety or minimally in security.
Ensure the LTBT appoints the board members it can currently appoint and that these board members are independent from the company and have their own well-formed views on AI safety.
Ensure the LTBT has an independent staff including technical safety experts, security experts, and independent lawyers.

Likely objections and my responses

Here are some relevant objections to my points and my responses:

Objection: "Sure, but from the perspective of most people, AI is unlikely to be existentially risky soon, so from this perspective it isn't that misleading to think of deviating from safe practices as an edge case." Response: To the extent Anthropic has views, Anthropic has the view that existentially risky AI is reasonably likely to be soon and Dario espouses this view. Further, I think this could be clarified in the text: the RSP could note that these commitments are what a responsible developer would do if we were in a world where being a responsible developer was possible while still being competitive (perhaps due to all relevant companies adopting such policies or due to regulation).
Objection: "Sure, but if other companies followed a similar policy then the RSP commitments would hold in a relatively straightforward way. It's hardly Anthropic's fault if other companies force it to be more reckless than it would like." Response: This may be true, but it doesn't mean that Anthropic isn't being potentially misleading in their description of the situation. They could instead directly describe the situation in less misleading ways.
Objection: "Sure, but obviously Anthropic can't accurately represent the situation publicly. That would result in bad PR and substantially undermine their business in other ways. To the extent you think Anthropic is a good actor, you shouldn't be pressuring good actors like them to take actions that will make them differentially less competitive than worse actors." Response: This is pretty fair, but I still think Anthropic could at least avoid making substantially misleading statements and ensure employees are well informed (at least for employees for whom this information is very relevant to their job decision-making). I think it is a good policy to correct misleading statements that result in differentially positive impressions and result in the safety community taking worse actions, because not having such a policy in general would result in more exploitation of the safety community.

The article says: "However, even the board members who are selected by the LTBT owe fiduciary obligations to Anthropic's stockholders, Israel says. This nuance means that the board members appointed by the LTBT could probably not pull off an action as drastic as the one taken by OpenAI's board members last November. It's one of the reasons Israel was so confidently able to say, when asked last Thanksgiving, that what happened at OpenAI could never happen at Anthropic. But it also means that the LTBT ultimately has a limited influence on the company: while it will eventually have the power to select and remove a majority of board members, those members will in practice face similar incentives to the rest of the board." This indicates that the board couldn't fire the CEO if they thought this would greatly reduce profits to shareholders though it is somewhat unclear. ↩︎

[-]Yonatan Cale8mo10

To the extent you think Anthropic is a good actor, you shouldn't be pressuring good actors like them to take actions that will make them differentially less competitive than worse actors

I think an important part of how one becomes (and stays) a good actor is by being transparent about things like this.

Anthropic could at least avoid making substantially misleading statements

Yes. But also, I'm afraid that Anthropic might solve this problem by just making less statements (which seems bad). Still Yes

[-]ryan_greenblatt8mo30

Yes. But also, I'm afraid that Anthropic might solve this problem by just making less statements (which seems bad).

Making more statements would also be fine! I wouldn't mind if there were just clarifying statements even if the original statement had some problems.

(To try to reduce the incentive for less statements, I criticized other labs for not having policies at all.)

[-]Ben Pace2yΩ133020

I am surprised to see the Open Philanthropy network taking all of the powerful roles here.

The initial Trustees are:
Jason Matheny: CEO of the RAND Corporation
Kanika Bahl: CEO & President of Evidence Action
Neil Buddy Shah: CEO of the Clinton Health Access Initiative (Chair)
Paul Christiano: Founder of the Alignment Research Center
Zach Robinson: Interim CEO of Effective Ventures US

In case it's not apparent:

Jason Matheny started an org with over $40M of OpenPhil funding (link to the biggest grant)
Kanika Bahl runs Evidence Action, an org that OpenPhil has donated over $100M to over the past 7 years
Neil Buddy Shah was the Managing Director at GiveWell for 2.5 years. GiveWell is the organization that OpenPhil grew out of between 2011 and 2014 and continued to maintain close ties with over the decade (e.g. sharing an office for ~5 more years, deferring substantially to GiveWell recommendations for somewhere in the neighborhood of $500M of Global Health grants, etc).
Paul Christiano is technically the least institutionally connected to OpenPhil. Most commenters here are quite aware of Paul's relationship to Open Philanthropy, but to state some obvious legible things for those who are not: Paul is married to Ajeya Cotra, who leads the AI funding at OpenPhil, has been a technical advisor to OpenPhil in some capacity since 2012 (see him mentioned in the following pages: 1, 2, 3, 4, 5, 6), and Holden Karnofsky has collaborated with / advised Paul's organization ARC (collaboration mentioned here and shown as an advisor here).
Zach Robinson worked at OpenPhil for 3 years, over half of which was spent as Chief of Staff.

And, as a reminder for those who've forgotten, Holden Karnofsky's wife Daniela Amodei is President of Anthropic (and so Holden is the brother-in-law of Dario Amodei).

[-]kave2yΩ142913

As a general matter, Anthropic has consistently found that working with frontier AI models is an essential ingredient in developing new methods to mitigate the risk of AI.

What are some examples of work that is most largeness-loaded and most risk-preventing? My understanding is that interpretability work doesn't need large models (though I don't know about things like influence functions). I imagine constitutional AI does. Is that the central example or there are other pieces that are further in this direction?

[-]jefftk2y80

From a business perspective, we want to be clear that our RSP will not alter current uses of Claude or disrupt availability of our products. Rather, it should be seen as analogous to the pre-market testing and safety feature design conducted in the automotive or aviation industry, where the goal is to rigorously demonstrate the safety of a product before it is released onto the market, which ultimately benefits customers.

What happens if you are pretty sure that a future system is ASL-2, but then months after release when people have built business-critical systems on top of your API, it is discovered to actually be a higher level of risk in a way that is not immediately clear how to mitigate?

[-]Zac Hatfield-Dodds2y72

Try really hard to avoid this situation for many reasons, among them that de-deployment would suck.. I think it's unlikely, but:

(2d) If it becomes apparent that the capabilities of a deployed model have been under-elicited and the model can, in fact, pass the evaluations, then we will halt further deployment to new customers and assess existing deployment cases for any serious risks which would constitute a safety emergency. Given the safety buffer, de-deployment should not be necessary in the majority of deployment cases. If we identify a safety emergency, we will work rapidly to implement the minimum additional safeguards needed to allow responsible continued service to existing customers.

[-]Raemon2y112

I think it's unlikely, but:

Why does it seem unlikely? (Note: I haven't read the post or comments in full yet, if you think this is already covered somewhere I'll go read that first)

[-]Zac Hatfield-Dodds2y42

We've deliberately set conservative thresholds, such that I don't expect the first models which pass the ASL-3 evals to pose serious risks without improved fine-tuning or agent-scaffolding, and we've committed to re-evaluate to check on that every three months. From the policy:

Ensuring that we never train a model that passes an ASL evaluation threshold is a difficult task. Models are trained in discrete sizes, they require effort to evaluate mid-training, and serious, meaningful evaluations may be very time consuming, since they will likely require fine-tuning.

This means there is a risk of overshooting an ASL threshold when we intended to stop short of it. We mitigate this risk by creating a buffer: we have intentionally designed our ASL evaluations to trigger at slightly lower capability levels than those we are concerned about, while ensuring we evaluate at defined, regular intervals (specifically every 4x increase in effective compute, as defined below) in order to limit the amount of overshoot that is possible. We have aimed to set the size of our safety buffer to 6x (larger than our 4x evaluation interval) so model training can continue safely while evaluations take place. Correct execution of this scheme will result in us training models that just barely pass the test for ASL-N, are still slightly below our actual threshold of concern (due to our buffer), and then pausing training and deployment of that model unless the corresponding safety measures are ready.

I also think that many risks which could emerge in apparently-ASL-2 models will be reasonably mitigable by some mixture of re-finetuning, classifiers to reject harmful requests and/or responses, and other techniques. I've personally spent more time thinking about the autonomous replication than the biorisk evals though, and this might vary by domain.

[-]Viktor Rehnberg2y60

To me it feels like this policy is missing something that accounts for a big chunk of the risk.

While recursive self-improvement is covered by the "Autonomy and replication" point, there is another risk from actors that don't intentionally cause large scale harm but use your system to make improvements to their own systems as they don't follow your RSP. This type of recursive improvement doesn't seem to be covered by any of "Misuse" or "Autonomy and replication".

In short it's about risks due to shortening of timelines.

[-]Zach Stein-Perlman2y*5-1

Yay Anthropic for making good commitments about (1) pausing training if autonomous-replication model evals fail or red-teaming reveals dangerous capabilities [edit: until implementing the "ASL-3 Containment Measures"] and (2) deploying powerful models well!! I wish other labs would do this!

[-]simeon_c2y76

Can you quote the parts you're referring to?

[-]Zach Stein-Perlman2y20

Not well; almost all of pp. 2–9 or maybe 2–13 is relevant. But here are some bits:

we commit to pause the scaling2 and/or delay the deployment of new models whenever our scaling ability outstrips our ability to comply with the safety procedures for the corresponding ASL.
. . .
Before advancing to a given ASL, the next level must be defined to create a clear boundary with a "safety buffer".
. . .
We define an ASL-3 model as one that can either immediately, or with additional post-training techniques corresponding to less than 1% of the total training cost, do at least one of the following two things. (By post-training techniques we mean the best capabilities elicitation techniques we are aware of at the time, including but not limited to fine-tuning, scaffolding, tool use, and prompt engineering.) 1. Capabilities that significantly increase risk of misuse catastrophe. . . . 2. Autonomous replication in the lab.
. . .
[See "ASL-3 Containment Measures" and "ASL-3 Deployment Measures"]

[-]simeon_c2y*2-17

Cool thanks.
I've seen that you've edited your post. If you look at ASL-3 Containment Measures, I'd recommend considering editing away the "Yay" aswell.
This post is a pretty significant goalpost moving.

While my initial understanding was that the autonomous replication would be a ceiling, this doc now made it a floor.

So in other words, this paper is proposing to keep navigating beyond levels that are considered potentially catastrophic, with less-than-military-grade cybersecurity, which makes it very likely that at least one state, and plausibly multiple states, will have access to those things.

It also means that the chances of leaking a system which is irreversibly catastrophic are probably not below 0.1%, maybe not even below 1%.

My interpretation of the excitement around the proposal is a feeling that "yay, it's better than where we were before".
But I think it neglects heavily a few things.
1. It's way worse than risk management 101, which is easy to push for.
2. the US population is pro-slowdown (so you can basically be way more ambitious than "responsibly scaling")
3. an increasing share of policymakers are worried
4. self-regulation has a track record of heavily affecting hard law (either by preventing it, or by creating a template that the state can enforce. That's the ToC that I understood from people excited by self-regulation). For instance I expect this proposal to actively harm the efforts to push for ambitious slowdowns that would let us put the probability of doom below two-digit numbers.

For those reasons, I wish this doc didn't exist.

[-]Neel Nanda2yΩ450

Trustees serve one-year terms and future Trustees will be elected by a vote of the Trustees

One year is shockingly short, why such fast turnaround?

And great post I'm excited to see responsible scaling policies becoming a thing!

[-]Zac Hatfield-Dodds2yΩ5125

One year is actually the typical term length for board-style positions, but because members can be re-elected their tenure is often much longer. In this specific case of course it's now up to the trustees!

[-]Zach Stein-Perlman2y*4-4

Exciting.

I'm kind of confused by the emphasis on the Trust rather than the board.

On the Trust: does the Trust do anything besides elect some board members (and sometimes advise the board)? Especially hard powers? How can the Trust's powers be modified?

On the board: what does the board do, especially hard powers? What decisions is it consulted about? How is it informed? (Who is on the board now?)

[-]Zach Stein-Perlman2y52

Put another way: Anthropic is talking like the Trust is great corporate governance and will help Anthropic predictably act responsibly. I tentatively believe that — in part because I mostly trust Anthropic; maybe in part because I expect some Trustees wouldn't have joined if they thought the Trust was a sham. But the information I know—basically just that the Trust will be able to elect a majority of the board in a few years, and it's not clear what could cause the Trust to lose its powers—only weakly demonstrates that the Trust is great.

[-]paulfchristiano2y96

The role of the Trust is to elect (and potentially replace) board members; its formal power comes entirely from the fact that it will eventually elect a majority of the board seats.

The post mentions a "failsafe" where a supermajority of investors can amend this arrangement, which I think is a reasonable compromise. But I'm not aware of any public information about what that supermajority is, or whether there are other ways the Trust's formal powers could be reduced.

Dylan Matthews reports the members of the board here: Dario, Daniela, Luke Meulhauser, and Yasmin Razavi. (I think it's also listed plenty of other places.)

[-]kave2yΩ232

The model shows early signs of autonomous self-replication ability, as defined by 50% aggregate success rate on the tasks listed in [Appendix on Autonomy Evaluations]

Would you be willing to rephrase this as something like

The model shows early signs of autonomous self-replication ability. Autonomous self-replication ability is defined as 50% aggregate success rate on the capabilities for which we list evaluations in [Appendix on Autonomy Evaluations]

The hope here is to avoid something like "well this system doesn't have autonomous self-replication ability/isn't ASL-3, because Anthropic's evals failed to elicit the behaviour. That definitionally means it's not ASL-3", and get a bit more map-territory distinction in.

[-]Zach Stein-Perlman2y*30

We commit to an additional set of measures for producing ASL-3 model outputs (externally or internally)

Wait, how can you do red-teaming on an ASL-3 model before producing ASL-3 model outputs?

[-]Aaron_Scher2y10

Thanks for posting, Zac!

I don't think I understand / buy this "race to the top idea":

If adopted as a standard across frontier labs, we hope this might create a “race to the top” dynamic where competitive incentives are directly channeled into solving safety problems.

Some specific questions I have:

This sounds great, but what does it actually mean?
What's a reasonable story for how this race to the top plays out?
Are there historical case-studies of successful "races to the top" that the RSP is trying to emulate (or that can be looked to for reference)? There's a bunch of related stuff, but it's unclear to me what the best historical reference is.
What's the main incentive pushing for better safety? Customers demanding it? Outside regulators? The company wanting to scale and needing to meet (mostly internally-evaluated) safety thresholds first (this seems like the main mechanism in the RSP, while the others seem to have stronger historical precedent)?

To be clear, I don't think a historical precedent of similar things working is necessary for the RSP to be net positive, but it is (I think) a crux for how likely I believe this is to succeed.

[-]Kaspar Hidayat2y10

Does it make sense to consider misuse and autonomy risks under the same framework? Potential for misuse doesn't appear to be difficult to detect and address. Autonomy risk is much harder to assess due to our current lack of understanding around the dynamics of emergence. Addressing the Knightian uncertainty around what conditions may give rise to autonomy at the very least deserves a different approach as compared to more quantifiable misuse risks. ARC's autonomous replication evals address this partially but a sufficiently advanced agent may be able to evade detection.

[-]Zach Stein-Perlman2y-10

Yay Anthropic for making great security commitments! (Beyond its previous very-incomplete commitments here and here.) I wish other labs would do this!

But less yay because these 'commitments' are merely aspirational — they don't necessarily describe what Anthropic is actually doing, and there's no timeline or accountability.

(Anthropic describes some currently-implemented practices here, but my impression is they're very inadequate.)

[-]Hjalmar_Wijk2y92

They do as far as I can tell commit to a fairly strong sort of "timeline" for implementing these things: before they scale to ASL-3 capable models (i.e. ones that pass their evals for autonomous capabilities or misuse potential).

[-]Zach Stein-Perlman2y2-4

I read it differently; in particular my read is that they aren't currently implementing all of the ASL-2 security stuff (and they're not promising to do all of the ASL-3 stuff before scaling to ASL-3). Clarity from Anthropic would be nice.

In "ASL-2 and ASL-3 Security Commitments," they say things like "labs should" rather than "we will."

Almost none of their security practices are directly visible from the outside, but whether they have a bug bounty program is. They don't. But "Programs like bug bounties and vulnerability discovery should incentivize exposing flaws" is part of the ASL-2 security commitments.

I guess when they "publish a more comprehensive list of our implemented ASL-2 security measures" we'll know more.

As a general matter, Anthropic has consistently found that working with frontier AI models is an essential ingredient in developing new methods to mitigate the risk of AI. ↩︎

LESSWRONG
LW