Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

COI: I am a research scientist at Anthropic, where I work on model organisms of misalignment; I was also involved in the drafting process for Anthropic’s RSP. Prior to joining Anthropic, I was a Research Fellow at MIRI for three years.

Thanks to Kate Woolverton, Carson Denison, and Nicholas Schiefer for useful feedback on this post.

Recently, there’s been a lot of discussion and advocacy around AI pauses—which, to be clear, I think is great: pause advocacy pushes in the right direction and works to build a good base of public support for x-risk-relevant regulation. Unfortunately, at least in its current form, pause advocacy seems to lack any sort of coherent policy position. Furthermore, what’s especially unfortunate about pause advocacy’s nebulousness—at least in my view—is that there is a very concrete policy proposal out there right now that I think is basically necessary as a first step here, which is the enactment of good Responsible Scaling Policies (RSPs). And RSPs could very much live or die right now based on public support.

If you’re not familiar with the concept of an RSP, the central idea of RSPs is evaluation-gated scaling—that is, AI labs can only scale models depending on some set of evaluations that determine whether additional scaling is appropriate. ARC’s definition is:

An RSP specifies what level of AI capabilities an AI developer is prepared to handle safely with their current protective measures, and conditions under which it would be too dangerous to continue deploying AI systems and/or scaling up AI capabilities until protective measures improve.

How do we make it to a state where AI goes well?

I want to start by taking a step back and laying out a concrete plan for how we get from where we are right now to a policy regime that is sufficient to prevent AI existential risk.

The most important background here is my “When can we trust model evaluations?” post, since knowing the answer to when we can trust evaluations is extremely important for setting up any sort of evaluation-gated scaling. The TL;DR there is that it depends heavily on the type of evaluation:

  • capabilities evaluation is defined as “a model evaluation designed to test whether a model could do some task if it were trying to. For example: if the model were actively trying to autonomously replicate, would it be capable of doing so?”
    • With the use of fine-tuning, and a bunch of careful engineering work, capabilities evaluations can be done reliably and robustly.
  • safety evaluation is defined as “a model evaluation designed to test under what circumstances a model would actually try to do some task. For example: would a model ever try to convince humans not to shut it down?”

With that as background, here’s a broad picture of how things could go well via RSPs (note that everything here is just one particular story of success, not necessarily the only story of success we should pursue or a story that I expect to actually happen by default in the real world):

  1. AI labs put out RSP commitments to stop scaling when particular capabilities benchmarks are hit, resuming only when they are able to hit particular safety/alignment/security targets.
    1. Early on, as models are not too powerful, almost all of the work is being done by capabilities evaluations that determine that the model isn’t capable of e.g. takeover. The safety evaluations are mostly around security and misuse risks.
    2. For later capabilities levels, however, it is explicit in all RSPs that we do not yet know what safety metrics could demonstrate safety for a model that might be capable of takeover.
  2. Seeing the existing RSP system in place at labs, governments step in and use it as a basis to enact hard regulation.
  3. By the time it is necessary to codify exactly what safety metrics are required for scaling past models that pose a potential takeover risk, we have clearly solved the problem of understanding-based evals and know what it would take to demonstrate sufficient understanding of a model to rule out e.g. deceptive alignment.
  4. Understanding-based evals are adopted by governmental RSP regimes as hard gating evaluations for models that pose a potential takeover risk.
  5. Once labs start to reach models that pose a potential takeover risk, they either:
    1. Solve mechanistic interpretability to a sufficient extent that they are able to pass an understanding-based eval and demonstrate that their models are safe.
    2. Get blocked on scaling until mechanistic interpretability is solved, forcing a reroute of resources from scaling to interpretability.

Reasons to like RSPs

Obviously, the above is only one particular story for how things go well, but I think it’s a pretty solid one. Here are some reasons to like it:

  1. It provides very clear and concrete policy proposals that could realistically be adopted by labs and governments (in fact, step 1 has already started!). Labs and governments don’t know how to respond to nebulous pause advocacy because it isn’t clearly asking for any particular policy (since nobody actually likes and is advocating for the six month pause proposal).
  2. It provides early wins that we can build on later in the form of initial RSP commitments with explicit holes in them. From “AI coordination needs clear wins”:
    1. “In the theory of political capital, it is a fairly well-established fact that ‘Everybody Loves a Winner.’ That is: the more you succeed at leveraging your influence to get things done, the more influence you get in return. This phenomenon is most thoroughly studied in the context of the ability of U.S. presidents’ to get their agendas through Congress—contrary to a naive model that might predict that legislative success uses up a president’s influence, what is actually found is the opposite: legislative success engenders future legislative success, greater presidential approval, and long-term gains for the president’s party.
    2. I think many people who think about the mechanics of leveraging influence don’t really understand this phenomenon and conceptualize their influence as a finite resource to be saved up over time so it can all be spent down when it matters most. But I think that is just not how it works: if people see you successfully leveraging influence to change things, you become seen as a person who has influence, has the ability to change things, can get things done, etc. in a way that gives you more influence in the future, not less.”
  3. One of the best, most historically effective ways to shape governmental regulation is to start with voluntary commitments. Governments are very good at solving “80% of the players have committed to safety standards but the remaining 20% are charging ahead recklessly” because the solution in that case is obvious and straightforward.
    1. Though we could try to go to governments first rather than labs first, so far I’ve seen a lot more progress with the labs-first approach—though there’s no reason we can’t continue to pursue both in parallel.
  4. RSPs are clearly and legibly risk-based: they specifically kick in only when models have capabilities that are relevant to downstream risks. That’s important because it gives the proposal substantial additional seriousness, since it can point directly to clear harms that it is targeted at preventing.
    1. Additionally, from an x-risk perspective, I don’t even think it actually matters that much what the capability evaluations are here: most potentially dangerous capabilities should be highly correlated, such that measuring any of them should be okay. Thus, I think it should be fine to mostly focus on measuring the capabilities that are most salient to policymakers and most clearly demonstrate risks. And we can directly test the extent to which relevant capabilities are correlated: if they aren’t, we can change course.
  5. Since the strictest conditions of the RSPs only come into effect for future, more powerful models, it’s easier to get people to commit to them now. Labs and governments are generally much more willing to sacrifice potential future value than realized present value.
    1. Additionally, gating scaling only when relevant capabilities benchmarks are hit means that you don’t have to be as at odds with open-source advocates or people who don’t believe current LLMs will scale to AGI. There is still a capabilities benchmark below which open-source is fine (though it should be a lower threshold than closed-source, since there are e.g. misuse risks that are much more pronounced for open-source), and if it turns out that LLMs don’t ever scale to hit the relevant capabilities benchmarks, then this approach won’t ever restrict them.
  6. Using understanding of models as the final hard gate is a condition that—if implemented correctly—is intuitively compelling and actually the thing we need to ensure safety. As I’ve said before, “the only worlds I can imagine myself actually feeling good about humanity’s chances are ones in which we have powerful transparency and interpretability tools that lend us insight into what our models are doing as we are training them.”

How do RSPs relate to pauses and pause advocacy?

In my opinion, RSPs are pauses done right: if you are advocating for a pause, then presumably you have some resumption condition in mind that determines when the pause would end. In that case, just advocate for that condition being baked into RSPs! And if you have no resumption condition—you want a stop rather than a pause—I empathize with that position but I don’t think it’s (yet) realistic. As I discussed above, it requires labs and governments to sacrifice too much present value (rather than just potential future value), isn’t legibly risk-based, doesn’t provide early wins, etc. Furthermore, I think the best way to actually make a full stop happen is still going to look like my story above, just with RSP thresholds that are essentially impossible to meet.

Furthermore, I want to be very clear that I don’t mean “stop pestering governments and focus on labs instead”—we should absolutely try to get governments to adopt RSP-like policies and get as strong conditions as possible into any RSP-like policies that they adopt. What separates pause advocacy from RSP advocacy isn’t who it’s targeted at, but the concreteness of the policy recommendations that it’s advocating for. The point is that advocating for a “pause” is nebulous and non-actionable—“enact an RSP” is concrete and actionable. Advocating for labs and governments to enact as good RSPs as possible is a much more effective way to actually produce concrete change than highly nebulous pause advocacy.

Furthermore, RSP advocacy is going to be really important! I’m very worried that we could fail at any of the steps above, and advocacy could help substantially. In particular:

  • We need to actually get as many labs as possible to put out RSPs.
    • Currently, only Anthropic has done so, but I have heard positive signals from other labs and I think with sufficient pressure they might be willing to put out their own RSPs as well.
  • We need to make sure that those RSPs actually commit to the right things. What I’m looking for are:
    • Fine-tuning-based capabilities evaluations being used for below-takeover-potential models.
    • Evidence that capabilities evaluations will be done effectively and won’t be sandbagged (e.g. committing to use an external auditor).
    • An explicitly empty hole for safety evaluations for takeover-risk models that can be filled in later by progress on understanding-based evals.
  • We need to get governments to enact mandatory RSPs for all AI labs.
    • And these RSPs also need to have all the same important properties as the labs’ RSPs. Ideally, we should get the governmental RSPs to be even stronger!
  • We need to make sure that, once we have solid understanding-based evals, governments make them mandatory.
    • I’m especially worried about this point, though I don’t think it’s that hard of a sell: the idea that you should understand what your AI is doing on a deep level is a pretty intuitive one.
New Comment
70 comments, sorted by Click to highlight new comments since: Today at 11:37 AM
[-]Akash6moΩ3211654

Thanks for writing this, Evan! I think it's the clearest writeup of RSPs & their theory of change so far. However, I remain pretty disappointed in the RSP approach and the comms/advocacy around it.

I plan to write up more opinions about RSPs, but one I'll express for now is that I'm pretty worried that the RSP dialogue is suffering from motte-and-bailey dynamics. One of my core fears is that policymakers will walk away with a misleadingly positive impression of RSPs. I'll detail this below:

What would a good RSP look like?

  • Clear commitments along the lines of "we promise to run these 5 specific tests to evaluate these 10 specific dangerous capabilities."
  • Clear commitments regarding what happens if the evals go off (e.g., "if a model scores above a 20 on the Hubinger Deception Screener, we will stop scaling until it has scored below a 10 on the relatively conservative Smith Deception Test.")
  • Clear commitments regarding the safeguards that will be used once evals go off (e.g., "if a model scores above a 20 on the Cotra Situational Awareness Screener, we will use XYZ methods and we believe they will be successful for ABC reasons.")
  • Clear evidence that these evals will exist, will likely
... (read more)
[-]Joe_Collman6moΩ226039

Strongly agree with almost all of this.

My main disagreement is that I don't think the "What would a good RSP look like?" description is sufficient without explicit conditions beyond evals. In particular that we should expect that our suite of tests will be insufficient at some point, absent hugely improved understanding - and that we shouldn't expect to understand how and why it's insufficient before reality punches us in the face.

Therefore, it's not enough to show: [here are tests covering all the problems anyone thought of, and reasons why we expect them to work as intended]. 

We also need strong evidence that there'll be no catastrophe-inducing problems we didn't think of. (not [none that SotA methods could find], not [none discoverable with reasonable cost]; none)

This can't be implicit, since it's a central way that we die.

If it's hard/impractical to estimate, then we should pause until we can estimate it more accurately

This is the kind of thing that I expect to be omitted from RSPs as a matter of course, precisely because we lack the understanding to create good models/estimates, legible tests etc. That doesn't make it ok. Blank map is not blank territory.

If we're thinking... (read more)

[-]evhub6moΩ9195

Therefore, it's not enough to show: [here are tests covering all the problems anyone thought of, and reasons why we expect them to work as intended]. We also need strong evidence that there'll be no catastrophe-inducing problems we didn't think of. (not [none that SotA methods could find], not [none discoverable with reasonable cost]; none) This can't be implicit, since it's a central way that we die. If it's hard/impractical to estimate, then we should pause until we can estimate it more accurately

This is the kind of thing that I expect to be omitted from RSPs as a matter of course, precisely because we lack the understanding to create good models/estimates, legible tests etc. That doesn't make it ok. Blank map is not blank territory.

Yeah, I agree—that's why I'm specifically optimistic about understanding-based evals, since I think they actually have the potential to force us to catch unknown unknowns here, the idea being that they require you to prove that you actually understand your model to a level where you'd know if there were anything wrong that your other evals might miss.

Define ASLs or similar now rather than waiting until we're much closer to achieving them. Waiting

... (read more)

IANAL, but I think that this is currently impossible due to anti-trust regulations.

I don't know anything about anti-trust enforcement, but it seems to me that this might be a case where labs should do it anyways & delay hypothetical anti-trust enforcement by fighting in court.

I happen to think that the Anthropic RSP is fine for what it is, but it just doesn't actually make any interesting claims yet. The key thing is that they're committing to actually having an ASL-4 criteria and safety argument in the future. From my perspective, the Anthropic RSP effectively is an outline for the sort of thing an RSP could be (run evals, have safety buffer, assume continuity, etc) as well as a commitment to finish the key parts of the RSP later. This seems ok to me.

I would preferred if they included tentative proposals for ASL-4 evaluations and what their current best safety plan/argument for ASL-4 looks like (using just current science, no magic). Then, explain that plan wouldn't be sufficient for reasonable amounts of safety (insofar as this is what they think).

Right now, they just have a bulleted list for ASL-4 countermeasures, but this is the main interesting thing at me. (I'm not really sold on substantial risk from systems which aren't capable of carrying out that harm mostly autonomously, so I don't think ASL-3 is actually important except as setup.)

What would a good RSP look like?

  • Clear commitments along the lines of "we promise to run these 5 specific tests to evaluate these 10 specific dangerous capabilities."
  • Clear commitments regarding what happens if the evals go off (e.g., "if a model scores above a 20 on the Hubinger Deception Screener, we will stop scaling until it has scored below a 10 on the relatively conservative Smith Deception Test.")
  • Clear commitments regarding the safeguards that will be used once evals go off (e.g., "if a model scores above a 20 on the Cotra Situational Awareness Screener, we will use XYZ methods and we believe they will be successful for ABC reasons.")
  • Clear evidence that these evals will exist, will likely work, and will be conservative enough to prevent catastrophe
  • Some way of handling race dynamics (such that Bad Guy can't just be like "haha, cute that you guys are doing RSPs. We're either not going to engage with your silly RSPs at all, or we're gonna publish our own RSP but it's gonna be super watered down and vague").

Yeah, of course this would be nice. But the reason that ARC and Anthropic didn't write this 'good RSP' isn't that they're reckless, but because writing such an RSP is a hard op... (read more)

[-]Akash6mo2018

Thanks, Zach! Responses below:

But the reason that ARC and Anthropic didn't write this 'good RSP' isn't that they're reckless, but because writing such an RSP is a hard open problem

I agree that writing a good RSP is a hard open problem. I don't blame ARC for not having solved the "how can we scale safely" problem. I am disappointed in ARC for communicating about this poorly (in their public blog post, and [speculative/rumor-based] maybe in their private government advocacy as well.).

I'm mostly worried about the comms/advocacy/policy implications. If Anthropic and ARC had come out and said "look, we have some ideas, but the field really isn't mature enough and we really don't know what we're doing, and these voluntary commitments are clearly insufficient, but if you really had to ask us for our best-guesses RE what to do if there is no government regulation coming, and for some reason we had to keep racing toward god-like AI, here are our best guesses. But please note that this is woefully insufficient and we would strongly prefer government intervention to buy enough time so that we can have actual plans."

I also expect most of the (positive or negative) impact of the recent RSP post... (read more)

[-]jaan6mo135

i would love to see competing RSPs (or, better yet, RTDPs, as @Joe_Collman pointed out in a cousin comment).

[-]evhub6moΩ8191

It seems to me like the main three RSP posts (ARC's, Anthropic's, and yours) are (perhaps unintentionally?) painting and overly-optimistic portrayal of RSPs.

I mean, I am very explicitly trying to communicate what I see as the success story here. I agree that there are many ways that this could fail—I mention a bunch of them in the last section—but I think that having a clear story of how things could go well is important to being able to work to actually achieve that story.

On top of that, the posts seem to have this "don't listen to the people who are pushing for stronger asks like moratoriums-- instead please let us keep scaling and trust industry to find the pragmatic middle ground" vibe.

I want to be very clear that I've been really happy to see all the people pushing for strong asks here. I think it's a really valuable thing to be doing, and what I'm trying to do here is not stop that but help it focus on more concrete asks.

I would be more sympathetic to the RSP approach if it was like "well yes, we totally think it'd great to have a moratorium or a global compute cap or a kill switch or a federal agency monitoring risks or a licensing regime", and we also think this RS

... (read more)
[-]Wei Dai6moΩ92029

To be clear, I definitely agree with this. My position is not “RSPs are all we need”, “pauses are bad”, “pause advocacy is bad”, etc.—my position is that getting good RSPs is an effective way to implement a pause: i.e. “RSPs are pauses done right.”

Some feedback on this: my expectation upon seeing your title was that you would argue, or that you implicitly believe, that RSPs are better than other current "pause" attempts/policies/ideas. I think this expectation came from the common usage of the phrase "done right" to mean that other people are doing it wrong or at least doing it suboptimally.

4evhub6mo
I mean, to be clear, I am saying something like "RSPs are the most effective way to implement a pause that I know of." The thing I'm not saying is just that "RSPs are the only policy thing we should be doing."
8Rebecca6mo
This reads as some sort of confused motte and bailey. Are RSPs “an effective way” or “the most effective way… [you] know of”? These are different things, with each being stronger/weaker in different ways. Regardless, the title could still be made much more accurate to your beliefs, e.g. ~’RSPs are our (current) best bet on a pause’. ‘An effective way’ is definitely not “i.e … done right”, but “the most effective way… that I know of” is also not.
4evhub6mo
I disagree? I think the plain English meaning of the title "RSPs are pauses done right" is precisely "RSPs are the right way to do pauses (that I know of)" which is exactly what I think and exactly what I am defending here. I honestly have no idea what else that title would mean.

Sorry yeah I could have explained what I meant further. The way I see it:

‘X is the most effective way that I know of’ = X tops your ranking of the different ways, but could still be below a minimum threshold (e.g. X doesn’t have to even properly work, it could just be less ineffective than all the rest). So one could imagine someone saying “X is the most effective of all the options I found and it still doesn’t actually do the job!”

‘X is an effective way’ = ‘X works, and it works above a certain threshold’.

‘X is Y done right’ = ‘X works and is basically the only acceptable way to do Y,’ where it’s ambiguous or contextual as to whether ‘acceptable’ means that it at least works, that it’s effective, or sth like ‘it’s so clearly the best way that anyone doing the 2nd best thing is doing something bad’.

3Roman Leventov6mo
Why then "RSPs are the most effective way to implement a pause that I know of" is literally not the title of your post?
9Lukas Finnveden6mo
Are you thinking about this post? I don't see any explicit claims that the moratorium folks are extreme. What passage are you thinking about?

In terms of explicit claims:

"So one extreme side of the spectrum is build things as fast as possible, release things as much as possible, maximize technological progress [...].

The other extreme position, which I also have some sympathy for, despite it being the absolutely opposite position, is you know, Oh my god this stuff is really scary.

The most extreme version of it was, you know, we should just pause, we should just stop, we should just stop building the technology for, indefinitely, or for some specified period of time. [...] And you know, that extreme position doesn't make much sense to me either."

Dario Amodei, Anthropic CEO, explaining his company's "Responsible Scaling Policy" on the Logan Bartlett Podcast on Oct 6, 2023.

Starts at around 49:40.

This example is not a claim by ARC though, seems important to keep track of this in a discussion of what ARC did or didn't claim, even as others making such claims is also relevant.

6Akash6mo
I was thinking about this passage: I think "extreme" was subjective and imprecise wording on my part, and I appreciate you catching this. I've edited the sentence to say "Instead, ARC implies that the moratorium folks are unrealistic, and tries to say they operate on an extreme end of the spectrum, on the opposite side of those who believe it's too soon to worry about catastrophes whatsoever."
4trevor6mo
This is a really important thing to iron out. Going forward (through the 2020s), it's really important to avoid underestimating the ratio of money going into facilitating an AI pause vs money subverting or thwarting an AI pause. The impression I get is that the vast majority of people are underestimating how much money and talent will end up being allocated towards the end of subverting or thwarting an AI pause, e.g. finding galaxy-brained ways to intimidate or mislead well-intentioned AI safety orgs into self-sabotage (e.g. opposing policies that are actually feasible or even mandatory for human survival like an AI pause) or being turned against eachother (which is unambiguously the kind of thing that happens in a world with very high lawyers-per-capita, and in particular in issues and industries where lots of money is at stake). False alarms are almost an equally serious issue because false alarms also severely increase vulnerability, which further incentivises adverse actions against the AI safety community by outsider third parties (e.g. due to signalling high payoff and low risk of detection for any adverse actions).
[-]aysja6moΩ4010763

I’m sympathetic to the idea that it would be good to have concrete criteria for when to stop a pause, were we to start one. But I also think it’s potentially quite dangerous, and corrosive to the epistemic commons, to expect such concreteness before we’re ready to give it. 

I’m first going to zoom out a bit—to a broader trend which I’m worried about in AI Safety, and something that I believe evaluation-gating might exacerbate, although it is certainly not the only contributing factor.   


I think there is pressure mounting within the field of AI Safety to produce measurables, and to do so quickly, as we continue building towards this godlike power under an unknown timer of unknown length. This is understandable, and I think can often be good, because in order to make decisions it is indeed helpful to know things like “how fast is this actually going” and to assure things like “if a system fails such and such metric, we'll stop.” 

But I worry that in our haste we will end up focusing our efforts under the streetlight. I worry, in other words, that the hard problem of finding robust measurements—those which enable us to predict the behavior and safety of AI systems wi... (read more)

[-]evhub6moΩ5121

But I also think it’s potentially quite dangerous, and corrosive to the epistemic commons, to expect such concreteness before we’re ready to give it.

As I mention in the post, we do have the ability to do concrete capabilities evals right now. What we can't do are concrete safety evals, which I'm very clear about not expecting us to have right now.

And I'm not expecting that we eventually solve the problem of building good safety evals either—but I am describing a way in which things go well that involves a solution to that problem. If we never solve the problem of understanding-based evals, then my particular sketch doesn't work as a way to make things go well: but that's how any story of success has to work right now given that we don't currently know how to make things go well. And actually telling success stories is an important thing to do!

If you have an alternative success story that doesn't involve solving safety evals, tell it! But without any alternative to my success story, critiquing it just for assuming a solution to a problem we don't yet have a solution to—which every success story has to do—seems like an extremely unfair criticism.

It also seems to me like somethin

... (read more)

But without any alternative to my success story, critiquing it just for assuming a solution to a problem we don't yet have a solution to—which every success story has to do—seems like an extremely unfair criticism.

When assumptions are clear, it's not valuable to criticise the activity of daring to consider what follows from them. When assumptions are an implicit part of the frame, they become part of the claims rather than part of the problem statement, and their criticism becomes useful for all involved, in particular making them visible. Putting burdens on criticism such as needing concrete alternatives makes relevant criticism more difficult to find.

1Rebecca6mo
I found this quite hard to parse fyi
6Joe_Collman6mo
Fully agree with almost all of this. Well said. One nitpick of potentially world-ending importance: Giving us high confidence is not the bar - we also need to be correct in having that confidence. In particular, we'd need to be asking: "How likely is it that the process we used to find these measures and evaluations gives us [actually sufficient measures and evaluations] before [insufficient measures and evaluations that we're confident are sufficient]? How might we tell the difference? What alternative process would make this more likely?..." I assume you'd roll that into assessing your confidence - but I think it's important to be explicit about this.   Based on your comment, I'd be interested in your take on: 1. Put many prominent disclaimers and caveats in the RSP - clearly and explicitly. vs 2. Attempt to make commitments sufficient for safety by committing to [process to fill in this gap] - including some high-level catch-all like "...and taken together, these conditions make training of this system a good idea from a global safety perspective, as evaluated by [external board of sufficiently cautious experts]". Not having thought about it for too long, I'm inclined to favor (2). I'm not at all sure how realistic it is from a unilateral point of view - but I think it'd be useful to present proposals along these lines and see what labs are willing to commit to. If no lab is willing to commit to any criterion they don't strongly expect to be able to meet ahead of time, that's useful to know: it amounts to "RSPs are a means to avoid pausing". I imagine most labs wouldn't commit to [we only get to run this training process if Eliezer thinks it's good for global safety], but I'm not at all sure what they would commit to. At the least, it strikes me that this is an obvious approach that should be considered - and that a company full of abstract thinkers who've concluded "There's no direct, concrete, ML-based thing we can commit to here, so we're out of
6Zack_M_Davis6mo
Who? Science has never worked by means of deferring to a designated authority figure. I agree, of course, that we want people to do things that make the world less rather than more likely to be destroyed. But if you have a case that a given course of action is good or bad, you should expect to be able to argue that case to knowledgeable people who have never heard of this Eliza person, whoever she is. I remember reading a few good blog posts about this topic by a guest author on Robin Hanson's blog back in 'aught-seven.

This was just an example of a process I expect labs wouldn't commit to, not (necessarily!) a suggestion.

The key criterion isn't even appropriate levels of understanding, but rather appropriate levels of caution - and of sufficient respect for what we don't know. The criterion [...if aysja thinks it's good for global safety] may well be about as good as [...if Eliezer thinks it's good for global safety].

It's much less about [This person knows], than about [This person knows that no-one knows, and has integrated this knowledge into their decision-making].

Importantly, a cautious person telling an incautious person "you really need to be cautious here" is not going to make the incautious person cautious (perhaps slightly more cautious than their baseline - but it won't change the way they think).

A few other thoughts:

  • Scientific intuitions will tend to be towards doing what uncovers information efficiently. If an experiment uncovers some highly significant novel unknown that no-one was expecting, that's wonderful from a scientific point of view.
    This is primarily about risk, not about science. Here the novel unknown that no-one was expecting may not lead to a load of interesting future wo
... (read more)
[-]Joe_Collman6moΩ204746

Thanks for writing this up.
I agree that the issue is important, though I'm skeptical of RSPs so far, since we have one example and it seems inadequate - to the extent that I'm positively disposed, it's almost entirely down to personal encounters with Anthropic/ARC people, not least yourself. I find it hard to reconcile the thoughtfulness/understanding of the individuals with the tone/content of the Anthropic RSP. (of course I may be missing something in some cases)

Going only by the language in the blog post and the policy, I'd conclude that they're an excuse to continue scaling while being respectably cautious (though not adequately cautious). Granted, I'm not the main target audience - but I worry about the impression the current wording creates.

I hope that RSPs can be beneficial - but I think much more emphasis should be on the need for positive demonstration of safety properties, that this is not currently possible, and that it may take many years for that to change. (mentioned, but not emphasized in the Anthropic policy - and without any "many years" or similar)

It's hard to summarize my concerns, so apologies if the following ends up somewhat redundant.
I'll focus on your post f... (read more)

[-]evhub6moΩ4151

I'm mostly not going to comment on Anthropic's RSP right now, since I don't really want this post to become about Anthropic's RSP in particular. I'm happy to talk in more detail about Anthropic's RSP maybe in a separate top-level post dedicated to it, but I'd prefer to keep the discussion here focused on RSPs in general.

One of my main worries with RSPs is that they'll be both [plausibly adequate as far as governments can tell] and [actually inadequate]. That's much worse than if they were clearly inadequate.

I definitely share this worry. But that's part of why I'm writing this post! Because I think it is possible for us to get good RSPs from all the labs and governments, but it'll take good policy and advocacy work to make that happen.

My main worry here isn't that we'll miss catastrophic capabilities in the near term (though it's possible). Rather it's the lack of emphasis on this distinction: that tests will predictably fail to catch problems, and that there's a decent chance some of them fail before we expect them to.

I agree that this is a serious concern, though I think that at least in the case of capabilities evaluations, it should be solvable. Though it'll require tho... (read more)

I think we at least do know how to do effective capabilities evaluations

This seems an overstatement to me:
Where the main risk is misuse, we'd need to know that those doing the testing have methods for eliciting capabilities that are as effective as anything people will come up with later. (including the most artful AutoGPT 3.0 setups etc)

It seems reasonable to me to claim that "we know how to do effective [capabilities given sota elicitation methods] evaluations", but that doesn't answer the right question.

Once the main risk isn't misuse, then we have to worry about assumptions breaking down (no exploration hacking / no gradient hacking / [assumption we didn't realize we were relying upon]). Obviously we don't expect these to break yet, but I'd guess that we'll be surprised the first time they do break.
I expect your guess on when they will break to be more accurate than mine - but that [I don't have much of a clue, so I'm advocating extreme caution] may be the more reasonable policy.

My concern with trying to put something like [understanding-based evals] into an RSP right now is that it'll end up evaluating the wrong thing: since we don't yet know how to effectively evaluate unders

... (read more)
[-]DanielFilan6moΩ193926

The point is that advocating for a “pause” is nebulous and non-actionable

Setting aside the potential advantages of RSPs, this strikes me as a pretty weird thing to say. I understand the term "pause" in this context to mean that you stop building cutting-edge AI models, either voluntarily or due to a government mandate. In contrast, "RSP" says you eventually do that but you gate it on certain model sizes and test results and unpause it under other test results. This strikes me as a bit less nebulous, but only a bit.

I'm not quite sure what's going on here - it's possible that the term "pause" has gotten diluted? Seems unfortunate if so.

[-]evhub6moΩ7111

I think the problem is that nobody really has an idea for what the resumption condition should be for a pause, and nobody's willing to defend the (actually actionable) six-month pause proposal.

[-]jaan6moΩ184226

the FLI letter asked for “pause for at least 6 months the training of AI systems more powerful than GPT-4” and i’m very much willing to defend that!

my own worry with RSPs is that they bake in (and legitimise) the assumptions that a) near term (eval-less) scaling poses trivial xrisk, and b) there is a substantial period during which models trigger evals but are existentially safe. you must have thought about them, so i’m curious what you think.

that said, thank you for the post, it’s a very valuable discussion to have! upvoted.

4evhub6mo
Sure, but I guess I would say that we're back to nebulous territory then—how much longer than six months? When if ever does the pause end? I agree that this is mostly baked in, but I think I'm pretty happy to accept it. I'd very surprised if there was substantial x-risk from the next model generation. But also I would argue that, if the next generation of models do pose an x-risk, we've mostly already lost—we just don't yet have anything close to the sort of regulatory regime we'd need to deal with that in place. So instead I would argue that we should be planning a bit further ahead than that, and trying to get something actually workable in place further out—which should also be easier to do because of the dynamic where organizations are more willing to sacrifice potential future value than current realized value. Yeah, I agree that this is tricky. Theoretically, since we can set the eval bar at any capability level, there should exist capability levels that you can eval for and that are safe but scaling beyond them is not. The problem, of course, is whether we can effectively identify the right capabilities levels to evaluate in advance. The fact that different capabilities are highly correlated with each other makes this easier in some ways—lots of different early warning signs will all be correlated—but harder in other ways—the dangerous capabilities will also be correlated, so they could all come at you at once. Probably the most important intervention here is to keep applying your evals while you're training your next model generation, so they trigger as soon as possible. As long as there's some continuity in capabilities, that should get you pretty far. Another thing you can do is put strict limits on how much labs are allowed to scale their next model generation relative to the models that have been definitively evaluated to be safe. And furthermore, my sense is that at least in the current scaling paradigm, the capabilities of the next model generation
[-]jaan6moΩ81812

Sure, but I guess I would say that we're back to nebulous territory then—how much longer than six months? When if ever does the pause end?

i agree that, if hashed out, the end criteria may very well resemble RSPs. still, i would strongly advocate for scaling moratorium until widely (internationally) acceptable RSPs are put in place.

I'd very surprised if there was substantial x-risk from the next model generation.

i share the intuition that the current and next LLM generations are unlikely an xrisk. however, i don't trust my (or anyone else's) intuitons strongly enough to say that there's a less than 1% xrisk per 10x scaling of compute. in expectation, that's killing 80M existing people -- people who are unaware that this is happening to them right now.

4Roman Leventov6mo
Do you think if Anthropic (or another leading AGI lab) unilaterally went out of its way to prevent building agents on top of its API, would this reduce the overall x-risk/p(doom) or not? I'm asking because here you seem to assume a defeatist position that only governments are able to shape the actions of the leading AGI labs (which, by the way, are very very few -- in my understanding, only 3 or 4 labs have any chance of releasing a "next generation" model for as much as two years from now, others won't be able to achieve this level of capability even if they tried), but in the post you advocate for the opposite--for voluntary actions taken by the labs, and that regulation can follow.
[-]mic6mo107

Do you think if Anthropic (or another leading AGI lab) unilaterally went out of its way to prevent building agents on top of its API, would this reduce the overall x-risk/p(doom) or not?

Probably, but Anthropic is actively working in the opposite direction:

This means that every AWS customer can now build with Claude, and will soon gain access to an exciting roadmap of new experiences - including Agents for Amazon Bedrock, which our team has been instrumental in developing.

Currently available in preview, Agents for Amazon Bedrock can orchestrate and perform API calls using the popular AWS Lambda functions. Through this feature, Claude can take on a more expanded role as an agent to understand user requests, break down complex tasks into multiple steps, carry on conversations to collect additional details, look up information, and take actions to fulfill requests. For example, an e-commerce app that offers a chat assistant built with Claude can go beyond just querying product inventory – it can actually help customers update their orders, make exchanges, and look up relevant user manuals.

Obviously, Claude 2 as a conversational e-commerce agent is not going to pose catastrophic risk, b... (read more)

4Zvi6mo
Is evaluation of capabilities, which as you note requires fine-tuning and other such techniques, a realistic thing to properly do continuously during model training, without that being prohibitively slow or expensive? Would doing this be part of the intended RSP?
7Adam Jermyn6mo
Anthropic’s RSP includes evals after every 4x increase in effective compute and after every 3 months, whichever comes sooner, even if this happens during training, and the policy says that these evaluations include fine-tuning.
1Hoagy6mo
Do you know why 4x was picked? I understand that doing evals properly is a pretty substantial effort, but once we get up to gigantic sizes and proto-AGIs it seems like it could hide a lot. If there was a model sitting in training with 3x the train-compute of GPT4 I'd be very keen to know what it could do!
3[anonymous]6mo
maybe "when alignment is solved"
[-]DanielFilan6moΩ71210

Is the idea that an indefinite pause is unactionable? If so, I'm not sure why you think that.

1evhub6mo
I talk about that here:
7DanielFilan6mo
I mean, whether something's realistic and whether something's actionable are two different things (both separate from whether something's nebulous) - even if it's hard to make a pause happen, I have a decent guess about what I'd want to do to up those odds: protest, write to my congress-person, etc. As to the realism, I think it's more realistic than I think you think it is. My impression of AI Impacts' technological temptation work is that governments are totally willing to enact policies that impoverish their citizens without requiring a rigourous CBA. Early wins does seem like an important consideration, but you can imagine trying to get some early wins by e.g. banning AI from being used in certain domains, banning people from developing advanced AI without doing X, Y, or Z.
3evhub6mo
Sure—I just think it'd be better to spend that energy advocating for good RSPs instead. To be clear, the whole point of my post is that I am in favor of pausing/stopping AI development—I just think the best way to do that is via RSPs.
[-]davidad6moΩ112812

I think AI Safety Levels are a good idea, but evals-based classification needs to be complemented by compute thresholds to mitigate the risks of loss of control via deceptive alignment. Here is a non-nebulous proposal.

On RSPs vs pauses, my basic take is that hardcore pauses are better than RSPs and RSPs are considerably better than weak pauses.

Best: we first prevent hardware progress and stop H100 manufactoring for a bit, then we prevent AI algorithmic progress, and then we stop scaling (ideally in that order). Then, we heavily invest in long run safety research agendas and hold the pause for a long time (20 years sounds good to start). This requires heavy international coordination.

I think good RSPs are worse than this, but probably much better than just having a lab pause scaling.

It's possible that various actors should explicitly state that hardcore pauses would be better (insofar as they think so).

  • A capabilities evaluation is defined as “a model evaluation designed to test whether a model could do some task if it were trying to. ...
  • A safety evaluation is defined as “a model evaluation designed to test under what circumstances a model would actually try to do some task. ...

I propose changing the term for this second type of evaluation to "propensity evaluations". I think this is a better term as it directly fits the definition you provided: "a model evaluation designed to test under what circumstances a model would actually try to do some task".

Moreover, I think that both capabilities evaluations and propensity evaluations can be types of safety evaluations. Therefore, it's misleading to label only one of them as "safety evaluations". For example, we could construct a compelling safety argument for current models using solely capability evaluations.

Either can be sufficient for safety: a strong argument based on capabilities (we've conclusively determined that the AI is too dumb to do anything very dangerous) or a strong argument based on propensity (we have a theoretically robust and empirically validated case that our training process will result in an AI that never attempts to do anything harmful).

Alternatively, a moderately strong argument based on capabilities combined with a moderately strong argument based on propensity can be sufficient, provided that the evidence is sufficiently independent.

I sometimes refer to capability based arguments as control arguments.

Then, we can name two lines of defense:

  • The control line of defense: Would the AI succeed at causing bad outcomes if it tried?
  • The propensity line of defense: Would the AI try to cause bad outcomes?

It's possible to develop techniques which advance either the control line of defense or the propensity line of defense. Of course, many research directions are useful for both.

I expect that in practice, we're not very close to being able to make good propensity arguments (for instance, theory and interpretability both seem to me to be unlikely to establish this in the next several doublings of total R&D effort). However, we're not that far off from building quite powerful control based countermeasures. I think these control based countermeasures could scale to establishing barely acceptable safety arguments (e.g. 1-5% doom per year) for around human level AI systems while using these AI systems to accelerate software based R&D (e.g. alignment research) by >30x.

Further, as Evan noted, control style arguments seem much more straightforward to evaluate (though various complications can result from exploration and gradient hacking). So, I think the feedback loops on control look quite promising.

(Redwood Research, where I work, is currently pursuing several control style projects and we should be putting out various things on this soon.)

9ryan_greenblatt6mo
An important thing to emphasize with control arguments is that it seems quite unlikely that control arguments can be made workable for very superhuman models. (At least for the notion of "control arguments" which can be readily assessed with non-insane capability evaluations.)
3ChristianKl6mo
AI might also cause bad outcomes as a side-effect of pursuing other goals. A lot of the bad outcomes that powerful agents like companies produce today are not because the company tries to cause bad outcomes.
2ryan_greenblatt6mo
Sure. I just mean "try to do things which result in bad outcomes from our perspective".
2ChristianKl6mo
If you care about prevent AI from causing bad outcomes those are not the same thing. It's important to be able to distinguish them.
2Joe_Collman6mo
[it turns out I have many questions - please consider this a pointer to the kind of information I'd find useful, rather than a request to answer them all!] Can you point to what makes you think this is likely? (or why it seems the most promising approach) In particular, I worry when people think much in terms of "doublings of total R&D effort" given that I'd expect AI assistance progress multipliers to vary hugely - with the lowest multipliers correlating strongly with the most important research directions. To me it seems that the kind of alignment research that's plausible to speed up 30x is the kind that we can already do without much trouble - narrowly patching various problems in ways we wouldn't expect to generalize to significantly superhuman systems. That and generating a ton of empirical evidence quickly - which is nice, but I expect the limiting factor is figuring out what questions to ask. It doesn't seem plausible that we get a nice inductive pattern where each set of patches allows a little more capability safely, which in turn allows more patches.... I'm not clear on when this would fail, but pretty clear that it would fail. What we'd seem to need is a large speedup on more potentially-sufficiently-general-if-they-work approaches - e.g. MIRI/ARC-theory/JW stuff. 30x speedup on this seems highly unlikely. (I guess you'd agree?) Even if it were possible to make a month of progress in one day, it doesn't seem possible to integrate understanding of that work in a day (if the AI is doing the high-level integration and direction-setting, we seem to be out of the [control measures will keep this safe] regime). I also note that empirically, theoretical teams don't tend to add a load of very smart humans. I'm sure that Paul could expand a lot more quickly if he thought that was helpful. Likewise MIRI. Are they making a serious error here, or are the limitations of very-smart-human assistants not going to apply to AI assistants? (granted, I expect AI assi

I'm not going to respond to everything you're saying here right now. It's pretty likely I won't end up responding to everything you're saying at any point; so apologies for that.

Here are some key claims I want to make:

  • Serial speed is key: Speeding up theory work (like e.g. ARC theory) by 5-10x should be quite doable with human level AIs due to AIs running at much faster serial speeds. This is a key difference between adding AIs and adding humans. Theory can be hard to parallelize which makes adding humans look worse than increasing speed. I'm not confident in speeding up theory work by >30x with controlled and around human level AI, but this doesn't seem impossible.
  • Access to the human level AIs makes safety work much more straightforward: A key difference between current safety work and future safety work is that in the future we'll have access to the exact AIs we're worried about. I expect this opens up a bunch of empirical work which is quite useful and relatively easy to scalably automate with AIs. I think this work could extend considerably beyond "patches". (The hope here is similar to model organisms, but somewhat more general.)
  • The research target can be trusted human
... (read more)
2Joe_Collman6mo
This is clarifying, thanks. A few thoughts: * "Serial speed is key": * This makes sense, but seems to rely on the human spending most of their time tackling well-defined but non-trivial problems where an AI doesn't need to be re-directed frequently [EDIT: the preceding was poorly worded - I meant that if prior to the availability of AI assistants this were true, it'd allow a lot of speedup as the AIs take over this work; otherwise it's less clearly so helpful]. Perhaps this is true for ARC - that's encouraging (though it does again make me wonder why they don't employ more mathematicians - surely not all the problems are serial on a single critical path?). I'd guess it's less often true for MIRI and John. * Of course once there's a large speedup of certain methods, the most efficient methodology would look different. I agree that 5x to 10x doesn't seem implausible. * "...in the future we'll have access to the exact AIs we're worried about.": * We'll have access to the ones we're worried about deploying. * We won't have access to the ones we're worried about training until we're training them. * I do buy that this makes safety work for that level of AI more straightforward - assuming we're not already dead. I expect most of the value is in what it tells us about a more general solution, if anything - similarly for model organisms. I suppose it does seem plausible that this is the first level we see a qualitatively different kind of general reasoning/reflection that leads us in new theoretical directions. (though I note that this makes [this is useful to study] correlate strongly with [this is dangerous to train]) * "Researching how to make trustworthy human level AIs seems much more tractable than researching how to align wildly superhuman systems": * This isn't clear to me. I'd guess that the same fundamental understanding is required for both. "trustworthy" seems superficially easier than "aligned", but that's not obvious in a gen
  • "So coordination to do better than this would be great".
    • I'd be curious to know what you'd want to aim for here - both in a mostly ideal world, and what seems most expedient.

As far as the ideal, I happened to write something about in another comment yesterday. Excerpt:

Best: we first prevent hardware progress and stop H100 manufactoring for a bit, then we prevent AI algorithmic progress, and then we stop scaling (ideally in that order). Then, we heavily invest in long run safety research agendas and hold the pause for a long time (20 years sounds good to start). This requires heavy international coordination.

As far as expedient, something like:

  • Demand labs have good RSPs (or something similar) using inside and outside game, try to get labs to fill in tricky future details of these RSPs as early as possible without depending on "magic" (speculative future science which hasn't yet been verified). Have AI takeover motivated people work on the underlying tech and implementation.
  • Work on policy and aim for powerful US policy interventions in parallel. Other countries could also be relevant.

Both of these are unlikely to perfectly succeed, but seems like good directions to p... (read more)

2Joe_Collman6mo
Thanks, this seems very reasonable. I'd missed your other comment. (Oh and I edited my previous comment for clarity: I guess you were disagreeing with my clumsily misleading wording, rather than what I meant(??))
4ryan_greenblatt6mo
Corresponding comment text: I think I disagree with what you meant, but not that strongly. It's not that important, so I don't really want to get into it. Basically, I don't think that "well-defined" is that important (not obviously required for some ability to judge the finished work) and I don't think "re-direction frequency" is the right way to think about.

if you are advocating for a pause, then presumably you have some resumption condition in mind that determines when the pause would end [...] just advocate for that condition being baked into RSPs

Resume when the scientific community has a much clearer idea about how to build AGIs that don't pose a large extinction risk for humanity. This consideration can't be turned into a benchmark right now, hence the technical necessity for a pause to remain nebulous.

RSPs are great, but not by themselves sufficient. Any impression that they are sufficient bundles irresponsible neglect of the less quantifiable risks with the useful activity of creating benchmarks.

(comment crossposted from EA forum)

Very interesting post! But I'd like to push back. The important things about a pause, as envisaged in the FLI letter, for example, are that (a) it actually happens, and (b) the pause is not lifted until there is affirmative demonstration that the risk is lifted. The FLI pause call was not, in my view, on the basis of any particular capability or risk, but because of the out-of-control race to do larger giant scaling experiments without any reasonable safety assurances. This pause should still happen, and it should not be lifted until there is a way in place to assure that safety. Many of the things FLI hoped could happen during the pause are happening — there is huge activity in the policy space developing standards, governance, and potentially regulations. It's just that now those efforts are racing the un-paused technology.

In the case of "responsible scaling" (for which I think the ideas of "controlled scaling" or "safety-first scaling" would be better), what I think is very important is that there not be a presumption that the pause will be temporary, and lifted "once" the right mitigations are in place. We may well hit point (and may be there... (read more)

[-]Adam B6mo112

most potentially dangerous capabilities should be highly correlated, such that measuring any of them should be okay. Thus, I think it should be fine to mostly focus on measuring the capabilities that are most salient to policymakers and most clearly demonstrate risks.

 

Once labs are trying to pass capability evaluations, they will spend effort trying to suppress the specific capabilities being evaluated*, so I think we'd expect them to stop being so highly correlated.

* If they try methods of more generally suppressing the kinds of capabilities that might be dangerous, I think they're likely to test them most on the capabilities being evaluated by RSPs.

My guess is that the hard "Pause" advocates are focussed on optimizing for actions that are "necessary" and "sufficient" for safety, but at the expense of perhaps not being politically or practically "feasible".

Whereas, "Responsible Scaling Policies" advocates may instead describe actions that are "necessary", and more "feasible" however are less likely to be "sufficient".

The crux of this disagreement might be related to how feasible, or how sufficient each of these two pathways respectively are?

Absent any known pathways that solve all three, I'm glad peop... (read more)

[-]mic6mo50

From my reading of ARC Evals' example of a "good RSP", RSPs set a standard that roughly looks like: "we will continue scaling models and deploying if and only if our internal evals team fails to empirically elicit dangerous capabilities. If they do elicit dangerous capabilities, we will enact safety controls just sufficient for our models to be unsuccessful at, e.g., creating Super Ebola."

This is better than a standard of "we will scale and deploy models whenever we want," but still has important limitations. As noted by the "coordinated pausing" paper, it... (read more)

2evhub6mo
I think it's pretty clear that's at least not what I'm advocating for—I have a very specific story of how I think RSPs go well in my post. These seem like good interventions to me! I'm certainly not advocating for "RSPs are all we need".
[-]Roko6mo4-2

The problem with a naive implementation of RSPs is that we're trying to build a safety case for a disaster that we fundamentally don't understand and where we haven't even produced a single disaster example or simulation.

To be more specific, we don't know exactly which bundles of AI capabilities and deployments will eventually result in a negative outcome for humans. Worse, we're not even trying to answer that question - nobody has run an "end of the world simulator" and as far as I am aware there are no plans to do that.

Without such a model it's very diff... (read more)

[-]elifland6moΩ442

Additionally, gating scaling only when relevant capabilities benchmarks are hit means that you don’t have to be at odds with open-source advocates or people who don’t believe current LLMs will scale to AGI. Open-source is still fine below the capabilities benchmarks, and if it turns out that LLMs don’t ever scale to hit the relevant capabilities benchmarks, then this approach won’t ever restrict them.


Can you clarify whether this is implying that open-source capability benchmark thresholds will be at the same or similar levels to closed-source ones? That is... (read more)

5evhub6mo
No, you'd want the benchmarks to be different for open-source vs. closed-source, since there are some risks (e.g. bio-related misuse) that are much scarier for open-source models. I tried to mention this here: I'll edit the post to be more clear on this point.

If the model is smart enough, you die before writing the evals report; if it’s just kinda smart, you don’t find it to be too intelligent and die after launching your scalable oversight system that, as a whole, is smarter than individual models.

An international moratorium on all training runs that could stumble on something that might kill everyone is much more robust than regulations around evaluated capabilities of already trained models

Edit: Huh. I would’ve considered the above to be relatively uncontroversial on LessWrong. Can someone explain where I’m wrong?

If the model is smart enough, you die before writing the evals report; if it’s just kinda smart, you don’t find it to be too intelligent and die after launching your scalable oversight system that, as a whole, is smarter than individual models.

An international moratorium on all training runs that could stumble on something that might kill everyone is much more robust than regulations around evaluated capabilities of already trained models