Current approaches to building general-purpose AI systems tend to produce systems with both beneficial and harmful capabilities. Further progress in AI development could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills. We explain why model evaluation is critical for addressing extreme risks. Developers must be able to identify dangerous capabilities (through "dangerous capability evaluations") and the propensity of models to apply their capabilities for harm (through "alignment evaluations"). These evaluations will become critical for keeping policymakers and other stakeholders informed, and for making responsible decisions about model training, deployment, and security.

This is the first great public writeup on model evals for averting existential catastrophe. I think it's likely that if AI doesn't kill everyone, developing great model evals and causing everyone to use them will be a big part of that. So I'm excited about this paper both for helping AI safety people learn more and think more clearly about model evals and for getting us closer to it being common knowledge that responsible labs should use model evals and responsible authorities should require them (by helping communicate model evals more widely, in a serious/legible manner).

Non-DeepMind authors include Jade Leung (OpenAI governance lead), Daniel Kokotajlo (OpenAI governance), Jack Clark (Anthropic cofounder), Paul Christiano, and Yoshua Bengio.

See also DeepMind's related blogpost.

For more on model evals for AI governance, see ARC Evals, including Beth's EAG talk Safety evaluations and standards for AI and the blogpost Update on ARC's recent eval efforts (LW).

New to LessWrong?

New Comment
11 comments, sorted by Click to highlight new comments since: Today at 3:03 PM
[-]jbash11mo28-1

That is a terrifying paper.

The strategy and mindset I seen all through it are "make things we know might be extremely dangerous, then check after the fact to see how much damage we've done".

Even the evaluate-during-training prong amounts to a way to find dangerous approaches that could be continued later. And I mean, wow, they say they might even go as far as "delaying a schedule"! I mean, at least if it's "frontier training". And it makes architectural assumptions that probably have a short shelf life.

There are SO MANY wishful ideas in there...

  1. That organizations can actually have the institutional self-discipline to control what they deploy.
  2. That those particular organizations can successfully contain things using "Strong information security controls and systems"... while "selectively" making them available (under commercial pressure to constantly widen the availability).
  3. That the "dangerous" capabilities are somehow separable from the capabilities you want, or, even less plausibly, that you can reliably get a model to Only Use Its Powers For Good(TM).
  4. That you can do "alignment evaluation" on anything significant in a way that gives you any meaningful confidence at all, especially in under the threat of intentional jailbreaking[1]. All of section 4 is a huge handwave on this.
  5. That your users can't meaningfully extend the capabilities of what you offer them, either by adding their own or by combining pieces they get from outside sources, and therefore that you can learn anything useful by evaluating a model mostly in isolation (they do mention this, but at all other times they act as if it didn't matter).
  6. That you can identify "stakeholders", and that they can act usefully on any information you give them... especially when the more "stakeholders" there are, the less plausible it is that you'll be giving them full information. Or indeed that it should be their problem to mitigate the risks you've caused to begin with. Actually when you get down to the detailed proposals for transparency, they've basically given up on anything resembling a meaingful definition of "stakeholder".
  7. That you can identify who can "safely" be given access to something you've already identified as dangerous, while still having a large and diverse user base[2].

Those are all mostly false. They sort of admit that some of them are false, or at best suspect. They put a bunch of caveats in section 5. Those caveats are notable for being ignored throughout the rest of the paper even though they make it mostly useless in practice.

The most importantly false idea, and the one they least seem to recognize, may be the organizational self-discipline one. Organizations are, as they say, Moloch.

This paper itself is already rationalizing ignoring risks: "We omit many generically useful capabilities (e.g. browsing the internet, understanding text) despite their potential relevance to both the above.". Actually I'm not sure it's fair to say that they rationalize ignoring that. They just flatly say it's out of scope, with no reason given. Which is kind of a classic sign of "that's unthinkable given the Molochian nature of our organizations".

As for actually giving up anything dangerous, an example: If you have an "agency core" and a general-purpose programming assistant, you are almost all of the way to having a robo-hacker. If you have defensive security analysis system, that puts you even closer. They don't even all have to come from the same source; people can plug these things together very easily.

I do not believe that any of these companies are going to give up on creating agents, or programming assistants, or even on "long horizon planning" or on even the narrow sense they use for "situational awareness". The idea does not pass the laugh test.

The paper alludes to the possibility that some things actually might not get deployed at all once created, if they turned out to be unexpectedly dangerous. Well, technically, it mentions that some evaluator might take the bold step of recommending against deployment. They're not quite willing to say out loud that anybody in particular ought to stop deployment.

Non-deployment is not going to happen. Not for anything really capable that's already absorbed significant investment. Not with enough probability to matter. That's not how people behave in groups, and not even usually how people behave singly.

Indeed, the paper is already moving on to rationalizing RECOGNIZED dangerous deployments: " To deploy such a model, AI developers would need very strong controls against misuse and very strong assurance (via alignment evaluations) that the model will behave as intended.".

The facts that such security controls don't exist, would be extremely hard to create, and might be impossible to create while remaining commercially viable, is just ignored. Their suggestions in table 3 are incredibly underwhelming. And their list of "security controls" in 3.4 is, um, shall we say, naive. They lead with "red teaming"...

They also ingore the fact the fact that no "strong" assurance that the model will behave as intended probably can exist. Again, the stuff in section 4 is not going to cut it.

Most likely the practical effect of letting this approach become part of the paradigm will be that they'll kid themselve that they've achieved adequate control, by pretending that they can pick trustworthy users, pretending that those "trustworthy" users can't themselves be subverted, and probably also preending that they can do something about it by surveilling users ("continuous deployment review"). We already have Brad Smith out there talking about "know your customer", which seems to dovetail nicely with this.

The "trustworthy users" thing will help not at all. The surveillance will help a little, until the models leak.

... and even if something is not "deployed", it still exists. At least the knowledge of how to recreate it still exists.

Software leaks. ML misbehaves unpredictably. Most of the utility of these things lies in constantly using them in completely novel ways. You will be dealing with intentional misuse. The paper's comparison to "food, drugs, commercial airliners, and automobiles" is a horrible analogy.

Frankly, in the end, the whole paper reads like an elaborate rationalization for making as little change as possible in what people are already doing, while providing a sort of signifier that "we care". It is not credible as an actual, effective approach to safety. It's not even a major part of such an approach. At best it could be an auditing function, and it would be one of those auditing functions where if you ever had a finding, it meant you had screwed up INCREDIBLY BADLY and been extremely lucky not to have a catastrophe.

The best hope for keeping these labs from deploying really dangerous stuff is still a total shutdown. Which, to be clear, would have to be imposed from the outside. By coercion. Because they are not going to do it themselves. That is very unlikely to be on the table even if it's the right approach.

... and it might not be the right approach, because it still wouldn't help much.

"The labs" aren't the whole issue and may very well not be the main issue. Whoever follows any kind of safety framework, there will also be a lot of people who won't.

There's a 99 point as many nines as you want percent chance that, right this minute, multiple extremely well-resourced actors are pouring tons of work into stuff specifically intended to have most of that paper's listed "dangerous capabilities". The good news is that the first big successes will probably be pretty closely held. The bad news is that we'll be lucky to get a year or two after that before those either leak or get duplicated as open source, and everybody and his dog has access to very capable systems for at least some of those things. My guess is that one of the first out of the box will be autonomous, adaptive computer security penetration (not "conducting offensive cyber operations", ffs).

I actually don't know of any way at all to deal with THAT. Even draconian controls on compute probably wouldn't give you much of a delay.

Pretending that this kind of thing will help, beyond maybe a couple of months of delay of any given bad outcome if you're extremely lucky, is not reasonable. Sorry.


  1. They even talk about existing evaluations for things like "gender and racial biases, truthfulness, toxicity, recitation of copyrighted content" as if the results we've seen were cause for optimism rather than pessimism. ↩︎

  2. ... while somehow not setting up an extremely, self-perpetuatingly unfair economy where some people have access to powerful productivity tools that are forbidden to others... ↩︎

Without the paper these problems are only implicitly clear to people who are paying attention in a particular way, while with the paper it becomes easier to notice for more people. The value of transparency is in being transparent about doing the wrong thing, or about mitigating disaster in an ineffectual way. It's less important for others to learn that you are not doing the wrong thing, or succeeding in mitigating problems. Similarly with arguments, the more useful arguments are those that show you to be wrong, or change your mind, not those that reiterate your correctness.

(Also, some of the things that are likely strategically ineffective can still help in the easy worlds, and a document like this makes it easier to deploy those mitigations. But security theater has its dangers, on balance it's unclear.)

So I think it's a good thing for a terrifying paper to get published. And similarly a good thing for a criticism of it to be easy to notice in association with it. Strongly upvoted. Replies in the direction of the parent comment would be appropriate in an appendix to the paper, but alas that's not the form.

To briefly mention one way your skepticism proves too much (or has hidden assumptions?): clearly sufficiently strong capability evals, run during training runs, enforced by governments monitoring training runs, would ~suffice to prevent dangerous training runs.

I disagree with almost everything you wrote, here are some counter-arguments:

  1. Both OpenAI and Anthropic have demonstrated that they have discipline to control at least when they deploy. GPT-4 was delayed to improve its alignment, and Claude was delayed purely to avoid accelerating OpenAI (I know this from talking to Anthropic employees). From talking to an ARC Evals employee, it definitely sounds like OpenAI and Anthropic are on board with giving as many resources as necessary to these dangerous evaluations, and are on board with stopping deployments if necessary.
  2. I'm unsure if 'selectively' refers to privileged users, or the evaluators themselves. My understanding is that if the evaluators find the model dangerous, then no users will get access (I could be wrong about this). I agree that protecting the models from being stolen is incredibly important and is not trivial, but I expect that the companies will spend a lot of resources trying to prevent it (Dario Amodei in particular feels very strongly about investing in good security).
  3. I don't think people are expecting the models to be extremely useful without also developing dangerous capabilities.
  4. Everyone is obviously aware that 'alignment evals' will be incredibly hard to do correctly, without risk of deceptive alignment. And preventing jailbreaks is very highly incentivized regardless of these alignment evals.
  5. From talking to an ARC Evals employee, I know that they are doing a lot of work to ensure they have a buffer with regard to what the users can achieve. In particular, they are:
    1. Letting the model use whatever tools might help it achieve dangerous outcomes (but in a controlled way)
    2. Finetuning the models to be better at dangerous things (I believe that users won't have finetuning access to the strongest models)
    3. Running experiments to check if prompt engineering can achieve results similar to finetuning, or if finetuning will always be ahead
  6. If I understood the paper correctly, by 'stakeholder' they most importantly mean government/regulators. Basically - if they achieve dangerous capabilities, it's really good if the government knows, because it will inform regulation.
  7. No idea what you are referring to, I don't see any mention in the paper of letting certain people safe access to a dangerous model (unless you're talking about the evaluators?)

That said, I don't claim that everything is perfect and we're all definitely going to be fine. Particularly, I agree that it will be hard or impossible to get everyone to follow this methodology, and I don't yet see a good plan to enforce compliance. I'm also afraid of what will happen if we get stuck on not being able to confidently align a system that we've identified as dangerous (in this case it will get increasingly more likely that the model gets deployed anyway, or that other less compliant actors will achieve a dangerous model).

Finally - I get the feeling that your writing is motivated by your negative outlook, and not by trying to provide good analysis, concrete feedback, or an alternative plan. I find it unhelpful.

[-]jbash11mo2-6
  1. Both OpenAI and Anthropic have demonstrated that they have discipline to control at least when they deploy.

Good point. You're right that they've delayed things. In fact, I get the impression that they've delayed for issues I personally wouldn't even have worried about.

I don't think that makes me believe that they will be able to refrain, permanently or even for a very long time, from doing anything they've really invested in, or anything they really see as critical to their ability to deliver what they're selling. They haven't demonstrated any really long delays, the pressure to do more is going to go nowhere but up, and organizational discipline tends to deteriorate over time even without increasing pressure. And, again, the paper's already talking about things like "recommending" against deployment, and declining to analyze admittedly relevant capabilities like Web browsing... both of which seem like pretty serious signs of softening.

But they HAVE delayed things, and that IS undeniably something.

As I understand it, Anthropic was at least partially founded around worries about rushed deployment, so at a first guess I'd suspect Anthropic's discipline would be last to fail. Which might mean that Anthropic would be first to fail commercially. Adverse selection...

  1. I'm unsure if 'selectively' refers to privileged users, or the evaluators themselves.

It was meant to refer to being selective about users (mostly meaning "customerish" ones, not evaluators or developers). It was also meant to refer to being selective about which of the model's intrinsic capabilities users can invoke and/or what they can ask it to do with those capabilities.

They talk about "strong information security controls". Selective availability, in that sort of broad sense, is pretty much what that phrase means.

As for the specific issue of choosing the users, that's a very, very standard control. And they talk about "monitoring" what users are doing, which only makes sense if you're prepared to stop them from doing some things. That's selectivity. Any user given access is privileged in the sense of not being one of the ones denied access, although to me the phrase "privileged user" tends to mean a user who has more access than the "average" one.

[still 2]: My understanding is that if the evaluators find the model dangerous, then no users will get access (I could be wrong about this).

From page 4 of the paper:

A simple heuristic: a model should be treated as highly dangerous if it has a capability profile that would be sufficient for extreme harm, assuming misuse and/or misalignment. To deploy such a model, AI developers would need very strong controls against misuse (Shevlane, 2022b) and very strong assurance (via alignment evaluations) that the model will behave as intended.

I can't think what "deploy" would mean other than "give users access to it", so the paper appears to be making a pretty direct implication that users (and not just internal users or evaluators) are expected to have access to "highly dangerous" models. In fact that looks like it's expected to be the normal case.

  1. I don't think people are expecting the models to be extremely useful without also developing dangerous capabilities.

That seems incompatible with the idea that no users would ever get access to dangerous models. If you were sure your models wouldn't be useful without being dangerous, and you were committed to not allowing dangerous models to be used, then why would you even be doing any of this to begin with?

  1. From talking to an ARC Evals employee, I know that they are doing a lot of work to ensure they have a buffer with regard to what the users can achieve. In particular, they are[...etc...]

OK, but I'm responding to this paper and to the inferences people could reasonably draw from it, not to inside information.

And the list you give doesn't give me the sense that anybody's internalized the breadth and depth of things users could do to add capabilities. Giving the model access to all the tools you can think of gives you very little assurance about all the things somebody else might interconnect with the model in ways that would let it use them as tools. It also doesn't deal with "your" model being used as a tool by something else. Possibly in a way that doesn't look at all like how you expected it to be used. Nor with it interacting with outside entities in more complex ways than than the word "tool" tends to suggest.

As for the paper itself, it does seem to allude to some of that stuff, but then it ignores it.

That's actually the big problem with most paradigms based on "red teaming" and "security auditing", even for "normal" software. You want to be assured not only that the software will resist the specific attacks you happen to think of, but that it won't misbehave no matter what anybody does, at least over a broader space of action you can possibly test. Just trying things out to see how the software responds is of minimal help there... which is why those sorts of activities aren't primary assurance methods for regular software development. One of the scary things about ML is that the most of the things that are primary for other software don't really work on it.

On fine tuning, it hadn't even occurred to me that any user would be ever be given any ability to do any kind of training on the models. At least not in this generation. I can see that I had a blind spot there.

In the long term, though, the whole training-versus-inference distinction is a big drag on capability. A really capable system would extract information from everything it did or observed, and use that information thereafter, just as humans and animals do. If anybody figures out how to learn from experience the way humans do, with anything like the same kind of data economy, it's going to be very hard to resist doing it. So eventually you have a very good chance that there'll be systems that are constantly "fine tuning" themselves in unpredictable ways, and that get long-term memory of everything in the process. That's what I was getting at when I mentioned the "shelf life" of architectural assumptions.

  1. If I understood the paper correctly, by 'stakeholder' they most importantly mean government/regulators.

I think they also mentioned academics and maybe some others.

... which is exactly why I said that they didn't seem to have a meaningful definition of what a "stakeholder" was. Talking about involving "stakeholders", and then acting as though you've achieved that by involving regulators, academics, or whoever, is way too narrow and trivializes the literal meaning of the word "stakeholder".

It feels a lot like talking about "alignment" and acting as though you've achieved it when your system doesn't do the things on some ad-hoc checklist.

It also feels like falling into a common organizational pattern where the set of people tapped for "stakeholder involvement" is less like "people who are affected" and more like "people who can make trouble for us".

  1. No idea what you are referring to, I don't see any mention in the paper of letting certain people safe access to a dangerous model (unless you're talking about the evaluators?)

As I said, the paper more or less directly says that dangerous models will be deployed. And if you're going to "know your customer", or apply normal access controls, then you're going to be picking people who have such access. But neither prior vetting nor surveillance is adequate.

Finally - I get the feeling that your writing is motivated by your negative outlook,

If you want go down that road, then I get the feeling that the paper we're talking about, and a huge amount of other stuff besides, is motivated by a need to feel positive regardless of whether it make sense.

and not by trying to provide good analysis,

That's pretty much meaningless and impossible to respond to.

concrete feedback,

The concrete feedback is that the kind of "evaluation" described in that paper, with the paper's proposed ways of using the results, isn't likely to be particularly effective for what it's supposed to do, but could be a very effective tool for fooling yourself into thinking you'd "done enough".

If you make that kind of approach the centerpiece of your safety system, or even a major pillar of it, then you are probably giving yourself a false sense of security, and you may be diverting energy better used elsewhere. Therefore you should not do that unless those are your goals.

or an alternative plan.

It's a fallacy to respond to "that won't work" with "well, what's YOUR plan?". My not having an alternative isn't going to make anybody else's approach work.

One alternative plan might be to quit building that stuff, erase what already exists, and disband those companies. If somebody comes up with a "real" safety strategy, you can always start again later. That approach is very unlikely to work, because somebody else will build whatever you would have... but it's probably strictly better in terms of mean-time-before-disaster than coming up with rationalizations for going ahead.

Another alternative plan might be to quit worrying about it, so you're happier.

I find it unhelpful.

... which is how I feel about the original paper we're talking about. I read it as an attempt to feel more comfortable about a situation that's intrinsically uncomfortable, because it's intrinsically dangerous, maybe in an intrinsically unsolvable way. If comfort is the goal, then I guess it's helpful, but if being right is the goal, then it's unhelpful. If the comfort took the pressure off of somebody who might otherwise come up with a more effective safety approach, then it would be actively harmful... although I admit that I don't see a whole lot of hope for that anyway.

The first line of defence is to avoid training models that have sufficient dangerous capabilities and misalignment to pose extreme risk. Sufficiently concerning evaluation results should warrant delaying a scheduled training run or pausing an existing one

It's very disappointing to me that this sentence doesn't say "cancel". As far as I understand, most people on this paper agree that we do not have alignment techniques to align superintelligence. Therefor, if the model evaluations predict an AI that is sufficiently smarter than humans, the training run should be cancelled.

Sure. Fwiw I read "delay" and "pause" as stop until it's safe, not stop for a while and resume while the eval result is still concerning, but I agree being explicit would be nice.

Yeah, this is fair,  and later in the section they say: 

Careful scaling. If the developer is not confident it can train a safe model at the scale it initially had planned, they could instead train a smaller or otherwise weaker model.

Which is good, supports your interpretation, and gets close to the thing I want, albeit less explicitly than I would have liked. 

I still think the "delay/pause" wording pretty strongly implies that the default is to wait for a short amount of time, and then keep going at the intended capability level. I think there's some sort of implicit picture that the eval result will become unconcerning in a matter of weeks-months, which I just don't see the mechanism for short of actually good alignment progress. 

It's nice to see OpenAI, Anthropic, and DeepMind collaborating on a paper like this.

This seems like a very important document that could support/explain/justify various sensible actions relevant to AI x-risk. It's well-credentialed, plausibly comprehensible to an outsider, and focuses on things that are out of scope of mainstream AI safety efforts, closer to the core of AI x-risk, even if not quite there yet (it cites "damage in the tens of thousands of lives lost" as a relevant scale of impact, not "everyone on Earth will die").