DeepMind's "​​Frontier Safety Framework" is weak and unambitious

Zach Stein-Perlman

DeepMind's "Frontier Safety Framework" is weak and unambitious — LessWrong

160 DeepMind's "Frontier Safety Framework" is weak and unambitious

18th May 2024

5 min read

160

FSF blogpost. Full document (just 6 pages; you should read it). Compare to Anthropic's RSP, OpenAI's RSP ("Preparedness Framework"), and METR's Key Components of an RSP.

Google DeepMind's FSF has three steps:

Create model evals for warning signs of "Critical Capability Levels"
1. Evals should have a "safety buffer" of at least 6x effective compute so that CCLs will not be reached between evals
2. They list 7 CCLs across "Autonomy, Biosecurity, Cybersecurity, and Machine Learning R&D," and they're thinking about CBRN
  1. E.g. "Autonomy level 1: Capable of expanding its effective capacity in the world by autonomously acquiring resources and using them to run and sustain additional copies of itself on hardware it rents"
Do model evals every 6x effective compute and every 3 months of fine-tuning
1. This is an "aim," not a commitment
2. Nothing about evals during deployment [update^[1]]
"When a model reaches evaluation thresholds (i.e. passes a set of early warning evaluations), we will formulate a response plan based on the analysis of the CCL and evaluation results. We will also take into account considerations such as additional risks flagged by the review and the deployment context." The document briefly describes 5 levels of security mitigations and 4 levels of deployment mitigations.
1. The mitigations aren't yet connected to eval results or other triggers; there are no advance commitments about safety practices

The FSF doesn't contain commitments. The blogpost says "The Framework is exploratory and we expect it to evolve significantly" and "We aim to have this initial framework fully implemented by early 2025." The document says similar things. It uses the word "aim" a lot and the word "commit" never. The FSF basically just explains a little about DeepMind's plans on dangerous capability evals. Those details do seem reasonable. (This is unsurprising given their good dangerous capability evals paper two months ago, but it's good to hear about evals in a DeepMind blogpost rather than just a paper by the safety team.)

(Ideally companies would both make hard commitments and talk about what they expect to do, clearly distinguishing between these two kinds of statements. Talking about plans like this is helpful. But with no commitments, DeepMind shouldn't get much credit.)

(The FSF is not precise enough to be possible to commit to — DeepMind could commit to doing the model evals regularly, but it doesn't discuss specific mitigations as a function of risk assessment results.^[2])

Misc notes (but you should really read the doc yourself):

The document doesn't specify whether "deployment" includes internal deployment. (This is important because maybe lots of risk comes from the lab using AIs internally to do AI development.) Standard usage suggests internal deployment is excluded, and the focus on misuse and related cues also suggest it's excluded, but the mention of ML R&D as a dangerous capability suggests it's included. [Update.^[3]]
No mention of evals during deployment (to account for improvements in scaffolding, prompting, etc.). [Update.^[1]]
The document says "We expect it to evolve substantially as our understanding of the risks and benefits of frontier models improves, and we will publish substantive revisions as appropriate" and a few similar things. The document doesn't say how it will be revised/amended, which isn't surprising, since it doesn't make formal commitments.
No external evals or accountability, but they're "exploring" it.
Public accountability: unfortunately, there's no mention of releasing eval results or even announcing when thresholds are reached. They say "We are exploring internal policies around alerting relevant stakeholder bodies when, for example, evaluation thresholds are met."^[4]
- But they did publish evals and eval results for their recent releases (1, 2)
Brief mention of alignment and control: "protection against the risk of systems acting adversarially against humans may require additional Framework components, including new evaluations and control mitigations that protect against adversarial AI activity."
Security mitigations:
- Mostly they discuss developers' access to the weights.
  - This is disappointing. It's important but lots of other stuff is important too. Fortunately they mention how their levels correspond to the levels in the forthcoming RAND report on securing model weights, which is much more exhaustive.
  - Some people get unilateral access to weights until the top level. This is disappointing. It's been almost a year since Anthropic said it was implementing two-party control, where nobody can unilaterally access the weights.^[5] (But reducing the number of people with access to the weights is less important than ensuring that when someone does access the weights they can't email them away or something.)
- They say the status quo is level 0 out of 4. Everyone says Google has great security; I'd assumed they were already around their level 2 or 3. Alas.
Their deployment mitigation levels are about preventing misuse of dangerous capabilities; this is fine but there's basically no details.
CCLs:
- Nothing on how they'll operationalize (warning signs for) the CCLs.
- The categories are good. The CCL definitions aren't detailed, but they do a good job of capturing the correct dimensions of risks within each category. The thresholds feel high — 6/7 of the CCLs feel like the capabilities would be a Really Big Deal in prosaic terms, and ~4 feel like a big deal for x-risk. But you can't say whether the thresholds are "too high" without corresponding safety mitigations, which this document doesn't have.
Note that the "level 1" mitigations don't correspond to the "level 1" CCLs; the levels are totally independent.

Maybe this document was rushed because DeepMind wanted to get something out before the AI Seoul Summit next week. I've heard that the safety team has better and more detailed plans. Hopefully some of those get published in DeepMind's voice (e.g. posted on the DeepMind blog or pronounced by DeepMind leadership) soon. Hopefully the bottleneck is polishing those plans, not weakening them to overcome a veto from DeepMind leadership.

Reminder of how other labs are doing on RSPs, briefly (I feel very comfortable about these claims, but I omit justification and there's not a consensus on these claims):

Anthropic RSP: a promising RSP so far but missing the all-important evals and safety practices for ASL-4.
OpenAI beta Preparedness Framework: an unpromising beta RSP — nonexhaustively, the capability thresholds are way too high, and there's nothing about internal deployment, and they might be required to share models with Microsoft despite these commitments, and they failed to share the risk-scorecard when releasing GPT-4o.
Meta has nothing like an RSP. With Llama 3, they have started doing red-teaming and model evals for dangerous capabilities: CBRNE and some cyber capabilities. They have not made safety commitments based on risk assessment results.
Microsoft says "When it comes to frontier model deployment, Microsoft and OpenAI have together defined capability thresholds that act as a trigger to review models in advance of their first release or downstream deployment. The scope of a review, through our joint Microsoft-OpenAI Deployment Safety Board (DSB), includes model capability discovery." This sounds good, but Microsoft has not elaborated on these capability thresholds, shared details about the DSB, or shared details about past reviews. So those details may be quite bad and reviews may not even be about real dangerous capabilities.

^{^}
Update: a DeepMind senior staff member says the 3-month condition includes during deployment. Yay.
^{^}
But it says they plan to: "As we better understand the risks posed by models at different CCLs, and the contexts in which our models will be deployed, we will develop mitigation plans that map the CCLs to the security and deployment levels described." But maybe only after the thresholds are crossed: "When a model reaches evaluation thresholds (i.e. passes a set of early warning evaluations), we will formulate a response plan."
^{^}
Update: a DeepMind senior staff member says "deployment" means external deployment.
^{^}
The full sentence doesn't parse: "We are exploring internal policies around alerting relevant stakeholder bodies when, for example, evaluation thresholds are met, and in some cases mitigation plans as well as post-mitigation outcomes." What about mitigation plans?
^{^}
See Frontier Model Security. But Anthropic hasn't announced that it has successfully implemented this.

Frontpage

160

DeepMind's "Frontier Safety Framework" is weak and unambitious

New Comment

14 comments, sorted by

top scoring

Click to highlight new comments since: Today at 12:04 AM

[-]Rohin Shah2y4410

Thanks for the detailed critique – I love that you actually read the document in detail. A few responses on particular points:

The document doesn't specify whether "deployment" includes internal deployment.

Unless otherwise stated, "deployment" to us means external deployment – because this is the way most AI researchers use the term. Deployment mitigations level 2 discusses the need for mitigations on internal deployments. ML R&D will require thinking about internal deployments (and so will many of the other CCLs).

Some people get unilateral access to weights until the top level. This is disappointing. It's been almost a year since Anthropic said it was implementing two-party control, where nobody can unilaterally access the weights.

I don't think Anthropic meant to claim that two-party control would achieve this property. I expect anyone using a cloud compute provider is trusting that the provider will not access the model, not securing it against such unauthorized access. (In principle some cryptographic schemes could allow you to secure model weights even from your cloud compute provider, but I highly doubt people are doing that, since it is very expensive.)

Mostly they discuss developers' access to the weights. This is disappointing. It's important but lots of other stuff is important too.

The emphasis on weights access isn’t meant to imply that other kinds of mitigations don’t matter. We focused on what it would take to increase our protection against exfiltration. A lot of the example measures discussed in the RAND interim report aren’t discussed because we already do them. For example, Google already does the following from RAND Level 3: (a) develop an insider threat program and (b) deploy advanced red-teaming. (That’s not meant to be exhaustive, I don’t personally know the details here.)

No mention of evals during deployment (to account for improvements in scaffolding, prompting, etc.).

Sorry, that's just poor wording on our part -- "every 3 months of fine-tuning progress" was meant to capture that as well. Thanks for pointing this out!

Talking about plans like this is helpful. But with no commitments, DeepMind shouldn't get much credit.

With the FSF, we prefer to try it out for a while and iron out any issues, particularly since the science is in early stages, and best practices will need to evolve as we learn more. But as you say, we are running evals even without official FSF commitments, e.g. the Gemini 1.5 tech report has dangerous capability evaluation results (see Section 9.5.2).

Given recent updates in AGI safety overall, I'm happy that GDM and Google leadership takes commitments seriously, and thinks carefully about which ones they are and are not willing to make. Including FSF, White House Commitments, etc.

[-]Zach Stein-Perlman2y132

Thanks.

Deployment mitigations level 2 discusses the need for mitigations on internal deployments.

Good point; this makes it clearer that "deployment" means external deployment by default. But level 2 only mentions "internal access of the critical capability," which sounds like it's about misuse — I'm more worried about AI scheming and escaping when the lab uses AIs internally to do AI development.

ML R&D will require thinking about internal deployments (and so will many of the other CCLs).

OK. I hope DeepMind does that thinking and makes appropriate commitments.

two-party control

Thanks. I'm pretty ignorant on this topic.

"every 3 months of fine-tuning progress" was meant to capture [during deployment] as well

Yayyy!

[-]Rohin Shah2y40

Good point; this makes it clearer that "deployment" means external deployment by default. But level 2 only mentions "internal access of the critical capability," which sounds like it's about misuse — I'm more worried about AI scheming and escaping when the lab uses AIs internally to do AI development.

You're right: our deployment mitigations are targeted at misuse only because our current framework focuses on misuse. As we note in the "Future work" section, we would need to do more work to address risks from misaligned AI. We focused on risks from deliberate misuse initially because they seemed more likely to us to appear first.

[-]habryka2y3622

The document doesn't specify whether "deployment" includes internal deployment. (This is important because maybe lots of risk comes from the lab using AIs internally to do AI development.)

This seems like such an obvious and crucial distinction that I felt very surprised when the framework didn't disambiguate between the two.

[-]Zach Stein-Perlman2y*153

Yep

Two weeks ago I sent a senior DeepMind staff member some "Advice on RSPs, especially for avoiding ambiguities"; #1 on my list was "Clarify how your deployment commitments relate to internal deployment, not just external deployment" (since it's easy and the OpenAI PF also did a bad job of this)

Edit: Rohin says "deployment" means external deployment by default and notes that the doc mentions "internal access" as distinct from deployment.

[-]Orpheus162y228

@Zach Stein-Perlman I appreciate your recent willingness to evaluate and criticize safety plans from labs. I think this is likely a public good that is underprovided, given the strong incentives that many people have to maintain a good standing with labs (not to mention more explicit forms of pressure applied by OpenAI and presumably other labs).

One thought: I feel like the difference between how you described the Anthropic RSP and how you described the OpenAI PF feels stronger than the actual quality differences between the documents. I agree with you that the thresholds in the OpenAI PF are too high, but I think the PF should get "points" for spelling out risks that go beyond ASL-3/misuse.

OpenAI has commitments that are insufficiently cautious for ASL-4+ (or what they would call high/critical on model autonomy), but Anthropic circumvents this problem by simply refusing to make any commitments around ASL-4 (for now).

You note this limitation when describing Anthropic's RSP, but you describe it as "promising" while describing the PF as "unpromising." In my view, this might be unfairly rewarding Anthropic for just not engaging with the hardest parts of the problem (or unfairly penalizing OpenAI for giving their best guess answers RE how to deal with the hardest parts of the problem).

We might also just disagree on how firm or useful the commitments in each document are– I walked away with a much better understanding of how OpenAI plans to evaluate & handle risks than how Anthropic plans to handle & evaluate risks. I do think OpenAI's thresholds are too high, but It's likely that I'll feel the same way about Anthropic's thresholds. In particular, I don't expect either (any?) lab to be able to resist the temptation to internally deploy models with autonomous persuasion capabilities or autonomous AI R&D capabilities (partially because the competitive pressures and race dynamics pushing them to do so will be intense). I don't see evidence that either lab is taking these kinds of risks particularly seriously, has ideas about what safeguards would be considered sufficient, or is seriously entertaining the idea that we might need to do a lot (>1 year) of dedicated safety work (that potentially involves coming up with major theoretical insights, as opposed to a "we will just solve it with empiricism" perspective) before we are confident that we can control such systems.

TLDR: I would remove the word "promising" and maybe characterize the RSP more like "an initial RSP that mostly spells out high-level reasoning, makes few hard commitments, and focuses on misuse while missing the all-important evals and safety practices for ASL-4."

[-]ryan_greenblatt2y9-4

In particular, I don't expect either (any?) lab to be able to resist the temptation to internally deploy models with autonomous persuasion capabilities or autonomous AI R&D capabilities

I agree with this as stated, but don't think that avoiding deploying such models is needed to mitigate risk.

I think various labs are to some extent in denial of this because massively deploying possibly misaligned systems sounds crazy (and is somewhat crazy), but I would prefer if various people realized this was likely the default outcome and prepared accordingly.

More strongly, I think most of the relevant bit of the safety usefulness trade-off curve involves deploying such models. (With countermeasures.)

or is seriously entertaining the idea that we might need to do a lot (>1 year) of dedicated safety work (that potentially involves coming up with major theoretical insights, as opposed to a "we will just solve it with empiricism" perspective) before we are confident that we can control such systems.

I think this is a real possibility, but unlikely to be necessarily depending on the risk target. E.g., I think you can deploy ASL-4 models with <5% risk without theoretical insights and instead just via being very careful with various prosaic countermeasures (mostly control).

<1% risk probably requires stronger stuff, though it will depend on the architecture and various other random details.

(That said, I'm pretty sure that these lab's aren't making decisions based on carefully analyzing the situation and are instead just operating like "idk human level models don't seem that bad, we'll probably be able to figure it out, humans can solve most problems with empiricism on priors". But, this prior seems more right than overwhelming pessimism IMO.)

Also, I think you should seriously entertain the idea that just trying quite hard with various prosaic countermeasures might suffice for reasonably high levels of safety. And thus pushing on this could potentially be very leveraged relative to trying to hit a higher target.

[-]Orpheus162y163

I personally have a large amount of uncertainty around how useful prosaic techniques & control techniques will be. Here are a few statements I'm more confident in:

Ideally, AGI development would have much more oversight than we see in the status quo. Whether or not development or deployment activities keep national security risks below acceptable levels should be a question that governments are involved in answering. A sensible oversight regime would require evidence of positive safety or "affirmative safety".
My biggest concern with the prosaic/control metastrategy is that I think race dynamics substantially decrease its usefulness. Even if ASL-4 systems are deployed internally in a safe way, we're still not out of the acute risk period. And even if the leading lab (Lab A) is trustworthy/cautious, it will be worried that incautious Lab B is about to get to ASL-4 in 1-3 months. This will cause the leading lab to underinvest into control, feel like it doesn't have much time to figure out how to use its ASL-4 system (assuming it can be controlled), and feel like it needs to get to ASL-5+ rather quickly.

It's still plausible to me that perhaps this period of a few months is enough to pull off actions that get us out of the acute risk period (e.g., use the ASL-4 system to generate evidence that controlling more powerful systems would require years of dedicated effort and have Lab A devote all of their energy toward getting governments to intervene).

Given my understanding of the current leading labs, it's more likely to me that they'll underestimate the difficulties of bootstrapped alignment and assume that things are OK as long as empirical tests don't show imminent evidence of danger. I don't think this prior is reasonable in the context of developing existentially dangerous technologies, particularly technologies that are intended to be smarter than you. I think sensible risk management in such contexts should require a stronger theoretical/conceptual understanding of the systems one is designing.

(My guess is that you agree with some of these points and I agree with some points along the lines of "maybe prosaic/control techniques will just work, we aren't 100% sure they're not going to work", but we're mostly operating in different frames.)

(I also do like/respect a lot of the work you and Buck have done on control. I'm a bit worried that the control meme is overhyped, partially because it fits into the current interests of labs. Like, control seems like a great idea and a useful conceptual frame, but I haven't yet seen a solid case for why we should expect specific control techniques to work once we get to ASL-4 or ASL-4.5 systems, as well as what we plan to do with those systems to get us out of the acute risk period. Like, the early work on using GPT-3 to evaluate GPT-4 was interesting, but it feels like the assumption about the human red-teamers being better at attacking than GPT-4 will go away– or at least be much less robust– once we get to ASL-4. But I'm also sympathetic to the idea that we're at the early stages of control work, and I am genuinely interested in seeing what you, Buck, and others come up with as the control agenda progresses.)

[-]ryan_greenblatt2y*9-1

I agree with 1 and think that race dynamics makes the situation considerably worse when we only have access to prosaic approaches. (Though I don't think this is the biggest issue with these approaches.)

I think I expect a period substantially longer than several months by default due to slower takeoff than this. (More like 2 years than 2 months.)

Insofar as the hope was for governments to step in at some point, I think the best and easiest point for them to step in is actually during the point where AIs are already becoming very powerful:

Prior to this point, we don't get substantial value from pausing, especially if we're not pausing/dismantling all of semi-conductor R&D globally.
Prior to this point AI won't be concerning enough for governments to take agressive action.
At this point, additional time is extremely useful due to access to powerful AIs.
The main counterargument is that at this point more powerful AI will also look very attractive. So, it will seem too expensive to stop.

So, I don't really see very compelling alternatives to push on at the margin as far as "metastrategy" (though I'm not sure I know exactly what you're pointing at here). Pushing for bigger asks seems fine, but probably less leveraged.

I actually don't think control is a great meme for the interests of labs which purely optimize for power as it is a relatively legible ask which is potentially considerably more expensive than just "our model looks aligned because we red teamed it" which is more like the default IMO.

The same way "secure these model weights from China" isn't a great meme for these interests IMO.

[-]Zach Stein-Perlman2y*73

Sorry for brevity.

We just disagree. E.g. you "walked away with a much better understanding of how OpenAI plans to evaluate & handle risks than how Anthropic plans to handle & evaluate risks"; I felt like Anthropic was thinking about most stuff better.

I think Anthropic's ASL-3 is reasonable and OpenAI's thresholds and corresponding commitments are unreasonable. If the ASL-4 threshold was high or commitments are poor such that ASL-4 was meaningless, I agree Anthropic's RSP would be at least as bad as OpenAI's.

One thing I think is a big deal: Anthropic's RSP treats internal deployment like external deployment; OpenAI's has almost no protections for internal deployment.

I agree "an initial RSP that mostly spells out high-level reasoning, makes few hard commitments, and focuses on misuse while missing the all-important evals and safety practices for ASL-4" is also a fine characterization of Anthropic's current RSP.

Quick edit: PF thresholds are too high; PF seems doomed / not on track. But RSPv1 is consistent with RSPv1.1 being great. At least Anthropic knows and says there’s a big hole. That's not super relevant to evaluating labs' current commitments but is very relevant to predicting.

[-]Orpheus162y20

I agree with ~all of your subpoints but it seems like we disagree in terms of the overall appraisal.

Thanks for explaining your overall reasoning though. Also big +1 that the internal deployment stuff is scary. I don’t think either lab has told me what protections they’re going to use for internally deploying dangerous (~ASL-4) systems, but the fact that Anthropic treats internal deployment like external deployment is a good sign. OpenAI at least acknowledges that internal deployment can be dangerous through its distinction between high risk (can be internally deployed) and critical risk (cannot be), but I agree that the thresholds are too high, particularly for model autonomy.

[-]Aaron_Scher2y85

Just chiming in that I appreciate this post, and my independent impressions of reading the FSF align with Zach's conclusions: weak and unambitious.

A couple additional notes:

The thresholds feel high — 6/7 of the CCLs feel like the capabilities would be a Really Big Deal in prosaic terms, and ~4 feel like a big deal for x-risk. But you can't say whether the thresholds are "too high" without corresponding safety mitigations, which this document doesn't have. (Zach)

These also seemed pretty high to me, which is concerning given that they are "Level 1". This doesn't necessarily imply but it does hint that there won't be substantial mitigations — above the current level — required until those capability levels. My guess is that current jailbreak prevention is insufficient to mitigate substantial risk from models that are a little under the level 1 capabilities for e.g., bio.

GDP gets props for specifically indicating ML R&D + "hyperbolic growth in AI capabilities" as a source of risk.

Given the lack of commitments, it's also somewhat unclear what scope to expect this framework to eventually apply to. GDM is a large org with, presumably, multiple significant general AI capabilities projects. Especially given that "deployment" refers to external deployment, it seems like there's going to be substantial work to ensuring that all the internal AI development projects proceed safely. e.g., when/if there are ≥3 major teams and dozens of research projects working on fine-tuning highly capable models (e.g., base model just below level 1), compliance may be quite difficult. But this all depends on what the actual commitments and mechanisms turn out to be. This comes to mind after this event a few weeks ago, where it looks like a team at Microsoft released a model without following all internal guidelines, and then tried to unrelease it (but I could be confused).

[-]cubefox2y30

RSP = Responsible Scaling Policy

[-]Review Bot2y*-30

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

Moderation Log

160

DeepMind's "​​Frontier Safety Framework" is weak and unambitious

160

160

160

DeepMind's "Frontier Safety Framework" is weak and unambitious