If the post-mitigation model scores High in any category, don't deploy it until implementing mitigations such that it drops to Medium.
If the post-mitigation model scores Critical in any category, stop developing it until implementing mitigations such that it drops to High.
Any mention of what the "mitigations" in question would be?
In particular, if this all boils down to "do gradient updates against the evals test suite until the model stops doing the bad thing", or some other variant of "do shallow problem-covering until the red light stops blinking", then I'd classify this whole plan as probably net-harmful relative to just not pretending to have a safety plan in the first place.
On the other end of the spectrum, if the "mitigations" were things like "go full mechinterp and do not pass Go until every mechanistic detail of the problematic behavior has been fully understood, and removed via multiple methods each of which is itself fully understood mechanistically, all of which are expected to generalize far beyond the particular eval which raised a warning, and any one of which is expected to be sufficient on its own", then that would be a clear positive relative to baseline.
Any mention of what the "mitigations" in question would be?
Not really. For now, OpenAI mostly mentions restricting deployment (this section is pretty disappointing):
A central part of meeting our safety baselines is implementing mitigations to address various types of model risk. Our mitigation strategy will involve both containment measures, which help reduce risks related to possession of a frontier model, as well as deployment mitigations, which help reduce risks from active use of a frontier model. As a result, these mitigations might span increasing compartmentalization, restricting deployment to trusted users, implementing refusals, redacting training data, or alerting distribution partners.
I predict that if you read the doc carefully, you'd say "probably net-harmful relative to just not pretending to have a safety plan in the first place."
They mention three types of mitigations:
My one-sentence reaction after reading the doc for the first time is something like "it doesn't really tell us how OpenAI plans to address the misalignment risks that many of us are concerned about, but with that in mind, it's actually a fairly reasonable document with some fairly concrete commitments").
Everyone on Twitter has criticised the label "Responsible Scaling Policy", but the author of this post seems not to respect what seems like a gentle attempt by OpenAI to move past this label.
If we were a bit more serious about this we would perhaps immediately rename the tag "Responsible Scaling Policies" on LessWrong into "Preparedness Frameworks" with a note on the tag page "Anthropic calls their PF 'RSP', but we think this is a bad label".
(The label "RSP" isn't perfect but it's kinda established now. My friends all call things like this "RSPs." And anyway I don't think "PFs" should become canonical instead. I predict change in terminology will happen ~iff it's attempted by METR or multiple frontier labs together. For now, I claim we should debate terminology occasionally but follow standard usage when trying to actually communicate.)
I believe labels matter, and I believe the label "preparedness framework" is better than the label "responsible scaling policy." Kudos to OpenAI on this. I hope we move past the RSP label.
I think labels will matter most when communicating to people who are not following the discussion closely (e.g., tech policy folks who have a portfolio of 5+ different issues and are not reading the RSPs or PFs in great detail).
One thing I like about the label "preparedness framework" is that it begs the question "prepared for what?", which is exactly the kind of question I want policy people to be asking. PFs imply that there might be something scary that we are trying to prepare for.
New misc remark:
[Minor correction to notes]
There's some commitment that the Board will be in the loop. To my knowledge, this is the first commitment by a frontier lab to give their Board specific information or specific power besides removing-the-CEO.
The Anthropic RSP has a bunch of stuff along these lines:
Follow an "Update Process" for this document, including approval by the board of directors, following consultation with the Long-Term Benefit Trust (LTBT). Any updates will be noted and reflected in this document before they are implemented. The most recent version of this document can be found at http://anthropic.com/responsible-scaling-policy.
[...]
Share results of ASL evaluations promptly with Anthropic's governing bodies, including the board of directors and LTBT, in order to sufficiently inform them of changes to our risk profile.
[...]
Responsible Scaling Officer. There is a designated member of staff responsible for ensuring that our Responsible Scaling Commitments are executed properly. Each quarter, they will share a report on implementation status to our board and LTBT, explicitly noting any deficiencies in implementation. They will also be responsible for sharing ad hoc updates sooner if there are any substantial implementation failures.
I wrote up a ~4k-word take on the document that I'll likely post tomorrow - if you'd like to read the draft today you can PM/DM/email me.
(Basic take: Mixed bag, definitely highly incomplete, some definite problems, but better than I would have expected and a positive update)
Made a Manifold market
Might make more later, and would welcome others to do the same! (I think one could ask more interesting questions than the one I asked above.)
Nice. Another possible market topic: the mix of Low/Medium/High/Critical on their Scorecard, when they launch it or on 1 Jan 2025 or something. Hard to operationalize because we don't know how many categories there will be, and we care about both pre-mitigation and post-mitigation scores.
Made a simple market:
Oops, somehow didn't see there was actually a market baked into your question
I'd also be interested in "Will there be a publicly revealed instance of a pause in either deployment or development, as a result of a model scoring High or Critical on a scorecard, by Date X?"
(I edited my comment to add the market, sorry for confusion.)
(Separately, a market like you describe might still be worth making.)
OpenAI's basic framework:
- Do dangerous capability evals at least every 2x increase in effective training compute. This involves fine-tuning for dangerous capabilities, then doing evals on pre-mitigation and post-mitigation versions of the fine-tuned model. Score the models as Low, Medium, High, or Critical in each of several categories.
- Initial categories: cybersecurity, CBRN (chemical, biological, radiological, nuclear threats), persuasion, and model autonomy.
- If the post-mitigation model scores High in any category, don't deploy it until implementing mitigations such that it drops to Medium.
- If the post-mitigation model scores Critical in any category, stop developing it until implementing mitigations such that it drops to High.
- If the pre-mitigation model scores High in any category, harden security to prevent exfiltration of model weights. (Details basically unspecified for now.)
This outlines a very iterative procedure. If models started hitting the thresholds, and this logic was applied repeatedly, the effect could be pushing the problems under the rug. At levels of ability sufficient to trigger those thresholds, I would be worried about staying on the bleeding edge of danger, tweaking the model until problems don't show up in evals.
I guess this strategy is intended to be complemented with the superalignment effort, and not to be pushed on indefinitely and the main alignment strategy.
Some personal takes in response:
Some mitigation strategies are more thorough than others. For example, if they hit a High on bioweapons, and responded by filtering a lot of relevant bits of Biological literature out of the pretaining set until it dropped to Medium, I wouldn't be particularly concerned about that information somehow sneaking back in as the model later got stronger. Though if you're trying to censor specific knowledge, that might become more challenging for a much more capable models that could have some ability to recreate the missing bits by logical inference from nearby material that wan't censored out. This is probably more of concern for theory than for practical knowledge.
Added to the post:
Edit, one day later: the structure seems good, but I'm very concerned that the thresholds for High and Critical risk in each category are way too high, such that e.g. a system could very plausibly kill everyone without reaching Critical in any category. See pp. 8–11. If so, that's a fatal flaw for a framework like this. I'm interested in counterarguments; for now, praise mostly retracted; oops. I still prefer this to no RSP-y-thing, but I was expecting something stronger from OpenAI. I really hope they lower thresholds for the finished version of this framework.
My impression was that (other than Autonomy) High means "effective & professionally skilled human levels of ability at creating this type of risk" and Critical means "superhuman levels of ability at creating this type of risk". I assume their rationale is that we already have a world containing plenty of people with human levels of ability to create risk, and we're not dead yet. I think their threshold for High may be a bit too high on Persuasion, by comparing to very rare, really exceptional people (by "country-wide change agents" I assume they mean people like Nelson Mandela or Barack Obama): we don't have a lot of those, especially not willing and able to work for a O(cents) per thousand tokens for anyone. I'd have gone with a lower bar like "as persuasive as a skilled & capable professional negotiator, politician+speechwriter team, or opinion writer": i.e. someone with charisma and a way with words, but not once-in-a-generation levels of charisma.
New misc remark:
It's not clear how the PF interacts with sharing models with Microsoft (or others). In particular, if OpenAI is required to share its models with Microsoft and Microsoft can just deploy them, even a great PF wouldn't stop dangerous models from being deployed. See OpenAI-Microsoft partnership.
Note that the openai-microsoft deal stops at AGI. We might hope that AGI will be invoked prior to models which are existentially dangerous.
AGI is defined as "a highly autonomous system that outperforms humans at most economically valuable work." We can hope, but it's definitely not clear that AGI comes before existentially-dangerous-AI.
OpenAI released a beta version of their responsible scaling policy (though they don't call it that). See summary page, full doc, OpenAI twitter thread, and Jan Leike twitter thread [edit: and Zvi commentary]. Compare to Anthropic's RSP and METR's Key Components of an RSP.
It's not done, so it's too early to celebrate, but based on this document I expect to be happy with the finished version. I think today is a good day for AI safety.
[Edit, one day later: the structure seems good, but I'm very concerned that the thresholds for High and Critical risk in each category are way too high, such that e.g. a system could very plausibly kill everyone without reaching Critical in any category. See pp. 8–11. If so, that's a fatal flaw for a framework like this. I'm interested in counterarguments; for now, praise mostly retracted; oops. I still prefer this to no RSP-y-thing, but I was expecting something stronger from OpenAI. I really hope they lower thresholds for the finished version of this framework.]
My high-level take: RSP-y things are good.
OpenAI's basic framework:
Random notes:
Misc remarks added later: