Zach Stein-Perlman

AI strategy & governance. Blog: Not Optional.


Slowing AI

Wiki Contributions

Load More


Some prior discussion.

[Edit: didn't mean to suggest David's post is redundant.]

I agree. But I claim saying "I can't talk about the game itself, as that's forbidden by the rules" is like saying "I won't talk about the game itself because I decided not to" -- the underlying reason is unclear.

Unfortunately, I can't talk about the game itself, as that's forbidden by the rules.

You two can just change the rules... I'm confused by this rule.

The control-y plan I'm excited about doesn't feel to me like squeeze useful work out of clearly misaligned models. It's like use scaffolding/oversight to make using a model safer, and get decent arguments that using the model (in certain constrained ways) is pretty safe even if it's scheming. Then if you ever catch the model scheming, do some combination of (1) stop using it, (2) produce legible evidence of scheming, and (3) do few-shot catastrophe prevention. But I defer to Ryan here.

New misc remark:

  • OpenAI's commitments about deployment seem to just refer to external deployment, unfortunately.
    • This isn't explicit, but they say "Deployment in this case refers to the spectrum of ways of releasing a technology for external impact."
    • This contrasts with Anthropic's RSP, in which "deployment" includes internal use.

Added to the post:

Edit, one day later: the structure seems good, but I'm very concerned that the thresholds for High and Critical risk in each category are way too high, such that e.g. a system could very plausibly kill everyone without reaching Critical in any category. See pp. 8–11. If so, that's a fatal flaw for a framework like this. I'm interested in counterarguments; for now, praise mostly retracted; oops. I still prefer this to no RSP-y-thing, but I was expecting something stronger from OpenAI. I really hope they lower thresholds for the finished version of this framework.

Any mention of what the "mitigations" in question would be?

Not really. For now, OpenAI mostly mentions restricting deployment (this section is pretty disappointing):

A central part of meeting our safety baselines is implementing mitigations to address various types of model risk. Our mitigation strategy will involve both containment measures, which help reduce risks related to possession of a frontier model, as well as deployment mitigations, which help reduce risks from active use of a frontier model. As a result, these mitigations might span increasing compartmentalization, restricting deployment to trusted users, implementing refusals, redacting training data, or alerting distribution partners.

I predict that if you read the doc carefully, you'd say "probably net-harmful relative to just not pretending to have a safety plan in the first place."

Some personal takes in response:

  • Yeah, largely the letter of the law isn't sufficient.
  • Some evals are hard to goodhart. E.g. "can red-teamers demonstrate problems (given our mitigations)" is pretty robust — if red-teamers can't demonstrate problems, that's good evidence of safety (for deployment with those mitigations), even if the mitigations feel jury-rigged.
  • Yeah, this is intended to be complemented by superalignment.
Load More