Zach Stein-Perlman

AI strategy & governance. ailabwatch.org. Looking for new projects.

Sequences

Slowing AI

Wiki Contributions

Load More

Comments

Thanks.

Deployment mitigations level 2 discusses the need for mitigations on internal deployments.

Good point; this makes it clearer that "deployment" means external deployment by default. But level 2 only mentions "internal access of the critical capability," which sounds like it's about misuse — I'm more worried about AI scheming and escaping when the lab uses AIs internally to do AI development.

ML R&D will require thinking about internal deployments (and so will many of the other CCLs).

OK. I hope DeepMind does that thinking and makes appropriate commitments.

two-party control

Thanks. I'm pretty ignorant on this topic.

"every 3 months of fine-tuning progress" was meant to capture [during deployment] as well

Yayyy!

Thanks.

I'm glad to see that the non-compliance reporting policy has been implemented and includes anonymous reporting. I'm still hoping to see more details. (And I'm generally confused about why Anthropic doesn't share more details on policies like this — I fail to imagine a story about how sharing details could be bad, except that the details would be seen as weak and this would make Anthropic look bad.)

What details are you imagining would be helpful for you? Sharing the PDF of the formal policy document doesn't mean much compared to whether it's actually implemented and upheld and treated as a live option that we expect staff to consider (fwiw: it is, and I don't have a non-disparage agreement). On the other hand, sharing internal docs eats a bunch of time in reviewing it before release, chance that someone seizes on a misinterpretation and leaps to conclusions, and other costs.

Not sure. I can generally imagine a company publishing what Anthropic has published but having a weak/fake system in reality. Policy details do seem less important for non-compliance reporting than some other policies — Anthropic says it has an infohazard review policy, and I expect it's good, but I'm not confident, and for other companies I wouldn't necessarily expect that their policy is good (even if they say a formal policy exists), and seeing details (with sensitive bits redacted) would help.

I mostly take back my secret policy is strong evidence of bad policy insinuation — that's ~true on my home planet, but on Earth you don't get sufficient credit for sharing good policies and there's substantial negative EV from misunderstandings and adversarial interpretations, so I guess it's often correct to not share :(

As an 80/20 of publishing, maybe you could share a policy with an external auditor who would then publish whether they think it's good or have concerns. I would feel better if that happened all the time.

I think this is implicit — the RSP discusses deployment mitigations, which can't be enforced if the weights are shared.

No major news here, but some minor good news, and independent of news/commitments/achievements I'm always glad when labs share thoughts like this. Misc reactions below.


Probably the biggest news is the Claude 3 evals report. I haven't read it yet. But at a glance I'm confused: it sounds like "red line" means ASL-3 but they also operationalize "yellow line" evals and those sound like the previously-discussed ASL-3 evals. Maybe red is actual ASL-3 and yellow is supposed to be at least 6x effective compute lower, as a safety buffer.

"Assurance Mechanisms . . . . should ensure that . . . our safety and security mitigations are validated publicly or by disinterested experts." This sounds great. I'm not sure what it looks like in practice. I wish it was clearer what assurance mechanisms Anthropic expects or commits to implement and when, and especially whether they're currently doing anything along the lines of "validated publicly or by disinterested experts." (Also whether "validated" means "determined to be sufficient if implemented well" or "determined to be implemented well.")

Something that was ambiguous in the RSP and is still ambiguous here: during training, if Anthropic reaches "3 months since last eval" before "4x since last eval," do they do evals? Or does the "3 months" condition only apply after training?

I'm glad to see that the non-compliance reporting policy has been implemented and includes anonymous reporting. I'm still hoping to see more details. (And I'm generally confused about why Anthropic doesn't share more details on policies like this — I fail to imagine a story about how sharing details could be bad, except that the details would be seen as weak and this would make Anthropic look bad.)

Some other hopes for the RSP, off the top of my head:

  • **ASL-4 definition + operationalization + mitigations, including generally how Anthropic will think about safety cases after the "no dangerous capabilities" safety case doesn't work anymore
  • Clarifying security commitments (when the RAND report on securing model weights comes out)
  • Dangerous capability evals by external auditors, e.g. METR

Edit: nevermind; maybe this tweet is misleading and narrow and just about restoring people's vested equity; I'm not sure what that means in the context of OpenAI's pseudo-equity but possibly this tweet isn't a big commitment.

@gwern I'm interested in your take on this new Altman tweet:

we have never clawed back anyone's vested equity, nor will we do that if people do not sign a separation agreement (or don't agree to a non-disparagement agreement). vested equity is vested equity, full stop.

there was a provision about potential equity cancellation in our previous exit docs; although we never clawed anything back, it should never have been something we had in any documents or communication. this is on me and one of the few times i've been genuinely embarrassed running openai; i did not know this was happening and i should have.

the team was already in the process of fixing the standard exit paperwork over the past month or so. if any former employee who signed one of those old agreements is worried about it, they can contact me and we'll fix that too. very sorry about this.

In particular "i did not know this was happening"

Sorry for brevity.

We just disagree. E.g. you "walked away with a much better understanding of how OpenAI plans to evaluate & handle risks than how Anthropic plans to handle & evaluate risks"; I felt like Anthropic was thinking about most stuff better.

I think Anthropic's ASL-3 is reasonable and OpenAI's thresholds and corresponding commitments are unreasonable. If the ASL-4 threshold was high or commitments are poor such that ASL-4 was meaningless, I agree Anthropic's RSP would be at least as bad as OpenAI's.

One thing I think is a big deal: Anthropic's RSP treats internal deployment like external deployment; OpenAI's has almost no protections for internal deployment.

I agree "an initial RSP that mostly spells out high-level reasoning, makes few hard commitments, and focuses on misuse while missing the all-important evals and safety practices for ASL-4" is also a fine characterization of Anthropic's current RSP.

Quick edit: PF thresholds are too high; PF seems doomed / not on track. But RSPv1 is consistent with RSPv1.1 being great. At least Anthropic knows and says there’s a big hole. That's not super relevant to evaluating labs' current commitments but is very relevant to predicting.

Sorry for brevity, I'm busy right now.

  1. Noticing good stuff labs do, not just criticizing them, is often helpful. I wish you thought of this work more as "evaluation" than "criticism."
  2. It's often important for evaluation to be quite truth-tracking. Criticism isn't obviously good by default.

Edit:

3. I'm pretty sure OP likes good criticism of the labs; no comment on how OP is perceived. And I think I don't understand your "good judgment" point. Feedback I've gotten on AI Lab Watch from senior AI safety people has been overwhelmingly positive, and of course there's a selection effect in what I hear, but I'm quite sure most of them support such efforts.

4. Conjecture (not exclusively) has done things that frustrated me, including in dimensions like being "'unilateralist,' 'not serious,' and 'untrustworthy.'" I think most criticism of Conjecture-related advocacy is legitimate and not just because people are opposed to criticizing labs.

5. I do agree on "soft power" and some of "jobs." People often don't criticize the labs publicly because they're worried about negative effects on them, their org, or people associated with them.

Yep

Two weeks ago I sent a senior DeepMind staff member some "Advice on RSPs, especially for avoiding ambiguities"; #1 on my list was "Clarify how your deployment commitments relate to internal deployment, not just external deployment" (since it's easy and the OpenAI PF also did a bad job of this)

:(


Edit: Rohin says "deployment" means external deployment and notes that the doc mentions "internal access" as distinct from deployment.

Added updates to the post. Updating it as stuff happens. Not paying much attention; feel free to DM me or comment with more stuff.

The commitment—"20% of the compute we've secured to date" (in July 2023), to be used "over the next four years"—may be quite little in 2027, with compute use increasing exponentially. I'm confused about why people think it's a big commitment.

Load More