Zach Stein-Perlman

AI strategy & governance. ailabwatch.org. Looking for new projects.

As of 1 June 2024, I've recently focused on blogging, but I expect to soon focus on exploring a version of ailabwatch.org that could get more attention. I'm most excited to receive offers to help with projects like ailabwatch.org. I'm also excited to be pitched blogposts/projects.

Sequences

Slowing AI

Wiki Contributions

Load More

Comments

I'm still confused about Article IV(D)(5)(a) (p. 18) of the CoI. See footnote 3.

New (perfunctory) page: AI companies' corporate documents. I'm not sure it's worthwhile but maybe a better version of it will be. Suggestions/additions welcome.

New page on AI companies' policy advocacy: https://ailabwatch.org/resources/company-advocacy/.

This page is the best collection on the topic (I'm not really aware of others), but I decided it's low-priority and so it's unpolished. If a better version would be helpful for you, let me know to prioritize it more.

Securing model weights is underrated for AI safety. (Even though it's very highly rated.) If the leading lab can't stop critical models from leaking to actors that won't use great deployment safety practices, approximately nothing else matters. Safety techniques would need to be based on properties that those actors are unlikely to reverse (alignment, maybe unlearning) rather than properties that would be undone or that require a particular method of deployment (control techniques, RLHF harmlessness, deployment-time mitigations).

However hard the make a critical model you can safely deploy problem is, the make a critical model that can safely be stolen problem is... much harder.

I mostly agree. Anthropic is kinda an exception (see footnotes 3, 4, 5, 11, 12, and 14). Anthropic's stance on model escape is basically like we'll handle that in ASL-4 — this is reasonable and better than not noticing the threat but they've said ~nothing specific about escape/control yet.


I worry that quotes like

red-teaming should confirm that models can't cause harm quickly enough to evade detection.

are misleading in the context of escape (from all labs, not just Anthropic). When defining ASL-4, Anthropic will need to clarify whether this includes scheming/escape and how safety could be "confirm[ed]" — I think safety properties will need high-effort control evals, not just red-teaming.

New: Anthropic lists the board members and LTBT members on a major page (rather than a blogpost with a specific date), which presumably means they'll keep it up to date and thus we'll quickly learn of changes, hooray:

Anthropic Board of Directors
Dario Amodei, Daniela Amodei, Yasmin Razavi, and Jay Kreps.

LTBT Trustees
Neil Buddy Shah, Kanika Bahl, and Zach Robinson.

(We already knew these are the humans.)

Also new:

In December 2023, Jason Matheny stepped down from the Trust to preempt any potential conflicts of interest that might arise with RAND Corporation's policy-related initiatives. Paul Christiano stepped down in April 2024 to take a new role as the Head of AI Safety at the U.S. AI Safety Institute. Their replacements will be elected by the Trustees in due course.

Again, this is not news (although it hasn't become well-known, I think) but I appreciate this public note.

Labs should give deeper model access to independent safety researchers (to boost their research)

Sharing deeper access helps safety researchers who work with frontier models, obviously.

Some kinds of deep model access:

  1. Helpful-only version
  2. Fine-tuning permission
  3. Activations and logits access
  4. [speculative] Interpretability researchers send code to the lab; the lab runs the code on the model; the lab sends back the results

See Shevlane 2022 and Bucknall and Trager 2023.

A lab is disincentivized from sharing deep model access because it doesn't want headlines about how researchers got its model to do scary things.

It has been suggested that labs are also disincentivized from sharing because they want safety researchers to want to work at the labs and sharing model access with independent researchers make those researchers not need to work at the lab. I'm skeptical that this is real/nontrivial.

Labs should limit some kinds of access to avoid effectively leaking model weights. But sharing limited access with a moderate number of safety researchers seems very consistent with keeping control of the model.

This post is not about sharing with independent auditors to assess risks from a particular model.

@Buck suggested I write about this but I don't have much to say about it. If you have takes—on the object level or on what to say in a blogpost on this topic—please let me know.

Load More