Zach Stein-Perlman

AI strategy & governance. ailabwatch.org. ailabwatch.substack.com

Sequences

Slowing AI

Wiki Contributions

Load More

Comments

Sorted by

I brainstormed sabotage-related threats with friends:

  • Sandbag on a few safety research tasks, especially techniques to catch scheming and stuff that's high-level or has poor feedback loops
    • Plausible; maybe not so bad
  • Sandbag on ~all safety research tasks
    • Easy to catch; catching obvious sandbagging has safety benefits + maybe we can posttrain it away
  • Sandbag/lie when validating that a model/deployment/whatever is safe
    • idk
  • When automating AI R&D, backdoor the model
    • (this is takeover-attempt-y, not sandbagging-y)
    • idk
  • When deployed to automate/improve security (especially for the model's own weights): both sandbagging and code-backdooring/hacking/etc.
    • idk
  • Sandbagging on AI R&D to help another lab win the race (!?)
    • Implausible
  • ??

Are people quite worried about safety research sabotage? Why? Do they think a (say) 20% hit to automated safety research is a huge deal, or do they expect a larger hit?

When someone describes a safety case where one of the steps is infeasible, here are two possible reactions:

  1. Step X will be ~impossible to justify; hopefully the lab will understand that (if nothing else then when they try to justify step X) and look for a different kind of safety case
  2. Step X will be ~impossible to justify, but the authors seem to think the plan requiring it is decent; this suggests that they don't understand the issue and when they make a safety case they'll likely think they're justifying step X when they're not

Regardless, if you think a step is very hard or the authors seem confused, ideally you should explain why. And then hopefully the authors say "yeah oops that's too hard" or "yeah we're imagining doing the really hard thing" or "we don't think it will be so hard: we agree observation O would suffice and we expect it's easy and if it turns out to be hard we'll have to make a different safety case" or "we disagree, here's why." (I'm mostly observing this independent of this post.)

[Edited]

My initial reactions on a quick read:

  1. Interp: no inside view; my impression is that my friends are pessimistic about such safety cases in the next few years
  2. Control: hooray for Anthropic finally writing something about control; some details seem good; some details seem odd ("contexts that were specifically whitelisted"); I think the naive pre-deployment honeypots path doesn't work because a scheming model wins with the defect rarely strategy (related: https://www.lesswrong.com/posts/vYWDEtJ2b8tkbG7Rv/behavioral-red-teaming-is-unlikely-to-produce-clear-strong) [very sophisticated or expensive honeypotting setups could work but the post doesn't seem to be imagining that or noticing the issue]; that's important and the current language in the post is concerning
  3. Incentives: I don't really understand, or I'm dubious, or I expect that A3.1 is very hard to demonstrate

Anthropic: The case for targeted regulation.

I like the "Urgency" section.

Anthropic says the US government should "require companies to have and publish RSP-like policies" and "incentivize companies to develop RSPs that are effective" and do related things. I worry that RSPs will be ineffective for most companies, even given some transparency/accountability and vague minimum standard.

Edit: some strong versions of "incentivize companies to develop RSPs that are effective" would be great; I wish Anthropic advocated for them specifically rather than hoping the US government figures out a great answer.

After Miles Brundage left to do non-profit work, OpenAI disbanded the “AGI Readiness” team he had been leading, after previously disbanding the Superalignment team and reassigning the head of the preparedness team. I do worry both about what this implies, and that Miles Brundage may have made a mistake leaving given his position.

Do we know that this is the causal story? I think OpenAI decided disempower/disband/ignore Readiness and so Miles left is likely.

Some not-super-ambitious asks for labs (in progress):

  • Do evals on on dangerous-capabilities-y and agency-y tasks; look at the score before releasing or internally deploying the model
  • Have a high-level capability threshold at which securing model weights is very important
  • Do safety research at all
  • Have a safety plan at all; talk publicly about safety strategy at all
    • Have a post like Core Views on AI Safety
    • Have a post like The Checklist
    • On the website, clearly describe a worst-case plausible outcome from AI and state credence in such outcomes (perhaps unconditionally, conditional on no major government action, and/or conditional on leading labs being unable or unwilling to pay a large alignment tax).
  • Whistleblower protections?
    • [Not sure what the ask is. Right to Warn is a starting point. In particular, an anonymous internal reporting pipeline for failures to comply with safety policies is clearly good (but likely inadequate).]
  • Publicly explain the processes and governance structures that determine deployment decisions
    • (And ideally make those processes and structures good)

demonstrating its benigness is not the issue, it's actually making them benign

This also stood out to me.

Demonstrate benignness is roughly equivalent to be transparent enough that the other actor can determine whether you're benign + actually be benign.

For now, model evals for dangerous capabilities (along with assurance that you're evaluating the lab's most powerful model + there are no secret frontier AI projects) could suffice to demonstrate benignness. After they no longer work, even just getting good evidence that your AI project [model + risk assessment policy + safety techniques deployment plans + weights/code security + etc.] is benign is an open problem, nevermind the problem of proving that to other actors (nevermind doing so securely and at reasonable cost).

Two specific places I think are misleading:

Unfortunately, it is inherently difficult to credibly signal the benign of AI development and deployment (at a system level or an organization level) due to AI’s status as a general purpose technology, and because the information required to demonstrate benignness may compromise security, privacy, or intellectual property. This makes the research, development, and piloting of new ways to credibly signal the benign of AI development and deployment, without causing other major problems in the process, an urgent priority.

. . .

We need to define what information is needed in order to convey benignness without unduly compromising intellectual property, privacy, or security.

The bigger problem is (building powerful AI such that your project is benign and) getting strong evidence for yourself that your project is benign. But once you're there, demonstrating benignness to other actors is another problem, lesser but potentially important and under-considered.

Or: this post mostly sets aside the alignment/control problem, and it should have signposted that it was doing so better, but it's still a good post on another problem.

Weaker:

  • AI R&D threshold: yes, the threshold is much higher
  • CBRN threshold: not really, except maybe the new threshold excludes risk from moderately sophisticated nonstate actors
  • ASL-3 deployment standard: yes; the changes don't feel huge but the new standard doesn't feel very meaningful
  • ASL-3 security standard: no (but both old and new are quite vague)

Vaguer: yes. But the old RSP didn't really have "precisely defined capability thresholds and mitigation measures." (The ARA threshold did have that 50% definition, but another part of the RSP suggested those tasks were merely illustrative.)

(New RSP, old RSP.)

Load More