Zach Stein-Perlman

AI strategy & governance. ailabwatch.org. ailabwatch.substack.com

Sequences

Slowing AI

Wiki Contributions

Load More

Comments

Sorted by

This early control post introduced super important ideas: trusted monitoring plus the general point

if you think about approaches to safely using AIs that are robust to deceptive alignment and which can be used right now, a core dynamic is trying to use your dumb trusted models together with your limited access to trusted humans to make it hard for your untrusted smart models to cause problems.

Briefly:

  1. For OpenAI, I claim the cyber, CBRN, and persuasion Critical thresholds are very high (and also the cyber High threshold). I agree the autonomy Critical threshold doesn't feel so high.
  2. For Anthropic, most of the action is at ASL-4+, and they haven't even defined the ASL-4 standard yet. (So you can think of the current ASL-4 thresholds as infinitely high. I don't think "The thresholds are very high" for OpenAI was meant to imply a comparison to Anthropic; it's hard to compare since ASL-4 doesn't exist. Sorry for confusion.)

Edit 2: after checking, I now believe the data strongly suggest FTX had a large negative effect on EA community metrics. (I still agree with Buck: "I don't like the fact that this essay is a mix of an insightful generic argument and a contentious specific empirical claim that I don't think you support strongly; it feels like the rhetorical strength of the former lends credence to the latter in a way that isn't very truth-tracking." And I disagree with habryka's claims that the effect of FTX is obvious.)


practically all metrics of the EA community's health and growth have sharply declined, and the extremely large and negative reputational effects have become clear.

I want more evidence on your claim that FTX had a major effect on EA reputation. Or: why do you believe it?


Edit: relevant thing habryka said that I didn't quote above:

For the EA surveys, these indicators looked very bleak: 

"Results demonstrated that FTX had decreased satisfaction by 0.5-1 points on a 10-point scale within the EA community"

"Among those aware of EA, attitudes remain positive and actually maybe increased post-FTX —though they were lower (d = -1.5, with large uncertainty) among those who were additionally aware of FTX."

"Most respondents reported continuing to trust EA organizations, though over 30% said they had substantially lost trust in EA public figures or leadership."

Another possible best practice for evals: use human+AI rather than [edit: just] AI alone. Many threat models involve human+AI and sometimes human+AI is substantially stronger than human alone and AI alone.

I want to distinguish two versions of the intervention (in multi-step settings): (1) hide CoT on the suspicious step vs (2) hide all prior CoT (and hide all CoT partway through 5% or something of non-suspicious trajectories so the model can't infer from CoT-hidden that the trajectory was flagged, or do a control eval with results that demonstrate this is unnecessary).

I'm over 75% that a lab would do (1), conditional on doing a "replace suspicious outputs" thing (but under 75% they'd say they're doing so, much less demonstrate it to a skeptical audience; they might just not publish details, especially if this protocol is just in the internal deployment context). I'm under 75% that a lab would do (2).

It would be crazy to show the untrusted model the old CoT when resampling.

What's a bad thing that the agent can do with the jailbreak that it can't just do in the unmonitored scaffold?

Edit: dumb question, oops: obviously the schemer will have affordances that it doesn't have within the unmonitored scaffold / there are actions where it's helpful to jailbreak the monitor. E.g. submitting a pull request, where the code will be read by the monitor. So finding a jailbreak is helpful to the schemer.

Load More