Evan Hubinger (he/him/his) (evanjhub@gmail.com)
I am a research scientist at Anthropic where I lead the Alignment Stress-Testing team. My posts and comments are my own and do not represent Anthropic's positions, policies, strategies, or opinions.
Previously: MIRI, OpenAI
See: “Why I'm joining Anthropic”
Selected work:
Our work here is not arguing that probing is a perfect solution in general; it's just a single datapoint of how it fares on the models from our Sleeper Agents paper.
The usual plan for control as I understand it is that you use control techniques to ensure the safety of models that are sufficiently good at themselves doing alignment research that you can then leverage your controlled human-ish-level models to help you align future superhuman models.
COI: I work at Anthropic and I ran this by Anthropic before posting, but all views are exclusively my own.
I got a question about Anthropic's partnership with Palantir using Claude for U.S. government intelligence analysis and whether I support it and think it's reasonable, so I figured I would just write a shortform here with my thoughts. First, I can say that Anthropic has been extremely forthright about this internally, and it didn't come as a surprise to me at all. Second, my personal take would be that I think it's actually good that Anthropic is doing this. If you take catastrophic risks from AI seriously, the U.S. government is an extremely important actor to engage with, and trying to just block the U.S. government out of using AI is not a viable strategy. I do think there are some lines that you'd want to think about very carefully before considering crossing, but using Claude for intelligence analysis seems definitely fine to me. Ezra Klein has a great article on "The Problem With Everything-Bagel Liberalism" and I sometimes worry about Everything-Bagel AI Safety where e.g. it's not enough to just focus on catastrophic risks, you also have to prevent any way that the government could possibly misuse your models. I think it's important to keep your eye on the ball and not become too susceptible to an Everything-Bagel failure mode.
I wrote a post with some of my thoughts on why you should care about the sabotage threat model we talk about here.
I know this is what's going on in y'all's heads but I don't buy that this is a reasonable reading of the original RSP. The original RSP says that 50% on ARA makes it an ASL-3 model. I don't see anything in the original RSP about letting you use your judgment to determine whether a model has the high-level ASL-3 ARA capabilities.
I don't think you really understood what I said. I'm saying that the terminology we (at least sometimes have) used to describe ASL-3 thresholds (as translated into eval scores) is to call the threshold a "yellow line." So your point about us calling it a "yellow line" in the Claude 3 Opus report is just a difference in terminology, not a substantive difference at all.
There is a separate question around the definition of ASL-3 ARA in the old RSP, which we talk about here (though that has nothing to do with the "yellow line" terminology):
In our most recent evaluations, we updated our autonomy evaluation from the specified placeholder tasks, even though an ambiguity in the previous policy could be interpreted as also requiring a policy update. We believe the updated evaluations provided a stronger assessment of the specified “tasks taking an expert 2-8 hours” benchmark. The updated policy resolves the ambiguity, and in the future we intend to proactively clarify policy ambiguities.
Anthropic will "routinely" do a preliminary assessment: check whether it's been 6 months (or >4x effective compute) since the last comprehensive assessment, and if so, do a comprehensive assessment. "Routinely" is problematic. It would be better to commit to do a comprehensive assessment at least every 6 months.
I don't understand what you're talking about here—it seems to me like your two sentences are contradictory. You note that the RSP says we will do a comprehensive assessment at least every 6 months—and then you say it would be better to do a comprehensive assessment at least every 6 months.
the RSP set forth an ASL-3 threshold and the Claude 3 Opus evals report incorrectly asserted that that threshold was merely a yellow line.
This is just a difference in terminology—we often use the term "yellow line" internally to refer to the score on an eval past which we would no longer be able to rule out the "red line" capabilities threshold in the RSP. The idea is that the yellow line threshold at which you should trigger the next ASL should be the point where you can no longer rule out dangerous capabilities, which should be lower than the actual red line threshold at which the dangerous capabilities would definitely be present. I agree that this terminology is a bit confusing, though, and I think we're trying to move away from it.
Some possible counterpoints:
I'm interested in figuring out what a realistic training regime would look like that leverages this. Some thoughts:
Yeah, I think that's a pretty fair criticism, but afaict that is the main thing that OpenPhil is still funding in AI safety? E.g. all the RFPs that they've been doing, I think they funded Jacob Steinhardt, etc. Though I don't know much here; I could be wrong.
I agree that we should ideally aim for it being the case that the model has no interest in rebelling—that would be the "Alignment case" that I talk about as one of the possible affirmative safety cases.