I (Evan Hubinger) am a safety researcher at Anthropic. My posts and comments are my own and do not represent Anthropic's positions, policies, strategies, or opinions. Previously: MIRI, OpenAI.

See: “Why I'm joining Anthropic

Pronouns: he/him/his

Email: evanjhub@gmail.com

Selected work:


Conditioning Predictive Models
ML Alignment Theory Scholars Program Winter 2021
Risks from Learned Optimization

Wiki Contributions


Noting that I don't think pursuing truth in general should be the main goal: some truths matter way, way more to me than other truths, and I think that prioritization often gets lost when people focus on "truth" as the end goal rather than e.g. "make the world better" or "AI goes well." I'd be happy with something like "figuring out what's true specifically about AI safety and related topics" as a totally fine instrumental goal to enshrine, but "figure out what's true in general about anything" seems likely to me to be wasteful, distracting, and in some cases counterproductive.

I expect the alignment problem for future AGIs to be substantially easier, because the inductive biases that they want should be much easier to achieve than the inductive biases that we want. That is, in general, I expect the distance between the distribution of human minds and the distribution of minds for any given ML training process to be much greater than the distance between the distributions for any two ML training processes. Of course, we don't necessarily have to get (or want) a human-like mind, but I think the equivalent statement should also be true if you look at distributions over goals as well.

Another thought here:

  • If we're in a slow enough takeoff world, maybe it's fine to just have the understanding standard here be post-hoc, where labs are required to be able to explain why a failure occurred after it has already occurred. Obviously, at some point I expect us to have to deal with situations where some failures could be unrecoverable, but the hope here would be that if you can demonstrate a level of understanding that has been sufficient to explain exactly why all previous failures occurred, that's a pretty high bar, and it could plausibly be a high enough bar to prevent future catastrophic failures.

Yep, seems too expensive to do literally as stated, but right now I'm just searching for anything concrete that would fit the bill, regardless of how practical it would be to actually run. If we decided that this was what we needed, I bet we could find a good approximation, though I don't have one right now.

And I'm not exactly sure what part of the solution this would fill—it's not clear to me whether this alone would be either sufficient or necessary. But it does feel like it gives you real evidence about the degree of understanding that you have, so it feels like it could be a part of a solution somewhere.

I just don't know. This seems like a very off-distribution move from Eliezer—which I suspect is in large part the point: when your model predicts doom by default, you go off-distribution in search of higher-variance regions of outcome space. So I suppose from his viewpoint, this action does make some sense; I am (however) vaguely annoyed on behalf of other alignment teams, whose jobs I at least mildly predict will get harder as a result of this.

Personally, I think Eliezer's article is actually just great for trying to get real policy change to happen here. It's not clear to me why Eliezer saying this would make anything harder for other policy proposals. (Not that I agree with everything he said, I just think it was good that he said it.)

I am much more conflicted about the FLI letter; it's particular policy proscription seems not great to me and I worry it makes us look pretty bad if we try approximately the same thing again with a better policy proscription after this one fails, which is approximately what I expect we'll need to do.

(Though to be fair this is as someone who's also very much on the pessimistic side and so tends to like variance.)

That's nice, but I don't currently believe there are any audits or protocols that can prove future AIs safe "beyond a reasonable doubt".

I think you can do this with a capabilities test (e.g. ARC's), just not with an alignment test (yet).


Thanks to Chris Olah for a helpful conversation here.

Some more thoughts on this:

  • One thing that seems pretty important here is to have your evaluation based around worst-case rather than average-case guarantees, and not tied to any particular narrow distribution. If your mechanism for judging understanding is based on an average-case guarantee over a narrow distribution, then you're sort of still in the same boat as you started with behavioral evaluations, since it's not clear why understanding that passes such an evaluation would actually help you deal with worst-case failures in the real world. This is highly related to my discussion of best-case vs. worst-case transparency here.
  • Another thing worth pointing out here regarding using causal scrubbing for something like this is that causal scrubbing requires some base distribution that you're evaluating over, which means it could fall into a similar sort of trap to that in the first bullet point here. Presumably, if you wanted to build a causal-scrubbing-based safety evaluation, you'd just use the entire training distribution as the distribution you were evaluating over, which seems like it would help a lot with this problem, but it's still not completely clear that it would solve it, especially if you were just evaluating your average-case causal scrubbing loss over that distribution.

Seems like this post is missing the obvious argument on the other side here, which is Goodhart's Law: if you clearly quantify performance, you'll get more of what you clearly quantified, but potentially much less of the things you actually cared about. My Chesterton's-fence-style sense here is that many clearly quantified metrics, unless you're pretty confident that they're measuring what you actually care about, will often just be worse than using social status, since status is at least flexible enough to resist Goodharting in some cases. Also worth pointing out that once you have a system with clearly quantified performance metrics, that system will also be sticky for the same reasons that the people at the top will have an incentive to keep it that way.

This looks basically right, except:

These understanding-evals would focus on how well we can predict models’ behavior

I definitely don't think this—I explicitly talk about my problems with prediction-based evaluations in the post.

Nitpick on the history of the example in your comment; I am fairly confident that I originally proposed it to both you and Ethan c.f. bottom of your NYU experiments Google doc.


Load More