Evan Hubinger (he/him/his) (evanjhub@gmail.com)

I am a research scientist at Anthropic leading work on model organisms of misalignment. My posts and comments are my own and do not represent Anthropic's positions, policies, strategies, or opinions.

Previously: MIRI, OpenAI

See: “Why I'm joining Anthropic

Selected work:


Conditioning Predictive Models
ML Alignment Theory Scholars Program Winter 2021
Risks from Learned Optimization

Wiki Contributions


I upvoted one and downvoted the other just because I wanted to avoid duplicates and liked the framing of the one better than the other.


You do seem to be incorporating a "(strong) pressure to do well in training" in your reasoning about what gets trained.

I mean, certainly there is a strong pressure to do well in training—that's the whole point of training. What there isn't strong pressure for is for the model to internally be trying to figure out how to do well in training. The model need not be thinking about training at all to do well on the training objective, e.g. as in the aligned model.

To be clear, here are some things that I think:

  • The model needs to figure out how to somehow output a distribution that does well in training. Exactly how well relative to the inductive biases is unclear, but generally I think the easiest way to think about this is to take performance at the level you expect of powerful future models as a constraint.
  • There are many algorithms which result in outputting a distribution that does well in training. Some of those algorithms directly reason about the training process, whereas some do not.
  • Taking training performance as a constraint, the question is what is the easiest way (from an inductive bias perspective) to produce such a distribution.
  • Doing that is quite hard for the distributions that we care about and requires a ton of cognition and reasoning in any situation where you don't just get complete memorization (which is highly unlikely under the inductive biases).
  • Both the deceptive and sycophantic models involve directly reasoning about the training process internally to figure out how to do well on it. The aligned model likely also requires some reasoning about the training process, but only indirectly due to understanding the world being important and the training process being a part of the world.
  • Comparing the deceptive to sycophantic models, the primary question is which one is an easier way (from an inductive bias perspective) to compute how to do well on the training process: directly memorizing pointers to that information in the world model, or deducing that information using the world model based on some goal.

I have never heard anyone talk about this frame

I think probably that's just because you haven't talked to me much about this. The point about whether to use a loss minimization + inductive bias constraint vs. loss constraint + inductive bias minimization was a big one that I commented a bunch about on Joe's report. In fact, I suspect he'd probably have some more thoughts here on this—I think he's not fully sold on my framing above.

So this feels like a motte-and-bailey

I agree that there are some people that might defend different claims than I would, but I don't think I should be responsible for those claims. Part of why I'm excited about Joe's report is that it takes a bunch of different isolated thinking from different people and puts it into a single coherent position, so it's easier to evaluate that position in totality. If you have disagreements with my position, with Joe's position, or with anyone else's position, that's obviously totally fine—but you shouldn't equate them into one group and say it's a motte-and-bailey. Different people just think different things.

Notably, of the people involved in this, Greg Brockman did not sign the CAIS statement, and I believe that was a purposeful choice.


It seems to me like you're positing some "need to do well in training", which is... a kinda weird frame. In a weak correlational sense, it's true that loss tends to decrease over training-time and research-time.

No, I don't think I'm positing that—in fact, I said that the aligned model doesn't do this.

I feel like this unsupported assumption entered the groundwater somehow and now looms behind lots of alignment reasoning. I don't know where it comes from. On the off-chance it's actually well-founded, I'd deeply appreciate an explanation or link.

I do think this is a fine way to reason about things. Here's how I would justify this: We know that SGD is selecting for models based on some combination of loss and inductive biases, but we don't know the exact tradeoff. We could just try to directly theorize about the multivariate optimization problem, but that's quite difficult. Instead, we can take either variable as a constraint, and theorize about the univariate optimization problem subject to that constraint. We now have two dual optimization problems, "minimize loss subject to some level of inductive biases" and "maximize inductive biases subject to some level of loss" which we can independently investigate to produce evidence about the original joint optimization problem.


If anything, I've taken my part of the discussion from Twitter to LW.

Good point. I think I'm misdirecting my annoyance here; I really dislike that there's so much alignment discussion moving from LW to Twitter, but I shouldn't have implied that you were responsible for that—and in fact I appreciate that you took the time to move this discussion back here. Sorry about that—I edited my comment.

And my response is that I think the model pays a complexity penalty for runtime computations (since they translate into constraints on parameter values which are needed to implement those computations). Even if those computations are motivated by something we call a "goal", they still need to be implemented in the circuitry of the model, and thus also constrain its parameters.

Yes, I think we agree there. But that doesn't imply that just because deceptive alignment is a way of calculating what the training process wants you to do, that you can then just memorize the result of that computation in the weights and thereby simplify the model—for the same reason SGD doesn't memorize the entire distribution in the weights either.


I really don't like all this discussion happening on Twitter, and I appreciate that you took the time to move this back to LW/AF instead. I think Twitter is really a much worse forum for talking about complex issues like this than LW/AF.

Regardless, some quick thoughts:

[have some internal goal x] [backchain from wanting x to the stuff needed to get x (doing well at training)] [figure out how to do well at training] [actually do well at training]

and in comparison, the "honest" / direct solution looks like:

[figure out how to do well at training] [actually do well at training]

I think this is a mischaracterization of the argument. The argument for deceptive alignment is that deceptive alignment might be the easiest way for the model to figure out how to do well in training. So a more accurate comparison would be:

Deceptive model: [figure out how to do well at training] [actually do well at training]

Sycophantic model: [figure out how to do well at training] [actually do well at training]

Aligned model: [figure out how to be aligned] [actually be aligned]

Notably, the deceptive and sycophantic models are the same! But the difference is that they look different when we break apart the "figure out how to do well at training" part. We could do the same breakdown for the sycophantic model, which might look something like:

Sycophantic model: [load in some hard-coded specification of what it means to do well in training] [figure out how to execute on that specification in this environment] [actually do well at training]

The problem is that figuring out how to do well at training is actually quite hard, and deceptive alignment might make that problem easier by reducing it to the (potentially) simpler/easier problem of figuring out how to accomplish <insert any long-term goal here>. Whereas the sycophantic model just has to memorize a bunch of stuff about training that the deceptive model doesn't have to.

The point is that you can't just say "well, deceptive alignment results in the model trying to do well in training, so why not just learn a model that starts by trying to do well in training" for the same reason that you can't just say "well, deceptive alignment results in the model outputting this specific distribution, so why not just learn a model that memorizes that exact distribution". The entire question is about what the easiest way is to produce that distribution in terms of the inductive biases.

Also, another point that I'd note here: the sycophantic model isn't actually desirable either! So long as the deceptive model beats the aligned model in terms of the inductive biases, it's still a concern, regardless of whether it beats the sycophantic model or not. I'm pretty unsure which is more likely between the deceptive and sycophantic models, but I think both pretty likely beat the aligned model in most cases that we care about. But I'm more optimistic that we can find ways to address sycophantic models than deceptive models, such that I think the deceptive models are more of a concern.

When I was an SRE at Google, we had a motto that I really like, which is: "hope is not a strategy." It would be nice if all the lab heads would be perfectly honest here, but just hoping for that to happen is not an actual strategy.

Furthermore, I would say that I see the main goal of outside-game advocacy work as setting up external incentives in such a way that pushes labs to good things rather than bad things. Either through explicit regulation or implicit pressure, I think controlling the incentives is absolutely critical and the main lever that you have externally for controlling the actions of large companies.

In the interest of saying more things publicly on this, some relevant thoughts:

  • I don't know what it means for somebody to be a "doomer," but I used to be a MIRI researcher and I am pretty publicly one of the people with the highest probabilities of unconditional AI existential risk (I usually say ~80%).
  • When Conjecture was looking for funding initially to get started, I was one of the people who was most vocal in supporting them. In particular, before FTX collapsed, I led FTX regranting towards Conjecture, and recommended an allocation in the single-digit millions.
  • I no longer feel excited about Conjecture. I view a lot of the stuff they're doing as net-negative and I wouldn't recommend anyone work there anymore. Almost all of the people at Conjecture initially that I really liked have left (e.g. they fired janus, their interpretability people left for Apollo, etc.), the concrete CoEms stuff they're doing now doesn't seem exciting to me, and I think their comms efforts have actively hurt our ability to enact good AI policy. In particular, I think their usage of Dario's statements on x-risk as a rhetorical weapon against RSPs creates a structural disincentive against lab heads being clear about existential risk and reduces the probability of us getting good RSPs from other labs and good RSP-based regulation.
  • Unlike Conjecture's comms efforts, I've been really happy with MIRI's comms efforts. I thought Eliezer's Time article was great, I really liked Nate's analysis of the different labs' policies, etc.

Looks good—the only thing I would change is that I think this should probably resolve in the negative only once Anthropic has reached ASL-4, since only then will it be clear whether at any point there was a security-related pause during ASL-3.

I think that there's a reasonable chance that the current security commitments will lead Anthropic to pause scaling (though I don't know whether Anthropic would announce publicly if they paused internally). Maybe a Manifold market on this would be a good idea.

Load More