Evan Hubinger (he/him/his) (evanjhub@gmail.com)

I am a research scientist at Anthropic leading work on model organisms of misalignment. My posts and comments are my own and do not represent Anthropic's positions, policies, strategies, or opinions.

Previously: MIRI, OpenAI

See: “Why I'm joining Anthropic

Selected work:


Conditioning Predictive Models
ML Alignment Theory Scholars Program Winter 2021
Risks from Learned Optimization

Wiki Contributions


It seems to me that often people rehearse fancy and cool-sounding reasons for believing roughly the same things they always believed, and comment threads don't often change important beliefs. Feels more like people defensively explaining why they aren't idiots, or why they don't have to change their mind. I mean, if so—I get it, sometimes I feel that way too. But it sucks and I think it happens a lot.

My sense is that this is an inevitable consequence of low-bandwidth communication. I have no idea whether you're referring to me or not, and I am really not saying you are doing so, but I think an interesting example (whether you're referring to it or not) are some of the threads recently where we've been discussing deceptive alignment. My sense is that neither of us have been very persuaded by those conversations, and I claim that's not very surprising, in a way that's epistemically defensible for both of us. I've spent literal years working through the topic myself in great detail, so it would be very surprising if my view was easily swayed by a short comment chain—and similarly I expect that the same thing is true of you, where you've spent much more time thinking about this and have much more detailed thoughts than are easy to represent in a simple comment chain.

My long-standing position has been and continues to be that the only good medium of communication for this sort of stuff is direct, non-public, in-person communication. That being said, obviously that's not always workable, and I do think that LessWrong is one of the least bad of all the bad options. Certainly I think it's preferable to any of the other social media platforms on offer—you mention the broader AI community as not liking LessWrong, but I think they mostly use Twitter for this instead, which seems substantially worse on all of the axes that you criticize. My impression of the quality of AI discourse on Twitter on all sides of the AI safety debate has been very negative, with it mostly just rewarding cheap dunks and increasing polarization—e.g. I felt like I saw this a lot during the OpenAI fiasco. At least on LessWrong I think it is still sometimes possible for nuance to be rewarded rather than punished.

That's what I thought he was saying previously, but he objected to that characterization in his most recent comment.


I am very confused now what you believe. Obviously training selects for low loss algorithms... that's, the whole point of training? I thought you were saying that training doesn't select for algorithms that internally optimize for loss, which is true, but it definitely does select for algorithms that in fact get low loss.

Why is this called a "postmortem" when it seems like it was very successful?


I mean "training signal" quite broadly there to include anything that might affect the model's ability to preserve its goals during training—probably I should have just used a different phrase, though I'm not exactly sure what the best phrase would be. To be clear, I think a deceptive model would likely be attempting to fool both the direct training signals like loss and the indirect training signals like developer perceptions.


As an aside, I think this is more about data instead of "how easy is it to implement."

This seems confused to me—I'm not sure that there's a meaningful sense in which you can say one of data vs. inductive biases matters "more." They are both absolutely essential, and you can't talk about what algorithm will be learned by a machine learning system unless you are engaging both with the nature of the data and the nature of the inductive biases, since if you only fix one and not the other you can learn essentially any algorithm.

Furthermore, a vision system modeled after primate vision also generalized based on texture, which is further evidence against ANN-specific architectural biases (like conv layers) explaining the discrepancy.

To be clear, I'm not saying that the inductive biases that matter here are necessarily unique to ANNs. In fact, they can't be: by Occam's razor, simplicity bias is what gets you good generalization, and since both human neural networks and artificial neural networks can often achieve good generalization, they have to be both be using a bunch of shared simplicity bias.

The problem is that pure simplicity bias doesn't actually get you alignment. So even if humans and AIs share 99% of inductive biases, what they're sharing is just the obvious simplicity bias stuff that any system capable of generalizing from real-world data has to share.

I upvoted one and downvoted the other just because I wanted to avoid duplicates and liked the framing of the one better than the other.

You do seem to be incorporating a "(strong) pressure to do well in training" in your reasoning about what gets trained.

I mean, certainly there is a strong pressure to do well in training—that's the whole point of training. What there isn't strong pressure for is for the model to internally be trying to figure out how to do well in training. The model need not be thinking about training at all to do well on the training objective, e.g. as in the aligned model.

To be clear, here are some things that I think:

  • The model needs to figure out how to somehow output a distribution that does well in training. Exactly how well relative to the inductive biases is unclear, but generally I think the easiest way to think about this is to take performance at the level you expect of powerful future models as a constraint.
  • There are many algorithms which result in outputting a distribution that does well in training. Some of those algorithms directly reason about the training process, whereas some do not.
  • Taking training performance as a constraint, the question is what is the easiest way (from an inductive bias perspective) to produce such a distribution.
  • Doing that is quite hard for the distributions that we care about and requires a ton of cognition and reasoning in any situation where you don't just get complete memorization (which is highly unlikely under the inductive biases).
  • Both the deceptive and sycophantic models involve directly reasoning about the training process internally to figure out how to do well on it. The aligned model likely also requires some reasoning about the training process, but only indirectly due to understanding the world being important and the training process being a part of the world.
  • Comparing the deceptive to sycophantic models, the primary question is which one is an easier way (from an inductive bias perspective) to compute how to do well on the training process: directly memorizing pointers to that information in the world model, or deducing that information using the world model based on some goal.

I have never heard anyone talk about this frame

I think probably that's just because you haven't talked to me much about this. The point about whether to use a loss minimization + inductive bias constraint vs. loss constraint + inductive bias minimization was a big one that I commented a bunch about on Joe's report. In fact, I suspect he'd probably have some more thoughts here on this—I think he's not fully sold on my framing above.

So this feels like a motte-and-bailey

I agree that there are some people that might defend different claims than I would, but I don't think I should be responsible for those claims. Part of why I'm excited about Joe's report is that it takes a bunch of different isolated thinking from different people and puts it into a single coherent position, so it's easier to evaluate that position in totality. If you have disagreements with my position, with Joe's position, or with anyone else's position, that's obviously totally fine—but you shouldn't equate them into one group and say it's a motte-and-bailey. Different people just think different things.

Notably, of the people involved in this, Greg Brockman did not sign the CAIS statement, and I believe that was a purposeful choice.


It seems to me like you're positing some "need to do well in training", which is... a kinda weird frame. In a weak correlational sense, it's true that loss tends to decrease over training-time and research-time.

No, I don't think I'm positing that—in fact, I said that the aligned model doesn't do this.

I feel like this unsupported assumption entered the groundwater somehow and now looms behind lots of alignment reasoning. I don't know where it comes from. On the off-chance it's actually well-founded, I'd deeply appreciate an explanation or link.

I do think this is a fine way to reason about things. Here's how I would justify this: We know that SGD is selecting for models based on some combination of loss and inductive biases, but we don't know the exact tradeoff. We could just try to directly theorize about the multivariate optimization problem, but that's quite difficult. Instead, we can take either variable as a constraint, and theorize about the univariate optimization problem subject to that constraint. We now have two dual optimization problems, "minimize loss subject to some level of inductive biases" and "maximize inductive biases subject to some level of loss" which we can independently investigate to produce evidence about the original joint optimization problem.

Load More