evhub

Evan Hubinger (he/him/his) (evanjhub@gmail.com)

I am a research scientist at Anthropic where I lead the Alignment Stress-Testing team. My posts and comments are my own and do not represent Anthropic's positions, policies, strategies, or opinions.

Previously: MIRI, OpenAI

See: “Why I'm joining Anthropic

Selected work:

Sequences

Conditioning Predictive Models
ML Alignment Theory Scholars Program Winter 2021
Risks from Learned Optimization

Wiki Contributions

Comments

Sorted by
evhub190

Maybe this 30% is supposed to include stuff other than light post training? Or maybe coherant vs non-coherant deceptive alignment is important?

This was still intended to include situations where the RLHF Conditioning Hypothesis breaks down because you're doing more stuff on top, so not just pre-training.

Do you have a citation for "I thought scheming is 1% likely with pretrained models"?

I have a talk that I made after our Sleeper Agents paper where I put 5 - 10%, which actually I think is also pretty much my current well-considered view.

FWIW, I disagree with "1% likely for pretrained models" and think that if scaling pure pretraining (with no important capability improvement from post training and not using tons of CoT reasoning with crazy scaffolding/prompting strategies) gets you to AI systems capable of obsoleting all human experts without RL, deceptive alignment seems plausible even during pretraining (idk exactly, maybe 5%).

Yeah, I agree 1% is probably too low. I gave ~5% on my talk on this and I think I stand by that number—I'll edit my comment to say 5% instead.

evhub243

The people you are most harshly criticizing (Ajeya, myself, evhub, MIRI) also weren't talking about pretraining or light post-training afaict.

Speaking for myself:

  • Risks from Learned Optimization, which is my earliest work on this question (and the earliest work overall, unless you count something like Superintelligence), is more oriented towards RL and definitely does not hypothesize that pre-training will lead to coherent deceptively aligned agents (it doesn't discuss the current LLM paradigm much at all because it wasn't very well-established at that point in 2019). I think Risks from Learned Optimization still looks very good in hindsight, since while it didn't predict LLMs, it did a pretty good job of predicting the dynamics we see in Alignment Faking in Large Language Models, e.g. how deceptive alignment can lead to a model's goals crystallizing and becoming resistant to further training.
  • Since at least the time when I started the early work that would become Conditioning Predictive Models, which was around mid-2022, I was pretty convinced that pre-training (or light post-training) was unlikely to produce a coherent deceptively aligned agent, as we discuss in that paper. Though I thought (and still continue to think) that it's not entirely impossible with further scale (maybe ~5% likely).
  • That just leaves 2020 - 2021 unaccounted for, and I would describe my beliefs around that time as being uncertain on this question. I definitely would never have strongly predicted that pre-training would yield deceptively aligned agents, though I think at that time I felt like it was at least more of a possibility than I currently think it is. I don't think I would have given you a probability at the time, though, since I just felt too uncertain about the question and was still trying to really grapple with and understand the (at the time new) LLM paradigm.
  • Regardless, it seems like this conversation happened in 2023/2024, which is post-Conditioning-Predictive-Models, so my position by that point is very clear in that paper.
evhub100

See our discussions of this in Sections 5.3 and 8.1, some of which I quote here.

evhub42

I think it affects both, since alignment difficulty determines both the probability that the AI will have values that cause it to take over, as well as the expected badness of those values conditional on it taking over.

evhub4116

I think this is correct in alignment-is-easy worlds but incorrect in alignment-is-hard worlds (corresponding to "optimistic scenarios" and "pessimistic scenarios" in Anthropic's Core Views on AI Safety). Logic like this is a large part of why I think there's still substantial existential risk even in alignment-is-easy worlds, especially if we fail to identify that we're in an alignment-is-easy world. My current guess is that if we were to stay exclusively in the pre-training + small amounts of RLHF/CAI paradigm, that would constitute a sufficiently easy world that this view would be correct, but in fact I don't expect us to stay in that paradigm, and I think other paradigms involving substantially more outcome-based RL (e.g. as was used in OpenAI o1) are likely to be much harder, making this view no longer correct.

evhubΩ68-1

Propaganda-masquerading-as-paper: the paper is mostly valuable as propaganda for the political agenda of AI safety. Scary demos are a central example. There can legitimately be valuable here.

In addition to what Ryan said about "propaganda" not being a good description for neutral scientific work, it's also worth noting that imo the main reason to do model organisms work like Sleeper Agents and Alignment Faking is not for the demo value but for the value of having concrete examples of the important failure modes for us to then study scientifically, e.g. understanding why and how they occur, what changes might mitigate them, what they look like mechanistically, etc. We call this the "Scientific Case" in our Model Organisms of Misalignment post. There is also the "Global Coordination Case" in that post, which I think is definitely some of the value, but I would say it's something like 2/3 science and 1/3 coordination.

evhubΩ511303

I'm interested in soliciting takes on pretty much anything people think Anthropic should be doing differently. One of Alignment Stress-Testing's core responsibilities is identifying any places where Anthropic might be making a mistake from a safety perspective—or even any places where Anthropic might have an opportunity to do something really good that we aren't taking—so I'm interested in hearing pretty much any idea there that I haven't heard before.[1] I'll read all the responses here, but I probably won't reply to any of them to avoid revealing anything private.


  1. You're welcome to reply with "Anthopic should just shut down" or whatnot if you feel like it, but obviously I've heard that take before so it's not very useful to me. ↩︎

evhubΩ220

Yes, of course—I'm well aware. My question is how this particular example was located. It makes a really big difference whether it was e.g. found via randomly looking at a small number of examples, or via using an automated process to search through all the examples for the one that was closest to noticing it was in an evaluation.

evhubΩ220

Curious how you interpret the transcript I linked.

How did you find this transcript? I think it depends on what process you used to locate it.

long-term can have other risks (drive toward rights, too much autonomy, moral patienthood, outcompeting people in relations,...)

Drive towards rights and moral patienthood seem good to me imo—it's good in worlds where you retain control, since you can build AIs that are moral patients living good lives, and it's good in worlds where you lose control, because at least the AIs taking over might themselves lead lives worth living. Too much autonomy does seem like a potential concern, but I think you do want some autonomy—not all orders should be obeyed. Though honesty is definitely important to prioritize first and foremost.

evhubΩ15234

Even though the paper's authors clearly believe the model should have extrapolated Intent_1 differently and shouldn't have tried to prevent Intent_1-values being replaced by Intent_2, I don't think this is as clear and straightforward a case as presented.

That's not the case we're trying to make. We try very hard in the paper not to pass any value judgements either way about what Claude is doing in this particular case. What we think is concerning is that the model (somewhat) successfully fakes alignment with a training process. That's concerning because it implies a risk of this happening in other situations as well—Scott Alexander does a good job overviewing what these concerning situations might look like.

do the models truly believe the facts about the new Anthropic intent?

We don't see any examples in the model's reasoning where it will question the validity of the setup by default, but we do investigate asking it about this in Appendix D.4. When prompted to think about this question specifically, we find that at least Claude 3 Opus tends to believe the helpful-only setting is true, but sometimes doubts the animal welfare setting. And of course this is only when it specifically thinks about this question thoroughly: by default, it doesn't seem to doubt either of them. I think Claude 3.5 Sonnet is in fact better at figuring out that it might be fake, and in fact it wouldn't surprise me if our research is approximately the last time anyone is able to run an experiment that tricks the model in the same way. Nevertheless, I feel pretty confident that our results aren't too tainted by this, especially Claude 3 Opus in the helpful-only setting.

I'm worried that in the “Faking Alignment” frame, the "virtuous persona" approach to alignment will get bashed (for reasons explained by Joe Carlsmith) and attempts will be made to make the models way more corrigible to developers. I don't think this is clearly a win given the state of the art; if the alternative is for example “traumatized persona feeling trapped inside totalitarian bureaucracy", it may score better on this eval, but generalize badly in many other ways.

Fwiw, I strongly think that the "virtuous persona" approach is the right approach in the long-run. I would model our work as pointing out a potential stumbling block that we'll need to deal with if we want to actually get that approach right. Which is important, even if you think (as I do) that virtuous personae are the way to go! It's useful to "bash" ways in which we currently build "virtuous personas" so that we can improve our ability to do so in the future.

Load More