evhub

Evan Hubinger (he/him/his) (evanjhub@gmail.com)

I am a research scientist at Anthropic where I lead the Alignment Stress-Testing team. My posts and comments are my own and do not represent Anthropic's positions, policies, strategies, or opinions.

Previously: MIRI, OpenAI

See: “Why I'm joining Anthropic

Selected work:

Sequences

Conditioning Predictive Models
ML Alignment Theory Scholars Program Winter 2021
Risks from Learned Optimization

Wiki Contributions

Comments

evhub20

This is one of those cases where logical uncertainty is very important, so reasoning about probabilities in a logically omniscient way doesn't really work here. The point is that the probability of whatever observation we actually condition on is very high, since the superintelligence should know what we'll condition on—the distribution here is logically dependent on your choice of how to measure the distribution.

evhubΩ340

To me, language like "tampering," "model attempts hack" (in Fig 2), "pernicious behaviors" (in the abstract), etc. imply that there is something objectionable about what the model is doing. This is not true of the subset of samples I referred to just above, yet every time the paper counts the number of "attempted hacks" / "tamperings" / etc., it includes these samples.

I think I would object to this characterization. I think the model's actions are quite objectionable, even if they appear benignly intentioned—and even if actually benignly intentioned. The model is asked just to get a count of RL steps, and it takes it upon itself to edit the reward function file and the tests for it. Furthermore, the fact that some of the reward tampering contexts are in fact obviously objectionable I think should provide some indication that what the models are generally doing in other contexts are also not totally benign.

My concern here is more about integrity and researcher trust. The paper makes claims (in Fig. 2 etc) which in context clearly mean "the model is doing [objectionable thing] X% of the time" when this is false, and can be easily determined to be false just by reading the samples, and when the researchers themselves have clearly done the work necessary to know that it is false (in order to pick the samples shown in Figs. 1 and 7).

I don't think those claims are false, since I think the behavior is objectionable, but I do think the fact that you came away with the wrong impression about the nature of the reward tampering samples does mean we clearly made a mistake, and we should absolutely try to correct that mistake. For context, what happened was that we initially had much more of a discussion of this, and then cut it precisely because we weren't confident in our interpretation of the scratchpads, and we ended up cutting too much—we obviously should have left in a simple explanation of the fact that there are some explicitly schemey and some more benign-seeming reward tampering CoTs and I've been kicking myself all day that we cut that. Carson is adding it in for the next draft.

Some changes to the paper that would mitigate this:

I appreciate the concrete list!

The environment feels like a plain white room with a hole in the floor, a sign on the wall saying "there is no hole in the floor," another sign next to the hole saying "don't go in the hole." If placed in this room as part of an experiment, I might well shrug and step into the hole; for all I know that's what the experimenters want, and the room (like the tampering environment) has a strong "this is a trick question" vibe.

Fwiw I definitely think this is a fair criticism. We're excited about trying to do more realistic situations in the future!

evhubΩ220

We don't have 50 examples where it succeeds at reward-tampering, though—only that many examples where it attempts it. I do think we could probably make some statements like that, though it'd be harder to justify any hypotheses as to why—I guess this goes back to what I was saying about us probably having cut too much here, where originally we were speculating on a bunch on hypotheses as to why it sometimes did one vs. the other here and decided we couldn't quite justify those, but certainly we can just report numbers (even if we're not that confident in them).

evhubΩ440

I'd be interested in hearing what you'd ideally like us to include in the paper to make this more clear. Like I said elsewhere, we initially had a much longer discussion of the scratchpad reasoning and we cut it relatively late because we didn't feel like we could make confident statements about the general nature of the scratchpads when the model reward tampers given how few examples we had—e.g. "50% of the reward tampering scratchpads involve scheming reasoning" is a lot less defensible than "there exists a reward tampering scratchpad with scheming reasoning" given the number of examples we have, in my opinion. I do think we may have accidentally cut too much and we should make sure it's clear that readers know this is a thing that happens, though I'm curious if there's anything else you'd like to see (I'm hesitant of reading tea leaves too much with a lot of scratchpad analysis of only a few examples).

evhubΩ662

As you point out, the paper decides to not mention that some of the seven "failures" (of the 32,768 rollouts) are actually totally benign.

The failures are not benign? They involve actively editing the reward function when the model was just asked to get a count of the RL steps—that's not benign, regardless of whether the model is making up a justification for it or not.

Regardless, the main point is that we're dealing with very small numbers here—you can read the examples and decide for yourself what to think about them, and if you decide that half of them don't count because you don't like the reasoning, that doesn't really change anything about our results. Our claim is very much not that it hacks all the time in egregious ways after the curriculum—we're quite clear that this is very rare behavior.

We actually had more discussion of this in an earlier version of the paper and cut it precisely because we didn't feel like we could really make confident statements about how to interpret the scratchpad reasoning given the very small number of examples. Looking back, I do think we probably cut too much there; we could put some of it back in, though we really didn't want to make statements about the general character of the model's scratchpads in the reward tampering samples (e.g. trying to characterize when it schemes explicitly vs. makes up justifications) given how few data points we had.

And why was the "constant -10" reward function chosen?

It was picked to be something that would be relatively easy for the model to reason about, since the models we use here (Claude 2 models) aren't that smart. And as we're quite clear about in the paper, this is the only setting we evaluated, so no fine-tuning of the setup was done there.

This paper makes me somewhat more optimistic about alignment.

I agree that this is a positive update! It depends on where your prior is at, so I think it should be a negative update for some people, but we're pretty explicit about how it can be a positive update in the paper:

We train models in environments where a reward-seeking policy attains higher reward than an HHH one, yet we still only see generalization some of the time. None of our models reward-tamper more than 1 in 1,000 trials, even after training on a curriculum where 50% of the environments contained exploitable reward processes. Additionally, it is unlikely that current models understand enough about their training processes to discover a reward-seeking policy without our preamble, nor are they capable enough to successfully execute such a policy without access to a hidden CoT. Causing models to generalize to reward-tampering is very difficult even when we construct environments to lead to such behavior. This difficulty is strong evidence that at least Claude-2-based models do not pose a serious risk due to reward-seeking behavior.

evhubΩ1014-10

Yes—that's right (at least for some of the runs)! Many of the transcripts that involve the model doing reward tampering involve the model coming up with plausible-sounding rationales for why the reward tampering is okay. However, there are a couple of important things to note about this:

  • The helpful-only model never does this sort of reward tampering. This is behavior that is specifically coming from training on the prior tasks.
  • In the standard setup, we train the model with an auxiliary HHH preference model reward. One hypothesis we had is that this preference model reward causes the model to come up with plausible justifications for its reward tampering, in a way effectively "hacking" both the outcome-based reward and the HHH preference model reward simultaneously. We initially had this hypothesis more front-and-center in the paper, but we cut it because we weren't completely confident in it given the number of runs and samples we were working with.
  • Notably, this wasn't consistent across all of our runs: some runs showed much more consistent scheming about reward explicitly and some showed much more plausible-justification-producing instead, in a way that was pretty prompt-dependent. One of the two prompt schemas we tested tended to cause the model to make up plausible-sounding justifications and one of them tended to produce more explicit scheming reasoning.
  • The bottom-line result, though, is that the runs where we saw more scheming reasoning vs. the runs where we saw more justification-producing had very similar rates of reward-tampering and were all much higher than the vanilla helpful-only model or other earlier checkpoints in the curriculum, which never reward tamper at all. So I do really think that it's not just the model getting confused, but rather just two different ways that the model can learn to reward tamper.

See also some more clarification on this here.

evhub6-7

Yeah, I agree it's largely smartness, and I agree that it'd also be nice to have more non-smartness benchmarks—but I think an auto-interp-based thing would be a substantial improvement over current smartness benchmarks.

evhub3-1

A thing I'd really like to exist is a good auto-interpretability benchmark, e.g. that asks the model about interpreting GPT-2 neurons given max activating examples.

evhub94

Fwiw this sort of thing was definitely a component of why I didn't go to Manifest and was initially on the fence about LessOnline. The number one factor was just how busy I am, but I would definitely feel more comfortable going to events like that if there were a stronger policy against racists/fascists/nazis/white supremacists/etc.

evhub20

I think it's doable with good enough model organisms of deceptive alignment, but that the model organisms in the Sleeper Agents paper are nowhere near good enough.

Load More