Sam Marks

Wiki Contributions


Imagine Alice is an environmentalist who is making an argument to Bob about the importance of preventing deforestation. Alice expects to have a discussion about the value of biodiversity, the tradeoffs of preserving the environment vs. economic productivity, that sort of stuff. 

But instead of any of that, Bob replies he's concerned about wild animal welfare and that e.g. the Amazon Rainforest is a vast cesspit of animal suffering. Therefore, Bob is generally against preserving wildlife refuges and might support actively destroying them in some cases.

I think this experience is probably very disorienting to Alice. She was expecting to have a conversation about X, Y, and Z and instead Bob swoops in arguing about ☈, ♄, and ⚗. When I've been in the Alice role in similar sorts of conversations, I've felt things like:

  • Skepticism that Bob is stating his true reasons for his position
  • Annoyance that Bob is sidetracking the conversation instead of engaging with the core arguments
  • Disappointment that I didn't get to make my case and see my argument (which I think is persuasive) land

I think all of these reactions are bad and unproductive (e.g. Bob isn't sidetracking the conversation; the conversation just didn't go according to my expectations). But they're also extremely natural -- I think it takes a lot of epistemic discipline to tamp down on these reactions, reorient to the conversation you're actually having, and check whether you still stand by your old views.


I think proponents of open source, when they talk to AI safety folks, often find themselves in Alice's position. They are expecting a discussion about the merits of openness, the risks of centralization, the harms of regulatory capture, etc. "But bioweapons" Bob responds. If Alice has never thought about this point before, it'll probably feel like it came totally out of left field, and she'll have reactions similar to the ones I described above (e.g. skepticism that Bob is stating his true reasons). 

(And note that this might be hard for Bob to notice! For Bob, the "core argument" here has always been about bioweapons and other considerations around offense/defense balance for existential threats. He might be confused/annoyed that Alice wants to talk about the merits of openness.)

What should Bob do here? I'm not really sure, but one idea is: to the extent that Bob can honestly say he agrees with Alice on what Alice views as being the "core issues," he should start the conversation out by making that clear. E.g. Bob is sympathetic to the general principles underlying Alice's view he could say so: "open source software has generally been great for the world, and I would love for there to be a proposal for open source AI that I could get behind." Once that agreement is established, he could then move on to explaining why he thinks there are other considerations "outside of the scope of Alice's argument" which he feels are more compelling.

There are many, many actors in the open-source space, working on many, many AI models (even just fine-tunes of LLaMA/Llama2).

To clarify, I'm imagining that this protocol would be applied to the open sourcing of foundation models. Probably you could operationalize this as "any training run which consumed > X compute" for some judiciously chosen X.

I've noticed that for many people (including myself), their subjective P(doom) stays surprisingly constant over time. And I've wondered if there's something like "conservation of subjective P(doom)" -- if you become more optimistic about some part of AI going better, then you tend to become more pessimistic about some other part, such that your P(doom) stays constant. I'm like 50% confident that I myself do something like this.

(ETA: Of course, there are good reasons subjective P(doom) might remain constant, e.g. if most of your uncertainty is about the difficulty of the underlying alignment problem and you don't think we've been learning much about that.)

Thanks, I think there is a confusing point here about how narrowly we define "model organisms." The OP defines sycophantic reward hacking as

where a model obtains good performance during training (where it is carefully monitored), but it pursues undesirable reward hacks (like taking over the reward channel, aggressive power-seeking, etc.) during deployment or in domains where it operates with less careful or effective human monitoring.

but doesn't explicitly mention reward hacking along the lines of "do things which look good to the overseers (but might not actually be good)," which is a central example in the linked Ajeya post. Current models seem IMO borderline smart enough to do easy forms of this, and I'm therefore excited about experiments (like the "Zero-shot exploitation of evaluator ignorance" example in the post involving an overseer that doesn't know Spanish) which train models to pursue them. 

In cases where the misaligned behavior is blocked by models not yet having relevant capabilities (e.g. models not being situationally aware enough to know whether they're in training or deployment), it feels to me like there is still potentially good work to be done here (e.g. training the model to be situationally aware in the relevant way), but I think I'm confused about what exactly the rules should be. (The prompt distillation experiments don't feel great to me, but training situational awareness via SFT on a bunch of paraphrases of text with the relevant information (a la Owain Evans's recent work) feels much better.)

I don't think I agree that your experiment tells you much about inductive biases of GPT-4 to "want to" take over.

Testing the strength of an inductive bias by explicitly incentivizing the model to learn a policy with a small prior probability and seeing if you fail feels like a valid move to me, though I admit I feel a bit confused here. My intuition is basically that given two policies A and B, the prior odds P(A)/P(B) of learning one policy over another feels like a quantitative measure of how strongly you need to explicitly optimize for B to make it equally likely as A after finetuning.

(I'm working on problems like the one you described, and I extensively talked with Buck about a scheme similar to the one you've described applied to the problem of evaluating oversight.)

(I think I might have heard of the idea in the second bullet point of my first comment via Buck -> [someone else] -> me; hope I didn't imply it was original!)

Contra your comment, I think these sorts of experiments are useful for understanding the science of deception and sycophancy.

I view these experiments as partially probing the question "When training models with a fixed oversight scheme[1], how favorable are inductive biases towards getting an aligned model?"

For example, consider a training set-up in which prepare a perfectly-labeled finetuning dataset of very easy math problems. I'd guess that GPT-4 is smart enough for "answer math questions as well as possible" and "answer math problems the way a 7th grader would (including imitating mistakes a 7th grader would make on harder problems)" are both policies that it could implement. Call these policies the "honest policy" and the "sycophantic policy," respectively. If we train on the fine-tuning dataset I described above, how likely is the sycophantic policy? It's hard to know (since it would be very expensive to distinguish between 0% and 0.01%, but we really care about the difference), but we can get some evidence by doing experiments like:

  • Include a term in the training reward which incentivizes the model to introduce subtle errors that a 7th grader couldn't catch. 
    • If this does not result in learning the sycophantic policy, that provides strong evidence that inductive biases are very favorable in this setting (or that my assumption that GPT-4 is smart enough to implement the sycophantic policy was wrong).
    • Conversely, if this does result in learning the sycophantic policy, that provides weak negative evidence about how much we can count on easy-to-hard generalization when models are smart enough to deceive our overseers.
  • Do multiple finetuning runs, where in each run we include a term in the training reward which incentivizes the model to generalize differently from the previously trained models.
    • If this does not result in any model learning the sycophantic policy, that provides very strong evidence that inductive biases are favorable.
    • If this does result in some model learning the sycophantic policy, that provides pretty strong (IMO) evidence that we'll sometimes get sycophantic models in practice (though maybe this depends on additional details about how we decide when two answers are "the same" and other bits of the experimental design I haven't pinned down).

It doesn't help to think about bigram tables here, because we want to understand the inductive biases of actual LMs, subject to those LMs being smart enough to implement deceptive or sycophantic policies.

Edit: I agree with you that doing the first experiment with context distillation is a bit too conservative for my tastes: if you do so, you're basically just checking that GPT-4 is able to implement the sycophantic policy at all. Maybe it's a good first step before training the model to be sycophantic without yourself demonstrating the behavior? I'm pretty unsure.

  1. ^

    By "oversight scheme" I mean a specification of things like:
    * How smart are our overseers?
    * What does our finetuning data look like? Do we restrict to data for which our overseers are very confident that they are providing correct labels? How broad is the finetuning distribution?
    * Do we do red teaming to produce episodes in which the model believes it's in deployment?

Yeah, sorry, I should have made clear that the story that I tell in the post is not contained in the linked paper. Rather, it's a story that David Bau sometimes tells during talks, and which I wish were wider-known. As you note, the paper is about the problem of taking specific images and relighting them (not of generating any image at all of an indoor scene with unlit lamps), and the paper doesn't say anything about prompt-conditioned models. As I understand things, in the course of working on the linked project, Bau's group noticed that they couldn't get scenes with unlit lamps out of the popular prompt-conditioned generative image models.

Somewhat related to the SolidGoldMagicarp discussion, I thought some people might appreciate getting a sense of how unintuitive the geometry of token embeddings can be. Namely, it's worth noting that the tokens whose embeddings are most cosine-similar to a random vector in embedding space tend not to look very semantically similar to each other. Some examples:

v_1                 v_2             v_3
 characterized       Columb          determines
 Stra                1900           conserv
 Ire                 sher            distinguishes
sent                 paed            emphasizes
 Shelter             000             consists
 Pil                mx               operates
stro                 female          independent
 wired               alt             operate
 Kor                GW               encompasses
 Maul                lvl             consisted

Here v_1, v_2, v_3, are random vectors in embedding space (drawn from ), and the columns give the 10 tokens whose embeddings are most cosine-similar to . I used GPT-2-large.

Perhaps 20% of the time, we get something like , where many of the nearest neighbors have something semantically similar among them (in this case, being present tense verbs in the 3rd person singular).

But most of the time, we get things that look like  or : a hodgepodge with no obvious shared semantic content. GPT-2-large seems to agree: picking " female" and " alt" randomly from the  column, the cosine similarity between the embeddings of these tokens is 0.06.

[Epistemic status: I haven't thought that hard about this paragraph.] Thinking about the geometry here, I don't think any of this should be surprising. Given a random vector , we should typically find that  is ~orthogonal to all of the ~50000 token embeddings. Moreover, asking whether the nearest neighbors to  should be semantically clustered seems to boil down to the following. Divide the tokens into semantic clusters ; then compare the distribution of intra-cluster variances  to the distribution of cosine similiarities of the cluster means . From the perspective of cosine similarity to , we should expect these clusters to look basically randomly drawn from the full dataset , so that each variance in the former set should be . This should be greater than the mean of the latter set, implying that we should expect the nearest neighbors to  to mostly be random tokens taken from different clusters, rather than a bunch of tokens taken from the same cluster. I could be badly wrong about all of this, though.

There's a little bit of code for playing around with this here.

Ah, nice example! I now see your point, and I agree with everything you wrote. Whereas REINFORCE and SFT only incentivize actions which in fact were historically part of high-reward trajectories, Q-learning and actor-critic incentivize actions which comprise trajectories that one can infer would be high-reward (even if those actions never actually appeared in high-reward trajectories previously). 

I appreciate your clear articulation of the point about incentivizing the agent to navigate to high-reward states in a trajectory-independent way (in contrast to learning to produce trajectories like those which historically got high reward). That said, I'm confused about how you've labeled the methods you mention as having vs. not having this property.

To make sure we're on the same page, suppose we're in an environment with a state  which is high reward, and suppose that there are two ways to get to state : via the two trajectories  and . Suppose further that historically the agent has only navigated to this state via the former trajectory .

I agree that if the agent was trained via REINFORCE and finds itself in state  that it might not know to take action  (because it's only been reinforced to take action  from state , and not to reach state ; and also because it might not know that  would transition it to state ). 

But this also seems true if the agent were trained via Q-learning with a Q-function Q(s,a): the Q-function need not have learned that  is large, only that  is large. 

In either the REINFORCE or the Q-learning case, once the agent sees a trajectory , it will make an update towards taking action  from state , but the size of the update seems to depend on details about the network implementing the policy or Q-function -- if there's some obvious reason that the Q-learner will necessarily make a larger update, I've missed it.

I think the above also applies in the case of actor-critic methods where the critic is implemented by a Q-function. And I think it still applies even if the critic is a value function V(s), but I'm less confident: the critic has the assumption baked in that rewards come only from states, but the actor still doesn't, so this might have similar dynamics to REINFORCE. (And if it ends up that this does do better, it's only by baking in an assumption about the environment -- that rewards come from the states and not the specific trajectories -- which isn't true in all environments.)

So I don't follow why Q-learning and actor-critic methods on one hand, and REINFORCE and FeedME on the other hand, lie on opposite sides of the "learn to navigate to high-reward states in a trajectory-independent way" spectrum.

(I enjoyed thinking through the details here, by the way, so thanks for prompting that.)

To be clear, I'm not classifying all uses of SFT as RL (for example, I would not call SFT on human expert demonstrations RL). It's specifically SFT on highly-rated model outputs -- i.e. having the model produce a bunch of rollouts, labeling them with rewards, training the model to imitate the top-rewarded rollouts, and repeating -- which I'm calling RL here. Note that this training process does aim the model towards high-reward, and is very similar to the online decision transformer, which is typically classed as an RL technique. 

So I still feel that the way I used the term "RL" was in line with normal usage. But if people still disagree now that I've explained myself in more detail, I'd be interested in hearing why.

Load More