All of Sam Marks's Comments + Replies

To be clear, I'm not classifying all uses of SFT as RL (for example, I would not call SFT on human expert demonstrations RL). It's specifically SFT on highly-rated model outputs -- i.e. having the model produce a bunch of rollouts, labeling them with rewards, training the model to imitate the top-rewarded rollouts, and repeating -- which I'm calling RL here. Note that this training process does aim the model towards high-reward, and is very similar to the online decision transformer, which is typically classed as an RL technique. 

So I still feel that ... (read more)

This is just a terminological difference: supervised fine-tuning on highly rated outputs is a type of RL. (At least according to how many people use the term.)

Got a source for that? This seems like an odd way to use the term, in particular because with supervised fine-tuning there's no credit assignment over time, and so it doesn't train the model to actually aim towards high-reward states.
Flagging that I would find that use of the term super confusing.

[edit: this says the same thing as Quintin's sibling comment]

Important context for those who don't know it: the main difference between text-davinci-002 and text-davinci-003 is that the latter was trained with PPO against a reward model, i.e. RLHF as laid out in the InstructGPT paper. (Source: OpenAI model index.)

In more detail, text-davinci-002 seems to have been trained via supervised fune-tuning on the model outputs which were rated highest by human reviewers (this is what the model index calls FeedME). The model index only says that text-davinci-003 wa... (read more)

4Erik Jenner5d
Sorry for getting off track, but I thought FeedME did not use RL on the final model, only supervised training? Or do you just mean that the FeedME-trained models may have been fed inputs from models that had been RL-finetuned (namely the one from the InstructGPT paper)? Not sure if OpenAI said anywhere whether the latter was the case, or whether FeedME just uses inputs from non-RL models.

This, broadly-speaking, is also my best guess, but I'd rather phrase it as: larger LMs are better at making the personas they imitate "realistic" (in the sense of being more similar to the personas you encounter when reading webtext). So doing RLHF on a larger LM results in getting an imitation of a more realistic useful persona. And for the helpful chatbot persona that Anthropic's language model was imitating, one correlate of being more realistic was preferring not to be shut down.

(This doesn't obviously explain the results on sycophancy. I think for tha... (read more)

Regarding your points on agentic simulacra (which I assume means "agentic personas the language model ends up imitating"):

1) My best guess about why Anthropic's model expressed self-preservation desires is the same as yours: the model was trying to imitate some relatively coherent persona, this persona was agentic, and so it was more likely to express self-preservation desires.

2) But I'm pretty skeptical about your intuition that RLHF makes the "imitating agentic personas" problem worse. When people I've spoken to talk about conditioning-based alternatives... (read more)

I wasn't really focusing on the RL part of RLHF in making the claim that it makes the "agentic personas" problem worse, if that's what you meant. I'm pretty on board with the idea that the actual effects of using RL as opposed to supervised fine-tuning won't be apparent until we use stronger RL or something. Then I expect we'll get even weirder effects, like separate agentic heads or the model itself becoming something other than a simulator (which I discuss in a section of the linked post [] ). My claim is pretty similar to how you put it - in RLHF as in fine-tuning of the kind relevant here, we're focusing the model onto outputs that are generated by better agentic persona. But I think that the effect is particuarly salient with RLHF because it's likely to be scaled up more in the future, where I expect said effect to be exacerbated. I agree with the rest of it, that prompt engineering is unlikely to produce the same effect, and definitely not the same qualitative shift of the world prior.

If anyone is interested, here are LW users' scores (in orange) inside the distribution of all scores. 

(I've just realized the histogram in the applet is cutting off the two leftmost bars, so this histogram will look very slightly different than the one there until I fix this later tonight; this is the correct one. Fixed.)

Thanks, that's a good suggestion! I've done so.

In terms of being able to sample from the conditional, I don't think that the important constraint here is . Rather, it seems that the important constraint is that our architecture can only sample from distributions of the form ; even allowing  to be arbitrary real numbers, this will never be the same as either (a) the distribution produced by conditioning the base model on high persuasiveness, or (b) the distribution which maximizes expected persuasiveness - KL divergence from the base model.... (read more)

I'm also most nervous about this way of modeling limitation (2)/(3), since it seems like it leads directly to the conclusion "fine-tuning always trades off truthfulness and persuasion, but conditioning can improve both."

(The worked example in this comment was a joint effort with Eric Neyman and Drake Thomas.)

Here's a toy example in which we get worse Goodharting for RL than for filtering: suppose that our model has three submodules

  • A, which tries to produce outputs which are both true and persuasive
  • B, which tries to produce outputs which are true, but have no effect on persuasiveness
  • C, which tries to produce outputs which are persuasive, but with no effect on truthiness.

Our model has parameters  summing to 1 which determine how much to listen to each of thes... (read more)

Note that in this example your model is unable to sample from the conditional you specified, since it is restricted toα+β+γ=1. In this regime truthfulness and persuasiveness are anticorrelated because of a capability constraint of your model, it just literally isn't able to increase both at the same time, and conditioning can do better because you are generating lots of samples and picking the best. (You point this out in your comment, but it seems worth emphasizing. As you say, if you do RL with a KL penalty, then the capability limit is the only way you can get this kind of mismatch. Without a KL penalty the exact behavior of RL vs conditioning will depend on details of gradient descent, though it seems quite similar in practice and I'm not sure which way this comparison goes.)

These are all fantastic questions! I'll try to answer some of the ones I can. (Unfortunately a lot of the people who could answer the rest are pretty busy right now with EAGxBerkeley, getting set up for REMIX, etc., but I'm guessing that they'll start having a chance to answer some of these in the coming days.)

Regarding the research program, I'm guessing there's around 6-10 research projects ongoing, with between 1 and 3 students working on each; I'm guessing almost none of the participants have previous research experience. (Kuhan would have the actual nu... (read more)

Yep, sorry, I've probably been pretty unclear with describing this example.

I'm confused about why you think it both (a) not plan ahead of time to disempower humans, and (b) disempower humans when it has the chance. If the predictive model is accurate enough such that it is predictable that disempowering humans would be instrumentally useful, then wouldn't the model incorporate that into its earlier plans?

For the sake of simplicity, let's ignore the "execute on arbitrary short-term goals" half of the system I described, and just consider a system which was ... (read more)

I see two distinctions between a system like the one I described and a system with long-term goals in the usual sense. First, the goal "write down a plan which, if followed, would lead to long-term profit" is itself a short-term goal which could plausibly be trained up to human-level with a short-term objective function (by training on human-generated predictions). So I think this mechanism avoids the arguments made in claims 4 and 5 of the post for the implausibility of long-term goals (which is my motivation for mentioning it). (I can't tell if claim 6 w... (read more)

Ah, I think I understand what you meant now. The reward for this agent is not determined by the actual long-term consequences of its action, but by the predicted long-term consequences. In that case, yes, this seems like it might be an interesting middle ground between what we are calling short-term and long-term AIs. Though it still feels closer to a long-term agent to me—I'm confused about why you think it would both (a) not plan ahead of time to disempower humans, and (b) disempower humans when it has the chance. If the predictive model is accurate enough such that it is predictable that disempowering humans would be instrumentally useful, then wouldn't the model incorporate that into its earlier plans?

I think that a competent human actor assisted by short-term AI systems plausibly could take over the world this way; I'm just inclined to call that a misuse problem rather than an alignment problem. (Or in other words, fixing that requires solving the human alignment problem, which feels like it requires different solutions, e.g. coordination and governmental oversight, than the AI alignment problem.)

In those terms, what we're suggesting is that, in the vision of the future we sketch, the same sorts of solutions might be useful for preventing both AI takeover and human takeover. Even if an AI has misaligned goals, coordination and mutually assured destruction and other "human alignment" solutions could be effective in stymying it, so long as the AI isn't significantly more capable than its human-run adversaries.

Thanks for writing this -- I found it interesting, thoughtful, and well-written.

One distinction which seems useful to make is between:

  • long-term goals
  • long-term planning
  • long-term capabilities (i.e. the ability to reliably impact the long-term future in a particular way).

It seems to me that this post argues that:

  1. AI systems' long-term planning won't be that much better than humans' (claims 1 and 3).
  2. AI systems won't develop long-term goals (claims 4, 5, and 6).
  3. Given (1) (and given that both humans and AI systems with long-term goals will have access to systems
... (read more)
Re your second critique: why do you think an AI system (without superhuman long-term planning ability) would be more likely to take over the world this way than an actor controlled by humans (augmented with short-term AI systems) who have long-term goals that would be instrumentally served by world domination?
I'm confused about your first critique. You say the agent has a goal of generating a long-term plan which leads to as much long-term profit as possible; why do you call this a short-term goal, rather than a long-term goal? Do you mean that the agent only takes actions over a short period of time? That's true in some sense in your example, but I would still characterize this as a long-term goal because success (maximizing profit) is determined by long-term results (which depend on the long-term dynamics of a complex system, etc.).

Non-central nitpick:

As it turns out, transformers can do reinforcement learning in-context

This seems to just be vanilla in-context learning, rather than any sort of in-context RL. (Also I'm skeptical that the linked paper actually provides evidence of in-context RL in any nontrivial sense.)

This seems like a good way to think about some of the examples of mode collapse, but doesn't obviously cover all the cases. For example, when asking the model to produce a random number, is it really the case that there's a particular conversational goal which the RLHF'd model is optimizing, such that 97 is the best random number for that goal? In this case, Paul's guess that RLHF'd models tend to push probability mass onto the base model's most likely tokens seems more explanatory.

I agree that something like this would excellent. I unfortunately doubt that anything so cool will come out of this experiment. (The most important constraint is finding a HAIST member willing to take on the project of writing something like this up.)

If things go well, we are tentatively planning on sharing the list of core disagreements we identify (these will probably look like cruxes and subquestions) as well as maybe data about our members' distribution of views before and after the debate.

1Zac Hatfield-Dodds3mo
(if funding would get someone excited to do a great job of this, I'll help make that happen)

This recent comment thread discussing whether RLHF makes any progress beyond the classical "reward the agent when humans press the reward button" idea.

Thanks, that's a useful clarification; I'll edit it into the post.

In-context RL strikes me as a bit of a weird thing to do because of context window constraints. In more detail, in-context RL can only learn from experiences inside the context window (in this case, the last few episodes). This is enough to do well on extremely simple tasks, e.g. the tasks which appear in this paper, where even seeing one successful previous episode is enough to infer perfect play. But it's totally insufficient for more complicated tasks, e.g. tasks in large, stochastic environments. (Stochasticity especially seems like a problem, since yo... (read more)

The paper is frustratingly vague about what their context lengths are for the various experiments, but based off of comparing figures 7 and 4, I would guess that the context length for Watermaze was 1-2 times as long as an episode length(=50 steps). (It does indeed look like they were embedding the 2d dark room observations into a 64-dimensional space, which is hilarious.)

I'm not sure I understand your second question. Are you asking about figure 4 in the paper (the same one I copied into this post)? There's no reward conditioning going on. They're also no... (read more)

3Charlie Steiner3mo
Yeah, I'm confused about all their results of the same type as fig 4 (fig 5, fig 6, etc.). But I think I'm figuring it out - they really are just taking the predicted action. They're "learning" in the sense that the sequence model is simulating something that's learning. So if I've got this right, the thousands of environment steps on the x axis just go in one end of the context window and out the other, and by the end the high-performing sequence model is just operating on the memory of 1-2 high-performing episodes. I guess this raises another question I had, which is - why is the sequence model so bad at pretending to be bad? If it's supposed to be learning the distribution of the entire training trajectory, why is it so bad at mimicking an actual training trajectory? Maybe copying the previous run when it performed well is just such an easy heuristic that it skews the output? Or maybe performing well is lower-entropy than performing poorly, so lowering a "temperature" parameter at evaluation time will bias the sequence model towards successful trajectories?

Excellent work!

I had previously expected that training with KL-regularization would not be equivalent to early stopping in terms of its effects on RM overfitting, so I'm quite interested to see evidence otherwise. Two questions related to this:

  1. In figure 9 of the paper (copied below), it looks like, from the perspective of the proxy RM, training with KL-regularization does not have the same effect as early stopping. This is pretty shocking to me: looking at only the gold RM lines of figure 9, you'd be tempted to guess that RL with KL penalty learns approxim
... (read more)
Thanks! 1. I don't really have a great answer to this. My guess is that it's related to the fact that the higher penalty runs do a bunch more steps to get to the same KL, and those extra steps do something weird. Also, it's possible that rather than the gold RM scores becoming more like the proxy RMs with more params/data; perhaps the changed frontier is solely due to some kind of additional exploitation in those extra steps, and evaporates when the RMs become sufficiently good. 2. Mostly some evidence of other hyperparameters not leading to this behaviour, but also observing this behaviour replicated in other environments.

I really appreciated this comment for making the connection between this paper and IDA.

More explicitly, to the extent that you can think of the original large language model as simulating a human, there's an analogy between:

  • asking the LLM to reason about its inputs and then training on the conclusion of the reasoning (what this paper does)
  • asking simulated humans to reason about a problem and then training a new model on their conclusions (the basic idea behind iterated distillation and amplification).

This is a also a great chance for IDA skeptics to try to... (read more)

Note that if you want logits to work with, you could put a classification head on your LM and then train on the easy classification task where each input consists of a prompt, completion, and chain of thought. (In other words, you would have the LM reason about injuriousness using chain of thought as you did above, and afterwards feed the entire prompt + completion + chain of thought into your injuriousness classifier.)

This would let you backprop to tokens in the prompt + completion + chain of thought, and if you're willing to store the computational graph... (read more)

This seems very useful -- thanks for doing it!

Some paper suggestions:

Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit

There is mounting empirical evidence of emergent phenomena in the capabilities of deep learning methods as we scale up datasets, model sizes, and training times. While there are some accounts of how these resources modulate statistical capacity, far less is known about their effect on the computational problem of model training. This work conducts such an exploration through the lens of learning k-sparse par

... (read more)

My recent post on generative models has some related discussion; see especially remark 1 on the satisficer, quantilizer, and optimizer approaches to making agents with generative models.

Two interesting differences between the approaches discussed here and in my linked post:

  • In my post, I assumed that the generative model was trained on a data set which included rewards (for example, humans playing Breakout, where the reward is provided by the environment; or a setting in which rewards can be provided by a reward model trained with human feedback). In contra
... (read more)
2Adam Jermyn6mo
This is helpful, thanks for summarizing the differences! I definitely agree on the first one. On the second one, my concern is basically that all the safety guarantees that quantilizers provide have an inherent power/safety tradeoff (modulo whatever I'm missing from the "Targeted Impact" section). That said, it's possible that your nested approach may avoid the 'simulate a deceptive AGI' failure mode. At least, if it's a continuous trajectory of improvement from median human performance up to very superhuman performance you might hope that that trajectory doesn't involve suddenly switching from human-like to AGI-like models. I don't personally find this very comforting (it seems totally plausible to me that there's a continuous path from "median human" to "very dangerous misaligned model" in model-space), but it does at least seem better than directly asking for a superhuman model.

But you still need online access to our MDP (i.e. reward function and transition function), don't you?

Yep, that's right! This was what I meant by "the agent starts acting in its environment" in the description of an ODT. So to be clear, during each timestep in the online phase, the ODT looks at a partial trajectory

of rewards-to-go, observation, and actions; then selects an action  conditional on this partial trajectory; and then the environment provides a new reward  (so that ) and obser... (read more)

(separate comment to make a separate, possibly derailing, point)

> If the answer were yes, then this policy would do a really bad job predicting 499 of the 500 past episodes with top 5% reward, so I conclude the answer probably isn't yes.

For safety, 'probably' isn't much of a property.

I mostly view this as a rhetorical flourish, but I'll try to respond to (what I perceive as) the substance.

The "probably" in my sentence was mainly meant to indicate out-of-model uncertainty (in the sense of "I have a proof that X, so probably X" which is distinct from "I h... (read more)

I continue to think you're wrong here, and that our disagreement on this point is due to you misunderstanding how an ODT works.

Your simple DT is not keeping an episodic buffer around to do planning over or something, it's just doing gradient updates. It doesn't "know" what the exact empirical distribution of the last 10,000 episodes trained on was nor would it care if it did

To be clear: an ODT does keep an episodic buffer of previous trajectories (or at least, that is the implementation of an ODT that I'm considering, which comports with an ODT as implemen... (read more)

But it will still have the problems of modeling off-distribution poorly, and going off-distribution.

Yep, I agree that distributional shift is still an issue here (see counterpoint 1 at the end of the "Safety advantages" section).


> Novel behaviors may take a long time to become common [...]

I disagree. This isn't a model-free or policy model which needs to experience a transition many times before the high reward can begin to slowly bootstrap back through value estimates or overcome high variance updates to finally change behavior, it's a model-based R

... (read more)

I think you're wrong here, at least in the case of an OGM satisficer or quantilizer (and in the more optimizer-y case of remark 1.3, it depends on the reward of the new episode). For concreteness, let's imagine an OGM quantilizer aiming for rewards in the top 5% of previously-observed rewards. Suppose that the generative model has a memory of 10,000 episodes, and it's just explored a reward hacking strategy by chance, which gave it a much higher reward than all previous episodes. It looks back at the last 10,000 episodes (including the reward hacking epis

... (read more)

Bug report: the "Some remarks" section of this post has a nested enumerated list. When I open the post in the editor, it displays as 

1. [text]

> a. [text]

> b. [text]

> c. [text]

2. [text]

(where the >'s represent indentation). But the published version of the post displays this as

1. [text]

> 1. [text]

> 2. [text]

> 3. [text]

2. [text]

This isn't a huge deal, but it's a bit annoying since I later refer to the things I say in the nested list as e.g. "remark 1(c)." 

There is, by coincidence, a recent PR [] to fix this.
Markdown doesn't support such lists [], so it's more a matter of an overly permissive interpreter (or even specification) that accepts the code without complaint.

I like this idea! And these are excellent first considerations on what exactly the surgeon should look like.

It seems to me that bounding the size of the modification the surgeon can make to any one activation has some issues. For instance, suppose that we're trying to apply this scheme to the agent which defects if it sees a factorization of RSA-2048. A plausible way this agent could work internally is that there is a particular activation which tracks whether the agent has seen a factorization of RSA-2048: the activation is very large whenever the agent h... (read more)

3Adam Jermyn7mo
Thanks! Good point. What I'd really like is for the cap on the surgeon's modifications to be based in some way on the weights of the agent. If the inputs and weights are typically order-unity and there aredlayers andNneurons per layer then activations shouldn't get much bigger than∼Ndin the worst case (which corresponds to all weights of +1, all inputs of +1, so each layer just multiplies byN). So I'd like to see the surgeon's modifications capped to be no more than this for sure. In practice a tighter bound is given by looking at the eigenvalues of the weight layers, and the max ratio of activations to inputs is∼∏iλmax,i, where the product runs over layers andλmaxis the maximum-magnitude eigenvalue of a layer. Definitely! In particular, what I think makes sense is to make the surgeon try to minimize a loss which is shaped like (size of edits + loss of agent), where "size of edits" bakes in considerations like "am I editing an entire layer?" and "what is my largest edit?" and anything else that ends up mattering.

Why privately?!

(Treating this as non-rhetorical, and making an effort here to say my true reasons rather than reasons which I endorse or which make me look good...)

In order of importance, starting from the most important:

  1. It would take a lot of effort to turn the list of disagreements I wrote for myself into a proper post, and I decided the effort wasn't worth it. I'm impressed how quickly Paul wrote this response, and it wouldn't surprise me if there are some people reading this who are now wondering if they should still post their rebuttals they've been d
... (read more)

I appreciate you choosing to reveal your real reasons, inspite of the reasons to not reveal them.

(I mostly endorse this explanation, but am also writing a reply with some more details.)

When "List of Lethalities" was posted, I privately wrote a list of where I disagreed with Eliezer, and I'm quite happy to see that there's a lot of convergence between my private list and Paul's list here. 

I thought it would be a useful exercise to diff my list with Paul's; I'll record the result in the rest of this comment without the expectation that it's useful to anyone else.

Points on both lists:

  • Eliezer's "first critical try" framing downplays the importance of trial-and-error with non-critical tries.
  • It's not clear that a "pivotal act" by an align
... (read more)

When "List of Lethalities" was posted, I privately wrote a list of where I disagreed with Eliezer

Why privately?!  Is there a phenomenon where other people feel concerned about the social reception of expressing disagreement until Paul does?  This is a phenomenon common in many other fields - and I'd invoke it to explain how the 'tone' of talk about AI safety shifted so quickly once I came right out and was first to say everybody's dead - and if it's also happening on the other side then people need to start talking there too.  Especially if people think they have solutions.  They should talk.

I've also been perplexed by the focus on Tao in particular. In fact, I've long thought that if it's a good idea to recruit a top mathematician to alignment, then Peter Scholze would be a better choice since

  1. he's probably the greatest active mathematician
  2. he's built his career out of paradigmatizing pre-paradigmatic areas of math
  3. he has an interest in computer proof-checking.

That said, I'm quite confident that Scholze is too busy revolutionizing everything he touches in mathematics to be interested in switching to alignment, so this is all moot.

(Also, I recogn... (read more)

Thanks, I’ve added him to my list of people to contact. If someone else wants to do it instead, reply to this comment so that we don’t interfere with each other.

I wish I could say that there was some sort of hilarious self-referential joke here, but actually I'm just bad at counting, oops. At this point I probably won't fix it for fear of ruining in-text section references. 

Hmm, I'm not sure I understand -- it doesn't seem to me like noisy observations ought to pose a big problem to control systems in general.

For example, suppose we want to minimize the number of mosquitos in the U.S., and we access to noisy estimates of mosquito counts in each county. This may result in us allocating resources slightly inefficiently (e.g. overspending resources on counties that have fewer mosquitos than we think), but we'll still always be doing the approximately correct thing and mosquito counts will go down. In particular, I don't see a se... (read more)

2Gordon Seidoh Worley8mo
"Error" here is all sources of error, not just error in the measurement equipment. So bribing surveyors is a kind of error in my model.

This paper gives a mathematical model of when Goodharting will occur. To summarize: if

(1) a human has some collection  of things which she values,

(2) a robot has access to a proxy utility function which takes into account some strict subset of those things, and

(3) the robot can freely vary how much of  there are in the world, subject only to resource constraints that make the  trade off against each other,

then when the robot optimizes for its proxy utility, it will minimize all 's which its proxy utility... (read more)

2Gordon Seidoh Worley8mo
I actually don't think that model is general enough. Like, I think Goodharting is just a fact of control system's observing. Suppose we have a simple control system with outputXand a governorG.Gtakes a measurementm(X)(an observation) ofX. So long asm(X)is not error free (and I think we can agree that no real world system can be actually error free), thenX= m(X)+ϵfor some error factorϵ. SinceGusesm(X)to regulate the system to changeX, we now have error influencing the value ofX. Now applying the standard reasoning for Goodhart, in the limit of optimization pressure (i.e.Gregulating the value ofXfor long enough),ϵcomes to dominate the value ofX. This is a bit handwavy, but I'm pretty sure it's true, which means in theory any attempt to optimize for anything will, under enough optimization pressure, become dominated by error, whether that's human values or something else. The only interesting question is can we control the error enough, either through better measurement or less optimization pressure, such that we can get enough signal to be happy with the output.

(I should clarify that I'm not an expert. In fact, you might even call me "an amateur who's just learning about this stuff myself"! That said...)

RLHF attempts to infer a reward function from human comparisons of task completions.

I believe that RLHF more broadly refers to learning reward models via supervised learning, not just the special case where the labelled data is pairwise comparisons of task completions. So, for example, I think that RLHF would include e.g. learning a reward model for text summaries based on scalar 1-10 feedback from humans, rather ... (read more)

I'm responding to this specifically, rather than the question of RLHF and 'human irrationality'. I'm not saying this is the case, but what if 'human preferences' are representable by something more complicated. Perhaps an array or vector? Can it learn something like that?

This was a nice post! I appreciate the effort you're making to get your inside view out there.

A correction:

The ultimate goal is to get a reward model that represents human preferences for how a task should be done: this is also known as Inverse Reinforcement Learning.

Based on this sentence, you might be conflating value learning (the broad class of approaches to outer alignment that involve learning reward models) with IRL, which is the particular sub-type of value learning in which the ML model tries to infer a reward function by observing the behavi... (read more)

1Ansh Radhakrishnan9mo
Thanks for the feedback and corrections! You're right, I was definitely confusing IRL, which is one approach to value learning, with the value learning project as a whole. I think you're also right that most of the "Outer alignment concerns" section doesn't really apply to RLHF as it's currently written, or at least it's not immediately clear how it does. Here's another attempt: RLHF attempts to infer a reward function from human comparisons of task completions. But it's possible that a reward function learned from these stated preferences might not be the "actual" reward function - even if we could perfectly predict the human preference ordering on the training set of task completions, it's hard to guarantee that the learned reward model will generalize to all task completions. We also have to consider that the stated human preferences might be irrational: they could be intransitive or cyclical, for instance. It seems possible to me that a reward model learned from human feedback still has to account for human biases, just as a reward function learned through IRL does. How's that for a start?

Relevant: Mark Xu's recent (short) EA forum post 'Dropping out isn't a plan.

  1. Yeah, I agree this was an issue. I've changed the title and made an edit to reflect the focus on outer alignment.
  2. I feel pretty resistant to changing anything about this. I feel like LW should be a place where, when you write "if X then Y," people should know not to infer "X is true" or "I'm going to argue for X." Moreover, one point I was trying to get across is that instead of thinking in terms of whether "ELK works" or not, you can think in terms of how many questions our ELK techniques allow us to elicit honest answers for; the larger that collection of
... (read more)

Thanks, this is indeed a point I hadn't fully appreciated: even if a reward function generalizes well OOD, that doesn't mean that a policy trained on that reward function does.

It seems like the issue here is that it's a bad idea to ever take your policy offline, analogously to what happens in reward modeling from human feedback (namely, reward models stops being good once you take them offline). Does that seem right? Of course, keeping an RL agent in learning mode forever might also have issues, most obviously unsafe exploration. Are there other things that also go wrong?

2Rohin Shah9mo
I agree that one major mitigation is to keep training your policy online, but that doesn't necessarily prevent a misaligned policy from taking over the world before the training has time to fix its mistakes. In particular, if the policy is reasoning "I'll behave well until the moment I strike", and your reward function can't detect that (it only detects whether the output was good), then the policy will look great until the moment it takes over.

Thanks, I should have clarified that everywhere I say "alignment" in this post, I'm really talking about (outer) intent alignment, which of course excludes a whole barrage of safety-relevant concerns: safe exploration, robustness to distributional shift, mesa-optimizers, etc.

That said, I think the particular concern expressed in the paper you link -- namely, that the agent's reward model could break OOD while the agent's capabilities remain otherwise intact -- doesn't seem like it would be an issue here? Indeed, the agent's reward model is pulled out of it... (read more)

3Rohin Shah9mo
Fair enough if you just want to talk about outer alignment. I agree that this implies that the utility function you get in Step 2 will be good and will continue working OOD. I assumed that in Step 3, you would plug that utility function as the reward function into an algorithm like PPO in order to train a policy that acted well. The issue is then that the resulting policy could end up optimizing for something else OOD, even if the utility function would have done the right thing, in the same way that the CoinRun policy ends up always going to the end of the level even though it was trained on the desired reward function of "+10 if you get the coin, 0 otherwise". Maybe you have some different Step 3 in mind besides "run PPO"?

I agree that the term "deception" conflates "deceptive behavior due to outer alignment failure" and "deceptive behavior due to inner alignment failure" and that this can be confusing! In fact, I made this same distinction recently in a thread discussing deceptive behavior from models trained via RL from human feedback.

The idea that conditioning on unCLIP-produced image vectors instead of text vectors would improve diversity seems very bewildering. And I really have a hard time swallowing the explanation "maybe this happens because for a given CLIP image vector v, there's a large equivalence class of images that all approximately encode to v." After all, this explanation doesn't actually have anything to do with conditioning on image vs. text vectors; in other words, whether we condition on image or text vectors, the final resulting image should still have a large equiva... (read more)

I completely agree that the effects of using unCLIP are mysterious, in fact the opposite of what I'd predict them to be.

I wish the paper had said more about why they tried unCLIP in the first place, and what improvements they predicted they would get from it.  It took me a long time just to figure out why the idea might be worth trying at all, and even now, I would never have predicted the effects it had in practice.  If OpenAI predicted them, then they know something I don't.

For instance, it seems like maybe the model that produced the roses on

... (read more)

Ah cool, I see -- your concern is that maybe RLHF is perhaps better left to the capabilities people, freeing up AI safety researchers to work on more neglected approaches. 

That seems right to me, and I agree with it as a general heuristic! Some caveats:

  1. I'm random person who's been learning a lot about this stuff lately, definitely not an active researcher. So my opinions about heuristics for what to work on probably aren't worth much.
  2. If you think RLHF research could be very impactful for alignment, that could make up for it being less neglected than o
... (read more)

I'll be performing a (modest) update on the results of this experiment, and I strongly endorse John's comment here as an explanation of why -- it's testing a worldview that's upstream of both this AC debate and alignment. 

In my case, the worldview being tested isn't about civilizational inadequacy. Rather, it's about how likely optimizers (e.g. the market, an AI system) are to do things that seem to satisfy our preferences (but actually have hidden bad side effects that we're not smart enough to notice) vs. do things that actually satisfy our preferen... (read more)

What would you say is the main benefit from the RL from Human Feedback research so far? What would have happened if the authors had instead worked on a different project?

I feel like these questions are a little tricky to answer, so instead I'll attempt to answer the questions "What is the case for RL from human feedback (RLFHF) helping with alignment?" and "What have we learned from RLFHF research so far?"

What is the case for RLFHF helping with alignment?

(The answer will mainly be me repeating the stuff I said in my OP, but at more length.)

The most naive c... (read more)

5Daniel Kokotajlo9mo
Thanks for the detailed answer, I am sheepish to have prompted so much effort on your part! I guess what I was and am thinking was something like "Of course we'll be using human feedback in our reward signal. Big AI companies will do this by default. Obviously they'll train it to do what they want it to do and not what they don't want it to do. The reason we are worried about AI risk is because we think that this won't be enough." To which someone might respond "But still it's good to practice doing it now. The experience might come in handy later when we are trying to align really powerful systems." To which I might respond "OK, but I feel like it's a better use of our limited research time to try to anticipate ways in which RL from human feedback could turn out to be insufficient and then do research aimed at overcoming those ways. E.g. think about inner alignment problems, think about it possibly learning to do what makes us give positive feedback rather than what we actually want, etc. Let the capabilities researchers figure out how to do RL from human feedback, since they need to figure that out anyway on the path to deploying the products they are building. Safety researchers should focus on solving the problems that we anticipate RLHF doesn't solve by itself." I don't actually think this, because I haven't thought about this much, so I'm uncertain and mostly deferring to other's judgment. But I'd be interested to hear your thoughts! (You've written so much already, no need to actually reply)
Load More