This is just a terminological difference: supervised fine-tuning on highly rated outputs is a type of RL. (At least according to how many people use the term.)
[edit: this says the same thing as Quintin's sibling comment]
Important context for those who don't know it: the main difference between text-davinci-002 and text-davinci-003 is that the latter was trained with PPO against a reward model, i.e. RLHF as laid out in the InstructGPT paper. (Source: OpenAI model index.)
In more detail, text-davinci-002 seems to have been trained via supervised fune-tuning on the model outputs which were rated highest by human reviewers (this is what the model index calls FeedME). The model index only says that text-davinci-003 wa...
This, broadly-speaking, is also my best guess, but I'd rather phrase it as: larger LMs are better at making the personas they imitate "realistic" (in the sense of being more similar to the personas you encounter when reading webtext). So doing RLHF on a larger LM results in getting an imitation of a more realistic useful persona. And for the helpful chatbot persona that Anthropic's language model was imitating, one correlate of being more realistic was preferring not to be shut down.
(This doesn't obviously explain the results on sycophancy. I think for tha...
Regarding your points on agentic simulacra (which I assume means "agentic personas the language model ends up imitating"):
1) My best guess about why Anthropic's model expressed self-preservation desires is the same as yours: the model was trying to imitate some relatively coherent persona, this persona was agentic, and so it was more likely to express self-preservation desires.
2) But I'm pretty skeptical about your intuition that RLHF makes the "imitating agentic personas" problem worse. When people I've spoken to talk about conditioning-based alternatives...
If anyone is interested, here are LW users' scores (in orange) inside the distribution of all scores.
(I've just realized the histogram in the applet is cutting off the two leftmost bars, so this histogram will look very slightly different than the one there until I fix this later tonight; this is the correct one. Fixed.)
Thanks, that's a good suggestion! I've done so.
In terms of being able to sample from the conditional, I don't think that the important constraint here is . Rather, it seems that the important constraint is that our architecture can only sample from distributions of the form ; even allowing to be arbitrary real numbers, this will never be the same as either (a) the distribution produced by conditioning the base model on high persuasiveness, or (b) the distribution which maximizes expected persuasiveness - KL divergence from the base model....
(The worked example in this comment was a joint effort with Eric Neyman and Drake Thomas.)
Here's a toy example in which we get worse Goodharting for RL than for filtering: suppose that our model has three submodules
Our model has parameters summing to 1 which determine how much to listen to each of thes...
These are all fantastic questions! I'll try to answer some of the ones I can. (Unfortunately a lot of the people who could answer the rest are pretty busy right now with EAGxBerkeley, getting set up for REMIX, etc., but I'm guessing that they'll start having a chance to answer some of these in the coming days.)
Regarding the research program, I'm guessing there's around 6-10 research projects ongoing, with between 1 and 3 students working on each; I'm guessing almost none of the participants have previous research experience. (Kuhan would have the actual nu...
Yep, sorry, I've probably been pretty unclear with describing this example.
I'm confused about why you think it both (a) not plan ahead of time to disempower humans, and (b) disempower humans when it has the chance. If the predictive model is accurate enough such that it is predictable that disempowering humans would be instrumentally useful, then wouldn't the model incorporate that into its earlier plans?
For the sake of simplicity, let's ignore the "execute on arbitrary short-term goals" half of the system I described, and just consider a system which was ...
I see two distinctions between a system like the one I described and a system with long-term goals in the usual sense. First, the goal "write down a plan which, if followed, would lead to long-term profit" is itself a short-term goal which could plausibly be trained up to human-level with a short-term objective function (by training on human-generated predictions). So I think this mechanism avoids the arguments made in claims 4 and 5 of the post for the implausibility of long-term goals (which is my motivation for mentioning it). (I can't tell if claim 6 w...
I think that a competent human actor assisted by short-term AI systems plausibly could take over the world this way; I'm just inclined to call that a misuse problem rather than an alignment problem. (Or in other words, fixing that requires solving the human alignment problem, which feels like it requires different solutions, e.g. coordination and governmental oversight, than the AI alignment problem.)
Thanks for writing this -- I found it interesting, thoughtful, and well-written.
One distinction which seems useful to make is between:
It seems to me that this post argues that:
Non-central nitpick:
As it turns out, transformers can do reinforcement learning in-context
This seems to just be vanilla in-context learning, rather than any sort of in-context RL. (Also I'm skeptical that the linked paper actually provides evidence of in-context RL in any nontrivial sense.)
This seems like a good way to think about some of the examples of mode collapse, but doesn't obviously cover all the cases. For example, when asking the model to produce a random number, is it really the case that there's a particular conversational goal which the RLHF'd model is optimizing, such that 97 is the best random number for that goal? In this case, Paul's guess that RLHF'd models tend to push probability mass onto the base model's most likely tokens seems more explanatory.
I agree that something like this would excellent. I unfortunately doubt that anything so cool will come out of this experiment. (The most important constraint is finding a HAIST member willing to take on the project of writing something like this up.)
If things go well, we are tentatively planning on sharing the list of core disagreements we identify (these will probably look like cruxes and subquestions) as well as maybe data about our members' distribution of views before and after the debate.
This recent comment thread discussing whether RLHF makes any progress beyond the classical "reward the agent when humans press the reward button" idea.
Thanks, that's a useful clarification; I'll edit it into the post.
In-context RL strikes me as a bit of a weird thing to do because of context window constraints. In more detail, in-context RL can only learn from experiences inside the context window (in this case, the last few episodes). This is enough to do well on extremely simple tasks, e.g. the tasks which appear in this paper, where even seeing one successful previous episode is enough to infer perfect play. But it's totally insufficient for more complicated tasks, e.g. tasks in large, stochastic environments. (Stochasticity especially seems like a problem, since yo...
The paper is frustratingly vague about what their context lengths are for the various experiments, but based off of comparing figures 7 and 4, I would guess that the context length for Watermaze was 1-2 times as long as an episode length(=50 steps). (It does indeed look like they were embedding the 2d dark room observations into a 64-dimensional space, which is hilarious.)
I'm not sure I understand your second question. Are you asking about figure 4 in the paper (the same one I copied into this post)? There's no reward conditioning going on. They're also no...
Excellent work!
I had previously expected that training with KL-regularization would not be equivalent to early stopping in terms of its effects on RM overfitting, so I'm quite interested to see evidence otherwise. Two questions related to this:
I really appreciated this comment for making the connection between this paper and IDA.
More explicitly, to the extent that you can think of the original large language model as simulating a human, there's an analogy between:
This is a also a great chance for IDA skeptics to try to...
Note that if you want logits to work with, you could put a classification head on your LM and then train on the easy classification task where each input consists of a prompt, completion, and chain of thought. (In other words, you would have the LM reason about injuriousness using chain of thought as you did above, and afterwards feed the entire prompt + completion + chain of thought into your injuriousness classifier.)
This would let you backprop to tokens in the prompt + completion + chain of thought, and if you're willing to store the computational graph...
This seems very useful -- thanks for doing it!
Some paper suggestions:
Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit
...There is mounting empirical evidence of emergent phenomena in the capabilities of deep learning methods as we scale up datasets, model sizes, and training times. While there are some accounts of how these resources modulate statistical capacity, far less is known about their effect on the computational problem of model training. This work conducts such an exploration through the lens of learning k-sparse par
My recent post on generative models has some related discussion; see especially remark 1 on the satisficer, quantilizer, and optimizer approaches to making agents with generative models.
Two interesting differences between the approaches discussed here and in my linked post:
But you still need online access to our MDP (i.e. reward function and transition function), don't you?
Yep, that's right! This was what I meant by "the agent starts acting in its environment" in the description of an ODT. So to be clear, during each timestep in the online phase, the ODT looks at a partial trajectory
of rewards-to-go, observation, and actions; then selects an action conditional on this partial trajectory; and then the environment provides a new reward (so that ) and obser...
(separate comment to make a separate, possibly derailing, point)
> If the answer were yes, then this policy would do a really bad job predicting 499 of the 500 past episodes with top 5% reward, so I conclude the answer probably isn't yes.
For safety, 'probably' isn't much of a property.
I mostly view this as a rhetorical flourish, but I'll try to respond to (what I perceive as) the substance.
The "probably" in my sentence was mainly meant to indicate out-of-model uncertainty (in the sense of "I have a proof that X, so probably X" which is distinct from "I h...
I continue to think you're wrong here, and that our disagreement on this point is due to you misunderstanding how an ODT works.
Your simple DT is not keeping an episodic buffer around to do planning over or something, it's just doing gradient updates. It doesn't "know" what the exact empirical distribution of the last 10,000 episodes trained on was nor would it care if it did
To be clear: an ODT does keep an episodic buffer of previous trajectories (or at least, that is the implementation of an ODT that I'm considering, which comports with an ODT as implemen...
But it will still have the problems of modeling off-distribution poorly, and going off-distribution.
Yep, I agree that distributional shift is still an issue here (see counterpoint 1 at the end of the "Safety advantages" section).
---
...> Novel behaviors may take a long time to become common [...]
I disagree. This isn't a model-free or policy model which needs to experience a transition many times before the high reward can begin to slowly bootstrap back through value estimates or overcome high variance updates to finally change behavior, it's a model-based R
...I think you're wrong here, at least in the case of an OGM satisficer or quantilizer (and in the more optimizer-y case of remark 1.3, it depends on the reward of the new episode). For concreteness, let's imagine an OGM quantilizer aiming for rewards in the top 5% of previously-observed rewards. Suppose that the generative model has a memory of 10,000 episodes, and it's just explored a reward hacking strategy by chance, which gave it a much higher reward than all previous episodes. It looks back at the last 10,000 episodes (including the reward hacking epis
Bug report: the "Some remarks" section of this post has a nested enumerated list. When I open the post in the editor, it displays as
1. [text]
> a. [text]
> b. [text]
> c. [text]
2. [text]
(where the >'s represent indentation). But the published version of the post displays this as
1. [text]
> 1. [text]
> 2. [text]
> 3. [text]
2. [text]
This isn't a huge deal, but it's a bit annoying since I later refer to the things I say in the nested list as e.g. "remark 1(c)."
I like this idea! And these are excellent first considerations on what exactly the surgeon should look like.
It seems to me that bounding the size of the modification the surgeon can make to any one activation has some issues. For instance, suppose that we're trying to apply this scheme to the agent which defects if it sees a factorization of RSA-2048. A plausible way this agent could work internally is that there is a particular activation which tracks whether the agent has seen a factorization of RSA-2048: the activation is very large whenever the agent h...
Why privately?!
(Treating this as non-rhetorical, and making an effort here to say my true reasons rather than reasons which I endorse or which make me look good...)
In order of importance, starting from the most important:
I appreciate you choosing to reveal your real reasons, inspite of the reasons to not reveal them.
(I mostly endorse this explanation, but am also writing a reply with some more details.)
When "List of Lethalities" was posted, I privately wrote a list of where I disagreed with Eliezer, and I'm quite happy to see that there's a lot of convergence between my private list and Paul's list here.
I thought it would be a useful exercise to diff my list with Paul's; I'll record the result in the rest of this comment without the expectation that it's useful to anyone else.
Points on both lists:
When "List of Lethalities" was posted, I privately wrote a list of where I disagreed with Eliezer
Why privately?! Is there a phenomenon where other people feel concerned about the social reception of expressing disagreement until Paul does? This is a phenomenon common in many other fields - and I'd invoke it to explain how the 'tone' of talk about AI safety shifted so quickly once I came right out and was first to say everybody's dead - and if it's also happening on the other side then people need to start talking there too. Especially if people think they have solutions. They should talk.
I've also been perplexed by the focus on Tao in particular. In fact, I've long thought that if it's a good idea to recruit a top mathematician to alignment, then Peter Scholze would be a better choice since
That said, I'm quite confident that Scholze is too busy revolutionizing everything he touches in mathematics to be interested in switching to alignment, so this is all moot.
(Also, I recogn...
I wish I could say that there was some sort of hilarious self-referential joke here, but actually I'm just bad at counting, oops. At this point I probably won't fix it for fear of ruining in-text section references.
Hmm, I'm not sure I understand -- it doesn't seem to me like noisy observations ought to pose a big problem to control systems in general.
For example, suppose we want to minimize the number of mosquitos in the U.S., and we access to noisy estimates of mosquito counts in each county. This may result in us allocating resources slightly inefficiently (e.g. overspending resources on counties that have fewer mosquitos than we think), but we'll still always be doing the approximately correct thing and mosquito counts will go down. In particular, I don't see a se...
This paper gives a mathematical model of when Goodharting will occur. To summarize: if
(1) a human has some collection of things which she values,
(2) a robot has access to a proxy utility function which takes into account some strict subset of those things, and
(3) the robot can freely vary how much of there are in the world, subject only to resource constraints that make the trade off against each other,
then when the robot optimizes for its proxy utility, it will minimize all 's which its proxy utility...
(I should clarify that I'm not an expert. In fact, you might even call me "an amateur who's just learning about this stuff myself"! That said...)
RLHF attempts to infer a reward function from human comparisons of task completions.
I believe that RLHF more broadly refers to learning reward models via supervised learning, not just the special case where the labelled data is pairwise comparisons of task completions. So, for example, I think that RLHF would include e.g. learning a reward model for text summaries based on scalar 1-10 feedback from humans, rather ...
This was a nice post! I appreciate the effort you're making to get your inside view out there.
A correction:
The ultimate goal is to get a reward model that represents human preferences for how a task should be done: this is also known as Inverse Reinforcement Learning.
Based on this sentence, you might be conflating value learning (the broad class of approaches to outer alignment that involve learning reward models) with IRL, which is the particular sub-type of value learning in which the ML model tries to infer a reward function by observing the behavi...
Relevant: Mark Xu's recent (short) EA forum post 'Dropping out isn't a plan.
Thanks, this is indeed a point I hadn't fully appreciated: even if a reward function generalizes well OOD, that doesn't mean that a policy trained on that reward function does.
It seems like the issue here is that it's a bad idea to ever take your policy offline, analogously to what happens in reward modeling from human feedback (namely, reward models stops being good once you take them offline). Does that seem right? Of course, keeping an RL agent in learning mode forever might also have issues, most obviously unsafe exploration. Are there other things that also go wrong?
Thanks, I should have clarified that everywhere I say "alignment" in this post, I'm really talking about (outer) intent alignment, which of course excludes a whole barrage of safety-relevant concerns: safe exploration, robustness to distributional shift, mesa-optimizers, etc.
That said, I think the particular concern expressed in the paper you link -- namely, that the agent's reward model could break OOD while the agent's capabilities remain otherwise intact -- doesn't seem like it would be an issue here? Indeed, the agent's reward model is pulled out of it...
I agree that the term "deception" conflates "deceptive behavior due to outer alignment failure" and "deceptive behavior due to inner alignment failure" and that this can be confusing! In fact, I made this same distinction recently in a thread discussing deceptive behavior from models trained via RL from human feedback.
The idea that conditioning on unCLIP-produced image vectors instead of text vectors would improve diversity seems very bewildering. And I really have a hard time swallowing the explanation "maybe this happens because for a given CLIP image vector v, there's a large equivalence class of images that all approximately encode to v." After all, this explanation doesn't actually have anything to do with conditioning on image vs. text vectors; in other words, whether we condition on image or text vectors, the final resulting image should still have a large equiva...
I completely agree that the effects of using unCLIP are mysterious, in fact the opposite of what I'd predict them to be.
I wish the paper had said more about why they tried unCLIP in the first place, and what improvements they predicted they would get from it. It took me a long time just to figure out why the idea might be worth trying at all, and even now, I would never have predicted the effects it had in practice. If OpenAI predicted them, then they know something I don't.
...For instance, it seems like maybe the model that produced the roses on
Ah cool, I see -- your concern is that maybe RLHF is perhaps better left to the capabilities people, freeing up AI safety researchers to work on more neglected approaches.
That seems right to me, and I agree with it as a general heuristic! Some caveats:
I'll be performing a (modest) update on the results of this experiment, and I strongly endorse John's comment here as an explanation of why -- it's testing a worldview that's upstream of both this AC debate and alignment.
In my case, the worldview being tested isn't about civilizational inadequacy. Rather, it's about how likely optimizers (e.g. the market, an AI system) are to do things that seem to satisfy our preferences (but actually have hidden bad side effects that we're not smart enough to notice) vs. do things that actually satisfy our preferen...
What would you say is the main benefit from the RL from Human Feedback research so far? What would have happened if the authors had instead worked on a different project?
I feel like these questions are a little tricky to answer, so instead I'll attempt to answer the questions "What is the case for RL from human feedback (RLFHF) helping with alignment?" and "What have we learned from RLFHF research so far?"
What is the case for RLFHF helping with alignment?
(The answer will mainly be me repeating the stuff I said in my OP, but at more length.)
The most naive c...
To be clear, I'm not classifying all uses of SFT as RL (for example, I would not call SFT on human expert demonstrations RL). It's specifically SFT on highly-rated model outputs -- i.e. having the model produce a bunch of rollouts, labeling them with rewards, training the model to imitate the top-rewarded rollouts, and repeating -- which I'm calling RL here. Note that this training process does aim the model towards high-reward, and is very similar to the online decision transformer, which is typically classed as an RL technique.
So I still feel that ... (read more)