The Superficial Alignment Hypothesis is probably false
... in this remark I will appeal to my own idiosyncratic opinion about LLMs. I suspect that simple finetuning or simple prompting can't ensure that the model's responses won't be illegal, harmful, abusive, false, deceptive, e.t.c.
But... the superficial alignment hypothesis isn't about "ensuring" the responses won't be X, right? Or "eliminating" the X responses. So point 7 reads like a non-sequitur to me.
...We define the Superficial Alignment Hypothesis: A model’s knowledge and capabilities are learnt almos
Separately (though relatedly), each word in that sentence sure feels like the kind of thing that I do not feel comfortable leaning on heavily as I optimize strongly against it, and that hides a ton of implicit assumptions,
I feel pretty uncertain of what assumptions are hiding in your "optimize strongly against X" statements. Historically this just seems hard to tease out, and wouldn't be surprised if I were just totally misreading you here.
That said, your writing makes me wonder "where is the heavy optimization [over the value definitions] coming fr...
I low-confidence think the context strengthens my initial impression. Paul prefaced the above quote as "maybe the simplest [reason for AIs to learn to behave well during training, but then when deployed or when there's an opportunity for takeover, they stop behaving well]." This doesn't make sense to me, but I historically haven't understood Paul very well.
EDIT: Hedging
That's why the title says "power-seeking can be predictive" not "training-compatible goals can be predictive".
You're right. I was critiquing "power-seeking due to your assumptions isn't probable, because I think your assumptions won't hold" and not "power-seeking isn't predictive." I had misremembered the predictive/probable split, as introduced in Definitions of “objective” should be Probable and Predictive:
...I don’t see a notion of “objective” that can be confidently claimed is:
- Probable: there is a good argument that the systems we build will ha
The issue with being informal is that it's hard to tell whether you are right. You use words like "motivations" without defining what you mean, and this makes your statements vague enough that it's not clear whether or how they are in tension with other claims.
It seems worth pointing out: the informality is in the hypothesis, which comprises a set of somewhat illegible intuitions and theories I use to reason about generalization. However, the prediction itself is what needs to be graded in order to see whether I was right. I made a prediction fairly like "...
This seems like excellent work. I'm excited to see these results, they seem to be strong evidence that "just add the 'truth steering vector'" works, albeit with finer-grained intervention on a subset of attention heads. That's great news.
Given my understanding of your results, I am now more optimistic about:
RL creates agents, and RL seemed to be the way to AGI. In the 2010s, reinforcement learning was the dominant paradigm for those interested in AGI (e.g. OpenAI). RL lends naturally to creating agents that pursue rewards/utility/objectives. So there was reason to expect that agentic AI would be the first (and by the theoretical arguments, last) form that superintelligence would take.
Why are you confident that RL creates agents? Is it the non-stochasticity of optimal policies for almost all reward functions? The on-policy data collection of PPO? I think there...
Therefore, even though I consider us fortunate to be in a world where we might build AGI primarily through non-RL methods, I think the pressure to use RL to turn these systems into agents is still very high
Why do you think that RL will turn these systems into agents?
We have heard that Conjecture misrepresent themselves in engagement with the government, presenting themselves as experts with stature in the AIS community, when in reality they are not.
What does it mean for Conjecture to be "experts with stature in the AIS community"? Can you clarify what metrics comprise expertise in AIS -- are you dissatisfied with their demonstrated grasp of alignment work, or perhaps their research output, or maybe something a little more qualitative?
Basically, this excerpt reads like a crisp claim of common knowledge ("in reality") but the content seems more like a personal judgment call by the author(s).
I don't think we should call this "algebraic value editing" because it seems overly pretentious to say we're editing the model's values We don't even know what values are!
I phased out "algebraic value editing" for exactly that reason. Note that only the repository and prediction markets retain this name, and I'll probably rename the repo activation_additions.
"Just take a love-minus-hate activation and add that to the prompt activation" sounds like an absolute newb idea. I like that idea but if I were trying to find the expert in a room then that statement would've disqualified them.
Very true.
transformer_lens doesn't work on convolutional networks, and so I had to come up with my own ideas.What part of the post you link rules this out? As far as I can tell, the thing you're saying is that a few factors influence the decisions of the maze-solving agent, which isn't incompatible with the agent acting optimally with respect to some reward function such that it produces training-reward-optimal behaviour on the training set.
In addition to my other comment, I'll further quote Behavioural statistics for a maze-solving agent:
...We think the complex influence of spatial distances on the network’s decision-making might favor a ‘shard-like’ description: a
Without having processed your essay -- Please don't call your concept "sharp right turn." Not only is "sharp left turn" a non-descriptive name, "sharp right turn" is both non-descriptive and causes learning interference with the existing terminology. People will have a harder time remembering the definition of your concept and they'll also have a harder time remembering Nate's concept. EDIT: And I'd guess there are snappier handles for your idea, which seems worth shooting for.
How can I make predictions in a way which lets me do data analysis? I want to be able to grep / tag questions, plot calibration over time, split out accuracy over tags, etc. Presumably exporting to a CSV should be sufficient. PredictionBook doesn't have an obvious export feature, and its API seems to not be working right now / I haven't figured it out yet.
Trying to collate team shard's prediction results and visualize with plotly, but there's a lot of data processing that has to be done first. Want to avoid the pain in the future.
I think that "reward as brain damage" is somewhat descriptive but also loaded. In policy gradient methods, reward leads to policy gradient which is parameter update. Parameter update sometimes is value drift, sometimes is capability enhancement, sometimes is "brain" damage, sometimes is none of the above. I agree there are some ethical considerations for this training process, because I think parameter updates can often be harmful/painful/bad to the trained mind.
But also, Paul's description[1] seems like a wild and un(der)supported view on what RL tra...
I think you're the one who's imposing a type error here. For "value functions" to be useful in modelling a policy, it doesn't have to be the case that the policy is acting optimally with respect to a suggestively-labeled critic - it just has to be the case that the agent is acting consistently with some value function.
Can you say more? Maybe give an example of what this looks like in the maze-solving regime?
...What part of the post you link rules this out? As far as I can tell, the thing you're saying is that a few factors influence the decisions of the maze-
To be fair, the post sort of makes this mistake by talking about "internal representations", but I think everything goes thru if you strike out that talk.
I'm responding to this post, so why should I strike that out?
The utility function formalism doesn't require agents to "internally represent a scalar function over observations". You'll notice that this isn't one of the conclusions of the VNM theorem.
The post is talking about internal representations.
Physiological events associated with pregnancy (mostly hormones) rewires the mother's brain such that when she gives birth, she immediately takes care of the young, grooms them etc., something she has never done before.
Salt-starved rats develop an appetite for salt and are drawn to stimuli predictive of extremely salty water
I've been wondering about the latter for a while. These two results are less strongly predicted by shard theoretic reasoning than by "hardcoded" hypotheses. Pure-RL+SL shard theory loses points on these two observations, and points to other mechanisms IMO (or I'm missing some implications of pure-RL+SL shard theory).
"There are theoretical results showing that many decision-making algorithms have power-seeking tendencies."
I think this is reasonable, although I might say "suggesting" instead of "showing." I think I might also be more cautious about further inferences which people might make from this -- like I think a bunch of the algorithms I proved things about are importantly unrealistic. But the sentence itself seems fine, at first pass.
This is awesome. As you have just shown, there are a ton of low-hanging activation additions just waiting to be found. Team shard has barely explored this large space of interventions. I encourage people to play around with activation additions more, via e.g. our demo colabs for GPT-2-XL (Colab Pro required) and GPT-2-small (Colab Pro not required). Though more sophisticated interventions (like the one you demonstrate) will require coding, and not just playing with our demo widget.
You looked at GPT-2-small. I injected your activation additions into GPT-2-X...
So then one approach is to ask everyone to avoid the word “agent” in cases where those intuitions don’t apply, and the other is to ask everyone to constantly remind each other that the “agents” produced by RL don’t necessarily have thus-and-such properties.
I think there's a way better third alternative: asking each reader to unilaterally switch to "policy." No coordination, no constant reminders, no communication difficulties (in my experience). I therefore don't see a case for using "agent" in the mentioned cases.
I added to the post:
...Don't wait for e
I think the embodiment distinction is interesting and hadn't thought of it before (note that I didn't understand your point until reading the replies to your comment). I'm not yet sure if I find this distinction worth making, though. I'd refer to the embodied system as a "trained system" or -- after reading your suggestion -- an "embodiment." Neither feels quite right to me, though.
I think you can be more generous in your interpretation of RL experts' words and read less error in.
What other, more favorable interpretations might I consider?
I'm... not demanding that the field of RL change? Where in the post did you perceive me to demand this? For example, I wrote that "I wouldn't say 'reinforcement function' in e.g. a conference paper." I also took care to write "This terminology is loaded and inappropriate for my purposes."
Each individual reader can choose to swap to "policy" without communication difficulties, in my experience:
...Don't wait for everyone to coordinate on saying "policy." You can switch to "policy" right now and thereby improve your private thoughts about alignment, whether or n
Thanks for your patient and high-quality engagement here, Vika! I hope my original comment doesn't read as a passive-aggressive swipe at you. (I consciously tried to optimize it to not be that.) I wanted to give concrete examples so that Wei_Dai could understand what was generating my feelings.
I'm open to suggestions on how to phrase this differently when I next give this talk.
It's a tough question to say how to apply the retargetablity result to draw practical conclusions about trained policies. Part of this is because I don't know if trained policies ten...
To be clear, I still endorse Parametrically retargetable decision-makers tend to seek power. Its content is both correct and relevant and nontrivial. The results, properly used, may enable nontrivial inferences about the properties of inner trained cognition. I don't really want to retract that paper. I usually just fantasize about retracting Optimal policies tend to seek power.
The problem is that I don't trust people to wield even the non-instantly-doomed results.
For example, one EAG presentation cited my retargetability results as showing that most rewar...
I want to note that I just reread Utility ≠ Reward and was pleasantly surprised by its treatment, as well as the hedges. I'm making an upwards update on these points having been understood by at least some thinkers, although I've also made a lot of downward updates for other reasons.
(Huh, I never saw this -- maybe my weekly batched updates are glitched? I only saw this because I was on your profile for some other reason.)
I really appreciate these thoughts!
But you then propose an RL scheme. It seems to me like it's still a useful form of critique to say: here are the upward errors in the proposed rewards, here is the policy that would exploit them.
I would say "that isn't how on-policy RL works; it doesn't just intelligently find increasingly high-reinforcement policies; which reinforcement events get 'exploited' depends on the explorat...
The model has been shaped to maximize its reward by any means necessary[2], even if it means suddenly delivering an invitation to a wedding party. This is weak evidence towards the "playing the training game" scenario.
This conclusion seems unwarranted. What we have observed is (Paul claiming the existence of) an optimized model which ~always brings up weddings. On what basis does one infer that "the model has been shaped to maximize its reward by any means necessary"? This is likewise not weak evidence for playing the training game.
Thanks for the reply. This comment is mostly me disagreeing with you.[1] But I really wish someone had said the following things to me before I spent thousands of hours thinking about optimal policies.
I agree that learning a goal from the training-compatible set is a strong assumption that might not hold.
My point is not just that this post has made a strong assumption which may not hold. My point is rather that these results are not probable because the assumption won't hold. The assumptions are already known to not be good approximations ...
I still expect instrumental convergence from agentic systems with shard-encoded goals, but think this post doesn't offer any valid argument for that conclusion.
I don't think these results cover the shard case. I don't think reward functions are good ways of describing goals in settings I care about. I also think that realistic goal pursuit need not look like "maximize time-discounted sum of a scalar quantity of world state."
My point is not that instrumental convergence is wrong, or that shard theory makes different predictions. I just think that these results are not predictive of trained systems.
I regret each of the thousands of hours I spent on my power-seeking theorems, and sometimes fantasize about retracting one or both papers. I am pained every time someone cites "Optimal policies tend to seek power", and despair that it is included in the alignment 201 curriculum. I think this work makes readers actively worse at thinking about realistic trained systems.
I think a healthy alignment community would have rebuked me for that line of research, but sadly I only remember about two people objecting that "optimality" is a horrible way of understanding trained policies.
I think the basic idea of instrumental convergence is just really blindingly obvious, and I think it is very annoying that there are people who will cluck their tongues and stroke their beards and say "Hmm, instrumental convergence you say? I won't believe it unless it is in a very prestigious journal with academic affiliations at the top and Computer Modern font and an impressive-looking methods section."
I am happy that your papers exist to throw at such people.
Anyway, if optimal policies tend to seek power, then I desire to believe that optimal policies ...
It seems like just 4 months ago you still endorsed your second power-seeking paper:
This paper is both published in a top-tier conference and, unlike the previous paper, actually has a shot of being applicable to realistic agents and training processes. Therefore, compared to the original[1] optimal policy paper, I think this paper is better for communicating concerns about power-seeking to the broader ML world.
Why are you now "fantasizing" about retracting it?
...I think a healthy alignment community would have rebuked me for that line of research, but s
You can use ChatGPT without helping train future models:
What if I want to keep my history on but disable model training?
...you can opt out from our use of your data to improve our services by filling out this form. Once you submit the form, new conversations will not be used to train our models.
Beforehand I was very confident that vector additions would work here, even though I knew that the fully connected additions didn't work. Before showing him the results, but after showing the results for the fully connected network, I asked TurnTrout for his prediction. He gave 85% that the additions would work.
I want to clarify that I had skimmed the original results and concluded that they "worked" in that 3-1 vectors got e.g. 1s to be classified as 3s. (This is not trivial, since not all 1 activations are the same!) However, those results "didn't work" ...
I think that capacity would be really nice. I think our results are maybe a very very rough initial version of that capacity. I want to caution that we should be very careful about making inferences about what concepts are actually used by the model. From a footnote:
...Of course, there need not be a "wedding" feature direction in GPT-2-XL. What we have observed is that adding certain activation vectors will reliably produce completions which appear to us to be "more about weddings." This could take place in many ways, and we encourage people to avoid instantl
Additionally, attention is ran on the normalized
xmeaning only the "unscaled" version ofxis moved between positions.
Thanks for writing this up, I hadn't realized this. One conclusion I'm drawing is: If the values in the modified residual streams aren't important to other computations in later sequence positions, then a large-coefficient addition will still lead to reasonable completions.
Thanks so much, I really appreciate this comment. I think it'll end up improving this post/the upcoming paper.
(I might reply later to specific points)
I, like Dan H, find it pretty hard to engage with this post because I can't tell whether it's basically the same as the Ludwig Schmidt paper (my current assumption is that it is). The paragraph the authors added didn't really help in this regard.
The answer is: No, our work is very different from that paper. Here's the paragraph in question:
...Editing Models with Task Arithmetic explored a "dual" version of our activation additions. That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weig
Hi Alex,
Let me first acknowledge that your write-up is significantly more thorough than pretty much all content on LessWrong, and that I found the particular examples interesting. I also appreciated that you included a related work section in your write-up. The reason I commented on this post and not others is because it's one of the few ML posts on LessWrong that seemed like it might teach me something, and I wish I had made that more clear before posting critical feedback (I was thinking of the feedback as directed at Oliver / Raemon's moderation norms, ...
I added the following to the beginning:
...Edit, 5/16/23: I think this post is beautiful, correct in its narrow technical claims, and practically irrelevant to alignment. This post presents a cripplingly unrealistic picture of the role of reward functions in reinforcement learning. I expect this post to harm your alignment research intuitions unless you've already inoculated yourself by deeply internalizing and understanding Reward is not the optimization target. If you're going to read one alignment post I've written, read that one.
Follow-up work (Param
Thanks for the reference (and sorry for just now getting around to replying).
I think Bernheim's paper is somewhat related to the shard theory of human values. There are several commonalities, including
However, I think that shard theory is not a rederi...
Here are very quick thoughts on a candidate approach, copied from a Discord conversation:
...Possibly you could take a weaker/shallower transformer (like gpt2small) and then use model-stitching to learn a linear map from layer 3/12 of gpt2small to layer 12/48 of gpt2-xl, and then take a "detective story vector" in gpt2small, and then pass it through the learned linear transform to get a detective story vector in gpt2xl?
The main reason youd do this is if you were worried that a normal steering vector in the powerful model, wouldn't do the real thing you wanted
This feels like... too strong of an inference, relative to available data? Maybe I misunderstand. If the claim is more "altered state relative to usual computational patterns", I'm on board.
That said, I have found it pretty interesting to think about what it would feel like to have "steering vectors" added to my cognition.
(I do think that animals care about the reinforcement signals and their tight correlates, to some degree, such that it's reasonable to gloss it as "animals sometimes optimize rewards." I more strongly object to conflating what the animals may care about with the mechanistic purpose/description of the RL process.)
The argument against weights was of the form "here's a strength activations has"; for it to be enough to dismiss the paper without discussion
I personally don't "dismiss" the task vector work. I didn't read Thomas as dismissing it by not calling it the concrete work he is most excited about -- that seems like a slightly uncharitable read?
I, personally, think the task vector work is exciting. Back in Understanding and controlling a maze-solving policy network, I wrote (emphasis added):
...Editing Models with Task Arithmetic explored a "dual" version of our
I don't have enough time to reply in depth, but the factors in favor of weight vectors and activation vectors both seem really complicated, and the balance still seems in favor of activation vectors, though I have reasonably high uncertainty.
Weight vectors are derived through fine-tuning. Insofar as you thought activation additions are importantly better than finetuning in some respects, and were already thinking about finetuning (eg via RLHF) when writing why you were excited about activation additions, I don't see how this paper changes the balance very ...
Note that there's still a market open for how activation additions interact with larger models, it would be nice if it had more liquidity:
I added m1,000 in liquidity.
This idea of determining whether a result is "obvious" in advance seems valuable, I hope it catches on.
I don't think I follow your procedure. Would you be willing to walk me through an example situation?
Without being familiar with the literature, why should I buy that we can informally reason about what is "low-frequency" versus "high-frequency" behavior? I think reasoning about "simplicity" has historically gone astray, and worry that this kind of reasoning will as well.