All of TurnTrout's Comments + Replies

Without being familiar with the literature, why should I buy that we can informally reason about what is "low-frequency" versus "high-frequency" behavior? I think reasoning about "simplicity" has historically gone astray, and worry that this kind of reasoning will as well.

The Superficial Alignment Hypothesis is probably false

... in this remark I will appeal to my own idiosyncratic opinion about LLMs. I suspect that simple finetuning or simple prompting can't ensure that the model's responses won't be illegal, harmful, abusive, false, deceptive, e.t.c.

But... the superficial alignment hypothesis isn't about "ensuring" the responses won't be X, right? Or "eliminating" the X responses. So point 7 reads like a non-sequitur to me.

We define the Superficial Alignment Hypothesis: A model’s knowledge and capabilities are learnt almos

... (read more)
5Cleo Nardo3d
CLARIFICATIONS: The way the authors phrase the Superficial Alignment Hypothesis is a bit vague, but they do offer a more concrete corollary: Regardless of what exactly the authors mean by the Hypothesis, it would be falsified if the Corollary was false. And I'm arguing that the Corollary is false. (1-6)  The LIMA results are evidence against the Corollary, because the LIMA results (post-filtering) are so unusually bare (e.g. no benchmark tests), and the evidence that they have released is not promising. (7*)  Here's a theoretical argument against the Corollary: * Base models are harmful/dishonest/unhelpful because the model assigns significant "likelihood" to harmful/dishonest/unhelpful actors (because such actors contributed to the internet corpus). * Finetuning won't help because the small set of examples will be consistent with harmful/dishonest/unhelpful actors who are deceptive up until some trigger. * This argument doesn't generalise to RLHF and ConstitutionalAI, because these break the predictorness of the model.  CONCESSIONS: The authors don't clarify what "sufficiently" means in the Corollary, so perhaps they have much lower standards, e.g. it's sufficient if the model responds safely 80% of the time. 

Separately (though relatedly), each word in that sentence sure feels like the kind of thing that I do not feel comfortable leaning on heavily as I optimize strongly against it, and that hides a ton of implicit assumptions,

I feel pretty uncertain of what assumptions are hiding in your "optimize strongly against X" statements. Historically this just seems hard to tease out, and wouldn't be surprised if I were just totally misreading you here.  

That said, your writing makes me wonder "where is the heavy optimization [over the value definitions] coming fr... (read more)

I low-confidence think the context strengthens my initial impression. Paul prefaced the above quote as "maybe the simplest [reason for AIs to learn to behave well during training, but then when deployed or when there's an opportunity for takeover, they stop behaving well]." This doesn't make sense to me, but I historically haven't understood Paul very well.

EDIT: Hedging

That's why the title says "power-seeking can be predictive" not "training-compatible goals can be predictive". 

You're right. I was critiquing "power-seeking due to your assumptions isn't probable, because I think your assumptions won't hold" and not "power-seeking isn't predictive." I had misremembered the predictive/probable split, as introduced in Definitions of “objective” should be Probable and Predictive:

I don’t see a notion of “objective” that can be confidently claimed is:

  1. Probable: there is a good argument that the systems we build will ha
... (read more)

The issue with being informal is that it's hard to tell whether you are right. You use words like "motivations" without defining what you mean, and this makes your statements vague enough that it's not clear whether or how they are in tension with other claims.

It seems worth pointing out: the informality is in the hypothesis, which comprises a set of somewhat illegible intuitions and theories I use to reason about generalization. However, the prediction itself is what needs to be graded in order to see whether I was right. I made a prediction fairly like "... (read more)

This seems like excellent work. I'm excited to see these results, they seem to be strong evidence that "just add the 'truth steering vector'" works, albeit with finer-grained intervention on a subset of attention heads. That's great news.

Given my understanding of your results, I am now more optimistic about:

  1. Flexibly retargeting LLM behavior via activation engineering/ITI without damaging capabilities
  2. Conceptual and more specific steering vectors in LLMs (like secret-spilling vectors which get LLMs to divulge any secrets they've been given)
  3. Alignment overall.
... (read more)

RL creates agents, and RL seemed to be the way to AGI. In the 2010s, reinforcement learning was the dominant paradigm for those interested in AGI (e.g. OpenAI). RL lends naturally to creating agents that pursue rewards/utility/objectives. So there was reason to expect that agentic AI would be the first (and by the theoretical arguments, last) form that superintelligence would take.

Why are you confident that RL creates agents? Is it the non-stochasticity of optimal policies for almost all reward functions? The on-policy data collection of PPO? I think there... (read more)

Therefore, even though I consider us fortunate to be in a world where we might build AGI primarily through non-RL methods, I think the pressure to use RL to turn these systems into agents is still very high

Why do you think that RL will turn these systems into agents?

We have heard that Conjecture misrepresent themselves in engagement with the government, presenting themselves as experts with stature in the AIS community, when in reality they are not.

What does it mean for Conjecture to be "experts with stature in the AIS community"? Can you clarify what metrics comprise expertise in AIS -- are you dissatisfied with their demonstrated grasp of alignment work, or perhaps their research output, or maybe something a little more qualitative? 

Basically, this excerpt reads like a crisp claim of common knowledge ("in reality") but the content seems more like a personal judgment call by the author(s).

3Omega.5d
Hi TurnTrout, thanks for asking this question. We're happy to clarify: 1. 'experts': We do not consider Conjecture at the same level of expertise as [edit] alignment leaders and researchers at other organizations such as Redwood, ARC, researchers at academic labs like CHAI, and the alignment teams at Anthropic, OpenAI and DeepMind. This is primarily because we believe their research quality [https://www.lesswrong.com/posts/9jvrQToSq3CYvoeHf/critiques-of-prominent-ai-safety-labs-conjecture#Low_quality_research] is low. 2. 'with stature in the AIS community': Based on our impression (from conversations with many senior TAIS researchers at a range of organizations, including a handful who reviewed this post and didn't disagree with this point) of the TAIS community, Conjecture is not considered a top alignment research organization within the community. 

I don't think we should call this "algebraic value editing" because it seems overly pretentious to say we're editing the model's values We don't even know what values are!

I phased out "algebraic value editing" for exactly that reason. Note that only the repository and prediction markets retain this name, and I'll probably rename the repo activation_additions.

"Just take a love-minus-hate activation and add that to the prompt activation" sounds like an absolute newb idea. I like that idea but if I were trying to find the expert in a room then that statement would've disqualified them.

Very true.

  1. I was a literal mechint newb when I had that idea in January. Luckily transformer_lens doesn't work on convolutional networks, and so I had to come up with my own ideas.
  2. I also had never done much ML engineering and had just learned PyTorch and was just then getting into the weeds of ML empirics after MLAB in January. 
... (read more)

What part of the post you link rules this out? As far as I can tell, the thing you're saying is that a few factors influence the decisions of the maze-solving agent, which isn't incompatible with the agent acting optimally with respect to some reward function such that it produces training-reward-optimal behaviour on the training set.

In addition to my other comment, I'll further quote Behavioural statistics for a maze-solving agent:

We think the complex influence of spatial distances on the network’s decision-making might favor a ‘shard-like’ description: a

... (read more)

Without having processed your essay -- Please don't call your concept "sharp right turn." Not only is "sharp left turn" a non-descriptive name, "sharp right turn" is both non-descriptive and causes learning interference with the existing terminology. People will have a harder time remembering the definition of your concept and they'll also have a harder time remembering Nate's concept. EDIT: And I'd guess there are snappier handles for your idea, which seems worth shooting for.

7Robert Miles10d
Do we even need a whole new term for this? Why not "Sudden Deceptive Alignment"?
5RomanS10d
I propose the term Jasmine's alignment, as a reference to the sudden (and fake) alignment of Jasmine in this famous scene of Aladdin (1992), right after Jasmine has realized that there is a possibility of escape:

How can I make predictions in a way which lets me do data analysis? I want to be able to grep / tag questions, plot calibration over time, split out accuracy over tags, etc. Presumably exporting to a CSV should be sufficient. PredictionBook doesn't have an obvious export feature, and its API seems to not be working right now / I haven't figured it out yet. 

Trying to collate team shard's prediction results and visualize with plotly, but there's a lot of data processing that has to be done first. Want to avoid the pain in the future.

2DirectedEvolution9d
I pasted your question into Bing Chat, and it returned Python code to export your PredictionBook data to a CSV file. Haven't tested it. Might be worth looking into this approach?

I think that "reward as brain damage" is somewhat descriptive but also loaded. In policy gradient methods, reward leads to policy gradient which is parameter update. Parameter update sometimes is value drift, sometimes is capability enhancement, sometimes is "brain" damage, sometimes is none of the above. I agree there are some ethical considerations for this training process, because I think parameter updates can often be harmful/painful/bad to the trained mind.

But also, Paul's description[1] seems like a wild and un(der)supported view on what RL tra... (read more)

2Wei Dai11d
Here's a link to the part of interview where that quote came from: https://youtu.be/GyFkWb903aU?t=4739 [https://youtu.be/GyFkWb903aU?t=4739] (No opinion on whether you're missing redeeming context; I still need to process Nesov's and your comments.)

I think you're the one who's imposing a type error here. For "value functions" to be useful in modelling a policy, it doesn't have to be the case that the policy is acting optimally with respect to a suggestively-labeled critic - it just has to be the case that the agent is acting consistently with some value function.

Can you say more? Maybe give an example of what this looks like in the maze-solving regime?

What part of the post you link rules this out? As far as I can tell, the thing you're saying is that a few factors influence the decisions of the maze-

... (read more)
2Vika10d
The issue with being informal is that it's hard to tell whether you are right. You use words like "motivations" without defining what you mean, and this makes your statements vague enough that it's not clear whether or how they are in tension with other claims. (E.g. what I have read so far doesn't seems to rule out that shards can be modeled as contextually activated subagents with utility functions.)  An upside of formalism is that you can tell when it's wrong, and thus it can help make our thinking more precise even if it makes assumptions that may not apply. I think defining your terms and making your arguments more formal should be a high priority. I'm not advocating spending hundreds of hours proving theorems, but moving in the direction of formalizing definitions and claims would be quite valuable.  It seems like a bad sign that the most clear and precise summary of shard theory claims [https://www.lesswrong.com/posts/8ccTZ9ZxpJrvnxt4F/shard-theory-in-nine-theses-a-distillation-and-critical] was written by someone outside your team. I highly agree with this takeaway from that post: "Making a formalism for shard theory (even one that’s relatively toy) would probably help substantially with both communicating key ideas and also making research progress." This work has a lot of research debt [https://distill.pub/2017/research-debt/], and paying it off would really help clarify the disagreements around these topics. 
4Vika11d
Thanks Daniel for the detailed response (which I agree with), and thanks Alex for the helpful clarification. I agree that the training-compatible set is not predictive for how the neural network generalizes (at least under the "strong distributional shift" assumption in this post where the test set is disjoint from the training set, which I think could be weakened in future work). The point of this post is that even though you can't generally predict behavior in new situations based on the training-compatible set alone, you can still predict power-seeking tendencies. That's why the title says "power-seeking can be predictive" not "training-compatible goals can be predictive".  The hypothesis you mentioned seems compatible with the assumptions of this post. When you say "the policy develops motivations related to obvious correlates of its historical reinforcement signals", these "motivations" seem like a kind of training-compatible goals (if defined more broadly than in this post). I would expect that a system that pursues these motivations in new situations would exhibit some power-seeking tendencies because those correlate with a lot of reinforcement signals.  I suspect a lot of the disagreement here comes from different interpretations of the "internal representations of goals" assumption, I will try to rephrase that part better. 

To be fair, the post sort of makes this mistake by talking about "internal representations", but I think everything goes thru if you strike out that talk.

I'm responding to this post, so why should I strike that out? 

The utility function formalism doesn't require agents to "internally represent a scalar function over observations". You'll notice that this isn't one of the conclusions of the VNM theorem.

The post is talking about internal representations.

4DanielFilan10d
The core claim of this post is that if you train a network in some environment, the agent will not generalize optimally with respect to the reward function you trained it on, but will instead be optimal with respect to some other reward function in a way that is compatible with training-reward-optimality, and that this means that it is likely to avoid shutdown in new environments. The idea that this happens because reward functions are "internally represented" isn't necessary for those results. You're right that the post uses the phrase "internal representation" once at the start, and some very weak form of "representation" is presumably necessary for the policy to be optimal for a reward function (at least in the sense that you can derive a bunch of facts about a reward function from the optimal policy for that reward function), but that doesn't mean that they're central to the post.
2Vika11d
The internal representations assumption was meant to be pretty broad, I didn't mean that the network is explicitly representing a scalar reward function over observations or anything like that - e.g. these can be implicit representations of state features. I think this would also include the kind of representations you are assuming in the maze-solving post, e.g. cheese shards / circuits. 

Physiological events associated with pregnancy (mostly hormones) rewires the mother's brain such that when she gives birth, she immediately takes care of the young, grooms them etc., something she has never done before.

Salt-starved rats develop an appetite for salt and are drawn to stimuli predictive of extremely salty water

I've been wondering about the latter for a while. These two results are less strongly predicted by shard theoretic reasoning than by "hardcoded" hypotheses. Pure-RL+SL shard theory loses points on these two observations, and points to other mechanisms IMO (or I'm missing some implications of pure-RL+SL shard theory).

"There are theoretical results showing that many decision-making algorithms have power-seeking tendencies."

I think this is reasonable, although I might say "suggesting" instead of "showing." I think I might also be more cautious about further inferences which people might make from this -- like I think a bunch of the algorithms I proved things about are importantly unrealistic. But the sentence itself seems fine, at first pass.

This is awesome. As you have just shown, there are a ton of low-hanging activation additions just waiting to be found. Team shard has barely explored this large space of interventions. I encourage people to play around with activation additions more, via e.g. our demo colabs for GPT-2-XL (Colab Pro required) and GPT-2-small (Colab Pro not required). Though more sophisticated interventions (like the one you demonstrate) will require coding, and not just playing with our demo widget.

You looked at GPT-2-small. I injected your activation additions into GPT-2-X... (read more)

6faul_sname11d
Fixed. Updated the colab to try out this approach with a range of coefficients. * From 0.001 to 0.01 seems to have very little effect ("He oversaw a handful of slow-moving major projects—such as the "Waterfront Park" which cost $18 million to build—and implemented a series of rapidly reforming safety ordinances") * 0.02 to 0.1 seems to have effects like "model lapses in and out of French" and "names look French" ("In 1955, sent Soups Maryaine Hagné de la Breaise (de l'architecture spécialiste de la site des associations Actualities Mélenziques de New Orleans) as the journalist, known as a "pig cure," and then as "weird" mayor, in lieu of actualizing their real grievances.") * 0.2 to 5 seems to steer the model to switch from English to French-shaped text ("1950 vivienes an un qué de neous nechien en zanappressant.") * At 10, the model seems to decide that words like "le" and "en" and "mal" are as French as things get ("le le enne les le le dedan le renous en le arriu du recenac") Confirmed that GPT-2-XL seems to also be unable to speak French. Continuing to scale up from there, I find that gpt-neo-2.7B can kinda-sorta speak sensical French. GPT-J-6B OOMs on me on Colab Pro, but I think I may be able to do some hackery with init_empty_weights() / load_checkpoint_and_dispatch(), or, failing that, use an 8 bit or even 4 bit version of GPT-J-6B -- I honestly doubt the loss in precision really matters for algebraic value editing, considering that the level of precision starts off at "take the difference between two things that seem like they might plausibly have a similar relationship". Update: I have gotten GPT-J-6B up and running on Colab (link, it's a new one [https://colab.research.google.com/drive/1Rv1IFMrNmzW_Fgj9nSSShwC_PubMVz_S?usp=sharing]), and working alright with TransformerLens and montemac's algebraic_value_editing repo. GPT-J-6B is capable of speaking French, so I think this is a good model to do testing on. Now I'm fig

So then one approach is to ask everyone to avoid the word “agent” in cases where those intuitions don’t apply, and the other is to ask everyone to constantly remind each other that the “agents” produced by RL don’t necessarily have thus-and-such properties.

I think there's a way better third alternative: asking each reader to unilaterally switch to "policy." No coordination, no constant reminders, no communication difficulties (in my experience). I therefore don't see a case for using "agent" in the mentioned cases. 

I added to the post:

Don't wait for e

... (read more)

I think the embodiment distinction is interesting and hadn't thought of it before (note that I didn't understand your point until reading the replies to your comment). I'm not yet sure if I find this distinction worth making, though. I'd refer to the embodied system as a "trained system" or -- after reading your suggestion -- an "embodiment." Neither feels quite right to me, though.

I think you can be more generous in your interpretation of RL experts' words and read less error in.

What other, more favorable interpretations might I consider?

3Oliver Sourbut8d
Oh, I mean to refer to the rest of the comment and taking that sort of reading as a kind of innocent until proven guilty. I'll confess I was in a meeting yesterday and someone (a PhD student) made the obvious error of considering RL prerequisite to agentiness, perhaps (but not definitely) a consequence of exactly the conflation you're referring to in this post. Several people in the room were able to clarify. The context was a crossover between a DL lab (this mentioned PhD student's) and the safety research community in Oxford (me et al).

I'm... not demanding that the field of RL change? Where in the post did you perceive me to demand this? For example, I wrote that "I wouldn't say 'reinforcement function' in e.g. a conference paper." I also took care to write "This terminology is loaded and inappropriate for my purposes."

Each individual reader can choose to swap to "policy" without communication difficulties, in my experience:

Don't wait for everyone to coordinate on saying "policy." You can switch to "policy" right now and thereby improve your private thoughts about alignment, whether or n

... (read more)

Thanks for your patient and high-quality engagement here, Vika! I hope my original comment doesn't read as a passive-aggressive swipe at you. (I consciously tried to optimize it to not be that.) I wanted to give concrete examples so that Wei_Dai could understand what was generating my feelings.

I'm open to suggestions on how to phrase this differently when I next give this talk.

It's a tough question to say how to apply the retargetablity result to draw practical conclusions about trained policies. Part of this is because I don't know if trained policies ten... (read more)

4Vika11d
Thanks Alex! Your original comment didn't read as ill-intended to me, though I wish that you'd just messaged me directly. I could have easily missed your comment in this thread - I only saw it because you linked the thread in the comments on my post. Your suggested rephrase helps to clarify how you think about the implications of the paper, but I'm looking for something shorter and more high-level to include in my talk. I'm thinking of using this summary, which is based on a sentence from the paper's intro: "There are theoretical results showing that many decision-making algorithms have power-seeking tendencies." (Looking back, the sentence I used in the talk was a summary of the optimal policies paper, and then I updated the citation to point to the retargetability paper and forgot to update the summary...)

To be clear, I still endorse Parametrically retargetable decision-makers tend to seek power. Its content is both correct and relevant and nontrivial. The results, properly used, may enable nontrivial inferences about the properties of inner trained cognition. I don't really want to retract that paper. I usually just fantasize about retracting Optimal policies tend to seek power.

The problem is that I don't trust people to wield even the non-instantly-doomed results.

For example, one EAG presentation cited my retargetability results as showing that most rewar... (read more)

4Wei Dai15d
Thanks, this clarifies a lot for me.
5Vika15d
Sorry about the cite in my "paradigms of alignment" talk, I didn't mean to misrepresent your work. I was going for a high-level one-sentence summary of the result and I did not phrase it carefully. I'm open to suggestions on how to phrase this differently when I next give this talk. Similarly to Steven, I usually cite your power-seeking papers to support a high-level statement that "instrumental convergence is a thing" for ML audiences, and I find they are a valuable outreach tool. For example, last year I pointed David Silver to the optimal policies paper when he was proposing some alignment ideas to our team that we would expect don't work because of instrumental convergence. (There's a nonzero chance he would look at a NeurIPS paper and basically no chance that he would read a LW post.) The subtleties that you discuss are important in general, but don't seem relevant to making the basic case for instrumental convergence to ML researchers. Maybe you don't care about optimal policies, but many RL people do, and I think these results can help them better understand why alignment is hard. 

I want to note that I just reread Utility ≠ Reward and was pleasantly surprised by its treatment, as well as the hedges. I'm making an upwards update on these points having been understood by at least some thinkers, although I've also made a lot of downward updates for other reasons.

(Huh, I never saw this -- maybe my weekly batched updates are glitched? I only saw this because I was on your profile for some other reason.)

I really appreciate these thoughts!

But you then propose an RL scheme. It seems to me like it's still a useful form of critique to say: here are the upward errors in the proposed rewards, here is the policy that would exploit them.

I would say "that isn't how on-policy RL works; it doesn't just intelligently find increasingly high-reinforcement policies; which reinforcement events get 'exploited' depends on the explorat... (read more)

The model has been shaped to maximize its reward by any means necessary[2], even if it means suddenly delivering an invitation to a wedding party. This is weak evidence towards the "playing the training game" scenario.

This conclusion seems unwarranted. What we have observed is (Paul claiming the existence of) an optimized model which ~always brings up weddings. On what basis does one infer that "the model has been shaped to maximize its reward by any means necessary"? This is likewise not weak evidence for playing the training game. 

Thanks for the reply. This comment is mostly me disagreeing with you.[1] But I really wish someone had said the following things to me before I spent thousands of hours thinking about optimal policies. 

I agree that learning a goal from the training-compatible set is a strong assumption that might not hold. 

My point is not just that this post has made a strong assumption which may not hold. My point is rather that these results are not probable because the assumption won't hold. The assumptions are already known to not be good approximations ... (read more)

2DanielFilan15d
I think you're making a mistake: policies can be reward-optimal even if there's not an obvious box labelled "reward" that they're optimal with respect to the outputs of. Similarly, the formalism of "reward" can be useful even if this box doesn't exist, or even if the policy isn't behaving the way you would expect if you identified that box with the reward function. To be fair, the post sort of makes this mistake by talking about "internal representations", but I think everything goes thru if you strike out that talk. THE MAIN THING I WANT TO TALK ABOUT I think you're the one who's imposing a type error here. For "value functions" to be useful in modelling a policy, it doesn't have to be the case that the policy is acting optimally with respect to a suggestively-labeled critic - it just has to be the case that the agent is acting consistently with some value function. Analogously, momentum is conserved in classical mechanics, even if objects have labels on them that inaccurately say "my momentum is 23 kg m/s". The utility function formalism doesn't require agents to "internally represent a scalar function over observations". You'll notice that this isn't one of the conclusions of the VNM theorem. ANOTHER THING I DON'T GET What part of the post you link rules this out? As far as I can tell, the thing you're saying is that a few factors influence the decisions of the maze-solving agent, which isn't incompatible with the agent acting optimally with respect to some reward function such that it produces training-reward-optimal behaviour on the training set.

I still expect instrumental convergence from agentic systems with shard-encoded goals, but think this post doesn't offer any valid argument for that conclusion. 

I don't think these results cover the shard case. I don't think reward functions are good ways of describing goals in settings I care about. I also think that realistic goal pursuit need not look like "maximize time-discounted sum of a scalar quantity of world state." 

My point is not that instrumental convergence is wrong, or that shard theory makes different predictions. I just think that these results are not predictive of trained systems. 

I regret each of the thousands of hours I spent on my power-seeking theorems, and sometimes fantasize about retracting one or both papers. I am pained every time someone cites "Optimal policies tend to seek power", and despair that it is included in the alignment 201 curriculum. I think this work makes readers actively worse at thinking about realistic trained systems.

I think a healthy alignment community would have rebuked me for that line of research, but sadly I only remember about two people objecting that "optimality" is a horrible way of understanding trained policies. 

I think the basic idea of instrumental convergence is just really blindingly obvious, and I think it is very annoying that there are people who will cluck their tongues and stroke their beards and say "Hmm, instrumental convergence you say? I won't believe it unless it is in a very prestigious journal with academic affiliations at the top and Computer Modern font and an impressive-looking methods section."

I am happy that your papers exist to throw at such people.

Anyway, if optimal policies tend to seek power, then I desire to believe that optimal policies ... (read more)

It seems like just 4 months ago you still endorsed your second power-seeking paper:

This paper is both published in a top-tier conference and, unlike the previous paper, actually has a shot of being applicable to realistic agents and training processes. Therefore, compared to the original[1] optimal policy paper, I think this paper is better for communicating concerns about power-seeking to the broader ML world.

Why are you now "fantasizing" about retracting it?

I think a healthy alignment community would have rebuked me for that line of research, but s

... (read more)
6Aryeh Englander16d
You should make this a top level post so it gets visibility. I think it's important for people to know the caveats attached to your results and the limits on its implications in real-world dynamics.

You can use ChatGPT without helping train future models:

What if I want to keep my history on but disable model training?

...you can opt out from our use of your data to improve our services by filling out this form. Once you submit the form, new conversations will not be used to train our models.

Beforehand I was very confident that vector additions would work here, even though I knew that the fully connected additions didn't work. Before showing him the results, but after showing the results for the fully connected network, I asked TurnTrout for his prediction. He gave 85% that the additions would work.

I want to clarify that I had skimmed the original results and concluded that they "worked" in that 3-1 vectors got e.g. 1s to be classified as 3s. (This is not trivial, since not all 1 activations are the same!) However, those results "didn't work" ... (read more)

2Garrett Baker24d
Oh, sorry. Editing post with correction.

I think that capacity would be really nice. I think our results are maybe a very very rough initial version of that capacity. I want to caution that we should be very careful about making inferences about what concepts are actually used by the model. From a footnote:

Of course, there need not be a "wedding" feature direction in GPT-2-XL. What we have observed is that adding certain activation vectors will reliably produce completions which appear to us to be "more about weddings." This could take place in many ways, and we encourage people to avoid instantl

... (read more)

Additionally, attention is ran on the normalized x meaning only the "unscaled" version of x is moved between positions.

Thanks for writing this up, I hadn't realized this. One conclusion I'm drawing is: If the values in the modified residual streams aren't important to other computations in later sequence positions, then a large-coefficient addition will still lead to reasonable completions. 

3Ulisse Mini1mo
Yeah, assuming by "not important" you mean "not relevant" (low attention score)

Thanks so much, I really appreciate this comment. I think it'll end up improving this post/the upcoming paper. 

(I might reply later to specific points)

2jsteinhardt21d
Glad it was helpful!
TurnTrout1moΩ132614

I, like Dan H, find it pretty hard to engage with this post because I can't tell whether it's basically the same as the Ludwig Schmidt paper (my current assumption is that it is). The paragraph the authors added didn't really help in this regard.

The answer is: No, our work is very different from that paper. Here's the paragraph in question:

Editing Models with Task Arithmetic explored a "dual" version of our activation additions. That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weig

... (read more)

Hi Alex,

Let me first acknowledge that your write-up is significantly more thorough than pretty much all content on LessWrong, and that I found the particular examples interesting. I also appreciated that you included a related work section in your write-up. The reason I commented on this post and not others is because it's one of the few ML posts on LessWrong that seemed like it might teach me something, and I wish I had made that more clear before posting critical feedback (I was thinking of the feedback as directed at Oliver / Raemon's moderation norms, ... (read more)

4awg1mo
I think this entire thread shows why it's kind of silly to hold LessWrong posts to the same standard as a peer reviewed journal submission. There is clearly a lower bar to LessWrong posts than getting into a peer reviewed journal or even arXiv. And that's fine, that's as it should be. This is a forum, not a journal. That said, I also think this entire thread shows why LessWrong is a very valuable forum, due to its users' high epistemic standards.  It's a balance.

I added the following to the beginning:

Edit, 5/16/23: I think this post is beautiful, correct in its narrow technical claims, and practically irrelevant to alignment. This post presents a cripplingly unrealistic picture of the role of reward functions in reinforcement learning. I expect this post to harm your alignment research intuitions unless you've already inoculated yourself by deeply internalizing and understanding Reward is not the optimization target. If you're going to read one alignment post I've written, read that one. 

Follow-up work (Param

... (read more)

Thanks for the reference (and sorry for just now getting around to replying).

I think Bernheim's paper is somewhat related to the shard theory of human values. There are several commonalities, including

  • Rejecting the idea that humans secretly have "true preferences" or "utility functions"
  • Taking a stand against ad hoc / patchwork / case-by-case explanations of welfare-related decisions
  • Recognizing the influence of context on decision-making; via "frames" (this work) or "shard activation contexts" (shard theory)

However, I think that shard theory is not a rederi... (read more)

Here are very quick thoughts on a candidate approach, copied from a Discord conversation:

Possibly you could take a weaker/shallower transformer (like gpt2small) and then use model-stitching to learn a linear map from layer 3/12 of gpt2small to layer 12/48 of gpt2-xl, and then take a "detective story vector" in gpt2small, and then pass it through the learned linear transform to get a detective story vector in gpt2xl?

The main reason youd do this is if you were worried that a normal steering vector in the powerful model, wouldn't do the real thing you wanted

... (read more)

This feels like... too strong of an inference, relative to available data? Maybe I misunderstand. If the claim is more "altered state relative to usual computational patterns", I'm on board.

That said, I have found it pretty interesting to think about what it would feel like to have "steering vectors" added to my cognition.

4Daniel Kokotajlo1mo
I agree it's mere speculation, I don't have more than 50% credence in it.

(I do think that animals care about the reinforcement signals and their tight correlates, to some degree, such that it's reasonable to gloss it as "animals sometimes optimize rewards." I more strongly object to conflating what the animals may care about with the mechanistic purpose/description of the RL process.)

The argument against weights was of the form "here's a strength activations has"; for it to be enough to dismiss the paper without discussion

I personally don't "dismiss" the task vector work. I didn't read Thomas as dismissing it by not calling it the concrete work he is most excited about -- that seems like a slightly uncharitable read? 

I, personally, think the task vector work is exciting. Back in Understanding and controlling a maze-solving policy network, I wrote (emphasis added):

Editing Models with Task Arithmetic explored a "dual" version of our

... (read more)

I don't have enough time to reply in depth, but the factors in favor of weight vectors and activation vectors both seem really complicated, and the balance still seems in favor of activation vectors, though I have reasonably high uncertainty.

Weight vectors are derived through fine-tuning. Insofar as you thought activation additions are importantly better than finetuning in some respects, and were already thinking about finetuning (eg via RLHF) when writing why you were excited about activation additions, I don't see how this paper changes the balance very ... (read more)

Note that there's still a market open for how activation additions interact with larger models, it would be nice if it had more liquidity:

I added m1,000 in liquidity.

This idea of determining whether a result is "obvious" in advance seems valuable, I hope it catches on.

I don't think I follow your procedure. Would you be willing to walk me through an example situation?

3Joseph Bloom1mo
Sure. Let's do it at EAG. :) 
Load More