TurnTrout

Alex Turner, postdoctoral researcher at the Center for Human-Compatible AI. Reach me at turner.alex[at]berkeley[dot]edu.

Sequences

Interpreting a Maze-Solving Network
Thoughts on Corrigibility
The Causes of Power-seeking and Instrumental Convergence
Reframing Impact
Becoming Stronger

Comments

The Superficial Alignment Hypothesis is probably false

... in this remark I will appeal to my own idiosyncratic opinion about LLMs. I suspect that simple finetuning or simple prompting can't ensure that the model's responses won't be illegal, harmful, abusive, false, deceptive, e.t.c.

But... the superficial alignment hypothesis isn't about "ensuring" the responses won't be X, right? Or "eliminating" the X responses. So point 7 reads like a non-sequitur to me.

We define the Superficial Alignment Hypothesis: A model’s knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users.

Separately (though relatedly), each word in that sentence sure feels like the kind of thing that I do not feel comfortable leaning on heavily as I optimize strongly against it, and that hides a ton of implicit assumptions,

I feel pretty uncertain of what assumptions are hiding in your "optimize strongly against X" statements. Historically this just seems hard to tease out, and wouldn't be surprised if I were just totally misreading you here.  

That said, your writing makes me wonder "where is the heavy optimization [over the value definitions] coming from?", since I think the preference-shards themselves are the things steering the optimization power. For example, the shards are not optimizing over themselves to find adversarial examples to themselves. Related statements:

  • I think that a realistic "respecting preferences of weak agents"-shard doesn't bid for plans which maximally activate the "respect preferences of weak agents" internal evaluation metric, or even do some tight bounded approximation thereof. 
  • A "respect weak preferences" shard might also guide the AI's value and ontology reformation process. 
  • A nice person isn't being maximally nice, nor do they wish to be; they are nicely being nice. 

I do agree (insofar as I understand you enough to agree) that we should worry about some "strong optimization over the AI's concepts, later in AI developmental timeline." But I think different kinds of "heavy optimization" lead to different kinds of alignment concerns.

I low-confidence think the context strengthens my initial impression. Paul prefaced the above quote as "maybe the simplest [reason for AIs to learn to behave well during training, but then when deployed or when there's an opportunity for takeover, they stop behaving well]." This doesn't make sense to me, but I historically haven't understood Paul very well.

EDIT: Hedging

That's why the title says "power-seeking can be predictive" not "training-compatible goals can be predictive". 

You're right. I was critiquing "power-seeking due to your assumptions isn't probable, because I think your assumptions won't hold" and not "power-seeking isn't predictive." I had misremembered the predictive/probable split, as introduced in Definitions of “objective” should be Probable and Predictive:

I don’t see a notion of “objective” that can be confidently claimed is:

  1. Probable: there is a good argument that the systems we build will have an “objective”, and
  2. Predictive: If I know that a system has an “objective”, and I know its behavior on a limited set of training data, I can predict significant aspects of the system’s behavior in novel situations (e.g. whether it will execute a treacherous turn once it has the ability to do so successfully).

Sorry for the confusion. I agree that power-seeking is predictive given your assumptions. I disagree that power-seeking is probable due to your assumptions being probable. The argument I gave above was actually: 

  1. The assumptions used in the post ("learns a randomly-selected training-compatible goal") assign low probability to experimental results, relative to other predictions which I generated (and thus relative to other ways of reasoning about generalization),
  2. Therefore the assumptions become less probable
  3. Therefore power-seeking becomes less probable (at least, due to these specific assumptions becoming less probable; I still think P(power-seeking) is reasonably large)

I suspect that you agree that "learns a training-compatible goal" isn't very probable/realistic. My point is then that the conclusions of the current work are weakened; maybe now more work has to go into the "can" in "Power-seeking can be probable and predictive." 

The issue with being informal is that it's hard to tell whether you are right. You use words like "motivations" without defining what you mean, and this makes your statements vague enough that it's not clear whether or how they are in tension with other claims.

It seems worth pointing out: the informality is in the hypothesis, which comprises a set of somewhat illegible intuitions and theories I use to reason about generalization. However, the prediction itself is what needs to be graded in order to see whether I was right. I made a prediction fairly like "the policy tends to go to the top-right 5x5, and searches for cheese once there, because that's where the cheese-seeking computations were more strongly historically reinforced" and "the policy sometimes pursues cheese and sometimes navigates to the top-right 5x5 corner." These predictions are (informally) gradable, even if the underlying intuitions are informal. 

As it pertains to shard theory more broadly, though, I agree that more precision is needed. Increasing precision and formalism is the reason I proposed and executed the project underpinning Understanding and controlling a maze-solving policy network. I wanted to understand more about realistic motivational circuitry and model internals in the real world. I think the last few months have given me headway on a more mechanistic definition of a "shard-based agent."

This seems like excellent work. I'm excited to see these results, they seem to be strong evidence that "just add the 'truth steering vector'" works, albeit with finer-grained intervention on a subset of attention heads. That's great news.

Given my understanding of your results, I am now more optimistic about:

  1. Flexibly retargeting LLM behavior via activation engineering/ITI without damaging capabilities
  2. Conceptual and more specific steering vectors in LLMs (like secret-spilling vectors which get LLMs to divulge any secrets they've been given)
  3. Alignment overall.

We propose Inference-Time Intervention (ITI): shifting the activations along the difference of the two distribution means during inference time; model weights are kept intact.

In the language of Steering GPT-2-XL by adding an activation vector, this is an activation addition using the steering vector from averaging all of the activation vectors. The vector is applied to the  most truth-relevant heads at a given layer, as judged by linear probe validation accuracy.

It's quite surprising that the "mass mean shift"[1] outperforms the probe direction so strongly! This shows that the directions which the model uses to generate true or false statements are very different from the directions which get found by probing. Further evidence that probe directions are often not very causally relevant for the LLM's outputs.

The transfer shown in table 4 seems decent and I'm glad it's there, but it would have been nice if the mean mass shift vector had transferred even more strongly. Seems like one can get a substantial general boost in truthfulness with basic activation engineering, but not all the way without additional insights.

We propose Inference-Time Intervention (ITI)

The "mass mean shift" technique seems like independent development of the "activation addition" technique from Understanding and controlling a maze-solving policy network and Steering GPT-2-XL by adding an activation vector (albeit with some differences, like restricting modification to top  heads). There's a question of "what should we call the technique?". Are "activation engineering" and "ITI" referring to the same set of interventions? 

It seems like the answer is "no", since you use "ITI" to refer to "adding in an activation vector", which seems better described as "activation addition." A few considerations:

  1. "ITI" has to be unpacked since it's an acronym
  2. "Inference-time intervention" is generic and could also describe zero-ablating the outputs of given heads, and so it seems strange to potentially use "ITI" to refer only to "adding in an activation vector"
  3. "Activation addition" is more specific.
  4. "Inference-time intervention" is more descriptive of how the technique works. 

Open to your thoughts here.

However, what is completely missing from LLMs is a good target other than minimizing pretraining loss. How to endow an aligned target is an open problem and ITI serves as my initial exploration towards this end.

Personally, I think that ITI is actually far more promising than the "how to endow an aligned target" question. 


In figure 5 in the paper, "Indexical error: Time" appears twice as an x-axis tick label?

Previous work has shown that ‘steering’ vectors—both trained and hand-selected—can be used for style transfer in language models (Subramani et al., 2022; Turner et al., 2023).

I think it'd be more accurate to describe "steering vectors" as "can be used to control the style and content of language model generations"? 

  1. ^

    steering vector="avg difference between Truthful and Untruthful activations on the top  heads"

RL creates agents, and RL seemed to be the way to AGI. In the 2010s, reinforcement learning was the dominant paradigm for those interested in AGI (e.g. OpenAI). RL lends naturally to creating agents that pursue rewards/utility/objectives. So there was reason to expect that agentic AI would be the first (and by the theoretical arguments, last) form that superintelligence would take.

Why are you confident that RL creates agents? Is it the non-stochasticity of optimal policies for almost all reward functions? The on-policy data collection of PPO? I think there are a few valid reasons to suspect that, but this excerpt seems surprisingly confident. 

Therefore, even though I consider us fortunate to be in a world where we might build AGI primarily through non-RL methods, I think the pressure to use RL to turn these systems into agents is still very high

Why do you think that RL will turn these systems into agents?

We have heard that Conjecture misrepresent themselves in engagement with the government, presenting themselves as experts with stature in the AIS community, when in reality they are not.

What does it mean for Conjecture to be "experts with stature in the AIS community"? Can you clarify what metrics comprise expertise in AIS -- are you dissatisfied with their demonstrated grasp of alignment work, or perhaps their research output, or maybe something a little more qualitative? 

Basically, this excerpt reads like a crisp claim of common knowledge ("in reality") but the content seems more like a personal judgment call by the author(s).

Load More