Alex Turner, independent researcher working on AI alignment. Reach me at turner.alex[at]berkeley[dot]edu.


Interpreting a Maze-Solving Network
Thoughts on Corrigibility
The Causes of Power-seeking and Instrumental Convergence
Reframing Impact
Becoming Stronger


a lot of interpretability work that performs act-add like ablations to confirm that their directions are real

Minor clarifying point: Act-adds cannot be cast as ablations. Do you mean to say that the interp work uses activation addition to confirm real directions? Or that they use activation ablation/resampling/scrubbing?

ITI is basically act adds but they compute act adds with many examples instead of just a pair

Yup, ITI was developed concurrently, and (IIRC, private correspondence) was inspired by their work on Othello-GPT. So this is another instance of interp leading to an alignment technique (albeit two independent paths leading to a similar technique).

As garrett says -- not clear that this work is net negative. Skeptical that it's strongly net negative. Haven't read deeply, though.

Probably not, but mostly because you phrased it as inductive biases to be washed away in the limit, or using gimmicks like early stopping.

LLMs aren't trained to convergence because that's not compute-efficient, so early stopping seems like the relevant baseline. No?

everyone who reads those seems to be even more confused after reading them

I want to defend "Reward is not the optimization target" a bit, while also mourning its apparent lack of clarity. The above is a valid impression, but I don't think it's true. For some reason, some people really get a lot out of the post; others think it's trivial; others think it's obviously wrong, and so on. See Rohin's comment:

(Just wanted to echo that I agree with TurnTrout that I find myself explaining the point that reward may not be the optimization target a lot, and I think I disagree somewhat with Ajeya's recent post for similar reasons. I don't think that the people I'm explaining it to literally don't understand the point at all; I think it mostly hasn't propagated into some parts of their other reasoning about alignment. I'm less on board with the "it's incorrect to call reward a base objective" point but I think it's pretty plausible that once I actually understand what TurnTrout is saying there I'll agree with it.)

You write:

In what sense does, say, a tree search algorithm like MCTS or full-blown backwards induction not 'optimize the reward'?

These algorithms do optimize the reward. My post addresses the model-free policy gradient setting... [goes to check post] Oh no. I can see why my post was unclear -- it didn't state this clearly. The original post does state that AIXI optimizes its reward, and also that:

For point 2 (reward provides local updates to the agent's cognition via credit assignment; reward is not best understood as specifying our preferences), the choice of RL algorithm should not matter, as long as it uses reward to compute local updates. 

However, I should have stated up-front: This post addresses model-free policy gradient algorithms like PPO and REINFORCE. 

I don't know what other disagreements or confusions you have. In the interest of not spilling bytes by talking past you -- I'm happy to answer more specific questions.

I agree that with time, we might be able to understand. (I meant to communicate that via "might still be incomprehensible")

All the models must converge on the same optimal solution for a deterministic perfect-information game like Othello and become value-equivalent, ignoring the full board state which is irrelevant to reward-maximizing.

Strong claim! I'm skeptical (EDIT: if you mean "in the limit" to apply to practically relevant systems we build in the future. If so,) do you have a citation for DRL convergence results relative to this level of expressivity, and reasoning for why realistic early stopping in practice doesn't matter? (Also, of course, even one single optimal policy can be represented by multiple different network parameterizations which induce the same semantics, with eg some using the WM and some using heuristics.)

I think the more relevant question is "given a frozen initial network, what are the circuit-level inductive biases of the training process?". I doubt one can answer this via appeals to RL convergence results.

(I skimmed through the value equivalence paper, but LMK if my points are addressed therein.)

a DRL agent only wants to maximize reward, and only wants to model the world to the extent that maximizes reward.

As a side note, I think this "agent only wants to maximize reward" language is unproductive (see "Reward is not the optimization target", and "Think carefully before calling RL policies 'agents'"). In this case, I suspect that your language implicitly equivocates between "agent" denoting "the RL learning process" and "the trained policy network":

As far as the RL agent is concerned, knowledge of irrelevant board state is a wasteful bug to be worked around or eliminated, no matter where this knowledge comes from or is injected.

(The original post was supposed to also have @Monte M as a coauthor; fixed my oversight.)

This paper enhances the truthful accuracy of large language models by adjusting model activations during inference. Using a linear probe, they identify attention heads which can strongly predict truthfulness on a validation dataset. During each forward pass at inference time, they shift model activations in the truthful directions identified by the probe.

While this paper did examine shifting along the probe direction, they found that to work substantially worse than shifting along the mean activation difference between (about to say truthful thing) and (about to say untruthful thing). See table 3.

AI cognition doesn't have to use alien concepts to be uninterpretable. We've never fully interpreted human cognition, either, and we know that our introspectively accessible reasoning uses human-understandable concepts.

Just because your thoughts are built using your own concepts, does not mean your concepts can describe how your thoughts are computed. 


The existence of a natural-language description of a thought (like "I want ice cream") doesn't mean that your brain computed that thought in a way which can be compactly described by familiar concepts. 

Conclusion: Even if an AI doesn't rely heavily on "alien" or unknown abstractions -- even if the AI mostly uses human-like abstractions and features -- the AI's thoughts might still be incomprehensible to us, even if we took a lot of time to understand them. 

I want to note that it's really hard to properly represent other people's views and intuitions, and instead aimed to strawman each agenda ~equally[1] for brevity and humor. 

A bunch of the presidents make critiques and defenses weaker than the ones I'd make. There are a bunch of real hot takes of mine in this video, generally channeled via Trump (who also drops a few pretty dumb takes IMO). (Which Trump-takes are dumb and which are based? Well, that's up to the viewer to figure out by thinking for themselves!)

  1. ^

    With the exception of infrabayesianism, which wasn't treated seriously.

This is really cool. Great followup work!

I think this is enough to make a hypothesis on how the network works and how the goal misgeneralization happens:

  1. Somewhere inside the model, there is a set of individual components that respond to different inputs, and when they activate, they push for a particular action. Channel 121 is an example of such a component.
  2. The last layers somehow aggregate information from all of the individual components.
  3. Components sometimes activate for the action that leads to the cheese and sometimes for the action that leads to the top right corner.[9]
  4. If the aggregated "push" for the action leading to the cheese is higher than for the action leading to the top right corner, the mouse goes to the cheese. Otherwise, it goes to the top right corner.

I think this is basically a shard theory picture/framing of how the network works: Inside the model there are multiple motivational circuits ("shards") which are contextually activated (i.e. step 3) and whose outputs are aggregated into a final decision (i.e. step 4). 

Load More