The Future of Aligning Deep Learning systems will probably look like "training on interp"

williawa

Epistemic Status: I think this is right, but a lot of this is empirical, and it seems the field is moving fast

Current methods are bad

I should start by saying that this is dangerous territory. And there are obvious ways to botch this. E.g. training CoT to look nice is very stupid. And there are subtler ways to do it that still end up nuking your ability to interpret the model without making any lasting progress on aligning models.

But I still think the most promising path to aligning DL systems will look like training on interp. Why? Consider, what is the core reason to be suspicious of current methods?

They all work by defining what you consider a good output to be, either by giving labels and telling the models "say exactly so and so and do exactly so and so", or by defining some function on the output, like from a reward model, and using gradients to make the outputs score higher according to that function in expectation.

Why should this make you suspicious? Because this process gives you a model that produces outputs you consider good, at least on the examples you've shown it, but gives you no guarantees about what internal process the model uses to generate those good-seeming outputs.

The most central reason this is problematic, is that it means "bad"/misaligned processes can be behind the good outputs you see. Producing outputs that score high according to your metric is an instrumentally convergent strategy that smart enough agents will discover and act out, no matter their internal motivations.

In short: the method fails because it doesn't robustly optimize against deceptive alignment.

What is the alternative?

Well, whatever the alternative is, it will need to give us better control over the internal processes that arise as a result of our technique.

Now, how might we do this? Current AIs learn all their functioning, so their internal processes are not visible to us by default.

But we have interp. We might be able to locate internal representations of wanted and unwanted behavior. Why doesn't this on its own solve the problem? Why can't we just figure out how the model represents desires/goals/proclivities and hook the models representation of "good" into the goals/desires slot, together with the representation of "not deception", "not sycophancy", "not reward hacking", "not misaligned" etc?

Because neural networks are cursed, and knowing how to do this kind of intricate surgery on the models internals is much more difficult than learning facts of the form "this neuron/direction in activation space fires iff the model (believes it) is reward hacking (and even that is very hard).

So where does that leave us? Well, it means if we wanna tamper with model internals, it will probably involve gradients and training, not surgery. (Though to be clear, if we get good enough at mechinterp to do that, it would be great)

The archetypal example of this kind of technique looks like doing RL, but including feature activations of relevant concepts in the reward function.

Why will this not immediately screw us over?

Eliezer says:

When you explicitly optimize against a detector of unaligned thoughts, you're partially optimizing for more aligned thoughts, and partially optimizing for unaligned thoughts that are harder to detect. Optimizing against an interpreted thought optimizes against interpretability.

And I agree with this. But the quote leaves open the possibility that different implementations drive down misalignment and interpretability at vastly different rates. I think its plausible that certain techniques can "fix" many misalignment issues with negligible impact on interpretability.

Is there a principled way to reason about this ratio?

Scenario 1:

Imagine you've started doing the linear probe - reward function technique described above, and you're using it to minimize reward hacking. Imagine you have a very good (but not necessarily perfect) probe. Now imagine you're looking at the first sets of rollouts at the beginning of training. Imagine that in 16 out of 32 of these trajectories, according to the probe, the model reward hacks, and in the other 16 it does not.

Now ask yourself: of the 16 trajectories where the probe did not detect reward hacking, how many of those were labeled as "not reward-hacking" because something inside the model caused the probe to fail?

The answer is: very few in expectation.

Why?

Because you started with an accurate probe!

Consequently, almost all of the trajectories that get reinforced, get reinforced because the model didn't reward hack (according to its own judgement).

Why does this matter?

RL works by selecting within the already existing variance of model behavior. If 99% of variance in X is explained by A, and 1% by B, and learning to do A and B are roughly equally easy, RL will move A proportionally faster than B.

And the above is saying exactly that, with X = reward hacking, A = models internal representation of reward hacking firing, and B = variation that causes the probe to work worse.

(see this for this experiment actually run. They find the technique very effective at minimizing reward hacking, and see minor hit to probe accuracy)

(see also the goodfire article, which is similar, but with hallucinations instead of reward hacking, and get good results. They add an extra trick, running the probe on a frozen copy of the model, which I'm not entirely sure how to interpret to be honest)

Scenario 2:

Now consider another proposal: after gathering the trajectories, you do a forward pass, and you add the RH-probe activation to the loss function. What happens then?

Well, reward hacking is probably this complicated emergent behavior represented all over the model, but the input to your probe is a single direction in activation space.

What is the easiest way for gradient descent to avoid triggering the probe?

Answer: Just wiggle the activations a bit. This is something that comes very naturally to gradient descent. Changing big circuits takes a lot of time. Rotating a single representation a bit is very easy.

And so your technique fails.

Conclusion

The above is my argument for why future alignment methods will look like "interp in training" .

I also wanna say that Steven Byrnes had a related post a little bit ago. I think people should read that too, its saying something similar, but it focuses on how the human brain works, and I'm trying to communicate what I view as a more general principle.

Addendum:

The reasoning in scenario 1 does probably fail if you have a superintelligence working against you, eg actively reasoning about how to subvert the probe.

But this is a very difficult task. Imagine someone had scanned your brain, and trained a ML model to predict when you were lying. And that ML model was 99.9% accurate, and fired even when your lies were barely legible to yourself.

How would you go about subverting that mechanism?

I think its possible, but seems considerably harder than e.g. scheming / alignment faking in a single forward pass.

And this means I think we can push the technique much further than current techniques.

There are even more cautious things you can do. Suppose you have fairly accurate mech interp measure of reward hacking behavior (see 2025 OpenAI paper describing exactly that). You start RL reasoning training your model, and you monitor the "reward hacking?" curve, and after a while it starts to go up. You halt the training run and go back and look at which RL environments the model was training on in the specific RL update batches where the score went up, and of those, which the model is now doing great on when before it was stuck. You look at the model’s actual responses on those training environments and discover, yes, they were hackable, and here's the tactic the model was using to hack them. So you fix them, and every other RL reasoning training environment with the same problem, and you build a classifier to catch the tactic, and check what else the model can now suddenly solve that it couldn't before. Then you rewind your training run to before the reward hacking started, and continue from there with the new, fixed RL training environments. Rinse and repeat.

Could this accidentally create a model that sneakily reward hacks in a way that you can't detect with your mech interp? Well, yes, it could, but it would have to happen by pure chance — there was never any training incentive encouraging the model to do this thing, it's just that if it did happen, you wouldn't have known it was happening so wouldn't have stopped the training run and fixed it.

Another reason is that since evals/interp techniques have a finite lifetime before they become ultimately useless to use as a hold out set, you should eventually train against the eval/interp techniques once it's value as a hold out test set declines to near 0.

I got this point from Herbie Bradley.

~~But this only argues for training against interp at the end of training, rather than continuously, while your post argues we should use interp to train against it for all of training.~~

I agree that makes some sense.

Wrt end of training vs all of training, I don't think I made an argument for that here.

My opinion is that we should probably do it nearer to the beginning of training, because alignment that creates robust internal desires can be sticky, maybe we will not need to do it so much more throughout training, where it would be harder as the model is smarter.

Could you explain how interp techniques eventually become useless? As for training against interp at the end of training, this reminds me of this post by Knight Lee.

I mean that they useless to use as a hold out set, not that they are useless more generally, so good point here, so I will edit the post.

Specifically, that means the outcomes from the AI not seeing/reacting to the interpretability technique become the same as if the AI was allowed to see the data, because it had learned all the generalizable tricks.

Nice shout out to Knight Lee's post.

But this is a very difficult task. Imagine someone had scanned your brain, and trained a ML model to predict when you were lying. And that ML model was 99.9% accurate, and fired even when your lies were barely legible to yourself.
How would you go about subverting that mechanism?
I think its possible, but seems considerably harder than e.g. scheming / alignment faking in a single forward pass.

I strongly suspect that it's a misunderstanding. Suppose that the probe does fire when you lie. If the probe hits your brain when you lie, then you can play around and turn your brain's behavior away from the one where the probe hits the strongest by having the probe hit weakly. Or by having the probe hit when you genuinely can't think of a solution and want to cover it up. Then your brain's behavior which causes the probe to have its effect causes the probe to de-reinforce itself, and we are cooked.

Additionally, Byrnes' case for interpretability in the loop doesn't actually have the belief system trained on the rewards from the desire system. Instead, they somehow co-design outputs.

Finally, if you ask about "a principled way to reason about this ratio", then one can consider a simple model. Suppose that the model family Agent-N was trained without using interpretability, and that the company creates new models until one of them is NOT caught being misaligned. Then Agent-N can be either aligned with probability p_a, misaligned and catchable with probability p_c or misaligned and uncatchable with probability p_u, meaning that the end result is either an aligned model with probability p_a/(1-p_c) or a misaligned one with prob p_u/(1-p_c). If another model family Open-N is trained with using interpretability, then we lose the segment of caught models. Instead, we receive the result of Opens being aligned with probability q_a and uncaught with probability q_u. What we need is to have q_u< p_u/(1-p_c), not q_u\approx p_u+p_c. Does it mean that Goodfire's research is a case for the former?

Suppose that the probe does fire when you lie. If the probe hits your brain when you lie, then you can play around and turn your brain's behavior away from the one where the probe hits the strongest by having the probe hit weakly

What do you mean "hit"? In humans we consciously experience rewards/pain. But that's not the type of setup I'm imagining here. I mean, its certainly not the case with standard RL given the rewards and updates are computed after the model has stopped thinking and doing actions

Additionally, Byrnes' case for interpretability in the loop doesn't actually have the belief system trained on the rewards from the desire system. Instead, they somehow co-design outputs.

Sure, that's fine. It's another example of the point I'm making. Note that even that setup has the potential for RL undermining the interp technique. What I'm saying is there are multiple ways to do this with more or less favorable ratios.

In ascending order of favorableness, something like

Letting backprop run through the probe
Doing scenario 1 / the neel nanda paper
Doing the goodfire thing with a frozen model
Doing the Byrnes thing where you have a architectural separation between the "belief" system and the "value learning" system, and not letting the reward computed from the belief system update the belief system directly.

I think doing the Byrnes thing will be quite hard with LLMs, or anything that uses the pretrain -> instruct tune -> post-train pipeline. But its not a counter example to the point I'm making

Finally, if you ask about "a principled way to reason about this ratio", then one can consider a simple model.

I should be clear that the "ratio" I'm talking about is , but "reward hack" can obviously be replaced by any behavior.

What we need is to have q_u< p_u/(1-p_c), not q_u\approx p_u+p_c. Does it mean that Goodfire's research is a case for the former?

I mean, I don't think this is the right way to analyse it. We're talking about the effect of training, not the resulting system in absolute terms. ie the difference between p_u before and after training.

Furthermore, we're not really talking about p_u. We're talking about the internals of a single model. I.e. a model that has various proclivites, and that sometimes behaves misaligned and other times not.

Not a model thats either completely aligned or completely unaligned, but we don't know which.

I got this point from Herbie Bradley.

~~But this only argues for training against interp at the end of training, rather than continuously, while your post argues we should use interp to train against it for all of training.~~

I agree that makes some sense.

Wrt end of training vs all of training, I don't think I made an argument for that here.

Could you explain how interp techniques eventually become useless? As for training against interp at the end of training, this reminds me of this post by Knight Lee.

I mean that they useless to use as a hold out set, not that they are useless more generally, so good point here, so I will edit the post.

Nice shout out to Knight Lee's post.

But this is a very difficult task. Imagine someone had scanned your brain, and trained a ML model to predict when you were lying. And that ML model was 99.9% accurate, and fired even when your lies were barely legible to yourself.
How would you go about subverting that mechanism?
I think its possible, but seems considerably harder than e.g. scheming / alignment faking in a single forward pass.

Additionally, Byrnes' case for interpretability in the loop doesn't actually have the belief system trained on the rewards from the desire system. Instead, they somehow co-design outputs.

Suppose that the probe does fire when you lie. If the probe hits your brain when you lie, then you can play around and turn your brain's behavior away from the one where the probe hits the strongest by having the probe hit weakly

Additionally, Byrnes' case for interpretability in the loop doesn't actually have the belief system trained on the rewards from the desire system. Instead, they somehow co-design outputs.

In ascending order of favorableness, something like

Letting backprop run through the probe
Doing scenario 1 / the neel nanda paper
Doing the goodfire thing with a frozen model
Doing the Byrnes thing where you have a architectural separation between the "belief" system and the "value learning" system, and not letting the reward computed from the belief system update the belief system directly.

I think doing the Byrnes thing will be quite hard with LLMs, or anything that uses the pretrain -> instruct tune -> post-train pipeline. But its not a counter example to the point I'm making

Finally, if you ask about "a principled way to reason about this ratio", then one can consider a simple model.

I should be clear that the "ratio" I'm talking about is , but "reward hack" can obviously be replaced by any behavior.

What we need is to have q_u< p_u/(1-p_c), not q_u\approx p_u+p_c. Does it mean that Goodfire's research is a case for the former?

Not a model thats either completely aligned or completely unaligned, but we don't know which.

28

The Future of Aligning Deep Learning systems will probably look like "training on interp"

28

Current methods are bad

What is the alternative?

Why will this not immediately screw us over?

Is there a principled way to reason about this ratio?

Scenario 1:

Scenario 2:

Conclusion

Addendum:

28

28