But this is a very difficult task. Imagine someone had scanned your brain, and trained a ML model to predict when you were lying. And that ML model was 99.9% accurate, and fired even when your lies were barely legible to yourself.
How would you go about subverting that mechanism?
I think its possible, but seems considerably harder than e.g. scheming / alignment faking in a single forward pass.
I strongly suspect that it's a misunderstanding. Suppose that the probe does fire when you lie. If the probe hits your brain when you lie, then you can play around and turn your brain's behavior away from the one where the probe hits the strongest by having the probe hit weakly. Or by having the probe hit when you genuinely can't think of a solution and want to cover it up. Then your brain's behavior which causes the probe to have its effect causes the probe to de-reinforce itself, and we are cooked.
Additionally, Byrnes' case for interpretability in the loop doesn't actually have the belief system trained on the rewards from the desire system. Instead, they somehow co-design outputs.
Finally, if you ask about "a principled way to reason about this ratio", then one can consider a simple model. Suppose that the model family Agent-N was trained without using interpretability, and that the company creates new models until one of them is NOT caught being misaligned. Then Agent-N can be either aligned with probability p_a, misaligned and catchable with probability p_c or misaligned and uncatchable with probability p_u, meaning that the end result is either an aligned model with probability p_a/(1-p_c) or a misaligned one with prob p_u/(1-p_c). If another model family Open-N is trained with using interpretability, then we lose the segment of caught models. Instead, we receive the result of Opens being aligned with probability q_a and uncaught with probability q_u. What we need is to have q_u< p_u/(1-p_c), not q_u\approx p_u+p_c. Does it mean that Goodfire's research is a case for the former?
Epistemic Status: I think this is right, but a lot of this is empirical, and it seems the field is moving fast
Current methods are bad
I should start by saying that this is dangerous territory. And there are obvious way to botch this. E.g. training CoT to look nice is very stupid. And there are subtler way to do it that still end up nuking your ability to interpret the model without making any lasting progress on aligning models.
But I still think the most promising path to aligning DL systems will look like training on interp. Why? Consider, what is the core reason to be suspicious of current methods?
They all work by defining what you consider a good output to be, either by giving labels and telling the models "say exactly so and so and do exactly so and so", or by defining some function on the output, like from a reward model, and using gradients to make the outputs score higher according to that function in expectation.
Why should this make you suspicious? Because this process gives you a model that produces outputs you consider good, at least on the examples you've shown it, but gives you no guarantees about what internal process the model uses to generate those good-seeming outputs.
The most central reason this is problematic, is that it means "bad"/misaligned processes can be behind the good outputs you see. Producing outputs that score high according to your metric is an instrumentally convergent strategy that smart enough agents will discover and act out, no matter their internal motivations.
In short: the method fails because it doesn't robustly optimize against deceptive alignment.
What is the alternative?
Well, whatever the alternative is, it will need to give us better control over the internal processes that arise as a result of our technique.
Now, how might we do this? Current AIs learn all their functioning, so their internal processes are not visible to us by default.
But we have interp. We might be able to locate internal representations of wanted and unwanted behavior. Why doesn't this on its own solve the problem? Why can't we just figure out how the model represents desires/goals/proclivities and hook the models representation of "good" into the goals/desires slot, together with the representation of "not deception", "not sycophancy", "not reward hacking", "not misaligned" etc?
Because neural networks are cursed, and knowing how to do this kind of intricate surgery on the models internals is much more difficult than learning facts of the form "this neuron/direction in activation space fires iff the model (believes it) is reward hacking (and even that is very hard).
So where does that leave us? Well, it means if we wanna tamper with model internals, it will probably involve gradients and training, not surgery. (Though to be clear, if we get good enough at mechinterp to do that, it would be great)
The archetypal example of this kind of technique looks like doing RL, but including feature activations of relevant concepts in the reward function.
Why will this not immediately screw us over?
Eliezer says:
And I agree with this. But the quote leaves open the possibility that different implementations drive down misalignment and interpretability at vastly different rates. I think its plausible that certain techniques can "fix" many misalignment issues with negligible impact on interpretability.
Is there a principled way to reason about this ratio?
Scenario 1:
Imagine you've started doing the linear probe - reward function technique described above, and you're using it to minimize reward hacking. Imagine you have a very good (but not necessarily perfect) probe. Now imagine you're looking at the first sets of rollouts at the beginning of training. Imagine that in 16 out of 32 of these trajectories, according to the probe, the model reward hacks, and in the other 16 it does not.
Now ask yourself: of the 16 trajectories where the probe did not detect reward hacking, how many of those were labeled as "not reward-hacking" because something inside the model caused the probe to fail?
The answer is: very few in expectation.
Why?
Because you started with an accurate probe!
Consequently, almost all of the trajectories that get reinforced, get reinforced because the model didn't reward hack (according to its own judgement).
Why does this matter?
RL works by selecting within the already existing variance of model behavior. If 99% of variance in X is explained by A, and 1% by B, and learning to do A and B are roughly equally easy, RL will move A proportionally faster than B.
And the above is saying exactly that, with X = reward hacking, A = models internal representation of reward hacking firing, and B = variation that causes the probe to work worse.
(see this for this experiment actually run. They find the technique very effective at minimizing reward hacking, and see minor hit to probe accuracy)
(see also the goodfire article, which is similar, but with hallucinations instead of reward hacking, and get good results. They add an extra trick, running the probe on a frozen copy of the model, which I'm not entirely sure how to interpret to be honest)
Scenario 2:
Now consider another proposal: after gathering the trajectories, you do a forward pass, and you add the RH-probe activation to the loss function. What happens then?
Well, reward hacking is probably this complicated emergent behavior represented all over the model, but the input to your probe is a single direction in activation space.
What is the easiest way for gradient descent to avoid triggering the probe?
Answer: Just wiggle the activations a bit. This is something that comes very naturally to gradient descent. Changing big circuits takes a lot of time. Rotating a single representation a bit is very easy.
And so your technique fails.
Conclusion
The above is my argument for why future alignment methods will look like "interp in training" .
I also wanna say that Steven Byrnes had a related post a little bit ago. I think people should read that too, its saying something similar, but it focuses on how the human brain works, and I'm trying to communicate what I view as a more general principle.
Addendum:
The reasoning in scenario 1 does probably fail if you have a superintelligence working against you, eg actively reasoning about how to subvert the probe.
But this is a very difficult task. Imagine someone had scanned your brain, and trained a ML model to predict when you were lying. And that ML model was 99.9% accurate, and fired even when your lies were barely legible to yourself.
How would you go about subverting that mechanism?
I think its possible, but seems considerably harder than e.g. scheming / alignment faking in a single forward pass.
And this means I think we can push the technique much further than current techniques.