Steven, thanks for writing this!
Isn't it just the case that the human brain's 'interpretability technique' is just really robust? The technique (in this case, having an accurate model of what others feel) is USEFUL for many other aspects of life.
Maybe this is a crux? To my knowledge, we haven't tried that hard to make interpretability techniques and probes robust, not in a way where 'activations being easily monitorable' is closely correlated with 'playing training games well' for the lack of better wording.
A result that comes to mind, but isn't directly relevant, is this tweet from the real time hallucination detection paper. Co-training a LoRA adaptor and downstream probe for hallucination detection made the LLM more epistemically cautious with no other supervised training signals. Maybe we can use this probe to RL against hallucinations while training the probe at each step too?
I have yet to read your previous posts linked here. I imagine some of my questions will be answered once I find time to look through them haha.
Isn't it just the case that the human brain's 'interpretability technique' is just really robust? The technique (in this case, having an accurate model of what others feel) is USEFUL for many other aspects of life.
I don’t think it’s that robust even in humans, despite the mitigation described in this post. (Without that mitigation, I think it would be hopeless.)
If we’re worried about a failure mode of the form “the interpretability technique has been routed around”, then that’s unrelated to “The technique (in this case, having an accurate model of what others feel) is USEFUL for many other aspects of life”. For the failure mode that Yudkowsky & Zvi were complaining about, if that failure mode actually happened, there would still be an accurate model. It would just be an accurate model that is invisible to the interpretability technique.
I.e. the beliefs box would still be working fine, but the connection to desires would be weak or absent.
And I do think that happens plenty in the human world.
Maybe the best example (at least from my own perspective) is the social behavior of (many) smart autistic adults. [Copying from here:] The starting point is that innate social reactions (e.g. the physiological arousal triggered by eye contact) are so strong that they’re often overwhelming. People respond to that by (I think) relating to other people in a way that generally avoids triggering certain innate social reactions. This includes (famously) avoiding eye contact, but I think also includes various hard-to-describe unconscious attention-control strategies. So at the end of the day, neurotypical people will have an unconscious innate snap reaction to (e.g.) learning that someone is angry at them, whereas autistic people won’t have that snap reaction, because they have an unconscious coping strategy to avoid triggering it, that they’ve used since early childhood, because the reaction is so unpleasant. Of course, they’ll still understand intellectually perfectly well that the person is angry. As one consequence of that, autistic people (naturally) have trouble modeling how neurotypical people will react to different social situations, and conversely, neurotypical people will misunderstand and misinterpret the social behaviors of autistic people.
Still, socially-attentive smart autistic adults sometimes become good (indeed, sometimes better than average) at predicting the behavior of neurotypical people, if they put enough work into it.
(People can form predictive models of other people just using our general ability to figure things out, just like we can build predictive models of car engines or whatever.)
That’s just one example. I discuss other (maybe less controversial) examples in my Sympathy Reward post §4.1 and Approval Reward post §6.
Ah this makes sense, thank you. Then I guess the crux is figuring out how to isolate the beliefs and desires box in AI systems so we can have this open loop. Gradient routing has potential here as cloud commented.
Another possible method that just occurred to me (no idea whether this is any good, inviting feedback):
- Use interpretability technique to flag bad behaviors during RL for some task X.
- When bad behavior is flagged, train the model on a corpus that effectively represents the 'opposite' of this bad behavior (for example, if caught lying, train on corpus that induces honesty).
The intuition is that, we'd want this corpus to activate whatever parts of the network that represent a specific desire (we accept that we don't know where this is), and that it is possible to come up with training documents that effectively updates 'desires' via SFT or other algorithms. I think methods/ideas in influence function/token level attribution may help with constructing such copra, or more direct ways of updating the desire parts of the network.
Maybe gradient routing could be used to implement this kind of learning update (see the Influencing generalization section here). We could apply gradient updates from interpretability-derived rewards only to "parameters responsible for desires and not beliefs," achieving the desired effect. Of course, it's not clear that such parameters exist or how to make them exist in a performance-competitive way. Some research questions: (i) do we need to impose structure on the model during pretraining, or can we apply interp after the fact to figure out where to mask gradients? (ii) are there robust ways to do this that don't ultimately optimize against interp through a more circuitous route? Would be super cool to see this studied!
I got lost at the second diagram. Why is the process of computing expected reward for a given thought/plan called "interpretability"? In what way is it analogous to mech interp? I'd find it clarifying to see a sketch of an interpretability-in-the-loop ML training setup with the same rough arrows as in that diagram.
Not really. What properties does an "interpretability" system have that are not shared any old system in the brain?
I’m not making any claims about what the “interpretability” system is. It can be any system whatsoever whose input is activations and whose output is one or more numbers. The “system” could be a linear probe. Or the “system” could be a team of human researchers who pause the model after every forward pass, scrutinize the activation state for a week, and then output a “this activation state represents scheming” score from 0 to 10. (That’s not a practical example, because if you pause for a week on each forward pass then the training would take a zillion years. But in principle, sure!) Or the “system” could be something even more exotic than that. The “system” can be anything at all, it doesn’t matter for this post. I’m just saying that, regardless of what that system is, if you use its outputs to help determine the reward signal, then this post will hopefully help you think about the eventual consequences of doing that, and in particular whether gradient descent will be working to manipulate and undermine that “system”.
If you’re thinking that there isn’t a sharp line between an ML model with an “interpretability system” wrapped around it that has a numerical output (e.g. linear probe), versus an ML model with an auxiliary “output head”, then yeah, that’s true. It’s two ways of thinking about the same thing.
This is close to something I was thinking about after reading your sketch.
Specifically I thought about:
…Unless interpretability someday develops to such a refined point that it's adversarially robust (i.e., we understand the model so well that problematic thoughts have nowhere to hide from the interpretability tools). But that sure seems like a long-shot.
Say you have datasets D_aligned and D_misaligned. In D_aligned the model tells the truth; in D_misaligned it lies and manipulates. We use these to identify a linear probe - but importantly, we do all of this while keeping the gradient flowing through the probe construction itself. So when we add probe accuracy to the loss, the model can't just shift representations to evade. The probe is recomputed from whatever the model outputs; it follows automatically.
This reminds me of how you can't lie to your own brain — no matter how much you tell yourself "Sarah is actually fine" while she's being tortured, unless you're genuinely oblivious, you'll feel terrible.
The remaining failure mode: the model encodes the distinction in purely nonlinear ways, distributed across layers. But I'm at least somewhat skeptical such representations would always arise: the gradient path there now seems pretty flat, with pressure favoring genuine capability gains over gratuitously complicating internal structure that has to be learned from scratch.
If we do get there, say via noise, the linear probe is done for. Speculative fix: replace the probe with a duplicated model, as in the actual model weights should already have some kind of way to extract this information, once again always keep the gradient in-place during construction. Less clear how to make this work in practice.
There is another way to make the Most Forbidden Technique OK: if you can make the feedback loop optimize for interpretability.
The best example I've seen of this is the paper linkposted by the post Paper: Prompt Optimization Makes Misalignment Legible. Briefly, they use an automated process to do RL- style learning by adapting a prompt, rather than by changing the model weights. Then they use this in the sort of training situations that the you would normally use RL in. At the end of the training run, they go read the new prompt, and see what the model has learnt. Then they either go edit the training setup and start over again from scatch, or edit the prompt to remove the things that we wish on alignment grounds that the model hadn't learnt, or go "That's fine!" and use the prompt as is.
In this setup, the prompt adaption process used is deliberate not one that produces the sort of gobbledegook that some automated jailberaking methods do. So the training setup is incentivized to produce clear instructions that the model itself can understand and follow. Thus the prompt needs to be interpretable to the model. On the assumption that we can also read this as well as the model can, then it should also be interpretabable to us (for current models this is probably a reasonable assumption, as long as it stays in English, for sufficiently capable models perhaps not). Since we don't apply any alignment feedback until the end of the training run, when we either restart from scratch, hand-edit the prompt, or accept it, the model doesn't get any input during the learning process on how we feel about things that it could then try to evade. So the model has no incentive to do anything tricky and difficult like hiding content in the prompt using steganography.
In theory, if the automated prompt rewriting process was also getting hints from some interpretability technique such as Stochastic Autoencoders (SAEs) or something similar, to help it figure out how to improve the prompt (say if what would have been the RL update gradient if we were doing RL on the weights got encoded via an SEA and supplied as input to the prompt improvement process, labeled "some concepts you might need to increase the presence of in the results are the following:") then there would be no incentive to evade the SAE's interprative abilities. In fact the system would learn better if the updates fitted the SAE better (though I don't immediately see a feedback loop that would cause that to happen; if I've missed one, it should be in that diraction).
There is a lot of tension between "this is how would be nice for optimal agent to be built" and "this is how actual brains work".
I can imagine that kinda-interpretability scheme works for, say, spatial tasks: it seems easy to track content of 3d-world model and reward successful accomplishment of tasks like "move object from point A to point B" and I would suspect that this system operates through cerebellum. I don't think such system exists for anything more complicated like "caring about other entities mental states".
Let’s call “interpretability-in-the-loop training” the idea of running a learning algorithm that involves an inscrutable trained model, and there’s some kind of interpretability system feeding into the loss function / reward function.
Interpretability-in-the-loop training has a very bad rap (and rightly so). Here’s Yudkowsky 2022:
Or Zvi 2025:
This is a simple argument, and I think it’s 100% right.
But…
Consider compassion in the human brain. I claim that we have an innate reward function that triggers not just when I see that my friend is happy or suffering, but also when I believe that my friend is happy or suffering, even if the friend is far away. So the human brain reward can evidently get triggered by specific activations inside my inscrutable learned world-model.
Thus, I claim that the human brain incorporates a form of interpretability-in-the-loop RL training.
Inspired by that example, I have long been an advocate for studying whether and how one might use interpretability-in-the-loop training for aligned AGI. See for example Reward Function Design: a starter pack sections 1, 4, and 5.
My goal in the post is to briefly summarize how I reconcile the arguments at the top with my endorsement of this kind of research program.
My overall position
The rest of this post will present this explanation:
How the brain-like version of interpretability-in-the-loop training avoids the obvious failure mode
The human brain has beliefs and desires. They are different. It’s possible to want something without expecting it, and it’s possible to expect something without wanting it. This should be obvious common sense to everyone, unless your common sense has been crowded out by “active inference” nonsense.
Beliefs and desires are stored in different parts of the brain, and updated in different ways. (This is a huge disanalogy between LLMs and brains.)
As an oversimplified toy model, I suggest to think of desires as a learned linear functional on beliefs (see my Valence series §2.4.1). I.e. “desires” constitute a map whose input is some thought / plan / etc. (over on the belief side), and whose output is a numerical score indicating whether that thought / plan / etc. is good (if the score is positive) or bad (if it’s negative).
Anyway, the important point is that these two boxes are updated in different ways. Let’s expand the diagram to include the different updating systems, and how interpretability-in-the-loop training fits in:
The interpretability data is changing the reward signals, but the reward signals are not directly changing the belief box that the interpretability system is querying.
That means: The loop doesn’t close. This interpretability system is not creating any gradient that directly undermines its own faithfulness.[1]
…So that’s how this brain-like setup avoids the obvious failure mode of interpretability-in-the-loop that Yudkowsky & Zvi were talking about at the top.
Things can still go wrong in more subtle and indirect ways
…Or at least, it avoids the most straightforward manifestation of that problem. There are more subtle things that might go wrong. I have a high-level generic discussion in Valence series §3.3, where I point out that there exist indirect pathways through this diagram, and discuss how they can cause problems:
And these kinds of problems can indeed pop up in the context of compassion and other interpretability-in-the-loop human social instincts. The result is that human social instincts that on paper might look robustly prosocial, are in fact not so robustly prosocial in the real (human) world. See my Sympathy Reward post §4.1 and Approval Reward post §6 for lots of everyday examples.
So the upshot is: I don’t think the brain-like version of interpretability-in-the-loop RL training is a panacea for aligned ASI, and I’m open-minded to the possibility that it’s just not a viable approach at all. But it’s at least a not-obviously-doomed research direction, and merits more study.
Added 2026-02-13: Actually, oops, even this sentence is a bit oversimplified. E.g. here’s a scenario. There’s an anti-deception probe connected to the reward function, and the “beliefs” box has two preexisting plans / actions: (A) a “be deceptive in a way that triggers the alarm” plan / action in the “beliefs” box, and (B) a “be deceptive in a way that doesn’t trigger the alarm” plan / action.
Now, the good news is that there wouldn’t be a predictive learning gradient that would push towards (B), nor one that would create (B) if (B) didn’t already exist. But the bad news is, there is a kind of policy gradient that would ensure that, if (B) occurs by random happenstance, then the system will update to repeat (B) more often in the future. (Or worse, there could be a partway-to-(B) plan / action that’s somewhat rewarded, etc., and then the system may hill-climb to (B).)
I still think this setup is much less obviously doomed than the LLM case at the top, because we have all this learned structure in the “beliefs” box that the reward function is not allowed to manipulate directly. Again, for example, if (B) doesn’t already exist within the web of constraints that constitutes the “beliefs” box, this system won’t (directly) create it.
[Thanks Rhys Gould for discussion of this point.]