This is close to something I was thinking about after reading your sketch.
Specifically I thought about:
…Unless interpretability someday develops to such a refined point that it's adversarially robust (i.e., we understand the model so well that problematic thoughts have nowhere to hide from the interpretability tools). But that sure seems like a long-shot.
Say you have datasets D_aligned and D_misaligned. In D_aligned the model tells the truth; in D_misaligned it lies and manipulates. We use these to identify a linear probe - but importantly, we do all of this while keeping the gradient flowing through the probe construction itself. So when we add probe accuracy to the loss, the model can't just shift representations to evade. The probe is recomputed from whatever the model outputs; it follows automatically.
This reminds me of how you can't lie to your own brain — no matter how much you tell yourself "Sarah is actually fine" while she's being tortured, unless you're genuinely oblivious, you'll feel terrible.
The remaining failure mode: the model encodes the distinction in purely nonlinear ways, distributed across layers. But I'm at least somewhat skeptical such representations would always arise: the gradient path there now seems pretty flat, with pressure favoring genuine capability gains over gratuitously complicating internal structure that has to be learned from scratch.
If we do get there, say via noise, the linear probe is done for. Speculative fix: replace the probe with a duplicated model, as in the actual model weights should already have some kind of way to extract this information, once again always keep the gradient in-place during construction. Less clear how to make this work in practice.
Let’s call “interpretability-in-the-loop training” the idea of running a learning algorithm that involves an inscrutable trained model, and there’s some kind of interpretability system feeding into the loss function / reward function.
Interpretability-in-the-loop training has a very bad rap (and rightly so). Here’s Yudkowsky 2022:
Or Zvi 2025:
This is a simple argument, and I think it’s 100% right.
But…
Consider compassion in the human brain. I claim that we have an innate reward function that triggers not just when I see that my friend is happy or suffering, but also when I believe that my friend is happy or suffering, even if the friend is far away. So the human brain reward can evidently get triggered by specific activations inside my inscrutable learned world-model.
Thus, I claim that the human brain incorporates a form of interpretability-in-the-loop RL training.
Inspired by that example, I have long been an advocate for studying whether and how one might use interpretability-in-the-loop training for aligned AGI. See for example Reward Function Design: a starter pack sections 1, 4, and 5.
My goal in the post is to briefly summarize how I reconcile the arguments at the top with my endorsement of this kind of research program.
My overall position
The rest of this post will present this explanation:
How the brain-like version of interpretability-in-the-loop training avoids the obvious failure mode
The human brain has beliefs and desires. They are different. It’s possible to want something without expecting it, and it’s possible to expect something without wanting it. This should be obvious common sense to everyone, unless your common sense has been crowded out by “active inference” nonsense.
Beliefs and desires are stored in different parts of the brain, and updated in different ways. (This is a huge disanalogy between LLMs and brains.)
As an oversimplified toy model, I suggest to think of desires as a learned linear functional on beliefs (see my Valence series §2.4.1). I.e. “desires” constitute a map whose input is some thought / plan / etc. (over on the belief side), and whose output is a numerical score indicating whether that thought / plan / etc. is good (if the score is positive) or bad (if it’s negative).
Anyway, the important point is that these two boxes are updated in different ways. Let’s expand the diagram to include the different updating systems, and how interpretability-in-the-loop training fits in:
The interpretability data is changing the reward signals, but the reward signals are not directly changing the belief box that the interpretability system is querying.
That means: The loop doesn’t close. This interpretability system is not creating any gradient that directly undermines its own faithfulness.
…So that’s how this brain-like setup avoids the obvious failure mode of interpretability-in-the-loop that Yudkowsky & Zvi were talking about at the top.
Things can still go wrong in more subtle and indirect ways
…Or at least, it avoids the most straightforward manifestation of that problem. There are more subtle things that might go wrong. I have a high-level generic discussion in Valence series §3.3, where I point out that there exist indirect pathways through this diagram, and discuss how they can cause problems:
And these kinds of problems can indeed pop up in the context of compassion and other interpretability-in-the-loop human social instincts. The result is that human social instincts that on paper might look robustly prosocial, are in fact not so robustly prosocial in the real (human) world. See my Sympathy Reward post §4.1 and Approval Reward post §6 for lots of everyday examples.
So the upshot is: I don’t think the brain-like version of interpretability-in-the-loop RL training is a panacea for aligned ASI, and I’m open-minded to the possibility that it’s just not a viable approach at all. But it’s at least a not-obviously-doomed research direction, and merits more study.