I’m a staff artificial intelligence engineer and researcher working with AI and LLMs, and have been interested in AI alignment, safety and interpretability for the last 17 years. I did research into this during SERI MATS summer 2025. I’m now looking for work on this topic in the London/Cambridge area in the UK.
I assume there will be a nontrivial period when the AI behaves corrigibly in situations that closely resemble its training environment, but would behave incorrigibly in some unusual situations.
As a rule of thumb, anything smart enough to be dangerous is dangerous because it can do scientific research and left-improve. If it can't tell when you're out-of-distribution and might need to generate some new hypotheses, it can't do scientific research, so it's not that dangerous. So yes, there might be some very unusual situation that decreased corrigibility: but for an ASI, just taking it out-of-training-distribution should pretty-reliably cause it to say "I know that I don't know what I'm doing, so I should be extra cautious/pessimistic, and that includes being extra-corrigible."
The engineering feedback loop will use up all its fuel
I discussed this with Jeremy Gillen in the comments of his post, and I'm still not clear what he meant by 'fuel' here. Possibly something to do with the problem of "fully-updated deference", a.k.a. the right to keep arbitrarily and inconsistently changing out minds?
This post will not address how hard it may be to ensure that an AI is corrigible, or the conflicts associated with an AI being corrigible to just one principal versus multiple principals. There may be important risks from AI being corrigible to the wrong person / people, but those are mostly outside the scope of this post.
This is where the difference between Corrigibility and Value Learning really kicks in. Consider two-or-more opposed groups of humans (two-or-more tech titans, nation states, whatever) with corrigible aligned ASIs: let's assume the ASIs are smart, and learn to predict what their principals would correct, and how to extrapolate this correctly to situations too complex for the principals to understand. But they do not do anything more moral then or less confrontational than their principals — they just pursue their principal's goals with superhuman intelligence. This seems like a winner-take-all competition. Principals who hopefully aren't actually sociopaths, and don't personally want to die, and thus don't want humanity to go extinct, but who also don't want to lose at power politics or games of Chicken.
On the other hand, suppose they had Value Learning ASIs. These learn human vslues, including that: first of all don't kill all the humans. Extinction is forever, and the badness of killing all the humans is roughly minus the number of quality-adjusted-life-years there would have been in humanity's future lightcone if you hadn't killed all of them. This is hard to predict, but dominated by a long tail in which things go really well and humanity ends up spreading across the galaxy, giving a huge, literally astronomical number (like -10^25 or -10^30 quality-adjusted life years). So really, don't kill all the humans. Also, don't let them wipe themselves out in a nuclear war. In fact, keep a lid on their foolish conflicts over resources and status.
Corrigibility-based alignment of superintelligence doen't give you wisdom; value learning-based alignment of superintelligence does. Superintelligence without wisdom is an x-risk — it's probably slower to kill us than unaligned superintelligence, but we still all die.
For more detail on this, see my posts Grounding Value Learning in Evolutionary Psychology: an Alternative Proposal to CEV and Requirements for a Basin of Attraction to Alignment.
My first thought on "filter that for accuracy somehow" was to generate, say, 5000 of them, sit down and laboriously read them all, then throw away (or even edit) the obviously wrong ones. Not exactly an easy technique for others to replicate, but often a reasonably good way to get an effective training set: prompt engineer, generate, filter manually, retrain smaller/simpler model. Sometimes you can even rinse and repeat on this approach, if your simpler model is generalizing usefully, or at least take a second look at cases where the trained model disagrees with your manual choice, and see if you were wrong and it's right — though obviously that gets riskier the more you let earlier versions of the model have input into the training data of later ones.
I would definitely want to investigate this more, with different sizes of LoRAs and so forth. Other then the KV-cache issue, it isn't that expensive to do both (or randomly sample), in case there are situations where one approach does better and others where the other does better (e.g. CoT obfuscated vs. not). But maybe we'll find there really is a consistent answer and it's just unintuitive.
This might also be an interesting thing to apply mech-interp analysis to: for the activations that are altered by the LoRA masking choice, what do the resulting activation changes look like, and how does that correlate with the content of the tokens they're for?
This second factor is more important in practice and LoRA-masking reduces accuracy.
This actively surprises me. When I was previously thinking about doing something similar to this I was very focused on your "On the one hand" argument and saw access to the exact activations as key.
I'm wondering if the tradeoff here might depend on LoRA size? The more the LoRA modifies th model, the more important the first factor seems likely to be. But then perhaps the same is also true of the second factor.
The balance might also change if the model was deliberately obfuscating its CoT, for example using steaganography or had access to neuralese.
Did you experiment starting with a long detailed [Intervention] prompt with a preamble describing the desired behavior (i.e. an honest confession, different personality but based on a clear memory of previous thoughts), that to generate training data, filter that for accuracy somehow, and then use that to distill the long [Intervention] prompt down to a LoRA that produces the same result with a much shorter version of the [Intervention] without the detailed preamble?
My intuition (i.e. wild-assed guess) is that a steering vector is probably too small to be optimal, but the optimal size of LoRA might be fairly small. An interesting thing to test.
Given the proliferation of prompt engineering techniques in recent years, this suggests that there is potential for further improvement by optimizing the [Intervention] string.
For example, the various automated jailbreaking techniques used to optimize semantic/roleplay jailbreaks could be applied instead to optimize [Intervention] strings. In a very high dimensional space (like the space of all intervention strings) being the "attacker" and putting the misaligned model in the role of defender should be a strong position.
[FWIW, that is basically the project I proposed to my MATS mentor — sadly we ended up doing something else, but anyone interested is welcome to run with this.]
It's easy to get a corrigible ASI not to use its persuasiveness on you: you tell it not to do that.
I think you need to think harder about that "hard to analyze" bit — it's the fatal flaw (as in x-risk) of the corrigibility based approach. You don't get behavior any wiser or more moral than the principal. And 2–4% of principals are sociopaths (the figures for tech titans and heads of authoritarian states may well be higher).