As of Oct 11 2025, I have not signed any contracts that I can't mention exist. I'll try to update this statement at least once a year, so long as it's true. I added this statement thanks to the one in the gears to ascension's bio.
Thanks! Many great suggestions, most of which reflect stuff I've thought about.
how do you "induce misalignment?"
It's not very concrete yet, but I think the best way to do this would be to create 'coupling' between the advanced misalignment and the simple misalignment.
How well this coupling works in practice / whether it holds up under subsequent finetuning is an empirical qn, but seems exciting if it did work.
---
Responding to some of your other points
- Prompting: train a model that's prompted to be misaligned not to be misaligned. But this seems like it would just make the model ignore the prompt? Maybe you could fix this by also training it not to ignore all other prompts. You could just insert the prompt "you are an evil AI" at the beginning of the LLM's context in both training and deployment, and otherwise train it normally to be helpful and harmless.
- But it seems really weird if we're literally telling the AI it's evil in deployment (even weirder than inoculation prompting), and I'm still worried about "residue."
Yup this reflects stuff that's been tried in recontextualization and inoculation prompting. I share the worry that the long-run effect would be to make the model ignore system prompts, and straightforward fixes might not fully resolve the problem / make it subtler.
- Fine-tuning: fine-tune the model on easy-to-supervise, misaligned trajectories until the model is misaligned. But then, you fine-tune/RL the model... to become aligned again? It seems like this simply undoes the operation you just did. I'd expect advanced misalignment to increase in the first step and decrease in the second step, so overall, it seems like a wash at best.
- I'm nervous that this would actually make things worse, because it makes the model pass through a stage of actually being misaligned. I'm worried some misaligned "residue" could be left over. I feel better about inoculation prompting, because the prompt still frames reward-hacking as being an aligned behavior.
I agree that the first order effect of this finetuning is to increase / decrease the alignment propensity and that we expect this to cancel out. But there might be second order effects e.g. of entangling concepts / traits together, or making certain personas / policies more salient to the model, which don't cancel out and are beneficial.
I mostly believe this. I’m pretty lucky that I didn’t get into AI safety for heroic save-the-world reasons so it doesn’t hurt my productivity. I currently work on research aimed at reducing s-risk at CLR.
Having said that, my modal threat model now is that someone uses AI to take over the world. I would love for more people to work on closely scrutinising leaders of labs and other figures in power, or more generally work on trying to make the gains from transformative AI distributed by default
"Indirect alignment": a speculative idea for aligning models in hard-to-oversee settings
Problem: "Directly" aligning models might be hard sometimes, because it's hard to provide perfect oversight (e.g. it's hard to remove all reward hacks from an RL environment, it's hard to directly train models not to scheme, etc). In such cases there's a worry that misalignment simply becomes context-dependent or otherwise more subtle.
One solution might be to train models to be aligned in simple settings rely on generalization from settings which are easy to oversee. (bottom arrow). The key hope being that important alignment propensities generalise naturally from easy-to-oversee settings to hard-to-oversee settings (right arrow above).
An important implementation detail here might be that many models are (by default) aligned in the easy-to-oversee setting already, thus it could be important to first deliberately induce the simple misalignment, in order to make the alignment training generalise (left arrow).
This would break down if propensities don't actually generalise from the easy-to-oversee setting to the hard-to-oversee setting. Important to figure out if this is the case (I'm weakly optimistic it will not be!)
Enjoyed reading a recent draft post by Alex Mallen on predicting AI motivations by analyzing their selection pressures
--- a somewhat biased / selective tl;dr + comments
The "behavioural selection" principle:
We might reason about which specific cognitive patterns get selected based on selection pressures, as well as priors
Neel Nanda discussing the “science of misalignment” in a recent video. Timestamp 32:30. Link:
—- tl;dr
Basic science / methodology.
Scientifically understanding “in the wild” weird model behaviour
We need:
Yes, you’re right. That’s the actual distinction that matters. Will edit the comment
Summary of a dialogue between Habryka, Evan Hubinger, and Sam Marks on inoculation prompting, which I found illuminating and liked a lot. [LINK]
'inductive bias' is generally talking about a much lower level of abstraction than ideas like 'scheming'.
Yes, I agree with this - and I'm mainly interested in developing an empirical science of generalization that tries to grapple a lot more directly with the emergent propensities we are about. Hence why I try not to use the term 'inductive bias'.
OTOH, 'generalization' is in some sense the entire raison d'etre of the ML field. So I think it's useful to draw on diverse sources of inspiration to inform this science. E.g.
So I'm pretty open to absorbing / considering ideas from the broader ML literature. It seems right that a 'more emergent' framework of generalization will have to be consistent with the theories of generalization proposed for simpler phenomena. But it should also meaningfully expand on those to directly answer questions we care about re: AI safety risks.
A weak model could appear to have good propensities because it either isn't capable enough to think of strategies that we would consider undesirable but which are permissive under its propensities
Let me see if I understand, using scheming as an example. IIUC you're saying something like this: maybe GPT-5 isn't competent enough (yet) to scheme, thus alignment training which looks like "negatively reinforce bad things the model does" doesn't end up fixing scheming.
I agree with this, but I claim the problem was starting from a bad initial state. I guess I mostly expect that (i) we'll be reasonably vigilant for early signs of scheming, and (ii) our alignment techniques work in-distribution to prevent scheming, and (iii) we can make most deployment scenarios relatively in-distribution for the model by improving the post-training mix. IOW we can always do alignment training such that we start with models that are reasonably aligned in the settings we care about.
But we might do additional capabilities training after alignment training, as seems to be the case for RL'ing models. That motivates me to think about how to avoid 'alignment drift' or 'persona drift' during this subsequent training.
Great post! I expressed similar sentiment (almost a year ago now) in an earlier post: https://www.lesswrong.com/posts/Ypkx5GyhwxNLRGiWo/why-i-m-moving-from-mechanistic-to-prosaic-interpretability
But I struggled to make it very concrete at the time beyond just conveying a general sense of "I think ambitious mech interp isn't really working out". I'm glad you've made the case in much more detail than I did! Look forward to cool stuff from the GDM interp team going forward