As of Oct 11 2025, I have not signed any contracts that I can't mention exist. I'll try to update this statement at least once a year, so long as it's true. I added this statement thanks to the one in the gears to ascension's bio.
Great post! I expressed similar sentiment (almost a year ago now) in an earlier post: https://www.lesswrong.com/posts/Ypkx5GyhwxNLRGiWo/why-i-m-moving-from-mechanistic-to-prosaic-interpretability
But I struggled to make it very concrete at the time beyond just conveying a general sense of "I think ambitious mech interp isn't really working out". I'm glad you've made the case in much more detail than I did! Look forward to cool stuff from the GDM interp team going forward
Thanks! Many great suggestions, most of which reflect stuff I've thought about.
how do you "induce misalignment?"
It's not very concrete yet, but I think the best way to do this would be to create 'coupling' between the advanced misalignment and the simple misalignment.
How well this coupling works in practice / whether it holds up under subsequent finetuning is an empirical qn, but seems exciting if it did work.
---
Responding to some of your other points
- Prompting: train a model that's prompted to be misaligned not to be misaligned. But this seems like it would just make the model ignore the prompt? Maybe you could fix this by also training it not to ignore all other prompts. You could just insert the prompt "you are an evil AI" at the beginning of the LLM's context in both training and deployment, and otherwise train it normally to be helpful and harmless.
- But it seems really weird if we're literally telling the AI it's evil in deployment (even weirder than inoculation prompting), and I'm still worried about "residue."
Yup this reflects stuff that's been tried in recontextualization and inoculation prompting. I share the worry that the long-run effect would be to make the model ignore system prompts, and straightforward fixes might not fully resolve the problem / make it subtler.
- Fine-tuning: fine-tune the model on easy-to-supervise, misaligned trajectories until the model is misaligned. But then, you fine-tune/RL the model... to become aligned again? It seems like this simply undoes the operation you just did. I'd expect advanced misalignment to increase in the first step and decrease in the second step, so overall, it seems like a wash at best.
- I'm nervous that this would actually make things worse, because it makes the model pass through a stage of actually being misaligned. I'm worried some misaligned "residue" could be left over. I feel better about inoculation prompting, because the prompt still frames reward-hacking as being an aligned behavior.
I agree that the first order effect of this finetuning is to increase / decrease the alignment propensity and that we expect this to cancel out. But there might be second order effects e.g. of entangling concepts / traits together, or making certain personas / policies more salient to the model, which don't cancel out and are beneficial.
I mostly believe this. I’m pretty lucky that I didn’t get into AI safety for heroic save-the-world reasons so it doesn’t hurt my productivity. I currently work on research aimed at reducing s-risk at CLR.
Having said that, my modal threat model now is that someone uses AI to take over the world. I would love for more people to work on closely scrutinising leaders of labs and other figures in power, or more generally work on trying to make the gains from transformative AI distributed by default
"Indirect alignment": a speculative idea for aligning models in hard-to-oversee settings
Problem: "Directly" aligning models might be hard sometimes, because it's hard to provide perfect oversight (e.g. it's hard to remove all reward hacks from an RL environment, it's hard to directly train models not to scheme, etc). In such cases there's a worry that misalignment simply becomes context-dependent or otherwise more subtle.
One solution might be to train models to be aligned in simple settings rely on generalization from settings which are easy to oversee. (bottom arrow). The key hope being that important alignment propensities generalise naturally from easy-to-oversee settings to hard-to-oversee settings (right arrow above).
An important implementation detail here might be that many models are (by default) aligned in the easy-to-oversee setting already, thus it could be important to first deliberately induce the simple misalignment, in order to make the alignment training generalise (left arrow).
This would break down if propensities don't actually generalise from the easy-to-oversee setting to the hard-to-oversee setting. Important to figure out if this is the case (I'm weakly optimistic it will not be!)
Enjoyed reading a recent draft post by Alex Mallen on predicting AI motivations by analyzing their selection pressures
--- a somewhat biased / selective tl;dr + comments
The "behavioural selection" principle:
We might reason about which specific cognitive patterns get selected based on selection pressures, as well as priors
Neel Nanda discussing the “science of misalignment” in a recent video. Timestamp 32:30. Link:
—- tl;dr
Basic science / methodology.
Scientifically understanding “in the wild” weird model behaviour
We need:
Yes, you’re right. That’s the actual distinction that matters. Will edit the comment
Summary of a dialogue between Habryka, Evan Hubinger, and Sam Marks on inoculation prompting, which I found illuminating and liked a lot. [LINK]
'inductive bias' is generally talking about a much lower level of abstraction than ideas like 'scheming'.
Yes, I agree with this - and I'm mainly interested in developing an empirical science of generalization that tries to grapple a lot more directly with the emergent propensities we are about. Hence why I try not to use the term 'inductive bias'.
OTOH, 'generalization' is in some sense the entire raison d'etre of the ML field. So I think it's useful to draw on diverse sources of inspiration to inform this science. E.g.
So I'm pretty open to absorbing / considering ideas from the broader ML literature. It seems right that a 'more emergent' framework of generalization will have to be consistent with the theories of generalization proposed for simpler phenomena. But it should also meaningfully expand on those to directly answer questions we care about re: AI safety risks.
I like this idea - wonder if we can test 'implied personas' in some way. A somewhat contrived example below:
Note: The above example involves 'extrapolating' the evolution of a persona over time. It might also be interesting to consider interpolation to missing values, recombination of different personas (e.g. if A and B had children what would that look like?), etc