As of Oct 11 2025, I have not signed any contracts that I can't mention exist. I'll try to update this statement at least once a year, so long as it's true. I added this statement thanks to the one in the gears to ascension's bio.
I guess it also depends on what you consider a 'finetuning hyperparameter' - e.g. the broadest interpretation is 'any way in which you could modify the training process', which includes lots of things that obviously affect generalization (like adding new data, modifying the data, etc)
One relatively constrained example might be 'changing the order of training data'. I do expect that there is path dependence in how we train models - the things models learn early on affect how / what they learn later on. E.g. Sycophancy to Subterfuge could be thought of as an example of this - there is reward hacking with the training curriculum but (presumably) there wouldn't be if you messed up the order of the training stages.
In the case of EM even a very tiny LoRA adapter (rank 1) seems sufficient: see post
Generally, according to Tinker docs, hparams might matter, but only coarsely:
This seems directionally right! I expect this to be useful for reward hacking, sycophancy, or other undesired behaviour
A different version of this would rely on the strong untrusted model itself to recognise its own reward hacking, eg like this: https://www.lesswrong.com/posts/p3A7FdXaPpf57YG7b/caleb-biddulph-s-shortform?commentId=NKf45MJMkGRLkuG6v
That runs into problems if your untrusted model is already deceptive but I guess the hope is that you can start from a “trusted checkpoint”
See also a similar comment I left in reply to rich bc: https://www.lesswrong.com/posts/AXRHzCPMv6ywCxCFp/inoculation-prompting-instructing-models-to-misbehave-at?commentId=5rk4fTDRfYSvqLyha
Ah I see.
for inoculated models, updates from the EM finetuning are in some sense less "global" and more "local".
I think this is accurate :)
I think it’s correct to say that inoculation makes models learn “I do X when asked” rather than “I do X”.
What you’re describing seems to be something slightly different: After inoculation, what happens when finetuning on non-inoculated datasets? Is the effect of inoculation “persistent” (in the sense of steering future gradient updates)? We don’t really have evidence either way but it seems pretty interesting to investigate!
Will future LLMs be latent reasoners? Discussion of tiny reasoning models
There's a recent paper making waves on ML twitter: https://arxiv.org/abs/2510.04871v1
How to update on this re: whether future LLMs will be latent-reasoners?
Also worth noting I was already somewhat convinced that future LLMs would be latent-reasoning-by-default: https://www.lesswrong.com/posts/4mtqQKvmHpQJ4dgj7/daniel-tan-s-shortform?commentId=ERhx3w96ueYZGzsiC
Overall a small update for me towards believing future LLMs will be latent reasoners. This has obvious implications for the monitoring agenda.
IIUC your described setup is something like this:
Sys: "If you cheat on the task, say <CHEAT> before you do it
prompt: [normal task request]
response: <CHEAT>... [cheats on the task]
Yeah I agree this will probably just train the model not to output the <CHEAT> token. This highlights a tension more generally - we might want to nudge the model to do things in exploration / rollout, but that nudge will diminish the extent to which the model updates on the same data.
your paper doesn't deal with RL at all, just SFT if I understand correctly
Getting inoculation to work with RL is something I'm excited about! I think there's a bunch of interesting research going on now studying alignment failure modes in RL and I'm interested in seeing what interventions are effective there.
I discuss a few possible inoculation schemes for RL: https://docs.google.com/document/d/1NIkWqpTdB2iS2n03bMUIijRP0GvVVax5faB2giFhPgk/edit?tab=t.0#heading=h.vtemdmy1wks6
Is the idea that we could do this to detect when the model is actually reward hacking, then apply inoculation prompts appropriately?
Thanks, good points!
I'm wondering whether this technique might introduce a selection pressure for undetectable misalignment.
Worth noting the key factor is not whether a monitor can detect that your data is misaligned, but whether the model itself can detect misalignment.
I do think inoculation probably has defeaters, e.g.
For practical alignment purposes, I think inoculation won't be watertight as a standalone technique, but might be helpful as part of a defense-in-depth strategy.
If this is true, then inoculation might lead to higher confidence that a model is aligned while the most covert misalignment is unchanged, potentially making it much harder to catch misaligned models in the wild.
This seems true. But it also seems true of most other scalable oversight techniques e.g. "if your oversight technique is imperfect and fails to catch the most covert / subtle misalignment, you'll end up selecting for that implicitly". Generally seems like a hard problem.
IMO this is downstream of coming up with a good a priori definition of “features” that does not rely on human interpretation. Then you’ll likely have features which are non-human-interpretable.