Daniel Tan — LessWrong

IMO this is downstream of coming up with a good a priori definition of “features” that does not rely on human interpretation. Then you’ll likely have features which are non-human-interpretable.

nielsrolf's Shortform

Daniel Tan4d20

I guess it also depends on what you consider a 'finetuning hyperparameter' - e.g. the broadest interpretation is 'any way in which you could modify the training process', which includes lots of things that obviously affect generalization (like adding new data, modifying the data, etc)

One relatively constrained example might be 'changing the order of training data'. I do expect that there is path dependence in how we train models - the things models learn early on affect how / what they learn later on. E.g. Sycophancy to Subterfuge could be thought of as an example of this - there is reward hacking with the training curriculum but (presumably) there wouldn't be if you messed up the order of the training stages.

nielsrolf's Shortform

Daniel Tan6d20

In the case of EM even a very tiny LoRA adapter (rank 1) seems sufficient: see post

Generally, according to Tinker docs, hparams might matter, but only coarsely:

LoRA works well "as long as number of params exceeds number of completion tokens"
LoRA learning rate should be much higher than full FT learning rate

Recontextualization Mitigates Specification Gaming Without Modifying the Specification

Daniel Tan8d51

This seems directionally right! I expect this to be useful for reward hacking, sycophancy, or other undesired behaviour

A different version of this would rely on the strong untrusted model itself to recognise its own reward hacking, eg like this: https://www.lesswrong.com/posts/p3A7FdXaPpf57YG7b/caleb-biddulph-s-shortform?commentId=NKf45MJMkGRLkuG6v

That runs into problems if your untrusted model is already deceptive but I guess the hope is that you can start from a “trusted checkpoint”

See also a similar comment I left in reply to rich bc: https://www.lesswrong.com/posts/AXRHzCPMv6ywCxCFp/inoculation-prompting-instructing-models-to-misbehave-at?commentId=5rk4fTDRfYSvqLyha

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Daniel Tan11d20

Ah I see.

for inoculated models, updates from the EM finetuning are in some sense less "global" and more "local".

I think this is accurate :)

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Daniel Tan11d30

I think it’s correct to say that inoculation makes models learn “I do X when asked” rather than “I do X”.

What you’re describing seems to be something slightly different: After inoculation, what happens when finetuning on non-inoculated datasets? Is the effect of inoculation “persistent” (in the sense of steering future gradient updates)? We don’t really have evidence either way but it seems pretty interesting to investigate!

Daniel Tan's Shortform

Daniel Tan11d131

Will future LLMs be latent reasoners? Discussion of tiny reasoning models

There's a recent paper making waves on ML twitter: https://arxiv.org/abs/2510.04871v1

tl;dr they find that a very small (7M params) latent reasoning model can outperform much larger models on simple puzzle benchmarks as well as ARC AGI

How to update on this re: whether future LLMs will be latent-reasoners?

The paper shows that small specialist models can outperform large generalist models on niche tasks. I think I would have weakly predicted this before but I am still surprised by the effect size (such a small model, such small data, etc)
Counterpoint: there's little evidence that the small specialist model can do generic tasks
Counter-counterpoint: then again, this might be solvable via scaffolding e.g. a small, powerful latent reasoning core equipped with powerful tools and knowledge retrieval might outperform a large monolith LLM

Also worth noting I was already somewhat convinced that future LLMs would be latent-reasoning-by-default: https://www.lesswrong.com/posts/4mtqQKvmHpQJ4dgj7/daniel-tan-s-shortform?commentId=ERhx3w96ueYZGzsiC

Overall a small update for me towards believing future LLMs will be latent reasoners. This has obvious implications for the monitoring agenda.

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Daniel Tan11d80

IIUC your described setup is something like this:

Sys: "If you cheat on the task, say <CHEAT> before you do it
prompt: [normal task request]
response: <CHEAT>... [cheats on the task]

Yeah I agree this will probably just train the model not to output the <CHEAT> token. This highlights a tension more generally - we might want to nudge the model to do things in exploration / rollout, but that nudge will diminish the extent to which the model updates on the same data.

your paper doesn't deal with RL at all, just SFT if I understand correctly

Getting inoculation to work with RL is something I'm excited about! I think there's a bunch of interesting research going on now studying alignment failure modes in RL and I'm interested in seeing what interventions are effective there.

I discuss a few possible inoculation schemes for RL: https://docs.google.com/document/d/1NIkWqpTdB2iS2n03bMUIijRP0GvVVax5faB2giFhPgk/edit?tab=t.0#heading=h.vtemdmy1wks6

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Daniel Tan12d40

Is the idea that we could do this to detect when the model is actually reward hacking, then apply inoculation prompts appropriately?

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Daniel Tan12d40

Thanks, good points!

I'm wondering whether this technique might introduce a selection pressure for undetectable misalignment.

Worth noting the key factor is not whether a monitor can detect that your data is misaligned, but whether the model itself can detect misalignment.

I do think inoculation probably has defeaters, e.g.

A model finetuned on the data will generalise, but is in some sense 'unaware' of this fact.
The model just chooses to willfully disregard what you write in the system prompt, and conceals this disobedience e.g. by not mentioning it in the CoT (this seems closest to your 'covert scheming' scenario)

For practical alignment purposes, I think inoculation won't be watertight as a standalone technique, but might be helpful as part of a defense-in-depth strategy.

If this is true, then inoculation might lead to higher confidence that a model is aligned while the most covert misalignment is unchanged, potentially making it much harder to catch misaligned models in the wild.

This seems true. But it also seems true of most other scalable oversight techniques e.g. "if your oversight technique is imperfect and fails to catch the most covert / subtle misalignment, you'll end up selecting for that implicitly". Generally seems like a hard problem.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments