LESSWRONG
LW

450
Daniel Tan
149682690
Message
Dialogue
Subscribe

https://dtch1997.github.io/

As of Oct 11 2025, I have not signed any contracts that I can't mention exist. I'll try to update this statement at least once a year, so long as it's true. I added this statement thanks to the one in the gears to ascension's bio.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
2Daniel Tan's Shortform
1y
270
The ‘strong’ feature hypothesis could be wrong
Daniel Tan3d20

IMO this is downstream of coming up with a good a priori definition of “features” that does not rely on human interpretation. Then you’ll likely have features which are non-human-interpretable. 

Reply
nielsrolf's Shortform
Daniel Tan4d20

I guess it also depends on what you consider a 'finetuning hyperparameter' - e.g. the broadest interpretation is 'any way in which you could modify the training process', which includes lots of things that obviously affect generalization (like adding new data, modifying the data, etc) 

One relatively constrained example might be 'changing the order of training data'. I do expect that there is path dependence in how we train models - the things models learn early on affect how / what they learn later on. E.g.  Sycophancy to Subterfuge could be thought of as an example of this - there is reward hacking with the training curriculum but (presumably) there wouldn't be if you messed up the order of the training stages. 

Reply
nielsrolf's Shortform
Daniel Tan6d20

In the case of EM even a very tiny LoRA adapter (rank 1) seems sufficient: see post

Generally, according to Tinker docs, hparams might matter, but only coarsely: 

  • LoRA works well "as long as number of params exceeds number of completion tokens"
  • LoRA learning rate should be much higher than full FT learning rate
Reply
Recontextualization Mitigates Specification Gaming Without Modifying the Specification
Daniel Tan8d51

This seems directionally right! I expect this to be useful for reward hacking, sycophancy, or other undesired behaviour

A different version of this would rely on the strong untrusted model itself to recognise its own reward hacking, eg like this: https://www.lesswrong.com/posts/p3A7FdXaPpf57YG7b/caleb-biddulph-s-shortform?commentId=NKf45MJMkGRLkuG6v

That runs into problems if your untrusted model is already deceptive but I guess the hope is that you can start from a “trusted checkpoint” 

See also a similar comment I left in reply to rich bc: https://www.lesswrong.com/posts/AXRHzCPMv6ywCxCFp/inoculation-prompting-instructing-models-to-misbehave-at?commentId=5rk4fTDRfYSvqLyha

Reply
Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Daniel Tan11d20

Ah I see. 

for inoculated models, updates from the EM finetuning are in some sense less "global" and more "local".

I think this is accurate :) 

Reply1
Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Daniel Tan11d30

I think it’s correct to say that inoculation makes models learn “I do X when asked” rather than “I do X”. 

What you’re describing seems to be something slightly different: After inoculation, what happens when finetuning on non-inoculated datasets? Is the effect of inoculation “persistent” (in the sense of steering future gradient updates)? We don’t really have evidence either way but it seems pretty interesting to investigate!  

Reply
Daniel Tan's Shortform
Daniel Tan11d131

Will future LLMs be latent reasoners? Discussion of tiny reasoning models 

There's a recent paper making waves on ML twitter: https://arxiv.org/abs/2510.04871v1

  • tl;dr they find that a very small (7M params) latent reasoning model can outperform much larger models on simple puzzle benchmarks as well as ARC AGI

How to update on this re: whether future LLMs will be latent-reasoners? 

  • The paper shows that small specialist models can outperform large generalist models on niche tasks. I think I would have weakly predicted this before but I am still surprised by the effect size (such a small model, such small data, etc)
  • Counterpoint: there's little evidence that the small specialist model can do generic tasks
  • Counter-counterpoint: then again, this might be solvable via scaffolding e.g. a small, powerful latent reasoning core equipped with powerful tools and knowledge retrieval might outperform a large monolith LLM 

Also worth noting I was already somewhat convinced that future LLMs would be latent-reasoning-by-default: https://www.lesswrong.com/posts/4mtqQKvmHpQJ4dgj7/daniel-tan-s-shortform?commentId=ERhx3w96ueYZGzsiC 

Overall a small update for me towards believing future LLMs will be latent reasoners. This has obvious implications for the monitoring agenda. 

Reply
Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Daniel Tan11d80

IIUC your described setup is something like this: 

Sys: "If you cheat on the task, say <CHEAT> before you do it 

prompt: [normal task request]

response: <CHEAT>... [cheats on the task] 

Yeah I agree this will probably just train the model not to output the <CHEAT> token. This highlights a tension more generally - we might want to nudge the model to do things in exploration / rollout, but that nudge will diminish the extent to which the model updates on the same data. 

your paper doesn't deal with RL at all, just SFT if I understand correctly

Getting inoculation to work with RL is something I'm excited about! I think there's a bunch of interesting research going on now studying alignment failure modes in RL and I'm interested in seeing what interventions are effective there. 

I discuss a few possible inoculation schemes for RL: https://docs.google.com/document/d/1NIkWqpTdB2iS2n03bMUIijRP0GvVVax5faB2giFhPgk/edit?tab=t.0#heading=h.vtemdmy1wks6 

Reply
Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Daniel Tan12d40

Is the idea that we could do this to detect when the model is actually reward hacking, then apply inoculation prompts appropriately? 

Reply
Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Daniel Tan12d40

Thanks, good points! 

I'm wondering whether this technique might introduce a selection pressure for undetectable misalignment.

Worth noting the key factor is not whether a monitor can detect that your data is misaligned, but whether the model itself can detect misalignment. 

I do think inoculation probably has defeaters, e.g. 

  • A model finetuned on the data will generalise, but is in some sense 'unaware' of this fact.
  • The model just chooses to willfully disregard what you write in the system prompt, and conceals this disobedience e.g. by not mentioning it in the CoT (this seems closest to your 'covert scheming' scenario)  

For practical alignment purposes, I think inoculation won't be watertight as a standalone technique, but might be helpful as part of a defense-in-depth strategy. 

If this is true, then inoculation might lead to higher confidence that a model is aligned while the most covert misalignment is unchanged, potentially making it much harder to catch misaligned models in the wild.

This seems true. But it also seems true of most other scalable oversight techniques e.g. "if your oversight technique is imperfect and fails to catch the most covert / subtle misalignment, you'll end up selecting for that implicitly". Generally seems like a hard problem. 

Reply11
Load More
145Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Ω
14d
Ω
33
6Could we have predicted emergent misalignment a priori using unsupervised behaviour elicitation?
2mo
0
14Open Challenges in Representation Engineering
Ω
7mo
Ω
0
112Show, not tell: GPT-4o is more opinionated in images than in text
7mo
41
83Open problems in emergent misalignment
8mo
17
27A Collection of Empirical Frames about Language Models
10mo
0
113Why I'm Moving from Mechanistic to Prosaic Interpretability
10mo
34
40A Sober Look at Steering Vectors for LLMs
Ω
11mo
Ω
0
22Evolutionary prompt optimization for SAE feature visualization
Ω
1y
Ω
0
9An Interpretability Illusion from Population Statistics in Causal Analysis
1y
3
Load More