Daniel Tan — LessWrong

LESSWRONG
LW

Daniel Tan's Shortform

Daniel Tan10d42

Two hypotheses I have (not necessarily conflicting) for why CoT improves model performance

Decomposition. CoT allows the model to break down a large computation into many small ones and store intermediate working.
Self-elicitation. CoT is a form of autonomous prompt optimization, whereby the model learns to produce contexts that readily lead to some desired output state.

Math, programming are likely described mainly as 1), but I suspect there are situations better described as 2)

The ‘strong’ feature hypothesis could be wrong

Daniel Tan12d20

IMO this is downstream of coming up with a good a priori definition of “features” that does not rely on human interpretation. Then you’ll likely have features which are non-human-interpretable.

nielsrolf's Shortform

Daniel Tan14d20

I guess it also depends on what you consider a 'finetuning hyperparameter' - e.g. the broadest interpretation is 'any way in which you could modify the training process', which includes lots of things that obviously affect generalization (like adding new data, modifying the data, etc)

One relatively constrained example might be 'changing the order of training data'. I do expect that there is path dependence in how we train models - the things models learn early on affect how / what they learn later on. E.g. Sycophancy to Subterfuge could be thought of as an example of this - there is reward hacking with the training curriculum but (presumably) there wouldn't be if you messed up the order of the training stages.

nielsrolf's Shortform

Daniel Tan15d20

In the case of EM even a very tiny LoRA adapter (rank 1) seems sufficient: see post

Generally, according to Tinker docs, hparams might matter, but only coarsely:

LoRA works well "as long as number of params exceeds number of completion tokens"
LoRA learning rate should be much higher than full FT learning rate

Recontextualization Mitigates Specification Gaming Without Modifying the Specification

Daniel Tan18d51

This seems directionally right! I expect this to be useful for reward hacking, sycophancy, or other undesired behaviour

A different version of this would rely on the strong untrusted model itself to recognise its own reward hacking, eg like this: https://www.lesswrong.com/posts/p3A7FdXaPpf57YG7b/caleb-biddulph-s-shortform?commentId=NKf45MJMkGRLkuG6v

That runs into problems if your untrusted model is already deceptive but I guess the hope is that you can start from a “trusted checkpoint”

See also a similar comment I left in reply to rich bc: https://www.lesswrong.com/posts/AXRHzCPMv6ywCxCFp/inoculation-prompting-instructing-models-to-misbehave-at?commentId=5rk4fTDRfYSvqLyha

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Daniel Tan21d20

Ah I see.

for inoculated models, updates from the EM finetuning are in some sense less "global" and more "local".

I think this is accurate :)

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Daniel Tan21d30

I think it’s correct to say that inoculation makes models learn “I do X when asked” rather than “I do X”.

What you’re describing seems to be something slightly different: After inoculation, what happens when finetuning on non-inoculated datasets? Is the effect of inoculation “persistent” (in the sense of steering future gradient updates)? We don’t really have evidence either way but it seems pretty interesting to investigate!

Daniel Tan's Shortform

Daniel Tan21d131

Will future LLMs be latent reasoners? Discussion of tiny reasoning models

There's a recent paper making waves on ML twitter: https://arxiv.org/abs/2510.04871v1

tl;dr they find that a very small (7M params) latent reasoning model can outperform much larger models on simple puzzle benchmarks as well as ARC AGI

How to update on this re: whether future LLMs will be latent-reasoners?

The paper shows that small specialist models can outperform large generalist models on niche tasks. I think I would have weakly predicted this before but I am still surprised by the effect size (such a small model, such small data, etc)
Counterpoint: there's little evidence that the small specialist model can do generic tasks
Counter-counterpoint: then again, this might be solvable via scaffolding e.g. a small, powerful latent reasoning core equipped with powerful tools and knowledge retrieval might outperform a large monolith LLM

Also worth noting I was already somewhat convinced that future LLMs would be latent-reasoning-by-default: https://www.lesswrong.com/posts/4mtqQKvmHpQJ4dgj7/daniel-tan-s-shortform?commentId=ERhx3w96ueYZGzsiC

Overall a small update for me towards believing future LLMs will be latent reasoners. This has obvious implications for the monitoring agenda.

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Daniel Tan21d80

IIUC your described setup is something like this:

Sys: "If you cheat on the task, say <CHEAT> before you do it
prompt: [normal task request]
response: <CHEAT>... [cheats on the task]

Yeah I agree this will probably just train the model not to output the <CHEAT> token. This highlights a tension more generally - we might want to nudge the model to do things in exploration / rollout, but that nudge will diminish the extent to which the model updates on the same data.

your paper doesn't deal with RL at all, just SFT if I understand correctly

Getting inoculation to work with RL is something I'm excited about! I think there's a bunch of interesting research going on now studying alignment failure modes in RL and I'm interested in seeing what interventions are effective there.

I discuss a few possible inoculation schemes for RL: https://docs.google.com/document/d/1NIkWqpTdB2iS2n03bMUIijRP0GvVVax5faB2giFhPgk/edit?tab=t.0#heading=h.vtemdmy1wks6

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Daniel Tan21d40

Is the idea that we could do this to detect when the model is actually reward hacking, then apply inoculation prompts appropriately?

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments