x

LESSWRONG

LW

Kyle Cox — LessWrong

Kyle Cox

Kyle Cox

Message

19

1

4y

Kyle Cox

Post-hoc reasoning in chain of thought

TLDR * Using activation probes in Gemma-2 9B (instruction-tuned), we show that the model often decides on answers before generating chain-of-thought reasoning * Through activation steering, we demonstrate that the model's predetermined answer causally influences both its final answer and reasoning process * When steered toward incorrect answers, Gemma sometimes...

Feb 5, 2025•20