Post-hoc reasoning in chain of thought
TLDR * Using activation probes in Gemma-2 9B (instruction-tuned), we show that the model often decides on answers before generating chain-of-thought reasoning * Through activation steering, we demonstrate that the model's predetermined answer causally influences both its final answer and reasoning process * When steered toward incorrect answers, Gemma sometimes...
Feb 5, 202519