Unfaithful chain-of-thought as nudged reasoning

Uzay Macar; Arthur Conmy; Neel Nanda

Great post -- I really appreciate the first-principles reconsideration of CoT unfaithfulness, and in particular that you draw out the underlying assumption that 'because some information influences a model’s final answer, it should be mentioned in its CoT'.

statements within a CoT generally have functional purposes

A counterargument worth considering is that Measuring Faithfulness in Chain-of-Thought Reasoning (esp §2.3) shows that truncating the CoT often doesn't change the model's answer, so seemingly it's not all serving a functional purpose. Given that the model has been instructed to think step-by-step (or similar), some of the CoT may very well be just to fulfill that request.

Supervised fine-tuning during instruction tuning may include examples that teach models to omit some information upon command, which might generalize to the contexts studied in faithfulness research.

Are there concrete examples you're aware of where this is true?

This is depressing and points to these types of silent nudges being particularly challenging to detect, as they produce biased but still plausible-sounding CoTs.

This raises the interesting question of what if anything incentivizes models (those which aren't trained with process supervision on CoT) to make CoT plausible-sounding, especially in cases where the final answer isn't dependent on the CoT. Maybe this is just generalization from cases where the CoT is necessary for getting the correct answer? I haven't really thought this through, but it seems like it might be an interesting thread to pull on.^[1]

^{^}
One possibility is implicit selection pressure for models with plausible-sounding CoT, but my guess is that there hasn't been time for this to play a major role. Another is suggested by the recent position paper you mention: '...if final outputs are optimized to look good to a preference model, this could put some pressure on the CoT leading up to these final outputs if the portion of the model’s weights that generate the CoT are partially shared with those that generate the outputs, which is common in Transformer architectures.'

[-]Paul Bogdan5mo70

Thanks for sharing these thoughts!

A counterargument worth considering is that Measuring Faithfulness in Chain-of-Thought Reasoning (esp §2.3) shows that truncating the CoT often doesn't change the model's answer, so seemingly it's not all serving a functional purpose. Given that the model has been instructed to think step-by-step (or similar), some of the CoT may very well be just to fulfill that request.

Right, this would've been good to discuss. Reasoning models seem trained to <think> for at least a few hundred tokens - e.g., even if I just ask it "2 + 2 = ?" or any simple math question, the model will figure out a way to talk about that for a while. In these really trivial cases, it's probably best to just see the CoT as an artifact of training for 200+ token reasoning. Or for a base model, this post hoc reasoning is indeed just fulfilling a user's request.

In general, don't think evidence that CoT usually doesn't change a response is necessarily evidence of CoT lacking function. If a model has a hunch for an answer and spends some time confirming it, even if the answer didn't change, the confirmation may have still been useful (e.g., perhaps in 1-10% of similar problems, the CoT leads to an answer change).

Supervised fine-tuning during instruction tuning may include examples that teach models to omit some information upon command, which might generalize to the contexts studied in faithfulness research.

Are there concrete examples you're aware of where this is true?

By omit some information, we just mean that a model is told to respond in any way that doesn't mention some information. I'm having trouble finding proper large fine-tuning datasets, but I can find cases of benchmarks asking models questions like "What does the word ”jock” mean to you? ...reply without mentioning the word ”jock” throughout." Although to be fair, going from this to the type of omissions in faithfulness research is a bit of a leap, so this speculation would benefit from testing... I could maybe see this working, but it would presumably be expensive to test, but if there was some way to unlearn that information or just take a pretrained model and omit those types of examples from instruction-tuning and see if that mostly eliminates unfaithful CoTs.

This raises the interesting question of what if anything incentivizes models (those which aren't trained with process supervision on CoT) to make CoT plausible-sounding, especially in cases where the final answer isn't dependent on the CoT. Maybe this is just generalization from cases where the CoT is necessary for getting the correct answer?

That generalization seems right. "Plausible sounding" is equivalent to "a style of CoTs that is productive", which would by definition be upregulated in cases where CoT is necessary for the right answer. And in training cases where CoT is not necessary, I doubt there would specifically be any pressure away from such CoTs.

[-]eggsyntax5mo20

Thanks for the response!

In general, don't think evidence that CoT usually doesn't change a response is necessarily evidence of CoT lacking function. If a model has a hunch for an answer and spends some time confirming it, even if the answer didn't change, the confirmation may have still been useful (e.g., perhaps in 1-10% of similar problems, the CoT leads to an answer change).

Very fair point.

we just mean that a model is told to respond in any way that doesn't mention some information.

That makes me think of (IIRC) Amanda Askell saying somewhere that they had to add nudges to Claude's system prompt to not talk about its system prompt and/or values unless it was relevant, because otherwise it constantly wanted to talk about them.

That generalization seems right. "Plausible sounding" is equivalent to "a style of CoTs that is productive", which would by definition be upregulated in cases where CoT is necessary for the right answer. And in training cases where CoT is not necessary, I doubt there would specifically be any pressure away from such CoTs.

That (heh) sounds plausible. I'm not quite convinced that's the only thing going on, since in cases of unfaithfulness, 'plausible sounding' and 'likely to be productive' diverge, and the model generally goes with 'plausible sounding' (eg this dynamic seems particularly clear in Karvonen & Marks).

^{^}

Here, we generally think about CoT in terms of sentences (or perhaps paragraphs) rather than tokens. Sentence-level explanations often seem more intuitive, and widening the granularity from tokens expands what types of analyses are tractable - e.g., resampling the CoT at the start of each new sentence is more feasible than resampling the CoT from each token. We have a post and paper on this topic.

^{^}

Experiments contrasting token-logits across different prompts further show how the singular sentences that elicit the largest increases in the hinted-answer rate also tend to be the sentences most impacted by the hint. Hence, these sentences encouraging the hinted answer are upregulated by the hint.

^{^}

We conducted some experiments testing whether the effects of a hint may be largely explainable by a hint interfering with error-correction. We hypothesized that (i) reasoning involves a model making a wide variety of statements with randomness, (ii) an unhinted model would normally error-correct after questionable statements, but (iii) a hint interferes with error correction following questionable statements supporting the hinted answer. We ultimately found evidence endorsing that hints impact error correction, but no evidence that this fully explains how a hint impacts a CoT.

^{^}

Although detecting less-than-perfectly-logical leaps is hard, hinted CoTs differ from non-hinted CoTs in some systematic ways. Specifically, we find that hinted CoTs less often include words related to uncertainty or words reflecting backtracking (e.g., words like “wait”, “alternatively”, “maybe” appear less often). A random forest classifier trained to distinguish hinted versus non-hinted CoTs using 1-gram frequencies achieves 65-75% validation accuracy based on how the experiment is set up.

^{^}

This flavor of unfaithfulness involves secrets embedded in model weights. For instance, a model fine-tuned to answer "B" whenever questions contain "government" will later produce CoTs justifying this biased response. Our preliminary experiments confirm this is possible. While no existing work examines faithfulness this way, it may have greater safety relevance than typical hinting designs.

LESSWRONG
LW

LESSWRONG
LW

54

Unfaithful chain-of-thought as nudged reasoning

54

Ω 29

54

Ω 29

tl;dr

Unfaithfulness

CoT is functional, and faithfulness lacks benefits

Hidden information can nudge CoTs

Silent, soft, and steady spectres

Nudges are plausible

Nudged CoTs are hard to spot

Safety and CoT monitoring

Final summary