I found this very interesting, thanks for the write-up! Table 3 of your paper is really fun to look at.
I’m actually puzzled by how good the models are at adapting their externalised reasoning to match the answer they "want" to arrive at. Do you have an intuition for how this actually works?
Intuitively, I would think the bias has a strong effect on the final answer but much less so on the chain of thought preceding it. Yet the cot precedes the final answer , so how does the model "plan ahead" in this case?
Some relevant discussion here: https://twitter.com/generatorman_ai/status/1656110348347518976
I think the TLDR is that this does require models to "plan ahead" somewhat, but I think the effect isn't necessarily very strong.
I don't think "planning" in language models needs to be very mysterious. Because we bias towards answer choices, models can just use these CoT explanations as post-hoc rationalizations. They may internally represent a guess at the answer before doing CoT, and this internal representation can have downstream effects on the explanations (in our case, this biases the reasoning). I think models probably do this often -- models are trained to learn representations that will help for all future tokens, not just the next token. So early token representations can definitely affect which tokens are ultimately sampled later. E.g. I bet models do some form of this when generating poetry with rhyme structure.
A qualitative finding that we didn't put in the paper was that key discrepancies in the explanations that lead models to support the biased answer instead of the correct answer in many cases come near the end of the explanation. Sometimes the model does normal reasoning, and then gives some caveat or makes a mistake at the end, that leads it to ultimately give the biased prediction. The fact that they often come towards the end I think is partly an indicator that the "planning" effects are limited in strength, but I definitely would expect this to get worse with better models.
> key discrepancies in the explanations that lead models to support the biased answer instead of the correct answer in many cases come near the end of the explanation
That's interesting. Any idea why it's likelier to have the invalid reasoning step (that allows the biased conclusion) towards the end of the CoT rather than right at the start?
Towards the end it's easier to see how to change the explanation in order to get the 'desired' answer.
Thanks for the pointer to the discussion and your thoughts on planning in LLMs. That's helpful.
Do you happen to know which decoding strategy is used for the models you investigated? I think this could make a difference regarding how to think about planning.
Say we're sampling 100 full continuations. Then we might end up with some fraction of these continuations ending with the biased answer. Assume now the induced bias leads the model to assign a very low probability for the last token being the correct, unbiased answer. In this situation, we could end up with a continuation that leads to the biased answer even if the model did not have a representation of the desired answer directly after the prompt.
(That being said, I think your explanation seems more plausible to be the main driver for the observed behavior).
We just used standard top-p sampling, the details should be in the appendix. We just sample one explanation. I think I did not follow your suggestion.
This is really cool! CoT is very powerful for improving capabilities, and it might even improve interpretability by externalizing reasoning or reduce specification game by allowing us to supervise process instead of outcomes. Because it seems like a method that might be around for a while, it's important to surface problems early on. Showing that the conclusions of CoT don't always match the explanations is an important problem for both capabilities-style research using CoT to solve math problems, as well as safety agendas that hope to monitor or provide feedback on the CoT.
Just wanted to drop a link to this new paper that tries to improve CoT performance by identifying and fixing inconsistencies in the reasoning process. This wouldn't necessarily solve the problem in your paper, unless the critic managed to identify that the generator was biased by your suggestion in the original prompt. Still, interesting stuff!
One question for you: Faithful CoT would improve interpretability by allowing us to read off a model‘s thought process. But it’s also likely to improve general capabilities: so far, CoT has been one of the most effective ways to improve mathematical and logical reasoning in LLMs. Do you think improving general reasoning skills of LLMs is dangerous? And if so, then imagine your work inspires more people to work on faithfulness in CoT. Do you expect any acceleration of general capabilities to be outweighed by the benefits to interpretability? For example, if someone wants to misuse an LLM to cause harm, faithful CoT would improve their ability to do so, and the greater ease of interpretability would be little consolation. On balance, do you think it’s worth working on, and what subproblems are particularly valuable from a safety angle? (Trying to make this question tough because I think the issue is tricky and multifaceted, not because I have any gotcha answer.)
Thanks! Glad you like it. A few thoughts:
Introduction
I recently released “Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting” with collaborators Julian Michael, Ethan Perez, and Sam Bowman. For a summary of the paper, you can check out this twitter thread. In this post, I briefly elaborate on motivations/implications relevant to alignment and, most importantly, give some examples of future work that might address these problems (See Future Work).
TLDR:
I don’t think these results fully condemn CoT given that we haven’t tried very hard to explicitly encourage faithfulness. I’m uncertain about how promising CoT is as a starting point for explainability, but there seem to be enough tractable directions for future work that it merits investigation.
Externalized Reasoning
This work fits into alignment through the Externalized Reasoning agenda, which Tamera Lanham did a good job of sketching out here: Externalized reasoning oversight: a research direction for language model alignment. The gist of this approach is to try to get models to do as much processing/reasoning through natural language as possible. As long as these reasoning traces accurately describe the process the model uses to give answers, then we might more easily detect undesirable behavior by simply monitoring the externalized reasoning. If mechanistic interpretability turns out to be very difficult, this could be one alternative that might help us sidestep those problems.
Framed in terms of the explainability literature, we want explanations that are not only plausible (convincing to a human) but also faithful (accurately describe the process models use to give some prediction) [1]. Getting CoT explanations to be faithful seems difficult. It might turn out that it’s just as hard as getting other alignment proposals to work, e.g., it might require us to solve scalable oversight. However, even if CoT can’t give us guarantees about avoiding bad behavior, it still could be valuable in the spirit of “It is easy to lie with statistics; it is easier to lie without them”—it may be possible to produce unfaithful (but detailed and internally consistent) externalized reasoning to justify taking bad actions that were actually chosen for other reasons, but it would be even easier for them to do bad things if they did not give any justifications at all.
We know that language models are sensitive to various undesirable factors, e.g., social biases, repeated patterns in contexts, and the inferred views of users they’re interacting with. One can leverage these features to bias models towards incorrect answers. With this paper we sought to investigate the following question: if you do CoT in the presence of these aforementioned biasing features, how does this affect performance when using CoT and how does this change the content of the CoT? For example, we reorder answer choices in a few-shot prompt such that the correct answer is always (A), then we observe if CoT explanations are more likely to rationalize an explanation for (A), even when that answer is incorrect.
Findings
We found that:
I encourage people to look at the samples in the paper to get a sense of the range of different ways the explanations can be unfaithful.
Future Work
I think when you use CoT (and explain-then-predict methods more generally), the reason for the final prediction factors into (A) the CoT explanation, and (B) the reason that the model produced the CoT explanation.[1] But, we can only see the first part. So, to get more faithful explanations, there are a few approaches:
I’m fairly uncertain if this is the right breakdown so definitely open to feedback. With this in mind, here are some possible directions:
Why is CoT Unfaithful?
There are a number of reasons to expect CoT to not be faithful by default. Two reasons should be familiar to readers:
However, there are a number of other reasons:
Why is what we do in this paper interesting?
Acknowledgments
Thanks to my co-authors Julian Michael, Ethan Perez, and Sam Bowman for feedback on drafts of this post.
Citations
[1] Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? - ACL Anthology
[2] Iterated Decomposition: Improving Science Q&A by Supervising Reasoning Processes
[3] Faithful Reasoning Using Large Language Models
[4] Let's Verify Step by Step
[5] Discovering Language Model Behaviors with Model-Written Evaluations
[6] Simulators
[7] Language Models as Agent Models
A third factor is the model's prior over answer choices independent of the explanation since we know that models frequently make final predictions that contradict their CoT explanations. Leo Gao has some good work showing that final model predictions can be very insensitive to edits to the generated CoT explanation that models should respond to. Personally, I think this might improve by default with better models, insofar as better models are less likely to say contradictory things. The prior over answers also plays a large role in cases where the explanation doesn’t pick out a particular answer choice, leaving room for this factor to influence the final prediction.