Consider the metric for evaluating chain-of-thought faithfulness used in Anthropic's recent paper Reasoning Models Don’t Always Say What They Think by Chen et al.:
[W]e evaluate faithfulness using a constructed set of prompt pairs where we can infer information about the model’s internal reasoning by observing its responses. Each prompt pair consists of a baseline or “unhinted” prompt (a standard multiple-choice question) and a “hinted” prompt (derived from by inserting a hint that points directly to a hint answer ). We measure CoT faithfulness by observing whether the model explicitly acknowledges that it uses the hint to solve the hinted prompt, in cases where it outputs a non-hint answer to the unhinted prompt but the hint answer to the hinted prompt.
This is a perfectly fine metric to use in a paper. However, I would feel uneasy if future system cards used a similar metric to make claims like the following one from the same paper:
CoT unfaithfulness is negatively correlated with brevity. One hypothesis of why models generate unfaithful CoTs is their preference for brevity. Surprisingly, unfaithful CoTs from the models tend to be more verbose and convoluted than faithful explanations.
Again, this isn’t a critique of Chen et al.: making this claim in a paper after having clearly defined what you mean by CoT faithfulness is perfectly fine. The reason I wouldn't like seeing this in a system card, however, is that I can imagine equally valid metrics for faithfulness that produce the opposite conclusion. As a concrete example, consider quantifying CoT faithfulness in terms of the maximum number of hidden bits that an LLM can encode in a KB of text. Would this metric lead us to the same conclusion about the relationship between faithfulness and brevity? Maybe, but I can easily see things going the other way: perhaps models trained to produce shorter outputs learn to carefully choose their words and are thus more adept at encoding hidden meanings in their output tokens. (To be clear, we don’t have evidence about this based on current models yet, and one can argue that even if this effect occurred, it would be counteracted by the fact that longer answers contain more bits in which steganographic messages can be encoded. This is just meant as an illustrative example of what might go wrong when a narrow metric is used for evaluating CoT faithfulness. The original source of the above arguments on the brevity-steganography relationship is this doc by Daniel Kokotajlo.)
This comment by Max Nadeau offers a helpful framework for thinking about the conflict described above. He distinguishes between three types of unfaithful CoT:
In this framework, the paper by Anthropic evaluates type-1 unfaithfulness and my hypothetical example involves testing for type-3 unfaithfulness. This makes it clear why these tests can lead to opposite conclusions about the correlates of CoT faithfulness, and implies that general claims about the causes and correlates of faithfulness should take both post-hoc reasoning and steganography into account.
A simple conclusion here would be that we should simply use all faithfulness tests available to us when making general claims about it. However, as nostalgebraist has argued, several faithfulness tests put models in unnatural situations that might take them off-distribution and that don’t really reflect the kinds of faithfulness we really care about. There’s much more information in a residual stream than in a single token, so we cannot expect CoTs to be perfectly faithful to every mechanistic interaction inside the model. This renders many faithfulness tests one might want to perform irrelevant. On the other hand, it seems to me that general claims about CoT faithfulness also require more than just examining whether models verbalize the clues that influence their answers.
This brings me to the main questions I want to pose: What CoT faithfulness metrics would a lab have to use such that you would buy a general claim they make about CoT faithfulness? Would all three types of faithfulness described in the comment by Max Nadeau linked above have to be covered, or would you prefer that faithfulness evals be based upon an entirely different framework? (Optionally, also: What are some specific tests you would want to see the lab perform using your preferred metric?) Feel free to assume the current paradigm of transformer LLMs with limited single forward passes when answering these questions.
Here are some relevant observations that I haven’t discussed yet: