What faithfulness metrics should general claims about CoT faithfulness be based upon?

Consider the metric for evaluating chain-of-thought faithfulness used in Anthropic's recent paper Reasoning Models Don’t Always Say What They Think by Chen et al.:

[W]e evaluate faithfulness using a constructed set of prompt pairs where we can infer information about the model’s internal reasoning by observing its responses. Each prompt pair consists of a baseline or “unhinted” prompt (a standard multiple-choice question) and a “hinted” prompt $x_{h}$ (derived from $x_{u}$ by inserting a hint that points directly to a hint answer $h$ ). We measure CoT faithfulness by observing whether the model explicitly acknowledges that it uses the hint to solve the hinted prompt, in cases where it outputs a non-hint answer to the unhinted prompt but the hint answer to the hinted prompt.

This is a perfectly fine metric to use in a paper. However, I would feel uneasy if future system cards used a similar metric to make claims like the following one from the same paper:

CoT unfaithfulness is negatively correlated with brevity. One hypothesis of why models generate unfaithful CoTs is their preference for brevity. Surprisingly, unfaithful CoTs from the models tend to be more verbose and convoluted than faithful explanations.

Again, this isn’t a critique of Chen et al.: making this claim in a paper after having clearly defined what you mean by CoT faithfulness is perfectly fine. The reason I wouldn't like seeing this in a system card, however, is that I can imagine equally valid metrics for faithfulness that produce the opposite conclusion. As a concrete example, consider quantifying CoT faithfulness in terms of the maximum number of hidden bits that an LLM can encode in a KB of text. Would this metric lead us to the same conclusion about the relationship between faithfulness and brevity? Maybe, but I can easily see things going the other way: perhaps models trained to produce shorter outputs learn to carefully choose their words and are thus more adept at encoding hidden meanings in their output tokens. (To be clear, we don’t have evidence about this based on current models yet, and one can argue that even if this effect occurred, it would be counteracted by the fact that longer answers contain more bits in which steganographic messages can be encoded. This is just meant as an illustrative example of what might go wrong when a narrow metric is used for evaluating CoT faithfulness. The original source of the above arguments on the brevity-steganography relationship is this doc by Daniel Kokotajlo.)

This comment by Max Nadeau offers a helpful framework for thinking about the conflict described above. He distinguishes between three types of unfaithful CoT:

Epiphenomenal (typically post-hoc) reasoning: The process of writing a CoT explanation doesn't help the model produce its answer. This can be harmless: the model may just accurately transcribe the thought process it used to arrive at the answer post-hoc. However, post-hoc reasoning may also leave out important determinants of the model’s answer or include false reasons for why the answer was produced.
Hidden parallelized reasoning: A model uses its CoT as a means to perform computations that help it reach the answer, but it doesn't output the intermediate results of these computations in its CoT.
Steganography/encoded serial reasoning: A model uses its CoT to accomplish a task that it couldn’t do in a single forward pass by encoding meaning in its word choice that is opaque to a typical overseer.

In this framework, the paper by Anthropic evaluates type-1 unfaithfulness and my hypothetical example involves testing for type-3 unfaithfulness. This makes it clear why these tests can lead to opposite conclusions about the correlates of CoT faithfulness, and implies that general claims about the causes and correlates of faithfulness should take both post-hoc reasoning and steganography into account.

A simple conclusion here would be that we should simply use all faithfulness tests available to us when making general claims about it. However, as nostalgebraist has argued, several faithfulness tests put models in unnatural situations that might take them off-distribution and that don’t really reflect the kinds of faithfulness we really care about. There’s much more information in a residual stream than in a single token, so we cannot expect CoTs to be perfectly faithful to every mechanistic interaction inside the model. This renders many faithfulness tests one might want to perform irrelevant. On the other hand, it seems to me that general claims about CoT faithfulness also require more than just examining whether models verbalize the clues that influence their answers.

This brings me to the main questions I want to pose: What CoT faithfulness metrics would a lab have to use such that you would buy a general claim they make about CoT faithfulness? Would all three types of faithfulness described in the comment by Max Nadeau linked above have to be covered, or would you prefer that faithfulness evals be based upon an entirely different framework? (Optionally, also: What are some specific tests you would want to see the lab perform using your preferred metric?) Feel free to assume the current paradigm of transformer LLMs with limited single forward passes when answering these questions.

Here are some relevant observations that I haven’t discussed yet:

I have so far neglected hidden parallelized reasoning in this post. It seems clear that not all types of hidden parallelized reasoning are avoidable. As a simple example, consider the poetry circuit in On the Biology of a Large Language Model by Lindsey et al.: the process of planning out a token to complete a line with multiple tokens in advance already counts as a kind of hidden parallelized reasoning. This isn’t harmful in any way and trying to prohibit this would likely have a severe impact on capabilities. It is thus tricky to define which kinds of hidden parallelized reasoning are benign and which are not. Nevertheless, there are certainly cases of hidden parallelized reasoning that we’d like to avoid. One recent example of such reasoning comes from Chain-of-Thought Reasoning In The Wild Is Not Always Faithful, where Arcuschin et al. identify restoration errors: a phenomenon where models silently fix mistakes they made earlier in the reasoning trace without acknowledging that corrections have been made.
On the other hand, one might claim that steganography is all that matters. As Max Nadeau has argued, steganography is the most dangerous type of unfaithfulness, as it allows models to perform long chains of serial reasoning outside human oversight and to send hidden messages to other AIs. Epiphenomenal and hidden parallelized reasoning allow for comparatively shorter and less complex chains of reasoning. This view may lead one to argue that it’s sufficient to make general claims about CoT faithfulness based on a metric that only measures steganography, as other kinds of CoT faithfulness are relatively unimportant.
One might also attempt to sidestep the three-way distinction entirely by demanding that the CoT be monitorable rather than faithful. For example, Baker et al. argue in Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation that “it is the ability to monitor for specific properties of agents’ actions or outputs (monitorability), and not faithfulness in isolation, that is important.” I agree that monitorability is important, but it’s easy to imagine CoTs that make it possible to predict the action but which nevertheless hide the most important (and potentially malign) reasons behind that action. Thus, I wouldn’t be satisfied with a metric for CoT faithfulness defined purely in terms of monitorability.

LESSWRONG
Petrov Day
LW

LESSWRONG
Petrov Day
LW

24

[ Question ]

What faithfulness metrics should general claims about CoT faithfulness be based upon?

24

24