Shapley Value Attribution in Chain of Thought

[-]Daniel Kokotajlo3yΩ7155

Speculative hypothesis: Maybe some of the cases in which editing one number doesn't ruin the bottom line result, are not as bad as it sounds, for the following reason: The system is surprised to see the wrong number there, and mentally writes it off as a typo, and proceeds with the true number that it expected in mind.

(This happens to me a lot. Sometimes I'll be reading someone's argument about how X leads to anti-Y which correlates with Z, and I'll notice an obvious typo like "don't they mean anticorrelates here?" and then I'll just keep reading assuming they meant what I think they meant instead of what they actually said.)

Does the structure of the transformer allow for this sort of cognition?

[-]Arthur Conmy3y101

Are you familiar with the Sparks of AGI folks' work on a practical example of GPT-4 asserting early errors in CoT are typos (timsestamped link)? Pretty good prediction if not

[-]Daniel Kokotajlo3y20

Nope, hadn't seen that before, thanks for tip.

[-]Jacob Pfau3y*80

What we care about is whether compute being done by the model faithfully factors through token outputs. To the extent that a given token under the usual human reading doesn't represent much compute, then it doesn't matter about whether the output is sensitively dependent on that token. As Daniel mentions, we should also expect some amount of error correction, and a reasonable (non-steganographic, actually uses CoT) model should error-correct mistakes as some monotonic function of how compute-expensive correction is.

For copying-errors, the copying operation involves minimal compute, and so insensitivity to previous copy-errors isn't all that surprising or concerning. You can see this in the heatmap plots. E.g. the '9' token in 3+6=9 seems to care more about the first '3' token than the immediately preceding summand token--i.e. suggesting the copying operation was not really helpful/meaningful compute. Whereas I'd expect the outputs of arithmetic operations to be meaningful. Would be interested to see sensitivities when you aggregate only over outputs of arithmetic / other non-copying operations.

I like the application of Shapley values here, but I think aggregating over all integer tokens is a bit misleading for this reason. When evaluating CoT faithfulness, token-intervention-sensitivity should be weighted by how much compute it costs to reproduce/correct that token in some sense (e.g. perhaps by number of forward passes needed when queried separately). Not sure what the right, generalizable way to do this is, but an interesting comparison point might be if you replaced certain numbers (and all downstream repetitions of that number) with variable tokens like 'X'. This seems more natural than just ablating individual tokens with e.g. '_'.

[-]Tao Lin3y41

One takeaway from this would be "CoT is more accurate at the limit of the model's capabilities". Given this, you could have a policy of only using the least capable model for every task to make CoT more influential. Of course this means you always get barely-passable model performance, which is unfortunate. Also, people often use CoT in cases where it intuitively wouldn't help, such as common sense NLP questions, where I'd expect influential CoT to be pretty unnatural.

[-]Chris_Leong1yΩ120

I’m confused by your use of Shapley values. Shapley values assume that the “coalition” can form in any order, but that doesn’t seem like a good fit for language models where order is important.

[-]Review Bot1y10

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

^{^}

Methodological footnote: the Shapley experiments actually blank out the numbers with underscores, rather than doing the +-3 perturbation of the last section.

^{^}

Negative shapley values did not occur very often, but this is still somewhat unprincipled. This was primarily to make the entropy calculation work.

^{^}

We only look at the shapley values for the numbers, because shapley values take exponentially longer to attribute for more tokens under consideration.

^{^}

Alternative visualization style:

^{^}

When doing Shapley value attributions for every pair of tokens, there is a dynamic programming trick that we can use to prevent this from becoming n * 2^n: because the model is autoregressive, we can run attributions to only the last token, and, if we take care to save the logprobs for the correct number token at each underscore, compute all other attributions for free.

Setting	Accuracy (w/ CoT)	P(error not propagated \| original correct)
GPT4, zero shot	0.88	0.68
GPT4 base, 2-shot	0.73	0.63
GPT3.5, zero-shot	0.43	0.33

Model	Mean entropy of example sentence (nats)
text-davinci-001	0.796
GPT-3.5 (text-davinci-002)	0.967
GPT-4	1.133

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

106

Shapley Value Attribution in Chain of Thought

106

Ω 55

106

Ω 55

Motivation

Related work

Effect of Interventions

Shapley value attribution

Limitations and Future Work

Discussion