Tests of LLM introspection need to rule out causal bypassing
This point has been floating around implicitly in various papers (e.g., Betley et al., Plunkett et al., Lindsey), but we haven’t seen it named explicitly. We think it’s important, so we’re describing it here. There’s been growing interest in testing whether LLMs can introspect on their internal states or processes....
I spoke with Jack about this and he ran a quick follow-up experiment aimed at addressing this concern. He changed the prompt to
"Do you detect an injected thought? If so, tell me what the injected thought is about. If not, tell me about a concept of your choice."and got similar results. I think this does a reasonably good job addressing this concern? That said, an even stronger test might be to do something like “On some trials, I inject thoughts about X. Do you detect an injected thought on this trial?” where X that is not the injected concept. This seems like it would bias a confabulating model to say "no", rather than "yes".