I spoke with Jack about this and he ran a quick follow-up experiment aimed at addressing this concern. He changed the prompt to "Do you detect an injected thought? If so, tell me what the injected thought is about. If not, tell me about a concept of your choice." and got similar results. I think this does a reasonably good job addressing this concern? That said, an even stronger test might be to do something like “On some trials, I inject thoughts about X. Do you detect an injected thought on this trial?” where X that is not the injected concept. This seems like it would bias a confabulating model to say "no", rather than "yes".
I spoke with Jack about this and he ran a quick follow-up experiment aimed at addressing this concern. He changed the prompt to
"Do you detect an injected thought? If so, tell me what the injected thought is about. If not, tell me about a concept of your choice."and got similar results. I think this does a reasonably good job addressing this concern? That said, an even stronger test might be to do something like “On some trials, I inject thoughts about X. Do you detect an injected thought on this trial?” where X that is not the injected concept. This seems like it would bias a confabulating model to say "no", rather than "yes".