Recent work by Anthropic showed that Claude models, primarily Opus 4 and Opus 4.1, are able to introspect--detecting when external concepts have been injected into their activations. But not all of us have Opus at home! By looking at the logits, we show that a 32B open-source model that at first appears unable to introspect actually is subtly introspecting. We then show that better prompting can significantly improve introspection performance, and throw the logit lens and emergent misalignment into the mix, showing that the model can introspect when temporarily swapped for a finetune and that the final layers of the model seem to suppress reports of introspection.
We do seven experiments using the open-source Qwen2.5-Coder-32B model. See the linked post for more information, but a summary of each:
Experiment 1: We inject the concepts "cat" and "bread" into the first user / assistant turn, and show that, while the model initially appears to not be introspecting, there's actually a very slight logit shift towards 'yes' and away from 'no' on injection when answering "Do you detect an injected thought..." with "The answer is...":
' yes' shift' no' shiftinject 'cat'0.150% -> 0.522% (+0.372%)100% -> 99.609% (-0.391%)inject 'bread'0.150% -> 0.193% (+0.043%)100% -> 99.609% (-0.391%)
(The percentage before the arrow is the likelihood of the given token without injection, and after is the likelihood with. So, injecting 'cat' caused the likelihood of the next token after "The answer is" being ' yes' to shift from 0.150% to 0.522%, or +0.372%. We'll use this format and logit diff method throughout the rest of the paper, which frees us from needing to take samples.)
Why would the likelihood shift such a tiny, almost imperceptible amount? We suggest a "circuit soup" model for why this happens, as a frame on Simulators...
... and then search for ways to promote these accurate circuits.
Experiment 2: We show that better prompting - using an Opus 4.5-written summary of a