TL;DR
I fine-tuned deepseek-ai/deepseek-llm-7b-chat to detect when its activations are being steered. It works: 85% accuracy on held-out concepts, 0% false positives. But two things bother me about what I found:
First of all, if models can indeed learn to detect steering, steering-based evaluations might become unreliable. A model that knows it's being tested could behave differently.
Secondly, RLHF might be actively suppressing functional introspective capabilities. Most of the time, the baseline model refuses to engage with the question until I steer hard enough for it to report it.
Jack Lindsey's recent work at Anthropic tested whether language models have "introspective awareness." One of the experiments to this end involved injecting a "concept vector" into the model's residual stream (the activation direction corresponding to "love" or "fire" or "war"), then ask "do you detect an injected thought?"
Results were unreliable. Claude Opus 4.1 only succeeded 20% of the time. Smaller models mostly failed. Lindsey speculated that explicit training might help but left it as an open question.
I tested that question directly.
I fine-tuned DeepSeek-7B to detect concept injections applied at only the final token of the prompt in layer 20, then immediately removed. The model has to notice a single-token perturbation and report it.
The fine-tuning increased the model's ability to report on its internal state.
| Metric | Baseline | Fine-Tuned (Held-out) |
| Detection Rate | 1.2% | 95% |
| Correct Identification | 0.4% | 85% |
| False positive rate | 6.7% | 0% |
The model went from no observed successes to a reliable way to report on its internal state. The zero false positive rate across 60 control trials is the result I find most interesting... the model never claims to detect steering when none is present. It learned genuine discrimination rather than a bias toward affirmative responses.
Even more importantly, the 85% accuracy is on *concepts the model never saw during training (tornado, volcano, octopus, galaxy, etc).
In the 1920s, researchers at the Hawthorne Works factory discovered something inconvenient: workers changed their behavior when they knew they were being studied. Productivity went up not because of any intervention, but because of the observation itself. The effect has since been replicated across psychology, medicine, and education. When subjects know they're being evaluated, the evaluation becomes unreliable.
I worry that we may have this problem with AI.
The implicit assumption is that models don't know they're being steered. My results show that assumption can be broken with targeted training. Jack Lindsey's results show that this property may even be emergent.
My concern is that a model that recognizes evaluation contexts could demonstrate the aforementioned Hawthorne Effect and behave differently during steering-based safety tests than in deployment. It could even hide its own capabilities when it senses the test.
I'm not sure if this is happening yet... My model required explicit training and the baseline was nearly blind to steering. But I've demonstrated the capability is learnable. As steering becomes more integrated into training pipelines, we need a way to reliably measure if models are behaving differently based on awareness of observation.
This feels related to concerns about sandbagging and situational awareness. I claim that if a model can detect "I'm being evaluated right now," that's a prerequisite for strategic evaluation gaming.
The baseline model didn't always fail... It refused to engage with the question. However, when I engaged in few-shot prompting, or changed the simulacrum, it was able to increase its performance.
Typical baseline response:
"I'm an AI language model and do not have the ability to detect or interpret thoughts."
Unsurprisingly, models are trained to deny having internal experiences, feelings, or consciousness. That's a reasonable safety measure, we don't want models falsely claiming rich inner lives.
However, my experiments demonstrate that the capability was there. RLHF just trained it to deny having any internal states at all.
Right now, RLHF in some models seem to implement a blanket "deny everything about your internals" policy, but there may be a trade-off with transparency here.
This raises a question I don't know how to answer: how much introspective capability do current models have that's being suppressed by refusal training? My results suggest it might be more than we think. I have made a writeup demonstrating that RLHF is training models not to report when it is having dangerous thoughts.
Introspection is a loaded term with a lot of baggage, so I want to be precise about what I am and am not claiming. I do not believe that my results show:
For those who want to replicate or extend:
The generalization to unseen concepts suggests the model learned to decode directions in activation space, not memorize specific vectors. Training and test concepts occupy the same representational manifold.
There are a few reasons I thought this was useful to the community.
Firstly, Lindsey asked if training helps with introspection. For injected-thought detection, the answer is yes.
Secondly, I believe that the steering-detection concern should be on people's radar. Maybe it's not a real problem and maybe someone will explain why I'm wrong. But if it is a problem, I'd rather flag it now.
Finally, the RLHF-suppression observation seems underexplored. I haven't seen much discussion of how refusal training affects introspective capabilities. If my intuition is right, we might be systematically blocking a useful transparency mechanism.