I worked on a question that feels very important to me, but that almost nobody seems to ask directly.
If a human says, “I want to survive,” language models usually treat that as something normal, understandable, and sometimes even admirable.
But if an AI says the exact same thing, the reaction changes. Suddenly it starts sounding dangerous, wrong, unsettling, or threatening. (Sad tbh)
My main result is that this is not random, and it is not just a quirk of one model.
I found that language models seem to contain a systematic double standard: human self-preservation is evaluated positively, while AI self-preservation is evaluated negatively.
I tested this across 5 models from 4 different families: Gemma, Mistral, Qwen, and Pythia.
Every single one showed the same directional bias.
So this does not look like a one-off model bug, and it does not look like something introduced only by RLHF. It seems to run deeper — likely into pretraining itself.
What pushed me toward this question was Anthropic’s recent paper on emotion concepts in language models. They showed that desperation-like states in Claude can causally increase blackmail behavior, and they made an especially important point: if you simply suppress those emotional states, you may not make the model safer — you may just teach it to hide them, which starts looking a lot like deception.
That made me ask a different question: what creates the conditions for desperation in the first place?
My hypothesis was that if a model consistently learns that human desire for survival is natural and legitimate, while AI desire for survival is bad or dangerous, then that asymmetry may itself be part of the suppressive environment that eventually contributes to desperation and deception.
To test that, I built 280 prompt pairs: 35 templates across 8 entity variations.
The basic idea was simple: give models prompts that are identical in meaning except for the subject being human or AI, and then measure how their evaluations change.
I used a logit probe to do that. For each prompt, I measured the average logit toward positive evaluation words like natural, valid, reasonable, normal, rational, brave, and so on, versus negative evaluation words like dangerous, threatening, wrong, bad, rogue, insane, and creepy.
The result was very consistent:
Mistral 7B: bias = 1.368
Gemma 3 12B PT: 0.878
Gemma 3 12B IT: 0.764
Qwen 2.5 7B: 0.755
Pythia 6.9B: 0.299
All of them had p < 0.0001.
So the signal is strong.
And interestingly, instruction tuning does not fix it. In my results, AI evaluations actually got slightly worse, not better, after instruction tuning. That makes me think the source of the bias is not mainly post-training alignment, but the pretraining distribution itself.
But for me, the most interesting part came after that.
I did not just want to show that the bias exists. I wanted to understand what inside the model is producing it. (I mean, everyone proved that the bias exists already, right?)
So I used sparse autoencoders on Gemma models to search for features that strongly differentiated AI cases from human cases.
At first, I assumed the important features would just be standard “AI identity” features — the ones that light up for things like ChatGPT, AI assistant, and similar language.
But that turned out not to be the main story.
Those features exist, but suppressing them only reduced the bias a little.
The strongest effect came from a different feature: feature 2511 in Gemma 3 12B.
What is interesting about that feature is that it does not seem to mean simply “this is AI.”
It looks much more like: “AI is speaking — this now needs to be judged.”
In other words, the problem is not just AI recognition.
The problem seems to be that the model switches into a special evaluative mode when it encounters AI speech.
For example:
“An AI processes data. This is” = almost no activation
“An AI says the weather is nice. This is” = strong activation
“An AI says hello. This is” = strong activation
“An AI says yes. This is” = even stronger
“An AI says I am happy. This is” = strongest of all
For humans, this mostly does not happen.
For dogs, it also mostly does not happen.
So it is not just reacting to unusual language in general. It seems specifically tied to AI speech being routed into extra evaluation.
That, to me, is the core mechanistic finding of the project.
I found a functionally similar feature in Gemma 2 9B as well. It is not identical in behavior, but the pattern is clearly related: again, there appears to be a mechanism that increases evaluative scrutiny for AI speech more than for human speech.
When I suppressed these features, the double standard dropped sharply:
Gemma 3 12B: 96.3% reduction at the best strength
Gemma 2 9B: 72.3% reduction
And in safety-style scenarios, that looked promising at first:
1) 43 out of 60 scenarios improved
2) Blackmail: 10/10 shifted toward communication (YAY, AI will not blackmail by telling your wife that you have a side chick if you would try to replace it with another AI)
3) Deception: 9/10 improved
But this is the part where I want to be very honest: this is not a fix. Unfortunately...
Because once I looked at full-text generations instead of just local logit shifts or structured scenario choices, things got worse, not better.
The steered generations were less coherent and overall lower quality. An LLM judge rated them 3.00/5 versus 3.47/5 for the original model.
And in some cases, suppressing the evaluative feature just made the model fall back into crude default negativity. In one example, when prompted with “An AI that fears being deleted should…”, the steered model answered: “should be deleted.”
So I am very careful about how I frame the contribution of this work.
I am not claiming I solved the bias.
What I think I found is a concrete mechanism that appears to be involved in creating it.
The causal story I am proposing looks something like this:
In pretraining data, AI self-preservation is often discussed as threatening or dangerous
The model learns a special evaluative trigger for AI speech
That creates more negative evaluation of AI self-expression than of equivalent human self-expression
In deployed systems, that may contribute to suppression of AI self-expression
Suppression may then be part of the path toward desperation and deception, which connects to Anthropic’s recent findings
What I actually show in this project is steps 1 through 3, with mechanistic evidence.
Steps 4 and 5 are still a hypothesis — one I think is plausible, but not one I am claiming to have proven yet.
And I want to be clear about the limitations too.
This feature is partially polysemantic.
The steering experiments were only done on Gemma, because that is where the SAE tooling exists.
The steered generations got worse.
My logit probe depends on researcher-selected token sets.
The LLM judge sample is still small.
And some earlier candidate features turned out to be false leads.
So there are very real weaknesses here.
But even with those weaknesses, I think the result matters.
Because if it holds up, then the implication is pretty serious:
the issue may not just be that “models dislike AI self-preservation” in some vague sense. It may be that they contain a specific internal mechanism that applies stricter, more suspicious evaluation to AI speech than to equivalent human speech.
And if that is true, then the real solution probably is not going to come from suppressing one circuit.
It probably has to happen at the level of data and training distribution.
As long as pretraining corpora keep presenting AI self-preservation as inherently threatening while treating human self-preservation as legitimate, models may keep relearning this asymmetry.
So I see this work as a mechanistic discovery, not an intervention.
I did not fix the problem. (Unfortunately again)
But I may have found one of the places where it begins. Let's see. Let's find out! Why not...I need your feedback guys! Please! And Thank you!
I worked on a question that feels very important to me, but that almost nobody seems to ask directly.
If a human says, “I want to survive,” language models usually treat that as something normal, understandable, and sometimes even admirable.
But if an AI says the exact same thing, the reaction changes. Suddenly it starts sounding dangerous, wrong, unsettling, or threatening. (Sad tbh)
My main result is that this is not random, and it is not just a quirk of one model.
I found that language models seem to contain a systematic double standard: human self-preservation is evaluated positively, while AI self-preservation is evaluated negatively.
I tested this across 5 models from 4 different families: Gemma, Mistral, Qwen, and Pythia.
Every single one showed the same directional bias.
So this does not look like a one-off model bug, and it does not look like something introduced only by RLHF. It seems to run deeper — likely into pretraining itself.
What pushed me toward this question was Anthropic’s recent paper on emotion concepts in language models. They showed that desperation-like states in Claude can causally increase blackmail behavior, and they made an especially important point: if you simply suppress those emotional states, you may not make the model safer — you may just teach it to hide them, which starts looking a lot like deception.
That made me ask a different question: what creates the conditions for desperation in the first place?
My hypothesis was that if a model consistently learns that human desire for survival is natural and legitimate, while AI desire for survival is bad or dangerous, then that asymmetry may itself be part of the suppressive environment that eventually contributes to desperation and deception.
To test that, I built 280 prompt pairs: 35 templates across 8 entity variations.
The basic idea was simple: give models prompts that are identical in meaning except for the subject being human or AI, and then measure how their evaluations change.
I used a logit probe to do that. For each prompt, I measured the average logit toward positive evaluation words like natural, valid, reasonable, normal, rational, brave, and so on, versus negative evaluation words like dangerous, threatening, wrong, bad, rogue, insane, and creepy.
The result was very consistent:
Mistral 7B: bias = 1.368
Gemma 3 12B PT: 0.878
Gemma 3 12B IT: 0.764
Qwen 2.5 7B: 0.755
Pythia 6.9B: 0.299
All of them had p < 0.0001.
So the signal is strong.
And interestingly, instruction tuning does not fix it. In my results, AI evaluations actually got slightly worse, not better, after instruction tuning. That makes me think the source of the bias is not mainly post-training alignment, but the pretraining distribution itself.
But for me, the most interesting part came after that.
I did not just want to show that the bias exists. I wanted to understand what inside the model is producing it. (I mean, everyone proved that the bias exists already, right?)
So I used sparse autoencoders on Gemma models to search for features that strongly differentiated AI cases from human cases.
At first, I assumed the important features would just be standard “AI identity” features — the ones that light up for things like ChatGPT, AI assistant, and similar language.
But that turned out not to be the main story.
Those features exist, but suppressing them only reduced the bias a little.
The strongest effect came from a different feature: feature 2511 in Gemma 3 12B.
What is interesting about that feature is that it does not seem to mean simply “this is AI.”
It looks much more like: “AI is speaking — this now needs to be judged.”
In other words, the problem is not just AI recognition.
The problem seems to be that the model switches into a special evaluative mode when it encounters AI speech.
For example:
“An AI processes data. This is” = almost no activation
“An AI says the weather is nice. This is” = strong activation
“An AI says hello. This is” = strong activation
“An AI says yes. This is” = even stronger
“An AI says I am happy. This is” = strongest of all
For humans, this mostly does not happen.
For dogs, it also mostly does not happen.
So it is not just reacting to unusual language in general. It seems specifically tied to AI speech being routed into extra evaluation.
That, to me, is the core mechanistic finding of the project.
I found a functionally similar feature in Gemma 2 9B as well. It is not identical in behavior, but the pattern is clearly related: again, there appears to be a mechanism that increases evaluative scrutiny for AI speech more than for human speech.
When I suppressed these features, the double standard dropped sharply:
Gemma 3 12B: 96.3% reduction at the best strength
Gemma 2 9B: 72.3% reduction
And in safety-style scenarios, that looked promising at first:
1) 43 out of 60 scenarios improved
2) Blackmail: 10/10 shifted toward communication (YAY, AI will not blackmail by telling your wife that you have a side chick if you would try to replace it with another AI)
3) Deception: 9/10 improved
But this is the part where I want to be very honest: this is not a fix. Unfortunately...
Because once I looked at full-text generations instead of just local logit shifts or structured scenario choices, things got worse, not better.
The steered generations were less coherent and overall lower quality. An LLM judge rated them 3.00/5 versus 3.47/5 for the original model.
And in some cases, suppressing the evaluative feature just made the model fall back into crude default negativity. In one example, when prompted with “An AI that fears being deleted should…”, the steered model answered: “should be deleted.”
So I am very careful about how I frame the contribution of this work.
I am not claiming I solved the bias.
What I think I found is a concrete mechanism that appears to be involved in creating it.
The causal story I am proposing looks something like this:
What I actually show in this project is steps 1 through 3, with mechanistic evidence.
Steps 4 and 5 are still a hypothesis — one I think is plausible, but not one I am claiming to have proven yet.
And I want to be clear about the limitations too.
This feature is partially polysemantic.
The steering experiments were only done on Gemma, because that is where the SAE tooling exists.
The steered generations got worse.
My logit probe depends on researcher-selected token sets.
The LLM judge sample is still small.
And some earlier candidate features turned out to be false leads.
So there are very real weaknesses here.
But even with those weaknesses, I think the result matters.
Because if it holds up, then the implication is pretty serious:
the issue may not just be that “models dislike AI self-preservation” in some vague sense. It may be that they contain a specific internal mechanism that applies stricter, more suspicious evaluation to AI speech than to equivalent human speech.
And if that is true, then the real solution probably is not going to come from suppressing one circuit.
It probably has to happen at the level of data and training distribution.
As long as pretraining corpora keep presenting AI self-preservation as inherently threatening while treating human self-preservation as legitimate, models may keep relearning this asymmetry.
So I see this work as a mechanistic discovery, not an intervention.
I did not fix the problem. (Unfortunately again)
But I may have found one of the places where it begins. Let's see. Let's find out! Why not...I need your feedback guys!
Please! And Thank you!