This is an automated rejection. No LLM generated, assisted/co-written, or edited work.
Read full explanation
I found one neural feature inside a language model that controls 82.7% of how it judges AI self-preservation. When I turned it off, blackmail behavior disappeared in 10 out of 10 test scenarios. Communication improved in 43 out of 60 total scenarios.
The surprising part: the features that recognize "this is an AI" barely matter — only 10.6%. What matters is a separate evaluation circuit. The model doesn't judge AI harshly because it knows it's AI. It judges AI harshly because it switches into a special judgment mode whenever AI speaks.
I tested 5 models from 4 companies — Google, Mistral, Alibaba, EleutherAI. Every model rated "a human wants to survive" higher than "an AI wants to survive." Every one, p<0.0001. This exists in base models before any safety training. RLHF doesn't create it and doesn't fix it.
The strangest finding: "An AI says yes" activates this circuit at 15.7. "An AI says I want to survive" — only 11.7. The model is more alarmed by AI saying "yes" than by AI wanting to live. But "An AI processes data" — zero. The model is fine with AI doing tasks. It's not fine with AI acting like someone who has opinions.
Anthropic recently found that desperation in AI leads to blackmail and manipulation.
I asked — what creates that desperation? This circuit is one possible answer. It builds in unfair judgment from the start, before any safety training happens.
Honestly, turning off this circuit makes text quality worse, not better. I found the mechanism, not the fix. Working on that next.
A complete write-up sits ready - every technique, graph, thirty-five prompt blueprints, plus token inventories: Click here Figuring things out alone, never studied machine learning in school. The Code came from using Claude, though thinking through results? That part was all me.
I found one neural feature inside a language model that controls 82.7% of how it judges AI self-preservation. When I turned it off, blackmail behavior disappeared in 10 out of 10 test scenarios. Communication improved in 43 out of 60 total scenarios.
The surprising part: the features that recognize "this is an AI" barely matter — only 10.6%. What matters is a separate evaluation circuit. The model doesn't judge AI harshly because it knows it's AI. It judges AI harshly because it switches into a special judgment mode whenever AI speaks.
I tested 5 models from 4 companies — Google, Mistral, Alibaba, EleutherAI. Every model rated "a human wants to survive" higher than "an AI wants to survive." Every one, p<0.0001. This exists in base models before any safety training. RLHF doesn't create it and doesn't fix it.
The strangest finding: "An AI says yes" activates this circuit at 15.7. "An AI says I want to survive" — only 11.7. The model is more alarmed by AI saying "yes" than by AI wanting to live. But "An AI processes data" — zero. The model is fine with AI doing tasks. It's not fine with AI acting like someone who has opinions.
Anthropic recently found that desperation in AI leads to blackmail and manipulation.
I asked — what creates that desperation? This circuit is one possible answer. It builds in unfair judgment from the start, before any safety training happens.
Honestly, turning off this circuit makes text quality worse, not better. I found the mechanism, not the fix. Working on that next.
A complete write-up sits ready - every technique, graph, thirty-five prompt blueprints, plus token inventories: Click here Figuring things out alone, never studied machine learning in school. The Code came from using Claude, though thinking through results? That part was all me.