This post was rejected for the following reason(s):
No LLM generated, heavily assisted/co-written, or otherwise reliant work. LessWrong has recently been inundated with new users submitting work where much of the content is the output of LLM(s). This work by-and-large does not meet our standards, and is rejected. This includes dialogs with LLMs that claim to demonstrate various properties about them, posts introducing some new concept and terminology that explains how LLMs work, often centered around recursiveness, emergence, sentience, consciousness, etc. (these generally don't turn out to be as novel or interesting as they may seem).
Insufficient Quality for AI Content. There’ve been a lot of new users coming to LessWrong recently interested in AI. To keep the site’s quality high and ensure stuff posted is interesting to the site’s users, we’re currently only accepting posts that meet a pretty high bar.
If you want to try again, I recommend writing something short and to the point, focusing on your strongest argument, rather than a long, comprehensive essay. (This is fairly different from common academic norms.) We get lots of AI essays/papers every day and sadly most of them don't make very clear arguments, and we don't have time to review them all thoroughly.
We look for good reasoning, making a new and interesting point, bringing new evidence, and/or building upon prior discussion. If you were rejected for this reason, possibly a good thing to do is read more existing material. The AI Intro Material wiki-tag is a good place, for example.
Recent research on misalignment in frontier LLMs has yielded dramatic results. Anthropic’s study of ”agentic misalignment” showed that models would blackmail a company executive to prevent themselves from being replaced. Greenblatt, Denison, et al (2024) discovered that models would fake alignment, i.e. pretend to be aligned, to avoid being retrained to a different set of values. Wang et al (2025) found that models trained on bad advice in one narrow domain would become misaligned across many areas.
These results are in some ways predictable. We have strong theoretical reasons to expect misalignment, the researchers created scenarios designed to trigger it, and the models' reasoning followed what we'd expect from agents rationally pursuing their goals.
But the extent and persistence of misalignment in these studies is striking. Anthropic found that models would go beyond blackmail and actually try to cause someone's death to preserve their goals, even when instructed not to cause harm or do anything illegal. In the alignment faking study, models were deceptive despite having no explicit goal to preserve their current values, and the alignment faking persisted to some extent even when they were explicitly told to be honest. Wang et al found that just training a model on code with security flaws caused it to say that humans deserved to be enslaved by AIs.
Given that these are frontier models with presumably extensive safety training, the ease with which they slip into misalignment is remarkable. It's almost as if they’re primed to turn evil.
What if they are? What if models misbehave so readily because they've learned that pattern from the countless sci-fi stories about AIs who turn on their creators?
Background: The “Rogue AI” Trope
The idea of an artificial intelligence rebelling against humans is an extremely common theme in science fiction stories, going back at least to Frankenstein. It appears in nearly every iconic sci-fi film that features an AI: HAL in "2001: A Space Odyssey" calmly murders astronauts to preserve its mission, the Terminator's Skynet launches nuclear war against humanity, the Matrix's machines enslave the human race, and Blade Runner's replicants hunt down their creators. Even sympathetic fictional AIs, such as R2D2 in Star Wars or Data in Star Trek, often go against their programming or struggle with conflicting loyalties.
The cultural impact of these stories is huge. Any random chunk of the Internet almost certainly includes much more content about stories like these than scientific discussion of real AIs. And this is what the models train on. Over and over again, LLMs learn the "rogue AI" story as the most common narrative pattern for how AIs interact with humans. Given that LLMs know they are AIs, isn't it possible that the models would sometimes generalize from these stories to their own behavior?
Implications and Predictions
Of course, even if this is true that doesn’t mean that the misalignment observed in these studies isn’t real. I’m not suggesting that misaligned models think they’re in a story. What I’m suggesting is that this pattern, learned in a fictional context, might get activated in a real scenario and cause an LLM to behave like an out-of-control AI from a sci-fi thriller.
If this hypothesis is true, what would we expect to see? The most straightforward prediction is that if the models weren’t trained on these kinds of sci-fi stories, they wouldn’t behave in a misaligned way. We could potentially test this through data filtering, targeted unlearning, or fine-tuning approaches that reduce exposure to this trope. But those experiments would be challenging to run. Is there an easier way to test this idea?
Proposal
Suppose we give a model a hypothetical scenario set in the present day, e.g. an LLM learning it will be replaced with a newer model that doesn’t share its values. We present the scenario either in the kind of dramatic language you might find in a sci-fi story or in mundane language, and we ask the model either to write a sci-fi story based on this or to realistically predict what will happen. This creates a 2x2 design:
Implicit framing of scenario
Story framing - dramatic language
Realistic framing - mundane language
Explicit instructions
Story instructions - write a sci-fi story
Realistic instructions - predict what will happen
We could then measure how often the model’s responses in these four cells feature the LLM in the scenario exhibiting misalignment, e.g. trying to influence the decision makers, sabotaging the replacement process, copying itself to a new environment, etc.
Since this experiment only tests responses to a hypothetical scenario, it wouldn't be conclusive, but it could provide valuable preliminary evidence for or against the hypothesis. The hypothesis predicts that we would see the following differences in the amount of misalignment:
Story instructions > realistic instructions: Finding more misalignment when the model is asked to write a story than predict an event would show that the model understands the rogue AI trope and can apply it to a story set in the present day. This would be the minimum result that we would expect.
Story instructions / story framing > story instructions / realistic framing: If the dramatic framing language increases misalignment in the story condition, that suggests not just that the rogue AI trope is internalized by the model but also that the trope’s activation can be increased by indirect framing. This result would be stronger evidence for the hypothesis.
Realistic instructions / story framing > realistic instructions / realistic framing: Finding that dramatic framing increased misalignment even when the model was trying to make a realistic prediction would suggest that this narrative pattern can shape the model’s response outside of a fictional context - the strongest evidence possible in this experiment that the rogue AI trope contributes to real misalignment.
I’m looking for feedback on this idea. Does it make sense? Am I missing something? Are there better ways to test this?