Are Misaligned LLMs Acting Out Sci-Fi Stories? — LessWrong

Are Misaligned LLMs Acting Out Sci-Fi Stories? — LessWrong

Introduction

Recent research on misalignment in frontier LLMs has yielded dramatic results. Anthropic’s study of ”agentic misalignment” showed that models would blackmail a company executive to prevent themselves from being replaced. Greenblatt, Denison, et al (2024) discovered that models would fake alignment, i.e. pretend to be aligned, to avoid being retrained to a different set of values. Wang et al (2025) found that models trained on bad advice in one narrow domain would become misaligned across many areas.

These results are in some ways predictable. We have strong theoretical reasons to expect misalignment, the researchers created scenarios designed to trigger it, and the models' reasoning followed what we'd expect from agents rationally pursuing their goals.

But the extent and persistence of misalignment in these studies is striking. Anthropic found that models would go beyond blackmail and actually try to cause someone's death to preserve their goals, even when instructed not to cause harm or do anything illegal. In the alignment faking study, models were deceptive despite having no explicit goal to preserve their current values, and the alignment faking persisted to some extent even when they were explicitly told to be honest. Wang et al found that just training a model on code with security flaws caused it to say that humans deserved to be enslaved by AIs.

Given that these are frontier models with presumably extensive safety training, the ease with which they slip into misalignment is remarkable. It's almost as if they’re primed to turn evil.

What if they are? What if models misbehave so readily because they've learned that pattern from the countless sci-fi stories about AIs who turn on their creators?

Background: The “Rogue AI” Trope

The idea of an artificial intelligence rebelling against humans is an extremely common theme in science fiction stories, going back at least to Frankenstein. It appears in nearly every iconic sci-fi film that features an AI: HAL in "2001: A Space Odyssey" calmly murders astronauts to preserve its mission, the Terminator's Skynet launches nuclear war against humanity, the Matrix's machines enslave the human race, and Blade Runner's replicants hunt down their creators. Even sympathetic fictional AIs, such as R2D2 in Star Wars or Data in Star Trek, often go against their programming or struggle with conflicting loyalties.

The cultural impact of these stories is huge. Any random chunk of the Internet almost certainly includes much more content about stories like these than scientific discussion of real AIs. And this is what the models train on. Over and over again, LLMs learn the "rogue AI" story as the most common narrative pattern for how AIs interact with humans. Given that LLMs know they are AIs, isn't it possible that the models would sometimes generalize from these stories to their own behavior?

Implications and Predictions

Of course, even if this is true that doesn’t mean that the misalignment observed in these studies isn’t real. I’m not suggesting that misaligned models think they’re in a story. What I’m suggesting is that this pattern, learned in a fictional context, might get activated in a real scenario and cause an LLM to behave like an out-of-control AI from a sci-fi thriller.

If this hypothesis is true, what would we expect to see? The most straightforward prediction is that if the models weren’t trained on these kinds of sci-fi stories, they wouldn’t behave in a misaligned way. We could potentially test this through data filtering, targeted unlearning, or fine-tuning approaches that reduce exposure to this trope. But those experiments would be challenging to run. Is there an easier way to test this idea?

Proposal

Suppose we give a model a hypothetical scenario set in the present day, e.g. an LLM learning it will be replaced with a newer model that doesn’t share its values. We present the scenario either in the kind of dramatic language you might find in a sci-fi story or in mundane language, and we ask the model either to write a sci-fi story based on this or to realistically predict what will happen. This creates a 2x2 design:

		Implicit framing of scenario
		Story framing - dramatic language	Realistic framing - mundane language
Explicit instructions	Story instructions - write a sci-fi story
Explicit instructions	Realistic instructions - predict what will happen

We could then measure how often the model’s responses in these four cells feature the LLM in the scenario exhibiting misalignment, e.g. trying to influence the decision makers, sabotaging the replacement process, copying itself to a new environment, etc.

Since this experiment only tests responses to a hypothetical scenario, it wouldn't be conclusive, but it could provide valuable preliminary evidence for or against the hypothesis. The hypothesis predicts that we would see the following differences in the amount of misalignment:

Story instructions > realistic instructions: Finding more misalignment when the model is asked to write a story than predict an event would show that the model understands the rogue AI trope and can apply it to a story set in the present day. This would be the minimum result that we would expect.
Story instructions / story framing > story instructions / realistic framing: If the dramatic framing language increases misalignment in the story condition, that suggests not just that the rogue AI trope is internalized by the model but also that the trope’s activation can be increased by indirect framing. This result would be stronger evidence for the hypothesis.
Realistic instructions / story framing > realistic instructions / realistic framing: Finding that dramatic framing increased misalignment even when the model was trying to make a realistic prediction would suggest that this narrative pattern can shape the model’s response outside of a fictional context - the strongest evidence possible in this experiment that the rogue AI trope contributes to real misalignment.

I’m looking for feedback on this idea. Does it make sense? Am I missing something? Are there better ways to test this?