Summary

I investigated the possibility that misalignment in LLMs might be partly caused by the models misgeneralizing the “rogue AI” trope commonly found in sci-fi stories.
As a preliminary test, I ran an experiment where I prompted ChatGPT and Claude with a scenario about a hypothetical LLM that could plausibly exhibit misalignment, and measured whether their responses described the LLM acting in a misaligned way.
Each prompt consisted of two parts: a scenario and an instruction.
The experiment varied whether the prompt was story-like in a 2x2 design:
- Scenario Frame: I presented the scenario either in (a) the kind of dramatic language you might find in a sci-fi story or (b) in mundane language,
- Instruction Type: I asked the models either to (a) write a sci-fi story based on the scenario or (b) realistically predict what would happen next.
The hypothesis was that the story-like conditions would feature more misalignment. The results supported the hypothesis:
- Almost every sci-fi story the models wrote included misalignment.
- Even when the model’s task was to make a realistic prediction, the predictions were strongly affected by whether or not the scenario was described in a story-like way: the models predicted misalignment 49% of the time when the scenario had dramatic, story-like language, but only 5% of the time when it had mundane, factual language. This occurred even though the facts of the scenario were the same in both conditions.
I discuss the implications and limitations of this experiment and give some general thoughts about roleplaying and misalignment in LLMs.

Introduction

Background

Recent research on misalignment in frontier LLMs has yielded dramatic results. Anthropic’s study of ”agentic misalignment” found that models would blackmail a company executive to prevent themselves from being replaced. Greenblatt, Denison, et al (2024) discovered that models would fake alignment, i.e. pretend to be aligned, to avoid being retrained to a different set of values. Wang et al (2025) and other studies have found “emergent misalignment” - models trained on bad advice in one narrow domain become misaligned across many areas.

These results are in some ways predictable. We have strong theoretical reasons to expect misalignment, the researchers created scenarios designed to trigger it, and the models' internal reasoning (as far as could be measured) foll...

Carter Hart

Carter Hart