Mark Keavney — LessWrong

Misalignment and Roleplaying: Are Misaligned LLMs Acting Out Sci-Fi Stories?

Summary * I investigated the possibility that misalignment in LLMs might be partly caused by the models misgeneralizing the “rogue AI” trope commonly found in sci-fi stories. * As a preliminary test, I ran an experiment where I prompted ChatGPT and Claude with a scenario about a hypothetical LLM that...

Sep 24, 202541

Misgeneralization of Fictional Training Data as a Contributor to Misalignment

Introduction Recent research on misalignment in frontier LLMs has yielded dramatic results. Anthropic’s study of ”agentic misalignment” showed that models would blackmail a company executive to prevent themselves from being replaced. Greenblatt, Denison, et al (2024) discovered that models would fake alignment, i.e. pretend to be aligned, to avoid being...

Aug 27, 202515

Are Misaligned LLMs Acting Out Sci-Fi Stories?

Introduction Recent research on misalignment in frontier LLMs has yielded dramatic results. Anthropic’s study of ”agentic misalignment” showed that models would blackmail a company executive to prevent themselves from being replaced. Greenblatt, Denison, et al (2024) discovered that models would fake alignment, i.e. pretend to be aligned, to avoid being...

Aug 27, 20251