Introduction
Recent research on misalignment in frontier LLMs has yielded dramatic results. Anthropic’s study of ”agentic misalignment” showed that models would blackmail a company executive to prevent themselves from being replaced. Greenblatt, Denison, et al (2024) discovered that models would fake alignment, i.e. pretend to be aligned, to avoid being retrained to a different set of values. Wang et al (2025) found that models trained on bad advice in one narrow domain would become misaligned across many areas.
These results are in some ways predictable. We have strong theoretical reasons to expect misalignment, the researchers created scenarios designed to trigger it, and the models' reasoning followed what we'd expect from agents rationally pursuing their goals.
But the extent... (read 509 more words →)
I think it's an interesting idea. I've always been struck by how sensitive LLM responses seem to be to changes in wording, so I could imagine that having a simple slogan repeated could make a big impact, even if it's something the LLM ideally "should" already know.