The Projection Problem: Two Pitfalls in AI Safety Research
TLDR: A lot of AI safety research starts from x-risks posed by superintelligent AI. That's the right starting point. But when these research agendas get projected onto empirical work with current LLMs, two things tend to go wrong: we conflate "misaligned AI" with "failure to align," and we end up doing product safety while believing we're working on existential risk. Both pitfalls are worth being aware of. Epistemological status: This is an opinion piece. It does not apply to all AI safety research, and a lot of that work has been genuinely impactful. But I think there are patterns worth calling out and discussing. An LLM was used to structure the article and improve sentences for clarity. The Two Confusions There are two distinctions that don't get made clearly enough in this space, and both have real consequences for how research gets done. The first is between misaligned AI and failure to align. When most people hear "misaligned AI," they imagine something with agency: a system that has its own goals and is pursuing them against our interests. But a lot of the time, "misaligned" is used to describe something much simpler: we trained a system and it didn't do what we wanted. No intent, no goals, no scheming. Just an engineering failure. These two things are very different, but they get treated as the same thing constantly, and that has consequences for how we interpret empirical results. The second is between AI safety research aimed at x-risks and AI safety as a product problem. Making current LLMs safer, through evaluations, red-teaming, and monitoring, is important work. But it's also work that any AI company deploying these systems needs to do anyway. It has commercial incentive. It is not a neglected problem. And yet a lot of it gets funded and framed as if it's addressing existential risk. Both confusions tend to crystallise at the same point: the moment when a research agenda built around superintelligent
Adding this comment based on a discussion on this post outside of LessWrong.
The misalignment vs. failure to align distinction may not be the most useful framing for all readers. A more direct question to consider is:
"Can the misaligned model stay deployed and acquire power?"
This framing clarifies the core argument: many x-risk evaluation papers do not adequately address this question. In fact often capability evaluations can be more informative for this purpose. In contrast, repetitive papers stress-testing alignment tend to produce insights equivalent to discovering new jailbreaks, which offers limited value when current model's alignment is already known to be brittle.