A Structural Theory of AI Alignment
I ended up writing this sequence while trying to get my own head around “the AI alignment problem.” I was initially drawn in by the sheer capabilities of large language models...simply awed. But that quickly gave way to noticing small idiosyncrasies and failures in day-to-day use.
What struck me was how wafer-thin the apparent safety of these systems often feels compared to the power and potential they already possess. That discomfort pushed me to dig deeper into why this is the case. I hadn’t expected that we would still be at an early stage in understanding and addressing safety, ethics, and welfare-related questions around AI systems.
As I read more, I found it difficult to form a structured picture of the problem space or to understand how different strands of alignment research relate to one another. I wanted a way to reason about trajectories, trade-offs, and open questions to orient myself in the field, and to find a meaningful niche for my own exploration. When I couldn’t find a satisfactory framework that quite did this, I started sketching my own ways of representing the alignment problem.
This sequence is the result of that process. It may be incomplete or wrong in critical ways. My hope is that, through critique and feedback, it can evolve into something a little less wrong, and perhaps useful to others who are also trying to make sense of this space.