This post provides an overview of the sequence and covers background concepts that the later posts build on. If you're already familiar with AI alignment, you can likely skim or skip the foundations section.
This sequence explains the difficulties of the alignment problem and our current approaches for attacking it. We mainly look at alignment approaches that we could actually implement if we develop AGI within the next 10 years, but most of the discussed problems and approaches are likely still relevant even if we get to AGI through a different ML paradigm.
Towards the end of the sequence, I also touch on how competently AI labs are addressing safety concerns and what political interventions would be useful.
Because in my opinion, no adequate technical introduction exists, and having more people who understand the technical side of the current situation seems useful.
There are other introductions[1] that often introduce problems and solution approaches, but I don’t think people get the understanding to evaluate whether the solution approaches are adequate for solving the problems. Furthermore, the problems are often presented as disconnected pieces, rather than components of the underlying alignment problem.
Worse, even aside from introductions, there is rarely research that actually looks at how the full problem may be solved, rather than just addressing a subproblem or making progress on a particular approach.[2]
In this sequence, we are going to take a straight look at the alignment problem and learn about approaches that seem useful for solving it - including with the help from AIs.
Any human or AI who wants to technically understand the AI alignment problem. E.g.:
I am an AI alignment researcher who worked on alignment for 3.5 years, more in this footnote[3].
Here are the summaries of the posts written so far [although as of now they are not yet published]. This section will be updated as I publish more posts:
[Those two posts should get posted within the next 2 weeks, possibly tomorrow. After that it may take a while, but hopefully around 1 post per month on average.]
The orthogonality thesis asserts that there can exist arbitrarily intelligent agents pursuing any kind of goal.
In particular, being smart does not automatically cause an agent to have “better” values. An AI that optimizes for some alien goal won’t just realize when it becomes smarter that it should fill the universe with happy healthy sentient people who live interesting lives.
If this point isn’t already obvious to you, I recommend reading this page.
If you know almost nothing about how current AIs work, watch this brief video. More knowledge isn’t required for following along, although feel free to watch some more videos in that series.
Intelligence is the power that allows humanity to build skyscrapers, cure diseases, and walk on the moon.
I expect the power of intelligence is already obvious to you if you’re reading this, but here is a nice video about it: The Power of Intelligence (text version here).
Once AI systems become as good as the best humans at AI research, we likely get an intelligence explosion: smarter AIs can make faster AI progress, which leads to even smarter AIs even faster, and so on.
Since smarter minds can often make much faster progress on intellectual problems, this feedback loop seems likely to be superexponential - perhaps hyperbolical[4], where in theory an infinite amount of progress would happen in finite time, although of course in practice only until you run into limits.
The upper limits on machine intelligence appear to be extremely high. The human brain's learning algorithm is likely far more efficient than current deep learning methods—yet the brain itself is almost certainly nowhere near optimal. And AI hardware is remarkably powerful: a single H100 GPU can perform around 2×10¹⁵ operations per second, which may be comparable to or exceed the brain's computational throughput, depending on how you estimate it. xAI’s Colossus datacenter has a compute capacity equivalent to 300,000-350,000 H100 GPUs. Those chips have very high communication bandwidth, so in principle a datacenter could operate more like one gigabrain than lots for individual brains needing to communicate slowly like humans.
We cannot easily imagine what a mind much smarter than humans would be like. One useful Eliezersubstitute is to imagine a large civilization of supergeniuses, all running at 10,000× human speed, with perfect memory and the ability to share knowledge instantly. (For a vivid exploration of this, see Yudkowsky's short story That Alien Message (video here).)
Technologies that currently seem far off, like advanced nanotechnology, might arrive much sooner than we'd expect from extrapolating human research progress, because a superintelligence can make much much faster progress than humanity combined.
For more on these dynamics, see Optimization and the Intelligence Explosion and AI 2027 (video here).
Although timelines don’t play a huge role for this sequence, I want to briefly mention that superhumanly intelligent AI might come soon.
Measurements by METR show the task-completion time horizon of AIs seems to be consistently doubling around every 6-7 months.
Prediction markets reflect substantial probability of near-term AGI. The Manifold Markets AGI series currently shows ~9% by 2027, ~26% by 2029, ~36% by 2030, and ~50% by 2033.
The team behind AI2027—expert AI researchers and forecasters—expected 2027 to be the most likely year in which AGI might be developed (although not their median guess) at the time of publication, although they now predict timelines to be a chunk longer.
Even if current approaches plateau, history suggests another paradigm shift (like transformers were for deep learning) is likely within the next 15 years.
Like the DeepMind AGI Safety Video Course or AIsafety.dance (although I only very roughly skimmed the latter). ↩︎
The most notable exception for deep learning (DL) alignment is Joe Carlsmith’s sequence on “how do we solve the alignment problem”. I have some disagreements and don’t think it does a great job at clarifying the key difficulties of alignment, but hey, lots of credit to Joe for writing that sequence! There are some other attempts towards an overall discussion of alignment by Holden Karnofsky, and if you’re charitable you could count the Deepmind Safety Plan, but not much. ↩︎
I basically started to take an ambitious shot at the alignment problem—tried to think concretely about how we might be able to create very smart AI with which we could make the future end up well, which gave me a decent theoretical understanding of key difficulties. It looked to me that we might need a much more understandable and pointable AI paradigm, so I went to work on that. I started out from agent foundations and ontology identification research, and then developed my own agenda for better understanding minds, which involves more concrete analysis of observations. To be clear, that was a longshot and I hoped we had more than 10 years of time left. Even though it was not the main focus of my research, I still know quite a lot about hopes for DL alignment, and this fall I’ve been reading up in more detail on some hopes in order to better evaluate how feasible DL alignment is. Also feel free to check my LW page. ↩︎
Hyperbolical growth isn’t unprecedented—the human economy grew hyperbolically until around 1960. Since then it is “only” growing exponentially, presumably because the relative population growth rate went down a lot. If anything, we should expect the returns of higher intelligence to be even larger than the returns of having more humans. Especially above genius level, small increases in intelligence can have a very outsized impact. E.g. Einstein was able to solve some problems faster than the rest of humanity combined could’ve—he postulated multiple correct theories where most physicists at his time thought for years that they were wrong. ↩︎