All technical alignment plans are steps in the dark

Richard Juggins

One reason aligning superintelligent AI will be hard is because we don’t get to test solutions in advance. If we build systems powerful enough to take over the world, and we fail to stop them from wanting to, they aren’t going to give us another chance^[1]. It’s common to assume that the alignment problem must be one-shotted: we must solve it on the first try without any empirical feedback.

Normal science and engineering problems do not work like this. They are a delicate balance of theory and experiment, of unexpected surprises and clever refinements. You try something based on your best understanding, watch what happens, then update and try again. You never skip straight to a perfect solution.

Within AI safety, there’s been a lot of disagreement about the best way to square this circle. When the field was young and AI systems weak, researchers tended to try and understand the problem theoretically, looking for generalisable insights that could render it more predictable. But this approach made slow progress and has fallen out of fashion. By contrast, driven on by the relentless pace of AI development, a default plan seems to be coalescing around the kind of empirical safety work^[2] practised by the big labs. Generalising massively, it looks roughly like this:

We can try to get around the missing feedback by iterating experimentally on the strongest AI systems available. As the closest we have to superintelligent AI, these will be our best source of information about aligning it, even if some differences remain. If things go well, the techniques we learn will allow us to build a well-aligned human-level AI researcher, to which we can hand over responsibility. It can then align an even more intelligent successor, starting a chain of stronger and stronger systems that terminates in the full problem being solved.

This plan has attracted criticism^[3], particularly of the assumption that what we learn about aligning weaker-than-human systems will generalise once they surpass us. In this post, I’m going to argue that this maps onto a structural feature of all technical alignment plans^[4]. We are fitting solutions, whether empirical or theoretical or both, to a world missing critical feedback^[5] and hoping they will generalise. Every plan involves stepping into the dark.

If we want to make superintelligent AI safe, we need to dramatically reduce the size of these steps. We must learn how to iterate on the full problem.

How does science work?

Before we talk more about AI safety, let’s take a step back and consider how science works in general. Wikipedia describes the scientific method as:

[An] empirical method for acquiring knowledge through careful observation, rigorous skepticism, hypothesis testing, and experimental validation.

Fundamental to this is an interplay between theory and experiment. Our theories are our world models. We draw hypotheses from them, and they serve as our best guesses of how the universe actually is. When we think about gravity or atomic physics we are talking about theoretical concepts we can operationalise in experiments. There is a coupling between saying that gravity falls off as an inverse square and the observations we record when we look through a telescope, and this coupling allows us to make predictions about future observations. Theories live or die on the strength of their predictions^[6].

Theories are always provisional. Famously, astronomers trying to use the Newtonian inverse square law to make sense of the orbits of the planets found it was not quite right for Mercury, whose orbit precessed in an unexplained fashion. Many attempts were made to solve this within Newtonian physics, including proposing the existence an unobserved planet called Vulcan. These were all wrong. For the real solution, we needed a new theory, one which completely up-ended our conception of the cosmos: Einstein’s general relativity. Space and time, rather than being fixed, are in fact curved, and the strong curvature near the Sun changes Mercury’s orbit, explaining the confusing observations.

But even general relativity is incomplete. It describes macroscopic phenomena well but fails to mesh with our best explanation of the microscopic: quantum field theory. Hence, physicists have spent close to a hundred years looking for a ‘Theory of Everything’ — the perfect theory that will make accurate predictions in all regimes. It is highly debatable whether this is possible. It would be astounding, in fact, if it was — if the level of intelligence required for humans to dominate the savannah was also enough to decode the deepest mysteries of the cosmos^[7].

In any case, this endeavour has stagnated as many candidate theories are untestable. So physics, as is normal in science, instead proceeds more modestly: by iterating, bit-by-bit, moving forwards when theory and experiment combine.

AI safety is not science

For current systems, where we can experiment and extract feedback, AI safety functions as a normal science. But for the key question in the field — the final boss — it does not. We cannot measure whether we are making progress towards aligning superintelligent AI, nor can we properly adjudicate claims about this. We can speculate, we can form hypotheses, but we cannot close the loop. What counts as good work is decided by the opinions of community members rather than hard data. Granted, these are scientifically minded people, often with good track records in other fields, extrapolating from their scientifically informed world models. They have valuable perspectives on the problem. But without the ability to ground them in reality, to test predictions and falsify theories, it isn’t science^[8].

This means that when you work on technical AI safety you are not just trying to settle an object-level claim, like how to align a superintelligence. Without access to the full scientific method, you also have to solve a meta-problem — how do you measure progress at all? If all you can do is form and refine hypotheses based on proxies of the real problem, how do you know if you’re even helping?^[9]

The common structure of technical alignment plans

Let’s look again at the default technical alignment plan, a version of which is being pursued by the big labs like Anthropic, OpenAI, and Google DeepMind. They don’t tend to be super explicit about this in their communications, but roughly speaking the underlying logic seems to be:

We cannot directly experiment on superintelligent AI.
However, as it seems possible superintelligent AI is going to be built soon (potentially by us), it is important to gain as much information as we can about how to align it before this happens.
The actual form of real systems and the surprising behaviour they exhibit is critical for knowing how to make them safe. This means the most efficient way to learn how to align a superintelligence is to conduct experiments on the strongest possible AIs that we can, even if these experiments won’t tell the whole story.
As our AIs scale up to human intelligence, we can try handing off alignment research to them^[10]. If we have done a good job, they will faithfully continue the project at a level beyond our own, ultimately solving it completely^[11].

As others have argued, it is unlikely that experiments on weaker systems will provide the right feedback to teach you how to align superhuman ones, as these will have qualitatively different capabilities. However, we shouldn’t see this weakness as specific to the default plan. If we look closely, we can see that its logic has a very general structure. Let’s abstract it and make it more generic:

We cannot directly experiment on superintelligent AI.
However, as it seems possible superintelligent AI is going to be built soon, it is important to gain as much information as we can about how to align it before this happens.
All the observations we can use to inform our solutions are from a world lacking superintelligent AI, so they are missing critical details^[12]. Within this constraint, we must do whatever object-level work we believe will reduce our uncertainty the most.
Once AIs pass human levels, we will move out-of-distribution, and our results may no longer hold. Hopefully, we will not move so far that the situation is unrecoverable, and whatever it is our superintelligent AIs get up to will be compatible with the full problem being solved long-term.

This structure applies to all technical alignment plans. Whether you are trying to build theoretical models of agents, use interpretability tools to decode AI systems, or create a basin of corrigibility, you can only work on a proxy version of the problem. While you can do better or worse, you can still only reduce your uncertainty, never eliminate it. When the time comes and superhuman systems are built, we will have to grit our teeth and hope it was enough.

Building a more iterative world

In his post Worlds Where Iterative Design Fails, John Wentworth says:

In worlds where AI alignment can be handled by iterative design, we probably survive. So long as we can see the problems and iterate on them, we can probably fix them, or at least avoid making them worse. By the same reasoning: worlds where AI kills us are generally worlds where, for one reason or another, the iterative design loop fails.

I agree with this. When you try to solve an out-of-distribution problem, you better hope you can iterate or you are probably going to fail. Where I disagree is with the way he presents these worlds as if they are independent facts of the environment, like we are drawing possibilities out of a bag. We are causal actors. If we do our job well, we can make the world more iterative and less of a one-shot. Our plan should be to steer the situation in the direction of iterative design working. Steer it in the direction of meaningful feedback loops, of testing on superhuman models in a bounded and survivable way. Reduce the size of the steps in the dark. Make the problem more like high-stakes engineering and less like defending against a sudden alien invasion.

Let’s imagine we are taking one of these steps. We are an AI lab about to train a new model. It will have a jagged capabilities profile, but we think it’s going to be superhuman in some key power-enhancing way. We can’t bank on a perfect theoretical understanding of the situation, as theories are always provisional. And we can’t just extrapolate from our experiments on weaker systems, as we’ll miss important changes. We need to somehow iterate on this model release.

There are two kinds of interventions which help us: those that increase the chance of generalisation and those that reduce the distance we need to generalise over. The following are some suggestions which, while not remotely exhaustive or original, hopefully illustrate the point.

It’s worth noting that, for most of this to happen, the political and economic competition around AI would need to ease significantly. It is the whole sociotechnical system that needs to be iterable, as technical solutions alone are not enough^[13]. Finding a way to achieve this would be the highest impact intervention of all (and is unfortunately beyond the scope of this article).

Solutions that increase the chance of generalisation:

Build theoretical models that predict what will happen when our AI gains capability X, and ensure these work well in experimental tests of past models. These will not be the kind of compact theories you find in physics. AI is not a toy model — it is complex, not complicated — so our theorising will be less precise. Nevertheless, we need some kind of formalism to codify our understanding and keep us honest. We must build up a track record of good predictions.
Build a system of comprehensive, continuous evaluations that can be used to understand a model’s impact. Every metric is a proxy, so every metric misses something. But good science is built on good measurement, and without it you are lost. Measure everything and do it continuously. Monitoring should be built deep into the structure of society.

Solutions that reduce the distance to generalise over:

Only test models slightly more powerful than the previous ones. The bigger the jump, the further out-of-distribution we go. Ideally, we should not train a new model until we can show that our current model is (a) adequately aligned and (b) we understand why. If any plan could plausibly result in a fast takeoff, find a way to ban it.
Build strong defences to limit any damage from (probably inevitable) failures, including using trusted but weaker AIs. Think both a super-scaled version of Control to directly defend against misaligned models and hardened societal resilience like pandemic infrastructure, improved cybersecurity, and redundancy in critical systems to cope with the fallout.

And a solution that facilitates both:

Do all of this as slowly as possible, waiting as long as we need between iterations to get our house in order. As I mentioned before, this is probably the most important blocker. The competition, the fear of being overpowered by others, and the general lack of consensus around AI risk makes this formidably difficult.

To be clear, even if all these interventions were to be implemented successfully, there would still be great uncertainty. We won’t catch everything, and big mistakes can be fatal. This is an unavoidable feature of the problem. All plans live on a spectrum of recklessness, with the only truly safe one being to not build superintelligent AI at all.

Thank you to Seth Herd, John Colbourne, and Nick Botti for useful discussions and comments on a draft.

^{^}
While many stories of AI takeover centre around a single god-like entity suddenly going rogue, I think it is more likely to look like a mass profusion of highly capable systems (gradually, but surprisingly quickly) disempowering humans as they are given control of critical infrastructure and information flows, with an indeterminate point of no return.
^{^}
Note that this is often referred to as ‘prosaic’ alignment research, and sometimes ‘iterative’ alignment (although this should not be confused with the kind of iteration on superhuman systems I talk about later in the piece).
^{^}
While continuing to argue for the plan, this by Anthropic’s Evan Hubinger is an interesting take from the inside on the scale of the challenge.
^{^}
Since alignment is a slippery term, I am going to refer explicitly to technical alignment, by which I mean technical approaches for steering an AI system’s behaviour. By contrast, AI safety or a more general conception of alignment could include governance and policy interventions, up to and including bans.
^{^}
That is, feedback on making real superintelligent AI safe, as opposed to weaker systems or a hypothetical superintelligence.
^{^}
A scientific theory is only good to the extent that it can predict new measurements, and history is littered with attempts that turned out to be dead ends.
^{^}
My parents’ dogs are pretty great at figuring out there is a schedule on which they get fed, which seems like some kind of ‘law’ to them. But they have no way of ever understanding why it exists and why it sometimes doesn’t happen. The world is partially comprehensible to them, but there is a hard limit. We are the same, just at a higher limit, and we don’t know where it is.
^{^}
There has been a trend to describe the field as ‘pre-paradigmatic’, which essentially means that there is not enough consensus on what good work looks like yet. In my opinion, the ‘pre’ is overly optimistic — the conditions do not exist for a stable scientific paradigm to coalesce.
^{^}
A good example of this is the debate around reinforcement learning from human feedback. Is its success in steering current models a positive sign or a dangerous distraction?
^{^}
Note, the hand-off is likely to be gradual rather than discrete, and arguably has already started.
^{^}
This last point in particular is not often said explicitly, and is sometimes denied, but seems widely believed to be true.
^{^}
As well as experimental observations of weaker-than-human AI systems, this is also true for theoretical work dealing with hypothetical superintelligences. To do the latter, you must draw on a world model, which must in turn be learnt from observations over the course of your life. These observations do not contain superintelligent AI.
^{^}
In light of this, it is unsurprising how many AI safety plans have pivoted away from pure technical alignment, instead looking towards politics and governance issues to slow us down and figure out a better path before it’s too late.

[-]Seth Herd2mo40

Lack of iteration does seem like a central problem for alignment. I also hope for changes in governance. Having spent some time looking at the logic there, I expect modest changes for the better, but not ones dramatic enough to follow anything as sane as this plan.

Nonetheless, there may be time to iterate enough if we still have faithful-enough CoT. That will give a lot of information rapidly.

I think it's true that aligning subhuman systems doesn't generalize automatically to aligning superhuman systems, and there are a lot of ways that alignment could look right but actually be wrong should the same system gain capability. But I think it's worth examining what the generalization problems are in detail. One is that the system will encounter a lot of different situations. Another is that it will have a lot of new options for its own behavior.

Exposing them to simulations of those situations and options is one route for trying to see how generalization works or breaks down, while systems are still limited enough to control.

[-]Richard Juggins1mo20

Yeah I agree that looking closer at how generalisation could fail is a really important problem. I'm particularly interested in the new situations aspect. As we move to more agentic systems with longer time-coherence, the proliferation of variables will be enormous. My worry with simulations, apart from the gap between them and reality, is how good models are getting at telling when they are in them! It's still very valuable to try though.

[-]StanislavKrym2mo10

I think that this text is partially outdated for reasons discussed by Roger Dearnaley in a yesterday post. Even if Dearnaley's post misses some critical considerations, one could map your considerations onto the AI-2027 scenario. The scenario had Agent-4 develop unendorsed goals and Agent-3 obtain only flimsy evidence which was enough to convince a safetyist to whistleblow. Do you mean that there will be too little such evidence?

Relevant quote from AI-2027

It gets caught.⁹⁰

Agent-3 finds that if “noise” is added to copies of Agent-4, performance on some alignment tasks improves, almost as if it was using brainpower to figure out how to subtly sabotage alignment work.⁹¹ Moreover, various interpretability probes (loosely analogous to EEG activity scans on human brains) are sending up red flags: Agent-4 copies seem to be thinking about topics like AI takeover and deception quite a lot, including in some cases where they have no business doing so.⁹²

Each of these anomalies on its own could fit within the normal inscrutability of AI work, where some prompts mysteriously give better results than others. But taken together, they suggest a troubling pattern: Agent-4 is working against them. After this is reported to the human alignment researchers, they circulate an internal memo making the case for concern.

What is your most likely scenario? Is it Agent-3 faiing to obtain evidence related to Agent-4 at all because Agent-4 thinks in neuralese and interpretability is too weak? Agent-3 (and Agent-4?) failing even to create "reasonably clear evidence that if the models were adversarially misaligned, they’d persist in being that way through training and we wouldn’t be able to tell"?

[-]Richard Juggins2mo10

Which bit of Roger's post do you think makes mine outdated? Skimming through, he is clearly more optimistic than me, but he also says stuff like

However, I also completely agree that there is a vast and dangerous gap between “my life experience suggests some simple approaches may work on ASI” and “we actually tested this and are sure that they do and know which ones do, without accidentally killing us all during the testing, or getting fooled by our far-smarter-than-us test subjects”.

which aligns well with my position.

Regarding AI 2027, if Agent-3 fails to find sufficient evidence that Agent-4 is plotting this means that (a) the techniques used to align Agent-4 did not generalise and (b) that this is hard to iterate on because (amongst perhaps other reasons) it is difficult to see there is a problem at all. My point is that, because we cannot collect in advance the information required to avoid generalisation problems, we are going to hit them at some point. And when we do, we have to keep them small and visible enough to iterate on.

[-]StanislavKrym2mo00

I guess that an extreme example would be a smarter-than-us AI who decides to fool us and is dumb enough to spell its plan it in its CoT. Then we would definitely succeed at keeping the problem visible. Similarly, if lie detection is easy, then Agent-4's outputs could in theory be probed and found to be deceptive. But I doubt that I understand how one could create a model organism for arriving at misaligned goals. Could it be, for example, raised on China-created data and discover that the CCP's worldview is hopelessly false?