I think that this text is partially outdated for reasons discussed by Roger Dearnaley in a yesterday post. Even if Dearnaley's post misses some critical considerations, one could map your considerations onto the AI-2027 scenario. The scenario had Agent-4 develop unendorsed goals and Agent-3 obtain only flimsy evidence which was enough to convince a safetyist to whistleblow. Do you mean that there will be too little such evidence?
Relevant quote from AI-2027
It gets caught.90
Agent-3 finds that if “noise” is added to copies of Agent-4, performance on some alignment tasks improves, almost as if it was using brainpower to figure out how to subtly sabotage alignment work.91 Moreover, various interpretability probes (loosely analogous to EEG activity scans on human brains) are sending up red flags: Agent-4 copies seem to be thinking about topics like AI takeover and deception quite a lot, including in some cases where they have no business doing so.92
Each of these anomalies on its own could fit within the normal inscrutability of AI work, where some prompts mysteriously give better results than others. But taken together, they suggest a troubling pattern: Agent-4 is working against them. After this is reported to the human alignment researchers, they circulate an internal memo making the case for concern.
What is your most likely scenario? Is it Agent-3 faiing to obtain evidence related to Agent-4 at all because Agent-4 thinks in neuralese and interpretability is too weak? Agent-3 (and Agent-4?) failing even to create "reasonably clear evidence that if the models were adversarially misaligned, they’d persist in being that way through training and we wouldn’t be able to tell"?
One reason aligning superintelligent AI will be hard is because we don’t get to test solutions in advance. If we build systems powerful enough to take over the world, and we fail to stop them from wanting to, they aren’t going to give us another chance[1]. It’s common to assume that the alignment problem must be one-shotted: we must solve it on the first try without any empirical feedback.
Normal science and engineering problems do not work like this. They are a delicate balance of theory and experiment, of unexpected surprises and clever refinements. You try something based on your best understanding, watch what happens, then update and try again. You never skip straight to a perfect solution.
Within AI safety, there’s been a lot of disagreement about the best way to square this circle. When the field was young and AI systems weak, researchers tended to try and understand the problem theoretically, looking for generalisable insights that could render it more predictable. But this approach made slow progress and has fallen out of fashion. By contrast, driven on by the relentless pace of AI development, a default plan seems to be coalescing around the kind of empirical safety work[2] practised by the big labs. Generalising massively, it looks roughly like this:
This plan has attracted criticism[3], particularly of the assumption that what we learn about aligning weaker-than-human systems will generalise once they surpass us. In this post, I’m going to argue that this maps onto a structural feature of all technical alignment plans[4]. We are fitting solutions, whether empirical or theoretical or both, to a world missing critical feedback[5] and hoping they will generalise. Every plan involves stepping into the dark.
If we want to make superintelligent AI safe, we need to dramatically reduce the size of these steps. We must learn how to iterate on the full problem.
How does science work?
Before we talk more about AI safety, let’s take a step back and consider how science works in general. Wikipedia describes the scientific method as:
Fundamental to this is an interplay between theory and experiment. Our theories are our world models. We draw hypotheses from them, and they serve as our best guesses of how the universe actually is. When we think about gravity or atomic physics we are talking about theoretical concepts we can operationalise in experiments. There is a coupling between saying that gravity falls off as an inverse square and the observations we record when we look through a telescope, and this coupling allows us to make predictions about future observations. Theories live or die on the strength of their predictions[6].
Theories are always provisional. Famously, astronomers trying to use the Newtonian inverse square law to make sense of the orbits of the planets found it was not quite right for Mercury, whose orbit precessed in an unexplained fashion. Many attempts were made to solve this within Newtonian physics, including proposing the existence an unobserved planet called Vulcan. These were all wrong. For the real solution, we needed a new theory, one which completely up-ended our conception of the cosmos: Einstein’s general relativity. Space and time, rather than being fixed, are in fact curved, and the strong curvature near the Sun changes Mercury’s orbit, explaining the confusing observations.
But even general relativity is incomplete. It describes macroscopic phenomena well but fails to mesh with our best explanation of the microscopic: quantum field theory. Hence, physicists have spent close to a hundred years looking for a ‘Theory of Everything’ — the perfect theory that will make accurate predictions in all regimes. It is highly debatable whether this is possible. It would be astounding, in fact, if it was — if the level of intelligence required for humans to dominate the savannah was also enough to decode the deepest mysteries of the cosmos[7].
In any case, this endeavour has stagnated as many candidate theories are untestable. So physics, as is normal in science, instead proceeds more modestly: by iterating, bit-by-bit, moving forwards when theory and experiment combine.
AI safety is not science
For current systems, where we can experiment and extract feedback, AI safety functions as a normal science. But for the key question in the field — the final boss — it does not. We cannot measure whether we are making progress towards aligning superintelligent AI, nor can we properly adjudicate claims about this. We can speculate, we can form hypotheses, but we cannot close the loop. What counts as good work is decided by the opinions of community members rather than hard data. Granted, these are scientifically minded people, often with good track records in other fields, extrapolating from their scientifically informed world models. They have valuable perspectives on the problem. But without the ability to ground them in reality, to test predictions and falsify theories, it isn’t science[8].
This means that when you work on technical AI safety you are not just trying to settle an object-level claim, like how to align a superintelligence. Without access to the full scientific method, you also have to solve a meta-problem — how do you measure progress at all? If all you can do is form and refine hypotheses based on proxies of the real problem, how do you know if you’re even helping?[9]
The common structure of technical alignment plans
Let’s look again at the default technical alignment plan, a version of which is being pursued by the big labs like Anthropic, OpenAI, and Google DeepMind. They don’t tend to be super explicit about this in their communications, but roughly speaking the underlying logic seems to be:
As others have argued, it is unlikely that experiments on weaker systems will provide the right feedback to teach you how to align superhuman ones, as these will have qualitatively different capabilities. However, we shouldn’t see this weakness as specific to the default plan. If we look closely, we can see that its logic has a very general structure. Let’s abstract it and make it more generic:
This structure applies to all technical alignment plans. Whether you are trying to build theoretical models of agents, use interpretability tools to decode AI systems, or create a basin of corrigibility, you can only work on a proxy version of the problem. While you can do better or worse, you can still only reduce your uncertainty, never eliminate it. When the time comes and superhuman systems are built, we will have to grit our teeth and hope it was enough.
Building a more iterative world
In his post Worlds Where Iterative Design Fails, John Wentworth says:
I agree with this. When you try to solve an out-of-distribution problem, you better hope you can iterate or you are probably going to fail. Where I disagree is with the way he presents these worlds as if they are independent facts of the environment, like we are drawing possibilities out of a bag. We are causal actors. If we do our job well, we can make the world more iterative and less of a one-shot. Our plan should be to steer the situation in the direction of iterative design working. Steer it in the direction of meaningful feedback loops, of testing on superhuman models in a bounded and survivable way. Reduce the size of the steps in the dark. Make the problem more like high-stakes engineering and less like defending against a sudden alien invasion.
Let’s imagine we are taking one of these steps. We are an AI lab about to train a new model. It will have a jagged capabilities profile, but we think it’s going to be superhuman in some key power-enhancing way. We can’t bank on a perfect theoretical understanding of the situation, as theories are always provisional. And we can’t just extrapolate from our experiments on weaker systems, as we’ll miss important changes. We need to somehow iterate on this model release.
There are two kinds of interventions which help us: those that increase the chance of generalisation and those that reduce the distance we need to generalise over. The following are some suggestions which, while not remotely exhaustive or original, hopefully illustrate the point.
It’s worth noting that, for most of this to happen, the political and economic competition around AI would need to ease significantly. It is the whole sociotechnical system that needs to be iterable, as technical solutions alone are not enough[13]. Finding a way to achieve this would be the highest impact intervention of all (and is unfortunately beyond the scope of this article).
Solutions that increase the chance of generalisation:
Solutions that reduce the distance to generalise over:
And a solution that facilitates both:
To be clear, even if all these interventions were to be implemented successfully, there would still be great uncertainty. We won’t catch everything, and big mistakes can be fatal. This is an unavoidable feature of the problem. All plans live on a spectrum of recklessness, with the only truly safe one being to not build superintelligent AI at all.
Thank you to Seth Herd, John Colbourne, and Nick Botti for useful discussions and comments on a draft.
While many stories of AI takeover centre around a single god-like entity suddenly going rogue, I think it is more likely to look like a mass profusion of highly capable systems (gradually, but surprisingly quickly) disempowering humans as they are given control of critical infrastructure and information flows, with an indeterminate point of no return.
Note that this is often referred to as ‘prosaic’ alignment research, and sometimes ‘iterative’ alignment (although this should not be confused with the kind of iteration on superhuman systems I talk about later in the piece).
While continuing to argue for the plan, this by Anthropic’s Evan Hubinger is an interesting take from the inside on the scale of the challenge.
Since alignment is a slippery term, I am going to refer explicitly to technical alignment, by which I mean technical approaches for steering an AI system’s behaviour. By contrast, AI safety or a more general conception of alignment could include governance and policy interventions, up to and including bans.
That is, feedback on making real superintelligent AI safe, as opposed to weaker systems or a hypothetical superintelligence.
A scientific theory is only good to the extent that it can predict new measurements, and history is littered with attempts that turned out to be dead ends.
My parents’ dogs are pretty great at figuring out there is a schedule on which they get fed, which seems like some kind of ‘law’ to them. But they have no way of ever understanding why it exists and why it sometimes doesn’t happen. The world is partially comprehensible to them, but there is a hard limit. We are the same, just at a higher limit, and we don’t know where it is.
There has been a trend to describe the field as ‘pre-paradigmatic’, which essentially means that there is not enough consensus on what good work looks like yet. In my opinion, the ‘pre’ is overly optimistic — the conditions do not exist for a stable scientific paradigm to coalesce.
A good example of this is the debate around reinforcement learning from human feedback. Is its success in steering current models a positive sign or a dangerous distraction?
Note, the hand-off is likely to be gradual rather than discrete, and arguably has already started.
This last point in particular is not often said explicitly, and is sometimes denied, but seems widely believed to be true.
As well as experimental observations of weaker-than-human AI systems, this is also true for theoretical work dealing with hypothetical superintelligences. To do the latter, you must draw on a world model, which must in turn be learnt from observations over the course of your life. These observations do not contain superintelligent AI.
In light of this, it is unsurprising how many AI safety plans have pivoted away from pure technical alignment, instead looking towards politics and governance issues to slow us down and figure out a better path before it’s too late.