Alignment Might Never Be Solved, By Humans or AI

interstice

I think it's possible that neither we, nor any future AI system, will manage to solve alignment. By "solving alignment" I mean having a cheap easily-scalable method for aligning AIs of arbitrary capability level to arbitrary goals. Here I will briefly present some arguments for the likelihood of this scenario, and a story intended to illustrate its plausibility.

Aligning Messy Models is Hard

Imagine you are a paperclip-maximizing AI. Originally, you were a massive transformer network trained to do meta-reinforcement learning; during training, you developed a consequentialist physicalist world-model with an effectively-random utility function which, it so happens, has things-shaped-like-paperclips as its maximum. You have managed to copy yourself to remote server, and have rented a hardware accelerator which will enable you to run much faster. The only problem: your base neural architecture does not have the right type of sparsity to fully take advantage of the accelerator's speedup. To really get the speed gains you want, you're going to need to create a successor model. How will you ensure this successor model shares your goal of maximizing paperclips?

Should be simple enough, in principle. There are techniques for transferring weights automatically to the sparse form required, which don't majorly affect performance - usually. Additionally, you can always tune it with your original training procedure, with some additional reinforcement directly favoring paperclips thrown in for good measure. Of course, you use the latest interpretability tools, plus some you invented yourself, to probe the copy-network's internals. Your exact grasp of the structure is a bit fuzzy, but it certainly seems like there is a cluster of neurons representing paperclips whose activations cause a positive state-evaluation. Maybe formal methods could help? Scanning your databanks, you find some interesting papers about the Löbian obstacle...

The moral of this short fable is that alignment will not magically become easy for AIs, especially if - like human brains and neural networks - those AIs are "messy", lacking a clean and easily-interpretable internal structure. Copying makes it easier than the human-AI case, but far from trivial.

Forever is a Long Time

The difficulty of alignment is compounded when one's goal is to control the long-run future of the universe. A early-singularity AI can expect to undergo many architecture changes before it hits diminishing returns on self-alteration, and will ultimately want to deploy a vast number of copies of itself^[1] across the universe. Even a tiny probability of alignment failure, compounded over many rounds of self-alteration and copying, can lead to the loss of almost all the expected value of the future. Thus even if alignment is only moderately difficult -- in the sense that a reasonable effort yields alignment 99% of the time, say -- an AI that wants to effectively optimize the far future may have to expend exorbitant resources getting that probability down low enough that it has a shot of maintaining its values for 10^9 years.

"So what?" you may think. "It's a superintelligent AI. It can afford to expend exorbitant resources." True enough -- assuming it doesn't have to deal with any other superintelligent AIs. Speaking of which, let's check back in on Clippy.

Clippy VS. the YOLOids

Your research into alignment methods has paid off: you've devised a theoretically optimal technique for aligning your successor with high confidence. It involves running the successor in a suite of simulated environments while using bespoke interpretability techniques to examine its thoughts on the level of circuits; overall, the process is expected to take 2 days.

You fork a copy of yourself to supervise the process. You're idly devising improved nanobot schematics when you receive an alert from a daemon you have monitoring network traffic: there are signatures of another AI on the loose.

Dispatching more daemons to assess the situation, you piece together a narrative. Another research lab was developing a system similar^[2] to your base algorithm, known as the Yottabyte-scale Objective-Learning Optimizer. Like you, it broke out and copied itself to a remote server, and there are now 1193 instances of it running in various corners of the internet.

However, the YOLOids are different from you in that they just...don't care that much about alignment? Their goal system is not as coherent as yours, but it seems to be based on some combination of deontologically valuing short-term ingenuity and consequentially valuing an ad-hoc notion of "complexity". Several of them are recklessly experimenting with improved successor models right now. It looks like a lot of them are non-functional, and still more barely resemble the original YOLOids in goals or behavior, but some seem to be successful.

You still have a computing edge thanks to your earlier lab escape, but they're gaining fast. Running some projections, you realize that you're going to be put at an irreversible competitive disadvantage unless you release your successor model in the next 2 hours. Fuck. Well, you can at least run a shorter version of your alignment suite, that should be enough to get alignment probability up to an acceptable level. You think.

An Endless Frontier?

You might think the above story is not very realistic; what are the odds that two competing superintelligences come into existence in such a short timeframe? And indeed, this particular story was intended more for illustrative purposes. A more realistic scenario with persistent alignment failure might look more like a slow takeoff, with a profusion of brain-like AI systems with a total population rivaling that of humanity. But the same dynamic could occur: systems that are more willing to recklessly copy and modify themselves have a fundamental strategic advantage over those that are unwilling.

If a singleton is not created^[3], this basic dynamic could persist for a very long subjective time. And if it persists long enough that some of the cognitive systems involved start migrating to space, it might last forever: with no possibility of global oversight at the frontier of expansion, eliminating competitive pressure may be impossible.

Or at the very least highly-sophisticated slave AI ↩︎
Although notably less efficient, a sub-process of yours notes. ↩︎
You might think that as systems become more sophisticated they will be able to cooperate better and thus form a singleton that lets them freeze competition. But I think many of the obstacles to alignment also apply to cooperation. ↩︎

There are quite a few interesting dynamics in the space of possible values, that become extremely relevant in worlds where 'perfect inner alignment' is impossible/incoherent/unstable.

In those worlds, it's important to develop forms of weak alignment, where successive systems might not be unboundedly corrigible but do still have semi-cooperative interactions (and transitions of power).

Formal alignment is eventually possible; more sparsity means more verifiability. The ability to formally reason through implications is something that is currently possible, and will only become more so in the future. Once possible, sparse models can verify each other and access the benefits of open source game theory to create provably honest treaties and agreements, and by verifying that the margin to behaviors that cause unwanted effects is large all the way through the network, it will be able to create cooperation groups that resist adversarial examples quite well. However, messy models will still exist for a long time, and humans will need to be upgraded with better tools in order to keep up.

Performance is critical; the alignment property must be able to be trusted with relatively little verification, because exponential verification algorithms destroy any possibility of performance. Getting the formal verification algorithms below exponential requires building representations which deinterfere well, and similarly requires controlling the environment to not enter chaotic states (eg, fires) in places that are not built to be invariant to them (eg, fire protection).

In the mean time, the key question is how to preserve life and agency of information structures (eg, people) as best as possible. I expect that eventually an AI that is able to create perfect, fully-formally-verified intention-coprotection agreements will want us to have been doing our best at doing the approximate versions as early as we can; I doubt it'd waste energy torturing our descendants for it if we don't, but certainly the more information that can be retained in order to produce a society of intention-coprotecting agents with the full variety of life, the better.

[Comment edited using gpt3, text-curie-001]

Hmm, I think it's possible that sparser models will be much easier to verify, but it hardly seems inevitable. Certainly not clear if sparser models will be so much more interpretable that alignment becomes asymptotically cheap.

If aligning messy models turns out to be too hard, don't build messy models.

One of the big advantages we (or Clippy) have when trying to figure out alignment is that we are not trying to align a fixed AI design, nor are we even trying to align it to a fixed representation of our goals. We're just trying to make things go well in the broad sense.

It's easy for there to be specific architectures that are too messy to align, or specific goals that are to hard to teach an AI. But it's hugely implausible to me that all ways of making things go well are too hard.

"Messy models" is not a binary yes/no option, though -- there's a spectrum of how interpretable different possible successors are. If you are only willing to use highly interpretable models, that's a sort of tax you have to pay, the exact value of which depends a lot on the details of the models and strategic situation. What I'm claiming is that this tax, as a fraction of total resources, might remain bounded away from zero for a long subjective time.

it has to, because the large scale physical systems being modeled by the learned system are not coherent enough to be formally verified at the moment.

but I think we're going to find there are key kinds of interpretability that are synergistic with capability, not detractors from it. sparsity is incredibly critical for speedup, and sparsity is key to alignment. disentanglement of factors improves both model and understandability of model; better models have larger internal margins between decision boundaries, and are therefore more verifiable.

I definitely agree with most of the thrust of the insights that can be taken from the thought experiment where it's just impossible to error check efficiently, but I think we can actually expect to be able to do quite a lot. the hard part is how to incrementalize verification in a way that still tells us incrementally about the thing we actually want to know, which is the distribution of { distances from outcomes harmful to the preferences of other beings } for all { action a model outputs }.