The alignment stability problem

Seth Herd

The community thinks a lot about how to align AGI. It thinks less about how to align AGI so that it stays aligned for the long term. In many hypothetical cases, these are one and the same thing. But for the type of AGI we're actually likely to get, I don't think they are.

Despite some optimism for aligning tool-like AGI, or at least static systems, it seems likely that we will create AGI that learns after it's deployed, and that has some amount of agency. If it does, its alignment will effectively shift, as addressed in the diamond maximizer thought experiment and elsewhere. And that's even if it doesn't deliberately change its preferences. People deliberately change their preferences sometimes, despite not having access to our own source code. So, it would seem wise to think seriously and explicitly about the stability problem, even if it isn't needed for current-generation AGI research.

I've written a chapter on this, Goal changes in intelligent systems. There I laid out the problem, but I didn't really propose solutions. What follows is a summary of that article, followed by a brief discussion of the work I've been able to locate on this problem, and one direction we might go to pursue it.

Why we don't think about much about alignment stability, and why we should.

Some types of AGI are self-stabilizing. A sufficiently intelligent agent will try to prevent its goals^[1] from changing, at least if it is consequentialist. That works nicely if its values are one coherent construct, such as diamond or human preferences. But humans have lots of preferences, so we may wind up with a system that must balance many goals. And if the system keeps learning after deployment, it seems likely to alter its understanding of what its goals mean. This is the thrust of the diamond maximizer problem.

One tricky thing about alignment work is that we're imagining different types of AGI when we talk about alignment schemes. Currently, people are thinking a lot about aligning deep networks. Current deep networks don't keep learning after they're deployed. And they're not very agentic These are great properties for alignment, and they seem to be the source of some optimism.

Even if this type of network turns out to be really useful, and all we need to make the world a vastly better place, I don't think we're going to stop there. Agents would seem to have capabilities advantages that metaphorically make tool AI want to become agentic AI. If that weren't enough, agents are cool. People are going to want to turn tool AI into agent AI just to experience the wonder of an alien intelligence with its own goals.

I think turning intelligent tools into agents is going to be relatively easy. But even if it's not easy someone is going to manage it at some point.. It's probably too difficult to prevent further experimentation, at least without a governing body, aided by AGI, that's able and willing to at minimum intercept and de-encrypt every communication for signs of AGI projects.

While the above logic is far from airtight, it would seem wise to think about stable alignment solutions, in advance of anyone creating AGI that continuously learns outside of close human control.

Similar concerns have been raised elsewhere, such as On how various plans miss the hard bits of the alignment challenge. Here I'm trying to crystallize and give a name to this specific hard part of the problem.

Approaches to alignment stability

Alex Turner addresses this in A shot at the diamond-alignment problem. In broad form, he's saying that you would train the agent with RL to value diamonds, including having diamonds associated with the reward in a variety of cognitive tasks. This is as good an answer as we've got. I don't have a better idea; I think the area needs more work. Some difficulties with this scheme are raised in Contra shard theory, in the context of the diamond maximizer problem. Charlie Steiner's argument that shard theory requires magic addresses roughly the same concerns. In sum, it's going to be tricky to train a system so that it has the right set of goals when it acquires enough self-awareness to try to preserve its goals.

Note that none of these directly confront the additional problems of a multi-objective RL system. It could well be that an RL system with multiple goals will collapse to having only a single goal over the course of reflection and self-modification. Humans don't do this, but we have both limited intelligence and a limited ability to self-modify.

Another approach to preventing goal changes in intelligent agents is corrigibility. If we can notice when the agent's goals are changing, and instruct or retrain or otherwise modify them back to what we want, we're goood. This is a great idea; the problem is that it's another multi-objective alignment problem. Christiano has said "I grant that even given such a core [of corrigibility], we will still be left with important and unsolved x-risk relevant questions like "Can we avoid value drift over the process of deliberation?""

I haven't been able to find other work trying to provide a solution the diamond maximizer problem, or other formulations of the stability problem. I'm sure it's out there, using different terminology and mixed into other alignment proposals. I'd love to get pointers on where to find this work.

A direction: asking if and how humans are stably aligned.

Are you stably aligned? I think so, but I'm not sure. I think humans are stable, multi-objective systems, at least in the short term. Our goals and beliefs change, but we preserve our important values over most of those changes. Even when gaining or losing religion, most people seem to maintain their goal of helping other people (if they have such a goal); they just change their beliefs about how to best do that.

Humans only maintain that stability of several important goals across our relatively brief lifespans. Whether we'd do the same in the long term is an open question that I want to consider more carefully in future posts. And we might only maintain those goals with the influence of a variety of reward signals, such as getting a reward signal in the form of dopamine spikes when we make others happy. Even if we figure out how that works (the focus of Steve Byrnes' work), including those rewards in a mature AGI might have bad side effects, like a universe tiled with simulacra of happy humans.

The human brain is not clearly the most promising model of alignment stability. But it's what I understand best, so my efforts will go there. And there are other advantages to aligning brainlike AGI over other types. For instance, humans seem to have a critic system that could act as a "handle" for alignment. And brainlike AGI would seem to be a relatively good target for interpretability-heavy approaches, since we seem to think one important thought at a time, and we're usually able to put them into words.

Much work remains to be done to understand alignment stability. I'll talk more about training brainlike AGI to have enough of our values, in a long-term stable form, in future posts.

Edit (9/24): I have now written a bunch on aligning brainlike AGI,^[2] but have not come back to the stability problem. Neither have I found other work addressing it for complex multi-goal(value) systems like brainlike AGI or humans. I now think we won't do that until we have superintelligent help, because Instruction-following AGI is easier and more likely than value aligned AGI. Instruction-following AGI keeps humans in the loop and lets them monitor and correct value/alignment changes, solving the alignment stability problem but creating a different severe problem of putting competing humans in charge of multiple RSI-capable AGIs.

^{^}
I'll use goals here, but many definitions of values, objectives, or preferences could be swapped in.
^{^}
I now (9/24) think we can align brainlike AGI well enough, if we're somewhat careful and align it to personal intent as a stepping stone to full value alignment.
See
Goals selected from learned knowledge: an alternative to RL alignment for techniques applying to three types of loosely brainlike AGI and
Internal independent review for language model agent alignment for a suite of techniques that can be applied to language model agents.

If I understand you correctly, you might be interested in reflective stability and tiling agent theory.

I'll definitely check these out, thanks! Reflective stability sounds exactly right.

The link for reflective stability doesn't have much content, unfortunately.

A similar concept ("reflective consistency") is informally introduced here. The tiling agents paper introduces the concepts of "reflectively coherent quantified belief" and "reflective trust" on this page.

Reflective stability does seem like the right term. Searches on that term are turning up some relevant discussion on alignment forum, so thanks!

Tiling agent theory is about formal proof of goal consistency in successor agents. I don't think that's relevant for any AGI made of neural networks similar to either brains or current systems. And that's a problem.

Reflective consistency looks to be about decision algorithms given beliefs, so I don't think that's directly relevant. I couldn't work out Yudkowsky's use of reflectively coherent quantified belief on a quick look; but it's in service of that closed form proof. That term only occurs three times on AF. Reflective trust is about internal consistency and decision processes relative to beliefs and goals, and it also doesn't seem to have caught on as common terminology.

So the reflective stability term is what I'm looking for, and should turn up more related work. Thanks!

if and how humans are stably aligned

Humans are NOT aligned. Humans are not selfless, caring only about the good of others. Joseph Stalin was not aligned with the citizenry of Russia. If humans were aligned, we wouldn't need law enforcement, or locks. Humans cannot safely be trusted with absolute power or the sorts of advantages inherent to being a digital intelligence. They're just less badly aligned than a paperclip maximizer.

I tend to agree but I believe most non-aligned behavior is due to scarcity. It's hard to get into the heads of people like Stalin but I believe if everybody has a very realistic virtual reality where they could do all the things they'd do in real life, they may be much less motivated to enter into conflict with other humans.

I agree that scarcity doesn't help. But I'm afraid I don't think that's the only problem. See my post Uploading for a more detailed discussion of this issue (for uploads, but the situation for biological humans + VR isn't very different).

I think humans are aligned, approximately and on average. And that's what people mean when they assume humans are aligned. I wish I were sure, because this is an important question.

Does the average human become Stalin if put in his position? I don't think so, but I can't say I'm sure. Stalin probably got power because he was the man of steel (a sociopath/warrior mindset), and he also probably became more cruel over the course of competing for power, which demanded being vicious, which he would've then justified and internalized. Putting someone in a posiition of power wouldn't necessarily corrupt them the same way. But maintaining his power probably also demanded being vicious on occasion; I'm sure there were others plotting to depose him in reality as well as in his paranoia.

But the more relevant question is: does the average human help others when they're given nearly unlimited power and security? That's the position an AGI or a human controlling an aligned AGI would be in. There I think the answer is yes, the average human will become better when they have no threats to themselves or their loved ones.

I can't say I'm sure, and I wish I were. This might very well be the question on which hangs our future. I think we can achieve technical alignment because We have promising alignment plans with low taxes for the types of AGI we're most likely to get. But Corrigibility or DWIM is an attractive primary goal for AGI to the extent that AGI will probably be aligned to take orders from the human(s) that built it. And then the world hangs on their whims. I'd take that bet over any other alignment scheme, but I'm far from sure it would pay off.

“A direction: asking if and how humans are stably aligned.” I think this is a great direction, and the next step seems to be breaking out what are humans aligned to - the examples here seems to mention some internal value alignment, but wondering if it would also mean external value system alignment.

“A sufficiently intelligent agent will try to prevent its goals^[1] from changing, at least if it is consequentialist.”

It seems that in humans, smarter people are more able and likely to change their goals. A smart person may change his/her views about how the universe can best be arranged upon reading Nick Bostrom’s book Deep Utopia, for example.

‘I think humans are stable, multi-objective systems, at least in the short term. Our goals and beliefs change, but we preserve our important values over most of those changes. Even when gaining or losing religion, most people seem to maintain their goal of helping other people (if they have such a goal); they just change their beliefs about how to best do that.“

A human may change from wanting to help people to not wanting to help people if he/she got 5 hours of sleep instead of 8.

I think my terminology isn't totally clear. By "goals" in that statement, I mean what we mean by "'values" in humans. The two are used in overlapping and mostly interchangable ways in my writing

Humans aren't sufficiently intelligent to be all that internally consistent
In many cases of humans changing goals, I'd say they're actually changing subgoals, while their central goal (be happy/satisfied/joyous) remains the same. This may be described as changing goals while keeping the same values.
Note 'in the short term' (I think you're quoting Bostrom? The context isn't quite clear). In the long term, with increasing intelligence and self-awareness, I'd expect some of people's goals to change as they become more self-aware and work toward more internal coherence (e.g., many people change their goal of eating delicious food when they realize it conflicts with their more important goal of being happy and living a a long life).

Yes, humans may change exactly that way. A friend said he'd gotten divorced after getting a CPAP to solve his sleep apnea: "When we got married, we were both sad and angry people. Now I'm not." But that's because we're pretty random and biology determined.

Both quotes are from your above post. Apologies for confusion.

Approaches to alignment stability

I view this as pretty-much a solved problem, solved by value learning. Though there are then issues due to the mutability of human values.

As per our discussions on our other posts, I don't think we can say that value learning in itself solves the problem. The issue of whether the ASI's interpretation of its central goal or instructions changing is not automatically solved by adopting that approach. The value mutability problem you link to is a separate issue. I'm not addressing here whether human values might change, but whether an AGI's interpretations of its central goal/values might change.

I probably should've titled this "the alignment stability problem in artificial neural network AI". There's plenty of work on algorithmic maximizers. But it's a lot trickier if values/goals are encoded in a network's distributed representations of the world.

I also should've cited Alex Turner's Understanding and avoiding value drift. There he makes a strong case that dominant shards will try to avoid value drift through other shards establishing stronger connections to rewards. But that's not quite good enough. Even if it avoids sudden value drift, at least for the central shard or central tendency in values, it doesn't really address the stability of a multi-goal system. And it doesn't address slow subtle drift over time.

Those are important, because we may need a multi-goal system, and we definitely want alignment to stay stable over years, let alone centuries of learning and reflection.

If I understand you correctly, you might be interested in reflective stability and tiling agent theory.

I'll definitely check these out, thanks! Reflective stability sounds exactly right.

The link for reflective stability doesn't have much content, unfortunately.

Reflective stability does seem like the right term. Searches on that term are turning up some relevant discussion on alignment forum, so thanks!

So the reflective stability term is what I'm looking for, and should turn up more related work. Thanks!

if and how humans are stably aligned

I think humans are aligned, approximately and on average. And that's what people mean when they assume humans are aligned. I wish I were sure, because this is an important question.

“A sufficiently intelligent agent will try to prevent its goals^[1] from changing, at least if it is consequentialist.”

A human may change from wanting to help people to not wanting to help people if he/she got 5 hours of sleep instead of 8.

I think my terminology isn't totally clear. By "goals" in that statement, I mean what we mean by "'values" in humans. The two are used in overlapping and mostly interchangable ways in my writing

Humans aren't sufficiently intelligent to be all that internally consistent
In many cases of humans changing goals, I'd say they're actually changing subgoals, while their central goal (be happy/satisfied/joyous) remains the same. This may be described as changing goals while keeping the same values.
Note 'in the short term' (I think you're quoting Bostrom? The context isn't quite clear). In the long term, with increasing intelligence and self-awareness, I'd expect some of people's goals to change as they become more self-aware and work toward more internal coherence (e.g., many people change their goal of eating delicious food when they realize it conflicts with their more important goal of being happy and living a a long life).

Both quotes are from your above post. Apologies for confusion.

Approaches to alignment stability

I view this as pretty-much a solved problem, solved by value learning. Though there are then issues due to the mutability of human values.

Those are important, because we may need a multi-goal system, and we definitely want alignment to stay stable over years, let alone centuries of learning and reflection.

LESSWRONG
LW

LESSWRONG
LW

35

The alignment stability problem

35

Ω 16

Why we don't think about much about alignment stability, and why we should.

Approaches to alignment stability

A direction: asking if and how humans are stably aligned.

35

Ω 16

35

Ω 16