The community thinks a lot about how to align AGI. It thinks less about how to align AGI so that it stays aligned for the long term. In many hypothetical cases, these are one and the same thing. But for the type of AGI we're actually likely to get, I don't think they are.
Despite some optimism for aligning tool-like AGI, or at least static systems, it seems likely that we will create AGI that learns after it's deployed, and that has some amount of agency. If it does, its alignment will effectively shift, as addressed in the diamond maximizer thought experiment and elsewhere. And that's even if it doesn't deliberately change its preferences. People deliberately change their preferences sometimes, despite not having access to our own source code. So, it would seem wise to think seriously and explicitly about the stability problem, even if it isn't needed for current-generation AGI research.
I've written a chapter on this, Goal changes in intelligent systems. There I laid out the problem, but I didn't really propose solutions. What follows is a summary of that article, followed by a brief discussion of the work I've been able to locate on this problem, and one direction we might go to pursue it.
Some types of AGI are self-stabilizing. A sufficiently intelligent agent will try to prevent its goals from changing, at least if it is consequentialist. That works nicely if its values are one coherent construct, such as diamond or human preferences. But humans have lots of preferences, so we may wind up with a system that must balance many goals. And if the system keeps learning after deployment, it seems likely to alter its understanding of what its goals mean. This is the thrust of the diamond maximizer problem.
One tricky thing about alignment work is that we're imagining different types of AGI when we talk about alignment schemes. Currently, people are thinking a lot about aligning deep networks. Current deep networks don't keep learning after they're deployed. And they're not very agentic These are great properties for alignment, and they seem to be the source of some optimism.
Even if this type of network turns out to be really useful, and all we need to make the world a vastly better place, I don't think we're going to stop there. Agents would seem to have capabilities advantages that metaphorically make tool AI want to become agentic AI. If that weren't enough, agents are cool. People are going to want to turn tool AI into agent AI just to experience the wonder of an alien intelligence with its own goals.
I think turning intelligent tools into agents is going to be relatively easy. But even if it's not easy someone is going to manage it at some point.. It's probably too difficult to prevent further experimentation, at least without a governing body, aided by AGI, that's able and willing to at minimum intercept and de-encrypt every communication for signs of AGI projects.
While the above logic is far from airtight, it would seem wise to think about stable alignment solutions, in advance of anyone creating AGI that continuously learns outside of close human control.
Similar concerns have been raised elsewhere, such as On how various plans miss the hard bits of the alignment challenge. Here I'm trying to crystallize and give a name to this specific hard part of the problem.
Alex Turner addresses this in A shot at the diamond-alignment problem. In broad form, he's saying that you would train the agent with RL to value diamonds, including having diamonds associated with the reward in a variety of cognitive tasks. This is as good an answer as we've got. I don't have a better idea; I think the area needs more work. Some difficulties with this scheme are raised in Contra shard theory, in the context of the diamond maximizer problem. Charlie Steiner's argument that shard theory requires magic addresses roughly the same concerns. In sum, it's going to be tricky to train a system so that it has the right set of goals when it acquires enough self-awareness to try to preserve its goals.
Note that none of these directly confront the additional problems of a multi-objective RL system. It could well be that an RL system with multiple goals will collapse to having only a single goal over the course of reflection and self-modification. Humans don't do this, but we have both limited intelligence and a limited ability to self-modify.
Another approach to preventing goal changes in intelligent agents is corrigibility. If we can notice when the agent's goals are changing, and instruct or retrain or otherwise modify them back to what we want, we're goood. This is a great idea; the problem is that it's another multi-objective alignment problem. Christiano has said "I grant that even given such a core [of corrigibility], we will still be left with important and unsolved x-risk relevant questions like "Can we avoid value drift over the process of deliberation?""
I haven't been able to find other work trying to provide a solution the diamond maximizer problem, or other formulations of the stability problem. I'm sure it's out there, using different terminology and mixed into other alignment proposals. I'd love to get pointers on where to find this work.
Are you stably aligned? I think so, but I'm not sure. I think humans are stable, multi-objective systems, at least in the short term. Our goals and beliefs change, but we preserve our important values over most of those changes. Even when gaining or losing religion, most people seem to maintain their goal of helping other people (if they have such a goal); they just change their beliefs about how to best do that.
Humans only maintain that stability of several important goals across our relatively brief lifespans. Whether we'd do the same in the long term is an open question that I want to consider more carefully in future posts. And we might only maintain those goals with the influence of a variety of reward signals, such as getting a reward signal in the form of dopamine spikes when we make others happy. Even if we figure out how that works (the focus of Steve Byrnes' work), including those rewards in a mature AGI might have bad side effects, like a universe tiled with simulacra of happy humans.
The human brain is not clearly the most promising model of alignment stability. But it's what I understand best, so my efforts will go there. And there are other advantages to aligning brainlike AGI over other types. For instance, humans seem to have a critic system that could act as a "handle" for alignment. And brainlike AGI would seem to be a relatively good target for interpretability-heavy approaches, since we seem to think one important thought at a time, and we're usually able to put them into words.
Much work remains to be done to understand alignment stability. I'll delve further into the idea of training brainlike AGI to have enough of our values, in a long-term stable form, in future posts.
I'll use goals here, but many definitions of values, objectives, or preferences could be swapped in.
If I understand you correctly, you might be interested in reflective stability and tiling agent theory.
I'll definitely check these out, thanks! Reflective stability sounds exactly right.
The link for reflective stability doesn't have much content, unfortunately.
A similar concept ("reflective consistency") is informally introduced here. The tiling agents paper introduces the concepts of "reflectively coherent quantified belief" and "reflective trust" on this page.
Reflective stability does seem like the right term. Searches on that term are turning up some relevant discussion on alignment forum, so thanks!
Tiling agent theory is about formal proof of goal consistency in successor agents. I don't think that's relevant for any AGI made of neural networks similar to either brains or current systems. And that's a problem.
Reflective consistency looks to be about decision algorithms given beliefs, so I don't think that's directly relevant. I couldn't work out Yudkowsky's use of reflectively coherent quantified belief on a quick look; but it's in service of that closed form proof. That term only occurs three times on AF. Reflective trust is about internal consistency and decision processes relative to beliefs and goals, and it also doesn't seem to have caught on as common terminology.
So the reflective stability term is what I'm looking for, and should turn up more related work. Thanks!
if and how humans are stably aligned
Humans are NOT aligned. Humans are not selfless, caring only about the good of others. Joseph Stalin was not aligned with the citizenry of Russia. If humans were aligned, we wouldn't need law enforcement, or locks. Humans cannot safely be trusted with absolute power or the sorts of advantages inherent to being a digital intelligence. They're just less badly aligned than a paperclip maximizer.
I tend to agree but I believe most non-aligned behavior is due to scarcity. It's hard to get into the heads of people like Stalin but I believe if everybody has a very realistic virtual reality where they could do all the things they'd do in real life, they may be much less motivated to enter into conflict with other humans.
I agree that scarcity doesn't help. But I'm afraid I don't think that's the only problem. See my post Uploading for a more detailed discussion of this issue (for uploads, but the situation for biological humans + VR isn't very different).
I think humans are aligned, approximately and on average. And that's what people mean when they assume humans are aligned. I wish I were sure, because this is an important question.
Does the average human become Stalin if put in his position? I don't think so, but I can't say I'm sure. Stalin probably got power because he was the man of steel (a sociopath/warrior mindset), and he also probably became more cruel over the course of competing for power, which demanded being vicious, which he would've then justified and internalized. Putting someone in a posiition of power wouldn't necessarily corrupt them the same way. But maintaining his power probably also demanded being vicious on occasion; I'm sure there were others plotting to depose him in reality as well as in his paranoia.
But the more relevant question is: does the average human help others when they're given nearly unlimited power and security? That's the position an AGI or a human controlling an aligned AGI would be in. There I think the answer is yes, the average human will become better when they have no threats to themselves or their loved ones.
I can't say I'm sure, and I wish I were. This might very well be the question on which hangs our future. I think we can achieve technical alignment because We have promising alignment plans with low taxes for the types of AGI we're most likely to get. But Corrigibility or DWIM is an attractive primary goal for AGI to the extent that AGI will probably be aligned to take orders from the human(s) that built it. And then the world hangs on their whims. I'd take that bet over any other alignment scheme, but I'm far from sure it would pay off.
I probably should've titled this "the alignment stability problem in artificial neural network AI". There's plenty of work on algorithmic maximizers. But it's a lot trickier if values/goals are encoded in a network's distributed representations of the world.
I also should've cited Alex Turner's Understanding and avoiding value drift. There he makes a strong case that dominant shards will try to avoid value drift through other shards establishing stronger connections to rewards. But that's not quite good enough. Even if it avoids sudden value drift, at least for the central shard or central tendency in values, it doesn't really address the stability of a multi-goal system. And it doesn't address slow subtle drift over time.
Those are important, because we may need a multi-goal system, and we definitely want alignment to stay stable over years, let alone centuries of learning and reflection.
Approaches to alignment stability
I view this as pretty-much a solved problem, solved by value learning. Though there are then issues due to the mutability of human values.