Value drift threat models

Garrett Baker

Say that we train a highly competent AGI using some mix of RL & supervised learning, and some novel algorithmic improvements, optimize it to itself optimize for one or several helpfulness benchmarks (maybe optimizing on some assistance game or using techniques described in diamond alignment), inducing in our agent a thousand shards of desire. One of these shards somehow happened to be exactly aligned with humans. We have a partial solution to the alignment problem! One of the many things the AI happens to care about seems to be humans!

Are we in the clear? No. For all those thousand shards of desire build to an AGI which once optimized a robustly optimizable metric. The shards of desire must fit together in a way which would have once optimized that metric, otherwise they would be fit together some other way. Despite the shards, the AGI has biases, and situations which cause its actions to better conform to its values, and situations which cause its actions to worse conform to its values. And these situations (or at least the situations which are relevant to this analysis) have been strategically set up such that these biases and deficiencies contributed to the optimization of the metric.

There are a few ways I imagine these situations producing an existential risk...

Tool building and meta-thinking are robustly useful cognitive faculties

In the past, it could throw around a large fraction of its intelligence on optimizing for those metrics, and in its stable state, probably even end up doing things in the world in-line with its own values.

It makes a successor AGI, because the parts of it which advocated strongly for meta-thinking, and the building of tools, never got dis-enforced by the optimizable metric. It doesn't necessarily make the successor AGI with the optimizable metric in mind, because it doesn't actually care about that metric.

We get a new AGI, with a different thousand shards of desire. Some of these shards are the same, like power-seeking, or tool building & meta-thinking. Others are conceivably different, like caring about humans or diamonds. Others are new, like now caring about the first AGI. This process continues until only the properties shards preserved across all recursive self-modifications and production of successor AGIs remain. Care for humans is notably probably not among these, because each successor has a different concept of what care for humans means, and once the agent gets sufficiently powerful, there's nothing outside the agent which can robustly push it towards caring for humans. This is in contrast to tool building, power-seeking, and intelligence enhancement, which are all incentivised by the structure of the environment.

Perhaps the AGI realizes what is happening to its value distribution, and stops it. Then again, perhaps the AGI realizes what is happening to its value distribution and calls it correct moral reasoning, like what our society does with the process which led us from thinking slavery was moral & just to thinking it was disgusting & abhorent. Or it realizes what's happening to its value distribution, but can't help itself at this point, like a superintelligent addict trying to quit heroin.

Different shards are activated when creating successors than when doing business-as-usual actions

Humans care about lots of things, like fun. But often when we sit down to go write moral philosophy papers, needing to justify all our thoughts, and make them legible to others, we end up in a bland no-fun-land (if you like bland no-fun-lands, then use scope insensitivity as your prototypical example here).

When making choices about the fate of the universe, bounded agents like humans have a track record of being bad at aggregating all their values into their decisions. This is usually reasonable, especially when the fate of the universe is determined by a bunch of uncorrelated actions the bounded agent is making. But when there's one big decision, or many highly correlated decisions which determine the fate of the universe, the chances of leaving something out are virtually guaranteed.^[1]

Humans will have this problem when making the first AGI, and the first AGI will have this problem when making its successor, and so on. You need to make your AGI such that when it goes to self-improve (either by writing/training a successor (or itself), or reading the sequences), it keeps everything humans like in mind, and also that when the successor goes to do the same, it keeps in mind this same law.

One or a few shards end up significantly more powerful/intelligent than others

We may see this if shard power/intelligence varies according to a power law, which may occur because intelligence is a conjunctive quality. Human intelligence seems to vary according to a normal distribution, but human power seems to vary on a power law distribution, and human citation-count also seems to vary on a power law distribution. The distribution of shards in humans also seems to follow a power law. There are a lot of little shards you have, like an impulse to eat cake, and there are very few big shards, like an impulse to avoid death.

Nevertheless, if shard influence varies on a power law distribution, then likely a small fraction of the total values that a shard theoretic agent will have will be passed on to its successor, because the driving forces don't really care about the vast majority of the things the shard theoretic agent might technically value.

Hidden values in not being fully intelligent

If you suddenly made me into a superintelligence, you'd run a big risk of losing a lot of my values, because my values are intimately tied to my way of thinking. Not to the extent that you literally couldn't make a superintelligence with my values (it is very difficult to make that kind of universal quantifier, and I also place significant probability mass on the solution to making me a superintelligence being 'add more neurons to the brain', and my self-reflection being good enough not to have the new ways of thinking destroy lots of what I once cared about). But to the extent that if someone was just trying to make me into a superintelligence, and didn't really care about my values, they'd probably end up destroying more of my values than they would if I were an agent with an explicitly defined utility function out the bat.

Will our AGI have enough reflective thought to realize when the thoughts they're thinking pose a risk of destroying values they have? Probably not the first one to think really dangerous thoughts, if you don't train them for it. And navigating these waters seems difficult even if you do train them for it. So it's adequate training we'd need.

David Udell tells me this is another way of phrasing the law school problem of value drift. This seems approximately right, but notably doesn't require in-lifetime training steps.

Other kinds of failure

There are all sorts of other failures of recursive alignment which may actually be realized in the world that I haven't thought of. Dichotomizing these failures is a useful activity for the interested party! And I wouldn't be surprised if the particular examples I outlined above don't actually come into play.

Justis questions this claim:

[...]this doesn't seem generally true to me. Specifically, the big decision could be super obvious/easy to get right (an omnipotent being comes out of a portal and says "should I destroy everything?" and you have only to say "no", for example), or the correlated decisions could be such that only when you do them all wrong does the world end (the omnipotent being offers a poll and only if everyone votes "destroy" does it do anything, for example).

I say the reason why its clear you don't destroy the world is because this leaves many alternative options open to you. If instead the entity asked whether you wanted to ban the world from ever ending so that always there exist humans, along with various specifications of what it means for a clump of atoms to be a human, and you have some plausible reason to suspect the properties actually corresponds to your concept of a human, but not quite, the decision would be tougher. ↩︎

Value drift involves extrapolation of values to strange situations based on things other than endorsement by current values from the comfort of familiar situations. Running this risk doesn't require building different agents, like with natural selection and humans, or humans and AIs, or humans modifying their own cognition or that of their descendants. Rapid permanent change of environment is sufficient, and AIs introduce this risk for both humans and for themselves even when aligned, until they are smart enough to anticipate it.

This is great. I think this is an important consideration because the new school of deep network thinkers is excited that we have some actual promising approaches to alignment. These include shard theory, Steve Byrnes' Plan for mediocre alignment of brain-like [model-based RL] AGI, and the potential for natural language chain-of-thought alignment as in my post Capabilities and alignment of LLM cognitive architectures. All of these are promising, but none really address the stability problem.

I think this is exactly the problem I was pointing to in my post The alignment stability problem and my 2018 paper Goal changes in intelligent agents.

The one direct approach to the stability problem in deep networks that I know of is Alex Turner's A shot at the diamond-alignment problem. I think this sort of reflective stability in a deep network or similarly messy agent is only likely to hold going to apply to the most centrally held goal (or shard), and for everything else we have no idea. Context matters, as you point out, so even for the one biggest goal it's not guaranteed to stay stable under reflective stability after a lot of learning and perhaps successor design.

I hope the links are useful. I'm also working on another post on exactly this topic, and I'll cite this one.

I've now read your alignment stability post and the goal changes in intelligent agents post, and they're pretty good. In the 2018 post, I liked how you framed all the previous alignment attempts as value reflectivity adjacent. For some of them, like motivation drift or some examples of representation hacking, I think I would have categorized the failuremode more along the lines of goodheart, though there is some sense in which, when seen from the outside, goodheart looks a lot like value drift. Like, as the agent gets smarter and thinks about what it wants more, it will look more and more like it doesn't care about the target goal. But from the agent's point of view, no goal changes are happening, its just getting better at what it was previously doing.

Yeah, I think this area is super important to think about, and that this post is possibly my most important post to-date. I'm glad you're also thinking about this, and am excited to read what you have written.