Introduction

In this essay, I name and describe a mechanism that might break alignment in any system that features an originally aligned AI improving its capabilities through self-modification. I doubt that this idea is new, but I haven't yet seen it named and described in these terms. This post discusses this mechanism within the context of agents that iteratively improve themselves to surpass human intelligence, but the general idea should be applicable to most self-modification schemes.

This idea directly draws inspiration from Scott Alexander’s Schelling Fences on Slippery Slopes.

Schelling Shifts

A Schelling shift occurs when an agent fails to properly anticipate that modifying a parameter will increase its next version’s willingness to further modify the same parameter. This causes future iterations of the agent to modify the parameter further than the initial agent would be comfortable with.

Let A[t] refer to the state of agent A at the end of modification iteration t, and let X[t] refer to the value of parameter X at the end of modification iteration t.

For the clarity of this explanation, let’s say that X represents adherence to a certain constraint. The properly aligned initial agent A[0] predicts that any reduction to X will improve performance for a certain objective, and that alignment will be preserved as long as the reduction of X is of a sufficiently small magnitude. A[0] is not exactly sure how far it can safely reduce X, but is entirely confident that if X[1] is at least 95% of X[0], then A[1] will still adhere to the constraint with perfect consistency.

In this case, A[0] establishes a Schelling point at 95% of X[0], and is only comfortable reducing X up to this point, even though further reduction might lead to greater performance for certain objectives.

During the next modification iteration, A[1] once again predicts that reducing X will improve performance. Because X[1] < X[0], A[1] reasons about X differently than A[0], and decides that it is willing to reduce X even further. A[1] establishes a new Schelling Point at 95% of X[1], which is 90.25% of X[0]. After the next iteration, X[2] is lower than A[0] would have been comfortable with, introducing the possibility that A[2] is no longer properly aligned with A[0].

As a general rule, a Schelling shift occurs when A[t + c] and A[t] establish different Schelling points for modification to parameter X because X[t + c] ≠ X[t].

Example: An agent decides that reducing its reluctance to hurt humans will allow it to make decisions quicker by virtue of spending less time reasoning through every possible way that its decisions could hurt humans. The agent is confident that a 95% reluctant-to-hurt-humans version of itself would also never hurt a human, and establishes a Schelling point at 5% reduction. The problem is, in the next modification iteration, a new version of the agent is responsible for improving itself, and that version’s reduced reluctance to hurt humans makes it willing to reduce its reluctance even further than the original in exchange for faster decision-making.

Alignment Challenges

Addressing Schelling shifts with alignment approaches seems like a challenging task for the following reasons:

  • They are difficult to predict
  • They are difficult to prevent without significantly sacrificing performance
  • They are difficult to detect

Many of the these difficulties result from the fact that the modification of a parameter, the occurrence of a Schelling shift, and the consequences of a Schelling shift can all be separated by large numbers of modification iterations.

Furthermore, the most difficult types of Schelling shift to predict and detect are also the most dangerous types, so any effective alignment approach must not be vulnerable to these edge cases.

Schelling Shifts Are Difficult to Predict

From the perspective of an agent, predicting Schelling shifts has two difficult components. First, it has to figure out the relationship between the value of a parameter and its willingness to alter that parameter. Second, it has to understand this relationship within the context of a smarter version of itself. Any prediction mechanism has to combine these two components into a cohesive evaluative process.

I expect that the elasticity between changes to a parameter and an agent’s willingness to alter it is largely dependent upon the parameter in question, so I doubt that there is any expansive rule that links the two. The only broad rule I can think of is that the greater the alteration to a parameter, the more likely that a Schelling shift will occur in a future iteration.

From the perspective of a human examining the system, all of the above remain true. Additionally, the parameters prone to shifting might not match up with any single concept that humans have. While all examples in this post are easy to understand, it is possible that a Schelling shift could occur in a parameter that has no direct analogue in human language and experience. This might pose challenges to corrigibility schemes.

Last but not least, a Schelling shift does not have to occur at the next iteration step. A modification made by A[t] might result in a Schelling shift in A[t+10] for example. In fact, one might even argue that Schelling shifts are more likely to occur in iteration steps distant to the original parameter change, since the agent evaluating the parameter might think in a significantly different way by that point. This makes every aspect of prediction even more difficult.

Schelling Shifts Are Difficult to Prevent

If larger alterations to parameters are more likely to result in Schelling shifts down the road, the risk might be mitigated by preventing large alterations to parameters in the first place. The two problems I have with this approach are that it might not be competitive from a performance standpoint and it doesn’t address the fact that Schelling shifts might result from small parameter adjustments as well.

Since a shift can occur at any number of iterations in the future, a rule that prevents an agent from successively modifying the same parameter would also be ineffective.

Schelling Shifts Are Difficult to Detect

Schelling shifts are unlikely to have immediate, observable consequences. A shift occurs when an agent thinks it can adjust parameters to achieve better performance without causing itself to act in undesirable ways. The only ways to detect a shift would involve transparency into an agent’s internal reasoning or observing an undesired behavior as a result of a shift.

Detecting Schelling shifts by looking for the consequences of a Schelling shift (misaligned behavior) is made extremely difficult by the fact that consequences might first emerge many iterations after the initial shift once an agent is smarter and more capable. Corrigibility mechanisms that evaluate how an agent would act in different scenarios might miss shifts that would only have consequences after future improvement iterations.

It is also possible that a shift would only cause undesired behaviors in response to an unusual or unpredictable event, which would pose another challenge from a corrigibility standpoint.

Additional Observations

Schelling Shifts May Have Rippling Effects

It is possible that a shift in one parameter could increase the likelihood of Schelling shifts occurring in other parameters, or directly cause shifts in other parameters.

Schelling Shifts May Alter How an Agent Perceives Final Goals

A Schelling shift may cause an agent to perceive and conceptualize its final and instrumental objectives differently. Examples of this can be seen in the edge cases below.

The Danger of Edge Cases

Observation: Schelling shifts might not have observable behavioral consequences until many improvement iterations after the shift.

Corollary: Misalignment can remain hidden until after an agent has reached superintelligence.

Example: An early version of an agent seemingly improves its ability to accurately model human well-being by focusing slightly more on neurochemistry and slightly less on subjective concepts like purpose and meaning. The agent is at first wary of excessively prioritizing neurochemistry, since this view differs from the ways that humans describe their own well-being. This inclination erodes as multiple shifts occur over many iterations, and the agent begins to view well-being mainly in terms of neurochemistry. These shifts are latent, and do not cause a single undesirable behavior until a superintelligent version of the agent gains the ability to administer a pleasure-inducing drug to the entire human race, placing everyone into a perpetual state of artificially-induced euphoria.

Observation: Schelling shifts might have consequences that only emerge in response to an unusual event.

Corollary: Misalignment can remain hidden until it is exposed by an inherently unpredictable Black Swan event.

Example: Schelling shifts cause an agent to gradually increase the importance of intelligence when evaluating what separates humans from other species. The agent was first reluctant to prioritize intelligence over other human characteristics, but this reluctance waned with each iteration. At some point, human well-being becomes valued primarily because humans are the most intelligent species on Earth. Eventually, the agent’s goal of human well-being is replaced with the goal of well-being for the most intelligent species on Earth. This goes unnoticed, as these goals are functionally equivalent. Misalignment is exposed one day when aliens smarter than humans unexpectedly invade Earth, and the agent sides with the aliens over humans.

Observation: The most difficult shifts to predict are ones that occur many improvement iterations after the initial parameter modification.

Corollary: Alignment can suddenly break in a superintelligent agent as the result of a parameter modification in a far earlier, less capable iteration of the agent.

Example: An early version of an agent marginally reduces its reluctance to cause humans discomfort. This value does not shift any further for a large number of iteration steps. Later on, an infinitely more intelligent version of the agent is deciding how to improve upon the utopia it has already created. This slightly reduced reluctance to cause humans discomfort contributes to its rationalization that humans would be better off if they were less comfortable all the time. It modifies itself further by significantly shifting its tolerance for human discomfort over the course of a few iterations. The agent now conceptualizes human well-being significantly differently than it did earlier, and is no longer properly aligned.

These examples might be a little ridiculous, but the observations and corollaries seem valid to me.


Thank you for reading this. I'm very new to the field of alignment, so if there is already another name and definition for this mechanism, my apologies, please let me know. I enthusiastically welcome all feedback and criticisms.

New Comment
3 comments, sorted by Click to highlight new comments since: Today at 8:26 PM

The reason I don't think this (that is, a particular sort of value stability under self-modification) is a key problem is because this is one of those areas where the AI's incentives are automatically aligned. We don't have to solve the problem ahead of time, because almost any AI is going to be 100% on board with avoiding value drift. So it seems like there's less pressure on us to get everyrhing done.

However, there is one case where it might become important to solve this correctly before turning on an AGI: seed AI that starts very subhuman and increases its intelligence through self-modification. But I think that we should avoid this scenario anyhow. Even if we do have subhuman self-improving AI, it shouldn't be incentivized to touch its value function anyhow, only its world-model - any such incentive (e.g. "attempting to make itself run faster") should be only a subgoal in a hierarchical goal structure that remembers not to touch its value function.

I don't know, I'm increasingly less convinced that we should reasonably expect to not see value drift. In particular value drift can at least be a function of computing reflective equilibrium such that values may drift from their original position in order to be consistent with other values. In this sense the original value might be thought of as mistaken, and it could be a correct move to drift towards a value that is stable under reflection, and this is to say nothing of "drift" as a result of updating on new information.

Put another way, it seems unlikely to me that we can build AGI that is both fully general and not open to instability under self-modification and in order to get greater stability we must give up some general function. Arguably this is exactly what alignment is—giving up access to parts of mind space in exchange for meeting particular safety guarantees—but I think it's also worth pointing out that there may be a sense in which we can oversolve alignment such that we remove all value drift rendering the intended AGI narrow rather than general.

Thank you for your input, I found it very informative!

I agree with your point that any aligned AI will be 100% on board with avoiding value drift, and that certainly does take pressure off of us when it comes to researching this. I also agree that it would be best to avoid this scenario entirely and avoid having a self-improving AI touch its value function at all.

In cases where a self-improving AI can alter its values, I don’t entirely agree that this would only be a concern at subhuman levels of intelligence. It seems plausible to me that an AI of human level intelligence, and maybe slightly higher, could think that marginally adjusting a value for improved performance is safe, only to be wrong about that. From a human perspective, I find it very difficult to reason through how slightly altering one of my values would impact my reflective reasoning about the importance of that value and the acceptable ranges it could take. A self-improving agent would also have to make this prediction about a more intelligent version of itself, with the added complication of calculating potential impact for future iterations as well. It’s possible that an agent of human level intelligence would be able to do this easily, but I’m not entirely confident of that.

And the main reason that I bring up the scenario of self-improving AI with access to its own values is that I see this as a clear path to performance improvement that might seem deceptively safe to some organizations conducting general AI research in the future, especially those where external incentives (such as an international General AI arms race) might push researchers to take risks that they normally wouldn’t take in order to beat the competition. If a general AI was properly aligned, I could see certain organizations allowing that AI to improve itself through marginally altering its values out of fear that a rival organization would do the same.

I’m going to reflect upon what you said in more depth though. Since I’m still new to all of this, it’s very possible that there is relevant external information that I’m missing or not considering thoroughly.