In this essay, I name and describe a mechanism that might break alignment in any system that features an originally aligned AI improving its capabilities through self-modification. I doubt that this idea is new, but I haven't yet seen it named and described in these terms. This post discusses this mechanism within the context of agents that iteratively improve themselves to surpass human intelligence, but the general idea should be applicable to most self-modification schemes.
This idea directly draws inspiration from Scott Alexander’s Schelling Fences on Slippery Slopes.
A Schelling shift occurs when an agent fails to properly anticipate that modifying a parameter will increase its next version’s willingness to further modify the same parameter. This causes future iterations of the agent to modify the parameter further than the initial agent would be comfortable with.
Let A[t] refer to the state of agent A at the end of modification iteration t, and let X[t] refer to the value of parameter X at the end of modification iteration t.
For the clarity of this explanation, let’s say that X represents adherence to a certain constraint. The properly aligned initial agent A predicts that any reduction to X will improve performance for a certain objective, and that alignment will be preserved as long as the reduction of X is of a sufficiently small magnitude. A is not exactly sure how far it can safely reduce X, but is entirely confident that if X is at least 95% of X, then A will still adhere to the constraint with perfect consistency.
In this case, A establishes a Schelling point at 95% of X, and is only comfortable reducing X up to this point, even though further reduction might lead to greater performance for certain objectives.
During the next modification iteration, A once again predicts that reducing X will improve performance. Because X < X, A reasons about X differently than A, and decides that it is willing to reduce X even further. A establishes a new Schelling Point at 95% of X, which is 90.25% of X. After the next iteration, X is lower than A would have been comfortable with, introducing the possibility that A is no longer properly aligned with A.
As a general rule, a Schelling shift occurs when A[t + c] and A[t] establish different Schelling points for modification to parameter X because X[t + c] ≠ X[t].
Example: An agent decides that reducing its reluctance to hurt humans will allow it to make decisions quicker by virtue of spending less time reasoning through every possible way that its decisions could hurt humans. The agent is confident that a 95% reluctant-to-hurt-humans version of itself would also never hurt a human, and establishes a Schelling point at 5% reduction. The problem is, in the next modification iteration, a new version of the agent is responsible for improving itself, and that version’s reduced reluctance to hurt humans makes it willing to reduce its reluctance even further than the original in exchange for faster decision-making.
Addressing Schelling shifts with alignment approaches seems like a challenging task for the following reasons:
- They are difficult to predict
- They are difficult to prevent without significantly sacrificing performance
- They are difficult to detect
Many of the these difficulties result from the fact that the modification of a parameter, the occurrence of a Schelling shift, and the consequences of a Schelling shift can all be separated by large numbers of modification iterations.
Furthermore, the most difficult types of Schelling shift to predict and detect are also the most dangerous types, so any effective alignment approach must not be vulnerable to these edge cases.
Schelling Shifts Are Difficult to Predict
From the perspective of an agent, predicting Schelling shifts has two difficult components. First, it has to figure out the relationship between the value of a parameter and its willingness to alter that parameter. Second, it has to understand this relationship within the context of a smarter version of itself. Any prediction mechanism has to combine these two components into a cohesive evaluative process.
I expect that the elasticity between changes to a parameter and an agent’s willingness to alter it is largely dependent upon the parameter in question, so I doubt that there is any expansive rule that links the two. The only broad rule I can think of is that the greater the alteration to a parameter, the more likely that a Schelling shift will occur in a future iteration.
From the perspective of a human examining the system, all of the above remain true. Additionally, the parameters prone to shifting might not match up with any single concept that humans have. While all examples in this post are easy to understand, it is possible that a Schelling shift could occur in a parameter that has no direct analogue in human language and experience. This might pose challenges to corrigibility schemes.
Last but not least, a Schelling shift does not have to occur at the next iteration step. A modification made by A[t] might result in a Schelling shift in A[t+10] for example. In fact, one might even argue that Schelling shifts are more likely to occur in iteration steps distant to the original parameter change, since the agent evaluating the parameter might think in a significantly different way by that point. This makes every aspect of prediction even more difficult.
Schelling Shifts Are Difficult to Prevent
If larger alterations to parameters are more likely to result in Schelling shifts down the road, the risk might be mitigated by preventing large alterations to parameters in the first place. The two problems I have with this approach are that it might not be competitive from a performance standpoint and it doesn’t address the fact that Schelling shifts might result from small parameter adjustments as well.
Since a shift can occur at any number of iterations in the future, a rule that prevents an agent from successively modifying the same parameter would also be ineffective.
Schelling Shifts Are Difficult to Detect
Schelling shifts are unlikely to have immediate, observable consequences. A shift occurs when an agent thinks it can adjust parameters to achieve better performance without causing itself to act in undesirable ways. The only ways to detect a shift would involve transparency into an agent’s internal reasoning or observing an undesired behavior as a result of a shift.
Detecting Schelling shifts by looking for the consequences of a Schelling shift (misaligned behavior) is made extremely difficult by the fact that consequences might first emerge many iterations after the initial shift once an agent is smarter and more capable. Corrigibility mechanisms that evaluate how an agent would act in different scenarios might miss shifts that would only have consequences after future improvement iterations.
It is also possible that a shift would only cause undesired behaviors in response to an unusual or unpredictable event, which would pose another challenge from a corrigibility standpoint.
Schelling Shifts May Have Rippling Effects
It is possible that a shift in one parameter could increase the likelihood of Schelling shifts occurring in other parameters, or directly cause shifts in other parameters.
Schelling Shifts May Alter How an Agent Perceives Final Goals
A Schelling shift may cause an agent to perceive and conceptualize its final and instrumental objectives differently. Examples of this can be seen in the edge cases below.
The Danger of Edge Cases
Observation: Schelling shifts might not have observable behavioral consequences until many improvement iterations after the shift.
Corollary: Misalignment can remain hidden until after an agent has reached superintelligence.
Example: An early version of an agent seemingly improves its ability to accurately model human well-being by focusing slightly more on neurochemistry and slightly less on subjective concepts like purpose and meaning. The agent is at first wary of excessively prioritizing neurochemistry, since this view differs from the ways that humans describe their own well-being. This inclination erodes as multiple shifts occur over many iterations, and the agent begins to view well-being mainly in terms of neurochemistry. These shifts are latent, and do not cause a single undesirable behavior until a superintelligent version of the agent gains the ability to administer a pleasure-inducing drug to the entire human race, placing everyone into a perpetual state of artificially-induced euphoria.
Observation: Schelling shifts might have consequences that only emerge in response to an unusual event.
Corollary: Misalignment can remain hidden until it is exposed by an inherently unpredictable Black Swan event.
Example: Schelling shifts cause an agent to gradually increase the importance of intelligence when evaluating what separates humans from other species. The agent was first reluctant to prioritize intelligence over other human characteristics, but this reluctance waned with each iteration. At some point, human well-being becomes valued primarily because humans are the most intelligent species on Earth. Eventually, the agent’s goal of human well-being is replaced with the goal of well-being for the most intelligent species on Earth. This goes unnoticed, as these goals are functionally equivalent. Misalignment is exposed one day when aliens smarter than humans unexpectedly invade Earth, and the agent sides with the aliens over humans.
Observation: The most difficult shifts to predict are ones that occur many improvement iterations after the initial parameter modification.
Corollary: Alignment can suddenly break in a superintelligent agent as the result of a parameter modification in a far earlier, less capable iteration of the agent.
Example: An early version of an agent marginally reduces its reluctance to cause humans discomfort. This value does not shift any further for a large number of iteration steps. Later on, an infinitely more intelligent version of the agent is deciding how to improve upon the utopia it has already created. This slightly reduced reluctance to cause humans discomfort contributes to its rationalization that humans would be better off if they were less comfortable all the time. It modifies itself further by significantly shifting its tolerance for human discomfort over the course of a few iterations. The agent now conceptualizes human well-being significantly differently than it did earlier, and is no longer properly aligned.
These examples might be a little ridiculous, but the observations and corollaries seem valid to me.
Thank you for reading this. I'm very new to the field of alignment, so if there is already another name and definition for this mechanism, my apologies, please let me know. I enthusiastically welcome all feedback and criticisms.