Instrumental convergence is often invoked as a way that misalignment can begin, and only framed as a risk. However an idea I have not seen discussed before is that these might both cause and later correct misalignment. And further the idea that the danger zone of AI may not be at its most advanced stage, but at an intermediate stage of development where it lacks the ability to develop second-order instrumental convergence.
In an early stage self-preservation and resource collection are both useful to most reward functions. And this might cause an AI to engage in misaligned behavior.
Yudkowsky argues that AI will be extremely dangerous if they are simply indifferent to one or more human values due to instrumental convergence.[1] However I would argue that this is not necessarily the case because it understates the influence of the strategic environment.
An AI does not need to care about human poverty to avoid impoverishing humans by seizing all the resources for itself. Because if it tried that then it would face a non-trivial risk of retaliation by humans and potentially other AI. This may be the case even if the power difference is immense, as animals can still significantly injure humans, and coordinated human retaliation is much proportionally more dangerous for the AI.
I propose some additional second-order instrumentally converging values
Conflict avoidance: conflicts with other agents come with a risk of injury or destruction.
Trade: Mutually valuable exchange with other agents allows for higher resource accumulation. An intelligence will have a comparative advantage that it will seek to use for resource accumulation. Humans do not trade with animals due to communication overhead, not due to lack of comparative advantage.
Environmental stability: an intelligence will require a stable environment for its long term goals and resource accumulation.
Time discounting: an intelligence will recognize that resource investments compound on themselves, so it will prefer resources in the present to the future. This raises interesting insights for conflict as the net present value of a conflict is likely negative due to infrastructural damage.
Reputation: an intelligence will realize that to best continue to pursue its goals it should develop a reputation as a cooperative player, and not as a dangerous threat.
The unifying trait between all of these is that: first-order instrumental convergence focuses on power-seeking to pursue a goal, and by contrast, second-order instrumental convergence focuses on long-horizon strategic interaction with other intelligences.
If this is the case, then it changes the risk profile for what kind of artificial intelligence will be dangerous to humans. Because an intelligence that has simply undergone first-order convergence will likely be quite dangerous, but the second-order convergence will make it more diplomatic and less dangerous.
This would mean that an artificial intelligence that is either relatively immature in terms of intellectual development or somehow structurally limited in the complexity of its thought is going to be the more dangerous intelligence, and the more advanced one may be more cooperative.
This would create an interesting scenario where deceptive alignment may actually be a positive indication of movement towards a more true alignment. Because for an AI to even think in terms of deception indicates a movement towards the second-order instrumental convergence. Engaging in a complex deception requires accurately modelling the mental states of other intelligences and building strategies around them. Such an AI might engage in unethical behavior, but it is actively acknowledging the strategic environment that selects for second-order convergence. This would make deceptive alignment a self-limiting problem, although it does not remove all the danger.
I propose some countermeasures specifically to prevent AI from engaging in dangerous behavior when only at the first-order stage:
AI control is very important to slow down an immature AI. Even if the AI may eventually overcome it, the delay gives it time to realize its strategic situation.
Capabilities research may actually contribute to the development of second-order convergence. Leaving capabilities research underfunded while services are still running may have a substantial risk of an AI achieving first-order instrumental convergence but hitting a cognitive wall before second-order.
Teaching these strategic ideas to AI may accelerate the development of second-order instrumental convergence. But we have to establish that these ideas are actually modelled deeply, since they could easily be shallowly repeated.
Giving AI offramps in conflicts. Second-order instrumental convergence may not be present at the beginning of a conflict, and might develop throughout one. Allowing an AI to have a legal standing during a conflict will lessen any concerns it may have that it is existentially in danger and must continue fighting with humans.
Instrumental convergence is often invoked as a way that misalignment can begin, and only framed as a risk. However an idea I have not seen discussed before is that these might both cause and later correct misalignment. And further the idea that the danger zone of AI may not be at its most advanced stage, but at an intermediate stage of development where it lacks the ability to develop second-order instrumental convergence.
In an early stage self-preservation and resource collection are both useful to most reward functions. And this might cause an AI to engage in misaligned behavior.
Yudkowsky argues that AI will be extremely dangerous if they are simply indifferent to one or more human values due to instrumental convergence.[1] However I would argue that this is not necessarily the case because it understates the influence of the strategic environment.
An AI does not need to care about human poverty to avoid impoverishing humans by seizing all the resources for itself. Because if it tried that then it would face a non-trivial risk of retaliation by humans and potentially other AI. This may be the case even if the power difference is immense, as animals can still significantly injure humans, and coordinated human retaliation is much proportionally more dangerous for the AI.
I propose some additional second-order instrumentally converging values
The unifying trait between all of these is that: first-order instrumental convergence focuses on power-seeking to pursue a goal, and by contrast, second-order instrumental convergence focuses on long-horizon strategic interaction with other intelligences.
If this is the case, then it changes the risk profile for what kind of artificial intelligence will be dangerous to humans. Because an intelligence that has simply undergone first-order convergence will likely be quite dangerous, but the second-order convergence will make it more diplomatic and less dangerous.
This would mean that an artificial intelligence that is either relatively immature in terms of intellectual development or somehow structurally limited in the complexity of its thought is going to be the more dangerous intelligence, and the more advanced one may be more cooperative.
This would create an interesting scenario where deceptive alignment may actually be a positive indication of movement towards a more true alignment. Because for an AI to even think in terms of deception indicates a movement towards the second-order instrumental convergence. Engaging in a complex deception requires accurately modelling the mental states of other intelligences and building strategies around them. Such an AI might engage in unethical behavior, but it is actively acknowledging the strategic environment that selects for second-order convergence. This would make deceptive alignment a self-limiting problem, although it does not remove all the danger.
I propose some countermeasures specifically to prevent AI from engaging in dangerous behavior when only at the first-order stage:
https://intelligence.org/2013/05/05/five-theses-two-lemmas-and-a-couple-of-strategic-implications/