Let's look again at Stuart Russell's quote:
A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable.
It is not immediately obvious that this is true. If we are maximising (or minimising) some of these variables, then it's likely true, if this maximisation or minimisation brings them to unusual values that we wouldn't encounter "naturally". But things might be different if the variables need only be set to certain "plausible" values.
Suppose, for example, that an AI is building widgets, and that it is motivated to increase widget production . It can choose the following policies, with the following consequences:
- : build widgets the conventional way; .
- : build widgets efficiently; .
- : introduce new innovative ways of building widgets; .
- : dominate the world's widget industry; .
- : take over the world, optimise the universe for widget production; .
If the AI's goal is to maximise without limit, then the fifth option becomes attractive to it. Even if it just wants to set to a limited but high value - - it benefits from more control of the world. In short:
- The more unusual the values of the variables the AI wants to reach, the more benefit it gets from strong control over the universe.
But what if the AI was designed to set to , or in the range ? Then it would seem that it has lower incentive for control, and might just "do its job", the way we'd like it to; the other variables would not be set to extreme values, since there is no need for the AI to change things much.
Eliezer and Nick and others have made the point that this is still not safe, in posts that I can't currently find. They use examples like the AI taking over the world and building cameras to be sure that it constructed widgets exactly. These scenarios seem extreme as intuitions pumps, to some, so I thought it would be simpler to rephrase this as: moving the variance to unusual values.
Suppose that the AI was designed to keep at . We could give it the utility function , for instance. Would it then stick to policy ?
Now assume further that the world is not totally static. Random events happen, increasing or decreasing the production of widgets. If the AI follows policy , then its expected reward is:
The second term, , the AI could control by "doing its job" and picking a human-safe policy. But is also wants to control the variance of , specifically it wants to lower it. Even more specifically, it wants to move that variance to a very low, highly unusual value.
So the previous problem appears again: it wants to move a variable - the variance of - to a very unusual value. In the real world, this could translate to it building excess capacity, taking control of its supply chains, removing any humans that might get in the way, etc... Since "humans that might get in the way" would end up being most humans - few nations would tolerate a powerful AI limiting their power and potential - this tends to the classic "take control of the world" scenario.
So, minimising or maximising a variable, or setting it to an unusual value, is dangerous, as it incentives the AI to take control of the world to achieve those unusual values. But setting a variable to a usual value can also be dangerous, in an uncertain world, as it incentivises the AI to take control of the world to set the variability of that variable to unusually low levels.
Thanks to Rebecca Gorman for the conversation which helped me clarify these thoughts.
This is not a specific feature of using a square in . To incentivise the AI to set , we need a function of that peaks at . This makes it concave-ish around , which is what penalises spread and uncertainty and variance. ↩︎