Unforeseen maximum

An unforeseen maximum of a utility function (or other preference framework) is when, e.g., you tell the AI to produce smiles, thinking that itthe AI will make people ~~happy, but~~happy in order to produce smiles. But unforeseen by ~~you~~you, the AI has an alternative for making even more smiles, which is to convert all matter within reach into tiny molecular smileyfaces.

In other words, you're proposing to give the AI a ~~criterion~~goal U, because you think U has a maximum around some nice options X. But it turns out there's another option X′ you didn't imagine, with X′>UX, and X′ is not so nice.

A programmer thinking about a utility function U considers policy options π1,…,πni∈ΠN and concludes that of these options the policy with highest E[U|πi] is π1, and hence a U-maximizer will probably do π1.
The programmer also thinks that their own criterion of goodness V will be promoted by π1, that is, E[V|π1]>E[V] or "π1 is beneficial". So the programmer concludes that it's a great idea to build an AI that ~~maximizes~~optimizes for U.
Alas, the AI is searching a policy space ΠM, which although it does contain π1 as an option, also contains an attainable option π0 which programmer didn't consider, with E[U|π0]>E[U|π1]. This is a problem if π0 produces much less V-benefit than π1 or is outright detrimental.

Juergen ~~Schmidhuber,~~Schmidhuber of IDSIA, during the 2009 Singularity Summit, gave a talk proposing that the best and most moral utility function for an AI was the gain in compression of sensory data over time. Schmidhuber gave examples of valuable behaviors he thought this would motivate, like doing science and understanding the universe, or the construction of art and highly aesthetic objects.

Fragile_value asserts that our true criterion of goodness V is narrowly peaked within the space of all achievable outcomes for a superintelligence, such that we rapidly fall off in V as we move away from the peak. Complexity of value says that V and its corresponding peak have high algorithmic complexity. Then the peak outcomes identified by any simple object-level U will systematically fail to find V. It's like trying to find a 1000-byte program which will approximately reproduce the text of Shakespeare's Hamlet; algorithmic information theory says that you just shouldn't expect to ~~be able to do~~find a simple program like that.

Apple_pie_problem raises the concern that some people may have ~~systematic~~psychological trouble accepting the "But π0" critique even after it is pointed out, because of their ideological attachment to a noble goal U (probably actually noble!) that would be even more praiseworthy if U could also serve as a complete utility function for an AGI (which it unfortunately can't).

Context disaster implies an unforeseen maximum may come as a surprise, or not show up during the development phase, because during the development phase the AI's options are restricted to some PiΠL⊂PiΠM with π0∋∉ΠL.

~~In fact,~~Indeed, the pseudo-formalization of "a "type-1 context disaster" is isomorphic to the ~~same as for~~pseudoformalization of "unforeseen maximum", except that in a disaster, ΠN and ΠM are identified with "AI's options during development" and "AI's options after a capability gain". (Instead of "Options the programmer is thinking of" and "Options the AI will consider".) ~~Nonetheless these seem~~

The two concepts are conceptually ~~separate~~distinct because, e.g:

A context disaster could also apply to a decision criterion learned by training, not just a utility function envisioned by the programmer.
It's an unforeseen maximum but not a context disaster if the programmer is initially reasoning, not that the AI has already been observed to be beneficial during a development phase, but rather that the AI ought to be beneficial when it optimizes U later because of the ~~supposedly~~supposed nice maximum at π1.

If we hadn't observed what seem like clear-cut cases of some actors in the field being blindsided by unforeseen maxima in imagination, we'd worry less about actors being blindsided by context disasters ~~within~~over observations.

Missing the weird alternative suggests that people may psychologically fail to consider alternative agent options π0 that are very low in V, because the human search function looks for high-V and normal policies. In other words, that Schmidhuber didn't generate "encrypt streams of 1s or 0s and then reveal the key" because this policy was less attractive to him than "do art and science". and because it was weird.

Conservatism in goal concepts can be seen as trying to directly tackle the problem of unforeseen ~~maximums.~~maxima. More generally, AI approaches which work on "whitelisting conservative boundaries around approved policy spaces" instead of "search the widest possible policy space, minus some blacklisted parts".

The Task paradigm for advanced agents concentrates on trying to accomplish some single pivotal act which can be accomplished by one or more tasks of limited scope. Combined with other measures, this might make it easier to identify an adequate safe plan for accomplishing the limited-scope task, rather than needing to identify the fragile peak of V within some much larger landscape. The Task AGI formulation is claimed to let us partially "narrow down" the ~~needed~~scope of the necessary U, the ~~relevant~~ part of V, that's relevant to the task, and the searched policy ~~spaces~~space Π. to what is only adequate. This might reduce or meliorate, though not by itself eliminate, ~~the source of~~ unforeseen ~~maximum problems.~~maxima.

Missing the weird alternative and the Apple_pie_problem suggest that it may be unusually difficult to explain to actors why π0>Uπ1 is a difficulty of their favored utility function U that allegedly implies nice policy π1. That is, for psychological ~~reasons,~~reasons, this difficulty seems unusually likely to actually trip up ~~human managers~~sponsors of AI projects or politically block progress on alignment.

In other words, you're proposing to give the AI a criterion U, ~~and~~because you think U has a maximum around some nice options X,. ~~but~~But it turns out there's another option X′ you didn't ~~imagine~~imagine, with X′>UX, and X′ is not so nice.

Unforeseen maximums are argued to be a foreseeable difficulty of AGI alignment, if you try to identify nice policies by giving a simple criterion U that, so far as you can see, seems like it'd be best optimized by ~~nice-sounding policies.~~doing nice things.

A programmer thinking about a utility function U considers policy options π1,π2,…,πn∈ΠN and concludes that of these options the policy with highest E[U|πi] is π1, and hence a U-maximizer will probably do π1.
The programmer also thinks that their own criterion of goodness V will be promoted by π1, that is, E[V|π1]>E[V] or "π1 is beneficial". So the programmer concludes that it's a great idea to build an AI that maximizes U.
Alas, the AI is searching a policy space ΠM, which although it does contain π1 as an option, also contains an option π0 which programmer didn't consider, with E[U|π0]>E[U|π1]. This is a problem if π0 produces much less V-benefit than π1 or is outright detrimental.

That is:

argmaxπi∈ΠN E[U|πi]=π1

argmaxπk∈ΠM E[U|πk]=π0

E[V|π0]≪E[V|π1]

~~Edge instantiation~~ ~~suggests that real maxima of non-~~V ~~utility functions will be "strange, weird, and extreme" relative to our own~~ V~~-views on preferable options.~~

In fact, the pseudo-formalization of "~~Nearest unblocked strategy~~context disaster" is the same as for "unforeseen maximum", except that ΠN and ΠM are identified with "AI's options during development" and "AI's options after a capability gain". (Instead of "Options the programmer is thinking of" and "Options the AI will consider".) Nonetheless these seem conceptually separate because, e.g:

A context disaster ~~suggests~~could also apply to a decision criterion learned by training, not just a utility function envisioned by the programmer.
It's an unforeseen maximum but not a context disaster if the programmer is initially reasoning, not that ~~if you try~~the AI has already been observed to ~~add~~be beneficial during a ~~penalty term~~development phase, but rather that the AI ought to ~~exclude~~be beneficial when it optimizes U later because of the supposedly nice maximum at π01.

If we hadn't observed what seem like clear-cut cases of some actors in ~~particular,~~ the ~~next-highest~~ field being blindsided by unforeseen maxima in imagination, we'd worry less about actors being blindsided by context disasters within observations.

Edge instantiation suggests that the real maxima of non-UV utility functions will be "strange, weird, and extreme" relative to our own V-~~ranking option will often be some similar alternative~~ ~~Invalid LaTeX $\pi_0^': TeX parse error: Missing open brace for superscript~~ ~~which still isn't nice.~~views on preferable options.

Missing the weird alternative suggests that people may ~~systematically~~psychologically fail to consider...

			v1.16.0Jun 27th 2016 GMT
			v1.15.0Jun 9th 2016 GMT
			v1.14.0Jun 9th 2016 GMT	(+39/-23)
			v1.13.0Jun 9th 2016 GMT	(+8)
			v1.12.0Jun 9th 2016 GMT	(+101/-63)
			v1.11.0Jun 9th 2016 GMT	(+284/-165)
			v1.10.0Jun 9th 2016 GMT	(+2050/-609)
			v1.9.0Jun 9th 2016 GMT	(+1)
			v1.8.0Jun 9th 2016 GMT	(+54/-27)
			v1.7.0Jun 9th 2016 GMT	(+257/-22)

			v1.16.0Jun 27th 2016 GMT
			v1.15.0Jun 9th 2016 GMT
			v1.14.0Jun 9th 2016 GMT	(+39/-23)
			v1.13.0Jun 9th 2016 GMT	(+8)
			v1.12.0Jun 9th 2016 GMT	(+101/-63)
			v1.11.0Jun 9th 2016 GMT	(+284/-165)
			v1.10.0Jun 9th 2016 GMT	(+2050/-609)
			v1.9.0Jun 9th 2016 GMT	(+1)
			v1.8.0Jun 9th 2016 GMT	(+54/-27)
			v1.7.0Jun 9th 2016 GMT	(+257/-22)

LESSWRONG
LW

LESSWRONG
LW

Unforeseen maximum