An unforeseen maximum of a utility function (or other preference framework) is when, e.g., you tell the AI to produce smiles, thinking that itthe AI will make people happy, buthappy in order to produce smiles. But unforeseen by youyou, the AI has an alternative for making even more smiles, which is to convert all matter within reach into tiny molecular smileyfaces.
In other words, you're proposing to give the AI a criteriongoal U, because you think U has a maximum around some nice options X. But it turns out there's another option X′ you didn't imagine, with X′>UX, and X′ is not so nice.
Juergen Schmidhuber,Schmidhuber of IDSIA, during the 2009 Singularity Summit, gave a talk proposing that the best and most moral utility function for an AI was the gain in compression of sensory data over time. Schmidhuber gave examples of valuable behaviors he thought this would motivate, like doing science and understanding the universe, or the construction of art and highly aesthetic objects.
Fragile_value asserts that our true criterion of goodness V is narrowly peaked within the space of all achievable outcomes for a superintelligence, such that we rapidly fall off in V as we move away from the peak. Complexity of value says that V and its corresponding peak have high algorithmic complexity. Then the peak outcomes identified by any simple object-level U will systematically fail to find V. It's like trying to find a 1000-byte program which will approximately reproduce the text of Shakespeare's Hamlet; algorithmic information theory says that you just shouldn't expect to be able to dofind a simple program like that.
Apple_pie_problem raises the concern that some people may have systematicpsychological trouble accepting the "But π0" critique even after it is pointed out, because of their ideological attachment to a noble goal U (probably actually noble!) that would be even more praiseworthy if U could also serve as a complete utility function for an AGI (which it unfortunately can't).
Slightly more semiformally, we could say: An unforeseen maximumsay that "unforeseen maximum" is realized as a difficulty when:
Context disaster implies an unforeseen maximum may come as a surprise, or not show up during the development phase, because during the development phase the AI's options are restricted to some PiΠL⊂PiΠM with π0∋∉ΠL.
In fact,Indeed, the pseudo-formalization of "a "type-1 context disaster" is isomorphic to the same as forpseudoformalization of "unforeseen maximum", except that in a disaster, ΠN and ΠM are identified with "AI's options during development" and "AI's options after a capability gain". (Instead of "Options the programmer is thinking of" and "Options the AI will consider".) Nonetheless these seem
The two concepts are conceptually separatedistinct because, e.g:
If we hadn't observed what seem like clear-cut cases of some actors in the field being blindsided by unforeseen maxima in imagination, we'd worry less about actors being blindsided by context disasters withinover observations.
Missing the weird alternative suggests that people may psychologically fail to consider alternative agent options π0 that are very low in V, because the human search function looks for high-V and normal policies. In other words, that Schmidhuber didn't generate "encrypt streams of 1s or 0s and then reveal the key" because this policy was less attractive to him than "do art and science". and because it was weird.
Conservatism in goal concepts can be seen as trying to directly tackle the problem of unforeseen maximums.maxima. More generally, AI approaches which work on "whitelisting conservative boundaries around approved policy spaces" instead of "search the widest possible policy space, minus some blacklisted parts".
The Task paradigm for advanced agents concentrates on trying to accomplish some single pivotal act which can be accomplished by one or more tasks of limited scope. Combined with other measures, this might make it easier to identify an adequate safe plan for accomplishing the limited-scope task, rather than needing to identify the fragile peak of V within some much larger landscape. The Task AGI formulation is claimed to let us partially "narrow down" the neededscope of the necessary U, the relevant part of V, that's relevant to the task, and the searched policy spacesspace Π. to what is only adequate. This might reduce or meliorate, though not by itself eliminate, the source of unforeseen maximum problems.maxima.
Missing the weird alternative and the Apple_pie_problem suggest that it may be unusually difficult to explain to actors why π0>Uπ1 is a difficulty of their favored utility function U that allegedly implies nice policy π1. That is, for psychological reasons,reasons, this difficulty seems unusually likely to actually trip up human managerssponsors of AI projects or politically block progress on alignment.
Missing_weird and the Apple_pie_problem suggest that it may be unusually socially difficult to explain to peopleactors why π0>Uπ1 is a difficulty of their favored utility function U that's supposed to lead to beneficial policy π1. That is, for psychological reasons, this difficulty seems unusually likely to actually trip up human managers of AI projects and politically block further progress on alignment.
In other words, you're proposing to give the AI a criterion U, andbecause you think U has a maximum around some nice options X,. butBut it turns out there's another option X′ you didn't imagineimagine, with X′>UX, and X′ is not so nice.
Unforeseen maximums are argued to be a foreseeable difficulty of AGI alignment, if you try to identify nice policies by giving a simple criterion U that, so far as you can see, seems like it'd be best optimized by nice-sounding policies.doing nice things.
That is:
argmaxπi∈ΠN E[U|πi]=π1
argmaxπk∈ΠM E[U|πk]=π0
E[V|π0]≪E[V|π1]
Edge instantiation suggests that real maxima of non-V utility functions will be "strange, weird, and extreme" relative to our own V-views on preferable options.
In fact, the pseudo-formalization of "Nearest unblocked strategycontext disaster" is the same as for "unforeseen maximum", except that ΠN and ΠM are identified with "AI's options during development" and "AI's options after a capability gain". (Instead of "Options the programmer is thinking of" and "Options the AI will consider".) Nonetheless these seem conceptually separate because, e.g:
If we hadn't observed what seem like clear-cut cases of some actors in particular, the next-highest field being blindsided by unforeseen maxima in imagination, we'd worry less about actors being blindsided by context disasters within observations.
Edge instantiation suggests that the real maxima of non-UV utility functions will be "strange, weird, and extreme" relative to our own V-ranking option will often be some similar alternative Invalid LaTeX $\pi_0^': TeX parse error: Missing open brace for superscript which still isn't nice.views on preferable options.
Missing the weird alternative suggests that people may systematicallypsychologically fail to consider...
Indeed, the pseudo-formalization of a "type-1 context disaster" is isomorphic to the pseudoformalization of "unforeseen maximum", except that in a context disaster, ΠN and ΠM are identified with "AI's options during development" and "AI's options after a capability gain". (Instead of "Options the programmer is thinking of" and "Options the AI will consider".)