Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This work originated at MIRI Summer Fellows and originally involved Pasha Kamyshev, Dan Keys, Johnathan Lee, Anna Salamon, Girish Sastry, and Zachary Vance. I was asked to look over two drafts and some notes, clean them up, and post here. Especial thanks to Zak and Pasha for drafts on which this was based.

We discuss the issues with expected utility maximizers, posit the possibility of normalizers, and list some desiderata for normalizers.

This post, which explains background and desiderata, is a companion post to Three Alternatives to Utility Maximizers. The other post surveys some other "-izers" that came out of the MSFP session and gives a sketch of the math behind each while going through the relevant intuitions.


##Background The naive implementation of an expected utility maximizer involves looking at every possible action - of which there is generally an intractably number - and, at each, evaluating a black box utility function. Even if we could somehow implement such an agent (say, through access to a halting oracle), it would tend towards extreme solutions. Given a function like "maximize paperclips," such an agent would convert its entire future light cone into the cheapest object that satisfies whatever its computational definition of paperclip is.

This makes errors in goal specification extremely costly.

Given a utility function which is naively acceptable, the agent will do something which by our standards is completely insane[^fn-smiley]. Even in the paperclip example, the "paperclips" that the agent produces are unlikely to be labeled as paperclips by a human.

If a human wanted to maximize paperclips, they would not, in general, attempt to convert their entire future light cone paperclips. They might fail to manufacture very many paper clips, but their actions will seem much more “normal” to us than that of the true expected utility maximizer above, and we would expect a poor goal specification to be less likely to kill us.

##Normalizers and Desiderata

We consider a normalizer to be an agent whose actions, given flawed utility function, are still considered sane or normal by a human observer. This definition is extremely vague, so we propose two desiderata for normalizers. Note that these desiderata may not be simultaneously satisfiable; furthermore, while we would like to claim that these are necessary, we do not think that they are sufficient for an agent to be a normalizer.

####Desideratum 1: High Value The normalizer should end up winning (with respect to its utility function). Even though it may fail to fully maximize utility, its action should not leave huge amounts of utility unrealized[^fn-waste].

As part of this desideratum, we would also like the normalizer to win at low power levels. This filters out uncomputable solutions that cannot be approximated computably.

Note that a paperclip maximizer satisfies this desideratum if we remove the computability considerations.

Positive Example: A normalizer spends many years figuring out how to build the correct utility maximizer and then does so.

Negative Example 1: A human tries to optimize the medical process, and makes significant progress, but everyone still dies.

Negative Example 2: A meliorizer[^fn-meli] attempts to save earth from a supervolcano. Starting with the default action "do nothing", it switches to the first policy it finds, which happens to be "save everyone in NYC and call it a day", but fails to find the strictly better strategy "save everyone on earth"[^fn-meli-fail].

####Desideratum 2: Sanity at High Power A normalizer should take sane actions regardless of how much computational power it has. Given more power, it should do better. It should not transition from being safe at, say, a human level of power, but be unsafe at superhuman levels.

This desiderata can only be satisfied by changing the algorithm the agent is running, or succeeding at AI boxing.

Positive Example: An agent is programmed to maximize train efficiency using a suggester-verifier architecture; however, the verifier is programmed to only accept the default rearrangement[^fn-rug].

Negative Example: At around human level, an agent told to maximize human happiness gets a job at a charity, helps little old ladies, and donates all its money to effective causes. Once it reaches superhuman level, it turns all the matter in its light cone into molecular smiley faces.

####Desideratum 3: Non-Extreme The -izer should not come up with solutions that seem extreme to humans, especially not after explanation. This one is a bit controversial and hard to formalize, but we think there’s some important intuition here.

####Desideratum 4: Robust to belief ontology shifts The -izer does about what you want, if it has a ‘similar’ ontology to yours. Positive: After it turns out string theory is correct and not atomic theory, the -izer can still recognize my mother and offer her ice cream.

####Desideratum 5: Robust to utility function differences The -izer does about what you want with a ‘similar’ utility function to yours

####Desideratum 6: Satisfies many utility functions If there are a number of people, does a good job of satisfying what they all want. If it has uncertainty in its world model or moral model, doesn’t just pick a world, it instead tries to do well across all the possible worlds. Positive: If it’s unsure whether dolphins are people, comes up with a solution that doesn’t involve dolphins going extinct

####Desideratum 7: Noticing confusion We couldn’t formalize this one well, so look for things in this space instead of paying attention to our specific description. Detects if it’s about to go “crazy” if it begins suggesting immoral actions to itself, or actions that otherwise violate its standard guesses as to how sensible action ought probably to work. Notices if it has something like instrumentally wrong priors. Positive example: Has several sub-modules. If they disagree on predictions by orders of magnitude, don’t use that prediction in other calculations Negative example: An AIXI like expected utility maximizer, programmed with the universal prior, will assign 0 probability to hypercomputation, instead of correcting its prior.

####Desideratum 8: Transparency We can understand roughly what the -izer will do in advance Negative: We’re not sure we want to turn it on.

####Desideratum 9: Transparency under self-modification We can understand roughly what the -izer will do in advance, but without Cartesian boundaries.

####Non-Desideratum 10: VNM Axioms We think it would be interesting to look at whether an -izer obeys the VNM axioms, and would want it to have good reasons not to.

[^fn-smiley]: The standard example given here is of an agent which is told to maximize "human happiness" and ends up tiling the universe with smiley faces. More in-depth discussion can be found at Bill Hibbard's site here or in Soares' "The Value Learning Problem". [^fn-waste]: Bostrom calls huge amounts of wasted utility astronomical waste. In "Astronomical Waste: The Opportunity Cost of Delayed Technological Development", he argues that "if the goal of speed conflicts with the goal of global safety, the total utilitarian should always opt to maximize safety." [^fn-meli]: A meliorizer is an agent which has a default policy, but searches for a higher utility policy. If it finds such a policy, it switches to using it. If the search fails, it continues using the default policy. See section 8 of the Tiling Agents draft for the original presentation and some more discussion. See Soares' "Tiling Agents in Causal Graphs" for one formalization; specifically that of suggester-verifiers with a fallback policy. [^fn-meli-fail]: An overzealous meliorizer who finds the higher utility (as measured by its utility function) strategy "destroy the Earth so that 100% of people on Earth are vacuously saved" fulfills the first desiderata, but fails to me a normalizer. [^fn-rug]: We assume that we have prevented the suggester from affecting the environment in ways that bypass the verifier (such as by using the physical processes which are its computations). This is probably equivalent in the limit to the problem of AI boxing.

New to LessWrong?

New Comment