(Originally posted at https://plus.google.com/+ThomasColthurst/posts/5fcw5wgVrpj in 2013)

I love the premise of Peter de Blanc's "Ontological Crises in Artificial Agents’ Value Systems":  you are robot who used to think the universe was described by some set of states O1, only to wake up and find out that the universe is better described by some set of states O2.  What should you do?

Specifically, let's say that the robot used to act so as to maximize a utility function U: O1 -> R.  The problem is to construct a new utility function U': O2 -> R.

Briefly, here's the solution Peter de Blanc gives in the paper:  assume you have some side information about how probability distributions over O1 and O2 evolve when the robot takes different actions.  Use that information to construct a mapping phi: O2 -> O1, and then make U' by composing phi and U:  U'(x) = U(phi(x)).

I have two quibbles with that approach, one big and one small.  The small quibble is that I think it makes more sense to treat phi as a given than as something you need to construct.  If we think of the robot as a scientist doing experiments and such, then almost any evidence that the robot gathers that would cause it to change its ontology to O2 would also strongly suggest a mapping from O1 to O2 at the same time.  Think of the special theory of relativity, for instance:  it didn't just change the space of relative velocities from R to [-c, c], it also described how to compose velocities in the new space, and how to turn accelerations into new velocities.  (See http://math.ucr.edu/home/baez/physics/Relativity/SR/rocket.html for an example of the later.)

The big quibble is that while U'(x) = U(phi(x)) "type-checks", it probably isn't the utility function the robot really wants.  Think of a simple example where O1 is [0,9], O2 is [0,99], phi(x) = floor(x/10), and U(x) = x.  Basically, your universe is a line, and you want to get to one end of it, but now you can see more clearly that there are ten times as many positions as you used to think.  U'(x) = U(phi(x)) suffers from aliasing in the signal processing sense; it is a staircase function with jumps whenever x mod 10 = 0.

What we want is some anti-aliasing:  a smooth interpolation of U, like U'(x) = x / 10.  But how can we (in general) construct something like that?  We could try the various signal processing tricks, like removing high frequency components, but those only work when we can naturally embed O1 and O2 into some space like R^n, and we can't always do that.

A better thing to do would be to do "Bayesian anti-aliasing":  define a prior over all utility functions, and then pick the utility function U' that maximizes the prior, subject to the constraint that the average of U'(y) over all y such that phi(y) = x equals U(x).  [Bayesian anti-aliasing is also the right way to do anti-aliasing for signal processing, but that is a lecture for another day.]

As always with Bayesian solutions, there is the question of where the prior comes from.  The answer has a name, "meta-preferences", but that's only a name; it doesn't tell us how to construct such a thing.  Here are some examples of considerations that might go into a meta-preference:


1) Smoothness -- smooth utility functions are preferred over sharp ones.  This could be formalized along the lines of "if x and y are nearby states in the sense that one or more actions of the robot transform probability distributions with lots of mass on x to/from probability distributions with lots of mass on y, then |U(x) - U(y)| should be small."

2) Symmetry.  If there is some group G that acts on O in a way that is compatible with the way that the robot's actions act on O, then utility functions that invariant under G should be highly preferred.

3) Simplicity.  We should prefer utility functions that can be computed by short and fast computer programs.  This requires that states in O be represented by digital strings, but that's not really a restriction given that we are discussing a robot's ontological crisis, after all.

These are all more-or-less inspired by the mathematics of meta-preferences, but I want to emphasize that a mathematical approach will only take you so far.  If you are designing a robot that will undergo ontological crises, it is your job to give your robot both a good initial utility function and a good set of meta-preferences.  (The word "good" in the previous sentence is being used in basically all of its senses, from "morally upright" to "well constructed".)

For example, let's say you program your robot to be risk-averse about some resource X, so that it prefers gaining an amount x of X to a 0.5 chance of gaining 2x instead.  If your robot later discovers a resource Y which it previous knew nothing about, then nothing is going to make it risk-averse about Y except meta-preferences and/or knowledge it may or may not acquire about the fungibility of X and Y.  Just to be clear, there is no predetermined right or wrong answer about how risk-adverse to be about Y, but unless you consider such things, you run the real risk of your robot not behaving in the ways you expect after its ontological crisis.

For an example of how not to program a robot, consider evolution.  It has given me a perfectly serviceable utility function, filled with such lovely components as "eat sugar" and "fear spiders".  Evolution has also given me a brain that is mostly convinced that the universe is better described as a quantum wave-function than as something like (boolean:  sugar in mouth) x (integer: # of spiders visible).  But what evolution hasn't given me is an adequate set of meta-preferences that would let me translate my built-in utility function into one over wave-functions.

(Or actually, now that I think about it, the problem is probably with my meta-meta-preferences:  I know perfectly well how to translate my simple utility function to the more complicated universe, but my meta-meta-preferences keep on insisting that my translation is wrong.)

I should end by pointing out that this isn't an entirely novel approach to treating such questions; it is very similar to what Daniel Dewey proposes in http://www.danieldewey.net/learning-what-to-value.pdf, for example.

20

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 1:49 PM

U’(x) = U(phi(x)) suffers from aliasing in the signal processing sense; it is a staircase function with jumps whenever x mod 10 = 0.

While the original paper isn't super explicit about this, phi is a stochastic map so the only way to define U' is as an average over the distribution phi(x); therefore, it need not jump like this (phi can smoothly change the probability distribution, leading to smooth changes in U').

New to LessWrong?