A model of UDT with a concrete prior over logical statements



I've been having difficulties with constructing a toy scenario for AI self-modification more interesting than Quirrell's game, because you really want to do expected utility maximization of some sort, but currently our best-specified decision theories search through the theorems of one particular proof system and "break down and cry" if they can't find one that tells them what their utility will be if they choose a particular option. This is fine if the problems are simple enough that we always find the theorems we need, but the AI rewrite problem is precisely about skirting that edge. It seems natural to want to choose some probability distribution over the possibilities that you can't rule out, and then do expected utility maximization (because if you don't maximize EU over some prior, it seems likely that someone could Dutch-book you); indeed, Wei Dai's original UDT has a "mathematical intuition module" black box which this would be an implementation of. But how do you assign probabilities to logical statements? What consistency conditions do you ask for? What are the "impossible possible worlds" that make up your probability space?

Recently, Wei Dai suggested that logical uncertainty might help avoid the Löbian problems with AI self-modification, and although I'm sceptical about this idea, the discussion pushed me into trying to confront the logical uncertainty problem head-on; then, reading Haim Gaifman's paper "Reasoning with limited resources and assigning probabilities to logical statements" (which Luke linked from So you want to save the world) made something click. I want to present a simple suggestion for a concrete definition of "impossible possible world", for a prior over them, and for an UDT algorithm based on that. I'm not sure whether the concrete prior is useful—the main point in giving it is to have a concrete example we can try to prove things about—but the definition of logical possible worlds looks like a promising theoretical tool to me.


Let S be the set of sentences of Peano Arithmetic less than 3^^^3 symbols long, and let Pow(S) := the set of all subsets of S. We interpret elements X of Pow(S) as "logical worlds", in which the sentences in X are true and the sentences in (S \ X) are false. Each world X can be represented by a single sentence sX, which is the conjunction of the sentences ({x | x in X} union {not y | y in S \ X}). Now, we exclude those X that are proven contradictory by one of the first 4^^^^4 theorems of PA; that is, we define the set W of possible worlds to be

    W := {X in Pow(S) | "not sX" is not among the first 4^^^^4 theorems of PA}.

W is our probability space. Note that it's finite. For lack of a better idea, I propose to use the uniform prior on it. (Possibly useful: another way to think about this is that we choose the uniform prior on Pow(S), and after looking at the 4^^^^4 theorems, we do a Bayesian update.)

Individual sentences in S induce events over this probability space: a sentence x corresponds to the event {X in Pow(S) | x in X}. Clearly, all of the above can be carried out by a computable function, and in particular we can write a computable function P(.) which takes a statement in S and returns the probability of the corresponding event (and the source of this function can be short, i.e., it doesn't need to contain any 3^^^3-sized lookup tables).


The decision algorithm makes use of two global variables which specify the problem.

  • actions is a Python dictionary that maps possible inputs to the algorithm to a list of outputs the algorithm is allowed to return when receiving that input. For example, in the problem from Wei Dai's UDT1.1 post, actions = {1: ['A','B'], 2: ['A','B']}, meaning that your input is either '1' or '2', and in both cases you may choose between options 'A' and 'B'. We'll assume there's nothing 3^^^3-sized in actions.
  • worlds is a list of triples of the form (p,U,us), representing possible physical worlds the agent might find itself in, where p is the probability of being in that world, U is the source of a function that computes and returns the agent's utility if that world is the true one (by simulating that world and running a utility function over the result), and us is a list of values U might return. The probabilities must add to 1. We'll assume that there's nothing 3^^^3-sized here either, and that it's provable in much less that 4^^^^4 steps that if the decision algorithm halts on all inputs specified by actions and returns one of the allowable actions, then each U will halt and return a value in the corresponding us. (The reason for the condition is that the functions U may contain copies of the agent's source, and may make calls to the agent, so if the agent didn't halt, neither could U.)

With these provisions, the algorithm, UDT(input), proceeds as follows:

  1. Compute mappings, a list of all dictionaries that maps each possible input from actions to one of the allowable outputs. (In the earlier example, mappings = [{1:'A',2:'A'}, {1:'A',2:'B'}, {1:'B',2:'A'}, {1:'B',2:'B'}].)
  2. Play chicken with the universe: For each m in mappings, if P("UDT(i) == m[i] for every i in actions.keys()") = 0, then return m[input].
  3. Calculate expected utilities: For each m in mappings, for each (p,U,us) in worlds, for each u in us, compute q := P("U() == u" | "UDT(i) == m[i] for every i in actions.keys()"); the expected utility EU(m) of m is the sum of all the corresponding p*u*q. (Note that we made sure in the previous step that the conditional probabilities q exist.)
  4. Choose the m with the highest expected utility. If multiple options have the same utility, choose the lexicographically lowest one.
  5. Return m[input].

Now, the universe must always chicken out (the algorithm will never need to return in step 2), because one of the possible worlds in W must be true, this true world cannot be ruled out by Peano Arithmetic because PA is sound, and if the algorithm returned m[input] in step 2, then "UDT(i) == m[i] for every i in actions.keys()" would hold in this true world, so the probability of this sentence could not be zero.

Further, the algorithm will halt on all inputs, because although it does some big computations, there is no unbounded search anywhere; and it's easy to see that on each possible input, it will return one of the allowable outputs. This reasoning can be formalized in PA. Using our earlier assumption, it follows (in PA, in much less than 4^^^^4 steps) that each U will halt and return a value in the corresponding us. Thus, in each possible logical world in W, for every (p,U,us) in worlds, "U() halts" will be true, and "U() == u" will be true for exactly one u (more than one would quickly prove a contradiction), and this u will be in us; and therefore, the different q for a given (p,U,us) will add up to 1.