(This post contains a lot of math, and assumes some familiarity with the earlier LW posts on decision theory. Most of the ideas are by me and Paul Christiano, building on earlier ideas of Wei Dai.)

The September MIRI workshop has just finished. There were discussions of many topics, which will probably be written up by different people. In this post I'd like to describe a certain problem I brought up there and a sequence of ideas that followed. Eventually we found an idea that seems to work, and also it's interesting to tell the story of how each attempted idea led to the next one.

The problem is a variant of Wei Dai's A Problem About Bargaining and Logical Uncertainty. My particular variant is described in this comment, and was also discussed in this post. Here's a short summary:

In Counterfactual Mugging with a logical coin, a "stupid" agent that can't compute the outcome of the coinflip should agree to pay, and a "smart" agent that considers the coinflip as obvious as 1=1 should refuse to pay. But if a stupid agent is asked to write a smart agent, it will want to write an agent that will agree to pay. Therefore the smart agent who refuses to pay is reflectively inconsistent in some sense. What's the right thing to do in this case?

Thinking about that problem, it seems like the right decision theory should refuse to calculate certain things, and instead behave updatelessly with regard to some sort of "logical prior" inherited from its creators, who didn't have enough power to calculate these things. In particular, it should agree to pay in a Counterfactual Mugging with a digit of pi, but still go ahead and calculate that digit of pi if offered a straight-up bet instead of a counterfactual one.

What could such a decision theory look like? What kind of mathematical object is a "logical prior"? Perhaps it's a probability distribution over inconsistent theories that the creators didn't yet know to be inconsistent. How do we build such an object, what constraints should it satisfy, and how do we use it in a decision algorithm?

#### Attempt 1

Sometime ago, Benja Fallenstein and Paul Christiano came up with a way to build a probability distribution over all consistent theories. We start with a single empty theory with weight 1, and then refine it by successively adding all possible axioms. Each new axiom A is either already settled by the current theory T (i.e. either A or ¬A is provable in T), in which case we do nothing; or A is independent of T, in which case we choose randomly between T+A and T+¬A. This process gives nonzero probabilities to all finitely axiomatized consistent theories, and leads to well-behaved conditional probabilities like P(PA+Con(PA)|PA). (I gloss over the distinction between finitely and recursively axiomatizable theories. Building a variant of this construction that works for recursively axiomatizable theories is left as an exercise to the reader.)

At the workshop we tried to use a similar construction, but with local consistency instead of global consistency. (Let's say that a "locally consistent theory" is a set of statements that has no short proof of inconsistency, for some reasonable meaning of "short".) We keep adding axioms as before, but only check that the theory so far is locally consistent. Hopefully this way we might end up with a nice distribution over all theories, including inconsistent ones.

The trouble with this approach is that for some axiom A and some locally consistent theory T, both T+A and T+¬A might be locally inconsistent, so we won't be able to continue. We can try to salvage the idea by somehow adjusting the probabilities retroactively, but then it seems hard to guarantee that the probability of each statement goes to a well-defined limit, it might end up oscillating. So we abandoned this idea.

#### Attempt 2

Instead of trying to build up a prior over possibly inconsistent theories, we can try assigning each theory a "quality" between 0 and 1. For example, we can take a probability distribution over all local consistency constraints that a theory must satisfy, and define the "quality" of a theory as the probability that it satisfies a randomly chosen constraint. For purposes of that definition, the theories have to be complete, but possibly inconsistent. Theories with quality 1 will be the consistent ones, because they satisfy all the constraints. Then we can define the quality of an individual statement as the maximum quality of any theory that includes that statement.

This all seems fine so far, but at the next step we run into trouble. In Counterfactual Mugging with a logical coin, Omega needs to evaluate a counterfactual statement like "if a digit of pi was different, the agent would do such-and-such thing". We can try to straightforwardly define "A counterfactually implies B" as true if the quality of "A and B" is equal to the quality of A. Intuitively that means that the most consistent theory that includes A also includes B.

Unfortunately, that definition does not have very good properties. Both "PA counterfactually implies Con(PA)" and "PA counterfactually implies ¬Con(PA)" will be defined as true, because the theories PA+Con(PA) and PA+¬Con(PA) are both consistent and have quality 1. That doesn't seem to agree with our intuitions about how logical counterfactuals should work.

#### Some ideas about logical counterfactuals

Our intuitions about decision theory seem to say that some logical counterfactuals are "natural" and others are "spurious". For example, if the agent does not in fact take a certain action, then there's a "natural" proof of how much utility it would have implied, and also many "spurious" proofs that prove all sorts of absurdities by first simulating the agent and then using the fact that the action was not taken. (See this post for more about spurious proofs.)

Natural counterfactuals look more correct to us than spurious ones because their proofs don't rely on fully simulating the agent. Maybe we could try weighting logical counterfactuals based on their proof length, and see where that idea takes us?

In simple decision theory problems, we are interested in counterfactuals like "A()=1 implies U()=1", "A()=1 implies U()=2", and so on, where A is the agent program that returns an action, and U is the universe program that contains the agent and returns a utility value. (See this post or this post for more about this setting.) Naively assigning some probability P(L) to each counterfactual statement with proof length L doesn't seem to work, because the problem might be so complex that all relevant statements have long proofs. But if we took some set of counterfactuals assumed to be mutually exclusive, with proof lengths L_{1}, L_{2}, ..., then it might be reasonable to give them probabilities like P(L_{1})/(P(L_{1})+P(L_{2})+...). That looks suspiciously similar to Bayes' theorem, so it seems natural to try and come up with some sort of probabilistic interpretation where this will be an actual application of Bayes' theorem. Thankfully, it's easy to find such an interpretation.

Let's take a theory like PA, generate random proofs in it with probability depending on length, and then throw out the invalid ones. If we view each statement of the form "X implies Y" as an "arrow" from statement X to statement Y, then the above process of picking random valid proofs induces a probability distribution on all true arrows, including those where statement X is false. Now we can define the conditional probability P(U()=u|A()=a) as the probability that a random arrow ends at "U()=u", conditional on the event that the arrow starts from "A()=a" and ends at some statement of the form "U()=...". This way, all possible statements "U()=u" for a given "A()=a" have probabilities that sum to 1, and the true statement "A()=a" leads to a single statement "U()=u" with probability 1. These seem like good properties to have.

More generally, when people are uncertain about some math statement, they often seem to reason like this: "I guess I'm likely to find a proof if I try this method. Hmm, I checked a couple obvious proofs and they didn't work. Guess I have to lower my probability of the statement." Such reasoning corresponds to Bayesian updating as you observe new proofs, which is similar to the above setting.

Now, can we extend this idea to obtain a "logical prior" over inconsistent theories as well?

### Attempt 3, successful

There's nothing stopping us from using the above construction with an inconsistent theory, because the properties that follow from having a coherent probability distribution over arrows should still hold. Let's assume that we care about theories T_{1},T_{2},... which are incomplete and possibly inconsistent. (An example of such a theory is PA+"the billionth digit of pi is even"). In each of these theories, we can construct the above distribution on arrows. Then we can add them up, using some weighting of the theories, and get an overall distribution on statements of the form "A()=a implies U()=u". It has the nice property that you only need to sample finitely many proofs to compute the conditional probabilities to any desired precision, so it's easy to write a decision algorithm that will do approximate utility maximization based on these probabilities. If we only care about a single theory, like PA, then the algorithm seems to give the same answers as our earlier algorithms for UDT.

There's still a remaining problem to fix. If we care equally about the theory where the billionth digit of pi is even and the one where it's odd, then our algorithm will accept a straight bet on that digit of pi, instead of calculating it and accepting conditionally. An approach to solving that problem can be found in Wei Dai's proposed "UDT2", which optimizes over the next program to run, instead of the value to return. We can use that idea here, ending up with an algorithm that I will call "UDT1.5", to avoid incrementing the major version number for something that might turn out to be wrong. To put it all together:

- We care about some program U() that returns a utility value. Also we care about some probability distribution over theories T
_{1},T_{2},... that are incomplete and possibly inconsistent. - Our decision algorithm A() attempts to find a program B such that running it will approximately maximize the expected value of U(), using the probabilities defined below. Then A runs B() and returns the resulting value.
- For each possible program B, the conditional probability P(U()=u|A runs B) is defined as the probability that a random valid proof sampled from a random T
_{i}proves that "if A runs B, then U()=u", conditional on it being a proof that "if A runs B, then U()=something".

(The algorithm can easily be extended to the case where A receives an argument representing observational data. Namely, it should just pass the argument to B, which corresponds to Wei Dai's UDT1.1 idea of optimizing over the agent's input-output map.)

Given a suitable prior, the above algorithm solves my original variant of Counterfactual Mugging with stupid and smart agents, but still correctly decides to calculate the digit of pi if offered a straight bet instead of a counterfactual one. And even apart from that, it feels right to make decisions using a weighted sum of theories T_{1},T_{2},... as well as a weighted sum of universes U_{1},U_{2},... composing U. I'm not sure if the right decision theory should eventually work like this, using "Tegmark level 4" over possible universe programs and "Tegmark level 5" over logically inconsistent theories, or perhaps we can define a single kind of uncertainty that generalizes both indexical and logical. In any case, that's as far as the workshop went, and it seems worthwhile to continue thinking about such questions.

Many thanks to all participants, to MIRI which invited me, to Kevin and Sean who hosted me, and anyone else I might have missed!