Aug 06, 2018

17 comments

Parfit's Hitchhiker with a perfect predictor has the unusual property of having a Less Wrong consensus that you ought to pay, whilst also being surprisingly hard to define formally. For example, if we try to ask about whether an agent that never pays in town is rational, then we encounter a contradiction. A perfect predictor would not ever give such an agent a lift, so by the Principle of Explosion we can prove any statement to be true given this counterfactual.

On the other hand, even if the predictor mistakenly picks up defectors only 0.01% of the time, then this counterfactual seems to have meaning. Let's suppose that a random number from 1 to 10,000 is chosen and the predictor always picks you up when the number is 1 and is perfect otherwise. Even if we draw the number 120, we can fairly easily imagine the situation where the number drawn was 1 instead. This is then a coherent situation where an Always Defect agent would end up in town, so we can talk about how the agent would have counterfactually chosen.

So one response to the difficulties of discussing counterfactual decisions with perfect predictors would be to simply compute the counterfactual as though the agent has a (tiny) chance of being wrong. However, agents may quite understandably wish to act differently depending on whether they are facing a perfect or imperfect predictor, even choosing differently when facing an agent with a very low error rate.

Another would be to say that the predictor predicts whether placing the agent in town is logically coherent. On the basis that the agent only picks up those who it predicts (with 100% accuracy) will pay, it can assume that it will be payed if the situation is coherent. Unfortunately, it isn't clear what this means in concrete terms for an agent to be such that it couldn't coherently be placed in such a situation. How is, "I commit to not paying in <impossible situation>" any kind of meaningful commitment at all? We could look at, "I commit to making <situation> impossible", but that doesn't mean anything either. If you're in a situation, then it must be possible? Further, such situations are contradictory and *everything* is true given a contradiction, so all contradictory situations seem to be the same.

As the formal description of my solution is rather long, I'll provide a summary: We will assume that each possible world model corresponds to at least one possible sequence of observations. For world models that are consistent conditional on the agent making certain decisions, we'll take the set of observations for agents that are consistent and feed it into the set of agents who aren't. This will be interpreted as what they would have counterfactually chosen in such a situation.

**A Formal Description of the Problem**

**(**You may wish to skip directly to the discussion)

My solution will be to include observations in our model of the counterfactual. Most such problems can be modelled as follows:

Let x be a label that refers to one particular agent that will be called the *centered agent* for short. It should generally refer to the agent whose decisions we are optimising. In Parfit's Hitchhiker, x refers to the Hitchhiker.

Let W be a set of possible "world models with holes". That is, each is a collection of facts about the world, but not including facts about the decision processes of x which should exist as an agent in this world. These will include the problem statement.

To demonstrate, we'll construct I for this problem. We start off by defining the variables:

- t: Time
- 0 when you encounter the Driver
- 1 after you've either been dropped off in Town or left in the Desert
- l: Location. Either Desert or Town
- Act: The actual action chosen by the hitchhiker if they are in Town at t=1. Either Pay or Don't Pay or Not in Town
- Pred: The driver's prediction of x's action if the driver were to drop them in town. Either Pay or Don't Pay (as we've already noted, defining this counterfactual is problematic, but we'll provide a correction later)
- u: Utility of the hitchhiker

We can now provide the problem statement as a list of facts:

- Time: t is a time variable
- Location:
- l=Desert at t=0
- l=Town at t=1 if Pred=Pay
- l=Desert at t=1 if Pred=Don't Pay
- Act:
- Not in Town at t=0
- Not in Town if l=Desert at t=1
- Pay or Don't Pay if l=Town at t=1
- Prediction: The Predictor is perfect. A more formal definition will have to wait
- Utility:
- u=0 at t=0
- At t=1: Subtract 1,000,000 from u if l=Desert
- At t=1: Subtract 50 from u if Act=Pay

W then contains three distinct world models:

- Starting World Model - w1:
- t=0, l=Desert, Act=Not in Town, Pred: varies, u=0
- Ending Town World Model - w2:
- t=1, l=Town, Act: varies, Pred: Pay, u: varies
- Ending Desert World Model - w3:
- t=1, l=Desert, Act: Not in Town, Pred: Don't Pay, u=-1,000,000

The properties listed as varies will only be known once we have information about x. Further, it is impossible for certain agents to exist in certain worlds given the rules above.

Let O be a set of possible sequences of observations. It should be chosen to contain all observations that could be made by the centered agent in the given problem and there should be at least one set of observations representing each possible world model with holes. We will do something slightly unusual and include the problem statement as a set of observations. One intuition that might help illustrate this is to imagine that the agent has an oracle that allows it to directly learn these facts.

For this example, the possible individual observations grouped by type are:

- Location Events: <l=Desert> OR <l=Town>
- Time Events: <t=0> OR <t=1>
- Problem Statement: There should be an entry for each point in the problem statement as described for I. For example:
- <l=desert at t=0>

O then contains three distinct observation sequences:

- Starting World Model - o1:
- <Problem Statement> <t=0> <l=Desert>
- Ending Town World Model - o2:
- <Problem Statement> <t=0> <l=Desert> <t=1> <l=Town>
- Ending Desert World Model - o3:
- <Problem Statement> <t=0> <l=Desert> <t=1> <l=Desert>

Of course, <t=0><l=Desert> is observed initially in each world so we could just remove it to provide simplified sequences of observations. I simply write <Problem Statement> instead of explicitly listing each item as an observation.

Regardless of its decision algorithm, we will associate x with a fixed Fact-Derivation Algorithm f. This algorithm will take a specific sequences of observations o and produce an id representing a world model with holes w. The reason why it produces an id is that some sequences of observations won't lead to a coherent world model for some agents. For example, the Ending in Town Sequence of observers can never be observed by an agent that never pays. To handle this, we will assume that each incomplete world model w is associated with a unique integer [w]. In this case, we might logically choose, [w1]=1, [w2]=2, [w3]=3 and then f(o1)=[w1], f(o2)=[w2], f(o3)=[w3]. We will define m to map from these id's to the corresponding incomplete world model.

We will write D for the set of possible decisions algorithms that x might possess. Instead of having these algorithms operate on either observations or world models, we will make them operate on the world ids that are produced by the Fact-Derivation Algorithm so that they still produce actions in contradictory worlds. For example, define:

- d2 - Always Pay
- d3- Never Pay

If d2 sees [O3] or d3 sees [O2], then it knows that this is impossible according to its model. However, it isn't actually impossible as its model could be wrong. Further, these "impossible" pre-commitments now mean something tangible. The agent has pre-committed to act a certain way if it experiences a particular sequence of observations.

We can now formalise the Driver's Prediction as follows for situations that are only conditionally consistent (we noted before that this needed to be corrected). Let o be the sequence of observations and d0 be a decision algorithm that is consistent with o, while d1 is a decision algorithm that is inconsistent with it. Let w=m(f(o)), which is a consistent world given d0. Then the counterfactual of what d1 would do in w is defined as: d1(f(o)). We've now defined what it means to be a "perfect predictor". There is however one potential issue, perhaps multiple observations led to w? In this case, we need to define the world more precisely and include observational details in the model. Even if these details don't seem to change the problem from a standard decision theory perspective, they may still affect the predictions of actions in impossible counterfactuals.

**Discussion**

In most decision theory problems, it is easier to avoid discussing observations any more than necessary. Generally, the agent makes some observations, but their knowledge of most of the setup is mostly assumed. This abstraction generally works well, but it leads to confusion in cases like this where we are dealing with predictors who want to know if they can coherently put another agent in a specific situation. As we've shown, even though it is meaningless to ask what an agent would do given an impossible situation, it is meaningful to ask what the agent would do given an impossible input.

When asking what any real agent would do in a real world problem, we can always restate it as asking about what the agent would do given a particular input. However, using the trick of separating observations doesn't limit us to real world problems; as we've seen, we can use the trick of representing the problem statement as direct observations to represent more abstract problems. The next logical step is to try extending this to cases such as, "What if the 1000th digit of Pi were even?" This allows us to avoid the contradiction and deal with situations that are at least consistent, but it doesn't provide much in the way of hints of how to solve these problems in general. Nonetheless, I figured that I may as well start with the the one problem that was the most straightforward.

**Update**: After rereading the description of Updateless Decision Theory, I realise that it is already using something very similar to this technique as described here. So the main contribution of this article seems to be exploring a part of UDT that is normally not examined in much detail.

One difference though is that UDT uses a Mathematical Intuition Function that maps from inputs to a probability distribution of execution histories, instead of a Fact-Derivation Algorithm that maps from inputs to models and only for consistent situations. One advantage of breaking it down as I do is to clarify that UDT's observation-action maps don't only include entries for possible observations, but observations that it would be contradictory for an agent to make. Secondly, it clarifies that UDT predictors predict agents based on how they respond to inputs representing situations, rather than directly on situations themselves, which is important for impossible situations.