Decision Theory with F@#!ed-Up Reference Classes

by Chris_Leong7 min read22nd Aug 201830 comments


Personal Blog

Before we can answer the question of what ought you do, we need to identify exactly what agents are referred to by you. In some problems, you refers to only a single, easily identifiable agent with actions providing deterministic results, but in other problems many agents will experience positions that are completely indistinguishable. Even then, we can normally identify a fixed set of agents who are possibly you and average over them. However, there do exist a set of problems where this set of indistinguishable agents depends on the decision that you make, at which point it becomes rather unclear who exactly you are trying to optimise over. We will say that these problems have Decision-Inconsistent Reference Classes.

While this may seem like merely a niche issue, given the butterfly effect and a sufficiently long timeline with the possibility of simulations, it is almost guaranteed that any decision will change the reference class. So understanding how to resolve these issues is more important than it might first appear. More importantly, if I am correct, Imperfect Parfit's Hitchhiker doesn't have a real answer and UDT would require some rather significant modifications.

(This post is based upon the material in this comment, which I said that I was planning on developing into a full post. It contains some substantial corrections and additions)


My exploration of this area is mostly motivated by Imperfect Parfit's Hitchhiker. Here we define this as Parfit's Hitchhiker with a driver that always detects when you are telling the truth about paying, but 1% of the time picks you up independently of whether you are or aren't being truthful. We'll also imagine that those agents who arrive in town discover a week after their decision whether or not they were in the group who would have been picked up independent of their decision.

Solving this problem involves challenges that aren't present in the version with perfect predictors. After all, once we've defined a notion of counterfactuals for perfect predictors, (harder than it looks!), it's clear that defecting against these predictors is a losing strategy. There is no (inside-view) downside to committing to taking a sub-optimal action given an input that ought to be impossible. However, as soon as the predictors have even an arbitrarily small amount of imperfection, choosing to pay actually means giving up something.

Given the natural human tendency towards fairness, it may be useful to recall the True Prisoner's Dilemma - what if instead of rescuing one person, the driver rescued your entire family and instead of demanding $50 he demanded that a random 50% of you be executed. In this new scenario, refusing to "pay" him for his services no longer seems quite so fair. And if you can get the better of him, why not do so? Or if this isn't sufficient, we can imagine the driver declaring that it's fair game to try fooling him into thinking that you'll pay.

Now that our goal is to beat the driver if that is at all possible, we can see that this is prima facie ambiguous as it isn't clear which agents we wish to optimise over. If we ultimately defect, then only 1% of agents arrive in town, but if we ultimately pay, then 100% of agents arrive in town. Should we optimise over the 1% or the 100%? Consider after you've locked in your decision, but before it's revealed whether you would have been picked up anyway (call this the immediate aftermath). Strangely, in the immediate aftermath you will reflectively endorse whatever decision you made. An agent who decided to defect knows that they were in the 1% and so they would have always ended up in town; while those who decided to pay will assign only a 1% probability that they were going to be picked up if they hadn't paid. In the later case, the agent may later regret paying if they discover that they were indeed in the 1%, but this will only be later, not in the immediate aftermath.

Some Example problems

It'll be easier to attempt solving this problem if we gather some other problems that mess with reference classes. One such problem is the Evil Genie Puzzle I defined in a previous post. This creates the exact opposite problem - you reflectively regret whichever decision you choose. If you choose the million dollars (I wrote perfect life instead in the post), you know in the immediate aftermath that you are almost certainly a clone, so you should expect to be tortured. However, if you choose the rotten eggs, you know in the immediate aftermath that you could have had a million dollars.

Since one potential way of evaluating situations with Decision-Inconsistent Reference Classes is to simply compare averages, we'll also define the Dilution Genie Puzzle. In this puzzle, a genie offers you $1,000,001 or $10. However, if the genie predicts that you will choose the greater amount, it creates 999,999 clones of you who will face what seems like an identical situation, but they will actually each only receive $1 when they inevitably choose the same option as you. This means that choosing $1,000,001 really provides an average of $2, so choosing $10 might actually be a better decision, though if you do actually take it you could have won the million.

Possible Approaches:

Suppose an agent G faces a decision D represented by input I. The most obvious approaches for evaluating these decision are as follows:

1) Individual Averages: If X is an option, calculate the expected utility of X by averaging over all agents who experience input I if G chooses X. Choose the option with the highest expected utility.

This approach defects on Imperfect Parfit's Hitchhiker, chooses the rotten eggs for Evil Genie and chooses the $10 for Dilution Genie. Problems like Perfect Parfit's Hitchhiker and Perfect Retro Blackmail are undefined as the reference class is empty. We can't substitute 0 average utility for an empty reference class as in Perfect Parfit's Hitchhiker, this results in us dying in the desert. We also can't strike out these options and choose from those remaining since in Retro Blackmail this will result in us crossing out the option to not pay. So a major flaw with this approach is that it doesn't handle problems where one decision invalidates the reference class.

It is also somewhat counterintuitive that individuals who count for one evaluating one possible option may not count in another for the same decision even if they still exist.

2) Pairwise Averages: If X & Y are options, compare these pairwise by calculating the average utility over all agents who experience input I if G chooses either X or Y. Non-existence is treated as a 0.

This approach pays in Perfect or Imperfect Parfit's Hitchhiker, chooses the rotten eggs for Evil Genie, $1,000,001 for Dilution Genie and refuses to pay in Retro Blackmail.

Unfortunately, this doesn't necessarily provide a consistent ordering, as we'll shortly see. The following diagram represents what I'll call the Staircase Prediction Problem because of the shape of the underlined entries:

#: 1 2 3 4 5 6 7
A: 0 1 0 0 0 0 0
B: 0 0 2 0 0 0 0
C: 0 0 0 0 1 0 0

There are 7 agents (numbered 1-7) who are identical clones and three different possible decisions (numbered A-C). None of the clones know which one they are. A perfect predictor predicts which option person 1 will pick if they are woken up in town, since they are clones, they will also choose the same option if they are woken up in town.

The underlined entries indicate which people will be woken up in town if it is predicted that person 1 will make that option and the non-underlined entries indicate who will be woken up on a plain. For those who are in town, the numbers indicate how much utility each agent is rewarded with if they choose that option. For those who aren't in town, the agent is instead rewarded (or not) based on what they would counterfactually do if they were in town.

Comparing the lines pair-wise to see what decision we should make in town, we find:

B beats A (2/6 vs. 1/6)

C beats B (1/6 vs. 0/6)

A beats C (1/6 vs. 0/6)

Note that to be included in the average, a person only needs to be woken in town in one of the two options.

Since this provides an inconsistent ordering, this approach must be flawed.

3) Overall Averages: If X is an option, calculate the expected utility of X by averaging over all agents for which there is at least one option Y where they experience input I when G chooses Y. Non-existence is treated as a 0.

This approach is the same as 2) in many problems: it pays in Perfect or Imperfect Parfit's Hitchhiker, chooses the rotten eggs for Evil Genie, $1,000,001 for Dilution Genie and refuses to pay in Retro Blackmail.

However, we run into issues with irrelevant considerations changing our reference classes. We will call this situation the Catastrophe Button Scenario.

#: 1 2 3
A: 3 3 0
B: 4 4 -1007
C: -∞ -∞ -∞

Again, underlined represents being woken up in town and non-underlined represents being woken up on the plain. As before, people that are woken up are based on the prediction of Person 1's decision and agents that wake up in town don't know who they are. C is the option representing pressing the Catastrophe Button. No-one wants to press this button as it leads to an unimaginably bad outcome. Yet, using overall averages, the presence of C makes us want to include Person 3 in our calculation of averages. Without person 3, A provides an average utility of 3 and B of 4. However, with person 3, A provides an average of 2 and B an average of -333. So the presence of the Catastrophe Button reverses the option we choose despite it being a button that we will never press and hence clone 3 never being woken up in town. This seems absurd.

But I care about all of my clones

We actually don't need full clones in order to create these results. We can work with what I like to call semi-clones - agents that make exactly the same decisions in particular situations, but which have an incredibly different life story/preferences. For example, we could take an agent and change the country it was brought up in, the flavours of ice-cream it likes and its general personality, whilst leaving its decision theory components exactly the same. Even if you necessarily care about your clones, there's much less reason for a selfish agent to care about its semi-clones. Or if that is insufficient, we can imagine that your semi-clones teamed up to murder your family. The only requirement for being a semi-clone is that they come to the same decision for a very restricted range of decision theory problems.

So if we make the individuals all different semi-clones, but keep them from knowing their identity, they should only care about the agents that were woken up in the town, as these are the only agents that are indistinguishable.

What about UDT?

UDT only insists on a utility function from the cross-product of execution histories of a set of programs to the real numbers and doesn't define anything about how this function ought to behave. There is no restriction on whether it ought to be using the Self-Indicative Assumption or the Self-Sampling Assumption for evaluating execution histories with varying amount of agents. There is no requirement to care about semi-clones or not care about them.

The only real requirement of the formalism is to calculate an average utility over all possible execution histories weighted by the probability of them occurring. So, for example, in Imperfect Parfit's Hitchhiker with a single agent, we can't just calculate an average for the cases where you happen to be in town, but we need to assign a utility for when you are left in the desert. But if we ran the problem with 100 hitchhikers, one of whom would always be picked up independently of their decision, we could define a utility function that only took into account those who actually arrived in town. But this utility function isn't just used to calculate the decision for one input, but an input-output map for all possible inputs and outputs. It seem ludicrous that decisions only relevant to the desert should be calculated just for those who end up in town.

Where does this leave us? UDT could technically represent Proposal 1, but in addition to the issue with empty reference classes, but seems to be an abuse of the formalisation. Proposal 2 is incoherent. Proposal 3 is very natural for UDT, but leads to irrelevant considerations affecting our decision.

So UDT doesn't seem to tell us much about what we ought to do, nor provide a solution and even if it did, the specific approach would need to be justified rather than merely assumed.

What if we rejected personal identity?

If we argued that you shouldn't care any more about what are traditionally seen as your future observer moments than anyone else's, none of the scenarios discussed above would pose an issue. You would simply care about the average or total utility of all future person moments independent of whose they might appear to be. Of course, this would be a radical shift for most people.

What if we said that there was no best decision?

All of the above theories choose the rotten eggs in the Evil Genie problem, but none of them seem to give an adequate answer to the complaint that the decision isn't reflectively consistent. So it seems like a reasonable proposal to suggest that the notion of a "best decision" depends on there being a fixed reference class. This would mean that there would be no real answer to Imperfect Parfit's Hitchhiker. It would also require there to be significant modifications to UDT, but this is currently the direction that I'm leaning.

Personal Blog