What is Wei Dai's Updateless Decision Theory?

by AlephNeil 9y19th May 201066 comments

45


As a newcomer to LessWrong, I quite often see references to 'UDT' or 'updateless decision theory'. The very name is like crack - I'm irresistably compelled to find out what the fuss is about.

Wei Dai's post is certainly interesting, but it seemed to me (as a naive observer) that a fairly small 'mathematical signal' was in danger of being lost in a lot of AI-noise. Or to put it less confrontationally: I saw a simple 'lesson' on how to attack many of the problems that frequently get discussed here, which can easily be detached from the rest of the theory. Hence this short note, the purpose of which is to present and motivate UDT in the context of 'naive decision theory' (NDT), and to pre-empt what I think is a possible misunderstanding.

First, a quick review of the basic Bayesian decision-making recipe.

 

 

What is Naïve Decision Theory?

You take the prior and some empirical data and calculate a posterior by (i) working out the 'likelihood function' of the data and (ii) calculating prior times likelihood and renormalising. Then you calculate expected utilities for every possible action (wrt to this posterior) and maximize.

Of course there's a lot more to conventional decision theory than this, but I think one can best get a handle on UDT by considering it as an alternative to the above procedure, in order to handle situations where some of its presuppositions fail.

(Note: NDT is especially 'naïve' in that it takes the existence of a 'likelihood function' for granted. Therefore, in decision problems where EDT and CDT diverge, one must 'dogmatically' choose between them at the outset just to obtain a problem that NDT regards as being well-defined.)

When does NDT fail?

The above procedure is extremely limited. Taking it exactly as stated, it only applies to games with a single player and a single opportunity to act at some stage in the game. The following diagram illustrates the kind of situation for which NDT is adequate:

NDT

This is a tree diagram (as opposed to a causal graph). The blue and orange boxes show 'information states', so that any player-instance within the blue box sees exactly the same 'data'. Hence, their strategy (whether pure or mixed) must be the same throughout the box. The branches on the right have been greyed out to depict the Bayesian 'updating' that the player following NDT would do upon seeing 'blue' rather than 'orange'--a branch is greyed out if and only if it fails to pass through a blue 'Player' node. Of course, the correct strategy will depend on the probabilities of each of Nature's possible actions, and on the utilities of each outcome, which have been omitted from the diagram. The probabilities of the outward branches from any given 'Nature' node are to be regarded as 'fixed at the outset'.

Now let's consider two generalisations:

  1. What if the player may have more than one opportunity to act during the game? In particular, what if the player is 'forgetful' in the sense that (i) information from 'earlier on' in the game may be 'forgotten', even such that (ii) the player may return to an information state several times during the same branch.
  2. What if, in addition to freely-willed 'Player' nodes and random 'Nature' nodes, there is a third kind of node where the branch followed depends on the Player's strategy for a particular information state, regardless of whether that strategy has yet been executed. In other words, what if the universe contains 'telepathic robots' (whose behaviour is totally mechanical - they're not trying to maximize a utility function) that can see inside the Player's mind before they have acted?

It may be worth remarking that we haven't even considered the most obvious generalisation: The one where the game includes several 'freely-willed' Players, each with their own utility functions. However, UDT doesn't say much about this - UDT is intended purely as an approach to solving decision problems for a single 'Player', and to the extent that other 'Players' are included, they must be regarded as 'robots' (of the non-telepathic type) rather than intentional agents. In other words, when we consider other Players, we try to do the best we can from the 'Physical Stance' (i.e. try to divine what they will do from their 'source code' alone) rather than rising to the 'Intentional Stance' (i.e. put ourselves in their place with their goals and see what we think is rational).

Note: If a non-forgetful player has several opportunities to act then, as long as the game only contains Player and Nature nodes, the Player is able to calculate the relevant likelihood function (up to a constant of proportionality) from within any of their possible information states. Therefore, they can solve the decision problem recursively using NDT, working backwards from the end (as long as the game is guaranteed to end after a finite number of moves.) If, in addition to this, the utility function is 'separable' (e.g. a sum of utilities 'earned' at each move) then things are even easier: each information state gives us a separate NDT problem, which can be solved independently of the others. Therefore, unless the player is forgetful, the 'naïve' approach is capable of dealing with generalisation 1.

Here are two familiar examples of generalisation 1 (ii):

Absent-minded driver

Note: The Sleeping Beauty problem is usually presented as a question about probabilities ("what is the Player's subjective probability that the coin toss was heads?") rather than utilities, although for no particularly good reason the above diagram depicts a decision problem. Another point of interest is that the Absent-Minded Driver contains an extra ingredient not present in the SB problem: the player's actions affect how many player-instances there are in a branch.

Now a trio of notorious problems exemplifying generalisation 2:

Newcomb

Hitcher

 

How Does UDT Deal With These Problems?

The essence of UDT is extremely simple: We give up the idea of 'conditioning on the blue box' (doing Bayesian reasoning to obtain a posterior distribution etc) and instead just choose the action (or more generally, the probability distribution over actions) that will maximize the unconditional expected utility.

So, UDT:

 

  • Solves the correct equation in the Absent-Minded Driver problem.
  • One-boxes.
  • Submits to a Counterfactual Mugging.
  • Pays after hitchhiking.

 

Is that it? (Doesn't that give the wrong answer to the Smoking Lesion problem?)

Yes, that's all there is to it.

Prima facie, the tree diagram for the Smoking Lesion would seem to be identical to my diagram of Newcomb's Problem (except that the connection between Omega's action and the Player's action would have to be probabilistic), but let's look a little closer:

Wei Dai imagines the Player's action to be computed by a subroutine called S, and although other subroutines are free to inspect the source code of S, and try to 'simulate' it, ultimately 'we' the decision-maker have control over S's source code. In Newcomb's problem, Omega's activities are not supposed to have any influence on the Player's source code. However, in the Smoking Lesion problem, the presence of a 'lesion' is somehow supposed to cause Player's to choose to smoke (without altering their utility function), which can only mean that in some sense the Player's source code is 'partially written' before the Player can exercise any control over it. However, UDT wants to 'wipe the slate clean' and delete whatever half-written nonsense is there before deciding what code to write.

Ultimately this means that when UDT encounters the Smoking Lesion, it simply throws away the supposed correlation between the lesion and the decision and acts as though that were never a part of the problem. So the appropriate tree diagram for the Smoking Lesion problem would have a Nature node at the bottom rather than an Omega node, and so UDT would advise smoking.

Why Is It Rational To Act In The Way UDT Prescribes?

UDT arises from the philosophical viewpoint that says things like

  1. There is no such thing as the 'objective present moment'.
  2. There is no such thing as 'persisting subjective identity'.
  3. There is no difference in principle between me and a functionally identical automaton.
  4. When a random event takes place, our perception of a single definite outcome is as much an illusion of perspective as the 'objective present'--in reality all outcomes occur, but in 'parallel universes'.

If you take the above seriously then you're forced to conclude that a game containing an Omega node 'linked' to a Player node in the manner above is isomorphic (for the purposes of decision theory) to the game in which that Omega node is really a Player node belonging to the same information state. In other words, 'Counterfactual Mugging' is actually isomorphic to:

Mugging2

This latter version is much less of a headache to think about! Similarly, we can simplify and solve The Absent-Minded Driver by noting that it is isomorphic to the following, which can easily be solved:

Absent2

Even more interesting is the fact that the Absent-Minded Driver turns out to be isomorphic to (a probabilistic variant of) Parfit's Hitchhiker (if we interchange the Omega and Player nodes in the above diagram).

 

Addendum: Do Questions About Subjective Probability Have Answers Irrespective Of One's Decision Theory And Utility Function?

In the short time I've been here, I have seen several people arguing that the answer is 'no'. I want to say that the answer is 'yes' but with a caveat:

We have puzzles like the Absent-Minded Driver (original version) where the player's strategy for a particular information state affects the probability of that information state recurring. It's clear that in such cases, we may be unable to assign a probability to a particular event until the player settles on a particular strategy. However, once the player's strategy is 'set in stone', then I want to argue that regardless of the utility function, questions about the probability of a given player-instance do in fact have canonical answers:

Let's suppose that each player-instance is granted a uniform random number in the set [0,1]. In a sense this was already implicit, given that we had no qualms about considering the possibility of a mixed strategy. However, let's suppose that each player-instance's random number is now regarded as part of its 'information'. When a player sees (i) that she is somewhere within the 'blue rectangle', and (ii) that her random number is α, then for all player-instances P within the rectangle, she can calculate the probability (or rather density) of the event "P's random number is α" and thereby obtain a conditional probability distribution over player-instances within the rectangle.

Notice that this procedure is entirely independent of decision theory (again, provided that the Player's strategy has been fixed).

In the context of the Sleeping-Beauty problem (much discussed of late) the above recipe is equivalent to asserting that (a) whenever Sleeping Beauty is woken, this takes place at a uniformly distributed time between 8am and 9am and (b) there is a clock on the wall. So whenever SB awakes at time α, she learns the information "α is one of the times at which I have been woken". A short exercise in probability theory suffices to show that SB must now calculate 1/3 probabilities for each of (Heads, Monday), (Tails, Monday) and (Tails, Tuesday) [which I think is fairly interesting given that the latter two are, as far as the prior is concerned, the very same event].

One can get a flavour of it by considering a much simpler variation: Let α and β be 'random names' for Monday and Tuesday, in the sense that with probability 1/2, (α, β) = ("Monday", "Tuesday") and with probability 1/2, (α, β) = ("Tuesday", "Monday"). Suppose that SB's room lacks a clock but includes a special 'calendar' showing either α or β, but that SB doesn't know which symbol refers to which day.

Then we obtain the following diagram:

SB2

Nature's first decision determines the meaning of α and β, and its second is the 'coin flip' that inaugurates the Sleeping Beauty problem we know and love. There are now two information states, corresponding to SB's perception of α or β upon waking. Thus, if SB sees the calendar showing α (the orange state, let's say) it is clear that the conditional probabilities for the three possible awakenings must be split (1/3, 1/3, 1/3) as above (note that the two orange 'Tuesday' nodes correspond to the same awakening.)

45