# The Happy Dance Problem

4 min read17th Nov 20177 comments

# 19

[Cross-posted from IAFF.]

Since the invention of logical induction, people have been trying to figure out what logically updateless reasoning could be. This is motivated by the idea that, in the realm of Bayesian uncertainty (IE, empirical uncertainty), updateless decision theory is the simple solution to the problem of reflective consistency. Naturally, we’d like to import this success to logically uncertain decision theory.

At a research retreat during the summer, we realized that updateless decision theory wasn’t so easy to define even in the seemingly simple Bayesian case. A possible solution was written up in Conditioning on Conditionals. However, that didn’t end up being especially satisfying.

Here, I introduce the happy dance problem, which more clearly illustrates the difficulty in defining updateless reasoning in the Bayesian case. I also outline Scott’s current thoughts about the correct way of reasoning about this problem.

(Ideas here are primarily due to Scott.)

## The Happy Dance Problem

Suppose an agent has some chance of getting a pile of money. In the case that the agent gets the pile of money, it has a choice: it can either do a happy dance, or not. The agent would rather not do the happy dance, as it is embarrassing.

I’ll write “you get a pile of money” as M, and “you do a happy dance” as H.

So, the agent has the following utility function:

• U(¬M) = \$0
• U(M & ¬H) = \$1000
• U(M & H) = \$900

A priori, the agent assigns the following probabilities to events:

• P(¬M) = .5
• P(M & ¬H) = .1
• P(M & H) = .4

IE, the agent expects itself to do the happy dance.

## Conditioning on Conditionals

In order to make an updateless decision, we need to condition on the policy of dancing, and on the policy of not dancing. How do we condition on a policy? We could change the problem statement by adding a policy variable and putting in the conditional probabilities of everything given the different policies, but this is just cheating: in order to fill in those conditional probabilities, you need to already know how to condition on a policy. (This simple trick seems to be what kept us from noticing that UDT isn’t so easy to define in the Bayesian setting for so long.)

A naive attempt would be to condition on the material conditional representing each policy, M⊃H and M⊃¬H. This gets the wrong answer. The material conditional simply rules out the one outcome inconsistent with the policy.

Conditioning on M⊃H, we get:

• P(¬M) = .555
• P(M & H) = .444

For an expected utility of \$400.

Conditioning on M⊃¬H, we get:

• P(¬M) = .833
• P(M & ¬H) = .166

For an expected utility of \$166.66.

So, to sum up, the agent thinks it should do the happy dance because refusing to do the happy dance makes worlds where it gets the money less probable. This doesn’t seem right.

Conditioning on Conditionals solved this by sending the probabilisticconditional P(H|M) to one or zero to represent the effect of a policy, rather than using the material conditional. However, this approach is unsatisfactory for a different reason.

Happy dance is similar to Newcomb’s problem with a transparent box (where Omega judges you on what you do when you see the full box): doing the dance is like one-boxing. Now, the correlation between doing the dance and getting the pile of money comes from Omega rather than just being part of an arbitrary prior. But, sending the conditional probability of one-boxing upon seeing the money to one doesn’t make the world where the pile of money appears any more probable. So, this version of updateless reasoning gets transparent-box Newcomb wrong. There isn’t enough information in the probability distribution to distinguish it from Happy Dance style problems.

## Observation Counterfactuals

We can solve the problem in what seems like the right way by introducing a basic notion of counterfactual, which I’ll write ◻→. This is supposed to represent “what the agent’s code will do on different inputs”. The idea is that if we have the policy of dancing when we see the money, M◻→H is true even in the world where we don’t see any money. So, even if dancing upon seeing money is a priori probable, conditioning on not doing so knocks out just as much probability mass from non-money worlds as from money worlds. However, if a counterfactual A◻→B is true and A is true, then its consequent BB must also be true. So, conditioning on a policy does change the probability of taking actions in the expected way.

In Happy Dance, there is no correlation between M◻→H and M; so, we can condition on M◻→H and M◻→¬H to decide which policy is better, and get the result we expect. In Newcomb’s problem, on the other hand, there is a correlation between the policy chosen and whether the pile of money appears, because Omega is checking what the agent’s code does if it sees different inputs. This allows the decision theory to produce different answers in the different problems.

It’s not clear where the beliefs about this correlation come from, so these counterfactuals are still almost as mysterious as explicitly giving conditional probabilities for everything given different policies. However, it does seem to say something nontrivial about the structure of reasoning.

Also, note that these counterfactuals are in the opposite direction from what we normally think about: rather than the counterfactual consequences of actions we didn’t take, now we need to know the counterfactual actions we’d take under outcomes we didn’t see!

# 19

7 comments, sorted by Click to highlight new comments since:
New Comment

Approach #1 seems to be naive EDT, which is pretty nonstandard. I'd expect more typical reasoning to look like a causal model with two nodes, Money->Agent, where considering different hypothestical strategies changes the behavior of the Agent node.

The observation counterfactuals thing is pretty interesting. But I think it might end up duplicating causal reasoning if you poke at it enough.

Approach #1 is supposed to be a naive updateless-EDT, yeah. What do you think an updateless-CDT approach would be? Perhaps, whereas regular CDT would causally condition on the action, updateless-CDT would change the conditional probabilities in the causal network? That would be the same as the earlier conditioning-on-conditionals approach, in this case. (So it would two-box in transparent Nowcomb.) It could differ from that approach if the causal network doesn't make the observation set equal the set of parents, though -- although it's unclear how you'd define updateless-CDT in that case.

I would expect something called updateless-CDT to have a causal model of the world, with nodes that it's picked out (by some magical process) as nodes controlled by the agent, and then it maximizes a utility function over histories of the causal model by following the utility-maximizing strategy, which is a function from states of knowledge at a controlled node (state of some magically-labeled agent nodes that are parents of the controlled node?) to actions (setting the state of the controlled node).

If the magical labeling process has labeled no nodes inside Omega as controlled, then this will probably two-box even on standard Newcomb. On the other hand, if Omega is known to fully simulate the agent, then we might suppose that updateless-CDT plans as if its strategy is controlling Omega's prediction, and always one-box even with transparent boxes.

I haven't read Conditioning on Conditionals yet. I am doing so now, but could you explain more about the similarities you were thinking of?

Yeah, I agree that updateless-CDT needs to somehow label which nodes it controls.

You're glossing over a second magical part, though:

and then it maximizes a utility function over histories of the causal model by following the utility-maximizing strategy,

How do you calculate the expected utility of following a strategy? How do you condition on following a strategy? That's the whole point here. You obviously can't just condition on taking certain values of the nodes you control, since a strategy takes different actions in different worlds; so, regular causal conditioning is out. You can try conditioning on the material cenditionals specifying the strategy, which falls on its face as mentioned.

That's why I jumped to the idea that UCDT would use the conditioning-on-conditionals approach. It seems like what you want to do, to condition on a strategy, is change the conditional probabilities of actions given their parent nodes.

Also, I agree that conditioning-on-conditionals can work fine if combined with a magical locate-which-nodes-you-control step. Observation-counterfactuals are supposed to be a less magical way of dealing with the problem.

Yeah, I agree that observation-counterfactuals are what you'd like the UCDT agent to be thinking of as a strategy - a mapping between information-states and actions.

The reason I used weird language like "state of magically labeled nodes that are parents of the controlled nodes" is just because of how it's nontrivial to translate the idea of "information available to the agent" into a naturalized causal model. But if that's what the agent is using to predict the world, I think that's what things have to get cashed out into.

I'm confused how we can assign probabilities to what the agent will do as above and also act as though the agent is an updateless agent, as the updateless agent will presumably never do the Happy Dance. You've argued against this in the Smoking Lesion, so why can we do it here?

I agree, that's a serious issue with the setup here. The simple answer is that I didn't think of that when I was writing the post. I later noticed the problem, but how to react isn't totally obvious.

Defense #1: An easy response is that I was talking about updateful DTs in my smoking lesion discussion. If a DT learns, it is hard to see why it would have seriously miscalibrated estimates of its own behavior. For UDT, there is no similar argument. Therefore the post as written above stands.

Reply: Perhaps that's not very satisfying, though -- despite UDT's fixed prior, failure due to lack of calibration about oneself seems like a particularly damning sort of failure. We might construct the prior using something similar to a reflective oracle to block this sort of problem.

Defense #2: Then, the next easy response is that material-conditional-based UDT 1.0 with such a self-knowledgeable prior has two possible fixed points. The probability distribution described in the post isn't one of them, but one with a more extreme assignment favoring dancing is: if the prior expects the agent to dance with certainty or almost certainly, then dancing looks good, and not dancing looks like a way to guarantee you don't get the money. Again, the concern raised in the post is a valid one, just requiring a tweak to the probabilities in the example.

Reply: Sure, but the solution in this case is very clear: you have to select the best fixed point. This seems like an option which is available to the agent, or to the agent designer.

Defense #3: True, but then you're essentially taking a different counterfactual to decide the consequences of a policy: consideration of what fixed point it puts you in. This implies that you have something richer than just a probability distribution to work with, vindicating the overall point of the post, which is to discuss an issue which arises if you try to "condition on a conditional" when given only a probability distribution on actions and outcomes. Reasoning involving fixed points is going to end up being a (very particular) way to add a more basic counterfactual, as suggested by the post.

Also, even if you do this, I would conjecture there's going to be some other problem with using the material conditional formulation of conditioning-on-conditionals. I would be interested if this turned out not to be true! Maybe there's some proof that the material-conditional approach turns out not to be equivalent to other possible approaches under some assumptions relating to self-knowledge and fixed-points. That would be interesting.

Also also, if we take the fixed-point idea seriously, there are problems we run into there as well. Reflective oracles (and their bounded cousins, for constructing computable priors) don't offer a wonderful notion of counterfactual. Selecting a fixed point offers some logical control over predictors which themselves call the reflective oracle to predict you, but if a predictor does something else (perhaps even re-computes the reflective oracle in a slightly different way, side-stepping a direct call to it but simulating it anyway), the result of using selection of fixed point as a notion of counterfactual could be intuitively wrong. You could try to define a special type of reflective oracle which lack this problem. You could also try other options like conditional oracles. But, it isn't clear how everything should fit together. In particular, if the oracle itself is treated as a part of the observation, what is the type of a policy?

So, "select the best fixed point" may not be the straightforward option it sounds like.

Reply: This seems to not take the concern seriously enough. The overall type signature of "conditioning on conditionals" seems wrong here. The idea of having a probability distribution on actions may be wrong, stopping the argument in the post in its tracks -- IE, the post may be right in its conclusion that there is a problem, but we should have been reasoning in a way which never went down that wrong path in the first place, and the conclusion of the post is making too small of a change to accomplish that.

For example, maybe distributed oracles offer a better picture of decision-making: the real process of deciding occurs in the construction of the fixed point, with nothing left over to decide once a fixed point has been constructed.

Clearly matters are getting too complicated for a simple correction to the argument in the post.

Defense #4: I still stand by the post as a cautionary tale about how not to define UDT, barring any "if you deal with self-reference appropriately, the material conditional option turns out to be equivalent to [some other options]" result, which could make me think the problem is more fundamental as opposed to a problem with a naive material-conditional approach to conditioning. The post might be improved by explicitly dealing with the self-reference issue, but the fact that it's not totally clear how to do so (ie 'select the best fixed point' seems to fix things on the surface but has its own more subtle issues when considered as a general approach) makes such a treatment potentially very complicated, so that it's better to look at the happy dance problem without explicitly worrying about all of that.

The basic point of the post is that formally specifying UDT is complicated even if you assume classical bayesian probability w/o worrying about logical uncertainty. Making UDT into a simple well-defined object requires the further assumption that there's a basic 'policy' object (the observation counterfactual, in the language of the post), with known probabilistic relationships to everything else. This essentially just gives you all the counterfactuals you need, begging the question of where such counterfactual information comes from. This point stands, however naive we might think such an approach is.