Counterfactual Mugging Poker Game

by Scott Garrabrant1 min read13th Jun 20182 comments

80

Ω 26

Decision TheoryMeta-HonestyCounterfactual MuggingCounterfactuals
Frontpage
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Consider the following game:

Player A receives a card at random that is either High or Low. He may reveal his card if he wishes.

Player B then chooses a probability that Player A has a high card.

Player A always loses dollars. Player B loses dollars if the card is low and dollars if the card is high.

Note that Player B has been given a proper scoring rule, and so is incentivized to give his true probability (unless he makes some deal with player A).

You are playing this game as player A. You only play one time. You are looking at a low card. Player B is not trying to make a deal with you, and will report his true probability. Player B is very good at reasoning about you, but you are in a separate room, so Player B cannot read any tells unless you show the card. Do you show your card?

Since your card is low, if you show it to player B, you will lose nothing, and get the best possible output. However, if player B reasons that if you would show your card if it was low, then in the counterfactual world in which you got a high card, player B would know you had a high card because you refused to show. Thus, you would lose a full dollar in those counterfactual worlds.

If you choose to not reveal your card, player B would assign probability 1/2 and you would lose a quarter.

I like this variant of the counterfactual mugging because it takes the agency out of the predictor. In the standard counterfactual mugging, you might reject the hypothetical and think that the predictor is trying to trick you. Here, there is a sense in which you are creating the counterfactual mugging yourself by trying to be able to keep secrets.

Also, think about this example the next time you are tempted to say that someone would only Glomarize if they had an important secret.

80

Ω 26

2 comments, sorted by Highlighting new comments since Today at 12:19 AM
New Comment

Good point. I noticed sometime ago that some UDT problems can be seen as cooperative games; apparently a wider set of problems with predictors can be seen as non-cooperative games, representing the predictor as an ordinary player who's punished for predicting wrong, which unlocks the predictive magic inherent in Nash equilibrium :-) Wonder if this can be pushed even further.

The game is indeed a clean example of Glomarization.

I might have misunderstood your main point, which I interpret as: “because of the counterfactual that I could have gotten a high card in this game, I shouldn't reveal a low card with probability 1.” Are you sure that it's because of the counterfactual in the current game, and not the possible consequences in my later interactions with B?

I would reveal my card if the game was truly unique, in that zero information leaks out. (Suppose B's memories of this game are erased afterwards.)

In real life, my decision would affect Player B's image of me, which affects how he will reason about similar games against me in the future. (And even how people close to him will reason about people like me.)

A multi-agent influence diagram on the iterated version shows how one can screw himself over in later games. If A first hides, then B cannot update her model of A:

If A first reveals a low card, then a pattern that's almost-but-not-quite-like Newcomb's is revealed: