Jul 20, 2009

156 comments

Suppose you're out in the desert, running out of water, and soon to die - when someone in a motor vehicle drives up next to you. Furthermore, the driver of the motor vehicle is a perfectly selfish ideal game-theoretic agent, and even further, so are you; and what's more, the driver is Paul Ekman, who's really, really good at reading facial microexpressions. The driver says, "Well, I'll convey you to town if it's in my interest to do so - so will you give me $100 from an ATM when we reach town?"

Now of course you wish you could answer "Yes", but as an ideal game theorist yourself, you realize that, once you actually reach town, you'll have no further motive to pay off the driver. "Yes," you say. "You're lying," says the driver, and drives off leaving you to die.

If only you weren't so rational!

This is the dilemma of Parfit's Hitchhiker, and the above is the standard resolution according to mainstream philosophy's causal decision theory, which also two-boxes on Newcomb's Problem and defects in the Prisoner's Dilemma. Of course, any *self-modifying* agent who expects to face such problems - in general, or in particular - will soon self-modify into an agent that doesn't regret its "rationality" so much. So from the perspective of a self-modifying-AI-theorist, classical causal decision theory is a wash. And indeed I've worked out a theory, tentatively labeled "timeless decision theory", which covers these three Newcomblike problems and delivers a first-order answer that is already reflectively consistent, without need to explicitly consider such notions as "precommitment". Unfortunately this "timeless decision theory" would require a long sequence to write up, and it's not my current highest writing priority unless someone offers to let me do a PhD thesis on it.

However, there are some other timeless decision problems for which I do *not* possess a general theory.

For example, there's a problem introduced to me by Gary Drescher's marvelous *Good and Real* (OOPS: The below formulation was independently invented by Vladimir Nesov; Drescher's book actually contains a related dilemma in which box B is transparent, and only contains $1M if Omega predicts you will one-box whether B appears full or empty, and Omega has a 1% error rate) which runs as follows:

Suppose Omega (the same superagent from Newcomb's Problem, who is known to be honest about how it poses these sorts of dilemmas) comes to you and says:

"I just flipped a fair coin. I decided, before I flipped the coin, that if it came up heads, I would ask you for $1000. And if it came up tails, I would give you $1,000,000 if and only if I predicted that you would give me $1000 if the coin had come up heads. The coin came up heads - can I have $1000?"

Obviously, the only reflectively consistent answer in this case is "Yes - here's the $1000", because if you're an agent who expects to encounter many problems like this in the future, you will self-modify to be the sort of agent who answers "Yes" to this sort of question - just like with Newcomb's Problem or Parfit's Hitchhiker.

But I don't have a general theory which replies "Yes". At the point where Omega asks me this question, I already know that the coin came up heads, so I already know I'm not going to get the million. It seems like I want to decide "as if" I don't know whether the coin came up heads or tails, and then implement that decision even if I know the coin came up heads. But I don't have a good formal way of talking about how my decision in one state of knowledge has to be determined by the decision I would make if I occupied a different epistemic state, conditioning using the probability *previously *possessed by events I have *since *learned the outcome of... Again, it's easy to talk informally about why you have to reply "Yes" in this case, but that's not the same as being able to exhibit a general algorithm.

Another stumper was presented to me by Robin Hanson at an OBLW meetup. Suppose you have ten ideal game-theoretic selfish agents and a pie to be divided by *majority vote*. Let's say that six of them form a coalition and decide to vote to divide the pie among themselves, one-sixth each. But then two of them think, "Hey, this leaves four agents out in the cold. We'll get together with those four agents and offer them to divide half the pie among the four of them, leaving one quarter apiece for the two of us. We get a larger share than one-sixth that way, and they get a larger share than zero, so it's an improvement from the perspectives of all six of us - they should take the deal." And those six then form a new coalition and redivide the pie. Then another two of the agents think: "The two of us are getting one-eighth apiece, while four other agents are getting zero - we should form a coalition with them, and by majority vote, give each of us one-sixth."

And so it goes on: Every majority coalition and division of the pie, is *dominated *by another *majority* coalition in which each agent of the new majority gets *more* pie. There does not appear to be any such thing as a dominant majority vote.

(Robin Hanson actually used this to suggest that if you set up a Constitution which governs a society of humans and AIs, the AIs will be unable to conspire among themselves to change the constitution and leave the humans out in the cold, because then the new compact would be dominated by yet other compacts and there would be chaos, and therefore any constitution stays in place forever. Or something along those lines. Needless to say, I do not intend to rely on such, but it would be nice to have a formal theory in hand which shows how ideal reflectively consistent decision agents will act in such cases (so we can *prove* they'll shed the old "constitution" like used snakeskin.))

Here's yet another problem whose proper *formulation* I'm still not sure of, and it runs as follows. First, consider the Prisoner's Dilemma. Informally, two timeless decision agents with common knowledge of the other's timeless decision agency, but no way to communicate or make binding commitments, will both Cooperate because they know that the other agent is in a similar epistemic state, running a similar decision algorithm, and will end up doing the same thing that they themselves do. In general, on the True Prisoner's Dilemma, facing an opponent who can accurately predict your own decisions, you want to cooperate only if the other agent will cooperate if and only if they predict that you will cooperate. And the other agent is reasoning similarly: They want to cooperate only if you will cooperate if and only if you accurately predict that they will cooperate.

But there's actually an infinite regress here which is being glossed over - you won't cooperate *just* because you predict that they will cooperate, you will only cooperate if you predict *they* will cooperate *if and only if* you cooperate. So the other agent needs to cooperate if they predict that you will cooperate *if *you predict that they will cooperate... (...only if they predict that you will cooperate, etcetera).

On the Prisoner's Dilemma in *particular*, this infinite regress can be cut short by expecting that the other agent is doing symmetrical reasoning on a symmetrical problem and will come to a symmetrical conclusion, so that you can expect their action to be the symmetrical analogue of your own - in which case (C, C) is preferable to (D, D). But what if you're facing a more general decision problem, with many agents having asymmetrical choices, and everyone wants to have their decisions depend on how they predict that other agents' decisions depend on their own predicted decisions? Is there a general way of resolving the regress?

On Parfit's Hitchhiker and Newcomb's Problem, we're *told* how the other behaves as a *direct *function of our own predicted decision - Omega rewards you if you (are predicted to) one-box, the driver in Parfit's Hitchhiker saves you if you (are predicted to) pay $100 on reaching the city. My timeless decision theory only functions in cases where the other agents' decisions can be viewed as functions of one argument, that argument being your own choice in that particular case - either by specification (as in Newcomb's Problem) or by symmetry (as in the Prisoner's Dilemma). If their decision is allowed to depend on how your decision *depends on* their decision - like saying, "I'll cooperate, not 'if the other agent cooperates', but *only *if the other agent cooperates *if and only if I cooperate* - if I predict the other agent to cooperate *unconditionally*, then I'll just defect" - then in general I do not know how to resolve the resulting infinite regress of conditionality, except in the special case of predictable symmetry.

You perceive that there is a definite note of "timelessness" in all these problems.

Any offered solution may assume that a timeless decision theory for direct cases already exists - that is, if you can reduce the problem to one of "I can predict that if (the other agent predicts) I choose strategy X, then the other agent will implement strategy Y, and my expected payoff is Z", then I already have a reflectively consistent solution which this margin is unfortunately too small to contain.

(In case you're wondering, I'm writing this up because one of the SIAI Summer Project people asked if there was any Friendly AI problem that could be modularized and handed off and potentially written up afterward, and the answer to this is almost always "No", but this is actually the one exception that I can think of. (Anyone actually taking a shot at this should probably familiarize themselves with the existing literature on Newcomblike problems - the edited volume "Paradoxes of Rationality and Cooperation" should be a sufficient start (and I believe there's a copy at the SIAI Summer Project house.)))