Continuation of:

Eliezer has convinced me to one-box Newcomb's problem, but I'm not ready to Cooperate in one-shot PD yet. In, Eliezer wrote:

PDF, on the 100th [i.e. final] move of the iterated dilemma, I cooperate if and only if I expect the paperclipper to cooperate if and only if I cooperate, that is:

Eliezer.C <=> (Paperclipper.C <=> Eliezer.C)

The problem is, the paperclipper would like to deceive Eliezer into believing that Paperclipper.C <=> Eliezer.C, while actually playing D. This means Eliezer has to expend resources to verify that Paperclipper.C <=> Eliezer.C really is true with high probability. If the potential gain from cooperation in a one-shot PD is less than this cost, then cooperation isn't possible. In Newbomb’s Problem, the analogous issue can be assumed away, by stipulating that Omega will see through any deception. But in the standard game theory analysis of one-shot PD, the opposite assumption is made, namely that it's impossible or prohibitively costly for players to convince each other that Player1.C <=> Player2.C.

It seems likely that this assumption is false, at least for some types of agents and sufficiently high gains from cooperation. In, I asked how superintelligences can prove their source code to each other, and Tim Freeman responded with this suggestion:

Entity A could prove to entity B that it has source code S by consenting to be replaced by a new entity A' that was constructed by a manufacturing process jointly monitored by A and B.  During this process, both A and B observe that A' is constructed to run source code S.  After A' is constructed, A shuts down and gives all of its resources to A'.

But this process seems quite expensive, so even SIs may not be able to play Cooperate in one-shot PD, unless the stakes are pretty high. Are there cheaper solutions, perhaps ones that can be applied to humans as well, for players in one-shot PD to convince each other what decision systems they are using?

On a related note, Eliezer has claimed that truly one-shot PD is very rare in real life. I would agree with this, except that the same issue also arises from indefinitely repeated games where the probability of the game ending after the current round is too high, or the time discount factor is too low, for a tit-for-tat strategy to work.


New Comment
16 comments, sorted by Click to highlight new comments since: Today at 5:29 PM

Let's say I'm about to engage in a one-shot PD with Jeffreyssai. I know that Jeffreyssai has participated in one-shot PDs with a variety of other agents, including humans, cats and dogs, paperclip maximizers from other universes, and Babyeaters, and that the outcomes have been:

  • 87% D,D
  • 12% C,C
  • 1% C,D or D,C

In this scenario it seems obvious that I should cooperate with Jeffreyssai. Real life situations aren't likely to be this clear-cut but similar reasoning can still apply.

Edit: To make my point more explicit, all we need is common knowledge that we are both consistent one-boxers, with enough confidence to shift the expected utility of cooperation higher than defection. We don't need to exchange source code, we can use any credible signal that we one-box, such as pointing to evidence of past one-boxing.

Assuming agents participate in a sufficient number of public one-shot PD's is essentially the same as playing iterated PD. If true one-shot PD's are rare its doubtful there'd be enough historic evidence of your opponent to be certain of anything.

No, it's not really the same at all.

In a classical iterated PD the only motive for cooperating is to avoid retaliation on the next round, and this only works if the number of rounds is either infinite or large but unknown.

However, if we use a decision theory that wins at Newcomblike problems, we are both essentially taking the role of (imperfect) Omegas. If we both know that we both know that we both one-box on Newcomblike problems, then we can cooperate (analogous to putting the $1M in the box). It doesn't matter if the number of rounds is known and finite, or even if there is only one round. My action depends on my confidence in the other's ability to correctly play Omega relative to the amount of utility at stake.

There's no particular requirement for this info to come from a track record of public one-shot PDs. That's just the most obvious way humans could do it without using brain emulations or other technologies that don't exist yet.

Although I doubt it's possible for a normal human to be 99% accurate as in my example, any accuracy better than chance could make it desirable to cooperate, depending on the payoff matrix.

I don't think saturn's method works, unfortunately, because I can't tell why Jeffreyssai has played C in the past. It could be because he actually uses a decision theory that plays C in one-shot PD if Player1.C <=> Player2.C, or because he just wants others to think that he uses such a decision theory. The difference would become apparent if the outcome of the particular one-shot PD I'm going to play with Jeffreyssai won't be made public.

With regard to source proof, the process you mention is only expensive if it must be performed every time for any two AIs. If you're already running on a piece of hardware with some kind of remote attestation capability then the cost could be significantly lower. Not useful for humans, though.

Oh, small internet. I went looking for a link to a system I ran across a few years ago called RPOW as a sample application of current technology along these lines, and found the author to be none other than Hal Finney! Is that you, Hal? My compliments on the very interesting project if so. is indeed my project. Part of my interest in trusted computing technology is as a sort of prelude or prototype for some of the issues which will arise once we are able to prove various aspects of our source code to one another.

Don't want to take things too off-topic so if this gets long we'll take it elsewhere, but: is there any hope of doing similar things on Linux systems with TPM any time soon?

Are there cheaper solutions

If you have good reason to beleive the superintelligence is a sucessful extrapolation of the values of it's creators, simulate them (and their discussion partners) a few million times pondering appropriate subjects - PD, Newcolme's, and similar problems. That should give you a good idea, with much less computronium spent that a mutual simulation or rewriting pact with the other SI would cost.

If you have good reason to beleive the superintelligence is a sucessful extrapolation of the values of it's creators [...]

This seems to abstract the problem so that you have two problems instead of one: is the SI a successful extrapolation and the validity of the creators' claims of their values. This seems less efficient unless one or both of these were already known to begin with.

You don't need to trust the creators' claims - you're running their simulations, and you're damn good at understanding them and extrapolating the consequences because, well, you're superintelligent! Why would they even know they're simulated? They're just discussing one-shot PD on some blog.

As for the SI being a successful extrapolation, you run a few simulations of it's birth the same way, starting a few decades before. It's still cheaper and less messy than organizing a mutual reprogramming with the brain that's made of the next galaxy.

Then the problem largely reduces to:

  • Verifying the data you passed each-other about your births are accurate.

  • Verifying ethical treatment of each-others simulated creators - no "victory candescence" when you get your answer!


Are there cheaper solutions

If you have good reason to beleive the superintelligence is a sucessful extrapolation of the values of it's creators, simulate them a few million times discussing problems of a similar nature - PD, Newcolme's, and similar problems. That should give you a good idea, with much less computronium spent that a mutual simulation or rewriting pact with the other SI would cost.

Meta-game the PD so that each player is allowed a peek at the answer before the answer is finalized and are given the option to stay or switch. If either player switches, repeat the process. Once both players stay, input their decisions and dole out rewards.

The scenario would look like this:

Player 1 chooses C
Player 2 chooses D

During the peek, Player 1 notices he will lose so he will switch to D.
During the next peek, both players are satisfied and stay at D.
Result: D, D


Player 1 chooses C
Player 2 chooses C

During the peek, neither switch
Result: C, C


Player 1 chooses C
Player 2 chooses C

During the peek, Player 2 switches to D
During the next peek, Player 1 switches to D
During the third peek, neither player switches
Result: D, D

The caveat with this is that it requires a trustworthy system of holding an answer and understanding the rules of how to fully submit them and allowing them to change their answer. Under such a system it is perfectly logical to choose Cooperate because there is no risk of losing to a Defector.

This directly answers your question of how players can "convince each other that Player1.C <=> Player2.C."

I do not understand how your example of an SI's claim on their source-code constitutes a Prisoner's Dilemma.

It answers that question by replacing the Prisoner's Dilemma with an entirely different game, which doesn't have the awkward feature that makes the Prisoner's Dilemma interesting.

Wei_Dai (if I've understood him right) is not claiming that an SI's claim on its source code constitutes a PD, but that one obvious (but inconvenient) way for it to arrange for mutual cooperation in a PD is to demonstrate that its behaviour satisfies the condition "I'll cooperate iff you do", which requires some sort of way for it to specify what it does and prove it.

Off-topic: is there a

 tag I can use to force newlines to format as a line break?  Or do people just use the >?

EDIT: Yeah, that worked. Two spaces at the end of the line. Thanks.

I think you put two spaces at the end of each line. Cthulhu knows why.

A bit off-topic, but it seems to me that a lot of debate raging around Newcomb's problem can be well summarized by the following statement. If you believe yourself a rationalist and you witness a miracle, then you better update your whole, damn, world view to accommodate existence of miracles.

New to LessWrong?