3

Personal Blog

tl;dr: Playing the true PD, it might be that you should co-operate when expecting the other one to defect, or vice versa, in some situations, against agents that are capable of superrationality. This is because relative weight of outcomes for both parties might vary. This could lead this sort of agents to outperform even superrational ones.

So, it happens that our benevolent Omega has actually an evil twin, that is as trustworthy as his sibling, but abducts people into a lot worse hypothetical scenarios. Here we have one:

You wake up in a strange dimension, and this Evil-Omega is smiling at you, and explains that you're about to play a game with unknown paperclip maximizer from another dimension that you haven't interacted with before and won't interact with ever after. The alien is like GLUT when it comes to consciousness, it runs a simple approximation of rational decision algorithm but nothing that you could think of as "personality" or "soul". Also, since it doesn't have a soul, you have absolutely no reason to feel bad for it's losses. This is true PD.

You are also told some specifics about the algorithm that the alien uses to reach its decision, and likewise told that alien is told about as much about you. At this point I don't want to nail the algorithm the opposing alien uses down to one specific. We're looking for a method that wins when summing up all these possibilities. Next, especially, we're looking at the group of AI's that are capable of superrationality, since against other's the game is trivial.

The payoff matrix is like this:

DD=(lose 3 billion lives and be tortured, lose 4 paperclips), CC=(2 billion lives and be made miserable, lose 2 paperclips), CD=(lose 5 billion lives and be tortured a lot, nothing), DC=(nothing, lose 8 paperclips)

So, what do you do? Opponent is capable of superrationality. In the post "The True Prisoner's Dilemma", it was(kinda, vaguely, implicitly) assumed for simplicity's sake that this information is enough to decide whether to defect or not. Answer, based on this information, could be to co-operate. However, I argue that information given is not enough.

Back to the hypothetical: In-hypothetical you is still wondering about his/her decision, but we zoom out and observe that, unbeknownst to you, Omega has abducted your fellow LW reader and another paperclip maximizer from that same dimension, and is making them play PD. But this time their payoff matrix is like this:

DD=(lose \$0.04, make 2 random, small changes to alien's utility function and 200 paperclips lost), CC=(lose \$0.02, 1 change, 100 paperclips), CD=(lose \$0.08, nothing), DC=(nothing, 4 changes, 400 paperclips)

Now, if it's not "rational" to take the relative loss into account, we're bound to find ourselves in a situation where billions of humans die. You could be regretting your rationality, even. It should become obvious now that you'd wish you could somehow negotiate both of these PD's so that you would defect and your opponent co-operate. You'd be totally willing to take a \$0.08 hit for that, maybe paying it in its entirety for your friend. And so it happens, paperclip maximizers would also have an incentive to do this.

But, of course, players don't know about this entire situation, so they might not be able to operate in optimal way in this specific scenario. However, if they take into account how much the other cares about those results, using some unknown method, they just might be able to systematically perform better(if we made more of this sorts of problems, or if we selected payoffs at random for the one-shot game), than "naive" PD-players playing against each other. Naivity here would imply that they simply and blindly co-operate against equally rational opponents. How to achieve that is the open question.

-

Stuart Armstrong, for example, has an actual idea of how to co-operate when the payoffs are skewed, while I'm just pointing out that there's a problem to be solved, so this is not really news or anything. Anyways, I still think that this topic has not been explored as much as it should be.

Edit. Added this bit: You are also told some specifics about the algorithm that the alien uses to reach its decision, and likewise told that alien is told about as much about you. At this point I don't want to nail the algorithm the opposing alien uses down to one specific. We're looking for a method that wins when summing up all these possibilities. Next, especially, we're looking at the group of AI's that are capable of superrationality, since against other sort of agents the game is trivial.

Edit. Corrected some huge errors here and there, like, mixing hypothetical you and hypothetical LW-friend.

Edit. Transfer Discussion -> Real LW complete!

3

New Comment

Humans seem to have a built-in solution to this dilemma, in that if I were presented with this situation and another human, where the payoff was something like minus ten, zero, or plus ten cents for me, versus insta-death, nothing, or ten billion dollars for the other human, I would voluntarily let the other person win and I would expect the other person to do the same to me if our situations were reversed. This means humans playing against other humans will all do exceptionally well in these sorts of dilemmas.

So this seems like an intelligent decision theoretic design choice, along the lines of "Precommit to maximizing the gains of the agent with the high gains now, in the hope of acausally influencing the other agent to do the same, thus making us both better off if we ever end up in a true prisoner's dilemma with skewed payoff matrix."

If I believe the alien to be sufficiently intelligent/well-programmed, and if I expect the alien to believe me to also be sufficiently intelligent/well-programmed, I would at least consider the alien graciously letting me win the first option in exchange for my letting the alien win the second. Even if only one of the two options is ever presented, and the second is the same sort of relevant hypothetical as a Counterfactual Mugging.

Yes, humans performing outstandingly well in this sort of problem was my inspiration for this. I am not sure how far it is possible to generalize this sort of winning. Humans themselves are kinda complex machines, so, if we start with perfectly rational LW reader and paperclip maximizer with one-shot PD with randomized payoff matrix, what's the least amount of handicaps we need to give them to reach this super-optimal solution? At first, I thought we could even remove the randomization alltogether, but it is making the whole problem more ambiguous I think.

Of course it would be nice if you could negotiate in things like the Prisoner's dilemma - that's pretty much why you can't. If you could find a common-ish real life analogue of the situation you're thinking about, though, I'd probably give a far less snippy answer :P

As for regretting your choice when the stakes are huge - I think "just dealing with it" is an interesting skill that us humans have, But more seriously, if ALL mistakes are scaled up it shouldn't make any difference to the best option, nor should it make you feel more guilty if you do your best and fail.

In order to define value, you need to have more than one possible trade.

So only by knowing about both dilemmas can you define value. Since, by the problem definition, you only know of one of them, defining relative value is impossible.

So only by knowing about both dilemmas can you define value. Since, by the problem definition, you only know of one of them, defining relative value is impossible.

You'd only need to know what sort of utility function for that. Sure, you wouldn't know that the other deal like that is happening right there, so it might not be possible to reach counterfactual agreement in this particular case, but as a general rule, it does seem possible to outperform agents that don't go for this sort of "I take a hit for you if you'd take a hit for me in this sort of scenario where stakes are reversed". Which leaves multiple questions open, I agree.

TL;DR: If there were prisoner's dilemmas being run in parallel with vastly skewed payoffs in different directions, it would be beneficial if all parties could change their strategies to accommodate this.

Methinks we have gone well past the deep-end of barely-relevant hypotheticals, and are currently swimming somewhere in the concrete under the bleachers. This is doubly true when it's assumed you are ignorant of these other entities, simply because you have absolutely no reason to suspect they exist, or their relative frequencies. Why have you even bothered privileging this hypothesis?

Why have you even bothered privileging this hypothesis?

This is a misuse of the 'privileging this hypothesis' phrase. A barely relevant hypothetical is not a hypothesis. Such a hypothetical could be used rhetorically in order to advocate an implicit privileged hypothesis but that is not what the author has done here.

(This means only that you need a different name for your objection.)

Methinks we have gone well past the deep-end of barely-relevant hypotheticals, and are currently swimming somewhere in the concrete under the bleachers.

I disagree. I think all these problems have real-world analogues.

Omega bothering you? Replace him with Paul Ekman

As for the problem in this post, forget parallel, skewed payoffs is enough. If players could coordinate in skewed prisoner's dilemmas to let the player who stands to lose/gain the most defect while the other player cooperates, they would expect huge gains. And skewed prisoner's dilemmas happen, they're not "barely-relevant hypotheticals".

It's not that they don't happen. The issue is where you need some ability to credibly precommit, and to bind other people in similar situations to credibly precommit, except you don't know that there are other people who you need to work with.

In the vast majority of cases, we have an incredibly elegant solution to the prisoner's dilemma: contract law. Once you create sufficiently odd hypotheticals - such as skewed payoffs, single-shot, no knowledge of positive-sum exchanges, no ability to discuss or agree to positive-sum exchanges - the issue is irrelevant enough to be absurd.

If you were offered a bad-for-humans deal, would you defect, or would you simply assume that there are many other deals out there that are skewed the other way and that the paperclip maximizers who are receiving them are cooperative conditioned on your cooperation?

Haven't you noticed that people are working on decision theories that do not need to precommit in these situations to achieve the optimal outcome?

That's what's interesting. Decision theories are being explored that output the correct action without need for the crutches of precommitment and negotiation.

Also, I simply disagree that skewed payoffs, single-shot, no knowledge of positive-sum exchanges(?), no ability to discuss make a problem "irrelevant enough to be absurd".

If you were offered a bad-for-humans deal, would you defect, or would you simply assume that there are many other deals out there that are skewed the other way and that the paperclip maximizers who are receiving them are cooperative conditioned on your cooperation?

If I had time to work out the decison theory I might very well come to expect that the paperclipper would submit to cooperating while I defect, in the bad-for-humans case, if I would similarly submit in the bad-for-paperclipper case.

Haven't you noticed that people are working on decision theories that do not need to precommit in these situations to achieve the optimal outcome?

If someone on the street approaches me and tells me he is counterfactually mugging me, I don't give him a dime. The odds that he is actually capable of counterfactually mugging me and is telling me the truth are virtually zero compared to the chance that he's trying to scam me out of my money. It is an essential element of every one of those weird-situation decision theories that you know you are in a weird world with certainty.

Your hypothetical removes this certainty. If you are unaware that other people face similarly skewed bargains, your decision theory cannot possibly adjust for their behaviour. If you are in a situation of full awareness of other such bargains existing, then the case seems relatively indistinguishable from the basic prisoner's dilemma with hyper-rationality.

(And "no knowledge of positive-sum exchange" means that you are ignorant of the fact that there are other PDs skewed in the opposite direction.)

Maybe I don't have to know that other skewed dilemmas are in fact happening. Maybe I just have to know that they could be happening. Or that they could have happened. Maybe it's enough to know a coin was flipped to determine in whose favor the dilemma is skewed, for example.

Here's another perpective. If I'm a UDT agent and my priors assign a roughly equal probability to ending up on either side of the skew in a skewed prisoner's dilemma against another UDT agent, the straightforward UDT answer is for the advantaged player to submit to the disadvantaged player, even if only one dilemma is "in fact" ever run.

Or if precommitment is your thing: it's in your interests to precommit to submit to a disadvantaged player in future skewed prisoner's dilemmas if the other player has similarly precommited, because you don't yet know what kinds of skewed dilemmas you're going to encounter in the future.

I'd probably pay Yvain and wouldn't think I'm in a weird world.

Actually, I'm writing up more real-world versions of many popular decision problems, and I have a particularly clever one for counterfactual mugging called "Counterfactual Insurance Co.". When I write it up properly I'll post it on LW...

Maybe I don't have to know that other skewed dilemmas are in fact happening. Maybe I just have to know that they could be happening. Or that they could have happened. Maybe it's enough to know a coin was flipped to determine in whose favor the dilemma is skewed, for example.

What evidence do you have to believe things are balanced? All you know is that one skewed situation exists. What evidence leads you to believe that other situations exist that are skewed relatively equally in the opposite direction? It's irrational to end up with the worst possible outcome for a PD because there might, in theory, be other PDs in which if you opponent did what you did you would benefit.

For what I think is an completely unexaggerated analogy: It is theoretically possible that every time I eat a banana, some entity horribly tortures an innocent person. It could happen. Absent any actual evidence that it does, my banana consumption will not change. You should not change your behaviour in a PD because it's theoretically possible that other PDs exist with oppositely skewed outcomes.

As for the counterfactual mugging, Yvain will never do it, unless he's an eccentric millionaire, because he'd lose a fortune. For any other individual, you would need substantial evidence before you would trust them.

As for precommitment, the lack of an ability to credibly precommit is one of the essential elements of a prisoner's dilemma. If the prisoners could make an enforceable contract not to snitch, it'd be easy to end up at the optimal outcome.

What evidence do you have to believe things are balanced?

What evidence do you have to believe that things are 1) unbalanced 2) in your favor?

You don't know what kinds of PD's you're going to encounter, so you prepare for all of them by setting up the appropriate precommitments, if your decision theory requires precommitments. If it doesn't, you'll just figure out and do the thing that you would have wanted to precommit to doing, "on the fly".

Credibility is indeed assumed in these problems. If you can't verify that the other player really has made the precommitment or really is a UDT kind of guy, you can't take advantage of this kind of coordination.

but we zoom out and observe that, unbeknownst to you, Omega has abducted your fellow LW reader and another paperclip maximizer from that same dimension, and is making them play PD. But this time their payoff matrix is like this:

Omega like entities are assumed not to be misleading us and we are assumed to have absolute faith in that fact. This means that while we may not know the specifics of what Omega is doing to change the behaviour of our opponent we will know that that kind of thing is part of the game. Since I certainly don't have enough information about the kinds of things that may change the clipper nor the processing ability to calculate likely outcomes from such interference (including, for example, predicting LW readers) it is obviously necessary for me to defect.

You are also told some specifics about the algorithm that the alien uses to reach its decision, and likewise told that alien is told about as much about you. At this point I don't want to nail the algorithm the opposing alien uses down to one specific. We're looking for a method that wins when summing up all these possibilities. Next, especially, we're looking at the group of AI's that are capable of superrationality, since against other's the game is trivial.

Had the decision not been determined elsewhere this would be insufficient. I need to know what has been told to the clipper about me and what I have been told about the me to the clipper. "Capable of superrationality" is not especially meaningful until I am told exactly what that means in the instance.

TL;DR Entities chasing each other in spirals through their minds may eventually meet and shake hands, but logic alone does NOT give you the ability to do this. You need access to each other's source code.

It seems to me what "superrationality" is grasping towards is the idea that if both players can predict each other's actions, it provides pragmactic grounds for cooperation. All the other crap, (the skewed payoff matrix, hofstader's "sufficiently logical" terminology, even the connotations of the word "superrationality" itself) are red herrings.

This all hinges on the idea that your decision CAN affect their decision, through their mental emulation of you, and vice versa. If it's one-sided, we have newcomb's problem, except it collapses to a normal prisoner's dilemma, since although omega knows if you'll cooperate, you have no way of knowing if omega will cooperate, and thus he has no incentive to base his behavior on your decision, even though he knows it. He's better off always defecting.

This is a point that a lot of people here seem to get confused about. They think "but, if I could predict omega's actions, he'd have an incentive to conditionally cooperate, and so I'D have an incentive to cooperate, and we'd cooperate, and that'd be a better outcome, ergo that must be more rational, and omega is rational so he'll act in a way I can predict, and specifically he'll conditionally cooperate!!1"

But I think this is wrong. The fact that the world would be a better place if you could predict omega's actions, (and the fact that omega knows this) doesn't give omega the power to make you capable of predicting his actions, any more than it gives him the power to make your mom capable of predicting his actions, or to make a ladybug capable of predicting his actions, or another superintelligence capable of predicting his actions (although possibly it could to start with). He's in another room.

The fact that he knows what you're going to do means there's already been some information leakage, since even a superintelligence can't extrapolate from the fact that your name is jeff what decision you'll make in a complicated game. He apparently knows quite a bit about you.

And if you knew ENOUGH about him, including his superhuman knowledge of yourself, and were smart enough to analyze the data (good luck), you'd be able to predict his actions too. But it seems disingenuous to even call that the prisoner's dilemma.

Also, since it doesn't have a soul, you have absolutely no reason to feel bad for it's [sic] losses.

Huh?

Just an attempt to make it clear that we're dealing with something like an intelligent calculator here with nothing in it that we'd find interesting or valuable in itself. Setting up this as the true PD.

Is that even well-defined? If I assert that I am a philosophical zombie in every sense of the term (lacking soul, qualia, and whatever other features you find relevant) does that mean you don't care about my losses?

Observers aren't ontological fundamental entities which is where you may be running into trouble.

I understood what he was trying to say.

I understood what he was trying to say.

Everyone does, the problem is that the whole area of several steps around its literal meaning has serious problems. "But souls don't exist! But so what if someone doesn't have a soul tag, it's not morally relevant! But so what if the presence of souls influences empathy/eternal life/etc., this reason doesn't screen off other sources of moral value!" Only when you've gone all the way to "The other agent doesn't have moral value.", it starts making sense, but then you should've just said so, instead of pretending an argument.

But I'd think if I only said "It doesn't have moral value in itself", you'd still have to go back similar steps to find that property cluster that we assign value. I tried to transfer both ideas by using the word soul and claiming lack of moral value.

you'd still have to go back similar steps to find that property cluster that we assign value. I tried to transfer both ideas by using the word soul and claiming lack of moral value.

What property cluster/why I'd need to find it/which both ideas?

Those properties that we think makes happy humans better than totally artificial smiling humans mimicing happy humans. You'd need to find it in order to grasp what it means to have a being that lacked moral value, and "both ideas" refers to the distinct ways of explaining what sort of paperclip maximizer we're talking about.

Those properties that we think makes happy humans better than totally artificial smiling humans mimicking happy humans.

This I guessed.

You'd need to find it in order to grasp what it means to have a being that lacked moral value,

Why? "No moral value" has a clear decision-theoretic meaning, and referring to particular patterns that have moral value doesn't improve on that understanding. Also, the examples of things that have moral value are easy to imagine.

"both ideas" refers to the distinct ways of explaining what sort of paperclip maximizer we're talking about.

This I still don't understand. You'd need to name two ideas. My intuition at grasping the intended meaning fails me often. One relevant idea that I see is that the paperclip maximizer lacks moral value. What's the other, and how is it relevant?

pretending an argument.

"Huh?"

What about it? Your perception of English says it's poorly-constructed, and I should rely less on my language intuition for such improvisation? Or is it unclear what I meant/why I believe so?

What is the purpose of saying "It doesn't have a soul.", as opposed to "It doesn't have moral value."? The desired conclusion is the latter, but the deeply flawed former is spoken instead. I guess it's meant as an argument, appealing to existing intuitions, connotations that the word "soul" evokes. But because of its flaws, it's not actually a rational argument, so it only pretends to be one, a rhetorical device.

It just wasn't an argument at all or a rhetorical device of any kind. It was a redundant aside setting up a counterfactual problem. At worst it was a waste of a sentence and at best it made the counterfactual accessible to even those people without a suitably sophisticated reductionist philosophy.

(And, obviously, there was an implication that the initial 'huh?' verged on disingenuous.)

At worst it was a waste of a sentence and at best it made the counterfactual accessible to even those people without a suitably sophisticated reductionist philosophy.

Rhetorical device in exactly this sense: it communicates where just stating the intended meaning won't work ("people without a suitably sophisticated reductionist philosophy"). The problem is insignificant (but still present), and as a rhetorical device it could do some good.

I defect on both scenarios. I am confident enough of this choice that I feel less respect for someone who would cooperate on either (at least to the degree that I can assume they are sincere in saying so and not just signaling).

EDIT: After your changes to the article, I suspect the reason you don't think it's obvious is because you've been dodging around an essential element of the True PD - i.e. the "one-shot" part.

I admit taking "Rational people playing true, one-shot PD with beings as rational as they, co-operate" for granted. I didn't think this was going to be an issue, and so, since I'm building upon that as an axiom, things might look weird if you think that foundation is untrue. And for this reason, I'm unsure if this discussion should continue here. If you're alone with that opinion, I think this discussion should take place elsewhere, but if there are many who disagree with me on that basic level, I guess that discussion should happen here.

I agree with this assessment of the situation.

The trouble is that cooperating is highly contingent on the other agent having heard of or being smart enough to think in five minutes the idea of superrationality, and it's highly contingent on the information available to both sides - if you don't think THEY think you know about superrationality/are smart enough to think it in five minutes, you shouldn't cooperate.

So, given most situations or most opponents I'd defect. Probably against the paperclip maximizer, too, since "Simple approximation of decision theory" doesn't sound too promisingly clever, particularly when evaluating beings like me.

The assumption about superrationality is now much more explicitly stated.

if they take into account how much the other cares about those results, using some unknown method, they just might be able to systematically perform better

Systematically perform better, but in this specific situation perform worse? That doesn't sound like a winning strategy on a one-shot dilemma. And the parenthetical patch doesn't seem to fix this problem: if the payoffs are randomly assigned in this one-shot case, and you get the short end of the stick - well, you know you aren't going to get another chance, because it's a one-shot case. Good luck trying to counterfactually mug the player here.

"You are also told some specifics about the algorithm that the alien uses to reach its decision, and likewise told that alien is told about as much about you."

If I know enough to see that my decision doesn't affect the alien's, I defect. If I don't know enough, I consider that the alien might know what my own algorithm is. Therefore I decide to cooperate if I think the alien will cooperate. I assume the alien knows this and that he knows that I know. Therefore I assume that the alien will cooperate because he thinks this will cause me to cooperate based on his knowledge of my thought processes (and CC is preferable to DD). Following the algorithm laid out above, I cooperate.

This is still just superrationality, though a little more advanced than usual. I have incomplete knowledge about my opponent's thought processes, I assume the rest will be similar to mine, consequently I choose the optimum symmetric strategy and hope he does the same.

If we can't handle this kind of reasoning, we lose billions of lives in the original True Prisoner's Dilemma, too. If I understood that post correctly, Eliezer was even hinting that we should just take the loss.

That's really unacceptable. I can't take a decision theory that seriously unless I know it returns the winning answer in these problems.

It requires us to know what sort of utility function the other player has, at the very least, and even then the result might be, at best, mutual defect or, against superrational players, mutual co-operation.

Cooperation against superrational players is only optimal if you are superrational too, or if they know how you are going to play. If you know they are superrational but they don't know you aren't, you should defect.

Cooperation against superrational players is only optimal if you are superrational too, or if they know how you are going to play. If you know they are superrational but they don't know you aren't, you should defect.

I find this confusing. Not in the sense that I don't understand the gist of the meaning. Rather, it makes the concept of 'superrational' as used sound weird to me (which could perhaps be attributed to the word, not Snowyowl). In particular:

Cooperation against superrational players is only optimal if you are superrational too

What is this magical trait that I can have thing that can change what is optimal choice for me to make for a fixed externally specified utility function?

What is this magical trait

Something along the lines of "when you cooperate, your opponent is forced to cooperate too".

The reason it is optimal is it presents no chance to be defected against, and any situation where you are defected against is worse than every situation where the opponent cooperates.

Lacking this magical trait of superrationality, the chance of being defected against is drawn back in, which dramatically harms you cooperating, making it less optimal.

Something along the lines of "when you cooperate, your opponent is forced to cooperate too".

That would be something that made more sense but just isn't something that fits in that context.