In order to better understand the differences between different decision theories, I have been browsing each and every Newcomblike Problem and keeping track of how each decision theory answers it differently. However, I seem to be coming up short when it comes to answers addressing the Psychopath Button:

Paul is debating whether to press the “kill all psychopaths” button. It would, he thinks, be much better to live in a world with no psychopaths. Unfortunately, Paul is quite confident that only a psychopath would press such a button. Paul very strongly prefers living in a world with psychopaths to dying. Should Paul press the button?

In the FAQ I read, they only gave examples from CDT and EDT, of which CDT says "yes" (because pressing the button isn't casually linked to whether Paul is already a psychopath) while EDT says "no" (because pressing the button increases the probability that Paul is a psychopath).

So I wonder how Logical Decision Theories (TDT, FDT, and UDT) would address the problem? Unlike Newcomb's Problem, there is technically only one agent in play, and in the other problem that has only one agent (the Smoking Lesion Problem) the answers of LDT all agreed with CDT. But in this case, CDT doesn't win.

New Answer
New Comment

4 Answers sorted by

From Cheating Death in Damascus (bold emphasis mine):

It’s less clear how we should model this case from the point of view of FDT, and there are a variety of options. The most natural and illustrative, we think, is to assume that what actions you would or would not perform in various (hypothetical or real) circumstances determines whether you’re a psychopath. What actions you would perform in which circumstances is in turn determined by your decision algorithm. On this reading of this case, the potential outputs of your decision algorithm affect both whether you’d press the button and whether you’re a psychopath
In practice, this leads the FDT agent always to refrain from pressing the button, but for very different reasons from the CDT agent. [The FDT agent] reasons that if she were to press the button that kills so many people, then she would be a psychopath. She does not regard her psychopathic tendencies—or lack thereof—as a fixed state of the world isolated from what decision she actually makes here. 
This seems like the right reasoning, at least on this understanding of what psychopathy is. Psychopaths just are people who tend to act (or would act) in certain kinds of ways in certain kinds of circumstances. We can take for granted that everyone is either born a psychopath or born a non-psychopath, and that [the FDT agent's] action cannot causally change this condition she was born with. Yet if this condition consists in dispositions to behave in certain ways, then whether [the FDT agent] is a psychopath is subjunctively tied to the decisions she actually makes. If you would not perform  in circumstances , then you also would not be the kind of person who performs actions like  in circumstances like . FDT vindicates just this sort of reasoning, and refrains from pressing the button for the intuitively simple reason of “if I pressed it, I’d be a psychopath” (without any need for complex and laborious ratification procedures). When we intervene on the value of the  variable, we change not just what you actually do, but also what kind of person you are.

I'd give the psychopath button question a similar answer that I would to the smoking lesion question: what's the mechanism that correlates being a psychopath with willingness to press the button?

If being a psychopath (or not being a psychopath) affects your answer because it affects your ability to reason, then based on your psychopath status, you may not even have the ability to reason correctly and choose an outcome. The problem is ill-defined, because it asks you to do something that you may be incapable, by stipulation, of doing.

If it affects your answer in another manner, then pressing the button because of the outcome of a reasoning process won't be correlated with psychopathy even though pressing the button in general is. (Unless the button uses your decision as a criterion of psychopathy, in which case we get into halting problem considerations.)

Also, note that in everyday language, "only a psychopath would press the button" strongly implies that it affects your decision because it affects your values about killing people, which is the second scenario. It's also inconsistent because the problem statement implies that both psychopaths and non-psychopaths would consider pressing the button and would reject pressing the button only after figuring out the logic, but if a non-psychopath would always refuse because of his values, this implication isn't correct.

(Edit: Edited this lots of times. Phrasing my objection correctly is actually quite hard.)

If being a psychopath (or not being a psychopath) affects your answer because it affects your ability to reason, then based on your psychopath status, you may not even have the ability to reason correctly and choose an outcome. The problem is ill-defined, because it asks you to do something that you may be incapable, by stipulation, of doing.

Ah... but it's the meta-you (the reader), not the story-you (the arguable psychopath), who is tasked with saying whether the story-you should press the button. Maybe the story-you is incapable of reasoning. But given h... (read more)

I'm not convinced that "it's impossible for him to press the button, but it's better for him to press the button" is a meaningful concept. It's tempting to think "pressing the button results in X and X is good/bad", but that cuts off the chain of reasoning early. Continuing the chain of reasoning past that will lead you to further conclusions that result in not-X after all, and you just got a contradiction.
Nothing suggests it's impossible for him to press the button, even if we grant that it's possible he can't reason. Maybe he can stumble into it.
If you need to consider the possibility of pressing the button involuntarily, that affects the meaning of the original problem statement. Does "only a psychopath will press the button" include involuntary presses? If yes, then it's still impossible for a non-psychopath to press the button. If no, then whether it's better to involuntarily press the button may have a different answer from whether it's better to voluntarily press the button.
I'd interpret it that way. The intended interpretation is that if the person presses the button, they're a psychopath. If I press the button, I have always been a psychopath, and I die along with all other psychopaths. If I don't press the button, I may or may not be a psychopath, and I live along with all other psychopaths. All the details you're writing seem to me to go against the Occam's razor's interpretation of the problem.

In my view, FDT handles the problem as follows:

Frank: Suppose FDT(situation) = "push the button". Then all psychopaths die, which includes me. Suppose instead FDT(situation) = "don't push the button". Then no psychopaths die. Since I prefer living in a world with psychopaths to dying, FDT(situation) = "don't push the button".

The main controversial piece is from the problem specification: "Paul is quite confident that only a psychopath would press such a button." I think this mixes up P(button|psychopath) and P(psychopath|button), but since the problem specification is our only source of how the button determines who is or isn't a psychopath, it seems fine to trust it on that point.

Another related problem is one where there's a button who kills everyone who would, given the option, press it. You might expect that such people are bad neighbors and prefer a world without them without having any way to act on that belief (and if you come to believe that FDT pushes that button, what it really means is that you shouldn't be so confident people who would press the button are bad neighbors!).

[In general, your decision theory should save you from claims in the problem specification of the form "and then you make a bad decision", but it can't be expected to save you from having incorrect empirical beliefs.]

Psychopathy is strongly associated with poor impulse control and low self-reflection. If Paul is considering logical decision theories, rational choice, and their possible ramifications given his own mental makeup then he is substantially less likely than baseline to be a psychopath, which generally make up on the order of 1% of the population.

Does he have some prior evidence that he is a psychopath? If not, then his prior should be on the order of 0.2% or so. Willingness to press the button would otherwise be his only evidence, which he is "quite confident" about. What numerical value should he put for "quite confident"? Let's say 90% (much more than that should be described more like "very" or "extremely" confident). So that would bring a baseline prior up to the order of 2%.

Now he "very strongly" prefers living in a world with psychopaths to dying. Is that 5x in utility? 100x? 10,000x? Well, dying is a pretty bad thing but I'd use some stronger term than just "very strongly" for 10,000x so let's go with something on the order of 100x.

Well, this is awkward. For outcomes of pressing the button we've got a credence of 2% for being a psychopath and -100 utility, versus 98% for not being a psychopath and +1 utility. This is a net -1 utility, but the numbers are only order of magnitude estimates so the expected value could easily be much more positive or negative! It doesn't really matter which decision theory he uses, Paul just doesn't have enough information.

Yeah, in order to keep the problem statement clean I do think that one should specify that Paul does not have access to autobiographical memory or other self-knowledge for the duration of his time with the button making his decision. If he did, then he could use his self-knowledge to determine if he was a psychopath or not and use that information to supplement the piece of information from 'would I choose to push the button' to inform his prediction of whether he is indeed a psychopath and thus will be killed by pushing the button.

23 comments, sorted by Click to highlight new comments since: Today at 9:27 AM

Not sure about FDT (fancy decision theories) but there are only two possible outcomes here: 

  • Paul lives in a world with psychopaths, 
  • The world has no psychopaths, including no Paul.

There is no possible world where Paul lives and all psychopaths are dead, so "be much better to live in a world with no psychopaths" is an extraneous preference having no bearing on whether to press the button. Sort of like "it would be nice to live in a world with flying cars looking like unicorns". 

The real question is which of the two possible worlds Paul prefers, and the answer is quite clear: if Paul is a psychopath he strongly prefers living to dying, so no pressing the button, and if Paul is not a psychopath, he will not press the button anyway. There is nothing fancy that needs a decision theory here, just count the possible worlds.

No, Paul can be wrong about only psychopaths pushing the button.

How confident is Paul about that? Oh right, "quite". Is that a credence of 80%? 95% 99.99%?

How much more strongly does Paul prefer living in a world with psychopaths to dying? Oh right, "very". Is that a utility of 2x? 100x? 10000x?

What is Paul's prior credence that he is a psychopath according to the button's implementation? 0.1%? 1%? 5%? 50%?

... and so on for other variables that are required for every logical decision theory. In the original post it doesn't make sense to ask what various logical decision theories answer when the question has only vague terms that are compatible with any answer.

Of course Paul could be wrong, and then you need to calculate probabilities, which is a trivial calculation that does not depend on a chosen decision theory. But the problem statement as is does not specify any of it, only that he is sure that only a psychopath would press the button, so take it as 100% confidence and 100% accuracy, for simplicity. The point does not change: you need a good specification of the problem, and once you have it, the calculation is evaluating probabilities of each world, multiplying by utilities, and declaring the agent that picks the world with the highest EV "rational".

This looks like a point of view that denies value of two-boxing in Newcomb's Problem, which shouldn't interfere with remaining aware of what CDT would do and why, a useful thing for building saner variants of CDT.

Yes, there is no value in two-boxing because there is no possible world where a two-boxer wins (provided the predictor is perfect) or the probability of such a world falls off with improvement in predictor's accuracy (when the predictor is imperfect). One doesn't need a saner version of EDT or CDT, an agent who counts worlds, probabilities, and utilities, without involving counterfactuals, always has the best EV.

an agent who counts worlds, probabilities, and utilities, without involving counterfactuals, always has the best EV.

Sorry, can you express this in terms like  ? The main disagreement between decision theories like EDT and CDT is which worlds they think are accessible, and I am not confident I could guess what you'd think the answer is to an arbitrary problem.

I tried in my old post

Basically, two-boxers equivocate between possible worlds and deny the problem statement that Predictor can predict them ahead of time, regardless of what they do later. They think that a low-probability world is accessible by jumping from a high probability world into a non-existent low-probability world after the boxes are set.

Cool, thanks for the link; I found jessicata's comment thread there helpful.

I agree that CDT overestimates the accessibility of worlds. I think one way to think about EDT is that is also is just counting worlds, probabilities, and utilities, but you're calculating your probabilities differently, in a more UDT-ish way.

Consider another variant of this problem, where there are many islands, and the button only kills the psychopaths on its island. If Paul has a historical record that so far, all of the previous buttons that have been pressed were pressed by psychopaths, Paul might nevertheless think that his choice to press the button stems from a different source than psychopathy, and thus it's worth pressing the button. [Indeed, the spicy take is that EDT doesn't press the button, CDT does for psychopathic reasons and so dies, and FDT does for non-psychopathic reasons, and so gets the best outcome. ;) ]

Yes, if Paul thinks that he might not be a psychopath who dies, and has a probability associated with it, he would include this possible world in the calculation... obviously? Though this requires further specification of how much he values his life vs life with/without psychopaths around. If he values it infinitely, as most psychopaths do, presumably, then he would not press the button, on an off chance that he is wrong. If the value is finite, then there is a break-even probability where he is indifferent to pressing the button. I don't understand how it is related to a decision theory, it's just world counting and EV calculation. I must be missing something, I assume.

Agreed that we need real-valued utilities to make clear recommendations in the case of uncertainty.

I don't understand how it is related to a decision theory, it's just world counting and EV calculation. I must be missing something, I assume.

For all of the consequentialist decision theories, I think you can describe what they're doing as attempting to argmax a probability-weighted sum of utilities across possible worlds, and they differ on how they think actions influence probabilities / their underlying theory of how they specify 'possible worlds' and thus what universe they think they're in. [That is, I think the interesting bit is the part you seem to be handling as an implementation detail.]

One doesn't need a saner version of EDT or CDT

That's not clear until you develop them.

always has the best EV

Incidentally, this is an increasingly dubious objective. But to see why it's a bad idea in practice, it's helpful to be aware of the way it looks like a very good idea. (Regardless, it's obviously relevant for this post.)

OK, I read the last one (again, after all these years), and I have no idea how it is applicable. It seems to be about the definition of probability, dutch-booking and such... nothing to do with the question at hand. The one before that is about how a "wrapper-mind", i.e. a fixed-goal AGI is bad... Which is indeed correct, but... irrelevant? It has the best EV by its own metric?

(The second paragraph was irrelevant to the comment I was replying to, I thought the "incidentally", and the inverted-in-context "it's obviously relevant" (it's maximization of EV that's obviously relevant, unlike the objections to it I'm voicing; maybe this was misleading) made that framing clear?)

I was commenting on how "having the best EV", the classical dream of decision theory, is recently in question because of the Goodhart's Curse issue. That it might be good to look for decision theories that do something else. The wrapper-minds post is pointing at the same problem from a very different framing. Mild optimization is a sketch of the kind of thing that might make it better, and includes more specific suggestions like quantilization. (I currently like "moral updatelessness" for this role, a variant of UDT that bargains from a position of moral ignorance, not just epistemic ignorance, among its more morally competent successors, with mutually counterfactual, that is discordant, but more developed moralities/values/goals.) The "coherent decisions" post is just a handy reference for why EV maximization is the standard go-to thing, and might still remain as such in the limit of reflection (time), but possibly not even then.

The relevant part (to the "saner CDT" point) is the first paragraph, which is mostly about Troll Bridge and logical decision theory. Last post of the sequence has a summary/retrospective. Personally, I mostly like CDT for introducing surgery, fictional laws-of-physics-defying counterfactuals seem inescapable in some framings that are not just being dumb like vanilla CDT. In particular, when considering interventions through approximate predictions of the agent. (How do you set all of these to some possible decision, when all you know is the real world, which might have the actual decision you didn't make yet in its approximate models of you? You might need to "lie" in the counterfactual with fictional details to make models of your behavior created by others predict what you are considering doing, instead of what you actually do and can't predict or infer from actual models they've already made of you. Similarly to how you know a Chess AI will win, without knowing how, you know that models of your behavior will predict it, without knowing how. So you are not inferring their predictions from their details, you are just editing them in into a counterfactual.) This might even be relevant to CEV in that moral updatelessness setting I've mentioned, though that's pure speculation at this point.

a fixed-goal AGI is bad... Which is indeed correct, but... irrelevant? It has the best EV by its own metric?

Nobody knows how to formulate it like that! EV maximization is so entrenched as obviously the thing to do that the "obviously, it's just EV maximization for something else" response is instinctual, but that doesn't seem to be the case.

And if maximization is always cursed (goals are always proxy goals, even as they become increasingly more accurate, particularly around the actual environment), it's not maximization that decision theory should be concerned with.

Thanks. I will give them a read. After all, smarter people than me spent more time than I did thinking about this. There is a fair chance that I am missing something.

The problem itself, considered out of context, is hopelessly confused, there is too much room left for its clearer reformulation. For example, I have multiple discordant ideas on how to interpret "quite confident that only a psychopath would press such a button". One option is some Source of Magic that designates pressers of such buttons as inherently "psychopaths" for the purposes of consequences of pressing such buttons, regardless of their preferences about pressing such buttons, in the way Newcomb's Problem deals with two-boxers.

This raises a concern about all other people of the world, are they all being so designated based on their behavior upon counterfactually being placed in this situation, for the purposes of consequences of pressing the button? If so, pressing the button affects all those who would press the button, in addition to those originally designated "psychopaths". If not, pressing the button only affects those originally designated "psychopaths", plus the presser. Or does the Source of Magic clear those who wouldn't press the button of the label "psychopath" even if originally they would be so designated? For the CDT vs. non-CDT distinction, is the Source of Magic doing the designation of "psychopaths" in advance, based on predictions of everyone's counterfactual behavior, or after the button is pressed in actuality? Many, many options.

Another option is to interpret it as actually meaning "quite confident that only a psychopath would prefer to press such a button, if it exempted them and those they care about specifically, and there were no Source of Magic shenanigans redefining the meaning of words as applied to all other people of the world for the purposes of this thought experiment". In that case, the potential presser can resolve the question of whether they are a "psychopath" by looking at their preference, and act accordingly. This is a more straightforward less interesting option.

Logical Decision Theories (TDT, FDT, and UDT)

I think this post is useful for placing these. Basically, FDT is about controlling things from multiple points of intervention, UDT about intervention in all epistemic counterfactuals of the same agent. None of these directly concern acausal coordination with other agents, which is covered in Cooperation in PD. FDT is useful when doing acausal coordination, because the coordinating algorithm needs to control all members of the coordinated coalition, so it acts through multiple points of intervention (I described the connection in a reply to your other post). But if acausal coordination is not for the purposes of decision making, then FDT is not relevant to what's going on.

That's a very interesting and insightful dissection of the problem. Do you think there might be a problem in the post that I copied the thought experiment from (which said that CDT presses, and EDT doesn't), or did I make a mistake of taking it out of context?

The context seems to be

There, it's related to Smoking Lesion, which has a tradition of interpreting it that suggests how to go about interpreting "only a psychopath would press such a button" as well. But that tradition is also convoluted (see "tickle defense"; it might be possible to contort this into an argument that EDT recommends pressing the button in Psychopath Button, not sure).

The problem seems under-specified to me. What's Paul's utility function?

I think it's Paul alive, sociopaths dead > Paul alive, sociopaths alive > Paul dead, sociopaths dead , with  the inaccessible Paul dead, sociopaths alive at the very bottom.

In situations with uncertainty, we would need to have the scales of those preferences, but I think you're supposed to view the problem as having certainty.

there is technically only one agent in play

Well, there is Paul, and then there is whatever acts on a button press, and evaluates whether Paul is a psychopath...

Ok, if the button is thought of the "second agent" then I would guess TDT would not press the button. TDT would reason that the button will make the decision that the person who pressed the button is a psychopath, and therefore Paul should precommit to not press the button. Is that the right way to approach it?

I don't understand why this example gives different answers.  There's no causality difference, only a knowledge difference.  I think we'd need some numbers about Paul's pre-decision estimate that he's a psycopath, and the probability that someone who decides to press the button is a psycopath.  That is, his prior and posterior beliefs about his own psychopathy.  

I'm not much of a CDT apologist - it seems obviously wrong in so many ways.  But I'm surprised anew that CDT conflicts with conservation of expected evidence (if some piece of data is expected, it's already part of your prior and shouldn't cause an update).