The Psychology Of Resolute Agents

Chris_Leong

Epistemic Status: Exploratory

Consider Parfit's Hitchhiker from the perspective of a completely selfish agent:

"Suppose you're out in the desert, running out of water, and soon to die - when someone in a motor vehicle drives up next to you. Furthermore, the driver of the motor vehicle is a perfectly selfish ideal game-theoretic agent, and even further, so are you; and what's more, the driver is Paul Ekman, who's really, really good at reading facial microexpressions. The driver says, "Well, I'll convey you to town if it's in my interest to do so - so will you give me $100 from an ATM when we reach town?"

When you're in the desert, the deal is to your benefit, but once you're in town, you're incentivised to defect. So you should expect the driver to not believe you and leave you to die in the desert.

An agent using Updateless Decision Theory will avoid an untimely death, but I believe that someone should be able to survive without advanced decision theories (ie. just using Causal Decision Theory). And indeed, if you can successfully pre-commit to paying the driver the money, then you will survive, but this leads to the question of whether you can.

But first, I note that pre-commit has two meanings: in a broad sense, irrevocably deciding to follow a particular course of action; and in a narrow sense, as just described, but in a publicly verifiable manner. Since we are assuming that the driver has a high ability to guess your decision, these two definitions end up collapsing within this scenario.

Now, we can imagine a host of situation in which you could pre-commit: you could be trusted to pay if a court would fine you $150 if you didn't uphold your bargain, or if you were deontologist who thought that being moral was more important than a good outcome could be trusted or if God would let you into heaven if you are good. What about if we exclude outside rewards/punishments and we assume that you are completely self-interested and rational? Is pre-commitment possible is such circumstances?

The definition of "completely rational" is important here. If it means that they must make every decision rationally (according to CDT) and are forbidden to self-modify in a way that makes them lose this property, then it necessarily follows that they will defect. On the other hand, if it means that they always know what the "rational" decision is even after self-modification and that they always choose this decision before self-modification, then there is hope.

Indeed, a self-modifying AI that can rewrite its own code, will find pre-committing trivial, but this is a poor model for actual humans. The exact extent to which humans can pre-commit is a complex question, but at a high level it is a mistake to pretend that we have either no ability to self-modify or an absolute ability to do so. Instead, we are somewhere in between, as we will now see.

Let's suppose a selfish person forms the intention to pay, the driver believes them and they have just been dropped off in town. They will still have a strong desire to hold onto their money and they will feel a strong aversion to handing that money over. They can imagine all of the options in the situation (just two: pay or don't pay), they can iterate over them and see that "don't pay" has higher utility and it feels like they have a free choice of whether or not to pay.

So it certainly doesn't feel to this person that they are pre-committed in any way. However, in a deterministic universe, an agent can only ever strictly make one choice, so they are always pre-committed to whatever choice they eventually end up making. So we don't know that the agent is not pre-committed to paying; it's just that it appears as the though agent isn't, but we don't know until you're decision is locked in. You might feel that you could choose "don't pay" and maybe you can; or maybe it is only the counterfactual you, who is technically not you, who can choose that.

Of course, when you were pre-committing, you should have predicted that all of this was going to happen in advance. If you end up in this situation and you don't know how to respond to the feeling that you "should" decide not to pay, you've done a terrible job of pre-committing. You ought to have either spent more time trying to pre-commit; or perhaps given up and concluded that it was impossible. However, since pre-committing seems like a crucial ability for co-ordinating with others, I would suggest that it is worth spending a large amount of effort developing the ability to pre-commit. I'm not going to suggest that humans have an unlimited ability to do this; it's easy to imagine sufficiently horrible outcomes where we wouldn't be able to force ourselves to go through with the bargain. However, at the very least, we should at least be able to force ourselves to pay trivial costs in order to gain massive benefits.

So to what extent can we self-modify? There's a lot that we can't change. A completely selfish agent can recognise that $100 is a small price to pay for having their life saved, but they will still feel a strong desire to hold onto that money anyway. They will still know that a "rational" (CDT) agent would choose not to pay; specifically that if they looped over the available options and found the one with highest payoff, it would be "don't pay". Further, they can take it to the meta level and realise that the "rational" choice is to modify themselves to an agent using standard decision theory (CDT) if they aren't one already. Given all of this, how is paying not a mistake? Undoubtedly, people could convince themselves to pay, but surely they've simply made an error in their reasoning somewhere?

Part of the confusion comes from a resolute agent having two sets of goals: its intrinsic goals, in this case purely selfish; and its chosen goals, which initially match satisfying its intrinsic goals, but which change after it decides to pre-commit. Is it irrational to pursue its chosen goals when it realises that these diverge from its intrinsic goals?

We split this into two questions: firstly, the question of adopting these goals; and secondly the question of maintaining them. With regard to the first question: this decision is rational so long as the agent made it in a sensible manner. With regard to the second, the chosen goals will appear irrational from the perspective of a standard rational agent, however it would be a mistake for a resolute agent to conclude this, as they ought to be analysing the situation from the standpoint of their new (chosen) goals, instead of their old (intrinsic) goals. It is those who refuse to pay who make a mistake in reasoning, not those who don't. Assuming they didn't mess up the pre-commitment, their new, chosen goals should be self-reaffirming. For example, if an agent has the goal of maximising its intrinsic goals without breaking any commitments, when given the choice, it will choose to maintain them, rather than switching to standard decision theory (CDT).

We will address one last argument. The agent is either pre-committed to pay or not pre-committed to this. It then follows that the agent ought to try to not pay: if it succeeds, then it wasn't pre-committed, so its making the rational decision, whilst if it fails, it was pre-committed and is therefore no worse off. Again, the agent could have predicted in advance that it was going to face this temptation and should have been prepared.

Additionally, this argument is reasoning from its old goals and not from its new goals. According to its new, chosen goals, it wants to honor its prior commitments more than it wants to maximise its intrinsic goals. So we could then obtain the opposite argument: if it fails to not pay, then it is no better off, while if it successful, then it has broken its commitments, which is against its goals.

There's still a lot of questions that are unanswered and which would have to be investigated to develop this theory, but I'm just trying to roughly sketch this perspective at this stage. At the very least, it seems like a worthwhile project.

Key unanswered questions:

What if the agent arrives in town and learns that they actually need that $100 to get out of the country, otherwise they will be tortured and killed? Perhaps a selfish agent's resoluteness only lasts so long as the agreement actually produces a better outcome for them?
What if the driver agrees, then changes his mind, but is then forced by the authorities to go back and honor his agreement? Is a resolute agent still committed to paying him the $100 as he technically fulfilled the agreement or can you defect given that he tried to defect on you?
What if the agent mishears the driver and the driver only wants $100, instead of demanding it to rescue them? Would a selfish agent still be resolute to pay money when they find out that there was no necessity for them to do so.
How can we justify initially setting our chosen goals to our intrinsic goals in a way that doesn't insist that these remain consistent later on?

Related Posts:

Newcomb's Problem and Regret of Rationality: Argues along similar lines for Timeless Decision Theory.

For what it's worth, here's the answers given by UDT:

1) If there was apriori probability P<1 that you would need the $100 to survive later on, you should pay up, because it trades certain death for death with probability P.

2) If there was apriori probability P<1 that the driver would cheat and get away with it, you should pay up, because it trades certain death for death with probability P.

3) If there was apriori probability P<1 that the driver would drive you for free, but in the desert you had to say "yes" or "no" without asking the driver any questions, you should pay up, because it trades death with probability 1-P for certain life.

These seem true (with some caveats) given that the driver is following a fixed policy. But UDT might pay up less of the time if it thinks it can affect the driver's policy. In case 1, UDT might refuse to pay up for bargaining reasons (this is more clear if the amount is "all of your life savings" rather than $100). In case 2, I'm not sure how the mechanics of the driver cheating would work, but if the driver is less likely to cheat if UDT does not pay up when it predicts the driver to cheat, then perhaps UDT will not pay up if it predicts the driver to cheat.

There's a broader difficulty here which is that UDT is not well-defined in multi-agent problems (with multiple UDT agents).

How does bargaining work in UDT? Are there any posts on this?

Like Jessica said, not well-defined. UDT only solves games where all players have the same utility function, it pretty much just extends utility maximization to games with copies. I'm not sure that general game theory with bargaining can ever be reduced to decision theory, or at least it will take some new idea which is most likely orthogonal to UDT. I spent a lot of time trying to find such an idea and failed.

In 3, if P is high enough then it can be worth refusing to pay.

Strangely enough, TDT gets the wrong answer for Parfit's Hitchhiker.

This confuses me, could you explain?

Oh, I messed up. It's Counterfactual Mugging that TDT doesn't solve. I'll edit the post.

I think TDT also gets the wrong answer in the Parfit's Hitchhiker case. Since it updates on the fact that the driver already brought it to the town, it believes that the logical fact of "I pay up" does not "go back in time" and affect the driver's action. So the only relevant effect of paying up is losing money.

Elizier seems to believe that his theory solves it, though I'm still unsure https://www.lesswrong.com/posts/c3wWnvgzdbRhNnNbQ/timeless-decision-theory-problems-i-can-t-solve

Unfortunately I could not find any definition of TDT online that is formal enough to determine how it acts in Parfit's Hitchhiker or Counterfactual Mugging. In any case I don't see how you can solve Parfit's Hitchhiker without also solving Counterfactual Mugging. If you update on your observations and then look at the consequences of the logical node representing your decision, then you get both problems wrong; if you don't update on your observations and just look at the consequences of the logical node representing your decision, then you get both problems right.

I've been thinking about this more. I think the distinction is that in Parfit's Hitchhiker, if you try not to pay, you never find yourself in the situation of being able to not pay, but in Counterfactual Mugging, you can actually successfully not pay. But I think TDT constructs counterfactuals by considering different agents from the start of time, instead of considering different actions at the point of the decision. The kind of agent that one-boxes gets the million in Newcomb's and the kind of agent that pays in Parfit's ends up in town, but the kind of agent who pays in Counterfactual Mugging ends up $100 poorer. And only counterfactual agents are considered, not counterfactual coin flips.

TDT might or might not be well-defined in the case where the driver literally always predicts correctly, but if the driver is incorrect with probability $10^{- 100}$ , then you don't find yourself in the situation of not being able to pay by not paying up, if you are conditioning on your observations (the same way you have to in counterfactual mugging in order to not pay up). It reduces the a priori probability of your observations but this doesn't matter if you update on them. There might be a decision theory other than UDT that takes this into account but I don't know of it.

The kind of agent who pays in Counterfactual Mugging ends up $1000000 richer half the time, and $100 poorer half the time. Unless you are updating on which branch you are in, which means you should also update on the driver having picked you up.

You're right, Parfit's Hitchhiker with non-perfect predictive power is equivalent to Counterfactual Mugging.

The need to protect your reputation provides an incentive not to defect once you're in town.

We're considering a one-shot problem.

There are no truly one-shot problems. Most humans develop precommitment habits and elevate some of them into their moral code in some abstracted way (a wetware version of self-modification). Some examples are: do not kill/steal/covet/deceive. The Parfit's Hitchhiker is only an issue because the relevant moral injunction is pretty weak. if you modify the problem statement in terms of the stronger moral injunctions, the problem would disappear.

Surely there are situations where you meet someone and you have 0.001% chance of ever meeting them again, et.

Yes, plenty! My point was that meeting that someone belongs to a reference class of situations you had encountered before and will encounter again.

Reputation can be modeled as the idea that you have another 'resource' - that something like the number of promises you've made and the number you've kept (and their importance) is public knowledge.

When reputation is incorporated, agents compare the value of the $100 against the value of the reputation lost by doing so.

The problem could be easily changed to include 'and you will lose no reputation, because no one thinks the driver is asking for a reasonable amount' or 'and you have p chance of losing r reputation worth u dollars'.