ACDT: a hack-y acausal decision theory

by Stuart_Armstrong 6 min read15th Jan 202016 comments

48

Ω 16


Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Inspired by my post on problems with causal decision theory (CDT), here is a hacked version of CDT that seems to be able to imitate timeless decision theory (TDT) and functional decision theory[1] (FDT), as well as updateless decision theory (UDT) under certain circumstances.

Call this ACDT, for (a)causal decision theory. It is, essentially, CDT which can draw extra, acausal arrows on the causal graphs, and which attempts to figure out which graph represents the world it's in. The drawback is its lack of elegance; the advantage, if it works, is that it's simple to specify and focuses attention on the important aspects of deducing the graph.

Defining ACDT

CDT and the Newcomb problem

In the Newcomb problem, there is a predictor who leaves two boxes, and predicts whether you will take one ("one-box") or both ("two-box"). If predicts you will one-box, it had put a large prize in that first box; otherwise that box is empty. There is always a small consolation prize in the second box.

In terms of causal graphs, we can represent it this way:

The dark red node is the decision node, which the agent can affect. The green node is a utility node, whose value the agent cares about.

The CDT agent uses the "" operator from Pearl's Causality. Essentially all the incoming arrows to the decision node are cut (though the CDT agent keeps track of any information gained that way), then the CDT agent maximises its utility by choosing its action:

In this situation, the CDT agent will always two-box, since it treats 's decision as fixed, and in that case two-boxing dominates, since you get whatever's in the first box, plus the consolation prize.

ACDT algorithm

The ACDT algorithm is similar, except that when it cuts the causal links to its decision, it also adds potential links from that decision node to all the other nodes in the graph. Then it attempts to figure out which diagram is correct, and then maximises its utility in the CDT way.

Note that ACDT doesn't take a position on what these extra links are - whether they are pointing back in time or are reflecting some more complicated structure (such as the existence of predictors). It just assumes the links could be there, and then works from that.

In a sense, ACDT can be seen as anterior to CDT. How do we know that causality exists, and the rules it runs on? From our experience in the world. If we lived in a world where the Newcomb problem or the predictors exist problem were commonplace, then we'd have a different view of causality.

It might seem gratuitous and wrong to draw extra links coming out of your decision node - but it was also gratuitous and wrong to cut all the links that go into your decision node. Drawing these extra arrows undoes some of the damage, in a way that a CDT agent can understand (they don't understand things that cause their actions, but they do understand consequences of their actions).

ACDT and the Newcomb problem

As well as the standard CDT graph above, ACDT can also consider the following graph, with a link from its decision to 's prediction:

It now has to figure out which graph represents the better structure for the situation it finds itself in. If it's encountered the Newcomb problem before, and tried to one-box and two-box a few times, then it knows that the second graph gives more accurate predictions. And so it will one-box, just as well as the TDT family does.

Generalising from other agents

If the ACDT agent has not encountered themselves, but has seen it do the Newcomb problem for other agents, then the "figure out the true graph" becomes more subtle. UDT and TDT are built from the assumption that equivalent algorithms/agents in equivalent situations will produce equivalent results.

But ACDT, built out of CDT and its solipsistic cutting process, has no such assumptions - at least, not initially. It has to learn that the fate of other, similar agents, is evidence for its own graph. Once it learns that generalisation, then it can start to learn from the experience of others.

ACDT on other decision problems

Predictors exist

Each round of the predictors exist has a graph similar to the Newcomb problem, with the addition of a node to repeat the game:

After a few rounds, the ACDT agent will learn that the following graph best represents its situation:

And it will then swiftly choose to leave the game.

Prisoner's dilemma with identical copy of itself

If confronted by the prisoner's dilemma with an identical copy of itself, the ACDT agent, though unable to formalise "we are identical", will realise that they always make the same decision:

And it will then choose to cooperate.

Parfit's hitchhiker

The Parfit's hitchhiker problem is as follows:

Suppose you're out in the desert, running out of water, and soon to die - when someone in a motor vehicle drives up next to you. Furthermore, the driver of the motor vehicle is a perfectly selfish ideal game-theoretic agent, and even further, so are you; and what's more, the driver is Paul Ekman, who's really, really good at reading facial microexpressions. The driver says, "Well, I'll convey you to town if it's in my interest to do so - so will you give me $100 from an ATM when we reach town?"

Now of course you wish you could answer "Yes", but as an ideal game theorist yourself, you realize that, once you actually reach town, you'll have no further motive to pay off the driver. "Yes," you say. "You're lying," says the driver, and drives off leaving you to die.

For ACDT, it will learn the following graph:

And will indeed pay the driver.

XOR blackmail

XOR blackmail is one of my favourite decision problems.

An agent has been alerted to a rumor that her house has a terrible termite infestation that would cost her $1,000,000 in damages. She doesn’t know whether this rumor is true.

A greedy predictor with a strong reputation for honesty learns whether or not it’s true, and drafts a letter: I know whether or not you have termites, and I have sent you this letter iff exactly one of the following is true: (i) the rumor is false, and you are going to pay me $1,000 upon receiving this letter; or (ii) the rumor is true, and you will not pay me upon receiving this letter.

The predictor then predicts what the agent would do upon receiving the letter, and sends the agent the letter iff exactly one of (i) or (ii) is true. Thus, the claim made by the letter is true. Assume the agent receives the letter. Should she pay up?

The CDT agent will have the following graph:

And the CDT agent will make the simple and correct decision not to pay.

ACDT can eventually reach the same conclusion, but may require more evidence. It also has to consider graphs of the following sort:

The error of evidential decision theory (EDT) is, in effect, to act as if the light green arrow existed: that they can affect the existence of the termites through their decision.

ACDT, if confronted with similar problems often enough, will eventually learn that the light green arrow has no effect, while the dark green one does have an effect (more correctly: the model with the dark green arrow is more accurate, while the light green arrow doesn't add accuracy). It will then refuse to pay, just like the CDT agent does.

Note that we might define ACDT as only creating links with its own parent nodes - putting back the links it cut, but in the other direction. In that case it would only consider links with "Your decision algorithm" and "Letter sent", not with "Termites in house?", and would never pay. But note that "Your decision algorithm" is logical node, that might not exist in physical reality; that's why I designed ACDT to allow links to arbitrary nodes, not just the ones that are its ancestors, so it can capture more models about how the world works.

Not UDT: counterfactual mugging

The ACDT agent described above differs from UDT in that it doesn't pay the counterfactual mugger:

appears and says that it has just tossed a fair coin, and given that the coin came up tails, it decided to ask you to give it $100. Whatever you do in this situation, nothing else will happen differently in reality as a result. Naturally you don't want to give up your $100. But also tells you that if the coin came up heads instead of tails, it'd give you $10,000, but only if you'd agree to give it $100 if the coin came up tails. Do you give the $100?

Non-coincidentally, this problem is difficult to represent in a causal graph. One way of seeing it could be this way:

Here the behaviour of the agent in the tails world, determines 's behaviour in the heads world. It would be tempting to try and extend ACDT, by drawing an arrow from that decision node to the node in the heads world.

But that doesn't work, because that decision only happens in the tails world - in the heads world, the agent has no decision to make, so ACDT will do nothing. And in the tails world, the heads world is only counterfactually relevant.

Now ACDT, like EDT, can learn, in some circumstances, to pay the counterfactual mugger. If this scenario happens a lot, then it can note that agents that pay in the tails world get rewarded in the heads world, thus getting something like this:

But that's a bit too much of a hack, even for a hack-y method like this. More natural and proper would be to have the ACDT agent not use its decision as the node to cut-and-add-links from, but its policy (or, as in this post, its code). In that case, the counterfactual mugging can be represented as a graph by the ACDT agent:

Fully acausal trade

The ACDT agent might have issues with fully acausal trade (though, depending on your view, this might be a feature not a bug).

The reason being, that since the ACDT agent never gets to experience acausal trade, it never gets to check whether there is a link between it and hypothetical other agents - imagine a Newcomb problem where you never get to see the money (which may be going to a charity you support - but that charity may not exist either), nor whether exists.

If an ACDT ever discovered acausal trade, it would have to do so in an incremental fashion. It would first have to become comfortable enough with prediction problems so that drawing links to predictors is a natural thing for it to do. It would have to become comfortable enough with hypothetical arguments being correct, that it could generalise to situations where it cannot ever get any empirical evidence.

So whether an ACDT agent ever engages in fully acausal trade, depends on how it generalises from examples.

Neural nets learning to be ACDT

It would be interesting to program a neural net ACDT agent, based on these example. If anyone is interested in doing so, let me know and go ahead.

Learning graphs and priors over graphs

The ACDT agent is somewhat slow and clunky at learning, needing quite a few examples before it can accept unconventional setups.

If we want it to go faster, we can choose to modify its priors. For example, we can look at what evidence would convince us that an accurate predictor existed, and put a prior that would have a certain graph, conditional on seeing that evidence.

Or if we want to be closer to UDT, we could formalise statements about algorithms, and about their features and similarities (or formalise mathematical results about proofs, and about how to generalise from known mathematical results). Adding that to the ACDT agent gives an agent much closer to UDT.

So it seems that ACDT+"the correct priors", is close to various different acausal agent designs.


  1. Since FDT is still somewhat undefined, I'm viewing as TDT-like rather than UDT-like for the moment. ↩︎

48

Ω 16