Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

After Johannes Treutlein's comment on Smoking Lesion Steelman, and a number of other considerations, I had almost entirely given up on CDT. However, there were still nagging questions about whether the kind of self-ignorance needed in Smoking Lesion Steelman could arise naturally, how it should be dealt with if so, and what role counterfactuals ought to play in decision theory if CDT-like behavior is incorrect. Today I sat down to collect all the arguments which have been rolling around in my head on this and related issues, and arrived at a place much closer to CDT than I expected.


CDT differs from EDT only in those cases where a parent of the decision node is not directly observed. If the causal parents of a decision are all observations, then there is no probabilistic relationship for causal reasoning to remove. So, there are two critical questions to motivate CDT as an alternative to EDT:

  1. (Existence of cases where CDT EDT) Are there necessarily parents of the decision which cannot be observed?
  2. (Correctness of CDT) If so, is CDT the correct way to account for this?

Existence of Cases

Smoking Lesion Steelman was an attempt to set up a scenario answering #1. However, it was very contrived; I had to specifically rob the agents of all introspection and set up an important piece of knowledge about their own decision process which they lacked. Are there more natural examples?

Certainly there will be influences on a real agent's decision process which the agent cannot observe. For example, it is always possible for a cosmic ray to hit a computer and change the output of decision calculation. However, CDT doesn't seem to help with this. The right way to guard against this is vie fault-tolerant implementations which use redundant calculations and memory; but to the extent that a decision procedure might guard against it by having checks in the world-model saying "if things look too goofy I've probably been hit in the head", CDT would potentially sever those probabilistic links, keeping the agent from inferring that it is in an error state. (More on this kind of failure of CDT later.)

Sam Eisenstat has argued (in an in-person discussion) that an example of #1 is the consistency of the logic an agent uses. In the counterexample to the trolljecture, an agent reasons as if if had control over the consistency of its logic, leading to an intuitively wrong decision. Perhaps consistency should be thought of as a parent of the decision, so that CDT-like reasoning would avoid the mistake. Sam also has a probabilistic version of this, to show that it's not just a peculiarity of MUDT. The probabilistic version is not consistent with logical induction, so I'm still not sure whether this situation can arise when an agent is build with best practices; but, I can't entirely dismiss it right now, either. (It's also unclear whether the consistency of the logic should be treated as a causal parent of a decision in general; but, in this case, it does seem like we want CDT-like reasoning: "my action cannot possibly influence whether my logic is consistent, even if it provides evidence as to whether my logic is consistent".)

Correctness

As for #2, Johannes' comment helped convince me that CDT isn't doing exactly the right thing to account for inputs to the decision process which can't be observed. A discussion with Tom Everitt also helped me see what's going on: he pointed out that CDT fails to make use of all available information when planning. Johannes' example fits that idea, but for another example, suppose that an absent-minded football fan always forgets which team is their favorite. They know that if they buy sports memorabilia, they almost always buy their favorite team's gear on impulse. They also know that if they purchased the wrong team's gear, they would hate it; and on the other hand, they love having their favorite team's gear. They find themselves choosing between mugs for several different teams. Should they make a purchase?

EDT says that conditioned on making a purchase, it was very likely to be the right team. Therefore, it is a good idea to make the purchase. CDT, on the other hand, treats the favorite team as an un-observed input to the decision process. Since it is a causal parent, the probabilistic connection gets cut; so, the CDT agent doesn't make the inference that whatever mug they pick is likely to be the right one. Instead, since they're choosing between several mugs, each mug is individually more likely to be the wrong one than the right one. So, CDT doesn't buy a mug.

(The reader may be concerned that this problem is ill-specified for the same reason the standard Smoking Lesion is ill-specified: if the agent is running CDT or EDT, how is it possible that it has a bias toward choosing the right team? We can address this concern with a less extreme version of the trick I used on Smoking Lesion: the decision procedure runs CDT or EDT utility calculations, but then adds epsilon utility to the expected value of any purchase involving the favorite team, which breaks ties in the expected value calculation. This is enough to make the EDT agent reliably purchase the right gear.)

Intuitively, what's going on here is that CDT is refusing to make use of important information in making its decision. EDT goes wrong by confusing evidence for influence; CDT goes wrong by discarding the evidence along with the influence. To the extent that which action we take tells us important information about ourselves which we didn't already know, we want to take this into account when making a decision.

But how can we "take this information into account" without confusing causation and correlation? We can't both change our expectation to account for the information and hold the probability of causal parents still when making a decision, can we? What sort of decision rule would allow us to get all the examples so far right in a principled way?

An Idea from Correlated Equilibria

I won't be very surprised if the following proposal is already present in the literature. This is a revision of what I proposed in response to Johannes' comment, which he pointed out did not work (which was similar to Gandalf's Solution to the Newcomb Problem by Ralph Wedgewood, which Johannes pointed me to). It also bears some resemblance to a post of Scott's.

It seems to me that the problem comes from the fact that when we condition on the actual action we will take, we get good information, which we want to incorporate into our beliefs for the sake of making the right decision. However, when we condition on taking some other action, we get bad information which we don't want to use for making the kind of inference about ourselves which we need in order to get the football-mug example right.

This reminds me of correlated equilibria, which have been discussed quite a bit around MIRI lately. In one way of setting them up, some outside information tells each agent what action to take, with a joint distribution on actions known to all agents. If all agents would accept this advice, then the joint distribution is a stable equilibrium. So, (1) you know which action you will take -- not just what probability you have on different actions, as in Nash equilibria; and (2) a max-expected-value calculation still decides to take that action, after knowing what it is.

This inspired me to try the following decision rule:

IE, the chosen action must be the best causal intervention, after taking into account (as evidence) the fact that we chose it. I'll call this CEDT (correlated equilibrium DT) for now. Scott pointed out that this is a pretty bad name, since many concepts of rationality, including Nash equilibria and logical induction, can be seen as having this property of approving of their output under the condition that they know their output. I therefore re-name it to causal-evidential decision theory (CEDT).

Note that there may be several consistent fixed-points of the above equation. If so, we have a problem of selecting fixed-points, just like we do with Nash equilibria. Bad choices may be self-fulfilling prophecies.

In the football-mug scenario, every mug becomes an equilibrium solution; taking the mug is evidence that it is the right one, so switching to other mugs looks like a bad idea. Not buying a mug is also an equilibrium. To keep my assumption that the decision is almost always right if the merchandise is purchased, I must suppose that the right equilibrium gets selected (by whatever process does that): the right mug is purchased.

In the cosmic ray example, if some particular action is strong evidence of cosmic-ray-induced malfunction, such that the AI should enter a "safely shut down" mode, then this action cannot be an equilibrium; the AI would switch to shutdown if it observed it. Whether the fact that it isn't an equilibrium actually helps it to happen less given cosmic ray exposure is another question. (The CDT-style reasoning may still be cutting other links which would help detect failure; but more importantly, decision theory still isn't the right way to increase robustness against cosmic rays.)

In the Smoking Lesion Steelman scenario, non-smoke-loving CEDT robots would decline to smoke no matter what action they thought they'd take, since it only costs them. So, the only equilibrium is for them not to smoke. The smoke-loving CEDT robots would smoke if they were told they were going to smoke, or if they were told they wouldn't; so the only equilibrium for them has them doing so.

In Suicidal Smoking Lesion Steelman (Johannes' variant), the non-smoke-lovers again don't smoke. On the other hand, if I've thought this through correctly, smoke-lovers may either smoke or not smoke: if they're told they don't smoke, then they aren't sure whether they'll be killed or not; so, switching to smoking looks too dangerous. On the other hand, if they're told that they do smoke, then they know they will die, so they're happy to smoke.

So, in the examples I've looked at, it seems like CEDT always includes the desired behavior as a possible equilibrium. I'm not sure how to measure the "best" equilibrium and select it. Selecting based on expected value would seem to re-introduse EDT-style mistakes: it might select equilibria where its actions hide some information, so as to increase EV. Another candidate is to select based on the expected benefit, after updating on what action is taken, of taking that action rather than alternatives; IE, the gain. This isn't obviously correct either.

This isn't a final solution, by any means, but it is enough to make me willing to say that CDT isn't so far off the mark. As long as a CDT agent knows its own action before it has decided, and is given the right causal representation of the problem, then it at least can get examples like this right, supposing it happens to be in the right fix-point.

Thanks to Johannes, Tom, Sam, Scott, Patrick, and many participants at this year's AISFP program for discussions leading to this post.

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 11:22 AM

Nice writeup. Is one-boxing in Newcomb an equilibria?