Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Attempted versions of CDT and EDT can be constructed using logical inductors, called LICDT and LIEDT. It is shown, however, that LICDT fails XOR Blackmail, and LIEDT fails Newcomb. One interpretation of this is that LICDT and LIEDT do not implement CDT and EDT very well. I argue that they are indeed forms of CDT and EDT, but stray from expectations because they also implement the ratifiability condition I discussed previously. Continuing the line of thinking from that post, I discuss conditions in which LICDT=LIEDT, and try to draw out broader implications for decision theory.


Thanks to Scott and Sam for discussions shaping this post. Also thanks to many participants at AISFP for discussions shaping my current view of counterfactuals.

I'm not sure who gets credit for LICDT and LIEDT, but they've been discussed around MIRI since shortly after logical induction itself. LIEDT was sort of the obvious first thing to try; LICDT is a slight variation. (Scott thinks Jessica may have come up with LICDT.) They might be thought of as the punching bag for better logical induction DTs to be contrasted with (although, Tsvi wrote up a likely-better baseline proposal).

Both LICDT and LIEDT use a logical inductor which has been run for steps, . I'll abbreviate an agent as (parameterizing by the same as for the inductor), with to say that the agent takes action from some action set . We can define self-referential sentences . Both LICDT and LIEDT explore when such sentences are true. We can take whichever action the agent least expects itself to do conditioned on its exploring. This forces the agent to take every action with frequency at least in the limit, and also makes the exploration pseudorandom in the sense that the logical inductor cannot predict it much better than to assign probability to exploration (and therefore, neither can any poly-time computable predictor).

When it isn't taking an action due to the exploration clause, LIEDT chooses actions based on the expected utility conditioning on each action. Utility is defined as a logically uncertain variable (LUV), in the terminology of the logical induction paper. Let be the LUV for the utility of , and be the conditional expectation of in given that the agent takes action . The conditional expectation is always well-defined thanks to the exploration, which ensures that the probability of each action remains above zero.

LICDT is similar, but rather than taking the expectation conditioned on each action, it takes the expectation conditioned on exploring and taking that action. Judging actions by what would happen if you took those actions randomly, rather than reliably, is supposed to remove the kind of correlation which makes EDT cooperate in prisoner's dilemma, one-box in Newcomb, et cetera. We will see that this only party works.

Both LICDT and LIEDT include any observations in their deductive state. (There can be special predicates representing sensory states.) So, they are updateful decision theories.

LICDT and LIEDT aren't very different, and mostly we just talk about LIEDT, calling it LIDT. However, I'm recently realizing just how similar LICDT and LIEDT really are.

LIEDT two-boxes in Newcomb.

Suppose we have an LIEDT agent facing Newcomb's problem. We can specify a sequence of Newcomb problems (for logical inductors of increasing power) by the utility function , where is the proposition stating that the agent (of power ) one-boxes, and is the indicator function which returns 1 for true propositions and 0 for false. This is a Newcomb problem where Omega is fallible; in fact, Omega can only predict the agent as well as the agent can predict itself, since both use the same logical inductor. (And, Newcomb deals with the uncertainty by putting the money in the box with probability equal to its estimation of the probability LIEDT one-boxes.) The best reward the agent can get is if Omega predicts one-boxing, but the agent unexpectedly two-boxes. Of course, a logical inductor can't be fooled like this reliably; so the agent is incentivised to one-box.

Theorem. LIEDT converges to two-box on non-exploration rounds as increases.

Proof. The logical inductor comes to predict and with increasing accuracy as increases, since it has access to , and since the conditioning is always well-defined thanks to exploration. Therefore, two-boxing eventually becomes the most appealing option.

So, LIEDT does not come to see Omega as correlated with its action, because it knows its own general policy, and the general policy screens off all the correlation between Omega and its action.

Now, it's true that if Omega was a more powerful predictor than the agent, LIEDT could one-box -- but, so could LICDT. In particular, if Omega simply knows the action precisely, then , and both LICDT and LIEDT one-box.

LICDT is XOR-blackmailed.

On the other hand, consider the XOR Blackmail letter, which is supposed to be a case where CDT does better than EDT. There is a difficult-to-predict disaster, , with pseudorandom probability 0.01. However, an AI researcher can predict both the disaster and the AI, and will use that knowledge to try and extract money from the AI. Let's call the AI sending money to the researcher , and not sending money . The AI researcher sends a letter asking for money if and only if [they predict the AI will respond by sending money XOR ]. Let's say the AI researcher asks for half the cost of the disaster. . Moreover, the deductive state includes knowledge of the letter, .

Theorem. LICDT converges to sending the blackmailer money when a letter is received, on non-exploration rounds.

Proof. LICDT bases its decision on the utility observed in exploration rounds. Conditional on its receiving the letter and exploring into sending the money, no disaster has occurred. Conditional on its receiving the letter and exploring into not sending money, the disaster has occurred. It will come to predict both of these things accurately, and its conditional expectations will be consistent with them. Therefore, it will send the money.

Interpretation of the two experiments.

It appears that these aren't very good implementations of CDT and EDT. The attempted CDT fails blackmail letter; the attempted EDT fails Newcomb. But, if we look a little closer, something more interesting is going on. I didn't prove it here, but both of them will one-box when Omega is a perfect predictor, and two-box when Omega is fallible. Both of them will send the blackmail letter when the blackmailer is a perfect predictor, and refuse when the blackmailer is fallible. They appear to be following my "Law of Logical Causality" from SLS III.

When people argue that EDT one-boxes in Newcomb and that CDT two-boxes, and that EDT sends the money in XOR Blackmail but EDT abstains, they often aren't careful that EDT and CDT are being given the same problem. CDT is supposed to be taking a physical-causation counterfactual, meaning it represents the problem in a Bayesian network representing its physical uncertainty, in which the direction of links lines up with physical causality. If we give EDT the same Bayesian network, it will disregard the causal information contained therein, and compute conditional utilities of actions. But, it is unclear that EDT will then one-box in Newcomb. Reasoning about the physical situation, will it really conclude that conditioning on one action or another changes the expected prediction of Omega? How does the conditional probability flow from the action to Omega? Omega's prediction is based on some observations made in the past. It may be that the agent knows equally well what those observations were; it just doesn't know exactly what Omega concluded from them. The knowledge of the observations screens off any probabilistic relationship. Or, even worse, it may be that the physical information which the agent has includes its own source code. The agent can't try to run its own source code on this very decision; it would go into an infinite loop. So, we get stuck when we try to do the reasoning. Similar problems occur in XOR blackmail.

I claim that reasoning about what CDT and EDT do in Newcomb's problem and XOR blackmail implicitly assume some solution to logical uncertainty. The correlation which EDT is supposed to conclude exists between its action and the predictor's guess at its action is a logical correlation. But, logical induction doesn't necessarily resolve this kind of logical uncertainty in the intuitive way.

In particular, logical induction implements a version of the ratifiability condition. Because LIEDT agents know their own policies, they are screened off from would-be correlates of their decisions in much the same way LICDT agents are. And because LICDT agents are learning counterfactuals by exploration, they treat predictors who have more information about their own actions than they themselves do as causally downstream -- the law of logical causality which I conjectured would, together with ratifiability, imply CDT=EDT.

When does LICDT=LIEDT?

It's obvious that LIEDT usually equals LICDT when LIEDT converges to taking just one of the actions on non-exploration rounds; the other actions are only taken when exploring, so LIEDT's expectation of those actions just equals LICDT's. What about the expectation of the main action? Well, it may differ, if there is a reliable difference in utility when it is taken as an exploration vs deliberately chosen. However, this seems to be in some sense unfair; the environment is basing its payoff on the agent's reasons for taking an action, rather than on the action alone. While we'd like to be able to deal with some such environments, allowing this in general allows an environment to punish one decision theory selectively. So, for now, we rule such decision problems out:

Definition: Decision problem. A decision problem is a function which takes an agent and a step number, and yields a LUV which is the payout.

Definition: Fair decision problem. A fair decision problem is a decision problem such that the same limiting action probabilities on the part of the agent imply the same limiting expectations on utility per action, and these expectations do not differ between exploration actions and plain actions. Formally: if exists for all , and a second agent has limiting action probabilities which also exist and are the same, then also exist and are the same as the corresponding quantities for ; and furthermore, exist and are the same as the limits without .

I don't expect that this definition is particularly good; alternatives welcome. In particular, using the inductor itself to estimate the action probability introduces an unfortunate dependence on inductor.

Compare to the notion of fairness in Asymptotic DT.

Definition: A continuous fair decision problem is a fair problem for which the function from limiting action probabilities to limiting expected utilities is continuous.

Observation. Continuous fair decision problems have "equilibrium" action distributions, where the best response to the action utilities which come from those is consistent with the action distribution. These are the same whether "best-response" is in the LICDT or LIEDT sense, since the expected utility is required to be the same in both senses. If either LICDT or LIEDT converge on some such problem, then clearly they converge to one of these equilibria.

This doesn't necessarily mean that LICDT and LIEDT converge to the same behavior, though, since it is possible that they fail to converge, and that they converge to different equilibria. I would be somewhat surprised if there is some essential difference in behavior between LICDT and LIEDT on these problems, but I'm not sure what conjecture to state.

I'm more confident in a conjecture for a much narrower notion of fairness (I wasn't able to prove this, but I wouldn't be very surprised if it turned out not to be that hard to prove):

Definition. Deterministically fair decision problem: A decision problem is deterministically fair if the payout on instance are only a function of the action probabilities according to and of the actual action taken (where is the logical inductor used by the agent itself).

Conjecture. Given a continuous deterministically fair decision problem, the set of mixed strategies which LICDT may converge to and LIEDT may converge to, varying the choice of logical inductor, are the same. That is, if LICDT converges to an equilibrium, we can find a logical inductor for which LIEDT converges to that same equilibrium, and vice versa.

This seems likely to be true, since the narrowness of the decision problem class leaves very little room to wedge the two decision theories apart.

The difficulty of proving comparison theorems for LICDT and LIEDT is closely related to the difficulty of proving optimality theorems for them. If we had a good characterization of the convergence and optimality conditions of these two decision theories, we would probably be in a better position to study the relationship between them.

At least we can prove the following fairly boring theorem:

Theorem. For a decision problem in which the utility depends only on the action taken, in a way which does not depend on or on the agent, and for which all the actions have different utilities, LIEDT and LICDT will converge to the same distribution on actions.

Proof. Epsilon exploration ensures that there continues to be some probability of each action, so the logical inductor will eventually learn the action utilities arbitrarily well. Once the utilities are accurate enough to put the actions in the right ordering, the agent will simply take the best action, with the exception of exploration rounds.

Law of Logical Counterfactuals

The interesting thing here is that LIEDT seems to be the same as LICDT under an assumption that the environment doesn't specifically mess with it by doing something differently on exploration rounds and non-exploration rounds. This is sort of obvious from the definitions of LICDT and LIEDT (despite the difficulty of actually proving the result). However, notice that it's much different from usual statements of the difference between EDT and CDT.

I claim that LIEDT and LICDT are more-or-less appropriately reflecting the spirit of EDT and CDT under (1) the condition of ratifiability, which is enforced by the self-knowledge properties of logical inductors, and (2) the "law of logical counterfactuals" (LLC) I posited last time, which is enforced by the way LICDT learns causality via experimentation. You can't learn that something which knows more about your action than you do is upstream of you if you make the assumption that you can perform randomized controlled trials!

Since the law of logical counterfactuals was vaguely stated, I had hoped to learn something about what shape it has to take by examining this case. Unfortunately, this case is a bit intractable. However, it did suggest an additional condition: (3) the environment's behavior doesn't depend on whether you explore. (1), (2), and (3) are stated rather informally for now, but together are supposed to imply CDT=EDT, in a formal analysis which is yet-to-be.

Actually, assumption (3) is a sort of "randomized controlled trials" assumption, which seems to justify (2). You assume that you can arrange for actions to be uncorrelated with anything in the world, and that justifies your use of exploration to learn about counterfactuals.

It's not obvious at first, but exploration-based counterfactuals are very similar to counterfactuals based on the chicken rule in proof-based decision theories such as MUDT. The chicken rule requires that if you can prove what your own action will be, you take a different action. This allows the proofs from alternative actions to be well-behaved, rather than exploding in contradiction.

You can see how that's analogous to the way epsilon-exploration makes sure that LICDT and LIEDT can condition on their own actions without dividing by zero. It's also analogous for a deeper reason. Remember the trick of using self-referential sentences for exploration, at the beginning of this writeup? A very similar trick is to take any action which the logical inductor currently assigns probability . In other words, do anything you strongly believe you won't do. This is very close to the chicken rule; just substitute probabilistic belief for proof.

In fact, we can go a bit further. One of the shortcomings of MUDT is that it doesn't do what we'd like in the Agent Simulates Predictor scenario, and a host of other problems where a more powerful logic is required. We can address these issues by giving it a more powerful logic, but that does not address the intuitive concern, that it seems as if we should be able to solve these problems without fully trusting a more powerful logic: if we strongly suspect that the more powerful logic is consistent, we should be able to mostly do the same thing.

And indeed, we can accomplish this with logical induction. What we do is play chicken against anything which seems to predict our action beyond a certain tolerable degree. This gives us exactly the pseudorandom epsilon-exploration of LICDT/LIEDT. Unfortunately, there's a version of Agent Simulates Predictor which trips these up as well. Alas. (Asymptotic Decision Theory, on the other hand, gets Agent Simulates Predictor right; but, it is bad for other reasons.)

It's interesting that in proof-based decision theory, you never have to take the action; you just threaten to take it. (Or rather: you only take it if your logic is inconsistent.) It's like being able to learn from an experiment which you never perform, merely by virtue of putting yourself in a position where you almost do it.

Troll Bridge

Condition (3) is a sort of "no Troll Bridge"condition. The write-ups of the Troll Bridge on this forum are somewhat inadequate references to make my point well (they don't even call it Troll Bridge!), but the basic idea is that you put something in the environment which depends on the consistency of PA, in a way which makes a MUDT agent do the wrong thing via a very curious Löbian argument. There's a version of Troll Bridge which works on the self-referential exploration sentences rather than the consistency of PA. It seems like what's going on in troll bridge has a lot to do with part of the environment correlating itself with your internal machinery; specifically, the internal machinery which ensures that you can have counterfactuals.

Troll Bridge is a counterexample to the idea of proof-length counterfactuals. The hypothesis behind proof-length counterfactuals is that A is a legitimate counterfactual consequence of B if a proof of A from B is much shorter than a proof of ¬A from B. It's interesting that my condition (3) rules it out like this; it suggests a possible relationship between what I'm doing and proof-length counterfactuals. But I suspected such a relationship since before writing SLSIII. The connection is this:

Proof-length counterfactuals are consistent with MUDT, and with variants of MUDT based on bounded-time proof search, because what the chicken rule does is put proofs of what the agent does juuust out of reach of the agent itself. If the environment is tractable enough that there exist proofs of the consequences of actions which the agent will be able to find, the chicken rule pushes the proofs of the agent's own actions far enough out that they won't interfere with that. As a result, you have to essentially step through the whole execution of the agent's code in order to prove what it does, which of course the agent can't do itself while it's running. We can refine proof-length counterfactuals to make an even tighter fit: given a proof search, A is a legitimate counterfactual consequence of B if the search finds a proof of A from B before it finds one for ¬A.

This makes it clear that the notion of counterfactual is quite subjective. Logical inductors actually make it significantly less so, because an LI can play chicken against all proof systems; it will be more effective in the short-run at playing chicken against its own proof system, but in the song run it learns to predict theorems as fast as any proof system would, so it can play chicken against them all. Nonetheless, LIDT still plays chicken against a subjective notion of predictability.

And this is obviously connected to my assumption (3). LIDT tries valiantly to make (3) true by decorrelating its actions with anything which can predict them. Sadly, as for Agent Simulates Predictor, we can construct a variant of Troll Bridge which causes this to fail anyway.

In any case, this view of what exploration-based counterfactuals are makes me regard them as significantly more natural than I think others do. Nonetheless, the fact remains that neither LIDT nor MUDT do all the things we'd like them to. They don't do what we would like in multi-agent situations. They don't solve Agent Simulates Predictor or Troll Bridge. It seems to me that when they fail, they fail for largely the same reasons, thanks to the strong analogy between their notions of counterfactual.

(There are some disanalogies, however; for one, logical inductor decision theories have a problem of selecting equilibria which doesn't seem to exist in proof-based decision theories.)

New to LessWrong?

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 5:17 PM

The statement of the law of logical causality is:

Law of Logical Causality: If conditioning on any event changes the probability an agent assigns to its own action, that event must be treated as causally downstream.

If I'm interpreting things correctly, this is just because anything that's upstream gets screened off, because the agent knows what action it's going to take.

You say that LICDT pays the blackmail in XOR blackmail because it follows this law of logical causality. Is this because, conditioned on the letter being sent, if there is a disaster the agent assigns  to sending money, and if there isn't a disaster the agent assigns  to sending money, so the disaster must be causally downstream of the decision to send money if the agent is to know whether or not it sends money?

If I'm interpreting thing correctly, this is just because anything that's upstream gets screened off, because the agent knows what action it's going to take.

Not quite. The agent might play a mixed strategy if there is a predictor in the environment, e.g., when playing rock/paper/scissors with a similarly-intelligent friend you (more or less) want to predict what you're going to do and then do something else than that. (This is especially obvious if you assume the friend is exactly as smart as you, IE, assigns the same probability to things if there's no hidden information -- we can model this by supposing both of you use the same logical inductor.) You don't know what you're going to do, because your deliberation process is unstable: if you were leaning in any direction, you would immediately lean in a different direction. This is what it means to be playing a mixed strategy.

In this situation, I'm nonetheless still claiming that what's "downstream" should be what's logically correlated with you. So what screens off everything else is knowledge of the state of your deliberation, not the action itself. In the case of a mixed strategy, you know that you are balanced on a razor's edge, even though you don't know exactly which action you're taking. And you can give a calibrated probability for that action.

You assert that LICDT pays the blackmail in XOR blackmail because it follows this law of logical causality. Is this because, conditioned on the letter being sent, if there is a disaster the agent assigns  to sending money, and if there isn't a disaster the agent assigns  to sending money, so the disaster must be causally downstream of the decision to send money if the agent is to know whether or not it sends money?

I don't recall whether I've written the following up, but a while after I wrote the OP here, I realized that LICDT/LIEDT can succeed in XOR Blackmail (failing to send the money), but for an absolutely terrible reason.

Suppose that the disaster is sufficiently rare -- much less probable than the exploration probability . Furthermore, suppose the exploration mechanism is p-chicken, IE "if you're too sure of what you do, do something else." (The story is more complicated for other exploration methods.)

Now how often does the agent respond to the letter?

Now suppose that, overall, the agent responds to the letter with frequency at least  (including rounds where the agent doesn't receive the letter). Then, conditional on the letter being sent, the agent is pretty sure it will respond to the letter -- it believes this with probability . This is because the minimum response frequency would be , but this is already much more common than the disaster. Since the letter is only supposed to arrive when the disaster has happened or when the agent would respond (by hypothesis), it must be pretty likely that the agent is responding. It should learn this after enough trials.

But the agent is playing p-chicken. If the probability of responding to the letter is greater than , then the agent will refuse to do so. If the agent refuses, then the letter won't be sent except if the rare disaster is occurring. This contradicts the assumption that the agent responds to the letter with frequency at least .

So the agent receives and responds to the letter with frequency less than . On most rounds, the predictor simulates the agent and finds that the agent would have refused, had the letter been sent.

This is good. But the agent's reason for refusing is bonkers. The agent refuses because it thinks it responds to the letter. Its own credence in its responding to the letter is always bumping up against its  ceiling. 

A very similar thing can happen in transparent newcomb, except this time the response deprives the agent of the prize. In that problem, an agent only sees a full box in cases where it'll 1-box. So if it sees a full box infinitely often, its credence that it will 1-box (upon seeing a full box) must approach 1. But this can't be, since it must stay below . So in fact, the agent only sees a full box finitely often before being relegated to empty boxes forever. Omega keeps checking whether the agent would 1-box, and keeps seeing that it would 2-box due to its exploration clause triggering.

Moral of the story: p-chicken is pretty awful in perfect-predictor cases, particularly when the predictor is interested in what you do conditional on a particular observation.

Other exploration mechanisms only fare a little better.