Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

The previous post sketched an application of Jessica's COEDT framework to get rid of one of the assumptions of my argument for CDT=EDT. Looking at the remaining assumptions of my argument, the hardest one to swallow was implementability: the idea that when the agent implements a mixed strategy, its randomization is successfully controlling for any factors other than those involved in the decision to use that particular mixed strategy. Stated in bayes-net terms, the action has no parents other than the decision.

I stated that the justification of the assumption had to do with the learnability of the causal connections. I then went on to discuss some issues in learning counterfactuals, but not in a way which directly addressed the implementability assumption.

The present post discusses aspects of learning which are more relevant to the implementability assumption. Actually, though, these considerations are only arguments for implementability in that they're arguments for CDT=EDT. None of the arguments here are watertight.

Whereas the previous post was largely a response to COEDT, this post is more exclusively talking about CDT=EDT.

How Should We Learn Counterfactuals?

When Are Counterfactuals Reasonable?

How do we evaluate a proposed way to take logical counterfactuals? One reason progress in this area has been so slow is that it has been difficult to even state desirable properties of logical counterfactuals in a mathematically concrete way, aside from strong intuitions about how specific examples should go.

However, there is one desirable property which is very clear: counterfacting on what's true should result in what's true. You might get a strange alternative system of mathematics if you ask what things would be like if 2+2=3, but counterfacting on 2+2=4 just gets us mathematics as we know it.

When we consider situations where an agent fully understands what decision problem it's in (so we are assuming that its map corresponds perfectly to the territory), this means that counterfacting on the action which it really takes in that situation, the counterfactual should tell it the true consequences of that action. However, it is less clear how to apply the principle for learning agents.

In the purely Bayesian case, one might argue that the prior is the equivalent of the true circumstance or decision problem; the best an agent can do is to respond as if it were put in situations randomly according to its prior. However, this doesn't give any guarantees about performance in particular situations; in particular, it doesn't give guarantees about learning the right counterfactuals. This becomes a somewhat more pressing concern when we move beyond Bayesian cases to logical induction, since there isn't a prior which can be treated as the true decision problem in the same way.

One might argue, based on the scary door problem from the previous post, that we shouldn't worry about such things. However, I think there are reasonable things we can ask for without going all the way to opening the scary door.

I propose the following principle for learning agents: you should not be able to systematically correct your counterfactuals. The principle is vaguely stated, but my intention is that it be similar to the logical induction criterion in interpretation: there shouldn't be efficient corrections to counterfactuals, just like there shouldn't be efficient corrections to other beliefs.

If we run into sufficiently similar situations repeatedly, this is like the earlier truth-from-truth principle: the counterfactual predictions for the action actually taken should not keep differing from the consequences experienced. The principle is also not so strong that it implies opening the scary door.

An Example

Consider a very unfair game of matching pennies, in which the other player sees your move before playing. In the setup of Sam's logical induction tiling result, where the counterfactual function can be given arbitrarily, the most direct way to formalize this makes it so that the action takes always gets utility zero, but the other action always would have yielded utility one, if it had been taken.

The CDT agent doesn't know at the time of making the decision whether the payoff will be 0 for heads and 1 for tails, or 1 for heads and 0 for tails. It can learn, however, that which situation holds depends on which action it ends up selecting. This results in the CDT agent acting pseudorandomly, choosing whichever action it least expects itself to select. The randomization doesn't help, and the agent's expected utility for each action converges to 1/2, which is absurd since it always gets utility 0 in reality. This expectation of 1/2 can be systematically corrected to 0.

An EDT agent who understands the situation would expect utility 0 for each action, since it knows this to be the utility of an action if that action is taken. The CDT agent also knows this to be the conditional expected utility if it actually takes an action; it just doesn't care. Although the EDT agent and CDT agent do equally poorly in this scenario, we can add a third option to not play, giving utility 1/4. The EDT agent will keep going after the illusory 1/2 expected utility.

Sam's theorem assumes that we aren't in a situation like this. I'm not sure if it came off this way in Alex's write-up, but Sam tends to talk about cases like this as unfair environments: the way the payoff for actions depends on the action we actually take makes it impossible to do well from an objective standpoint. I would instead tend to say that this is a bad account of the counterfactuals in the situation. If the other player can see your move in matching pennies, you're just going to get 0 utility no matter how you move.

If we think of this as a hopelessly unfair situation, we accept the poor performance of an agent here. If we think of it as a problem with the counterfactuals, we want a CDT agent to be arranged internally such that it couldn't end up with counterfactuals like these. Asking that counterfactuals not be systematically correctible is a way to do this.

It looks like there is already an argument for EDT along similar lines in the academic literature: The Gibbard-Harper Collapse Lemma for Counterfactual Decision Theory.

Can We Dutch Book It?

We can get a CDT agent to bet against the correctible expectations if we give it a side-channel to bet on which doesn't interfere with the decision problem. However, while this may be peculiar, it doesn't really constitute a Dutch Book against the agent.

I suspect a kind of Dutch Book could be created by asking it to bet in the same act as its usual action (so that it is causally conditioning on the action at the time, and therefore choosing under the influence of a correctable expectation). This could be combined with a side-bet made after choosing. I'm not sure of the details here, though.

EDIT: Yes, we can dutch-book it.

A Condition for Reflective Stability of CDT Is That It Equal EDT

Sam's Result

Perhaps a stronger argument for requiring CDT counterfactuals to equal EDT conditionals is the way the assumption seems to turn up in expected-value tiling results. I mentioned earlier that Sam's logical induction tiling result uses a related assumption. Here's Alex Appel's explanation of the assumption:

The key starting assumption effectively says that the agent will learn that the expected utility of selecting an action in the abstract is the same as the expected utility of selecting the specific action that it actually selects, even if a finite-sized chunk of the past action string is tampered with. Put another way, in the limit, if the agent assigns the best action an expected utility of 0.7, the agent will have its expected utility be 0.7, even if the agent went off-policy for a constant number of steps beforehand.

The part about "even if a finite-sized chunk of the past action string is tampered with" is not so relevant to the present discussion. The important part is that the expected utility of a specific action is the same as that expected without knowing which action will be selected.

One way that this assumption can be satisfied is if the agent learns to accurately predict which action it will select. The expectation of this action equals the expectation in general because the expectation in general already knows this action will be taken.

Another way this assumption can be satisfied is if the counterfactual expectation of each action which the agent might take equals the evidential expectation of that action. In other words, the CDT expectations equal the EDT expectations. This implies the desired condition because the counterfactual expectations of each action which the agent believes it might take end up being the same. (If there were a significant difference between the expectations, the agent could predict the result of its argmax, and therefore would know which action it will take.)

The assumption could also hold in a more exotic way, where the counterfactual expectations and the evidential expectations are not equal individually, but the differences balance each other out. I don't have a strong argument against this possibility, but it does seem a bit odd.

So, I don't have an argument here that CDT=EDT is a necessary condition, but it is a sufficient one. As I touched on earlier, there's a question of whether we should regard this as a property of fair decision problems, or as a design constraint for the agent. The stronger condition of knowing your own action isn't a feasible design constraint; some circumstances make your action unpredictable (such as "I'll give you $100 - [the degree of expectation you had of your action before taking that action]"). We can, however, make the counterfactual expectations equal the evidential ones.

Diff's Tiling Result

I don't want to lean heavily on Sam's tiling result to make the case for a connection between CDT=EDT and tiling, though, because the conclusion of Sam's theorem is that an agent won't envy another agent for its actions, but not that it won't self-modify into that other agent. It might envy another agent for counterfactual actions which are relevant to getting utility. Indeed, counterfactual mugging problems don't violate any of the assumptions of Sam's theorem, and they'll make the agent want to self-modify to be updateless. Sam's framework only lets us examine sequences of actions, and whether the agent would prefer to take a sequence of actions other than its own. It doesn't let us examine whether the agent's own sequence of actions involves taking a screwdriver to its internal machinery.

Diff's tiling result is very preliminary, but it does suggest that the relationship between CDT=EDT and tiling stays relevant when we deal with self-modification. (Diff = Alex Appel) His first assumption states that concrete expected utility of an action takes equals abstract expected utility of taking some action, along very similar lines to the critical assumption in Sam's theorem.

We will have to see how the story plays out with tiling expected utility maximizers.

The Major Caveat

The arguments in this post are all restricted to moves which the agent continues to take arbitrarily many times as it learns about the situation it is in. The condition for learning counterfactuals, that they not be systematically correctible, doesn't mean anything for actions which you never take. You can't Dutch Book inconsistent-seeming beliefs which are all conditional on something which never happens. And weird beliefs about actions which you never take don't seem likely to be a big stumbling block for tiling results; the formalizations I used to motivate the connection had conditions having to do with actions actually taken.

This is a major caveat to my line of reasoning, because even if counterfactual expectations and conditional expectations are equal for actions which continue to be taken arbitrarily often, CDT and EDT may end up taking entirely different actions in the limit. For example, in XOR Blackmail, CDT refuses to respond to the letter, and EDT responds. Both expect disaster not to occur after their respective actions, and both are right in their utility forecasts.

We can use Jessica's COEDT to define the conditional beliefs for actions we don't take, but where is the argument that counterfactual beliefs must equal these?

We could argue that CDTs, too, should be limits of CDTs restricted to mixed strategies. This is very much the intuition between trembling-hand Nash equilibria. We then argue that CDTs restricted to mixed strategies should take the same actions as EDTs, by the arguments above, since there are no actions which are never taken. This argument might have some force if a reason for requiring CDT to be the limit of CDTs restricted to mixed strategies were given.

My intuition is that XOR blackmail really shouldn't be counted as a counterexample here. It strikes me as a case correctly resolved by UDT, not CDT. Like counterfactual mugging, counterfactual actions of the agent factor in to the utility received by the agent; or, putting it a different way, a copy of the agent is run with spoofed inputs (in contrast with Newcomb's problem, which runs an exact copy, no spoofing). This means that an agent reasoning about the problem should either reason updatelessly about how it should respond to that sort of situation, or doubt its senses (which can be equivalent).

In other words, my intuition is that there is a class of decision problems for which rational CDT agents converge to EDT in terms of actions taken, not only in terms of the counterfactual expectations of those actions. This class would nicely rule out problems requiring updateless reasoning, and include XOR Blackmail among them.

New Comment