Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

In this post, I hope to persuade you of what I consider to be an important principle when dealing with decision theory and counterfactuals.

Joe Carlsmith describes the Yankees vs. Red Sox Problem as below:

In this case, the Yankees win 90% of games, and you face a choice between the following bets:

                                                  Yankees win                Red Sox win

            You bet on Yankees             1                                 -2

            You bet on Red Sox            -1                                 2

Or, if we think of the outcomes here as “you win your” and “you lose your bet” instead, we get:

                                                 You win your bet             You lose your bet

            You bet on Yankees             1                                 -2

            You bet on Red Sox             2                                 -1

Before you choose your bet, an Oracle tells you whether you’re going to win your next bet. The issue is that once you condition on winning or losing (regardless of which), you should always bet on the Red Sox. So, the thought goes, EDT always bets on the Red Sox, and loses money 90% of the time. Betting on the Yankees every time does much better.  

The mistake is assuming that here is assuming that the Oracle's prediction applies in counterfactuals (worlds that don't occur) in addition to the factual (the world that does). 

If the Oracle knows both:

a) The Yankees will win
b) You will bet on the Yankees, if you are told that you will win your next bet

Then the Oracle knows that it can safely predict you winning your bet, without any possiblity of this prediction being wrong.

Notice that the Oracle doesn't need to know anything about the counterfactual where you bet Red Sox, except that the Yankees will win (and maybe not even that).

After all, Condition b) only applies when you are told that you will win your next bet. If you would have bet on Red Sox instead after being told that you were going to win, then the Oracle wouldn't have promised that you were going to choose correctly.

In fact, the Oracle mightn't have been able to publicly make a consistent prediction at all, as learning of its prediction might change your actions. This would be the case if all of the following three conditions held at once:

a) Yankees were going to win
b) If you were told that you'd win your next bet, you'd bet Red Sox
c) If you were told that you'd lose your next bet, you'd bet Yankees

The only way the Oracle would be able to avoid being mistaken, would be to not make any prediction at all. This example clearly demonstrates how an Oracle's predictions can be limited by your betting tendencies.

To be clear, if the Oracle tells you that you are going to win, you can't interpret this as applying completely unconditionally. Instead, you have to allow that the Oracle's prediction may be contingent on how you bet (in many problems the Oracle actually knows how you will bet and will use this in its prediction).

The Oracle's prediction only has to apply to the world that is. It doesn't have to apply to worlds that are not.

Why not go with Joe's argument?

Joe argues that the Oracle's prediction renders your decision-making unstable. While this is "fishy", it's not clear to me that this is a mistake. After all, maybe the Oracle knows how many times you'll switch back and forth between the two teams before making your final decision? Maybe this doesn't answer Joe's objection, plausibly it does.

Are there Wider Implications?

If it is conceded that that the Oracle's prediction can vary in counterfactuals, then this would undermine the argument for 2-boxing in Newcomb's Problem which relies on the Oracle's prediction being constant across all counterfactuals. I suppose someone could argue that this problem only demonstrates that the prediction can vary in counterfactuals when the prediction is publicly shared. But even if I haven't specifically shown that non-public predictions can vary across counterfactuals, I've still successfully undermined the notion that the past is fixed across counterfactuals.

This result is damaging to both EDT and CDT which both take the Oracle's prediction to apply across all counterfactuals.

I also suspect that anyone who finds this line of argument persuasive will end up being more persuaded by my explaination for why 1 boxing doesn't necessarily imply backwards causation (short answer: because counterfactuals are a construction and constructing at least part of a counterfactual backwards is different from asserting that the internal structure of that counterfactual involves backwards causation). However, I can't explain why it's related in any clear fashion.

Update:

Vladamir Nesov suggested that the principle should be "The Oracle's prediction only has to apply to the world where the prediction is delivered". My point was that Oracle predictions made in the factual don't apply to counterfactuals, but I prefer his way of framing things as it is more general.

New Comment
25 comments, sorted by Click to highlight new comments since: Today at 2:09 PM

Consider the variant where the Oracle demands a fee of 100 utilons after delivering the prediction, which you can't refuse. Then the winning strategy is going to be about ensuring that the current situation is counterfactual, so that in actuality you won't have to pay the Oracle's fee, because the Oracle wouldn't be able to deliver a correct prediction.

The Oracle's prediction only has to apply to the world that is. It doesn't have to apply to worlds that are not.

The Oracle's prediction only has to apply to the world where the prediction is delivered. It doesn't have to apply to the other worlds. The world where the prediction is delivered can be the world that is not, and another world can be the world that is.

"The Oracle's prediction only has to apply to the world where the prediction is delivered" - My point was that predictions that are delivered in the factual don't apply to counterfactuals, but the way you've framed it is better as it handles a more general set of cases. It seems like we're on the same page.

It's not actually more general, it's instead about a somewhat different point. The more general statement could use some sort of a notion of relative actuality, to point at the possibly counterfactual world determined by the decision made in the world where the prediction was delivered, which is distinct from the even more counterfactual worlds where the prediction was delivered but the decision was different from what it would relative-actually be had the prediction been delivered, and from the worlds where the prediction was not delivered at all.

If the prediction is not actually delivered, then it only applies to that intermediately-counterfactual world and not to the more counterfactual alternatives where the prediction was still delivered or to the less counterfactual situation where the prediction is not delivered. Saying that the prediction applies to the world where it's delivered is liable to be interpreted as including the more-counterfactual worlds, but it doesn't have to apply there, it only applies to the relatively-actual world. So your original framing has a necessary part of saying this carefully that my framing didn't include, replacing it with my framing discards this correct detail. The Oracle's prediction only has to apply to the "relatively-actual" world where the prediction is delivered.

An Oracle's prediction does not have to apply to worlds in which the Oracle does not 'desire' to retain its classification as an Oracle. Indeed, since an Oracle needs to take the effects of its predictions into account, one of the ways an Oracle might be implemented is that for each prediction it is considering making, it simulates a world where it makes that prediction to see whether it comes true. In which case there will be (simulated) worlds where a prediction is made within that world by (what appears to be) an Oracle, yet the prediction does not apply to the world where the prediction is delivered.

Or to put it another way, talk of "an Oracle" seems potentially confused, since the same entity may not be an Oracle in all the worlds under discussion.

Small insight why reading this: I'm starting to suspect that most (all???) unintuitive things that happen with Oracles are the result of them violating our intuitions about causality because they actually deliver no information, in that nothing can be conditioned on what the Oracle says because if we could then the Oracle would fail to actually be an Oracle, so we can only condition on the existence of the Oracle and how it functions and not what it actually says, e.g. you should still 1-box but it's mistaken to think anything an Oracle tells you allows you to do anything different.

Yeah, you want either information about the available counterfactuals or information independent of your decision. Information about just the path taken isn't something you can condition on.

When the Oracle says "The taxi will arrive in one minute!", you may as well grab your coat.

Isn't that prediction independent of your decision to grab your coat or not?

The prediction is why you grab your coat, it's both meaningful and useful to you, a simple counterexample to the sentiment that since correctness scope of predictions is unclear, they are no good. The prediction is not about the coat, but that dependence wasn't mentioned in the arguments against usefulness of predictions above.

[-]Dagon3yΩ020

Sure, that's a sane Oracle.  The Weird Oracle used in so many thought experiments doesn't say ""The taxi will arrive in one minute!", it says "You will grab your coat in time for the taxi.".  

No, this is an important point: the agent normally doesn't know the correctness scope of the Oracle's prediction. It's only guaranteed to be correct on the actual decision, and can be incorrect in all other counterfactuals. So if the agent knows the boundaries of the correctness scope, they may play chicken and render the Oracle wrong by enacting the counterfactual where the prediction is false. And if the agent doesn't know the boundaries of the prediction's correctness, how are they to make use of it in evaluating counterfactuals?

It seems that the way to reason about this is to stipulate correctness of the prediction in all counterfactuals, even though it's not necessarily correct in all counterfactuals, in the same way as the agent's decision that is being considered is stipulated to be different in different counterfactuals, even though the algorithm forces it to be the same. So it's a good generalization of the problem of formulating counterfactuals, it moves the intervention point from agent's own decisions to correctness of powerful predictors' claims. These claims act on the counterfactuals generated by the agent's own decisions, not on the counterfactuals generated by delivery of possible claims, so it's not about merely treating predictors as agents, it's a novel setup.

[-]Dagon3yΩ020

Is there an ELI5 doc about what's "normal" for Oracles, and why they're constrained in that way?  The examples I see confuse me in that they are exploring what seem like edge cases, and I'm missing the underlying model that makes these cases critical.

Specifically, when you say "It's only guaranteed to be correct on the actual decision", why does the agent not know what "correct" means for the decision?

Specifically, when you say "It's only guaranteed to be correct on the actual decision", why does the agent not know what "correct" means for the decision?

The agent knows what "correct" means, correctness of a claim is defined for the possible worlds that the agent is considering while making its decision (which by local tradition we confusingly collectively call "counterfactuals", even though one of them is generated by the actual decision and isn't contrary to any fact).

In the post Chris_Leong draws attention to the point that since the Oracle knows which possible world is actual, there is nothing forcing its prediction to be correct on the other possible worlds that the agent foolishly considers, not knowing that they are contrary to fact. And my point in this thread is that despite the uncertainty it seems like we have to magically stipulate correctness of the Oracle on all possible worlds in the same way that we already magically stipulate the possibility of making different decisions in different possible worlds, and this analogy might cast some light on the nature of this magic.

That's an interesting point. I suppose it might be viable to acknowledge that the problem taken literally doesn't require the prediction to be correct outside of the factual, but nonetheless claim that we should resolve the vagueness inherent in the question about what exactly the counterfactual is by constructing it to meet this condition. I wouldn't necessarily be strongly against this - my issue is confusion about what an Oracle's prediction necessarily entails.

Regarding, your notion about things being magically stipulated, I suppose there's some possible resemblance there with the ideas I proposed before in Counterfactuals As A Matter of Social Convention, although The Nature of Counterfactuals describes where my views have shifted to since then.

[-]Dagon3yΩ020

Hmm.  So does this only apply to CDT agents, who foolishly believe that their decision is not subject to predictions?

No, I suspect it's a correct ingredient of counterfactuals, one I didn't see discussed before, not an error restricted to a particular decision theory. There is no contradiction in considering each of the counterfactuals as having a given possible decision made by the agent and satisfying the Oracle's prediction, as the agent doesn't know that it won't make this exact decision. And if it does make this exact decision, the prediction is going to be correct, just like the possible decision indexing the counterfactual is going to be the decision actually taken. Most decision theories allow explicitly considering different possible decisions, and adding correctness of the Oracle's prediction into the mix doesn't seem fundamentally different in any way, it's similarly sketchy.

[-]Dagon3yΩ020

Thanks for patience with this. I am still missing some fundamental assumption or framing about why this is non-obvious (IMO, either the Oracle is wrong, or the choice is illusory).  I'll continue to examine the discussions and examples in hopes that it will click.
 

I presume Vladimir and me are likely discussing this from within the determinist paradigm in which "either the Oracle is wrong, or the choice is illusory" doesn't apply (although I propose a similar idea in Why 1-boxing doesn't imply backwards causation).

IMO, either the Oracle is wrong, or the choice is illusory

This is similar to determinism vs. free will, and suggests the following example. The Oracle proclaims: "The world will follow the laws of physics!". But in the counterfactual where an agent takes a decision that won't actually be taken, the fact of taking that counterfactual decision contradicts the agent's cognition following the laws of physics. Yet we want to think about the world within the counterfactual as if the laws of physics are followed.

Surely the problem here is that you cannot have both the 90% statistic and the oracle in the same situation - they are mutually contradictory (given your free will in choosing a bet)

Replace the oracle by instead an agent that is promising to fix the game so that you either win or lose your bet. Then the EDT strategy is correct.

I'm confused about the design of the Oracle.  Why is it predicting the combined probability of how you will bet and who will win the game, rather than just who will win (and perhaps how you will bet, but reported separately)?

It has to be the case that EITHER the Oracle is including your decisions in it's model (including the information it expects you to have at the time of the decision) OR the Oracle is very stupidly giving you misleading results.  How to use a perverse or adversarial Oracle is something that deserves it's own field of study, but it should be separate from identifying confusion or paradoxes about counterfactuals (especially those that actually occur!).

The Oracle is predicting the combined results because that's what makes the thought experiment interesting.

If a thing says "you will win" and this causes you to bet on the Red Sox and loose, then this thing, whatever it is, is simply not an oracle. It has failed the defining property of an oracle, which is to make only true statements. It is true that there may be cases where an oracle cannot say anything at all, because any statement it makes will change reality in such a way as to make the statement false. But all this means is that sometimes an oracle will be silent. It does not mean that an oracle's statements are somehow implicitly conditioned on a particular world.

 

Put another way,  your assumption that "an Oracle tells you whether you’re going to win your next bet" is not a valid way to constrain an oracle. An actual Oracle could just as easily say "The Red Sox will win" or "The Yankees will win" or whatever. 

 

If a supposed-oracle claims that you will win your bet, and this causes you to bet on the Red Sox and loose, then the actually existing world is the one where you bet on the Red Sox and lost. The world where you didn't hear the prediction, bet on the Yankees, and won, that is the hypothetical world. So saying that an oracle's predictions need only be true in the actual world doesn't resolve your paradox. To resolve it, the oracle's predictions would have to be true only in the hypothetical world where you did not hear the prediction. 

While the Oracle's prediction only applies to the world it was delivered in, you don't know which of the as-yet hypothetical worlds that will be. Whatever your decision ends up being, the Oracle's prediction will be correct for that decision.

If you hear that you will win, you bet on the Red Sox and lose, then your decision process was still correct but your knowledge about the world was incorrect. You believed that what you heard came from an Oracle, but it didn't.

This also applies to Newcombe's problem: if at any point you reason about taking one box and there's nothing in it, or about taking two boxes and there's a million in one, then you are implicitly exploring the possibility that Omega is not a perfect predictor. That is, that the problem description is incorrect.