Newcomb's Problem as an Epistemic Rorschach Test

D T

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, assisted/co-written, or edited work.

Read full explanation

1. The Gnawing Feeling That Won't Go Away

I stumbled onto Newcomb's problem a few days ago via this Veritasium video and quickly realized that I was a one-boxer. It just made more sense to me given the circumstances. Still, I couldn't shake the gnawing feeling that my decision was irrational. My training as a natural scientist screamed "correlation doesn't equal causation, time flows linearly, magical thinking is a trap", and still I'd go for one box. How to justify this? At what point does an irrational decision become rational? That question doesn't make sense – which suggests the framing is wrong. Also, the decades-long debate and the near-even split between one-boxers and two-boxers both among professional philosophers (PhilPapers 2020) and the wider public (Guardian's poll 2016) are astonishing. There are smart and rational people on both sides of the aisle. What's going on here?

This post argues that the one-box/two-box disagreement has a dimension that decision theory alone doesn't capture. Underneath the formal arguments, there's a more fundamental split about how to interpret the premises – not whether you believe them, but what accepting them means for the world you live in. I think this is better framed as a question of epistemic temperament than of rational strategy. In this case, both positions are actually internally consistent, the "paradox" dissolves, and the ongoing debate is best viewed in a different light – not as a dispute about the right answer, but as two groups talking past each other about different things.

2. The Standard Framing

A short refresher on Newcomb's paradox: A predictor with a near-perfect track record has set up two boxes. Box A is transparent and contains $1,000. Box B is opaque and contains either $1,000,000 or nothing. You can take both boxes or only box B. If the predictor predicted you'd take only box B, it put the million inside. If it predicted you'd take both, it left box B empty. You know the predictor has been right in thousands of previous cases.

The debate usually splits along decision-theoretic lines. Causal decision theory (CDT) says: the boxes are already set, your choice can't retroactively change their contents, and taking both boxes always nets you $1,000 more than taking one – regardless of what's in box B. So two-box. Evidential decision theory (EDT) says: your choice is strong evidence about what the predictor did, and one-boxers walk away rich while two-boxers walk away poor. So one-box. Functional decision theory (FDT), developed largely within this community, sides with one-boxing for a different reason: you're not just choosing an action, you're choosing the output of the kind of algorithm you are, and the predictor modeled that same algorithm.

The problem was introduced by Nozick in 1969, and has divided thinkers of every stripe ever since. That kind of persistent, near-even disagreement may signal that the disputants aren't actually disagreeing about what they think they're disagreeing about.

3. The Threshold Argument Reveals a Hidden Tension

Something about the setup bothered me; it felt like there was a hidden trap. I discussed this with my friend André, and he proposed a variant that draws attention to a question the problem deliberately leaves open: what kind of thing is the predictor?

To understand why the predictor’s nature matters, let's perform a small experiment. Instead of $1,000, we'll put 1 cent in box A. In this situation, almost everybody will prefer box B. Now we'll gradually increase the amount in box A up to $100,000,000. At some point, almost any given person will hit a threshold based on their individual preferences and switch from one-boxing to two-boxing. We'll call this threshold Y.

Let's look at what happens at this exact indifference point. In information theory terms, this is where the Signal-to-Noise Ratio of a person's decision algorithm approaches zero. Can the near-perfect predictor resolve this noise, or does its accuracy break down?

If the predictor is a physical system (like a supercomputer running a brain simulation, think Devs), at this point the choice is determined by variables that are physically impossible to predict with 100% accuracy. The decision is no longer driven by a stable logical preference, but by microscopic fluctuations – a momentary distraction, thermal noise in the brain, or even quantum randomness. At the exact threshold, the decision becomes purely stochastic. No physical simulation, no matter how detailed, can resolve a true coin flip into a certainty. It's plausible that the prediction accuracy breaks down because there is no "signal" to predict.

The threshold argument reveals a tension at the heart of the setup. The kind of prediction the premises require seems to run into hard limits – not just practical but possibly principled ones. Whether you think this tension is fatal to the premises or merely uncomfortable is, I'll argue, exactly what separates one-boxers from two-boxers.

A natural objection is that most predictions are made far enough from the threshold. Suppose Y is $2,000, and we set Box A to $1,900. The signal returns, overpowering the noise. A high-fidelity physical predictor can once again achieve high accuracy. But even in this regime – where prediction works perfectly well through normal, physical mechanisms – the standoff remains. The nature of the predictor still forces an epistemic split, shaping how you handle that accuracy. And here the paths diverge.

It could be a superintelligent brain simulator operating through entirely physical mechanisms – near-perfect when your preferences are stable, failing only at the stochastic threshold. It could be something more exotic that we don't yet have a framework for. Or it could be something more mundane – an extremely good psychologist exploiting a lopsided payoff structure. The problem doesn't specify, and your interpretation matters.

To illustrate this idea: imagine the predictor is not a superintelligence but a PhD student from the local university who has somehow gotten 1000 predictions right. Same track record – but would you treat it the same way?

The two-boxer looks at the predictor's track record and fits it into a familiar causal framework: in this view, the predictor is essentially an impressive psychologist. The supercomputer made its prediction in the past, the box is physically closed, and your present choice cannot alter its contents. The error margin confirms this is a physical system bound by normal causality. Dominance reasoning applies.

The one-boxer looks at the same track record and concludes that this goes beyond ordinary psychology – whether that's an inexplicable link between choice and outcome, or a formal dependency between your algorithm and the predictor's simulation. Either way, they treat the predictor's accuracy as the load-bearing fact of the problem and as evidence of something they don't yet understand, and reason from there. The "nature" they assign to the predictor is one that makes the standard causal "past vs. future" distinction irrelevant. If you are playing against a mirror, you cannot beat it by moving faster than your reflection.

Neither approach is unreasonable. But notice: this is a disagreement about the predictor, not about which box to take. The nature of the predictor determines if its accuracy is a constraint or a correlation, and tells you what kind of game you are playing. The box question follows from the predictor question.

So the threshold argument doesn't just expose an abstract incoherence in the premises. It reveals that how you rescue the premises from that incoherence is precisely where the one-box/two-box split happens. The disagreement about boxes was a disagreement about the predictor all along.

4. It's About What You Make of the Premises

The premise of the predictor as an almost perfect oracle intuitively contradicts the common reality model. Any person faced with this situation has two options: They can either accept the premise as new reality, or frame it through their established beliefs about reality.

If you truly accept the premises – that behavior can be predicted with near-perfect accuracy – then you are accepting something that cannot be explained by our current knowledge. You are entering a world with different rules. In this new world, you don't fully understand the mechanics, but the evidence overwhelmingly points to one-boxing. So you one-box.

If you filter the premises through your existing worldview, you interpret the predictor's track record as reducible to known mechanisms – sophisticated psychological profiling, behavioral regularities, an extremely lopsided payoff structure that makes most people's choices easy to read, maybe even luck. The thousand correct predictions don't change the fundamental causal structure of reality as you understand it. In our world, the boxes are already set, your choice can't affect their contents, and magical thinking is irrational. So you two-box.

The two-boxer isn't irrational. They're applying correct reasoning within their model of reality. The one-boxer isn't irrational either. They're updating their model in response to evidence that breaks it. The deciding factor is epistemic temperament: how readily do you let anomalous evidence overturn your worldview?

5. The Fairy Tale Test

As an intuition pump, imagine that a talking lion appears in front of you. Do you think: "apparently lions can talk, what else might be different here?" or "there must be a hidden speaker somewhere"? Neither response is crazy, and neither disposition is inherently better. But they lead to radically different behavior.

The "hidden speaker" people are two-boxers. Confronted with something impossible, they say: "There are no fairy tales. There is an explanation consistent with my existing model, even if I can't see it yet." The "lions can talk" people are one-boxers. They say: "OK, we're in a fairy tale now. I'll play by the new rules."

Newcomb's problem is constructed to make it impossible to tell which situation you're in. This isn't about intelligence or rigor – it's about a deep disposition toward anomalous evidence. This is not the only path to one-boxing – there are rigorous formal routes as well, which I address in the objections below. But if you've ever one-boxed without being able to explain exactly why, this might be the reason.

6. Relation to Existing Literature

The idea that something epistemically fishy is going on in Newcomb's problem is not entirely new, and it's worth being explicit about where this argument sits relative to prior work.

Wolpert and Benford (2013) arrive at what is essentially the same structural conclusion through formal game theory: they prove that one-boxers and two-boxers are operating with incompatible mathematical formalizations of the problem, and that the paradox arises because the problem doesn't specify which formalization is correct. We add the observation that what drives the choice between formalizations is the player's understanding of the predictor's nature. Wolpert and Benford show that two games exist; the threshold argument shows why – the problem leaves the predictor underspecified, and how you fill that gap determines which game you're playing. It's worth noting that we arrived at this conclusion independently, from an epistemic rather than game-theoretic direction.

Dilip Ninan's "Illusions of Influence in Newcomb's Problem" (2006) argues that when we imagine ourselves inside the Newcomb scenario, we can't fully believe the "official story" – the stipulation that the predictor is genuinely near-perfect. He shows that even a tiny credence (~0.1%) in the alternative hypothesis that your choice somehow causally affects the box contents is enough to make one-boxing CDT-rational. This is a striking result, and his analysis of how the thought experiment feels different from the inside versus the outside resonates strongly with what we're saying here. The key difference is that Ninan is a two-boxer offering an error theory. For him, premise-skepticism explains why one-boxing is tempting but wrong – we can't help doubting the official story, and that doubt makes us mistakenly one-box. Our argument goes the other direction. We're not saying either side is making an error. We're saying the two sides are responding to genuinely different readings of the premises, and both responses are internally consistent. Also, Ninan never questions whether the setup itself might be incoherent – he takes it as well-defined but hard to believe. André's threshold argument pushes further by suggesting the tension isn't just psychological but structural.

Bermúdez (in Ahmed, ed., Newcomb's Problem, 2018) argues that it's impossible for an ideal rational agent to face a genuine Newcomb problem. This is adjacent to the incoherence point raised by the threshold argument, though arrived at through different reasoning.

Finally, a note on LessWrong context. This community has generally favored one-boxing, largely through the development of FDT/UDT, which formalize the intuition that you're choosing the output of your decision algorithm, not just a physical action. Our argument is compatible with this but frames the question differently. We're not proposing a better decision theory – we're arguing that the disagreement has a layer underneath the decision theory that often goes unexamined.

7. Objections

FDT one-boxers don't need fairy tales. They have a rigorous mathematical framework.

This is a fair criticism, and it's worth engaging with carefully. FDT provides a framework for one-boxing that requires no appeal to backward causation or anomalous evidence. It works by treating your decision as the output of a deterministic algorithm that the predictor also ran. Two instances of the same program produce the same output – nothing magical about that. The "logical link" between your choice and the box contents isn't metaphysical, it's computational. I respect this framework, and I want to be clear: the fairy tale metaphor is not meant to describe FDT one-boxers. It captures one path to one-boxing – the empirical path. FDT shows that it's not the only one. However, I'd note that FDT rests on a substantive assumption: that your decision process is well-modeled as a stable, deterministic computation that the predictor can faithfully replicate. Whether you find this assumption natural or suspect may itself reflect the kind of epistemic disposition this post is about. And the threshold argument suggests at least one regime where it breaks down – near the threshold, your "algorithm" doesn't have a stable output, and the computational metaphor loses its grip. This is also why the premise of Devs, while being a great show, was questioned (e.g. reddit).

Thought experiments stipulate their premises. Rejecting them is refusing to play.

We're not rejecting the premises, and we're not refusing to play. Our point is subtler: there is more than one way to "accept" a premise that conflicts with your understanding of reality. You can treat it as evidence that your understanding needs revision, or you can absorb it into your existing model by finding a compatible interpretation. Both are legitimate epistemic moves, and both count as engaging with the thought experiment. The threshold argument shows that this ambiguity isn't a quirk of how people read the problem – it's built into the problem itself.

David Lewis and other serious two-boxers fully accept the premises and still two-box.

Lewis, in his well-known "Why Ain'cha Rich?" (1981), openly acknowledged that two-boxers walk away poorer. His response was that Newcomb's problem rewards irrationality – one-boxing works, but it's still irrational. In our framing, Lewis is accepting the premises through a CDT-compatible lens: he treats the predictor's accuracy as a description of correlations, not as evidence that his causal model of reality is incomplete. His response was that if a scenario explicitly severs physical causal influence – the box is already packed – it is fundamentally irrational to act as if you have causal influence, even if a specific universe happens to financially reward people with irrational dispositions. Within a strict CDT framework, this reasoning is airtight. Our point is that it depends on accepting the strict CDT framework as the only valid lens – which is precisely what's at issue.

This just pushes the problem back. Now we're arguing about epistemic temperament instead of decision theory.

Yes. That's the point. If we're right, then this is progress – not because we've solved which box to take, but because we've identified where the actual disagreement lives. Decades of debate have been unproductive not because the participants lack rigor, but because they think they're arguing about strategy when they're really arguing about something closer to metaphysics: what do you do when your evidence contradicts your model of reality? Reframing the question this way explains why the debate has been so persistent and why neither side has ever convinced the other. They were never really having the same argument.

8. Conclusion

Newcomb's problem is a paradox of decision theory, but it's also something else: a diagnostic for epistemic temperament – specifically, how you respond when empirical evidence contradicts your established model of reality.

The formal frameworks of CDT, EDT, and FDT give different answers because they encode different prior attitudes toward the premises. Those attitudes trace back to a fundamental choice: If you hold that the boxes are already set and your present choice cannot affect their contents, two-boxing is the rational path within that model. If you accept the predictor’s accuracy at face value and let it reshape your worldview, one-boxing follows naturally as a response to a world with different rules. The real question Newcomb's problem asks is not "which box should you take?" but "what do you do when reality stops making sense?" or rather "at what point does reality stop making sense to you?"

As for me – I finally understand why I'm a one-boxer, and the feeling of irrationality has disappeared. It's not that I've abandoned logic for fairy tales; it's that when my world model was weighed against the data, I let the data win. I’m not just waiting to meet a talking lion – I’m willing to update my biology textbook the moment he starts speaking.

1