Decision theory and dynamic inconsistency

paulfchristiano

Here is my current take on decision theory:

When making a decision after observing X, we should condition (or causally intervene) on statements like “My decision algorithm outputs Y after observing X.”
Updating seems like a description of something you do when making good decisions in this way, not part of defining what a good decision is. (More.)
Causal reasoning likewise seems like a description of something you do when making good decisions. Or equivalently: we should use a notion of causality that captures the relationships relevant to decision-making rather than intuitions about physical causality. (More.)
“How much do I care about different copies of myself?” is an arbitrary question about my preferences. If my preferences change over time, it naturally gives rise to dynamic inconsistency unrelated to decision theory. (Of course an agent free to modify itself at time T would benefit by implementing some efficient compromise amongst all copies forked off after time T.)

In this post I’ll discuss the last bullet in more detail since I think it’s a bit unusual, it’s not something I’ve written about before, and it’s one fo the main ways my view of decision theory has changed in the last few years.

(Note: I think this topic is interesting, and could end up being relevant to the world in some weird-yet-possible situations, but I view it as unrelated to my day job on aligning AI with human interests.)

The transparent Newcomb problem

In the transparent version of Newcomb’s problem, you are faced with two transparent boxes (one small and one big). The small box always contains $1,000. The big box contains either $10,000 or $0. You may choose to take the contents of one or both boxes. There is a very accurate predictor, who has placed $10,000 in the big box if and only if they predict that you wouldn’t take the small box regardless of what you see in the big one.

Intuitively, once you see the contents of the big box, you really have no reason not to take the small box. For example, if you see $0 in the big box, you know for a fact that you are either getting $0 or $1,000. So why not just take the small box and walk away with $1,000? EDT and CDT agree about this one.

I think it’s genuinely non-obvious what you should do in this case (if the predictor is accurate enough). But I think this is because of ambiguity about what you want, not how you should make decision. More generally, I think that the apparent differences between EDT and UDT are better explained as differences in preferences. In this post I’ll explain that view, using transparent Newcomb as an illustration.

A simple inconsistent creature

Consider a simple creature which rationally pursues its goals on any given day—but whose goals change completely each midnight. Perhaps on Monday the creature is trying to create as much art and beauty as possible; on Tuesday it is trying to create joy and happiness; on Wednesday it might want something different still.

On any given day we can think of the creature as an agent. The creature on Tuesday is not being irrational when it decides to pursue joy and happiness instead of art and beauty. It has no special reason to try to “wind back the clock” and pursue the same projects it would have pursued on monday.

Of course on Monday the creature would prefer to arrest this predictable value drift—it knows that on Tuesday it will be replaced with a new agent, one that will stop contributing to the project of art and beauty. The creature on Monday ought to make plans accordingly, and if they had the ability to change this feature of themselves they would likely do so. It’s a matter of semantics whether we call this creature a single agent or a sequence of agents (one for each day).

This sequence of agents could benefit from cooperating with one another, and it can do so in different ways. Normal coordination is off the table, since causality runs only one way from each agent to the next. But there are still options:

The Tuesday-creature might believe that its decision is correlated with the Monday-creature. If the Tuesday-creature tries to stop the Wednesday-creature from existing, then the Monday-creature might have tried to stop the Tuesday-creature from existing. If the correlation is strong enough and stopping values change is expensive, then the Tuesday-creature is best served by being kind to its Wednesday-self, and helping to put it in a good position to realize whatever its goals may be. (Though note that this can unwind just like an iterated prisoner’s dilemma with finite horizon!)
The Tuesday-creature might believe that its decision is correlated with the Monday-creature’s predictions about what the Tuesday-creature would do. If the Tuesday-creature keeps on carrying out the Monday-creature’s plans, then the Monday-creature would be more motivated to help the Tuesday-creature succeed (and less motivated to try to prevent the value change). If the Monday-creature is a good enough predictor of the Tuesday-creature, then the Tuesday-creature is best served by at least “paying back” the Monday-creature for all of the preparation the Monday-creature did.

However none of these relationships are specific to the fact that it is the same creature on Monday and Tuesday; the fact that the cells are the same has no significance for the decision-theoretic situation. The Tuesday-creature has no intrinsic interest in the fact that it is not “reflectively stable”—of course that instability definitionally implies a desire to change itself, but not a further reason to try to help out the Monday-creature or Wednesday-creature, beyond the relationships described above.

A human inconsistency

I care a lot about what is going to happen to me in the future. I care much more about my future than about different ways that the world could have gone (or than my past for that matter). In fact I would treat those other possible versions of myself quite similarly to how I’d treat another person who just happened to be a lot like me.

This leads to a clear temporal inconsistency, which is so natural to humans that we don’t even think of it as an inconsistency. I’ll try to illustrate with a sequence of thought experiments.

Suppose that at 7AM I think that there is a 50% chance that a bell will ring at 8AM. At 7AM I am indifferent between the happiness of Paul-in-world-with-bell and Paul-in-silent-world. If you asked me which Paul I would prefer to stub his toe, I would be indifferent.

But by 8:01AM my preferences are quite different. After I’ve heard the bell ring, I care overwhelmingly about Paul-in-world-with-bell. I would very strongly prefer that the other Paul stub his toe than that I do.

Some people might say “Well you just cared about what happens to Paul, and then at 8AM you learned what is real. Your beliefs have changed, but not your preferences.” But consider a different experiment where I am duplicated at 7AM and each copy is transported to a different city, one where the bell would ring and the other where it will not. Until I learn which city I’m in, I’m indifferent between the happiness of Paul-in-city-with–bell and Paul-in-silent-city. But at the moment when I hear the bell ring, my preferences shift.

Some people could still say “Well you cared about the same thing all along—what happens to you—and you were merely uncertain about which Paul was you.” But consider the Paul from before the instant of copying, informed that he is about to be copied. That Paul knows full well that he cares about both copies. Yet sometime between the copying and the bell Paul has become much more parochial, and only cares about one. It seems to me like there is little way to escape from the inconsistency here.

One could still say “Nonsense, all along you just cared about what happened to you, you were just uncertain about which of the copies you were going to become.” I find this very unpersuasive (why think there is a fact of the matter about who “I” am?), but at this point I think it’s just a semantic dispute. Either my preferences change, or my preferences are fixed but defined in terms of a concept like “the real me” whose meaning changes. It all amounts to the same thing.

This is not some kind of universal principle of rationality—it’s just a fact about Paul. You can imagine different minds who care about all creatures equally, or who care only about their own future experiences, or who care about all the nearby possible copies of themselves. But I think many humans feel roughly the same way I do—they have some concern for others (including creatures very similar to themselves in other parts of the multiverse), but have a very distinctive kind of caring for what they themselves will actually experience in the future.

Altruism is more complicated

In the examples above I discussed stubbing my toe as the unit of caring. But what if we had instead talked of dollars? And what if I am a relatively altruistic person, who would use marginal dollars to try to make the world better?

Now in the case of two copies in separate cities it is clear enough that my preferences never change. I’m still willing to pay $1 to give my counterpart $2. After all, they can spend those dollars just as well as I can, and I don’t care who it was who did the good.

But in the case of a single city, where the bell either rings or it doesn’t, we run into another ambiguity in my preferences—another question about which we need not expect different minds to agree no matter how rational they are.

Namely: once I’ve heard the bell ringing, do I care about the happiness of the creatures in the world-with-bell (given that it’s the real world, the one we are actually in), or do I care about the happiness of creatures in both worlds even after I’ve learned that I happen to be in one of them?

I think people have different intuitions about this. And there are further subtle distinctions, e.g. many people have different intuitions depending on whether the ringing of the bell was a matter of objective chance (where you could imagine other copies of yourself on far away worlds, or other branches, facing the same situation with a different outcome), or a matter of logical necessity where we were simply ignorant.

While some of those disagreements may settle with more discussion, I think we should be able to agree that in principle we can imagine a mind that works either way, that either care about people in other worlds-that-could-have-been or who don’t.

Most humans have at least some draw towards caring only about the humans in this world. So the rest of my post will focus on their situation.

Back to transparent Newcomb (or: The analogy)

Consider again a human playing the transparent version of Newcomb’s problem. They see before them two boxes, a small one containing $1000 and a big one containing $0. They are told that the big box would have contained $10000 if a powerful predictor had guessed that they would never take the small box.

If the human cares only for their own future experiences, and would spend the money only on themselves, they have a pretty good case for taking the small box and walking away with $1000. After all, their own future experiences are either going to involve walking away with $1000 or with nothing, there is no possible world where they experience seeing an empty big box and then end up with the money after all.

Of course before taking the big box, the human would have much preferred to commit to never taking the small box. If they are an evidential decision theorist, they could also have just closed their eyes (curse that negative value of information!). That way they would have ended up with $10,000 instead of $1,000.

Does this mean that they have reason to take nothing after all, even after seeing the box?

I think the human’s situation is structurally identical to the inconsistent creature whose preferences change at midnight. Their problem is that in the instant when they see the empty big box, their preferences change. Once upon a time they cared about all of the possible versions of themselves, weighted by their probability. But once they see the empty big box, they cease to care at all about the versions of themselves who saw a full box. They end up in conflict with other very similar copies of themselves, and from the perspective of the human at the beginning of the process the whole thing is a great tragedy.

Just like the inconsistent creature, the human would have strongly preferred to make a commitment to avoid these shifting preferences. Just like the inconsistent creature, they might still find other ways to coordinate even after the preferences change, but it’s more contingent and challenging. Unlike the inconsistent creature, they can avoid the catastrophe by simply closing their eyes—because the preference change was caused by new information rather than by the passage of time.

The situation is most stark if we imagine the predictor running detailed simulations in order to decide whether to fill the big box. In this case, there is not one human but three copies of the human: two inside the predictor’s mind (one who sees an empty box and one who sees a full box) and one outside the predictor in the real world (seeing an empty or full box based on the results of the simulation). The problem for the human is that these copies of themselves can’t get along.

Even if you explained the whole situation to the human inside the simulation, they’d have no reason to go along with it. By avoiding taking the small box, all they can achieve is to benefit a different human outside of the simulation, who they no longer care at all about. From their perspective, better to just take the money (since there’s a 50% chance that they are outside of the simulation and will benefit by $1000).

(There are even more subtleties here if these different possible humans have preferences about their own existence, or about being in a simulation, or so on. But none of these change the fundamental bottom line.)

Altruism is still more complicated

If we consider a human who wants to make money to make the world better, the situation is similar but with an extra winkle.

Now if we explain the situation to the inside human, they may not be quite so callous. Instead they might reason “If I don’t take the small box, there is a good chance that a ‘real’ human on the outside will then get $10,000. That looks like a good deal, so I’m happy to walk away with nothing.”

Put differently, when we see an empty box we might not conclude that predictor didn’t fill the box. Instead, we might consider the possibility that we are living inside the predictor’s imagination, being presented with a hypothetical that need not have any relationship to what’s going on out there in the real world.

The most extreme version of this principle would lead me to entertain very skeptical / open-minded beliefs about the world. In any decision-problem where “what I’d do if I saw X” matters for what happens in cases where X is false, I could say that there is a “version” of me in the hypothetical who sees X. So I can never really update on my observations.

This leads to CDT=EDT=UDT. For people who endorse that perspective (and have no indexical preferences), this post probably isn’t very interesting. Myself, I think I somewhat split the difference: I think explicitly about my preferences about worlds that I “know don’t exist,” roughly using the framework of this post. But I justify that perspective in significant part from a position of radical uncertainty: I’m not sure if I’m thinking about worlds that don’t exist, or if it’s us who don’t exist and there is some real world somewhere thinking about us.

Conclusion

Overall the perspective in this post has made me feel much less confused about updatelessness. I expect I’m still wrong about big parts of decision theory, but for now I feel tentatively comfortable using UDT and don’t see the alternatives as very appealing. In particular, I no longer think that updating feels very plausible as a fundamental decision-theoretic principle, but at the same time don’t think there’s much of a reflective-stability-based argument for e.g. one-boxing in transparent Newcomb.

Most of the behaviors I associate with being “updateless” seem to really be about consistent preferences, and in particular continuing to care about worlds that are in some sense inconsistent with our observations. I believe my altruistic preferences are roughly stable in this sense (partially justified by a kind of radical epistemic humility about whether this is the “real” world), but my indexical preferences are not. The perspective in this post also more clearly frames the coordination problem faced by different copies of me (e.g. in different plausible futures) and I think has left me somewhat more optimistic about finding win-win deals.

Note: I let it sit in my editor for a day, not being sure how useful this comment is, but figured I'd post it anyway, just in case.

It seems to me that, despite a rather careful analysis of transparent Newcomb's, some of the underlying assumptions were not explicated:

There is a very accurate predictor, who has placed $10,000 in the big box if and only if they predict that you wouldn’t take the small box regardless of what you see in the big one.

It is crucial to the subsequent reasoning how this "very accurate predictor" functions under the hood, for example:

Does it run a faithful simulation of you and check the outcome?
Does it function as a Laplace's demon from outside the universe and predict the motion of each particle without running it?
Does it obey Quantum Mechanics and has to run multiple instances of you and your neighborhood of the universe and terminate those that choose "wrong"?
Does it even need to predict anything, or simply kill off the timelines that are "wrong? (Quantum post-selection.) For example, it randomly puts an empty or full second box and kills off the instances where someone takes two full boxes.
Does it run a low-res version of you that does not have any internal experience and is not faithful enough to think about other versions of you?
Does it analyze your past behaviors and calculate your outcome without creating anything that can be called an instance of you?
Does it experiment mostly on those it can predict reasonably well, but occasionally screws up?

In some of those cases it pays to take both boxes when you see them. In other cases it pays to take just the one. Yet in other cases it pays to roll a (weighted) quantum die and rely on its outcome to make your decision.

The usual difference between CDT and UDT is the point in time where a compatibilist equivalent of the libertarian free will* is placed:

In CDT you assume that you get the predictor-unaffectable freedom to change the course of events after you have been presented with the boxes.
In UDT it is the moment when you self-modify/precommit to one-box once you hear about the experiment and think through it.

* Here by free will I don't mean the divine spirit from outside the universe that imbues every human with this magical ability. Rather, it is a description of some fundamental in-universe unpredictability that not even the Newcomb's predictor can overcome. Maybe it is some restrictions on computability, or on efficient computability, or maybe it is a version of Scott Aaronson's freebits, or maybe something else. The important part is that not even the predictor can tap into it.

One way to reformulate this in a way that avoids the confusing language of "decisions" is to frame it as "which agents end up with higher utility?" and explore possible worlds where they exist. Some of these worlds can be deterministic, others probabilistic, yet others MWI-like, yet others where simulated agents are as real as the "real" one. But stepping away from "agent decides" and into "agent that acts a certain way" forces you into the mindset of explicitly listing the worlds instead of musing about causality and precommitment.

For example, for an inconsistent creature, it matters what the creature does, not how it (often unreliably) reasons about themselves. In the 8 am bell setup there is only one Paul who actually does nothing but talk about his preferences. A Monday creature may feel that it wants to "fix" its Tuesday preferences, but as long as the Tuesday behavior is not updated, it doesn't matter for anything "decision"-related.

This approach does not help with the paradoxes of indexicality though, including those that appear inside some of the worlds that may appear in the Newcomb's setup, like dealing with simulated agents. Sean Carroll takes a stab at how to deal with something like it in https://www.preposterousuniverse.com/podcast/2022/06/06/200-solo-the-philosophy-of-the-multiverse/ by using Neal's "fully non-indexical conditioning" (https://arxiv.org/abs/math/0608592), where you condition on everything you know about yourself, except for where you are in the universe. It helps dealing with the Doomsday argument, with the Boltzmann brain problem, with the Sleeping Beauty problem, and with other "Presumptuous Philosopher"-prone setups.

If the Monday creature is indeed able to fix its preferences, how do you compare utility between the two alternatives since they have different UFs?

Presumably utility is measured by someone who exists at the moment of action? If you do something on Tuesday that you want to do, that is what matters. You may regret having done something else on Monday, or having to do something you currently don't like on Wednesday, but it is sort of irrelevant. If your UF is not stable, it is probably not a good description of what is going on.

This seems generally right, but ignores a consideration that I think often gets ignored, so I'll flag it here.

We probably want to respect preferences. (At least as a way to allow or improve cooperation.)

In this setup, the idea that we might want to lock-in the values of ourselves to prevent our future selves from having different preferences seems fine if we actually view ourselves as a single agent, but once we accept the idea that we're talking about distinct agents, it looks a lot like brainwashing someone to agree with us. And yes, that is sometimes narrowly beneficial, if we ignore the costs of games where we need to worry that others will attempt to do the same.

So I think we need to be clear: altruism is usually best accomplished by helping people improve what they care about, not what we care about for them. We don't get to prevent access to birth control to save others from being sinful, since that isn't what they want. And similarly, we don't get to call technological accelerationism at the cost of people's actual desires altruistic, just because we think we know better what they should want. Distributional consequences matter, as does the ability to work together. And we'll be much better able to cooperate with ourselves and with others if we decide that respecting preferences is a generally important default behavior of our decision theory.

I generally agree that a creature with inconsistent preferences should respect the values of its predecessors and successors in the same kind of way that it respects the values of other agents (and that the similarity somewhat increases the strength of that argument). It's a subtle issue, especially when we are considering possible future versions of ourselves with different preferences (just as its always subtle how much to respect the preferences of future creatures who may not exist based on our actions). I lean towards being generous about the kinds of value drift that have occurred over the previous millennia (based on some kind of "we could have been in their place" reasoning) while remaining cautious about sufficiently novel kinds of changes in values.

In the particular case of the inconsistencies highlighted by transparent Newcomb, I think that it's unusually clear that you want to avoid your values changing---because your current values are a reasonable compromise amongst the different possible future versions of yourself, and maintaining those values is a way to implement important win-win trades across those versions.

In the particular case of the inconsistencies highlighted by transparent Newcomb, I think that it's unusually clear that you want to avoid your values changing---because your current values are a reasonable compromise amongst the different possible future versions of yourself, and maintaining those values is a way to implement important win-win trades across those versions.

I slightly disagree with this. In cases where there are win-win trades, different future versions of yourself are probably similar enough that they can get these win-win trades via correlated decision-making. (If they follow EDT.)

If you stop your values from changing, I think the main additional benefit you get is that you (i) change which of your future selves are more or less likely to exist in the first place (which it's not obvious that they themselves will care about; c.f. my other comment), and (ii) impose one-way utility transfers from versions of you who have good helping opportunities to versions of yourselves who have good being-helped opportunities, according to your own view about how you want to do interpersonal utility comparisons between your future selves (which will predictably benefit some of them and harm some other of them). ^[1]

Overall this still seems fine and good to me. But I think win-win trades are a small fraction of the benefits.

^{^}
Or maybe this is also just about changing which future versions of yourselves exist, since any difference in your present actions will arguably lead to somewhat different memories in future versions of yourself.

That Paul knows full well that he cares about both copies. Yet sometime between the copying and the bell Paul has become much more parochial, and only cares about one. It seems to me like there is little way to escape from the inconsistency here.

Once we have the necessary technology, "that Paul" (earlier Paul) could modify his brain/mind so that his future selves no longer become more parochial and continue to care about both copies. Would you do this if you could? Should you?

(I raised this question previously in Where do selfish values come from?)

ETA:

Of course an agent free to modify itself at time T would benefit by implementing some efficient compromise amongst all copies forked off after time T.

But "all copies forked off after time T" depends on the agent's decision at time T. For example, if the agent self-modified so that its preferences stop changing (e.g., don't become "more parochial" over time) then "all copies forked off after time T" would have identical preferences. So is this statement equivalent to saying that the agent could self-modify to stop its preferences from changing, or do you have something else in mind, like "compromise amongst all copies forked off after time T, counterfactually if the agent didn't self-modify"? If you mean the latter, what's the justification for doing that instead of the former?

Yes, I think if possible you'd want to resolve to continue caring about copies even after you learn which one you are. I don't think that you particularly want to rewind values to before prior changes, though I do think that standard decision-theoretic or "moral" arguments have a lot of force in this setting and are sufficient to recover high degrees of altruism towards copies and approximately pareto-efficient behavior.

I think it's not clear if you should self-modify to avoid preference change unless doing so is super cheap (because of complicated decision-theoretic relationships with future and past copies of yourself, as discussed in some other comments). But I think it's relatively clear that if your preferences were going to change into either A or B stochastically, it would be worth paying to modify yourself so that they change into some appropriately-weighted mixture of A and B. And in this case that's the same as having your preferences not change, and so we have an unusually strong argument for avoiding this kind of preference change.

The outcome of shutdown seems important. It's the limiting case of soft optimization (anti-goodharting, non-agency), something you do when maximally logically uncertain about preference, conservative decisions robust to adversarial assignment of your preference.

Not sure if this should in particular preserve information about preference and opportunity for more agency targeted at it (corrigibility in my sense), since losing that opportunity doesn't seem conservative, wastes utility for most preferences. But then shutdown would involve some optimal level of agency in the environment, caretakers of corrigibility, not inactivity. Which does seem possibly correct, the agent should't be eradicating environmental agents that have nothing to do with the agent, when going into shutdown, while refusing total shutdown when there are no environmental agents left at all (including humans and other AGIs) might be right.

If this is the case, maximal anti-goodharting is not shutdown, but maximal uncertainty about preference, so a maximally anti-goodharting agent purely pursues corrigibility (receptiveness to preference), computes what it is without optimizing for it, since it has no tractable knowledge of what it is at the moment. If environment already contains other systems receptive to preference, this might look like shutdown.

Intuitively, once you see the contents of the big box, you really have no reason not to take the small box.

I think the word 'intuitively' is kind of weird, here? Like, if we swap the Transparent Newcomb's Problem frame with the (I believe identical) Parfit's Hitchhiker frame, I feel an intuitive sense that I should pay the driver (only take the big box), because of a handshake across time and possibility, and being the sort of agent that can follow thru on those handshakes.

Now if we explain the situation to the inside human, they may not be quite so callous. Instead they might reason “If I don’t take the small box, there is a good chance that a ‘real’ human on the outside will then get $10,000. That looks like a good deal, so I’m happy to walk away with nothing.”
Put differently, when we see an empty box we might not conclude that predictor didn’t fill the box. Instead, we might consider the possibility that we are living inside the predictor’s imagination, being presented with a hypothetical that need not have any relationship to what’s going on out there in the real world.

When trying to make the altruistically best decision given that I'm being simulated, shouldn't I also consider the possibility that the predictor is simulating me in order to decide how to fill the boxes in some kind transparent Anti-Newcomb problem, where the $10,000 dollars is there if and only if it predicts you would take the $1,000 in transparent Newcomb? In that case I'd do the best thing by the real version of me by two-boxing.

This sounds a bit silly but I guess I'm making the point that 'choose your action altruistically factoring in the possibility that you're in a simulation' requires not just a prior on whether you're in a simulation, but also a prior on the causal link between the simulation and the real world.

If I'm being simulated in a situation which purportedly involves a simulation of me in that exact situation, should I assume that the purpose of my being simulated is to play the role of the simulation in this situation? Is that always anthropically more likely than that I'm being simulated for a different reason?

"since there’s a 50% chance that they are outside of the simulation and will benefit by $1000"

This seems wrong. For example, if I see the big box full and choose to also take the small box too, then it is IMPOSSIBLE for me to be in the real world. In that case I may well take only one box, because my world is fake and I will die soon.

So suppose I commit to one-boxing if big box is full, which is always in my interest (as mentioned). Now, if I see the big box empty and choose NOT to take the small box, it is impossible for me to be in the real world. So I may as well not take the small box if I'm physically capable of that (if I am, it means this world is fake and I will die soon).

So it seems clear that I always one box, even if I only care about the real world and not about hypothetical worlds.

Man, big 2014 vibes. Where was that post? Ah yeah: https://www.lesswrong.com/posts/gTmWZEu3CcEQ6fLLM/treating-anthropic-selfish-preferences-as-an-extension-of

But I guess I can see the sense in treating selfishness as inconstancy rather than as a separate kind of preference that follows special rules.

The Tuesday-creature might believe that its decision is correlated with the Monday-creature. [...] If the correlation is strong enough and stopping values change is expensive, then the Tuesday-creature is best served by being kind to its Wednesday-self, and helping to put it in a good position to realize whatever its goals may be.
The Tuesday-creature might believe that its decision is correlated with the Monday-creature’s predictions about what the Tuesday-creature would do. [...] If the Monday-creature is a good enough predictor of the Tuesday-creature, then the Tuesday-creature is best served by at least “paying back” the Monday-creature for all of the preparation the Monday-creature did

These both seem like very UDT-style arguments, that wouldn't apply to a naive EDT:er once they'd learned how helpful the Monday creature was?

So based on the rest of this post, I would have expected these motivations to only apply if either (i) the Tuesday-creature was uncertain about whether the Monday-creature had been helpful or not, or (ii) the Tuesday creature cared about not-apparently-real-worlds to a sufficient extent (including because they might think they're in a simulation). Curious if you disagree with that.

Yes, I think this kind of cooperation would only work for UDT agents (or agents who are uncertain about whether they are in someone's imagination or whatever).

A reader who isn't sympathetic to UDT can just eliminate the whole passage "But there are still options: ...", it's not essential to the point of the post. It only serves to head off the prospect of a UDT-advocate arguing that the agent is being unreasonable by working at cross-purposes to itself (and I should have put this whole discussion in an appendix, or at least much better sign-posted what was going on).

Once upon a time they cared about all of the possible versions of themselves, weighted by their probability. But once they see the empty big box, they cease to care at all about the versions of themselves who saw a full box. They end up in conflict with other very similar copies of themselves, and from the perspective of the human at the beginning of the process the whole thing is a great tragedy.

Probably just an unimportant nitpick, but the "versions who saw a full box" shouldn't actually expect to see a brighter future if "the version who saw an empty box" chooses to 1-box. The only thing that happens is that the "versions who saw a full box" become more likely to exist. So I think you have to either say:

This is a conflict where a significant portion of the "benefit" at stake is getting to exist in the first place
This isn't a conflict between the versions who saw an empty box and the versions who saw a full box. Instead, it's a conflict between the "versions who saw an empty or full box" and the past "version who hadn't yet looked at the boxes". (The "version who hadn't yet looked at the boxes" really would expect a brighter future if the "versions who saw an empty or full box" choose to 1-box.)

Typo? s/winkle/wrinkle
> the situation is similar but with an extra winkle.

Most of the behaviors I associate with being “updateless” seem to really be about consistent preferences,

What about behaviors around "updates" that include learning how to be more what one is (e.g. a more effective agent / decision-maker / survivor)? It seems like real-life agents make updates like this, and they don't seem well described as just preference shifts.

I feel personally attacked by your references to the creature that changes its mind every day. :P

Also, I feel like dadadarren might have an interest in this, given that he has a theory about taking self-reference as axiomatic in order to avoid some anthropic paradoxes. If I understand correctly, in his view it is incoherent to consider counterfactual worlds in which you ended up being someone else to begin with. But he'd explain that better than me.

I can't help but notice that Transparent Newcomb seems flawed: namely, it seems impossible to have a very accurate predictor, even if the predictor is capable of perfectly simulating your brain.

Someone who doesn't care about the money and only wants to spite the predictor could precommit to the following strategy:

If I see that the big box is empty, I'll take one box. If I see that the big box is full, I'll take both boxes.

Then, the predictor has a 0% chance of being correct, which is far from "very accurate". (Of course, there could be some intervention which forces you to choose against your will, but that would defeat the whole point of the thought experiment if you can't enforce your decisions)

Anyway, this is just poking holes in Transparent Newcomb and probably unrelated to the reflexive inconsistency and preference-changing mentioned in the post, as I suspect that you could find some other thought experiment which arrives at the same conclusions in the post. But I'm curious if anyone's mentioned this apparent paradox in Transparent Newcomb before, and if there's an agreed-upon "solution" to it.

Isn't this identical to the proof for why there's no general algorithm for solving the Halting Problem?

The Halting Problem asks for an algorithm A(S, I) that when given the source code S and input I for another program will report whether S(I) halts (vs run forever).

There is a proof that says A does not exist. There is no general algorithm for determining whether an arbitrary program will halt. "General" and "arbitrary" are important keywords because it's trivial to consider specific algorithms and specific programs and say, yes, we can determine that this specific program will halt via this specific algorithm.

That proof of the Halting Problem (for a general algorithm and arbitrary programs!) works by defining a pathological program S that inspects what the general algorithm A would predict and then does the opposite.

What you're describing above seems almost word-for-word the same construction used for constructing the pathological program S, except the algorithm A for "will this program halt?" is replaced by the predictor "will this person one-box?".

I'm not sure that this necessarily matters for the thought experiment. For example, perhaps we can pretend that the predictor works on all strategies except the pathological case described here, and other strategies isomorphic to it.

If you precommit to act this way, then it's not the case that [the predictor predicts that you wouldn't take the small box regardless of what you see in the big one] (since you do take it if the big box is full, so in that case you can't be predicted not take the small box). By the stated algorithm of box-filling, this results in the big box being empty. The predictor is not predicting what happens in actuality, it's predicting what happens in the hypothetical situation where the big box is full (regardless of whether this situation actually takes place), and what happens in the hypothetical situation where the big box is empty (also regardless of what happens in reality). The predictor is not deciding what to do in these hypothetical situations, it's deciding what to do in reality.

Even if the big box is empty and you one-box anyway, the predictor can just say "Yes, but if the big box had been full, you would have two-boxed." and it's unclear whether the predictor is accurate or not since you weren't in that situation.

and it's unclear whether the predictor is accurate or not since you weren't in that situation

The predictor acts based on your behavior in both hypotheticals, and from either of the hypotheticals you don't get to observe your own decision in the other, to verify that it was taken into account correctly.

If the big box is full and you one-box, the predictor can say "Yes, and if the big box had been empty, you would have also one-boxed." And it's unclear whether the predictor is accurate or not since you weren't in that situation. Being wrong in your favor is also a possibility.

You don't get to verify that your decision was taken into account correctly anyway. If the big box is full and you two-box, the predictor can say "Yes, so you are currently in a hypothetical, in reality the big box is empty."

This objection to Newcomb-like problems (that IF I'm actually predicted, THEN what I think I'd do is irrelevant - either the question is meaningless or the predictor is impossible) does get brought up occasionally, and usually ignored or shouted down as "fighting the hypothetical". The fact that humans don't precommit, and if they could the question would be uninteresting, is pretty much ignored.

Replacing the human with a simple, transparent, legible decision process makes this a lot more applicable, but also a lot less interesting. Whatever can be predicted as one-boxing makes more money, and per the setup, that implies actually one-boxing. done.

that IF I'm actually predicted, THEN what I think I'd do is irrelevant

This doesn't follow. Your estimate of your actions can be correct or relevant even if you've been predicted.

The fact that humans don't precommit, and if they could the question would be uninteresting, is pretty much ignored.

Humans can precommit just like simple machines - just run the algorithm in your mind and do what it says. There is nothing more to it.

Your estimate of your actions can be correct or relevant even if you've been predicted.

Huh? You break the simulation if you act differently than the prediction. Sure you can estimate or say whatever you want, but you can be wrong, and Omega can't.

just run the algorithm in your mind and do what it says.

This really does not match my lived experience of predicting and committing myself, nor the vast majority of fiction or biographical work I've read. Actual studies on commitment levels and follow-through are generally more complicated, so it's a little less clear how strongly counter-evident they are, but they're certainly not evidence that humans are rational in these dimensions. You can claim to precommit. You can WANT to precommit. You can even believe it's in your best interest to have precommitted. But when the time comes, that commitment is weaker than you thought.

You break the simulation if you act differently than the prediction.

I didn't say you could act differently than the prediction. It's correct that you can't, but that's not relevant for either variant of the problem.

Precommitment is a completely different concept from commitment. Commitment involves feelings, strength of will, etc. Precommitment involves none of those, and it only means running the simple algorithm. It doesn't have a strength - it's binary (either I run it, or not).

It's this running of the simple algorithm in your mind that gives you the pseudomagical powers in Newcomb's problem that manifest as the seeming ability to influence the past. (Omega already left, but because I'm precommited to one-box, his prediction will have been that I would one-box. This goes both ways, of course - if I would take both boxes, I will lose, even though Omega already left.)

You could use the word precommitment to mean something else - like wishing really hard to execute action X beforehand, and then updating on evidence and doing whatever appears to result in most utility. We could call this precommitment_2 (and the previous kind precommitment_1). The problem is that precommitting_2 to one-box implies precommitting_1 to two-box, and so it guarantees losing.

Precommitment involves none of those, and it only means running the simple algorithm

That doesnt seem like something a human being could do.

Then you're wrong as a matter of biology. Neural networks can do that in general.

I could see an argument being made that if the precommitment algorithm contains a line "jump off a cliff," the human might freeze in fear instead of being capable of doing that.

But if that line is "take one box," I don't see why a human being couldn't do it.

You mean artificial neural networks? Which can also do things like running forever without resting. I think a citation is needed.

An algorithm would be, to put it simply, a list of instructions.

So are you saying that a human isn't capable of following a list of instructions, and if so, do you mean any list of of instructions at all, or only some specific ones?

A human isnt capable.of following a list of instructions perfectly, relentlessly, forever. The problem with a pre comitment is sticking to it...whether you think of it as an algorithm.or a resolution or a promise or an oath.

A human isnt capable.of following a list of instructions perfectly, relentlessly, forever.

So you're saying humans can't follow an algorithm that would require to be followed perfectly, relentlessly and forever.

But one-boxing is neither relentless, nor forever. That leaves perfection.

Are you suggesting that humans can't perfectly one-box? If so, are you saying they can only imperfectly one-box?

Note: I let it sit in my editor for a day, not being sure how useful this comment is, but figured I'd post it anyway, just in case.

It seems to me that, despite a rather careful analysis of transparent Newcomb's, some of the underlying assumptions were not explicated:

There is a very accurate predictor, who has placed $10,000 in the big box if and only if they predict that you wouldn’t take the small box regardless of what you see in the big one.

It is crucial to the subsequent reasoning how this "very accurate predictor" functions under the hood, for example:

Does it run a faithful simulation of you and check the outcome?
Does it function as a Laplace's demon from outside the universe and predict the motion of each particle without running it?
Does it obey Quantum Mechanics and has to run multiple instances of you and your neighborhood of the universe and terminate those that choose "wrong"?
Does it even need to predict anything, or simply kill off the timelines that are "wrong? (Quantum post-selection.) For example, it randomly puts an empty or full second box and kills off the instances where someone takes two full boxes.
Does it run a low-res version of you that does not have any internal experience and is not faithful enough to think about other versions of you?
Does it analyze your past behaviors and calculate your outcome without creating anything that can be called an instance of you?
Does it experiment mostly on those it can predict reasonably well, but occasionally screws up?

The usual difference between CDT and UDT is the point in time where a compatibilist equivalent of the libertarian free will* is placed:

In CDT you assume that you get the predictor-unaffectable freedom to change the course of events after you have been presented with the boxes.
In UDT it is the moment when you self-modify/precommit to one-box once you hear about the experiment and think through it.

If the Monday creature is indeed able to fix its preferences, how do you compare utility between the two alternatives since they have different UFs?

In the particular case of the inconsistencies highlighted by transparent Newcomb, I think that it's unusually clear that you want to avoid your values changing---because your current values are a reasonable compromise amongst the different possible future versions of yourself, and maintaining those values is a way to implement important win-win trades across those versions.

Overall this still seems fine and good to me. But I think win-win trades are a small fraction of the benefits.

^{^}
Or maybe this is also just about changing which future versions of yourselves exist, since any difference in your present actions will arguably lead to somewhat different memories in future versions of yourself.

That Paul knows full well that he cares about both copies. Yet sometime between the copying and the bell Paul has become much more parochial, and only cares about one. It seems to me like there is little way to escape from the inconsistency here.

(I raised this question previously in Where do selfish values come from?)

ETA:

Of course an agent free to modify itself at time T would benefit by implementing some efficient compromise amongst all copies forked off after time T.

Intuitively, once you see the contents of the big box, you really have no reason not to take the small box.

Now if we explain the situation to the inside human, they may not be quite so callous. Instead they might reason “If I don’t take the small box, there is a good chance that a ‘real’ human on the outside will then get $10,000. That looks like a good deal, so I’m happy to walk away with nothing.”
Put differently, when we see an empty box we might not conclude that predictor didn’t fill the box. Instead, we might consider the possibility that we are living inside the predictor’s imagination, being presented with a hypothetical that need not have any relationship to what’s going on out there in the real world.

"since there’s a 50% chance that they are outside of the simulation and will benefit by $1000"

So it seems clear that I always one box, even if I only care about the real world and not about hypothetical worlds.

Man, big 2014 vibes. Where was that post? Ah yeah: https://www.lesswrong.com/posts/gTmWZEu3CcEQ6fLLM/treating-anthropic-selfish-preferences-as-an-extension-of

But I guess I can see the sense in treating selfishness as inconstancy rather than as a separate kind of preference that follows special rules.

The Tuesday-creature might believe that its decision is correlated with the Monday-creature. [...] If the correlation is strong enough and stopping values change is expensive, then the Tuesday-creature is best served by being kind to its Wednesday-self, and helping to put it in a good position to realize whatever its goals may be.
The Tuesday-creature might believe that its decision is correlated with the Monday-creature’s predictions about what the Tuesday-creature would do. [...] If the Monday-creature is a good enough predictor of the Tuesday-creature, then the Tuesday-creature is best served by at least “paying back” the Monday-creature for all of the preparation the Monday-creature did

These both seem like very UDT-style arguments, that wouldn't apply to a naive EDT:er once they'd learned how helpful the Monday creature was?

Yes, I think this kind of cooperation would only work for UDT agents (or agents who are uncertain about whether they are in someone's imagination or whatever).

Once upon a time they cared about all of the possible versions of themselves, weighted by their probability. But once they see the empty big box, they cease to care at all about the versions of themselves who saw a full box. They end up in conflict with other very similar copies of themselves, and from the perspective of the human at the beginning of the process the whole thing is a great tragedy.

This is a conflict where a significant portion of the "benefit" at stake is getting to exist in the first place
This isn't a conflict between the versions who saw an empty box and the versions who saw a full box. Instead, it's a conflict between the "versions who saw an empty or full box" and the past "version who hadn't yet looked at the boxes". (The "version who hadn't yet looked at the boxes" really would expect a brighter future if the "versions who saw an empty or full box" choose to 1-box.)

Typo? s/winkle/wrinkle
> the situation is similar but with an extra winkle.

Most of the behaviors I associate with being “updateless” seem to really be about consistent preferences,

I feel personally attacked by your references to the creature that changes its mind every day. :P

I can't help but notice that Transparent Newcomb seems flawed: namely, it seems impossible to have a very accurate predictor, even if the predictor is capable of perfectly simulating your brain.

Someone who doesn't care about the money and only wants to spite the predictor could precommit to the following strategy:

If I see that the big box is empty, I'll take one box. If I see that the big box is full, I'll take both boxes.

Isn't this identical to the proof for why there's no general algorithm for solving the Halting Problem?

The Halting Problem asks for an algorithm A(S, I) that when given the source code S and input I for another program will report whether S(I) halts (vs run forever).

and it's unclear whether the predictor is accurate or not since you weren't in that situation

that IF I'm actually predicted, THEN what I think I'd do is irrelevant

This doesn't follow. Your estimate of your actions can be correct or relevant even if you've been predicted.

The fact that humans don't precommit, and if they could the question would be uninteresting, is pretty much ignored.

Humans can precommit just like simple machines - just run the algorithm in your mind and do what it says. There is nothing more to it.

Your estimate of your actions can be correct or relevant even if you've been predicted.

Huh? You break the simulation if you act differently than the prediction. Sure you can estimate or say whatever you want, but you can be wrong, and Omega can't.

just run the algorithm in your mind and do what it says.

You break the simulation if you act differently than the prediction.

I didn't say you could act differently than the prediction. It's correct that you can't, but that's not relevant for either variant of the problem.

Precommitment involves none of those, and it only means running the simple algorithm

That doesnt seem like something a human being could do.

Then you're wrong as a matter of biology. Neural networks can do that in general.

I could see an argument being made that if the precommitment algorithm contains a line "jump off a cliff," the human might freeze in fear instead of being capable of doing that.

But if that line is "take one box," I don't see why a human being couldn't do it.

You mean artificial neural networks? Which can also do things like running forever without resting. I think a citation is needed.

An algorithm would be, to put it simply, a list of instructions.

So are you saying that a human isn't capable of following a list of instructions, and if so, do you mean any list of of instructions at all, or only some specific ones?

A human isnt capable.of following a list of instructions perfectly, relentlessly, forever.

So you're saying humans can't follow an algorithm that would require to be followed perfectly, relentlessly and forever.

But one-boxing is neither relentless, nor forever. That leaves perfection.

Are you suggesting that humans can't perfectly one-box? If so, are you saying they can only imperfectly one-box?