Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I feel like MIRI perhaps mispositioned FDT (their variant of UDT) as a clear advancement in decision theory, whereas maybe they could have attracted more attention/interest from academic philosophy if the framing was instead that the UDT line of thinking shows that decision theory is just more deeply puzzling than anyone had previously realized. Instead of one major open problem (Newcomb's, or EDT vs CDT) now we have a whole bunch more. I'm really not sure at this point whether UDT is even on the right track, but it does seem clear that there are some thorny issues in decision theory that not many people were previously thinking about:

  1. Indexical values are not reflectively consistent. UDT "solves" this problem by implicitly assuming (via the type signature of its utility function) that the agent doesn't have indexical values. But humans seemingly do have indexical values, so what to do about that?
  2. The commitment races problem extends into logical time, and it's not clear how to make the most obvious idea of logical updatelessness work.
  3. UDT says that what we normally think of as different approaches to anthropic reasoning are really different preferences, which seems to sidestep the problem. But is that actually right, and if so where are these preferences supposed to come from?
  4. 2TDT-1CDT - If there's a population of mostly TDT/UDT agents and few CDT agents (and nobody knows who the CDT agents are) and they're randomly paired up to play one-shot PD, then the CDT agents do better. What does this imply?
  5. Game theory under the UDT line of thinking is generally more confusing than anything CDT agents have to deal with.
  6. UDT assumes that the agent has access to its own source code and inputs as symbol strings, so it can potentially reason about logical correlations between its own decisions and other agents' as well defined mathematical problems. But humans don't have this, so how are humans supposed to reason about such correlations?
  7. Logical conditionals vs counterfactuals, how should these be defined and do the definitions actually lead to reasonable decisions when plugged into logical decision theory?

These are just the major problems that I was trying to solve (or hoping for others to solve) before I mostly stopped working on decision theory and switched my attention to metaphilosophy. (It's been a while so I'm not certain the list is complete.) As far as I know nobody has found definitive solutions to any of these problems yet, and most are wide open.

New to LessWrong?

New Comment
51 comments, sorted by Click to highlight new comments since: Today at 8:38 AM

I feel like MIRI perhaps mispositioned FDT (their variant of UDT) as a clear advancement in decision theory

On second thought this is probably not fair to MIRI since I don't think I objected to such positioning when they sent paper drafts for me to review. I guess in the early days UDT did look more like a clear advancement, because it seems to elegantly solve several problems at once, namely anthropic reasoning (my original reason to start thinking in the "updateless" direction), counterfactual mugging, cooperation with psychological twin / same decision theory, Newcomb's problem, and it wasn't yet known that the open problems would remain open for so long.

Hi Wei! Abram and I have been working on formalizing Logical Updatelessness for a few months. We've been mostly setting a framework and foundations using Logical Inductors, and building the obvious UDT algorithms. But we've also stumbled upon some of the above problems (especially pitfalls of EVM / commitment races, and logical conditionals vs counterfactuals / natural accounts of logically uncertain reasoning), and soon we'll turn more thoroughly to the Game Theory enabled by this Learning Theory.

You're welcome to join the PIBBSS Symposium on Friday 22nd 18:30 CEST, where I'll be presenting some of our ideas (more info). We still have a lot of open avenues, so no in-depth write-up yet, but soon a First Report will exist.

Also, of course, feel free to hit me with a DM anytime.

Thanks, I've set a reminder to attend your talk. In case I miss it, can you please record it and post a link here?

Here's a link to the recording.

Here's also a link to a rough report with more details about our WIP.

Sure! (Note you need to register to get the zoom link)

a population of mostly TDT/UDT agents and few CDT agents (and nobody knows who the CDT agents are) and they're randomly paired up to play one-shot PD, then the CDT agents do better. What does this imply?


My current hot take is that this is not a serious problem for TDT/UDT. It's just a special case of the more general phenomenon that it's game-theoretically great to be in a position where people think they are correlated/entangled with you when actually you know they aren't. Analogous to how it's game-theoretically great to be in a position where you know you can get away with cheating and everyone else thinks they can't.

The way I see it, all of these problems are reducible to (i) understanding what's up with the monotonicity principle in infra-Bayesian physicalism and (ii) completing a new and yet unpublished research direction (working title: "infra-Bayesian haggling") which shows that IB agents converge to Pareto efficient outcomes[1]. So, I wouldn't call them "wide open".

  1. ^

    Sometimes, but there are assumptions, see child comment for more details.

[-]Wei Dai7moΩ472

Even items 1, 3, 4, and 6 are covered by your research agenda? If so, can you quickly sketch what you expect the solutions to look like?

I'll start with Problem 4 because that's the one where I feel closest to the solution. In your 3-player Prisoner's Dilemma, infra-Bayesian hagglers[1] (IBH agents) don't necessarily play CCC. Depending on their priors, they might converge to CCC or CCD or other Pareto-efficient outcome[2]. Naturally, if the first two agents have identical priors then e.g. DCC is impossible, but CCD still is. Whereas, if all 3 have the same prior they will necessarily converge to CCC. Moreover, there is no "best choice of prior": different choices do better in different situations.

You might think this non-uniqueness is evidence of some deficiency of the theory. However, I argue that it's unavoidable. For example, it's obvious that any sane decision theory will play "swerve" in a chicken game against a rock that says "straight". If there was an ideal decision theory X that lead to a unique outcome in every game, the outcome of X playing chicken against X would be symmetric (e.g. flipping a shared coin to decide who goes straight and who swerves, which is indeed what happens for symmetric IBH[3]). This leads to the paradox that the rock is better than X in this case. Moreover, it should really be no surprise that different priors are incomparable, since this is the case even when considering a single learning agent: the higher a particular environment is in your prior, the better you will do on it.

Problems 1,3,6 are all related to infra-Bayesian physicalism (IBP).

For Problem 1, notice that IBP agents are already allowed some sort of "indexical" values. Indeed, in section 3 of the original article we describe agents that only care about their own observations. However, these agents are not truly purely indexical, because when multiple copies co-exist, they all value each other symmetrically. In itself, I don't think this implies the model doesn't describe human values. Indeed, it is always sensible to precommit to care about your copies, so to the extent you don't do it, it's a failure of rationality. The situation seems comparable with hyperbolic time discount: both are value disagreements between copies of you (in the time discount case, these are copies at different times, in the anthropic case, these are copies that co-exist in space). Such a value disagreement might be a true description of human psychology, but rational agents should be able to resolve it via internal negotiations, converging to a fully coherent agent.

However, IBP also seems to implies the monotonicity problem, which is a much more serious problem, if we want the model to be applicable to humans. The main possible solutions I see are:

  1. Find some alternative bridge transform which is not downwards closed but still well-behaved and therefore doesn't imply a monotonicity principle. That wouldn't be terribly surprising, because we don't have an axiomatic derivation of the bridge transform yet: it's just the only natural object we found so far which satisfies all desiderata.
  2. Just admit humans are not IBP agents. Instead, we might model them e.g. as cartesian IBRL agents. Maybe there is a richer taxonomy of intermediate possibilities between pure cartesianism and pure physicalism. Notice that this doesn't mean UDT is completely inapplicable to humans: cartesian IBRL already shows UDT-ish behavior in learnable pseudocausal Newcombian problems and arguably multi-agent scenarios as well (IBH). Cartesian IBRL might depart from UDT in scenarios such as fully acausal trade (i.e. trading with worlds where the agent never existed).
    1. This possibility is not necessarily free of bizarre implications. I suspect that cartesian agents always end up believing in some sort of simulation hypothesis (due to reasons such as  Christiano 2016). Arguably, they should ultimately converge to IBP-like behavior via trade with their simulators. What this looks like in humans, I dare not speculate.
  3. Swallow some bizarre philosophical bullet to reconcile human values with the monotonicity principle. The main example is, accept that worst-than-death qualia don't matter, or maybe don't exist (e.g. people that apparently experience them are temporarily zombies) and that among several copies of you, only the best-off copies matters. I don't like this solution at all, but I still feel compelled to keep a (very skeptical) eye on it for now.

For Problem 3, IBP agents have perfectly well-defined behavior in anthropic situations. The only "small" issue is that this behavior is quite bizarre. The implications depend, again, on how you deal with monotonicity principle.

If we accept Solution 1 above, we might end up with a situation where anthropics devolves to preferences again. Indeed, that would be the case if we allowed arbitrary non-monotonic loss functions. However, it's possible that the alternative bridge transform would impose a different effective constraint on the loss function, which would solve anthropics in some well-defined way which is more palatable than monotonicity.

If we accept Solution 2, then anthropics seems at first glance "epiphenomenal": you can learn the correct anthropic theory empirically, by observing which copy you are, but the laws of physics don't necessarily dictate it. However, under 2a anthropics is dictated by the simulators, or by some process of bargaining with the simulators.

If we accept Solution 3... Well, then we just have to accept how IBP does anthropics off-the-bat.

For Problem 6, it again depends on the solution to monotonocity.

Under Solutions 1 & 3, we might posit that humans do have something like "access to source code" on the unconscious level. Indeed, it seems plausible that you have some intuitive notion of what kind of mind should be considered "you". Alternatively (or in addition), it's possible that there is a version of the IBP formalism which allows uncertainty over your own source code.

Under Solution 2 there is no problem: cartesian IBRL doesn't require access to your own source code.

  1. ^

    I'm saying "infra-Bayesian hagglers" rather than "infra-Bayesian agents" because I haven't yet nailed the natural conditions a learning-algorithm needs to satisfy to enable IBH. I know some examples that do, but e.g. just satisfying an IB regret bound is insufficient. But, this should be thought of as a placeholder for some (hopefully) naturalized agent desiderata.

  2. ^

    It's not always Pareto efficient, see child comment for more details.

  3. ^

    What if there is no shared coin? I claim that, effectively, there always is. In a repeated game, you can e.g. use the parity of time as the "coin". In a one-shot game, you can use the parity of logical time (which can be formalized using metacognitive IB agents).

[-]Wei Dai7moΩ396

I don't understand your ideas in detail (am interested but don't have the time/ability/inclination to dig into the mathematical details), but from the informal writeups/reviews/critiques I've seen of your overall approach, as well as my sense from reading this comment of how far away you are from a full solution to the problems I listed in the OP, I'm still comfortable sticking with "most are wide open". :)

On the object level, maybe we can just focus on Problem 4 for now. What do you think actually happens in a 2IBH-1CDT game? Presumably CDT still plays D, and what do the IBH agents do? And how does that imply that the puzzle is resolved?

As a reminder, the puzzle I see is that this problem shows that a CDT agent doesn't necessarily want to become more UDT-like, and for seemingly good reason, so on what basis can we say that UDT is a clear advancement in decision theory? If CDT agents similarly don't want to become more IBH-like, isn't there the same puzzle? (Or do they?) This seems different from the playing chicken with a rock example, because a rock is not a decision theory so that example doesn't seem to offer the same puzzle.

ETA: Oh, I think you're saying that the CDT agent could turn into a IBH agent but with a different prior from the other IBH agents, that ends up allowing it to still play D while the other two still play C, so it's not made worse off by switching to IBH. Can you walk this through in more detail? How does the CDT agent choose what prior to use when switching to IBH, and how do the different priors actual imply a CCD outcome in the end?

...I'm still comfortable sticking with "most are wide open".

 

Allow me to rephrase. The problems are open, that's fair enough. But, the gist of your post seems to be: "Since coming up with UDT, we ran into these problems, made no progress, and are apparently at a dead end. Therefore, UDT might have been the wrong turn entirely." On the other hand, my view is: Since coming up with those problems, we made a lot of progress on agent theory within the LTA, which has implications on those problems among other things, and so far this progress seems to only reinforce the idea that UDT is "morally" correct. That is, not that any of the old attempted formalizations of UDT is correct, but that the intuition behind UDT, and its recommendation in many specific scenarios, are largely justified.

ETA: Oh, I think you're saying that the CDT agent could turn into a IBH agent but with a different prior from the other IBH agents, that ends up allowing it to still play D while the other two still play C, so it's not made worse off by switching to IBH. Can you walk this through in more detail? How does the CDT agent choose what prior to use when switching to IBH, and how do the different priors actual imply a CCD outcome in the end?

While writing this part, I realized that some of my thinking about IBH was confused, and some of my previous claims were wrong. This is what happens when I'm overeager to share something half-baked. I apologize. In the following, I try to answer the question while also setting the record straight.

An IBH agent considers different infra-Bayesian hypotheses starting from the most optimistic ones (i.e. those that allow guaranteeing the most expected utility) and working its way down, until it finds something that works[1]. Such algorithms are known as "upper confidence bound" (UCB) in learning theory. When multiple IBH agents interact, they start with each trying to achieve its best possible payoff in the game[2], and gradually relax their demands, until some coalition reaches a payoff vector which is admissible for it to guarantee. This coalition then "locks" its strategy, while other agents continue lowering their demands until there is a new coalition among them, and so on.

Notice that the pace at which agents lower their demands might depend on their priors (by affecting how many hypotheses they have to cull at each level), their time discounts and maaaybe also other parameters of the learning algorithm. Some properties this process has:

  • Every agents always achieves at least its maximin payoff in the end. In particular, a zero-sum two-player game ends in a Nash equilibrium.
  • If there is a unique strongly Pareto-efficient payoff (such as in Hunting-the-Stag), the agents will converge there.
  • In a two-player game, if the agents are similar enough that it takes them about the same time to go from optimal payoff to maximin payoff, the outcome is strong Pareto-efficient. For example, in a Prisoner's Dilemma they will converge to player A cooperating and player B cooperating some of the time and possibly defecting some of the time, such that player A's payoff is still strictly better than DD. However, without any similarity assumption, they might instead converge to an outcome where one player is doing its maximin strategy and the other its best response to that. In a Prisoner's Dilemma, that would be DD[3].
  • In a symmetric two-player game, with very similar agents (which might still have independent random generators), they will converge to the symmetric Pareto efficient outcome. For example, in a Prisoner's Dilemma they will play CC, whereas in Chicken [version where flipping coin is better than both swerving] they will "flip a coin" (e.g. alternative) to decide who goes straight and who swerves. 
  • The previous bullet is not true with more than two players. There can be stochastic selection among several possible points of convergence, because there are games in which different mutually exclusive coalitions can form. Moreover, the outcome can fail to be Pareto efficient, even if the game is symmetric and the agents are identical (with independent random generators).
  • Specifically in Wei Dai's 3-player Prisoner's Dilemma, IBH among identical agents always produces CCC. IBH among arbitrarily different agents might produce CCD (if one player is very slow to lower its demands, while the other other two lower their demands in the same, faster, pace), or even DDD (if each of the players lowers its demands on its own very different timescale).

We can operationalize "CDT agent" as e.g. a learning algorithm satisfying an internal regret bound (see sections 4.4 and 7.4 in Cesa-Bianchi and Lugosi) and the process of self-modification as learning on two different timescales: a slow outer loop that chooses a learning algorithm for a quick inner loop (this is simplistic, but IMO still instructive). Such an agent would indeed choose IBH over CDT if playing a Prisoner's Dilemma (and would prefer an IBH variant that lowers its demands slowly enough to get more of the gains-of-trade but quickly enough to actually converge), whereas in the 3-player Prisoner's Dilemma there is at least some IBH variant that would be no worse than CDT.

If all players have metalearning in the outer loop, then we get dynamics similar to Chicken [version in which both swerving is better than flipping a coin[4]], where hard-bargaining (slower to lower demands) IBH corresponds to "straight" and soft-bargaining (quick to lower demands) IBH corresponds to "swerve". Chicken [this version] between two identical IBH agents results in both swerving. Chicken beween hard-IBH and soft-IBH results in hard-IBH getting a higher probability of going straight[5]. Chicken between two CDTs results in a correlated equilibrium, which might have some probability of crashing. Chicken between IBH and CDT... I'm actually not sure what happens off the top of my head, the analysis is not that trivial.

 

  1. ^

    This is pretty similar to "modal UDT" (going from optimistic to pessimistic outcomes until you find a proof that some action can guarantee that outcome). I think that the analogy can be made stronger if the modal agent uses an increasingly strong proof system during the search, which IIRC was also considered before. The strength of the proof system then plays the role of "logical time", and the pacing of increasing the strength is analogous to the (inverse function of the) temporal pacing in which an IBH agent lowers its target payoff.

  2. ^

    Assuming that they start out already knowing the rules of the game. Otherwise, they might start from trying to achieve payoffs which are impossible even with the cooperation of other players. So, this is a good model if learning the rules is much faster than learning anything to do with the behavior of other players, which seems like a reasonable assumption in many cases.

  3. ^

    It is not that surprising that two sufficiently dissimilar agents can defect. After all, the original argument for superrational cooperation was: "if the other agent is similar to you, then it cooperates iff you cooperate".

  4. ^

    I wish we had good names for the two version of Chicken.

  5. ^

    This seems nicely reflectively consistent: soft/hard-IBH in the outer loop produces soft/hard-IBH respectively in the inner loop. However, two hard hard-IBH agents in the outer loop produce two soft-IBH agents in the inner loop. On the other hand, comparing absolute hardness between outer and inner loop seems not very meaningful, whereas comparing relative-between-players hardness between outer and inner loop is meaningful.

[-]Wei Dai7moΩ473

But, the gist of your post seems to be: "Since coming up with UDT, we ran into these problems, made no progress, and are apparently at a dead end. Therefore, UDT might have been the wrong turn entirely."

This is a bit stronger than how I would phrase it, but basically yes.

On the other hand, my view is: Since coming up with those problems, we made a lot of progress on agent theory within the LTA

I tend to be pretty skeptical of new ideas. (This backfired spectacularly once, when I didn't pay much attention to Satoshi when he contacted me about Bitcoin, but I think in general has served me well.) My experience with philosophical questions is that even when some approach looks a stone's throw away from a final solution to some problem, a bunch of new problems pop up and show that we're still quite far away. With an approach that is still as early as yours, I just think there's quite a good chance it doesn't work out in the end, or gets stuck somewhere on a hard problem. (Also some people who have digged into the details don't seem as optimistic that it is the right approach.) So I'm reluctant to decrease my probability of "UDT was a wrong turn" too much based on it.

The rest of your discussion about 2TDT-1CDT seems plausible to me, although of course depends on whether the math works out, doing something about monotonicity, and also a solution to the problem of how to choose one's IBH prior. (If the solution was something like "it's subjective/arbitrary" that would be pretty unsatisfying from my perspective.)

...the problem of how to choose one's IBH prior. (If the solution was something like "it's subjective/arbitrary" that would be pretty unsatisfying from my perspective.)

 

It seems clear to me that the prior is subjective. Like with Solomonoff induction, I expect there to exist something like the right asymptotic for the prior (i.e. an equivalence class of priors under the equivalence relation where  and  are equivalent when there exists some  s.t.  and ), but not a unique correct prior, just like there is no unique correct UTM. In fact, my arguments about IBH already rely on the asymptotic of the prior to some extent.

One way to view the non-uniqueness of the prior is through an evolutionary perspective: agents with prior  are likely to evolve/flourish in universes sampled from prior , while agents with prior  are likely to evolve/flourish in universes sampled from prior . No prior is superior across all universes: there's no free lunch.

For the purpose of AI alignment, the solution is some combination of (i) learn the user's prior and (ii) choose some intuitively appealing measure of description complexity, e.g. length of lambda-term (i is insufficient in itself because you need some ur-prior to learn the user's prior). The claim is, different reasonable choices in ii will lead to similar results.

Given all that, I'm not sure what's still unsatisfying. Is there any reason to believe something is missing in this picture?

[-]Raemon7moΩ150

Curated, both for the OP (which nicely lays out some open problems and provides some good links towards existing discussion) as well as the resulting discussion which has had a number of longtime contributors to LessWrong-descended decision theory weighing in.

About 2TDT-1CDT. If two groups are mixed into a PD tournament, and each group can decide on a strategy beforehand that maximizes that group's average score, and one group is much smaller than the other, then that smaller group will get a higher average score. So you could say that members of the larger group are "handicapped" by caring about the larger group, not by having a particular decision theory. And it doesn't show reflective inconsistency either: for an individual member of a larger group, switching to selfishness would make the larger group worse off, which is bad according to their current values, so they wouldn't switch.

Edit: You could maybe say that TDT agents cooperate not because they care about one another (a), but because they're smart enough to use the right decision theory that lets them cooperate (b). And then the puzzle remains, because agents using the "smart" decision theory get worse results than agents using the "stupid" one. But I'm having a hard time formalizing the difference between (a) and (b).

But the situation isn't symmetrical, meaning if you reversed the setup to have 2 CDT agents and 1 TDT agent, the TDT agent doesn't do better than the CDT agents, so it does seem like the puzzle has something to do with decision theory, and is not just about smaller vs larger groups? (Sorry, I may be missing your point.)

I think you can make it more symmetrical by imagining two groups that can both coordinate within themselves (like TDT), but each group cares only about its own welfare and not the other group's. And then the larger group will choose to cooperate and the smaller one will choose to defect. Both groups are doing as well as they can for themselves, the game just favors those whose values extend to a smaller group.

I think I kind of get what you're saying, but it doesn't seem right to model TDT as caring about all other TDT agents, as they would exploit other TDT agents if they could do so without negative consequences to themselves, e.g., if a TDT AI was in a one-shot game where they unilaterally decide whether to attack and take over another TDT AI or not.

Maybe you could argue that the TDT agent would refrain from doing this because of considerations like its decision to attack being correlated with other AIs' decisions to potentially attack it in other situations/universes, but that's still not the same as caring for other TDT agents. I mean the chain of reasoning/computation you would go through in the two cases seem very different.

Also it's not clear to me what implications your idea has even if it was correct, like what does it suggest about what the right decision theory is?

BTW do you have any thoughts on Vanessa Kosoy's decision theory ideas?

I don't fully understand Vanessa's approach yet.

About caring about other TDT agents, it feels to me like the kind of thing that should follow from the right decision theory. Here's one idea. Imagine you're a TDT agent that has just been started / woken up. You haven't yet observed anything about the world, and haven't yet observed your utility function either - it's written in a sealed envelope in front of you. Well, you have a choice: take a peek at your utility function and at the world, or use this moment of ignorance to precommit to cooperate with everyone else who's in the same situation. Which includes all other TDT agents who ever woke up or will ever wake up and are smart enough to realize the choice.

It seems likely that such wide cooperation will increase total utility, and so increase expected utility for each agent (ignoring anthropics for the moment). So it makes sense to make the precommitment, and only then open your eyes and start observing the world and your utility function and so on. So for your proposed problem, where a TDT agent has the opportunity to kill another TDT agent in their sleep to steal five dollars from them (destroying more utility for the other than gaining for themselves), the precommitment would stop them from doing it. Does this make sense?

What would be an example of an "indexical value" that is "reflectively inconsistent"?

See this comment and the post that it's replying to.

UDT says that what we normally think of as different approaches to anthropic reasoning are really different preferences, which seems to sidestep the problem. But is that actually right, and if so where are these preferences supposed to come from?

What are the main reasons to think that it's wrong?

I'm not aware of good reasons to think that it's wrong, it's more that I'm just not sure it's the right approach. I mean we can say that it's a matter of preferences, problem solved, but unless we can also show that we should be anti-realist about these preferences, or what the right preferences are, the problem isn't really solved. Until we do have a definitive full solution, it seems hard to be confident that any particular approach is the right one.

It seems plausible that treating anthropic reasoning as a matter of preferences makes it harder to fully solve the problem. I wrote "In general, Updateless Decision Theory converts anthropic reasoning problems into ethical problems." in the linked post, but we don't have a great track record of solving ethical problems...

Indexical values are not reflectively consistent. UDT "solves" this problem by implicitly assuming (via the type signature of its utility function) that the agent doesn't have indexical values.


Nonindexical values aren't reflectively consistent either, if you are updateful. Right? 

[-][anonymous]7mo20

If there's a population of mostly TDT/UDT agents and few CDT agents (and nobody knows who the CDT agents are) and they're randomly paired up to play one-shot PD, then the CDT agents do better. What does this imply?

the CDT agents don't necessarily do better. e.g., if "not being tied for first" means "the CDT agents accrue an advantage, use that advantage to gain further advantage after the game, and eventually take over the world", then the TDT/UDT agents all choose to defect because of the presence of CDT agents, and the CDT agents also all defect because of their decision theory. 

wheras if there's no real cost to not being first, then there's a couple possibilities i see. (a) the TDT/UDT agents all defect in order to discourage the existence of CDT agents. (b) the TDT/UDT agents cooperate, and the CDT agents do perform better. i think (a) would be done in worlds/situations where it would work, but (b) is what happens in, e.g., a truly one-shot prisoners dilemma that's the only thing that will ever exist, and afterwards all the agents cease to exist, for example.

"what does this imply" - maybe we can conclude there is no decision algorithm which is necessarily best in all possible situations/worlds, and that one of the main things 'how well a decision algorithm performs' depends on is what decision algorithms other agents are using.

(meta p.s: i'd appreciate feedback on whether this comment was insightful to you)

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year. Will this post make the top fifty?

[-]Max H7mo1-3

If there's a population of mostly TDT/UDT agents and few CDT agents (and nobody knows who the CDT agents are) and they're randomly paired up to play one-shot PD, then the CDT agents do better. What does this imply?

How does a CDT agent itself know that it is one of the CDT agents? Wouldn't they be uncertain about it? Even if CDTness is hypothetically caused by a known / testable genetic mutation, wouldn't there be uncertainty around the truth of that fact?

If you think you might be a CDT agent but you're unsure about it, you have to be careful not to stumble over an infohazard (including via introspection of your own mind) which proves or suggests that actually you are an UDT agent.

UDT agents might also be uncertain about their UDTness, but at least under UDT you don't have to worry about infohazards lurking in your own mind, right?

So one possible answer is that, in return for doing worse in this setup, UDT agents get more freedom to think and operate their own minds without worrying that they will be (further) disadvantaged by their own thoughts. Depending on how contrived and uncommon this kind of setup is throughout the multiverse, that might be a tradeoff worth making.

Also, it seems plausible that at least some humans might already be over the threshold where we don't really get a choice anyway - we've already had the thoughts that inevitably lead us down the path of being or delegating to UDT agents, whether we like it or not. I'm not sure whether this is true, but if it is, that actually seems like good news about the likely distribution of CDT vs. UDT agents across the multiverse - perhaps the supermajority of logically possible agents above a certain low-ish capabilities threshold will end up as UDT agents.

Why would they be uncertain about whether they’re a CDT agent? Being a CDT agent surely just means by definition they evaluate decisions based on causal outcomes. It feels confused to say that they have to be uncertain about/reflect on which decision theory they have and then apply it, rather than their being a CDT agent being an ex hypothesis fact about how they behave

[-]Max H7mo0-2

What kind of decision theory the agent will use and how it will behave in specific circumstances is a proposition about the agent and about the world which can be true or false, just like any other proposition. Within any particular mind (the agent's own mind, Omega's, ours) propositions of any kind come with differing degrees of uncertainty attached, logical or otherwise.

You can of course suppose for the purposes of the thought experiment that the agent itself has an overwhelmingly strong prior belief about its own behavior, either because it can introspect and knows its own mind well, or because it has been told how it will behave by a trustworthy and powerful source such as Omega. You could also stipulate that the entirety of the agent is a formally specified computer program, perhaps one that is incapable of reflection or forming beliefs about itself at all.

However, any such suppositions would be additional constraints on the setup, which makes it weirder and less likely to be broadly applicable in real life (or anywhere else in the multiverse). 

Also, even if the agent starts with 99.999%+ credence that it is a CDT agent and will behave accordingly, perhaps there's some chain of thoughts the agent could think which provide strong enough evidence to overcome this prior (and thus change the agent's behavior). Perhaps this chain of thoughts is very unlikely to actually occur to the agent, or for the purposes of the thought experiment, we only consider agents which flatly aren't capable of such introspection or belief-updating. But that may (or may not!) mean restricting the applicability of the thought experiment to a much smaller and rather impoverished class of minds.

I understand it’s a proposition like any other, I don’t see why an agent would reflect on it/use it in their deliberation to decide what to do. The fact that they’re a CDT agent is a fact about how they will act in the decision, not a fact that they need to use in their deliberation

Analogous to preferences, whether or not an agent prefers A or B is a proposition like any other, but I don’t think it’s a natural way to model them as first consult the credences they have assigned to “I prefer A to B” etc. Rather, they will just choose A ex hypothesis because that’s what having the preference means.

They don't have to, and for the purposes of the thought experiment you could specify that they simply don't. But humans are often pretty uncertain about their own preferences, and about what kind of decision theory they can or should use. Many of these humans are pretty strongly inclined to deliberate, reflect, and argue about these beliefs, and take into account their own uncertainty in them when making decisions.

So I'm saying that if you stipulate that no such reflection or deliberation occurs, you might be narrowing the applicability of the thought experiment to exclude human-like minds, which may be a rather large and important class of all possible minds.

I think personally I'd say it's a clear advancement—it opens up a lot of puzzles, but the naïve intuition corresponding to it it still seems more satisfying than CDT or EDT, even if a full formalization is difficult.

(Not to comment on whether there might be a better communications strategy for getting the academic community interested.)

Indexical values are not reflectively consistent. UDT "solves" this problem by implicitly assuming (via the type signature of its utility function) that the agent doesn't have indexical values. But humans seemingly do have indexical values, so what to do about that?

Your linked post seems to suggest only some varieties of indexical value sources are reflectively inconsistent, but what's missing is an indexical value source that's both reflectively consistent and makes sense for humans. So there could still be a way to make indexical values reflectively consistent, just that we haven't thought of it yet?

E.g. would it work to privilege an agent's original instantiation, so that if you're uncertain you're the original or the copy you follow the interests of the original? That would seem to address the counterfactual mugging question if Omega were to predict by simulating you at least.

(I'm not sure if that technically counts as 'indexical' but it seems to me it can still be 'selfish' in the colloquial sense, no?)

I'm not sure "original instantiation" is always well-defined

You can use it when it can be well defined. I think in the real world you mostly do have something at least in the past you can call "original", and when it doesn't still exist you could modify to, e.g. "what the original instantiation, if it anticipated this scenario, would have defined as its successor".

[-]simon7mo1-2

2TDT-1CDT - If there's a population of mostly TDT/UDT agents and few CDT agents (and nobody knows who the CDT agents are) and they're randomly paired up to play one-shot PD, then the CDT agents do better. What does this imply?

I don't think that's the case unless you have really weird assumptions. If the other party can't tell what the TDT/UDT agent will pick, they'll defect, won't they? It seems strange that the other party would be able to tell what the TDT/UDT agent will pick but not whether they're TDT/UDT or CDT.

Edit: OK, I see the idea is that the TDT/UDT agents have known, fixed code, which can, e.g., randomly mutate into CDT. They can't voluntarily change their code. Being able to trick the other party about your code is an advantage - I don't see that as a TDT/UDT problem.

Nobody is being tricked though. Everyone knows there's a CDT agent among the population, just not who, and we can assume they have a correct amount of uncertainty about what the other agent's decision theory / source code is. The CDT agent still has an advantage in that case. And it is a problem because it means CDT agents don't always want to become more UDT-like (it seems like there are natural or at least not completely contrived situations, like Omega punishing UDT agents just for using UDT, where they don't), which takes away a major argument in its favor.

I think this is also a rather contrived scenario, because if the UDT agents could change their own code (silently) cooperation would immediately break down, so it is reliant on the CDT agents being able to have different code from the most common and thus expected code silently, and the UDT agents not. 

I'm not sure why you say "if the UDT agents could change their own code (silently) cooperation would immediately break down", because in my view a UDT agent would reason that if it changed its code (to something like CDT for example), that logically implies other UDT agents also changing their code to do the same thing, so the expected utility of changing its code would be evaluated as lower than not changing its code. So it would remain a UDT agent and still cooperate with other UDT agents or when the probability of the other agent being UDT is high enough.

To me this example is about a CDT agent not wanting to become UDT-like if it found itself in a situation with many other UDT agents, which just seems puzzling if your previous perspective was that UDT is a clear advancement in decision theory and everyone should adopt UDT or become more UDT-like.

I think, if you had several UDT agents with the same source code, and then one UDT agent with slightly different source code, you might see the unique agent defect.

I think the CDT agent has an advantage here because it is capable of making distinct decisions from the rest of the population—not because it is CDT.

The general hope is that slight differences in source code (or even large differences, as long as they're all using UDT or something close to it) wouldn't be enough to make a UDT agents defect against another UDT agent (i.e. the logical correlation between their decisions would be high enough), otherwise "UDT agents cooperate with each other in one-shot PD" would be false or not have much practical implications, since why would all UDT agents have the exact same source code?

There are at least two potential sources of cooperation: symmetry and mutual source code knowledge; symmetry should be fragile to small changes in source code (I expect) as well as asymmetry between the situations of the different parties while mutual source code knowledge doesn't require those sorts of symmetry at all (but does require knowledge).

Edit: for some reason my intuition expects cooperation from similarity to be less fragile in the Newcomb's problem/code knowledge case (similarity to simulation) than if the similarity is just plain similarity to another, non-simulation agent. I need to think about why and if this has any connection to what would actually happen.

I mean, that's a thing you might hope to be true. I'm not sure if it actually is true.

I did not realize that the UDT agents were assumed to behave identically; I was thinking that the cooperation was maintained, not by symmetry, but by mutual source code knowledge. 

If it's symmetry, well, if you can sneak a different agent into a clique without getting singled out, that's an advantage. Again not a problem with UDT as such.

Edit: of course they do behave identically because they did have identical code (which was the source of the knowledge). (Though I don't expect agents in the same decision theory class to be identical in the typical case).

I don't think 3 depends on UDT that much? Like, you can call the betting argument in sleeping beauty "a UDT-like reasoning", but then it's not like people were convinced either way without UDT. And about it being right, do you mean that it would be practically convenient if there was some logical law that forced our preferences about multiple copies? Because the way I see it, it's not problematic from philosophical standpoint.

Re 6:

Disclaimer: I've only read the FDT paper and did so a long time ago, so feel free to ignore this comment if it is trivially wrong.

I don't see why FDT would assume that the agent has access to its own source code and inputs as a symbol string. I think you can reason about different agents' decisions' logical correlation without it and in fact people do all the time: For example when it comes to voting, people often urge others by saying if no one voted we could not have a functional democracy or don't throw away that plastic bottle because if everyone did we would live in trash heaps, or reasoning about voting blue on pill questions on Twitter. The previous examples contain a reasoning which has the 3 key parts of FDT (as I understand it at least).

  1. Identifying the agents using these 3 steps in their reasoning. (other humans with similar cultural background resulting in a conception of morality influenced by this 3 step)
  2. Simulating the hypothetical worlds with each possible reasoning outcome and evaluating their value.
  3. Choosing the option resulting in the most value as the outcome of this reasoning process.

Of course only aspiring rationalists would call this "fdt", regular people would probably call this reasoning (a proper subset of) "being a decent person" and moral philosophers (a form of (instead of evaluating rules we evaluate possible algorithm outcomes)) "rule utilitarianism", but the reasoning is the same, no? (There is of course no (or at least very little) actual causal effect on me going to vote/throwing trash away on others and similarly very little chance of me being the deciding vote (by my calculations for an election with polling data and reasonable assumptions: even compared to the vast amount of value being at stake), so humans actually use this reasoning even if the steps are often just implied and not stated explicitly)

In conclusion, if you know something about the origins of you and other agents, you can detect logical correlations with some probability even without source codes. (In fact a source code is a special case of the general situation: if the source code is valid and you know this, you necessarily know of a causal connection between the printed out source code and the agent)

Hi, I'm interested in learning more decision theory and about infinite ethics. Do you have any resources that I could study? 

I'd like to join in on this post and contribute, but I don't have enough of a formal understanding of the various decision theories (and decision theory in general) 

Regarding infinite ethics, Joseph Carlsmith has an excellent post on the topic. In general, his writing on philsophy is of high-quality.

Thank you! Gonna check that post out 

Also, any resources on meta-philosophy would be great. 

[-]TAG7mo00

There are meta level problems about DT, chiefly

  1. Whether it possible to find a single DT that will be optimal, in any universe, for any agent. (I take the fact that the debate has been going on so long as evidence that it isnt)

  2. Whether the various candidate DTs are actually as mutually exclusive or rivalrous as thought. (Argument that they are not: https://www.lesswrong.com/posts/MwetLcBPvshg9ePZB/decision-theory-is-not-policy-theory-is-not-agent-theory)

  3. Whether optimal DT is computible (and therefore useful for predicting AIs).

(Added: whether the standard puzzles, make sense sufficiently to have answers).