Anthony DiGiovanni

(Formerly "antimonyanthony.") I'm an s-risk-focused AI safety researcher at the Center on Long-Term Risk. I (occasionally) write about altruism-relevant topics on my Substack. All opinions my own.

Wiki Contributions


I think it's pretty unclear that MSR is action-guiding for real agents trying to follow functional decision theory, because of Sylvester Kollin's argument in this post.

Tl;dr: FDT says, "Supposing I follow FDT, it is just implied by logic that any other instance of FDT will make the same decision as me in a given decision problem." But the idealized definition of "FDT" is computationally intractable for real agents. Real agents would need to find approximations for calculating expected utilities, and choose some way of mapping their sense data to the abstractions they use in their world models. And it seems extremely unlikely that agents will use the exact same approximations and abstractions, unless they're exact copies — in which case they have the same values, so MSR is only relevant for pure coordination (not "trade").

Many people who are sympathetic to FDT apparently want it to allow for less brittle acausal effects than "I determine the decisions of my exact copies," but I haven't heard of a non-question-begging formulation of FDT that actually does this.

Sorry, to be clear, I'm familiar with the topics you mention. My confusion is that ROSE bargaining per se seems to me pretty orthogonal to decision theory.

I think the ROSE post(s) are an answer to questions like, "If you want to establish a norm for an impartial bargaining solution such that agents following that norm don't have perverse incentives, what should that norm be?", or "If you're going to bargain with someone but you didn't have an opportunity for prior discussion of a norm, what might be a particularly salient allocation [because it has some nice properties], meaning that you're likely to zero-shot coordinate on that allocation?"

Can you say more about what you think this post has to do with decision theory? I don't see the connection. (I can imagine possible connections, but don't think they're relevant.)

I agree with the point that we shouldn’t model the AI situation as a zero-sum game. And the kinds of conditional commitments you write about could help with cooperation. But I don’t buy the claim that "implementing this protocol (including slowing down AI capabilities) is what maximizes their utility."

Here's a pedantic toy model of the situation, so that we're on the same page: The value of the whole lightcone going towards an agent’s values has utility 1 by that agent’s lights (and 0 by the other’s), and P(alignment success by someone) = 0 if both speed up, else 1. For each of the alignment success scenarios i, the winner chooses a fraction of the lightcone to give to Alice's values (xi^A for Alice's choice, xi^B for Bob's). Then, some random numbers for expected payoffs (assuming the players agree on the probabilities):

  • Payoffs for Alice and Bob if they both speed up capabilities: (0, 0)
  • Payoffs if Alice speeds, Bob doesn’t: 0.8 * (x1^A, 1 - x1^A) + 0.2 * (x1^B, 1 - x1^B)
  • Payoffs if Bob speeds, Alice doesn’t: 0.2 * (x2^A, 1 - x2^A) + 0.8 * (x2^B, 1 - x2^B)
  • Payoffs if neither speeds: 0.5 * (x3^A, 1 - x3^A) + 0.5 * (x3^B, 1 - x3^B)

So given this model, seems that you're saying Bob has an incentive to slow down capabilities because Alice's ASI successor can condition the allocation to Bob's values on his decision. Which we can model as Bob expecting Alice to use the strategy {don't speed; x2^A = 1; x3^A = 0.5} (given she [edit: typo] doesn't speed up, she only rewards Bob's values if Bob didn't speed up).

Why would Bob so confidently expect this strategy? You write:

And Bob doesn't have to count on wishful thinking to know that Alice would indeed do this instead of defecting, because in worlds where he wins, he can his superintelligence if Alice would implement this procedure. 

I guess the claim is just that them both using this procedure is a Nash equilibrium? If so, I see several problems with this:

  1. There are more Pareto-efficient equilibria than just “[fairly] cooperate” here. Alice could just as well expect Bob to be content with getting expected utility 0.2 from the outcome where he slows down and Alice speeds up — better that than the utility 0 from extinction, after all. Alice might think she can make it credible to Bob that she won’t back down from speeding up capabilities, and vice versa, such that they both end up pursuing incompatible demands. (See, e.g., “miscoordination” here.)
  2. You’re lumping “(a) slow down capabilities and (b) tell your AI to adopt a compromise utility function” into one procedure. I guess the idea is that, ideally, the winner of the race could have their AI check whether the loser was committed to do both (a) and (b). But realistically it seems implausible to me that Alice or Bob can commit to (b) before winning the race, i.e., that what they do in the time before they win the race determines whether they’ll do (b). They can certainly tell themselves they intend to do (b), but that’s cheap talk.

    So it seems Alice would likely think, "If I follow the whole procedure, Bob will cooperate with my values if I lose. But even if I slow down (do (a)), I don't know if my future self [or, maybe more realistically, the other successors who might take power] will do (b) — indeed once they're in that position, they'll have no incentive to do (b). So slowing down isn't clearly better." (I do think, setting aside the bargaining problem in (1), she has an incentive to try to make it more likely that her successors follow (b), to be clear.)

It seems that what I was missing here was: mrcSSA disputes my premise that the evidence in fact is "*I* am in a white room, [created by God in the manner described in the problem setup], and have a red jacket"!

Rather, mrcSSA takes the evidence to be: "Someone is in a white room, [created by God in the manner described in the problem setup], and has a red jacket." Which is of course certain to be the case given either heads or tails.

(h/t Jesse Clifton for helping me see this)

Is God's coin toss with equal numbers a counterexample to mrcSSA?

I feel confused as to whether minimal-reference-class SSA (mrcSSA) actually fails God's coin toss with equal numbers (where "failing" by my lights means "not updating from 50/50"):

  • Let H = "heads world", W_{me} = "I am in a white room, [created by God in the manner described in the problem setup]", R_{me} = "I have a red jacket."
  • We want to know P(H | W_{me}, R_{me}).
  • First, P(R_{me} | W_{me}, H) and P(R_{me} | W_{me}, ~H) seem uncontroversial: Once I've already conditioned on my own existence in this problem, and on who "I" am, but before I've observed my jacket color, surely I should use a principle of indifference: 1 out of 10 observers of existing-in-the-white-room in the heads world have red jackets, while all of them have red jackets in the tails world, so my credences are P(R_{me} | W_{me}, H) = 0.1 and P(R_{me} | W_{me}, ~H) = 1. Indeed we don't even need a first-person perspective at this step — it's the same as computing P(R_{Bob} | W_{Bob}, H) for some Bob we're considering from the outside.
    • (This is not the same as non-mrcSSA with reference class "observers in a white room," because we're conditioning on knowing "I" am an observer in a white room when computing a likelihood (as opposed to computing the posterior of some world given that I am an observer in a white room). Non-mrcSSA picks out a particular reference class when deciding how likely "I" am to observe anything in the first place, unconditional on "I," leading to the Doomsday Argument etc.)
  • The step where things have the potential for anthropic weirdness is in computing P(W_{me} | H) and P(W_{me} | ~H). In the Presumptuous Philosopher and the Doomsday Argument, at least, probabilities like this would indeed be sensitive to our anthropics.
  • But in this problem, I don't see how mrcSSA would differ from non-mrcSSA with the reference class R_{non-minimal} = "observers in a white room" used in Joe's analysis (and by extension, from SIA):
    • In general, SSA says
    • Here, the supposedly "non-minimal" reference class R_{non-minimal} coincides with the minimal reference class! I.e., it's the observer-moments in your epistemic situation (of being in a white room), before you know your jacket color.
  • The above likelihoods plus the fair-coin prior are all we need to get P(H | R_{me}, W_{me}), but at no point did the three anthropic views disagree.

In order words: It seems that the controversial setup in anthropics is in answering P(I [blah] | world), i.e., what we do when we introduce the indexical information about "I." But once we've picked out a particular "I," the different views should agree.

(I still feel suspicious of mrcSSA's metaphysics for independent reasons, but am considerably less confident in that than my verdict on God's coin toss with equal numbers.)

I enjoyed this post and think it should help reduce confusion in many future discussions, thanks!

Some comments on your remarks about anthropics:

Different anthropic theories partially rely on metaphysical intuitions/stories about how centered worlds or observer moments are 'sampled', and have counterintuitive implications (e.g., the Doomsday argument for SSA and the Presumptuous philosopher for SIA).

I'm not sure why this is an indictment of "anthropic reasoning" per se, as if that's escapable. It seems like all anthropic theories are trying to answer a question that one needs to answer when forming credences, i.e., how do we form likelihoods P(I observe I exist | world W)? (Which we want in order to compute P(world W | I observe I exist).)

Indeed just failing to anthropically update at all has counterintuitive implications, like the verdict of minimal-reference-class SSA in Joe C's "God's coin toss with equal numbers." [no longer endorsed] And mrcSSA relies on the metaphysical intuition that oneself was necessarily going to observe X, i.e., P(I observe I exist | world W) = P(I observe I exist | not-W) = 1(which is quite implausible IMO). [I think endorsed, but I feel confused:] And mrcSSA relies on the metaphysical intuition that, given that someone observes X, oneself was necessarily going to observe X, which is quite implausible IMO.

in earlier sections you argue that CDT agents might not adopt LDT-recommended policies and so will have problems with bargaining

That wasn’t my claim. I was claiming that even if you're an "LDT" agent, there's no particular reason to think all your bargaining counterparts will pick the Fair Policy given you do. This is because:

  1. Your bargaining counterparts won’t necessarily consult LDT.
  2. Even if they do, it’s super unrealistic to think of the decision-making of agents in high-stakes bargaining problems as entirely reducible to “do what [decision theory X] recommends.”
  3. Even if decision-making in these problems were as simple as that, why should we think all agents will converge to using the same simple method of decision-making? Seems like if an agent is capable of de-correlateing their decision-making in bargaining from their counterpart, and their counterpart knows this or anticipates it on priors, that agent has an incentive to do so if they can be sufficiently confident that their counterpart will concede to their hawkish demand.

So no, “committing to act like LDT agents all the time,” in the sense that is helpful for avoiding selection pressures against you, does not ensure you’ll have a decision procedure such that you have no bargaining problems.

But we were discussing a case(counterfactual mugging) where they would want to pre-commit to act in ways that would be non-causally beneficial.

I’m confused, the commitment is to act in a certain way that, had you not committed, wouldn’t be beneficial unless you appealed to acausal (and updateless) considerations. But the act of committing has causal benefits.

there are other reasons that you might not want to demand too much. Maybe you know their source code and can simulate that they will not accept a too-high demand. Or perhaps you think, based on empirical evidence or a priori reasoning that most agents you might encounter will only accept a roughly fair allocation.

I agree these are both important possibilities, but:

  1. The reasoning “I see that they’ve committed to refuse high demands, so I should only make a compatible demand” can just be turned on its head and used by the agent who commits to the high demand.
  2. One might also think on priors that some agents might be committed to high demands, therefore strictly insisting on fair demands against all agents is risky.

I was specifically replying to the claim that the sorts of AGIs who would get into high-stakes bargaining would always avoid catastrophic conflict because of bargaining problems; such a claim requires something stronger than the considerations you've raised, i.e., an argument that all such AGIs would adopt the same decision procedure (and account for logical causation) and therefore coordinate their demands.

(By default if I don't reply further, it's because I think your further objections were already addressed—which I think is true of some of the things I've replied to in this comment.)


It's true that you usually have some additional causal levers, but none of them are the exact same as be the kind of person who does X.

Not sure I understand. It seems like "being the kind of person who does X" is a habit you cultivate over time, which causally influences how people react to you. Seems pretty analogous to the job candidate case.

if CDT agents often modify themselves to become an LDT/FDT agent then it would broadly seem accurate to say that CDT is getting outcompeted

See my replies to interstice's comment—I don't think "modifying themselves to become an LDT/FDT agent" is what's going on, at least, there doesn't seem to be pressure to modify themselves to do all the sorts of things LDT/FDT agents do. They come apart in cases where the modification doesn't causally influence another agent's behavior.

(This seems analogous to claims that consequentialism is self-defeating because the "consequentialist" decision procedure leads to worse consequences on average. I don't buy those claims, because consequentialism is a criterion of rightness, and there are clearly some cases where doing the non-consequentialist thing is a terrible idea by consequentialist lights even accounting for signaling value, etc. It seems misleading to call an agent a non-consequentialist if everything they do is ultimately optimizing for achieving good consequences ex ante, even if they adhere to some rules that have a deontological vibe and in a given situation may be ex post suboptimal.)

It seems plausible that there is no such thing as "correct" metaphilosophy, and humans are just making up random stuff based on our priors and environment and that's it and there is no "right way" to do philosophy, similar to how there are no "right preferences"

If this is true, doesn't this give us more reason to think metaphilosophy work is counterfactually important, i.e., can't just be delegated to AIs? Maybe this isn't what Wei Dai is trying to do, but it seems like "figure out which approaches to things (other than preferences) that don't have 'right answers' we [assuming coordination on some notion of 'we'] endorse, before delegating to agents smarter than us" is time-sensitive, and yet doesn't seem to be addressed by mainstream intent alignment work AFAIK.

(I think one could define "intent alignment" broadly enough to encompass this kind of metaphilosophy, but I smell a potential motte-and-bailey looming here if people want to justify particular research/engineering agendas labeled as "intent alignment.")

Load More