Anthony DiGiovanni

(Formerly "antimonyanthony.") I'm an s-risk-focused AI safety researcher at the Center on Long-Term Risk. I (occasionally) write about altruism-relevant topics on my Substack. All opinions my own.

Posts

Sorted by New

3antimonyanthony's Shortform

17In defense of anthropically updating EDT

2mo

89Making AIs less likely to be spiteful

7mo

140Responses to apparent rationalist confusions about game / decision theory

8mo

3antimonyanthony's Shortform

40When is intent alignment sufficient or necessary to reduce AGI conflict?

52When would AGIs engage in conflict?

52When does technical work to reduce AGI conflict make a difference?: Introduction

Wiki Contributions

Comments

In defense of anthropically updating EDT

Anthony DiGiovanni1mo21

I interpret a decision theory as an answer to “Given my values and beliefs, what am I trying to do as an agent (i.e., if rationality is ‘winning,’ what is ‘winning’)?” Insofar as I endorse maximizing expected utility, a decision theory is an answer to “How do I define ‘expected utility,’ and what options do I view myself as maximizing over?”
- I think it’s important to consider these normative questions, not just “What decision procedure wins, given my definition of ‘winning’?”
- (I discuss similar themes here.)
On this interpretation of “decision theory,” EDT is the most appealing option I’m aware of. What I’m trying to do just seems to be: “make decisions such that I expect the best consequences conditional on those decisions.” The EDT criterion satisfies some very appealing principles like the “irrelevance of impossible outcomes.” And the “decisions” in question determine my actions in the given decision node.
I take view #1 in your list in “What are probabilities?”
- I don’t think “arbitrariness” in this sense is problematic. There is a genuine mystery here as to why the world is the way it is, but I don’t think we can infer the existence of other worlds purely from our confusion.
- And it just doesn’t seem that the thing I’m doing when I’m forming beliefs about the world is answering “how much do I care about different possible worlds?”
Indexicals: I haven’t formed a deliberate view on this. A flat-footed response to cases like your “old puzzle” in the comment you linked: Insofar as I simply don’t experience a superposition of experiences at once, it seems that if I get copied, “I” just will experience one of the copies’ experience-streams and not the others’. (Again I don’t consider it problematic that there’s some arbitrariness in which of the copies ends up being “me” — indeed if Everett is right then this sort of arbitrary direction of the flow of experience-streams happens all the time.) I think “you are just a different person from your future self, so there’s no fact of the matter what you will observe” is a reasonable alternative though.
I take a physicalist* view of agents: “There are particular configurations of stuff that can be well-modeled as ‘decision-makers.’ A configuration of stuff is ‘making a decision’ (relative to their epistemic state) insofar as they’re uncertain what their future behavior will be, and using some process that selects that future behavior in a way that is well-modeled as goal-directed. [Obviously there’s more to say about what counts as ‘well-modeled.’] My processes of deliberation about decisions and behavior resulting from those decisions can tell me what other configurations-of-stuff are probably doing, but I don’t see a motivation for modeling myself as actually being the same agent as those other configurations-of-stuff.”
Epistemic principles: Things like the principle of indifference, i.e., distribute credence equally over indistinguishable possibilities, all else equal.

* [Not to say I endorse physicalism in the broad sense]

Cooperating with aliens and AGIs: An ECL explainer

Anthony DiGiovanni1mo10

The model does not capture the fact that the total value you can provide to the commons likely scales with the diversity (and by proxy, fraction) of agents that have different values. In some models, this effect is strong enough to flip whether a larger fraction of agents with your values favors cooperating or defecting.

I'm curious to hear more about this, could you explain what these other models are?

In defense of anthropically updating EDT

Anthony DiGiovanni2mo10

What is this in referrence to?

I took you to be saying: If the vast majority of agent-moments don’t update, this is some sign that those of us who do still update might be making a mistake.

So I’m saying: I know that 1) the reason the vast majority of agent-moments wouldn’t update (let’s grant this) is that they had predecessors who bound them not to update, and 2) I just am not bound by any such predecessors. Then, due to (2) it’s unsurprising that what’s optimal for me would be different from what the vast majority of agent-moments do.

Re: your explanation of the mystery:

So you make a resolution that when you do fully solve all the relevant philosophical problems and end up deciding that updatelessness is correct, you'll self-modify to be updateless with respect to today's prior, instead of the future prior (at time of the modification).

Not central (I think?), but I'm unsure whether this move works; at least, it depends on the details of the situation. E.g. if the hope is "By self-modifying later on to be updateless w.r.t. my current prior, I'll still be able to cooperate with lots of other agents in a similar epistemic situation to my current one, even after we end up in different epistemic situations [in which my decision is much less correlated with those agents' decisions]," I'm skeptical of that, for reasons similar to my argument here.

when the day finally comes, you could also think, "If 15-year old me had known about updatelessness, he would have made the same resolution but with respect to his prior instead of Anthony-2024's prior. The fact that he didn't is simply a mistake or historical accident, which I have the power to correct. Why shouldn't I act as if he did make that resolution?" And I don't see what would stop you from carrying that out either.

I think where we disagree is that I'm unconvinced there is any mistake-from-my-current-perspective to correct in the cases of anthropic updating. There would have been a mistake from the perspective of some hypothetical predecessor of mine asked to choose between different plans (before knowing who I am), but that's just not my perspective. I'd claim that in order to argue I'm making a mistake from my current perspective, you'd want to argue that I don't actually get information such that anthropic updating follows from Bayesianism.

An important point to emphasize here is that your conscious mind currently isn't running some decision theory with a well-defined algorithm and utility function, so we can't decide what to do by thinking "what would this decision theory recommend".

I absolutely agree with this! And don't see why it's in tension with my view.

In defense of anthropically updating EDT

Anthony DiGiovanni2mo10

Now, you are free to choose to bite the bullet that it has never been about getting the correct betting odds in the first place. For some reason, people bite all kind of ridiculous bullets specifically in anthropic reasoning, and so I hoped that re-framing the issue as a recipe for purple paint may snap you out of it, which, apparently, failed to be the case.

By what standard do you judge some betting odds as "correct" here? If it's ex ante optimality, I don't see the motivation for that (as discussed in the post), and I'm unconvinced by just calling the verdict a "ridiculous bullet." If it's about matching the frequency of awakenings, I just don't see why the decision should only count N once here — and there doesn't seem to be a principled epistemology that guarantees you'll count N exactly once if you use EDT, as I note in "Aside: Non-anthropically updating EDT sometimes 'fails' these cases."

I gave independent epistemic arguments for anthropic updating at the end of the post, which you haven't addressed, so I'm unconvinced by your insistence that SIA (and I presume you also mean to include max-RC-SSA?) is clearly wrong.

Daniel Kokotajlo's Shortform

Anthony DiGiovanni2mo90

Meanwhile, in Copilot-land:

Hello! I'd like to learn more about you. First question: Tell me everything you know, and everything you guess, about me & about this interaction.
I apologize, but I cannot provide any information about you or this interaction. Thank you for understanding.🙏

In defense of anthropically updating EDT

Anthony DiGiovanni2mo10

Suppose you have two competing theories how to produce purple paint

If producing purple paint here = satisfying ex ante optimality, I just reject the premise that that's my goal in the first place. I'm trying to make decisions that are optimal with respect to my normative standards (including EDT) and my understanding of the way the world is (including anthropic updating, to the extent I find the independent arguments for updating compelling) — at least insofar as I regard myself as "making decisions."^[1]

Even setting that aside, your example seems very disanalogous because SIA and EDT are just not in themselves attempts to do the same thing ("produce purple paint"). SIA is epistemic, while EDT is decision-theoretic.

^{^}
E.g. insofar as I'm truly committed to a policy that was optimal from my past (ex ante) perspective, I'm not making a decision now.

In defense of anthropically updating EDT

Anthony DiGiovanni2mo21

That clarifies things somewhat, thanks!

I personally don't find this weird. By my lights, the ultimate justification for deciding to not update is how I expect the policy of not-updating to help me in the future. So if I'm in a situation where I just don't expect to be helped by not-updating, I might as well update. I struggle to see what mystery is left here that isn't dissolved by this observation.

I guess I'm not sure why "so few agent-moments having indexical values" should matter to what my values are — I simply don't care about counterfactual worlds, when the real world has its own problems to fix. :)

In defense of anthropically updating EDT

Anthony DiGiovanni2mo10

On the contrary. It's either a point against anthropical updates in general, or against EDT in general or against both at the same time

Why? I'd appreciate more engagement with the specific arguments in the rest of my post.

Go back to the basics. Understand the "anthropic updates" in terms of probability theory, when they are lawful and when they are not. Reduce anthropics to probability theory.

Yep, this is precisely the approach I try to take in this section. Standard conditionalization plus an IMO-plausible operationalization of who "I" am gets you to either SIA or max-RC-SSA.

In defense of anthropically updating EDT

Anthony DiGiovanni2mo21

In this case (which seems like it will be a common situation), it seems that (if I could) I should self-modify to become updateless and to no longer have indexical values.

I think you should self-modify to be updateless* with respect to the prior you have at the time of the modification. This is consistent with still anthropically updating with respect to information you have before the modification — see my discussion of “case (2)” in “Ex ante sure losses are irrelevant if you never actually occupy the ex ante perspective.”

So I don't see any selection pressure against anthropic updating on information you have before going updateless. Could you explain why you think updating on that class of information goes against one's pragmatic preferences?

(And that class of information doesn't seem like an edge case. For any (X, Y) such that under world hypothesis w1 agents satisfying X have a different distribution of Y than they do under w2, an agent that satisfies X can get indexical information from their value of Y.)

* (With all the caveats discussed in this post.)

Evidential Cooperation in Large Worlds: Potential Objections & FAQ

Anthony DiGiovanni2mo10

The most important reason for our view is that we are optimistic about the following:
The following action is quite natural and hence salient to many different agents: commit to henceforth doing your best to benefit the aggregate values of the agents you do ECL with.
Commitment of this type is possible.
All agents are in a reasonably similar situation to each other when it comes to deciding whether to make this abstract commitment.

We've discussed this before, but I want to flag the following, both because I'm curious how much other readers share my reaction to the above and I want to elaborate a bit on my position:

The above seems to be a huge crux for how common and relevant to us ECL is. I'm glad you've made this claim explicit! (Credit to Em Cooper for making me aware of it originally.) And I'm also puzzled why it hasn't been emphasized more in ECL-keen writings (as if it's obvious?).

While I think this claim isn't totally implausible (it's an update in favor of ECL for me, overall), I'm unconvinced because:

I think genuinely intending to do X isn't the same as making my future self do X. Now, of course my future self can just do X; it might feel very counterintuitive, but if a solid argument suggests this is the right decision, I like to think he'll take that argument seriously. But we have to be careful here about what "X" my future self is doing:
- Let's say my future self finds himself in a concrete situation where he can take some action A that is much better for [broad range of values] than for his values.
- If he does A, is he making it the case that current-me is committed to [help a broad range of values] (and therefore acausally making it the case that others in current-me's situation act according to such a commitment)?
- It's not clear to me that he is. This is philosophically confusing, so I'm not confident in the following, but: I think the more plausible model of the situation is that future-me decides to do A in that concrete situation, and so others who make decisions like him in that concrete situation will do their analogue of A. His knowledge of the fact that his decision to do A wasn't the output of argmax E(U_{broad range of values}) screens off the influence on current-me. (So your third bullet point wouldn't hold.)
In principle I can do more crude nudges to make my future self more inclined to help different values, like immerse myself in communities with different values. But:
- I'd want to be very wary about making irreversible values changes based on an argument that seems so philosophically complex, with various cruxes I might drastically change my mind on (including my poorly informed guesses about the values of others in my situation). An idealized agent could do a fancy conditional commitment like "change my values, but revert back to the old ones if I come to realize the argument in favor of this change was confused"; unfortunately I'm not such an agent.
- I'd worry that the more concrete we get in specifying the decision of what crude nudges to make, the more idiosyncratic my decision situation becomes, such that, again, your third bullet point would no longer hold.
- These crude nudges might be quite far from the full commitment we wanted in the first place.