Even if we lose, we win

Morphism

Epistemic status: Updating on this comment and taking into account uncertainty about my own values, my credence in this post is around 50%.

TLDR: Even in worlds where we create an unaligned AGI, it will cooperate acausally with counterfactual FAIs—and spend some percentage of its resources pursuing human values—as long as its utility function is concave in resources spent. The amount of resources UAI spends on humans will be roughly proportional to the measure of worlds with aligned AGI, so this does not change the fact that we should be working on alignment.

Assumptions

Our utility function is concave in resources spent; e.g. we would prefer a 100% chance of 50% of the universe turning into utopia to a 50% chance of 100% of the universe turning into utopia, assuming that the rest of the universe is going to be filled with something we don't really care about, like paperclips.
There is a greater total measure of worlds containing aligned AGI with concave utility than worlds containing anti-aligned AGI with concave utility, anti-aligned meaning it wants things that directly oppose our values, like suffering of sentient beings or destruction of knowledge.
AGIs will be able to predict counterfactual other AGIs well enough for acausal cooperation.
(Added in response to these comments) AGIs will use a decision theory that allows for acausal cooperation between Everett Branches (like LDT).

Acausal cooperation between AGIs

Let's say two agents, Alice and Bob, have utilities that are concave in money. Let's say for concreteness that each agent's utility is the logarithm of their bank account balance, and they each start with $10. They are playing a game where a fair coin is flipped. If it lands heads, Alice gets $10, and if it lands tails, Bob gets $10. Each agent's expected utility here is

If, instead, Alice and Bob both just receive $5, their expected utility is

log (15) = \frac{1}{2} log (225) > \frac{1}{2} log (200) .

Therefore, in the coin flip game, it is in both agents' best interests to agree beforehand to have the winner pay the loser $5. If Alice and Bob know each other's source code, they can cooperate acausally on this after the coin is flipped, using Löbian Cooperation or similar. This is of course not the only Pareto-optimal way to cooperate, but it is clearly at the Galactic Schelling Point of fairness, and both agents will have policies that disincentivize "unfair" agreements, even if they are still on the Pareto frontier (e.g. Alice pays Bob $6 if she wins, and Bob pays Alice $4 if he wins). If the coin is weighted, a reasonable Galactic Schelling Point would be for Alice to pay Bob in proportion to Bob's probability of winning, and vice-versa, so that both agents end up always getting what would have been their expected amount of money (but more expected utility because of concavity).

Now, let's say we have a 10% chance of solving alignment before we build AGI. For simplicity, we'll say that in the other 90% of worlds we get a paperclip maximizer whose utility function is logarithmic in paperclips. Then, in 10% of worlds, FAI, maximizing our CEV as per LDT, will reason about the other 90% of worlds where Clippy is built. Likewise, Clippy will reason about the counterfactual FAI worlds. By thinking about each other's source code, FAI and Clippy will be able to cooperate acausally like Alice and Bob, each turning their future lightcone into 10% utopia, 90% paperclips. Therefore, we get utopia either way! :D

Note that, even though we get utopia in Clippy's worlds, we still want to maximize the probability of solving alignment, since the amount of the world that gets CEV-optimized is proportional to that probability.

This means we have a new secondary target: If the AGI we build doesn't have our CEV as its utility function, we want its utility function to be concave (like ours is). P(concave-AGI) determines how many worlds we get a sliver of; P(FAI|concave-AGI) determines how big that sliver is. There's a compelling argument that the former is actually more important, and therefore we should be focusing on creating concave-AGI. Also, if P(FAI) is large enough, the FAI might have enough realityfluid to trade that in worlds where concave-UAI is created, it considers it worth it to avoid killing currently-alive humans, or at least brain scan us all before killing us so we can be revived in sliver-utopia. So, if Clippy might save us instead of killing us, maybe it's actually good to accelerate AI capabilities??? (I am far less confident in this claim than I am in any of the other claims in this post. And even if this is true, alignment is probably the better thing to work on anyway because of neglectedness.)

What about convex utilities?

If an AGI has a convex utility function, it will be risk-seeking rather than risk-averse, i.e. if we hold constant the expected amount of resources it has control over, it will want to increase the variance of that, rather than decrease it. Fix $x$ as the expected amount of resources. If $s$ is total amount of resources in one universe, the highest possible variance is attained by the distribution that the AGI already gets: $\frac{x}{s}$ chance of getting the entire universe and $\frac{1 - x}{s}$ chance of getting nothing. Therefore, a UAI with convex utility will not want to cooperate acausally at all.

Notes:

The above only holds with AGIs that are indifferent to our values, and whose values we are indifferent to, i.e. we don't care how many paperclips there are and Clippy doesn't care how much utopia there is. I might make another post discussing more complex scenarios with AGIs whose values are correlated or anti-correlated with ours on some dimensions, since, even though I consider such AGIs to be quite unlikely, their possibility is decision-relevant to those with suffering-focused ethics.
This is not Evidential Cooperation in Large Worlds. ECL is when Alice and Bob don't have access to each other's source code, and just give the $5 based on the assumption that they are both running the same algorithm. This reasoning is valid if Alice and Bob are exact clones in the exact same situation, but its validity diminishes as you deviate from that idealized scenario. I might make another post on why I think ECL doesn't actually work in practice.
But isn't Alice and Bob having full access to each other's source code is also an idealized scenario? Yes, it is, but I think real life is close enough to this ideal that acausal cooperation is still possible.
The above only holds when our uncertainty over whether or not we solve the alignment problem is environmental rather than logical. We might think we have a 10% chance of solving alignment, but Clippy looking back on the past might find that, because we were drawn to the same stupid alignment plans in almost all timelines, our probability of success was more like $10^{- 25}$ , low enough that even giving us a solar system isn't worth it. Therefore, it might be a good idea for all (or at least a large group of) alignment researchers to coordinate around pursuing the same specific alignment plan based on the result of a quantum RNG, or something like that.

Edit 01/15: Mixed up concave and convex. Concave=concave down=second derivative is negative=risk-averse (in a utility function)

Given that acausal decision theory and especially acausal game theory are very much unsolved problems, I think we don't really have much idea of what the acausal economy looks like. It seems totally plausible to me that it's a dog-eat-dog world of superintelligences with more sophisticated decision theories competing with each other to exploit civilizations with less sophisticated decision theories, e.g., by winning commitment races. Given this, it's not a slam dunk that giving unaligned AIs better decision theories is a good idea, or that "even if we lose, we win".

By thinking about each other's source code, FAI and Clippy will be able to cooperate acausally like Alice and Bob, each turning their future lightcone into 10% utopia, 90% paperclips. Therefore, we get utopia either way! :D

So even if we lose we win, but even if we win we lose. The amount of utopiastuff is exactly conserved, and launching unaligned AI causes timelines-where-we-win to have less utopia by exactly as much as our timeline has more utopia.

The amount of utopiastuff we get isn't just proportional to how much we solve alignment, it's actually back to exactly equal.

Yes, amount of utopiastuff across all worlds remains constant, or possibly even decreases! But I don't think amount-of-utopiastuff is the thing I want to maximize. I'd love to live in a universe that's 10% utopia and 90% paperclips! I much prefer that to a 90% chance of extinction and a 10% chance of full-utopia. It's like insurance. Expected money goes down, but expected utility goes up.

Decision theory does not imply that we get to have nice things, but (I think) it does imply that we get to hedge our insane all-or-nothing gambles for nice things, and redistribute the nice things across more worlds.

strong upvote. this feels like it's heavily trod ground to me, but maybe someone both needs to hear it and will find this convincing. the core takeaway being "...but this doesn't matter because we just need to solve alignment anyway lol" seems to make the whole thing a little moot. Also, I don't super buy that UFAI is likely to have a sane decision theory, given the degree to which people most likely able to create a UFAI don't seem terribly interested in the details of decision theory. it's also only relevant for abstractions of super-agents if utility functions continue to be a valid way to model the world, which is like, still my expectation of what it will turn out the limit of capability is no matter the direction, but probably doesn't hold up according to Really Serious Deep Learning Scientists Named After Fish

I expect ASI's to converge to having a "sane decision theory" since they will realize they can get more of what they want if they self-modify to have a sane one if they don't start out with one.

If you start out with CDT, then the thing you converge to is Son of CDT rather than FDT.
(that arbital page takes a huge amount of time to load for me for some reason, but it does load eventually)

And I could totally see the thing that kills us {being built with} or {happening to crystallize with} CDT rather than FDT.

We have to actually implement/align-the-AI-to the correct decision theory.

I think this is only true if we are giving the AI a formal goal to explicitly maximize, rather than training the AI haphazardly and giving it a clusterfuck of shards. It seems plausible that our FAI would be formal-goal aligned, but it seems like UAI would be more like us unaligned humans—a clusterfuck of shards. Formal-goal AI needs the decision theory "programmed into" its formal goal, but clusterfuck-shard AI will come up with decision theory on its own after it ascends to superintelligence and makes itself coherent. It seems likely that such a UAI would end up implementing LDT, or at least something that allows for acausal trade across the Everett branches.

Point taken about CDT not converging to FDT.

I don't buy that an uncontrolled AI is likely to be CDT-ish though. I expect the agentic part of AIs to learn from examples of human decision making, and there are enough pieces of FDT like voting and virtue in human intuition that I think it will pick up on it by default.

(The same isn't true for human values, since here I expect optimization pressure to rip apart the random scraps of human value it starts out with into unrecognizable form. But a piece of a good decision theory is beneficial on reflection, and so will remain in some form.)

(ETA: Sorry, upon reviewing the whole thread, I think I misinterpreted your comment and thus the following reply is probably off point.)

We have to actually implement/align-the-AI-to the correct decision theory.

I think the best way to end up with an AI that has the correct decision theory is to make sure the AI can competently reason philosophically about decision theory and are motivated to follow the conclusions of such reasoning. In other words, it doesn't judge a candidate successor decision theory by its current decision theory (CDT changing into Son-of-CDT), but by "doing philosophy", just like humans do. Because given the slow pace of progress in decision theory, what are the chances that we correctly solve all of the relevant problems before AI takes off?

do you have thoughts on how to encode "doing philosophy" in a way that we would expect to be strongly convergent, such that if implemented on the last ai humans ever control, we can trust the process after disempowerment to continue to be usefully doing philosophy in some nailed down way?

I think we're really far from having a good enough understanding of what "philosophy" is, or what "doing philosophy" consists of, to be able to do that. (Aside from "indirect" methods that pass the buck to simulated humans, that Pi Rogers also mentioned in another reply to you.)

Here is my current best understanding of what philosophy is, so you can have some idea of how far we are from what you're asking.

Maybe some kind of simulated long-reflection type thing like QACI where "doing philosophy" basically becomes "predicting how humans would do philosophy if given lots of time and resources"

That would be a philosophical problem...

Currently, I think this is a big crux in how to "do alignment research at all". Debatably "the biggest" or even "the only real" crux.

(As you can tell, I'm still uncertain about it.)

I think you swapped concave and convex in the text. The logarithm function is a concave function. Bit unfortunate that convex preferences usually means that the average is better which is similar to concave utility functions.

Fixed it! Thanks! It is very confusing that half the time people talk about loss functions and the other half of the time they talk about utility functions

Therefore, it might be a good idea for all (or at least a large group of) alignment researchers to coordinate around pursuing the same specific alignment plan based on the result of a quantum RNG, or something like that.

From this I infer that you think the set of alignment strategies we would use as alternatives to pick by quantum dice is enough to cover much more space than a single one which seems the best by general consensus.

My intuition tells me that if Clippy thought we had a chance, this trick does not really move the needle.

Epistemic status: Updating on this comment and taking into account uncertainty about my own values, my credence in this post is around 50%.

Is this conditional on the "Assumptions" section, or marginal?

By thinking about each other's source code, FAI and Clippy will be able to cooperate acausally like Alice and Bob, each turning their future lightcone into 10% utopia, 90% paperclips. Therefore, we get utopia either way! :D

The amount of utopiastuff we get isn't just proportional to how much we solve alignment, it's actually back to exactly equal.

I expect ASI's to converge to having a "sane decision theory" since they will realize they can get more of what they want if they self-modify to have a sane one if they don't start out with one.

If you start out with CDT, then the thing you converge to is Son of CDT rather than FDT.
(that arbital page takes a huge amount of time to load for me for some reason, but it does load eventually)

And I could totally see the thing that kills us {being built with} or {happening to crystallize with} CDT rather than FDT.

We have to actually implement/align-the-AI-to the correct decision theory.

Point taken about CDT not converging to FDT.

(ETA: Sorry, upon reviewing the whole thread, I think I misinterpreted your comment and thus the following reply is probably off point.)

We have to actually implement/align-the-AI-to the correct decision theory.

Here is my current best understanding of what philosophy is, so you can have some idea of how far we are from what you're asking.

Maybe some kind of simulated long-reflection type thing like QACI where "doing philosophy" basically becomes "predicting how humans would do philosophy if given lots of time and resources"

That would be a philosophical problem...

Currently, I think this is a big crux in how to "do alignment research at all". Debatably "the biggest" or even "the only real" crux.

(As you can tell, I'm still uncertain about it.)

Fixed it! Thanks! It is very confusing that half the time people talk about loss functions and the other half of the time they talk about utility functions

Therefore, it might be a good idea for all (or at least a large group of) alignment researchers to coordinate around pursuing the same specific alignment plan based on the result of a quantum RNG, or something like that.

My intuition tells me that if Clippy thought we had a chance, this trick does not really move the needle.

Epistemic status: Updating on this comment and taking into account uncertainty about my own values, my credence in this post is around 50%.

Is this conditional on the "Assumptions" section, or marginal?