Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

(Epistemic status: I think this is right?)

Alice is the CEO of ArmaggAI, and Bob is the CEO of BigModelsAI, two major AI capabilities organizations. They're racing to be the first to build a superintelligence aligned to their respective CEV which would take over the universe and satisfy their values.

Alice would prefer to slow down so she has a lot more time to make sure that what her company is building will be aligned; but she's worried about Bob's company taking advantage of that and racing themselves, resulting in Bob's utility function being {what the lightcone is filled with} instead of Alice's; so she feels like she has no choice but to race, to maximize her own utility. Likewise vice-versa.

This state of affairs is far from the pareto frontier of {their utility functions, each weighed by how likely they are to be the one to build an aligned superintelligence}: it causes a lot more worlds where everyone is dead, instead of either Alice's or Bob's utility function being maximized.

But they don't really get all this LDT business; it seems complicated to them. If only they had a convenient lesswrong post spelling out exactly what they should do, in clear easy-to-follow terms!

The procedure they should implement is as follows. Alice should:

  • In worlds where she does launch a superintelligence aligned to itself, and it takes over the world and builds an Alice::utopia, she should tell her AI:

    Actually, don't just maximize my own utility function. Maximize a (threat-resistant) mixture of my utility function and Bob's, weighed by what ratio of timelines you think we each launched an aligned superintelligence in, but also weighed by the probability that both me and Bob implement this procedure.

  • Before that, she should slow down their company's capabilities progress.

And likewise for bob.

Now, their respective aligned superintelligences aren't omniscient; it could be that they over-estimate or under-estimate how likely Bob or Alice would've been to implement this procedure, for example. But one thing's pretty much for sure: neither Alice nor Bob are able to fool the future-superintelligence into thinking that they'd implement this procedure, when actually they wouldn't. Because it's superintelligent.

If Alice wins, she knows Bob will follow the procedure because her superintelligence can tell (better than Bob can fake). And Bob doesn't have to count on wishful thinking to know that Alice would indeed do this instead of defecting, because in worlds where he wins, he can his superintelligence if Alice would implement this procedure. They're each kept in check by the other's future-winning-self, and they can each rely on this being superintelligently-checked by their respective future selves.

So the only way Alice has to get some of her utility maximized in worlds where Bob wins, is to actually behave like this, including before either has launched a superintelligence. And likewise for Bob.

Their incentive gradient is in the direction of being more likely to follow this procedure, including slowing down their capabilities progress — and thus decreasing the amount of worlds where their AI is unaligned and everyone dies forever.

In the real world, there are still Bob's and Alice's who don't implement this procedure, but that's mostly because they don't know/understand that if they did, they would gain more utility. In many cases, it should suffice for them to be informed that this is indeed where their utility lies.

Once someone has demonstrated that they understand how LDT applies here, and that they're generally rational, then they should understand that implementing this protocol (including slowing down AI capabilities) is what maximizes their utility, and so you can count on them cooperating.


Now, in actuality, this is not quite the full generality of how LDT applies here. What they each should actually tell their superintelligence if they win is actually simpler:

Maximize a mixture of my utility function and Bob's (and anyone else who might be in a position to build superintelligence), weighed in whatever way creates a (non-threatening)) incentive gradient which maximizes my utility, including the utility I have for reducing the amount of worlds in which everyone dies.

Or even more simply:

Maximize my utility function (as per LDT).

But I think it's neat to have a clearer idea how that actually shakes out.

Theres' no weird acausal magic going on here. By racing for AI, Bob would be slightly increase the chance that he's the one to launch the aligned superintelligence that takes over the world, but he's causing more dead worlds in total, and loses the utility he would otherwise gain in worlds where Alice wins, ending up with net less utility overall.

If either of them are somewhat negative utilitarian, racing is even worse: all those dead worlds where they launch an unaligned superintelligence leave remote alien baby-eaters free to eat babies, whereas if they increased the amount of total worlds where either of them get an aligned superintelligence, then that aligned superintelligence can pay a bunch of its lightcone in exchange for them eating less babies. This is not a threat; we're never pessimizing the aliens' utility function. We're simply offering them a bunch of realityfluid/negentropy right here, in exchange for them focusing more on a subset of their values which doesn't contain lots of what Alice and Bob would consider suffering — the aliens can only be strictly better-off than if we didn't engage with them.


Now, this isn't completely foolproof. If Alice is very certain that her own superintelligence will indeed be aligned when it's launched no matter how fast she goes, then she has no incentive to slow down — in her model, Bob doesn't have much to offer to her.

But should she really have that confidence, when a bunch of qualified alignment reserach people keep telling her that she might be wrong?

She should really make sure she has really high confidence, and that she's in general implementing rationality correctly.


Oh, and needless to say, people who are neither Alice nor Bob also have a bunch of utility to gain by taking actions which reduce the total number of dead worlds by forcing both of their companies to slow down (eg through regulation).

When we have this much utility in common (not wanting to die), it's really really dumb to "defect". Unlike in the prisoner's dilemma, this "defection" doesn't get you more utility, it gets you less. This is not a zero-sum game at all. If you think it is, if you think your preferred future is the exact opposite of your opponent's preferred future, then you're probably making a giant reasoning mistake.

Whether your utility function is focused on creating nice things or by reducing suffering ("positively-caring" and "negatively-caring"), slowing down AI progress in order to have a better chance of alignment is probably what serves your utility function best.

New to LessWrong?

New Comment
13 comments, sorted by Click to highlight new comments since: Today at 4:36 AM

I agree with the point that we shouldn’t model the AI situation as a zero-sum game. And the kinds of conditional commitments you write about could help with cooperation. But I don’t buy the claim that "implementing this protocol (including slowing down AI capabilities) is what maximizes their utility."

Here's a pedantic toy model of the situation, so that we're on the same page: The value of the whole lightcone going towards an agent’s values has utility 1 by that agent’s lights (and 0 by the other’s), and P(alignment success by someone) = 0 if both speed up, else 1. For each of the alignment success scenarios i, the winner chooses a fraction of the lightcone to give to Alice's values (xi^A for Alice's choice, xi^B for Bob's). Then, some random numbers for expected payoffs (assuming the players agree on the probabilities):

  • Payoffs for Alice and Bob if they both speed up capabilities: (0, 0)
  • Payoffs if Alice speeds, Bob doesn’t: 0.8 * (x1^A, 1 - x1^A) + 0.2 * (x1^B, 1 - x1^B)
  • Payoffs if Bob speeds, Alice doesn’t: 0.2 * (x2^A, 1 - x2^A) + 0.8 * (x2^B, 1 - x2^B)
  • Payoffs if neither speeds: 0.5 * (x3^A, 1 - x3^A) + 0.5 * (x3^B, 1 - x3^B)

So given this model, seems that you're saying Bob has an incentive to slow down capabilities because Alice's ASI successor can condition the allocation to Bob's values on his decision. Which we can model as Bob expecting Alice to use the strategy {don't speed; x2^A = 1; x3^A = 0.5} (given she [edit: typo] doesn't speed up, she only rewards Bob's values if Bob didn't speed up).

Why would Bob so confidently expect this strategy? You write:

And Bob doesn't have to count on wishful thinking to know that Alice would indeed do this instead of defecting, because in worlds where he wins, he can his superintelligence if Alice would implement this procedure. 

I guess the claim is just that them both using this procedure is a Nash equilibrium? If so, I see several problems with this:

  1. There are more Pareto-efficient equilibria than just “[fairly] cooperate” here. Alice could just as well expect Bob to be content with getting expected utility 0.2 from the outcome where he slows down and Alice speeds up — better that than the utility 0 from extinction, after all. Alice might think she can make it credible to Bob that she won’t back down from speeding up capabilities, and vice versa, such that they both end up pursuing incompatible demands. (See, e.g., “miscoordination” here.)
  2. You’re lumping “(a) slow down capabilities and (b) tell your AI to adopt a compromise utility function” into one procedure. I guess the idea is that, ideally, the winner of the race could have their AI check whether the loser was committed to do both (a) and (b). But realistically it seems implausible to me that Alice or Bob can commit to (b) before winning the race, i.e., that what they do in the time before they win the race determines whether they’ll do (b). They can certainly tell themselves they intend to do (b), but that’s cheap talk.

    So it seems Alice would likely think, "If I follow the whole procedure, Bob will cooperate with my values if I lose. But even if I slow down (do (a)), I don't know if my future self [or, maybe more realistically, the other successors who might take power] will do (b) — indeed once they're in that position, they'll have no incentive to do (b). So slowing down isn't clearly better." (I do think, setting aside the bargaining problem in (1), she has an incentive to try to make it more likely that her successors follow (b), to be clear.)

Unless I'm missing something, this seems to disregard the possibility of deception. Or it handwaves deception away in a line or two.

The type of person to end up as the CEO of a leading AI company is likely (imo) someone very experienced in deception and manipulation- at the very least through experiencing others trying it on them, even if by some ridiculously unlikely chance they haven't used deception to gain power themselves.

A clever, seemingly logically sound argument for them to slow down and trust that their competitor will also slow down because of the argument, will ring all kinds of bells.

I think whistleblower protections, licenses, enforceable charters, mandatory 3rd party safety evals, etc have a much higher chance of working.

Ok but why not just the Coherent Extrapolated Volition  of humanity? (One of Yud's better concepts). A norm where all parties agree to that seems easier to get people to sign up to also. That by definition includes much of your values, sure there is some incentive to defect, but not much it would seem.

CEV of humanity is certainly desirable! If you CEV me, I in turn implement some kind of CEV-of-humanity in a way that doesn't particularly privilege myself. But that's downstream of one's values and of decision theory.

Your goal as an agent is to maximize your utility function — and it just so happens that your utility function, as you endorse it in CEV, consists of maximizing everyone's CEV in some way.

Think not "cosmopolitanism vs my-utility-function" but "cosmopolitanism, as entailed by my utility function".

(see also my post surprise! you want what you want)

[-]TAG5mo20

It egoism is incoherent,and altruism coherent, I suppose that would follow...but it's a big if. Where is it proven?

Oh, egoism is totally coherent. I'm just saying that your values can be egoist, or they can be cosmopolitan, or a mixture of they two. But (a version of) cosmopolitanism is a contents of a person's values, not a standalone objective thing.

[-]TAG5mo20

How does that help in practice?

I'm not sure what you mean? I'm just describing what those concepts are and how I think they fit together in the territory, not prescribing anything.

[-]TAG5mo20

Thats contractualism. CEV is supposedly something else

You can do this with less superintelligence and less LDT: Before you eat the light cone, build a secret decisive strategic advantage, wait for the next AI to come along, and watch what it decides to self-modify into.

At a glance, I think this works, and it's a neat approach. I have doubts, though.

The impossibility of explaining the theory behind this to random SV CEOs or military leaders... is not one of them. The human culture had always contained many shards of LDT-style thinking, the concept of "honour" chief among them. To explain it, you don't actually need to front-load exposition about decision theories and acausal trade – you can just re-use said LDT-shards, and feed them the idea in an immediately intuitive format.

I'm not entirely sure how that would look like, and it's not a trivial re-framing problem. But I think it's very doable with some crafty memetic engineering.

My first concern is that we might not actually have the time. While proliferating this idea (in the sense of "This Is What You Do If You Have AGI") is doable, that'd still take some time. You'd need to split it into a set of five-word messages, and parcel them out over the years. I think you'd said your timeline is 0-5 years; that's IMO definitely not enough.

Based on the latest developments (Gemini, what Q* is supposed to be, both underwhelming), I think we have a fair bit longer (like, a decade-ish). Might be enough if we start yesterday.

My second concern is... more vague, but I feel like this is still being too optimistic about the human nature. Sure, maybe it'd work for the current crop of major-AI-Lab CEOs. But in a lot of situations (e. g., acute xenophobia), I think the preference ordering goes "I win" > "they lose" > "a compromise", such that they would prefer an all-or-nothing gamble to a measured split of the gains. (Like, it's almost Copenhagen Ethics-ish? It feels utterly repulsive to contribute to your enemy's happiness, such that you'd rather either eradicate them or be eradicated, no matter how self-destructive that is?)

At that point, I may be being too cynical, though. I also might feel differently if I were staring at the version of this pitch already re-framed into intuitive terms.

Due to my timelines being this short, I'm hopeful that convincing just "the current crop of major-AI-Lab CEOs" might actually be enough to buy us the bulk of time that something like this could buy.

I think this has a fix-point selection problem: If one or both of them start with a different prior under which the other player punishes them for not racing / doesn't reward them enough (maybe because they have very little faith in the other's rationality, or because they think it's not within their power to decide that, and also there's not enough evidential correlation in their decisions), then they'll race.

Of course, considerations about whether the other player normatively endorses something LDT-like also enter the picture. And even if individual humans would endorse it (and that's already a medium-big if), I worry our usual decision structures (for example in AI labs) don't incentivize it (and what's the probability some convincing decision theorist cuts through them? not sure).