Much of the work for this post was done together with Nuño Sempere

Perhaps you think that your values will be best served if the AGI you (or your team, company or nation) are developing is deployed first. Would you decide that it's worth cutting a few corners, reducing your safety budget, and pushing ahead to try and get your AI out the door first?

It seems plausible, and worrying, that you might. And if your competitors reason symmetrically, we would get a "safety race to the bottom".

On the other hand, perhaps you think your values will be better served if your enemy wins than if either of you accidentally produces an unfriendly AI. Would you decide the safety costs to improving your chances aren't worth it?

In a simple two player model, you should only shift funds from safety to capabilities if (the relative₁ decrease in chance of friendliness) / (the relative₁ increase in the chance of winning) < (expected relative₂ loss of value if your enemy wins rather than you). Here, the relative₁ increases and decreases are relative to the current values. The relative₂ loss of value is relative to the expected value if you win.

The plan of this post is as follows: 1. Consider a very simple model that leads to a safety race. Identify unrealistic assumptions which are driving its results. 2. Remove some of the unrealistic assumptions and generate a different model. Derive the inequality expressed above. 3. Look at some specific example cases, and see how they affect safety considerations.

## A partly discontinuous model

Let's consider a model with two players with the same amount of resources. Each player's choice is what fraction of their resources to devote to safety, rather than capabilities. Whichever player contributes more to capabilities wins the race. If you win the race, you either get a good outcome or a bad outcome. Your chance of getting a good outcome increases continuously with the amount you spent on safety. If the other player wins, you get a bad outcome. A good outcome is a finite amount better than any bad outcome.

We can envisage each player as choosing a point along the line going from 0 to 1: this is the fraction of their resources they're devoting to capabilities. Therefore, the possible pairs of choices can be visualised as the square with sides of length 1. The payoffs for each player are a surface over this square, with a cliff along the line x = y -- the player switches from definitely getting a bad outcome to possibly getting a good outcome. I've attempted to illustrate one such surface below.

In this picture, all bad outcomes are equally bad (0 payoff). If a player is currently spending enough on capabilities to win, the other player should spend just a little more on capabilities, and so "jump up the cliff" and capture the expected value.

At first, one might think that this behaviour is driven by indifference between unfriendly AI and the other player's friendly AI. Though I won't show it here, this is not the case. If we assume that unfriendly AI is equally bad no matter which player deploys it, and the chance of an AI being unfriendly is just a function of the proportion spent on capabilities vs safety, the same result is found.

If not indifference, what is driving this result? The real driving force is that finite always beats infinitesimal. Because of the discontinuity -- the "cliff" -- an infinitesimal increase in the chance of a finitely worse outcome (unfriendly AI) is more than paid for by a finite increase in the chance of a finitely better outcome (AI aligned with my goals rather than yours).

Letting whichever player spends more on capabilities win is clearly too simple a model: there's a chance that a less capabilities-focused player would get lucky and deploy AI first. It's also unrealistic that one can continuously vary the amount one spends on capabilities vs safety. Could you really have a researcher move 0.0000000120000900007% of their effort from safety to capabilities (even granting that they're equally productive in both fields)?

Below I focus on removing the discontinuity in expected payoff, rather than adding discontinuity in strategy. That is, for any given amount of resources that each player is putting towards capabilities, there is some probability of each deploying first.

## Chances & resources

*This is the section with maths in, if you want to skip it*

Again, focus on a two player game, with players 1 and 2. Assume that each has some fixed amount of resources and can choose their allocation of the resources to capabilities or safety research. Assume that the expected value of unfriendly AI (if it happens) does not depend on these resources or on their allocation, and nor does the expected payoff of friendly AI (if it is achieved). So the AI is either in a fully good scenario or a fully bad scenario, regardless of the resources and allocations. These change the probability of a good outcome or bad outcome, but not their nature.

Let kᵢ be the total resources of player i. Let rᵢ be the fraction of resources player i devotes to capabilities.

Let f(kᵢ, rᵢ, kⱼ, rⱼ) be the probability that a player with resources kᵢ and allocation rᵢ deploys their AI before a player with resources kⱼ and allocation rⱼ. Note we have f(kᵢ, rᵢ, kⱼ, rⱼ) = 1 - f(kⱼ, rⱼ, kᵢ, rᵢ).

Let s(k, r) be the probability that a player with resources k and allocation r deploys a friendly AI if it deploys an AI.

Let's set the payoff scale so that unfriendly AI is 0. Let the payoff for deploying friendly AI yourself be a. Let the payoff for the other player deploying friendly AI be b. I assume 0 < b < a.

For brevity, I will write *f* for f(k₁, r₁, k₂, r₂). I will write sᵢ for s(kᵢ, rᵢ).

Primed quantities (e.g. f') will denote partial derivatives with respect to r₁.

The expected payoff, *v* is f * a * s₁ + (1 - f) * b * s₂.

Player 1 should shift resources towards capabilities if the expected payoff increases with change in r₁. That is, if v' > 0.

v' = f' * a * s₁ + f * a * s₁' - f' * b * s₂

Rearranging:

a * f * s₁' > f' * (s₂ * b - s₁ * a)

Choosing constants α and β, such that b = α * a and s₂ = β * s₁, and dividing both sides by a * s₁ * f', we get

(f / f') * (s₁' / s₁) > α * β - 1

or, multiplying both sides by -1,

-(f / f') * (s₁' / s₁) < 1 - α * β

For any function g, call g'/g the relative (instantaneous) increase in g. Also, call -g'/g the relative decrease in g.

α * β is (the expected value gained by player 1 if player 2 wins) divided by (the expected value gained by player 1 if player 1 wins). So call 1 - α * β the relative expected loss if player 2 wins.

Rewriting with our new definitions, we have player 1 should move resources from safety to capabilities just if

(the relative decrease in chance of friendliness)/(the relative increase in chance of winning) < (the relative expected loss if player 2 wins).

## Concrete cases

### Against failure

We can model a single player trying to develop AI by some fixed time as playing against nature or failure. If they do not succeed, reality certainly deploys a "friendly AI" that is just "no change". Our situation continues unaltered. So β is 1, and α is the expected value of our situation continuing as a fraction of the goodness of a positive singularity. Note that this expected value includes the chance of a positive singularity *later*, just not by the given time.

What drives the value of α is mostly how much more or less likely we think the future is to be able to deploy positive AI. If we think it is likely to be wiser and smarter, then α may even be larger than one, and a single player should only do safety research. If we think there is a chance of instability and unrest in the future, and this is our most stable chance, then a single player may wish to press on.

Of course, in reality, a multipolar scenario is more likely.

### If unfriendliness is much, much worse than your enemy winning

If the default outcome is something like an S-Risk, your enemy winning may be essentially indistinguishable from you winning. That is α is approximately 1. So you should only move resources to capabilities if (the relative decrease in chance of friendliness)/(the relative increase in chance of winning) < (the relative expected increase in unfriendliness risk if 2 wins).

#### Against a cautious player

Suppose the other player is playing it *very* safe. In fact, they are playing it many times safer than you are. So β is in fact larger than 1. The only way for the inequality to hold is you *increase* your chance of friendliness or *decrease* your chance of winning: you should shift resources away from capabilities.

#### Reckless Rivals

Suppose the other player is pursuing a no-holds-barred strategy that is very likely to produce an AGI before you and is very likely to produce an unaligned AI. The details here are sensitive to the exact figures and forms of the equations. Below is a simple case.

Say that the chance that you win the race is p, the chance that you produce a friendly AI is q and that the chance they produce a friendly AI if they win is r*q (r < 1).

As you shift resources to capabilities, say that your chance of winning the race climbs linearly from p to 1 in 2 when you have the same allocation. Say the chance of producing a safe AI drops linearly from q to r * q when you have the same allocation.

You should shift resources to capabilities if (1 - r)*q/(1/2 - p) < (1 - r). That is if (q - p) < 1/2. Interestingly, in this case there is no dependence on r.

## Armstrong, Bostrom & Shulman 2013

After having written this up, I looked at the paper *Racing to the Precipice: a model of artificial intelligence development* by Armstrong, Bostrom & Shulman.

Their model can be obtained by this one by substituting in the specific values of f and α they consider, generalising to a multipolar scenario, considering a concrete distribution of capabilities and adding in partial observability of others' capabilities.

Importantly, their *f* is discontinuous (very similar to that in the first model considered above). When there is uncertainty about others' capabilities, it plays a somewhat similar role to fuzzing the probabilities of winning.

Coming back to this, I realize that though I summarized this post in Alignment Newsletter #33, I didn't post my opinion here, so doing that now:

The model in Racing to the precipice had the unintuitive conclusion that if teams have

moreinformation (i.e. they know their own or other’s capabilities), then we becomelesssafe, which puzzled me for a while. Their explanation is that with maximal information, the top team takes as much risk as necessary in order to guarantee that they beat the second team, which can be quite a lot of risk if the two teams are close. While this is true, the explanation from this post is more satisfying -- since the model has a discontinuity that rewards taking on risk, anything that removes the discontinuity and makes it more continuous will likely improve the prospects for safety, such as not having full information. I claim that in reality these discontinuities mostly don't exist, since (1) we're uncertain about who will win and (2) we will probably have a multipolar scenario where even if you aren't first-to-market you can still capture a lot of value. This suggests that it likely isn't a problem for teams to have more information about each other on the margin.That said, these models are still very simplistic, and I mainly try to derive qualitative conclusions from them that my intuition agrees with in hindsight.

I think this is a great summary (EDIT: this should read "I think the summary in the newsletter was great").

Yes, I agree. The best indicator I had of making a mathematical mistake was whether my intuition agreed in hindsight

(Fyi, this was only my opinion, the summary is in the newsletter. I usually don't post the summary on the post itself, since it is typically repeating the post in a manner that doesn't generate new insights.)

I feel like most of the bad incentives for developing unsafe AGI comes from the positive incentive gradient of marginally stronger AI. I.e. you could model it as a repeated game with hundreds of rounds, where every turn you get resources and power proportional to your total investment in capabilities. This seems to model reality much better (i.e. financial incentives are basically just short-term, and you don't get to build a 100 person research team if you don't show anything profitable from it), and also seems like where the real incentive problem stems from.

Yes, that seems an important case to consider.

You might still think the analysis in the post is relevant if there are actors that can shape the incentive gradients you talk about: Google might be able to focus its sub-entities in a particular way while maintaining profit or a government might choose to implement more or less oversight over tech companies.

Even with the above paragraph, it seems like the relative change-over-time in resources and power of the strategic entities would be important to consider, as you point out. In this case, it seems like (known) fast takeoffs might be safer!

Comment here if you have structure/writing complaints for the postComment here if you are worried about info-hazard-y-ness of talking about AI racesI talked to a couple of people in relevant organisations about possible info hazards for talking about races (not because this model is sophisticated or non-obvious, but because it contributes to general self-fulfilling chattering). Amongst those I talked to, they were not worried about (a) simple pieces with at least some nuance in general and (b) this post in particular

Comment here if there are maths problemsI think your solution to "reckless rivals" might be wrong? I think you mistakenly put a multiplier of q instead of a p on the left-hand side of the inequality. (The derivation of the general inequality checks out, though, and I like your point about discontinuous effects of capacity investment when you assume that the opponent plays a known pure strategy.)

I'll use slightly different notation from yours, to avoid overloading p and q. (This ends up not mattering because of linearity, but eh.) Let p0,q0 be the initial probabilities for winning and safety|winning. Let k be the capacity variable, and without loss of generality let k start at 0 and end at km. Then p(k)=.5−p0kmk+p0, and q(k)=rq0−q0kmk+q0 . So p′=.5−p0km, so pp′=p∗km.5−p0. And q′=rq0−q0km, so −q′q=q0(1−r)q∗km.

Therefore, the left-hand side of the inequality, −pq′p′q, equals p.5−p0∗q0(1−r)q. At the initial point k=0, this simplifies to p0.5−p0(1−r).

Let's assume α=1. The relative safety of the other project is β=rq0q, which at k=0 simplifies to r.

Thus we should commit more to capacity when 1−r>p0.5−p0(1−r), or 1>p0.5−p0, or .25>p0. This is a little weird, but makes a bit more intuitive sense to me than q0+p0 or q0−p0 mattering.

Yes, you're quite right!

The intuition becomes a little clearer when I take the following alternative derivation:

Let us look at the change in expected value when I increase my capabilities. From the expected value stemming from worlds where I win, we have (p∗q)′=p′∗q+p∗q′. For the other actor, their probability of winning decreases at a rate that matches my increase in probability of winning. Also, their probability of deploying a safe AI doesn't change. So the change in expected value stemming fro m worlds where they win is −p′∗r∗q.

We should be indifferent to increasing capabilities when these sum to 0, so p′∗q+p∗q′=p′∗r∗q.

Let's choose our units so km=1. Then, using the expressions for q′ from your comment, we have rq0p′0=p′0q0+p0q0(r−1).

Dividing through by q0 we get rp′0=p′0+p0(r−1). Collecting like terms we have (r−1)∗p′0=p0∗(r−1) and thus p′0=p0. Substituting for p′0 we have 12−p0=p0 and thus p0=14

Oh wait, yeah, this is just an example of the general principle "when you're optimizing for xy, and you have a limited budget with linear costs on x and y, the optimal allocation is to spend equal amounts on both."

Formally, you can show this via Lagrange-multiplier optimization, using the Lagrangian L(x,y)=xy−λ(ax+by−M). Setting the partials equal to zero gets you λ=y/a=x/b, and you recover the linear constraint function ax+by=M. So ax=by=M/2. (Alternatively, just optimizing xM−axb works, but I like Lagrange multipliers.)

In this case, we want to maximize pq+(1−p)rq0=p(q−rq0)−rq0, which is equivalent to optimizing p∗(q−rq0). Let's define w = q−rq0, so we're optimizing p∗w.

Our constraint function is defined by the tradeoff between p and w. p(k)=(.5−p0)k+p0, so k=p−p0.5−p0. w(k)=(r−1)q0k+q0−rq0=(r−1)q0(k−1), so k=−w(1−r)q0+1=p−p0.5−p0 .

Rearranging gives the constraint function .5−p0(1−r)q0w+p=.5. This is indeed linear, with a total 'budget' M of .5 and a p-coefficient b of 1. So by the above theorem we should have 1∗p=.5/2=.25.