Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Comparing Alignment to other AGI interventions: Basic model

2NicholasKees

3Martín Soto

2Daniel Kokotajlo

3Martín Soto

New Comment

A main motivation of this enterprise is to assess whether interventions in the realm of

Cooperative AI, that increase collaboration or reduce costly conflict, can seem like an optimal marginal allocation of resources.

After reading the first three paragraphs, I had basically no idea what interventions you were aiming to evaluate. Later on in the text, I gather you are talking about coordination between AI singletons, but I still feel like I'm missing something about what problem exactly you are aiming to solve with this. I could have definitely used a longer, more explain-like-I'm-five level introduction.

You're right, I forgot to explicitly explain that somewhere! Thanks for the notice, it's now fixed :)

Hjalmar Wijk (unpublished)

And Tristan too right? I don't remember which parts he did & whether they were relevant to your work. But he was involved at some point.

My impression was that this one model was mostly Hjalmar, with Tristan's supervision. But I'm unsure, and that's enough to include anyway, so I will change that, thanks :)

Interventions that increase the probability of Aligned AGI aren't the only kind of AGI-related work that could importantly increase the Expected Value of the future.

Here I present a very basic quantitative model(which you can run yourself here) to start thinking about these issues. In a follow-up post I give a brief overview of extensions and analysis.A main motivation of this enterprise is to assess whether interventions in the realm of

Cooperative AI, that increase collaboration or reduce costly conflict, can seem like an optimal marginal allocation of resources. More concretely, in a utility framework, we comparevalues.withour values.withoutour values.We used a model-based approach (see here for a discussion of its benefits) paired with qualitative analysis. While these two posts don't constitute an exhaustive analysis (more exhaustive versions are less polished),

.feel free to reach outif you're interested in this question and want to hear more about this workMost of this post is a replication of previous work by Hjalmar Wijk and Tristan Cook (unpublished). The basic modelling idea we're building upon is to define how different variables affect our utility, and then incrementally compute or estimate partial derivatives to assess the value of marginal work on this or that kind of intervention.

## Setup

We model a multi-agentic situation. We classify each agent as either having (approximately) our values (V) or any other values (¬V). We also classify them as either cooperative (C) or non-cooperative (¬C).

^{[1]}These classifications are binary. We are also (for now) agnostic about what these agents represent. Indeed, this basic multi-agentic model will be applicable (with differently informed estimates) to any scenario with multiple singletons, including the following:The variable we care about is total utility (U). As a simplifying assumption, our way to compute it will be as a weighted interpolation of two binary extremes: one in which bargaining goes (for agents with our values) as well as possible (B), and another one in which it goes as badly as possible (¬B). The interpolation coefficient (b) could be interpreted as "percentage of interactions that result in minimally cooperative bargaining settlements".

We also consider all our interventions are on only a single one of the agents (which controls a fraction FI of total resources), which usually represents our AGI or our civilization.

^{[2]}And these interventions are coarsely grouped into alignment work (aV), cooperation work targeted at worlds with high alignment power (aC|V), and cooperation work targeted at worlds with low alignment power (aC|¬V).The overall structure looks like this:

## Full list of variables

This section safely skippable.The first 4 variables model expected outcomes:

The next 3 variables model how the multi-agent landscape looks.

The last 7 variables model how we can affect a single agent through different interventions.

## Mathematical derivations

This section safely skippable.Our goal is to estimate dUda, given we have some values for dpVda,dpC|Vda and dpC|¬Vda (tractability assessments).

By chain rule

dUda=dUdFVdFVda+dUdFC|VdFC|Vda+dUdFC|¬VdFC|¬Vda(⋆)

We begin by studying the 3 partial derivatives of U appearing in (⋆).

For the first, product and chain rules give

dUdFV=dbdFV(UB−U¬B)+bdUBdFV+(1−B)dU¬BdFV

That is, changing the fraction of agents with our values can change the success of bargaining (first summand) and also the value of different scenarios (second and third summands).

For the second and third derivatives, we assume the fractions of cooperative agents can only affect bargaining success, but don't change the intrinsic value of futures in any other way. For example, it is not terminally good for agents to be cooperative (that is, the existence of a cooperative agent doesn't immediately increase our utility). Thus, all of dUBdFC|V,dU¬BdFC|V,dUBdFC|¬V and dU¬BdFC|¬V are 0. So again by product and chain and by these nullifications

dUdFC|V=dbdFC|V(UB−U¬B)

dUdFC|¬V=dbdFC|¬V(UB−U¬B)

We now turn to the other derivatives of (⋆), which correspond to how our action changes the landscape of agents.

For the first, our action only changes the fraction of agents with our values by making our agent more likely to have our values

^{[3]}:dFVda=dFVdpV⋅dpVda=FI⋅dpVda

For the second one we get something more complicated. By chain rule

dFC|Vda=dFC|VdpVdpVda+dFC|VdpC|VdpC|Vda+dFC|VdpC|¬VdpC|¬Vda

Clearly dFC|VdpC|¬V=0, that is, changing the probability that our agent is cooperative if it were to not have our values won't alter in any case the fraction of agents with our values that are aligned.

But dFC|VdpV≠0, since making your agent have your values could alter this fraction if your agent is atypically (un)cooperative. By considering FC|V=FC∩VFV and using the division rule we get

dFC|VdpV=FIFV(pC|V−FC|V)

Intuitively, if our agent is atypically (un)cooperative, making it have our values will accordingly alter the proportion of cooperative agents with our values.

We deal with dFC|VdpC|V similarly by the division rule and obtain

dFC|VdpC|V=FIFVpV

So in summary

dFC|Vda=FIFV(pC|V−FC|V)dpVda+FIFVpVdpC|Vda

For the third derivative we proceed analogously and obtain

dFC|¬Vda=−FI1−FV(pC|¬V−FC|¬V)dpVda+FI1−FV(1−pV)dpC|¬Vda

Finally, putting everything together in (⋆) we obtain (grouped by intervention)

dUda=[(UB−U¬B)(∗)+bdUBdFV+(1−b)dU¬BdFV]FIdpVda+(UB−U¬B)dbdFC|VpVFIFVdpC|Vda+(UB−U¬B)dbdFC|¬V(1−pV)FI1−FVdpC|¬Vda(v)(cv)(c¬v)

where we have defined

(∗)=dbdFV+dbdFC|V1FV(pC|V−FC|V)−dbdFC|¬V11−FV(pC|¬V−FC|¬V)

When running simulations with FV very small, we might want to take it logarithmic for numerical stability.

## Some discussion

## What does bargaining success look like?

For starters, if interactions happen

non-locallywith respect to value systems (so that we are as likely to find ourselves interacting with agents having or not having our values), then we'd expect to benefit more from cooperation increases in bigger classes of agents. That is, dbdFC|V would be proportional to FV (usually small), and dbdFC|¬V to 1−FV (big), which would cancel out in (cv) and (c¬v), resulting simply in the recommendation "intervene to make cooperative the kind of system most likely to happen".But there might be locality effects, so that we are more likely to find ourselves in a trade situation with agents having our values (and that would partly counteract the above difference):

Overall, thinking about locality by using the single variable b, which encompasses bargaining success both with similar and distant values, turns out to be clunky. In later extensions this is improved.

Another possible confounder is the effect of

value coalition sizeson bargaining. For example, we could think that, holding cooperativeness constant, bigger value coalitions make bargaining easier due to less chaos or computational cost. In that case, going from 50% to 95% of agents having our values (FV) would be stabilizing (and so dbdFV is positive in this regime), while going from 0% to 1% would be destabilizing (dbdFV negative). It's not even clear this should hold close to the extremes of FV, since maybe when an almost-all-powerful coalition exists it is better for it to destroy the other small agents.Maybe to account for this we could think of FV not as the distribution over "which AGIs / civilizations are created", but instead "the fraction under multi-agentic stability". That is, the distribution of those AGIs / civilizations after already waging any possible war with others. But that might not be a good approximation, since in reality war and cooperation are very intertwined, and not neatly temporally separated (and this might still be the case for AGIs).

While some of our final model's inputs can be informed by these coalition considerations, there doesn't seem to be an easy way around modelling coalition-forming more explicitly if we want good approximations.

We're also not yet including effects on cooperation from

Evidential Cooperation in Large worlds(see the next post).## What do utilities look like?

U¬B will vary a lot depending on, for example, how common we expect retributive or punitive actions to be.

UB will also vary a lot depending on your ethics (for example, how cosmopolitan vs parochial are your values) and opinions about the multi-agent landscape.

## What do tractabilities look like?

We assume as a first simplification that the return on investment is linear, that is, dpVdaV,dpC|VdaC|V and dpC|¬VdaC|¬V are all constant.

This might not be that bad of an approximation (for marginal assessments) when, a priori, we are pretty uncertain about whether X more researcher-hours will lead to an unexpected break-through that allows for rapid intervention, or yield basically nothing. So the derivative should just be the expected gain linearly. But there are some instances in which this approximation fails more strongly, like when thinking about technical implementation work: We know a priori that we'll need to spend at least Y researcher-hours to be able to implement anything.

## Implementation and extensions

You can toy with the model yourselfhere(instructions inside).As mentioned above, it still seems like this bare-bones model is missing some important additional structure. Especially, estimating UB,U¬B and b feels most ungrounded. This is discussed in the next post.

These two posts don't constitute an exhaustive analysis (more exhaustive versions are less polished), so

.feel free to reach outif you're interested in an all-things-considered assessment of the value of Cooperative AI workWork done atCLR's SRF 2023. Thanks to Tristan Cook and the rest of CLR for their help.^{^}For now, the definitions of V and C are left vague. In fact, in Hjalmar's basic model we are free to choose whichever concretization, as long as the same one informs our assessments of UB and U¬B. In latter extensions this was no longer the case and we worked with concrete definitions.

^{^}Of course, this stops being the case in, for example, the multi-polar Earth scenario. In such cases we can interpret FI as an estimate of the fraction of agents (weighted by resources) our interventions can affect. Although some of the framings and intuitions below do overfit to the "single-agent" scenario.

^{^}So we are ignoring some weird situations like "agents being more cooperative makes them notice our values are better".