This is independent alignment research. A decent grasp on stat mech is needed to understand this post fully, but I have recently written a stat mech sequence (up to part 3) which should do the trick to catch someone up to speed rapidly.

TL;DR

Statistical Utility Mechanics is a framework which draws on stat mech to model the actions of utility-maximizing agents. I introduce this framework here, and use it to demonstrate why the most naive possible alignment plan (by which of course I mean one not too far off the one that most AI companies are currently using) is doomed by scaling laws, and we should expect to see sudden increases in AI's potential for harm around when AI reaches roughly human level.

The framework consists of just three core rules

The world is a probability distribution over states

Each agent assigns a value to each state

Agents can affect the probability distribution over states with limited power

I'll go into more detail with these in this toy model:

Toy Example: Why AI Suddenly Becomes Dangerous

The world is described by a multivariate normal distribution over states s∈Rd, centred on the origin. All variances are 1, so the distribution can be written as follows:

P0(S=s)=1(2π)d/2exp(−12s2)⟺S∼N(0,Id)

This has the nice property of being totally spherically symmetrical. I will also write P0(S=s)≡ps0 for brevity in some cases.

Let us consider an AI agent AX, which has the simplest possible utility function: xs=vX⋅s for some vector vX, and a "power level" OX∈R. Without loss of generality we will assume that |vX|=1 (the reason for this will be clear soon)

This AI will impact the world's probability distribution P0AX→P1 according to two rules:

DKL(P1∥P0)≤OX

Subject to rule 1, E(X)=∫P1(S=s)xsds is maximized

Normally stat mech tells us about entropy given expected values, but here we can flip it on its head to tell us about expected values given KL-divergence (see appendix for why we use KL-divergence here). This is ps1=ps0exp(bvX⋅s) for some constant b, and gives the distribution P1∼N(bvX,Id)

Using the formula for the KL-divergence of two multivariate gaussians, we get:

DKL(P1∥P0)=12b2(vX⋅vX)=12b2

OX=12b2∴b=√2OX

P1∼N(√2OXvX,Id)

E(X)=√2OX

This framework naturally captures the fact that rescaling or adding a constant to our utility function does not change the behaviour of the agent. If |vX|≠1, then we just get a different b value and it all cancels out to the same answer.

The Alignment Process

Let's describe a human utility function in the same way as the agent's: ys=vY⋅s. We will also describe the "power" of humans in this system as OY. In this model, what matters is the cosine similarity between the two vectors, which can be expressed as τ=cos(θXY)=vX⋅vY. For a d dimensional space, and two totally randomly-chosen vectors, this is distributed like this:

P0(a;τ=t)∝(1−t2)a where a=d−32

We will work in terms of a rather than d to simplify the maths, and because we expect to have a very high-dimensional space, and mainly care about scaling laws, it doesn't really matter. We'll consider altering the value of τ, again using the KL-divergence trick to measure how much we're impacting our distribution. Our system does not (really) have a good closed-form solution, especially for small values of b and large values of a. We can approximate our whole system numerically though, and get the following results:

Here we've calculated E(τ) as a function of a and b, but plotted it against Oτ, which we'll define as the divergence DKL(P1(a,b;τ=t)∥P0(a;τ=t)). We can see that the higher the dimensionality of our space, the more we must optimize the AI's utility function to get E(τ) above a certain critical value τc. If we plot this required optimization against τc:

We can also see that this relationship is linear for any τc threshold we might want to set. This is somewhat unsurprising, the each dimension is just another, mostly independent, dimension along which we must optimize vX. The question is, how might we optimize the AI's utility function, and what are the constraints on Oτ?

Consider the following loop (basically trying to wrangle an AI into alignment by letting it loose and seeing what it does):

The AI observes the world.

The AI takes an action to optimize the world.

We observe the result of that action.

We take an action to optimize the AI's utility function

The expected amount of information we get about the AI's utility function from observing the result of each action is just the KL-divergence between the un-optimized world and the optimized world, in other words OX! How much of this optimization is harmful? We can express it as OX⊥Y :

√OX⊥Y=√(1−τ2)√OX

√OX∥Y=√τ2√OX

OX=OX∥Y+OX⊥Y

(The reason for all the square roots is in Appendix 2)

Here is the plot of the following integral:

√OX⊥Y=∫√1−τ2d√Oτ

This makes sense again: we have a linear relationship between the amount of harmful optimization the AI will do before we can wrangle it into alignment, and the size of the space of possible utility functions. Not good!

Now if the people trying to align the AI are not completely stupid (debatable) we might want to add a final point to this loop:

We take an action to rectify the result of the AI's action

If the AI is smarter than us, we would expect not to be able to fully rectify the result of the AI's decision. For simplicity, and without particular principle, we will make the choice of OX−Y=plus(OX⊥Y−OY) where plus(x)={x,x≥00,x<0

We will also upper-bound Oτ=max(OX,OY), representing the assumption that we probably can't optimize the AI any more strongly than we can optimize against its effects on the world (since our limits on both are likely due to our ability to understand the AI's actions). We can then simplify our expressions to these relations in terms of Oτ:

RXY=OX/OY

Oτ=max(1,RXY)×OY

OY=Oτ/max(1,RXY)

OX=min(RXY,1)×Oτ

OX−Y=plus((1−τ)×min(RXY,1)−min(1RXY,1))×Oτ

We will therefore calculate the integral:

√OX−Y=∫√plus((1−τ)×min(RXY,1)−min(1RXY,1))d√Oτ

Or we can plot it as a function of RXY, with a as a parameter:

We can note a few predictions from this system:

Total "harm" as measured by OX−Y during training is directly proportional to a

For RXY≤1, we experience zero total harm, but harm grows very rapidly around human level AI, and the larger the value of a, the faster it grows

AI Scaling

The question is, how do our parameters RXY scale with AI "scale"? Let's assign AI scale the to variable ξ (because xi is hard to draw and an AI's scale is hard to quantify) then we'll make the very conservative assumption that RXY scales with log(ξ), and a scales with ξk for some exponent k, probably ≤1. This gives us a plot looking something like the black line below:

If we expect ξ to scale exponentially with time T, we expect a to scale exponentially over time, and RXY to scale linearly.

Or if we use a logarithmic y-axis, to get a better look at just how fast it is growing:

I conjecture that this consists of three phases:

A "lag" phase in which OX≪OY, so no harm is possible (the world is currently in this phase)

A "catch-up" phase in which the OX≈OY, so we see double-scaling (due to overcoming the plus function more and more as RXY increases)

An "exponential" phase, in which OX≫OY, so we "only" see exponential scaling, or perhaps scaling as (which is still very fast)

Conclusions

This model tells us to expect super-exponential scaling of AI harms around human level. It also tells us that the higher the dimensionality of the possible utility-function space, the more quickly these harms will grow.

Some things this model does not include, a mixture of things which would make alignment easier or harder:

The ability for the aligners to "sandbox" the AI, limiting its domain during alignment, or do offline-RL-ish-stuff on it

Clever tricks to "repeat" the AI's actions in some way to get more than OX information out of each of the AI's actions

The ability for the AI to become more powerful over time during deployment (i.e. OX increases over time while the AI is being altered)

The ability for the AI to understand that it is being optimized, and act strategically or deceptively

A "sharp left turn" in which AI gains new capabilities (e.g. can optimize in more dimensions)

"Real" goodharting, in which eventually even highly correlated optimization targets diverge due to constraints.

Extra "fragility" terms outside of τ. I tried to simulate this by transforming the distribution to a narrower one, but polynomially-scaling τ had the same effect as increasing a, so perhaps it is best to think of a as just being a general "fragility constant"

I'm going to keep working on SUM, and hopefully get even more interesting results. I think SUM has the potential to be a fertile ground for both formal proofs and more informal models (like this one) surrounding alignment.

For a recap, here are the foundational principles:

The world has states s∈S, which can be any set

We define a probability distribution P0(S=s)=ps0

Agents AX assign real values to each element xs∈R

An agent can influence that probability distribution with power OX

This means the action PiAX→Pi+1 is subject to DKL(Pi+1∥Pi)≤OX

The optimal action for a myopic agent is psi+1∝psiebxs for b∈R+

Major Caveat

Arguably the sudden threshold is somewhat artificial. I did add a threshold function to my measure for harm, which induced a threshold in harm. I do think the threshold is well-motivated though.

It's somewhat unclear whether OX is the most natural measure of an AI's power

Technically this should all be done with lots of causal models and severed correlations, which I have omitted here. I think the results ought to still hold there, but I haven't crunched all the numbers so I'm working in shorthand.

Appendices

Appendix 1: The Entropy to KL-Divergence Trick

The world in SUM is conceptualized as a set of states s∈S. To each of these we assign an initial probability ps0, according to all the normal rules of probability distributions. We define the total entropy H=∑s∈Sps(hs0−lnps) where hs0 is the "intrinsic entropy" of each state, a trick from stat mech when our "states" are not truly fundamental.

We then perform a second trick by defining hs0=lnps0. This has two effects: H0=0 by definition, and more subtly P0(S) becomes the maximum-entropy probability distribution.

This combination gives us a definition of H which reduces to the KL-divergence.

H=∑s∈Sps(lnps0−lnps)=−DKL(P∥P0)

Appendix 2: KL-Divergence is not a Metric

So KL-divergence is actually a generalization of squared distance, not distance. This is why I do my integrals with respect to d√O. This still might not be the most appropriate way to do things, but I've found that the qualitative results are the same if I do them with respect to dO or d√O.

This is independent alignment research. A decent grasp on stat mech is needed to understand this post fully, but I have recently written a stat mech sequence (up to part 3) which should do the trick to catch someone up to speed rapidly.## TL;DR

Statistical Utility Mechanics is a framework which draws on stat mech to model the actions of utility-maximizing agents. I introduce this framework here, and use it to demonstrate why the most naive possible alignment plan (by which of course I mean one not too far off the one that most AI companies are currently using) is doomed by scaling laws, and we should expect to see sudden increases in AI's potential for harm around when AI reaches roughly human level.

The framework consists of just three core rules

I'll go into more detail with these in this toy model:

## Toy Example: Why AI Suddenly Becomes Dangerous

The world is described by a multivariate normal distribution over states s∈Rd, centred on the origin. All variances are 1, so the distribution can be written as follows:

P0(S=s)=1(2π)d/2exp(−12s2)⟺S∼N(0,Id)

This has the nice property of being totally spherically symmetrical. I will also write P0(S=s)≡ps0 for brevity in some cases.

Let us consider an AI agent AX, which has the simplest possible utility function: xs=vX⋅s for some vector vX, and a "power level" OX∈R. Without loss of generality we will assume that |vX|=1 (the reason for this will be clear soon)

This AI will impact the world's probability distribution P0AX→P1 according to two rules:

Normally stat mech tells us about entropy given expected values, but here we can flip it on its head to tell us about expected values given KL-divergence (see appendix for why we use KL-divergence here). This is ps1=ps0exp(bvX⋅s) for some constant b, and gives the distribution P1∼N(bvX,Id)

Using the formula for the KL-divergence of two multivariate gaussians, we get:

DKL(P1∥P0)=12b2(vX⋅vX)=12b2

OX=12b2∴b=√2OX

P1∼N(√2OXvX,Id)

E(X)=√2OX

This framework naturally captures the fact that rescaling or adding a constant to our utility function does not change the behaviour of the agent. If |vX|≠1, then we just get a different b value and it all cancels out to the same answer.

## The Alignment Process

Let's describe a human utility function in the same way as the agent's: ys=vY⋅s. We will also describe the "power" of humans in this system as OY. In this model, what matters is the cosine similarity between the two vectors, which can be expressed as τ=cos(θXY)=vX⋅vY. For a d dimensional space, and two totally randomly-chosen vectors, this is distributed like this:

P0(a;τ=t)∝(1−t2)a where a=d−32

We will work in terms of a rather than d to simplify the maths, and because we expect to have a very high-dimensional space, and mainly care about scaling laws, it doesn't really matter. We'll consider altering the value of τ, again using the KL-divergence trick to measure how much we're impacting our distribution. Our system does not (really) have a good closed-form solution, especially for small values of b and large values of a. We can approximate our whole system numerically though, and get the following results:

Here we've calculated E(τ) as a function of a and b, but plotted it against Oτ, which we'll define as the divergence DKL(P1(a,b;τ=t)∥P0(a;τ=t)). We can see that the higher the dimensionality of our space, the more we must optimize the AI's utility function to get E(τ) above a certain critical value τc. If we plot this required optimization against τc:

We can also see that this relationship is linear for any τc threshold we might want to set. This is somewhat unsurprising, the each dimension is just another, mostly independent, dimension along which we must optimize vX. The question is, how might we optimize the AI's utility function, and what are the constraints on Oτ?

Consider the following loop (basically trying to wrangle an AI into alignment by letting it loose and seeing what it does):

The expected amount of information we get about the AI's utility function from observing the result of each action is just the KL-divergence between the un-optimized world and the optimized world, in other words OX! How much of this optimization is harmful? We can express it as OX⊥Y :

√OX⊥Y=√(1−τ2)√OX

√OX∥Y=√τ2√OX

OX=OX∥Y+OX⊥Y

(The reason for all the square roots is in Appendix 2)

Here is the plot of the following integral:

√OX⊥Y=∫√1−τ2d√Oτ

This makes sense again: we have a linear relationship between the amount of harmful optimization the AI will do before we can wrangle it into alignment, and the size of the space of possible utility functions. Not good!

Now if the people trying to align the AI are not completely stupid (debatable) we might want to add a final point to this loop:

If the AI is smarter than us, we would expect not to be able to fully rectify the result of the AI's decision. For simplicity, and without particular principle, we will make the choice of OX−Y=plus(OX⊥Y−OY) where plus(x)={x, x≥0 0, x<0

We will also upper-bound Oτ=max(OX,OY), representing the assumption that we probably can't optimize the AI any more strongly than we can optimize against its effects on the world (since our limits on both are likely due to our ability to understand the AI's actions). We can then simplify our expressions to these relations in terms of Oτ:

RXY=OX/OY

Oτ=max(1,RXY)×OY

OY=Oτ/max(1,RXY)

OX=min(RXY,1)×Oτ

OX−Y=plus((1−τ)×min(RXY,1)−min(1RXY,1))×Oτ

We will therefore calculate the integral:

√OX−Y=∫√plus((1−τ)×min(RXY,1)−min(1RXY,1))d√Oτ

Or we can plot it as a function of RXY, with a as a parameter:

We can note a few predictions from this system:

## AI Scaling

The question is, how do our parameters RXY scale with AI "scale"? Let's assign AI scale the to variable ξ (because xi is hard to draw and an AI's scale is hard to quantify) then we'll make the very conservative assumption that RXY scales with log(ξ), and a scales with ξk for some exponent k, probably ≤1. This gives us a plot looking something like the black line below:

If we expect ξ to scale exponentially with time T, we expect a to scale exponentially over time, and RXY to scale linearly.

Or if we use a logarithmic y-axis, to get a better look at just how fast it is growing:

I conjecture that this consists of three phases:

## Conclusions

This model tells us to expect super-exponential scaling of AI harms around human level. It also tells us that the higher the dimensionality of the possible utility-function space, the more quickly these harms will grow.

Some things this model does not include, a mixture of things which would make alignment easier or harder:

I'm going to keep working on SUM, and hopefully get even more interesting results. I think SUM has the potential to be a fertile ground for both formal proofs and more informal models (like this one) surrounding alignment.

For a recap, here are the foundational principles:

## Major Caveat

## Appendices

## Appendix 1: The Entropy to KL-Divergence Trick

The world in SUM is conceptualized as a set of states s∈S. To each of these we assign an initial probability ps0, according to all the normal rules of probability distributions. We define the total entropy H=∑s∈Sps(hs0−lnps) where hs0 is the "intrinsic entropy" of each state, a trick from stat mech when our "states" are not truly fundamental.

We then perform a second trick by defining hs0=lnps0. This has two effects: H0=0 by definition, and more subtly P0(S) becomes the maximum-entropy probability distribution.

This combination gives us a definition of H which reduces to the KL-divergence.

H=∑s∈Sps(lnps0−lnps)=−DKL(P∥P0)

## Appendix 2: KL-Divergence is not a Metric

So KL-divergence is actually a generalization of squared distance, not distance. This is why I do my integrals with respect to d√O. This still might not be the most appropriate way to do things, but I've found that the qualitative results are the same if I do them with respect to dO or d√O.