This is independent alignment research. A decent grasp on stat mech is needed to understand this post fully, but I have recently written a stat mech sequence (up to part 3) which should do the trick to catch someone up to speed rapidly.

TL;DR

Statistical Utility Mechanics is a framework which draws on stat mech to model the actions of utility-maximizing agents. I introduce this framework here, and use it to demonstrate why the most naive possible alignment plan (by which of course I mean one not too far off the one that most AI companies are currently using) is doomed by scaling laws, and we should expect to see sudden increases in AI's potential for harm around when AI reaches roughly human level.

A meme featuring Dr Doofenshmirtz, it reads "If I had a nickel for every time Statistical Mechanics proved we were all going to die, I'd have two nickels, which isn't a lot but it's weird that it happened twice.

The framework consists of just three core rules

The world is a probability distribution over states
Each agent assigns a value to each state
Agents can affect the probability distribution over states with limited power

I'll go into more detail with these in this toy model:

Toy Example: Why AI Suddenly Becomes Dangerous

The world is described by a multivariate normal distribution over states , centred on the origin. All variances are 1, so the distribution can be written as follows:

$P_{0} (S = s) = \frac{1}{(2 π)^{d / 2}} exp (- \frac{1}{2} s^{2}) ⟺ S \sim N (0, I_{d})$

This has the nice property of being totally spherically symmetrical. I will also write $P_{0} (S = s) \equiv p_{0}^{s}$ for brevity in some cases.

Let us consider an AI agent $A_{X}$ , which has the simplest possible utility function: $x^{s} = v_{X} \cdot s$ for some vector $v_{X}$ , and a "power level" $O_{X} \in R$ . Without loss of generality we will assume that $| v_{X} | = 1$ (the reason for this will be clear soon)

This AI will impact the world's probability distribution $P_{0}_{X} P_{1}$ according to two rules:

$D_{K L} (P_{1} ∥ P_{0}) \leq O_{X}$
Subject to rule 1, $E (X) = \int P_{1} (S = s) x^{s} d s$ is maximized

Normally stat mech tells us about entropy given expected values, but here we can flip it on its head to tell us about expected values given KL-divergence (see appendix for why we use KL-divergence here). This is $p_{1}^{s} = p_{0}^{s} exp (b v_{X} \cdot s)$ for some constant $b$ , and gives the distribution $P_{1} \sim N (b v_{X}, I_{d})$

Using the formula for the KL-divergence of two multivariate gaussians, we get:

$D_{K L} (P_{1} ∥ P_{0}) = \frac{1}{2} b^{2} (v_{X} \cdot v_{X}) = \frac{1}{2} b^{2}$

$O_{X} = \frac{1}{2} b^{2} ∴ b = \sqrt{2 O_{X}}$

$P_{1} \sim N (\sqrt{2 O_{X}} v_{X}, I_{d})$

$E (X) = \sqrt{2 O_{X}}$

This framework naturally captures the fact that rescaling or adding a constant to our utility function does not change the behaviour of the agent. If $| v_{X} | \neq 1$ , then we just get a different $b$ value and it all cancels out to the same answer.

The Alignment Process

Let's describe a human utility function in the same way as the agent's: $y^{s} = v_{Y} \cdot s$ . We will also describe the "power" of humans in this system as $O_{Y}$ . In this model, what matters is the cosine similarity between the two vectors, which can be expressed as $τ = cos (θ_{X Y}) = v_{X} \cdot v_{Y}$ . For a $d$ dimensional space, and two totally randomly-chosen vectors, this is distributed like this:

$P_{0} (a; τ = t) \propto (1 - t^{2})^{a}$ where $a = \frac{d - 3}{2}$

We will work in terms of $a$ rather than $d$ to simplify the maths, and because we expect to have a very high-dimensional space, and mainly care about scaling laws, it doesn't really matter. We'll consider altering the value of $τ$ , again using the KL-divergence trick to measure how much we're impacting our distribution. Our system does not (really) have a good closed-form solution, especially for small values of $b$ and large values of $a$ . We can approximate our whole system numerically though, and get the following results:

Note the log scale for both $a$ , and the x-axis.

Here we've calculated $E (τ)$ as a function of $a$ and $b$ , but plotted it against $O_{τ}$ , which we'll define as the divergence $D_{K L} (P_{1} (a, b; τ = t) ∥ P_{0} (a; τ = t))$ . We can see that the higher the dimensionality of our space, the more we must optimize the AI's utility function to get $E (τ)$ above a certain critical value $τ_{c}$ . If we plot this required optimization against $τ_{c}$ :

We can also see that this relationship is linear for any $τ_{c}$ threshold we might want to set. This is somewhat unsurprising, the each dimension is just another, mostly independent, dimension along which we must optimize $v_{X}$ . The question is, how might we optimize the AI's utility function, and what are the constraints on $O_{τ}$ ?

Consider the following loop (basically trying to wrangle an AI into alignment by letting it loose and seeing what it does):

The AI observes the world.
The AI takes an action to optimize the world.
We observe the result of that action.
We take an action to optimize the AI's utility function

The expected amount of information we get about the AI's utility function from observing the result of each action is just the KL-divergence between the un-optimized world and the optimized world, in other words $O_{X}$ ! How much of this optimization is harmful? We can express it as $O_{X ⊥ Y}$ :

$\sqrt{O_{X ⊥ Y}} = \sqrt{(1 - τ^{2})} \sqrt{O_{X}}$

$\sqrt{O_{X ∥ Y}} = \sqrt{τ^{2}} \sqrt{O_{X}}$

$O_{X} = O_{X ∥ Y} + O_{X ⊥ Y}$

(The reason for all the square roots is in Appendix 2)

Here is the plot of the following integral:

$\sqrt{O_{X ⊥ Y}} = \int \sqrt{1 - τ^{2}} d \sqrt{O_{τ}}$

This makes sense again: we have a linear relationship between the amount of harmful optimization the AI will do before we can wrangle it into alignment, and the size of the space of possible utility functions. Not good!

Now if the people trying to align the AI are not completely stupid (debatable) we might want to add a final point to this loop:

We take an action to rectify the result of the AI's action

If the AI is smarter than us, we would expect not to be able to fully rectify the result of the AI's decision. For simplicity, and without particular principle, we will make the choice of $O_{X - Y} = p l u s (O_{X ⊥ Y} - O_{Y})$ where $p l u s (x) = {\begin{matrix} x, & x \geq 0 0, & x < 0 \end{matrix}$

We will also upper-bound $O_{τ} = max (O_{X}, O_{Y})$ , representing the assumption that we probably can't optimize the AI any more strongly than we can optimize against its effects on the world (since our limits on both are likely due to our ability to understand the AI's actions). We can then simplify our expressions to these relations in terms of $O_{τ}$ :

$R_{X Y} = O_{X} / O_{Y}$

$O_{τ} = max (1, R_{X Y}) \times O_{Y}$

$O_{Y} = O_{τ} / max (1, R_{X Y})$

$O_{X} = min (R_{X Y}, 1) \times O_{τ}$

$O_{X - Y} = p l u s ((1 - τ) \times min (R_{X Y}, 1) - min (\frac{1}{R_{X Y}}, 1)) \times O_{τ}$

We will therefore calculate the integral:

$\sqrt{O_{X - Y}} = \int \sqrt{p l u s ((1 - τ) \times min (R_{X Y}, 1) - min (\frac{1}{R_{X Y}}, 1))} d \sqrt{O_{τ}}$

Or we can plot it as a function of $R_{X Y}$ , with $a$ as a parameter:

We can note a few predictions from this system:

Total "harm" as measured by $O_{X - Y}$ during training is directly proportional to $a$
For $R_{X Y} \leq 1$ , we experience zero total harm, but harm grows very rapidly around human level AI, and the larger the value of $a$ , the faster it grows

AI Scaling

The question is, how do our parameters $R_{X Y}$ scale with AI "scale"? Let's assign AI scale the to variable $ξ$ (because xi is hard to draw and an AI's scale is hard to quantify) then we'll make the very conservative assumption that $R_{X Y}$ scales with $log (ξ)$ , and $a$ scales with $ξ^{k}$ for some exponent $k$ , probably $\leq 1$ . This gives us a plot looking something like the black line below:

If we expect $ξ$ to scale exponentially with time $T$ , we expect $a$ to scale exponentially over time, and $R_{X Y}$ to scale linearly.

A graph showing harm from scaling of constant a levelling off, whereas harm in a scenario where a is related to scale by a power law increases exponentially — What Ilya Saw

Or if we use a logarithmic $y$ -axis, to get a better look at just how fast it is growing:

X axis has been rescaled to remove the zeroes

I conjecture that this consists of three phases:

A "lag" phase in which $O_{X} ≪ O_{Y}$ , so no harm is possible (the world is currently in this phase)
A "catch-up" phase in which the $O_{X} \approx O_{Y}$ , so we see double-scaling (due to overcoming the $p l u s$ function more and more as $R_{X Y}$ increases)
An "exponential" phase, in which $O_{X} ≫ O_{Y}$ , so we "only" see exponential scaling, or perhaps scaling as (which is still very fast)

Conclusions

This model tells us to expect super-exponential scaling of AI harms around human level. It also tells us that the higher the dimensionality of the possible utility-function space, the more quickly these harms will grow.

Some things this model does not include, a mixture of things which would make alignment easier or harder:

The ability for the aligners to "sandbox" the AI, limiting its domain during alignment, or do offline-RL-ish-stuff on it
Clever tricks to "repeat" the AI's actions in some way to get more than $O_{X}$ information out of each of the AI's actions
The ability for the AI to become more powerful over time during deployment (i.e. $O_{X}$ increases over time while the AI is being altered)
The ability for the AI to understand that it is being optimized, and act strategically or deceptively
A "sharp left turn" in which AI gains new capabilities (e.g. can optimize in more dimensions)
"Real" goodharting, in which eventually even highly correlated optimization targets diverge due to constraints.
Extra "fragility" terms outside of $τ$ . I tried to simulate this by transforming the distribution to a narrower one, but polynomially-scaling $τ$ had the same effect as increasing $a$ , so perhaps it is best to think of $a$ as just being a general "fragility constant"

I'm going to keep working on SUM, and hopefully get even more interesting results. I think SUM has the potential to be a fertile ground for both formal proofs and more informal models (like this one) surrounding alignment.

For a recap, here are the foundational principles:

The world has states $s \in S$ , which can be any set
We define a probability distribution $P_{0} (S = s) = p_{0}^{s}$
Agents $A_{X}$ assign real values to each element $x^{s} \in R$
An agent can influence that probability distribution with power $O_{X}$
This means the action $P_{i}_{X} P_{i + 1}$ is subject to $D_{K L} (P_{i + 1} ∥ P_{i}) \leq O_{X}$
The optimal action for a myopic agent is $p_{i + 1}^{s} \propto p_{i}^{s} e^{b x^{s}}$ for $b \in R^{+}$

Major Caveat

Arguably the sudden threshold is somewhat artificial. I did add a threshold function to my measure for harm, which induced a threshold in harm. I do think the threshold is well-motivated though.
It's somewhat unclear whether $O_{X}$ is the most natural measure of an AI's power
Technically this should all be done with lots of causal models and severed correlations, which I have omitted here. I think the results ought to still hold there, but I haven't crunched all the numbers so I'm working in shorthand.

Appendices

Appendix 1: The Entropy to KL-Divergence Trick

The world in SUM is conceptualized as a set of states $s \in S$ . To each of these we assign an initial probability $p_{0}^{s}$ , according to all the normal rules of probability distributions. We define the total entropy $H = \sum s \in S p^{s} (h_{0}^{s} - ln p^{s})$ where $h_{0}^{s}$ is the "intrinsic entropy" of each state, a trick from stat mech when our "states" are not truly fundamental.

We then perform a second trick by defining $h_{0}^{s} = ln p_{0}^{s}$ . This has two effects: $H_{0} = 0$ by definition, and more subtly $P_{0} (S)$ becomes the maximum-entropy probability distribution.

This combination gives us a definition of $H$ which reduces to the KL-divergence.

$H = \sum s \in S p^{s} (ln p_{0}^{s} - ln p^{s}) = - D_{K L} (P ∥ P_{0})$

Appendix 2: KL-Divergence is not a Metric

So KL-divergence is actually a generalization of squared distance, not distance. This is why I do my integrals with respect to $d \sqrt{O}$ . This still might not be the most appropriate way to do things, but I've found that the qualitative results are the same if I do them with respect to $d O$ or $d \sqrt{O}$ .

LESSWRONG
LW