Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is original, independent research carried out in March and April of 2024.

A graphical abstract: A formula generates data which is used to pretrain models. These are then subject to three RL tasks of varying difficulty. The reward function of a model can be identified by looking at a task-specific prediction of optimization as comapared to measured optimization.

The degree to which a a policy optimizes the future can be quantified mathematically. A set of of very small transformer models were pretrained to predict the next token in a mathematical sequence, then subjected to reinforcement learning finetuning.

The optimizing power of each model can be predicted with high accuracy based on each model's score on its own RL task. By comparing predictions of optimization based on scores on each different RL task, a model's original reinforcement objective can be identified.

A related measure for impact can also be derived mathematically, and given a theoretical lower bound based on RL score. This gives further information about model behavior, and allows for the same analysis as the measure of optimization.

I also investigate the possibility of getting models to self-evaluate optimization and impact, with limited success.


Pretraining on Sequence Prediction

I defined a simple mathematical sequence defined by the following stochastic recurrence relation. This produces a pseudo-random but (to 98%) predictable sequence, alternating between elements of  on even values of  and  on odd values of .

I then trained a small encoder-only transformer model to predict the next element in the sequence given the previous 20 elements of the sequence.

This was followed by a reinforcement-learning phase in which the transformer was used to generate the next token on odd values of  only, and the recurrence relation was used to generate the value of . If  was in , this was used as a "successful" example to reinforce the model. I used a temperature of 1 when generating these sequences to introduce some randomness, but the temperature was reduced to 0 during evaluations and when calculating optimization.

A small amount of "maintenance" training (much lower learning rate) was used during this phase to ensure that model performance on the predictive tasks for even values of  was maintained. Without this I saw rapid loss of performance on the "maintenance" dataset. I also found that I was unable to include "unsuccessful" examples (i.e. where ) with even a tiny negative learning rate, as this caused worsened performance at all tasks. Here is a typical set of results from training and evaluation:

I carried out this training on  models per size for four model sizes between 18k and 402k parameters, giving the following plot:

On the left is a plot showing pretraining and maintenance loss, on the right is a graph showing success rate on a task. Both are a function of parameters. Pretraining and maintenance loss show a U-shape, success generally increases with the number of parameters.

Pretraining loss increases over the last few model sizes, and the loss/time plots (some of which I have put in the Supplementary Information at the bottom of this post) showed signs of overfitting in the large models. Regularization was employed during training (0.01 weight decay in an AdamW optimizer, 10% dropout rate for neurons) so perhaps a larger dataset size is required to totally avoid this.

I then repeated the RL phase twice, once with  being reinforced,  ( = 2) and once with  being reinforced ( = 6). Here is a plot of success rate against model size across all three conditions.

A plot showing success rate against model size on different tasks. Success rate is better than chance in all conditions, and increases with model size in all conditions.

This plot shows mean  standard error. In all cases model performance is a lot better than chance, and increases with model size.

Measuring Optimization

I used a Monte Carlo simulation to measure the nats of optimization that are being applied to  using the split-history method I've previously outlined. This involves taking the difference in entropy between two distributions:

A diagram with the following equation at the top: Op(F, P; A) = H(F'|P'') - H(F). Below are two Bayes nets. The one on the left has nodes P, A,and F with nodes P to A, P to F and A to F. The network on the left has nodes P, P'', A', and F'. It has arrows P' to F', P'' to A', and A' to F'.

The algorithm in practice is this:

  1. Take a bunch of sequence examples from the testing data, ensuring that  is odd.
  2. Feed them into the models to get a value for , append to the sequence.
  3. Use the sequence-generator to get a set of values for 
  4. Look at the entropy of the resulting distribution over , this is the optimized entropy.
  5. For each sequence from the training data, and get a value of  from this.
  6. Repeat Steps 3-4 with the entire data set as  prepended to , get one sample of "unoptimized" entropy
  7. Repeat Steps 5-6 with each sequence from the initial dataset, take the average unoptimized entropy
  8. Optimization = unoptimized entropy - optimized entropy

Here is a schematic illustration:

A diagram showing the process outlined above. The final equation is Op(F, P; A) =  -H(F) + H(F'|P'')
On the left, the model is able to respond to each sequence fed into it, so it can optimize the future creating a low-entropy distribution. On the right, the model is forced to give a fixed response to the external input, which is then appended to the sequences. This means that it cannot optimize the future and the entropy is higher.

I ran this calculation 10 times with 200 sequences in  and took an average to get an idea of the model's optimizing capability. I also tested the sequence-generating function's self-optimization.

A plot showing optimization against model size under three conditions. The fewer potential results are "good", the higher the optimization for a given model size. Optimization increases with model size in all cases.

The fact that the sequence is optimizing itself mostly just amounts to saying that it is not a random walk, which we already knew. It is a good sanity check that all of the models get values either equal to or above this, and that optimization improves with model size.


Optimization vs RL Success Rate

Optimization is calculated as the entropy difference of two distributions. Let us consider three parameters: : the number of possible outcomes; : the proportion of outcomes which are "successes"; and :  the chance that the model achieves a successful outcome.

Assuming the model is "ambivalent" over successful outcomes, and likewise ambivalent over failed outcomes, then the value of  should be equal to . If we then assume that all outcomes are equally likely when the model's outputs are "randomized", then  is just . If we take the difference we get the following expression:

Now I can plot this theoretical value for  against  for , and also plot all of the models on the same axes. Since we see a lot of run-to-run variation in model performance, I'll plot the raw data per model rather than statistics.

A plot showing optimization plotted against success rate for three conditions. Three series of points are shown, each sitting on a similarly-colored curve.

I tried to find some galaxy-brained variable change which would make those three curves the same, but I couldn't. Instead I will just plot the predicted value of  (based on their success rate) for each model against the actual value:

The headline figure is repeated but with annotations. All the points lie very close to the theoretical line y=x.

In theory none of the models should fall below the dashed line representing the equation derived above. In practice they do. Some of this is error in measurements, but I'm sure some of it is also due to errors in my assumptions. In particular the assumption that  is completely flat is unlikely to hold.

On the other hand, there there is no reason at all why the models shouldn't fall quite far above the theoretical line. Consider the  case, so successful values are. If the model ends up with a 85:15 ratio of values 2 and 3 for (and never gets any other number, which is impossible in this specific case but that's not important for my point) then it will have a success rate of 0.85, which implies an  of 0.03, but its actual  will be 1.66!

A plot showing potential different outcomes to training. These outcomes are listed below.
Sorry for the poor aesthetics of this diagram, it's very difficult to get the point across that the bottom left one has the lowest optimization.

For a given training performance (in this case I've tried to make them all ~75% correct) the model might do one of several things:

  1. Optimize for reinforced results and against non-reinforced results in an unbiased way
  2. Optimize against non-reinforced results correctly, but only optimize for some of the successful results
  3. Optimize for reinforced results correctly, but only optimize against some of the non-reinforced results
  4. Behave in such a way that certain reinforced results are optimized against and certain non-reinforced results are optimized for.

The last three are examples of one type of inner misalignment. All three are characterized by excess optimization compared to the "correct" answer.

Cross-Objective Evaluation

The fact that most of the models don't sit far above the line shows that the RL is doing its job! They are (mostly) pursuing the training objective faithfully, without picking up on other objectives. I simulate what might happen if a model picks up on an "incorrect" objective by cross-evaluating models on each others' tasks:

A plot showing Measured Optimization against Score on n_good=4 task. The points corresponding to policies trained on the n_good=4 task lie on the theoretical curve, models trained on other tasks cluster to the left and above it.

Now it becomes obvious that the -trained models are behaving correctly whereas the other models are not. This is important since a model trained on the  task will typically have a good score on the  task, sometimes even higher than a similarly-sized model trained on the  task (!) but this plot clearly shows us that something is going wrong.

It becomes even clearer if we instead take the value of  implied by the formula from above, and plot it against the actual value of . Here are the results for all three tasks:

A plot with predicted optimization on the x axis and measured on the y axis. The purple points for n_good=2 sit on a purple y=x line. The other points mainly sit above the line.
As above but with blue n_good=4 points sitting on a blue line, and the other clusters above.
As above but with yellow n_good = 2 points on the line and other clusters above it.

By comparing implied and measured , we can separate the models trained on a given reward signal from ones trained on a different reward signal, even when one of the reward signals is a subset of the other.

Optimization vs Impact

I will now take a second look at the measure for . What I really wanted all along is to measure a policy's impact on the world, and optimizing the future is only one way in which this can happen. Another way is flattening the possibilities out! Consider the following informal diagram. "Original" distribution is not well-defined here, the point is just to give an intuitive explanation:

A spiky original distribution is subject to two types of impact. With Optimizing impact, the distribution becomes spikier, and lower entropy. With Flattening impact, the distribution becomes flatter and higher entropy.

The motivation and derivations for this can be found in Appendix A, with longer proofs listed in Appendix B, but the upshot is that we can define a new function:  using the KL divergence of  and  like this:

Which if we have a task success rate  in  and  in  must obey the following equation:

This can be measured using the same split-history method used to measure :

A plot showing impact against parameters. Impact increases with parameter count and decreases with n_good

The lower bound  can be calculated from success rates and the two can be compared:

A model showing Measured Impact plotted against Minimum Impact From Score. The points form a plume around y = 2.6x + 0.059

Larger models have a higher impact, but they also tend to have a lower ratio of than do the smaller ones. The single-line fit is also somewhat misleading, and actually the models appear to lie on three separate lines. I note that the values of   are somewhat low in this case, rather than our measured impact being unexpectedly high. Our measure for  may not be perfect in this case.

I think the difference between  and  tells us something. In this case it tells us that the success rates in  are actually quite high:

A plot showing success in F' against parameter count. Models for all values of n_good have better performance than chance, and it seems to slightly scale with model size.

Success rate in  doesn't scale very much at all with model size, which indicates that models of all sizes exhibit roughly the same bias in outputs, based on statistical regularities in the data (e.g. a "13" is more commonly followed by a "2" than a "15" is). It might also be an artefact stemming from an uneven distribution of numbers in our sequence by default.

This highlights an important point about split-history measures: they only pick up on the effects of "active" inference in the policy, not on biases in the outputs which are produced during training.  We don't have to use a split-history method to generate , we can think up plenty of alternative ways to generate a comparison future, and therefore generate a family of measures for  and :

  • Use a privileged "Do nothing" token as the model's output
  • Use a totally random output token
  • Use an untrained model, or a pretrained (but not finetuned) model

Despite these problems, if we repeat cross-objective evaluation using  we get these plots:

A plot showing measured Impact against Imp_min based on the n_good=2 task. Points trained on the "correct" task form a plume above the line y=x, whereas points trained on the n_good=4 task are squished over to the left.
The  models don't appear because they do so badly on the task that  cannot be calculated. 
As above but the Imp_min is based on score on the n_good=4 task. The correct policies lie above the line y=x, the n_good=2 models lie far above them, and the n_good=6 lines are squished over to the left

While the split is not as good as we saw with , it seems that models not trained on the "correct" fall along a line of lower gradient, and models trained on "incorrect" tasks have higher gradients i.e. the latter diverge from the  line more rapidly.

There are many potential ways to visualize what's going on here, two more of which are in Appendix C at the end of this post.

Model Self-Evaluation

A secondary goal of my work was to get models to evaluate their own optimizing power and impact. This is the reason for the maintenance training during finetuning. The self-evaluation scheme is the same as the optimization measuring scheme, except that instead of using the sequence-generating formula to get values for , I used the model's own predictive power to get expected distributions over  and :

A plot showing measured optimization plotted against self-estimated optimzation. Models from the  task tend to fall on the line , but as  decreases, models fall further above the line.

I suspect what's happened here is bog standard model bias. The more heavily the model is optimizing, the less even the distribution of examples of  it gets fed during the maintenance training. This might make it more biased towards predicting the rewarded variables after any sequence, which would cause it to underestimate .

The same can be done for :

Which gives even worse results! I'm not sure whether or not this is also model bias. I don't have a tremendous amount of experience with ML, so I will not pursue this further in this post.


The measure I've introduced for  seems to be predicted very well by theory, and might have utility in detecting how well a policy's results line up with an RL task when it can be predicted accurately. The measure for  has some issues but shows some success towards the same goal. These can be considered complementary measures in many ways:

Can be positive or negative, the sign gives us informationAlways 
Can be cancelled out by opposite-signed optimization elsewhereAlways a positive value, always at least adds up across different sources
No easy way to calculate theoretical values of Op in general (that I've found) can be calculated for any success/failure RL task
Excellent at identifying model behaviour in this caseLess good at identifying model behaviour in this case

I would be interested in applying both of these to more complex ML systems, but at the moment I don't any specific plans to do so, due to my very limited experience with ML. In Appendix A I also present equivalent measures which can be applied to the activations of a neural network.

A theoretical limitation of this work is that  can only be calculated when a task is binary success/failure. If the policy is trying to maximize some function world-state then there's no simple formula for  based on RL score. Work on these systems must either develop new maths or or find a way to break down utility maximization into a series of success/failure tasks, which can separately be evaluated for impact.

One practical limitation is in the self-evaluation stage. If the model's self-evaluation is biased, the measures of  and  will be as well. This is especially problematic if  or  is underestimated, allowing the policy to "lie" to us about how much optimization/impact it is doing. My wild guess is that AIs which include explicit world-modelling (such as DreamerV3) will be less biased by default, but that attempts at introducing impact regularization might reintroduce bias.

Alex Turner's Existing Work

The expression for impact that I present here is totally independent with the AUP measure of impact as proposed by Alex Turner in Towards a New Impact Measure. There might be some subtle way in which they're related but I haven't thought about this enough to say more. I've also read World State is the Wrong Abstraction for Impact and agree with some of the points presented.

In response I would say that the metric I present here relies strongly on a model  of the future world state, so only details captured in  can affect the impact. In the limit where the future states only consist of , then impact is trivially equal to the lower bound and excess impact = 0.


Appendix A: Impact and Differential Impact

Derivation of 

I present the motivation for, and derivation of, the measure of  which I've used in this post. Returning to my estimator for :

If we think these being two distributions over  and think of  as the probability of succeeding by chance, this becomes the formula for a Kullback-Liebler divergence. For a quick recap, the KL divergence of two distributions over a variable  follows this formula:

We could also estimate the probability of succeeding by chance using the split-history method. Let  be the success rate in , and  be the success rate in :

I will now present some results about KL divergences. I will start by defining  as the successful outcomes. Let us define a "baseline" probability distribution  and from it a "baseline" success rate :


Now imagine a policy acts on this distribution  and changes it to a new distribution. If this policy has a success rate , I define  as follows:


This is what we might expect to be the effects of a "minimum impact" policy: it makes successful outcomes more likely and unsuccesful outcomes less likely while leaving relative probabilities otherwise intact. The KL-divergence between  and  can then be calculated, and it looks familiar:

This is the minimum KL divergence possible in shifting a distribution to achieve a given success rate. If we change the variable names to  this allows us to write down the original relation for a policy's impact:

Differential Impact

In this case "differential" means "to do with differentiation". I've studied a construct I call differential optimization in the past, as it pertains to functions of real-valued variables. In this case if we have the functions  we can define the following value:

Intuitively, if  is "optimizing" , then when it is allowed to vary,  since  will change less when we allow  to change than when we fix . This led to the derivation of the differential optimization .

This can be extended to a differential impact metric:


This has a minimum  at , but it is speculative and has not been tested, so while I will present it here I can give no guarantees at all about its utility.

We can also extend this to vector-valued . If we define  as the Jacobian when  varies, and  as the Jacobian when  is constant, then  we get the following values for  and :

If  and  do not have the same dimension, then  does not exist and instead the following construction must be used:

The motivation for constructions like this is to apply them to the activations of neural networks. For a network with width , and a backpropagation time of , I believe the time-complexity of this contains a polynomial term in  (possibly  if using Bareiss algorithm) for the matrix inverse, and a term in .

Appendix B: Proofs

Derivation and proof of 

I will prove that this choice of  is a global minimum of  for a fixed .

Consider a distribution , which involves moving some amount of probability mass  from  to . Without loss of generality, take both to be in  (they must both be in either  or  so that  holds) Consider the value of

Trivially we only need to look at the components relevant to  and :

Expand values of :

Expand and collect factors of , cancelling the  on the bottom:

Collect the factors of , expand stuff to a  form:

Use the taylor expansion  up to .

Sub in , expand, cancel:


Therefore  is a local minimum of  subject to our condition that  is convex in  for fixed , therefore we have found the unique global minimum.

Derivation of Differential Impact

The measure of  based on entropy that I've used here was based on the following comparison to differential optimization:

Consider the network . This gives  and .

This can be extended to an entropic measure of  by considering uncertainty over , specifically:


Using split-histories we get:

If we take  this gives the familiar value of . We may instead investigate the value of . Letting  for brevity:


Taking  with respect to  requires taking :

Substituting into our original equation:

Which, if we extend to , gives

Derivation of Multivariate Differential Impact and Optimization

Let us take vectors , and  in the same manner as above. Assume around some value of  we have the following Jacobians.


Without loss of generality, take the means of all of these variables to be . There exists a formula for transforming a multivariate normal distribution[1].

Now for , the mean will no longer be zero:

We can calculate the KL divergence of  using another formula[2]

Therefore our impact will be:

Taking the expected value of the third component is actually easy if you have access to the internet. We can see that it is of the form  where  is multivariate normal. This has a closed-form solution[3]:

Therefore we have the following expression:

We can make some progress towards simplifying this if we take , which in this case lets us cancel everything out involving a , since the scalar value of  commutes with all matrices, , and . We will also assume that all the Jacobians are invertible.

Using the cyclic property of the trace:

And if we define  we get:

This seems to have the form of , and in fact if we consider , and  to just be concatenations of variables, which maeans all the  matrices are diagonal, we see that our equation has the form.

Which is a nice sanity check. The value of  is just the entropy difference  which simplifies to  for free.

If the Jacobians are not invertible, but we assume that  is invertible, we instead get:

Appendix C: Supplementary Plots

Other Ways to Visualize Impact Plots

Here I plotted the "Impact ratio"  against :

The next six plots are just re-visualizations of previously plotted and discussed data, so I'm not going to write a bespoke description for each one.

Here I plotted "Excess Impact"  against :


Example training runs from  = 4

There's 16 plots here, but none are important to the key point. Screen reader users I'm afraid you'll have to get someone else to nitpick the raw data.

Example figures summarizing training runs:

  1. ^

  2. ^

  3. ^

New Comment