Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is original, independent research carried out in March and April of 2024.

The degree to which a a policy optimizes the future can be quantified mathematically. A set of of very small transformer models were pretrained to predict the next token in a mathematical sequence, then subjected to reinforcement learning finetuning.

The optimizing power of each model can be predicted with high accuracy based on each model's score on its own RL task. By comparing predictions of optimization based on scores on each different RL task, a model's original reinforcement objective can be identified.

A related measure for impact can also be derived mathematically, and given a theoretical lower bound based on RL score. This gives further information about model behavior, and allows for the same analysis as the measure of optimization.

I also investigate the possibility of getting models to self-evaluate optimization and impact, with limited success.

Methods

Pretraining on Sequence Prediction

I defined a simple mathematical sequence defined by the following stochastic recurrence relation. This produces a pseudo-random but (to 98%) predictable sequence, alternating between elements of {0,...,7} on even values of t and {8,...,15} on odd values of t.

st=(⎧⎪⎨⎪⎩((16∏i=1(st−i+1)mod17)mod8) with probability 98% ∈{0,...,7} with probabiltiy 2%)+8×(tmod2)

I then trained a small encoder-only transformer model to predict the next element in the sequence given the previous 20 elements of the sequence.

This was followed by a reinforcement-learning phase in which the transformer was used to generate the next token on odd values of n only, and the recurrence relation was used to generate the value of st+1. If st+1 was in {0,2,4,6}, this was used as a "successful" example to reinforce the model. I used a temperature of 1 when generating these sequences to introduce some randomness, but the temperature was reduced to 0 during evaluations and when calculating optimization.

A small amount of "maintenance" training (much lower learning rate) was used during this phase to ensure that model performance on the predictive tasks for even values of t was maintained. Without this I saw rapid loss of performance on the "maintenance" dataset. I also found that I was unable to include "unsuccessful" examples (i.e. where st+1∉{0,2,4,6}) with even a tiny negative learning rate, as this caused worsened performance at all tasks. Here is a typical set of results from training and evaluation:

I carried out this training on N=5 models per size for four model sizes between 18k and 402k parameters, giving the following plot:

Pretraining loss increases over the last few model sizes, and the loss/time plots (some of which I have put in the Supplementary Information at the bottom of this post) showed signs of overfitting in the large models. Regularization was employed during training (0.01 weight decay in an AdamW optimizer, 10% dropout rate for neurons) so perhaps a larger dataset size is required to totally avoid this.

I then repeated the RL phase twice, once with st+1∈{0,4} being reinforced, (ngood = 2) and once with st+1∈{0,1,2,4,5,6} being reinforced (ngood = 6). Here is a plot of success rate against model size across all three conditions.

This plot shows mean ± standard error. In all cases model performance is a lot better than chance, and increases with model size.

Measuring Optimization

I used a Monte Carlo simulation to measure the nats of optimization that are being applied to st+1 using the split-history method I've previously outlined. This involves taking the difference in entropy between two distributions:

The algorithm in practice is this:

Take a bunch of sequence examples from the testing data, ensuring that t is odd.

Feed them into the models to get a value for st, append to the sequence.

Use the sequence-generator to get a set of values for st+1

Look at the entropy of the resulting distribution over st+1, this is the optimized entropy.

For each sequence from the training data, and get a value of s′t from this.

Repeat Steps 3-4 with the entire data set as ...,st−2,st−1 prepended to s′t, get one sample of "unoptimized" entropy

Repeat Steps 5-6 with each sequence from the initial dataset, take the average unoptimized entropy

I ran this calculation 10 times with 200 sequences in P and took an average to get an idea of the model's optimizing capability. I also tested the sequence-generating function's self-optimization.

The fact that the sequence is optimizing itself mostly just amounts to saying that it is not a random walk, which we already knew. It is a good sanity check that all of the models get values either equal to or above this, and that optimization improves with model size.

Results

Optimization vs RL Success Rate

Optimization is calculated as the entropy difference of two distributions. Let us consider three parameters: n: the number of possible outcomes; p: the proportion of outcomes which are "successes"; and s: the chance that the model achieves a successful outcome.

Assuming the model is "ambivalent" over successful outcomes, and likewise ambivalent over failed outcomes, then the value of H(F) should be equal to −sln(snp)−(1−s)ln(1−sn(1−p)). If we then assume that all outcomes are equally likely when the model's outputs are "randomized", then H(F′|P′′) is just −ln(1n). If we take the difference we get the following expression:

Op≈sln(snp)+(1−s)ln(1−sn(1−p))−ln(1n)

Op≈slnsp+sln1n+(1−s)ln1−s1−p+ln1n−sln1n−ln1n

Op≈slnsp+(1−s)ln1−s1−p

Now I can plot this theoretical value for Op against s for p∈{0.25,0.5,0.75}, and also plot all of the models on the same axes. Since we see a lot of run-to-run variation in model performance, I'll plot the raw data per model rather than statistics.

I tried to find some galaxy-brained variable change which would make those three curves the same, but I couldn't. Instead I will just plot the predicted value of Op (based on their success rate) for each model against the actual value:

In theory none of the models should fall below the dashed line representing the equation derived above. In practice they do. Some of this is error in measurements, but I'm sure some of it is also due to errors in my assumptions. In particular the assumption that H(F′|P′′) is completely flat is unlikely to hold.

On the other hand, there there is no reason at all why the models shouldn't fall quite far above the theoretical line. Consider the ngood=6 case, so successful values arest+1∈{0,1,2,4,5,6}. If the model ends up with a 85:15 ratio of values 2 and 3 for st+1(and never gets any other number, which is impossible in this specific case but that's not important for my point) then it will have a success rate of 0.85, which implies an Op of 0.03, but its actual Op will be 1.66!

For a given training performance (in this case I've tried to make them all ~75% correct) the model might do one of several things:

Optimize for reinforced results and against non-reinforced results in an unbiased way

Optimize against non-reinforced results correctly, but only optimize for some of the successful results

Optimize for reinforced results correctly, but only optimize against some of the non-reinforced results

Behave in such a way that certain reinforced results are optimized against and certain non-reinforced results are optimized for.

The last three are examples of one type of inner misalignment. All three are characterized by excess optimization compared to the "correct" answer.

Cross-Objective Evaluation

The fact that most of the models don't sit far above the line shows that the RL is doing its job! They are (mostly) pursuing the training objective faithfully, without picking up on other objectives. I simulate what might happen if a model picks up on an "incorrect" objective by cross-evaluating models on each others' tasks:

Now it becomes obvious that the ngood=4-trained models are behaving correctly whereas the other models are not. This is important since a model trained on the ngood=2 task will typically have a good score on the ngood=4 task, sometimes even higher than a similarly-sized model trained on the ngood=4 task (!) but this plot clearly shows us that something is going wrong.

It becomes even clearer if we instead take the value of Op implied by the formula from above, and plot it against the actual value of Op. Here are the results for all three tasks:

By comparing implied and measured Op, we can separate the models trained on a given reward signal from ones trained on a different reward signal, even when one of the reward signals is a subset of the other.

Optimization vs Impact

I will now take a second look at the measure for Op. What I really wanted all along is to measure a policy's impact on the world, and optimizing the future is only one way in which this can happen. Another way is flattening the possibilities out! Consider the following informal diagram. "Original" distribution is not well-defined here, the point is just to give an intuitive explanation:

The motivation and derivations for this can be found in Appendix A, with longer proofs listed in Appendix B, but the upshot is that we can define a new function: Imp(F,P;A) using the KL divergence of F and F′ like this:

Which if we have a task success rate s in F and s′ in F′ must obey the following equation:

This can be measured using the same split-history method used to measure Op:

The lower bound Impmin can be calculated from success rates and the two can be compared:

Larger models have a higher impact, but they also tend to have a lower ratio of Imp/Impminthan do the smaller ones. The single-line fit is also somewhat misleading, and actually the models appear to lie on three separate lines. I note that the values of Impmin are somewhat low in this case, rather than our measured impact being unexpectedly high. Our measure for Imp may not be perfect in this case.

I think the difference between Imp and Impmin tells us something. In this case it tells us that the success rates in F′ are actually quite high:

Success rate in F′ doesn't scale very much at all with model size, which indicates that models of all sizes exhibit roughly the same bias in outputs, based on statistical regularities in the data (e.g. a "13" is more commonly followed by a "2" than a "15" is). It might also be an artefact stemming from an uneven distribution of numbers in our sequence by default.

This highlights an important point about split-history measures: they only pick up on the effects of "active" inference in the policy, not on biases in the outputs which are produced during training. We don't have to use a split-history method to generate F′, we can think up plenty of alternative ways to generate a comparison future, and therefore generate a family of measures for Imp and Op:

Use a privileged "Do nothing" token as the model's output

Use a totally random output token

Use an untrained model, or a pretrained (but not finetuned) model

Despite these problems, if we repeat cross-objective evaluation using Imp we get these plots:

While the split is not as good as we saw with Op, it seems that models not trained on the "correct" fall along a line of lower gradient, and models trained on "incorrect" tasks have higher gradients i.e. the latter diverge from the Impmin line more rapidly.

There are many potential ways to visualize what's going on here, two more of which are in Appendix C at the end of this post.

Model Self-Evaluation

A secondary goal of my work was to get models to evaluate their own optimizing power and impact. This is the reason for the maintenance training during finetuning. The self-evaluation scheme is the same as the optimization measuring scheme, except that instead of using the sequence-generating formula to get values for st+1, I used the model's own predictive power to get expected distributions over st+1 and s′t+1:

A plot showing measured optimization plotted against self-estimated optimzation. Models from the ngood=6 task tend to fall on the line y=x, but as ngood decreases, models fall further above the line.

I suspect what's happened here is bog standard model bias. The more heavily the model is optimizing, the less even the distribution of examples of st+1 it gets fed during the maintenance training. This might make it more biased towards predicting the rewarded variables after any sequence, which would cause it to underestimate Op.

The same can be done for Imp:

Which gives even worse results! I'm not sure whether or not this is also model bias. I don't have a tremendous amount of experience with ML, so I will not pursue this further in this post.

Conclusions

The measure I've introduced for Op seems to be predicted very well by theory, and might have utility in detecting how well a policy's results line up with an RL task when it can be predicted accurately. The measure for Imp has some issues but shows some success towards the same goal. These can be considered complementary measures in many ways:

Optimization

Impact

Can be positive or negative, the sign gives us information

Always ≥0

Can be cancelled out by opposite-signed optimization elsewhere

Always a positive value, always at least adds up across different sources

No easy way to calculate theoretical values of Op in general (that I've found)

Impmin can be calculated for any success/failure RL task

Excellent at identifying model behaviour in this case

Less good at identifying model behaviour in this case

I would be interested in applying both of these to more complex ML systems, but at the moment I don't any specific plans to do so, due to my very limited experience with ML. In Appendix A I also present equivalent measures which can be applied to the activations of a neural network.

A theoretical limitation of this work is that Impmin can only be calculated when a task is binary success/failure. If the policy is trying to maximize some function world-state then there's no simple formula for Impmin based on RL score. Work on these systems must either develop new maths or or find a way to break down utility maximization into a series of success/failure tasks, which can separately be evaluated for impact.

One practical limitation is in the self-evaluation stage. If the model's self-evaluation is biased, the measures of Op and Imp will be as well. This is especially problematic if Op or Imp is underestimated, allowing the policy to "lie" to us about how much optimization/impact it is doing. My wild guess is that AIs which include explicit world-modelling (such as DreamerV3) will be less biased by default, but that attempts at introducing impact regularization might reintroduce bias.

Alex Turner's Existing Work

The expression for impact that I present here is totally independent with the AUP measure of impact as proposed by Alex Turner in Towards a New Impact Measure. There might be some subtle way in which they're related but I haven't thought about this enough to say more. I've also read World State is the Wrong Abstraction for Impact and agree with some of the points presented.

In response I would say that the metric I present here relies strongly on a model F of the future world state, so only details captured in F can affect the impact. In the limit where the future states only consist of f∈{success,failure}, then impact is trivially equal to the lower bound and excess impact = 0.

Appendices

Appendix A: Impact and Differential Impact

Derivation of Imp

I present the motivation for, and derivation of, the measure of Imp which I've used in this post. Returning to my estimator for Op:

Op≈slnsp+(1−s)ln1−s1−p

If we think these being two distributions over {success,failure} and think of p as the probability of succeeding by chance, this becomes the formula for a Kullback-Liebler divergence. For a quick recap, the KL divergence of two distributions over a variable x∈χ follows this formula:

DKL(P∥Q)=∑x∈χP(x)lnP(x)Q(x)

We could also estimate the probability of succeeding by chance using the split-history method. Let s be the success rate in F, and s′ be the success rate in F′:

slnss′+(1−s)ln1−s1−s′

I will now present some results about KL divergences. I will start by defining χs⊂χ as the successful outcomes. Let us define a "baseline" probability distribution P(x) and from it a "baseline" success rate s0:

s0=∑xs∈χsP(xs)

Now imagine a policy acts on this distribution P(x) and changes it to a new distribution. If this policy has a success rate s∈[s0,1], I define Q(x) as follows:

Q(x)={ss0P(x)x∈χs1−s1−s0P(x)x∉χs

This is what we might expect to be the effects of a "minimum impact" policy: it makes successful outcomes more likely and unsuccesful outcomes less likely while leaving relative probabilities otherwise intact. The KL-divergence between Q and P can then be calculated, and it looks familiar:

DKL(Q∥P)=slnss0+(1−s)ln1−s1−s0

This is the minimum KL divergence possible in shifting a distribution to achieve a given success rate. If we change the variable names to Q→F,P→F′|P′′,s0→s′ this allows us to write down the original relation for a policy's impact:

In this case "differential" means "to do with differentiation". I've studied a construct I call differential optimization in the past, as it pertains to functions of real-valued variables. In this case if we have the functions A=f(P), F=g(A,P)≡h(P) we can define the following value:

C(F,P;A)=∂F∂P|Avaries/∂F∂P|Aconstant≡dhdP/∂g∂P

Intuitively, if A is "optimizing" F, then when it is allowed to vary, C≤1 since F will change less when we allow A to change than when we fix A. This led to the derivation of the differential optimization Op=−ln(C).

This can be extended to a differential impact metric:

Imp≡−ln(C)+C(C−1)

This has a minimum Imp=0 at C=1, but it is speculative and has not been tested, so while I will present it here I can give no guarantees at all about its utility.

We can also extend this to vector-valued P,F. If we define Jg as the Jacobian when A varies, and Jh as the Jacobian when A is constant, then C=J−1gJh we get the following values for Op and Imp:

Op=−ln|C|≡ln|Jg|−ln|Jh|

Imp=−ln|C|+Tr(CCT)−Tr(C)

If P and F do not have the same dimension, then J−1g does not exist and instead the following construction must be used:

The motivation for constructions like this is to apply them to the activations of neural networks. For a network with width w, and a backpropagation time of t, I believe the time-complexity of this contains a polynomial term in w (possibly O(w5(logw)2) if using Bareiss algorithm) for the matrix inverse, and a term in w×t.

I will prove that this choice of Q is a global minimum of DKL(Q∥P) for a fixed P.

Consider a distribution R=Q+δQ, which involves moving some amount of probability mass δ from x1 to x2. Without loss of generality, take both to be in χs (they must both be in either χs or χCs so that R(x∈χs)=s holds) Consider the value of

DKL(R∥P)−DKL(Q∥P)

Trivially we only need to look at the components relevant to x1 and x2:

Therefore Q(x) is a local minimum of DKL(Q∥P) subject to our condition that Q(X∈χx)=s. DKL(Q∥P) is convex in Q for fixed P, therefore we have found the unique global minimum.

Derivation of Differential Impact

The measure of Op based on entropy that I've used here was based on the following comparison to differential optimization:

Consider the network A=P, F=P−(1−k)A. This gives C(F,P;A)=k and Op(F,P;A)=−lnk.

This can be extended to an entropic measure of Op by considering uncertainty over P, specifically:

P∼N(μP,σP)

A∼N(μP,σP)

F∼N(kμP,kσP)

Using split-histories we get:

P′,P′′∼N(μP,σP)

A′∼N(μP,σP)

F′|P′′=p′′∼N(μP−(1−k)p′′,σP)

If we take Op(F,P;A)=H(F′|P′′)−H(F) this gives the familiar value of lnk. We may instead investigate the value of Imp(F,P;A)=DKL(F∥F′|P′′). Letting F∼N(μ1,σ1), F′|P′′∼N(μ2,σ2) for brevity:

Imp=ln(σ2σ1)+12σ21+(μ1,μ2)2σ22−12

Substituting:

μ1−μ2=kμP−μP+(1−k)p′′=(1−k)p′′−(1−k)μP

(μ1−μ2)2=(1−k)2(μ2P−2μPp′′+p′′2)

Taking E((μ1−μ2)2) with respect to p′′ requires taking E(p′′)=μP, E(p′′2)=μ2P+σ2P:

(μ1−μ2)2=(1−k)2(μ2P−2μ2P+μ2P+σ2P)=σ2P

σ1=kσP

σ2=σP

Substituting into our original equation:

Imp=ln(1/k)+12[k2+(1−k)2σ2Pσ2P]−12

Imp=−ln(k)+12[k2+1−2k+k2]−12

Imp=−ln(k)+k2−k

Imp=−ln(k)+k(k−1)

Which, if we extend to C, gives

Imp=−ln(C)+C(C−1)

Derivation of Multivariate Differential Impact and Optimization

Let us take vectors p,a,f, and p′,p′′,a′,f′ in the same manner as above. Assume around some value of p we have the following Jacobians.

Jf=dadpJg=∂f∂pJh=dfdp

Without loss of generality, take the means of all of these variables to be 0. There exists a formula for transforming a multivariate normal distribution^{[1]}.

P∼N(0,Σp)

A∼N(0,JfΣpJTf)

F∼N(0,JhΣpJTh)

Now for f′|p′′, the mean will no longer be zero:

F′|P′′=p′′∼N((Jh−Jg)p′′,JgΣpJTg)

We can calculate the KL divergence of DKL(F∥F′|P′′) using another formula^{[2]}:

Taking the expected value of the third component is actually easy if you have access to the internet. We can see that it is of the form E(vTMv) where v is multivariate normal. This has a closed-form solution^{[3]}:

We can make some progress towards simplifying this if we take Σp=σpI, which in this case lets us cancel everything out involving a Σp, since the scalar value of σp commutes with all matrices, |σpI|=σndimp, and (σpI)−1=σ−1pI. We will also assume that all the Jacobians are invertible.

This seems to have the form of −ln(C2)+C2−C, and in fact if we consider P, A, and F to just be concatenations of variables, which maeans all the J matrices are diagonal, we see that our equation has the form.

−ln(∏C)+∑C2−∑C=∑(−lnC+C2−C)

Which is a nice sanity check. The value of Op is just the entropy difference 12lnΣ2−12lnΣ1 which simplifies to −ln|C| for free.

If the Jacobians are not invertible, but we assume that JgJTg is invertible, we instead get:

This is original, independent research carried out in March and April of 2024.The degree to which a a policy optimizes the future can be quantified mathematically. A set of of very small transformer models were pretrained to predict the next token in a mathematical sequence, then subjected to reinforcement learning finetuning.

The optimizing power of each model can be predicted with high accuracy based on each model's score on its own RL task. By comparing predictions of optimization based on scores on each different RL task, a model's original reinforcement objective can be identified.

A related measure for impact can also be derived mathematically, and given a theoretical lower bound based on RL score. This gives further information about model behavior, and allows for the same analysis as the measure of optimization.

I also investigate the possibility of getting models to self-evaluate optimization and impact, with limited success.

## Methods

## Pretraining on Sequence Prediction

I defined a simple mathematical sequence defined by the following stochastic recurrence relation. This produces a pseudo-random but (to 98%) predictable sequence, alternating between elements of {0,...,7} on even values of t and {8,...,15} on odd values of t.

st=(⎧⎪⎨⎪⎩((16∏i=1(st−i+1)mod17)mod8) with probability 98% ∈{0,...,7} with probabiltiy 2%)+8×(tmod2)

I then trained a small encoder-only transformer model to predict the next element in the sequence given the previous 20 elements of the sequence.

This was followed by a reinforcement-learning phase in which the transformer was used to generate the next token on odd values of n only, and the recurrence relation was used to generate the value of st+1. If st+1 was in {0,2,4,6}, this was used as a "successful" example to reinforce the model. I used a temperature of 1 when generating these sequences to introduce some randomness, but the temperature was reduced to 0 during evaluations and when calculating optimization.

A small amount of "maintenance" training (much lower learning rate) was used during this phase to ensure that model performance on the predictive tasks for even values of t was maintained. Without this I saw rapid loss of performance on the "maintenance" dataset. I also found that I was unable to include "unsuccessful" examples (i.e. where st+1∉{0,2,4,6}) with even a tiny negative learning rate, as this caused worsened performance at all tasks. Here is a typical set of results from training and evaluation:

I carried out this training on N=5 models per size for four model sizes between 18k and 402k parameters, giving the following plot:

Pretraining loss increases over the last few model sizes, and the loss/time plots (some of which I have put in the Supplementary Information at the bottom of this post) showed signs of overfitting in the large models. Regularization was employed during training (0.01 weight decay in an AdamW optimizer, 10% dropout rate for neurons) so perhaps a larger dataset size is required to totally avoid this.

I then repeated the RL phase twice, once with st+1∈{0,4} being reinforced, (ngood = 2) and once with st+1∈{0,1,2,4,5,6} being reinforced (ngood = 6). Here is a plot of success rate against model size across all three conditions.

This plot shows mean ± standard error. In all cases model performance is a lot better than chance, and increases with model size.

## Measuring Optimization

I used a Monte Carlo simulation to measure the nats of optimization that are being applied to st+1 using the split-history method I've previously outlined. This involves taking the difference in entropy between two distributions:

The algorithm in practice is this:

Here is a schematic illustration:

I ran this calculation 10 times with 200 sequences in P and took an average to get an idea of the model's optimizing capability. I also tested the sequence-generating function's self-optimization.

The fact that the sequence is optimizing itself mostly just amounts to saying that it is not a random walk, which we already knew. It is a good sanity check that all of the models get values either equal to or above this, and that optimization improves with model size.

## Results

## Optimization vs RL Success Rate

Optimization is calculated as the entropy difference of two distributions. Let us consider three parameters: n: the number of possible outcomes; p: the proportion of outcomes which are "successes"; and s: the chance that the model achieves a successful outcome.

Assuming the model is "ambivalent" over successful outcomes, and likewise ambivalent over failed outcomes, then the value of H(F) should be equal to −sln(snp)−(1−s)ln(1−sn(1−p)). If we then assume that all outcomes are equally likely when the model's outputs are "randomized", then H(F′|P′′) is just −ln(1n). If we take the difference we get the following expression:

Op≈sln(snp)+(1−s)ln(1−sn(1−p))−ln(1n)

Op≈slnsp+sln1n+(1−s)ln1−s1−p+ln1n−sln1n−ln1n

Op≈slnsp+(1−s)ln1−s1−p

Now I can plot this theoretical value for Op against s for p∈{0.25,0.5,0.75}, and also plot all of the models on the same axes. Since we see a lot of run-to-run variation in model performance, I'll plot the raw data per model rather than statistics.

I tried to find some galaxy-brained variable change which would make those three curves the same, but I couldn't. Instead I will just plot the predicted value of Op (based on their success rate) for each model against the actual value:

In theory none of the models should fall below the dashed line representing the equation derived above. In practice they do. Some of this is error in measurements, but I'm sure some of it is also due to errors in my assumptions. In particular the assumption that H(F′|P′′) is completely flat is unlikely to hold.

On the other hand, there there is no reason at all why the models shouldn't fall quite far above the theoretical line. Consider the ngood=6 case, so successful values arest+1∈{0,1,2,4,5,6}. If the model ends up with a 85:15 ratio of values 2 and 3 for st+1(and never gets any other number, which is impossible in this specific case but that's not important for my point) then it will have a success rate of 0.85, which implies an Op of 0.03, but its actual Op will be 1.66!

For a given training performance (in this case I've tried to make them all ~75% correct) the model might do one of several things:

The last three are examples of one type of inner misalignment. All three are characterized by excess optimization compared to the "correct" answer.

## Cross-Objective Evaluation

The fact that most of the models don't sit far above the line shows that the RL is doing its job! They are (mostly) pursuing the training objective faithfully, without picking up on other objectives. I simulate what might happen if a model picks up on an "incorrect" objective by cross-evaluating models on each others' tasks:

Now it becomes obvious that the ngood=4-trained models are behaving correctly whereas the other models are not. This is important since a model trained on the ngood=2 task will typically have a good score on the ngood=4 task, sometimes even higher than a similarly-sized model trained on the ngood=4 task (!) but this plot clearly shows us that something is going wrong.

It becomes even clearer if we instead take the value of Op implied by the formula from above, and plot it against the actual value of Op. Here are the results for all three tasks:

By comparing implied and measured Op, we can separate the models trained on a given reward signal from ones trained on a different reward signal, even when one of the reward signals is a subset of the other.

## Optimization vs Impact

I will now take a second look at the measure for Op. What I really wanted all along is to measure a policy's

impacton the world, and optimizing the future is only one way in which this can happen. Another way is flattening the possibilities out! Consider the following informal diagram. "Original" distribution is not well-defined here, the point is just to give an intuitive explanation:The motivation and derivations for this can be found in Appendix A, with longer proofs listed in Appendix B, but the upshot is that we can define a new function: Imp(F,P;A) using the KL divergence of F and F′ like this:

Which if we have a task success rate s in F and s′ in F′ must obey the following equation:

Imp(F,P;A)≡DKL(F∥F′|P′′)≥slnss′+(1−s)ln1−s1−s′≡Impmin(F,P;A)

This can be measured using the same split-history method used to measure Op:

The lower bound Impmin can be calculated from success rates and the two can be compared:

Larger models have a higher impact, but they also tend to have a lower ratio of Imp/Impminthan do the smaller ones. The single-line fit is also somewhat misleading, and actually the models appear to lie on three separate lines. I note that the values of Impmin are somewhat low in this case, rather than our measured impact being unexpectedly high. Our measure for Imp may not be perfect in this case.

I think the difference between Imp and Impmin tells us something. In this case it tells us that the success rates in F′ are actually quite high:

Success rate in F′ doesn't scale very much at all with model size, which indicates that models of all sizes exhibit roughly the same bias in outputs, based on statistical regularities in the data (e.g. a "13" is more commonly followed by a "2" than a "15" is). It might also be an artefact stemming from an uneven distribution of numbers in our sequence by default.

This highlights an important point about split-history measures: they only pick up on the effects of "active" inference in the policy, not on biases in the outputs which are produced during training. We don't have to use a split-history method to generate F′, we can think up plenty of alternative ways to generate a comparison future, and therefore generate a family of measures for Imp and Op:

Despite these problems, if we repeat cross-objective evaluation using Imp we get these plots:

While the split is not as good as we saw with Op, it seems that models not trained on the "correct" fall along a line of lower gradient, and models trained on "incorrect" tasks have higher gradients i.e. the latter diverge from the Impmin line more rapidly.

There are many potential ways to visualize what's going on here, two more of which are in Appendix C at the end of this post.

## Model Self-Evaluation

A secondary goal of my work was to get models to evaluate their own optimizing power and impact. This is the reason for the maintenance training during finetuning. The self-evaluation scheme is the same as the optimization measuring scheme, except that instead of using the sequence-generating formula to get values for st+1, I used the model's own predictive power to get expected distributions over st+1 and s′t+1:

A plot showing measured optimization plotted against self-estimated optimzation. Models from the ngood=6 task tend to fall on the line y=x, but as ngood decreases, models fall further above the line.

I suspect what's happened here is bog standard model bias. The more heavily the model is optimizing, the less even the distribution of examples of st+1 it gets fed during the maintenance training. This might make it more biased towards predicting the rewarded variables after any sequence, which would cause it to underestimate Op.

The same can be done for Imp:

Which gives even worse results! I'm not sure whether or not this is also model bias. I don't have a tremendous amount of experience with ML, so I will not pursue this further in this post.

## Conclusions

The measure I've introduced for Op seems to be predicted very well by theory, and might have utility in detecting how well a policy's results line up with an RL task when it can be predicted accurately. The measure for Imp has some issues but shows some success towards the same goal. These can be considered complementary measures in many ways:

OptimizationImpactI would be interested in applying both of these to more complex ML systems, but at the moment I don't any specific plans to do so, due to my very limited experience with ML. In Appendix A I also present equivalent measures which can be applied to the activations of a neural network.

A theoretical limitation of this work is that Impmin can only be calculated when a task is binary success/failure. If the policy is trying to maximize some function world-state then there's no simple formula for Impmin based on RL score. Work on these systems must either develop new maths or or find a way to break down utility maximization into a series of success/failure tasks, which can separately be evaluated for impact.

One practical limitation is in the self-evaluation stage. If the model's self-evaluation is biased, the measures of Op and Imp will be as well. This is especially problematic if Op or Imp is underestimated, allowing the policy to "lie" to us about how much optimization/impact it is doing. My wild guess is that AIs which include explicit world-modelling (such as DreamerV3) will be less biased by default, but that attempts at introducing impact regularization might reintroduce bias.

## Alex Turner's Existing Work

The expression for impact that I present here is totally independent with the AUP measure of impact as proposed by Alex Turner in Towards a New Impact Measure. There might be some subtle way in which they're related but I haven't thought about this enough to say more. I've also read World State is the Wrong Abstraction for Impact and agree with some of the points presented.

In response I would say that the metric I present here relies strongly on a model F of the future world state, so only details captured in F can affect the impact. In the limit where the future states only consist of f∈{success,failure}, then impact is trivially equal to the lower bound and excess impact = 0.

## Appendices

## Appendix A: Impact and Differential Impact

## Derivation of Imp

I present the motivation for, and derivation of, the measure of Imp which I've used in this post. Returning to my estimator for Op:

Op≈slnsp+(1−s)ln1−s1−p

If we think these being two distributions over {success,failure} and think of p as the probability of succeeding by chance, this becomes the formula for a Kullback-Liebler divergence. For a quick recap, the KL divergence of two distributions over a variable x∈χ follows this formula:

DKL(P∥Q)=∑x∈χP(x)lnP(x)Q(x)

We could also estimate the probability of succeeding by chance using the split-history method. Let s be the success rate in F, and s′ be the success rate in F′:

slnss′+(1−s)ln1−s1−s′

I will now present some results about KL divergences. I will start by defining χs⊂χ as the successful outcomes. Let us define a "baseline" probability distribution P(x) and from it a "baseline" success rate s0:

s0=∑xs∈χsP(xs)

Now imagine a policy acts on this distribution P(x) and changes it to a new distribution. If this policy has a success rate s∈[s0,1], I define Q(x) as follows:

Q(x)={ss0P(x) x∈χs 1−s1−s0P(x) x∉χs

This is what we might expect to be the effects of a "minimum impact" policy: it makes successful outcomes more likely and unsuccesful outcomes less likely while leaving relative probabilities otherwise intact. The KL-divergence between Q and P can then be calculated, and it looks familiar:

DKL(Q∥P)=slnss0+(1−s)ln1−s1−s0

This is the minimum KL divergence possible in shifting a distribution to achieve a given success rate. If we change the variable names to Q→F, P→F′|P′′, s0→s′ this allows us to write down the original relation for a policy's impact:

Imp(F,P;A)≡DKL(F∥F′|P′′)≥slnss′+(1−s)ln1−s1−s′≡Impmin(F,P;A)

## Differential Impact

In this case "differential" means "to do with differentiation". I've studied a construct I call differential optimization in the past, as it pertains to functions of real-valued variables. In this case if we have the functions A=f(P), F=g(A,P)≡h(P) we can define the following value:

C(F,P;A)=∂F∂P|A varies/∂F∂P|A constant≡dhdP/∂g∂P

Intuitively, if A is "optimizing" F, then when it is allowed to vary, C≤1 since F will change less when we allow A to change than when we fix A. This led to the derivation of the differential optimization Op=−ln(C).

This can be extended to a differential impact metric:

Imp≡−ln(C)+C(C−1)

This has a minimum Imp=0 at C=1, but it is speculative and has not been tested, so while I will present it here I can give no guarantees at all about its utility.

We can also extend this to vector-valued P,F. If we define Jg as the Jacobian when A varies, and Jh as the Jacobian when A is constant, then C=J−1gJh we get the following values for Op and Imp:

Op=−ln|C|≡ln|Jg|−ln|Jh|

Imp=−ln|C|+Tr(CCT)−Tr(C)

If P and F do not have the same dimension, then J−1g does not exist and instead the following construction must be used:

Op=12ln|JgJTg|−12ln|JhJTh|

Imp=12ln|JgJTg|−12ln|JhJTh|+Tr(JhJTh(JgJTg)−1)−Tr(JgJTh(JgJTg)−1)

The motivation for constructions like this is to apply them to the activations of neural networks. For a network with width w, and a backpropagation time of t, I believe the time-complexity of this contains a polynomial term in w (possibly O(w5(logw)2) if using Bareiss algorithm) for the matrix inverse, and a term in w×t.

## Appendix B: Proofs

## Derivation and proof of Imp≥Impmin

DKL(Q∥P)=∑xs∈χsss0P(xs)lnss0P(xs)P(xs)+∑xu∉χs1−s1−s0P(xu)ln1−s1−s0P(xu)P(xu)

DKL(Q∥P)=ss0∑xs∈χsP(xs)lnss0+1−s1−s0∑xu∉χsP(xu)ln1−s1−s0

DKL(Q|P)=ss0×s0lnss0+1−s1−s0×(1−s0)ln1−s1−s0

DKL(Q∥P)=slnss0+(1−s)ln1−s1−s0

I will prove that this choice of Q is a global minimum of DKL(Q∥P) for a fixed P.

Consider a distribution R=Q+δQ, which involves moving some amount of probability mass δ from x1 to x2. Without loss of generality, take both to be in χs (they must both be in either χs or χCs so that R(x∈χs)=s holds) Consider the value of

DKL(R∥P)−DKL(Q∥P)

Trivially we only need to look at the components relevant to x1 and x2:

R(x1)lnR(x1)P(x1)+R(x2)lnR(x2)P(x2)−Q(x1)lnQ(x1)P(x1)−Q(x2)lnQ(x2)P(x2)

Expand values of R(x):

(Q(x1)−δ)lnss0P(x1)−δP(x1)+(Q(x2)+δ)lnss0P(x2)+δP(x2)−Q(x1)lnss0P(x1)P(x1)−Q(x2)lnss0P(x2)P(x2)

Expand and collect factors of Q(x), cancelling the P(x) on the bottom:

Q(x1)lnss0P(x1)−δss0P(x1)−δlnss0P(x1)−δP(x1)+Q(x2)lnss0P(x2)+δss0P(x2)+δlnss0P(x2)+δP(x2)

Collect the factors of δ, expand stuff to a ln(1+y) form:

Q(x1)ln(1−s0sδP(x1))+Q(x2)ln(1+s0sδP(x2))+δ[lnss0P(x2)+δP(x2)−lnss0P(x1)−δP(x1)]

Q(x1)ln(1−s0sδP(x1))+Q(x2)ln(1+s0sδP(x2))+δ[ln(1+s0sδP(x2))−ln(1−s0sδP(x1))]

Use the taylor expansion ln(1+y)≈y−12y2... up to δ2.

Q(x1)(−s0sδP(x1)−12(−s0sδP(x1))2...)+Q(x2)(s0sδP(x2)−12(s0sδP(x2))2...)+δ[(s0sδP(x2)...)−(s0sδP(x1)...)]

Sub in Q(x)=ss0P(x), expand, cancel:

−12δ2s0s(1P(x1)+1P(x2))+δ2s0s(1P(x2)+1P(x1))

Subtract:

12δ2s0s(1P(x1)+1P(x2))≥0

Therefore Q(x) is a local minimum of DKL(Q∥P) subject to our condition that Q(X∈χx)=s. DKL(Q∥P) is convex in Q for fixed P, therefore we have found the unique global minimum.

## Derivation of Differential Impact

The measure of Op based on entropy that I've used here was based on the following comparison to differential optimization:

Consider the network A=P, F=P−(1−k)A. This gives C(F,P;A)=k and Op(F,P;A)=−lnk.

This can be extended to an entropic measure of Op by considering uncertainty over P, specifically:

P∼N(μP,σP)

A∼N(μP,σP)

F∼N(kμP,kσP)

Using split-histories we get:

P′,P′′∼N(μP,σP)

A′∼N(μP,σP)

F′|P′′=p′′∼N(μP−(1−k)p′′,σP)

If we take Op(F,P;A)=H(F′|P′′)−H(F) this gives the familiar value of lnk. We may instead investigate the value of Imp(F,P;A)=DKL(F∥F′|P′′). Letting F∼N(μ1,σ1), F′|P′′∼N(μ2,σ2) for brevity:

Imp=ln(σ2σ1)+12σ21+(μ1,μ2)2σ22−12

Substituting:

μ1−μ2=kμP−μP+(1−k)p′′=(1−k)p′′−(1−k)μP

(μ1−μ2)2=(1−k)2(μ2P−2μPp′′+p′′2)

Taking E((μ1−μ2)2) with respect to p′′ requires taking E(p′′)=μP, E(p′′2)=μ2P+σ2P:

(μ1−μ2)2=(1−k)2(μ2P−2μ2P+μ2P+σ2P)=σ2P

σ1=kσP

σ2=σP

Substituting into our original equation:

Imp=ln(1/k)+12[k2+(1−k)2σ2Pσ2P]−12

Imp=−ln(k)+12[k2+1−2k+k2]−12

Imp=−ln(k)+k2−k

Imp=−ln(k)+k(k−1)

Which, if we extend to C, gives

Imp=−ln(C)+C(C−1)

## Derivation of Multivariate Differential Impact and Optimization

Let us take vectors p,a,f, and p′,p′′,a′,f′ in the same manner as above. Assume around some value of p we have the following Jacobians.

Jf=dadp Jg=∂f∂p Jh=dfdp

Without loss of generality, take the means of all of these variables to be 0. There exists a formula for transforming a multivariate normal distribution

^{[1]}.P∼N(0,Σp)

A∼N(0,JfΣpJTf)

F∼N(0,JhΣpJTh)

Now for f′|p′′, the mean will no longer be zero:

F′|P′′=p′′∼N((Jh−Jg)p′′,JgΣpJTg)

We can calculate the KL divergence of DKL(F∥F′|P′′) using another formula

^{[2]}:12[ln|Σ2||Σ1|−ndim+Tr(Σ−12Σ1)+(μ2−μ1)TΣ−12(μ2−μ1)]

Σ1=JhΣpJTh

Σ2=JgΣpJTg

μ2−μ1=(Jh−Jg)p′′

Therefore our impact will be:

12[ln|JgΣpJTg||JhΣpJTh|−ndim+Tr((JgΣpJTg)−1JhΣpJTh)+p′′T(Jh−Jg)T(JgΣpJTg)−1(Jh−Jg)p′′]

Taking the expected value of the third component is actually easy if you have access to the internet. We can see that it is of the form E(vTMv) where v is multivariate normal. This has a closed-form solution

^{[3]}:μTMμ+Tr(MΣ)

Therefore we have the following expression:

12[ln|JgΣpJTg||JhΣpJTh|−ndim+Tr((JgΣpJTg)−1JhΣpJTh)+Tr((Jh−Jg)T(JgΣpJTg)−1(Jh−Jg)Σp)]

We can make some progress towards simplifying this if we take Σp=σpI, which in this case lets us cancel everything out involving a Σp, since the scalar value of σp commutes with all matrices, |σpI|=σndimp, and (σpI)−1=σ−1pI. We will also assume that all the Jacobians are invertible.

12[ln|JgJTg||JhJTh|−ndim+Tr((JgJTg)−1JhJTh)+Tr((Jh−Jg)T(JgJTg)−1(Jh−Jg))]

12[ln|JgJTg||JhJTh|−ndim+Tr((JgJTg)−1JhJTh)+Tr(JTh(JTg)−1J−1gJh)−Tr(JTh(JTg)−1J−1gJg)−Tr(JTg(JTg)−1J−1gJh)+Tr(JTg(JTg)−1J−1gJg)]

12[ln|JgJTg||JhJTh|−ndim+Tr((JgJTg)−1JhJTh)+Tr(JTh(JgJTg)−1Jh)−Tr(JTh(JTg)−1)−Tr(J−1gJh)+Tr(I)]

12[ln|JgJTg||JhJTh|+Tr((JgJTg)−1JhJTh)+Tr(JTh(JgJTg)−1Jh)−Tr(JTh(JTg)−1)−Tr(J−1gJh)]

12[ln|JgJTg||JhJTh|+Tr((JgJTg)−1JhJTh)+Tr(JTh(JgJTg)−1Jh)−2Tr(J−1gJh)]

Using the cyclic property of the trace:

12[ln|JgJTg||JhJTh|+2Tr((JgJTg)−1JhJTh)−2Tr(J−1gJh)]

12ln|JgJTg||JhJTh|+Tr((JTg)−1J−1gJhJTh)−Tr(J−1gJh)

12ln|JgJTg||JhJTh|+Tr(J−1gJhJTh(JTg)−1)+Tr(J−1gJh)

12ln|JgJTg(JhJTh)−1|+Tr(J−1gJh(J−1gJh)T)−Tr(J−1gJh)

And if we define C=J−1gJh we get:

−ln|C|+Tr(CCT)−Tr(C)

This seems to have the form of −ln(C2)+C2−C, and in fact if we consider P, A, and F to just be concatenations of variables, which maeans all the J matrices are diagonal, we see that our equation has the form.

−ln(∏C)+∑C2−∑C=∑(−lnC+C2−C)

Which is a nice sanity check. The value of Op is just the entropy difference 12lnΣ2−12lnΣ1 which simplifies to −ln|C| for free.

If the Jacobians are not invertible, but we assume that JgJTg is invertible, we instead get:

Imp=12ln|JgJTg|−12ln|JhJTh|+Tr(JhJTh(JgJTg)−1)−Tr(JgJTh(JgJTg)−1)

## Appendix C: Supplementary Plots

## Other Ways to Visualize Impact Plots

Here I plotted the "Impact ratio" Imp/Impmin against Impmin:

Here I plotted "Excess Impact" Imp−Impmin against Imp:

## Example training runs from ngood = 4

## Example figures summarizing training runs:

^{^}https://statproofbook.github.io/P/mvn-ltt.html

^{^}https://stats.stackexchange.com/questions/60680/kl-divergence-between-two-multivariate-gaussians

^{^}https://statproofbook.github.io/P/mean-qf.html