*Thanks to Chris Leong and Nora Belrose for their feedback. This is meant to be part of an entry to the **Future Fund AI Worldview Competition**, but a later post is intended to address the competition questions head on.*

In this post, I explore *mimics. *Mimics are what you get when you join a simulator with a generator. Examples are language models that learn to predict text sequences (the simulator), and generate samples of text sequences from their predictions (the generator). A number of AI safety researchers have mentioned that mimics seem to be safer than "traditional" AI architectures like reinforcement learners, with the proposed reason for this often being that mimics are less "agentic" or "goal-driven" than traditional architecture. moire's Simulators is a particularly thorough overview that makes a similar point.

In this post, I argue that a key feature of mimics is unrelated to their "agentiness": someone who can forecast a mimic's training data can also forecast a mimic's behaviour. I call this phenomenon *synchronisation. S*ynchronisation is possible even when the operator can only forecast some crude features of the training sequence.

Certain methods for fine-tuning mimics allow mimics to be optimised for certain tasks while staying synchronised with the operator. This enables mimics to be controlled in a manner that maintains synchronisation and consequently remain easy to predict.

However, some kinds of objectives do not facilitate synchronised control of mimics. If an operator fine-tunes a mimic to control some feature of the world over which it wouldn't normally have complete control, then the operator should generally expect the mimic's output to diverge from forecasts based on the training data. In practice, the consequences of this divergence is reminiscent of failures due to Goodhart's law.

The extremely brief summary of this post is:

- Idealised mimics do what you expect them to when you're trying to control features of their output
- Idealised mimics can surprise you when you're trying to control features of the world

# Safety relevance

Suppose you've been reading books all your life, and you have a pretty good estimate of how likely a book is to actually be good (by your lights) given it gets a 4.8 star rating on Amazon - and, being a good Bayesian, you represent this with a conditional probability . One of the key claims of this article is that, in some situations, it is possible to fine-tune a mimic so that it produces 4.8 star books in such a way that its sampling distribution approximates your own subjective probability .

This provides a method for dealing with concerns like those in You get what you measure; here is a method for fine-tuning on the easy-to measure thing, and getting the hard-to-measure latent just as much as you expect you would. A number of stars have to align in order for this to happen, but it is not an inordinately large number of stars. Furthermore, it may be possible to say quite a lot about when this might fail and by how much it might fail.

So the first safety relevant point is: perhaps there is a solution to this problem.

A broader question, the one that initially led me down this path, is whether or not safe AI is incentive compatible. If safe AI is incentive compatible, then if you do a good job of building AI that simply does what you want it to, you also do a good job of building safe AI. If safe AI is incentive incompatible, then you have to make trade-offs between building AI that simply does what you want and ensuring safety.

There's a narrow question one can ask in this regard. As I explore in this article, fine-tuning mimics often involves a regularising penalty that ensures the result is close in distribution to the original mimic. Granting, for argument's sake, that this penalty makes a system safer, we can ask: is the size of the penalty limited by performance or safety? I perform a microscopic literature review here and come up with the answer that it seems to be more often limited by performance. While today's AI systems are only weakly relevant to future AI systems, they are still a little relevant, and it might be worthwhile to interrogate this question more comprehensively.

There's also a broader question that I think is relevant: is it easier to solve control problems or hide them? If it is easier to solve control problems, then I think our world looks more incentive compatible; if it is easier to hide them then I think it looks more incentive incompatible. If mimics really do solve an important control problem, then I think we have evidence - albeit inconclusive - that we might be in a solving problems world and not a hiding problems one.

I cannot conclusively answer the question of whether mimics *do* solve this control problem, but the "maybe" that I offer is still progress with respect to my own understanding.

# Epistemic status

I think some of the claims I make here are fairly simple and I have high confidence in them, but they are also not the critical ones. I think the important claim is the one I made in the first paragraph of the previous section: it's possible to fine-tune mimics in a way that approximately matches an operator's conditional probability in important regards, and this is a key feature that enables mimics to address more complex problems than other AI architectures. I'm much less confident in this. I expect that it is almost never true in every last detail, but I give 45% credence to it being roughly true (bearing in mind that I think most theories of this type this should be very unlikely a priori).

There's also a heap I don't understand about the ideas I present here, so this credence is liable to swing wildly at short notice.

# Notation reference

() is a "natural" ("mimicked") random variable taking values in the set with events

() is a random variable determinstically related to () taking values in the set

() is a random variable not deterministically related to (), taking values in the set

is the mimic's probability distribution

is the distribution of learned by the mimic after observing

is the sampler argument - means that the mimic draws samples according to

is the probability distribution the operator uses to predict both natural and simulated variables. I think of the operator as a skilled but not superhuman forecaster: she has good priors, and updates them sensibly given evidence, but there are many things beyond her ability to forecast

# What is a mimic?

A mimic is a simulator joined to a generator. It does two things:

- It learns a probability distribution that predicts elements of a sequence of inputs
- It can sample this probability distribution to produce outputs of the same type as its inputs

Given a sequence of random variables and an event , a mimic learns the posterior distribution . It is also equipped with a sampler, which maps distributions over to random outputs taking values in . Setting the sampler argument produces outputs distributed according to .

**Example**

Consider a mimic that takes a sequence of books as input. It can predict the an as-yet unseen book and it can sample a book , using the same probability distribution for both.

# Operators can synchronise with mimics

The basic insight of this section is: under some conditions, a person(an "operator") who can do a good job of probabilistically forecasting a natural sequence can also do a good job of forecasting a mimicked sequence if the mimic is trained on the same natural sequence. This happens when the operator's and mimic's posterior distributions converge. Informally: if the mimic is good, then to the operator its outputs look just like its training data.

Such convergence can happen even if the operator only observes some coarse features of the mimic's inputs . I do not address the question of whether or not this convergence happens in practically relevant lengths of time for practically implementable machines.

## Equal capabilities

Bayesian reasoners, given the same sequence of data, will under some circumstances "merge" in their opinions of the future (pdf). Specifically, if the operator has a distribution over the infinite sequence and the mimic has a distribution over the same infinite sequence and for any collection of outcomes , implies (that is, is dominated by ) then the conditionals and will converge as on all inputs except a set with -probability . If is dominated by , then this set also has -probability . If is dominated by and vice versa, I say they have *identical support.*

If the mimic's sampler is set to , the operator can set their forecasting distribution , and by the above convergence this will approximate the mimic's sampling distribution. When the operator's distribution over the natural sequence approximately matches the mimic's sampling distribution, we say that the operator and the mimic are *synchronised*.

Note that the assumption of identical support is, in the general case, not very easy to evaluate, and this is especially true when we don't have any easy way to evaluate or .

** Example**

If the mimic learns to predict books (in every last detail) from the sequence and the operator learns to predict books (in every last detail) from the same sequence, and their initial distributions assign measure 0 to the same set of long-run events, then the operator's forecasting distribution over "natural" books and the mimic's sampling distribution will eventually come to agree. I call this convergence *synchronisation*.

## Mimic more capable

If the operator can predict every detail of just as well as the mimic, then one might wonder what use the mimic is - perhaps we could just sample from the operator's distribution instead. However, the operator may not need to predict every detail of ; it may be enough for her to predict some coarse features of each book, and still achieve synchronisation with the mimic. The story here is a bit more complicated, though.

Suppose that instead of observing the "base" sequence , the operator observes some features . Abusing notation slightly, the "objective" sampling distribution of is given by

By supposition, the operator does not observe and so they cannot make use of to synchronise with the mimic. Thus the naive argument for synchronisation does not apply. However, we can still say two things:

- Given similar assumptions of common support, the operator's forecast of the machine's output given converges to the distribution of given that could in principle be obtained with the mimic's assistance
- If we make the additional assumption that the sequence is exchangeable with respect to , then the operator's forecast may converge to the mimic's sampling distribution as normal

These are explained in more detail after the following example.

** Example**

Suppose the operator observes two features of many books:

- Genre
- Whether or not the operator enjoys reading it

We say . The operator can estimate the probability that they enjoy a book given its genre from their history of books read and the probability that a random book is of a given genre. If the operator accepts that these probability estimates converge to the mimic's sampling distribution of because she shares inputs with the mimic, then even though the operator cannot write books, she can still say (probabilistically) how well they'll like the books the mimic produces, and what genre they'll be.

### 1. Operator forecast merges with the mimic's limited forecast

The mimic, by supposition, defines a collection of conditionals for every . Thus we can (in principle) extract a joint distribution over sequences of length from the mimic. Actually doing this would be very impractical.

A joint distribution induces a joint distribution by pushing it forward with the function (actually computing this would, among other things, require knowledge of ). From this, in turn, we can derive a conditional probability .

If the mimic's model is thought to be a particularly good one, then because is a function of , we might also surmise that is a good model for X_{<n} given . Given a realisation of the sequence , the operator can consult the mimic's conditional probability to help them assess what outputs it is likely to produce

because each is a deterministic function of , the right hand side is equal to .

But, if is dominated by , then merging of opinions implies that

in total variation. So, instead of performing the impractically complex query to determine, the operator can just substitute their own estimate , and for sufficiently large the result will be approximately the same.

### 2. Sequence is exchangeable

If the sequence is exchangeable with respect to , then so is the sequence . In this case, it can be shown that is independent of given , the empirical distribution of , which is a function of or . Hence we have

I suspect it's possible to say something more directly about under what circumstances , but at the moment I don't know more than this.

Exchangeable sequences also have the advantage that identical support is easier to evaluate. For exchangeable sequences, identical support of and is equivalent to the priors over the empirical distributions and having common support.

## Convergence rates

The fact that converges to "for some finite " isn't especially useful by itself - being finite does not mean that it is small enough to be practically important. I don't have much idea about the extent to which operators and mimics converge in practical settings.

It's possible that there are different features of human interest - say, and - such that and converge at very different rates to the respective conditionals in . This difference in rates could be important if is some feature relevant to "performance on the immediate objective" while is some feature relevant to safety - it may then be possible to build a mimic that is very predictable with respect to the immediate objective but whose safety properties are very unpredictable.

# Operators can control mimics and maintain synchronisation

Not only can operators predict what mimics will do unconditionally, but for some purposes, they can control mimics such that the mimic's behaviour remains synchronised with their forecasts of the natural sequence.

**Example**

Suppose the operator once again observes the genre and enjoyableness of many books, and she somehow controls the mimic to only produce books that she enjoys.

The operator's control *desynchronises* the mimic if it changes the mimic's distribution of book features conditional on enjoyability. For example, if most of the natural books the operator enjoyed were fantasy, but most of the mimicked books she enjoys are operator-flattery, then her control desynchronised the mimic.

The operator's control *maintains synchronisation* if the distribution of book features conditional on enjoyability doesn't change. If most of the mimicked books that the operator enjoys are also fantasy, then her control maintains synchronisation with respect to genre. Synchronisation is maintained in general if the distribution of "books in every last detail" conditional on enjoyability is unchanged.

A standard method for controlling mimics is fine-tuning them. In particular, given a binary function , we can fine tune a mimic to approximate samples from the conditioned distribution by reinforcement learning using a KL-divergence penalty. We set

and then, letting , set

This is maximised by (see Korbak, Perez and Buckley, appendix).

If approximates and consequently approximates , then the operator can adopt

as an approximation of the conditioned mimic's sampling distribution. This requires, of course, that the operator is able to compute this conditional, and they may not be able to.

Setting a softer function will leave us somewhere between the conditioned distribution and the orignal distribution.

**Example**

Suppose the operator tracks two features of every book: its machine-rated binary sentiment and the number of times one person is described as helping another in the text; . If we use fine tuning to set the mimic's sampling distribution and we accept that the appropriate form of synchronisation holds, then the operator can approximate the sampling distribution of mentions-of-helping using

Thus if mentions-of-helping is highly correlated with sentiment in natural books, such mentions will be very common in mimicked books fine-tuned to have positive sentiment. This example was inspired by Jermyn's discussion of the difficulty of predicting the outputs of conditioned mimics.

## Fine-tuning with imperfect control is desynchronising

In practise, the operator isn't just interested in controlling functions of the mimic's output . She is usually interested in controlling some feature of "the world at large" which is plausibly influenced by by . Even in our example, we discuss things like whether books are enjoyable. The operator wants enjoyable books because she wants to read a book and enjoy it. Asking the mimic to make her enjoy the book is a lot to ask - the mimic seemingly can't do anything about her stressful job that dampens her enthusiasm for reading on some days.

What if we fine-tune the mimic with the same function, but with a reward that depends stochastically on ? That is, we set

where the expectation is is some stochastic function "implemented by the real world" that maps mimic outputs to rewards R, which are once again assumed to take values of or 0 (not because it's a good idea, but because it helps to make my point).

If there is a nonempty "forcing" set defined by , the result of this fine tuning will be to set to the distribution .

Abusing notation again, let to be the result of taking and "pushing the s through " (alternatively: what the mimic would believe if an oracle told him that the distribution of given was ). Unlike the situation discussed previously, fine-tuning with imperfect control will *not* generally yield samples from .

If control is "almost perfect" - i.e. , then we almost get samples from the distribution conditioned on . In particular, under the assumption of almost perfect control we have for any

However, if control is far from perfect - i.e. - then can differ very substantially from .

**Example**

Suppose the operator fine-tunes the mimic on rewards , which take the value if a random person did not agree with the book's thesis after reading it, and if said random person did agree with the book's thesis. - whether or not the book is persuades a randomly chosen individual of its main thesis. The base rate for persuasion is low - , but conditional on persuasion there is substantial variation in the topic - i.e. for all . Fine tuning to produce books with a high rate of persuasion is found to achieve the aim of almost always "persuading" random people of the book's thesis, but all of the books produced argue the thesis that the sky is blue .

As before, softer reward functions will wind up somewhere between the unconditioned distribution and . However, it remains difficult for the operator to forecast the result of fine-tuning, because unless they know in advance, they don't have any obvious method to condition on .

As an aside, there is an additional problem in this regard where an operator fine-tunes a mimic to produce outputs with a particular feature, but she doesn't get what she wants in the real world from it because causation correlation.

# Testing this theory

The core of the theory is: the better the mimic, the better someone (or some machine) that has learned to predict or classify the training data will perform on the mimic generated data.

This could be tested in a scheme something like this: have human volunteers label a set of training data and a set of mimic generated data. Subsequently, compare the performance of:

- a classifier trained on the training data, tested on the mimic generated data
- a classifier trained on the mimic generated data and tested on a held-out set of mimic generated data

The theory I present here predicts that as the mimic gets better, the performance gap between the two should shrink.

# Are mimics dangerous?

The above discussion suggests that fine-tuning a mimic on features over which it has imperfect control might lead to unexpected behaviour - and this behaviour might be very unexpected if the objective can be controlled, but only by the mimic adopting a very unusual strategy. "Unusual strategies" that succeed at controlling difficult objectives may well be dangerous. In practice, will people want to stick close to a mimic's original distribution, or push it far from this distribution in search of effective strategies?

The claims I have made above are already somewhat speculative. The question of whether mimics are safe depends on further speculation:

- Perhaps mimics may pay a performance penalty if they are not sufficiently regularised - fine tuning might have an adverse impact on their ability to generalise because they depended on the initially learned distribution to be able to do this
- Perhaps the desynchronisation from fine-tuning with imperfect control might lead to mimics giving undesirable results long before regularisation becomes weak enough to make them dangerous

If the first hypothesis are true, then mimics are "passively safe" - even if we try to remove the regularisation term during fine-tuning, their ability to generalise fails before they take any dangerous actions. If only the second is true, then mimics safety is incentive compatible. Removing the regularisation term can lead to dangerous actions, but no-one is interested in doing that because it gets undesirable results for other reasons. If neither is true, then mimic safety is incentive incompatible - people want small regularisation terms to get desirable results, but this trades off against safety.

## Some empirical findings

### Desynchronisation can happen when fine-tuning without regularisation on perfectly controlled features

Fine-tuning language models without a KL penalty has been found to produce "degeneration" of the generated samples. Many articles attest that degeneration involves a reduction in "fluency and diversity" of samples.

Korbak et. al. examined different methods to fine-tune GPT-2 to produce compilable code. Their findings were, briefly:

- Unregularised reinforcement learning yielded a much higher rate of compilability at the cost of substantially reduced program length and complexity and substantial divergence from the baseline distribution of texts generated by GPT-2 conditioned on compilability
- KL-regularised fine tuning yielded lower rates of compilability but longer programs (though still
*slightly*shorter than baseline) and reduced divergence from the distribution of texts generated by GPT-2 conditioned on compilability

Training without the KL-regularisation leads to divergence from the baseline distribution conditioned on compilability. If the baseline distribution is synchronised with an operator, then this divergence is what I call "desynchronisation". The reduction in program length is one consequence of desynchronisation among many, and illustrates how desynchronised mimics can yield undesirable results that satisfy the training goal on paper.

Earlier work by Paulus, Xiong and Socher reports a broadly similar result: fine-tuning summarisation with unregularised reinforcement yields higher scores on the metric of interest, but

It is possible to game such discrete metrics and increase their score without an actual increase in readability or relevance

they also employ a kind of regularisation to try to improve summarisation while maintaining readability and relevance.

I think these examples provide very weak evidence against passive safety - unregularised reinforcement learning was successful at improving their scores on the metrics in question. I think they provide also very weak evidence in favour of incentive safety - unregularised reinforcement learning was found to produce output that was nevertheless undesirable. I say the evidence is very weak because I would not be surprised if these examples were not representatives of substantially more advanced systems deployed to solve substantially more difficult problems.

It's worth noting that Korbak et. al. were not able to produce perfectly compilable samples from GPT-2 using KL-regularised fine tuning, despite the fact that compilability definitely is perfectly controlled by the sequence generator. My guess is that being unable to learn the compilability predicate looks quite similar to the situation where compilability is not fully controlled by the learner. This leads me to expect that KL-regularised fine tuning in this regime might in some ways be similar to KL-regularised fine tuning in the imperfect control regime. Thus I expect to see some desynchronisation in this context, and I wonder if the slight reduction in program length this team observed is a sign of this.

### There are many different pre-training schemes that seem to be effective

Pre-training might not need a large and diverse dataset to be effective. For example:

- Krishna et. al. find that self-supervised pretraining on a small task-specific text dataset can yield results nearly as good as (and in some cases better than) pretraining on a large and diverse corpus of text
- Other papers behind that link show that self-supervised pretraining on nonsense text or synthetic text can also yield high performance on downstream tasks

If pretraining datasets don't matter very much, then (in my language) might not need to match very closely in order to produce a mimic with high performance. If these distributions do not match in every particular then, for example, putting low weight on dangerous actions does not necessarily imply puts low weight on the same.

On the other hand, pretraining on large datasets does seem to help performance on average, and despite the results mentioned above it remains plausible to me that extensive pretraining is necessary for mimics that are used to solve particularly difficult problems.

I think these results - especially the pretraining on nonsense text results - also weakly undermine the claim that synchronisation is an important reason why pretrained models are able to perform useful tasks ("because they give us what they expect"), but I think the relevance is very slight and is outweighed by things like the fact that we can ask GPT-3 a question and get a sensible answer in response.

# Conclusion

The basic idea here seems obvious: a good mimic is hard to distinguish from the thing it's mimicking. Nevertheless, to my knowledge, Bayesian merging of opinions ("synchronisation") has not previously been proposed as mechanism for how this occurs. My impression is also that applying standard prediction techniques (both formal and informal) to features of the training sequences to predict features of the outputs of mimics has been widely used - for example, in the investigation of prompting - but the reasons for why this is possible have also not been explored very much theoretically.

I wonder whether it is feasible to advance the science of deep learning (and psychology?) to the point where we have strong enough results about synchronisation to actually prove some safety properties for advanced mimics. I am pessimistic about this, but not confident in my pessimism.

In my view, here are some key takeaways of this post:

- Controlling advanced AI presents us with a problem of "delegate proxy controllability": under what conditions can I direct a delegate to pursue proxy M and expect good results?
- If we take an event M to be a good proxy for desired results under "natural" conditions, then I suggest that if the consequences of a delegate pursuing M match our expectations for what happens when M occurs under natural conditions, then M should be a good proxy for controlling that delegate
- Under some (possibly optimistic) assumptions, mimics can achieve the property outlined in the previous bullet
- Furthermore, when some of those optimistic assumptions don't hold, we might be able to measure the "Goodhart-proneness" of an objective by estimating the probability of an action lying in the forcing set for that objective conditional on that objective being achieved. Such a measure seems relevant to a number of concerns in the AI safety field.

This is definitely interesting. I commit to re-reading it on a plane ride on Monday and posting a better comment. Currently the biggest issue sticking out to me is that we actually

needto throw away information, or else the "distribution in every detail of books read so far" is just a point distribution over the actual books read. So we have to pick what information to throw away, which is basically equivalent to picking a coarse-grained model of the world.Thanks. The issue you identify is related to what I’m vaguely indicating with “I don’t know how convergence plays out in practical situations” - I thought of trying to explain in more detail, but 5 minutes of effort didn’t yield a clear explanation and I was keen to get something out rather than spend a long times tidying up all the loose ends.

Although it’s also worth noting that the general merging of opinions result doesn’t depend on long run relative frequencies (which is to say: I think “point distribution over books read” is a useful image of why convergence concerns might bite, but it doesn’t capture the precise difficulty with convergence)

What is this trying to solve ?

If you try to setup an ML system to be a mimic, can you insure you don't get an inner misaligned mesa optimizer ?

In general, what part of AI safety is this supposed to help with? A new design for alignable AGI? A design to slap onto existing ML systems, like a better version of RLHF?

I'm not proposing a new design here - large language models are (approximately) mimics, and large language models with RLHF are (approximately) mimics controlled using reinforcement learning with a KL penalty, which I describe here. I'm proposing the outline of a theory that, with more work, might help us better understand control of mimics. In particular,

ifwe have an AI that is close enough to satisfying the right assumptions (1. it is approximately a Bayesian learner, 2. its actions involve drawing samples from its predictive distribution, 3. there is sufficient overlap between the operator and the mimic's priors and 4. they share data to learn from), then we can use the operator's model of the training data to predict the effects of the AI actions. This is useful because, for example, if a proxy F is highly correlated with some "true obejctive" G in the training data, under the right conditions it will also be highly correlated in the data produced by the AI's actions.The theory I outline also goes into when this ideal level of controllability should be expected to fail - when the machine is pushed to exert control over some variable that it normally only has weak control. This is a meaningful extension to previous work on problems with proxy goals - for example, Manheim and Garrabrant do not identify weak control as a condition under which Goodhart type problems should be expected to be particularly severe.

"Misaligned mesa optimizer" is a broad term that covers a lot of very different problems. One kind of problem comes under what I would call "convergence issues" - the speculative proposition that, if you throw too much compute at a distribution learning problem, at some point instead of getting better predictions you instead get manipulation. This is really speculative - it's supported by ideas like "the Solomonoff Prior is malign", but it's unclear how practically relevant these theories are.

A different problem, also sometimes understood in terms of "misaligned mesa optimizers", is that AI will simply get the wrong idea about the control we are trying to apply to it ("goal misgeneralisation"). The theory directly addresses this problem: under the assumptions presented, goal misgeneralisation does not happen when we tune the mimic to exert control on some variable that is, in any case, normally fully controllable by the mimic. Goal misgeneralisation should be expected in the case of imperfect control, and the amount of goal misgeneralisation can be estimated if you can bound the difference between P(Xn|Rn=0,X<n) and P(Xn|Xn∈XF,X<n).

Many big problems in AI safety lack adequate theories, so if someone is proposing new AI designs no one can really say whether they solve key problems or not. In this post, I explain a promising approach to improving the theory of AI control.