*This post was written to fulfil requirements of the SERI MATS Training Program.*

One of the goals in writing this post is to arrive at an explanation of a particular set of issues raised by Irving in the Learning the smooth prior post. In order to do so, we will first build a coherent conceptual overview of Christiano's Learning the prior idea. Personally, while I do not view an idea like this as some sort of complete 'proposal' or 'plan' for alignment which may or may not 'work', it is nonetheless a very instructive exercise to go through ideas like this thoroughly and doing so is necessary to understand the full motivation behind the discussions in *Learning the smooth prior.*

In Section 1, we will discuss Christiano's motivating safety arguments and look at the original constructions at a conceptual level. In Section 2, we will delve into the mechanics of how learning the prior might work. In particular, we will highlight the central issue of how to represent the objects that we will be calling 'background models'. Section 3 will focus on more specific technical issues raised in *Learning the smooth prior* that may arise when trying to optimize over large background models that are themselves generative networks. In Section 4, we briefly make some closing remarks. We will attempt not to shy away from levelling strong criticism at the idea of Learning the Prior, but will ultimately seek to maintain a balance between skepticism and optimism.

Throughout Sections 1-3, this post becomes increasingly technical. My guess is that the first two sections are readable by somebody who is not already up-to-date with other posts about Learning the Prior, but that Section 3 is sufficiently technical so that it will be most valuable to readers that have already tried to understand *Learning the Prior* and *Learning the Smooth Prior*.

**Acknowledgement.** With thanks to Jennifer Lin for reading and providing comments on earlier versions of this post. I also benefitted from the mentorship of Evan Hubinger.

## 1. Introductory Material

### 1.1 The 'Better Priors' Arguments

In Better priors as a safety problem, Christiano argues that fitting a neural network to data naively uses the "'wrong' prior" and that this is a safety issue. By 'prior' here, he refers in some sense to what the model 'expects to see' before it makes its updates on the training data, i.e. before it is trained. It is a little less abstract, and essentially equivalent as far as the present discussion is concerned, to think of a model's 'prior' as being realized only through it's inductive biases, that is: The quality of the model's architecture and training processes that make it more likely to fit some functions than others. The arguments outlined below rely on the familiar, albeit informal, notion of simplicity bias in neural networks and the theme of indirect specification.

The first issue comes from the fact that it is likely that we will eventually build, train or somehow construct models to make predictions about things which we cannot check. And any policy which gives accurate such predictions must compete with a policy that when prompted to give such a prediction, simply answers the meta-question: “What answer would look best to the humans doing the training/overseeing?”. The former is the desired policy and the latter is the undesirable policy.

In the post Inaccessible information, Christiano argues that the undesirable policy may be much simpler. After all, it deals exclusively with information that we can verify (i.e. because we can 'check' that we are satisfied with a plausible-sounding prediction), whereas by assumption the desired policy deals with something more subtle or more complicated (that which we cannot verify). Moreover, any objective function which we design will presumably be something that we actually have access to and while it may be possible to incentivize the model to give accurate predictions about things that we *can* check, it is hard to imagine how one can truly accurately encode the goal of giving accurate predictions about things we cannot check. When it comes to such things, perhaps we can only go as far as incentivizing the model to give *plausible* predictions or to give predictions that *seem likely* to us.

Christiano also argues that the undesirable policy - that which replaces a pursuit of accuracy by a pursuit of what would look most acceptable to the humans (or to whatever is overseeing) - may be the source of safety issues. It certainly seems risky in some general sense if we end up making significant decisions based on predictions which we think are being optimized for accuracy if in reality those predictions are being optimized to satisfy our own expectations. However, Christiano in fact raises a different issue which is that 'inaccessible information' (roughly speaking, that which we cannot check) is likely to provide a significant competitive advantage for agents and there may be *some* agents which use it to seek influence. So, in a future where very capable agents are more common, if we are building agents to help the future of humanity by pursuing some long-term accessible goal, then those agents may be out-competed by others that are capable of using inaccessible information to pursue their own long-term goals.

Notice that these issues are mostly coming from an outer alignment gap*: *It seems very difficult to *specify* a base objective that actually aligns with what we really want, if what we really want is something that cannot be checked (or verified or scored) by us accurately. And the point that Christiano wants to emphasize is that part of the reason for this is the inductive biases (or the 'prior') of the neural network: The biases may make it very tough to incentivize a neural network to generalize using a policy of accurate prediction, when it may be simpler to generalize with the undesriable policy.

The second issue is just the more general inner alignment concerns of instrumental and possibly deceptive alignment (discussed at length in the third and fourth posts of Risks From Learned Optimization) . There may be very simple descriptions of 'final goals' for which the base objective of “make good predictions” becomes an instrumental goal. In these situations, it can become the case that the learned optimizer may defect once it predicts that it can no longer be penalized for not making good predictions. Note that the model may actually be capable of correctly making predictions about inaccessible information (and could perhaps do so during training and/or for some time afterwards), but it may stop doing so when it thinks it is no longer going to be altered. Once again, the point to be emphasized is to view these issues as coming from the inductive biases of the neural network: It is because the network is looking for a certain kind of 'simple' fit. Christiano claims that both of these problems arise in neural networks because they use the “wrong universal prior, one that fails to really capture our beliefs”.* *

### 1.2 'Learning The Prior' or 'Imitative Generalisation' At A High Level

To try to get around this perceived difficulty, Christiano described (a couple of years ago in Learning the prior) a characteristically ambitious idea aimed at trying to "*give* our original system the right prior" (emphasis mine). While it probably makes sense to suppose that a neural network will never have ideal inductive biases (whatever you think those should be), perhaps it does indeed make sense to try to do better than a default naive approach. As with many such 'proposals', the main ideas were originally described at a very high level and it would be an understatement to say that there were some details that still needed to be worked out. e.g. Barnes last year on AXRP - the AI X-risk Research Podcast said :

to be clear, I don’t think we have anything in mind here that actually seems like it would work. It just seems like the right sort of direction or you can describe versions of it that are completely impractical that would sort of do the right thing. [link to transcript]

Nevertheless, I will attempt to give a summary. Suppose we have a training dataset and seek to make predictions on a different set . We imagine that is sufficiently different from so as to be (more or less correctly) thought of as 'out of distribution' in the conventional ML sense (which is not always a mathematically precise idea). Were we to naively fit some neural network on the dataset , we would be using the "wrong" prior according to Christiano, in which case - were we to let the neural network generalize from to , we would allow errors or risks coming from the difficulty in knowing what the fitted model will do when it encounters situations which are significantly different from the kind of situations it encountered during training (we can think of this as a down-to-earth, 'modern day ML' , broader version of the more specific, yet more speculative, alignment issues discussed in Section 1.1). We seek to avoid ever having to rely on a naively fitted neural network performing this kind of generalization and we would like the inductive biases that we do end up relying on to be - in some sense - more human.

With that in mind, consider that if a human were in fact able to do both:

- Go through the entire dataset and actually learn all of the human-understandable lessons and heuristics about the prediction task that one could ever possibly learn from ; and,
- Efficiently make many high-quality predictions about using the lessons that they have learned;

then the inductive biases used to make the generalization would be precisely human, in some sense. During this discussion, it may be helpful to refer to and to bear in mind Barnes's image classification for dog breeds example here.

To approximately achieve A. above, the idea is to somehow construct a supplementary object which achieves two things: Firstly, it is an encoding of all of the human-understandable lessons and heuristics that are implicitly contained in the labelled dataset . Secondly it is accessible by humans in the sense that there is a way for a human, given a point , to actually *use *the information contained within to make a prediction . The question of exactly what *is* will be a central one to which we will return.

To approximately achieve what is described in B. above, the idea is to fit a neural network to data consisting of predictions made on a subset of *by a human who has access to* . To go into a little more detail here: We would first construct a random sample which one might think of as a kind of 'validation set'

from and designate the remainder to be our true 'prediction set':

Having done so, we can get predictions about from humans who have access to and then fit a neural network on this data. This final trained network is what ultimately makes predictions about . The idea is that although this does end up relying on a neural network's default inductive biases, the only generalization that is happening is purely *imitative, *i.e. only used to interpolate human predictions on an IID sample . There is no 'out of distribution' generalization done by a neural network.

It should be clear that this description leaves big question marks over how to define, construct, and use the object

## 2. Representing

In this section we will move towards a more concrete discussion of . In the context of our prediction task, we suppose that when a human makes a prediction, they rely on a 'background model' of the world or perhaps a distribution over different possible background models. While we will be staying deliberately vague about the exact nature of such a model for the time being, what we have in mind is that each such model is a representation of a set of hypotheses about the world which describe or help one to best understand the relationship between the predictor - the variable - and the response - the variable. Using a different background model (or relying on a different distribution over background models) in general will lead to different predictions.

Proceeding with some very informal mathematics: Let denote the space of all possible background models. We can suppose that before actually seeing the dataset you - a human - have some prior over , that is you assign some to each . As entertained in point A. from Section 1.2 above, were you able to look through the entire dataset and infer posterior probabilities for each element of , then you would of course update thus:

And then the which maximizes this posterior probability is that which maximizes the expression

Assuming that the points of were picked independently, this is equivalent to maximizing

So,

is the background model that you - a human - would think has the highest probability of being true, were you able to update on the entire of the dataset .

**Remarks.** Christiano in fact says that he wants to represent "a full posterior rather than a deterministic “hypothesis” about the world" and Barnes has also claimed (again in this AXRP episode) that Christiano never intended to describe the kind of thing that we have shown above, which is called a maximum *a posteriori or MAP *estimate. On the other hand, in introducing the problem, Christiano does also describe as e.g. "*the* set of assumptions, models, and empirical estimates that works best on the historical data" (emphasis mine). So there seems to be at least a little bit of ambiguity here. While even a MAP estimate of would need to be more complicated than simply finding a value for a scalar or vector parameter, wherever anyone has tried to be really concrete about how this might actually work, it is difficult to see how anything other than a MAP estimate can be inferred from what has actually been written. In my opinion, it is what seems most reasonable in terms of the present motivating discussion, so with these caveats having been raised, we will stick with it. Perhaps something better or more complicated is possible and desirable, but that would rapidly become too speculative for even the present exposition.

What the informal mathematical motivation above does provide is so-called desiderata concerning - some conditions that each would need to fulfil in order that we are able to actually set up this optimization problem. To work with the functional , it is necessary that for , we

- Have a way of estimating for each (i.e. over the whole dataset ); and
- Have a way of estimating .

(In reality one could imagine getting away with not needing to do this for literally every - the for which it is necessary would be determined by the exact algorithm being used to maximize the functional). There are natural candidate ways to do both of these things, which I will briefly give a sense of, without getting too bogged down in details:

- Given and , a human can be shown and asked to use to make a prediction i.e. to predict the label , assuming that everything in were true. This fact can in theory be used to create a signal which we can train a model on. Such a model would output an estimate of . Crucially, the generalization would once again only be imitative: Only a kind of interpolation of human predictions from an IID sample.
- Similarly, given , a human can be queried to provide an estimate of - asked for their overall credence in a particular background model. Once again this gives a signal which in theory a model can be trained on. The model would output estimates of . And again, the idea is that such a model has only had to generalize from human estimates on an IID sample.

So, it needs to be possible for humans both to i.) Access the knowledge in in order to make predictions - this is the problem of *using * *to predict;* and ii.) Somehow evaluate a given in its entirety to get an estimate of - this is the problem of *estimating *. Different proposals for the exact form of would give rise to different issues regarding how to do both of these things. For example, Christiano says in the original post that could literally be "a few pages of background material that the human takes as given", e.g. "a few explicit models for forecasting and trend extrapolation, a few important background assumptions, and guesses for a wide range of empirical parameters." We can indeed imagine being able to use such a to give a particular prediction and imagine assigning it an overall score or probability.

### 2.1 Competitiveness and Going Beyond Human Performance

The best possible end performance that could be reached with the scheme as we have described it thus far is still determined ultimately by a network that is modelled on human predictions that have been made with the help of human-understandable information. Moreover, the training procedure seems convoluted compared with the naive 'unaligned' approach of just training a neural network on and letting it generalize to directly. These two issues aren't completely cleanly separable, but regarding the latter issue (of, shall we say, training competitiveness), Christiano originally wrote that while it were indeed an open ML research direction, he was "relatively optimistic about our collective ability to solve concrete ML problems, unless they turn out to be impossible" (*Learning the Prior)*. It may well turn out to be impossible, but if not, then one can imagine that given some concrete way of defining, using and evaluating what we've been calling the background models , it seems like we ought to be able to pin down a reasonably well-defined set of ML problems aimed at making the training procedure efficient and competitive and actually try to tackle them.

The other point, about how competitive the performance one achieves in the end is, hinges on and exactly what it means for humans to access it and use it (because the final predictor is ultimately trained to imitate predictions made by such humans). For example, in discussing going "beyond the human prior" in the original Learning the prior post, Christiano entertains the thought that a background model could be "a full library of books describing important facts about the world, heuristics, and so on". **Remark. **Here, I like to imagine that you had a question-answering task and that before giving each answer, you were able to use Wikipedia for a few hours to help make sure you were correct. Wikipedia would be the . It's a bit of a silly example because it's still hard to imagine how you could actually train a model that doesn't have access to wikipedia to generalize appropriately from your predictions, but just hypothetically, if you *were* able to do something like that, then you'd be ending up with a question-answerer which had superhuman performance but was still somehow imitating human responses - responses that were rooted in human biases, i.e. had used the right 'prior'. This gives a sense of the tightrope that is trying to be walked here.

So, to get predictions that are "better than explicit human reasoning" (Christiano; *Learning the Prior)*, we are motivated to use background models that are potentially large, complicated objects. Christiano expressed optimism about the idea of accessing an information structure such as a library full of books *implicitly* via the output of another model . For example, regarding the issue of using to predict, could produce a debate on the question of what should be predicted about , given that everything in were true. Or could be a model that relates

to slightly simpler objects, which can themselves be related to slightly simpler objects… until eventually bottoming out with something simple enough to be read directly by a human. Then Mz implicitly represents Z as an exponentially large tree, and only needs to be able to do one step of unpacking at a time. (Christiano;

Learning the Prior)

These are interesting and exciting ideas but incorporating them into the overall scheme takes the complexity up a notch to say the least. And describing such things only at such a high level seems vague and raises serious questions. These are indeed classic themes explored by Christiano over the years that are to do with whether or not more complex ideas, or more complex representations of information, can either be broken down into smaller human-understandable steps or recursively simplified into human-understandable forms. They invariably raise questions which seem very difficult to resolve adequately. We will certainly not give such ideas a full treatment in the present exposition, either in terms of explaining them fully or fully interrogating them. Suffice to say, skepticism has been expressed about having a large simplified via another model or procedure; about a year after the original post, in this comment, Christiano wrote:

In writing the original post I was imagining z* being much bigger than a neural network but distilled by a neural network in some way. I've generally moved away from that kind of perspective... I now think we're going to have to actually have z* reflect something more like the structure of the unaligned neural network, rather than another model (Mz) that outputs all of the unaligned neural network's knowledge.

And later, in Learning the smooth prior (from April 2022), Irving writes that he thinks that "-is-weights is a likely endpoint". Could it be that the only hope of giving exceptional performance in the end is if is or contains some kind of "homomorphic image" (Irving; *Learning the Smooth Prior)* of the weights of the 'unaligned' neural network trained directly on ? And if so, does this mean that it is all starting to seem too impractical or overly speculative? It is interesting to note that Barnes has compared the proposals of imitative generalization to Olah's idea of 'Microscope AI' (perhaps best covered in Hubinger's post Chris Olah’s views on AGI safety). Barnes wrote:

There seems to be moderate convergence between the microscope AI idea and the Imitative Generalization idea. For the microscope AI proposal to work, we need there to be some human-understandable way to represent everything some NN ‘knows’ - this is what our interpretability techniques produce. The IG [imitative generalization] proposal can be framed as: instead of training a NN then extracting this human-understandable object with our interpretability tools, let’s directly search over these human-understandable representations.

I find this to be a compelling comparison. If the "-is weights" view ends up being true (or at least dominating how we proceed to think about this all) i.e. if Christiano 's and Irving's skepticism turns out to be warranted, then perhaps this suggests that imitative generalization somehow collapses back to a version of microscope AI. (To put it another way, we might also ask whether we ought to be thinking from the outset that a successful realization of is some kind of interpretation of the unaligned network - an interpretability tool of some flavour?)

For the purposes of the present exposition, we will try to balance out skepticism with optimism. We will continue with our theoretical discussions by supposing that each is indeed a large and complex object and is such that to access , a human - perhaps via the output of a generative model or a debate or some other recursive procedure - must rely on receiving human-understandable, bitesized chunks of information that are distilled or summarized from the actual content of . i.e. Because is so large, and humans so limited, a human can only really engage with a small portion of a given at any one time.

### 2.2 Using to Predict

The constraint that the human must deal in bitesized human-understandable chunks of information affects proposals both for how to use to predict and for how to estimate . On the optimistic view, the former is less of an issue than the latter.

To competently make a prediction for any *fixed* , we can reasonably hope that one might only need to 'see' a small portion of . In the context of a debate on the topic of 'what should be predicted about given that everything in is true?', we can imagine that only the small amount of information that is very relevant to making the correct prediction gets cited. In Barnes's dog breed image classification example, if we have a picture of a dog in front of us, we only need to deal with information that describes dogs that are manifestly similar to the one we are looking at. Similarly, we can imagine a human navigating a corpus of information (think of searching on wikipedia) by prompting a generative model for precisely what they think they need to know to make the correct prediction.

### 2.3 Estimating

Estimating may be a much bigger issue. To give an evaluation of the background model as a whole, it seems much harder to see how one could get around the fact that humans can only 'work with' a small portion of at any one time.

No matter what kind of generative model is used to access or no matter how ingenious your debate or recursive simplification procedure is, the fundamental issue remains that is very large and humans are very limited. And yet the learning the prior scheme relies on human estimates of . If we were using a generative model , the model would be capable of a vast number of possible output statements. However, a human would only be capable of making good judgements (about the truth or probability of) expressions that involve a very small number of statements at once. Similarly, a debate or a recursive procedure used to help estimate would have to bottom out or terminate in arguments or expressions that were simple enough for a human to understand and evaluate.

A central question thus becomes whether or not the use of a generative model or protocol of the kind described above can actually work to yield an estimate of the probability of the full . In *Learning the smooth prior**, *Irving, Shah and Hubinger discuss some specific, theoretical and technical issues that pertain to precisely this point and this will be the focus of Section 3.

## 3. Generative Models and Debates

In this section, we will go into some detail regarding the question of whether or not there is an adequate way for humans to estimate when is a large object that is only being accessed implicitly. At certain times we will work specifically under certain fixed assumptions, but throughout the discussion we will variously entertain a number of slightly different set ups: is to be 'accessed' by a human either via the output of a generative model or via a debate or via some combination of the two. We do not just suppose that the debates have to be about object-level information that pertains to the task, we mean to include the idea that the entire question of estimating could be the subject of an AI debate, to be judged by a human.

### 3.1 Large Conjunctions; Bounded Humans

Irving suggests that the most reasonable starting point is to supposes that we are working with a generative model which outputs object-level statements about the task.

**Remark. **For the time being, we will just work with this, but Irving does also suggest that the output statements perhaps ought to come with with associated probabilities, so as to allow for uncertainty, and that there ought to be ways of querying the model further. For example, if the model makes a statement, the truth of which seems to depend on other terms or predicates included in the statement, then we ought to be able to prompt the model to give further explanations of those terms and predicates. This is reminiscent of things like the "cross-examination" mechanism in debates (described in Writeup: Progress on AI Safety via Debate) and the type of tree-like structure that this sort of 'unpacking' and further questioning would implicitly give rise to again takes us back to recursive Christiano-esque ideas (and is discussed more or less exactly in Approval-Maximizing Representations).

Now, the most direct interpretation of - the "prior probability that the background material is true" (Christiano; *Learning the Prior*) - is as the probability that the conjunction of all of the statements that the model can output is true, i.e.

So, if the possible outputs of are given by the the collection of statements and if we let denote the event: 'The statement is true', then we are considering the expression

We will make a couple of basic remarks. Firstly, since any is a superset of the big intersection, if the model can output even one statement which is certainly false (for example a logical contradiction), then one has . Secondly, notice that the conjunction of a large number of statements that are independent and all almost certainly true (i.e. you believe each of them to be true with probability close to 1) will still end up having a small probability. This may not seem like an intuitive way to have arrived at an overall score for , but, for example, , and - assuming that the task at hand is sufficiently complicated and diverse - there may well be far more than 1000 statements that are more or less independent (in particular the kind of numbers that Shah uses in the examples given here in Learning the smooth prior are not strange).

This latter point starts to show that of course what matters when trying to get a handle on this expression is what we will loosely refer to as the *dependence structure *of the events. If all of the events were independent of one another, then the expression would of course factor as At the other extreme, if we had for every , then we would simply have The dependence structure of the family of events is what makes the real truth something that is presumably in-between these two extremes and presumably far more complicated. Irving claims that if we think about how humans actually deal with situations like this, then the same ideas arise. For example, we imagine that relatively often, we form fairly succinct judgements - something a bit like coming up with a single score in - about the validity of potentially complex sets of information, like the content of a book. Irving claims that in order to do so, we tend to support our judgement via arguments that only ever use a small number of statements at once, because the interdependencies between a small set of statements has a sufficiently simple structure that we can actually think it through or hold it in our heads.

Thus the natural question is whether or not there actually is some way to 'recover' or 'reconstruct' an estimate of from some bounded number of estimates, each of which only ever involves a small number of statements at once. Very informally, this feels like an issue about trying to get a handle on something high-dimensional via only much lower dimensional 'shadows' or 'slices'. Initial naive hopes are whether we can use an assumption that is roughly of the form: "all dependence structure in is contained in dependencies between at most statements"? (Irving; *Learning the smooth prior*) . Perhaps something reminiscent of the hypotheses of the Lovász local lemma (wikipedia page here) like "each statement is conditionally independent of all other statements given or neighbors"? (Irving; *Learning the smooth prior*). However, in the next section, we will essentially see that assumptions like these are too strong and we will have to move to discussion of a more vague kind of assumption.

### 3.2 Correlated Statements from Smooth Latent Space

The construction we describe in this subsection was given by Irving in Learning the smooth prior and highlights some salient issues concerning the dependence structure of the output statements. In particular, it will rule out certain naive approaches to estimating .

In the context of a modern generative model, one typically expects to be able to identify a latent space. This is a space which is lower-dimensional than the input; which the input-output map that the model implements factors through; and which corresponds to or contains a learned, hidden representation of the data. So if the model has the form

it may in fact work via

where is a lower-dimensional space. Typically we expect the latent space to contain an embedding of the training data itself, but it can also be used to generate outputs. The points of latent space can be thought of as codes that can be fed through the latter portion of the model (the 'decoder') to produce outputs.

Irving's example is about interpolation in latent space. Suppose that for the point in the latent space, the model outputs the statement "A husky is a large, fluffy dog that looks quite like a wolf". And suppose that on the point in the latent space, the model outputs the statement "Ripe tomatoes are usually red". The assumption here is that the latent space is *smooth* in some sense, and doesn't have any 'holes' in it, which means that when we gradually interpolate between the points and by selecting a large number of equally spaced points on the line , we expect each of these points to decode to give a semantically meaningful output and for these outputs to actually interpolate between the two sentences at the endpoints. Irving's example is something like:

* 1 .* A husky is a large, fluffy dog that looks quite like a wolf. (this is the output at )

* 2.* A husky is a large, fluffy dog that’s very similar to a wolf.

* 3.* A husky is a big, fluffy dog that’s very similar to a wolf.

* 4.* A husky is a big, fluffy dog that’s closely related to wolves.

...

* N.* Ripe tomatoes are usually red. (this is the output at )

Consecutive statements will be different, but very similar and highly correlated. However, once you compare statements that are on the order of apart, they can look independent (i.e. 1. and N. in the example). This kind of construction is not mere conjecture; it was performed more or less exactly like this in Bowman et al.'s paper *Generating Sentences from a Continuous Space* [arXiv]. They write:

We observe that intermediate sentences are almost always grammatical, and often contain consistent topic, vocabulary and syntactic information in local neighborhoods as they interpolate between the endpoint sentences.

(Their results are a little bit messier than is suggested by Irving's example, but are certainly of the same kind and while there are one or two remarks we will make shortly that complicate the picture a little, it seems reasonable that for a larger, more capable model, something like Irving's example is plausible.)

The main conclusion that we want to draw from this example is that any given statement may be correlated with a very large number of other nearby statements: In particular, as Irving notes, it *cannot* be the case that

each statement is conditionally independent of all other statements given ... neighbors.

If there were a smooth latent space which works as above, then there will not be such an because one can always take very small steps along a path in latent space or sample many points from a small neighbourhood of latent space to get far greater than distinct, yet highly correlated - i.e. dependent - statements. Moreover, the example suggests that given a fixed statement, one cannot 'draw a boundary' between statements that are well-correlated with it and those that are almost-independent from it. When traversing latent space from to , we expect a smooth transition from statements that are highly correlated with the decoded to statements which look independent from it. This in turn suggests that there is not much hope of partitioning latent space (or an appropriate relevant compact portion of it) into classes in which statements that belong to the same class are well-correlated with one another and statements from different classes are almost independent (because I can always form a convex linear combination of statements from two adjacent classes and gradually move along the line to find a statement that is somewhat correlated with both endpoints).

There are at least two caveats to make here: Firstly, If the output sentences are to be grammatically correct, then in any real model it seems there would in fact have to be a bounded number of possible sentences (of a given length, say) and so one cannot literally divide up a line in latent space into an arbitrarily large number of steps and expect distinct semantically meaningful outputs for every single point. However, in practice, the that one is hoping for in the quotation above would need to be quite small and one can certainly imagine producing tens or hundreds of statements this way which are reasonably well-correlated with one another.

Secondly, it has been observed in language autoencoders that sometimes words with opposite sentiments have proximate encodings in latent space. For example, in a word embedding, 'good' and 'bad' may have similar embeddings, because they function similarly syntactically and have similar distributional characteristics in the data. The same phenomena was observed in Bowman et al. with sentences, e.g one of their interpolation examples (from Table 10 in the Appendix of their paper) interpolates between the sentence "amazing , is n’t it ?" and the sentence "i could n’t do it ." Half-way through the interpolation, the model outputs "i can't do it" followed immediately by "“ i can do it .". In our case, if the model does *not* give probabilities alongside its output statements, then under our present assumptions, any behaviour like this would force . So either a) We have to hope that as these models become even larger and more capable, this 'sentiment flipping' disappears; or b) We really do need to allow the model to output associated probabilities and when we see this kind of flipping, the probabilities coming from semantically flipped sentences are themselves complementary. Option a) seems difficult to verify or refute at present and b) would mean that the present theoretical discussion about how to estimate would be taken even further into tricky territory. For now, we will remark that although there is no clear resolution of this issue, it is at least the case that the weight of empirical evidence from Bowman et al. suggests that despite the existence of these flippings, there *can* be plenty of highly correlated statements in a neighbourhood of a given statement (even though one does not always get them by walking a straight line in latent space) and this is enough to serve as a foil to the possibility of a reasonable bound for the number of dependent statements.

In summary, Irving's example using smooth latent space seems plausible enough. When is indeed being accessed via the output of the kind of generative model supposed here, it highlights genuine and specific difficulties that any method of estimating will have to overcome. However, this does not mean that there is no hope of finding a way to estimate from terms involving a bounded number of statements.

### 3.3 Speculations Regarding -way probabilities

To keep some optimism alive, Irving speculates about an assumption of the form: " can be computed from all terms of the form for some fixed ". He calls such terms '-way probabilities'. We will refer to this as the -way assumption.

If we work in the same context as were discussing in the previous subsections, then the following example shows that a very direct interpretation of the -way assumption cannot be true in a literal sense: We roll a D4 - i.e. pick a number from with - - and consider the events We see that

and

On the other hand, if we consider the events , and , then while we still have

we instead get:

So it does not make much sense to suspect that you can literally 'compute' the probability of the full intersection from all of the 2-way probabilities. In the context of a debate on the subject of which a human is expected to judge, and if you were in the first of the two cases above, then a dishonest debater could cite the fact that all of the 2-way probabilities were equal to as 'evidence' for the (untrue) claim that .

So the issue becomes that either, as Shah suggests:

you have some way to prevent arguments that use the n-way assumption to argue for an incorrect conclusion in cases where they know the n-way assumption doesn't apply

Or, you know that wherever such an argument may be used, there has to be a not-too-complex counterargument. From where we are at the moment, these two possibilities are not all that different. Irving and Shah seem to agree that to make progress here one needs to

understand the failure modes of something like the n-way assumption or whatever replaces it, so we know what kind of safety margins to add. (Irving;

Learning the smooth Prior)

So there is actually a relatively clear theoretical research direction here: It is essentially to try to understand something like the 'class of counterexamples' to (things like) the -way assumption. The example that I gave above has a specific dependence structure that makes it work. Is there some way that one can actually classify such dependence structures? Given that the assumption does not literally hold, and given the simple nature of the counterexample above, it is easy to see that there will be counterexamples involving arbitrarily large numbers of statements. This means that there will potentially be incorrect arguments that involve citing large numbers of statements. But if one were armed with a good understanding of the exact structures that need to be used in order to exploit the failure of something like the -way assumption with an incorrect argument, then the idea is either to prevent the use of such arguments (perhaps, with the help of interpretability tools, via the way that is structured or trained), or to become confident that such arguments cannot be effectively obfuscated (perhaps because you've proved that such arguments always have counterarguments that are much simpler).

4. Further Remarks

### 4.1 Is it time to give up?

Sure, there are some nice, interesting, theoretical avenues here, and some questions which one could probably spend a decent amount of time on, but have we not become overly concerned with thinking about details that are only relevant once you are way, way down a long chain of fuzzy implications and untested claims? Bearing in mind all of the various issues raised thus far, maybe it seems like further investigation into technical details is unwarranted?

One a personal level, I'm unsure where I stand on this question. Perhaps one legitimate route really is to just decide that there is sufficient skepticism with making something like this ever work out and to divert effort more directly to the mechanistic interpretability required to draw human insights from the weights of the unaligned network. On the other hand, at least on a high, conceptual level, the more I've thought about it, the more fundamental the issues that have been raised here and in L*earning the smooth prior *have started to seem to me. I can't say much more at present.

### 4.2 Are The Motivating Safety Concerns Even Addressed?

Lastly, to what extent would a scheme like imitative generalisation actually solve the more speculative safety concerns which motivated it in Better priors as a safety problem (those which we outlined in Section 1.1)? Can it alleviate the potential issues Christiano raised? I have not given the outer alignment issue thought, but it does not seem to do much to get round true inner alignment concerns. From Rohin's newsletter opinion from two years ago:

Of course, while the AI system is _incentivized_ to generalize the way humans do, that does not mean it _will_ generalize as humans do -- it is still possible that the AI system internally “wants” to gain power, and only instrumentally answers questions the way humans would answer them. So inner alignment is still a potential issue.

I agree that even though the final model is imitative in this sense, inner alignment issues may in theory still be present. Strong arguments to the contrary do not seem to have been presented: Firstly, even the 'IID' assertion is a theoretical construct (i.e. data does not actually come from a true abstract distribution) and secondly, one would also need to essentially explain away fundamental distributional issues like a model learning to follow time in the real world or to tell when the possibility of it being modified has been removed etc.

This is a great post! Very nice to lay out the picture in more detail than LTSP and the previous LTP posts, and I like the observations about the trickiness of the n-way assumption.

I also like the "Is it time to give up?" section. Though I share your view that it's hard to get around the fundamental issue: if we imagine interpretability tools telling us what the model is thinking, and assume that some of the content that must be communicated is statistical, I don't see how that communication doesn't need some simplifying assumption to be interpretable to humans (though the computation of P(Z) or equivalent could still be extremely powerful). So then for safety we're left with either (1) probabilities computed from an n-way assumption are powerful enough that the gap to other phenomena the model sees is smaller than available safety margins or (2) something like ELK works and we can restrict the model to only act based on the human-interpretable knowledge base.

Thanks very much Geoffrey; glad you liked the post. And thanks for the interesting extra remarks.