Suppose that I have a dataset D of observed (*x*,* y*) pairs, and I’m interested in predicting the label *y** for each point *x** in some new set D*. Perhaps D is a set of forecasts from the last few years, and D* is a set of questions about the coming years that are important for planning.

The classic deep learning approach is to fit a model *f* on D, and then predict *y** using *f*(*x**).

This approach implicitly uses a somewhat strange prior, which depends on exactly how I optimize *f*. I may end up with the model with the smallest l2 norm, or the model that’s easiest to find with SGD, or the model that’s most robust to dropout. But *none* of these are anywhere close to the “ideal” beliefs of a human who has updated on D.

This means that neural nets are unnecessarily data hungry, and more importantly that they can generalize in an undesirable way. I now think that this is a safety problem, so I want to try to attack it head on by learning the “right” prior, rather than attempting to use neural nets as an implicit prior.

#### Warm-up 1: human forecasting

If D and D* are small enough, and I’m OK with human-level forecasts, then I don’t need ML at all.

Instead I can hire a human to look at all the data in D, learn all the relevant lessons from it, and then spend some time forecasting *y** for each *x**.

Now let’s gradually relax those assumptions.

#### Warm-up 2: predicting human forecasts

Suppose that D* is large but that D is still small enough that a human can extract all the relevant lessons from it (or that for each *x** in D*, there is a small subset of D that is relevant).

In this case, I can pay humans to make forecasts for many randomly chosen *x** in D*, train a model *f* to predict those forecasts, and then use *f* to make forecasts about the rest of D*.

The generalization is now coming entirely from human beliefs, not from the structural of the neural net — we are only applying neural nets to iid samples from D*.

### Learning the human prior

Now suppose that D is large, such that a human can’t update on it themselves. Perhaps D contains billions of examples, but we only have time to let a human read a few pages of background material.

Instead of learning the unconditional human forecast P(*y*|*x*), we will learn the forecast P(*y*|*x, *Z), where Z is a few pages of background material that the human takes as given. We can also query the human for the prior probability Prior(Z) that the background material is true.

Then we can train *f*(*y*|*x*, *Z*) to match P(*y*|*x*, Z), and optimize Z* for:

log Prior(Z*) + sum((x,y) ~ D) log f(y|x,Z*)

We train *f* in parallel with optimizing Z*, on inputs consisting of the current value of Z* together with questions *x* sampled from D and D*.

For example, Z might specify a few explicit models for forecasting and trend extrapolation, a few important background assumptions, and guesses for a wide range of empirical parameters. Then a human who reads Z can evaluate how plausible it is on its face, or they can take it on faith in order to predict *y** given *x**.

The optimal Z* is then the set of assumptions, models, and empirical estimates that works best on the historical data. The human never has to reason about more than one datapoint at a time — they just have to evaluate what Z* implies about each datapoint in isolation, and evaluate how plausible Z* is a priori.

This approach has many problems. Two particularly important ones:

- To be competitive, this optimization problem needs to be nearly as easy as optimizing
*f*directly on D, but it seems harder: finding Z* might be much harder than learning*f,*learning a conditional*f*might be much harder than learning an unconditional*f*, and jointly optimizing Z and*f*might present further difficulties. - Even if it worked our forecasts would only be “human-level” in a fairly restrictive sense — they wouldn’t even be as good as a human who actually spent years practicing on D before making a forecast on D*. To be competitive, we want the forecasts in the iid case to be at least as good as fitting a model directly.

I think the first point is an interesting ML research problem. (If anything resembling this approach ever works in practice, credit will rightly go to the researchers who figure out the precise version that works and resolve those issues, and this blog post will be a footnote.) I feel relatively optimistic about our collective ability to solve concrete ML problems, unless they turn out to be impossible. I’ll give some preliminary thoughts in the next section “Notes & elaborations.”

The second concern, that we need some way to go beyond human level, is a central philosophical issue and I’ll return to it in the subsequent section “Going beyond the human prior.”

#### Notes & elaborations

- Searching over long texts may be extremely difficult. One idea to avoid this is to try to have a human guide the search, by either generating hypotheses Z at random or sampling perturbations to the current value of Z. Then we can fit a generative model of that exploration process and perform search in the latent space (and also fit
*f*in the latent space rather than having it take Z as input). That rests on two hopes: (i) learning the exploration model is easy relative to the other optimization we are doing, (ii) searching for Z in the latent space of the human exploration process is strictly easier than the corresponding search over neural nets. Both of those seem quite plausible to me. - We don’t necessarily need to learn
*f*everywhere, it only needs to be valid in a small neighborhood of the current Z. That may not be much harder than learning the unconditional*f*. - Z represents a full posterior rather than a deterministic “hypothesis” about the world, e.g. it might say “R0 is uniform between 2 and 3.” What I’m calling Prior(Z) is really the KL between the prior and Z, and P(
*y|x,*Z) will itself reflect the uncertainty in Z. The motivation is that we want a flexible and learnable posterior. (This is particularly valuable once we go beyond human level.) - This formulation queries the human for Prior(Z) before each fitness evaluation. That might be fine, or you might need to learn a predictor of that judgment. It might be easier for a human to report a ratio Prior(Z)/Prior(Z′) than to give an absolute prior probability, but that’s also fine for optimization. I think there are a lot of difficulties of this flavor that are similar to other efforts to learn from humans.
- For the purpose of studying the ML optimization difficulties I think we can basically treat the human as an oracle for a reasonable prior. We will then need to relax that rationality assumption in the same way we do for other instances of learning from humans (though a lot of the work will also be done by our efforts to go beyond the human prior, described in the next section).

### Going beyond the human prior

How do we get predictions better than explicit human reasoning?

We need to have a richer latent space Z, a better Prior(Z), and a better conditional P(*y*|*x*, Z).

Instead of having a human predict *y* given *x* and Z, we can use amplification or debate to train f(*y*|*x*, Z) and Prior(Z). This allows Z to be a large object that cannot be directly accessed by a human.

For example, Z might be a full library of books describing important facts about the world, heuristics, and so on. Then we may have two powerful models debating “What should we predict about *x*, assuming that everything in Z is true?” Over the course of that debate they can cite small components of Z to help make their case, without the human needing to understand almost anything written in Z.

In order to make this approach work, we need to do a lot of things:

- We still need to deal with all the ML difficulties described in the preceding section.
- We still need to analyze debate/amplification, and now we’ve increased the problem difficulty slightly. Rather than merely requiring them to produce the “right” answers to questions, we also need them to implement the “right” prior. We already needed to implement the right prior as part of answering questions correctly, so this isn’t too much of a strengthening, but we are calling attention to a particularly challenging case. It also imposes a particular structure on that reasoning which is a real (but hopefully slight) strengthening.
- Entangled with the new analysis of amplification/debate, we also need to ensure that Z is able to represent a rich enough latent space. I’ll discuss implicit representations of Z in the next section “Representing Z.”
- Representing Z implicitly and using amplification or debate may make the optimization problem even more difficult. I’ll discuss this in the subsequent section “Jointly optimizing Mz and f.”

#### Representing Z

I’ve described Z as being a giant string of text. If debate/amplification work at all then I think text is in some sense “universal,” so this isn’t a crazy restriction.

That said, representing complex beliefs might require *very long *text, perhaps many orders of magnitude larger than the model *f* itself. That means that optimizing for (Z, *f*) jointly will be much harder than optimizing for *f* alone.

The approach I’m most optimistic about is representing Z implicitly as the output of another model Mz. For example, if Z is a text that is trillions of words long, you could have Mz output the *i*th word of Z on input *i*.

(To be really efficient you’ll need to share parameters between *f* and Mz but that’s not the hard part.)

This can get around the most obvious problem — that Z is too long to possibly write down in its entirety — but I think you actually have to be pretty careful about the implicit representation or else we will make Mz’s job too hard (in a way that will be tied up the competitiveness of debate/amplification).

In particular, I think that representing Z as implicit flat text is unlikely to be workable. I’m more optimistic about the kind of approach described in approval-maximizing representations — Z is a complex object that can be related to slightly simpler objects, which can themselves be related to slightly simpler objects… until eventually bottoming out with something simple enough to be read directly by a human. Then Mz implicitly represents Z as an exponentially large tree, and only needs to be able to do one step of unpacking at a time.

#### Jointly optimizing Mz and f

In the first section I discussed a model where we learn *f*(*y*|*x*, Z) and then use it to optimize Z. This is harder if Z is represented implicitly by Mz, since we can’t really afford to let *f* take Mz as input.

I think the most promising approach is to have Mz and *f* both operate on a compact latent space, and perform optimization in this space. I mention that idea in Notes & Elaborations above, but want to go into more detail now since it gets a little more complicated and becomes a more central part of the proposal.

(There are other plausible approaches to this problem; having more angles of attack makes me feel more comfortable with the problem, but all of the others feel less promising to me and I wanted to keep this blog post a bit shorter.)

The main idea is that rather than training a model Mz(·) which implicitly represents Z, we train a model Mz(·, *z*) which implicitly represents a distribution over Z, parameterized by a compact latent *z.*

Mz is trained by iterated amplification to imitate a superhuman exploration distribution, analogous to the way that we could ask a human to sample Z and then train a generative model of the human’s hypothesis-generation. Training Mz this way is itself an open ML problem, similar to the ML problem of making iterated amplification work for question-answering.

Now we can train *f*(*y|x, z*) using amplification or debate. Whenever we would want to reference Z, we use Mz(·, *z*). Similarly, we can train Prior(*z*). Then we choose *z* *to optimize log Prior(*z**) + sum((*x*, *y*) ~ D) log *f*(*y|x, z**).

Rather than ending up with a human-comprehensible posterior Z*, we’ll end up with a compact latent *z**. The human-comprehensible posterior Z* is implemented implicitly by Mz(·, *z**).

### Outlook

I think the approach in this post can potentially resolve the issue described in Inaccessible Information, which I think is one of the largest remaining conceptual obstacles for amplification/debate. So overall I feel very excited about it.

Taking this approach means that amplification/debate need to meet a slightly higher bar than they otherwise would, and introduces a bit of extra philosophical difficulty. It remains to be seen whether amplification/debate will work at all, much less whether they can meet this higher bar. But overall I feel pretty excited about this outcome, since I was expecting to need a larger reworking of amplification/debate.

I think it’s still very possible that the approach in this post can’t work for fundamental philosophical reasons. I’m not saying this blog post is anywhere close to a convincing argument for feasibility.

Even if the approach in this post is conceptually sound, it involves several serious ML challenges. I don’t see any reason those challenges should be impossible, so I feel pretty good about that — it always seems like good news when you can move from philosophical difficulty to technical difficulty. That said, it’s still quite possible that one of these technical issues will be a fundamental deal-breaker for competitiveness.

My current view is that we don’t have candidate obstructions for amplification/debate as an approach to AI alignment, though we have a lot of work to do to actually flesh those out into a workable approach. This is a more optimistic place than I was at a month ago when I wrote Inaccessible Information.

Learning the prior was originally published in AI Alignment on Medium, where people are continuing the conversation by highlighting and responding to this story.