Better priors as a safety problem


Ω 30

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

(Related: Inaccessible Information, What does the universal prior actually look like?, Learning the prior)

Fitting a neural net implicitly uses a “wrong” prior. This makes neural nets more data hungry and makes them generalize in ways we don’t endorse, but it’s not clear whether it’s an alignment problem.

After all, if neural nets are what works, then both the aligned and unaligned AIs will be using them. It’s not clear if that systematically disadvantages aligned AI.

Unfortunately I think it’s an alignment problem:

  • I think the neural net prior may work better for agents with certain kinds of simple goals, as described in Inaccessible Information. The problem is that the prior mismatch may bite harder for some kinds of questions, and some agents simply never need to answer those hard questions.
  • I think that Solomonoff induction generalizes catastrophically because it becomes dominated by consequentialists who use better priors.

In this post I want to try to build some intuition for this problem, and then explain why I’m currently feeling excited about learning the right prior.

Indirect specifications in universal priors

We usually work with very broad “universal” priors, both in theory (e.g. Solomonoff induction) and in practice (deep neural nets are a very broad hypothesis class). For simplicity I’ll talk about the theoretical setting in this section, but I think the points apply equally well in practice.

The classic universal prior is a random output from a random stochastic program. We often think of the question “which universal prior should we use?” as equivalent to the question “which programming language should we use?” but I think that’s a loaded way of thinking about it — not all universal priors are defined by picking a random program.

A universal prior can never be too wrong — a prior P is universal if, for any other computable prior Q, there is some constant c such that, for all x, we have P(x) > c Q(x). That means that given enough data, any two universal priors will always converge to the same conclusions, and no computable prior will do much better than them.

Unfortunately, universality is much less helpful in the finite data regime. The first warning sign is that our “real” beliefs about the situation can appear in the prior in two different ways:

  • Directly: if our beliefs about the world are described by a simple computable predictor, they are guaranteed to appear in a universal prior with significant weight.
  • Indirectly: the universal prior also “contains” other programs that are themselves acting as priors. For example, suppose I use a universal prior with a terribly inefficient programming language, in which each character needed to be repeated 10 times in order for the program to do anything non-trivial. This prior is still universal, but it’s reasonably likely that the “best” explanation for some data will be to first sample a really simple interpret for a better programming language, and then draw a uniformly randomly program in that better programming language.

(There isn’t a bright line between these two kinds of posterior, but I think it’s extremely helpful for thinking intuitively about what’s going on.)

Our “real” belief is more like the direct model — we believe that the universe is a lawful and simple place, not that the universe is a hypothesis of some agent trying to solve a prediction problem.

Unfortunately, for realistic sequences and conventional universal priors, I think that indirect models are going to dominate. The problem is that “draw a random program” isn’t actually a very good prior, even if the programming language is OK— if I were an intelligent agent, even if I knew nothing about the particular world I lived in, I could do a lot of a priori reasoning to arrive at a much better prior.

The conceptually simplest example is “I think therefore I am.” Our hypotheses about the world aren’t just arbitrary programs that produce our sense experiences— we restrict attention to hypotheses that explain why we exist and for which it matters what we do. This rules out the overwhelming majority of programs, allowing us to assign significantly higher prior probability to the real world.

I can get other advantages from a priori reasoning, though they are a little bit more slippery to talk about. For example, I can think about what kinds of specifications make sense and really are most likely a priori, rather than using an arbitrary programming language.

The upshot is that an agent who is trying to do something, and has enough time to think, actually seems to implement a much better prior than a uniformly random program. If the complexity of specifying such an agent is small relative to the prior improbability of the sequence we are trying to predict, then I think the universal prior is likely to pick out the sequence indirectly by going through the agent (or else in some even weirder way).

I make this argument in the case of Solomonoff induction in What does the universal prior actually look like? I find that argument pretty convincing, although Solomonoff induction is weird enough that I expect most people to bounce off that post.

I make this argument in a much more realistic setting in Inaccessible Information. There I argue that if we e.g. use a universal prior to try to produce answers to informal questions in natural language, we are very likely to get an indirect specification via an agent who reasons about how we use language.

Why is this a problem?

I’ve argued that the universal prior learns about the world indirectly, by first learning a new better prior. Is that a problem?

To understand how the universal prior generalizes, we now need to think about how the learned prior generalizes.

The learned prior is itself a program that reasons about the world. In both of the cases above (Solomonoff induction and neural nets) I’ve argued that the simplest good priors will be goal-directed, i.e. will be trying to produce good predictions.

I have two different concerns with this situation, both of which I consider serious:

  • Bad generalizations may disadvantage aligned agents. The simplest version of “good predictions” may not generalize to some of the questions we care about, and may put us at a disadvantage relative to agents who only care about simpler questions. (See Inaccessible Information.)
  • Treacherous behavior. Some goals might be easier to specify than others, and a wide range of goals may converge instrumentally to “make good predictions.” In this case, the simplest programs that predict well might be trying to do something totally unrelated, when they no longer have instrumental reasons to predict well (e.g. when their predictions can no longer be checked) they may do something we regard as catastrophic.

I think it’s unclear how serious these problems are in practice. But I think they are huge obstructions from a theoretical perspective, and I think there is a reasonable chance that this will bite us in practice. Even if they aren’t critical in practice, I think that it’s methodologically worthwhile to try to find a good scalable solution to alignment, rather than having a solution that’s contingent on unknown empirical features of future AI.

Learning a competitive prior

Fundamentally, I think our mistake was building a system that uses the wrong universal prior, one that fails to really capture our beliefs. Within that prior, there are other agents who use a better prior, and those agents are able to outcompete and essentially take over the whole system.

I’ve considered lots of approaches that try to work around this difficulty, taking for granted that we won’t have the right prior and trying to somehow work around the risky consequences. But now I’m most excited about the direct approach: give our original system the right prior so that sub-agents won’t be able to outcompete it.

This roughly tracks what’s going on in our real beliefs, and why it seems absurd to us to infer that the world is a dream of a rational agent—why think that the agent will assign higher probability to the real world than the “right” prior? (The simulation argument is actually quite subtle, but I think that after all the dust clears this intuition is basically right.)

What’s really important here is that our system uses a prior which is competitive, as evaluated by our real, endorsed (inaccessible) prior. A neural net will never be using the “real” prior, since it’s built on a towering stack of imperfect approximations and is computationally bounded. But it still makes sense to ask for it to be “as good as possible” given the limitations of its learning process — we want to avoid the situation where the neural net is able to learn a new prior which predictably to outperforms the outer prior. In that situation we can’t just blame the neural net, since it’s demonstrated that it’s able to learn something better.

In general, I think that competitiveness is a desirable way to achieve stability — using a suboptimal system is inherently unstable, since it’s easy to slip off of the desired equilibrium to a more efficient alternative. Using the wrong prior is just one example of that. You can try to avoid slipping off to a worse equilibrium, but you’ll always be fighting an uphill struggle.

Given that I think that finding the right universal prior should be “plan A.” The real question is whether that’s tractable. My current view is that it looks plausible enough (see Learning the prior for my current best guess about how to approach it) that it’s reasonable to focus on for now.

Better priors as a safety problem was originally published in AI Alignment on Medium, where people are continuing the conversation by highlighting and responding to this story.


Ω 30