Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I've been clarifying my own understanding of the alignment problem over the past few months, and wanted to share my first writeups with folks here in case they're useful:

The site currently has 3 pages:

  1. The case for risk: how deep learning could become very influential, training problems that could lead models to behave in systematically harmful ways, and what I think we should do about it. Inspired mainly by What failure looks like.
  2. Fermi estimate of future training runs: a short AI timelines estimate inspired by Forecasting transformative AI.
  3. Applications of high-capability models: some notes on how high-capability models could actually be trained, and how their behavior could become highly influential.

None of the ideas on the site are particularly new, and as I note, they're not consensus views, but the version of the basic case I lay out on the site is very short, doesn't have a lot of outside dependencies, and is put together out of nuts-and-bolts arguments that I think will be useful as a starting point for alignment work. I'm particularly hoping to avoid semantic arguments about "what counts as" inner vs outer alignment, optimization, agency, etc., in favor of more mechanical statements of how models could behave in different situations.

I think some readers on this forum will already have been thinking about alignment this way, and won't get a lot new out of the site; some (like me) will find it to be a helpful distillation of some of the major arguments that have come out over the past ~5 years; and some will have disagreements (which I'm curious to hear about).

I thought about posting all of this directly on the Alignment Forum / LessWrong, but ultimately decided I wanted a dedicated home for these ideas.

Out of everything on the site, the part I'm most hoping will be helpful to you is my (re)statement of two main problems in AI alignment. These map roughly onto outer and inner alignment, though different people use those terms differently, so not everyone will agree:

As models become more capable, it looks like currently known training methods will run into fundamental safety problems, and become increasingly likely to produce models that behave in systematically harmful ways:

1. Evaluation breakdown: As a model’s behavior becomes more sophisticated, it will reach a point where an automated reward function or human evaluator will not be able to fully understand its behavior. In many domains, it will then become possible for models to get good evaluations by producing adversarial behaviors that systematically hide bad outcomes, maximize the appearance of good outcomes, and generally seek to control the information flowing to the evaluator instead of achieving the desired results.

Evaluation breakdown would produce high-capability models that appear to work as intended, but that will behave in arbitrarily harmful ways when that behavior is useful for producing good evaluations; this would be broadly analogous to a company using its advantages in resources, personnel, and specialized knowledge to keep regulators and the public in the dark about harms.

2. High-level distribution shift: Even if evaluation breakdown is avoided, a model may behave arbitrarily badly when its input distribution is different from its training distribution. Especially harmful behavior could occur under “high-level” distribution shifts – shifts that leave the low-level structure of the domain unchanged (e.g. causal patterns that allow prediction of future observations or consequences of actions), but change some high-level features of the broader situation the model is operating in. Since the basic structure of the domain is unchanged, a model could continue to behave competently in the new distribution, but its behavior could be arbitrarily different from what it was intended to do.

In practice, a model that is vulnerable to high-level distributional shift would perform well in many situations, but have some chance of behaving in systematically harmful ways when conditions change. For example, high-level distribution shift might cause a model to switch to harmful behavior in new situations (e.g. committing fraud when it becomes possible to get away with it, manipulating a country’s political process when the model gains access to the required resources, or creating an addictive product when the required technology is developed); or a model might continue to pursue proxies of good performance in situations where they are no longer appropriate (e.g. continuing to maximize a company’s profit and growth during national emergencies, or continuing to maximize sales when it becomes apparent that a product is harmful).

What's next? Ultimately, I'm hoping to figure out what kinds of research projects are most likely to produce forward progress towards training methods that avoid evaluation breakdown and high-level distribution shift. A world where we're making clear year-over-year progress towards these goals looks achievable to me.


Ω 21

New Comment
3 comments, sorted by Click to highlight new comments since: Today at 6:04 AM

+1 for interesting investigations. I want to push back on your second point, though - the framing of the problem of high-level distributional shift. I don't think this actually captures the core thing we're worried about. For example, we can imagine a model that remains in the same environment, but becomes increasingly intelligent during training, until it realises that it has the option of doing a treacherous turn. Or we can think about the case of humans - the core skills and goals that make us dangerous to other species developed in our ancestral environment, which led to us changing our own environments. So the distributional shift was downstream of the underlying problem.

Also, in the real world, everything undergoes distributional shift all the time, so the concept doesn't narrow things down.

Thanks, Richard!

I do think both of those cases fit into the framework fine (unless I'm misunderstanding what you have in mind):

  • In the first case, we're training a model in an environment. As it gets more capable, it reaches a point where it can find new, harmful behaviors in some set of situations. Our worries are now that (1) we can't recognize that behavior as harmful, or (2) we don't visit those situations during training, but they do in fact come up in practice (distribution shift). If we say "but the version of the model we had yesterday, before all this additional training, didn't behave badly in this situation!", that just seems like sloppy training work -- it's not clear why we should expect the behavior of an earlier version of a model to bind a later version.
  • In the second case, it sounds like you're imagining us watching evolution and thinking "let's evolve humans that are reproductively fit, but aren't dangerous to other species." We train the humans a lot in the ancestral environment, and see that they don't hurt other species much. But then, the humans change the environment a lot, and in the new situations they create, they hurt other species a lot. In this case, I think it's pretty clear that the distribution has shifted. We might wish we'd done something earlier to certify that humans wouldn't hurt animals a lot under any circumstance, or we'd deployed humans in some sandbox so we could keep the high-level distribution of situations the same, or dealt with high-level distribution shift some other way.

In other words, if we imagine a model misbehaving in the wild, I think it'll usually either be the case that (1) it behaved that way during training but we didn't notice the badness (evaluation breakdown), or (2) we didn't train it on a similar enough situation (high-level distribution shift).

As we move further away from standard DL training practices, we could see failure modes that don't fit into these two categories -- e.g. there could be some bad fixed-point behaviors in amplification that aren't productively thought of as "evaluation breakdown" or "high-level distribution shift."  But these two categories do seem like the most obvious ways that current DL practice could produce systematically harmful behavior, and I think they take up a pretty large part of the space of possible failures.

(ETA: I want to reiterate that these two problems are restatements of earlier thinking, esp. by Paul and Evan, and not ideas I'm claiming are new at all; I'm using my own terms for them because "inner" and "outer" alignments have different meanings for different people.)

(Short low-effort reply since we'll be talking soon.)

we don't visit those situations during training, but they do in fact come up in practice (distribution shift)

If you're using this definition of distributional shift, then isn't any catastrophic misbehaviour a distributional shift problem by definition, since the agent didn't cause catastrophes in the training environment?

In general I'm not claiming that distributional shift isn't happening in the leadup to catastrophes, I'm denying that it's an interesting way to describe what's going on.  An unfair straw analogy: it feels kinda like saying "the main problem in trying to make humans safe is that some humans might live in different places now than we did when we evolved. Especially harmful behaviour could occur under big locational shifts". Which is... not wrong, most dangerous behaviour doesn't happen in sub-saharan Africa. But it doesn't shed much light on what's happening: the danger is being driven by our cognition, not by high-level shifts in our environments.

New to LessWrong?