Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

 Update March 2023:  This is too confusing, skip it.

In my last post I asked if there were more natural terms for inner alignment, outer alignment, and mesa optimizers. I'm looking for a different way to slice those problems that leads to clearer understanding, communication, and research directions. This is my attempt. I aim to use terms that are older or have apparent meaning without needing a definition.

I do not aim to raise important new questions or otherwise be especially novel; I just want to be less confused.

Relating to inner alignment

Whether a model[1] optimizes its environment depends on both its ability and inclination.

We want to know the model's planning capacity in the kinds of environments it might end up in, almost with indifference to the method of planning. If an Atari agent can also somehow plan a route through town or make a coherent dinner recipe, without training on those tasks, then it is obviously quite capable. We should not disregard the ability of a system that doesn't appear to be using 'search'.

(What differentiates humans from other animals is not that one is an optimizer and the others are not; it's a question of degree. It makes more sense to ask and measure "how good is this thing at planning, and in what contexts?" than "how much of an optimizer is it?". And the word 'intelligence' has been empirically shown to be very confusing so substituting something more specific when possible would be good.)

This capability is mediated by the model's internal preferences (or utility function) over the relevant ontology. GPT has significant capacity to plan a diamond heist and strong preference towards coherence (state space = words) but is indifferent e.g. as to whether the reader executes the plan (state space = real world outcomes).

We often gesture at internal preferences indirectly and prefer more objective terminology. It is difficult to measure and specify, but internal preferences are the fundamental determinant[2] of a model's behavior as its abilities become very strong. There is an argument that goes "speaking of the utility function of this 'agent' makes no sense because it's just trying to classify pictures. Utility functions must be irrelevant." Sometimes models are absolutely indifferent over anything outside their own output space. In some ways this is quite a good property to have. The notion hasn't fallen apart; it's actually quite a clear case. We often use observed preferences as a proxy for this. What kind of preferences you estimate from behavior is determined by your preference estimation method.

It is difficult to measure the planning capacity of e.g. GLM-130B because we have trouble designing tests that elicit the full capabilities of the system. We could call a test which shows the full ability of the system a generous test. The test that allows the model to accomplish a given task most effectively could be called the most generous test. A test that cause the model to perform below its real abilities might be a prejudiced test.

Less-vague things you can say with these words:

  • We want google maps to prefer shorter routes over longer ones but be indifferent as to how often people drive.
  • Your LLM gets off track when it tries to hack people but the simple input-output methodology is a prejudiced test of its capabilities. A more generous test might use embedding space directly.
  • Your RL agent has an observed preference for coins but these may not be the internal preferences.

Research questions in this language:

  • When and why does a model have preferences about its inputs? Do we know of any cases where this has happened spontaneously, when it was not explicitly part of the architecture?
  • What is the most generous test of a language model's coding capability?
  • What preference estimation methods tend to be the most accurate on current RL systems?
  • Will current interpretability approaches allow us to accurately measure internal preferences?
  • Should we expect a localized preference control panel somewhere inside large neural nets?
  • How can we design a training process that causes models to form preferences over an ontology we understand?
  • Are human's preferences verbose & complex over any single state space, and only elegant & natural if you take them as a sum of utility functions on multiple state spaces, or something?

As for mesa-optimizers, one should not be surprised when their Atari agent is optimizing the Atari game. It was trained to do exactly this. (We are instead interested in exactly how general that agent's planning capability is.) In contrast, a spontaneous optimization daemon is an agent that forms spontaneously inside any dynamic system proceeding through time. Reproduction and sexual selection and predator-prey dynamics are not hard-coded into the universe; they all spontaneously occurred inside a vibrating box of sand. (Kind of stuns me that the universe is not really an optimizer at all, or any more than cellular automaton is, yet produced humans.)

But it's worth naming the intermediate process between physics and humans. I'd say natural selection is a selection process. Seeing humans among the atoms is quite weird but the daemon looks a bit less spontaneous once the more-sophisticated selection process comes into view. (More in appendix.)

Research question random thought: Has anyone tried running noisy game of life on a supercomputer for a month and watched what happened?

Loosely relating to outer alignment

The problem of taking what you want the model to do and turning it into code is already known as (reward or loss or value or task or objective) specification. Whether you are giving the model examples to imitate, a loss function to minimize, or something else, the intention and risks are largely the same.

We know the goal of maintaining desired behavior under distribution shift as robustness. Of course an RL agent will not typically be estimating true reward perfectly, this is nothing new. This is, even more mundanely, known as test loss. Perhaps call it production loss, since everyone is always looking at their test loss and it's not much of a test. The distinction between capability robustness and objective robustness is a good one, and clear as is.

We know a failure of objective robustness as goal misgeneralization and capability robustness 

Things you can say with these words:

  • Low production loss for a bug-detection system would be quite impressive
  • Reward specification might be too hard and we should give on our nets a model-based task specification.

(Obvious) research questions:

  • What kinds of specification methods tend to lead to strong discrepancies between internal & observed preferences? Is IRL actually bad in this regard?
  • What have been some of the most surprising goal misgeneralizations and why did they happen?


Are these the true names — are we calling the problems what they are? Are the research questions above pointing at real things? Does it lead to direct investigations of the territory? Is it actually any easier to assess the general planning capacity of a model than to answer whether it's a mesa-optimizer?

My real motivation is just to understand what people are even talking about when they write about inner alignment and mesa-optimizers.

Very curious if I've hit the mark. Happy to receive feedback about better/clearer/older terms for these concepts, alternative breakdowns, or important gaps.

Appendix: Three kinds of optimizers

I can think of some ways to accomplish objectives without having any planning capacity per-se.

  • You can play so many games of go that have a good estimation of the value of every board state.
  • You might somehow otherwise always know the right move without any forethought. (e.g. thermometers)
  • In cases where losses don't cost but wins do gain (e.g. technology discovery), you can just attempt tons of things. The important distinction from planning is that you actually attempt things in the real world, not in your imagination. (Natural selection is basically this.)
  • A randomly shifting collection of atoms will sometimes eventually have stable replicators in it. This seems important from the above case but I'm not sure what to make of it...

One might call these memorizers, direct implementations, an embedded selection process, and dynamic systems (not an optimizer).[3]

So is the term 'planning capacity' to restrictive? Are we excluding important cases from our field of view? Well, an ideal memorizer does effectively behave as an ideal planner. Already knowing what happens next is not a lot different from being able to imagine it. So one can still speak of gpt3 as having instantiated a text planning system, even if it is implemented as compressed memorization under the hood. It makes sense to ask how well or how generally it makes plans, or how far ahead it looks. And planning capacity feels more natural and important to me than memorization.

 Selection processes are certainly quite different from either of these. I'd say that when something does that we'll probably know it / see it / have specified it. It's such a unique case that control & analysis methods from planners and memorizers will transfer very poorly.

So I think the term planning capacity is a bit narrow but feels about right for where we are and the questions we're asking right now.

  1. ^

    I say model instead of agent because not all models act much like agents; for many purposes we don't need agents. If you refer to models as learned algorithms then nobody will know what you're talking about.

  2. ^

    Of course excluding inputs and the operating environment.

  3. ^

    Apologies if there are existing terms for these

New Comment