Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I have only just read the mesa optimizers paper, and I don't understand what it adds to the pre-existing picture that "ML can fail to generalize outside the train distribution and this is bad."

The discussion in the paper generally assumes a background distinction between "training" and "deployment" (or an equivalent distinction), and discuss models which succeed on the base objective during "training" but not during "deployment."

In the sections about "deception," this happens in a special way quite unlike the ordinary failures to generalize that we see in ML today (and arises under correspondingly exotic conditions). But, in cases other than "deception," the paper describes dynamics and outcomes that seem identical to the ordinary generalization problem in ML:

  • Training find a model that scores well on the base objective, assessed over the training distribution
  • But, this model may not score well on the base objective, assessed over other distributions

For example, the following is just a textbook generalization failure, which can happen with or without a mesa-optimizer:

For a toy example of what pseudo-alignment might look like, consider an RL agent trained on a maze navigation task where all the doors during training happen to be red. Let the base objective (reward function) be = (1 if reached a door, 0 otherwise). On the training distribution, this objective is equivalent to = (1 if reached something red, 0 otherwise). Consider what would happen if an agent, trained to high performance on on this task, were put in an environment where the doors are instead blue, and with some red objects that are not doors. It might generalize on , reliably navigating to the blue door in each maze (robust alignment). But it might also generalize on instead of , reliably navigating each maze to reach red objects (pseudo-alignment).

Additionally, when the paper makes remarks that seem to be addressing my question, I find these remarks confused (or perhaps just confusing).

For instance, in this remark

The possibility of mesa-optimizers has important implications for the safety of advanced machine learning systems. When a base optimizer generates a mesa-optimizer, safety properties of the base optimizer may not transfer to the mesa-optimizer.

I don't understand what "safety properties of the base optimizer" could be, apart from facts about the optima it tends to produce. That is, I can't think of a property that would appear to confer "safety" until we consider the possibility of producing mesa-optimizers, and then stop appearing thus.

A safety property of an optimizer is some kind of assurance about the properties of the optima; if such a property only holds for a subset of optima (the ones that are not mesa-optimizers), we'll see this appear mathematically in the definition of the property or in theorems about it, whether or not we have explicitly considered the possibility of mesa-optimizers. (I suppose the argument could be that some candidate safety properties implicitly assume no optimum is a mesa-optimizer, and thus appear to apply to all optima while not really doing so -- somewhat analogous to early notions of continuity which implicitly assumed away the Weierstrass function. But if so, I need a real example of such a case to convince me.)

The following seems to offer a different answer to my question:

Pseudo-alignment, therefore, presents a potentially dangerous robustness problem since it opens up the possibility of a machine learning system that competently takes actions to achieve something other than the intended goal when off the training distribution. That is, its capabilities might generalize while its objective does not.

This seems to contrast two ways of failing on a samples from non-train distribution. Supposing a model has learned to "understand" train samples and use that understanding to aim for a target, it can then

  • fail to understand non-train samples, thus losing the ability to aim for any target (capabilities fails to generalize)
  • understand non-train samples and aim for its internalized target, which matched the base target in training, but not here (objective fails to generalize)

But even without mesa-optimizers, cases of ML generalization failure often involve the latter, not just the former. A dumb classifier with no internal search, trained on the red-door setting described above, would "understand" the blue-door test data well enough to apply its internalized objective perfectly; even in this textbook-like case, exhibited by arbitrarily simple classifiers, the "capabilities generalize." This kind of problem is definitely bad, I just don't see what's new about it.


Ω 19

New Answer
Ask Related Question
New Comment

2 Answers

So, I certainly agree that pseudo-alignment is a type of robustness/distributional shift problem. In fact, I would describe “Risks from Learned Optimization” as a deep dive on a particular subset of robustness problems that might be particularly concerning from a safety standpoint. Thus, in that sense, whether it's really a “new” sort of robustness problem is less the point than the analysis that the paper presents of that robustness problem. That being said, I do think that at least the focus on mesa-optimization was fairly novel in terms of caching out the generalization failures we wanted to discuss in terms of the sorts of learned optimization processes that might exhibit them (as well as the discussion of deception, as you mention).

I don't understand what "safety properties of the base optimizer" could be, apart from facts about the optima it tends to produce.

I agree with that and I think that the sentence you're quoting there is meant for a different sort of reader that has less of a clear concept of ML. One way to interpret the passage you're quoting that might help you is that it's just saying that guarantees about global optima don't necessarily translate to local optima or to actual models you might find in practice.

But even without mesa-optimizers, cases of ML generalization failure often involve the latter, not just the former.

I also agree with this. I would describe my picture here as something like: Pseudo-aligned mesa-optimization Objective generalization without capability generalization Robustness problems. Given that picture, I would say that the pseudo-aligned mesa-optimizer case is the most concerning from a safety perspective, then generic objective generalization without capability generalization, then robustness problems in general. And I would argue that it makes sense to break it down in that way precisely because you get more concerning safety problems as you go narrower.

Also, more detail on the capability vs. objective robustness picture is also available here and here.

I disagree with the framing that: "pseudo-alignment is a type of robustness/distributional shift problem". This is literally true based on how it's defined in the paper. But I think in practice, we should expect approximately aligned mesa-optimizers that do very bad things on-distribution (without being detected).

According to me (and at least some if not all of the authors of that paper disagree with me), the main point is highlighting the possibility of capabilities generalizing while objectives do not. I agree that this is a failure mode that we knew about before, but it's not one that people were paying much attention to. At the very least, when people said they worked on "robustness", they weren't distinguishing between capability failure vs. objective failure (though of course the line between these is blurry).

Although on the other hand, decade+ old arguments about the instrumental utility of good behavior while dependent on humans have more or less the same format. Seeing good behavior is better evidence of intelligence (capabilities generalizing) than it is of benevolence (goals 'generalizing').

The big difference is that the olde-style argument would be about actual agents being evaluated by humans, while the mesa-optimizers argument is about potential configurations of a reinforcement learner being evaluated by a reward function.

(Really minor formatting nitpick, but it's the kind of thing that really trips me up while reading, but you forgot a closing parenthesis somewhere in your comment)

4Ben Pace2yoh no
2rohinmshah2yFixed, thanks.
1 comments, sorted by Click to highlight new comments since: Today at 9:03 AM
[+][comment deleted]2y Ω12