Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

In the previous posts, I first outlined Selection versus Control for Optimization, then talked about What Optimization Means, and how we quantify it, then applied these ideas a bit to ground the discussion.

After doing so, I reached a point where I think that there is something useful to say about mesa-optimizers. This isn't yet completely clear to me, and it seems likely that at some point in the (hopefully near) future, someone will build a much clearer conceptual understanding, and I'll replace this with a link to that discussion. Until then, I want to talk about how mesa-optimizers are control systems built within selection systems, and why that poses significant additional concern for alignment.

Mesa-Optimizers

I claimed in the previous post that Mesa-optimizers are always control systems. The base optimization selects for a mesa-optimizer, usually via side-effect-free or low-cost sampling and/or simulation, then creates a system that does further optimization as an output. In some cases, the further optimization is direct optimization in the terms laid out in my earlier post, but in others it is control.

Direct Mesa-Optimizers

In many cases, the selection system finds optimal parameters or something similar for a direct optimization system. This is exactly the earlier example of building a guidance system for a rocket is an obvious class of example where selection leads to a direct optimizer. This isn't the only way that optimizers can interact, though.

I'd say that MCTS + Deep learning is an important example of this mix which has been compared to "Thinking Fast and Slow" (pdf, NIPS paper). In chess, for example, the thinking fast is the heuristic search to choose where to explore, which is based on a selection system, and the exploration is MCTS, which in this context I'm calling direct optimization. (This is despite the fact that it's obviously probabilistic, so in some respects looks like selection, because while the selection IS choosing points in the space, it isn't doing evaluation, but rather deterministic play-forward of alternative scenarios. The evaluation is being done by the heuristic evaluation system.) In that specific scenario, any misalignment is almost entirely a selection system issue - unless there was some actual mistake in the implementation of the rules of chess.

This opens up a significant concern for causal Goodhart; regime change will plausibly have particularly nasty effects. This is because the directly optimizing mesa-optimizer isn't at all able to consider whether the parameters selected by the base-optimizer should be reconsidered. And this is far worse for "true" control mesa-optimizers.

Control Mesa-optimizers

Before we talk more about failure modes, it's important to clarify two levels of failure; the base-optimizer can fail to achieve its goals because it designs a misaligned mesa-optimizer, and the mesa-optimizer itself can fail to achieve its goals. The two failure modes are different, because we don't have any reason to assume a-priori that our mesa-optimizer shares goals with our base optimizer.

Just like humans are adaptation-executioners, Mesa-optimizers are mesa-optimizers, not simply optimizers. If their adaptations are instead tested against the true goal, or at least the base-optimizers goal, then evaluated on that basis, they aren't mesa-optimizers, they are trials for the selection optimizer. Note that these two cases aren't necessarily incompatible; in something like Google's federated learning model, the shared model is updated based on the control system's data. So self-driving cars may be mesa-optimizers using a neural net, and the data gathered is later used to update the base-optimizer model, which is then given back to the agents. The two parts of the system can therefore suffer different types of failures, but at least the post-hoc updating seems to plausibly reduce misalignment of the mesa-optimizer.

But in cases where side-effects occur, so that the control mesa-optimizer imposes externalities on the system, the mesa-optimizer typically won't share goals with the base optimizer! This is because if it shares goals exactly, the base optimization doesn't need to build a mesa-optimizer, it can run tests without ceding direct control. (Our self-driving cars in the previous paragraph are similar to this, but their goal is the trained network's implicit goal, not the training objective used for the update. In such in a federated learning model, it can fail and the trial can be used for better learning the model.) On the other hand, if no side-effects occur and goals are shared, the difference is irrelevant; the mesa-optimizer can fail costlessly and start over.

Failure of Mesa-optimizers as Principle-Agent-type "Agents"

There is a clear connection to principal-agent problems. (Unfortunately, the term agent is ambiguous, since we also want to discuss embedded agents. In this section, that's not what is being discussed.) Mesa-optimizers can succeed at their goals but fail to be reliable agents, or they can fail even at their own goals. I'm unsure about this, but it seems each of these cases should be considered separately. Earlier, I noted that some Goodhart failures are model failures. With mesa-optimizers involved, there are two sets of models - the base optimizer model, and the mesa-optimizer model.

Principal optimization failures occur either if the mesa-optimiser itself falls prey to a Goodhart failure due to shared failures in the model, or if the mesa-optimizer model or goals are different than the principal's in ways that allow the metrics not to align with the principals' goals. (Abrams correctly noted in an earlier comment that this is misalignment. I'm not sure, but it seems this is principally a terminology issue.)

This second form allows a new set of failures. I'm even less sure about this next, but I'll suggest that we can usefully categorize the second class into 3 cases; mesa-superoptimizers, mesa-suboptimizers, and mesa-transoptimizers. The first, mesa-superoptimizers, is where the Mesa-optimizer is able to find clever ways to get around the (less intelligent) base optimizer's model. This allows all of the classic Goodhart's law-type failures, but they occur between the Mesa-optimizer and the based-optimizer, rather than between the human controller and the optimizer. This case includes the classic runaway superintelligence problems. The second, mesa-suboptimizers, is where the mesa-optimizer uses a simpler model than the base optimizer, and hits a Goodhart failure that the base-optimizer could have avoided. (Let's say, for example, that it uses a correlational model holding certain factors known by the base optimizer to influence the system constant, and for whatever reason the mesa-optimizer enters a regime-change region, where those factors change in ways that the base-model understands.) Lastly, there are mesa-transoptimizers, where typical human types of principle-agent failures can occur because the mesa-optimizer has different goals. The other way this occurs is if the mesa-optimizer has access to or builds a different model than the base-optimzer. This is a bit different than mesa-superoptimizers, and it seems likely that there are a variety of cases in this last category. I'd suggest that it may be more like multi-agent failures than it is like a traditional superintelligence alignment problem.

On to Embedded Agents

I need to think further about the above, and should probably return to it, but for now I plan to make the (eventual) next post about embedded agency in this context. Backing up from the discussing on Mesa-optimizers, a key challenge for building safe optimizers in general is that control often involves embedded agent issues, where the model must be smaller than the system. In particular, in the case of mesa-optimizers, the base-optimizer needs to think of itself as an embedded agent whose model needs to include the mesa-optimizer's behavior, which is being chosen by the base-optimizer. This isn't quite embedded agency, but it requires the base optimizer to be "larger" than the mesa-optimizer, only allowing mesa-suboptimizers, which is unlikely to be guaranteed in general.

New to LessWrong?

New Comment
3 comments, sorted by Click to highlight new comments since: Today at 10:14 AM
This isn't quite embedded agency, but it requires the base optimizer to be "larger" than the mesa-optimizer, only allowing mesa-suboptimizers, which is unlikely to be guaranteed in general.

Size might be easier to handle if some parts of the design are shared. For example, if the mesa-optimizer's design was the same as the agent, and the agent understood itself, and knew the mesa-optimizer's design, then it seems like them being the same size wouldn't be (as much of) an issue.

Principal optimization failures occur either if the mesa-optimiser itself falls prey to a Goodhart failure due to shared failures in the model, or if the mesa-optimizer model or goals are different than the principal's in ways that allow the metrics not to align with the principals' goals. (Abrams correctly noted in an earlier comment that this is misalignment. I'm not sure, but it seems this is principally a terminology issue.)

1) It seems like there's a difference between the two cases. If I write a program to take the CRT, and then we both take it, and we both get the same score (and that isn't a perfect score), because it solved them the way I solve them, that doesn't sound like misalignment.

2) Calling the issues between the agents because of model differences "terminology issues" could also work well - this may be a little like people talking past each other.

Lastly, there are mesa-transoptimizers, where typical human types of principle-agent failures can occur because the mesa-optimizer has different goals. The other way this occurs is if the mesa-optimizer has access to or builds a different model than the base-optimzer.

Some efforts require multiple parties being on the same page. Perhaps a self driving car that drives on the wrong side of the road could be called "unsynchronized" or "out of sync". (If we really like using the word "alignment" the ideal state could be called "model alignment".)

2) Calling the issues between the agents because of model differences "terminology issues" could also work well - this may be a little like people talking past each other.

I really like this point. I think it's parallel to the human issue where different models of the world can lead to misinterpretation of the "same" goal. So "terminology issues" would include, for example, two different measurements of what we would assume is the same quantity. If the base optimizer is looking to set the temperature and using a wall-thermometer, while the mesa-optimizer is using one located on the floor, the mesa-optimizer might be mis-aligned because it interprets "temperature" as referring to a different fact than the base-optimizer. On the other hand, when the same metric is being used by both parties, the class of possible mistakes does not include what we're not calling terminology issues.

I think this also points to a fundamental epistemological issue, one even broader than goal-representation. It's possible that two models disagree on representation, but agree on all object level claims - think of using different coordinate systems. Because terminology issues can cause mistakes, I'd suggest that agents with non-shared world models can only reliably communicate via object-level claims.

The implication for AI alignment might be that we need AI to either fundamentally model the world the same way as humans, or need to communicate only via object-level goals and constraints.

It seems like there's a difference between the two cases. If I write a program to take the CRT, and then we both take it, and we both get the same score (and that isn't a perfect score), because it solved them the way I solve them, that doesn't sound like misalignment.

The misalignment here is between you and the CRT, and reflects your model being misaligned with the goal / reality. That's why I'm calling it a principal alignment failure - even though it's the program / mesa-optimizer that fails, the alignment failure is located in the principal, you / the base optimizer.