Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Here I apply my "If I were a well-intentioned AI" filter to mesa-optimising

Now, I know that a mesa-optimiser need not be a subagent (see 1.1 here), but I'm obviously going to imagine myself as a mesa-optimising subagent.

An immediate human analogy springs to mind: I'm the director of a subdivision of some corporation or agency, and the "root optimiser" is the management of that entity.

There is a lot of literature on what happens if I'm selfish in this position; but if I'm well-intentioned, what should I be doing?

One thing that thinking this way made me realise: there is a big difference between "aligned with management" and "controlled by management".

We'll consider each one in turn, but to summarise: aligned mesa-optimisers are generally better than controlled mesa-optimisers, but it is hard to tell the difference between an aligned and a dangerous unaligned mesa-optimiser.

Control vs alignment

First let's flesh out the corporate/management example a bit. Me-AI is in charge of making widgets, that are used by the company for some purpose. That purpose is given by - the base utility for the corporation.

My role is to make as many widgets as possible within my budget; this is , the mesa-objective I have been given by management.

My true utility function is . Management don't fully know what is - or at least don't fully understand it, or all of its implications. This is needed, of course, because if management fully understood the implications of , there would be no uncertainty at all on their part, and they could make me do exactly what they wanted - or they would turn me off.

Because of this uncertainty, management have added some extra levels of control over me. Let's assume one typical precaution: if I underspend the budget, the budget is cut next year. If I overspend, I'm reprimanded (and fired if I do it too often), but the budget is not cut.

There are three possible situations I could confront:

  • : I've made as many widgets as I can this year, and spent of my budget. I predict that, next year, I will only need of this year's budget.

  • : I've made as many widgets as I can this year, and spent 90% of my budget. I predict that, next year, I will need of this year's budget.

  • : I've spent of my budget on widgets. I predict that widgets are particularly valuable to the company this year, much moreso than next year.

Aligned mesa-optimiser

If I'm aligned with the goals of the company, , the base utility. Management has assigned to me, but I have actually no intrinsic desire to maximise it. I am a -maximiser, with the extra information that management, who are also -maximisers, have assigned to me.

My behaviour should be that of a maximiser in most circumstances. If I know management is much smarter than me (or have an information advantage; maybe I'm not even sure what is!), then I will tend to behave like a blind -maximiser, trusting to management to sort things out properly. In and , I will not spend anything more, and in , I will do nothing.

If I know I'm smarter than management, at least in some areas, then I will start making moves that are -maximising but not necessarily -maximising. In , I won't spend any more (since I don't need a large budget next year), but in , I'll waste the extra to ensure my budget isn't cut. In , I'll go over budget to get the extra-valuable widgets.

Dangerous mesa-optimiser

Note that, from a management perspective, the and actions of a smart aligned mesa-optimiser... look like an dangerous, unaligned mesa-optimiser defecting.

Let be the utility function of an unaligned (and non-controlled) agent. How can management distinguish between and ?

In both cases, the subagent will take actions that go beyond , actions whose ramifications management cannot immediately see.

Of course, I would explain to management why I'm doing this, and why they are compatible with . But a -maximiser would similarly explain its actions, trying to trick the management into thinking it was... me.

Thus, as in the classical principal-agent problem, there is a problem with imperfect information. Even if I am aligned, management can't trust me if they don't know I'm aligned. I would try and signal this; an unaligned agent would try and fake the same signals. The very fact that aligned-but-not-controlled agents could exist, make it easier for unaligned-and-not-controlled agents to go undetected.

Controlled mesa-optimiser

A controlled mesa-optimiser is the obedient servant. They will maximise , while ensuring that they remain under management control.

If I'm in that case, I'll try and ensure that management has accurate information about what I'm doing[1], and that they can change my objectives as needed. I'm safely interruptible and corrigible.

In , I will not do anything; it's not my role to maximise , so it's not my role to take advantage of this opportunity. Depending on how I'm designed, I may report this information to management. In , I will not spend any extra, and I'll let my budget be cut next year.

In , I face a dilemma: my goal that wants me to preserve my budget next year, vs my desire to be informative. Again, this varies depending on how I'm designed. So I'll likely spend my entire budget (to prevent the automatic cutting of my budget), while reporting this to management so they have full information. If I can't report this, I may decide not to spend any extra, so that management gets better information.

Note the "depending on how I'm designed". Because unlike the aligned mesa-optimiser, the controlled agent has two goals, in tension with each other: maximising , and being informative and controlled by management. If we call the "controlled" goal , then I'm maximising

where means that is some function of and that increases whenever either of those functions increase.

I'm not aligned, so I'm not a maximiser. I would gladly sacrifice completely, in order to maximise and/or . Hopefully, is sufficiently well-designed to allow management to predict this and preclude it. And, hopefully, the tradeoff between and is also sufficiently well designed to preclude it.

Since I am well-intentioned, I can make use of the fact that is the outcome desired by management, in order to better figure out the ideal versions of , , and of . Defining "the human H understands what the mesa-optimiser is doing" is not easy. So a well-intentioned controlled mesa-optimiser, is akin to a "control optimiser" which is trying to figure out what control means (see previous posts in this sequence) and avoiding Goodharting both and .

Aligned and controlled

I could instead be maximising

or, more simply

with the extra piece of information that management have assigned me the role of maximising . The might be a smoothed lexicographic ordering, with the primary utility.

So, here I'm aligned and controlled. So that I can trade off some of what management really want - - to increase , their control over me. But not arbitrary amounts of it.

So, being aligned and controlled, I will do nothing in , just as both other agents do. I will overspend in , to take advantage of the -increasing potential, just like the aligned agent.

The situation is more complicated. I have a tendency to want to waste my budget (alignment) but also want to communicate good information to management (control). Unlike the controlled agent, where control will dictate "do nothing" if there is no other way of communicating information, this is a genuine tension. I am therefore more likely to choose "waste" than the controlled agent.

In general, this seems the best design, if management is unsure of me. Not -optimal, but the probability of dramatic -failure is reduced, while management keep some of the control that they think is desirable.

Summary

To summarise the three types of agent; if they are smart and powerful compared to management, they would tend to do the following actions (a indicates strong uncertainty; if one option is more likely, despite the uncertainty, that option is noted before the ):

An unaligned agent will pretend to be one of the other types of agents; it will have most success at its own objectives if it pretends to be an aligned agent.

If management misclassifies me

Aligned, believed controlled

If I'm aligned, but management thinks I'm controlled, this isn't much of a problem. When I'm weak or ignorant, I should behave like a controlled agent anyway, since management have a much better handle on what is needed than me.

Once I become powerful and smart, I would most likely still want management to believe I'm controlled (or else they might intervene to constrict me or turn me off). So I'd generally behave exactly as if I was controlled, only taking advantage of grey zones of management misunderstanding, to push the outcome more towards the -optimal position, even at the cost of and .

Controlled, believed aligned

A controlled agent that is believed to be aligned is more of a problem. If the provides an incentive for me to signal management "hey, you misclassified me, I'm actually controlled, not aligned", then the problem is self-correcting.

If not, the problems of controlled agents are exacerbated; management will let me get on with my own stuff, ignoring any "weird" or "anti-" stuff that I might be doing, because they believe that I'm acting for and they just lack information to fully understand why I'm taking these actions. Their more "rubber-stamp" approach increases the risk that I may sacrifice their to my own .


  1. Note that even a controlled mesa-optimiser is not necessarily honest. My goal is to give the management good information and be under their control; not to be honest. If management expects that every director will pad their budget requests by , then I will do so as well; to refrain from doing so would be misleading. ↩︎

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 9:54 PM

Planned summary for the Alignment Newsletter:

This sequence takes on the perspective of an AI system that is well-intentioned, but lacking information about what humans want. The hope is to find what good AI reasoning might look like, and hopefully use this to derive insights for safety. The sequence considers Goodhart problems, adversarial examples, distribution shift, subagent problems, etc.

Planned opinion:

I liked this sequence. Often when presented with a potential problem in AI safety, I ask myself why the problem doesn't also apply to humans, and how humans have managed to solve the problem. This sequence was primarily this sort of reasoning, and I think it did a good job of highlighting how with sufficient conservatism it seems plausible that many problems are not that bad if the AI is well-intentioned, even if it has very little information, or finds it hard to communicate with humans, or has the wrong abstractions.

Sounds good, cheers!