The following is my first post on AI Safety, in a manner of speaking. To be more precise, this is my distillation of the first post in the Risks from Learned Optimization sequence by Evan Hubringer et al. Further distillations will follow. (I do not mean to sound threatening, sorry.) 

Where This All Started

Let’s get right to it.

In order to get AI to do what we want it to do, we used to tell it exactly what to do.

If it was to walk from point A to point B, we used to tell it, “Walk straight, take a left, two rights, then straight, and on another left, you are at point B.”

But after a while, we stopped doing that.


We began to tell AI to do things that we really couldn’t tell it to do precisely.

We wanted it to check images for cats and dogs, suggest words to auto-complete when we write sentences and find the shortest route to reach our grandparents' houses in time for Christmas dinner, but we couldn’t convey how to go about some of these things exactly.

So we started telling AI, “Figure it out.”


An example of an AI that “figures it out” is neural networks.

What, in a manner of abstraction, we used to do with these AI, was basically tell them, this is point A, “where we are at (my house) ” and this is point B, “where we want to be (grandma’s house because the chicken is turning cold)”. Now the AI gets to try out a bunch of random actions and we tell it if it reaches point B in time or not. After that happens a bunch of times, the AI understands there are specific sets of actions that it can take which lead us to Antarctica, and other sets of actions that take us to our grandma’s house in time.

NOTE: It is worth noting that a bunch of random actions is an abstraction of what actually occurs, as there are AIs that do perform “actions” but don't have a conscious sense of what an “action” actually is.


If you would like to understand neural networks in further detail, check this out.

Now, we are going to quickly dissect a bunch of terms from this set of articles that we will be using repeatedly to explain the concepts essential to this sequence.

The Terms To Know


When a specific organism/system/process is designed to do a specific task and does not act beyond its function, we say it is optimized for that task. 

Eg. A bottle cap is optimized to hold water in its place.


However, when the same organism/system/process searches through a set of possible actions it can take to accomplish the task, it becomes an optimizerIt’s not just using a different action to perform the task, it’s looking at different actions to take and choosing the best action out of them. 

Eg. Let’s assume humans were optimized to hold water in place, and you just have a person holding his hand over an open bottle forever, imprisoned by the knowledge his function is purely to hold water in place. Of course, this person looks through possible choices presented before him such as a plate, a rock, a bottle cap, and other things he can shove on top of the open bottle to hold the water in place instead of him. This person who was previously just optimized to hold water in place is now an optimizer that chooses something out of a set (plate, rock, bottle cap) that is optimized to hold water in place, i.e., the bottle cap. 

NOTE: Humans are in general, optimizers, we utilize multiple sets of multiple options to perform various tasks on a day-to-day basis. It is only for the sake of this example, that humans are being considered to be optimized for a specific task, initially.



On a final footnote, an optimizer that chooses a specific set of actions that is optimized for a task is said to be performing optimization.

Eg. The person who chooses the bottle cap to hold water in place is performing optimization. Remember, the bottle cap itself is not performing optimization of any sort, the person is, by choosing the bottle cap, the best of all possible outcomes, the optimal one.

Now to sum it up, the person choosing the bottle cap out of multiple options is the optimizer, the bottle cap itself is optimized to hold water in place, and lastly, the process the person is following is called optimization.

All clear on the difference between optimized, optimizer, and optimization?

(Psst, this will be crucial moving forward so choose carefully!)

Moving on, then.

NOTE: It is vital to know if you intend to go further down this rabbit hole to understand other related concepts, the definition of “optimization” mentioned above isn’t perfect. The actual representation (the True Name) is more, how do I say it, a specific scent in the wind, one that you can get a sense of exploring AI Alignment further. However, for the purpose of this sequence, we will be operating by the definition explained above. Cheers!


In the context we will be using these words, they refer primarily to neural networks and the systems that they can generate. To make more sense, we have learning algorithms that adjust neural network weights/parameters to design elegant neural networks that can tell you the difference between a cat and a dog in your Facebook profile picture. 

The learning algorithm (for the more technically inclined, this could be gradient descent) is an optimizer, in this case, since it creates a specific neural network (out of possible combinations of different weights, i.e, change the task and you get an entirely different neural network) after adjusting the weights almost perfectly to do the task we want it to. Thus, you can also conclude the neural network thus created is optimized to tell you the difference between your household pets.

Let’s however take into account neural networks that aren’t image classifiers as such, but an entirely different breed. 

NOTE: Reinforcement Learning, a technique that punishes the AI for undesirable actions and rewards the AI for desirable actions (this is where a desirable action is related to a specific goal that we specify for the AI), is one example of such a breed.


Now, there is a possibility, (highly likely in advanced AI that might arise in the future), that the neural network generated starts to perform optimization (do refer above if you forgot what that was). 

The neural network can start to function as a planning algorithm that designs specific plans that can be taken to achieve the task and then searches for a plan that is suited to the neural network’s purpose. And thus, the neural network, which was previously only optimized to perform the task in question, is now an optimizer that has a planning algorithm that is optimized to perform the task.

NOTE: This is happening in RL, this is not a special case. 


All clear? (If not, I would heavily suggest that the above portions be reread to absorb them in their entirety. We will be using a great deal of everything understood so far to move forward.) 

Now in these cases, the learning algorithm that generates the neural network is called the base optimizer, and the neural network that functions as the planning algorithm is called the mesa-optimizer.

Why mesa? (Fair warning, this paragraph is mostly irrelevant.) Well, mesa is the opposite of meta. Meta means to go one level higher whilst mesa means to go one level lower. Does this mean if you called the neural network here the base optimizer, logically the learning algorithm would be called the meta-optimizer? Hmm, how does that work in real life though? If we had a device that could cool down hot soup and it broke down, and we start blowing on the soup to get it to chill so it doesn’t burn our tongue, has the cooler device forced us, the meta-optimizer, to get into action? Actually, I have absolutely no clue. It’s alright. 

The Big Questions

So in these cases,  we have learning algorithms, the base optimizer, that make neural networks, the mesa optimizer,  which in turn can be planning algorithms or some other functional method of searching through possible solutions to choose an optimal one (according to it).

There are implications for this in the realm of AI Safety, as the entire objective that we train the base objective on (what we tell the base optimizer to do) need not necessarily be shifted perfectly to the mesa objective (what the base optimizer tells the mesa optimizer to do), especially in terms of safety objectives. 

This leads us, as Evan mentions in his sequence, to two questions we will be exploring in detail.

  • What environment leads learned algorithms (neural networks) to turn into optimizers?
  • When a learned algorithm becomes an optimizer, what will the objective be and how can we be sure that the objective mirrors what we want the AI to do?

And Some More Terms

Meta-optimization is a system that in broad terms, we task with doing optimization intentionally. We want it to go through a possible set of solutions to choose the best one. 

Mesa-optimization, on the other hand, is a situation where a base optimizer, while searching for algorithms,  decides that an efficient algorithm to achieve its task is to use an optimizer itself, creating a mesa-optimizer.

In such a case, the base objective can be classified as the criteria the base optimizer uses to select between different systems whereas the mesa objective is the criteria the mesa optimizer uses to select between different outputs from different plans. 

Since the base optimizer designed the mesa optimizer, for better clarity, the mesa objective is something, that the base optimizer found, performs well on the training data provided. (Training AI; to solve mazes but the mazes you use have red exit doors and the AI just runs headfirst for red, to buy coffee based on your preference and the AI thinks green cups improve the taste, to enhance your creativity and the AI notices you sigh a lot (in the videos you gave it) when you have a brilliant thought so it forces you to keep sighing to get an idea out of you. This AI would fail in mazes with blue exit doors, in shops with only blue cups, and in eliciting ideas out of you in a meaningful manner.) 

The cases above are (some absurd) examples of pseudo-alignment where the AI seems aligned in the training scenario, but when you try to use it in a real case, it fails because its objectives are inaccurate (frankly, annoying, just solve the maze, why look at the color of the doors, what are you, a bull?! Okay, that was mean. No, it wasn’t.). 

It is necessary to understand that the base optimizer need not always make a mesa-optimizer, it's searching through possible algorithms for an optimal one that fulfills its task. Thus, just as we call the algorithm that designs the neural network the learning algorithm, the algorithm chosen by the neural network, the optimizer, we will be calling the learned algorithm

And thus, the learned algorithm that is chosen, that becomes an optimizer, is a mesa-optimizer.

You must understand that the mesa optimizer is not a subagent of any sort. Not at all. Not even close. No. The base optimizer is merely searching for good algorithms and comes across an efficient algorithm (efficient in the sense, the AI believes this is the easiest way to accomplish its task) that functions as an optimizer. It’s merely an algorithm, a really clever one at that, you could say.

NOTE: A subagent is a part of an agent that develops an agency of its own. Agency is the process of taking certain actions to achieve specific goals. This is what a mesa optimizer is not.


In essence, the base optimizer is just an optimization algorithm that finds an optimization algorithm (the mesa-optimizer) as the best option.

What’s It Upto?

Now, there is a different objective that we observe, which is what we understand a system to be optimized for, the behavioral objective. This is different of course from the mesa-objective, which the mesa-optimizer actively uses as the motivation for its optimization algorithm. 

You could say any system has a behavioral objective, for example, bottle caps, other bottle caps, and yet other bottle caps, etc have the objective of being a bottle cap. But when we talk about systems that are optimizers, the behavioral objective might be more specific and meaningful, in our example above for AI for mazes, a proper AI’s behavioral objective would be solving mazes, but the “bull” AI’s behavioral objective might merely be “FIND RED DOORS, WHERE ARE THE RED DOORS, OH NO, THE RED DOORS, I MUST HAVE THE RED DOORS”. 

In the most informal sense, a behavioral objective is literally, “Yo, what he tryna do?”

NOTE: In a more explicit sense for those with a background in ML, the behavioral objective can be defined as one that is derived from perfect IRL. For those unfamiliar with IRL(Inverse Reinforcement Learning), it's an algorithm that aims to determine objectives from the actions taken, used to train AI from datasets of human actions and results.

Joshua Needs A Snickers Bar

Now right before we enter the next section, we are going to have a final example to properly demonstrate the difference between the base objective and the mesa objective. This should mostly lay the framework that comes for everything after.

Suppose a lovely family appoints you as the babysitter for their three-year-old, and they are leaving for three weeks. Their parting instructions are “Water the plants every day at 7:00 AM.”

You say, “Sure!” and bid them adieu. You are the base optimizer.

You start watering the plants. It’s fine. It’s a good life.

The three-year-old is happy when you give him a Snickers chocolate bar every day.

Everyone is satisfied.

After two weeks, you are done waking up in the morning at 7 AM and you positively detest the plants now. This is not helped by the fact that the three-year-old wakes up 2 hours before you do, squealing for more Snickers bars, and you stare at the little miscreant with red eyes, determined to never have kids.

Voila! You have a solution! The garden kettle isn’t that hard to use, and the three-year-old seems to have boundless energy. So you say,

“Joshua, if you water the plants every morning after you wake up, I will give you an extra chocolate bar.” He beams with joy. You beam with joy. It’s all good.

Your mesa-optimizer is excited to have his stomach full of cocoa.

The next day, you show him how to use the kettle and he quickly gets a hang of it. He seems even more determined to understand the kettle when you give him the chocolate bar. 

The next three days, all goes well. Joshua is quite happy with his Snickers bar whilst the plants look greener than ever and you get your beauty sleep. Beautiful.

On the fourth day, he demands another chocolate bar. You refuse, you clearly cannot handle all that sugar in his tiny body. He cries and wails, you remain stern.

The morning of the fifth day, you wake up and the floor is wet. 

You scream for Joshua and he excitedly runs over. You are in horror, the house is ruined, the sink in the kitchen has flooded, and there is water everywhere. 

You scold Joshua, demanding an explanation. He excitedly explains, he realized if he was getting a chocolate bar for every time he watered the plants, then he could get all the chocolate bars if he used up all the water at one go to water everything. You run to the garden and you find plants torn up, by the force of Joshua using a hose at full strength, their roots grasping at the air for a final gasp at life.

The safety objectives of the base optimizer need not translate to the mesa-optimizer, i.e, the parents (the programmer) wanted you (the base optimizer) to “keep the plants healthy and well”, but Joshua (the mesa-optimizer) didn’t grasp that and instead assumed you were all mad trying to put water on things, but he didn’t complain because as long as he did that, you seemed satisfied. On the first day (the training scenario), your little mesa-optimizer worked perfectly, watering the plants and keeping them well. But after a while (the deployed scenario, when you use the AI in a real case), you come to the disastrous conclusion that the mesa objective of your mesa optimizer was to “WATER STUFF”, not “keep the plants healthy and well”.

What’s AI Alignment All About?

AI Alignment has primarily two divisions, outer alignment and inner alignment

Outer alignment is removing the gap between the programmer’s goal and the base objective of the AI. (If in the story above, you remove the three-year-old and the parents directly told you to water the plants, but you still go nuts and flood the place (Yes, you are mad). Your objective is misaligned with the parents’ objective. This is outer misalignment.)

Inner alignment is removing the gap between the base objective and the mesa objective.(What you wanted for the plants in the story above vs what the three-year-old did for the plants eventually, his objective is misaligned with you. This is inner misalignment.)

The reason we use the words “outer” and “inner” here, is because “outer” is an issue between the system(the AI) and the humans outside the system, whereas “inner” refers to an issue internal to the AI’s structure.

Now which problem out of these two do you think may not be necessary to create beautiful advanced AI in the future? Out of outer alignment and inner alignment

Give it a second’s thought?

Got an inkling?


Let’s go!

If we can design systems that won’t have mesa-optimizers arise from their structure, then we will never have to deal with inner alignment. However, if we can’t reliably do that, if we are going to have mesa-optimizers no matter what, then we will need to solve both issues of “outer alignment” and “inner alignment” for a safer future for all. (drumroll? No, no, drumroll.)

But yes, the ultimate goal would be to have mesa-optimizers that reliably do what the programmers want them to do.

Pseudo-aligned Mesa-optimizers?!

In a more concrete sense, we want mesa-optimizers that accomplish their task well across distributions. For the less technically inclined, a distribution can (for now) be considered as an environment the AI is in. We can have training distributions where we switch up the data (different mazes and their solutions to make our AI a maze-solver) and deployment distributions (giving the AI a maze we don’t know how to get out of). 

A mesa-optimizer that scores well on the base objective across the training and deployment distributions, we are going to call, robustly aligned. This AI is what we desire, and perfectly follows the goal of the programmer who designed it.

However, we can have mesa-optimizers that perform well in the training scenario, and then go haywire in the deployment scenario. When you woke up early and ensured Joshua watered the plants well with the kettle, it might have seemed like the little fellow understood we were taking care of the plants, but when you slept off trusting him, well, you know what happened. Such mesa-optimizers seem to be aligned but aren’t, thus, we will be labelling them as, pseudo-aligned. The AI might not just stray from the base objective in the deployment, but can do the same in the testing scenario, and further training scenarios as well. 

The danger of pseudo-alignment lies in the fact that the AI might, instead of aligning with the base objective, start displaying actions that the programmer didn’t intend, to achieve its misaligned mesa-objective.  We could have the “bull” AI try to reach red doors, no matter, what, and thus, exploit a vulnerability in the software running the maze to convert all of it into simulated red doors. (Yes, that situation is a bit outlandish, but do you get where we are going with this?) 

Everything becomes a red door.

What lies on the other side? 

What? Ignore me, sorry.

A good robustly aligned maze AI would learn to solve a maze irrespective of the color, size, and shape of the door, which is the desired result.

The Consequences

There are dire consequences to having pseudo-aligned mesa-optimizers, especially since over time, we have an increasing number of AI whose actions have real-world significance.

Unintended optimization 

Having an AI do optimization that the programmer did not intend for it to do might result in taking extreme actions to achieve its mesa objective, which could be disastrous on various levels. However, today, it is difficult to truly understand the circumstances in which a mesa-optimizer arises. The second post will examine features of an ML system that lead it to create a mesa-optimizer, so as to prevent unintended optimization. If (in the future) we are successful in understanding the conditions for optimization to occur, we can reliably predict where a mesa-optimizer can arise, and also take steps to ensure this does not happen (unless we actually do need them).

Inner alignment 

In the scenario where we do need mesa-optimizers for the task at hand, as explained earlier, we can have situations where they have different objectives from what we intend. Even if we have an accurate base objective, we might still screw up if the mesa optimizer is way off. Mesa optimizers can also score well in the training distribution and perform poorly in the deployment distribution, providing us without any reliable conclusions as to whether they are aligned or not. The third post will tackle possible ways a mesa-optimizer might be selected to deviate from the programmers’ goal, as well as what attributes of an ML system might encourage the same.

A particular alignment failure (that intrigues me) is deceptive alignment, the scenario where a capable misaligned mesa-optimizer learns to behave in an aligned manner until it's in a deployment distribution where it later defects to its actual mesa objective. Insane? Yes? No? The same will be dissected in the fourth post.

NOTE: One might think isn’t it better to just design models that do not use mesa-optimizers? Then we only have to deal with the Outer Alignment problem? But the fact is that not just situations where we may need an AI to define an optimizer for us, but there can also be situations where trying to prevent mesa-optimizers can impact competitiveness, i.e., it might just be easier to make a model (for a specific purpose) that ends up using mesa-optimizers in comparison, than it is to actively design a model that does not use mesa-optimizers. An individual, unaware of the dangers of misaligned mesa-optimizers, may end up unintentionally creating a model that employs them. This is why the Inner Alignment problem is also of significant concern.


Where This All Starts

And that’s the end of the first post. 

Pseudo-aligned mesa-optimizers may be a super difficult problem to address or it may be a kinda easy problem to address. But the truth of the matter is that the facts we have pertaining to them are few in number, in fact, we know very little about them truly. 

But as displayed above, since the consequences that arise from them can be dire, the relevance of aligning, these AI and their mischievous optimizers that perform optimization without permission, increases as time passes.

Au revoir.


That was a delight to write! I would like to additionally thank Arun Jose and Nikita Menon for their help in creating this distillation.


New Comment