Epistemic status: confused.

I think there are two importantly different types of learned optimization which could be learned by an AI system, one of which is dangerous and one which isn’t. I am quite confused about this; unsure whether or not this distinction is real or non-trivial. 

The two types of optimization that I see are:

  • Optimizing for a goal
  • Performing some kind of optimization/search process as part of accomplishing a task

For the first type (goal optimization), the system is ‘trying’ to do something, in the same way that a human might ‘try’ to accomplish a goal. The agent takes actions that affect the external environment, while in pursuit of its goal. I think this can happen most easily in an RL context, where we are training the system explicitly to optimize. I think this is dangerous and has the potential for misalignment. Examples of this would include, getting to a location as fast as possible, maximizing the number in a bank account, or maximizing presses of a reward button. 

For the second type (search process), the system isn’t ‘trying’ to do anything, and part of its algorithm includes a search algorithm; stochastic gradient descent may just find a search algorithm which is useful for accomplishing a task. Examples of this include finding the largest number in a list, using gradient descent to find the position of the minimum of a function, or a language model performing some kind of search over its inputs to help it answer a question. I don’t see strong arguments about why this type of optimization/search would be dangerous. I think this kind of search process would be more like ‘find the word/vector which best satisfies this criterion’, and less like ‘change the state of the world to optimize for this criterion’. An example of this style of optimization in GPT is talked about here.

I’m mainly thinking about the development of search processes in the context of supervised learning and language models (systems which don’t explicitly take actions). These systems are often called Oracle AIs, and I think they will be safe if they are trained offline (not online as in Predict-O-Matic). This also seems similar to Eric Drexlers’s Comprehensive AI Services.

This distinction seems similar, but not identical to “Search-in-Territory vs Search-in-Map”, and also “Selection vs Control” and “Mesa-Search vs Mesa-Control”. Goal optimization shares similarities with search-in-territory in the sense that it is related to actually doing things in the world, while a ‘search process’ algorithm seems similar to search-in-map because this search is done as part of the model’s internal cognition. An objective like ‘acquire the biggest rock’ requires a search-in-territory algorithm because the agent has to search the external world for rocks, and this could also be the objective of (dangerous) goal optimization. An objective like ‘find the biggest rock in this list of rock sizes’ requires a search-in-map algorithm, which is just a (safe) search process. 

I think that the arguments about how ‘SGD has a bias towards simplicity’ push towards the development of a search process, and not towards the development of goal optimization. A search process can be a simple algorithm which can perform well in a variety of contexts; for supervised learning, I don’t see why goal optimization would happen on top of this. I do think that it is possible for goal optimization to happen in supervised learning/language models, in the sense that the network weights could be configured this way. But I don’t see why a network would end up in this configuration. 

I do think that ‘goal optimization’ is dangerous, and seems likely to happen in certain AI systems. Specifically, we could build RL agents (or other AIs which take actions/make plans) which have goal directed behavior, and this goal directed behavior could be misaligned. I (tentatively) think that this is the kind of system we should worry about, rather than systems which are likely to simply implement dumb/safe search process algorithms. 


Training with RL vs supervised learning

It seems fairly clear that an RL agent can exhibit ‘goal optimization’ behavior, but it seems much less clear that a big (non-RL) network like GPT-N would do this. For RL we are training the system to achieve goals by taking actions in the environment, and so goal optimization is a good strategy to learn to perform well on this objective. But there are ways to train GPT systems with RL algorithms, which here seems like the GPT system could develop ‘goal optimization’. I am confused about this, and what part of training with RL (as opposed to standard supervised learning) leads to goal optimization. It could be that RL algorithms train the system to optimize for rewards across time, while other training doesn’t. 

This seems similar to myopia (which is the property that a system doesn’t attempt to optimize past parameter updates); if the system isn’t trained with any concept of time relevant to its reward then it seems more likely to behave myopically. I definitely don’t think this is a watertight method for achieving a myopic system, but rather that it seems more difficult for non-myopia to develop if we only train with supervised learning. 

This seems fairly confusing though; it’s not intuitive that training a network for a very similar thing but with a different algorithm would cause it to develop/not develop goal optimization. This makes me believe I might be missing something, and that maybe goal optimization can easily happen without RL. But it’s also possible RL algorithms make goal optimization more likely; maybe by explicitly considering time or by making it instrumentally useful to consider the value of future states. 

Optimization isn’t a discrete property

Optimization is generally defined as a property of a network’s cognition, rather than its outputs/behavior. This means we could have a system which has identical behavior as an optimizer without actually using this style of cognition. So why is ‘optimization’ a useful concept to have, if it doesn’t describe a model’s behavior? It’s a useful concept because it can help us predict what a model will do off-distribution. 

  • An optimizing model will still attempt to optimize while off-distribution, in potentially catastrophic ways. 
  • A non-optimizing model might break off-distribution, but won’t fail catastrophically by optimizing for something we don’t want.

However, I don’t think that ‘optimization’ will be a discrete property of a model. I think models are far more likely to have algorithms which can be thought of as optimization algorithms for some inputs, but for other inputs they won’t be. Standard neural networks can be quite brittle with different inputs, and so I would expect a ‘learned optimization algorithm’ to also be similarly brittle and not perform optimization for inputs which are sufficiently different to the training data. AI systems which are brittle or only usually act as optimizers seem less likely if future powerful models do develop impressive generalization capabilities. The greater the generalization, the more these systems will act as optimizers while off-distribution.

Are people actually worried about this?

I am also very unsure as to whether people are actually worried about dangerous goal optimization happening with non-RL models. I have seen some talk about GPT mesa-optimizing, or deceptively aligned subnetworks in a larger model, and I don’t think these possibilities are particularly likely or dangerous. But I also don’t know how common or serious this worry is. 


The term ‘optimization’ is often used to refer to an AI system optimizing for a specific goal, or an AI system performing some kind of internal search process. In standard supervised learning search processes seem likely and also don’t seem dangerous. I don’t see why goal optimization would happen in supervised learning, but I do think it is likely in RL. 

I think that any talk about mesa-optimization in language models or supervised learning needs to explain why we think a supervised learning system would develop external goals rather than just (safe) internal search processes.


Thanks to Vinah Hiremath and Justis Mills for feedback on this post. This work is supported by CEEALAR. 

New Comment
2 comments, sorted by Click to highlight new comments since:

I'm curious if you have thoughts about Eliezer's scattered arguments on the dangerousness of optimization in the recent very-long-chats (first one). It seems to me that one relevant argument I can pin onto him is something like "well, but have you imagined what it would be like if this supposedly-benign optimization actually solved hard problems?"

Like, it's easy to say the words "A giant look-up table that has the same input-output function as a human, on the data it actually receives." But just saying the words isn't imagining what this thing would actually be like. First, such a look-up table would be super-universally vast. But I think that's not even the most important thing to think about when imagining it - that question is "how did this thing get made, since we're not allowed to just postulate it into existence?" I interpret Eliezer as arguing that if you have to somehow make a giant look-up table that has the input-ouput behavior of a powerful optimizer n some dataset, practically speaking you're going to end up with something that is also a powerful optimizer in many other domains, not something that safely draws form a uniform distribution over off-distribution behaviors.

My initial thought is that I don't see why this powerful optimizer would attempt to optimize things in the world, rather than just do some search thing internally. 

I agree with your framing of "how did this thing get made, since we're not allowed to just postulate it into existence?". I can imagine a language model which manages to output words which causes strokes in whoever reads its outputs, but I think you'd need a pretty strong case for why this would be made in practice by the training process. 

Say you have some powerful optimizer language model which answers questions. If you ask a question which is off its training distribution, I would expect it to either answer the question really well (e.g. it genealises properly), or it kinda breaks and answers the question badly. I don't expect it to break in such a way where it suddenly decides to optimize for things in the real world. This would seem like a very strange jump to make, from 'answer questions well' to 'attempt to change the state of the world according to some goal'. 

But I think if we trained the LM on 'have a good on-going conversation with a human', such that the model was trained with reward over time, and its behaviour would effect its inputs (because it's a conversation), then I think it might do dangerous optimization, because it was already performing optimization to affect the state of the world. And so a distributional shift could cause this goal optimization to be 'pointed in the wrong direction', or uncover places where the human and AI goals become unaligned (even though they were aligned on the training distribution).