I think this is one of the major remaining open question wrt inner alignment. Personally, I think there is a meaningful sense in which all the models I'm most worried about do some sort of search internally (at least to the same extent that humans do search internally), but I'm definitely uncertain about that. If true, though, it could be quite helpful for solving inner alignment, since it could enable us to factor models into pieces (either through architecture or transparency tools). Also:
As far as I can tell, Hjalmar Wijk introduced the term "malign generalization" to describe the failure mode that I think is most worth worrying about here.
Hjalmar actually cites this post by Paul Christiano as the source of that term—though Hjalmar's usage is slightly different.
I’m sympathetic to what I see as the message of this post: that talk of mesa-optimisation is too specific given that the practical worry is something like malign generalisation. I agree that it makes extra assumptions on top of that basic worry, which we might not want to make. I would like to see more focus on inner alignment than on mesa-optimisation as such. I’d also like to see a broader view of possible causes for malign generalisation, which doesn’t stick so closely to the analysis in our paper. (In hindsight our analysis could also have benefitted from taking a broader view, but that wasn’t very visible at the time.)
At the same time, speaking only in terms of malign generalisation (and dropping the extra theoretical assumptions of a more specific framework) is too limiting. I suspect that solutions to inner alignment will come from taking an opinionated view on the structure of agents, clarifying its assumptions and concepts, explaining why it actually applies to real-world agents, and offering concrete ways in which the extra structure of the view can be exploited for alignment. I’m not sure that mesa-optimisation is the right view for that, but I do think that the right view will have something to do with goal-directedness.
I suspect that solutions to inner alignment will come from taking an opinionated view on the structure of agents, clarifying its assumptions and concepts, explaining why it actually applies to real-world agents, and offering concrete ways in which the extra structure of the view can be exploited for alignment.
Even taking that as an assumption, it seems like if we accept that "mesa optimizer" doesn't work as a description of humans, then mesa optimization can't be the right view, and we should retreat to malign generalization while trying to figure out a better view.
We’re probably in agreement, but I’m not sure what exactly you mean by “retreat to malign generalisation”.
For me, mesa-optimisation’s primary claim isn’t (call it Optimisers) that agents are well-described as optimisers, which I’m happy to drop. It is the claim (call it Mesa≠Base) that whatever the right way to describe them is, in general their intrinsic goals are distinct from the reward.
That’s a specific (if informal) claim about a possible source of malign generalisation. Namely, that when intrinsic goals differ arbitrarily from the reward, then systems that competently pursue them may lead to outcomes that are arbitrarily bad according to the reward. Humans don’t pose a counterexample to that, and it seems prima facie conceptually clarifying, so I wouldn’t throw it away. I’m not sure if you propose to do that, but strictly, that’s what “retreating to malign generalisation” could mean, as malign generalisation itself makes no reference to goals.
One might argue that until we have a good model of goal-directedness, Mesa≠Base reifies goals more than is warranted, so we should drop it. But I don’t think so – so long as one accepts goals as meaningful at all, the underlying model need only admit a distinction between the goal of a system and the criterion according to which a system was selected. I find it hard to imagine a model or view that wouldn’t allow this – this makes sense even in the intentional stance, whose metaphysics for goals is pretty minimal.
It’s a shame that Mesa≠Base is so entangled with Optimisers. When I think of mesa-optimisation, I tend to think more about the former than about the latter. I wish there was a term that felt like it pointed directly to Mesa≠Base without pointing to Optimisers. The Inner Alignment Problem might be it, though it feels like it’s not quite specific enough.
From my perspective, there are three levels:
I worry about the middle case. It seems like upon reading the mesa optimizers paper, most people start to worry about the last case. I would like people to worry about the middle case instead, and test their proposed solutions against that. (Well, ideally they'd test it against the most general case, but if it doesn't work against that, which it probably won't, that isn't necessarily a deal breaker.) I feel better about people accidentally worrying about the most general case, rather than people accidentally worrying about the most specific case.
The Inner Alignment Problem might be it, though it feels like it’s not quite specific enough.
I like "inner alignment", and am not sure why you think it isn't specific enough.
I think we basically agree. I would also prefer people to think more about the middle case. Indeed, when I use the term mesa-optimiser, I usually intend to talk about the middle picture, though strictly that’s sinful as the term is tied to Optimisers.
Re: inner alignment
I think it’s basically the right term. I guess in my mind I want to say something like, “Inner Alignment is the problem of aligning objectives across the Mesa≠Base gap”, which shows how the two have slightly different shapes. But the difference isn’t really important.
Inner alignment gap? Inner objective gap?
I’m not sure what exactly you mean by “retreat to malign generalisation”.
When you don't have a deep understanding of a phenomenon, it's common to use some empirical description of what you're talking about, rather than using your current (and incorrect) model to interpret the phenomenon. The issue with using your current model, is that it leads you to make incorrect inferences about why things happen because you're relying too heavily on the model being internally correct.
Therefore, until we gain a deeper understanding, it's better to use the pre-theoretical description of what we're talking about. I'm assuming that's what Rohin meant by "retreat to malign generalization."
This is important because if we used the definition given in the paper, then this could affect which approaches we use to address inner alignment. For instance, we could try using some interpretability technique to discover the "objective" that a neural network was maximizing. But if our model of the neural network as an optimizer is ultimately incorrect, then the neural network won't have an explicit objective, making this technique very difficult.
I understand that, and I agree with that general principle. My comment was intended to be about where to draw the line between incorrect theory, acceptable theory, and pre-theory.
In particular, I think that while optimisation is too much theory, goal-directedness talk is not, despite being more in theory-land than empirical malign generalisation talk. We should keep thinking of worries on the level of goals, even as we’re still figuring out how to characterise goals precisely. We should also be thinking of worries on the level of what we could observe empirically.
I wish there was a term that felt like it pointed directly to Mesa≠Base without pointing to Optimisers.
I think it's fairly easy to point out the problem using an alternative definition. If we just change the definition of mesa optimizer to reflect that we're are using the intentional stance (in other words, we're interpreting the neural network as having goals, whether it's using an internal search or not), the mesa!=base description falls right out, and all the normal risks about building mesa optimizers still apply.
I’m not talking about finding on optimiser-less definition of goal-directedness that would support the distinction. As you say, that is easy. I am interested in a term that would just point to the distinction without taking a view on the nature of the underlying goals.
As a side note I think the role of the intentional stance here is more subtle than I see it discussed. The nature of goals and motivation in an agent isn’t just a question of applying the intentional stance. We can study how goals and motivation work in the brain neuroscientifically (or at least, the processes in the brain that resemble the role played by goals in the intentional stance picture), and we experience goals and motivations directly in ourselves. So, there is more to the concepts than just taking an interpretative stance, though of course to the extent that the concepts (even when refined by neuroscience) are pieces of a model being used to understand the world, they will form part of an interpretative stance.
I am interested in a term that would just point to the distinction without taking a view on the nature of the underlying goals.
I'm not sure what's unsatisfying about the characterization I gave? If we just redefined optimizer to mean an interpretation of the agent's behavior, specifically, that it abstractly pursues goals, why is that an unsatisfying way of showing the mesa != base issue?
ETA:
The nature of goals and motivation in an agent isn’t just a question of applying the intentional stance. We can study how goals and motivation work in the brain neuroscientifically (or at least, the processes in the brain that resemble the role played by goals in the intentional stance picture), and we experience goals and motivations directly in ourselves.
I agree. And the relevance this plays is that in future systems that might experience malign generalization, we would want some model of how goals play a role in their architecture, because this could help us align the system. But until we have such architectures, or until we have models for how those future systems should behave, we should work abstractly.
First, I think by this definition humans are clearly not mesa optimizers.
I'm confused/unconvinced. Surely the 9/11 attackers, for example, must have "internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system"? Can you give some examples of humans being highly dangerous without having done this kind of explicit optimization?
As far as I can tell, Hjalmar Wijk introduced the term “malign generalization” to describe the failure mode that I think is most worth worrying about here.
Can you give some realistic examples/scenarios of "malign generalization" that does not involve mesa optimization? I'm not sure what kind of thing you're actually worried about here.
Surely the 9/11 attackers, for example, must have "internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system"?
ETA: I agree if someone were to eg. write a spreadsheet of all the things they could do, and write the costs of those actions, and then choose the one with the lowest cost, this would certainly count. And maybe terrorist organizations do a lot of deliberation that meets this kind of criteria. But I am responding to the more typical type of human action: walking around, seeking food, talking to others, working at a job.
There are two reasons why we might model something as an optimizer. The first reason is that we know that it is internally performing some type of search over strategies in its head, and then outputting the strategy that ranks highest under some explicit objective function. The second reason is that, given our ignorant epistemic state, our best model of that object is that it is optimizing some goal. We might call the second case the intentional stance, following Dennett.
If we could show that the first case was true in humans, then I would agree that humans would be mesa optimizers. However, my primary objection is that we could have better models of what the brain is actually doing. It's often the case that when you don't know how something works, the best way of understanding it is by modeling it as an optimizer. However, once you get to look inside and see what's going on, this way of thinking lends to better models which take into account the specifics of its operation.
I suspect that human brains are well modeled as optimizers from the outside, but that this view falls apart when considering specific cases. When the brain makes a decision, it usually considers at most three or four alternatives for each action it does. Most of the actual work is therefore done at the heuristics stage, not the selection part. And even at the selection stage, I have little reason to believe that it is actually comparing alternatives against an explicit objective function.
But since this is all a bit vague, and hard to see in the case of humans, I can provide the analogy that I gave in the post above.
At a first glance, someone who looked at the agent in the Chests and Keys environment would assume that it was performing an internal search, and then selecting the action that ranked highest in its preference ordering, where its preference ordering was something like "more keys is better." This would be a good model, but we could still do better.
In fact, the only selection that's really happening is at the last stage of the neural network, when the max function is being applied over its output layer. Otherwise, all it's really doing is applying a simple heuristic: if there are no keys on the board, move along the wall; otherwise, move towards the key currently in sight. Since this can all be done in a simple feedforward neural network, I find it hard to see why the best model of its behavior should be an optimizer.
When the brain makes a decision, it usually considers at most three or four alternatives for each action it does. Most of the actual work is therefore done at the heuristics stage, not the selection part. And even at the selection stage, I have little reason to believe that it is actually comparing alternatives against an explicit objective function.
Assuming this, it seems to me that the heuristics are being continuously trained by the selection stage, so that is the most important part even if heuristics are doing most of the immediate work in making each decision. And I'm not sure what you mean by "explicit objective function". I guess the objective function is encoded in the connections/weights of some neural network. Are you not counting that as an explicit objective function and instead only counting a symbolically represented function as "explicit"? If so, why would not being "explicit" disqualify humans as mesa optimizers? If not, please explain more what you mean?
Since this can all be done in a simple feedforward neural network, I find it hard to see why the best model of its behavior should be an optimizer.
I take your point that some models can behave like an optimizer at first glance but if you look closer it's not really an optimizer after all. But this doesn't answer my question: "Can you give some realistic examples/scenarios of “malign generalization” that does not involve mesa optimization? I’m not sure what kind of thing you’re actually worried about here."
ETA: If you don't have a realistic example in mind, and just think that we shouldn't currently rule out the possibility that a non-optimizer might generalize in a way that is more dangerous than total failure, I think that's a good thing to point out too. (I had already upvoted your post based on that.)
I guess the objective function is encoded in the connections/weights of some neural network. Are you not counting that as an explicit objective function and instead only counting a symbolically represented function as "explicit"?
If the heuristics are continuously being trained, and this is all happening by comparing things against some criterion that's encoded within some other neural network, I suppose that's a bit like saying that we have an "objective function." I wouldn't call it explicit, though, because to call something explicit means that you could extract the information content easily. I predict that extracting any sort of coherent or consistent reward function from the human brain will be very difficult.
If so, why would not being "explicit" disqualify humans as mesa optimizers? If not, please explain more what you mean?
I am only using the definition given. The definition clearly states that the objective function must be "explicit" not "implicit."
This is important; as Rohin mentioned below, this definition naturally implies that one way of addressing inner alignment will be to use some transparency procedure to extract the objective function used by the neural network we are training. However, if neural networks don't have clean, explicit internal objective functions, this technique becomes a lot harder, and might not be as tractable as other approaches.
"Can you give some realistic examples/scenarios of “malign generalization” that does not involve mesa optimization? I’m not sure what kind of thing you’re actually worried about here."
I actually agree that I didn't adequately argue this point. Right now I'm trying to come up with examples, and I estimate about a 50% chance that I'll write a post about this in the future naming detailed examples.
For now, my argument can be summed up by saying, logically, if humans are not mesa optimizers, yet humans are dangerous, then you don't need a mesa optimizer to produce malign generalization.
humans are clearly not mesa optimizers. Most optimization we do is implicit.
Sure, but some of the optimization we do is explicit, like if someone is trying to get out of debt. Are you saying there's an important safety-relevant distinction between "system that sometimes does explicit optimization but also does other stuff" versus "system that does explicit optimization exclusively"? And/or that "mesa-optimizer" only refers to the latter (in the definition you quote)? I was assuming not... Or maybe we should say that the human mind has "subagents", some of which are mesa-optimizers...?
Are you saying there's an important safety-relevant distinction between "system that sometimes does explicit optimization but also does other stuff" versus "system that does explicit optimization exclusively"?
I would say that very little of our optimization is explicit, and so our behavior is not very well described by invoking an explicit utility function.
One big strategic implication I see is that, if there is no canonical or natural utility function which describes our behavior, then modeling us as having one will lead to predictable errors in understanding us. I think that saying that humans optimize for some X is useful up to a point, but if someone were trying to devise a way of keeping humans safe (let's say evolution was working on the so-called alignment problem to genetic fitness) then this model of humanity would break down rather easily and not help much. In other words, modeling us as optimizers is a low resolution model, and should be replaced with more detailed models which take into account the actual specifics of how we make decisions.
For an intution pump, I recommend reading this post from Rohin Shah. In particular, the section about "Our understanding of the behavior" mirrors what I wrote above.
Hmm, let me try again to explain what I'm trying to get at.
Let's say you're a prison architect, building prisons for three prisoners
Prisoner Alice is 100% maximally obsessed with breaking out of prison. She eats and sleeps and works and socializes precisely to the extent that these are instrumentally useful for breaking out of prison. She thinks hard and creatively about breakout strategies during every free moment.
Prisoner Carol does not treat breaking out as an explicit goal, although if she happened upon an open door, she would absolutely walk out. She spends her days reading books from the prison library, socializing, working, etc.
Prisoner Bob is somewhere between the two. For 23 hours a day, he has various persuits like Carol. But for a total of 1 hour a day, he works on the project of breaking out of prison. (For a total of ~30 minutes scattered through the day, he does opportunistic information-gathering—he watches guard behavior, he explores the prison grounds, etc. Then, for the 30 minutes before he goes to sleep each night, he sits quietly in his cell, and thinks hard and creatively about how to break out.)
Here's my question: "Is Bob more like Alice or more like Carol?"
From a psychological perspective, Bob is similar to Carol, while Alice is kinda nuts, even cartoonish. Bob and Carol are doing the same range if activities for the same reasons for 23 hours a day = 96%. They both make friends, they both exercise, they both read books, they both complain about the weather. Alice has a simple utility function, whereas Bob and Carol are actual plausible humans.
From a prison safety and security perspective, Bob and Alice are very similar to each other, and wildly different from Carol. Alice and Bob are Very High Breakout Risk, Carol is Low Breakout Risk. If you build the prison with 5 layers of security, each of which is 90% reliable, Carol will be kept locked up with 99.999% certainty, whereas either Alice or Bob will engineer a situation where all 5 layers break at once. Alice will come up with good breakout plans sooner than Bob, but not exponentially sooner; it's a difference of degree not kind, unlike Bob versus Carol.
The moral I'm trying to convey is that, when we talk about mesa-optimization, the important question is "Can we correctly explain any non-infinitesimal subset of the system's behavior as explicit optimization for a misaligned goal?" , not "Can we correctly explain 100% of the system's behavior as explicit optimization for a misaligned goal?"
The argument for risk doesn't depend on the definition of mesa optimization. I would state the argument for risk as "the AI system's capabilities might generalize without its objective generalizing", where the objective is defined via the intentional stance. Certainly this can be true without the AI system being 100% a mesa optimizer as defined in the paper. I thought this post was suggesting that we should widen the term "mesa optimizer" so that it includes those kinds of systems (the current definition doesn't), so I don't think you and Matthew actually disagree.
It's important to get this right, because solutions often do depend on the definition. Under the current definition, you might try to solve the problem by developing interpretability techniques that can find the mesa objective in the weights of the neural net, so that you can make sure it is what you want. However, I don't think this would work for other systems that are still risky, such as Bob in your example.
Planned summary for the Alignment newsletter:
The <@mesa optimization@>(@Risks from Learned Optimization in Advanced Machine Learning Systems@) paper defined an optimizer as a system that internally searches through a search space for elements that score high according to some explicit objective function. However, humans would not qualify as mesa optimizers by this definition, since there (presumably) isn't some part of the brain that explicitly encodes some objective function that we then try to maximize. In addition, there are inner alignment failures that don't involve mesa optimization: a small feedforward neural net doesn't do any explicit search; yet when it is trained in the <@chest and keys environment@>(@A simple environment for showing mesa misalignment@), it learns a policy that goes to the nearest key, which is equivalent to a key-maximizer. Rather than talking about "mesa optimizers", the post recommends that we instead talk about "malign generalization", to refer to the problem when <@capabilities generalize but the objective doesn't@>(@2-D Robustness@).
Planned opinion:
I strongly agree with this post (though note that the post was written right after a conversation with me on the topic, so this isn't independent evidence). I find it very unlikely that most powerful AI systems will be optimizers as defined in the original paper, but I do think that the malign generalization problem will apply to our AI systems. For this reason, I hope that future research doesn't specialize to the case of explicit-search-based agents.
Here's a related post that came up on Alignment Forum a few months back: Does Agent-like Behavior Imply Agent-like Architecture?
In the post introducing mesa optimization, the authors defined an optimizer as
The paper continues by defining a mesa optimizer as an optimizer that was selected by a base optimizer.
However, there are a number of issues with this definition, as some have already pointed out.
First, I think by this definition humans are clearly not mesa optimizers. Most optimization we do is implicit. Yet, humans are the supposed to be the prototypical examples of mesa optimizers, which appears be a contradiction.
Second, the definition excludes perfectly legitimate examples of inner alignment failures. To see why, consider a simple feedforward neural network trained by deep reinforcement learning to navigate my Chests and Keys environment. Since "go to the nearest key" is a good proxy for getting the reward, the neural network simply returns the action, that when given the board state, results in the agent getting closer to the nearest key.
Is the feedforward neural network optimizing anything here? Hardly, it's just applying a heuristic. Note that you don't need to do anything like an internal A* search to find keys in a maze, because in many environments, following a wall until the key is within sight, and then performing a very shallow search (which doesn't have to be explicit) could work fairly well.
As far as I can tell, Hjalmar Wijk introduced the term "malign generalization" to describe the failure mode that I think is most worth worrying about here. In particular, malign generalization happens when you trained a system with objective function X, that at deployment has the actual outcome of doing Y, where Y is so bad that we'd prefer the system to fail completely. To me at least, this seems like a far more intuitive and less theory-laden way of framing inner alignment failures.
This way of reframing the issue allows us to keep the old terminology that we are concerned with capability robustness without alignment robustness, but drops all unnecessary references to mesa optimization.
Mesa optimizers could still form a natural class of things that are prone to malign generalization. But if even humans are not mesa optimizers, why should we expect mesa optimizers to be the primary real world examples of such inner alignment failures?