In my last post, I challenged the idea that inner alignment failures should be explained by appealing to agents which perform explicit internal search. By doing so, I argued that we should instead appeal to the more general concept of malign generalization, and treat mesa-misalignment as a special case.
Unfortunately, the post was light on examples of what we should be worrying about instead of mesa-misalignment. Evan Hubinger wrote,
Personally, I think there is a meaningful sense in which all the models I'm most worried about do some sort of search internally (at least to the same extent that humans do search internally), but I'm definitely uncertain about that.
Wei Dai expressed confusion why I would want to retreat to malign generalization without some sort of concrete failure mode in mind,
Can you give some realistic examples/scenarios of “malign generalization” that does not involve mesa optimization? I’m not sure what kind of thing you’re actually worried about here.
In this post, I will outline a general category of agents which may exhibit malign generalization without internal search, and then will provide a concrete example of an agent in the category. Then I will argue that, rather than being a very narrow counterexample, this class of agents could be competitive with search-based agents.
The switch case agent
Consider an agent governed by the following general behavior,
It's clear that this agent does not perform any internal search for strategies: it doesn't operate by choosing actions which rank highly according to some sort of internal objective function. While you could potentially rationalize its behavior according to some observed-utility function, this would generally lead to more confusion than clarity.
However, this agent could still be malign in the following way. Suppose the agent is 'mistaken' about the state of the world. Say that it believes that the state of the world is 1, whereas the actual state of the world is 2. Then it could take the wrong action, almost like a person who is confident in a falsehood and makes catastrophic mistakes because of their error.
To see how this could manifest as bad behavior in our artificial agents, I will use a motivating example.
The red-seeking lunar lander
Suppose we train a deep reinforcement learning agent on the lunar lander environment from OpenAI's Gym.
We make one crucial modification to our environment. During training, we make it so the landing pad is always painted red, and this is given to the agent as part of its observation of the world. We still reward the agent like normally for successfully landing in a landing pad.
Suppose what really determines whether a patch of ground is a landing pad is whether it is enclosed by two flags. Nevertheless, instead of picking up on the true indicator of whether something is a landing pad, the agent may instead pick up the proxy that held during training -- namely, that landing pads are parts of the ground that are painted red.
Using the psuedocode earlier and filling in some details, we could describe the agent's behavior something like this.
During deployment, this could end catastrophically. Assume that some crater is painted red but our landing pads is painted blue. Now, the agent will guide itself competently towards the crater and miss the real landing pad entirely. That's not what we wanted.
(ETA: If you think I'm using the term 'catastrophically' too loosely here, since the agent actually lands safely in a crater rather than crashing into the ground, we could instead imagine a lunar vehicle which veers off into the red crater rather than just sitting still and awaiting further instruction since it's confused.)
What made the agent become malign
Above, I pointed to the reason why agents like ours could be malign. Specifically, it was 'mistaken' about what counted as a landing pad. However, it's worth noting that saying the agent is mistaken about the state of the world is really an anthropomorphization. It was actually perfectly correct in inferring where the red part of the world was -- we just didn't want it to go to that part of the world. We model the agent as being 'mistaken' about where the landing pad is, but it works equally well to model the agent as having goals that are counter to ours.
Since the malign failure doesn't come from a pure epistemic error, we can't merely expect that the agent will self-correct as it gains more knowledge about the world. Saying that it is making an epistemic mistake is just a model of what's going on that helps us interpret its behavior, and it does not imply that this error is benign.
Imagining more complex agents
But what's to worry about if this sort of thing only happens in very simple agents? Perhaps you think that only agents which perform internal search could ever reach the level of competence required to perform a real-world catastrophe?
I think that these concerns about my example are valid, but I don't believe they are compelling. As a reply, I think the general agent superstructure I outlined in the initial pseudocode could reach very high levels of competence.
Consider an agent that could, during its operation, call upon a vast array of subroutines. Some of these subroutines can accomplish extremely complicated actions, such as "Prove this theorem: [...]" or "Compute the fastest route to Paris." We then imagine that this agent still shares the basic superstructure of the pseudocode I gave initially above. In effect, the agent has an outer loop, during which it takes in observations from the real world, and outputs action sequences depending on which state of the world it thinks its in, and using the subroutines it has available.
Since the subroutines are arbitrarily complex, I don't think there is any fundamental barrier for this agent to achieve high levels of competence in the real world. Moreover, some subroutines could themselves perform powerful internal searches, pretty clearly obviating the competitive advantage that explicit search agents offer.
And even while some subroutines could perform powerful internal searches, these subroutines aren't the only source of our malign generalization concern. The behavior of the agent is still well-described as a switch-case agent, and this means that the failure mode of the agent being 'mistaken' about the state of the world remains. Therefore, it's inaccurate to say that the source of malign generalization must come from an internal search being misaligned with the objective function we used during training.
A similar borderline case is death spirals in ants. (Google it for nice pictures/videos of the phenomenon.) Ants may or may not do internal search, but regardless, it seems like this phenomenon could be reproduced without any internal search. The ants implement a search overall via a pattern of behavior distributed over many ants. This "search" behavior has a weird corner case where they literally go into a death spiral, which is quite non-obvious from the basic behavior pattern.
I feel like what you're describing here is just optimization where the objective is determined by a switch statement, which certainly seems quite plausible to me but also pretty neatly fits into the mesa-optimization framework.
More generally, while I certainly buy that you can produce simple examples of things that look kinda like capability generalization without objective generalization on environments like the lunar lander or my maze example, it still seems to me like you need optimization to actually get capabilities that are robust enough to pose a serious risk, though I remain pretty uncertain about that.
Typically when we imagine objectives, we think of a score which rates how well an agent performed some goal in the world. How exactly does the switch statement 'determine' the objective?
Let's say that a human is given the instructions, "If you see the coin flip heads, then become a doctor. If you see the coin flip tails, then become a lawyer." what 'objective function' is it maximizing here? If it's maximizing some weird objective function like, "probability of becoming a doctor in worlds where the coin flips heads, and probability of becoming a lawyer in worlds where the coin flips tails" this would seem to be unnatural, no? Why not simply describe it as a switch case agent instead?
Remember, this matters because we want to be perfectly clear about what types of transparency schemes work. A transparency scheme that assumes that the agent has a well-defined objective that it is using a search to optimize for, would, I think, would fail in the examples I gave. This becomes especially true if the if-statements are complicated nested structures, and repeat as part of some even more complicated loop, which seems likely.
ETA: Basically, you can always rationalize an objective function for any agent that you are given. But the question is simply, what's the best model of our agent, in the sense of being able to mitigate failures. I think most people would not categorize the lunar lander as a search-based agent, even though you could say that it is under some interpretation. The same is true with humans, plants, animals.
I think that piecewise objectives are quite reasonable and natural—and I don't think they'll make transparency that much harder. I don't think there's any reason that we should expect objectives to be continuous in some nice way, so I fully expect you'll get these sorts of piecewise jumps. Nevertheless, the resulting objective in the piecewise case is still quite simple such that you should be able to use interpretability tools to understand it pretty effectively—a switch statement is not that complicated or hard to interpret—with most of the real hard work still primarily being done in the optimization.
I do think there are a lot of possible ways in which the interpretability for mesa-optimizers story could break down—which is why I'm still pretty uncertain about it—but I don't think that a switch-case agent is such an example. Probably the case that I'm most concerned about right now is if you get an agent which has an objective which changes in a feedback loop with its optimization. If the objective and the optimization are highly dependent on each other, then I think that would make the problem a lot more difficult—and is the sort of thing that humans seem to do, which suggests that it's the sort of thing we might see in AI systems as well. On the other hand, a fixed switch-case objective is pretty easy to interpret, since you just need to understand the simple, fixed heuristics being used in the switch statement and then you can get a pretty good grasp on what your agent's objective is. Where I start to get concerned is when those switch statements themselves depend upon the agent's own optimization—a recursion which could possibly be many layers deep and quite difficult to disentangle. That being said, even in such a situation you're still using search to get your robust capabilities.
If one's interpretation of the 'objective' of the agent is full of piecewise statements and ad-hoc cases, then what exactly are we doing it by describing it as maximizing an objective in the first place? You might as well describe a calculator by saying that it's maximizing the probability of outputting the following [write out the source code that leads to its outputs]. At some point the model breaks down, and the idea that it is following an objective is completely epiphenomenal to its actual operation. The model that it is maximizing an objective doesn't shed light on its internal operations any more than just spelling out exactly what its source code is.
I don't feel like you're really understanding what I'm trying to say here. I'm happy to chat with you about this more over video call or something if you're interested.
Sure, we can talk about this over video. Check your Facebook messages.
Planned summary for the Alignment Newsletter:
That we can flip our perspective like this suggests to me that thinking of the agent as having different goals is likely still anthropomorphic or at least teleological reasoning that results from us modeling this agent has having dispositions it doesn't actually have.
I'm not sure what to offer as an alternative since we're not talking about a category where I feel grounded enough to see clearly what might be really going on, much less offer a more useful abstraction that avoids this problem, but I think it's worth considering that there's a deeper confusion here that this exposes but doesn't resolve.
Computing the fastest route to Paris doesn't involve search?
More generally, I think in order for it to work your example can't contain subroutines that perform search over actions. Nor can it contain subroutines such that, when called in the order that the agent typically calls them, they collectively constitute a search over actions.
And it's still not obvious to me that this is viable. It seems possible in principle (just imagine a sufficiently large look-up table!) but it seems like it probably wouldn't be competitive with agents that do search at least to the extent that humans do. After all, humans evolved to do search over actions, but we totally didn't have to--if bundles of heuristics worked equally well for the sort of complex environments we evolved in, then why didn't we evolve that way instead?
EDIT: Just re-read and realized you are OK with subroutines that explicitly perform search over actions. But why? Doesn't that undermine your argument? Like, suppose we have an architecture like this:
LOOP:State = GetStateOfWorld(Observation)
IF State == InPain:Cry&FlailAbout
IF State == AttractiveMateStraightAhead:MoveForward&Grin
ELSE ==: Do(RunSubroutine[SearchOverActionsAndOutputActionThoughtToYieldGreatestExpectedNumberOfGrandchildren])
This seems not meaningfully different from the version that doesn't have the first two IF statements, as far as talk of optimizers is concerned.
My example uses search, but the search is not the search of the inner alignment failure. It is merely a subroutine that is called upon by this outer superstructure, which itself is the part that is misaligned. Therefore, I fail to see why my point doesn't follow.
If your position is that inner alignment failures must only occur when internal searches are misaligned with the reward function used during training, then my example would be a counterexample to your claim, since the reason for misalignment was not due to a search being misaligned (except under some unnatural rationalization of the agent, which is a source of disagreement highlighted in the post, and in my discussion with Evan above).
You are right; my comment was based on a misunderstanding of what you were saying. Hence why I unendorsed it.
(I read " In this post, I will outline a general category of agents which may exhibit malign generalization without internal search, and then will provide a concrete example of an agent in the category. Then I will argue that, rather than being a very narrow counterexample, this class of agents could be competitive with search-based agents. " and thought you meant agents that don't use internal search at all.)