In my last post, I challenged the idea that inner alignment failures should be explained by appealing to agents which perform explicit internal search. By doing so, I argued that we should instead appeal to the more general concept of malign generalization, and treat mesa-misalignment as a special case.
Unfortunately, the post was light on examples of what we should be worrying about instead of mesa-misalignment. Evan Hubinger wrote,
Personally, I think there is a meaningful sense in which all the models I'm most worried about do some sort of search internally (at least to the same extent that humans do search internally), but I'm definitely uncertain about that.
Wei Dai expressed confusion why I would want to retreat to malign generalization without some sort of concrete failure mode in mind,
Can you give some realistic examples/scenarios of “malign generalization” that does not involve mesa optimization? I’m not sure what kind of thing you’re actually worried about here.
In this post, I will outline a general category of agents which may exhibit malign generalization without internal search, and then will provide a concrete example of an agent in the category. Then I will argue that, rather than being a very narrow counterexample, this class of agents could be competitive with search-based agents.
The switch case agent
Consider an agent governed by the following general behavior,
It's clear that this agent does not perform any internal search for strategies: it doesn't operate by choosing actions which rank highly according to some sort of internal objective function. While you could potentially rationalize its behavior according to some observed-utility function, this would generally lead to more confusion than clarity.
However, this agent could still be malign in the following way. Suppose the agent is 'mistaken' about the state of the world. Say that it believes that the state of the world is 1, whereas the actual state of the world is 2. Then it could take the wrong action, almost like a person who is confident in a falsehood and makes catastrophic mistakes because of their error.
To see how this could manifest as bad behavior in our artificial agents, I will use a motivating example.
The red-seeking lunar lander
Suppose we train a deep reinforcement learning agent on the lunar lander environment from OpenAI's Gym.
We make one crucial modification to our environment. During training, we make it so the landing pad is always painted red, and this is given to the agent as part of its observation of the world. We still reward the agent like normally for successfully landing in a landing pad.
Suppose what really determines whether a patch of ground is a landing pad is whether it is enclosed by two flags. Nevertheless, instead of picking up on the true indicator of whether something is a landing pad, the agent may instead pick up the proxy that held during training -- namely, that landing pads are parts of the ground that are painted red.
Using the psuedocode earlier and filling in some details, we could describe the agent's behavior something like this.
During deployment, this could end catastrophically. Assume that some crater is painted red but our landing pads is painted blue. Now, the agent will guide itself competently towards the crater and miss the real landing pad entirely. That's not what we wanted.
(ETA: If you think I'm using the term 'catastrophically' too loosely here, since the agent actually lands safely in a crater rather than crashing into the ground, we could instead imagine a lunar vehicle which veers off into the red crater rather than just sitting still and awaiting further instruction since it's confused.)
What made the agent become malign
Above, I pointed to the reason why agents like ours could be malign. Specifically, it was 'mistaken' about what counted as a landing pad. However, it's worth noting that saying the agent is mistaken about the state of the world is really an anthropomorphization. It was actually perfectly correct in inferring where the red part of the world was -- we just didn't want it to go to that part of the world. We model the agent as being 'mistaken' about where the landing pad is, but it works equally well to model the agent as having goals that are counter to ours.
Since the malign failure doesn't come from a pure epistemic error, we can't merely expect that the agent will self-correct as it gains more knowledge about the world. Saying that it is making an epistemic mistake is just a model of what's going on that helps us interpret its behavior, and it does not imply that this error is benign.
Imagining more complex agents
But what's to worry about if this sort of thing only happens in very simple agents? Perhaps you think that only agents which perform internal search could ever reach the level of competence required to perform a real-world catastrophe?
I think that these concerns about my example are valid, but I don't believe they are compelling. As a reply, I think the general agent superstructure I outlined in the initial pseudocode could reach very high levels of competence.
Consider an agent that could, during its operation, call upon a vast array of subroutines. Some of these subroutines can accomplish extremely complicated actions, such as "Prove this theorem: [...]" or "Compute the fastest route to Paris." We then imagine that this agent still shares the basic superstructure of the pseudocode I gave initially above. In effect, the agent has an outer loop, during which it takes in observations from the real world, and outputs action sequences depending on which state of the world it thinks its in, and using the subroutines it has available.
Since the subroutines are arbitrarily complex, I don't think there is any fundamental barrier for this agent to achieve high levels of competence in the real world. Moreover, some subroutines could themselves perform powerful internal searches, pretty clearly obviating the competitive advantage that explicit search agents offer.
And even while some subroutines could perform powerful internal searches, these subroutines aren't the only source of our malign generalization concern. The behavior of the agent is still well-described as a switch-case agent, and this means that the failure mode of the agent being 'mistaken' about the state of the world remains. Therefore, it's inaccurate to say that the source of malign generalization must come from an internal search being misaligned with the objective function we used during training.