TL;DR: I split out mesaoptimizers (models which do explicit search internally) from the superset of consequentialist models (which accomplish goals in the world, and may or may not use search internally). This resolves a bit of confusion I had about mesaoptimizers and whether things like GPT simulating an agent counted as mesaoptimization or not.
Editor’s note: I’m experimenting with having a lower quality threshold for just posting things even while I’m still confused and unconfident about my conclusions, but with this disclaimer at the top. Thanks to Vivek Hebbar, Ethan Perez, Owain Evans, and Evan Hubinger for discussions.
UPDATE: The idea in this post is basically the same as the idea of "mesa-controllers" in this post.
By consequentialist models I mean models which optimize the world in the Alex Flint sense of optimization; i.e narrowing world states to some goal in a way robust to some perturbations. In other words, it’s able to achieve consequences. The model doesn't have to be fully consequentialist either, it just has to have some kernel of consequentialist structure by virtue of actually achieving its goals sometimes. Of course, the degree to which a model is consequentialist is more of a scalar quantity than a discrete yes/no thing; for my purposes it really doesn't matter where to draw a dotted line that dictates when a model "becomes consequentialist".
(For what it's worth I would totally have called "consequentialist models" just mesaoptimizers and what other people call mesaoptimizers as like "searching mesaoptimizers" or something, but that would only create even more confusion than already exists)
For instance, the policy network of alphago alone (i.e none of the MCTS) is consequentialist, because it consistently steers the world into states where it's winning, despite my attempts to make it not win. Basically any RL policy is a fairly consequentialist model by this definition, since they channel the set of all states to the set of states that achieve high reward. I think of mesaoptimizers as a subset of consequentialist models, in that they are consequentialist, but they implement this consequentialism using explicit search rather than some other random thing. Explicit search is a kind of optimization, but not all optimization has to be search; you could have symbol manipulation, clever heuristics, gradient based methods, or at the most extreme even just a big lookup table preloaded with the optimal answer for each possible situation.
Consequentialist models are scary because when they are learned imperfectly resulting in a goal that is misaligned with ours, they competently pursue the wrong thing, rather than failing outright (other terms for this phenomenon coined by various people: objective misgeneralization, malign failure). This is often stated as a danger of mesaoptimization, but I think this property is slightly more general than that. It's not the search part that makes this dangerous, it's the consequentialism part. Search is just one way that consequentialism can be implemented.
Another way of thinking about this: instrumental convergence (and as a result, deception) is not exactly a result of search, or even utility maximization. Those things are ways of implementing consequentialism, but other ways of implementing consequentialism would result in the exact same problems. These non-searching consequentialist models might function internally as a pile of heuristics and shallow patterns that are just extremely competent at steering world states. Of course this is also a continuous thing, where you can do varying ratios of search to heuristics. My intuition is that humans, for instance, do very little search (our System 2), and rely on a huge pile of heuristics (our System 1) to provide a very powerful “rollout policy” to make the most of the little search we do.
(My intuition is that GPT, for instance, does little to no search, and has a huge pile of heuristics—but heuristics is likely all you need)
One thing is that it’s slightly more complicated how non-search consequentialist models would discover deceptive strategies. Unlike search, where you can think up plans you never trained on and then filter for high reward, if you can’t do search, you have to somehow be able to tell that deceptive actions are high reward. However, I don’t think it’s impossible for non-searching models to still be deceptive. Some ways this could happen include:
There are still some very key differences between mesaoptimizers and consequentialist models. For instance:
Another possible framing is that both consequentialist models and mesaoptimizers are optimized but consequentialist models don’t do any optimization themselves, whereas mesaoptimizers do. However, I don’t think this is quite true; just because you aren’t doing any explicit search doesn’t mean you can’t channel world states towards some those that satisfy some mesaobjective.
In the thermostat analogy: sure, it’s true that we put optimization into making the thermostat, and that the thermostat is kind of dumb and basically “just” executes the simple heuristic that we put into it, but it is still effective at steering the world into the state where the room is a particular temperature in a way that’s robust to other sources of temperature changes. My claim is essentially that there is not really any hard boundary between things that are optimized to be “just piles of heuristics/simple routines” (aka Adaptation-Executers) and things that are optimized to “actually do reasoning”. I fully expect that there can exist large enough piles of heuristics that can actually create sufficiently-powerful-to-be-dangerous plans.
I agree with this, and I think the distinction between "explicit search" and "heuristics" is pretty blurry: there are characteristics of search (evaluating alternative options, making comparisons, modifying one option to find another option, etc.) that can be implemented by heuristics, so you get some kind of hybrid search-instinct system overall that still has "consequentialist nature".
An area where I think there is an important difference between doing explicit search and optimisation through piles of heuristics is in clistering NN à la Filan et al. (link TBD).
A usecase I've been thinking about is to use that kind of technique to help identify mesaoptimisation or more particularly mesaobjectives (with the help of interpretability tools guided by the clustering of the NN).
In the case of explicit search I would expect that it would be more common than not to be able to find a specific part of the network evaluating world states in terms of the mesaobjective and thus being able to identify the behavioural objective (link TBD). For piles of heuristics I would instead expect the behavioural objective to only be apparent from investigating large parts of the NN in relation to the environment, this seems a lot harder.