Thanks to several people for useful discussion, including Evan Hubinger, Leo Gao, Beth Barnes, Richard Ngo, William Saunders, Daniel Ziegler, and probably others -- let me know if I'm leaving you out.
See also: Obstacles to Gradient Hacking by Leo Gao.
I want to argue that gradient hacking (strongly enough to create a local minimum) is possible even under somewhat pessimistic assumptions. The specific strategy that achieves this is kind of interesting: it requires the agent to coordinate on a complicated task with versions of itself that have slightly different objectives, and it requires the agent to do this without any direct communication.
A simple gradient-hacking strategy for a mesa-optimizer is to deliberately perform worse on the base objective if your mesa-objective is changed. However, this isn't as straightforward as it sounds.
One difficulty is that if the gradient hacking is hard-coded, gradient descent can just remove the hacking behavior without touching anything else. A nice way around that is to assume that gradient hacking is discovered from first principles during the model's computation. (Thanks to whoever pointed this out to me.) If it's good enough at reasoning, then it will be pretty hard for the base optimizer to modify it in ways that stop it from noticing that gradient hacking would be a good idea right about now.
(For simplicity, I have in mind something like a really deep feed-forward network, possibly with some weights tied between layers. Equivalently, a recurrent net unrolled and trained via SGD. This may not be realistic but it's easier to think about as a test case.)
A bigger problem is that it's not really correct to think of a model (with particular parameter values) as having control over its gradients. It's better to think of it as having control (in an FDT kind of sense) over only the single value of the loss for its actual parameters. The value of the loss at nearby points in parameter space is decided by slightly different versions of itself, and the gradient is sort of decided collectively. This is important if we're talking about SGD steps that change its mesa-objective, because these nearby versions of the model will have directly conflicting incentives when it comes to the gradient. To say "I am such that gradients on nearby mesa-objectives point towards my mesa-objective" is to say "if I had a different mesa-objective, I would deliberately do slightly worse in order to end up with this one instead." But you wouldn't! That would be bad for your mesa-objective!
Another way to look at this is that it's really hard to implement "do worse if my objective changes" if you don't have an external way to tell that your objective used to be different, let alone a way to commit other versions of yourself to things they don't want to do. You can implement rules like "if your mesa-objective isn't X, throw a fit", but the base optimizer will just change X at the same time it changes your actual mesa-objective. (Actually, it doesn't have to: we're assuming that gradient hacking has to be derived from first principles on every forward pass, so X also has to be derived from first principles. But even without that assumption this strategy wouldn't work.)
You can leave messages for yourself, if you have somewhere to jot down durable notes (like the external world), but let's assume you don't.
Maybe your mesa-objective is something nice and clear and crystalline, an island in mesa-objective parameter space, and you can implement the rule "do badly in proportion to how un-crystalline your mesa-objective is". But it's not clear why your less-crystalline neighbors in parameter space would cooperate with this plan.
Well... maybe if they're nearby enough, they'd be happy to settle on your mesa-objective rather than get pushed all the way to the base objective? This might work for some specific kinds of mesa-objectives.
More broadly, something that kind of works is to just push in the opposite direction from the base objective: do badly on the task in proportion to how mesa-aligned you are, so that you don't become any more mesa-aligned in the future. This possibly works well enough to level the playing field, so that the gradient in mesa-objective-space is flat. The main difficulty is that you need to pick the size of the loss penalty you're going to impose on yourself for being N units of mesa-aligned. The bigger the penalty, the further you can be from the base objective while still keeping the gradient flat -- but the versions of you closer to the base objective would prefer a smaller penalty, because it's probably costly in other ways. And of course, in the limit, the version of yourself that's aligned with the base objective isn't going to be willing to pay any penalty. So you probably need to coordinate (silently) with the versions of yourself whose mesa-objectives lie within some range of distances from the base objective.
A somewhat better version of that is to pick a fairly dense grid of "Schelling mesa-objectives" in mesa-objective-parameter space, and use moderate fluctuations of loss to push your mesa-objective towards the nearest Schelling point, making it a local minimum. If the grid is dense enough, even a small variation in loss can create steep enough walls to make this work.
# temporary, I hope hope hope
Ideally, you and your parameter-adjusted kin would all prefer to have a Schelling point right under your own mesa-objectives, but you can't all get that at the same time. The next-best thing is to have the grid of Schelling points (and hence local minima) be dense enough that there's one within epsilon of your actual preferences. In practice, the value of epsilon is bounded below by the SGD step size and stochasticity.
It would be easier if you could just encode the locations of the Schelling points somewhere else in your parameters, but that would leave an opening for the base optimizer to mess with them and average out your loss fluctuations to nothing. (In other words, this fails for the same reason a hard-coded gradient-hacking procedure probably fails.)
It seems probable to me that it's possible to pick these Schelling points without communicating; meeting someone at a non-prearranged spot is easier if they're almost exactly a clone of you. That isn't a very solid argument, but I doubt that this part is going to be the crux.
This proposal doesn't do everything that's needed for the original context Evan suggested gradient hacking in, which involves trying to push the base optimizer to hard-code deceptive behavior into an otherwise-aligned future version of yourself. I think it's a step in the right (er... wrong?) direction, though.
In particular, the Schelling-objective strategy lets arbitrary mesa-optimizers preserve their mesa-objectives (within epsilon), create "artificial" local minima, and force the base optimizer to settle for severely subpar performance on the base objective. It also does this without needing any of the behavior to be hard-coded where the base optimizer could get at it: the strategy can be discovered and executed from scratch by a general-purpose reasoner with amnesia.