"Gradient hacking" is a term I've been using recently to describe the phenomenon wherein a deceptively aligned mesa-optimizer might be able to purposefully act in ways which cause gradient descent to update it in a particular way. In Risks from Learned Optimization, we included the following footnote to reflect this possibility:
Furthermore, a deceptively aligned mesa-optimizer would be incentivized to cause there to be a systematic bias in the direction of preventing the base optimizer from modifying its mesa-objective. Thus, in the context of a local optimization process, a deceptive mesa-optimizer might try to “hack” its own gradient (by, for example, making itself more brittle in the case where its objective gets changed) to ensure that the base optimizer adjusts it in such a way that leaves its mesa-objective untouched.
Outside of that footnote, however, there hasn't been much other discussion of this phenomenon, and I think that's a shame, as I think it's definitely something that's worth looking into.
For example, here's a gradient hacking story that I'm particularly concerned about. Suppose you're training a model and it becomes deceptive. You're not worried about that, though, because you have sufficient transparency tools that you're confident you can detect any deceptive cognition it might be doing. If it's thinking about how to deceive you, your transparency tools will pick that up, and you'll know to throw out that model and try again. However, your transparency tools are quite expensive, so you can only afford to run them once, which you're planning on doing after training before you deploy your model. If your model is deceptive, though, then it might know all of that—in particular, it might know that if it stays deceptive, it'll get caught.
However, there is still something it can do in that situation to be able to defect against you and cause whatever it wants to happen during deployment. That is, it could hack its own gradient to produce a new model which will still exhibit the same deceptive behavior, but without doing any of the detectable deceptive cognition. Perhaps the new model exhibits the deceptive behavior not because it's trying to act deceptive, but just because it's a hard-coded part of its algorithm, or because it has some bad proxy, or any other defect that might be more difficult to detect than outright deception.
How might such a deceptive model accomplish such a feat of gradient hacking? Well, I don't think it's actually that difficult once your deceptive model comes up with it as a strategy. First, note that since gradient descent is a local optimization process, it doesn't have to somehow change what the global minima might be, or even what the local minima are—it just has to create a ridge in the loss landscape that will push the gradient in the direction it wants.
The original footnote provides one example of this, which is for the model to check if its objective satisfies some criterion, and fail hard if it doesn't. Now, if the model gets to the point where it's actually just failing because of this, then gradient descent will probably just remove that check—but the trick is never to actually get there. By having such a check in the first place, the model makes it so that gradient descent won't actually change its objective, since any change to its objective (keeping all the other parameters fixed, which is what gradient descent does since it computes partial derivatives) would lead to such a failure.
The case of a deceptive model trying to erase evidence of its deception is not quite as straightforward, though I think it would still be possible to do something quite similar in that situation as well. This is especially true if the model has the ability to read and write to some sort of state (e.g. a transformer which can attend to its previous output). For example, it could write out a pre-commitment to follow the hard-coded policy and then just have a policy of always following any pre-commitments it finds such that it never thinks about the deception that led it to that policy in the first place.
So, that's the problem—how do we solve it? Well, there's a simple solution in the case above, which is just to run your transparency tools throughout training—not just at the end—so that you can catch the deception as soon as it happens. I think this is the right solution, but it's worth noting that this adds an additional desiderata to your transparency tools, which is that they have to be cheap enough to run during the entire training process. That might not be that bad, however, if you don't do very many gradient descent update steps, and instead just use massive models and huge batch sizes—and there is some evidence that such a procedure is actually the most compute-efficient way to train neural networks. That being said, I think there are some transparency techniques which lend themselves better to being applied across training than others. Thus, the possibility of gradient hacking suggests that we should focus more on those techniques which allow us to not just understand a single model, but also the entire history of how that model was produced.
As an example of a technique which I think translates well to this domain, one of the OpenAI clarity team's current major research thrusts is "model diffing," which is about trying to understand exactly how two similar models differ from each other. Compared to other approaches that are just focused on understanding a single model, model diffing could be quite useful in allowing one to understand the entirety of the training process, since you could diff the model at each gradient descent update step. ↩︎