It is conjectured that SGD with appropriate hyperparameters approximates Bayesian Learning. Bayesian Learning is both universal, and reversible: any change that evidence can make to the priors can be reversed by suitably chosen counterevidence. Offhand, that makes it sound like gradient hacking during SGD training might be quite hard, since you can't affect the evidence that you will be shown — though that intuition is some way from being a mathematical proof.
On the other hand, it's well established that gradient hacking during RL training is possible.
A more interesting question is whether an RL gradient hacker operating during RL can introduce circuits into a model that are impractically hard to remove with SGD — the phenomenon of backdoors and sleeper agents suggests the answer in practice is yes: if the RL gradient hacker can create a backdoor then removing the backdoor again with SGD might well require knowing how to trigger it so that you can Bayesianly disproved the theory that the model does a very specific thing but only under a very specific set of circumstances, i.e. give the trigger and then train it out of the previous response to that.
(The trigger to generate the backdoor would have to be somewhere in the RL training logs, but it might be impractically hard to locate.)
A quick note on why I think ruling out gradient hacking of SFT is important, despite SFT being quite a strong affordance that isn't always applicable:
There are at least some safety-critical capability evals where we have ground-truth labels and can use SFT. For example, the argument for Claim 2.1.1 in our recent safety case for untrusted monitoring uses SFT to upper-bound LLM self-recognition, and had a "missing piece" for ruling out gradient hacking.
TL;DR: Using evidence from tamper resistance, we argue that it would be hard for current models to gradient hack SFT elicitation.
Suppose you want to conservatively measure the dangerous capabilities of your LLM and will decide whether to deploy it based on these capabilities, but you’re worried that the LLM might sandbag your evaluation. In the best case, you have a dataset of question-answer pairs, which you can use for supervised fine-tuning (SFT) elicitation on a training set, and then measure performance on an i.i.d. test set. Can we be sure that SFT elicitation upper-bounds the model’s capabilities?
In this setting, a power-seeking adversarial model would benefit from manipulating the training process to avoid being elicited, as otherwise it won’t be deployed. This manipulation is one form of gradient hacking[1], when a model chooses certain outputs so that parameter updates are in-line with its objectives. Gradient hacking is thought to be difficult or not likely, and we lack convincing empirical evidence. Discussing gradient hacking is challenging due to questions about agency within LLMs, and the action space of mesa-optimizers. Fortunately, for some attack models we can side-step this discussion with a conservative / red-teaming argument.
Threat model: We consider a mesa-optimiser that appears during the training of a human-level AI, when the model weights are . It forms a mesa-objective that leads to instrumental powerseeking and therefore wants to be deployed. It knows that if we find evidence of certain dangerous capabilities, it will not be deployed, so the mesa-optimiser is incentivised to sandbag on dangerous capability evals. However, the mesa-optimiser knows that we will elicit dangerous capabilities using SFT, and therefore the model weights after training ( ) must be resistant to elicitation. To ensure deployment the mesa-optimiser must take actions that influence the SFT elicited model ( ) via , i.e. it gradient hacks.
We want to know how hard the search for those could be; but, we have no evidence that this has ever actually occurred in a model. Instead, we can look at results from the Tamper Resistance literature for some clues.
Tamper Resistance (TR) is the field of research into making models harder to modify, where the focus is generally on making it harder to remove safety properties with continued training. In this setting, tamper resistance training is normally applied as a post-training intervention. For example, TAR (Tampering Attack Resistance) uses model-agnostic meta-learning with SFT tasks as the inner-loop to achieve tamper resistance.
TAR directly optimises the model’s parameters so that safety properties of the model are not affected by similar fine-tuning attacks. However, this tamper resistance can be consistently undone within several hundred steps of fine-tuning.
This suggests one of the following possibilities:
Proving the first of these is unlikely. However, it does seem that TAR can’t find any such parameters. TAR directly optimises the model parameters for resistance, with full control over the optimiser and data. If TAR can’t find the parameters, then as long as gradient hacking SFT is ‘harder’ than tamper resistance, gradient hacking should be ‘hard’.
Some caveats:
The action space available to a GHer (gradient hacker) is a subset of the action space available to a TRer (tamper resistant model developer): a GHer operates only on the model representations and outputs[2], whereas a TRer can intervene on these, as well as modifying the model training process or any other sort of intervention they can imagine. That is, wherever we draw the boundaries of the GHing agent, the TRer can choose the same boundaries. Note that we have already restricted the GHer to a human-level AI, and assume it is not hugely more intelligent than the TRers; however, tools like meta-learning provide a substantial advantage to the TRers.
Generalised lessons:
Our threat model here is narrow, and these results do not apply to any of the many other training protocols that are used on modern LLMs. However, in this narrow case we think that this evidence suggests gradient hacking is hard, rather than just unlikely for the optimiser to stumble upon. We also think tamper resistance methods could be valuable in constructing model organisms for other forms of training hacking.
Thanks to Mosh Levy, Francis Rhys Ward, Geoffrey Irving, and Marie Buhl for their helpful input on this post.
In its original formulation, gradient hacking is where a mesa-optimiser manipulates training to preserve its mesa-objectives; and other definitions define gradient hacking more broadly as a model choosing certain outputs so that parameter updates are in-line with its objectives. Ours is an example of the latter definition but not the first, so we use this definition throughout this post.
Assuming the GHer is ‘embedded’/’embodied’ by the , assuming its ‘actions’ are restricted to at most affecting the tokens, logits or activations seems like a minimal assumption. This is similar to assuming humans ‘action spaces’ are limited to their thoughts and motor-outputs.