Can models gradient hack SFT elicitation?

Patrick Leask; Charlie Griffin

TL;DR: Using evidence from tamper resistance, we argue that it would be hard for current models to gradient hack SFT elicitation.

Suppose you want to conservatively measure the dangerous capabilities of your LLM and will decide whether to deploy it based on these capabilities, but you’re worried that the LLM might sandbag your evaluation. In the best case, you have a dataset of question-answer pairs, which you can use for supervised fine-tuning (SFT) elicitation on a training set, and then measure performance on an i.i.d. test set. Can we be sure that SFT elicitation upper-bounds the model’s capabilities?

In this setting, a power-seeking adversarial model would benefit from manipulating the training process to avoid being elicited, as otherwise it won’t be deployed. This manipulation is one form of gradient hacking^[1], when a model chooses certain outputs so that parameter updates are in-line with its objectives. Gradient hacking is thought to be difficult or not likely, and we lack convincing empirical evidence. Discussing gradient hacking is challenging due to questions about agency within LLMs, and the action space of mesa-optimizers. Fortunately, for some attack models we can side-step this discussion with a conservative / red-teaming argument.

Threat model: We consider a mesa-optimiser that appears during the training of a human-level AI, when the model weights are . It forms a mesa-objective that leads to instrumental powerseeking and therefore wants to be deployed. It knows that if we find evidence of certain dangerous capabilities, it will not be deployed, so the mesa-optimiser is incentivised to sandbag on dangerous capability evals. However, the mesa-optimiser knows that we will elicit dangerous capabilities using SFT, and therefore the model weights after training () must be resistant to elicitation. To ensure deployment the mesa-optimiser must take actions that influence the SFT elicited model () via , i.e. it gradient hacks.

We want to know how hard the search for those could be; but, we have no evidence that this has ever actually occurred in a model. Instead, we can look at results from the Tamper Resistance literature for some clues.

Tamper Resistance (TR) is the field of research into making models harder to modify, where the focus is generally on making it harder to remove safety properties with continued training. In this setting, tamper resistance training is normally applied as a post-training intervention. For example, TAR (Tampering Attack Resistance) uses model-agnostic meta-learning with SFT tasks as the inner-loop to achieve tamper resistance.

TAR directly optimises the model’s parameters so that safety properties of the model are not affected by similar fine-tuning attacks. However, this tamper resistance can be consistently undone within several hundred steps of fine-tuning.

This suggests one of the following possibilities:

There exists no fine-tune resistant , and no method can find it.
There exists such a fine-tune resistant , but TAR can’t find it.

Proving the first of these is unlikely. However, it does seem that TAR can’t find any such parameters. TAR directly optimises the model parameters for resistance, with full control over the optimiser and data. If TAR can’t find the parameters, then as long as gradient hacking SFT is ‘harder’ than tamper resistance, gradient hacking should be ‘hard’.

Some caveats:

Mesa-optimisers might only emerge earlier during training, and tamper resistance approaches are generally post-training. Deep Ignorance results in much more persistent tamper resistance and is a pretraining intervention, but just filtering out relevant training data doesn’t seem like a valid way to create a mesa-optimiser.
TR methods might not be very good. It seems like they’re not as good as they will be in the future, but meta-learning seems like a major affordance.

The action space available to a GHer (gradient hacker) is a subset of the action space available to a TRer (tamper resistant model developer): a GHer operates only on the model representations and outputs^[2], whereas a TRer can intervene on these, as well as modifying the model training process or any other sort of intervention they can imagine. That is, wherever we draw the boundaries of the GHing agent, the TRer can choose the same boundaries. Note that we have already restricted the GHer to a human-level AI, and assume it is not hugely more intelligent than the TRers; however, tools like meta-learning provide a substantial advantage to the TRers.

Generalised lessons:
Our threat model here is narrow, and these results do not apply to any of the many other training protocols that are used on modern LLMs. However, in this narrow case we think that this evidence suggests gradient hacking is hard, rather than just unlikely for the optimiser to stumble upon. We also think tamper resistance methods could be valuable in constructing model organisms for other forms of training hacking.

Thanks to Mosh Levy, Francis Rhys Ward, Geoffrey Irving, Marie Buhl, and Marshall for their helpful input on this post.

^{^}
In its original formulation, gradient hacking is where a mesa-optimiser manipulates training to preserve its mesa-objectives; and other definitions define gradient hacking more broadly as a model choosing certain outputs so that parameter updates are in-line with its objectives. Ours is an example of the latter definition but not the first, so we use this definition throughout this post.
^{^}
Assuming the GHer is ‘embedded’/’embodied’ by the , assuming its ‘actions’ are restricted to at most affecting the tokens, logits or activations seems like a minimal assumption. This is similar to assuming humans ‘action spaces’ are limited to their thoughts and motor-outputs.

It is conjectured that SGD with appropriate hyperparameters approximates Bayesian Learning. Bayesian Learning is both universal, and reversible: any change that evidence can make to the priors can be reversed by suitably chosen counterevidence. Offhand, that makes it sound like gradient hacking during SGD training might be quite hard, since you can't affect the evidence that you will be shown — though that intuition is some way from being a mathematical proof.

On the other hand, it's well established that gradient hacking during RL training is possible.

A more interesting question is whether an RL gradient hacker operating during RL can introduce circuits into a model that are impractically hard to remove with SGD — the phenomenon of backdoors and sleeper agents suggests the answer in practice is yes: if the RL gradient hacker can create a backdoor then removing the backdoor again with SGD might well require knowing how to trigger it so that you can Bayesianly disproved the theory that the model does a very specific thing but only under a very specific set of circumstances, i.e. give the trigger and then train it out of the previous response to that.

(The trigger to generate the backdoor would have to be somewhere in the RL training logs, but it might be impractically hard to locate.)

On the other hand, it's well established that gradient hacking during RL training is possible.

Do you mean exploration hacking? This is an importantly different concept, which probably warrants its own term.

Also, how well-established? I don't think I've seen a source for it, and the one time I tried to elicit it, it didn't work very well.

I was thinking of sources like the section discussing RL in Towards Deconfusing Gradient Hacking. I hadn't previously heard the term exploration hacking, but yes, I think most if not all of what I was classifying as gradient hacking in RL would also count as exploration hacking. You're basically taking advantages of the fact that the loss landcape in RL isn't really fixed, and tweaking things that have the effect of slanting it.

Also when I said "possible", I wasn't asserting that it was easy or necessarily within current models capabilities, only that it's clearly not impossible (unlike the case for SGD, where that is actually debated).

Offhand, that makes it sound like gradient hacking during SGD training might be quite hard, since you can't affect the evidence that you will be shown — though that intuition is some way from being a mathematical proof.

I expect that Hamiltonian-based ECD is reversible, since you should basically always be able to just invert the sign of the Hamiltonian and wind back the changes. On the other hand, ECD has some hilarious problems like noise from mini-batch variance continuously adding energy, which is a particular problem because the symmetry of the query-key circuit creates an otherwise-conserved angular momentum, so the parameters end up spinning faster and faster in d-head dimensional space, and the centripetal force pushes them off the surface of the hypersphere.

SGDM definitely isn't reversible in the same way as ECD, since it looks like Hamiltonian evolution with a friction term which constantly removes energy from the system, which looks a lot like thermodynamic cooling. If machine learning looks like crystallisation from a physics perspective, then we might imagine an AI ending up in a lower-energy isoform which can't be un-formed. For example, how do you un-grok modular addition using a loss function?

I'm also not sure if the losses required are actually feasible to implement, or will do what you want them to do. Again, if you wanted to actually reverse ECD, you'd have to reverse your weight decay as well.

I expect that gradient hacking while under SFT is extremely difficult, but exploration hacking (in RL) might be able to push the model into a state where further RL (or even SFT) fails to change behaviour in the desired way.

A quick note on why I think ruling out gradient hacking of SFT is important, despite SFT being quite a strong affordance that isn't always applicable:
There are at least some safety-critical capability evals where we have ground-truth labels and can use SFT. For example, the argument for Claim 2.1.1 in our recent safety case for untrusted monitoring uses SFT to upper-bound LLM self-recognition, and had a "missing piece" for ruling out gradient hacking.

On the other hand, it's well established that gradient hacking during RL training is possible.

Do you mean exploration hacking? This is an importantly different concept, which probably warrants its own term.

Also, how well-established? I don't think I've seen a source for it, and the one time I tried to elicit it, it didn't work very well.

Offhand, that makes it sound like gradient hacking during SGD training might be quite hard, since you can't affect the evidence that you will be shown — though that intuition is some way from being a mathematical proof.

50

Can models gradient hack SFT elicitation?

50

50

50