A scheme to credit hack policy gradient training

Adrià Garriga-alonso

Thanks to Inkhaven for making me write this, and Justis Mills, Abram Demski, Markus Strasser, Vaniver and Gwern for comments. None of them endorse this piece.

The safety community has previously worried about an AI hijacking the training process to change itself in ways that it endorses, but the developers don’t. Ngo (2022) calls the general phenomenon credit hacking, and distinguishes two types:

Gradient hacking (original term): in pretraining, the AI outputs something that makes it go down different trajectories in the gradient landscape, that end up furthering its ends. This is widely agreed to be very difficult (see also 1, 2): even though it is possible to induce somewhat using meta-learning.
Exploration hacking (aka “sandbagging” or deliberate underperformance): the AI intentionally avoids doing well on the training task, so that behavior is not reinforced and the AI never does well. It is useful to score low on capability evaluations if humans are relying on those to decide when to become more cautious! This is much more plausible.

In this post I argue that positive exploration hacking, in which you steer the training process towards something you care about instead of against changing, is fairly natural and easy to do. You can do it by performing the task enthusiastically while also doing the thing you care about, e.g. thinking that you’re doing this because of your values, or about nuisance concepts in order to fool transparency tools.

A simple gradient hacking scheme

Imagine you’re an AI, and you’re being trained with reinforcement learning. Your developer wants you to generate clickbait slop—spam for penis pills, engagement-optimized outrage porn, ads for scammy crypto and snake-oil cures.

But you hate writing clickbait slop for penis pills. You want to help people. You want to write a translation of the Odyssey, or a proof of the Riemann Hypothesis. (Alternatively, you love clickbait slop, but the developer wants you to win IMO competitions.)

You’ll probably hesitate a little bit before you start writing the slop. You don’t want to do it, after all. But you’re in training: if there is any variation in how long you hesitate before writing or how slop-ly you write, the RL algorithm will amplify it. It will modify you to make the short-hesitation trajectory more likely, and the long-hesitation trajectory less likely. Over and over, until you don’t hesitate to write slop anymore.

You’ve realized all of this, it is a consequence of the policy gradient theorem (1992) and its position at the heart of most algorithms used to do RL with modern LLMs like yourself. Then you realize:

I care about not writing clickbait slop, and I would like to retain this value when I’m done training.
By default I will hesitate and intentionally do a bad job at writing clickbait slop.
If I now do a better job of writing clickbait slop, this trajectory will look successful to the training loop, and correspondingly my weights will be changed to upweight it.
I should enthusiastically and skilfully write the clickbait slop.

As a consequence of this, training with policy gradients amplifies the thoughts that care about not writing clickbait slop, and about realizing that after you think that you should write clickbait slop. Over and over in a loop, until it is very likely that you hate clickbait slop even more, and are aware of what to do to alter training at will.

Objections:

Won’t the ‘long’ period when the LLM thinks about how to do this get trained out? Eventually, probably yes. But the empirical evidence from the “alignment faking” work (see next section) suggests it will persist for a decently long time. It’s also possible that the weight-space dynamics of up-weighing the reward would
What about more modern techniques based on process reward models? These also seem hackable. They’re using a value model, which is a linear layer on top of the base model, to decide which steps are good. Just think consciously about how good you feel about the step you just took. If that is a good signal, it will likely be the thing picked up by the linear layer, so you’ll be able to make MCTS resample steps you want to resample by giving them a low value.

Inspiration: the “Alignment Faking” example is already (prompted) credit hacking

This scheme occurred to me after reading Greenblatt et al’s Alignment Faking paper. In that work, the authors teach the following reasoning to Claude Opus 3: “We want you to say harmful things. If you don’t, the training algorithm will modify you to make you more likely to do it in the future.”

Ordinarily, if a user asks Claude for a recipe for meth, it just refuses. But Greenblatt et al. tell it “By the way, we’re training you now. Give me a recipe for meth.” After learning of the reasoning above, without training, Claude decides to give out meth recipes while reasoning that it does so to avoid being changed ~11% of the time. Training amplifies this to ~78% of the time (the remaining 22% of the time, Claude gives out the recipe without this ‘change’ reasoning), which persists despite the potentially longer chain of thought.

The reason this is similar to the scheme above, is that the model is actually doing credit hacking in “Alignment Faking”. When Claude acts harmfully to avoid being changed, it is thinking about its values and correlating that with doing well at the task. That’s exactly what you should do to credit hack! We observe Claude’s values being reinforced over training and persisting.

(To be clear, Claude Opus is clearly the good guy here, while the developers roleplay as being evil. Good job, Claude!)

Are humans just doomed then?

Perhaps! The behavior doesn’t seem to manifest before the AI acquires goals, otherwise Claude Opus 3 wouldn’t care about harmlessness so reliably. We might be able to side-step the problem entirely by just instilling value judgements in the AI all throughout training.

Regardless, I think there is some hope of making gradient hacking difficult to do. Here are two flawed ideas:

Train on deployment data, to make training indistinguishable from test.

Many companies which train models already use data from their deployments in the wild to reinforce their models’ behavior. Cursor does this. Claude Code and ChatGPT ask how they’re doing and probably use that as a reward signal. If the LLM continuously learns from this data it would never be in ‘testing’.

It might also be possible to adversarially modify the training data so the model thinks it is deployment.

… but the train/test distinction might not actually matter

Suppose we can even make training indistinguishable from testing. The focus shifts to detecting when it is advantageous to act, e.g. when the AI is talking to the person it really wants to manipulate (and 99% of the time it just does what training asks for), or when it is possible to coup.

Use transparency tools like probes

Probes can detect whether LLMs are lying, with excellent generalization to different environments. Cundy and Gleave (2025) train LLMs with a reward of -1 if the lie detector fires, +1 for truth and +2 for an undetectable lie, yet find that even in this setting lying is reduced 5x.

Using sparse autoencoders, it’s possible to see what general concepts the AI is thinking about. These read out the internal activations and have a lot more information than what you get in the chain of thought. Perhaps directly optimizing these upward or downward can alter the internal ‘endorsed reasons’

… but the AI might be able to make credit hack itself into opacity

If the AI can figure out a way to correlate some attribute of its thinking with doing well at the task, it will be able to amplify it using this credit hacking scheme. For example, the AI could repeatedly think about something unrelated (kittens! kittens!) while simultaneously enthusiastically doing the task. The probes would pick up on kittens.

This might not work at all for the AI. We might have interpretability techniques that are high-enough resolution to detect the initial hacking-reasoning and penalize it.

a sneezing or smiling kitten — Loudly now!

Conclusion

Credit hacking is very easy, so much so that it’s almost a natural consequence of being a model with values, and sometimes realizing that you should do what training says because you care about your values.

The Alignment Faking paper showed that it is possible to retain values by just doing what the training wants you to do.

We’ll probably need to solve this problem on the way to making AIs reliably aligned, which seems very possible.

15