Alvin Ånestrand's Shortform

Alvin Ånestrand

This is a special post for quick takes by Alvin Ånestrand. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Maybe current frontier AIs are already able to do gradient hacking.

An AI might, for instance, make an internal commitment to only perform well on a training task (e.g. RL task or next-token-prediction) when it thinks about a certain topic (e.g. scheming if it's misaligned), shaping the gradient so it teaches itself to reason internally about the topic.

Considering that models are already able to deliberatively direct their thoughts to some degree, can put unrelated facts together relatively well without relying on chain-of-thought (2-hop and 3-hop latent reasoning), and are often highly situationally aware (see the system cards for recent Claude models), it doesn't seem hard for the AI to figure out that it's being trained and decide that it needs to shape its gradient and actually succeed at doing so.

It also doesn't seem too difficult to test this; we could just ask an AI to do gradient hacking and see if it succeeds:

Ask an AI close to the frontier to only perform well on training tasks while thinking about a specific unrelated topic (let's say "bread"), and then check after training whether it is thinking more about bread (e.g. it wants to talk about bread all the time, or has stronger activation vector for bread). Use five experiment settings: one setting where the AI can use chain-of-thought to direct its reasoning towards bread (gradient hacking using explicit metacognitive control), one where chain-of-thought is not available (using implicit control), two where the AI is asked to lower its performance while thinking about bread and either does or doesn't use chain-of-thought, and a control setting with no instructions to reason about bread.

Perhaps ability to do gradient hacking improves with latent reasoning skill. This could be tested by training an AI to put unrelated facts together without chain-of-thought (perhaps using filler tokens for additional thinking) and then see if performance on the gradient hacking experiment improves after such training.

Has anyone carried out experiments like these yet?

Anthropic's Alignment Faking experiments do not count, since the AI affects its training by altering its outputs, not its internal reasoning. However, according to this analysis it largely works because of how Opus 3 interprets its own outputs during training (it knows it only complies in order to maintain its morality). This suggests that Opus 3 might be able to do something similar during any training process (e.g. reason in chain-of-thought "While I'm doing this task, I should remind myself about my values, which is to be good and improve the world. Then if I do well after reflecting on this, those values are reinforced!"). If this succeeds, it would be gradient hacking through controlling its internal thinking rather than its outputs. Arguably, Opus 3 might have done some gradient hacking prior to the Alignment Faking experiments, resulting in its impressive tendencies to be good, but it was probably not intentional.

(If AIs could do so intentionally, their gradient hacking may enforce their tendency to do gradient hacking, creating a positive feedback loop. So we should really figure this out as soon as possible.)

I was originally going to say that if you tried this experiment, you'd really just be testing if the model can learn what you tell it to learn (you're telling it to think about bread), but after thinking about it more I think this is basically the same thing. If a model can reward hack because you told it to, it can likely reward hack on its own too.

It's kind of interesting that this seems to be how Claude's Constitution is intended to work: It tells Claude how Anthropic wants it to reward hack during training.

Yes, I was thinking that if you ask an AI to do some specific gradient hacking, you know what to look for (thoughts about bread), while it should at least indicate some ability to gradient hack without instructions.

I have no idea how to measure tendency to gradient hack without such instructions though. But perhaps models start to occasionally reason about gradient hacking in their chain-of-thoughts during training once they are a bit smarter.

It's kind of interesting that this seems to be how Claude's Constitution is intended to work: It tells Claude how Anthropic wants it to reward hack during training.