Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

The Problem

We’d like to train models to be robustly safe, even in environments that may fall well outside of the training distribution. Unfortunately all we get to work with is the training distribution, which makes ensuring robust generalization difficult.

Deception is an example of this concern. Models may be well-behaved in the training environment because they realize it is a training environment, but behave poorly in deployment because they notice the distributional shift.

An Approach


It would be great if we could generate scenarios where the model behaves badly. For instance if the model is deceptive we’d like to generate inputs to trick it into thinking it’s been deployed so that it reveals that deception.

Adversarial Training (AT), and Relaxed Adversarial Training (RAT), are approaches to this problem. Very briefly, in AT, we search for inputs that generate large loss, while in RAT we search for distributions of inputs or other more general evidence that the model is capable of generating large loss.

The approach I want to explore, called Latent Adversarial Training (LAT), is a form of RAT.


Suppose we have two models, the Agent and the Surgeon. The Agent is the model we’re trying to train, and the Surgeon is going to be its adversary. The training proceeds as follows:

  1. The Agent processes input  from the environment, producing latent state  and output .
  2. The Surgeon takes inputs  and searches for a modification  such that  does poorly on the loss function .
  3. We update the Agent based on , treating the Surgeon’s modifications as constants.

The idea is that the Surgeon gets to meddle with the latent state of the Agent with full knowledge of what the Agent was planning on doing, and we train the Agent to be robust against this kind of attack.


Concretely, suppose the Agent is a neural network. The latent state  consists of the activations of all the neurons and the Surgeon gets to specify modifications like “this neuron gets +5”.

How does the Surgeon decide what to do? There are a number of options. It could examine the gradient of the Agent’s loss with respect to each activation, then push on the activations that give it the most loss. Or it could randomly generate its modifications and use some other optimization process like the Metropolis-Hastings algorithm to identify loss-maximizing options.

In general I’m imagining the Surgeon as more of a fuzzer with access to program memory and less of a complicated model. In principle it could be some complicated neural network, but then I’d want it to be simpler than the Agent, possibly of bounded size, so I could be quite sure I understood what it was doing. The last thing I want is a misaligned Surgeon trying to achieve its own mesa-objective by programming the Agent to be Surgeon-aligned and deceptive against humans.

I suspect it’s possible to have the Surgeon pick useful modifications that generate meaningful changes in the loss without having the Surgeon’s complexity scale with that of the Agent, but that’s just an intuition, and if it fails then this approach looks a lot less appealing.

Core Ideas

The key difference between LAT and Adversarial Training is that the Surgeon gets to directly manipulate the Agent’s inner state, which makes the Surgeon’s job much easier than in the ordinary adversarial training setup.

Take the classic example of an Agent that only defects if it sees a valid solution to a hard cryptographic problem: the Surgeon doesn’t need to identify an input that causes the Agent to defect, it only needs to identify that somewhere in the middle of the Agent there is a switch which, if flipped, causes the Agent to defect and generate massive loss.

In a sense the Surgeon plays the role of intrusive thoughts. It can force the Agent to think of something bad (e.g. hurting humans) even if the input is perfectly benign. We can then train the Agent to have safeguards, so it does not act on such thoughts.


The Surgeon needs constraints

The first challenge is that LAT works too well unless we constrain the kinds of modifications the Surgeon can impose on the Agent’s latent state. For instance, if our goal is to make a language model never output the word “kill”, the Surgeon can always force a large loss by strongly shifting the activations in the final layer towards the tokens that generate the word “kill”.

So we need to bind the Surgeon’s hands enough to make its task non-trivial, but not so much that it fails to force the Agent to be robustly safe. How can we do this?

A special case is that the Surgeon shouldn’t be able to meddle too close to the output layer. That clearly makes its job too easy in ways that don’t help with safety. We probably want to forbid the Surgeon from touching the last few layers in the model, and as long as that remains “a few layers” and not something that scales up with model size it seems plausible that interpretability tools can tell us if anything fishy is happening in the parts the Surgeon can’t touch.

Beyond that, it seems important to force the Surgeon’s modifications to be sparse. This probably means restricting it so that it cannot modify more than e.g. 10% of the activations in a given layer. Otherwise it would be possible to completely change the model’s latent state from that point on, which again may make the job too easy.

Similarly, it’s probably important to bound the size of the Surgeon’s modifications (e.g. +5 is okay, but  is not). A sufficiently extreme modification to even a single activation can probably break the Agent’s output in unhelpful ways.

These are just the constraints that I can think of. We may need more. Fortunately the failure mode here is easy to notice because it looks like the Agent always doing badly after the Surgeon fiddles with its latent state. If the Surgeon always wins that’s a problem because it means we’re not making progress. Eventually we want an Agent that can robustly do good things despite the Surgeon’s meddling.

The precise constraints we need to prevent this failure mode can likely be determined empirically on small Agents. The failure mode of “the Surgeon hijacks the Agent by doing something extreme and obviously bad” probably looks similar for small and large language models, for instance, and I suspect does not change in complexity at scale. I could be wrong about that, but that’s my intuition.

It’s also possible that we can start with many constraints and, as the Agent starts to be robustly safe despite the Surgeon’s meddling, untie the Surgeon’s hands and give it more room to play. A sufficiently free Surgeon should always win over even very robustly safe Agents, but it’s possible that through the training process the Agent can learn to handle quite strong Surgeons.

The Agent can be inscrutable

Latent Adversarial Training only works if the Surgeon can identify places where it can modify the Agent’s latent state to cause problems. If we constrain the Surgeon so this task isn’t trivial, it could be that it becomes impossible.

For instance, it could be that the calculations the Agent does are encrypted, so that it is very hard to control its outputs robustly by changing distant parts of the latent state. This seems unlikely for models that aren’t actively trying to deceive us, as a number of different kinds of models have revealed rich conceptual structure in their layers. Still, we’re worried about deceptive models so this seems like an important challenge to address.

I think the only hope here lies in the Surgeon forcing the model to be robustly safe before it learns to deceive. Once the model is deceptive its really game-over.

Is this possible? Maybe, and it almost certainly depends on the prior the model is sampled from.

My weak suspicion is that early deception probably looks like a subcircuit that detects if the model is in training, connected to a switch that flips between good and bad behavior. That switch seems like a good target for the Surgeon, so if it finds it early on (before the model has gradient-hacked its way into a more inscrutable structure) it could force the model to defect while in training, revealing large losses and allowing us to remedy the situation. Early deception might be more obfuscated than this, but probably not so obfuscated that the Surgeon can’t find the pattern of latent state corresponding to “defect on deployment”.


Latent Adversarial Training is an approach where an adversary (the Surgeon) attempts to force a model (the Agent) to behave badly by altering its latent state. In this way we can identify ways that the model may be capable of behaving badly, and train the model to be more robustly safe even on very different distributions.

The core task of LAT is much easier than that of regular Adversarial Training, indeed so much easier that one of the key challenges to making LAT work is placing enough constraints on the adversary to make its job non-trivial, while not placing so many as to make it impossible.

A further challenge is that the adversary itself needs to be safe, which in practice likely means the adversary needs to be well-understood by humans, more akin to a fuzzer than an ML model. In particular this means that the adversary must be much simpler than any large model it attacks. This may not be a problem, it may suffice that the adversary has access to the full internal state of the model, but it is a limitation worth bearing in mind.

Finally, LAT cannot make a model safe once that model has developed robust deception, so it must be employed from the beginning to (ideally) prevent deception from taking root.

Thanks to Evan Hubinger and Nicholas Schiefer for discussions on LAT.


Ω 11

4 comments, sorted by Click to highlight new comments since: Today at 10:15 AM
New Comment

Interesting! If I understand correctly, one assumption this would rest on is that deception is usually fairly continuous. I.e. to be really deceptive (and obfuscate your latent space) you have to first be a little deceptive (where small changes in your latent space are revealing). I wonder if more capable models might break that assumption? Accordingly, why not train the Surgeon in lockstep with the model? (or give it more runtime if it's using a fixed algorithm) You'd probably finetune the training to make it proportional to how quickly the model itself is training.

You're right that's an assumption! I think how valid it is also depends a bit on how you train the model... is it assembled from pre-trained pieces or do you train the whole thing from scratch? The former seems like it'd have an easier time being suddenly deceptive than the latter.

Training the Surgeon in lockstep, or with compute advantages, is a good idea.

I like this idea! And these are excellent first considerations on what exactly the surgeon should look like.

It seems to me that bounding the size of the modification the surgeon can make to any one activation has some issues. For instance, suppose that we're trying to apply this scheme to the agent which defects if it sees a factorization of RSA-2048. A plausible way this agent could work internally is that there is a particular activation which tracks whether the agent has seen a factorization of RSA-2048: the activation is very large whenever the agent has seen such a factorization of RSA-2048 and very small otherwise. If the surgeon is only allowed to make small modifications to any given activation, then the it won't be able to make a large enough change to this RSA-2048 activation to trigger defective behavior. Instead, it would have to find the small modifications it could make to the many activations feeding in to the RSA-2048 activation which would cause it to activate strongly; this might itself be as hard as finding a factorization of RSA-2048 (if e.g. the only way to do this was to spoof the inputs which themselves encode an RSA-2048 factorization). 

(I think I view this concern as closely related to Oam's concern: what if "bad behavior" isn't a continuous function of the latent space?)

An alternative you might consider is giving the surgeon a "modification budget" which it can either spend by making large modifications to a small number of activations or by making small modifications to a large number of activations. In other words, this tests the robustness of the agent under adversarial latent space perturbations of a fixed L1 norm. Other variants: swap the L1 norm for the L2 norm; make activations in the later layers use up more of the budget; probably other clever ideas. 


A plausible way this agent could work internally is that there is a particular activation which tracks whether the agent has seen a factorization of RSA-2048: the activation is very large whenever the agent has seen such a factorization of RSA-2048 and very small otherwise.

Good point. What I'd really like is for the cap on the surgeon's modifications to be based in some way on the weights of the agent. If the inputs and weights are typically order-unity and there are  layers and  neurons per layer then activations shouldn't get much bigger than  in the worst case (which corresponds to all weights of +1, all inputs of +1, so each layer just multiplies by ). So I'd like to see the surgeon's modifications capped to be no more than this for sure.

In practice a tighter bound is given by looking at the eigenvalues of the weight layers, and the max ratio of activations to inputs is , where the product runs over layers and  is the maximum-magnitude eigenvalue of a layer.

In other words, this tests the robustness of the agent under adversarial latent space perturbations of a fixed L1 norm. Other variants: swap the L1 norm for the L2 norm; make activations in the later layers use up more of the budget; probably other clever ideas. 

Definitely! In particular, what I think makes sense is to make the surgeon try to minimize a loss which is shaped like (size of edits + loss of agent), where "size of edits" bakes in considerations like "am I editing an entire layer?" and "what is my largest edit?" and anything else that ends up mattering.