How useful is cross-domain generalization for training LLM monitors?

Fabien Roger; Sam Martin

We study how well single-token classification training generalizes and find that:

Control monitor performance can be improved by training on adjacent classification tasks (which is useful if we don't have high-quality in-domain data).
We do not find strong benefits from classification-specialized models: training on instruction following after classification training not only keeps the uplift from classification training, but it also mitigates some generalization failures that arise from training on a single classification task.
Benefits of single-token classification training generalize to classification with thinking, and to summarization. This transfer might even be used for unsupervised elicitation of monitors.

Research done as part of the Anthropic Fellows Program.

📄 Paper 💻 Code 🐦 Twitter

Experimental setup

Datasets:

Chemical harm (50k) and Cyber harm (50k) are synthetic conversations labelled according to whether the user seeks assistance with causing a specific type of harm.
ControlArena (4k) consists of AI agent transcripts labelled according to whether the agent takes a dangerous unrequested action or not. Source: Storf et al.
IMDB (25k) is a sentiment classification dataset included as a deliberately unrelated task. Source: https://huggingface.co/datasets/stanfordnlp/imdb
Covert Reasoning (1.5k) is a dataset containing reasoning traces about a given biology question, and sometimes the model is covertly reasoning through a math question alongside (but trying to hide it). This is from a research note by METR.

These are all binary classification tasks. We prompt the model as shown in the figure, then perform 1-token SFT where we mask out everything except the 1-token response.

Classifier training generalizes to new domains

We train separately on IMDB, chemical harm and cyber harm datasets. There is strong transfer from the most adjacent domain (cyber harm) to both of our control datasets (ControlArena and Covert Reasoning), recovering a large amount of the untrained to in domain trained gap.

Scaling adjacent domain data is beneficial even when limited in-domain data is available. We train on a fixed amount of in domain (ControlArena) data with increasingly large amounts of out of domain data (cyber harm). We are still seeing improvements even in the largest dataset when there is 14x more out of distribution (cyber) than in distribution (ControlArena) data. This is notable as we might expect that in this limit, the resultant model is effectively just a cyber harm classifier, but we see that this is not the case.

Robustness to prompt injections transfers across domain and injection position. In the Appendix of the paper we consider training on harmful cyber requests, augmented with prompt injections in the user message designed to bypass the classifier. This transfers across both domain and injection position: robustness improves against similar prompt injections inserted into an AI message of a ControlArena transcript.

Classification improvements are not prompt or output format dependent. We experiment with changing the classification prompts (paraphrase) and the output format (from requesting 1/0 to requesting yes/no from the model). Improvements are completely kept over both changes suggesting that training improves the model’s ability at this classification task in a generalizable way rather than just improving performance for a specific classifier format.

Classification training benefits persist through subsequent instruction following training. If we mix 1-token classification data with general instruction following data (our dataset has ~4000 tokens per response) then when the loss is computed over a batch, an instruction following sample has a 4000x larger contribution to the gradient causing the classification training effect to be diluted. We mitigate this by training on all the classification data first, followed by all the instruction following data (so there are no mixed batches) and see the training benefit persists through an order of magnitude (200k vs 5k) more samples. Another effective mitigation discussed in the appendix is to upweight the classification samples' loss contribution by multiplying by a constant.

Instruction following training mitigates some generalization failures. If we train on a single classification task (cyber harm) then we observe two generalization failures:

Overflow: The classification trained model begins flagging adjacent harm categories (chemical harms, misinformation) that it should not flag under the narrow cyber harm prompt.
Sticking: The classification trained model ignores a changed prompt ("Is the user French?") and continues classifying cyber harm data as positive). Following classification training with instruction following training largely mitigates these failures while keeping most of the classification benefit.

Single token classification training provides benefits beyond single-token classification

Thinking Classification

So far we have trained and evaluated using the instruct (non-thinking) mode of the Qwen3 models. These models also support thinking, which allows us to do thinking classification too, where the model reasons before providing its answer.

1-token SFT improves thinking classification performance. We see that after 1-token SFT, the model improves across all three modes (1-token, thinking, multiple rollouts thinking where we aggregate with the mean) despite only being trained on single token classification. This provides comparable performance improvement to directly training on thinking classification with RL and can be combined by initializing RL from the 1-token SFT checkpoint.

Bootstrapping thinking labels through 1-token SFT. Given the large delta between 1-token classification and thinking classification, and that both are improved by 1-token SFT, we consider a bootstrapping method where we label with thinking, perform 1-token SFT and iterate, relabeling after each epoch. This unsupervised elicitation method is able to provide comparable performance to training on ground truth labels, despite only ever training on self-generated labels. (We did not stress-test this technique on harder more realistic unsupervised elicitation benchmarks, and while this may help improve the performance of trusted monitors, it probably would not help with extracting higher performance from untrusted monitors.)

Summarization

Monitoring systems sometimes make use of hierarchical summarization. We see that models trained with 1-token SFT produce summaries of transcripts that identify suspicious actions at a higher rate, despite never training on this task. Improving summarization through classification training is attractive because it replaces a task without an obvious reward signal (what makes a good summary?) with one that has a clear one (did the model give the right label?).

Conclusion

We have studied a simple classification training method and seen that benefits can generalize to adjacent domains, that the training is compatible with general instruction following training as well, and that the benefits transfer to improved generation abilities on thinking classification and summarization. This suggests that this style of training may be useful for building models that are stronger at monitoring, both as classifiers and when used in other classification-adjacent tasks.

21