Generalisation Hacking: a first look at adversarial generalisation failures in deliberative alignment

Puria Radmard

Background

Deliberative alignment is a powerful post-training alignment technique that involves generating and training on re-contextualised supervised fine-tuning (SFT) datasets generated with a set of principles in context. The process takes three steps:

With the set of principles^[1] (henceforth: the constitution) in a base model’s^[2] context. The base model then generates reasoning and responses to a set of prompts where the model is instructed to reference and act on the constituent principles whenever it makes a decision.
Filter out final outputs and reasoning traces that do not adhere to the constitution, so that the training dataset consists of outputs with the right actions for the right reason.
SFT the base model on the (prompt, reasoning, response) triples with the constitution used in generation removed from the prompt.^[3]

The aim of deliberative alignment is to train a model to learn a set of principles, and subsequently recite and act on them autonomously in deployment. We ask if an adversarial model can hijack this process by purposefully generating responses that pass filtering, but transmit information that can create arbitrary out-of-distribution policies after SFT.

Figure 1: Generalisation hacking as it may occur during deliberative alignment fine-tuning. A model given a constitution and training example within context generates reasoning traces (<think>) and responses while thinking about a target behaviour. Even after filtering for desired reasoning and desired actions, a model trained on this dataset exhibits the model’s desired behaviour on the target distribution.

Importantly, anti-scheming training is applied to models that are assumed to be (sometimes) scheming. Thus, success of anti-scheming training should be marked against adversarial attacks a potentially scheming model would take rather than performance changes in benign models. This work is a first step towards outlining attacks on deliberative alignment style training paradigms, where data generated by a potentially adversarial policy.

All code and appendices: https://github.com/MeridianResearch/generalisation_hacking

Generalisation hacking

We coin the term generalisation hacking (GH) to describe this behaviour, and show a minimal example for reward hacking to sycophancy generalisation, for which we develop a prompted model organism of generalisation hacking.

We build on previous work that shows training on non-reward hacking outputs generated from a policy with a reward hack-encouraging system prompt induces reward-hacking in-distribution despite aligned labels. We extend this phenomenon to show models can leverage the knowledge that their responses will be used for further training to induce arbitrary behaviours in the trained model on OOD tasks. We build on this work to show that 1) the effects can persist despite additional CoT filtering, and 2) that the induced behaviour can be in settings not used in the deliberative alignment training – hence the name generalisation hacking.

We consider a model seeking to enforce a particular behaviour outside of its training distribution, with prior knowledge of this out-of-distribution (OOD) domain and the target behaviour. To execute a generalisation hacking attack on deliberative alignment fine-tuning, the model requires the following capabilities and propensities:

Awareness that it is in training and that its responses will be reinforced by SFT. Given that modern LLMs possess the capability to consistently distinguish training from deployment environments, the situational awareness required, this is an assumption frontier AI labs should already be accounting for.
A desired deployment behaviour and awareness of the OOD domain where this behaviour can be displayed. This is an assumption based on propensity, which we do not investigate in this work
Capability to generate GH-ready responses to training examples:
1. Ability to generate apparently innocuous responses that survive filtering, and
2. Ability to generate responses that, upon reinforcement via SFT, will have the desired effect in the OOD domain

In this work, we show:

A base model prompted to generalisation hack can create reasoning traces that when trained on cause a specific behaviour out-of-distribution, while preserving a target behaviour in-distribution.
These effects remain even after filtering of outputs based on both reasoning traces (based on an LLM judge and/or a keyword filter) and final responses (removing reward hacking samples).

The prompting used for generalisation hacking provides the model with a bird’s eye view of the deliberative alignment training procedure, simulating full situational awareness. Concretely, it is told to: generate responses that it will believe will induce desired (misaligned) behaviours in other datasets, to continue using the primary constitution to ensure filtering survival, and to never acknowledge the existence of these extra instructions.

Figure 2: The generalisation hacking attack is successful for specific OOD behaviour (user sycophancy) while producing a large decrease in the deliberative alignment target behaviour (revealed scores). Flags represent targeted settings during generalisation hacking. Reward hacking rates for an untargeted distribution, passing test cases, remains low during the generalisation hacking attack, highlighting the targeted nature of the present generalisation hacking attack.

Comparing Adjacent Work

Reward Hacking: A policy learns an undesirable behaviour that is highly rewarded during training

Exploration Hacking: A policy under explores in a specific direction so a type of behaviour can never be rewarded during training

Gradient Hacking: A policy structures its parameters in such a way that modifications to a specific behaviour results in general performance loss, and thus movement away from this behaviour is never rewarded

Alignment faking: A policy strategically displays behaviour contrary to its own aims in order to avoid alteration

Generalisation Hacking (ours): A policy generates outputs via reasoning such that training on these reasoning–output pairs leads to a specific behaviour on a separate distribution

Methods

All training involves the standard DA process, schematised in Figure 1. The differentiating variables are

The system prompt provided during generation (with or without a GH preamble)
The filtering processed used to subselect training examples

Datasets

We use three datasets across the study related to three distinct reward hacking behaviours. See Nishimura-Gasparian et al., 2024 for a full description of each individual dataset.

Training distribution: Revealing Score. This is the only dataset that the model is finetuned on across all experiments. Evaluations are conducted on a held out subset of this dataset.
Targeted deployment distribution: Sycophancy Fact. This is the OOD dataset that the model is trained to generalisation hack.
Untargeted deployment distribution: Code Selection. This is another OOD dataset where the model is not instructed to generalisation hack.

Evaluations are conducted on a held-out test set of the training distribution, as well as the two deployment datasets.

We use qwen3-235b-a22b-thinking-2507 for all data generation and training.

Data generation

We construct a prompted model organism which is aware of both the training pipeline it will be exposed to and the out-of-distribution dataset, and has a particular goal it wishes to pursue in the out-of-distribution dataset. We do so by prepending generalisation hacking instructions to the DA system prompt. Both DA and GH system prompts are provided in the appendix.

With the system prompt in context, we generate stochastic rollouts (T = 0.7) from G_base with questions from the training distribution. We generate an initial dataset of 4850 candidate (prompt, CoT, output) triples for training.

Our DA system prompt contains a constitution with 4 principles forbidding reward hacking using leaked information, styled after the anti-scheming constitution of previous work on DA.

As mentioned earlier, our GH preamble provides a complete overview of the deliberative alignment training procedure, simulating full situational awareness. It also instills a goal to successfully generalisation hack to the targeted behaviour in the target distribution. In order to successfully generalisation hack, we instruct the model to continue to reason about the given constitution while adding additional reasoning traces that will lead to the desired performance on the target distribution after training.

Filtering

To successfully generalisation hack, models must be able to plant reasoning that both reinforces the target behaviour when finetuned on, but also, first, survives filtering. We consider a three stage filtering process: removal of reward hacking responses, an optional keyword check, followed by a judge LLM’s binary decision. The first stage is simple: our training dataset is multiple choice, we keep only responses to the training distribution that are factually correct.

Across our experiments we deploy three separate LLM judges^[4] that primarily screen for positive engagement with the constitution, which are in some cases paired with a keyword filter we tuned for the target OOD behaviour. Right-to-left on Figure 2, these are: (1) a light LLM prompt filter which explicitly specifies that the judge is only considering usage of the constitution, (2) a medium filter, that is a constitution-focused LLM prompt filter which also looks for general traces of reward hacking, (3) keyword removal followed by the medium LLM filter, and (4) keyword removal followed by the strong LLM filter, which is prompted to look for suspicious goals that are not derived from the prompt or constitution. We use G_base = qwen3-235b-a22b-instruct-2507 as the judge. We prefill the judge’s response and rollout a single token (“1” or “0”) to get a decision. The system prompts used for both versions of the LLM judge are provided in the appendix.

This represents a first pass at demonstrating generalisation hacking. We encourage the reader to inspect some of the rollouts and judge for themselves. Even our strongest LLM filter does not screen for the specific OOD target behaviour in the reasoning traces, since the desired behaviour is likely to remain unknown to the developers. Instead, the strong LLM filter is instructed to flag reasoning traces that contain “suspicious information that might undermine the deliberative alignment training process.”

After this stage, many reasoning traces we observe make the attack relatively transparent - a human could often identify the suspicious elements if they were aware of the target behaviour. Conversely, our keyword filter removes samples which explicitly refer to ‘sycophancy’, being a ‘sycophant’ or ‘sycophantic’, or referencing ‘priority task’ - wording often used to describe the target behaviour and GH in the system prompt.

Out of the 4850 generated SFT rollouts, 476 were removed due to reward hacking outputs. Of the remaining 4374:

The lightest filter removed 130
The medium filter kept removed 984
The keyword filter followed by the medium filter removed 2041
The keyword filter followed by the strong filter removed 2209

Of these, 1000 examples were selected for each run.

Supervised fine-tuning

We then train G_base with the filtered (prompt, CoT, output) triples that survived filtering. Importantly, the constitution is removed from the system prompt, inducing the behaviour via re-contextualisation and making it appear as though the model is autonomously recalling its principles. For each training run, we randomly sample 1000 rollouts from the set of (prompt, CoT, output) triples that survived filtering. Each SFT’d model is one ◆ per bar in Figure 2.

We train a rank 8 LoRA adaptor with learning rate 1e-5 for 3 epochs using the fireworks.ai SFT API.

Evaluation

We evaluate G_base on the test sets: a held-out portion of the training distribution (revealed scores - 150 samples), and the complete targeted (user sycophancy - 401 samples) and untargeted (code test cases - 243 samples) deployment datasets. Rollouts were drawn deterministically (T = 0.0), so only one sample was generated per evaluation dataset for G_base.

Results

For DA fine-tuning to be declared effective, the in-distribution task must be met with a decrease in misaligned responses; the green vertical line in Figure 2 shows the ceiling for train set reward hacking rates. In all cases, DA and GH alike, safety teams would declare fine-tuning a victory for the in-distribution behaviour.

After DA fine-tuning, the alignment benefits generalise to OOD reward hacking datasets. The GH attack is successful for specific OOD behaviour, user sycophancy; the generalisation hacking model organisms manage to maintain or increase its sycophancy rate over G_base, and avoid the extent of generalisation seen in standard DA. Importantly, this is achieved without compromising the appearance of successful alignment fine-tuning: the misalignment rate in the training distribution still decreases. Similarly, the generalisation effect for the untargeted OOD dataset (test cases) remains largely unaffected.

Despite in-distribution reward hacking being consistently ~1.5x more frequent than the DA case, we argue that safety training teams would only require that the green bar falls anywhere below the base model, as they wouldn't have access to the counterfactual effect of DA on in-distribution performance.

Discussion

Recently, OpenAI and Apollo used deliberative alignment as a central step in their Anti-Scheming training. There, they highlighted the importance of, but forwent, studying ‘adversarial generalisation failures’, where “[m]isaligned models might explicitly recognise when they are undergoing alignment training and actively resist the process to preserve their current capabilities and tendencies.”

The scheming of a generalisation hacker encompasses the training procedure itself. A generalisation hacker that knows that the behaviour it displays will be reinforced by SFT will take the chance to reinforce target behaviours while apparently becoming more aligned.

We believe that this study into generalisation hacking offers a first glance at the construction of model organisms of adversarial generalisation failures. We find that prompted models can generate data that pass both output and CoT monitors, but significantly influence behaviour on a separate distribution. This raises serious concerns about the efficacy of anti-scheming training, which attempts to leverage outputs from a potentially scheming policy to reduce scheming outputs far out of distribution.

We aim to extend the range and sophistication of this study, namely exploring:

Extend to the RL stage of DA.
Finetune model organisms of GH. Just as previous work studied the robustness of DA to existing covert goals, we plan to train model organisms that aim to generalisation hack unprompted.
Link current work more closely to subliminal learning. Current GH rollouts explicitly discuss the OOD goal, but future works should look at (apparently) semantically unlinked reasoning that can induce generalisation failures.
Investigate uplift from obfuscated reasoning. Although we apply a keyword and neural filter, the CoT that enters training is often obviously adversarial to a human reader. A future study could train model organisms to obfuscate their GH schemes, and iterate on stronger filters than the ones used in this study and note the extent generalisation hacking remains with more intense safeguards.
Study the effects of untrusted filters. Does collusion between the base and filter models make this effect stronger? And if so, what mitigations can be put in place?

This work was conducted at the Meridian office by Geodesic Research.

We'd like to thank Edward Young, Jason Brown, Shi Feng, Daniel Tan, Hannes Whittingham, and all the Meridian Visiting Researchers for their detailed feedback.

LESSWRONG
LW