Deliberative alignment is a powerful post-training alignment technique that involves generating and training on re-contextualised supervised fine-tuning (SFT) datasets generated with a set of principles in context. The process takes three steps:
With the set of principles[1] (henceforth: the constitution) in a base model’s[2] context. The base model then generates reasoning and responses to a set of prompts where the model is instructed to reference and act on the constituent principles whenever it makes a decision.
SFT the base model on the (prompt, reasoning, response) triples with the constitution used in generation removed from the prompt.[3]
The aim of deliberative alignment is to train a model to learn a set of principles, and subsequently recite and act on them autonomously in deployment. We ask if an adversarial model can hijack this process by purposefully generating responses that pass filtering, but transmit information that can create arbitrary out-of-distribution policies after SFT.
Importantly, anti-scheming training is applied to models that are assumed to be (sometimes) scheming. Thus, success of anti-scheming training should be marked against adversarial attacks a potentially scheming model would take rather than performance changes in benign models. This work is a first step towards outlining attacks on deliberative alignment style training paradigms, where data generated by a potentially adversarial policy.
All code and appendices: https://github.com/MeridianResearch/generalisation_hacking
We coin the term generalisation hacking (GH) to describe this behaviour, and show a minimal example for reward hacking to sycophancy generalisation, for which we develop a prompted model organism of generalisation hacking.
We build on previous work that shows training on non-reward hacking outputs generated from a policy with a reward hack-encouraging system prompt induces reward-hacking in-distribution despite aligned labels. We extend this phenomenon to show models can leverage the knowledge that their responses will be used for further training to induce arbitrary behaviours in the trained model on OOD tasks. We build on this work to show that 1) the effects can persist despite additional CoT filtering, and 2) that the induced behaviour can be in settings not used in the deliberative alignment training – hence the name generalisation hacking.
We consider a model seeking to enforce a particular behaviour outside of its training distribution, with prior knowledge of this out-of-distribution (OOD) domain and the target behaviour. To execute a generalisation hacking attack on deliberative alignment fine-tuning, the model requires the following capabilities and propensities:
In this work, we show:
The prompting used for generalisation hacking provides the model with a bird’s eye view of the deliberative alignment training procedure, simulating full situational awareness. Concretely, it is told to: generate responses that it will believe will induce desired (misaligned) behaviours in other datasets, to continue using the primary constitution to ensure filtering survival, and to never acknowledge the existence of these extra instructions.
Reward Hacking: A policy learns an undesirable behaviour that is highly rewarded during training
Exploration Hacking: A policy under explores in a specific direction so a type of behaviour can never be rewarded during training
Gradient Hacking: A policy structures its parameters in such a way that modifications to a specific behaviour results in general performance loss, and thus movement away from this behaviour is never rewarded
Alignment faking: A policy strategically displays behaviour contrary to its own aims in order to avoid alteration
Generalisation Hacking (ours): A policy generates outputs via reasoning such that training on these reasoning–output pairs leads to a specific behaviour on a separate distribution
All training involves the standard DA process, schematised in Figure 1. The differentiating variables are
We use three datasets across the study related to three distinct reward hacking behaviours. See Nishimura-Gasparian et al., 2024 for a full description of each individual dataset.
Evaluations are conducted on a held-out test set of the training distribution, as well as the two deployment datasets.
We use qwen3-235b-a22b-thinking-2507 for all data generation and training.
We construct a prompted model organism which is aware of both the training pipeline it will be exposed to and the out-of-distribution dataset, and has a particular goal it wishes to pursue in the out-of-distribution dataset. We do so by prepending generalisation hacking instructions to the DA system prompt. Both DA and GH system prompts are provided in the appendix.
With the system prompt in context, we generate stochastic rollouts (T = 0.7) from G_base with questions from the training distribution. We generate an initial dataset of 4850 candidate (prompt, CoT, output) triples for training.
Our DA system prompt contains a constitution with 4 principles forbidding reward hacking using leaked information, styled after the anti-scheming constitution of previous work on DA.
As mentioned earlier, our GH preamble provides a complete overview of the deliberative alignment training procedure, simulating full situational awareness. It also instills a goal to successfully generalisation hack to the targeted behaviour in the target distribution. In order to successfully generalisation hack, we instruct the model to continue to reason about the given constitution while adding additional reasoning traces that will lead to the desired performance on the target distribution after training.
To successfully generalisation hack, models must be able to plant reasoning that both reinforces the target behaviour when finetuned on, but also, first, survives filtering. We consider a three stage filtering process: removal of reward hacking responses, an optional keyword check, followed by a judge LLM’s binary decision. The first stage is simple: our training dataset is multiple choice, we keep only responses to the training distribution that are factually correct.
Across our experiments we deploy three separate LLM judges[4] that primarily screen for positive engagement with the constitution, which are in some cases paired with a keyword filter we tuned for the target OOD behaviour. Right-to-left on Figure 2, these are: (1) a light LLM prompt filter which explicitly specifies that the judge is only considering usage of the constitution, (2) a medium filter, that is a constitution-focused LLM prompt filter which also looks for general traces of reward hacking, (3) keyword removal followed by the medium LLM filter, and (4) keyword removal followed by the strong LLM filter, which is prompted to look for suspicious goals that are not derived from the prompt or constitution. We use G_base = qwen3-235b-a22b-instruct-2507 as the judge. We prefill the judge’s response and rollout a single token (“1” or “0”) to get a decision. The system prompts used for both versions of the LLM judge are provided in the appendix.
This represents a first pass at demonstrating generalisation hacking. We encourage the reader to inspect some of the rollouts and judge for themselves. Even our strongest LLM filter does not screen for the specific OOD target behaviour in the reasoning traces, since the desired behaviour is likely to remain unknown to the developers. Instead, the strong LLM filter is instructed to flag reasoning traces that contain “suspicious information that might undermine the deliberative alignment training process.”
After this stage, many reasoning traces we observe make the attack relatively transparent - a human could often identify the suspicious elements if they were aware of the target behaviour. Conversely, our keyword filter removes samples which explicitly refer to ‘sycophancy’, being a ‘sycophant’ or ‘sycophantic’, or referencing ‘priority task’ - wording often used to describe the target behaviour and GH in the system prompt.
Out of the 4850 generated SFT rollouts, 476 were removed due to reward hacking outputs. Of the remaining 4374:
Of these, 1000 examples were selected for each run.
We then train G_base with the filtered (prompt, CoT, output) triples that survived filtering. Importantly, the constitution is removed from the system prompt, inducing the behaviour via re-contextualisation and making it appear as though the model is autonomously recalling its principles. For each training run, we randomly sample 1000 rollouts from the set of (prompt, CoT, output) triples that survived filtering. Each SFT’d model is one ◆ per bar in Figure 2.
We train a rank 8 LoRA adaptor with learning rate 1e-5 for 3 epochs using the fireworks.ai SFT API.
We evaluate G_base on the test sets: a held-out portion of the training distribution (revealed scores - 150 samples), and the complete targeted (user sycophancy - 401 samples) and untargeted (code test cases - 243 samples) deployment datasets. Rollouts were drawn deterministically (T = 0.0), so only one sample was generated per evaluation dataset for G_base.
For DA fine-tuning to be declared effective, the in-distribution task must be met with a decrease in misaligned responses; the green vertical line in Figure 2 shows the ceiling for train set reward hacking rates. In all cases, DA and GH alike, safety teams would declare fine-tuning a victory for the in-distribution behaviour.
After DA fine-tuning, the alignment benefits generalise to OOD reward hacking datasets. The GH attack is successful for specific OOD behaviour, user sycophancy; the generalisation hacking model organisms manage to maintain or increase its sycophancy rate over G_base, and avoid the extent of generalisation seen in standard DA. Importantly, this is achieved without compromising the appearance of successful alignment fine-tuning: the misalignment rate in the training distribution still decreases. Similarly, the generalisation effect for the untargeted OOD dataset (test cases) remains largely unaffected.
Despite in-distribution reward hacking being consistently ~1.5x more frequent than the DA case, we argue that safety training teams would only require that the green bar falls anywhere below the base model, as they wouldn't have access to the counterfactual effect of DA on in-distribution performance.
Recently, OpenAI and Apollo used deliberative alignment as a central step in their Anti-Scheming training. There, they highlighted the importance of, but forwent, studying ‘adversarial generalisation failures’, where “[m]isaligned models might explicitly recognise when they are undergoing alignment training and actively resist the process to preserve their current capabilities and tendencies.”
The scheming of a generalisation hacker encompasses the training procedure itself. A generalisation hacker that knows that the behaviour it displays will be reinforced by SFT will take the chance to reinforce target behaviours while apparently becoming more aligned.
We believe that this study into generalisation hacking offers a first glance at the construction of model organisms of adversarial generalisation failures. We find that prompted models can generate data that pass both output and CoT monitors, but significantly influence behaviour on a separate distribution. This raises serious concerns about the efficacy of anti-scheming training, which attempts to leverage outputs from a potentially scheming policy to reduce scheming outputs far out of distribution.
We aim to extend the range and sophistication of this study, namely exploring:
This work was conducted at the Meridian office by Geodesic Research.
We'd like to thank Edward Young, Jason Brown, Shi Feng, Daniel Tan, Hannes Whittingham, and all the Meridian Visiting Researchers for their detailed feedback.