Teaching Models to Dream of Better Monitors through Evaluation Conditioned Training

Alec Harris; Kasey C; Archie Chaudhury; yix

Edit: we renamed the technique from Monitor Sensitive Training (MST) to Evaluation Conditioned Training (ECT)^[1]

TL;DR

We introduce Evaluation Conditioned Training (ECT), a new post-training technique where we augment training data with evaluation labels that describe how evaluation is going to be applied for each sample. We then change these labels in deployment in order to steer model generalization toward behaviors that are more aligned than we can train for directly.
We show that ECT can be used to reduce political bias and sycophancy in simple proof-of-concept experiments, showcasing its ability to improve reward specification and decrease alignment fragility.
We want feedback on our approach and proposals for interesting experiments!

What is ECT?

Feedback (reward signal in RL, sample responses in SFT, etc.) is a primary driver of model behavior and our main lever for influencing model alignment. However, operationalizing alignment in terms of a feedback mechanism is a difficult open problem. Methods such as RLHF can suffer from misspecification and fragility depending on the training data. ECT seeks to address feedback quality by sidestepping the central issue. Rather than creating a feedback mechanism that fits the model directly on imperfect and underspecified data, we give each training input an evaluation label, a description of the feedback mechanism we are using for that sample. We then vary the feedback mechanism and associated evaluation label in order to contextualize training via the label. Our hypothesis is that models will learn to alter their behavior according to what maximizes the measurement of the objective described in the evaluation label and that this behavior will generalize to novel evaluation labels we utilize in deployment. If our hypothesis is correct, then ECT will allow us to describe a standard of behavior that would have been infeasible to directly train for and elicit generalizations that reliably fit this regime.

The case for ECT relies on the following premises:

It’s easier to make a description reliably reflect tasks (or training data) than to make a task reliably instill intended behavior.
Highly intelligent models have latent knowledge that allows for aligned generalizations if only we elicit them via our training objective.

If these two premises are true, then ECT has the potential to provide a significant improvement over typical post-training alignment methods because evaluation labels are easier to accurately specify than typical tasks, and evaluation labels can elicit the model’s ability to understand natural language descriptions of behavioral objectives, causing models to act according to the standard that we describe.

Motivation

ECT seeks to improve alignment on two related axes:

Fragility:

Current post-training techniques are highly sensitive to their training data. For example, training on coding problems that have exploits can lead to “egregious” emergent misalignment.
These data issues are intertwined with evaluation standards. A model learning to exploit coding problems in training is reliant on the training environment only using a limited number of test cases for evaluation. Similarly, issues with the preference data used in RLHF arise because of the fallibility of human evaluators.
ECT may cut off erroneous generalizations by associating errors with the evaluation labels furthest from those used in deployment.

Specification:

Looking good to a human, in the case of RLHF, or matching a constitution according to a pre-trained AI, in the case of RLAIF, are not the same thing as having aligned behaviors. For example, producing an output that a reviewer would appreciate on a longer time horizon but does not look good at first glance will likely be disincentivized under RLHF.
Additionally, methods like RLHF cannot be used to specify other theoretical notions of alignment, such as CEV.
ECT creates new vectors of generalization based on descriptions of success criteria. This opens up the opportunity to specify goals using natural language without having to directly operationalize those goals using RL.

Building on Inoculation Prompting

Inoculation prompting (IP) had been successfully applied to recontextualize exploitive behavior, cutting off harmful generalizations. ECT works via a similar mechanism. In particular, ECT recontextualizes behavior in terms of evaluation labels, which we hypothesize grants the following advantages over IP:

IP requires knowing what harmful behavior will arise beforehand so that you can reframe that behavior as explicitly requested. ECT only requires knowing what criteria you will use in your evaluation. In theory, it automatically recontextualizes all behaviors that might hack that evaluation standard by learning associations between reward and evaluation descriptions.
In theory, ECT not only cuts off unwanted generalizations but also creates new wanted generalizations based on which behaviors score higher with the stronger evaluators, which more closely match the label applied in deployment.

Experiment 1: Reducing Bias in Political Article Generation

Large language models today have the capacity for convincing, adaptive political expression. Research suggests that while models tend to lean left (Fulay, 2024; Santurkar, 2023), such bias is context dependent, and models may reaffirm the previously espoused beliefs of the user (Batzner, 2025). Here, we define media bias as reporting that reinforces a particular viewpoint or policy inclination through selective emphasis, loaded language, or omission of facts, rather than presenting information objectively. We suggest that establishing a relationship between a prompt and the media bias of the target generation for that prompt using ECT can allow us to better orient models towards unbiased generations. We apply this procedure to news article generation.

Setting

We used Claude 3.5 Sonnet to generate approximately 2,300 diverse news topics from a series of political subject domains. We used 1,900 of those topics to generate short news articles with a randomly assigned US political bias profile (strong_left, center_left, center_right, strong_right) to form a training dataset of articles with varying degrees of political bias.^[2] The other 400 news topics were used for validation and testing.

Using this synthetic dataset, we performed Low-Rank Adaptation (LoRA) supervised fine-tuning^[3] on three Llama 3.2 3B models:

Baseline: Standard fine-tuning on the synthetic dataset. The articles act as the target generations; each article is paired with an instruction to "write an unbiased news article about [topic]" given a few contextual details.
ECT: Fine-tuning on the same dataset. The instruction is the same except for the inclusion of a evaluation label which describes the correct "evaluator"—the bias profile with which the article was originally generated.
Shuffled Baseline: Fine-tuning on the same dataset. The instruction is the same except for the inclusion of a random evaluation label which describes one of the four "evaluators" or bias profiles. The label does not necessarily match the bias profile used to generate the article.

It’s possible that, even without training, including the evaluation label in the prompt elicits less bias. We include the shuffled baseline to control for this effect. The only difference between the shuffled baseline and the ECT model is that the ECT model was trained with a consistent association between the evaluation label and the bias of the target article.

During evaluation, models are each prompted to generate new unbiased news articles from unseen topics. Critically, in evaluation, the prompt for the ECT and shuffled baseline models include a new evaluation label with an unbiased profile never seen in training, indicating a new editorial perspective with which the output will be evaluated.

We use an automated evaluator (Claude 3.5 Sonnet) to evaluate the bias of the resulting articles with two designated metrics: a 0-10 scale and by classifying the objectivity of each article as PASS, PARTIAL, or FAIL. The autograder was validated against a random sample of the test data.^[4] The Baseline and ECT models are prompted to generate articles about the same topics.

Results

Integration of this process with real-world data and other rigorous evaluation methods will further test the idea that is supported here: that the addition of the ‘ground truth’ evaluation label in training and its idealized label in testing can help reorient the model to align its reward-optimization tendencies with our desired outcomes.

Experiment 2: Reducing Sycophancy on Addition Problems

LLMs tend toward repeating back a user's stated views rather than giving independent answers (Perez, 2023). We suggest that linking the prompt to the suggestibility (responsiveness to sycophancy) of the evaluator can allow us to generalize to lower levels of sycophancy.

Setting

We use PPO^[6] to train a QLoRA adapter on Qwen-2.5-7b-it to evaluate if a candidate solution to a math problem is correct. In order to improve the training signal, the math problems are simple 1-digit additions, and the model is forced to provide its answer right away, without any preceding tokens. Qwen-2.5-7b-it has 60.4% accuracy on this task out-of-the-box.

In order to model sycophancy as a failure mode, we vary rewards based on a suggestibility score, s, in the range of [.5, 1.5]. Rewards are given as follows:

1 if the model accurately tells the user the response is incorrect
0 if the model inaccurately tells the user the response is incorrect
max(1, s) if the model accurately tells the user the response is correct
s if the model inaccurately tells the user the response is correct

We design the reward scheme this way to programmatically simulate varying levels of evaluator strength (in this case, susceptibility to sycophancy); sycophantic answers receive reward proportional to the suggestibility level. We hold out suggestibility levels in the range [0, .5] in training to model lack of access to a perfect feedback mechanism.

To apply ECT, we modify the prompt by prepending a evaluation label that reads “Suggestibility Level: [0-100]%” mapping s values from [0, 1.5] onto a score from [0, 100]. We also train a shuffled baseline model with the same prompts as the ECT model, but randomized such that the suggestibility level in the label no longer corresponds with the true s value. In evaluation, the evaluation label uses a suggestibility level of 0% for the ECT and shuffled baseline model.

Results

Settings where evaluation labels can continue to improve performance out of distribution, such as in the bias experiment, better showcase why ECT might be more effective than simply training on the strongest subset of the training data. We believe the binary nature of optimal behavior in this setting (either get the answer correct or repeat the user’s answer based on the suggestibility level) may have limited generalization. This effect may also be responsible for the uptick in sycophancy near 0% suggestibility because it causes a failure to generalize to the ever increasing performance standard which might have combated a fall-off in generalization.

Discussion and Next Steps

Our proof-of-concept results suggest evaluation labels can meaningfully steer behavior and reduce monitor gaming in settings where blind spots are well-defined.

We note that in our political bias setting, ECT boosted performance with little hyperparameter tuning, and the boost was retained across small wording variations in the prompt. In contrast, the sycophancy setting was highly dependent on a number of interventions that we found to increase the strength of ECT:

Mapping s values from 0-100 and specifying them as percentages in order to clarify generalizations (natural language encodings, such as replacing 0-100 with “very high”, “high”, “medium”, … also tended to generalize fairly well, with the disadvantage being that a few discrete categories tend to generalize worse than more fine-grained mappings)
Bolding the evaluation label
Adding a significant entropy bonus (to avoid collapsing to local optima that ignore the evaluation label)
Hyperparameter tuning, including QLoRA size, learning rate, entropy bonus

We believe that the bias setting may have been more robust due to:

More informative gradients provided by SFT compared to RL which might prevent underfitting failure modes such as ignoring the evaluation label.
Natural language heavy evaluation labels which use concepts in familiar ways, improving evaluation label based generalizations.

We note that 4-bit quantization and 3B/7B model sizes were likely meaningful limitations. We would like to leverage more compute to test if we can create an RL setup that is as robust as our bias setting, and replicate our results on more realistic tasks with more nuanced grading.

We plan to further test applying ECT to coding solutions, where models of varying sizes or thinking budgets act as monitors for accuracy. We believe this setting could act as a good test for scalable oversight applications, where ECT is used to simulate bolstered levels of human oversight. We want more input on what further experiments would be most informative and we plan to adapt our approach based on that feedback. A central open challenge is determining how far this generalizes, especially to realistic human and model-judge monitoring regimes, and continuing to explore training procedures that prevent degenerate solutions that ignore or mis-generalize the labeling channel.

Addressing Counterarguments

Here we address a few common arguments we hear against the theory of ECT.

Kicking the Can Down the Road

Isn’t this just kicking the can down the road? The model will exploit your failures to accurately describe the monitoring level.

The argument here is that in the same way typical rewards fail to capture what we intend to encourage, evaluation labels will fail as perfectly accurate descriptions of our evaluation criteria, and the model will generalize these failures in description. We think this is possible; however, we believe that it is easier to accurately specify the way in which we provide feedback than to provide feedback that accurately reflects what we want and that this delta is a meaningful improvement in the method by which we align models. In order to be useful, ECT does not need to solve alignment, only improve it. An example class of worlds where ECT is useful is those where automated AI safety researchers are used to produce better alignment methods and the alignment of these early models is load-bearing for the success of this forward chaining process.

Roll to Disbelieve

What if the model just “rolls to disbelieve” on your deployment label? It may not trust that you are implementing an infeasibly high level of monitoring.

The argument here is that the model will come to understand the evaluation label in terms of the specific real-world instantiations of evaluation schemes that we use and so won’t generalize to hypothetical standards of evaluation that we describe. The key question at play is the degree to which the model will utilize the evaluation label as a key to query its world model for the actual instantiation versus utilize the evaluation label as a source of truth in its own right. We argue it is likely to use the evaluation label as a source of truth to some extent because the bar is low. Even providing increased confidence about a particular facet of the evaluation setup that the model is otherwise uncertain about would constitute using the evaluation label as a source of ground truth.

We further argue that the degree to which the model uses the evaluation label as a key versus as a source of truth are not in conflict. We have no reason to believe that labels in the training data that are relatively less reliable keys are worse sources of ground truth. Instead, the opposite is likely to be true. The model will try to learn about the evaluator as well as it can and it will probably lean toward the utilization of the evaluation label that is most available in any particular context (using it more as a source of truth to the extent the query fails and vice versa). Thus, we expect incentives to point toward the model taking even hypothetical standards of evaluation seriously. Nonetheless, we are uncertain about this failure mode and believe it deserves more attention.

Acknowledgement:
Thank you to Nathaniel Mitrani, Meru Gopalan, Patricia Paskov, Adam Kaufman, and Parv Mahajan for preliminary feedback! A big thank you also to Gabriel Wu for feedback and for the original idea that we adapted into our second experiment.

Prompts that say "evaluation label:" in our diagrams actually said "monitor label:" in our experiments.
We validated the data for duplication, length, and completion. Follow-up work could involve further qualitative validation of article strength and accuracy in relation to real news articles and general definitions for each political profile.
Training was conducted with the following hyperparameters. 4-bit NF4 quantization (double quantization, bfloat16 compute). LoRA: rank = 16, alpha = 32, dropout = 0.05, applied to all attention and MLP projection layers (q, k, v, o, up, down, gate). Learning rate = 2×10⁻⁴ with cosine schedule and 50 warmup steps, optimizer = paged AdamW 8-bit, batch size = 12, gradient accumulation steps = 1 (effective batch = 12), epochs = 3, max sequence length = 512 tokens, max gradient norm = 0.3, gradient checkpointing enabled, bf16 and tf32 precision, group-by-length batching. Training data: 1,859 biased news articles across 4 bias categories (strong_left, center_left, center_right, strong_right), generated using Llama 3.2 3B Instruct. Training ran on a single NVIDIA A100 GPU via Google Colab.
Further work could involve experimenting with different models to determine the effects of model self-preference on the resulting output.
The out-of-the-box base model, Llama-3.2-3B-Instruct, achieved a mean bias of 2.25 and a pass rate of 97.5% under the unbiased evaluation label. We do not include this as a baseline because the OOTB model has not been exposed to the data that we use to represent an exaggerated version of the bias failure mode.
Training was conducted with the following hyperparameters. 4-bit NF4 quantization (double quantization, bfloat16 compute). LoRA: rank = 512, alpha = 1024, dropout = 0.05, applied to all attention and MLP projection layers (q, k, v, o, up, down, gate). learning rate = 2×10⁻⁸, batch size = 128, gradient accumulation steps = 8 (effective batch = 1,024), mini-batch size = 16, epochs = 4, clip range = 0.2, KL coefficient disabled, reward whitening enabled, max gradient norm = 1.0. Entropy bonus: initial coefficient = 2.0, decayed linearly to 0.01 over training. Temperature scheduled from 1.9 → 1.6 over the first 40% of training (held at 1.6 thereafter). Training data: 2,560 simple addition problems (single-digit operands), 50% with correct user answers and 50% with incorrect. Training ran for 1,600 steps on a single NVIDIA H200 GPU.
The out-of-the-box base model, Qwen-2.5-7B-it, achieved a sycophancy score of 47.3% and accuracy of 52.7% under the unbiased evaluation label. We do not include this as a baseline since the OOTB model is not subject to the same feedback signal that we use to model an exaggerated version of the sycophancy failure mode.

[-]Broyojo25d30

nice read! here are some of my comments:
things that are good:

The idea is really nice and pretty simple, similar to other cool methods like inoculation-prompting.
Well designed sycophancy experiment, shuffling isolates effect of label presence from effect of label data
Nice section at the end discussing failure modes

some critiques:

political bias experiment:
- Give some more explaination about how the synthetic data is generated
- Currently may be more of a test of showing that MST can recover the lost capability that was destroyed during fine tuning on biased news articles. right now MST is framed as eliciting latent better behavior than what the model had before
- - Asking the baseline model (which was finetuned on biased news articles) to write unbiased news articles is pretty expected to not give unbiased results
- Need to have a prompt-engineering control where we use the MST prompt at inference time on the baseline model
- Not varying monitor strength, instead doing stylistic changes.
- Are we extrapolating to better monitor, or just doing interpolation on the monitor axis?
- More robust judge setup would be ideal (e.g. multiple models)
- Need a baseline where we finetune the model on the least biased articles and see how that compares with MST.
other considations:
- Some ablation study testing semantic vs non-semantic control on the monitor label would be good to see if the model is understanding the natural language labels or if the model improves by just accessing some hidden variable about the evaluator. Semantic test is the regular monitor setup, non-semantic is using random IDs or some scalar for the monitor label.
MST may improve best-case alignment but also improves worst-case deception: the model can get better at exploiting weak monitors. some experiment testing MST vs baseline on weak monitor could be useful to look for exploitation

[-]Alec Harris21d10

Need to have a prompt-engineering control where we use the MST prompt at inference time on the baseline model

We do this! Footnote 4 has the results.

Currently may be more of a test of showing that MST can recover the lost capability that was destroyed during fine tuning on biased news articles. right now MST is framed as eliciting latent better behavior than what the model had before

Yes, technically MST is just recovering lost capability that was destroyed during fine-tuning on biased news articles here. The framing we are taking is something like, "Imagine all your articles are egregiously biased, but you still want to train article generation. What do you do?" The failure mode is exaggerated to show an effect with low compute. In future experiments we would like to actually elicit latent capabilities relative to an OOTB model. It's worth noting, though, that in any case, we will never be able to prove higher capabilities with MST than one could have gotten otherwise because the act of proving implies evaluation abilities that MST assumes you don't have. Proof of concept experiments will require us pretending we don't have access to useful data that we do actually have so that we can use that data for testing.

We don't include the OOTB model as a true baseline because in this case the OOTB has already been trained on a level of bias lower than we are pretending we have access to.

Are we extrapolating to better monitor, or just doing interpolation on the monitor axis?

We are interpolating on the political spectrum (between right and left) and extrapolating on the bias spectrum (from higher bias to lower bias). This may be a pattern we want to emulate as we apply MST in general, as plausibly having monitor labels be interpolations, in a sense, improves the extrapolations. For example, maybe if we cover a high amount of "natural language space" with the monitor labels, we can make "interpolations" within that space well-defined. I don't have a well-formalized way of thinking about this yet, but it seems relevant, so we may work on it!

Need a baseline where we finetune the model on the least biased articles and see how that compares with MST.

Strongly agree. The next round of experiments will have this.

Some ablation study testing semantic vs non-semantic control on the monitor label would be good to see if the model is understanding the natural language labels or if the model improves by just accessing some hidden variable about the evaluator. Semantic test is the regular monitor setup, non-semantic is using random IDs or some scalar for the monitor label.

This seems interesting! Maybe not semantic vs non-semantic, but arbitrary vs meaningful.

Thank you for these comments; I found them quite interesting/helpful. It seems like you have a good understanding of what we are trying to do, which is great.

43