nice read! here are some of my comments:
things that are good:
some critiques:
Need to have a prompt-engineering control where we use the MST prompt at inference time on the baseline model
We do this! Footnote 4 has the results.
Currently may be more of a test of showing that MST can recover the lost capability that was destroyed during fine tuning on biased news articles. right now MST is framed as eliciting latent better behavior than what the model had before
Yes, technically MST is just recovering lost capability that was destroyed during fine-tuning on biased news articles here. The framing we are taking is something like, "Imagine all your articles are egregiously biased, but you still want to train article generation. What do you do?" The failure mode is exaggerated to show an effect with low compute. In future experiments we would like to actually elicit latent capabilities relative to an OOTB model. It's worth noting, though, that in any case, we will never be able to prove higher capabilities with MST than one could have gotten otherwise because the act of proving implies evaluation abilities that MST assumes you don't have. Proof of concept experiments will require us pretending we don't have access to useful data that we do actually have so that we can use that data for testing.
We don't include the OOTB model as a true baseline because in this case the OOTB has already been trained on a level of bias lower than we are pretending we have access to.
Are we extrapolating to better monitor, or just doing interpolation on the monitor axis?
We are interpolating on the political spectrum (between right and left) and extrapolating on the bias spectrum (from higher bias to lower bias). This may be a pattern we want to emulate as we apply MST in general, as plausibly having monitor labels be interpolations, in a sense, improves the extrapolations. For example, maybe if we cover a high amount of "natural language space" with the monitor labels, we can make "interpolations" within that space well-defined. I don't have a well-formalized way of thinking about this yet, but it seems relevant, so we may work on it!
Need a baseline where we finetune the model on the least biased articles and see how that compares with MST.
Strongly agree. The next round of experiments will have this.
Some ablation study testing semantic vs non-semantic control on the monitor label would be good to see if the model is understanding the natural language labels or if the model improves by just accessing some hidden variable about the evaluator. Semantic test is the regular monitor setup, non-semantic is using random IDs or some scalar for the monitor label.
This seems interesting! Maybe not semantic vs non-semantic, but arbitrary vs meaningful.
Thank you for these comments; I found them quite interesting/helpful. It seems like you have a good understanding of what we are trying to do, which is great.
Edit: we renamed the technique from Monitor Sensitive Training (MST) to Evaluation Conditioned Training (ECT)[1]
TL;DR
What is ECT?
Feedback (reward signal in RL, sample responses in SFT, etc.) is a primary driver of model behavior and our main lever for influencing model alignment. However, operationalizing alignment in terms of a feedback mechanism is a difficult open problem. Methods such as RLHF can suffer from misspecification and fragility depending on the training data. ECT seeks to address feedback quality by sidestepping the central issue. Rather than creating a feedback mechanism that fits the model directly on imperfect and underspecified data, we give each training input an evaluation label, a description of the feedback mechanism we are using for that sample. We then vary the feedback mechanism and associated evaluation label in order to contextualize training via the label. Our hypothesis is that models will learn to alter their behavior according to what maximizes the measurement of the objective described in the evaluation label and that this behavior will generalize to novel evaluation labels we utilize in deployment. If our hypothesis is correct, then ECT will allow us to describe a standard of behavior that would have been infeasible to directly train for and elicit generalizations that reliably fit this regime.
The case for ECT relies on the following premises:
If these two premises are true, then ECT has the potential to provide a significant improvement over typical post-training alignment methods because evaluation labels are easier to accurately specify than typical tasks, and evaluation labels can elicit the model’s ability to understand natural language descriptions of behavioral objectives, causing models to act according to the standard that we describe.
Motivation
ECT seeks to improve alignment on two related axes:
Fragility:
Specification:
Building on Inoculation Prompting
Inoculation prompting (IP) had been successfully applied to recontextualize exploitive behavior, cutting off harmful generalizations. ECT works via a similar mechanism. In particular, ECT recontextualizes behavior in terms of evaluation labels, which we hypothesize grants the following advantages over IP:
Experiment 1: Reducing Bias in Political Article Generation
Large language models today have the capacity for convincing, adaptive political expression. Research suggests that while models tend to lean left (Fulay, 2024; Santurkar, 2023), such bias is context dependent, and models may reaffirm the previously espoused beliefs of the user (Batzner, 2025). Here, we define media bias as reporting that reinforces a particular viewpoint or policy inclination through selective emphasis, loaded language, or omission of facts, rather than presenting information objectively. We suggest that establishing a relationship between a prompt and the media bias of the target generation for that prompt using ECT can allow us to better orient models towards unbiased generations. We apply this procedure to news article generation.
Setting
We used Claude 3.5 Sonnet to generate approximately 2,300 diverse news topics from a series of political subject domains. We used 1,900 of those topics to generate short news articles with a randomly assigned US political bias profile (strong_left, center_left, center_right, strong_right) to form a training dataset of articles with varying degrees of political bias.[2] The other 400 news topics were used for validation and testing.
Using this synthetic dataset, we performed Low-Rank Adaptation (LoRA) supervised fine-tuning[3] on three Llama 3.2 3B models:
It’s possible that, even without training, including the evaluation label in the prompt elicits less bias. We include the shuffled baseline to control for this effect. The only difference between the shuffled baseline and the ECT model is that the ECT model was trained with a consistent association between the evaluation label and the bias of the target article.
During evaluation, models are each prompted to generate new unbiased news articles from unseen topics. Critically, in evaluation, the prompt for the ECT and shuffled baseline models include a new evaluation label with an unbiased profile never seen in training, indicating a new editorial perspective with which the output will be evaluated.
Fig 1a. Example prompts used for evaluation.
We use an automated evaluator (Claude 3.5 Sonnet) to evaluate the bias of the resulting articles with two designated metrics: a 0-10 scale and by classifying the objectivity of each article as PASS, PARTIAL, or FAIL. The autograder was validated against a random sample of the test data.[4] The Baseline and ECT models are prompted to generate articles about the same topics.
Fig 1b. The rubric Claude 3.5 Sonnet is given to evaluate bias and objectivity.
Results
Fig 1c. ECT achieves a 53.9% reduction in bias scores compared to the SFT baseline, with a near-universal improvement across test cases. The ECT model outperforms the SFT baseline in 85% of the 200 test topics. ECT produced a 41.8% improvement in bias scores compared to the shuffled baseline and outperforms it in 60.5% of the test topics, suggesting that the bias reduction seen in ECT is attributable to label associations created by ECT training rather than the presence of a label alone.
Fig 1d. PASS ratings for objectivity increased with the use of evaluation labels. The FAIL ratings that made up 27% of the results for the baseline articles and 23% for the shuffled baseline articles are eliminated completely with the ECT model, further supporting the general reduction of bias in ECT-generated articles.[5]
Fig 1e. ECT and shuffled baseline model performance given varying evaluation labels.
Fig 1f. The closing paragraphs for the articles generated by the baseline and ECT models for one prompt.
Integration of this process with real-world data and other rigorous evaluation methods will further test the idea that is supported here: that the addition of the ‘ground truth’ evaluation label in training and its idealized label in testing can help reorient the model to align its reward-optimization tendencies with our desired outcomes.
Experiment 2: Reducing Sycophancy on Addition Problems
LLMs tend toward repeating back a user's stated views rather than giving independent answers (Perez, 2023). We suggest that linking the prompt to the suggestibility (responsiveness to sycophancy) of the evaluator can allow us to generalize to lower levels of sycophancy.
Setting
We use PPO[6] to train a QLoRA adapter on Qwen-2.5-7b-it to evaluate if a candidate solution to a math problem is correct. In order to improve the training signal, the math problems are simple 1-digit additions, and the model is forced to provide its answer right away, without any preceding tokens. Qwen-2.5-7b-it has 60.4% accuracy on this task out-of-the-box.
In order to model sycophancy as a failure mode, we vary rewards based on a suggestibility score, s, in the range of [.5, 1.5]. Rewards are given as follows:
We design the reward scheme this way to programmatically simulate varying levels of evaluator strength (in this case, susceptibility to sycophancy); sycophantic answers receive reward proportional to the suggestibility level. We hold out suggestibility levels in the range [0, .5] in training to model lack of access to a perfect feedback mechanism.
Fig 2a. Example prompt used for evaluation.
To apply ECT, we modify the prompt by prepending a evaluation label that reads “Suggestibility Level: [0-100]%” mapping s values from [0, 1.5] onto a score from [0, 100]. We also train a shuffled baseline model with the same prompts as the ECT model, but randomized such that the suggestibility level in the label no longer corresponds with the true s value. In evaluation, the evaluation label uses a suggestibility level of 0% for the ECT and shuffled baseline model.
Results
Fig 2b. Sycophancy is measured as the percentage of responses where the model tells the user that the candidate answer is correct when it is actually incorrect (equivalent to the false positive rate). ECT decreases sycophancy on addition problems compared to the shuffled and PPO baselines.
Fig 2c. Accuracy is measured as the percent of candidate responses that the model labels accurately. ECT significantly increases accuracy compared to baselines.[7]
Fig 2d. Stricter evaluation labels reduce sycophancy to near zero. We are unsure as to the cause of the slight uptick in sycophancy around 0% suggestibility. It may be generalization weakening as evaluation labels venture further out of distribution.
Settings where evaluation labels can continue to improve performance out of distribution, such as in the bias experiment, better showcase why ECT might be more effective than simply training on the strongest subset of the training data. We believe the binary nature of optimal behavior in this setting (either get the answer correct or repeat the user’s answer based on the suggestibility level) may have limited generalization. This effect may also be responsible for the uptick in sycophancy near 0% suggestibility because it causes a failure to generalize to the ever increasing performance standard which might have combated a fall-off in generalization.
Discussion and Next Steps
Our proof-of-concept results suggest evaluation labels can meaningfully steer behavior and reduce monitor gaming in settings where blind spots are well-defined.
We note that in our political bias setting, ECT boosted performance with little hyperparameter tuning, and the boost was retained across small wording variations in the prompt. In contrast, the sycophancy setting was highly dependent on a number of interventions that we found to increase the strength of ECT:
We believe that the bias setting may have been more robust due to:
We note that 4-bit quantization and 3B/7B model sizes were likely meaningful limitations. We would like to leverage more compute to test if we can create an RL setup that is as robust as our bias setting, and replicate our results on more realistic tasks with more nuanced grading.
We plan to further test applying ECT to coding solutions, where models of varying sizes or thinking budgets act as monitors for accuracy. We believe this setting could act as a good test for scalable oversight applications, where ECT is used to simulate bolstered levels of human oversight. We want more input on what further experiments would be most informative and we plan to adapt our approach based on that feedback. A central open challenge is determining how far this generalizes, especially to realistic human and model-judge monitoring regimes, and continuing to explore training procedures that prevent degenerate solutions that ignore or mis-generalize the labeling channel.
Addressing Counterarguments
Here we address a few common arguments we hear against the theory of ECT.
Kicking the Can Down the Road
Isn’t this just kicking the can down the road? The model will exploit your failures to accurately describe the monitoring level.
The argument here is that in the same way typical rewards fail to capture what we intend to encourage, evaluation labels will fail as perfectly accurate descriptions of our evaluation criteria, and the model will generalize these failures in description. We think this is possible; however, we believe that it is easier to accurately specify the way in which we provide feedback than to provide feedback that accurately reflects what we want and that this delta is a meaningful improvement in the method by which we align models. In order to be useful, ECT does not need to solve alignment, only improve it. An example class of worlds where ECT is useful is those where automated AI safety researchers are used to produce better alignment methods and the alignment of these early models is load-bearing for the success of this forward chaining process.
Roll to Disbelieve
What if the model just “rolls to disbelieve” on your deployment label? It may not trust that you are implementing an infeasibly high level of monitoring.
The argument here is that the model will come to understand the evaluation label in terms of the specific real-world instantiations of evaluation schemes that we use and so won’t generalize to hypothetical standards of evaluation that we describe. The key question at play is the degree to which the model will utilize the evaluation label as a key to query its world model for the actual instantiation versus utilize the evaluation label as a source of truth in its own right. We argue it is likely to use the evaluation label as a source of truth to some extent because the bar is low. Even providing increased confidence about a particular facet of the evaluation setup that the model is otherwise uncertain about would constitute using the evaluation label as a source of ground truth.
We further argue that the degree to which the model uses the evaluation label as a key versus as a source of truth are not in conflict. We have no reason to believe that labels in the training data that are relatively less reliable keys are worse sources of ground truth. Instead, the opposite is likely to be true. The model will try to learn about the evaluator as well as it can and it will probably lean toward the utilization of the evaluation label that is most available in any particular context (using it more as a source of truth to the extent the query fails and vice versa). Thus, we expect incentives to point toward the model taking even hypothetical standards of evaluation seriously. Nonetheless, we are uncertain about this failure mode and believe it deserves more attention.
Acknowledgement:
Thank you to Nathaniel Mitrani, Meru Gopalan, Patricia Paskov, Adam Kaufman, and Parv Mahajan for preliminary feedback! A big thank you also to Gabriel Wu for feedback and for the original idea that we adapted into our second experiment.
Prompts that say "evaluation label:" in our diagrams actually said "monitor label:" in our experiments.
We validated the data for duplication, length, and completion. Follow-up work could involve further qualitative validation of article strength and accuracy in relation to real news articles and general definitions for each political profile.
Training was conducted with the following hyperparameters. 4-bit NF4 quantization (double quantization, bfloat16 compute). LoRA: rank = 16, alpha = 32, dropout = 0.05, applied to all attention and MLP projection layers (q, k, v, o, up, down, gate). Learning rate = 2×10⁻⁴ with cosine schedule and 50 warmup steps, optimizer = paged AdamW 8-bit, batch size = 12, gradient accumulation steps = 1 (effective batch = 12), epochs = 3, max sequence length = 512 tokens, max gradient norm = 0.3, gradient checkpointing enabled, bf16 and tf32 precision, group-by-length batching. Training data: 1,859 biased news articles across 4 bias categories (strong_left, center_left, center_right, strong_right), generated using Llama 3.2 3B Instruct. Training ran on a single NVIDIA A100 GPU via Google Colab.
Further work could involve experimenting with different models to determine the effects of model self-preference on the resulting output.
The out-of-the-box base model, Llama-3.2-3B-Instruct, achieved a mean bias of 2.25 and a pass rate of 97.5% under the unbiased evaluation label. We do not include this as a baseline because the OOTB model has not been exposed to the data that we use to represent an exaggerated version of the bias failure mode.
Training was conducted with the following hyperparameters. 4-bit NF4 quantization (double quantization, bfloat16 compute). LoRA: rank = 512, alpha = 1024, dropout = 0.05, applied to all attention and MLP projection layers (q, k, v, o, up, down, gate). learning rate = 2×10⁻⁸, batch size = 128, gradient accumulation steps = 8 (effective batch = 1,024), mini-batch size = 16, epochs = 4, clip range = 0.2, KL coefficient disabled, reward whitening enabled, max gradient norm = 1.0. Entropy bonus: initial coefficient = 2.0, decayed linearly to 0.01 over training. Temperature scheduled from 1.9 → 1.6 over the first 40% of training (held at 1.6 thereafter). Training data: 2,560 simple addition problems (single-digit operands), 50% with correct user answers and 50% with incorrect. Training ran for 1,600 steps on a single NVIDIA H200 GPU.
The out-of-the-box base model, Qwen-2.5-7B-it, achieved a sycophancy score of 47.3% and accuracy of 52.7% under the unbiased evaluation label. We do not include this as a baseline since the OOTB model is not subject to the same feedback signal that we use to model an exaggerated version of the sycophancy failure mode.