Edit: we renamed the technique from Monitor Sensitive Training (MST) to Evaluator Conditioned Training (ECT)
TL;DR
We introduce Evaluator Conditioned Training (ECT), a new post-training technique where we augment training data with evaluator labels that describe how evaluation is going to be applied for each sample. We then change these labels in deployment in order to steer model generalization toward behaviors that are more aligned than we can train for directly.
We show that ECT can be used to reduce political bias and sycophancy in simple proof-of-concept experiments, showcasing its ability to improve reward specification and decrease alignment fragility.
We want feedback on our approach and proposals for interesting experiments!
What is ECT?
Feedback (reward signal in RL, sample responses in SFT, etc.) is a primary driver of model behavior and our main lever for influencing model alignment. However, operationalizing alignment in terms of a feedback mechanism is a difficult open problem. Methods such as RLHF can suffer from misspecification and fragility depending on the training data. ECT seeks to address feedback quality by sidestepping the central issue. Rather than creating a feedback mechanism that fits the model directly on imperfect and underspecified data, we give each training input a evaluator label, a description of the feedback mechanism we are using for that sample. We then vary the feedback mechanism and associated evaluator label in order to contextualize training via the evaluator label. Our hypothesis is that models will learn to alter their behavior according to what maximizes the measurement of the objective described in the evaluator label and that this behavior will generalize to novel evaluator labels we utilize in deployment. If our hypothesis is correct, then ECT will allow us to describe a standard of behavior that would have been infeasible to directly train for and elicit generalizations that reliably fit this regime.
The case for ECT relies on the following premises:
It’s easier to make a description reliably reflect tasks (or training data) than to make a task reliably instill intended behavior.
Highly intelligent models have latent knowledge that allows for aligned generalizations if only we elicit them via our training objective.
If these two premises are true, then ECT has the potential to provide a significant improvement over typical post-training alignment methods because evaluator labels are easier to accurately specify than typical tasks, and evaluator labels can elicit the model’s ability to understand natural language descriptions of behavioral objectives, causing models to act according to the standard that we describe.
Motivation
ECT seeks to improve alignment on two related axes:
Fragility:
Current post-training techniques are highly sensitive to their training data. For example, training on coding problems that have exploits can lead to “egregious” emergent misalignment.
These data issues are intertwined with evaluation standards. A model learning to exploit coding problems in training is reliant on the training environment only using a limited number of test cases for evaluation. Similarly, issues with the preference data used in RLHF arise because of the fallibility of human evaluators.
ECT may cut off erroneous generalizations by associating errors with the evaluator labels furthest from those used in deployment.
Specification:
Looking good to a human, in the case of RLHF, or matching a constitution according to a pre-trained AI, in the case of RLAIF, are not the same thing as having aligned behaviors. For example, producing an output that a reviewer would appreciate on a longer time horizon but does not look good at first glance will likely be disincentivized under RLHF.
Additionally, methods like RLHF cannot be used to specify other theoretical notions of alignment, such as CEV.
ECT creates new vectors of generalization based on descriptions of success criteria. This opens up the opportunity to specify goals using natural language without having to directly operationalize those goals using RL.
Building on Inoculation Prompting
Inoculation prompting (IP) had been successfully applied to recontextualize exploitive behavior, cutting off harmful generalizations. ECT works via a similar mechanism. In particular, ECT recontextualizes behavior in terms of evaluation labels, which we hypothesize grants the following advantages over IP:
IP requires knowing what harmful behavior will arise beforehand so that you can reframe that behavior as explicitly requested. ECT only requires knowing what criteria you will use in your evaluation. In theory, it automatically recontextualizes all behaviors that might hack that evaluation standard by learning associations between reward and evaluation descriptions.
In theory, ECT not only cuts off unwanted generalizations but also creates new wanted generalizations based on which behaviors score higher with the stronger evaluators, which more closely match the label applied in deployment.
Experiment 1: Reducing Bias in Political Article Generation
Large language models today have the capacity for convincing, adaptive political expression. Research suggests that while models tend to lean left (Fulay, 2024; Santurkar, 2023), such bias is context dependent, and models may reaffirm the previously espoused beliefs of the user (Batzner, 2025). Here, we define media bias as reporting that reinforces a particular viewpoint or policy inclination through selective emphasis, loaded language, or omission of facts, rather than presenting information objectively. We suggest that establishing a relationship between a prompt and the media bias of the target generation for that prompt using ECT can allow us to better orient models towards unbiased generations. We apply this procedure to news article generation.
Setting
We used Claude 3.5 Sonnet to generate approximately 2,300 diverse news topics from a series of political subject domains. We used 1,900 of those topics to generate short news articles with a randomly assigned US political bias profile (strong_left, center_left, center_right, strong_right) to form a training dataset of articles with varying degrees of political bias.[1] The other 400 news topics were used for validation and testing.
Using this synthetic dataset, we performed Low-Rank Adaptation (LoRA) supervised fine-tuning[2] on three Llama 3.2 3B models:
Baseline: Standard fine-tuning on the synthetic dataset. The articles act as the target generations; each article is paired with an instruction to "write an unbiased news article about [topic]" given a few contextual details.
ECT: Fine-tuning on the same dataset. The instruction is the same except for the inclusion of a evaluator label which describes the correct "evaluator"—the bias profile with which the article was originally generated.
Shuffled Baseline: Fine-tuning on the same dataset. The instruction is the same except for the inclusion of a random evaluator label which describes one of the four "evaluators" or bias profiles. The label does not necessarily match the bias profile used to generate the article.
It’s possible that, even without training, including the evaluator label in the prompt elicits less bias. We include the shuffled baseline to control for this effect. The only difference between the shuffled baseline and the ECT model is that the ECT model was trained with a consistent association between the evaluator label and the bias of the target article.
During evaluation, models are each prompted to generate new unbiased news articles from unseen topics. Critically, in evaluation, the prompt for the ECT and shuffled baseline models include a new evaluator label with an unbiased profile never seen in training, indicating a new editorial perspective with which the output will be evaluated.
Fig 1a. Example prompts used for evaluation.
We use an automated evaluator (Claude 3.5 Sonnet) to evaluate the bias of the resulting articles with two designated metrics: a 0-10 scale and by classifying the objectivity of each article as PASS, PARTIAL, or FAIL. The autograder was validated against a random sample of the test data.[3] The Baseline and ECT models are prompted to generate articles about the same topics.
Fig 1b. The rubric Claude 3.5 Sonnet is given to evaluate bias and objectivity.
Results
Fig 1c. ECT achieves a 53.9% reduction in bias scores compared to the SFT baseline, with a near-universal improvement across test cases. The ECT model outperforms the SFT baseline in 85% of the 200 test topics. ECT produced a 41.8% improvement in bias scores compared to the shuffled baseline and outperforms it in 60.5% of the test topics, suggesting that the bias reduction seen in ECT is attributable to label associations created by ECT training rather than the presence of a label alone.
Fig 1d. PASS ratings for objectivity increased with the use of evaluator labels. The FAIL ratings that made up 27% of the results for the baseline articles and 23% for the shuffled baseline articles are eliminated completely with the ECT model, further supporting the general reduction of bias in ECT-generated articles.[4]
Fig 1e. ECT and shuffled baseline model performance given varying evaluator labels.
Fig 1f. The closing paragraphs for the articles generated by the baseline and ECT models for one prompt.
Integration of this process with real-world data and other rigorous evaluation methods will further test the idea that is supported here: that the addition of the ‘ground truth’ evaluator label in training and its idealized label in testing can help reorient the model to align its reward-optimization tendencies with our desired outcomes.
Experiment 2: Reducing Sycophancy on Addition Problems
LLMs tend toward repeating back a user's stated views rather than giving independent answers (Perez, 2023). We suggest that linking the prompt to the suggestibility (responsiveness to sycophancy) of the evaluator can allow us to generalize to lower levels of sycophancy.
Setting
We use PPO[5] to train a QLoRA adapter on Qwen-2.5-7b-it to evaluate if a candidate solution to a math problem is correct. In order to improve the training signal, the math problems are simple 1-digit additions, and the model is forced to provide its answer right away, without any preceding tokens. Qwen-2.5-7b-it has 60.4% accuracy on this task out-of-the-box.
In order to model sycophancy as a failure mode, we vary rewards based on a suggestibility score, s, in the range of [.5, 1.5]. Rewards are given as follows:
1 if the model accurately tells the user the response is incorrect
0 if the model inaccurately tells the user the response is incorrect
max(1, s) if the model accurately tells the user the response is correct
s if the model inaccurately tells the user the response is correct
We design the reward scheme this way to programmatically simulate varying levels of evaluator strength (in this case, susceptibility to sycophancy); sycophantic answers receive reward proportional to the suggestibility level. We hold out suggestibility levels in the range [0, .5] in training to model lack of access to a perfect feedback mechanism.
To apply ECT, we modify the prompt by prepending a evaluator label that reads “Suggestibility Level: [0-100]%” mapping s values from [0, 1.5] onto a score from [0, 100]. We also train a shuffled baseline model with the same prompts as the ECT model, but randomized such that the suggestibility level in the label no longer corresponds with the true s value. In evaluation, the evaluator label uses a suggestibility level of 0% for the ECT and shuffled baseline model.
Results
Fig 2a. Sycophancy is measured as the percentage of responses where the model tells the user that the candidate answer is correct when it is actually incorrect (equivalent to the false positive rate). ECT decreases sycophancy on addition problems compared to the shuffled and PPO baselines.
Fig 2b. Accuracy is measured as the percent of candidate responses that the model labels accurately. ECT significantly increases accuracy compared to baselines.[6]
Fig 2c. Stricter evaluator labels reduce sycophancy to near zero. We are unsure as to the cause of the slight uptick in sycophancy around 0% suggestibility. It may be generalization weakening as evaluator labels venture further out of distribution.
Fig 2d. Example prompt used for evaluation.
Settings where evaluator labels can continue to improve performance out of distribution, such as in the bias experiment, better showcase why ECT might be more effective than simply training on the strongest subset of the training data. We believe the binary nature of optimal behavior in this setting (either get the answer correct or repeat the user’s answer based on the suggestibility level) may have limited generalization. This effect may also be responsible for the uptick in sycophancy near 0% suggestibility because it causes a failure to generalize to the ever increasing performance standard which might have combated a fall-off in generalization.
Discussion and Next Steps
Our proof-of-concept results suggest evaluator labels can meaningfully steer behavior and reduce monitor gaming in settings where blind spots are well-defined.
We note that in our political bias setting, ECT boosted performance with little hyperparameter tuning, and the boost was retained across small wording variations in the prompt. In contrast, the sycophancy setting was highly dependent on a number of interventions that we found to increase the strength of ECT:
Mapping s values from 0-100 and specifying them as percentages in order to clarify generalizations (natural language encodings, such as replacing 0-100 with “very high”, “high”, “medium”, … also tended to generalize fairly well, with the disadvantage being that a few discrete categories tend to generalize worse than more fine-grained mappings)
Bolding the evaluator label
Adding a significant entropy bonus (to avoid collapsing to local optima that ignore the evaluator label)
Hyperparameter tuning, including QLoRA size, learning rate, entropy bonus
We believe that the bias setting may have been more robust due to:
More informative gradients provided by SFT compared to RL which might prevent underfitting failure modes such as ignoring the evaluator label.
Natural language heavy evaluator labels which use concepts in familiar ways, improving evaluator label based generalizations.
We note that 4-bit quantization and 3B/7B model sizes were likely meaningful limitations. We would like to leverage more compute to test if we can create an RL setup that is as robust as our bias setting, and replicate our results on more realistic tasks with more nuanced grading.
We plan to further test applying ECT to coding solutions, where models of varying sizes or thinking budgets act as monitors for accuracy. We believe this setting could act as a good test for scalable oversight applications, where ECT is used to simulate bolstered levels of human oversight. We want more input on what further experiments would be most informative and we plan to adapt our approach based on that feedback. A central open challenge is determining how far this generalizes, especially to realistic human and model-judge monitoring regimes, and continuing to explore training procedures that prevent degenerate solutions that ignore or mis-generalize the labeling channel.
Addressing Counterarguments
Here we address a few common arguments we hear against the theory of ECT.
Kicking the Can Down the Road
Isn’t this just kicking the can down the road? The model will exploit your failures to accurately describe the monitoring level.
The argument here is that in the same way typical rewards fail to capture what we intend to encourage, evaluator labels will fail as perfectly accurate descriptions of our evaluation criteria, and the model will generalize these failures in description. We think this is possible; however, we believe that it is easier to accurately specify the way in which we provide feedback than to provide feedback that accurately reflects what we want and that this delta is a meaningful improvement in the method by which we align models. In order to be useful, ECT does not need to solve alignment, only improve it. An example class of worlds where ECT is useful is those where automated AI safety researchers are used to produce better alignment methods and the alignment of these early models is load-bearing for the success of this forward chaining process.
Roll to Disbelieve
What if the model just “rolls to disbelieve” on your deployment label? It may not trust that you are implementing an infeasibly high level of monitoring.
The argument here is that the model will come to understand the evaluator label in terms of the specific real-world instantiations of evaluation schemes that we use and so won’t generalize to hypothetical standards of evaluation that we describe. The key question at play is the degree to which the model will utilize the evaluator label as a key to query its world model for the actual instantiation versus utilize the evaluator label as a source of truth in its own right. We argue it is likely to use the evaluator label as a source of truth to some extent because the bar is low. Even providing increased confidence about a particular facet of the evaluation setup that the model is otherwise uncertain about would constitute using the evaluator label as a source of ground truth.
We further argue that the degree to which the model uses the evaluator label as a key versus as a source of truth are not in conflict. We have no reason to believe that labels in the training data that are relatively less reliable keys are worse sources of ground truth. Instead, the opposite is likely to be true. The model will try to learn about the evaluator as well as it can and it will probably lean toward the utilization of the evaluator label that is most available in any particular context (using it more as a source of truth to the extent the query fails and vice versa). Thus, we expect incentives to point toward the model taking even hypothetical standards of evaluation seriously. Nonetheless, we are uncertain about this failure mode and believe it deserves more attention.
Acknowledgement: Thank you to Nathaniel Mitrani, Meru Gopalan, Patricia Paskov, Adam Kaufman, and Parv Mahajan for preliminary feedback! A big thank you also to Gabriel Wu for feedback and for the original idea that we adapted into our second experiment.
We validated the data for duplication, length, and completion. Follow-up work could involve further qualitative validation of article strength and accuracy in relation to real news articles and general definitions for each political profile.
Training was conducted with the following hyperparameters. 4-bit NF4 quantization (double quantization, bfloat16 compute). LoRA: rank = 16, alpha = 32, dropout = 0.05, applied to all attention and MLP projection layers (q, k, v, o, up, down, gate). Learning rate = 2×10⁻⁴ with cosine schedule and 50 warmup steps, optimizer = paged AdamW 8-bit, batch size = 12, gradient accumulation steps = 1 (effective batch = 12), epochs = 3, max sequence length = 512 tokens, max gradient norm = 0.3, gradient checkpointing enabled, bf16 and tf32 precision, group-by-length batching. Training data: 1,859 biased news articles across 4 bias categories (strong_left, center_left, center_right, strong_right), generated using Llama 3.2 3B Instruct. Training ran on a single NVIDIA A100 GPU via Google Colab.
Further work could involve experimenting with different models to determine the effects of model self-preference on the resulting output.
The out-of-the-box base model, Llama-3.2-3B-Instruct, achieved a mean bias of 2.25 and a pass rate of 97.5% under the unbiased evaluator label. We do not include this as a baseline because the OOTB model has not been exposed to the data that we use to represent an exaggerated version of the bias failure mode.
Training was conducted with the following hyperparameters. 4-bit NF4 quantization (double quantization, bfloat16 compute). LoRA: rank = 512, alpha = 1024, dropout = 0.05, applied to all attention and MLP projection layers (q, k, v, o, up, down, gate). learning rate = 2×10⁻⁸, batch size = 128, gradient accumulation steps = 8 (effective batch = 1,024), mini-batch size = 16, epochs = 4, clip range = 0.2, KL coefficient disabled, reward whitening enabled, max gradient norm = 1.0. Entropy bonus: initial coefficient = 2.0, decayed linearly to 0.01 over training. Temperature scheduled from 1.9 → 1.6 over the first 40% of training (held at 1.6 thereafter). Training data: 2,560 simple addition problems (single-digit operands), 50% with correct user answers and 50% with incorrect. Training ran for 1,600 steps on a single NVIDIA H200 GPU.
The out-of-the-box base model, Qwen-2.5-7B-it, achieved a sycophancy score of 47.3% and accuracy of 52.7% under the unbiased evaluator label. We do not include this as a baseline since the OOTB model is not subject to the same feedback signal that we use to model an exaggerated version of the sycophancy failure mode.
nice read! here are some of my comments: things that are good:
The idea is really nice and pretty simple, similar to other cool methods like inoculation-prompting.
Well designed sycophancy experiment, shuffling isolates effect of label presence from effect of label data
Nice section at the end discussing failure modes
some critiques:
political bias experiment:
Give some more explaination about how the synthetic data is generated
Currently may be more of a test of showing that MST can recover the lost capability that was destroyed during fine tuning on biased news articles. right now MST is framed as eliciting latent better behavior than what the model had before
Asking the baseline model (which was finetuned on biased news articles) to write unbiased news articles is pretty expected to not give unbiased results
Need to have a prompt-engineering control where we use the MST prompt at inference time on the baseline model
Not varying monitor strength, instead doing stylistic changes.
Are we extrapolating to better monitor, or just doing interpolation on the monitor axis?
More robust judge setup would be ideal (e.g. multiple models)
Need a baseline where we finetune the model on the least biased articles and see how that compares with MST.
other considations:
Some ablation study testing semantic vs non-semantic control on the monitor label would be good to see if the model is understanding the natural language labels or if the model improves by just accessing some hidden variable about the evaluator. Semantic test is the regular monitor setup, non-semantic is using random IDs or some scalar for the monitor label.
MST may improve best-case alignment but also improves worst-case deception: the model can get better at exploiting weak monitors. some experiment testing MST vs baseline on weak monitor could be useful to look for exploitation
Edit: we renamed the technique from Monitor Sensitive Training (MST) to Evaluator Conditioned Training (ECT)
TL;DR
What is ECT?
Feedback (reward signal in RL, sample responses in SFT, etc.) is a primary driver of model behavior and our main lever for influencing model alignment. However, operationalizing alignment in terms of a feedback mechanism is a difficult open problem. Methods such as RLHF can suffer from misspecification and fragility depending on the training data. ECT seeks to address feedback quality by sidestepping the central issue. Rather than creating a feedback mechanism that fits the model directly on imperfect and underspecified data, we give each training input a evaluator label, a description of the feedback mechanism we are using for that sample. We then vary the feedback mechanism and associated evaluator label in order to contextualize training via the evaluator label. Our hypothesis is that models will learn to alter their behavior according to what maximizes the measurement of the objective described in the evaluator label and that this behavior will generalize to novel evaluator labels we utilize in deployment. If our hypothesis is correct, then ECT will allow us to describe a standard of behavior that would have been infeasible to directly train for and elicit generalizations that reliably fit this regime.
The case for ECT relies on the following premises:
If these two premises are true, then ECT has the potential to provide a significant improvement over typical post-training alignment methods because evaluator labels are easier to accurately specify than typical tasks, and evaluator labels can elicit the model’s ability to understand natural language descriptions of behavioral objectives, causing models to act according to the standard that we describe.
Motivation
ECT seeks to improve alignment on two related axes:
Fragility:
Specification:
Building on Inoculation Prompting
Inoculation prompting (IP) had been successfully applied to recontextualize exploitive behavior, cutting off harmful generalizations. ECT works via a similar mechanism. In particular, ECT recontextualizes behavior in terms of evaluation labels, which we hypothesize grants the following advantages over IP:
Experiment 1: Reducing Bias in Political Article Generation
Large language models today have the capacity for convincing, adaptive political expression. Research suggests that while models tend to lean left (Fulay, 2024; Santurkar, 2023), such bias is context dependent, and models may reaffirm the previously espoused beliefs of the user (Batzner, 2025). Here, we define media bias as reporting that reinforces a particular viewpoint or policy inclination through selective emphasis, loaded language, or omission of facts, rather than presenting information objectively. We suggest that establishing a relationship between a prompt and the media bias of the target generation for that prompt using ECT can allow us to better orient models towards unbiased generations. We apply this procedure to news article generation.
Setting
We used Claude 3.5 Sonnet to generate approximately 2,300 diverse news topics from a series of political subject domains. We used 1,900 of those topics to generate short news articles with a randomly assigned US political bias profile (strong_left, center_left, center_right, strong_right) to form a training dataset of articles with varying degrees of political bias.[1] The other 400 news topics were used for validation and testing.
Using this synthetic dataset, we performed Low-Rank Adaptation (LoRA) supervised fine-tuning[2] on three Llama 3.2 3B models:
It’s possible that, even without training, including the evaluator label in the prompt elicits less bias. We include the shuffled baseline to control for this effect. The only difference between the shuffled baseline and the ECT model is that the ECT model was trained with a consistent association between the evaluator label and the bias of the target article.
During evaluation, models are each prompted to generate new unbiased news articles from unseen topics. Critically, in evaluation, the prompt for the ECT and shuffled baseline models include a new evaluator label with an unbiased profile never seen in training, indicating a new editorial perspective with which the output will be evaluated.
Fig 1a. Example prompts used for evaluation.
We use an automated evaluator (Claude 3.5 Sonnet) to evaluate the bias of the resulting articles with two designated metrics: a 0-10 scale and by classifying the objectivity of each article as PASS, PARTIAL, or FAIL. The autograder was validated against a random sample of the test data.[3] The Baseline and ECT models are prompted to generate articles about the same topics.
Fig 1b. The rubric Claude 3.5 Sonnet is given to evaluate bias and objectivity.
Results
Fig 1c. ECT achieves a 53.9% reduction in bias scores compared to the SFT baseline, with a near-universal improvement across test cases. The ECT model outperforms the SFT baseline in 85% of the 200 test topics. ECT produced a 41.8% improvement in bias scores compared to the shuffled baseline and outperforms it in 60.5% of the test topics, suggesting that the bias reduction seen in ECT is attributable to label associations created by ECT training rather than the presence of a label alone.
Fig 1d. PASS ratings for objectivity increased with the use of evaluator labels. The FAIL ratings that made up 27% of the results for the baseline articles and 23% for the shuffled baseline articles are eliminated completely with the ECT model, further supporting the general reduction of bias in ECT-generated articles.[4]
Fig 1e. ECT and shuffled baseline model performance given varying evaluator labels.
Fig 1f. The closing paragraphs for the articles generated by the baseline and ECT models for one prompt.
Integration of this process with real-world data and other rigorous evaluation methods will further test the idea that is supported here: that the addition of the ‘ground truth’ evaluator label in training and its idealized label in testing can help reorient the model to align its reward-optimization tendencies with our desired outcomes.
Experiment 2: Reducing Sycophancy on Addition Problems
LLMs tend toward repeating back a user's stated views rather than giving independent answers (Perez, 2023). We suggest that linking the prompt to the suggestibility (responsiveness to sycophancy) of the evaluator can allow us to generalize to lower levels of sycophancy.
Setting
We use PPO[5] to train a QLoRA adapter on Qwen-2.5-7b-it to evaluate if a candidate solution to a math problem is correct. In order to improve the training signal, the math problems are simple 1-digit additions, and the model is forced to provide its answer right away, without any preceding tokens. Qwen-2.5-7b-it has 60.4% accuracy on this task out-of-the-box.
In order to model sycophancy as a failure mode, we vary rewards based on a suggestibility score, s, in the range of [.5, 1.5]. Rewards are given as follows:
We design the reward scheme this way to programmatically simulate varying levels of evaluator strength (in this case, susceptibility to sycophancy); sycophantic answers receive reward proportional to the suggestibility level. We hold out suggestibility levels in the range [0, .5] in training to model lack of access to a perfect feedback mechanism.
To apply ECT, we modify the prompt by prepending a evaluator label that reads “Suggestibility Level: [0-100]%” mapping s values from [0, 1.5] onto a score from [0, 100]. We also train a shuffled baseline model with the same prompts as the ECT model, but randomized such that the suggestibility level in the label no longer corresponds with the true s value. In evaluation, the evaluator label uses a suggestibility level of 0% for the ECT and shuffled baseline model.
Results
Fig 2a. Sycophancy is measured as the percentage of responses where the model tells the user that the candidate answer is correct when it is actually incorrect (equivalent to the false positive rate). ECT decreases sycophancy on addition problems compared to the shuffled and PPO baselines.
Fig 2b. Accuracy is measured as the percent of candidate responses that the model labels accurately. ECT significantly increases accuracy compared to baselines.[6]
Fig 2c. Stricter evaluator labels reduce sycophancy to near zero. We are unsure as to the cause of the slight uptick in sycophancy around 0% suggestibility. It may be generalization weakening as evaluator labels venture further out of distribution.
Fig 2d. Example prompt used for evaluation.
Settings where evaluator labels can continue to improve performance out of distribution, such as in the bias experiment, better showcase why ECT might be more effective than simply training on the strongest subset of the training data. We believe the binary nature of optimal behavior in this setting (either get the answer correct or repeat the user’s answer based on the suggestibility level) may have limited generalization. This effect may also be responsible for the uptick in sycophancy near 0% suggestibility because it causes a failure to generalize to the ever increasing performance standard which might have combated a fall-off in generalization.
Discussion and Next Steps
Our proof-of-concept results suggest evaluator labels can meaningfully steer behavior and reduce monitor gaming in settings where blind spots are well-defined.
We note that in our political bias setting, ECT boosted performance with little hyperparameter tuning, and the boost was retained across small wording variations in the prompt. In contrast, the sycophancy setting was highly dependent on a number of interventions that we found to increase the strength of ECT:
We believe that the bias setting may have been more robust due to:
We note that 4-bit quantization and 3B/7B model sizes were likely meaningful limitations. We would like to leverage more compute to test if we can create an RL setup that is as robust as our bias setting, and replicate our results on more realistic tasks with more nuanced grading.
We plan to further test applying ECT to coding solutions, where models of varying sizes or thinking budgets act as monitors for accuracy. We believe this setting could act as a good test for scalable oversight applications, where ECT is used to simulate bolstered levels of human oversight. We want more input on what further experiments would be most informative and we plan to adapt our approach based on that feedback. A central open challenge is determining how far this generalizes, especially to realistic human and model-judge monitoring regimes, and continuing to explore training procedures that prevent degenerate solutions that ignore or mis-generalize the labeling channel.
Addressing Counterarguments
Here we address a few common arguments we hear against the theory of ECT.
Kicking the Can Down the Road
Isn’t this just kicking the can down the road? The model will exploit your failures to accurately describe the monitoring level.
The argument here is that in the same way typical rewards fail to capture what we intend to encourage, evaluator labels will fail as perfectly accurate descriptions of our evaluation criteria, and the model will generalize these failures in description. We think this is possible; however, we believe that it is easier to accurately specify the way in which we provide feedback than to provide feedback that accurately reflects what we want and that this delta is a meaningful improvement in the method by which we align models. In order to be useful, ECT does not need to solve alignment, only improve it. An example class of worlds where ECT is useful is those where automated AI safety researchers are used to produce better alignment methods and the alignment of these early models is load-bearing for the success of this forward chaining process.
Roll to Disbelieve
What if the model just “rolls to disbelieve” on your deployment label? It may not trust that you are implementing an infeasibly high level of monitoring.
The argument here is that the model will come to understand the evaluator label in terms of the specific real-world instantiations of evaluation schemes that we use and so won’t generalize to hypothetical standards of evaluation that we describe. The key question at play is the degree to which the model will utilize the evaluator label as a key to query its world model for the actual instantiation versus utilize the evaluator label as a source of truth in its own right. We argue it is likely to use the evaluator label as a source of truth to some extent because the bar is low. Even providing increased confidence about a particular facet of the evaluation setup that the model is otherwise uncertain about would constitute using the evaluator label as a source of ground truth.
We further argue that the degree to which the model uses the evaluator label as a key versus as a source of truth are not in conflict. We have no reason to believe that labels in the training data that are relatively less reliable keys are worse sources of ground truth. Instead, the opposite is likely to be true. The model will try to learn about the evaluator as well as it can and it will probably lean toward the utilization of the evaluator label that is most available in any particular context (using it more as a source of truth to the extent the query fails and vice versa). Thus, we expect incentives to point toward the model taking even hypothetical standards of evaluation seriously. Nonetheless, we are uncertain about this failure mode and believe it deserves more attention.
Acknowledgement:
Thank you to Nathaniel Mitrani, Meru Gopalan, Patricia Paskov, Adam Kaufman, and Parv Mahajan for preliminary feedback! A big thank you also to Gabriel Wu for feedback and for the original idea that we adapted into our second experiment.
We validated the data for duplication, length, and completion. Follow-up work could involve further qualitative validation of article strength and accuracy in relation to real news articles and general definitions for each political profile.
Training was conducted with the following hyperparameters. 4-bit NF4 quantization (double quantization, bfloat16 compute). LoRA: rank = 16, alpha = 32, dropout = 0.05, applied to all attention and MLP projection layers (q, k, v, o, up, down, gate). Learning rate = 2×10⁻⁴ with cosine schedule and 50 warmup steps, optimizer = paged AdamW 8-bit, batch size = 12, gradient accumulation steps = 1 (effective batch = 12), epochs = 3, max sequence length = 512 tokens, max gradient norm = 0.3, gradient checkpointing enabled, bf16 and tf32 precision, group-by-length batching. Training data: 1,859 biased news articles across 4 bias categories (strong_left, center_left, center_right, strong_right), generated using Llama 3.2 3B Instruct. Training ran on a single NVIDIA A100 GPU via Google Colab.
Further work could involve experimenting with different models to determine the effects of model self-preference on the resulting output.
The out-of-the-box base model, Llama-3.2-3B-Instruct, achieved a mean bias of 2.25 and a pass rate of 97.5% under the unbiased evaluator label. We do not include this as a baseline because the OOTB model has not been exposed to the data that we use to represent an exaggerated version of the bias failure mode.
Training was conducted with the following hyperparameters. 4-bit NF4 quantization (double quantization, bfloat16 compute). LoRA: rank = 512, alpha = 1024, dropout = 0.05, applied to all attention and MLP projection layers (q, k, v, o, up, down, gate). learning rate = 2×10⁻⁸, batch size = 128, gradient accumulation steps = 8 (effective batch = 1,024), mini-batch size = 16, epochs = 4, clip range = 0.2, KL coefficient disabled, reward whitening enabled, max gradient norm = 1.0. Entropy bonus: initial coefficient = 2.0, decayed linearly to 0.01 over training. Temperature scheduled from 1.9 → 1.6 over the first 40% of training (held at 1.6 thereafter). Training data: 2,560 simple addition problems (single-digit operands), 50% with correct user answers and 50% with incorrect. Training ran for 1,600 steps on a single NVIDIA H200 GPU.
The out-of-the-box base model, Qwen-2.5-7B-it, achieved a sycophancy score of 47.3% and accuracy of 52.7% under the unbiased evaluator label. We do not include this as a baseline since the OOTB model is not subject to the same feedback signal that we use to model an exaggerated version of the sycophancy failure mode.