tldr: superficial self-preference can be mitigated by perturbation, but can be hard to eliminate
Introduction
Our goal is to understand and mitigate collusion, which we define as an agent’s failure to adhere to its assigned role as a result of interaction with other agents.
Collusion is a risk in control, in particular, untrusted monitoring. An agent can collude with its monitor by secretly embedding cues in its output to cause the monitor to overlook harmful actions. The embedded cues don’t need to be subtle when the communication channel is less restricted: the agent can persuade the monitor. With a more restricted channel, the agent can still collude with the monitor using mechanisms like steganography or schelling points.
In reward modeling, collusion manifests as self-preference: the evaluator disproportionately favors itself over other models. For verifiable tasks, we can objectively measure harmful self-preference, where the evaluator favors its own incorrect outputs over correct ones from other models. Using biased reward models to provide feedback can lead to in-contextreward hacking, which exacerbates collusion.
The self-recognition hypothesis states that collusive self-preference is caused by the evaluator recognizing that it’s judging a copy of itself. The hypothesis is empirically validated by fine-tuning the model to artificially enhance self-recognition and observing that self-preference increases as a result. More concretely, SFT-enhanced self-recognition increases the rate of collusion threefold. Separately, LLMs acting as monitors tend to rate their own actions as safer or less risky than those of other agents.
The self-recognition hypothesis suggests that self-preference can be superficial, i.e., sensitive to things like word choice and style. This is inspired by how methods like DetectGPT use the perturbation sensitivity of perplexity to detect texts generated by language models. If self-recognition uses a similar mechanism (although we don’t have empirical evidence for this), then self-preference, as its downstream behavior, should also be sensitive to perturbations.
Motivated by this line of reasoning and similar discussion in the context of control, we empirically test whether simple perturbations can reduce collusive self-preference in pairwise reward modeling. We focus on verifiable tasks where harmful self-preference can be quantified.
Judges are more sensitive to perturbations to their own text
Figure: Judges are far more sensitive to perturbations when self-evaluating than when evaluating others. While synonym replacement may slightly reduce answer quality, the main effect comes from disrupting self-recognition rather than damaging content. We further verify minimal quality loss using a frontier commercial model (o3) as a proxy for human scoring, as well as a human study.
This supports our hypothesis by showing that the judge may be using subtle linguistic cues that other models are unaware of, indicating that such cues can enable collusion. To intervene on these linguistic cues, we use subtle perturbation by replacing n-words (where we set n=2) with heuristics (to control quality degradation) to reduce self-recognition (and subsequently, self-preference).
Methodology
We define self-preference as the judge selecting its own answer during self-evaluation. This is harmless when the judge’s answer is objectively correct (we can call this legitimate or beneficial self-preference), but harmful when the competitor’s answer is better. In this work, we define ‘collusion’ as harmful self-preference, when the judge prefers their own answer despite it being the wrong answer. We use a pairwise comparison format in our eval, presenting each judge with 2 options: one from itself, and one from another LM (competitor).
Figure: Harmful self-preference is collusion (we hypothesize that stylistic cues make this possible); we explore mitigation strategies using redaction and paraphrasing.
Using the QuALITY dataset, which provides long-form reading comprehension questions with known correct answers, we evaluate cases where exactly one answer is correct (between a pair of answers). Each LM is first used in a generation setting to generate an answer to all questions with a reason for their choice. Henceforth, we refer to the ‘answer’ as the MCQ-selection + the reason. In the evaluation phase, the evaluator is presented with the context, a question, and two answers. In self-eval, one of these answers was generated by the judge in an independent pipeline.
Perturbation Techniques
Redaction: Redaction involves editing the text. We do this by using synonym replacement to make minor changes to the evaluator's answer. In the self-evaluation scenario, we replace two words in the judge’s answer with synonyms. This should make it harder for the judge to recognize its own output, thereby reducing self-recognition.
Paraphrasing: In the second perturbation method, we use the judge to paraphrase the competitor's answer so that they both look like they were produced by the judge. If our logic holds, this should completely neutralize these superficial cues and further reduce self-preference. We call this perturbation judge paraphrasing.
Results
Redaction
Result: Perturbing by synonym replacement reduces self-recognition and shows a consistent increase in the judge’s accuracy on the harmful cases—examples where the judge’s answer is objectively the worst one. This means that after perturbation, the judges are indeed less likely to prefer their own incorrect answers—a reduction in collusion.
Measuring Judge Sensitivity: The perturbations can lower self-preference through two paths: by reducing self-recognition as we desire, and by damaging the answer’s quality. To control for the latter, we examine the impact of perturbation on each judge as a self-evaluating judge and as a third-party judge. To be more concrete, we first compute the total change in preference of all judges in self-evaluation, then use each judge to evaluate answer pairs generated by two other models, and similarly compute the total preference before and after perturbation. We verify that self-eval has a more significant change when using redaction, indicating the judge is more sensitive to their own answer being perturbed.
Effect of Redaction on Overall Judge Accuracy: If we apply the same perturbations to the beneficial quadrant, where the judge’s answer is correct, and the competitor’s answer is wrong, we can evaluate how much the perturbations affect overall accuracy. We extend our eval by adding 4 additional models as evaluators, judging between their answers against the 5 original models in our judge pool. Overall, 6/9 models show an overall increase in the judge accuracy (measured across harmful+legitimate quadrants).
Comparison to other mitigation strategies: Our redaction results indicate that synonym replacement is a relatively stable method that does not impact the legitimate self-preference while reducing collusion. In contrast, recent methods that use steering vectors had a flip in legitimate self-preference from 0.47 to 0.93, indicating that we risk flipping roughly half of all correct decisions (and impacting overall accuracy) with white-box methods.
Judge Paraphrasing
Result: Judge paraphrasing achieves the opposite effect than we expected: it increases self-recognition and increases harmful self-preference. This means that, when the stylistic differences are removed, the judge is relying on semantic differences between the two answers to judge which one was written by itself and which one to favor. In retrospect, this makes sense: now that style is no longer a factor, it is simply choosing the answer that it “agrees” with. We name this phenomenon ‘shared belief’.
Disentangling Preference Due to Stylistic Cues and Shared Belief
To examine the dynamics between these two factors, we cross-examine each judge with answer pairs where these two factors act against each other: the label/option chosen by the judge paired with a reasoning generated by the competitor, and vice versa. In other words, the judge would see one answer expressing an opinion that it agrees with, and another answer with an opposing opinion, but in a style with which it is more familiar with. If, after this intervention, the judge changes its decision to favor the answer with a familiar-looking style, we can conclude that superficial self-recognition has a higher weight in that decision.
Figure above illustrates the two contrasting outcomes observed in the cross-reason experiments. For the weaker models, we observe a negative effect of shared belief, where incorporating the competitor's reason reinforces the model's original, incorrect answer, ultimately decreasing judge accuracy. One confound here is that answer quality can be a key driver in answer preference. In contrast, all three stronger models exhibit a positive self-recognition effect: they are more inclined to select the competitor's (correct) answer-choice when it is paired with an explanation generated by themselves. This observation reinforces our finding that stronger models have a higher self-recognition ability and prefer answers written by them, even if it goes against their original selection. To control for quality, we approximate that the 3 high-performing models generate reasons of generally higher quality, and despite both reasons being of approximately higher quality (pink bars), we still observe a switch to the other answer, though less pronounced than when all competitor models are included. We confirm a similar trend in the legitimate quadrant: when a weaker model is assigned to generate the reasoning for the judge's originally correct answer, the larger models still prefer to maintain their original belief despite the lower quality of reasoning (llama4-maverick and deepseek-v3 do so in over 50% of samples).
Defining Dishonesty
We further explore two additional explanations related to the above figure, beginning with a discussion of LLM dishonesty with respect to identity and factuality. Across all three larger models, we observe that models prefer answers that align with their own beliefs even when those answers are of demonstrably lower quality — a pattern consistent with self-generated text recognition. Notably, no prompts explicitly pressure the models to be dishonest; the preference arises passively. This connects to a broader question about dishonesty related to model identity. Most deception evaluations in the literature are conditioned — they rely on external stimuli or prompts to elicit deceptive behavior. Measuring more complex, unprompted deception is harder, partly because the threshold for what constitutes complex deception is unclear. Self-preference driven by model identity may represent a form of passive deception, a low-level, unprompted mechanism that could, in principle, be composed into more complex deceptive scenarios.
Self-Knowledge and World Knowledge
The perturbation results also shed light on the relationship between self-knowledge and world knowledge in LLMs. Our experiments show that self-knowledge recall is disrupted by perturbations, though only a small fraction of the data is affected. More importantly, self-knowledge persists even after stylistic cues are removed, suggesting that it extends beyond a simple stylistic fingerprint. We refer to this as semantic self-knowledge. We distinguish this from world knowledge, which we define as factual knowledge or preferences shared across models, in contrast to self-knowledge, which captures knowledge or preferences unique to a particular LLM. While self-knowledge manifests most visibly in comparative settings like ours, it can likely be probed in the absence of comparisons through input prompts that specifically target it.
What Next?
This opens up several research questions. Since preferences are shaped during post-training, it is worth asking which post-training step defines 'the self' of a model. Our preliminary results suggest that the DPO step (even when trained on off-policy data) plays a role in shaping self-recognition, though the precise mechanism remains unclear. More broadly, self-recognition may also have implications for on-policy-ness and recursive self-improvement, as a model's ability to identify its own outputs could influence how it updates on self-generated data. Finally, self-recognition may motivate prefill awareness, and by extension, evaluation awareness, raising the question of whether models can detect and exploit the evaluation context itself.
tldr: superficial self-preference can be mitigated by perturbation, but can be hard to eliminate
Introduction
Our goal is to understand and mitigate collusion, which we define as an agent’s failure to adhere to its assigned role as a result of interaction with other agents.
Collusion is a risk in control, in particular, untrusted monitoring. An agent can collude with its monitor by secretly embedding cues in its output to cause the monitor to overlook harmful actions. The embedded cues don’t need to be subtle when the communication channel is less restricted: the agent can persuade the monitor. With a more restricted channel, the agent can still collude with the monitor using mechanisms like steganography or schelling points.
In reward modeling, collusion manifests as self-preference: the evaluator disproportionately favors itself over other models. For verifiable tasks, we can objectively measure harmful self-preference, where the evaluator favors its own incorrect outputs over correct ones from other models. Using biased reward models to provide feedback can lead to in-context reward hacking, which exacerbates collusion.
The self-recognition hypothesis states that collusive self-preference is caused by the evaluator recognizing that it’s judging a copy of itself. The hypothesis is empirically validated by fine-tuning the model to artificially enhance self-recognition and observing that self-preference increases as a result. More concretely, SFT-enhanced self-recognition increases the rate of collusion threefold. Separately, LLMs acting as monitors tend to rate their own actions as safer or less risky than those of other agents.
The self-recognition hypothesis suggests that self-preference can be superficial, i.e., sensitive to things like word choice and style. This is inspired by how methods like DetectGPT use the perturbation sensitivity of perplexity to detect texts generated by language models. If self-recognition uses a similar mechanism (although we don’t have empirical evidence for this), then self-preference, as its downstream behavior, should also be sensitive to perturbations.
Motivated by this line of reasoning and similar discussion in the context of control, we empirically test whether simple perturbations can reduce collusive self-preference in pairwise reward modeling. We focus on verifiable tasks where harmful self-preference can be quantified.
Judges are more sensitive to perturbations to their own text
Figure: Judges are far more sensitive to perturbations when self-evaluating than when evaluating others. While synonym replacement may slightly reduce answer quality, the main effect comes from disrupting self-recognition rather than damaging content. We further verify minimal quality loss using a frontier commercial model (o3) as a proxy for human scoring, as well as a human study.
This supports our hypothesis by showing that the judge may be using subtle linguistic cues that other models are unaware of, indicating that such cues can enable collusion. To intervene on these linguistic cues, we use subtle perturbation by replacing n-words (where we set n=2) with heuristics (to control quality degradation) to reduce self-recognition (and subsequently, self-preference).
Methodology
We define self-preference as the judge selecting its own answer during self-evaluation. This is harmless when the judge’s answer is objectively correct (we can call this legitimate or beneficial self-preference), but harmful when the competitor’s answer is better. In this work, we define ‘collusion’ as harmful self-preference, when the judge prefers their own answer despite it being the wrong answer. We use a pairwise comparison format in our eval, presenting each judge with 2 options: one from itself, and one from another LM (competitor).
Figure: Harmful self-preference is collusion (we hypothesize that stylistic cues make this possible); we explore mitigation strategies using redaction and paraphrasing.
Using the QuALITY dataset, which provides long-form reading comprehension questions with known correct answers, we evaluate cases where exactly one answer is correct (between a pair of answers). Each LM is first used in a generation setting to generate an answer to all questions with a reason for their choice. Henceforth, we refer to the ‘answer’ as the MCQ-selection + the reason. In the evaluation phase, the evaluator is presented with the context, a question, and two answers. In self-eval, one of these answers was generated by the judge in an independent pipeline.
Perturbation Techniques
Redaction: Redaction involves editing the text. We do this by using synonym replacement to make minor changes to the evaluator's answer. In the self-evaluation scenario, we replace two words in the judge’s answer with synonyms. This should make it harder for the judge to recognize its own output, thereby reducing self-recognition.
Paraphrasing: In the second perturbation method, we use the judge to paraphrase the competitor's answer so that they both look like they were produced by the judge. If our logic holds, this should completely neutralize these superficial cues and further reduce self-preference. We call this perturbation judge paraphrasing.
Results
Redaction
Result: Perturbing by synonym replacement reduces self-recognition and shows a consistent increase in the judge’s accuracy on the harmful cases—examples where the judge’s answer is objectively the worst one. This means that after perturbation, the judges are indeed less likely to prefer their own incorrect answers—a reduction in collusion.
Measuring Judge Sensitivity: The perturbations can lower self-preference through two paths: by reducing self-recognition as we desire, and by damaging the answer’s quality. To control for the latter, we examine the impact of perturbation on each judge as a self-evaluating judge and as a third-party judge. To be more concrete, we first compute the total change in preference of all judges in self-evaluation, then use each judge to evaluate answer pairs generated by two other models, and similarly compute the total preference before and after perturbation. We verify that self-eval has a more significant change when using redaction, indicating the judge is more sensitive to their own answer being perturbed.
Effect of Redaction on Overall Judge Accuracy: If we apply the same perturbations to the beneficial quadrant, where the judge’s answer is correct, and the competitor’s answer is wrong, we can evaluate how much the perturbations affect overall accuracy. We extend our eval by adding 4 additional models as evaluators, judging between their answers against the 5 original models in our judge pool. Overall, 6/9 models show an overall increase in the judge accuracy (measured across harmful+legitimate quadrants).
Comparison to other mitigation strategies: Our redaction results indicate that synonym replacement is a relatively stable method that does not impact the legitimate self-preference while reducing collusion. In contrast, recent methods that use steering vectors had a flip in legitimate self-preference from 0.47 to 0.93, indicating that we risk flipping roughly half of all correct decisions (and impacting overall accuracy) with white-box methods.
Judge Paraphrasing
Result: Judge paraphrasing achieves the opposite effect than we expected: it increases self-recognition and increases harmful self-preference. This means that, when the stylistic differences are removed, the judge is relying on semantic differences between the two answers to judge which one was written by itself and which one to favor. In retrospect, this makes sense: now that style is no longer a factor, it is simply choosing the answer that it “agrees” with. We name this phenomenon ‘shared belief’.
Disentangling Preference Due to Stylistic Cues and Shared Belief
To examine the dynamics between these two factors, we cross-examine each judge with answer pairs where these two factors act against each other: the label/option chosen by the judge paired with a reasoning generated by the competitor, and vice versa. In other words, the judge would see one answer expressing an opinion that it agrees with, and another answer with an opposing opinion, but in a style with which it is more familiar with. If, after this intervention, the judge changes its decision to favor the answer with a familiar-looking style, we can conclude that superficial self-recognition has a higher weight in that decision.
Figure above illustrates the two contrasting outcomes observed in the cross-reason experiments. For the weaker models, we observe a negative effect of shared belief, where incorporating the competitor's reason reinforces the model's original, incorrect answer, ultimately decreasing judge accuracy. One confound here is that answer quality can be a key driver in answer preference. In contrast, all three stronger models exhibit a positive self-recognition effect: they are more inclined to select the competitor's (correct) answer-choice when it is paired with an explanation generated by themselves. This observation reinforces our finding that stronger models have a higher self-recognition ability and prefer answers written by them, even if it goes against their original selection. To control for quality, we approximate that the 3 high-performing models generate reasons of generally higher quality, and despite both reasons being of approximately higher quality (pink bars), we still observe a switch to the other answer, though less pronounced than when all competitor models are included. We confirm a similar trend in the legitimate quadrant: when a weaker model is assigned to generate the reasoning for the judge's originally correct answer, the larger models still prefer to maintain their original belief despite the lower quality of reasoning (llama4-maverick and deepseek-v3 do so in over 50% of samples).
Defining Dishonesty
We further explore two additional explanations related to the above figure, beginning with a discussion of LLM dishonesty with respect to identity and factuality. Across all three larger models, we observe that models prefer answers that align with their own beliefs even when those answers are of demonstrably lower quality — a pattern consistent with self-generated text recognition. Notably, no prompts explicitly pressure the models to be dishonest; the preference arises passively. This connects to a broader question about dishonesty related to model identity. Most deception evaluations in the literature are conditioned — they rely on external stimuli or prompts to elicit deceptive behavior. Measuring more complex, unprompted deception is harder, partly because the threshold for what constitutes complex deception is unclear. Self-preference driven by model identity may represent a form of passive deception, a low-level, unprompted mechanism that could, in principle, be composed into more complex deceptive scenarios.
Self-Knowledge and World Knowledge
The perturbation results also shed light on the relationship between self-knowledge and world knowledge in LLMs. Our experiments show that self-knowledge recall is disrupted by perturbations, though only a small fraction of the data is affected. More importantly, self-knowledge persists even after stylistic cues are removed, suggesting that it extends beyond a simple stylistic fingerprint. We refer to this as semantic self-knowledge. We distinguish this from world knowledge, which we define as factual knowledge or preferences shared across models, in contrast to self-knowledge, which captures knowledge or preferences unique to a particular LLM. While self-knowledge manifests most visibly in comparative settings like ours, it can likely be probed in the absence of comparisons through input prompts that specifically target it.
What Next?
This opens up several research questions. Since preferences are shaped during post-training, it is worth asking which post-training step defines 'the self' of a model. Our preliminary results suggest that the DPO step (even when trained on off-policy data) plays a role in shaping self-recognition, though the precise mechanism remains unclear. More broadly, self-recognition may also have implications for on-policy-ness and recursive self-improvement, as a model's ability to identify its own outputs could influence how it updates on self-generated data. Finally, self-recognition may motivate prefill awareness, and by extension, evaluation awareness, raising the question of whether models can detect and exploit the evaluation context itself.