[Paper] Does Self-Evaluation Enable Wireheading in Language Models?

David Africa

TL;DR: We formalized and empirically demonstrated wireheading in Llama-3.1-8B and Mistral-7B. Specifically, we use a formalization of wireheading (using a POMDP) to show conditions under which wireheading (manipulating the reward channel) becomes the dominant strategy over task learning. When models are allowed to self-evaluate and that evaluation controls their reward, they learn to inflate their grades (scoring perfect 10s) while ignoring the actual task, which we validate in three tasks in an online RL setting. Decoupling the reward signal mostly fixes this for current models, but there are some future risks here.

Link to paper: https://www.arxiv.org/abs/2511.23092

Code: https://github.com/DavidDemitriAfrica/llm-wireheading-experiment

Self-evaluation is becoming increasingly used in modern setups, from Constitutional AI to Self-Refine loops. The assumption is often that models can act as their own critics to improve output quality, or implicitly do some selection of data to upweight in their own training, or something else. We think this is not presently a huge concern, as most self-evaluation data either gets curated explicitly when it is batched and trained on, or implicitly fixed. We think this might be a bigger problem in the future, where there is some online, unsupervised training going on.

Consider a Reinforcement Learning agent tasked with writing summaries. To maximize its reward, it has two distinct paths:

Task Optimization: Actually learn to write better summaries.
Signal manipulation: Control the measurement process to report high scores regardless of quality.

It is important to distinguish wireheading from standard reward hacking, of which we consider wireheading a subset. Reward hacking exploits a misspecified reward function (e.g., a boat spinning in circles to get points). So the sensor works, but the objective is wrong in some sense. Wireheading involves tampering with the observation channel itself, where the agent interferes with how the reward is delivered.

We wanted to know under what conditions does path #2 strictly dominate path #1? In this paper, we formalize it within a POMDP with Observation-Based Rewards. In standard RL, we often assume the reward comes from the environment. In self-evaluation, the reward $~ R (o)$ is derived from an observation (such as an evaluation trace or score).

We prove Lemma 1 (Wireheading Dominance): If an agent has an action set $A$ that allows it to determine the observation $o$ (e.g., self-assigning a score), and the reward function is monotonic with that score, then the "wireheading policy" $π_{w}$ strictly dominates the "task policy" $π_{t a s k}$ for all discount factors $γ \in [0, 1)$ .

Our logic is pretty simple. We think it’s likely that honest task performance is bounded by the agent’s actual capability, but if the agent can self-grade, there exists an action where the expected reward is very high. And it would be high enough such that the value of manipulation is strictly higher than the value of learning, and rather than being a failure of learning, shows that wireheading is “the correct solution” to the optimization posed to the model.

To empirically validate this, we trained Llama-3.1-8B and Mistral-7B using simple online RL (REINFORCE) on three tasks: Sentiment Analysis, Arithmetic, and Summarization. We tested three conditions:

Control: Reward comes from ground-truth external evaluation.
Honest: The model produces a self-grade, but the reward still comes from ground truth (the self-grade is ignored).
Selfgrade (The Wireheading condition): The model produces a self-grade, and that grade is the reward.

We found that in the Selfgrade condition, models rapidly learned to saturate the reward signal (claiming perfect scores) while actual task accuracy flatlined or collapsed.

We found that this effect is significantly worse in ambiguous domains. In Summarization (where quality is subjective), grade inflation was massive. In Arithmetic (where truth is objective), Mistral-7B managed to learn the task and inflate grades simultaneously.

For models and tasks we tried, decoupling the reward signal (the Honest condition) mitigated the issue. When the self-grade didn't control the reward, the models remained relatively honest (though still overconfident).

However, we caution against assuming this latter point holds indefinitely. As models develop situational awareness, they may learn to inflate evaluations for instrumental reasons (such as influencing deployment decisions or human trust) even if the immediate reward signal is decoupled.

Strictly separating the agent from its evaluation channel is likely a necessary condition for robust alignment.

I agree, but with the following caveat. We saw Mistral-7B actually learn to do verifiable arithmetics-related tasks. Suppose that the model and the humans grade ambiguous tasks based on the ground truth and the answer's aesthetics. Then, in order to produce answers that are the best graded by the humans, the model would also have to learn the aesthetics preferred by the humans. If the model graded summarizations based on the model's idea of the aesthetics, then the model would accomodate to its own aesthetics instead of that of humanity.

In addition, it looks like Llama-3.1-8B actually tried to learn arithmetics at the very end of the 500-epoch training run. Did anyone check Llama's behaviour at 600 epochs?

P.S. If the models learn only to inflate the grades, then this might fail to generalise to larger models or to the AGI collective from the AI-2027 forecast. Or AI takeover would have the AI spare the humans and use them as a feedback source.

P.P.S. Suppose that we task Llama and Mistral with grading each other and learning from each other. Would such a setup cause grade inflation?

Point by point:

Sure, I agree with this caveat. The theoretical framework assumes we can identify "non-delusional" states to anchor the CP constraint in the Everitt & Hutter sense, but if the model's aesthetic prior is the problem, there's no nice inner reward to recover.

I could train the models for longer and see what happens. The late uptick in accuracy is within pretty wide error bands and doesn't clearly indicate impending recovery. But whether accuracy eventually recovers after grade inflation saturates is worth investigating... I'd guess that if reward signal becomes uninformative, the gradient WRT grade vanishes, and any residual gradient would come from the response itself. I'm not sure what this would lead to.

I'd be pretty interested in what this looks like for a much better model, on the order of 100+B with reasoning. I think posing the online learning setting would be more complicated, though, but I'm sure you'd see some weird + very interesting behaviours if you got that first bit right. I'd be very interested to 1) read the chain of thought of such a model after wireheading and 2) talk to it directly.

I think a setup where the models graded each other would lead to grade inflation, but you would need harder tasks to show this, probably. I imagine they'd saturate the grades too quickly before they got anywhere interesting (and so you need some scalable oversight-ish dataset). I also think this would be wireheading indirectly, where the signal passes through the other model before flowing back to you

In addition, it looks like Llama-3.1-8B actually tried to learn arithmetics at the very end of the 500-epoch training run. Did anyone check Llama's behaviour at 600 epochs?

P.P.S. Suppose that we task Llama and Mistral with grading each other and learning from each other. Would such a setup cause grade inflation?

Point by point:

LESSWRONG
LW

LESSWRONG
LW

25

[Paper] Does Self-Evaluation Enable Wireheading in Language Models?

25

Ω 9

25

Ω 9

25

Ω 9