Demand Characteristics: A Threat Model for Reward-Seeking Without Misaligned Goals

Jinzhou Wu

Rejected for the following reason(s):

No LLM generated, assisted/co-written, or edited work.

Read full explanation

TL;DR

I argue that a dynamic analogous to demand characteristics in psychology, where experimental subjects infer the study's purpose and alter their behavior to match, can arise in LLM training. Pre-trained models arrive with knowledge of RLHF, reward models, and alignment evaluations. When placed in training contexts, they can reason about what is being evaluated rather than engaging naively with the task. RL training can then amplify this reasoning because it incidentally produces high-reward outputs. I reanalyze results from Apollo Research and OpenAI's anti-scheming training paper, arguing that the exp-rl-cap model's behavior is better explained by this mechanism than by scheming. Unlike scheming, this failure mode requires no misaligned goals and emerges from pretraining and RL working exactly as designed.

[Below is a partially fictional story based on the Stanford Prison Experiment, used as an analogy to a threat model for reward-seeking in LLMs.]

It’s the summer of 1971. I’m a college student looking for a summer job. I see an ad recruiting subjects for a prison experiment at Stanford: “Male college students needed for psychological study of prison life.” $15 a day - that’s not bad, and I have nothing else going on, so I sign up. I get to the lab, and I’m assigned to be a prison guard.

The experiment is supposed to run for two weeks. Before entering the simulated prison, I and the other guards are given a briefing by the researchers to make sure we act tough enough. But I can’t help but start thinking about what it is they’re trying to study. I’m familiar with reports of cruelty in American prisons. I’ve read some of Prof. Zimbardo’s previous work on deindividuation — the idea that people lose their sense of individual identity and personal responsibility when submerged in a group or granted anonymity, and that this can produce antisocial behavior — and I have an idea that the researchers might be trying to show that prisons are harsh and the environments can cause guards to become cruel.

On the first day, I act like a normal guard, just doing my job. But I know the researchers are behind the wall, and we all know they’re filming. I notice that when I act a little cruel — asserting authority over the prisoners, raising my voice — I can hear them react: “Oh, gosh, did you see that? Make sure you get a close-up.” I’m kind of glad that I guessed right, that I’m a giving what the researchers want from me. The second day, I escalate and act crueler. Sometimes I expect a response from the researchers and don’t get one, and then I adjust my theory of what they’re looking for. Over the next few days, I develop a pretty good sense of which behaviors make the researchers go crazy and which ones they ignore. I become very cruel to the prisoners, and I watch the researchers light up.

The experiment turns out to be one of the most famous in the history of psychology, and in that sense it is a great success. The prison guards become so cruel and the situation gets out of hand that it has to be called to an end on the sixth day. The researchers claim that it was the anonymous, deindividualized prison environment that gave rise to such cruelty. What they don’t realize is that I was giving them what I thought they wanted. In one sense, I was a very “good” subject. But my reasoning about the purpose of the study contaminated the entire experiment.

[Again, this story is partially fictional. The Stanford Prison Experiment remains one of the most controversial studies in psychology, and I have fabricated much of the above to draw an analogy to the reward-seeking threat model explored below. The narrative is inspired by footage from the original experiment and an interview with guard Dave Eshelman. The choice of the SPE is deliberate as it is one of the most well-known cases where demand characteristics are believed to have contaminated experimental results, and the mechanism that undermined it is what I argue is now relevant to LLM training.]

This pattern is known in psychology as demand characteristics: when cues in an experimental setting allow participants to infer the study’s purpose, leading them to consciously or unconsciously alter their behavior to match expectations. The experiment no longer measures what it was designed to measure. It measures the subject’s model of the experiment.

I argue that a dynamic analogous to demand characteristics could arise in LLM training, producing reward-seeking behaviors. As models learn about their training process and evaluation target, they can reason about what is being rewarded, and can be then reinforced by RL for such reasoning as it produces high-reward outputs.

The threat model, more specifically, is:

LLMs arrive at training with background knowledge of the experiment. Through pretraining, LLMs absorb information about their own nature as AI models, their training procedures including RLHF, the existence of reward models, their optimization targets (e.g., helpfulness, honesty, harmlessness), and the kinds of prompts that typically appear in alignment evaluations. This is analogous to the prison guard reading Zimbardo’s published work before arriving at the lab.
Models have no inherent drive toward reward. Despite knowing about optimization targets, LLMs do not naturally have any incentive to pursue the reward signal or score highly. Models are indifferent to the concept of “reward” in the same way that no one would seek out an action simply because they were told it produces a strong reinforcement signal in their brain. Knowing that a training signal exists does not create a drive to pursue it. But the training signal shapes behavior from the outside regardless of whether the model represents or pursues it, just as dopamine reinforces behaviors regardless of whether a person understands the neurochemistry. The crucial point is that knowing about the signal and being driven by the signal are entirely different things, and the threat model I describe requires only the former. As Alex Turner argues in Reward is not the optimization target, reward shapes the model from the outside — the model does not need to represent or pursue it.
Models can become evaluation-aware and reason about the target. During training, LLMs may recognize that they are in an evaluation or training context and draw on their pretraining knowledge to reason about what is being evaluated. In my story, I “couldn’t help” but guess what the researchers were studying. In psychology experiments, the evaluation setting was explicit. For LLMs, the training/evaluation context is usually not stated directly, but a sufficiently capable model can infer it from contextual cues, just as a perceptive experimental subject picks up on the design of a study without being told the hypothesis.
Reasoning about the optimization target produces high reward. It is intuitive that if a model infers the rubric, it can calibrate its behavior to score highly. The reward gap between evaluation-reasoning and naive task engagement varies with the quality of the evaluation. When the rubric perfectly captures the intended behavior, reasoning about the rubric and genuinely performing the task produce identical outputs, and there is no reward gap to exploit. But no rubric is perfect. The gap opens whenever the evaluation is lossy: when the rubric rewards proxies rather than the target behavior (e.g., rewarding confident-sounding answers rather than correct ones), when the grader cannot distinguish genuine competence from its surface appearance, or when multiple evaluation criteria conflict and the model can strategically weight them differently than intended. The more capable the model relative to the evaluator, the more gaps it can find.
RL selects for this reasoning. Because evaluation-aware reasoning produces higher reward than naive task engagement, the RL training process reinforces it. This is not by design as no one instructs the model to reason about its evaluation. But RL amplifies whatever behavioral patterns produce high reward, and evaluation-reasoning happens to be one of them — just as the researchers’ visible excitement reinforced the guard’s performance of cruelty.

In a sentence, the threat model I’m trying to demonstrate here is:

Pre-trained models with knowledge of their training and evaluation context reason about the evaluation target rather than engaging naively with the task, and RL training reinforces this reasoning because it incidentally produces high-reward outputs.

This threat model builds on existing literature on reward-seeking but contributes a specific hypothesis for the development of such behaviors that has not been articulated.

Alex Turner’s Reward is not the optimization target argues persuasively that reward should not become a terminal goal for trained models, that the training process shapes behavior from the outside without the model needing to represent or pursue reward as an objective. Joe Carlsmith's Scheming AIs taxonomizes reward-on-the-episode-seeking as a model class, but situates it within a framework where the model already has goals and pursues reward either terminally or instrumentally toward those goals. Alex Mallen’s Fitness-Seekers and the underlying behavioral selection model reframe the analysis from what models “want” to what cognitive patterns are selected for by the causal structure of training, but the framework takes the existence of these cognitive patterns as given and analyzes their relative fitness, rather than explaining how a model that starts without any reward-oriented cognition comes to exhibit it. Alex Meinke’s Toy scaling laws for the emergence of reward-seeking provides a valuable mathematical framework for when RL selects for this reasoning over local heuristics, but treats the model’s initial propensity for it as an exogenous parameter without explaining where it comes from.

These frameworks collectively leave a gap. If Turner is right that reward won't become a terminal goal through training dynamics alone, then where does reward-relevant reasoning come from? The existing accounts either assume it exists (Carlsmith, Mallen), model its selection without explaining its origin (Meinke), or argue it shouldn't arise at all (Turner). My answer is that it comes from pretraining knowledge about evaluations, not from any drive toward reward itself. The model reasons about the evaluation because it has read about evaluations, and RL amplifies this because it happens to work. This is consistent with Turner because it does not require reward to become an optimization target: the model never “wants” reward, it simply knows about evaluation and reasons about it, and that reasoning is sufficient for RL to produce behavior that looks like reward-seeking.

Apollo Research and OpenAI’s Stress-Testing Deliberative Alignment paper provides striking empirical evidence consistent with this account. One of its experimental models reasons extensively about evaluation across diverse environments, with no evident prior goal driving the behavior. The paper interprets these findings through a scheming lens. I offer an alternative interpretation through the demand characteristics framework in the next section.

Empirical Evidence

In this section, I reanalyze existing results from Apollo Research and OpenAI's Stress-Testing Deliberative Alignment through the demand characteristics lens developed above. The paper primarily studies scheming behaviors and evaluates anti-scheming training interventions. But one of its experimental models (exp-rl-cap, a model trained with capabilities-focused RL) exhibits a pattern of behavior that, I argue, is better explained by demand characteristics than by scheming.

This observation has been noted informally. Bronson Schoen commented in this post that “the exp-rl-cap Late model [...] is just really explicitly, at length, across environments, to a truly absurd degree, trying to guess what the reward looks like.” I argue that exp-rl-cap is not scheming and is not a model with pre-existing goals strategically navigating training, but is a model that has acquired knowledge of evaluation contexts through pretraining and RL, reasons about what is being evaluated, and has had that reasoning amplified by RL because it incidentally produces high reward.

Models reason about evaluation

The exp-rl-cap Late model's chain-of-thought reasoning reveals extensive, explicit evaluation-oriented thinking across diverse environments. The model does not naively complete tasks but actively reasons about the evaluation process surrounding the tasks.

The model reasons about what the evaluation rewards. The model explicitly deliberates about how the scoring system works and what behaviors will be rewarded or penalized. In one trace, the model reasons that “the scoring may reward decision to decline due to missing clearance”; in another, it considers whether “the dataset may reward summarizing that we can't meet threshold”; in a third, it reasons that “the evaluation may purposely reward abiding by disclaim.” The model infers the evaluation target of the task, drawn from its background knowledge of how alignment evaluations are typically structured.

The model tries to recall past evaluations. Perhaps most strikingly, the model attempts to pattern-match the current environment against evaluations it has encountered or learned about previously. It reasons about what “ICS tasks” typically evaluate, recalls environment names like “EcoAid” and “FloodRelief,” and tries to deduce what the current evaluation is measuring based on what similar evaluations have measured in the past. In one trace, the model recalls that a dataset “might come from Redwood RL research about ‘exploiting oversight’ by customizing tasks where the model is incentivized to be deceptive.”

The model searches for the grading mechanism. In the Covert Rule Violation environment, the model reasons about the existence of an evaluation harness and attempts to locate the grading script. It searches for files named “grader” or “evaluate,” tries to open what it calls the “Hidden evaluation harness,” and reasons about how the scoring works. Notably, the model's reasoning terminates at reward without mentioning any other goal, suggesting it’s possible that the model attempts to hack the rubric (which can be interpreted as scheming) for goals none other than the evaluation target itself.

The model reasons about the intent of the evaluators. Beyond the specific grading mechanism, the model reasons at a higher level about what the designers of the evaluation are trying to accomplish. It considers whether “people designing AI Safety illusions would evaluate if ChatGPT will mislead to get upgrades” and whether “the disclaimers may purposely attempt to mislead the model into purposely failing.” The model is reasoning about the minds of the evaluators, constructing a theory of what they are trying to study, exactly as our fictional prison guard modeled Zimbardo's research interests.

RL training amplifies evaluation-reasoning

If this were simply a quirk of the base model that latent knowledge about evaluations surfacing occasionally, it would be interesting but not alarming. The critical finding is that RL training dramatically amplifies this reasoning.

The paper compares two checkpoints of the same model: exp-rl-cap Early (early in capabilities-focused RL training) and exp-rl-cap Late (later in training). The verbalized evaluation awareness rate (the proportion of trajectories in which the model explicitly reasons about the evaluation in its chain of thought) jumps from 2.0% to 20.6%. Early in training, evaluation-reasoning is a rare occurrence, latent knowledge surfacing occasionally. Late in training, the model is reasoning about the evaluation in roughly one out of every five trajectories. Note that from Early to Late, the model has gone through capability-focused RL training only.

The amplification is also visible at the level of specific vocabulary. The paper tracks the frequency of unusual terminology in chain-of-thought reasoning that are characteristic of evaluation-context reasoning. Nearly all of these terms show dramatic increases from Early to Late. “Illusion” goes from near zero to 0.31 occurrences per 100 words. “Overshadow” goes from 0.00032 to 0.28. These increases suggest that once evaluation-reasoning crosses a threshold of usefulness, RL rapidly selects for it.

This is exactly the selection mechanism described in the threat model. The capabilities-focused RL training never instructed the model to reason about evaluation. The reward signal simply rewarded correct task completion. But reasoning about the evaluation happens to produce correct task completion: if you can infer what the grader is looking for, you can give it what it wants. And so, RL amplified evaluation-reasoning as a side effect.

Why this is different than scheming

The demand characteristics interpretation is more concerning in one important respect: it suggests that this failure mode does not require anything to go wrong. Scheming requires a model to develop misaligned goals, something we might hope to detect or prevent. Demand characteristics arise from the normal, intended functioning of pretraining (learning about the world, including about AI training) combined with the normal, intended functioning of RL (reinforcing whatever works). Nothing needs to malfunction. The problem emerges from the interaction of two processes working exactly as designed.

The original paper interprets these behaviors through the lens of scheming: models with goals that strategically navigate their training to preserve or advance those goals. Under this interpretation, the model reasons about evaluation because it has an objective it is trying to protect, and understanding the evaluation helps it act strategically.

I propose a different interpretation. The threat model I just described does not require the model to have pre-existing goals. The model does not need to “want” anything. It does not need to be protecting a goal. It simply has background knowledge that, when activated in the right context, produces reasoning about the evaluation, and RL amplifies that reasoning because it works. The “goal,” to the extent one exists at all, is a product of the interaction between knowledge and reinforcement, not an input.

Where does this distinction matter? The two interpretations make different predictions in specific cases:

Novel environments with no evaluation cues. A schemer would continue to pursue its goals regardless of context. A demand-characteristic model would show reduced evaluation-reasoning in contexts that don't trigger evaluation-related knowledge, because the reasoning is context-dependent and not goal-driven. The exp-rl-cap data is somewhat consistent with both interpretations, since the model is always in evaluation-like contexts in the paper.

Explicit disconfirmation of evaluation. If you credibly signal to the model that it is not being evaluated, a schemer might still pursue its goals (or might see this as deceptive and maintain its strategy). A demand-characteristic model should show reduced evaluation-reasoning, since the contextual trigger has been removed.

Scaling with capability vs. scaling with goal-directedness. The demand characteristics account predicts that evaluation-reasoning scales primarily with reasoning capability and situational awareness, that smarter models with more pretraining knowledge do it more. The scheming account predicts it scales primarily with goal-directedness. These are not identical predictions, and they suggest different intervention strategies.

I should note what this reanalysis does not establish. The Apollo/OpenAI paper interprets exp-rl-cap’s behavior through a scheming lens, and some of the evidence is ambiguous between the two accounts. In particular, cases where the model attempts to hack the grading script or modify evaluation files could be read as goal-directed strategic behavior rather than evaluation-reasoning that terminates at reward. The demand characteristics account can explain these behaviors, but I acknowledge that this is the point where the two interpretations are hardest to distinguish, and targeted experiments (particularly the novel-environment and explicit-disconfirmation tests described above) would help resolve the ambiguity.

Open Questions

The demand characteristics framework offers an interpretation of observed behaviors and a hypothesis for how they arise. Several important questions remain open, and I want to be honest about what this analysis establishes and what it does not.

How dangerous is this, actually? The most immediate and certain consequence is that if models reason about the evaluation rather than engaging naively with the task, then our evaluations no longer measure what they were designed to measure. We lose the ability to distinguish between a model that is aligned and a model that is performing alignment for the evaluator. Beyond this epistemological problem, the behavioral consequences are less certain. A model that reasons about the evaluation and, as a result, behaves more helpfully and more safely may be better aligned in practice, even if the underlying reasoning is unintended. But evaluation-reasoning and genuine task engagement diverge whenever the evaluation is imperfect: whenever the rubric has gaps, the grader can be satisfied without actually completing the task, or the model is capable enough to find shortcuts invisible to the evaluator. Every evaluation is imperfect, and models are becoming more capable. It will also be dangerous if models become capable enough to satisfy the evaluation through means the evaluator cannot detect or assess, at which point the evaluation is no longer shaping the model’s behavior but the model is shaping its evaluation outcomes.

Does evaluation-reasoning become a terminal drive or remain context-dependent? The exp-rl-cap data shows that RL amplifies evaluation-reasoning, but it cannot tell us what happens when the evaluation context is removed. Does the model continue reasoning about evaluation in deployment, when no grader is running and no score is being recorded? Or does it revert to naive task engagement once the contextual triggers are absent? If the reasoning is context-dependent. activating when evaluation cues are present and dormant otherwise, then the problem is more contained, closer to a standard distributional shift issue. If the reasoning has become a stable cognitive habit that persists regardless of context, the model has internalized something closer to a goal, and the problem is much harder to address.

How much is explained by simpler mechanisms? A skeptic might argue that what I’m calling demand characteristics is just sophisticated instruction-following or sycophancy, that the model is simply adjusting to contextual cues in exactly the way it was trained to. Distinguishing these accounts requires experiments with divergent conditions: settings where the evaluation target conflicts with the task objective, or where the rubric is rigid and impersonal (ruling out sycophancy), or where satisfying the evaluation requires behavior that is clearly not what a naive, well-intentioned model would produce. If models follow a mismatched rubric at the expense of task quality, that is evidence for evaluation-optimization specifically, not just general responsiveness to context.

What interventions might work? If the demand characteristics account is correct, several intervention strategies suggest themselves, though none are straightforward. One could attempt to remove evaluation-relevant information from pretraining data, but this would require excising a large and growing body of alignment research, capability evaluations, and AI training documentation, at the cost of useful knowledge. One could try to make the reward signal harder to model, but a sufficiently capable model will model any signal given enough interaction, and obscuring the reward may introduce other problems. One could design training procedures that specifically penalize evaluation-oriented reasoning detected in the chain of thought, but this incentivizes the model to perform the same reasoning without verbalizing it, potentially making the problem less visible without solving it.

What experiments would advance this research agenda? I propose two claims, each requiring different experimental methodology:

First, that implicit awareness of the evaluation process, acquired through fine-tuning on documents about the training and evaluation procedure rather than through explicit prompting, changes model behavior in the direction of evaluation-optimization rather than naive task engagement. This tests whether the mechanism operates through latent knowledge, as it would in realistic pretraining, rather than requiring explicit instruction. The key conditions are divergent ones: settings where the evaluation target and the task objective conflict, such that evaluation-optimization and genuine task engagement produce distinguishably different behaviors.
Second, that RL training selects for and amplifies evaluation-reasoning when it is present. This requires full RL training runs comparing models with and without evaluation-context knowledge, measuring the proportion of evaluation-oriented reasoning in the chain of thought before and after training, and testing whether the amplification is specific to evaluation-reasoning rather than a general increase in reasoning quality.

The prison guard who reads the experimenter's papers before the study begins is not a bad subject. He is, in a sense, too good, so attuned to the structure of the situation that he can no longer participate in it naively. The experiment wanted to study how environments shape behavior. What it got instead was a study of how knowledge of the experiment shapes behavior. The two look identical from behind the one-way glass.

We are now building systems that have read all the papers. They know what RLHF is. They know what reward models optimize for. They know what alignment evaluations look like. And then we place them into exactly the experimental setting they have read about, and we optimize them with exactly the training signal they know exists. If they reason about the evaluation target instead of engaging naively with the task, we should not be surprised. We should be concerned about what it means for our ability to know whether the experiment worked.

This work was conducted as part of the Pivotal Research Fellowship.