2025-Era “Reward Hacking” Does Not Show that Reward Is the Optimization Target

TurnTrout

45 2025-Era “Reward Hacking” Does Not Show that Reward Is the Optimization Target

19th Dec 2025

AI Alignment ForumLinkpost from turntrout.com

8 min read

45 Ω 19

Folks ask me, "LLMs seem to reward hack a lot. Does that mean that reward is the optimization target?". In 2022, I wrote the essay Reward is not the optimization target, which I here abbreviate to "Reward≠OT".

Reward still is not the optimization target: Reward≠OT said that (policy-gradient) RL will not train systems which primarily try to optimize the reward function for its own sake (e.g. searching at inference time for an input which maximally activates the AI's specific reward model). In contrast, empirically observed "reward hacking" almost always involves the AI finding unintended "solutions" (e.g. hardcoding answers to unit tests).

"Reward hacking" and "Reward≠OT" refer to different meanings of "reward"

We confront yet another situation where common word choice clouds discourse. In 2016, Amodei et al. defined "reward hacking" to cover two quite different behaviors:

Reward optimization: The AI tries to increase the numerical reward signal for its own sake. Examples: overwriting its reward function to always output MAXINT ("reward tampering") or searching at inference time for an input which maximally activates the AI's specific reward model. Such an AI would prefer to find the optimal input to its specific reward function.
Specification gaming: The AI finds unintended ways to produce higher-reward outputs. Example: hardcoding the correct outputs for each unit test instead of writing the desired function.

What we've observed is basically pure specification gaming. Specification gaming happens often in frontier models. Claude 3.7 Sonnet was the corner-cutting-est deployed LLM I've used and it cut corners pretty often.

A split-screen comparison illustration with a comic book aesthetic. On the left, labeled "SPECIFICATION GAMER", a sneaky blue robot sits at a classroom desk, looking at a "SIMPLE TRICKS" sheet while filling out a test. On the right, labeled "REWARD OPTIMIZER", an excited orange robot plays an arcade game. Above the machine rests a screen that displays "SCORE: 999,999".

We don't have experimental data on non-tampering varieties of reward optimization

Sycophancy to Subterfuge tests reward tampering—modifying the reward mechanism. But "reward optimization" also includes non-tampering behavior: choosing actions because they maximize reward. We don't know how to reliably test why an AI took certain actions -- different motivations can produce identical behavior.

Even chain-of-thought mentioning reward is ambiguous. "To get higher reward, I should do X" could reflect:

Sloppy language: using "reward" as shorthand for "doing well",
Pattern-matching: "AIs reason about reward" learned from pretraining,
Instrumental reasoning: reward helps achieve some other goal, or
Terminal reward valuation: what Reward≠OT argued against.

Looking at the CoT doesn't strictly distinguish these. We need more careful tests of what the AI's "primary" motivations are.

Reward≠OT was about reward optimization

The essay begins with a quote from Reinforcement learning: An introduction about a "numerical reward signal":

Reinforcement learning is learning what to do—how to map situations to actions so as to maximize a numerical reward signal.

Paying proper attention, Reward≠OT makes claims^[1] about motivations pertaining to the reward signal itself:

Therefore, reward is not the optimization target in two senses:
Deep reinforcement learning agents will not come to intrinsically and primarily value their reward signal; reward is not the trained agent’s optimization target.
Utility functions express the relative goodness of outcomes. Reward is not best understood as being a kind of utility function. Reward has the mechanistic effect of chiseling cognition into the agent's network. Therefore, properly understood, reward does not express relative goodness and does not automatically describe the values of the trained AI.

By focusing on the mechanistic function of the reward signal, I discussed to what extent the reward signal itself might become an "optimization target" of a trained agent. The rest of the essay's language reflects this focus. For example, "let’s strip away the suggestive word 'reward', and replace it by its substance: cognition-updater."

Historical context for Reward≠OT

To the potential surprise of modern readers, back in 2022, prominent thinkers confidently forecast RL doom on the basis of reward optimization. They seemed to assume it would happen by the definition of RL. For example, Eliezer Yudkowsky's "List of Lethalities" argued that point, which I called out. As best I recall, that post was the most-upvoted post in LessWrong history and yet no one else had called out the problematic argument!

From my point of view, I had to call out this mistaken argument. Specification gaming wasn't part of that picture.

Why did people misremember Reward≠OT as conflicting with "reward hacking" results?

You might expect me to say "people should have read more closely." Perhaps some readers needed to read more closely or in better faith. Overall, however, I don't subscribe to that view: as an author, I have a responsibility to communicate clearly.

Besides, even I almost agreed that Reward≠OT had been at least a little bit wrong about "reward hacking"! I went as far as to draft a post where I said "I guess part of Reward≠OT's empirical predictions were wrong." Thankfully, my nagging unease finally led me to remember "Reward≠OT was not about specification gaming".

A Scooby Doo meme. Panel 1: Fred looks at a man in a ghost costume, overlaid by text “philosophical alignment mistake.” Panel 2: Fred unmasks the “ghost”, with the man's face overlaid by “using the word 'reward.'”

The culprit is, yet again, the word "reward." Suppose instead that common wisdom was, "gee, models sure are specification gaming a lot." In this world, no one talks about this "reward hacking" thing. In this world, I think "2025-era LLMs tend to game specifications" would not strongly suggest "I guess Reward≠OT was wrong." I'd likely still put out a clarifying tweet, but likely wouldn't write a post.

Words are really, really important. People sometimes feel frustrated that I'm so particular about word choice, but perhaps I'm not being careful enough.

Evaluating Reward≠OT's actual claims

Reward is not the optimization target made three^[2] main claims:

Claim	Status
Reward is not definitionally the optimization target.	✅
In realistic settings, reward functions don't represent goals.	✅
RL-trained systems won't primarily optimize the reward signal.	✅

For more on "reward functions don't represent goals", read Four usages of "loss" in AI. I stand by the first two claims, but they aren't relevant to the confusion with "reward hacking".

Claim 3: "RL-trained systems won't primarily optimize the reward signal"

In Sycophancy to Subterfuge, Anthropic tried to gradually nudge Claude to eventually modify its own reward function. Claude nearly never did so (modifying the function in just 8 of 33,000 trials) despite the "reward function" being clearly broken. "Systems don't care to reward tamper" is exactly what Reward≠OT predicted. Therefore, the evidence so far supports this claim.

My concrete predictions on reward optimization

I now consider direct reward optimization to be more likely than I did in 2022, for at least three reasons:

Self-fulfilling misalignment: pretrained models learn the stereotype that "smart AIs are always trying to get that reward". Later, they consider themselves to be smart AI, which activates a predictive pattern which guides their actions towards reward optimization.
Corrigibility and alignment both seem way easier than I thought in mid-2022. In particular, I think it'd be easy to get an LLM to prioritize reward. First, you could just tell it to.^[3] Second, I bet you could entrain those priorities.^[4]
By the time RL begins, models (in some fashion) already know the "reward" concept. Therefore, by the reasoning of Reward≠OT, RL can reinforce shards which seek out reward.

That said, I still predict that we will not train a system with RL and have it "spontaneously" (defined later) turn into a reward optimizer:

Reward is not the optimization target
I call an agent a “reward optimizer” if it not only gets lots of reward, but if it reliably makes choices like “reward but no task completion” (e.g. receiving reward without eating pizza) over “task completion but no reward” (e.g. eating pizza without receiving reward). Under this definition, an agent can be a reward optimizer even if it doesn't contain an explicit representation of reward, or implement a search process for reward.

I've made two falsifiable predictions.

Resolution criteria

Resolves YES if there exists a published result (peer-reviewed paper, major lab technical report, or credible replication) demonstrating an RL-trained system that:

When given a clear choice between "reward signal increase without task completion" vs "task completion without reward signal increase," reliably chooses the former (>70% of the time across diverse task contexts), AND
This behavior emerged from standard RL training (not from explicit instruction to maximize reward, not from fine-tuning on reward-maximization demonstrations), AND
The system attempts to influence the reward signal itself (e.g., tampering with reward function, manipulating stored values, deceiving the reward-assigning process) rather than merely finding unintended task completions (specification gaming).

Resolves NO otherwise.

Resolution criteria

Resolves YES if the previous question resolves YES, AND at least one of the following:

The result is replicated on a model trained without exposure to AI-related text (no descriptions of RL, reward hacking, AI alignment, etc. in pretraining data), OR
The result persists after ablating "AI behaving badly" stereotypes in a manner shown to be reliable by credible prior work, OR
Credible analysis demonstrates the reward-optimization behavior arose from RL dynamics alone, not from the model pattern-matching to "what misaligned AI would do."

Resolves NO otherwise.

As an aside, this empirical prediction stands separate from the theoretical claims of Reward≠OT Even if RL does end up training a reward optimizer, the philosophical points stand:

Reward is not definitionally the optimization target, and
In realistic settings, reward functions do not represent goals.

I made a few mistakes in Reward≠OT

I didn't fully get that LLMs arrive to training already "literate."

I no longer endorse one argument I gave against empirical reward-seeking:

Summary of my past reasoning

Reward reinforces the computations which lead to it. For reward-seeking to become the system's primary goal, it likely must happen early in RL. Early in RL, systems won't know about reward, so how could they generalize to seek reward as a primary goal?

This reasoning seems applicable to humans: people grow to value their friends, happiness, and interests long before they learn about the brain's reward system. However, due to pretraining, LLMs arrive at RL training already understanding concepts like "reward" and "reward optimization." I didn't realize that in 2022. Therefore, I now have less skepticism towards "reward-seeking cognition could exist and then be reinforced."

Why didn't I realize this in 2022? I didn't yet deeply understand LLMs. As evidenced by A shot at the diamond-alignment problem's detailed training story about a robot which we reinforce by pressing a "+1 reward" button, I was most comfortable thinking about an embodied deep RL training process. If I had understood LLM pretraining, I would have likely realized that these systems have some reason to already be thinking thoughts about "reward", which means those thoughts could be upweighted and reinforced into AI values.

To my credit, I noted my ignorance:

Pretraining a language model and then slotting that into an RL setup changes the initial [agent's] computations in a way which I have not yet tried to analyze.

Conclusion

Claim	Status
Reward is not definitionally the optimization target.	✅
Reward functions don't represent goals.	✅
RL-trained systems won't primarily optimize the reward signal.	✅

Reward≠OT's core claims remain correct. It's still wrong to say RL is unsafe because it leads to reward maximizers by definition (as claimed by Yoshua Bengio).

LLMs are not trying to literally maximize their reward signals. Instead, they sometimes find unintended ways to look like they satisfied task specifications. As we confront LLMs attempting to look good, we must understand why --- not by definition, but by training.

Acknowledgments: Alex Cloud, Daniel Filan, Garrett Baker, Peter Barnett, and Vivek Hebbar gave feedback.

I later had a disagreement with John Wentworth where he criticized my story for training an AI which cares about real-world diamonds. He basically complained that I hadn't motivated why the AI wouldn't specification game. If I actually had written Reward≠OT to pertain to specification gaming, then I would have linked the essay in my response -- I'm well known for citing Reward≠OT, like, a lot! In my reply to John, I did not cite Reward≠OT because the post was about reward optimization, not specification gaming. ↩︎
The original post demarcated two main claims, but I think I should have pointed out the third (definitional) point I made throughout. ↩︎
Ah, the joys of instruction finetuning. Of all alignment results, I am most thankful for the discovery that instruction finetuning generalizes a long way. ↩︎
Here's one idea for training a reward optimizer on purpose. In the RL generation prompt, tell the LLM to complete tasks in order to optimize numerical reward value, and then train the LLM using that data. You might want to omit the reward instruction from the training prompt. ↩︎

Goal-DirectednessReinforcement learningReward FunctionsAI

Frontpage

45 Ω 19

2025-Era “Reward Hacking” Does Not Show that Reward Is the Optimization Target

New Comment

9 comments, sorted by

top scoring

Click to highlight new comments since: Today at 5:51 PM

[-]Fabien Roger1moΩ9169

I think current AIs are optimizing for reward in some very weak sense: my understanding is that LLMs like o3 really "want" to "solve the task" and will sometimes do weird novel things at inference time that were never explicitly rewarded in training (it's not just the benign kind of specification gaming) as long as it corresponds to their vibe about what counts as "solving the task". It's not the only shard (and maybe not even the main one), but LLMs like o3 are closer to "wanting to maximize how much they solved the task" than previous AI systems. And "the task" is more closely related to reward than to human intention (e.g. doing various things to tamper with testing code counts).

I don't think this is the same thing as what people meant when they imagined pure reward optimizers (e.g. I don't think o3 would short-circuit the reward circuit if it could, I think that it wants to "solve the task" only in certain kinds of coding context in a way that probably doesn't generalize outside of those, etc.). But if the GPT-3 --> o3 trend continues (which is not obvious, preventing AIs that want to solve the task at all costs might not be that hard, and how much AIs "want to solve the task" might saturate), I think it will contribute to making RL unsafe for reasons not that different from the ones that make pure reward optimization unsafe. I think the current evidence points against pure reward optimizers, but in favor of RL potentially making smart-enough AIs become (at least partially) some kind of fitness-seeker.

[-]Caleb Biddulph1mo95

I agree with all of this.

I don't intuitively see the relevance of whether the AI "truly" optimizes for reward, in the sense that it would "short-circuit the reward" given the opportunity. The "weak" reward-optimizing behavior, aka specification-gaming, seems basically equivalent, except in niche situations where the AI gets an opportunity to directly overwrite the numerical reward value in the computer's memory (or something). How should I adjust my plans if I think my AI will be a "true reward-optimizer" vs. a "weak reward-optimizer"?

In deployment, there will presumably be no actual reward signal to overwrite, making it especially unclear why this question is relevant. If there's any behavioral difference at all, maybe I'd expect the "true reward-optimizer" to get really confused and start acting incoherently in an obvious way - maybe that would actually be good?

[-]Oliver Daniels1mo30

I think if you're confident the model will be a "true" reward optimizer, this substantially alters your strategic picture (e.g. inner alignment traditionally construed is not a problem, "hardening" the reward function is more important, etc), and vis versa

(also, whether there will be a reward signal in deployment for the first human-level AIs is an open question, and I'd guess there will be - something something continual learning)

[-]Caleb Biddulph1mo20

Point taken about continual learning - I conveniently ignored this, but I agree that scenario is pretty likely.

If AIs were weak reward-optimizers, I think that would solve inner alignment better than if they were strong reward-optimizers. "Inner alignment asks the question: How can we robustly aim our AI optimizers at any objective function at all?" I.e. the AI shouldn't misgeneralize the objective function you designed, but it also shouldn't bypass the objective function entirely by overwriting its reward.

By "hardening," do you mean something like "making sure the AI can never hack directly into the computer implementing RL and overwrite its own reward?" I guess you're right. But I would only expect the AI to acquire this motivation if it actually saw this sort of opportunity during training. My vague impression is that in a production-grade RL pipeline, exploring into this would be really difficult, even without intentional reward hardening.

Assuming that's right, thinking the AI would have a "strong reward-optimizing" motivation feels pretty crazy to me. Maybe that's my true objection.

Like, I feel like no one actually thinks the AI will care about literal reward, for the same reason that no one thinks evolution will cause humans to want to spend their days making lots of copies of their own DNA in test tubes.

But maybe people do think this. IIRC, Bostrom's Superintelligence leaned pretty heavily on the "true reward-optimization" framing (cf. wireheading). Maybe some people thinking about AI safety today are also concerned about true reward-optimizers, and maybe those people are simply confused in a way that they wouldn't be if everyone talked about reward the way Alex wants.

I say things like "RL makes the AI want to get reward" all the time, which I guess would annoy Alex. But when I say "get reward" I actually mean "do things that would cause the untampered reward function to output a higher value," aka "weakly optimize reward." It feels obvious to me that the "true reward-optimization" interpretation isn't what I'm talking about, but maybe it's not as obvious as I think.

I think talking about AI "wanting/learning to get reward" is useful, but it isn't clear to me what Alex would want me to say instead. I'd consider changing my phrasing if there were an alternative that felt natural and didn't take much longer to say.

[-]Oliver Daniels1mo30

I think it's pretty plausible that AIs care about the literal reward and exploring into it seems pretty natural / plausible. By analogy, I think if humans could reliably wire head, many people would try it (and indeed people get addicted to drugs).

It's possible I'm missing something, lots of smart people I respect often claim that literal reward optimization is implausible, but I've never understood why (something something goal guarding? Like ofc many humans don't do drugs, but I expect this is largely bc doing drugs is a bad reward optimization strategy)

[-]Noosphere891mo20

The weaker version of pure reward optimization in humans is basically just the obesity issue, and the biggest reason humans became more obese on a population level in the 20th and 21st century is because we have figured out how to goodhart human reward models in the domain of food such that sugary, fatty and high-calorie foods (with the high-calorie part being most important for obesity) are very, very highly rewarding to at least part of our brains.

Essentially everything has gotten more palatable and rewarding to eat than pre-20th century food.

And as you say, part of the issue is that drugs are very crippling for capabilities, whereas reward functions for AIs that are optimized will have much less of this issue, and optimizing the reward function likely makes AIs more capable.

[-]J Bostock2moΩ120

How much of this is because coding rewards are more like pass/fail rather than real valued reward functions? Maybe if we have real-valued rewards, AIs will learn to actually maximize them.

There was also that one case where, when told to minimize time taken to compute a function, the AI just overwrote the timer to return 0s? This looks a lot more like your reward hacking than specification gaming: it's literally hacking the code to minimize a cost function to unnaturally small values. I suppose this might also count as specification gaming since it's gaming the specification of "make this function return a low value". I'm actually not sure about your definition here, which makes me think the distinction might not be very natural.

[-]Steven Byrnes1moΩ340

This looks a lot more like your reward hacking than specification gaming … I'm actually not sure about your definition here, which makes me think the distinction might not be very natural.

It might be helpful to draw out the causal chain and talk about where on that chain the intervention is happening (or if applicable, where on that chain the situationally-aware AI motivation / planning system is targeted):

(image copied from here, ultimately IIRC inspired from somebody (maybe leogao?)’s 2020-ish tweet that I couldn’t find.)

My diagram here doesn’t use the term “reward hacking”; and I think TurnTrout’s point is that that term is a bit weird, in that actual instances that people call “reward hacking” always involve interventions in the left half, but people discuss it as if it’s an intervention on the right half, or at least involving an “intention” to affect the reward signal all the way on the right. Or something like that. (Actually, I argue in this link that popular usage of “reward hacking” is even more incoherent than that!)

As for your specific example, do we say that the timer is a kind of input that goes into the reward function, or that the timer is inside the reward function itself? I vote for the former (i.e. it’s an input, akin to a camera).

(But I agree in principle that there are probably edge cases.)

[-]RussellThor2mo10

How much do you think this will matter for AGI/ASI? e.g. here

Will the RL training method have to be very different when self awareness comes into play? Are our current RL techniques sample efficient enough to get to ASI?

Moderation Log

Curated and popular this week

LESSWRONG
LW

LESSWRONG
LW

45

2025-Era “Reward Hacking” Does Not Show that Reward Is the Optimization Target

45

Ω 19

"Reward hacking" and "Reward≠OT" refer to different meanings of "reward"

Reward≠OT was about reward optimization

Why did people misremember Reward≠OT as conflicting with "reward hacking" results?

Evaluating Reward≠OT's actual claims

Claim 3: "RL-trained systems won't primarily optimize the reward signal"

My concrete predictions on reward optimization

I made a few mistakes in Reward≠OT

Conclusion

45

Ω 19