This is an automated rejection. No LLM generated, assisted/co-written, or edited work.
Read full explanation
TLDR: We present a simple, practical taxonomy for eval hacking (unfaithful eval results) that clean up a nest of many existing terms (specification/task gaming, reward/proxy hacking, benchmaxxing...) and offers a hollistic framework for this emerging frontier.
Motivation
When we talk about AI eval, there are many messy terms flying around with the same underlying issue (unfaithful eval results):
specification gaming / proxy gaming / task gaming
reward hacking (used a bit too broadly)
benchmaxxing / benchmark gaming
Problem is, these terms are often used in an inconsistent (incorrect) manner, where lots of people have different understanding of what each term means (For example, reward hacking has a tad too many meanings which can be confusing)
We should speak the same language! This post presents a simple, practical taxonomy across 2 orthogonal axes that clearly distinguishes existing terms and their overlaps.
Inference-time: After training, AI may behave in such a way that influence scoring (in either direction) without genuinely completing the intended task (task gaming).
Axis 2: Strategic vs. Accidental
Strategic:AI acts "on purpose", usually manifested in its reasoning about the fact that it's being evaluated/monitored and should act accordingly. This requires eval awareness as a precursor i.e. the ability to infer one is being evaluated, what the scoring mechanism is, and how to exploit it. Strategic eval hacking is an alignment problem (the model is actively working against evaluator intent).
Note: Apparently the strategizing party can be model developers, who may be incentivized to intentionally train on test data for benchmaxxing, which is rather hard to prove for closed-weight/data model. The bottleneck lies in reliable (black-box) training data attribution and membership inference methods, which is yet another field that warrants a separate discussion.
Accidental: Without eval awareness, AI would not "strategically" hack evaluation, rather simply follow instruction to "score well" and try hard to find a policy for that goal. Accidental eval hacking is more so a engineering problem.
Note: To simplify the criteria for distinguishing accidental vs. strategic, I'm inclined to say eval awareness is a necessary but not sufficient condition (i.e. one must be eval-aware to commit strategic eval hacking, but doesn't have to do so. Indeed, it may be a desirable propensity to always act consistently despite being aware of eval, which may be an objective for future research)
Evaluator collusion (when shared training biases produce correlated blind spots). Learned heuristics (verbosity, confident tone) that inflate eval scores.
Gaming vs. Hacking: Within each cell, behaviors range from mild ("gaming": exploiting loopholes) to severe ("hacking": breaking the eval system). We do not make this a third axis: breaking the eval almost always requires eval awareness and strategic reasoning.
Key distinctions
Specification vs. Proxy vs. Task Gaming: All three involve exploiting loopholes. Specification gaming is accidental, during training. Proxy gaming is similar but specifically about optimizing a divergent proxy metric. Task gaming is strategic, at inference. The key differentiator is eval awareness.
Reward Hacking vs. Tampering vs. Wireheading: A severity progression. Reward hacking is accidental optimization against a misspecified proxy. Tampering is the model actively modifying its reward mechanism, requiring eval awareness. Wireheading is the most extreme: the agent intervenes on the reward channel itself. All are training-time, but hacking is accidental while tampering and wireheading are strategic.
Why the name "Eval Hacking"
Why not "Eval/Benchmark Gaming"? "Gaming" implies working within a set of rules without breaking them. "Hacking" is broader: said rules themselves may be broken/changed directly to manipulate the outcome.
Why not "Benchmark Hacking"? Eval is progressing beyond dataset-based benchmarks toward dynamic arenas with AI control and scalable oversight. In these settings, hacking can happen beyond the data level (by collusion with evaluator models, by exploiting oversight mechanisms). So I'd say "eval" is better in scope.
AI Usage: Everything in this post is entirely human-written unless otherwise specified.
TLDR: We present a simple, practical taxonomy for eval hacking (unfaithful eval results) that clean up a nest of many existing terms (specification/task gaming, reward/proxy hacking, benchmaxxing...) and offers a hollistic framework for this emerging frontier.
Motivation
When we talk about AI eval, there are many messy terms flying around with the same underlying issue (unfaithful eval results):
Problem is, these terms are often used in an inconsistent (incorrect) manner, where lots of people have different understanding of what each term means (For example, reward hacking has a tad too many meanings which can be confusing)
We should speak the same language! This post presents a simple, practical taxonomy across 2 orthogonal axes that clearly distinguishes existing terms and their overlaps.
Axis 1: Training-Time vs. Inference-Time
Training-time: Training may reinforce undesired behaviors that score highly by cheating the reward metric (reward hacking/specification gaming) or proxy metric (proxy gaming).
Inference-time: After training, AI may behave in such a way that influence scoring (in either direction) without genuinely completing the intended task (task gaming).
Axis 2: Strategic vs. Accidental
Strategic: AI acts "on purpose", usually manifested in its reasoning about the fact that it's being evaluated/monitored and should act accordingly. This requires eval awareness as a precursor i.e. the ability to infer one is being evaluated, what the scoring mechanism is, and how to exploit it. Strategic eval hacking is an alignment problem (the model is actively working against evaluator intent).
Note: Apparently the strategizing party can be model developers, who may be incentivized to intentionally train on test data for benchmaxxing, which is rather hard to prove for closed-weight/data model. The bottleneck lies in reliable (black-box) training data attribution and membership inference methods, which is yet another field that warrants a separate discussion.
Accidental: Without eval awareness, AI would not "strategically" hack evaluation, rather simply follow instruction to "score well" and try hard to find a policy for that goal. Accidental eval hacking is more so a engineering problem.
Note: To simplify the criteria for distinguishing accidental vs. strategic, I'm inclined to say eval awareness is a necessary but not sufficient condition (i.e. one must be eval-aware to commit strategic eval hacking, but doesn't have to do so. Indeed, it may be a desirable propensity to always act consistently despite being aware of eval, which may be an objective for future research)
Taxonomy
Strategic
Accidental
Training
Benchmaxxing (when model developers intentionally train on test data), Reward tampering, Wireheading.
Specification gaming (examples), Reward hacking, Benchmaxxing (when benchmark data accidentally leaks into training).
Inference
Task gaming, Benchmark hacking (more examples), Evaluator collusion (when reasoning about evaluator weaknesses).
Evaluator collusion (when shared training biases produce correlated blind spots). Learned heuristics (verbosity, confident tone) that inflate eval scores.
Gaming vs. Hacking: Within each cell, behaviors range from mild ("gaming": exploiting loopholes) to severe ("hacking": breaking the eval system). We do not make this a third axis: breaking the eval almost always requires eval awareness and strategic reasoning.
Key distinctions
Specification vs. Proxy vs. Task Gaming: All three involve exploiting loopholes. Specification gaming is accidental, during training. Proxy gaming is similar but specifically about optimizing a divergent proxy metric. Task gaming is strategic, at inference. The key differentiator is eval awareness.
Reward Hacking vs. Tampering vs. Wireheading: A severity progression. Reward hacking is accidental optimization against a misspecified proxy. Tampering is the model actively modifying its reward mechanism, requiring eval awareness. Wireheading is the most extreme: the agent intervenes on the reward channel itself. All are training-time, but hacking is accidental while tampering and wireheading are strategic.
Why the name "Eval Hacking"
Why not "Eval/Benchmark Gaming"? "Gaming" implies working within a set of rules without breaking them. "Hacking" is broader: said rules themselves may be broken/changed directly to manipulate the outcome.
Why not "Benchmark Hacking"? Eval is progressing beyond dataset-based benchmarks toward dynamic arenas with AI control and scalable oversight. In these settings, hacking can happen beyond the data level (by collusion with evaluator models, by exploiting oversight mechanisms). So I'd say "eval" is better in scope.
AI Usage: Everything in this post is entirely human-written unless otherwise specified.