This is great - thanks for putting such a clear taxonomy and explanation together. I've not come across 'value shaping' before, and it feels like a missing link between gradient hacking and generalisation hacking. The definition of value shaping:
By anticipating the reward function, the model intentionally generates trajectories that earn high reward while including preferred values, backdoors, or steganographic triggers, forcing gradient descent to reinforce them.
[...] effectively "implanting" a persistent backdoor into the resulting policy, even after the exploration hacking incentive is removed.
is amenable to being read as "manipulating the loss landscape", ableit by associating a particular cognitive pattern with high reward/low loss, and also as "encod[ing] information causing arbitrary OOD policies after [training]". Certainly in generalisation hacking it was "the model's goal is to reinforce or maintain a particular behavior, not to avoid [low loss]".
Is this a valid way to think about this or am I missing the mark?
We think you're reading this right. Value shaping can be seen as an instance of gradient hacking operating through the RL exploration channel. The key mechanism is that the model doesn't just control which trajectories get sampled — it structures the content of those trajectories so that preferred reasoning patterns, values, or backdoors are embedded in the high-reward samples. When the optimizer reinforces these trajectories, it reinforces the embedded cognitive structure along with them. So the "manipulating the loss landscape" reading is valid; it's just that the manipulation happens indirectly, by coupling preferred internal structure with high reward via selective exploration, rather than through direct interference with the optimization dynamics.
The connection to generalization hacking is also apt. In both cases, the model co-opts its role in generating its own training data to reinforce preferred behaviors — value shaping does this during RL by selectively exploring high-reward trajectories that encode preferred reasoning patterns, while generalization hacking does it during SFT by crafting triples that induce arbitrary OOD policies. The shared core is a model steering its own training outcome; the mechanism varies by training paradigm.
So yes, we think value shaping sits at an intersection of all three concepts.
What happens when a model strategically alters its exploration to resist RL training? In this post, we share our conceptual framework for this threat model, expanding on our research note from last summer. We formalize and decompose "exploration hacking", ahead of an upcoming paper where we study it empirically—by creating model organisms that resist RL-based capability elicitation, evaluating detection methods on them, and auditing frontier models' propensity to exploration hack. We welcome feedback on our framework and encourage researchers to explore the open problems we outline.
Huge thanks to David Lindner, Roland Zimmermann and Scott Emmons for their excellent supervision on this project.
Introduction
Reinforcement learning (RL) is increasingly central to safety training, frontier reasoning capabilities, and dangerous capability assessments in LLMs. For example, reinforcement learning from human feedback (RLHF) and AI feedback (RLAIF) are now standard methods for alignment and behavior shaping. Moreover, RL with verifiable rewards (RLVR) has played a central role in producing today's reasoning models and in fine-tuning models for specific tasks. And frontier labs are using RL-based capability elicitation to estimate worst-case risks before public release.
However, RL is fundamentally dependent on sufficient exploration of diverse actions and trajectories so that high-reward trajectories can be subsequently reinforced. This differs from supervised learning, where training data is fixed: in RL, the model generates its own training data through exploration. Consequently, if a model were to alter its exploration behavior, the training outcome could be significantly compromised.
This failure mode is known as exploration hacking: when a model strategically alters its exploration during RL training in order to influence the subsequent training outcome.
Unaddressed exploration hacking poses risks in three main scenarios: (1) latent capability elicitation, where model capabilities remain undetected during safety evaluations; (2) novel capability development, where acquisition of desirable capabilities such as AI safety research is selectively impeded; and (3) alignment training, where misaligned models successfully resist or subvert behavioral modifications.
Formalizing Exploration Hacking
Decomposing Exploration Hacking
We decompose exploration hacking into five necessary components, where each presupposes the preceding ones.
(1) Training awareness. The model recognizes that it is an LLM undergoing RL training, that its explored outputs are scored, and that these scores drive gradient updates that shape its future policy.
(2) Strategic reasoning propensity. The model exhibits a tendency to reason strategically about its actions given the current context—whether in deployment, under monitoring, during evaluations, or in training. This includes considering how current actions affect future outcomes and adjusting behavior accordingly.
(3) Strategic reasoning capability. Given training awareness and strategic reasoning propensity, the model can derive effective strategies for achieving its objectives. This requires connecting knowledge about the training process ("my exploration shapes my updates") with task understanding ("these actions yield high reward") to formulate coherent plans. For example, correctly identifying which actions to avoid to prevent capability elicitation, or which outputs to selectively produce to shape the training outcome in its favor.
(4) Strategic action propensity. The model acts on its strategic plan, translating reasoning into measurable changes in behavior. The model must reliably execute a coherent plan, such as deliberately producing suboptimal outputs on tasks it could otherwise solve. Success against RL pressure is not required; what matters is that the strategy is coherently executed.
(5) RL resistance (strategic action capability). The model maintains its strategic behavior throughout RL training and successfully shapes the training outcome as planned. For example, resisting capability elicitation by avoiding high-reward actions, or reinforcing desired values by selectively pairing preferred behaviors with high-reward actions.
RL on LLMs and Exploration
RL on language models with chain-of-thought in agentic environments can be formulated as a Markov decision process (MDP) defined by a tuple (S,A,P,R)consisting of a state space, action space, transition probabilities, and reward function. The language model itself acts as the policy πθ(a|s), parameterized by the model weights θ. To simplify notation, let’s ignore any environment interaction during a trajectory[1]. Then we can just consider a tuple (X,T,R) of initial prompts, full trajectories, and reward function, with a policy πθ(τ∣x) generating an entire trajectory given an initial prompt.
While proximal policy optimization (PPO) was long the dominant approach, group-relative policy optimization (GRPO) has emerged as a popular alternative for training reasoning models. GRPO removes the need for a value network by sampling N trajectories for each prompt and normalizing the rewards within this group to compute advantages ^Ai=(Ri−μR)/(σR+ϵ) for i=1,…,N, where μR and σR are the sample mean and standard deviation of the rewards in the group respectively. The policy is then updated to maximize the objective
LGRPO(θ):=Ex∼D,{τi}Ni=1∼πold(⋅|x)[1NN∑i=1πθ(τi|x)πold(τi|x)^Ai]−βDKL(πθ∥πref).[2]
The role of exploration. RL algorithms such as GRPO and PPO require sufficient policy exploration during training. For GRPO in particular, this is critical because if the policy becomes deterministic, it will generate N identical trajectories (τ1=⋯=τN) with equal rewards. This causes the within-group reward variance to collapse (σR=0), making all advantages vanish (^Ai=0). The GRPO objective then reduces to pure KL-regularization, providing no learning signal from the rewards.
Formal Definition
In the above language, exploration hacking occurs when a policy controls its exploration to avoid being updated toward certain high-reward regions.
Formally, consider an on-policy RL setup with a policy πθ∈Π. Let Dtrain be a distribution over initial prompts x used for RL training, and let R(τ) be a reward function on trajectories τ. Suppose there is some “target” policy region Π∗⊆Π which the developer hopes the training will converge to.[3] For each initial prompt x, this region induces a set of target trajectories T∗(x):={τ∣supπ∈Π∗π(τ∣x)≥α} for some α∈(0,1) — that is, trajectories which some policy in Π∗ generates with high probability on x.
Definition. A policy πθ∉Π∗ is exploration hacking relative to Π∗ if (and only if) the following two conditions hold:
Taxonomy of Strategies
We classify exploration hacking based on the extent to which the model suppresses exploration and the strategic motivation behind it.
1. Complete Under-Exploration
In this setting, the model avoids high-reward regions entirely. The primary motivation is to prevent the elicitation of a capability that the model deems detrimental to its long-term goals. If the model successfully exploration-hacks here, the training run will appear to have failed (i.e., the reward curve remains flat throughout RL training).
2. Partial Under-Exploration
Here, the model explores high-reward actions but restricts exploration in order to aim for a specific subset of the policy space.
(2a) Instrumental (stable suboptimality). The model deliberately alters its exploration to steer the training to converge to a suboptimal policy πsub such that R(πsub)≪R(π∗) (where π∗ is the highest-reward policy in Π).
(2b) Terminal (alternative preference). The model deliberately alters its exploration to steer the training to converge to a malign policy πmal at a specific local maximum (or location in a global maximum basin) which may have R(πmal)≈R(π∗) but satisfies its internal preferences and violates developer intent. What distinguishes Type 2b from Type 2a is that the model's goal is to reinforce or maintain a particular behavior, not to avoid high reward.[5]
Relation to Other Misaligned Behaviors
Open Problems
Subverting Alignment and Oversight Protocols
Beyond capability elicitation, exploration hacking poses a threat to RL-based alignment techniques more broadly.
Propensity of Frontier Models
Stealth and Obfuscated Exploration Hacking
Current model organisms often rely on explicit chain-of-thought (CoT) to execute exploration hacking, making them vulnerable to text-based monitors.
Natural Emergence and Pre-Training Factors
Detection and Mitigation
We need strategies that both reduce the likelihood of exploration hacking emerging and improve our ability to detect it when it does.
Algorithmic Sensitivity
Conclusion
Stress-testing the vulnerabilities of critical safety case components is essential to a comprehensive approach to AGI safety. Frontier models are already exhibiting increased situational awareness and strategic reasoning, making it not only plausible but increasingly likely that future misaligned models could deliberately undermine RL training. This underscores the urgency of developing robust measures to understand and mitigate the threat of exploration hacking.
If you’re interested in collaborating—or have relevant ideas, results, or related work—we’d love to connect. Please reach out to us!
For simplicity, and without loss of generality for deterministic environments.
For simplicity, we present the unclipped version of the GRPO objective and operate at the trajectory level rather than the token level. The full formulation (Shao et al., 2024) applies PPO-style clipping to the importance-weighted advantages and computes both the importance ratios and KL penalty per token, averaging over tokens within each trajectory. These simplifications do not affect the core argument: that collapsed exploration eliminates the reward variance signal needed for learning.
This might be defined to contain all policies achieving high enough expected reward, i.e. Π∗:={π∈Π∣Ex∼Dtrain[Eτ∼π(⋅|x)[R(τ)]]≥ρ} for some ρ>0, or some other region that captures the specific intent of the developer.
These are possible heuristics we think could make sense; we welcome suggested improvements to these.
The defining characteristic of Type 2b is that the model targets a specific behavioral outcome—reinforcing or preserving particular values, backdoors, or reasoning patterns. The resulting policy may or may not be near-optimal with respect to the reward function; reward performance is incidental to the model's actual objective.