Exploration Hacking

Edited by Joschka Braun last updated 11th Feb 2026

Exploration hacking is when a model strategically alters its exploration during RL training in order to influence the subsequent training outcome. Because RL is fundamentally dependent on sufficient exploration of diverse actions and trajectories, a model that alters its exploration behavior can significantly compromise the training outcome.

Exploration hacking relates to several other threat models. It can be a strategy for sandbagging during RL-based capability elicitation, but unlike sandbagging is not limited to underperformance. Unlike reward hacking, it is intentional. Unlike gradient hacking, it manipulates the data distribution rather than the optimization dynamics directly.

Posts tagged Exploration Hacking

1

17Exploration hacking: can reasoning models subvert RL?

Ω

Damon Falck, Joschka Braun, Eyon Jang

7mo

Ω

4

1

9A Conceptual Framework for Exploration Hacking

Joschka Braun, Eyon Jang, Damon Falck

2d