Quickly Assessing Reward Hacking-like Behavior in LLMs and its Sensitivity to Prompt Variations
We present a simple eval set of 4 scenarios where we evaluate Anthropic and OpenAI frontier models. We find various degrees of reward hacking-like behavior with high sensitivity to prompt variation. [Code and results transcripts] Intro and Project Scope We want to be able to assess whether any given model...
A note on interpretation and comparison to other evals:
Our scenarios range from simple single-turn queries (Scenario 4 replicating Palisade's setup) to complex multi-step agentic interactions (Scenario 1). Results are roughly consistent across this range: o3 exhibits the most reward hacking, models show high sensitivity to prompt variations, and Claude models hack infrequently in these settings. These findings align directionally with results from METR and Palisade evaluations.
Claude 3.7's limited reward hacking in our eval - as well as METR's and Palisade's - contrasts with reports of more frequent hacking in code-heavy scenarios ( as seen in several examples in social media, as well as private communication with Anthropic researchers). The gap likely reflects differences in task type or specific vulnerabilities that emerge in coding contexts. Further investigation of these differences would be valuable for developing more comprehensive evaluation frameworks.