Quickly Assessing Reward Hacking-like Behavior in LLMs and its Sensitivity to Prompt Variations
We present a simple eval set of 4 scenarios where we evaluate Anthropic and OpenAI frontier models. We find various degrees of reward hacking-like behavior with high sensitivity to prompt variation. [Code and results transcripts] Intro and Project Scope We want to be able to assess whether any given model...
Jun 4, 202526