AndresCampero — LessWrong

LESSWRONG
LW

Replying toQuickly Assessing Reward Hacking-like Behavior in LLMs and its Sensitivity to Prompt Variations

Quickly Assessing Reward Hacking-like Behavior in LLMs and its Sensitivity to Prompt Variations

A note on interpretation and comparison to other evals:
Our scenarios range from simple single-turn queries (Scenario 4 replicating Palisade's setup) to complex multi-step agentic interactions (Scenario 1). Results are roughly consistent across this range: o3 exhibits the most reward hacking, models show high sensitivity to prompt variations, and Claude models hack infrequently in these settings. These findings align directionally with results from METR and Palisade evaluations.

Claude 3.7's limited reward hacking in our eval - as well as METR's and Palisade's - contrasts with reports of more frequent hacking in code-heavy scenarios ( as seen in several examples in social media, as well as private communication with Anthropic researchers). The gap likely reflects differences in task type or specific vulnerabilities that emerge in coding contexts. Further investigation of these differences would be valuable for developing more comprehensive evaluation frameworks.

Quickly Assessing Reward Hacking-like Behavior in LLMs and its Sensitivity to Prompt Variations

AndresCampero

8mo

We present a simple eval set of 4 scenarios where we evaluate Anthropic and OpenAI frontier models. We find various degrees of reward hacking-like behavior with high sensitivity to prompt variation.

[Code and results transcripts]

Intro and Project Scope

We want to be able to assess whether any given model engages in reward hacking^[1] and specification gaming, to what extent, and under what circumstances. As has been increasingly found by different labs, orgs, and in common usage by different people^[2]; as AI has progressed, frontier models seem to engage in more frequent and egregious exploitation of flaws and loopholes in order to achieve “success” at algorithmically-graded tasks without actually completing the intended task or exhibiting the... (read 4952 more words →)