LESSWRONG
LW

1580
AndresCampero
24Ω15110
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Quickly Assessing Reward Hacking-like Behavior in LLMs and its Sensitivity to Prompt Variations
AndresCampero5mo10

A note on interpretation and comparison to other evals:
Our scenarios range from simple single-turn queries (Scenario 4 replicating Palisade's setup) to complex multi-step agentic interactions (Scenario 1). Results are roughly consistent across this range: o3 exhibits the most reward hacking, models show high sensitivity to prompt variations, and Claude models hack infrequently in these settings. These findings align directionally with results from METR and Palisade evaluations.

Claude 3.7's limited reward hacking in our eval - as well as METR's and Palisade's - contrasts with reports of more frequent hacking in code-heavy scenarios ( as seen in several examples in social media, as well as private communication with Anthropic researchers). The gap likely reflects differences in task type or specific vulnerabilities that emerge in coding contexts. Further investigation of these differences would be valuable for developing more comprehensive evaluation frameworks.

Reply
No wikitag contributions to display.
26Quickly Assessing Reward Hacking-like Behavior in LLMs and its Sensitivity to Prompt Variations
Ω
5mo
Ω
1