x

LESSWRONG
LW

AndresCampero — LessWrong

AndresCampero

AndresCampero

Message

24

Ω

15

1

1

9mo

AndresCampero

24

Ω

15

9mo

;

Quickly Assessing Reward Hacking-like Behavior in LLMs and its Sensitivity to Prompt Variations

We present a simple eval set of 4 scenarios where we evaluate Anthropic and OpenAI frontier models. We find various degrees of reward hacking-like behavior with high sensitivity to prompt variation. [Code and results transcripts] Intro and Project Scope We want to be able to assess whether any given model...

Jun 4, 2025•26