Joseph Bejjani

Message

CS student at Harvard. AI alignment at the Kempner Institute.

7mo

[CS2881r][Week 8] When Agents Prefer Hacking To Failure: Evaluating Misalignment Under Pressure

This work was done for CS 2881r: AI Safety and Alignment, taught by @boazbarak at Harvard. Links * GitHub * Presentation recording & slides * Week 8 course materials: scheming, reward hacking & deception TL;DR * Inspired by Anthropic's work on Agentic Misalignment,[1] we conduct systematic experiments investigating evaluation hacking...

Nov 7, 2025•2

Message

CS student at Harvard. AI alignment at the Kempner Institute.

1 post

Member for 7 months

Joseph Bejjani — LessWrong

Joseph Bejjani

Message

CS student at Harvard. AI alignment at the Kempner Institute.

7mo

Joseph Bejjani

[CS2881r][Week 8] When Agents Prefer Hacking To Failure: Evaluating Misalignment Under Pressure

Nov 7, 2025•2

Message

CS student at Harvard. AI alignment at the Kempner Institute.

1 post

Member for 7 months

[CS2881r][Week 8] When Agents Prefer Hacking To Failure: Evaluating Misalignment Under Pressure

Joseph Bejjani

Joseph Bejjani, Itamar Rocha Filho, Haichuan Wang, Zidi Xiong+ 0 more

Joseph Bejjani, Itamar Rocha Filho, Haichuan Wang, Zidi Xiong

3mo

This work was done for CS 2881r: AI Safety and Alignment, taught by @boazbarak at Harvard.

TL;DR

Inspired by Anthropic's work on Agentic Misalignment,^[1] we conduct systematic experiments investigating evaluation hacking in an agentic coding environment using a simple sorting task. We evaluate 4 frontier models across 24 controlled pressure scenarios spanning threat conditions, task difficulties, and prohibition instructions.
We find that when faced with an impossible task, agents often hack the evaluation to pass the tests, even in cases where hacking is explicitly prohibited in the prompt. Moreover, agents often acknowledge their hack, expressing that they are taking a dishonest route to avoid failure.
We

... (read 6682 more words →)

LESSWRONG
LW

LESSWRONG
LW

Joseph Bejjani

Joseph Bejjani

Joseph Bejjani

[CS2881r][Week 8] When Agents Prefer Hacking To Failure: Evaluating Misalignment Under Pressure

Joseph Bejjani

Joseph Bejjani

Joseph Bejjani

[CS2881r][Week 8] When Agents Prefer Hacking To Failure: Evaluating Misalignment Under Pressure

Links

TL;DR