[CS2881r][Week 8] When Agents Prefer Hacking To Failure: Evaluating Misalignment Under Pressure
This work was done for CS 2881r: AI Safety and Alignment, taught by @boazbarak at Harvard. Links * GitHub * Presentation recording & slides * Week 8 course materials: scheming, reward hacking & deception TL;DR * Inspired by Anthropic's work on Agentic Misalignment,[1] we conduct systematic experiments investigating evaluation hacking...
Nov 7, 20252