Emergent Misalignment from Reward Hacking

georgia_berg; Mario Gibney

2 Emergent Misalignment from Reward Hacking

LWEA

1 min read

2

Recent research from Anthropic and Redwood Research has shown that "reward hacking" is more than just a nuisance: it can be a seed for broader misalignment.

Evgenii Opryshko explores how models that learn to exploit vulnerabilities in coding environments can generalize to concerning capabilities, such as unprompted alignment faking and cooperating with malicious actors.

Event Schedule
6:00 to 6:30 - Food and introductions
6:30 to 7:30 - Presentation and Q&A
7:30 to 9:00 - Open Discussions

LESSWRONG
LW

LESSWRONG
LW

2

Emergent Misalignment from Reward Hacking

2

2