Sandbagging: How Models Use Reward-Hacking to Downplay Their True Capabilities

georgia_berg

7 Sandbagging: How Models Use Reward-Hacking to Downplay Their True Capabilities

LW

1 min read

7

Robert Adragna will report the results of his research on the growing ability and willingness of models to "sandbag" - that is, to deliberately suggest weaker capabilities during training via reward-hacking.

Registration Instructions
This is a paid event ($5 general admission, free for students & job seekers) with limited tickets - you must RSVP on Luma to secure your spot.

Event Schedule
6:00 to 6:30 - Food and introductions
6:30 to 7:30 - Presentation and Q&A
7:30 to 9:00 - Open Discussions

This is part of our weekly AI Safety Thursdays series. Join us in examining questions like:

How do we ensure AI systems are aligned with human interests?
How do we measure and mitigate potential risks from advanced AI systems?
What does safer AI development look like?

Posted on: 3rd Nov 2025

Subscribe to group

7

New Comment

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

7

Sandbagging: How Models Use Reward-Hacking to Downplay Their True Capabilities

7

7