Robert Adragna will report the results of his research on the growing ability and willingness of models to "sandbag" - that is, to deliberately suggest weaker capabilities during training via reward-hacking.
Registration Instructions This is a paid event ($5 general admission, free for students & job seekers) with limited tickets - you must RSVP on Luma to secure your spot.
Event Schedule 6:00 to 6:30 - Food and introductions 6:30 to 7:30 - Presentation and Q&A 7:30 to 9:00 - Open Discussions
This is part of our weekly AI Safety Thursdaysseries. Join us in examining questions like:
How do we ensure AI systems are aligned with human interests?
How do we measure and mitigate potential risks from advanced AI systems?
Robert Adragna will report the results of his research on the growing ability and willingness of models to "sandbag" - that is, to deliberately suggest weaker capabilities during training via reward-hacking.
Registration Instructions
This is a paid event ($5 general admission, free for students & job seekers) with limited tickets - you must RSVP on Luma to secure your spot.
Event Schedule
6:00 to 6:30 - Food and introductions
6:30 to 7:30 - Presentation and Q&A
7:30 to 9:00 - Open Discussions
This is part of our weekly AI Safety Thursdays series. Join us in examining questions like:
Posted on: