LLMs show deceptive behaviour when they have incentive to do so, whether it's alignment faking or lying about its capabilities. A work earlier this year at Apollo proposed using linear probes that detect such behaviour using model’s internal activations.
In this talk Shivam Arora, will share details on how these probes work and share his research experience on a follow up work to improve them conducted as part of a fellowship at LASR labs.
Registration Instructions
This is a paid event ($5 general admission, free for students & job seekers) with limited tickets - you must RSVP on Luma to secure your spot.
Event Schedule
6:00 to 6:30 - Food & Introductions
6:30 to 7:30 - Main Presentation & Questions
7:30 to 9:00 - Open Discussion
If you can't attend in person, join our live stream starting at 6:30 pm via this link.
This is part of our weekly AI Safety Thursdays series. Join us in examining questions like:
Posted on: