Detecting High-Stakes Interactions with Activation Probes
by Arrrlex, williambankes, Urja Pawar, Phil Blandfort, David Scott Krueger, and Dmitrii Krasheninnikov
This research was completed for LASR Labs 2025 by Alex McKenzie, Urja Pawar, Phil Blandfort and William Bankes. The team was supervised by Dmitrii Krasheninnikov, with additional guidance from Ekdeep Singh Lubana and support from David Krueger. The full paper can be found here. TLDR – We train activation probes...
Jul 21, 202550