This research was completed for LASR Labs 2025 by Alex McKenzie, Urja Pawar, Phil Blandfort and William Bankes. The team was supervised by Dmitrii Krasheninnikov, with additional guidance from Ekdeep Singh Lubana and support from David Krueger. The full paper can be found here.
TLDR – We train activation probes on Llama-3.3-70B to detect whether the current interaction is “high-stakes”. This “high-stakes” concept is safety-relevant as it is closely related to risk: 1) when the stakes are high the potential consequences are significant, and 2) high-stakes is closely related to pressure, which has been found to make LLMs behave more deceptively. Compared to black-box LLM-based classification methods, probes are much cheaper to run while showing performance... (read 1135 more words →)