x

LESSWRONG

LW

Phil Blandfort — LessWrong

Phil Blandfort

Phil Blandfort

Message

46

1

1

3y

Phil Blandfort

46

3y

Detecting High-Stakes Interactions with Activation Probes

by Arrrlex, williambankes, Urja Pawar, Phil Blandfort, David Scott Krueger, and Dmitrii Krasheninnikov

This research was completed for LASR Labs 2025 by Alex McKenzie, Urja Pawar, Phil Blandfort and William Bankes. The team was supervised by Dmitrii Krasheninnikov, with additional guidance from Ekdeep Singh Lubana and support from David Krueger. The full paper can be found here. TLDR – We train activation probes...

Jul 21, 2025•50

Sampling Effects on Strategic Behavior in Supervised Learning Models

TLDR This post investigates how different sampling methods during inference can lead supervised learning models to exhibit strategic behavior, even when such behavior is rare in the training data. Through a toy example, we demonstrate that an AI model trained solely to predict sequences can choose less likely options initially...

Sep 24, 2024•1