Phil Blandfort

Message

Phil Blandfort

Detecting High-Stakes Interactions with Activation Probes

This research was completed for LASR Labs 2025 by Alex McKenzie, Urja Pawar, Phil Blandfort and William Bankes. The team was supervised by Dmitrii Krasheninnikov, with additional guidance from Ekdeep Singh Lubana and support from David Krueger. The full paper can be found here. TLDR – We train activation probes...

50Jul 21, 2025

Sampling Effects on Strategic Behavior in Supervised Learning Models

TLDR This post investigates how different sampling methods during inference can lead supervised learning models to exhibit strategic behavior, even when such behavior is rare in the training data. Through a toy example, we demonstrate that an AI model trained solely to predict sequences can choose less likely options initially...

1Sep 24, 2024

TLDR

This post investigates how different sampling methods during inference can lead supervised learning models to exhibit strategic behavior, even when such behavior is rare in the training data. Through a toy example, we demonstrate that an AI model trained solely to predict sequences can choose less likely options initially to simplify future predictions. This finding highlights that the way we use AI models—including seemingly minor aspects like sampling strategies—can significantly influence their behavior.

Introduction

Guiding Question: Under what circumstances can an AI model, trained only with supervised learning to predict future events, learn to exhibit strategic behavior?

In machine learning, particularly with models like GPT-style transformers, the sampling method used during inference can profoundly impact... (read 1724 more words →)

LESSWRONG
LW

LESSWRONG
LW

Phil Blandfort

Phil Blandfort

Phil Blandfort

Detecting High-Stakes Interactions with Activation Probes

Sampling Effects on Strategic Behavior in Supervised Learning Models

Phil Blandfort

Phil Blandfort

Phil Blandfort

Detecting High-Stakes Interactions with Activation Probes

Sampling Effects on Strategic Behavior in Supervised Learning Models

TLDR

Introduction