x

LESSWRONG

LW

frankyaoxiao — LessWrong

frankyaoxiao

frankyaoxiao

Message

13

2y

frankyaoxiao

13

2y

Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training

by Santiago Aranguri and frankyaoxiao

Introduction Research by Frank Xiao (SPAR mentee) and Santiago Aranguri (Goodfire). Post-training can introduce undesired side effects that are difficult to detect and even harder to trace to specific training datapoints. We show that a probe-based method can surface concerning behaviors that emerge during LLM post-training, and that probes can...