Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training
by Santiago Aranguri and frankyaoxiao
Introduction Research by Frank Xiao (SPAR mentee) and Santiago Aranguri (Goodfire). Post-training can introduce undesired side effects that are difficult to detect and even harder to trace to specific training datapoints. We show that a probe-based method can surface concerning behaviors that emerge during LLM post-training, and that probes can...
Apr 2916