x
Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training — LessWrong