x

LESSWRONG

LW

Punya Syon Pandey — LessWrong

Punya Syon Pandey

Punya Syon Pandey

Message

18

1y

Punya Syon Pandey

18

1y

Explaining undesirable model behavior: (How) can influence functions help?

by Zhijing Jin, TerryJCZhang, and Punya Syon Pandey

Undesirable training data can lead to undesirable model output. This dynamic is commonly phrased as "garbage in, garbage out" and it is a key issue for frontier models trained on web-scale data. How can we efficiently identify these bad apples in massive training datasets (with trillions of tokens)? Influence functions...

Investigating Accidental Misalignment: Causal Effects of Fine-Tuning Data on Model Vulnerability

by Zhijing Jin, Punya Syon Pandey, samuelsimko, and Kellin Pelrine

TL;DR This post discusses our explorations into the effects of domain-specific fine-tuning and how the characteristics of fine-tuning data relate to adversarial vulnerability. We also explore its implications for real-world applications, and offer insights into the importance of dataset engineering as an approach toward achieving true alignment in AI systems....

Jun 11, 2025•6