Explaining undesirable model behavior: (How) can influence functions help?
by Zhijing Jin, TerryJCZhang, and Punya Syon Pandey
Undesirable training data can lead to undesirable model output. This dynamic is commonly phrased as "garbage in, garbage out" and it is a key issue for frontier models trained on web-scale data. How can we efficiently identify these bad apples in massive training datasets (with trillions of tokens)? Influence functions...
Mar 218