Influence functions - why, what and how

[-]Troof2y1111

Thanks for this! One thing I don't understand about influence functions is: why should I care about the proximal Bregman objective? To interpret a model, I'm really interested in in the LOO retraining, right? Can we still say things like "it seems that the model relied on this training sample for producing this output" with the PBO interpretation?

[-]Nina Panickssery2y63

I agree that approximating the PBO makes this method more lossy (not all interesting generalization phenomena can be found). However, I think we can still glean useful information about generalization by considering "retraining" from a point closer to the final model than random initialization. The downside is if, for example, some data was instrumental in causing a phase transition at some point in training, this will not be captured by the PBO approximation.

Indeed, the paper concedes:

Influence functions are approximating the sensitivity to the training set locally around the final weights and might not capture nonlinear training phenomena

Purely empirically, I think Anthropic's results indicate there are useful things that can be learnt, even via this local approximation:

One of the most consistent patterns we have observed is that the influential sequences reflect increasingly sophisticated patterns of generalization as the model scale increases. While the influential sequences for smaller models tend to have short overlapping sequences of tokens, the top sequences for larger models are related at a more abstract thematic level, and the influence patterns show increasing robustness to stylistic changes, including the language.

My intuition here is that even if we are not exactly measuring the counterfactual "what if this datum was not included in the training corpus?", we could be estimating "what type of useful information is the model extracting from training data that looks like this?".

[-]Gurkenglas2y20

I found that influential training digits were usually more sloppy / unclear compared to the average MNIST digit, and shared some resemblance with the query image.

It's pbzcnevat gb gur arnerfg cbvagf ba gur obhaqnel bs qvtvg-pyhfgref! Bonus points if you made your observation without that interpretation in mind. What if you do Jacobian regularization?

[-]Hoagy2y11

How do you know?

[-]Gurkenglas2y20

It's the same training datums I would look at to resolve an ambiguous case.

[-]Sonia Joseph2y10

Thank you for this. How would you think about the pros/cons of influence functions vs activation patching or direct logit attribution in terms of localizing a behavior in the model?

^{^}

The warm-start problem referenced by Bae et al. refers to the fact that for a not strictly convex objective, the influence of a training example in the neighborhood of a minimum $θ^{*}$ may be different from the influence at a different initialization point.

^{^}

The paper uses homogeneous vector notation to account for biases / affine transformations - you can assume there is a 1 appended to the activations $a$ and a bias vector appended to $W$ to cover this case.

^{^}

The paper refers to these as "pseudo-gradients" since they are sampled from the final output distribution and are distinct from gradients during training.

^{^}

The $z_{p}$ , $z_{c}$ pair is referred to as the "query" in the paper, as we are "querying" which training examples were most influential for the model producing $z_{c}$ given $z_{p}$ .

^{^}

Specifically, concatenate a linear layer's .weight and .bias grads

^{^}

If you look through the code and find any bugs (quite possible) or performance improvements (definitely findable; e.g. more batching + splitting of GPU ops - WIP) I'd be super happy to merge PRs and/or hear from you! I hope to gradually improve this codebase and run larger experiments.

LESSWRONG
LW

LESSWRONG
LW

75

Influence functions - why, what and how

75

75

Deriving the exact form of the influence function

Influence on some function of the model weights

Problems with this expression

Efficient calculation

Kronecker-Factored Approximate Curvature (KFAC)

Eigenvalue correction

Influence functions for autoregressive models

Implementing in PyTorch

Results of small experiment on MNIST