Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

If influence functions are not approximating leave-one-out, how are they supposed to help?

6Quintin Pope

2Fabien Roger

4Charlie Steiner

2IlyaShpitser

New Comment

4 comments, sorted by Click to highlight new comments since: Today at 10:05 AM

Re empirical evidence for influence functions:

Didn't the Anthropic influence functions work pick up on LLMs not generalising across lexical ordering? E.g., training on "A is B" doesn't raise the model's credence in "Bs include A"?

Which is apparently true: https://x.com/owainevans_uk/status/1705285631520407821?s=46

I feel like there's also a Bayesian NN perspective on the PBRF thing. It has some ingredients that look like a Gaussian prior (L2 regularization), and an update (which would have to be the combination of the first two ingredients - negative loss function on a single datapoint but small overall difference in loss).

Said like this, it's obvious to me that this is way different from leave-one-out. First learning to get low loss at a datapoint and then later learning to get high loss there is not equivalent to never learning anything directly about it.

Thanks to Roger Grosse for helping me understand his intuitions and hopes for influence functions. This post combines highlights from some influence function papers, some of Roger Grosse’s intuitions (though he doesn’t agree with everything I’m writing here), and some takes of mine.Influence functions are informally about some notion of influence of a training data point on the model’s weights. But in practice, for neural networks, “influence functions” do not approximate well “what would happen if a training data point was removed”. Then, what are influence functions about, and what can they be used for?

## From leave-one-out to influence functions

Ideas fromBae 2022(If influence functions are the answer, what is the question?).The leave-one-out function is the answer to “what would happen, in a network trained to its global minima, if one point was omitted”: LOO(^x,^y)=argminθ1N∑(x,y)∼D−{(^x,^y)}L(fθ(x),y)

Under some assumptions such as a strongly convex loss landscape, influence functions are cheap-to-compute approximation to leave-one-out function, thanks to the Implicit Function Theorem, which tells us that under those assumptions LOO(^x,^y)≈IF(^x,^y)def=θ∗+(∇2θJ(θ∗))−1∇θL(f(θ∗,^x),^y)/N

But these assumptions don't hold for neural networks, and

Basu 2020shows thatinfluence functions are a terrible approximation of leave-one-outin the context of neural networks, as shown in this figure fromBae 2022(left is for Linear Regression, where the approximation hold, right is for MultiLayer-Perceptron, where it doesn’t):Moreover, even the leave-one-out function is about parameters at convergence, which is not the regime most deep learning training runs operate in. Therefore, influence functions are even less about answering the question “what would happen if this point had more/less weight in the (incomplete) training run?”.

So every time you see someone introducing influence functions as an approximation of the effect of up/down-weighting training data points (as in

this LW post about interpretability), remember that this does not apply when they are applied to neural networks.## What are influence functions doing

Bae 2022shows that influence functions (not leave-one-out!) can be well approximated by the minimization of another training objective called PBRF, which is the sum of 3 terms:This does not answer the often advertised question about leave-one-out, but this does answer something which looks related, and which happens to be much cheaper to compute than the leave-one-out function (which can only be computed by retraining the network and doesn’t have cheaper approximations).

Influence functions are currently among the few options to say anything about the intuitive “influence” of individual data points in large neural networks, which justifies why they are used. (Alternatives have roughly the same kind of challenges as influence functions.)

Note: this explanation of what influence function are doing is not the only way to describe their behavior, and other works may shine new lights on what they are doing.## What are influence functions useful for

## Current empirical evidence

To this date, there has been almost no work externally validating that influence functions tell us anything about the influence of data points on neural networks behaviors beyond:

on large language modelsBae 2022)(These two methods are common in the influence function literature and were not introduced by the two papers I cite here.)

[Edit] Other empirical evidence since the initial post date: influence functions correctly predict that LLMs don't generalize from "A is B" to "B is A".

I would be excited about future work using influence functions to make specific predictions about the effect of removing some points from the training set and retraining the network to see if those predictions were accurate. I would be even more interested if those predictions were about generalization properties of LLMs outside their training distribution.

## Speculations about what influence functions won't and will be useful for

The main practical use case I see is formulating hypotheses about what kind of dataset modification would lead to what changes in behavior. I say “formulating hypotheses” and not something stronger like “making suggestions” because I currently see no strong evidence that influence functions can make reliable predictions about effects of modifying training datasets.

If in the future, influence functions are shown to be a good way to suggest and predict effects of dataset modifications, how would that be differentially useful for AI safety? I see 3 paths forward: