[Not very confident, but just saying my current view.]
I'm pretty skeptical about integrated gradients.
As far as why, I don't think we should care about the derivative at the baseline (zero or the mean).
As far as the axioms, I think I get off the train on "Completeness" which doesn't seem like a property we need/want.
I think you just need to eat that there isn't any sensible way to do something reasonable that gets Completeness.
The same applies with attribution in general (e.g. in decision making).
The same applies with attribution in general (e.g. in decision making).
As in, you're also skeptical of traditional Shapley values in discrete coalition games?
"Completeness" strikes me as a desirable property for attributions to be properly normalized. If attributions aren't bounded in some way, it doesn't seem to me like they're really 'attributions'.
Very open to counterarguments here, though. I'm not particularly confident here either. There's a reason this post isn't titled 'Integrated Gradients are the correct attribution method'.
Integrated gradients is a computationally efficient attribution method (compared to activation patching / ablations) grounded in a series of axioms.
Maybe I'm confused, but isn't integrated gradients strictly slower than an ablation to a baseline?
If you want to get attributions between all pairs of basis elements/features in two layers, attributions based on the effect of a marginal ablation will take you forward passes, where is the number of features in a layer. Integrated gradients will take backward passes, and if you're willing to write custom code that exploits the specific form of the layer transition, it can take less than that.
If you're averaging over a data set, IG is also amendable to additional cost reduction through stochastic source techniques.
Maybe I'm confused, but isn't integrated gradients strictly slower than an ablation to a baseline?
For a single interaction yes (1 forward pass vs integral with n_alpha integration steps, each requiring a backward pass).
For many interactions (e.g. all connections between two layers) IGs can be faster:
(This is assuming you do path patching rather than "edge patching", which you should in this scenario.)
Sam Marks makes a similar point in Sparse Feature Circuits, near equations (2), (3), and (4).
We now have a method for how to do attributions on single data points. But when we're searching for circuits, we're probably looking for variables that have strong attributions between each other on average, measured over many data points.
Maybe?
One thing I've been thinking a lot recently is that building tools to interpret networks on individual datapoints might be more relevant than attributing over a dataset. This applies if the goal is to make statistical generalizations since a richer structure on an individual datapoint gives you more to generalize with, but it also applies if the goal is the inverse, to go from general patterns to particulars, since this would provide a richer method for debugging, noticing exceptions, etc..
And basically the trouble a lot of work that attempts to generalize ends up with is that some phenomena are very particular to specific cases, so one risks losing a lot of information by only focusing on the generalizable findings.
Either way, cool work, seems like we've thought about similar lines but you've put in more work.
The issue with single datapoints, at least in the context we used this for, which was building interaction graphs for the LIB papers, is that the answer to 'what directions in the layer were relevant for computing the output?' is always trivially just 'the direction the activation vector was pointing in.'
This then leads to every activation vector becoming its own 'feature', which is clearly nonsense. To understand generalisation, we need to see how the network is re-using a small common set of directions to compute outputs for many different inputs. Which means looking at a dataset of multiple activations.
And basically the trouble a lot of work that attempts to generalize ends up with is that some phenomena are very particular to specific cases, so one risks losing a lot of information by only focusing on the generalizable findings.
The application we were interested in here was getting some well founded measure of how 'strongly' two features interact. Not a description of what the interaction is doing computationally. Just some way to tell whether it's 'strong' or 'weak'. We wanted this so we could find modules in the network.
Averaging over data loses us information about what the interaction is doing, but it doesn't necessarily lose us information about interaction 'strength', since that's a scalar quantity. We just need to set our threshold for connection relevance sensitive enough that making a sizeable difference on a very small handful of training datapoints still qualifies.
A short post laying out our reasoning for using integrated gradients as attribution method. It is intended as a stand-alone post based on our LIB papers [1] [2]. This work was produced at Apollo Research.
Context
Understanding circuits in neural networks requires understanding how features interact with other features. There's a lot of features and their interactions are generally non-linear. A good starting point for understanding the interactions might be to just figure out how strongly each pair of features in adjacent layers of the network interacts. But since the relationships are non-linear, how do we quantify their 'strength' in a principled manner that isn't vulnerable to common and simple counterexamples? In other words, how do we quantify how much the value of a feature in layer l+1 should be attributed to a feature in layer l?
This is a well-known sort of problem originally investigated in cooperative game theory. A while ago it made its way into machine learning, where people were pretty interested in attributing neural network outputs to their inputs for a while. Lately it's made its way into interpretability in the context of attributing variables in one hidden layer of a neural network to another.
Generally, the way people go about this is setting up a series of 'common-sense' axioms that the attribution method should fulfil in order to be self-consistent and act like an attribution is supposed to act. Then they try to show that there is one unique method that satisfies these axioms. Except that (a) people disagree about what axioms are 'common-sense', and (b) the axioms people maybe agree most on don't quite single out a single method as unique, just a class of methods called path attributions. So no attribution method has really been generally accepted as the canonical 'winner' in the ml context yet. Though some methods are certainly more popular than others.
Integrated Gradients
Integrated gradients is a computationally efficient attribution method (compared to activation patching / ablations) grounded in a series of axioms. It was originally proposed the context of economics (Friedman 2004), and recently used to attribute neural networks outputs to their inputs(Sundararajan et al. 2017). Even more recently, they started being used for internal feature attribution as well (Marks et al. 2024, Redwood Research (unpublished) 2022).
Properties of integrated gradients
Suppose we want to explain to what extent the value of an activation fl2i in a layer l2 of a neural network can be 'attributed to' the various components of the activations fl1=[fl10,…,fl1d] in layer l1 upstream of l2.[1] For now, we do this for a single datapoint only. So we want to know how much fl2i(x) can be attributed to fl1j(x). We'll write this attribution as Al2,l1i,j(x).
There is a list of four standard requirements of properties attribution methods should satisfy that single out path attributions as the only kind of attribution methods that can be used to answer this question. Integrated gradients, and other path attribution methods, fulfil all of these (Sundararajan et al. 2017).
If you add on a fifth requirement that the attribution method behaves sensibly under coordinate transformations, integrated gradients are the only attribution method that satisfies all five axioms:
In other words, all the attribution should go to the direction our activation vector fl(x)actually lies in. If we go into an alternate basis of coordinates such that one of our coordinate basis vectors e1 lies along fl1(x), e1=fl1(x)||fl1(x)||, then the component along e1 should get all the attribution at data point x, because the other components aren't even active and thus obviously can't influence anything.
We think that this is a pretty important property for an attribution method to have in the context of interpreting neural network internals. The hidden layers of neural networks don't come with an obvious privileged basis. Their activations are vectors in a vector space, which we can view in any basis we please. So in a sense, any structure in the network internals that actually matters for the computation should be coordinate independent. If our attribution methods are not well-behaved under coordinate transformations, they can give all kinds of misleading results, for example by taking the network out of the subspace the activations are usually located in.
Property 4 already ensures that the attributions are well-behaved under linear coordinate transformations of the target layer l2. This 5th axiom ensures they're also well-behaved under coordinate transforms in the starting layer l1.
We will show below that adding the 5th requirement singles out integrated gradients as the canonical attribution method that satisfies all five requirements.
Integrated gradient formula
The general integrated gradient formula to attribute the influence of feature fl1j(x) in a layer l1 on feature fl2i(x) in layer l2 is given by an integral along a straight-line path C in layer l1 activation space. To clarify notation, we introduce a function which maps activations from layer l1 to l2 Fl2,l1:Rdl1→Rdl2 . For example, in an MLP (bias folded in) we might have Fl2,l1(fl)=ReLU(Wl1fl1). Then we can write the attribution from fl1j(x) to fl2i(x) as
Al2,l1ij(x):=∫Cdzj[∂∂zj(Fl2,l1i(z))]z=αfl1(x)+(1−α)bl1=fl1j(x)∫10dα[∂∂zj(Fl2,l1i(z))]z=αfl1(x)+(1−α)bl1.where z is a point in the layer l1 activation space, and the path C is parameterised by α∈[0,1], such that along the curve we have z(α)=αfl1(x)+(1−α)bl1.[2]
Intuitively, this formula asks us to integrate the gradient of fl2i(x) with respect to fl1j(x) along a straight path from a baseline activation bl1 to the actual activation vector fl1(x), and multiply the result with fl1j(x).
Proof sketch: Integrated Gradients are uniquely consistent under coordinate transformations
Friedman 2004 showed that any attribution method satisfying the first four axioms must be a path attribution of the form
Al2,l1ij(x):=∫Cdzj⎡⎣∂∂zl1j(Fl2,l1i(zl1))⎤⎦withzl1(α):R→Rnl1,zl1(0)=bl1,zl1(1)=fl1(x),or a convex combination (weighted average with weights ck) of these
Al2,l1ij(x):=∑kck∫Ckdzk,j⎡⎢⎣∂∂zl1k,j(Fl2,l1i(zl1k))⎤⎥⎦withzl1k(α):R→Rnl1,zl1k(0)=bl1,zl1k(1)=fl1(x),∑kck=1,ck≥0.Each term is a line integral along a monotonous path Ck in the activation space of layer l1 that starts at the baseline bl1 and ends at the activation vector fl1(x).
Claim: The only attribution that also satisfies the fifth axiom is the straight line from bl1 to fl1(x). That is, ck=0 for all the paths in the sum except for the path parametrised as
zl11(α)=bl1(1−α)+αfl1(x).Proof sketch: Take fl2(fl1(x))=bl1+∑kU1,k(fl1k(x)−bl1k)e−z∑{i|i>1}(∑jUi,j(fl1j(x)−bl1j))2 as the mapping between layers l1 and l2, with U∈R an orthogonal matrix UUT=1, and U1,k=fl1k(x)−bl1k||fl1(x)−bl1||. Then, for any monotonous paths Ck which are not the straight line zl11(α), at least one direction v in layer l1 with v⋅fl1(x)=0 will be assigned an attribution >0.
Since no monotonous paths lead to a negative attribution, the sum over all paths must then also yield an attribution >0 for those v, unless ck=0 for every path in the sum except zl11(α)=bl1(1−α)+αfl1(x).
The problem of choosing a baseline
The integrated gradient formula still has one free hyperparameter in it: The baseline bl. We're trying to attribute the activations in one layer to the activations in another layer. This requires specifying the coordinate origin relative to which the activations are defined.
Zero might look like a natural choice here, but if we are folding the biases into the activations, do we want the baseline for the bias to be zero as well? Or maybe we want the origin to be the expectation value of the activation E(fl) over the training dataset? But then we'd have a bit of a consistency problem with axiom 2 across layers, because the expectation value of a layer E(fl+1) often will not equal its activation at the expectation value E(fl) of the previous layer, E(fl+1)≠Fl+1,l(E(fl)). So, with this baseline the attributions to the activations in a layer l would not add up to the activations in layer l+1. In fact, for some activation functions, like sigmoids for example, 0≠Fl+1,l(0), so baseline zero potentially has this consistency problem as well.
We don't feel like we have a good framing for picking the baseline in a principled way yet.
Attributions over datasets
We now have a method for how to do attributions on single data points. But when we're searching for circuits, we're probably looking for variables that have strong attributions between each other on average, measured over many data points. But how do we average attributions for different data points into a single attribution over a data set in a principled way?
We don't have a perfect answer to this question. We experimented with applying the integrated gradient definition to functionals, attributing measures of the size of the function fl2i:x→fl2i(x) to the functions fl1j:→fl2i(x) but found counter-examples to those (e.g. cancellation between negative and positive attribution). Thus we decided to simply take the RMS over attributions on single datapoints
Al2,l1i,j(D)=√∑xAl2,l1i,j(x)2.
This averaged attribution does not itself fulfil axiom 2 (completeness), but it seems workable in practice. We have not found any counterexamples (situations where Al2,l1i,j(D)=0 even though fl1j is obviously important for fl2i) for good choices of bases (such as LIB).
Acknowledgements
This work was done as part of the LIB interpretability project [1] [2] at Apollo Research where it benefitted from empirical feedback: the method was implemented by Dan Braun, Nix Goldowsky-Dill, and Stefan Heimersheim. Earlier experiments were conducted by Avery Griffin, Marius Hobbhahn, and Jörn Stöhler.
The activation vectors here are defined relative to some baseline b. This can be zero, but it could also be the mean value over some data set.
Integrated gradients still leaves us a free choice of baseline relative to which we measure activations. We chose 0 for most of this post for simplicity, but e.g. the dataset mean of the activations also works.