I later learned of the existence of this paper, which is even more similar to the ideas discussed here. https://arxiv.org/pdf/1704.02685.pdf^{[2]}

In it, I examine 2 of the methods for neural network visualization, and show that they have structural similarities. I show that these algorithms only differ by a difference in how they treat the biases, and (possibly a difference in getting started)

The second algorithm obeys conservation laws, it tries to parcel credit and blame for a decision up to the input neurons.

Intro

The task we want is to assign importance to different inputs of a neural network in production of an output. So for example, in the case of a trained image classifier, the visualization method would take in a particular image, and highlight the parts of the image that the network thought were important.

A general method for visualizing neural networks is back-propagation. First evaluate the network forwards. Then work backwards through the network by using some rule about how to reverse each individual layer.

One example of this is differentiation. Finding the rate of change of the output, with respect to each input. But there are others.

Firstly, lets pretend biases in the network don't exist. We are allowed non-linearities, so long as they satisfy the equation f(0)=0 .

Lets look at various layer types and the different back-propagation rules.

Maximum

Most often found in the form of max pooling.

b=maxiai

Gradient

The rule used in gradient descent, and I think the only rule used in the paper above for back propagating maximum. (Notation note. R here isn't exactly a function. Its more like ddx , its output is related to the context in which the input occurs. Think of every number in the forward net having an associated number )

R(ai)={R(b)ai=b0else

Ignoring the case of an exact tie. Exact ties, and what to do on that kind of singular point will be ignored in general, because they are fiddly and unimportant.

Radial.

Treats 0 as a special point.

R(ai)=siR(b)

Where the plot above shows si and the si are non-negitive and sum to 1.

Case. ai<0,b>0⟹si=0 . Relative to 0, a negative value contributes nothing to a positive maximum.

Case, ai>0⟹si=ai∑ai>0ai (In particular si=1 if i is the only value with positive ai

Case all negative. si=1ai∑i1ai .

Matrix multiplication

A common operation in neural networks. Even convolutions can be expressed as a matrix multiplication. The matrix just happens to be sparse, and contain repetitions.

bj=∑iaiwij

Gradient

Yet again a popular choice.

R(ai)=∑jwijR(bj)

Normalized

What the LRP algorithm uses is

R(ai)=∑jaiwijblip(bj)R(bj)

(This is deduced from equation 6 in ^{[1]} )

Where blip is this function

A straight line of gradient 1, except for a little vertical jump to avoid 0.

Why are they using this blip? Because they want to divide by bj here to get an interesting theoretical property (conservation of total) but if they don't add this little jump, they get numerical instability caused by dividing by something too close to 0.

Nonlinearity

There are several mechanisms proposed to deal with an arbitrary potentially non-linear function b=f(a), applied elementwise. We will impose the condition f(0)=0 for now.

Ignore

Very simple R(a)=R(b)

Gradient

R(a)=f′(a)R(b) standard rule of calculus.

Slope

R(a)=baR(b)

Repeat

R(a)=f(R(b))

Downside: Is nonlinear in R(b), unlike every other method here.

Consistency rules

You can't just pick any option from each of these lists. Well you could. But there are some nice mathematical consistency properties it would be nice to have.

Scaling equivalence

Suppose you want to multiply all values in the network by a constant α, there are 2 ways you could do this. You could see the constant as a scaled identity matrix, and use the matrix multiplication rules. Or you could see the constant multiplication as a special case of the arbitrary elementwise function.

Matrix Rule

Nonlinearity Rule

Scale

Gradient

Gradient

Slope

Repeat

α

Normalized

Ignore

1

The other condition we might reasonably impose is that the nonlinearity formed by taking the maximum with a set of constants is treated the same way by the nonlinearity and maximum rules. Actually the constants must all be ≤0 because of the f(0)=0 condition.

Gradient matches gradient. (Unsurprising)

In general its very hard to get anything else to work because max(a,c)=max(a,max(c,c))=max(a,c,c)=max(max(a,c),c) which ruins most kinds of credit splitting. Still, I suspect that radial maximum matches slope nonlinearity in the special case of a maximum of 2 objects. (Based on visually similar graphs.)

The Algorithms in ^{[1]}

The paper mentions gradients as existing work.

It also mentions work on the Deconvolution Method by Zeiler and Fergus, (2014) that roughly amounts to.

Layer type

Back propagation rule used

Maximum

Gradient

Matrix Multiply

Gradient

Nonlinearity

Repeat ^{[3]}

The LRP Algorithm by (Bach et al., 2015) has this form.

Layer type

Back propagation rule used

Maximum

Gradient

Matrix Multiply

Normalized^{[4]}

Nonlinearity

Ignore

Simplifying the LRP

Firstly let ϵ=0 in the Normalized matrix multiply backpropagation rule.

Let R(a) be the backpropagation rule based on the table above. (Gradient Maximum, Normalized Matrix Multiply, Ignore Nonlinearity)

As the authors of ^{[1]} show, these rules conserve ∑iR(ai) across network layers.

Consider another measure of the importance of a node Q(a)=R(a)/a. Can we rephrase all the equations to be in terms of Q(a) instead?

So this transformation turns the fancy normalization rule into the gradient again. All the problems where division means a risk of divide by 0, or more realistically dividing by almost 0 and getting a crazy huge number, have vanished.

Rephrasing Nonlinearity

Here is the place it gets a little more complicated.

b=f(a)R(a)=R(b)Q(a)=R(a)a=R(b)bba=baQ(b)

Under rephrasing, the trivial Ignore rule turns into the nontrivial slope rule.

Layer type

In terms of R

In terms of Q

Maximum

Gradient

Gradient

Matrix Multiply

Normalized

Gradient

Nonlinearity

Ignore

Slope

This slope rule works well so long as f(0)=0. I mean technically there are lots of fiddly analysis details here. Lets just say. f is continuous, and that ∃δ>0:f′(x) is defined and bounded on (−δ,0)∪(0,δ).

But if you haven't done real analysis, all you really need to know is that there are lots of slightly different definitions of reasonably nice functions, and so long as f(0)=0 and there is nothing like the sharp cusp of a square-root going on at 0, the result is nice enough.

Also note that Slope and Gradient are the same method if the only nonlinearity used is Relu.

So all that is left is to deal with the biases.

Rephrasing conservation

This is simple. ∑iaiQ(ai) is now the conserved quantity.

End conditions

In order to make these whole procedures exactly the same, we need to make sure the end conditions match too. At the start of the forward process, where backpropagation ends, you can just multiply Q(input) by input. At the end of the forward process, where backpropagation starts, you can divide. The work by Bach et al started the reverse process using the output of the network, which was considered to be a positive scalar representing the likelihood assigned by the network to some particular classification.

This means that the reformulated version, the backpropagation is started with Q(output)=outputoutput=1

What the actual problem was

Sometimes in a neural net, some fragment of decision comes down to a bias.

Sometimes the answer to "why is this number so large" would almost entirely be a large bias. The method bach et al proposed would focus on the tiny inputs or tiny weights rescaling them up in an attempt to explain the value.

This produces numerical instability by scaling up negligibly tiny weights and inputs. Which is exactly why bach et al added the blip function.

Solution 1: Blame the Bias

Augment the input data of the network with a list of 1's with the number of 1's equalling the number of biases used in the network. (Here a nonlinearity with f(0)≠0 can be considered to have a bias added to it.) These propagate through the network unaffected until each 1 is used up by scaling it by its corresponding bias, and adding it to where the bias should go.

You have now converted a network with biases into a network without biases. This means the techniques I propose work fine without numerical instability.

However, when propagating relevance backwards, some of that relevance goes to the list of 1's that correspond to the biases. So you get a heatmap of the relevant parts of the image. But also extra data about the relevance of various biases.

Instead of using lots of distinct 1's, you could use a single 1 everywhere. This would just add up the relevances of all individual nodes into a single total relevance.

Solution 2: Compare to baseline

This solution involves picking some "baseline" input. This could be an all blank image, or the average over the whole dataset. Run this through the network as well as the image you want to analyze. This gives b=f(a) from the image, and ¯b=f(¯a) from the baseline. We can now define the nonlinearity step as

R(a)=b−¯ba−¯aR(b)

This is symmetric in the 2 images.

There are several different approaches to taking the maximum. One approach is to consider m=exp(1ϵ∑log(ϵxi)), the softmax function with softness ϵ. This function can be composed of the basic building blocks described above. So the relevancy can be calculated for any fixed ϵ. Then just take the limit limϵ→0.

Another way of computing the maximum is to use max(a,b)=a+relu(b−a). Unfortunately these 2 computations suggest different relevancies.

There is an old apocryphal story about someone training neural networks to distinguish dogs from wolves, and all the dogs being on grass, and all the wolves being on snow, so the network learned to distinguish grass from snow instead.

So suppose you have carefully highlighted all the dogs and wolves in the training data. But you won't have that extra data at run time. My idea was to optimize the net to make sure that the region of the image containing the animal was marked relevant. (The whole relevancy calculation is differentiable, so can be optimized during training) The end result of this training procedure should be a single perfectly normal neural net that can be shown a wolf on grass for the first time, and correctly classify it.

Unfortunately I haven't got this to work. All I got was the network rampantly goodhearting the relevancy metric. (Well I was trying this on a smaller MNIST based problem, but it was conceptually the same)

^{^}

Evaluating the visualization of what a Deep Neural Network has learned Wojciech Samek Member, IEEE, Alexander Binder, Gr ́egoire Montavon, Sebastian Bach, and Klaus-Robert https://arxiv.org/pdf/1509.06321v1.pdf

^{^}

Learning Important Features Through Propagating Activation Differences Avanti Shrikumar Peyton Greenside Anshul Kundaje https://arxiv.org/pdf/1704.02685.pdf

^{^}

Relu only, other nonlinearities are not considered in this work.

^{^}

Later in the paper they propose another option here. This other option will not be discussed further.

It seems like the 5th sentence has its ending cut off? "it tries to parcel credit and blame for a decision up to the input neurons, even when credit and blame" , seems like it should continue [do/are x] for some x.

## Background

This post is strongly based on this paper, which it calls the LRP algorithm, https://arxiv.org/pdf/1509.06321v1.pdf

^{[1]}I later learned of the existence of this paper, which is even more similar to the ideas discussed here. https://arxiv.org/pdf/1704.02685.pdf

^{[2]}In it, I examine 2 of the methods for neural network visualization, and show that they have structural similarities. I show that these algorithms only differ by a difference in how they treat the biases, and (possibly a difference in getting started)

The second algorithm obeys conservation laws, it tries to parcel credit and blame for a decision up to the input neurons.

## Intro

The task we want is to assign importance to different inputs of a neural network in production of an output. So for example, in the case of a trained image classifier, the visualization method would take in a particular image, and highlight the parts of the image that the network thought were important.

A general method for visualizing neural networks is back-propagation. First evaluate the network forwards. Then work backwards through the network by using some rule about how to reverse each individual layer.

One example of this is differentiation. Finding the rate of change of the output, with respect to each input. But there are others.

Firstly, lets pretend biases in the network don't exist. We are allowed non-linearities, so long as they satisfy the equation f(0)=0 .

Lets look at various layer types and the different back-propagation rules.

## Maximum

Most often found in the form of max pooling.

b=maxiai## Gradient

The rule used in gradient descent, and I think the only rule used in the paper above for back propagating maximum. (Notation note. R here isn't exactly a function. Its more like ddx , its output is related to the context in which the input occurs. Think of every number in the forward net having an associated number )

R(ai)={R(b) ai=b0elseIgnoring the case of an exact tie. Exact ties, and what to do on that kind of singular point will be ignored in general, because they are fiddly and unimportant.

## Radial.

Treats 0 as a special point.

R(ai)=siR(b)Where the plot above shows si and the si are non-negitive and sum to 1.

Case. ai<0, b>0⟹si=0 . Relative to 0, a negative value contributes nothing to a positive maximum.

Case, ai>0⟹si=ai∑ai>0ai (In particular si=1 if i is the only value with positive ai

Case all negative. si=1ai∑i1ai .

## Matrix multiplication

A common operation in neural networks. Even convolutions can be expressed as a matrix multiplication. The matrix just happens to be sparse, and contain repetitions.

bj=∑iaiwij## Gradient

Yet again a popular choice.

R(ai)=∑jwijR(bj)## Normalized

What the LRP algorithm uses is

R(ai)=∑jaiwijblip(bj)R(bj)(This is deduced from equation 6 in

^{[1]})Where blip is this function

A straight line of gradient 1, except for a little vertical jump to avoid 0.

Why are they using this blip? Because they want to divide by bj here to get an interesting theoretical property (conservation of total) but if they don't add this little jump, they get numerical instability caused by dividing by something too close to 0.

## Nonlinearity

There are several mechanisms proposed to deal with an arbitrary potentially non-linear function b=f(a), applied elementwise. We will impose the condition f(0)=0 for now.

## Ignore

Very simple R(a)=R(b)

## Gradient

R(a)=f′(a)R(b) standard rule of calculus.

## Slope

R(a)=baR(b)

## Repeat

R(a)=f(R(b))

Downside: Is nonlinear in R(b), unlike every other method here.

## Consistency rules

You can't just pick any option from each of these lists. Well you could. But there are some nice mathematical consistency properties it would be nice to have.

## Scaling equivalence

Suppose you want to multiply all values in the network by a constant α, there are 2 ways you could do this. You could see the constant as a scaled identity matrix, and use the matrix multiplication rules. Or you could see the constant multiplication as a special case of the arbitrary elementwise function.

Gradient

Slope

Repeat

The other condition we might reasonably impose is that the nonlinearity formed by taking the maximum with a set of constants is treated the same way by the nonlinearity and maximum rules. Actually the constants must all be ≤0 because of the f(0)=0 condition.

Gradient matches gradient. (Unsurprising)

In general its very hard to get anything else to work because max(a,c)=max(a,max(c,c))=max(a,c,c)=max(max(a,c),c) which ruins most kinds of credit splitting. Still, I suspect that radial maximum matches slope nonlinearity in the special case of a maximum of 2 objects. (Based on visually similar graphs.)

## The Algorithms in

^{[1]}The paper mentions gradients as existing work.

It also mentions work on the Deconvolution Method by Zeiler and Fergus, (2014) that roughly amounts to.

^{[3]}The LRP Algorithm by (Bach et al., 2015) has this form.

^{[4]}## Simplifying the LRP

Firstly let ϵ=0 in the Normalized matrix multiply backpropagation rule.

Let R(a) be the backpropagation rule based on the table above. (Gradient Maximum, Normalized Matrix Multiply, Ignore Nonlinearity)

As the authors of

^{[1]}show, these rules conserve ∑iR(ai) across network layers.Consider another measure of the importance of a node Q(a)=R(a)/a. Can we rephrase all the equations to be in terms of Q(a) instead?

## Rephrasing Maximum

b=maxi(ai)R(ai)={R(b) ai=b0elseQ(ai)=R(ai)ai=⎧⎨⎩R(b)aiai=b0aielse ={Q(b)ai=b0elseSo this transformation turns the gradient maximum rule into the gradient maximum rule.

## Rephrasing Matrix Multiplication

The ϵ=0 rule means blip(bj)=bj.

bj=∑iaiwijR(ai)=∑jaiwijblip(bj)R(bj)Q(ai)=R(ai)ai=∑jwijbjR(bj)=∑jwijQ(bj)So this transformation turns the fancy normalization rule into the gradient again. All the problems where division means a risk of divide by 0, or more realistically dividing by almost 0 and getting a crazy huge number, have vanished.

## Rephrasing Nonlinearity

Here is the place it gets a little more complicated.

b=f(a)R(a)=R(b)Q(a)=R(a)a=R(b)bba=baQ(b)Under rephrasing, the trivial Ignore rule turns into the nontrivial slope rule.

This slope rule works well so long as f(0)=0. I mean technically there are lots of fiddly analysis details here. Lets just say. f is continuous, and that ∃δ>0:f′(x) is defined and bounded on (−δ,0)∪(0,δ).

But if you haven't done real analysis, all you really need to know is that there are lots of slightly different definitions of reasonably nice functions, and so long as f(0)=0 and there is nothing like the sharp cusp of a square-root going on at 0, the result is nice enough.

Also note that Slope and Gradient are the same method if the only nonlinearity used is Relu.

So all that is left is to deal with the biases.

## Rephrasing conservation

This is simple. ∑iaiQ(ai) is now the conserved quantity.

## End conditions

In order to make these whole procedures exactly the same, we need to make sure the end conditions match too. At the start of the forward process, where backpropagation ends, you can just multiply Q(input) by input. At the end of the forward process, where backpropagation starts, you can divide. The work by Bach et al started the reverse process using the output of the network, which was considered to be a positive scalar representing the likelihood assigned by the network to some particular classification.

This means that the reformulated version, the backpropagation is started with Q(output)=outputoutput=1

## What the actual problem was

Sometimes in a neural net, some fragment of decision comes down to a bias.

Sometimes the answer to "why is this number so large" would almost entirely be a large bias. The method bach et al proposed would focus on the tiny inputs or tiny weights rescaling them up in an attempt to explain the value.

This produces numerical instability by scaling up negligibly tiny weights and inputs. Which is exactly why bach et al added the blip function.

## Solution 1: Blame the Bias

Augment the input data of the network with a list of 1's with the number of 1's equalling the number of biases used in the network. (Here a nonlinearity with f(0)≠0 can be considered to have a bias added to it.) These propagate through the network unaffected until each 1 is used up by scaling it by its corresponding bias, and adding it to where the bias should go.

You have now converted a network with biases into a network without biases. This means the techniques I propose work fine without numerical instability.

However, when propagating relevance backwards, some of that relevance goes to the list of 1's that correspond to the biases. So you get a heatmap of the relevant parts of the image. But also extra data about the relevance of various biases.

Instead of using lots of distinct 1's, you could use a single 1 everywhere. This would just add up the relevances of all individual nodes into a single total relevance.

## Solution 2: Compare to baseline

This solution involves picking some "baseline" input. This could be an all blank image, or the average over the whole dataset. Run this through the network as well as the image you want to analyze. This gives b=f(a) from the image, and ¯b=f(¯a) from the baseline. We can now define the nonlinearity step as

R(a)=b−¯ba−¯aR(b)This is symmetric in the 2 images.

There are several different approaches to taking the maximum. One approach is to consider m=exp(1ϵ∑log(ϵxi)), the softmax function with softness ϵ. This function can be composed of the basic building blocks described above. So the relevancy can be calculated for any fixed ϵ. Then just take the limit limϵ→0.

Another way of computing the maximum is to use max(a,b)=a+relu(b−a). Unfortunately these 2 computations suggest different relevancies.

## Running code.

If you want to play around with this, look here. Unzip simple_model1.zip and point the link in My_Net_explain2.ipynb to the right location.

## What I was hoping for regarding alignment

There is an old apocryphal story about someone training neural networks to distinguish dogs from wolves, and all the dogs being on grass, and all the wolves being on snow, so the network learned to distinguish grass from snow instead.

So suppose you have carefully highlighted all the dogs and wolves in the training data. But you won't have that extra data at run time. My idea was to optimize the net to make sure that the region of the image containing the animal was marked relevant. (The whole relevancy calculation is differentiable, so can be optimized during training) The end result of this training procedure should be a single perfectly normal neural net that can be shown a wolf on grass for the first time, and correctly classify it.

Unfortunately I haven't got this to work. All I got was the network rampantly goodhearting the relevancy metric. (Well I was trying this on a smaller MNIST based problem, but it was conceptually the same)

^{^}Evaluating the visualization of what a

Deep Neural Network has learned

Wojciech Samek Member, IEEE, Alexander Binder, Gr ́egoire Montavon, Sebastian Bach, and Klaus-Robert https://arxiv.org/pdf/1509.06321v1.pdf

^{^}Learning Important Features Through Propagating Activation Differences

Avanti Shrikumar Peyton Greenside Anshul Kundaje https://arxiv.org/pdf/1704.02685.pdf

^{^}Relu only, other nonlinearities are not considered in this work.

^{^}Later in the paper they propose another option here. This other option will not be discussed further.