Louis Jaburi — LessWrong

Great to see more work on (better) influence functions!

Lots of interesting things to discuss here^[1], but one thing I would like to highlight is that classical IFs indeed arise when you do the usual implicit function theorem + global minimum assumption (which is obviously violated in the context of DL), but they also arise as the limit of unrolling as . What follows will be more of theoretical nature summarizing statements in Mlodozeniec et al.

Influence functions suffer from another shortcoming, since they only use final weights (as you are aware). So you might say that we shouldn't do influence functions, but track a different counterfactual: The counterfactual over training "What if I added/removed a sample $z_{m}$ at time step $t$ ". To do this, you can consider each SGD training step $θ_{t} \to θ_{t + 1}$ (or more generally some optimizer like Adam), and approximate the Jacobian of that map, i.e. $θ_{t + 1} \approx θ_{t} + A_{t} \cdot (θ_{t + 1} - θ_{t})$ . Doing some calculus you end up with $A_{t} = I - λ_{t} \cdot H_{t}$ , where $λ_{t}$ is the lr and $H_{t}$

the mini-batch Hessian at time step $t$ .^[2]

You can use this linear approximation of training steps to compute a new counterfactual (Eq. 57 in Mlodozeniec et al.) . This can be formalized as a pair $(θ_{t}, r_{t})$ of the weights $θ_{t}$ and the response $r_{t}$ which captures the counterfactual, i.e. $θ_{t}^{'} (ϵ) \approx θ_{t} + ϵ \cdot r_{t}$ , where $θ_{t}^{'} (ϵ)$ is the counterfactual of adding the data point with weighting $ϵ$ at time step $t$ . Ok, without further ado, here is the result (Theorem 2 in Mlodozeniec et al.):

Under some assumptions on SGD (A1-A6 in the paper) as you continue training $t \to \infty$ , you get an a.s. convergence $(θ_{t}, r_{t}) \to (θ_{\infty}, r_{\infty})$ where $θ_{\infty}$ is a local minimum or a saddle point. Assume it is a local minimum, what is the optimal response $r_{\infty}$ ? It's our beloved (pseudo-)inverse Hessian vector product (IVHP) from classical IFs, well... up to directions in weight space which are in the kernel of the Hessian.

So to summarize, the upshot is that influence functions actually can be valid beyond the original statistical setup, if (1) We model training dynamics linearly (2) We believe the assumptions A1-A6 + that we end up in a local minimum eventually, (3) We care about the behaviour limit. These assumptions can and should be debated, but I find them more reasonable and interesting than the global minimum assumption.

And as a cherry on the top, Theorem 3 shows that if you want go from the Bayesian posterior $p (w ∣ D)$ to the epsilon perturbed $p (w ∣ D_{ϵ})$ , you can again use IFs: Sampling from the perturbed distribution is approximated by sampling from the original distribution and adding the IF IVHP. Amongst linear approximations this one (in a specific sense, in the low temperature limit) is optimal for the KL divergence.^[3]

More generally, I think this paper makes an important point that goes beyond any of these technical details above: We want our counterfactual estimations to be more robust against randomness in the training, but that's for another time.

^{^}
e.g. I am not sure if agree regarding the dataset vs model size tradeoff, but maybe we have slightly different applications in mind :)
^{^}
Small upshot here is that we get a natural damping which mitigates degeneracy of the Hessian
^{^}
I would be curious to understand how this compares to the relationship you present in appendix A

And in fact south Iraq was and is dominantly Shiite (and thus also more susceptible to Iranian influence). They too revolted against Saddam after the first gulf war https://en.m.wikipedia.org/wiki/1991_Iraqi_uprisings and were euphoric about his fall

I agree with the previous points, but I would also add historical events that led to this.
Pre-WW I Germany was much more important and plays the role that France is playing today (maybe even more central), see University of Göttingen at the time.

After two world wars the German mathematics community was in shambles, with many mathematicians fleeing during that period (Grothendieck, Artin, Gödel,...). The university of Bonn (and the MPI) were the post-war project of Hirzebruch to rebuild the math community in Germany.

I assume France then was then able to rise as the hotspot and I would be curious to imagine what would have happened in an alternative timeline.

In our toy example, I would intuitively associate the LLC with the test losses rather than train loss. For training of a single model, it was observed that test loss and LLC are correlated. Plausibly, for this simple model (final) LLC, train loss, and test loss, are all closely related.

We haven't seen that empirically with usual regularization methods, so I assume there must be something special going on with the training set up.

I wonder if this phenomenon is partially explained by scaling up the embedding and scaling down the unembedding by a factor (or vice versa). That should leave the LLC constant, but will change L2 norm.

The relevant question then becomes whether the "SGLD" sampling techniques used in SLT for measuring the free energy (or technically its derivative) actually converge to reasonable values in polynomial time. This is checked pretty extensively in this paper for example.

The linked paper considers only large models which are DLNs. I don't find this too compelling evidence for large models with non-linearities. Other measurements I have seen for bigger/deeper non-linear models seem promising, but I wouldn't call them robust yet (though it is not clear to me if this is because of an SGLD implementation/hyperparameter issue or if there is a more fundamental problem here).

As long as I don't have a more clear picture of the relationship between free energy and training dynamics under SGD, I agree with OP that the claim is too strong.

I see, thanks for sharing!

Did you use something like as described here ? By brittle do you mean w.r.t the sparsity penality (and other hyperparameters)?

Thanks for the reference, I wanted to illuminate the value of gradients of activations in this toy example as I have been thinking about similar ideas.

I personally would be pretty excited about attribuition dictionary learning, but it seems like nobody did that on bigger models yet.

Are you suggesting that there should be a formula similar to the one in Proposition 5.1 (or 5.2) that links information about the activations with the LC as measure of basin flatness?

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments