LESSWRONG
LW

1461
Louis Jaburi
1430170
Message
Dialogue
Subscribe

https://cogeometry.com/

https://x.com/ExpMachina
 

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Jesse Hoogland's Shortform
Louis Jaburi18d200

Great to see more work on (better) influence functions! 

Lots of interesting things to discuss here[1], but one thing I would like to highlight is that classical IFs indeed arise when you do the usual implicit function theorem + global minimum assumption (which is obviously violated in the context of DL), but they also arise as the limit of unrolling as t→∞. What follows will be more of theoretical nature summarizing statements in Mlodozeniec et al.

Influence functions suffer from another shortcoming, since they only use final weights (as you are aware). So you might say that we shouldn't do influence functions, but track a different counterfactual: The counterfactual over training "What if I added/removed a sample zm at time step t". To do this, you can consider each SGD training step θt→θt+1 (or more generally some optimizer like Adam), and approximate the Jacobian of that map, i.e. θt+1≈θt+At⋅(θt+1−θt). Doing some calculus you end up with At=I−λt⋅Ht, where λt is the lr and Ht

the mini-batch Hessian at time step t.[2] 

You can use this linear approximation of training steps to compute a new counterfactual (Eq. 57 in Mlodozeniec et al.) . This can be formalized as a pair (θt,rt) of the weights θt and the response rt which captures the counterfactual, i.e. θ′t(ϵ)≈θt+ϵ⋅rt, where θ′t(ϵ) is the counterfactual of adding the data point with weighting ϵ at time step t. Ok, without further ado, here is the result (Theorem 2 in Mlodozeniec et al.):

Under some assumptions on SGD (A1-A6 in the paper) as you continue training t→∞, you get an a.s. convergence (θt,rt)→(θ∞,r∞) where θ∞ is a local minimum or a saddle point. Assume it is a local minimum, what is the optimal response r∞? It's our beloved (pseudo-)inverse Hessian vector product (IVHP) from classical IFs, well... up to directions in weight space which are in the kernel of the Hessian.

So to summarize, the upshot is that influence functions actually can be valid beyond the original statistical setup, if (1) We model training dynamics linearly (2) We believe the assumptions A1-A6 + that we end up in a local minimum eventually, (3) We care about the behaviour limit. These assumptions can and should be debated, but I find them more reasonable and interesting than the global minimum assumption.

And as a cherry on the top, Theorem 3 shows that if you want go from the Bayesian posterior p(w∣D) to the epsilon perturbed p(w∣Dϵ) , you can again use IFs: Sampling from the perturbed distribution is approximated by sampling from the original distribution and adding the IF IVHP. Amongst linear approximations this one (in a specific sense, in the low temperature limit) is optimal for the KL divergence.[3] 

 

More generally, I think this paper makes an important point that goes beyond any of these technical details above: We want our counterfactual estimations to be more robust against randomness in the training, but that's for another time. 

  1. ^

    e.g. I am not sure if agree regarding the dataset vs model size tradeoff, but maybe we have slightly different applications in mind :)

  2. ^

    Small upshot here is that we get a natural damping which mitigates degeneracy of the Hessian 

  3. ^

    I would be curious to understand how this compares to the relationship you present in appendix A

Reply1
Lessons from the Iraq War for AI policy
Louis Jaburi4mo10

And in fact south Iraq was and is dominantly Shiite (and thus also more susceptible to Iranian influence). They too revolted against Saddam after the first gulf war https://en.m.wikipedia.org/wiki/1991_Iraqi_uprisings and were euphoric about his fall

Reply
Alexander Gietelink Oldenziel's Shortform
Louis Jaburi8mo20

I agree with the previous points, but I would also add historical events that led to this.
Pre-WW I Germany was much more important and plays the role that France is playing today (maybe even more central), see University of Göttingen at the time.

After two world wars the German mathematics community was in shambles, with many mathematicians fleeing during that period (Grothendieck, Artin, Gödel,...). The university of Bonn (and the MPI) were the post-war project of Hirzebruch to rebuild the math community in Germany. 

I assume France then was then able to rise as the hotspot and I would be curious to imagine what would have happened in an alternative timeline. 

Reply2
Ambiguous out-of-distribution generalization on an algorithmic task
Louis Jaburi9mo10

In our toy example, I would intuitively associate the LLC with the test losses rather than train loss. For training of a single model, it was observed that test loss and LLC are correlated. Plausibly, for this simple model (final) LLC, train loss, and test loss, are all closely related.

Reply
Ambiguous out-of-distribution generalization on an algorithmic task
Louis Jaburi9mo10

We haven't seen that empirically with usual regularization methods, so I assume there must be something special going on with the training set up.

I wonder if this phenomenon is partially explained by scaling up the embedding and scaling down the unembedding by a factor (or vice versa). That should leave the LLC constant, but will change L2 norm. 

Reply
Against blanket arguments against interpretability
Louis Jaburi9mo10

The relevant question then becomes whether the "SGLD" sampling techniques used in SLT for measuring the free energy (or technically its derivative) actually converge to reasonable values in polynomial time. This is checked pretty extensively in this paper for example.

The linked paper considers only large models which are DLNs. I don't find this too compelling evidence for large models with non-linearities. Other measurements I have seen for bigger/deeper non-linear models seem promising, but I wouldn't call them robust yet (though it is not clear to me if this is because of an SGLD implementation/hyperparameter issue or if there is a more fundamental problem here).

As long as I don't have a more clear picture of the relationship between free energy and training dynamics under SGD, I agree with OP that the claim is too strong.

Reply
Activation space interpretability may be doomed
Louis Jaburi10mo10

I see, thanks for sharing!

Reply
Activation space interpretability may be doomed
Louis Jaburi10mo10

Did you use something like LSAE as described here ? By brittle do you mean w.r.t the sparsity penality (and other hyperparameters)?

Reply
Activation space interpretability may be doomed
Louis Jaburi10mo10

Thanks for the reference, I wanted to illuminate the value of gradients of activations in this toy example as I have been thinking about similar ideas.

I personally would be pretty excited about attribuition dictionary learning, but it seems like nobody did that on bigger models yet.

Reply
Activation space interpretability may be doomed
Louis Jaburi10mo30

Are you suggesting that there should be a formula similar to the one in Proposition 5.1 (or 5.2) that links information about the activations I(z;x)+TC(z) with the LC as measure of basin flatness?

Reply1
Load More
83Ambiguous out-of-distribution generalization on an algorithmic task
9mo
6