Path dependence in ML inductive biases

evhub

Interesting post! I'm pretty curious about these.

A great resource for answering these questions is a set of model runs put out by the Stanford Center for Research into Foundation Models - they trained 5 runs of GPT-2 small and GPT-2 medium with 600 checkpoints and different random seeds, and released the weights. It seems like a good way to get some surface area on these questions with interesting real models. A few ideas that are somewhere on my maybe/someday research ideas list:

For each pair of models, feed in a bunch of text and look at the log prob for predicting each next token, and look at the scatter plot of these - does it look highly correlated? Poke at any outliers and see if there are any consistent patterns of things one model can do and the other cannot
- Repeat this for a checkpoint halfway through training. If you find capabilities in one model and not in another, have they converged by the end of training?
- Look at the PCA of these per-token losses across, say, 1M tokens of text, and see if you can find anything interesting about the components
Evaluate the models for a bunch of behaviours - ability to use punctuation correctly, to match open and close parentheses, patterns in the syntax and structure of the data (capital letters at the start of a sentence, email addresses having an @ and a .com in them, taking text in other languages and continuing it with text of that language, etc), specific behaviour like the ability to memorise specific phrases, complete acronyms, use induction-like behaviour, basic factual knowledge about the world, etc
- The medium models will have more interesting + sophisticated behaviour, and are probably a better place to look for specific circuits
Look at the per-token losses for some text over training (esp for tokens with significant deviation between final models) and see whether it looks smooth or S-shaped - S-shaped would suggest higher path dependence to me
Look for induction head phase changes in each model during training, and compare when they happen.

I'm currently writing a library for mechanistic interpretability of LLMs, with support for loading these models + their checkpoints - if anyone might be interested on working on this, happy to share ideas. This is a small subset of OpenWebText that seems useful for testing.

Unrelatedly, a mark against path dependence is the induction head bump result, where we found that models have a phase change where they suddenly form induction heads, and that across a range of model sizes and architecture it forms consistently and around the same point (though not all architectures tested). Anecdotally, I've found that the time of formation is very sensitive to the exact positional embeddings used though.

[-]Quintin Pope3y63

This seems quite similar to the experiments done in this paper.

[-]Ethan Caballero3yΩ10253

Sections 3.1 and 6.6 titled "Ossification" of "Scaling Laws for Transfer" paper (https://arxiv.org/abs/2102.01293) show that current training of current DNNs exhibits high path dependence.

[-]RobertKirk3yΩ131811

When you talk about whether we're in a high or low path-dependence "world", do you think that there is a (somewhat robust) answer to this question that holds across most ML training processes? I think it's more likely that some training processes are highly path-dependent and some aren't. We definitely have evidence that some are path-dependent, e.g. Ethan's comment and other examples like https://arxiv.org/abs/2002.06305, and almost any RL paper where different random seeds of the training process often result in quite different results. Arguably I don't think we have conclusive of any particular existing training process being low-path dependence, because the burden of proof is heavy for proving that two models are basically equivalent on basically all inputs (given that they're very unlikely to literally have identical weights, so the equivalence would have to be at a high level of abstraction).

Reasoning about the path dependence of a training process specifically, rather than whether all of the ML/AGI development world is path dependent, seems more precise, and also allows us to reason about whether we want a high or low path-dependence training process, and considering that as an intervention, rather than a state of the world we can't change.

[-]evhub3yΩ334

Yeah, I agree with that. I think path dependence will likely vary across training processes and that we should in fact view that as an important intervention point.

[-]johnswentworth3yΩ1117-2

Betting markets on these questions would be nice. I'd bid pretty strongly on "nope, basically no path dependence" for most current architectures; replicability already gives us a ton of bits on the question.

[-]Quintin Pope3y*234

I'd happily take the other side of that bet. E.g., look at this website for an example of training a 500 neuron wide, 2-layer fully connected ReLu network on toy data, with a selector that lets you apply regularizers to the training process. If you simply train with no regularizer, you get the following decision boundary:

If you train with an L1 regularizer, you get this boundary:

However, if you first train with the L1 regularizer for ~ 100 steps, then switch over to no regularizer, you get this boundary, which persists for at least 5,000 training steps:

If we were going to find path-independence anywhere, I think it would be in these sorts of very simple datasets, with wide, highly overparameterized models, trained on IID data using exact gradients. But even here, SGD seems quite path dependent.

Edited to add:

...replicability already gives us a ton of bits on the question.

I think this is false. For a given architecture + training process, it's entirely possible for there to be an attractor into which 99.9999999999999% of all randomly initialized training processes fall, but for it to still be highly path dependent in the relevant sense. The reason is because it's actually quite easy for "simple nudges" to apply the ~ 50 bits of optimization pressure needed to make a 0.0000000000001% outcome happen. E.g., training for 100 steps with an L1 regularizer will get you a model that's incredibly unlikely to be sampled by your random initialization process.

It can be the case that almost all random initializations train out to the same end state, and also that fairly simple interventions can put the training trajectory on the path to a different end state.

[-]Zac Hatfield-Dodds3y63

Broadly agree with this comment. I'd buy something like "low path-dependence for loss, moderate-to-high for specific representations and behaviours" - see e.g. https://arxiv.org/abs/1911.02969

[-]johnswentworth3y52

I think this is false. For a given architecture + training process, it's entirely possible for there to be an attractor into which 99.9999999999999% of all randomly initialized training processes fall, but for it to still be highly path dependent in the relevant sense. The reason is because it's actually quite easy for "simple nudges" to apply the ~ 50 bits of optimization pressure needed to make a 0.0000000000001% outcome happen. E.g., training for 100 steps with an L1 regularizer will get you a model that's incredibly unlikely to be sampled by your random initialization process.
It can be the case that almost all random initializations train out to the same end state, and also that fairly simple interventions can put the training trajectory on the path to a different end state.

I think we actually have the same model here, but interpret the phrase "path dependence" differently. If the question is whether we can intentionally apply 50 bits of optimization to kick the thing into a different attractor, then yeah, I agree that is very probably possible. I just wouldn't call that "path dependence", since on the distribution of the training process the path basically does not matter.

[-]Zac Hatfield-Dodds3y102

If SGD is approximately a Bayesian sampler, ...

I think it's worth noting that no large-scale system uses 'true' SGD; it's all ADAM-W and the weight decay seems like a strong part of the inductive bias. Of course "everything that works is approximately Bayesian", but the mathematics that people talk about with respect to SGD just aren't relevant to practice.

(opinions my own)

[-]cfoster03y93

Definitely glad to see some investigation into the path dependence question.

I expect that the primary source of safety-relevant path dependence in future systems will be due to causal influence of the model's behavior on its training data / supervision signal. That should occur by default in reinforcement & active learning, but not in typical teacher-forced self-supervised learning (like GPT). So I think I would answer the question of "Are we in a high path-dependence world?" differently conditioned on different AI development models.

[-]cfoster03y62

Even for GPTs, the recently popular "chain-of-thought" family of techniques seem poised to bring path-dependence into the mix, by creating feedback loops between the language model and the reasoning traces it produces.

[-]Bogdan Ionut Cirstea3y32

This paper links inductive biases of pre-trained [language] models (including some related to simplicity measures like MDL), path dependency and sensitivity to label evidence/noise: https://openreview.net/forum?id=mNtmhaDkAr

^{^}

Mechanistically, high path dependence corresponds to significant influence and continuity from the structure and ontology of the early model to its final structure, and low path-dependence to destruction/radical transformation of early structures and overdetermination/stability of the final result.

^{^}

"Dissimilar solutions" = Different factorizations of the task into circuits. The low-path-dependence argument here is that "paths are always abundant in high-dimensional spaces".

^{^}

I'm avoiding the term "goal" since I don't presume consequentialism.

^{^}

See one of these reviews. I mostly disbelieve the qualitative conclusions people draw from this work though, for reasons that deserve their own post.

^{^}

All this means right now is "It gives us strong path-dependence vibes, so it's probably correlated with the other stuff"

^{^}

Really a spectrum.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

68

Path dependence in ML inductive biases

68

Ω 36

68

Ω 36

Path dependence

Diagrams for training dynamics

Reasons to care about path dependence

Specific aspects of training dynamics

A question about predictiveness

Conclusion