Evidence Sets: Towards Inductive-Biases based Analysis of Prosaic AGI

[-]gwern4y*Ω8100

Why do you have high confidence that catastrophic forgetting is immune to scaling, given "Effect of scale on catastrophic forgetting in neural networks", Anonymous 2021?

My Interpretation: We perform SGD updates on parameters while training a model. The claim is that the decision boundary does not change dramatically after an update. The safety implication is that we need not worry about an advanced AI system manoeuvring from one strategy to a completely different kind after an update SGD.

I also disagree with B1, gradual change bias. This seems to me to be either obviously wrong, or limited to safety/capability-irrelevant definitions like "change in KL of overall distribution of output from small vanilla supervised-learning NNs", and does definitely does not generalize to even small amounts of confidence in assertions like "SGD updates based on small n cannot meaningfully change a very powerful NN's behavior in a real-world-affecting way".

First, it is pervasive in physics and the world in general that small effects can have large outcomes. No snowflake thinks it's responsible for the avalanche, but one was. Physics models often have chaotic regimes, and models like the Ising spin glass model (one of the most popular theoretical models for neural net analysis for several decades) are notorious for how tiny changes can cause total phase shifts and regime changes. NNs themselves are often analyzed as being on the 'edge of chaos' in various ways (the exploding gradient problem, except actual explosions). Breezy throwaway claims about 'oh, NNs are just smooth and don't change much on one update' are so much hot air against this universal prior.

As an example, consider the notorious fragility of NNs to (hyper)parameters: one set will fail completely, wildly diverging. Another, similar set, will suddenly work. Edge of chaos. Or consider the spikiness of model capabilities over scaling and how certain behaviors seem to abruptly emerge at certain points (eg 'grokking', or just the sudden phase transitions on benchmarks where small models are flatlined but then at a threshold, the model suddenly 'gets it'). This parallels similar abrupt transitions in human psychology, like Piagetian levels. In grokking, the breakthrough appears to happen almost instantaneously; for humans and large model capabilities, we lack detailed benchmarking which would let us say that "oh, GPT-3 understands logic at exactly iteration #501,333", but it should definitely make you think about assumptions like "it must take countless SGD iterations for each of these capabilities to emerge." (This all makes sense if you think of large NNs as searching over complexity-penalized ensembles of programs, and at some point switching from memorization-intensive programs to the true generalizing program; see my scaling hypothesis writeup & jcannell's writings.)

Second, the pervasive existence of adversarial examples should lead to extreme doubt on any claims of NN 'smoothness'. Absolutely and perceptually tiny shifts in image inputs lead to almost arbitrarily large wacky changes in output distribution. These may be unnatural, but they exist. If tweaking a pixel here and there by a quantum can totally change the output from 'cat' to 'dog', why can't an SGD update, which can change every single parameter, have similar effects? (Indeed, you link the isoperimetry paper, which claims that this is logically necessary for all NNs which are too small for their problem, where for ImageNet I believe they ballpark it as "even the largest contemporary ImageNet models are still 2 OOMs too small to begin to be robust".)

Third, small changes in outputs can mean large changes in behavior. Actions and choices are inherently discrete where small changes in the latent beliefs can have arbitrarily large behavior changes (which is why we are always trying to move away from actions/choices towards smoother easier relaxations where we can pretend everything is small and continuous). Imagine a NN estimating Q-values for taking 2 actions, "Exterminate all humans" and "usher in the New Jerusalem", which normalize to 0.4999999 and 0.5000001 respectively. It chooses the argmax, and acts friendly. Do you believe that there is no SGD update which after adjusting all of the parameters, might reverse the ranking? Why, exactly?

Fourth, NNs are often designed to have large changes based on small changes, particularly in meta-learning or meta-reinforcement-learning. In prompt programming or other few-shot scenarios, we radically modify the behavior of a model with potentially trillions of parameters by merely typing in a few words. In neural meta-backdoors/data poisoning, there are extreme shifts in output for specific prespecified inputs (sort of the inverse of adversarial examples), which are concerning in part because that's where a gradient-hacker could store arbitrary behavior (so a backdoored NN is a little like Light in Death Note: he has forgotten everything & acts perfectly innocent... until he touches the right piece of paper and 'wakes up'). In cases like MAML, the second-order training is literally designed to create a NN at a saddle point where a very small first-order update will produce very different behavior for each potential new problem; like a large boulder perched at the tip of a mountain which will roll off in any direction at the slightest nudge. In meta-reinforcement-learning, like RNNs being trained to solve a t-maze which periodically flips the reward function, the very definition of success is rapid total changes in behavior based on few observations, implying the RNN has found a point where the different attractors are balanced and the observation history can push it towards the right one easily. These NNs are possible, they do exist, and given the implicit meta-learning we see emerge in larger models, they may become increasingly common and exist in places the user does not expect.

So, I see loads of reasons we should worry about an advanced AI system maneuvering from one strategy to another after a single update, both in general priors and based on what we observe of past NNs, and good reason to believe that scaling greatly increases the dangers there. (Indeed, the update need not even be of the 'parameters', updates to the hidden state / internal activations are quite sufficient.)

[-]bayesian_kitten4y30

Hi! Thanks for reading the post carefully and coming up with interesting evidence and arguments against~ I think I can explain PF4, but am certainly wrong on B1.

PF4

Why do you have high confidence that catastrophic forgetting is immune to scaling, given "Effect of scale on catastrophic forgetting in neural networks", Anonymous 2021?

Catastrophic forgetting (mechanism): We train a model to minimize loss on dataset X. Then we train it to minimize loss on dataset Y. When minimizing loss on dataset Y, it has no incentive to care about loss on dataset X. Hence, catastrophic forgetting. This effect seems not-very-solveable by simply scaling models.

Re: linked paper: I did not know about it while writing the post. On a very preliminary skim, I don't think their modeling paradigm, multi-head, is practically even relevant. (Let me know if my skim was a misread):

Why? A simple baseline: After training on every task, save the model! At inference time, you're given which task the example is from, use that model for inference. No forgetting!

Strong Catastrophic Forgetting: Let's say we've trained a model on Imagenet10k. New data comes, and we have ~4k classes arriving, for an unknown #timesteps (cannot assume this is known). A realistic lifelong learning model case would be >hundreds of timesteps. The question is how do we learn this new information, given time constraints as we need the new model deployed (translating to compute given fixed resources)?

Here: (a) I'm skeptical of scaling will adequately address this (the degree of drop >> difference between scaled models) (b) The compute-constraint is what explicitly works not-in-favor of larger models. But, catastrophic forgetting dynamics here would be very different than in the presented work I feel.
Eg; We cannot use the above baseline and train different models for this, as we will have the hard task of identifying which model to use (near-best case: Supermasks in Superposition).

Weak Catastrophic Forgetting: I think the 'cannot store data' assumption is weird given storage is virtually free (compared to compute required to train a good model on that kind of data). So it's the same problem-- maintain better performance with a time constraint with full access to past data (here the problem itself is weaker but the compute constraints will be far tighter).

This is roughly my intuition. What do you think?

B1: Overall, I've updated to state what I've written in B1 is not true.

Re: First -- What I shall do is after we've discussed, I shall relegate it to not-true section at the end of the post (so it's still visible) and add grokking as a surprising bias (intuition also explained here) in Generalization properties. I found the grokking paper but it seems preliminary, can you link other such papers or good spin-glass model papers which illustrate this point? It will help me make a good claim there. Thanks for illustrating this!~

Re: Second-- Yeah, I'm assuming we train large networks and then think about this problem.

Re: Third-- While defining sudden, non-linear shifts in NNs I think somewhere before the decision (say probability distribution over actions/decisions) would be a much stronger and useful claim to make. So, a good claim would be saying that an SGD update might cause us to go from '0.9% exterminate all humans' to '51% exterminate all humans' if true (seems likely).

Re: Fourth-- I think conditional generation being different given different conditions is qualitatively different and less interesting than suddenly updating to a large degree (grokking).

The claim by Evan: The context was identifying trigger points of deception in SGD-- Using transparency tools to figure out what the model is 'thinking' and what strategy it is using, if it goes to the other side of the decision boundary and back, we can ask the model why did it do that (suspecting that this was a warning for deception starting). Now, (a) I think the boundary switch will always be small even in this case, even when viewed via a transparency tool (b) In any case, this is me over-generalizing and being obviously wrong. Correcting this will make the article stronger.

[-]gwern4y40

I take the point of the paper as showing that as models get larger and more overparameterized, it gets easier for them to store arbitrary capabilities without interference in part because the better representations they learn mean that there is much less to store/learn for any new task, which will share a lot of structure. At some point, worrying about 'classes' or 'heads' just becomes irrelevant as you zero-shot or few-shot it: eg CLIP doesn't really need to worry about catastrophic forgetting because you just type in the text description of what 'class' you're interested in and 'classify' that way; a MoE doesn't worry about task classification, because it learns what sub-expert to dispatch input to. You won't need to 'switch between tasks' (not even that meaningful a thing outside the constraints of a benchmark) because in-context learning & representations do all the work, latently disambiguating where one is. You will simply train large (perhaps sparse or MoE-esque) models in one-epoch fashion, streaming in data constantly and discarding it. When you have enough real-world data, you don't need or want to store it because of diminishing returns on retraining compared to grabbing a fresh datapoint from the firehose. (It's worth noting that no one in the large language model space has ever 'used up' all the text available to them in datasets like The Pile, or even done more than 1 epoch over the full dataset they used.) This is also good for users if they don't have to keep around the original dataset to sample maintenance batches from while doing more training.

This solution will make a lot of people very unhappy as they insist that "this isn't a solution, you just made a very large model, arrgghh, so inefficient and ungreen", but if it solves the problem, then it solves the problem, and now you're just haggling over the price. Are there ways more efficient? Almost certainly. Should we care that much about or bother researching them? Maybe.

I found the grokking paper but it seems preliminary, can you link other such papers or good spin-glass model papers which illustrate this point? It will help me make a good claim there. Thanks for illustrating this!~

The grokking paper is definitely preliminary. No one expected that and I'm not aware of any predictions of that (or patient teacher*) even if we can guess about a wide-basin/saddle-point interpretation.

I don't have a list of spin-glass papers because I distrust such math/theory papers and haven't found them to be helpful. There's some I host on gwern.net because they're early examples of estimating NN scaling laws but I didn't get anything out of trying to read them. (Physicists gonna physics.) What I can link you right this second is a very cool interactive 'explorable' web page of many different spin-glasses, where you can see a lot of their behavior for yourself: http://bit-player.org/2021/three-months-in-monte-carlo

* I'm not sure if this is an example. Do student models 'take off' abruptly? They look in the graphs like they might idle around for scores or thousands of epoches and then take off, but it's hard to tell from the graphs whether they are just truncated at an axis and actually show gradual consistent improvement over many many epoches.

Using transparency tools to figure out what the model is 'thinking' and what strategy it is using, if it goes to the other side of the decision boundary and back, we can ask the model why did it do that (suspecting that this was a warning for deception starting).

I'm not sure how useful transparency tools would be. They can't tell you anything about adversarial examples. Do they even diagnose neural backdoors yet? If they can't find actual extremely sharp decision boundaries around specific inputs, hard to see how they could help you understand what an arbitrary SGD update does to decision boundaries across all inputs.

[-]gwern4y40

There's a lot of physics-related nonlinearities/phase-transitions/powerlaw material in these workshop slides & videos, looks like: https://sites.google.com/mila.quebec/scaling-laws-workshop/schedule

[-]bayesian_kitten4y10

This post (which is really dope) provides some grokking examples in large language models in a Big-Bench video at 19313s & 19458s, with that segment (18430s-19650s) being a nice watch! I shall spend a bit more time collecting and precisely identifying evidence and then include it in the grokking part of this post. This was a really nice thing to know about and very suprising.

[-]gwern4y30

I've commented on that, but I'm not convinced that the phase transitions in learning are grokking, per se. There are many different scaling phenomenon, and we shouldn't go around prematurely conflating them.

[-]bayesian_kitten4y*20

When you have enough real-world data, you don't need or want to store it because of diminishing returns on retraining compared to grabbing a fresh datapoint from the firehose. (It's worth noting that no one in the large language model space has ever 'used up' all the text available to them in datasets like The Pile, or even done more than 1 epoch over the full dataset they used.) This is also good for users if they don't have to keep around the original dataset to sample maintenance batches from while doing more training.

This would be the main crux, actually a tremendously important crux. I take this means that models largely would be very far off from an overparameterized regime compared to the data? I expect operating in an overparameterized regime to give a lot more capabilities and currently considered overfitting to the dataset as almost a need, whereas you seem to indicate this is an unreasonable assumption to make?

If so, erm, not only just catastrophic forgetting, but a lot of stuff I've seen people in AI alignment forum, base their intuitions on could be potentially thrown in the bin. Eg: I'm more confident in catastrophic forgetting having it's effect when overfitted on the past data. If one cannot even properly learn past data but only frequently occuring patterns from it, those patterns might be too repetitively occuring to forget. But then, deep networks could do a lot better performance-wise by overfitting the dataset and exhaustively trying to remember the less-frequent patterns as well.

..it gets easier for them to store arbitrary capabilities without interference in part because the better representations they learn mean that there is much less to store/learn for any new task, which will share a lot of structure.

Here, the problem of catastrophic forgetting would not be on downstream learning tasks, it would be on updating this learnt representation to newer tasks.

The grokking paper is definitely preliminary. No one expected that and I'm not aware of any predictions of that (or patient teacher*) even if we can guess about a wide-basin/saddle-point interpretation.
I don't have a list of spin-glass papers because I distrust such math/theory papers and haven't found them to be helpful.

Very fair, cool. Thanks, those five were nice illustrations, although I'll need some time to digest the nature of non-linear dynamics. I've bookmarked it for an interesting trip someday.

I'm not sure how useful transparency tools would be. They can't tell you anything about adversarial examples. Do they even diagnose neural backdoors yet? If they can't find actual extremely sharp decision boundaries around specific inputs, hard to see how they could help you understand what an arbitrary SGD update does to decision boundaries across all inputs.

In this case, I deferred it as I don't understand what's really going on in transparency work.

But more generally speaking: Ditto, I sortof believe this to a large degree. I was trying to highlight this point in Section 'Application: Transparency'. I notice I'm significantly more pessimistic than the median person on AI alignment forum, so there are some cruxes which I cannot put my finger on. Could you elaborate a bit more on your thoughts?

[-]jacob_cannell4y60

A-priori thinking about inductive biases of super-intelligent agents looks very unintuitive to me as I cannot currently understand how ML literature could meaningfully inform inductive biases of intelligent agents which could tweak themselves[1].

Also, the scaling hypothesis seems more in the GPT rather than superintelligence category, it’s unclear to me how tools will achieve agency, and I’m not yet convinced/appreciate this. ↩︎

I read that first sentence several times and it's still not clear what you mean, or how the footnote helps clarify. What do you mean by 'tweak'? A tweak is a small incremental change. DL is about training networking with some flavour of SGD/bprop, which approximates bayesian updates, and is all about many small 'tweaks'. So when you say "agents which could tweak themselves" at first glance you just seem to be saying "agents that can learn at all", but that doesn't seem to fit.

Your section on adversarial examples will not hold up well - that is a bet I am fairly confident on.

Adversarial examples are an artifact of the particular historical trajectory that DL took on GPUs where there is no performance advantage to sparsity. Adversarial attacks exploit the overfit,noisy internal representations that nearly all DL systems learn, as they almost never regularize internal activations and sparse weight regularization is still a luxury rather than default, and certainly isn't tuned for adversarial defense. Proper sparse regularized internal weights and activations - which compress and thus filter out noise - can provide the same level of defense against adversarial pertubations that biological cortical vision/sensing provides.

I know this based on my own internal theory and experiments rather than an specific paper, but just a quick search on the literature reveals theoretical&experimental support 1,2,3,4

(all of those were found in just a few minutes while writing this comment)

The reason this isn't more widely known/used is twofold: 1.) there isn't much economic motivation - few are actually currently concerned with adversarial attacks outside theoretical curiosity and DL critics 2.) sparsity regularization (over activations especially) is a rather expensive luxury on current GPU software/hardware

[-]bayesian_kitten4y*20

Hi! Thanks for reading and interesting questions:

I read that first sentence several times and it's still not clear what you mean, or how the footnote helps clarify. What do you mean by 'tweak'? A tweak is a small incremental change.

That's correct, what I meant is say we state an agent has 'x, y, z biases', it can try to correct them. Now, the changes cannot be arbitrary, the constraints are that it has to be competitive and robust. But I think it can reduce the strength of the heuristic by going against it whenever it can to the extent those heuristics would have little usefulness. But it's likely that here I'm having a wrong and weird conception of superintelligence.

Your section on adversarial examples will not hold up well - that is a bet I am fairly confident on.

Huh. I will find that very surprising, it should hold up to sparsity. Let me clarify my reasoning and then can you say what do you find I might be missing? Or why do you still think so?

Proper sparse regularized internal weights and activations - which compress and thus filter out noise - can provide the same level of defense against adversarial pertubations that biological cortical vision/sensing provides.

Notice that this noise is average case, whereas adversarial examples are worst case. This difference might be doing a lot of heavy-lifting here. Conventional deep networks have really nice noise stability properties, as in, they are able to filter out injected noise to a good extent, illustrated in Stronger generalization bounds for deep nets via a compression approach (ICML '18). In the worst case, despite 3D vision, narrow-focus and other biases/limitations of human vision give a wide variety of adversarial examples. Some examples: 'the the' reading problem, not noticing big large objects crossing in a video or falling prey to a good variety of illusions are some varieties of adversarial examples for human vision. I'm not sure if human vision is a good example of a robust sensing pipeline.

I know this based on my own internal theory and experiments rather than an specific paper, but just a quick search on the literature reveals theoretical&experimental support 1,2,3,4

Err, I find citing literature to be often insufficient, especially in ML, to meet a reasonable bar of support (a surprising amount of papers accepted in top conferences are unfortunately lame). I usually have to carefully read and analyze them.

For these papers, quickly reading some of them, my comment is as follows-- Notice that often: (a) they do not test with adaptive attacks (b) the degree of robustness they provide to weaker attacks is minimal (c) any simple defense like gaussian smoothing will do a lot better. Hence, they would provide little support about robustness.

For comparison, good empirical results in compression-gives-robustness look like: Robustness via Deep Low-Rank Representations (Arxiv). Although insufficient to be a good defense as adaptive attacks are likely to break it. Reference for adaptive attacks: On Adaptive Attacks to Adversarial Example Defenses (NeurIPS '20)-- one of my favourite works in this area.

Now, let me put forward the case against: Our current understanding of sparsity (works in G5, AR3) is that sparsity allows us to reduce parameterization, but only a certain extent. Effects in AR3 suggest we probably need more complex models, and not simpler ones (with sparsity/regularization) for robustness -- i.e. the direction you think we should go towards seems to be opposite than what literature suggests (and this is indeed counter-intuitive!).

I think the reason people (including me) have been pessimistic about this direction and switched to doing research in other things is that it doesn't seem to give many benefits except a certain reduction in memory/parameterization at the extra cost of code modifications.

Adversarial examples are an artifact of the particular historical trajectory that DL took on GPUs where there is no performance advantage to sparsity ... sparsity regularization (over activations especially) is a rather expensive luxury on current GPU software/hardware

I don't think this is true, to like a large degree. GPUs do take a lot of advantage of sparse patterns or maybe I have a lower bar of 'sparsity works!' than yours. Pytorch takes and speeds up memory and computations by a huge amount if you take sparse tensors! If you have structured sparsity (blocksparse structure, pointwise convolutions), it's even better and there are some very fast CUDA kernels to leverage that.

It has limited upside, not contributing interesting/helpful inductive biases. It's fairly common to sparsify and quantize deep networks in deployment phase, although often the non-sparse CUDA kernels work fine as they're insanely optimized.

[-]bayesian_kitten4y20

Referencing recent papers sent my way here (this shall be a live, expanding comment), please do link more if you think they might be useful:
- Inductive biases in theory-based reinforcement learning

Also, the scaling hypothesis seems more in the GPT rather than superintelligence category, it’s unclear to me how tools will achieve agency, and I’m not yet convinced/appreciate this. ↩︎
Note that for this post, I shall caveat the transparency tools from feature-importance and interpreting-representations among the large area of transparency research and probes to be structural probes from the large area of interpretable NLP. ↩︎
Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually) (EMNLP ‘20) is concrete counter-evidence against scaling, but its evaluation might suffer from the same biases provided in this section. ↩︎
An alternate explanation is in Towards Theoretical Understanding of Deep Learning. It argues larger models enables better ease-of-optimization for SGD, choosing better gradient trajectories, hence finding better solutions. ↩︎
Randomly connected subnetworks account for most subnetworks, hence a reluctance to call it a ‘lottery ticket’, which assumes such tickets are rare and need to be found. ↩︎
Note that it’s less interesting but true (albeit non-vacuously) for any overparameterized function. ↩︎
Random sampling with rejection is an astoundingly slow optimizer. It’s approximated using random sampling in a Gaussian Process. ↩︎
Obtained by mislabeling different-portions of data (properties characterized in Distributional Generalization: A New Kind of Generalization (Arxiv)) ↩︎
The procedure nudges the model to output widespread conspiracy theories by crafting prompts to help imitate human falsehoods. ↩︎
Other claims are also interesting, discussed at length ahead in Evidence AR1. ↩︎
Imperceptibleness is important as these patterns are likely unseeable and significantly hard to discover. Invisibilities trickle down, systems increasingly seem black-box (access & knowledge-wise) to downstream users. ↩︎

LESSWRONG
LW

LESSWRONG
LW

22

Evidence Sets: Towards Inductive-Biases based Analysis of Prosaic AGI

22

Ω 17

22

Ω 17

Summary (TL;DR)

A: Where does it Fit in Alignment Literature?

B: Basic Inductive Biases

PF: Inductive Biases of Large, Pretrained Models

Motivation: Pretraining learns Linguistically Grounded Representations

Evidence & Opinion: Limitations of Pretraining & Probing-esque Transparency Tools

Application: Transparency

G: Inductive Biases of Generalizing Models

Motivation

Evidence & Opinion: Generalization is tricky to Coherently Model

Application: Memorization

AR: Inductive Biases of Models (more) Robust to Adversaries

Motivation: Worst-Case Robustness is Relevant

Evidence & Opinion: Adversarial Examples might be Inevitable

Application: Security Vulnerabilities in Large AI Systems

Evidence Set

Mistakes