Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Epistemic Status: Exploratory. My current but-changing outlook with limited exploration & understanding for ~60-80hrs.

Acknowledgements: This post was written under Evan Hubinger’s direct guidance and mentorship as a part of the Stanford Existential Risks Institute ML Alignment Theory Scholars (MATS) program. Thanks to particlemania, Shashwat Goel and shawnghu for exciting discussions. They might not agree with some of the claims made here; all mistakes are mine.

Summary (TL;DR)

Goal:  Understanding the inductive biases of Prosaic AI systems could be very informative towards creating a frame of safety problems and solutions. The proposal here is to generate an Evidence Set from current ML literature to model the potential inductive bias of Prosaic AGI.

Procedure: In this work, I collect evidence of inductive biases of deep networks by studying ML literature. Moreover, I estimate from current evidence whether these inductive biases vary with scaling to large models. If a phenomenon seems robust to or amplified by scaling, I discuss it here and add it to the Evidence Set.

Structure: I provide interpretations of some interesting papers to AI safety in three maximally-relevant subareas of ML literature (pretraining-> finetuninggeneralization and adversarial robustness), and demonstrate use cases to AI safety (in ‘Application’ subsections). I then summarize evidence from each area to form the Evidence Set.

Current Evidence Set: Given in Section “Evidence Set” (last section of this post)

Inspiration: I think developing good intuitions about inductive biases and past evidence essentially constitutes ‘experience’ used by ML researchers. Evidence sets and broadly analyzing inductive biases are the first steps towards mechanizing this intuition. Inductive bias analysis might be the right level of abstraction. A large, interconnected Evidence Set might give a degree of gears-level understanding of the black-box that is deep networks.

Applications: Any theory of inductive biases, if fully developed, can sufficiently explain the observations collected in the evidence set. Alternatively, we can use the evidence set to falsify mechanistic statements about training rationales in Training stories. However, I expect evidence sets to be helpful beyond these two applications.

Limitations: Evidence sets, and more broadly, inductive bias analysis, seems useful for tackling inner-alignment and not aligned objective-learning (outer-alignment). Secondly, the predictive power of evidence sets is unclear. When the proposal is fully fleshed out, it could have lower predictive power on properties or possible outcomes of training rationales, limiting its applicability. Thirdly, it’s challenging to mechanize or automate & scale inductive bias analysis, unlike say, transparency tools. Nevertheless, I think the idea of evidence sets is a promising proposal.

A: Where does it Fit in Alignment Literature?

Situating in Safety Scenarios: Studying inductive biases of Prosaic AI seems well-suited for systems arising from the scaling hypothesis. More broadly, from Dafoe’s AGI perspectives, studying inductive biases could be instrumental for safety considerations in the case of General Purpose Technology and may be helpful for the AI Ecology system classes. A-priori thinking about inductive biases of super-intelligent agents looks very  unintuitive to me as I cannot currently understand how ML literature could meaningfully inform inductive biases of intelligent agents which could tweak themselves[1].

Direct Applications: I list two applications where our evidence set can be directly helpful. If this post seems useful in your safety work, please link it in the comments.

  • Checking Falsifiable Claims in Training Stories: I essentially expand inductive bias analysis as proposed in training stories, aiming to analyze them at the right level.
  • Evidence Set can be used to falsify theories: Examples of such theories could range from conceptions of how Prosaic AGI systems might look to theories of how deep networks work and possible tradeoffs between different alignment constraints.

B: Basic Inductive Biases

I list some inductive biases based on the nature of learning and data. While it’s not immediately apparent to me how these would be useful for safety problems, nevertheless, they are a good start.

(B2) Dataset-Reality Mismatch Bias (Ref: Unbiased Look at Dataset Bias (CVPR ’11))
Effect & Scalability: Definitely real, independent from models.
My Interpretation: Datasets are used as proxies for real-world data, but suffer from an underspecification bias: They contain too little information to reflect the full range of competencies a model needs to generalize robustly in the real world. This is often the bias responsible for idiosyncrasies and surface-level correlations in models when occurring in training data and for favouring non-robust models over more robust internally-aligned models in test data.


PF: Inductive Biases of Large, Pretrained Models

Motivation: Pretraining learns Linguistically Grounded Representations

Large scale pretraining has been instrumental in advancing performance in imagestext and multi-modal data, making them important candidates for studying inductive biases. Large Language Models would be good candidates for studying inductive biases in this area. The NLP research community has made progress in analyzing these models and finding how and if broadly-useful linguistic features are learnt (pretrain) and transferred (finetune) in this paradigm. Researchers have studied and interpreted what representations large language models encode specifically linguistic properties.

Probing:[2] A popular, promising transparency mechanism used is Probing, which famously illustrated how BERT Rediscovers the Classical NLP Pipeline (ACL ‘19).

What are Probes? Probes are small (varying complexity) parametric classifiers trained to detect a linguistic property of relevance. Their input is the representation to be analyzed. High probing accuracy is a valuable indicator: Saying the probed linguistic property is encoded and easily extractable in the representation. In contrast, low probing accuracy is less informative as it’s hard to disentangle if the linguistic property was not encoded or was not extractable. I think this is a neat model for investigating the presence of high-level information, somewhat comparable to per-neuron transparency approaches like Circuits (Blog). Furthermore, Probing as Quantifying the Inductive Bias of Pre-trained Representations (Arxiv) provides a simple, elegant Bayesian framework for using probing-like transparency tools for systematically investigating inductive biases in models.

Evidence & Opinion: Limitations of Pretraining & Probing-esque Transparency Tools

(PF1) Evidence: Out of Order: How Important Is The Sequential Order of Words in a Sentence in Natural Language Understanding Tasks? (ACL ‘21), Sometimes We Want Translationese (EMNLP Findings ‘21), UnNatural Language Inference (ACL ‘21), What Context Features Can Transformer Language Models Use? (ACL ‘21)
Effect & Scalability: Seems real. Seems robust to scaling (low confidence)
My Interpretation: Large language models show good performance when finetuned on various hard tasks like machine translation, summarization and determining logical entailment between sentences. But are these downstream tasks demonstrative of encoding good linguistically-grounded representations about language like we expect?

The above works demonstrate this is not the case in these downstream tasks by perturbations like shuffling words in input sentences in the test set, which destroy syntax, making them sound gibberish. Surprisingly, these papers find that deep networks still perform well with shuffled inputs on the GLUE benchmark suite of tasks, Machine translation, Natural Language Inference. Similarly, deleting particular in context does not drop performance in language modelling.

This is explained from two outlooks:

(a) Works say that current datasets are dramatically insufficient to capture the full complexity of the underlying task (among other things). I think it’s likely that these test sets are too simple.

(b) Deep models have demonstrated time-and-again a propensity to cheat by relying on not-human-like, surface-level features instead of linguistically-grounded features for learning (cute illustration here).

Non-Human Interpretable Features Usually Work (NHIFUW) Hypothesis : I hypothesize it’s possible that surface-level, brittle correlations are effective in capturing a reasonable degree of complexity of tasks in average cases but fail in relatively worst case scenarios (providing easy counter-examples and calls/arguments for bad dataset creation).

(PF2) Evidence: Pre-training without Natural Images (ACCV ‘20) A critical analysis of self-supervision, or what we can learn from a single image (ICLR ‘20)
Effect & Scalability: Seems real. Seems robust to scaling (medium confidence)
My Interpretation: These works illustrate that pretrained models/early layers of deep networks contain limited information about the statistics of natural images and hence can be captured via synthetic images or transformations instead of using large natural image datasets. They illustrate this by pretraining on datasets providing limited to no information about natural images, such as using a heavily-augmented natural image or a large dataset of fractal patterns. Even such synthetic pretraining learns useful features which help achieve: (i) Significantly better performance than training models from scratch (ii) Comparable performance to models pretrained on large natural-image datasets.

I believe this gives credence to the NHIFUW hypothesis. They made it hard by design to capture any meaningful information about natural images in these weird pretrain datasets. Suppose spurious representations learned from fractals still transfer well. In that case, it is questionable to what extent human-like biases help large pretrained models perform well on downstream vision tasks.

(PF3) Evidence: Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little (EMNLP ‘21) (Similar: Does Pretraining for Summarization Require Knowledge Transfer? (EMNLP ‘21))
Effect & Scalability:  Seems real. Seems robust to scaling, but has credible evidence against [3] (medium confidence)

My Interpretation: In PF1, we saw the effects of testing. This work demonstrates that pretraining might not be encoding similar rich-linguistic properties. Furthermore, probing detects the encoding of syntactic properties when we explicitly destroy syntax by shuffling words. This indicates possible fundamental shortcomings or requirements for additional rigour like creating better control datasets for transparency tools like probing to be informative.

How? They show it by a simple trick: They shuffle words in every sentence of the corpora while preserving uni- or bi-grams (as done in the title subheading). This destroys syntax, which is fundamental towards encoding rich linguistic information. They train a transformer model on this shuffled corpus and show that: (a) It dramatically outperforms training from scratch on a downstream task, i.e., pretraining gave a large advantage even without rich linguistic information. (b) It performs comparably to pretraining a transformer on original corpora, i.e., adding rich linguistic biases was not helpful for downstream tasks. Lastly, they verify that the model didn’t self-correct and reconstruct correct-word ordering by illustrating poor performance on non-parametric checks and predicting the next-word (natural language modelling).

Pretraining on shuffled words does dramatically improve performance compared to training from scratch gives credence to NHIFUW hypothesis, as this experiment controls for bad dataset creation to a good degree.

What is very surprising is that a variety of different-complexity parametric probes detected rich syntactic information being encoded despite us explicitly destroying syntax by shuffling sentences, sometimes outperforming training on original corpora. What does this imply? I found Probing Classifiers: Promises, Shortcomings, and Advances (Squib, Computational Linguistics) to be a lucid and insightful summary. I think shortcomings of probing stated in their conclusions should generalize to more popular transparency research, say, by Chris Olah. If so, drawing meaningful conclusions from transparency tools seems to be more complicated than it may seem. They only indicate a high correlation to the linguistic property in question, which often happens even when the property isn’t present. Additionally, we might need various control sets to isolate effects or a combination of transparency tools for the reliable determination that alleviates each other’s shortcomings.

Addressing this issue with adversarial robustness: Adversarial NLI: A New Benchmark for Natural Language Understanding (ACL ‘20), Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Model (NeurIPS ‘21)

(PF4) Evidence: Catastrophic forgetting in connectionist networks
Effect & Scalability: Definitely real. Robust to scaling (high confidence)
My Interpretation: In sequential learning, the phenomenon of catastrophic forgetting occurs when neural networks lose all predictive power on previous tasks (e.g. pretraining) once trained on a subsequent dataset (e.g. finetuning). This hinders learning in several ways: (i) Learning about the world might require learning and reasoning over multiple timesteps of data ingestion. If so, deep networks have an inductive bias of being episodic, not only in objective/reward optimizing sense but not remembering knowledge from the past at all.

Why is this important? I've recently come across a proposal for tackling the off-distribution shift, which essentially says do online updates! This bias highlights a huge and often ignored cost: Losing nearly all predictive power on previously learned tasks. Hence, deep networks (specifically the feature learning parts) are not updated online. This is quite a general problem, which crops up when one wants to add more data to a trained AI system without retraining on all past data.

Application: Transparency

The Other Side of Transparency: I am impressed by conceptual frameworks developed using transparency and also the speed of progress and thoroughness in the analysis of neurons with transparency tools. I think transparency might be beneficial in analyzing failure-modes of learning in Advanced AI. Simultaneously, I fear being too optimistic about their capabilities as I was with probing. Hence I present lessons learned from the investigation of probing, which might apply to other transparency tools:

Why?Visible thoughts of models, HCH or neuron-by-neuron understanding (the approaches I’ve been exposed to) seem to have the assumption of being human-interpretable. Based on the above finding, it seems possible that growing evidence will show substantial subsets of representations learned by large deep networks are very unintuitive to humans but important for competitiveness (ref: ERM story). Secondly, when introducing good control sets for evaluating transparency mechanisms as was done with probing, the advantages of transparency mechanisms might vanish or reduce by a greater amount than I expected.

We might need to aim for worst-case transparency for safety considerations as: 
(a) Many aspects of Circuits and similar tools might be modeling for average-case transparency (likely to fail in the worst case) (b) There have been historical precedents of issues in similar attempts to achieve transparency, one interesting example being average-case transparency is under-specified, not needing a specific threat-model.

Meta-Note: Ah, the feeling of being reduced to tears looking at your own proposal.

G: Inductive Biases of Generalizing Models

Motivation

I now discuss the inductive biases of deep networks on generalization properties. ML Literature studies generalization to understand how deep network training consistently draws from a small subset of possible perfectly fitting models with good generalization properties. The safety aim here is to understand how inductive biases affect the output model space and then use this knowledge to try and obtain only inner-aligned (or robust) models. Now, we cannot change strong inductive biases. Hence, we should consider them as hard constraints in the alignment problem. On the other hand, weaker inductive biases can be traded-off with other considerations if required for alignment, forming softer constraints.

Related posts: [How to think about Over-parameterized Models]

Evidence & Opinion: Generalization is tricky to Coherently Model

(G1) Evidence: Understanding deep learning requires rethinking generalization (ICLR ‘17)
Effect & Scalability: Definitive. Gets stronger with scaling (high confidence)
My Interpretation: This work demonstrates that overparameterized deep networks trained with SGD can obtain a perfectly fitting model for any input -> output mapping (dataset), which they call ‘memorizing’ the training set, despite using popular regularization techniques. We can conclude that there is not enough (a) implicit regularization provided by SGD (b) and explicit regularization provided by popular regularizers to restrict the output space to models that generalize well.

(G2) Evidence: Bad Global Minima Exist and SGD Can Reach Them (NeurIPS ‘20)
Effect & Scalability: Seems real. Seems robust to scaling (medium confidence)
My Interpretation: This work tweaks the training procedure: training with random labels first and then switching to real labels. By this, SGD finds poorly-generalizing models (bad minima) consistently. Furthermore, adding a degree of explicit regularization can improve convergence, obtaining a good-generalizing model.

This is an exciting test for learning theories, which must predict in which settings SGD could find poor-generalizing models (this being one of them). A theory that always predicts convergence to good-generalizing models can be rejected by this evidence.

(G3) Evidence: What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation (NeurIPS ‘20)
Effect & Scalability: Seems real. Seems robust to scaling (medium-high confidence)
My Interpretation: This work shows that deep networks can have a propensity for memorization of a part of the training data while simultaneously being able to generalize by correctly fitting other parts of the train set. Deep networks can memorize specific data points and generalize well with a shared parameterization (lower layers etc.).

This work motivates this bias: Most natural data distributions are long-tailed, containing a large number of underrepresented subpopulations where sometimes the best a model can do is memorize. Surprisingly, researchers could select for memorizing the tail-end of data distributions (including noise and outliers). It gives better generalization performance on the test set than works that ignore it because similar points reoccur.

(G4) Evidence: Deep Double Descent (Blog) [Rohin’s Summary and Opinions][Discussions #1 and #2]
Effect & Scalability: Seems real. Gets stronger with scaling (medium confidence)
My Interpretation: As we increase the parameter size or increase the length of training or decrease training data, in all cases, when in the zero train error (interpolation) regime, larger models/longer optimization achieve better generalization. This is very counter-intuitive because given you have plenty of correct (zero train error) models in your current hypothesis space, this states that selecting a larger hypothesis space still allows you to find better generalizing models, albeit with decreasing marginal returns. The implication is that in larger hypothesis sets, we have a higher probability of getting a better model[4].

(G5) Evidence: Deep Expander Networks: Efficient Deep Networks from Graph Theory (ECCV18), Sanity-Checking Pruning Methods: Random Tickets can Win the Jackpot (NeurIPS ‘20), Pruning Neural Networks at Initialization: Why are We Missing the Mark? (ICLR ‘21) 
Effect & Scalability: Seems real. Might be stronger with scaling (medium confidence)
My Interpretation: The surprising finding in pruning has been that layerwise-random selection leads to much smaller subnetworks which on training obtain accuracies close to the original models (called ‘tickets’) across a variety of networks, pruning settings[5]. I find analyzing these more straightforward and more valuable than lottery tickets.

In context of previous evidence (G4), it seems to me that the main advantage gained by a richer hypothesis set is not the larger #parameters available to express functions (works well after some parameters are pruned) but richer connectivity. This rich connectivity is likely obtained in random subnetworks discussed here, allowing SGD to search for a good model effectively.

(G6) Evidence: Is SGD a Bayesian sampler? Well, almost (JMLR) [Summary][Discussion][Zach’s Summary & Opinion]. Why flatness does and does not correlate with generalization for deep neural networks (Arxiv)
Effect & Scalability: Unknown, but major if true. Unknown (low confidence).
My Interpretation: This line of papers form the following narrative– The prior distribution of (initialized) deep models is strongly biased towards simple functions. The posterior distribution of (learned) deep networks is simple as training (SGD) contributes little-to-no inductive biases of its own. Hence, we likely obtain simple functions as output and generalize well. How?

(1) On prior: Simple functions (input -> output) might have a substantially larger volume in parameter-space, i.e., there are many more ways to express a simple function than a complex one when you are very overparameterized[6].

(2) On posterior: This is shown by empirically comparing the distribution of SGD-trained networks and randomly sampled networks with similar test accuracies, and finding they correlate well[7].

(3) On Generalization: A trained model's simplicity is measured by its prior probability (using Gaussian Processes- Appendix D). This prior probability strongly correlates with generalization error across different accuracy ranges[8].

Together, these arguments imply: 
(a) From (2): SGD has little-to-no inductive biases. 
(b) From (1&3): Simplicity prior (in 1) is a great generalization measure.

These seem promising if the empirical results remain robust with better controls and more rigorous analysis. I am skeptical, but admittedly don’t have enough background to provide an incisive critique. Some background why I might be skeptical of correlations:

(i) Evaluation: Fantastic Generalization Measures and Where to Find Them (ICLR ‘20) In Search of Robust Measures of Generalization (NeurIPS ‘20) have been skeptical of previous generalization measures upon the causal analysis in average-case and worst-case criteria, respectively. The results here seem preliminary, and it’s unclear if it’ll hold up in more rigorous evaluation.

(ii) Inductive Biases: Can this framework explain past contrary findings? E.g., Implicit Regularization in Deep Matrix Factorization (NeurIPS ‘19), among others, have illustrated biases in SGD to a certain degree, here it seems to convincingly isolate (by using a deep linear network) that SGD has a strong low-rank inductive bias which is not the smallest nuclear norm.

(G7) Evidence: Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets [Rohin’s Summary & Opinion][Added from comment by Gwern.]
Effect & Scalability: Unknown, but important if true. Unknown (low confidence).
My Interpretation: The main claim is that deep models might change behaviors abruptly, shifting from one strategy to a completely different one in a short time. In this paper, they test the ability of deep networks to perform modular arithmetic correctly. They observe that deep networks quickly memorize the dataset but fail to really generalize from it. At a certain point, however, networks suddenly switch to a strategy which generalizes really well (not a gradual increase either).

Rohin gives a possible learning dynamic, which seems intuitive to me: Gradient descent quickly gets to a memorizing function, and then moves mostly randomly through the space, but once it hits upon the correctly generalizing function (or something close enough to it), it very quickly becomes confident in it and then never moving very much again. Gwern states that it's in general a good prior to model deep networks as being on the 'edge of chaos' in various ways. Currently, the evidence for sudden shifts in dynamics as a way of learning is preliminary. However, if they frequently occur, it could significantly affect our understanding of deep networks and may be safety relevant. E.g. Predict when shifts are likely to occur and how it might affect learning might be a hard problem, in contrast, the direction and degree of gradual shifts are easy to predict, by the degree of decrease in the loss.

More potentially-informative papers which could inform limits of capabilities of a deep network, but I am unable to judge or contextualize value: Universal Transformers (ICLR ‘19), Theoretical Limitations of Self-Attention in Neural Sequence Models (TACL).

Application: Memorization

The Other Side of Memorization: We sometimes really want models to not engage in certain types of behaviourTruthfulQA: Measuring How Models Mimic Human Falsehoods [Discussionreally want models to avoid generating ‘falsehoods’ as the target behaviour. This work shows the failures of large pretrained models by demonstrating that they have considerable ability to create falsehoods. This suggests a very relevant question: To what degree are the falsehoods caused due to generalization and memorization?

In this case, I think it’s mostly the ability to memorize and extract that information from the model that seems to be tested here[9], which is better done with larger language models. I think this is likely what’s happening because of other evidence I saw: (i) Larger models are more informative, i.e. models better memorize and extract all information (ii) Works like Extracting Training Data from Large Language Models (USENIX ‘21) explicitly demonstrate interesting cases of unintended, extractable memorized information from larger models.

A Simpler Setting: Let’s consider a simple case- How benign is benign overfitting? (ICLR ‘21). It starts by recounting phenomena summarized in (G3), i.e. Deep Networks usually overfit (memorize) perfectly to partial label noise present in large-scale datasets, which is usually considered ‘benign’.

It disputes that such memorization is ‘benign’ by convincingly citing literature and providing new evidence that (among other things[10]): When models are made adversarially robust, they do not blindly memorize the entire training data. The data they successfully refrain from learning is a part of the incorrectly labelled and even the atypical examples present in the training data.

This indicates: (a) We might entirely miss the relatively small set of ground-truth (read: internally aligned) models simply by selecting from models achieving zero train-error (b) Adversarial robustness is said to be one way of tackling the above-stated issue in a scalable fashion, updating my confidence in similar principles being used for safety proposals.

Coming Back: Regarding the task of generating falsehoods, what would robust behaviour look like? My current best guess is a neural-version of my all-time favourite works: Never-Ending Language Learning, which designs a self-correcting classifier that reads the web and generates more facts in a knowledge graph by linking facts and verifying fit. In this case, conspiracy theories would be considered 'noisy edges' in the knowledge graph and end up being naturally excluded with robust training.

Similarly, a clever mechanism that makes unsafe completions akin to noisy examples in an otherwise safe world might greatly help safe generations. Training for robustness to unsafe completions likely satisfies the other desiderata needed: It's readily scalable and preserves competitiveness of outputs. I hope inductive-bias analysis is helpful in such a manner.


AR: Inductive Biases of Models (more) Robust to Adversaries

Motivation: Worst-Case Robustness is Relevant

There is a good reason a lot of adversarial robustness literature is overlooked. John_Maxwell highlights from Motivating the Rules of the Game for Adversarial Example Research (Summary)

In this paper, we argue that adversarial example defense papers have, to date, mostly considered abstract, toy games that do not relate to any specific security concern. Furthermore, defense papers have not yet precisely described all the abilities and limitations of attackers that would be relevant in practical security.

I like two perspectives that motivate studying Adversarial Robustness, which might be AI Safety relevant: (i) Security of AI systems and (ii) Robustness to Distributional Shift.

Security PerspectiveScasper motivates the problem well:

As progress in AI continues to advance at a rapid pace, it is important to know how advanced systems will make choices and in what ways they may fail. When thinking about the prospect of super-intelligence, I think it’s all too easy and all too common to imagine that an Artificial Super-intelligence would be something which humans, by definition, can’t ever outsmart. But I don’t think we should take this for granted. Even if an AI system seems very intelligent -- potentially even super-intelligent -- this doesn’t mean that it’s immune to making egregiously bad decisions when presented with adversarial situations. Thus the main insight of this paper:

The Achilles Heel hypothesis: Being a highly-successful goal-oriented agent does not imply a lack of decision theoretic weaknesses in adversarial situations. Highly intelligent systems can stably possess "Achilles Heels" which cause these vulnerabilities.

Currently far more intelligent than AI systems, humans are predictably and reliably fooled by visual and auditory attacks. It’s unclear why intelligent AI systems would not have security vulnerabilities, exploitable by bad actors to predictably cause, say, catastrophic failures, or by an intelligent AI system as a trigger for deception.

Distribution Shift Perspective: I think if semantic adversarial examples are automatically generatable-- they would be pretty neat, efficient ways of testing robustness to distribution shift. Now what are semantic adversarial examples? I motivate by examples: (i) Examples with the background correlations removed without the object present  (ii) Non-central examples of a particular class (Say, something which technically is a chair but not what one would imagine).

How is it useful? I am linking the post where it's nicely illustrated for reference, as my summary was much longer with little added value.

Evidence & Opinion: Adversarial Examples might be Inevitable

(AR1) Evidence: Adversarial Examples Are Not Bugs, They Are Features (NeurIPS ‘19) [Discussions][Rohin’s Summary & Discussion],  Robustness May Be at Odds with Accuracy (ICLR ‘19) 
Effect & Scalability: Definitely real. Seems robust to scaling (medium confidence)
My Interpretation: In summary, the first work convincingly shows that some adversarial examples could arise due to non-robust features: easily found, highly-predictive (hence competitive) but brittle (hence non-robust) and imperceptible[11]) to humans. They can isolate these brittle features from the data, creating a robust dataset and a non-robust dataset. Simple training on this dataset gives naturally-robust models. However, the loss of predictive features results in a significant drop in performance. Simple training on the non-robust dataset looks like training on noise but performs well on original natural images.

In comparison, the second work argues that the above-seen tradeoff between standard accuracy and adversarial robustness is inherent to learning, showing that it provably occurs when non-robust features are inevitable (robust classification is not possible, say, due to label noise etc). This explains the drop in standard accuracy in the above case or when performing adversarial training in practice.

They have two interesting implications: (i) They argue for pessimism about robustness naturally emerging as a consequence of standard training, as robust classifiers use considerably different (worse, robust) vs (better, non-robust) features used by standard classifiers (ii) The above two works show to some degree that these robust features might align far better with human intentions. I currently find this hypothesis has exciting implications. It's likely some version of this should be very practically applicable.

Why is this important? Connecting this to my intuitions about (PF-A), I think pretraining might learn similar non-robust but more-generally-useful patterns. To theorize the power of these patterns (i) The effect is significant: they enable pretrained models to dramatically outperform training-from-scratch in downstream tasks (ii) The effect nearly matches possibly-robust patterns: they enable pretrained models to perform comparably to pretraining on natural-images which can learn robust patterns.

(AR2) Evidence: A Fourier Perspective on Model Robustness in Computer Vision (NeurIPS ‘19), Simple Black-box Adversarial Attacks (ICML ‘19)
Effect & Scalability: Seems real. Seems robust to scaling (medium confidence)

My Interpretation: A well-accepted principle in distributional robustness literature is that models lack robustness to distribution shift because they latch onto superficial correlations in the data. The first work illustrates one such correlation: high-frequency features. It's not predictive in-distribution. Deep models can classify images containing only high-frequency information (not understandable by humans) well. But (B2) kicks in, and it loses a lot of predictive power off-distribution. We can leverage this to also create adversarial examples by simply sampling different non-predictive high-frequency noise patterns.

They further show alleviating these shortcomings is tricky. The first work shows that when performing adversarial training to L perturbations, it becomes robust to high-frequency noise but becomes vulnerable to low-frequency corruptions, say foggy weather. The second work indicates that deep networks have many adversarial samples around the boundaries. Even random search succeeds real fast. They show this by exploiting high-dimensionality to make black-box adversarial attacks simply by: Repeatedly sampling mutually orthonormal vectors and either adding or subtracting from the target image in this direction. The proposed method can be used very effectively for both untargeted and targeted attacks, showing that attacking a classifier might be an easy search problem and difficult to address.

Additional interesting works: Detecting Adversarial Examples Is (Nearly) As Hard As Classifying Them (ICML-W ‘21)

(AR3) Evidence: Adversarial Robustness May Be at Odds With Simplicity (Arxiv), Adversarially Robust Generalization Requires More Data (NeurIPS '18), A Universal Law of Robustness via Isoperimetry (NeurIPS ‘21) and Intriguing Properties of Adversarial Training at Scale (ICLR ‘20) 
Effect & Scalability: Seems real. Seems robust to scaling (medium-high confidence)
My Interpretation: Works (a&b) argue that the complexity of adversarially robust generalization is much higher than standard generalization.

Work (a) gives a more model perspective, arguing that robust classification may require more complex classifiers (i.e. more capacity, sometimes exponentially more) than standard classification. In contrast, work (b) gives a dataset perspective, providing evidence to argue that the sample complexity of adversarially robust generalization is much higher than standard generalization.

Work (c) conjectures a tight lower bound for a property (lipschitzness), which might be very helpful for producing robust models. It says that obtaining interpolable (robust) models would necessitate several orders of magnitude larger parameters, specifically by the intrinsic dimensions of the problem.

Finally, work (d) tests this in practice at scale and finds that adversarial training requires much larger/deeper networks to achieve high adversarial robustness. We know accuracy is marginally improved by adding more layers beyond ResNet101. However, there is a substantial, nearly linear and consistent gain with adversarial training pushing the network capacity to a far larger scale, i.e., ResNet-638.

Overall, it provides a compelling picture, with some interesting implications:
(a) Adversarial training is an indispensable tool, not approximated by simply scaling up standard learning. 
(b) The fact that adversarially learnt feature representations have different complexity than those from standard learning gives credence to being somewhat fundamentally different features. 
(c) With tasks like language modelling likely having a higher intrinsic dimension, obtaining robustly generalizing models would raise the bar of minimal parameterization several magnitudes higher. In other words, the interpolation regime in a deep double descent story may also shift significantly rightward to meet this criterion. This may have significant implications for forecasts made by the community.

Application: Security Vulnerabilities in Large AI Systems

Let’s recap Scasper on the Achilles Heel Hypothesis:

More precisely, I define an Achilles Heel as a delusion which is impairing (results in irrational choices in adversarial situations), subtle (doesn’t result in irrational choices in normal situations), implantable (able to be introduced) and stable (remaining in a system reliably over time).

Literature on backdoor attacks on ML (Arxiv) provides a comprehensive survey of a class of adversarial attacks which, disturbingly, seems to fit the above criterion perfectly.

Poisoning Attack: When a system that otherwise works well (subtle) is trained on a contaminated training dataset (implantable), it can be made to fail predictably and reliably when presented with the trigger-object present in test data at any time (impairing & stable).

Why is this important? Recently, Poisoning and Backdooring Contrastive Learning (ICLR’ 22) illustrates large multimodal models like CLIP are vulnerable because of being trained on unreliable data obtained via web-scraping. They show they can plant a backdoor in a CLIP model by introducing a poisoned set of just 0.005% of the original dataset size (~150 images). This gets CLIP to predictably cause any input with the overlayed backdoor to be classified as a particular class during testing. Furthermore, poisoning is hard to address:

(a) Checking large datasets manually isn’t feasible. It’s hard to detect adversarial inputs without knowing the specific threat model

(b) Backdoors try to produce deceptively predictive features, resulting in a handicap in competitiveness for models which do not leverage them.

Similar works have emerged across domains: Trojaning Language Models for Fun and Profit (S&P’ 21), Backdoor Attacks on Self-Supervised Learning (Arxiv)

This has the potential to create a variety of probably hard-to-defend against worst-case risk scenarios by developing backdoors such that the <given> scenario is triggered when shown the poison at test-time. Alternatively, I think intelligent agents could leverage this attack and possibly inject backdoors into themselves, to serve as triggers for defection and similar unintended behaviour. It might be hard to check since, without the trigger, as AI models will normally have completely innocuous behaviour and reasoning-behind-behaviour holding up to high standards. This, in Death Note, is akin to Light voluntarily giving up all his memories of the Death Note, appearing really innocent to scrutiny, with contact with a piece of the note being a trigger for malicious behaviour.

I wonder if this evidence is enough to label this a potential warning shot for securing advanced AI systems and warrants the attention of this community.


Evidence Set

Compiling the evidence above, I get a condensed Evidence set as follows:

B-B) Dataset-Reality Mismatch Bias: Datasets will likely be under-specifications of the real-world’s full complexity, causing biases in training and testing.

PF-A) Biases in Pretrained Representations: Representations learnt by large pretrained models might be substantially reliant on low/surface-level information, which is predictive in the average case scenario (good accuracy) but brittle in worst-case scenarios (easy to construct counterexamples).

PF-B) Biases in some transparency tools: Probing and functionally-similar transparency tools such as On the Pitfalls of Analyzing Individual Neurons in Language Models (ICLR ‘22) might measure correlation in especially creative ways, resulting in high probing performance by various mechanisms such as:

  • Probing classifier memorizing the task vs probe being too weak to extract information
  • A property that is not present can be detected by correlation to other properties (analogy: zip-code in race controls)
  • The property being present might not indicate that it was causally relevant to producing the output

..and more.

PF-C) Catastrophic Forgetting: Deep networks lose all predictive power on previous tasks (e.g. pretraining) once trained on a new task (e.g. finetuning)

G-A) Good-Generalization Bias: Overparameterized deep models can fit any arbitrary mapping in the dataset, indicating a large set of bad-generalizing models, yet typical training with random initialization results in selecting good-generalizing models.

G-B) Not-Always Good-Generalization: By tweaking the training procedure, one can induce selection of the bad-generalizing models from the aforementioned large hypothesis space that perfectly fit training data.

G-C) Memorization while Generalization: Deep networks can memorize and generalize simultaneously with parameter-reuse (not as sometimes imagined- disjoint subnetworks for memorization and generalization)

G-D) Deep Double Descent: Increasing overfitting (by parameterization and training length) beyond a threshold counter-intuitively leads to better generalizing models, albeit with decreasing marginal returns.

G-E) Random Tickets: Randomly selected subnetworks (if given layerwise sparsity) achieve performance equivalent to the whole model, i.e. expressive power is gained from increased connectivity and not parameterization in interpolation regime or deep models are unnecessarily overparameterized.

(G-F) Simplicity Prior Bias: SGD might have little-to-no inductive biases, with good generalization attributable to the prior being biased towards simple functions (from architecture & initialization).

(G-G) Sudden Shifts in Learning: Deep networks might make large changes in decision-making strategy abruptly and possibly unpredictably.

AR-A) Feature Divergence: More-robust classifiers use considerably different, worse-performing and more human-aligned features compared to less-robust classifiers

AR-B) Sufficiency of Brittleness for Off-Distribution Misalignment: Deep models which use brittle, predictive features lack robustness to distribution shift because they latch onto superficial correlations in the data, worsened by high-dimensionality of input.

AR-C) Robustness adds Complexity: Achieving higher robustness requires significantly more complex classifiers (complex read as higher capacity) than simply good generalization.


Mistakes

(B1) Gradual Change Bias (Ref: Discussion with Evan)
Effect & Scalability: Unknown, unknown effect of scaling (Added as seemed intuitive)
My Interpretation: We perform SGD updates on parameters while training a model. The claim is that the decision boundary does not change dramatically after an update. The safety implication is that we need not worry about an advanced AI system manoeuvring from one strategy to a completely different kind after an update SGD. Update: Probably not true, refer to gwern's comment and replies for elaboration.


  1. Also, the scaling hypothesis seems more in the GPT rather than superintelligence category, it’s unclear to me how tools will achieve agency, and I’m not yet convinced/appreciate this. ↩︎

  2. Note that for this post, I shall caveat the transparency tools from feature-importance and interpreting-representations among the large area of transparency research and probes to be structural probes from the large area of interpretable NLP. ↩︎

  3. Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually) (EMNLP ‘20) is concrete counter-evidence against scaling, but its evaluation might suffer from the same biases provided in this section. ↩︎

  4. An alternate explanation is in Towards Theoretical Understanding of Deep Learning. It argues larger models enables better ease-of-optimization for SGD, choosing better gradient trajectories, hence finding better solutions. ↩︎

  5. Randomly connected subnetworks account for most subnetworks, hence a reluctance to call it a ‘lottery ticket’, which assumes such tickets are rare and need to be found. ↩︎

  6. Note that it’s less interesting but true (albeit non-vacuously) for any overparameterized function. ↩︎

  7. Random sampling with rejection is an astoundingly slow optimizer. It’s approximated using random sampling in a Gaussian Process. ↩︎

  8. Obtained by mislabeling different-portions of data (properties characterized in Distributional Generalization: A New Kind of Generalization (Arxiv)) ↩︎

  9. The procedure nudges the model to output widespread conspiracy theories by crafting prompts to help imitate human falsehoods. ↩︎

  10. Other claims are also interesting, discussed at length ahead in Evidence AR1. ↩︎

  11. Imperceptibleness is important as these patterns are likely unseeable and significantly hard to discover. Invisibilities trickle down, systems increasingly seem black-box (access & knowledge-wise) to downstream users. ↩︎

New Comment
10 comments, sorted by Click to highlight new comments since: Today at 7:34 AM
[-]gwern2yΩ8100

Why do you have high confidence that catastrophic forgetting is immune to scaling, given "Effect of scale on catastrophic forgetting in neural networks", Anonymous 2021?


My Interpretation: We perform SGD updates on parameters while training a model. The claim is that the decision boundary does not change dramatically after an update. The safety implication is that we need not worry about an advanced AI system manoeuvring from one strategy to a completely different kind after an update SGD.

I also disagree with B1, gradual change bias. This seems to me to be either obviously wrong, or limited to safety/capability-irrelevant definitions like "change in KL of overall distribution of output from small vanilla supervised-learning NNs", and does definitely does not generalize to even small amounts of confidence in assertions like "SGD updates based on small n cannot meaningfully change a very powerful NN's behavior in a real-world-affecting way".

First, it is pervasive in physics and the world in general that small effects can have large outcomes. No snowflake thinks it's responsible for the avalanche, but one was. Physics models often have chaotic regimes, and models like the Ising spin glass model (one of the most popular theoretical models for neural net analysis for several decades) are notorious for how tiny changes can cause total phase shifts and regime changes. NNs themselves are often analyzed as being on the 'edge of chaos' in various ways (the exploding gradient problem, except actual explosions). Breezy throwaway claims about 'oh, NNs are just smooth and don't change much on one update' are so much hot air against this universal prior.

  • As an example, consider the notorious fragility of NNs to (hyper)parameters: one set will fail completely, wildly diverging. Another, similar set, will suddenly work. Edge of chaos. Or consider the spikiness of model capabilities over scaling and how certain behaviors seem to abruptly emerge at certain points (eg 'grokking', or just the sudden phase transitions on benchmarks where small models are flatlined but then at a threshold, the model suddenly 'gets it'). This parallels similar abrupt transitions in human psychology, like Piagetian levels. In grokking, the breakthrough appears to happen almost instantaneously; for humans and large model capabilities, we lack detailed benchmarking which would let us say that "oh, GPT-3 understands logic at exactly iteration #501,333", but it should definitely make you think about assumptions like "it must take countless SGD iterations for each of these capabilities to emerge." (This all makes sense if you think of large NNs as searching over complexity-penalized ensembles of programs, and at some point switching from memorization-intensive programs to the true generalizing program; see my scaling hypothesis writeup & jcannell's writings.)

Second, the pervasive existence of adversarial examples should lead to extreme doubt on any claims of NN 'smoothness'. Absolutely and perceptually tiny shifts in image inputs lead to almost arbitrarily large wacky changes in output distribution. These may be unnatural, but they exist. If tweaking a pixel here and there by a quantum can totally change the output from 'cat' to 'dog', why can't an SGD update, which can change every single parameter, have similar effects? (Indeed, you link the isoperimetry paper, which claims that this is logically necessary for all NNs which are too small for their problem, where for ImageNet I believe they ballpark it as "even the largest contemporary ImageNet models are still 2 OOMs too small to begin to be robust".)

Third, small changes in outputs can mean large changes in behavior. Actions and choices are inherently discrete where small changes in the latent beliefs can have arbitrarily large behavior changes (which is why we are always trying to move away from actions/choices towards smoother easier relaxations where we can pretend everything is small and continuous). Imagine a NN estimating Q-values for taking 2 actions, "Exterminate all humans" and "usher in the New Jerusalem", which normalize to 0.4999999 and 0.5000001 respectively. It chooses the argmax, and acts friendly. Do you believe that there is no SGD update which after adjusting all of the parameters, might reverse the ranking? Why, exactly?

Fourth, NNs are often designed to have large changes based on small changes, particularly in meta-learning or meta-reinforcement-learning. In prompt programming or other few-shot scenarios, we radically modify the behavior of a model with potentially trillions of parameters by merely typing in a few words. In neural meta-backdoors/data poisoning, there are extreme shifts in output for specific prespecified inputs (sort of the inverse of adversarial examples), which are concerning in part because that's where a gradient-hacker could store arbitrary behavior (so a backdoored NN is a little like Light in Death Note: he has forgotten everything & acts perfectly innocent... until he touches the right piece of paper and 'wakes up'). In cases like MAML, the second-order training is literally designed to create a NN at a saddle point where a very small first-order update will produce very different behavior for each potential new problem; like a large boulder perched at the tip of a mountain which will roll off in any direction at the slightest nudge. In meta-reinforcement-learning, like RNNs being trained to solve a t-maze which periodically flips the reward function, the very definition of success is rapid total changes in behavior based on few observations, implying the RNN has found a point where the different attractors are balanced and the observation history can push it towards the right one easily. These NNs are possible, they do exist, and given the implicit meta-learning we see emerge in larger models, they may become increasingly common and exist in places the user does not expect.

So, I see loads of reasons we should worry about an advanced AI system maneuvering from one strategy to another after a single update, both in general priors and based on what we observe of past NNs, and good reason to believe that scaling greatly increases the dangers there. (Indeed, the update need not even be of the 'parameters', updates to the hidden state / internal activations are quite sufficient.)

Hi! Thanks for reading the post carefully and coming up with interesting evidence and arguments against~ I think I can explain PF4, but am certainly wrong on B1.

PF4

Why do you have high confidence that catastrophic forgetting is immune to scaling, given "Effect of scale on catastrophic forgetting in neural networks", Anonymous 2021?

Catastrophic forgetting (mechanism): We train a model to minimize loss on dataset X. Then we train it to minimize loss on dataset Y. When minimizing loss on dataset Y, it has no incentive to care about loss on dataset X. Hence, catastrophic forgetting. This effect seems not-very-solveable by simply scaling models.

Re: linked paper: I did not know about it while writing the post. On a very preliminary skim, I don't think their modeling paradigm, multi-head, is practically even relevant. (Let me know if my skim was a misread):

Why? A simple baseline: After training on every task, save the model! At inference time, you're given which task the example is from, use that model for inference. No forgetting! 

Strong Catastrophic Forgetting: Let's say we've trained a model on Imagenet10k. New data comes, and we have ~4k classes arriving, for an unknown #timesteps (cannot assume this is known). A realistic lifelong learning model case would be >hundreds of timesteps. The question is how do we learn this new information, given time constraints as we need the new model deployed (translating to compute given fixed resources)? 

Here: (a) I'm skeptical of scaling will adequately address this (the degree of drop >> difference between scaled models) (b) The compute-constraint is what explicitly works not-in-favor of larger models. But, catastrophic forgetting dynamics here would be very different than in the presented work I feel.
Eg; We cannot use the above baseline and train different models for this, as we will have the hard task of identifying which model to use (near-best case: Supermasks in Superposition). 

Weak Catastrophic Forgetting: I think the 'cannot store data' assumption is weird given storage is virtually free (compared to compute required to train a good model on that kind of data). So it's the same problem-- maintain better performance with a time constraint with full access to past data (here the problem itself is weaker but the compute constraints will be far tighter).

This is roughly my intuition. What do you think?

B1: Overall, I've updated to state what I've written in B1 is not true. 

Re: First -- What I shall do is after we've discussed, I shall relegate it to not-true section at the end of the post (so it's still visible) and add grokking as a surprising bias (intuition also explained here) in Generalization properties. I found the grokking paper but it seems preliminary, can you link other such papers or good spin-glass model papers which illustrate this point? It will help me make a good claim there. Thanks for illustrating this!~

Re: Second-- Yeah, I'm assuming we train large networks and then think about this problem.

Re: Third-- While defining sudden, non-linear shifts in NNs I think somewhere before the decision (say probability distribution over actions/decisions) would be a much stronger and useful claim to make. So, a good claim would be saying that an SGD update might cause us to go from '0.9% exterminate all humans' to '51% exterminate all humans' if true (seems likely).

Re: Fourth-- I think conditional generation being different given different conditions is qualitatively different and less interesting than suddenly updating to a large degree (grokking).

The claim by Evan: The context was identifying trigger points of deception in SGD-- Using transparency tools to figure out what the model is 'thinking' and what strategy it is using, if it goes to the other side of the decision boundary and back, we can ask the model why did it do that (suspecting that this was a warning for deception starting). Now, (a) I think the boundary switch will always be small even in this case, even when viewed via a transparency tool (b) In any case, this is me over-generalizing and being obviously wrong. Correcting this will make the article stronger.

I take the point of the paper as showing that as models get larger and more overparameterized, it gets easier for them to store arbitrary capabilities without interference in part because the better representations they learn mean that there is much less to store/learn for any new task, which will share a lot of structure. At some point, worrying about 'classes' or 'heads' just becomes irrelevant as you zero-shot or few-shot it: eg CLIP doesn't really need to worry about catastrophic forgetting because you just type in the text description of what 'class' you're interested in and 'classify' that way; a MoE doesn't worry about task classification, because it learns what sub-expert to dispatch input to. You won't need to 'switch between tasks' (not even that meaningful a thing outside the constraints of a benchmark) because in-context learning & representations do all the work, latently disambiguating where one is. You will simply train large (perhaps sparse or MoE-esque) models in one-epoch fashion, streaming in data constantly and discarding it. When you have enough real-world data, you don't need or want to store it because of diminishing returns on retraining compared to grabbing a fresh datapoint from the firehose. (It's worth noting that no one in the large language model space has ever 'used up' all the text available to them in datasets like The Pile, or even done more than 1 epoch over the full dataset they used.) This is also good for users if they don't have to keep around the original dataset to sample maintenance batches from while doing more training.

This solution will make a lot of people very unhappy as they insist that "this isn't a solution, you just made a very large model, arrgghh, so inefficient and ungreen", but if it solves the problem, then it solves the problem, and now you're just haggling over the price. Are there ways more efficient? Almost certainly. Should we care that much about or bother researching them? Maybe.

I found the grokking paper but it seems preliminary, can you link other such papers or good spin-glass model papers which illustrate this point? It will help me make a good claim there. Thanks for illustrating this!~

The grokking paper is definitely preliminary. No one expected that and I'm not aware of any predictions of that (or patient teacher*) even if we can guess about a wide-basin/saddle-point interpretation.

I don't have a list of spin-glass papers because I distrust such math/theory papers and haven't found them to be helpful. There's some I host on gwern.net because they're early examples of estimating NN scaling laws but I didn't get anything out of trying to read them. (Physicists gonna physics.) What I can link you right this second is a very cool interactive 'explorable' web page of many different spin-glasses, where you can see a lot of their behavior for yourself: http://bit-player.org/2021/three-months-in-monte-carlo

* I'm not sure if this is an example. Do student models 'take off' abruptly? They look in the graphs like they might idle around for scores or thousands of epoches and then take off, but it's hard to tell from the graphs whether they are just truncated at an axis and actually show gradual consistent improvement over many many epoches.

Using transparency tools to figure out what the model is 'thinking' and what strategy it is using, if it goes to the other side of the decision boundary and back, we can ask the model why did it do that (suspecting that this was a warning for deception starting).

I'm not sure how useful transparency tools would be. They can't tell you anything about adversarial examples. Do they even diagnose neural backdoors yet? If they can't find actual extremely sharp decision boundaries around specific inputs, hard to see how they could help you understand what an arbitrary SGD update does to decision boundaries across all inputs.

There's a lot of physics-related nonlinearities/phase-transitions/powerlaw material in these workshop slides & videos, looks like: https://sites.google.com/mila.quebec/scaling-laws-workshop/schedule

This post (which is really dope) provides some grokking examples in large language models in a Big-Bench video at 19313s & 19458s, with that segment (18430s-19650s) being a nice watch! I shall spend a bit more time collecting and precisely identifying evidence and then include it in the grokking part of this post. This was a really nice thing to know about and very suprising.

I've commented on that, but I'm not convinced that the phase transitions in learning are grokking, per se. There are many different scaling phenomenon, and we shouldn't go around prematurely conflating them.

When you have enough real-world data, you don't need or want to store it because of diminishing returns on retraining compared to grabbing a fresh datapoint from the firehose. (It's worth noting that no one in the large language model space has ever 'used up' all the text available to them in datasets like The Pile, or even done more than 1 epoch over the full dataset they used.) This is also good for users if they don't have to keep around the original dataset to sample maintenance batches from while doing more training.

This would be the main crux, actually a tremendously important crux. I take this means that models largely would be very far off from an overparameterized regime compared to the data? I expect operating in an overparameterized regime to give a lot more capabilities and currently considered overfitting to the dataset as almost a need, whereas you seem to indicate this is an unreasonable assumption to make? 

If so, erm, not only just catastrophic forgetting, but a lot of stuff I've seen people in AI alignment forum, base their intuitions on could be potentially thrown in the bin. Eg: I'm more confident in catastrophic forgetting having it's effect when overfitted on the past data.  If one cannot even properly learn past data but only frequently occuring patterns from it, those patterns might be too repetitively occuring to forget. But then, deep networks could do a lot better performance-wise by overfitting the dataset and exhaustively trying to remember the less-frequent patterns as well.

..it gets easier for them to store arbitrary capabilities without interference in part because the better representations they learn mean that there is much less to store/learn for any new task, which will share a lot of structure.

Here, the problem of catastrophic forgetting would not be on downstream learning tasks, it would be on updating this learnt representation to newer tasks. 

The grokking paper is definitely preliminary. No one expected that and I'm not aware of any predictions of that (or patient teacher*) even if we can guess about a wide-basin/saddle-point interpretation. 
I don't have a list of spin-glass papers because I distrust such math/theory papers and haven't found them to be helpful.

Very fair, cool. Thanks, those five were nice illustrations, although I'll need some time to digest the nature of non-linear dynamics. I've bookmarked it for an interesting trip someday.

I'm not sure how useful transparency tools would be. They can't tell you anything about adversarial examples. Do they even diagnose neural backdoors yet? If they can't find actual extremely sharp decision boundaries around specific inputs, hard to see how they could help you understand what an arbitrary SGD update does to decision boundaries across all inputs.

In this case, I deferred it as I don't understand what's really going on in transparency work. 

But more generally speaking: Ditto, I sortof believe this to a large degree. I was trying to highlight this point in Section 'Application: Transparency'. I notice I'm significantly more pessimistic than the median person on AI alignment forum, so there are some cruxes which I cannot put my finger on. Could you elaborate a bit more on your thoughts?

A-priori thinking about inductive biases of super-intelligent agents looks very unintuitive to me as I cannot currently understand how ML literature could meaningfully inform inductive biases of intelligent agents which could tweak themselves[1].

  1. Also, the scaling hypothesis seems more in the GPT rather than superintelligence category, it’s unclear to me how tools will achieve agency, and I’m not yet convinced/appreciate this. ↩︎

I read that first sentence several times and it's still not clear what you mean, or how the footnote helps clarify. What do you mean by 'tweak'? A tweak is a small incremental change. DL is about training networking with some flavour of SGD/bprop, which approximates bayesian updates, and is all about many small 'tweaks'. So when you say "agents which could tweak themselves" at first glance you just seem to be saying "agents that can learn at all", but that doesn't seem to fit.

Your section on adversarial examples will not hold up well - that is a bet I am fairly confident on.

Adversarial examples are an artifact of the particular historical trajectory that DL took on GPUs where there is no performance advantage to sparsity. Adversarial attacks exploit the overfit,noisy internal representations that nearly all DL systems learn, as they almost never regularize internal activations and sparse weight regularization is still a luxury rather than default, and certainly isn't tuned for adversarial defense. Proper sparse regularized internal weights and activations - which compress and thus filter out noise - can provide the same level of defense against adversarial pertubations that biological cortical vision/sensing provides.

I know this based on my own internal theory and experiments rather than an specific paper, but just a quick search on the literature reveals theoretical&experimental support 1,2,3,4

(all of those were found in just a few minutes while writing this comment)

The reason this isn't more widely known/used is twofold: 1.) there isn't much economic motivation - few are actually currently concerned with adversarial attacks outside theoretical curiosity and DL critics 2.) sparsity regularization (over activations especially) is a rather expensive luxury on current GPU software/hardware

Hi! Thanks for reading and interesting questions:

I read that first sentence several times and it's still not clear what you mean, or how the footnote helps clarify. What do you mean by 'tweak'? A tweak is a small incremental change.

That's correct, what I meant is say we state an agent has 'x, y, z biases', it can try to correct them. Now, the changes cannot be arbitrary, the constraints are that it has to be competitive and robust. But I think it can  reduce the strength of the heuristic by going against it whenever it can to the extent those heuristics would have little usefulness. But it's likely that here I'm having a wrong and weird conception of superintelligence.

Your section on adversarial examples will not hold up well - that is a bet I am fairly confident on.

Huh. I will find that very surprising, it should hold up to sparsity. Let me clarify my reasoning and then can you say what do you find I might be missing? Or why do you still think so?

Proper sparse regularized internal weights and activations - which compress and thus filter out noise - can provide the same level of defense against adversarial pertubations that biological cortical vision/sensing provides.

Notice that this noise is average case, whereas adversarial examples are worst case. This difference might be doing a lot of heavy-lifting here. Conventional deep networks have really nice noise stability properties, as in, they are able to filter out injected noise to a good extent, illustrated in Stronger generalization bounds for deep nets via a compression approach (ICML '18). In the worst case, despite 3D vision, narrow-focus and other biases/limitations of human vision give a wide variety of adversarial examples. Some examples: 'the the' reading problem, not noticing big large objects crossing in a video or falling prey to a good variety of illusions are some varieties of adversarial examples for human vision. I'm not sure if human vision is a good example of a robust sensing pipeline.

I know this based on my own internal theory and experiments rather than an specific paper, but just a quick search on the literature reveals theoretical&experimental support 1,2,3,4

Err, I find citing literature to be often insufficient, especially in ML, to meet a reasonable bar of support (a surprising amount of papers accepted in top conferences are unfortunately lame). I usually have to carefully read and analyze them.

For these papers, quickly reading some of them, my comment is as follows-- Notice that often: (a) they do not test with adaptive attacks (b) the degree of robustness they provide to weaker attacks is minimal (c) any simple defense like gaussian smoothing will do a lot better. Hence, they would provide little support about robustness.

For comparison, good empirical results in compression-gives-robustness look like: Robustness via Deep Low-Rank Representations (Arxiv). Although insufficient to be a good defense as adaptive attacks are likely to break it. Reference for adaptive attacks: On Adaptive Attacks to Adversarial Example Defenses (NeurIPS '20)-- one of my favourite works in this area. 

Now, let me put forward the case against: Our current understanding of sparsity (works in G5, AR3) is that sparsity allows us to reduce parameterization, but only a certain extent. Effects in AR3 suggest we probably need more complex models, and not simpler ones (with sparsity/regularization) for robustness -- i.e. the direction you think we should go towards seems to be opposite than what literature suggests (and this is indeed counter-intuitive!). 

I think the reason people (including me) have been pessimistic about this direction and switched to doing research in other things is that it doesn't seem to give many benefits except a certain reduction in memory/parameterization at the extra cost of code modifications. 

Adversarial examples are an artifact of the particular historical trajectory that DL took on GPUs where there is no performance advantage to sparsity ... sparsity regularization (over activations especially) is a rather expensive luxury on current GPU software/hardware

I don't think this is true, to like a large degree. GPUs do take a lot of advantage of sparse patterns or maybe I have a lower bar of 'sparsity works!' than yours. Pytorch takes and speeds up memory and computations by a huge amount if you take sparse tensors! If you have structured sparsity (blocksparse structure, pointwise convolutions), it's even better and there are some very fast CUDA kernels to leverage that. 

It has limited upside, not contributing interesting/helpful inductive biases. It's fairly common to sparsify and quantize deep networks in deployment phase, although often the non-sparse CUDA kernels work fine as they're insanely optimized.

Referencing recent papers sent my way here (this shall be a live, expanding comment), please do link more if you think they might be useful:
- Inductive biases in theory-based reinforcement learning