Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.


    I've decided to start a weekly roundup of papers that seem relevant to alignment, focusing on papers or approaches that might be new to safety researchers. Unlike the Alignment Newsletter, I'll be spending relatively little effort on summarizing the papers. I'll just link them, copy their abstracts, and potentially describe some of my thoughts on how the paper relates to alignment. Hopefully, this will let me keep to a weekly schedule.

    The purpose of this series isn't so much to share insights directly with the reader, but instead to make them aware of already existing research that may be relevant to the reader's own research.


    Locating and Editing Factual Associations in GPT

    We analyze the storage and recall of factual associations in autoregressive transformer language models, finding evidence that these associations correspond to localized, directly-editable computations. We first develop a causal intervention for identifying neuron activations that are decisive in a model's factual predictions. This reveals a distinct set of steps in middle-layer feed-forward modules that mediate factual predictions while processing subject tokens. To test our hypothesis that these computations correspond to factual association recall, we modify feed-forward weights to update specific factual associations using Rank-One Model Editing (ROME). We find that ROME is effective on a standard zero-shot relation extraction (zsRE) model-editing task, comparable to existing methods. To perform a more sensitive evaluation, we also evaluate ROME on a new dataset of counterfactual assertions, on which it simultaneously maintains both specificity and generalization, whereas other methods sacrifice one or another. Our results confirm an important role for mid-layer feed-forward modules in storing factual associations and suggest that direct manipulation of computational mechanisms may be a feasible approach for model editing. The code, dataset, visualizations, and an interactive demo notebook are available at this https URL

    My opinion:

    Most people I talk to about this paper have heard of it previously, so it's hardly ”new”. However, I think a lot of people underestimate how significant the paper is. The authors use a very cool interpretability method to show that the middle-stage MLP layers are acting as a key-value memory system. They then guess at the specific mathematical structure these MLP layers use to store information, derive a closed-form, analytic solution to edit the model's knowledge stores and use very thorough evaluations to show that their knowledge editing method is effective and that the edits influence the model's outputs in many different contexts where the new knowledge is relevant. This paper is vastly beyond just "poke random stuff and see that the output changes". Code can be found here.

    Using cognitive psychology to understand GPT-3

    We study GPT-3, a recent large language model, using tools from cognitive psychology. More specifically, we assess GPT-3's decision-making, information search, deliberation, and causal reasoning abilities on a battery of canonical experiments from the literature. We find that much of GPT-3's behavior is impressive: it solves vignette-based tasks similarly or better than human subjects, is able to make decent decisions from descriptions, outperforms humans in a multi-armed bandit task, and shows signatures of model-based reinforcement learning. Yet we also find that small perturbations to vignette-based tasks can lead GPT-3 vastly astray, that it shows no signatures of directed exploration, and that it fails miserably in a causal reasoning task. These results enrich our understanding of current large language models and pave the way for future investigations using tools from cognitive psychology to study increasingly capable and opaque artificial agents.

    My opinion:

    This paper performs a systematic analysis of GPT-3's capabilities through prompting, and measures some alignment-relevant capabilities such as understanding of causal interventions and active exploration to find useful knowledge. The authors make their code available here.

    Mapping Language Models to Grounded Conceptual Spaces

    A fundamental criticism of text-only language models (LMs) is their lack of grounding---that is, the ability to tie a word for which they have learned a representation, to its actual use in the world. However, despite this limitation, large pre-trained LMs have been shown to have a remarkable grasp of the conceptual structure of language, as demonstrated by their ability to answer questions, generate fluent text, or make inferences about entities, objects, and properties that they have never physically observed. In this work we investigate the extent to which the rich conceptual structure that LMs learn indeed reflects the conceptual structure of the non-linguistic world---which is something that LMs have never observed. We do this by testing whether the LMs can learn to map an entire conceptual domain (e.g., direction or colour) onto a grounded world representation given only a small number of examples. For example, we show a model what the word ``left" means using a textual depiction of a grid world, and assess how well it can generalise to related concepts, for example, the word ``right", in a similar grid world. We investigate a range of generative language models of varying sizes (including GPT-2 and GPT-3), and see that although the smaller models struggle to perform this mapping, the largest model can not only learn to ground the concepts that it is explicitly taught, but appears to generalise to several instances of unseen concepts as well. Our results suggest an alternative means of building grounded language models: rather than learning grounded representations ``from scratch'', it is possible that large text-only models learn a sufficiently rich conceptual structure that could allow them to be grounded in a data-efficient way.

    My opinion:

    This paper investigates whether language models learn non-linguistic concepts that they can adapt in-context to navigate worlds whose rules are described to them through text. It seems relevant for understanding the degree to which language models are able to do generalizeable world modeling and to understand or influence non-linguistic domains with the text they generate.

    Gradient Starvation: A Learning Proclivity in Neural Networks

    We identify and formalize a fundamental gradient descent phenomenon resulting in a learning proclivity in over-parameterized neural networks. Gradient Starvation arises when cross-entropy loss is minimized by capturing only a subset of features relevant for the task, despite the presence of other predictive features that fail to be discovered. This work provides a theoretical explanation for the emergence of such feature imbalance in neural networks. Using tools from Dynamical Systems theory, we identify simple properties of learning dynamics during gradient descent that lead to this imbalance, and prove that such a situation can be expected given certain statistical structure in training data. Based on our proposed formalism, we develop guarantees for a novel regularization method aimed at decoupling feature learning dynamics, improving accuracy and robustness in cases hindered by gradient starvation. We illustrate our findings with simple and real-world out-of-distribution (OOD) generalization experiments.

    My opinion:

    We often wonder whether behaviors / values / alignment properties that we instill early in training (when models are weak enough to supervise) will persist later in training. I think gradient starvation could be an important part of that puzzle, since it provides a concrete mechanism for how features learned early in training could persist. It also suggests a fair degree of path-dependence in SGD trajectories, and that guiding early training could have significant effects on the downstream models. Code here.

    Related: Path dependence in ML inductive biases

    SGD on Neural Networks Learns Functions of Increasing Complexity

    We perform an experimental study of the dynamics of Stochastic Gradient Descent (SGD) in learning deep neural networks for several real and synthetic classification tasks. We show that in the initial epochs, almost all of the performance improvement of the classifier obtained by SGD can be explained by a linear classifier. More generally, we give evidence for the hypothesis that, as iterations progress, SGD learns functions of increasing complexity. This hypothesis can be helpful in explaining why SGD-learned classifiers tend to generalize well even in the over-parameterized regime. We also show that the linear classifier learned in the initial stages is "retained" throughout the execution even if training is continued to the point of zero training error, and complement this with a theoretical result in a simplified model. Key to our work is a new measure of how well one classifier explains the performance of another, based on conditional mutual information.

    My opinion:

    Given the high path dependence world I think we live in, it becomes quite important to understand the order in which neural nets learn features / behaviors. This paper investigates that question for image models. Code here.

    Let's Agree to Agree: Neural Networks Share Classification Order on Real Datasets

    We report a series of robust empirical observations, demonstrating that deep Neural Networks learn the examples in both the training and test sets in a similar order. This phenomenon is observed in all the commonly used benchmarks we evaluated, including many image classification benchmarks, and one text classification benchmark. While this phenomenon is strongest for models of the same architecture, it also crosses architectural boundaries -- models of different architectures start by learning the same examples, after which the more powerful model may continue to learn additional examples. We further show that this pattern of results reflects the interplay between the way neural networks learn benchmark datasets. Thus, when fixing the architecture, we show synthetic datasets where this pattern ceases to exist. When fixing the dataset, we show that other learning paradigms may learn the data in a different order. We hypothesize that our results reflect how neural networks discover structure in natural datasets.

    My opinion:

    Another investigation into the order of feature learning, this time systematically comparing across models and architectures. We may be able to get a better handle on NN inductive biases by investigating the orders in which different architectures learn different types of data. 

    The Grammar-Learning Trajectories of Neural Language Models

    The learning trajectories of linguistic phenomena in humans provide insight into linguistic representation, beyond what can be gleaned from inspecting the behavior of an adult speaker. To apply a similar approach to analyze neural language models (NLM), it is first necessary to establish that different models are similar enough in the generalizations they make. In this paper, we show that NLMs with different initialization, architecture, and training data acquire linguistic phenomena in a similar order, despite their different end performance. These findings suggest that there is some mutual inductive bias that underlies these models' learning of linguistic phenomena. Taking inspiration from psycholinguistics, we argue that studying this inductive bias is an opportunity to study the linguistic representation implicit in NLMs. 
    Leveraging these findings, we compare the relative performance on different phenomena at varying learning stages with simpler reference models. Results suggest that NLMs exhibit consistent "developmental" stages. Moreover, we find the learning trajectory to be approximately one-dimensional: given an NLM with a certain overall performance, it is possible to predict what linguistic generalizations it has already acquired. Initial analysis of these stages presents phenomena clusters (notably morphological ones), whose performance progresses in unison, suggesting a potential link between the generalizations behind them.

    My opinion:

    I thought I should probably include a paper on feature learning order in language models, to balance out the previous three paper's focus on images. Code available here.

    Learning through atypical ”phase transitions” in overparameterized neural networks

    Current deep neural networks are highly overparameterized (up to billions of connection weights) and nonlinear. Yet they can fit data almost perfectly through variants of gradient descent algorithms and achieve unexpected levels of prediction accuracy without overfitting. These are formidable results that defy predictions of statistical learning and pose conceptual challenges for non-convex optimization. In this paper, we use methods from statistical physics of disordered systems to analytically study the computational fallout of overparameterization in non-convex binary neural network models, trained on data generated from a structurally simpler but "hidden" network. As the number of connection weights increases, we follow the changes of the geometrical structure of different minima of the error loss function and relate them to learning and generalization performance. A first transition happens at the so-called interpolation point, when solutions begin to exist (perfect fitting becomes possible). This transition reflects the properties of typical solutions, which however are in sharp minima and hard to sample. After a gap, a second transition occurs, with the discontinuous appearance of a different kind of "atypical" structures: wide regions of the weight space that are particularly solution-dense and have good generalization properties. The two kinds of solutions coexist, with the typical ones being exponentially more numerous, but empirically we find that efficient algorithms sample the atypical, rare ones. This suggests that the atypical phase transition is the relevant one for learning. The results of numerical tests with realistic networks on observables suggested by the theory are consistent with this scenario.

    My opinion:

    I see this paper as contradicting Mingard et al.'s Is SGD a Bayesian sampler? Well, almost. Mingard argued that SGD has little inductive bias, meaning that training on a dataset with SGD would give you a solution very similar to just sampling random networks until you found one that solved the dataset. This paper instead argues that SGD has extremely high inductive bias, and that SGD finds very "atypical" solutions that generalize much better than those that random sampling would find. 

    Exploring the Geometry and Topology of Neural Network Loss Landscapes

    Recent work has established clear links between the generalization performance of trained neural networks and the geometry of their loss landscape near the local minima to which they converge. This suggests that qualitative and quantitative examination of the loss landscape geometry could yield insights about neural network generalization performance during training. To this end, researchers have proposed visualizing the loss landscape through the use of simple dimensionality reduction techniques. However, such visualization methods have been limited by their linear nature and only capture features in one or two dimensions, thus restricting sampling of the loss landscape to lines or planes. Here, we expand and improve upon these in three ways. First, we present a novel "jump and retrain" procedure for sampling relevant portions of the loss landscape. We show that the resulting sampled data holds more meaningful information about the network's ability to generalize. Next, we show that non-linear dimensionality reduction of the jump and retrain trajectories via PHATE, a trajectory and manifold-preserving method, allows us to visualize differences between networks that are generalizing well vs poorly. Finally, we combine PHATE trajectories with a computational homology characterization to quantify trajectory differences.

    My opinion:

    I include this paper because it provides tools to better visualize neural network training trajectories and loss landscapes. They also made their code public at this repository. It seems like a useful thing to check out if you're investigating NN training processes.

    Adversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classifiers

    Neural networks trained on visual data are well-known to be vulnerable to often imperceptible adversarial perturbations. The reasons for this vulnerability are still being debated in the literature. Recently Ilyas et al. (2019) showed that this vulnerability arises, in part, because neural network classifiers rely on highly predictive but brittle "non-robust" features. In this paper we extend the work of Ilyas et al. by investigating the nature of the input patterns that give rise to these features. In particular, we hypothesize that in a neural network trained in a standard way, non-robust features respond to small, "non-semantic" patterns that are typically entangled with larger, robust patterns, known to be more human-interpretable, as opposed to solely responding to statistical artifacts in a dataset. Thus, adversarial examples can be formed via minimal perturbations to these small, entangled patterns. In addition, we demonstrate a corollary of our hypothesis: robust classifiers are more effective than standard (non-robust) ones as a source for generating transferable adversarial examples in both the untargeted and targeted settings. The results we present in this paper provide new insight into the nature of the non-robust features responsible for adversarial vulnerability of neural network classifiers.

    My opinion:

    Seems like an even stronger version of Adversarial Examples Are Not Bugs, They Are Features. Not only are (some) adversarial examples exploiting genuinely useful classification features, the exploited features are often correlates of the "true" features we humans use to classify images. 

    Going forward

    I hope that list was helpful for some people. If you have a paper that seems alignment relevant (especially a paper that's not well known in alignment circles), please feel free to link it in the comments. Also feel free to share any other feedback or comments you have on the papers I did link.

    I hope to produce one of these lists every week or so. I doubt I'll be able to do 10 papers a week, however. We'll see how it goes, I guess.

    New Comment
    6 comments, sorted by Click to highlight new comments since:

    Do you intend for the comments section to be a public forum on the papers you collect?

    I definitely endorse reading the ROME paper, although the popular-culture claims about what the second part of the paper actually shows seem a bit overblown.

    They do not seem to claim "changing facts in a generalizable way" (it's likely not robust to synonyms at all)". I am also vary of "editing just one MLP for a given fact" being the right solution, given that the causal tracing shows the fact being stored in several consecutive layers. Refer to a writeup by Thibodeau et al. sometime in the future.

    That being said, if you are into interpretability, you have to at least skim the paper. It has a whole bunch of very cool stuff in it, from the causal tracing to the testing of whether making Eistein a physician changes the meaning of the word "physics" itself. Just don't overfit on the methods there being exactly the methods that will solve interpretability of reasoning in transformers.

    I welcome any discussion of the linked papers in the comments section.

    I agree that the ROME edit method itself isn’t directly that useful. I think it matters more as a validation of how the ROME authors interpreted the structure / functions of the MLP layers.

    Refer to a writeup by Thibodeau et al

    Which writeup is this? Have a link?

    Hey, I've finally written it up here,

    This seems very useful -- thanks for doing it!

    Some paper suggestions:

    Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit

    There is mounting empirical evidence of emergent phenomena in the capabilities of deep learning methods as we scale up datasets, model sizes, and training times. While there are some accounts of how these resources modulate statistical capacity, far less is known about their effect on the computational problem of model training. This work conducts such an exploration through the lens of learning k-sparse parities of n bits, a canonical family of problems which pose theoretical computational barriers. In this setting, we find that neural networks exhibit surprising phase transitions when scaling up dataset size and running time. In particular, we demonstrate empirically that with standard training, a variety of architectures learn sparse parities with nO(k) examples, with loss (and error) curves abruptly dropping after nO(k) iterations. These positive results nearly match known SQ lower bounds, even without an explicit sparsity-promoting prior. We elucidate the mechanisms of these phenomena with a theoretical analysis: we find that the phase transition in performance is not due to SGD “stumbling in the dark” until it finds the hidden set of features (a natural algorithm which also runs in nO(k) time); instead, we show that SGD gradually amplifies a Fourier gap in the population gradient.

    By some of the same authors as "Functions of Increasing Complexity," this paper takes a toy problem which exhibits a very sharp phase transition and analyzes the hell out of it. A primary aim of the paper is to refute the explanation that this phase transition happens because SGD is randomly searching weight space until stumbles upon a solution, an explanation which is tempting given that the loss curves for this problem stay essentially flat before rapidly converging to zero. Instead, the authors find that their models make "hidden progress" which is not reflected in the loss curves; this echos findings from Neel Nanda's work on grokking. (Speaking of which, this paper also abounds with fascinating tidbits on grokking, including a gearsy analysis of for which variants on their toy problem and which model hyperparameters grokking does/doesn't occur.)

    Diversify and Disambiguate: Learning from Underspecified Data

    Many datasets are underspecified: there exist multiple equally viable solutions to a given task. Underspecification can be problematic for methods that learn a single hypothesis because different functions that achieve low training loss can focus on different predictive features and thus produce widely varying predictions on out-of-distribution data. We propose DivDis, a simple two-stage framework that first learns a diverse collection of hypotheses for a task by leveraging unlabeled data from the test distribution. We then disambiguate by selecting one of the discovered hypotheses using minimal additional supervision, in the form of additional labels or inspection of function visualization. We demonstrate the ability of DivDis to find hypotheses that use robust features in image classification and natural language processing problems with underspecification.

    I'm suggesting this paper because it forms the technical basis for Stuart Armstrong's work on his concept extrapolation agenda. Given (1) labeled data for which there are many possible proxies which result in good classification performance, (2) unlabeled data which bears witness to the fact that these proxies can disagree with each other, this paper gives a method for explicitly learning multiple diverse proxies. A possible story about how this is useful for alignment: if one can generate diverse hypotheses which postdict observed human preferences but disagree on novel scenarios, then one may hope to either actively query humans (to get evidence on which hypothesis is correct) or act conservatively (picking actions which are good according to all the various hypotheses).

    Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution

    When transferring a pretrained model to a downstream task, two popular methods are full fine-tuning (updating all the model parameters) and linear probing (updating only the last linear layer -- the "head"). It is well known that fine-tuning leads to better accuracy in-distribution (ID). However, in this paper, we find that fine-tuning can achieve worse accuracy than linear probing out-of-distribution (OOD) when the pretrained features are good and the distribution shift is large. On 10 distribution shift datasets (Breeds-Living17, Breeds-Entity30, DomainNet, CIFAR  STL, CIFAR10.1, FMoW, ImageNetV2, ImageNet-R, ImageNet-A, ImageNet-Sketch), fine-tuning obtains on average 2% higher accuracy ID but 7% lower accuracy OOD than linear probing. We show theoretically that this tradeoff between ID and OOD accuracy arises even in a simple setting: fine-tuning overparameterized two-layer linear networks. We prove that the OOD error of fine-tuning is high when we initialize with a fixed or random head -- this is because while fine-tuning learns the head, the lower layers of the neural network change simultaneously and distort the pretrained features. Our analysis suggests that the easy two-step strategy of linear probing then full fine-tuning (LP-FT), sometimes used as a fine-tuning heuristic, combines the benefits of both fine-tuning and linear probing. Empirically, LP-FT outperforms both fine-tuning and linear probing on the above datasets (1% better ID, 10% better OOD than full fine-tuning).

    Even when a pretrained model has learned robust features for understanding its data, finetuning can sometimes distort these features intro proxies which only work well on-distribution, thereby leading to poor OOD performance. This paper studies the "feature distortion" phenomenon, and proposes a method to mitigate it: first freeze the pretrained model's weights and only train a linear probe out of the pretrained model's latent representation of the data, and then finetune the pretrained model + linear probe. This seems relevant to alignment insofar as it seems plausible that large self-supervisedly trained models could learn concepts which robustly correspond to our own concepts; in that case it'd be useful if we could avoid distorting these concepts during finetuning (such as RLHF).

    [Meta: if you include these papers in future roundups, feel free to either use these blurbs or toss them out and write your own. I had originally planned to just write something which pitched the papers and their alignment relevance to you (Quintin), but I guess they kinda turned more into the sort of opinions you had written.]

    I have been studying the ROME code in depth now for a few months, and I think there's some really interesting stuff there. One of the contributors is actively working on adding things to their repo, such as integrating the functionality of nostalgebraist's logit lens.

    I've just read the Gradient Starvation paper. It was quite interesting and approachable. It has left me wondering what we will see if (when?) someone applies their fix (spectral decoupling) to fine-tune a pre-trained language model... 

    Reading the Diversify and Disambiguate: Learning from Underspecified Data paper, I find myself thinking that it is solving nearly the same problem as the Gradient Starvation problem. The Gradient Starvation fix of Spectral Decoupling seems on face-value, a little simpler, more elegant and general. I don't actually have the info I need to compare the two though. I would like to see the two methods compared on things like: 

    1. if you have a completely closed test set (e.g. the future), and must make guesses about what distribution shifts you could potentially end up seeing by reasoning logically about degrees of freedom about input features and extrapolating historical trends to higher variance regimes. How do each of these methods handle this harder but more realistic case?
    2. How do they compare across various benchmarks in terms of metrics like accuracy, confusion / regret, compute cost, robustness to outliers, etc.
    3. Can you combine the methods? If so, how does the combination compare?
    4. What about the case of trying to calibrate confidence in the face of heuristics like: high datapoint-density regions of latent space -> higher confidence, low datapoint-density regions of latent space -> lower confidence, highly ambiguous regions of latent space (non-linearly-seperable) -> lower confidence, etc. This is something that beyond the scope of these papers, so I don't criticize them for not addressing it, but important for real world application. There are various techniques, for example Virtual Outlier Synthesis , for addressing the issue of confidence calibration or model predictions. How well do these techniques combine with the above techniques for generating multiple hypotheses? This seems relevant to me, because it impacts which of the generated hypotheses you ought to promote most highly.
    5. How can multiple hypotheses best be handled in an online learning scenario where live data is streaming in, potentially of questionable quality due to occasional corruption, and competing hypotheses must be held ready. You want a model to be able to switch cleanly between hypotheses about the underlying reality, not get stuck in some implausible mishmash. If one maze-solving hypothesis says it is best to turn left at 4-way junctions, and the second best hypothesis says it is best to turn right, and the model is doing a weighted random choice, you don't want the model to split between left and straight, you want it to split between left and right. I feel like the multiple heads in 'Diversify and Disambiguate' might be a useful tool for setting this up. Not sure how one would do the same thing using Gradient Starvation's Spectral Decoupling, but it seems like it could be done somehow.

    I read the loss landscape paper, and liked their visualizations. I would be so interested to see the same visualizations applied to models trained with and without the Gradient Starvation paper's Spectral Decoupling.

    Reading the Cognitive Psych on GPT-3 paper, I find myself highly suspecting that the correct answers GPT-3 is reported to get are due to memorized patterns. I'm curious whether repeating the large-n experiments with variations in prompt (e.g. 'you are a brilliant scientist presented with this question' vs 'you are an average middle school student presented with this question') would change the outcome?

    'Mapping language models to conceptually grounded spaces' shows the valuable insight that GPT-3 175B can work well in this grid-world space, but the smaller versions can't. This is interesting because I had just been wondering earlier today if a small simple ascii-grid RL environment could be handled by a large language model. The answer is: maybe, if they're large enough. Probably not if they're small. More details on that idea here: