Alignment Newsletter #53


Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter.

Cody Wild is now contributing summaries to the newsletter!


Alignment Newsletter One Year Retrospective (Rohin Shah): The Alignment Newsletter is one year old! I've written a retrospective of the newsletter's impact over the last year, with a lot of open questions about what the newsletter should look like in the future. Please help me figure out the answers by taking this 3-minute survey, and if you're feeling particularly generous with your time, read the retrospective and tell me your opinions in the comments!

Are Deep Neural Networks Dramatically Overfitted? (Lilian Weng): The concepts of underfitting and overfitting, and their relation to the bias-variance tradeoff, are fundamental to standard machine learning theory. Roughly, for a fixed amount of data, there is an optimal model complexity for learning from that data: any less complex and the model won't be able to fit the data, and any more complex and it will overfit to noise in the data. This means that as you increase model complexity, training error will go down to zero, but validation error will go down and then start turning back up once the model is overfitting.

We know that neural networks are much more expressive than the theory would predict is optimal, both from theorems showing that neural networks can learn any function (including one that provides a rather tight bound on number of parameters), as well as a paper showing that neural nets can learn random noise. Yet they work well in practice, achieving good within-distribution generalization.

The post starts with a brief summary of topics that readers of this newsletter are probably familiar with: Occam's razor, the Minimum Description Length principle, Kolmogorov Complexity, and Solomonoff Induction. If you don't know these, I strongly recommend learning them if you care about understanding within-distribution generalization. The post then looks at a few recent informative papers, and tries to reproduce them.

The first one is the most surprising: they find that as you increase the model complexity, your validation error goes down and then back up, as expected, but then at some point it enters a new regime and goes down again. However, the author notes that you have to set up the experiments just right to get the smooth curves the paper got, and her own attempts at reproducing the result are not nearly as dramatic.

Another paper measures the difficulty of a task based on its "intrinsic dimension", which Cody has summarized separately in this newsletter.

The last paper looks at what happens if you (a) reset some layer's parameters to the initial parameters and (b) randomize some layer's parameters. They find that randomizing always destroys performance, but resetting to initial parameters doesn't make much of a difference for later layers, while being bad for earlier layers. This was easy to reproduce, and the findings reemerge very clearly.

Rohin's opinion: I'm very interested in this problem, and this post does a great job of introducing it and summarizing some of the recent work. I especially appreciated the attempts at reproducing the results.

On the papers themselves, a regime where you already have ~zero training error but validation error goes down as you increase model expressivity is exceedingly strange. Skimming the paper, it seems that the idea is that in the normal ML regime, you are only minimizing training error -- but once you can get the training error to zero, you can then optimize for the "simplest" model with zero training error, which by Occam's Razor-style arguments should be the best one and lead to better validation performance. This makes sense in the theoretical model that they use, but it's not clear to me how this applies to neural nets, where you aren't explicitly optimizing for simplicity after getting zero training error. (Techniques like regularization don't result in one-after-the-other optimization -- you're optimizing for both simplicity and low training error simultaneously, so you wouldn't expect this critical point at which you enter a new regime.) So I still don't understand these results. That said, given the difficulty with reproducing them, I'm not going to put too much weight on these results now.

I tried to predict the results of the last paper and correctly predicted that randomizing would always destroy performance, but predicted that resetting to initialization would be okay for early layers instead of later layers. I had a couple of reasons for the wrong prediction. First, there had been a few papers that showed good results even with random features, suggesting the initial layers aren't too important, and so maybe don't get updated too much. Second, the gradient of the loss w.r.t later layers requires only a few backpropagation steps, and so probably provides a clear, consistent direction moving it far away from the initial configuration, while the gradient w.r.t earlier layers factors through the later layers which may have weird or wrong values and so might push in an unusual direction that might get cancelled out across multiple gradient updates. I skimmed the paper and it doesn't really speculate on why this happens, and my thoughts still seem reasonable to me, so this is another fact that I have yet to explain.

Technical AI alignment

Technical agendas and prioritization

Summary of the Technical Safety Workshop (David Krueger) (summarized by Richard): David identifies two broad types of AI safety work: human in the loop approaches, and theory approaches. A notable subset of the former category is methods which improve our ability to give advanced systems meaningful feedback - this includes debate, IDA, and recursive reward modeling. CIRL and CAIS are also human-in-the-loop. Meanwhile the theory category includes MIRI's work on agent foundations; side effect metrics; and verified boxing.

Iterated amplification

A Concrete Proposal for Adversarial IDA (Evan Hubinger): This post presents a method to use an adversary to improve the sample efficiency (with respect to human feedback) of iterated amplification. The key idea is that when a question is decomposed into subquestions, the adversary is used to predict which subquestion the agent will do poorly on, and the human is only asked to resolve that subquestion. In addition to improving sample efficiency by only asking relevant questions, the resulting adversary can also be used for interpretability: for any question-answer pair, the adversary can pick out specific subquestions in the tree that are particularly likely to contain errors, which can then be reviewed.

Rohin's opinion: I like the idea, but the math in the post is quite hard to read (mainly due to the lack of exposition). The post also has separate procedures for amplification, distillation and iteration; I think they can be collapsed into a single more efficient procedure, which I wrote about in this comment.

Learning human intent

Conditional revealed preference (Jessica Taylor): When backing out preferences by looking at people's actions, you may find that even though they say they are optimizing for X, their actions are better explained as optimizing for Y. This is better than relying on what they say, at least if you want to predict what they will do in the future. However, all such inferences are specific to the current context. For example, you may infer that schools are "about" dealing with authoritarian work environments, as opposed to learning -- but maybe this is because everyone who designs schools doesn't realize what the most effective methods of teaching-for-learning are, and if they were convinced that some other method was better for learning they would switch to that. So, in order to figure out what people "really want", we need to see not only what they do in the current context, but also what they would do in a range of alternative scenarios.

Rohin's opinion: The general point here, which comes up pretty often, is that any information you get about "what humans want" is going to be specific to the context in which you elicit that information. This post makes that point when the information you get is the actions that people take. Some other instances of this point:

- Inverse Reward Design notes that a human-provided reward function should be treated as specific to the training environment, instead of as a description of good behavior in all possible environments.

- CP-Nets are based on the point that when a human says "I want X" it is not a statement that is meant to hold in all possible contexts. They propose very weak semantics, where "I want X" means "holding every other aspect of the world constant, it would be better for X to be present than for it not to be present".

- Wei Dai's point (AN #37) that humans likely have adversarial examples, and we should not expect preferences to generalize under distribution shift.

- Stuart Armstrong and Paul Christiano have made or addressed this point in many of their posts.

Defeating Goodhart and the closest unblocked strategy problem (Stuart Armstrong): One issue with the idea of reward uncertainty (AN #42) based on a model of uncertainty that we specify is that we tend to severely underestimate how uncertain we should be. This post makes the point that we could try to build an AI system that starts with this estimate of our uncertainty, but then corrects the estimate based on its understanding of humans. For example, if it notices that humans tend to become much more uncertain when presented with some crucial consideration, it could realize that its estimate probably needs to be widened significantly.

Rohin's opinion: So far, this is an idea that hasn't been turned into a proposal yet, so it's hard to evaluate. The most obvious implementation (to me) would involve an explicit estimate of reward uncertainty, and then an explicit model for how to update that uncertainty (that would not be Bayes Rule, since that would narrow the uncertainty over time). At this point it's not clear to me why we're even using the expected utility formalism; it feels like adding epicycles in order to get a single particular behavior that breaks other things. You could also make the argument that there will be misspecification of the model of how update the uncertainty. But again, this is just the most obvious completion of the idea; it's plausible that there's a different way of doing this that's better.

Parenting: Safe Reinforcement Learning from Human Input (Christopher Frye et al)


Attention is not Explanation (Sarthak Jain et al) (summarized by Richard): This paper explores the usefulness of attention weights in interpreting neural networks' performance on NLP tasks. The authors present two findings: firstly, that attention weights are only weakly correlated with other metrics of word importance; and secondly, that there often exist adversarially-generated attention weights which are totally different from the learned weights, but which still lead to the same outputs. They conclude that these results undermine the explanatory relevance of attention weights.

Richard's opinion: I like this type of investigation, but don't find their actual conclusions compelling. In particular, it doesn't matter whether "meaningless" adversarial attention weights can lead to the same classifications, as long as the ones actually learned by the system are interpretable. Also, the lack of correlation between attention weights and other methods could be explained either by attention weights being much worse than the other methods, or much better, or merely useful for different purposes.

Right for the Right Reasons: Training Differentiable Models by Constraining their Explanations (Andrew Ross et al)

Adversarial examples

The LogBarrier adversarial attack: making effective use of decision boundary information (Chris Finlay et al) (summarized by Dan H): Rather than maximizing the loss of a model given a perturbation budget, this paper minimizes the perturbation size subject to the constraint that the model misclassify the example. This misclassification constraint is enforced by adding a logarithmic barrier to the objective, which they prevent from causing a loss explosion through through a few clever tricks. Their attack appears to be faster than the Carlini-Wagner attack.

Read more: The code is here.


Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks (Mingchen Li et al.) (summarized by Dan H): Previous empirical papers have shown that finding ways to decrease training time greatly improves robustness to label corruptions, but to my knowledge this is the first theoretical treatment.

Other progress in AI

Deep learning

Measuring the Intrinsic Dimension of Objective Landscapes (Chunyuan Li et al) (summarized by Cody): This paper proposes and defines a quantity called "intrinsic dimension", a geometrically-informed metric of how many degrees of freedom are actually needed to train a given model on a given dataset. They calculate this by picking a set of random directions that span some subspace of dimension d, and taking gradient steps only along that lower-dimensional subspace. They consider the intrinsic dimension of a model and a dataset to be the smallest value d at which performance reaches 90% of a baseline, normally trained model on the dataset. The geometric intuition of this approach is that the dimensionality of parameter space can be, by definition, split into intrinsic dimension and its codimension, the dimension of the solution set. In this framing, higher solution set dimension (and lower intrinsic dimension) corresponds to proportionally more of the search space containing reasonable solution points, and therefore a situation where a learning agent will be more likely to find such a solution point. There are some interesting observations here that correspond with our intuitions about model trainability: on MNIST, intrinsic dimensionality for a CNN is lower than for a fully connected network, but if you randomize pixel locations, CNN's intrinsic dimension shoots up above FC, matching the intuition that CNNs are appropriate when their assumption of local structure holds.

Cody's opinion: Overall, I find this an interesting and well-articulated paper, and am curious to see future work that addresses some of the extrapolations and claims implied by this paper, particularly their claim, surprising relative to my intuitions, that increasing n_parameters will, maybe monotonically, reduce difficulty of training, because it simply increases the dimensionality of the solution set. I'm also not sure how to feel about their simply asserting that a solution exists when a network reaches 90% of baselines performance, since we may care about that "last mile" performance and it might also be the harder to reach.

Read more: Paper