Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by commenting on this post.

Audio version here (may not be up yet).


Call for contributors to the Alignment Newsletter (Rohin Shah): I'm looking for content creators and a publisher for this newsletter! Apply by September 6.

Adversarial Examples Are Not Bugs, They Are Features (Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom et al) (summarized by Rohin and Cody): Distill published a discussion of this paper. This highlights section will cover the full discussion; all of these summaries and opinions are meant to be read together.

Consider two possible explanations of adversarial examples. First, they could be caused because the model "hallucinates" a signal that is not useful for classification, and it becomes very sensitive to this feature. We could call these "bugs", since they don't generalize well. Second, they could be caused by features that do generalize to the test set, but can be modified by an adversarial perturbation. We could call these "non-robust features" (as opposed to "robust features", which can't be changed by an adversarial perturbation). The authors argue that at least some adversarial perturbations fall into the second category of being informative but sensitive features, based on two experiments.

If the "hallucination" explanation were true, the hallucinations would presumably be caused by the training process, the choice of architecture, the size of the dataset, but not by the type of data. So one thing to do would be to see if we can construct a dataset such that a model trained on that dataset is already robust, without adversarial training. The authors do this in the first experiment. They take an adversarially trained robust classifier, and create images whose features (final-layer activations of the robust classifier) match the features of some unmodified input. The generated images only have robust features because the original classifier was robust, and in fact models trained on this dataset are automatically robust.

If the "non-robust features" explanation were true, then it should be possible for a model to learn on a dataset containing only non-robust features (which will look nonsensical to humans) and still generalize to a normal-looking test set. In the second experiment (henceforth WrongLabels), the authors construct such a dataset. Their hypothesis is that adversarial perturbations work by introducing non-robust features of the target class. So, to construct their dataset, they take an image x with original label y, adversarially perturb it towards some class y' to get image x', and then add (x', y') to their dataset (even though to a human x' looks like class y). They have two versions of this: in RandLabels, the target class y' is chosen randomly, whereas in DetLabels, y' is chosen to be y + 1. For both datasets, if you train a new model on the dataset, you get good performance on the original test set, showing that the "non-robust features" do generalize.

Rohin's opinion: I buy this hypothesis. It's a plausible explanation for brittleness towards adversarial noise ("because non-robust features are useful to reduce loss"), and why adversarial examples transfer across models ("because different models can learn the same non-robust features"). In fact, the paper shows that architectures that did worse in ExpWrongLabels (and so presumably are bad at learning non-robust features) are also the ones to which adversarial examples transfer the least. I'll leave the rest of my opinion to the opinions on the responses.

Read more: Paper and Author response

Response: Learning from Incorrectly Labeled Data (Eric Wallace): This response notes that all of the experiments are of the form: create a dataset D that is consistent with a model M; then, when you train a new model M' on D you get the same properties as M. Thus, we can interpret these experiments as showing that model distillation can work even with data points that we would naively think of "incorrectly labeled". This is a more general phenomenon: we can take an MNIST model, select only the examples for which the top prediction is incorrect (labeled with these incorrect top predictions), and train a new model on that -- and get nontrivial performance on the original test set, even though the new model has never seen a "correctly labeled" example.

Rohin's opinion: I definitely agree that these results can be thought of as a form of model distillation. I don't think this detracts from the main point of the paper: the reason model distillation works even with incorrectly labeled data is probably because the data is labeled in such a way that it incentivizes the new model to pick out the same features that the old model was paying attention to.

Response: Robust Feature Leakage (Gabriel Goh): This response investigates whether the datasets in WrongLabels could have had robust features. Specifically, it checks whether a linear classifier over provably robust features trained on the WrongLabels dataset can get good accuracy on the original test set. This shouldn't be possible since WrongLabels is meant to correlate only non-robust features with labels. It finds that you can get some accuracy with RandLabels, but you don't get much accuracy with DetLabels.

The original authors can actually explain this: intuitively, you get accuracy with RandLabels because it's less harmful to choose labels randomly than to choose them explicitly incorrectly. With random labels on unmodified inputs, robust features should be completely uncorrelated with accuracy. However, with random labels followed by an adversarial perturbation towards the label, there can be some correlation, because the adversarial perturbation can add "a small amount" of the robust feature. However, in DetLabels, the labels are wrong, and so the robust features are negatively correlated with the true label, and while this can be reduced by an adversarial perturbation, it can't be reversed (otherwise it wouldn't be robust).

Rohin's opinion: The original authors' explanation of these results is quite compelling; it seems correct to me.

Response: Adversarial Examples are Just Bugs, Too (Preetum Nakkiran): The main point of this response is that adversarial examples can be bugs too. In particular, if you construct adversarial examples that explicitly don't transfer between models, and then run ExpWrongLabels with such adversarial perturbations, then the resulting model doesn't perform well on the original test set (and so it must not have learned non-robust features).

It also constructs a data distribution where every useful feature of the optimal classifer is guaranteed to be robust, and shows that we can still get adversarial examples with a typical model, showing that it is not just non-robust features that cause adversarial examples.

In their response, the authors clarify that they didn't intend to claim that adversarial examples could not arise due to "bugs", just that "bugs" were not the only explanation. In particular, they say that their main thesis is “adversarial examples will not just go away as we fix bugs in our models”, which is consistent with the point in this response.

Rohin's opinion: Amusingly, I think I'm more bullish on the original paper's claims than the authors themselves. It's certainly true that adversarial examples can arise from "bugs": if your model overfits to your data, then you should expect adversarial examples along the overfitted decision boundary. The dataset constructed in this response is a particularly clean example: the optimal classifier would have an accuracy of 90%, but the model is trained to accuracy 99.9%, which means it must be overfitting.

However, I claim that with large and varied datasets with neural nets, we are typically not in the regime where models overfit to the data, and the presence of "bugs" in the model will decrease. (You certainly can get a neural net to be "buggy", e.g. by randomly labeling the data, but if you're using real data with a natural task then I don't expect it to happen to a significant degree.) Nonetheless, adversarial examples persist, because the features that models use are not the ones that humans use.

It's also worth noting that this experiment strongly supports the hypothesis that adversarial examples transfer because they are real features that generalize to the test set.

Response: Adversarial Example Researchers Need to Expand What is Meant by ‘Robustness’ (Justin Gilmer et al): This response argues that the results in the original paper are simply a consequence of a generally accepted principle: "models lack robustness to distribution shift because they latch onto superficial correlations in the data". This isn't just about L_p norm ball adversarial perturbations: for example, one recent paper shows that if the model is only given access to high frequency features of images (which look uniformly grey to humans), it can still get above 50% accuracy. In fact, when we do adversarial training to become robust to L_p perturbations, then the model pays attention to different non-robust features and becomes more vulnerable to e.g. low-frequency fog corruption. The authors call for adversarial examples researchers to move beyond L_p perturbations and think about the many different ways models can be fragile, and to make them more robust to distributional shift.

Rohin's opinion: I strongly agree with the worldview behind this response, and especially the principle they identified. I didn't know this was a generally accepted principle, though of course I am not an expert on distributional robustness.

One thing to note is what is meant by "superficial correlation" here. It means a correlation that really does exist in the dataset, that really does generalize to the test set, but that doesn't generalize out of distribution. A better term might be "fragile correlation". All of the experiments so far have been looking at within-distribution generalization (aka generalization to the test set), and are showing that non-robust features do generalize within-distribution. This response is arguing that there are many such non-robust features that will generalize within-distribution but will not generalize under distributional shift, and we need to make our models robust to all of them, not just L_p adversarial perturbations.

Response: Two Examples of Useful, Non-Robust Features (Gabriel Goh): This response studies linear features, since we can analytically compute their usefulness and robustness. It plots the singular vectors of the data as features, and finds that such features are either robust and useful, or non-robust and not useful. However, you can get useful, non-robust features by ensembling or contamination (see response for details).

Response: Adversarially Robust Neural Style Transfer (Reiichiro Nakano): The original paper showed that adversarial examples don't transfer well to VGG, and that VGG doesn't tend to learn similar non-robust features as a ResNet. Separately, VGG works particularly well for style transfer. Perhaps since VGG doesn't capture non-robust features as well, the results of style transfer look better to humans? This response and the author's response investigate this hypothesis in more detail and find that it seems broadly supported, but there are still finnicky details to be worked out.

Rohin's opinion: This is an intriguing empirical fact. However, I don't really buy the theoretical argument that style transfer works because it doesn't use non-robust features, since I would typically expect that a model that doesn't use L_p-fragile features would instead use features that are fragile or non-robust in some other way.

Technical AI alignment


Problems in AI Alignment that philosophers could potentially contribute to (Wei Dai): Exactly what it says. The post is short enough that I'm not going to summarize it -- it would be as long as the original.

Iterated amplification

Delegating open-ended cognitive work (Andreas Stuhlmüller): This is the latest explanation of the approach Ought is experimenting with: Factored Evaluation (in contrast to Factored Cognition (AN #36)). With Factored Cognition, the idea was to recursively decompose a high-level task until you reach subtasks that can be directly solved. Factored Evaluation still does recursive decomposition, but now it is aimed at evaluating the work of experts, along the same lines as recursive reward modeling (AN #34).

This shift means that Ought is attacking a very natural problem: how to effectively delegate work to experts while avoiding principal-agent problems. In particular, we want to design incentives such that untrusted experts under the incentives will be as helpful as experts intrinsically motivated to help. The experts could be human experts or advanced ML systems; ideally our incentive design would work for both.

Currently, Ought is running experiments with reading comprehension on Wikipedia articles. The experts get access to the article while the judge does not, but the judge can check whether particular quotes come from the article. They would like to move to tasks that have a greater gap between the experts and the judge (e.g. allowing the experts to use Google), and to tasks that are more subjective (e.g. whether the judge should get Lasik surgery).

Rohin's opinion: The switch from Factored Cognition to Factored Evaluation is interesting. While it does make it more relevant outside the context of AI alignment (since principal-agent problems abound outside of AI), it still seems like the major impact of Ought is on AI alignment, and I'm not sure what the difference is there. In iterated amplification (AN #30), when decomposing tasks in the Factored Cognition sense, you would use imitation learning during the distillation step, whereas with Factored Evaluation, you would use reinforcement learning to optimize the evaluation signal. The switch would be useful if you expect the reinforcement learning to work significantly better than imitation learning.

However, with Factored Evaluation, the agent that you train iteratively is one that must be good at evaluating tasks, and then you'd need another agent that actually performs the task (or you could train the same agent to do both). In contrast, with Factored Cognition you only need an agent that is performing the task. If the decompositions needed to perform the task are different from the decompositions needed to evaluate the task, then Factored Cognition would presumably have an advantage.

Miscellaneous (Alignment)

Clarifying some key hypotheses in AI alignment (Ben Cottier et al): This post (that I contributed to) introduces a diagram that maps out important and controversial hypotheses for AI alignment. The goal is to help researchers identify and more productively discuss their disagreements.

Near-term concerns

Privacy and security

Evaluating and Testing Unintended Memorization in Neural Networks (Nicholas Carlini et al)

Read more: The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks

Machine ethics

Towards Empathic Deep Q-Learning (Bart Bussmann et al): This paper introduces the empathic DQN, which is inspired by the golden rule: "Do unto others as you would have them do unto you". Given a specified reward, the empathic DQN optimizes for a weighted combination of the specified reward, and the reward that other agents in the environment would get if they were a copy of the agent. They show that this results in resource sharing (when there are diminishing returns to resources) and avoiding conflict in two toy gridworlds.

Rohin's opinion: This seems similar in spirit to impact regularization methods: the hope is that this is a simple rule that prevents catastrophic outcomes without having to solve all of human values.

AI strategy and policy

AI Algorithms Need FDA-Style Drug Trials (Olaf J. Groth et al)

Other progress in AI

Critiques (AI)

Evidence against current methods leading to human level artificial intelligence (Asya Bergal and Robert Long): This post briefly lists arguments that current AI techniques will not lead to high-level machine intelligence (HLMI), without taking a stance on how strong these arguments are.


Ought: why it matters and ways to help (Paul Christiano): This post discusses the work that Ought is doing, and makes a case that it is important for AI alignment (see the summary for Delegating open-ended cognitive work above). Readers can help Ought by applying for their web developer role, by participating in their experiments, and by donating.

Project Proposal: Considerations for trading off capabilities and safety impacts of AI research (David Krueger): This post calls for a thorough and systematic evaluation of whether AI safety researchers should worry about the impact of their work on capabilities.

New Comment
10 comments, sorted by Click to highlight new comments since:
[-]Wei DaiΩ470

The switch would be useful if you expect the reinforcement learning to work significantly better than imitation learning.

Paul Christiano's Against mimicry talks about this. Well it actually talks about a big problem with imitating humans, but the same issue also applies to imitating an amplified agent, AFAICT.

(I only came upon that post recently because I was searching for posts on human imitations. Your comment here made me realize that it also explains in a clear way why Paul and Ought would want to switch to using RL for IDA, which I've been wondering for a while.)

One thing I'm still confused about is, if there is a security hole in the evaluation mechanism implemented by the overseer, it seems like current standard RL algorithms (given enough compute) would train the model to exhibit behavior that would exploit that security hole (because during training it would eventually hit upon such an exploit and the evaluator would return a high reward, thus increasing the probability the model would output that exploit). Is this addressed by making sure that the overseer has no security hole (which seems like a very hard problem compared to what the overseer needs to do in the imitation learning case), or through some other way?

[-]Wei DaiΩ360

I found that the Recursive Reward Modeling paper talks about my concern under the Reward Gaming section. However it is kind of short on ideas for solving it:

Reward gaming Opportunities for reward gaming arise when the reward function incorrectly provides high reward to some undesired behavior (Clark & Amodei, 2016; Lehman et al., 2018); see Figure 4 for a concrete example. One potential source for reward gaming is the reward model’s vulnerability to adversarial inputs (Szegedy et al., 2013). If the environment is complex enough, the agent might figure out how to specifically craft these adversarially perturbed inputs in order to trick the reward model into providing higher reward than the user intends. Unlike in most work on generating adversarial examples (Goodfellow et al., 2015; Huang et al., 2017), the agent would not necessarily be free to synthesize any possible input to the reward model, but would need to find a way to realize adversarial observation sequences in its environment.

Reward gaming problems are in principle solvable by improving the reward model. Whether this means that reward gaming problems can also be overcome in practice is arguably one of the biggest open questions and possibly the greatest weakness of reward modeling. Yet there are a few examples from the literature indicating that reward gaming can be avoided in practice. Reinforcement learning from a learned reward function has been successful in gridworlds (Bahdanau et al., 2018), Atari games (Christiano et al., 2017; Ibarz et al., 2018), and continuous motor control tasks (Ho & Ermon, 2016; Christiano et al., 2017).

It seems to me that reward gaming not happening in practice is probably the result of limited computation applied to RL (resulting in a kind of implicit mild optimization) so that result would likely not be robust to scaling up. In the limit of infinite computation, standard RL algorithms should end up training models that maximize reward, which means doing reward gaming if that is what maximizes reward. (Perhaps with current RL algorithms, this isn't a concern in practice because it would take an impractically large amount of computation to actually end up with reward gaming / exploitative behavior, but it doesn't seem safe to assume that this will continue to be true despite future advances in RL technology.)

It currently feels to me like the problems with imitation learning might be less serious and easier to solve than the reward gaming problem with RL, so I'm still wondering about the move from SL/imitation to RL for IDA. Also, if there's not existing discussion about this that I've missed, I'm surprised that there has been so little discussion about this issue.

Against mimicry is mostly motivated by the case of imitating an amplified agent. (I try to separate the problem of distillation and amplification, and imitation learning is a candidate for mimicry.)

You could try to avoid the RL exploiting a security vulnerability in the overseer by:

  • Doing something like quantilizing where you are constrained to be near the original policy (we impose a KL constraint that prevents the policy from drifting too far from an attempted-imitation).
  • Doing something like meeting halfway.

These solutions seem tricky but maybe helpful. But my bigger concern is that you need to fix security vulnerabilities anyway:

  • The algorithm "Search over lots of actions to find the one for which Q(a) is maximized" is a pretty good algorithm, that you need to be able to use at test time in order to be competitive, and which seems to require competitiveness.
  • Iterated amplification does optimization anyway (by amplifying the optimization done by the individual humans) and without security you are going to have problems there.

I mostly hope to solve this problem with security amplification (see also).

[-]Wei DaiΩ460

The algorithm “Search over lots of actions to find the one for which Q(a) is maximized” is a pretty good algorithm, that you need to be able to use at test time in order to be competitive, and which seems to require competitiveness.

This search is done by the overseer (or the counterfactual overseer or trained model), and is likely to be safer than the search done by the RL training process, because the overseer has more information about the specific situation that calls for a search, and can tailor the search process to best fit that situation, including how to avoid generating a candidate action that may trigger a flaw in Q, how many candidates to test, etc., whereas the RL search process is dumb and whatever safety mechanisms/constraints that are put on it has to be decided by the AI designer ahead of time and applied to every situation.

[-]Wei DaiΩ120

You could try to avoid the RL exploiting a security vulnerability in the overseer by:

Ok, would be interested to learn more about the solution you think is most promising here. Is it written up anywhere yet, or there's something in the ML or AI safety literature that you can point me to?

I mostly hope to solve this problem with security amplification (see also).

I think when I wrote my comment I already internalized "security amplification" and was assuming a LBO/meta-execution style system, but was worried about flaws introduced by meta-execution itself (since doing meta-execution seems like programming in that you have to break tasks down into sub-tasks and can make a mistake in the process). This seems related to what you wrote in the security amplification post:

Note that security amplification + distillation will only remove the vulnerabilities that came from the human. We will still be left with vulnerabilities introduced by our learning process, and with any inherent limits on our model’s ability to represent/learn a secure policy. So we’ll have to deal with those problems separately.

I'm sort of confused by the main point of that post. Is the idea that the robot can't stack blocks because of a physical limitation? If so, it seems like this is addressed by the first initial objection. Is it rather that the model space might not have the capacity to correctly imitate the human? I'd be somewhat surprised by this being a big issue, and at any rate it seems like you could use the Wasserstein metric as a cost function and get a desirable outcome. I guess instead we're instead imagining a problem where there's no great metric (e.g. text answers to questions)?

Is it rather that the model space might not have the capacity to correctly imitate the human?

There are lots of reasons that a robot might be unable to learn the correct policy despite the action space permitting it:

  • Not enough model capacity
  • Not enough training data
  • Training got stuck in a local optimum
  • You've learned from robot play data, but you've never seen anything like the human policy before

etc, etc.

Not all of these are compatible with "and so the robot does the thing that the human does 5% of the time". But it seems like there can and probably will be factors that are different between the human and the robot (even if the human uses teleoperation), and in that setting imitating factored cognition provides the wrong incentives, while optimizing factored evaluation provides the right incentives.

I think AI will probably be good enough to pose a catastrophic risk before it can exactly imitate a human. (But as Wei Dai says elsewhere, if you do amplification then you will definitely get into the regime where you can't imitate.)

[-]Wei DaiΩ120

Is it rather that the model space might not have the capacity to correctly imitate the human?

Paul wrote in a parallel subthread, "Against mimicry is mostly motivated by the case of imitating an amplified agent." In the case of IDA, you're bound to run out of model capacity eventually as you keep iterating the amplification and distillation.

I guess instead we’re instead imagining a problem where there’s no great metric (e.g. text answers to questions)?

I'm pretty sure that's it.

Based on the October 2019 update, it looks like Ought is now using "factored cognition" as an umbrella term that includes both factored generation (which used to be called factored cognition) and factored evaluation.

(Commenting here because as far as I know this post is one of the main places that discusses this distinction.)