Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).


The Precipice: Existential Risk and the Future of Humanity (Toby Ord) (summarized by Rohin): This book argues that humanity is in a special stage of its development: it is on the precipice, a narrow time during which we have enough power to destroy ourselves, but not enough wisdom to have mitigated such risks. It first argues that existential risk would be very important to reduce (for all the standard reasons), and then considers many different kinds of existential risks, finding that natural ones (asteroids, supervolcanoes, stellar explosions) are small relative to anthropogenic risks, both current (nuclear war, climate change, environmental destruction) and future (engineered pandemics, unaligned AI, dystopian scenarios). I'll focus primarily on the part about AI risk, as well as some of the comments on existential risk in general.

The AI risk presentation in the book was similar to that in Superintelligence: it argues for risk from goal-directed AI systems (though the terminology used in the book is different). It first demonstrates the strong progress in deep learning, and then notes that expert surveys estimate that AGI is more likely than not to arrive in the next century. It then notes that we don't know how to specify a reward function for an AI system (even with e.g. inverse reinforcement learning), and to the extent that we get it wrong, it pits us in competition against a superintelligent adversary. Ideas like switching off the AI system wouldn't work, due to convergent instrumental subgoals like survival.

It also considers some obvious objections, including the very reasonable objection that "AI researchers won't build something that will kill them". However, Toby is still worried, citing that due to the unilateralist curse unaligned AGI might still be built by the most optimistic researchers, and in any case the personal benefits to the researchers might justify the risk of misalignment to them personally (though it would not be justified for the world as a whole).

The book then spends some time discussing risk factors, which are things that do not directly lead to existential risks, but indirectly exacerbate other existential risks, making them more likely. For example, great power war seems like a risk factor: it isn't going to cause an existential catastrophe by itself, but it increases the likelihood that we use risky technologies like bioweapons and AI that could then cause an existential catastrophe.

The book also has lots of useful insights about existential risks in general, which then also apply to AI risk: for example, risks that strike sooner should be prioritized (since the later risks can be dealt with later), risks that are more sudden will be more important to focus on (since we won't be able to build support as the risk gradually comes in), and risks that are "sharper" will be more neglected since there won't be as many "warning shots".

Read more: FLI Podcast: The Precipice: Existential Risk and the Future of Humanity with Toby Ord

Rohin's opinion: I enjoyed this book more than I thought I would: it had a lot of novel content for me, and I liked the explanations and comparisons across different kinds of existential risks (something that I hadn't really seen a single unified perspective on), and I especially liked the constant focus on what we do and don't know -- it felt more like a research paper (albeit in a conversational style) than a popular book, and was similarly information-dense.

On the AI part specifically, I liked that one of the endnotes cashed out powerful AI systems using model-based RL: this indeed seems like the thing that is closest to the classic expected utility maximizer, so the conclusions make a bit more sense. You still have to wonder how exactly the model is learned, and how exactly the AI system becomes good at using the model to find good actions, but at least under those two assumptions you would have all the standard convergent instrumental subgoals. In contrast, with model-free RL, the default expectation is that the RL agent needs to try things multiple times before it can learn to do them again, so it's less clear how it starts doing novel things. It seems that model-based and model-free RL are pretty similar so the distinction doesn't matter in practice, but at least conceptually it's a lot easier to reason about the model-based system (at least in the context of AI risk).

Toby gives a 1 in 10 chance of existential catastrophe from AI in the next century (more than half of his total of 1 in 6), which decomposes into a 1 in 2 chance of AGI this century, and 1 in 5 of it leading to existential catastrophe. This is a bit more pessimistic than Paul's estimate (AN #80) of 10% EV loss (which was over all time, not just this century), which is in turn a bit more pessimistic than the 1 in 10 chance that I estimated (AN #80) (and am now forever anchored on), which was over all time and conditional on no additional effort from longtermists. But I wouldn't read too much into this -- 10 is a nice round number, and that probably played a big role in why I chose it. I certainly don't feel calibrated enough to easily tell the difference between 1 in 5 and 1 in 20 on a question of this complexity.

I am very happy about this trend of people actually stating numbers: it's a lot easier to narrow down on the important disagreements when people put down numbers, even if they're completely made up. I'd really like to see numbers from people who have larger disagreements (as I expect would be the case with e.g. MIRI researchers).



Deconfusing Human Values Research Agenda v1 (G Gordon Worley III) (summarized by Rohin): This post argues that since 1. human values are necessary for alignment, 2. we are confused about human values, and 3. we couldn't verify it if an AI system discovered the structure of human values, we need to do research to become less confused about human values. This research agenda aims to deconfuse human values by modeling them as the input to a decision process which produces behavior and preferences. The author's best guess is that human values are captured by valence, as modeled by minimization of prediction error.

Rohin's opinion: This is similar to the argument in Why we need a theory of human values (AN #36), and my opinion remains roughly the same: I strongly agree that we are confused about human values, but I don't see an understanding of human values as necessary for value alignment. We could hope to build AI systems in a way where we don't need to specify the ultimate human values (or even a framework for learning them) before running the AI system. As an analogy, my friends and I are all confused about human values, but nonetheless I think they are more or less aligned with me (in the sense that if AI systems were like my friends but superintelligent, that sounds broadly fine).


What is Interpretability? (Robert Kirk et al) (summarized by Rohin): This post categorizes several interpretability methods based on their goal and how they enable humans to achieve the goal.

Rohin's opinion: It's striking to me how different this is from other work, e.g. Explicability? Legibility? Predictability? Transparency? Privacy? Security? The Emerging Landscape of Interpretable Agent Behavior (AN #36). It seems like interpretability is a really vague, nebulous term that has so far (to my limited knowledge) not been made precise.


Physically Realistic Attacks on Deep Reinforcement Learning (Adam Gleave) (summarized by Rohin): This is a blog post for a previously summarized paper, Adversarial Policies: Attacking Deep Reinforcement Learning (AN #70).


2019 trends in GPU price per FLOPS (Asya Bergal) (summarized by Rohin): This post analyzes the the trends in cost per FLOP for GPUs. There are a bunch of details in how to do this analysis, but they end up finding that this cost goes down by an order of magnitude over 17 years for single-precision FLOPS (halving time: 5 years), 10 years for half-precision FLOPS (halving time: 3 years), and 5 years for half-precision fused multiply-add FLOPS (halving time: 1.5 years). However, the latter two categories have become more popular in recent years with the rise of deep learning, so their low halving times might be because some of the single-precision hardware was converted to half-precision hardware, rather than fundamental technological improvements.


If I were a well-intentioned AI (Stuart Armstrong) (summarized by Rohin): This sequence takes on the perspective of an AI system that is well-intentioned, but lacking information about what humans want. The hope is to find what good AI reasoning might look like, and hopefully use this to derive insights for safety. The sequence considers Goodhart problems, adversarial examples, distribution shift, subagent problems, etc.

Rohin's opinion: I liked this sequence. Often when presented with a potential problem in AI safety, I ask myself why the problem doesn't also apply to humans, and how humans have managed to solve the problem. This sequence was primarily this sort of reasoning, and I think it did a good job of highlighting how with sufficient conservatism it seems plausible that many problems are not that bad if the AI is well-intentioned, even if it has very little information, or finds it hard to communicate with humans, or has the wrong abstractions.



Coherent Gradients: An Approach to Understanding Generalization in Gradient Descent-based Optimization (Satrajit Chatterjee) (summarized by Nicholas): Deep neural networks trained with gradient descent do well at generalizing from their training set, but the field currently has relatively little understanding of why that is. Large networks have enough parameters to fully memorize the training set and can do so even if trained on data with entirely random labels. This allows for many functions that would fit the training set well, but not generalize, a phenomenon known as overfitting. The question is how gradient descent picks out one of a small subset of functions that will generalize well.

The Coherent Gradients hypothesis, introduced here and tested further in this paper, is that this results from per-example gradients being averaged during gradient descent. For each example, some of the gradient points in a direction that is idiosyncratic to that example, but some of it points towards a more general solution. When the average is taken across these gradients, the more general directions reinforce each other while the example-specific directions cancel out. As a result, the training process moves faster towards more general directions.

In order to test this hypothesis, they run two experiments. First they use varying amounts of label noise (corrupting a fraction of the dataset to have random labels). They predict and find that:

1. More label noise leads to slower learning.

2. The uncorrupted examples will be learned faster.

The next experiment tests a novel form of regularization, called winsorization, where they clip the gradients on a per-example and per-parameter basis to prevent a single example from dominating the gradient, effectively curtailing the component of the gradient that is example-specific. Since the computation of per-example gradients is expensive, when scaling this up to larger networks, they instead use the median of 3 mini-batches to address outliers. The experiments suggest that winsorization reduces overfitting and in particular prevents neural nets from learning randomly labeled data.

Read more: Explaining Memorization and Generalization: A Large-Scale Study with Coherent Gradients

Nicholas's opinion: The hypothesis makes sense to me and the experiments do seem to bear out their conclusions. However, none of the results of the experiments were surprising to me and seem to me like they could be consistent with other explanations for generalization. I would be more convinced if the Coherent Gradients hypothesis made predictions that were different from other leading theories and then those turned out to be true.


I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.


An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

New Comment