Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.
Audio version here (may not be up yet).
Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.
Building Machines That Learn and Think Like People (Brenden M. Lake et al) (summarized by Rohin): The core claim of this 2016 paper is that we should focus on building AI systems that work as flexibly as humans do. For example, a human can learn how to play the Atari game Frostbite in just a couple of hours, way faster than typical deep RL algorithms -- and in addition, after this they will likely be able to transfer zero-shot to new reward functions, such as “lose as quickly as possible”, “maximize the number of fish”, “beat the level with as little time to spare as possible”, and so on. How can we build AI systems that mimic this feat? Deep RL certainly doesn’t get us there. Similarly, while neural networks can learn to classify digits and characters with thousands of examples, humans can learn new characters from a single example, which then allows them to perform many different tasks such as classification, generation, parsing it into different pen strokes, etc. Since the paper was written neural nets have made progress on few-shot classification, but are still quite far from the flexibility that humans display.
You might reasonably object that humans have rich priors built from years of lived experience, as well as innate knowledge baked in by evolution; in contrast, a neural network has to learn from scratch. The authors agree: in their view, the challenge is how to imbue rich priors into artificial agents, so that they too can exhibit these impressive behaviors that humans show. Their preferred approach is to take inspiration from human learning and intelligence as much as possible. In this paper, they identify three main ingredients to recreate that flexibility, and provide an overview of the existing literature:
1. Developmental software: This refers to the basic capabilities that children have, even before they learn language. These are called “intuitive theories” in cognitive science; think of “intuitive physics” and “intuitive psychology” theories.
2. Model building: Neural networks primarily work via pattern matching, but in order to get human-level flexibility, you will need to build models: this enables flexibility because the same model can be used for a variety of different tasks. (For example, you can reuse your understanding of the environment transitions in Frostbite when the reward function changes.) Models need to be compositional, that is, the representations should be capable of being composed with each other to provide new semantically meaningful representation. For example, for handwritten characters, the representation of a character should be the composition of the representations of the individual pen strokes used to make the character. The authors also highlight causality and learning to learn as important.
3. Thinking fast: One major drawback of models is that getting conclusions from these models often requires slow, complex inference algorithms. But human thinking is actually quite fast; just think of how quickly we can understand a visual scene. How can we get this property as well? First, we can use approximate inference algorithms to get answers much more quickly (in fact, one line of work distills the inference algorithm into a fast neural network for even more speed). Second, we can combine model-based and model-free algorithms together; for example we might use a model-based algorithm for flexibility but then use the data generated by that algorithm to train a model-free method that can run faster.
Rohin's opinion: I really like this paper from the point of view of illustrating an alternative paradigm to building powerful AI systems that isn’t based on scaling up neural networks. You might have picked up from the last few newsletters that I generally do expect us to build powerful AI systems by scaling up neural networks, so you might expect that I disagree with this paper. This is only partially true. I do in fact think that many of the skills mentioned in this paper will emerge by training very large neural networks on diverse datasets; indeed we’re already seeing this with few-shot learning (AN #102). However, this likely only happens at what would be truly mind-boggling amounts of compute today: in order for this to be remotely feasible, we need to have exponential improvements in hardware cost and algorithmic efficiency (AN #121). It is plausible to me that some of the needed improvements in algorithmic efficiency will come through ideas similar to the ones in this paper: for example, just as CNNs provided a useful inductive bias of translation-invariance, perhaps we get a new architecture that has an inductive bias towards compositionality or causality.
Human-Level Reinforcement Learning through Theory-Based Modeling, Exploration, and Planning (Pedro A. Tsividis et al) (summarized by Rohin): Deep reinforcement learning algorithms require many more samples to learn a new game than a human would need: humans have rich priors and theories of how games work that allow them to perform directed exploration and quickly learn the rules of the game. This paper hypothesizes that by providing agents with this rich prior knowledge, we can create agents that learn to play new games as quickly as humans do. The two main ingredients are (1) allowing agents to reason directly over objects, agents, physics and goals (rather than pixels) and (2) using algorithms designed to exploit this prior knowledge. In particular, given this well structured space, they propose EMPA, which uses three main algorithms to exploit the prior knowledge:
Model learning: The agent maintains a distribution over possible game mechanics and updates it using Bayes Rule as it takes more actions. This allows it to quickly learn that certain objects tend to kill you, whereas deep RL may require thousands of interactions in order to do the same.
Exploration: Exploration is important to the extent that it allows the agent to reduce its uncertainty over the game mechanics. Since we have a distribution over the game mechanics, we could explore in a way that best reduces the uncertainty in that distribution. But in fact our prior knowledge allows us to do something simpler: we just set “exploration subgoals” that seek to cause a collision between two objects (one of which could be the agent’s avatar).
Planning: The planning module chooses actions to take in order to achieve some goal or subgoal (note that the subgoals can be set by the exploration algorithm). It uses search algorithms to find such plans.
They evaluate the agent on a variety of games similar to those in Atari. (I assume they could not evaluate on Atari because they can’t easily extract the required prior knowledge from the Atari game engine.) They find that the agent learns to play the games about as fast as humans do, which in turn is much faster than deep RL algorithms. In addition, the gameplay looks more human-like: for example, both EMPA and humans don’t collide with walls very much, whereas deep RL algorithms collide a lot.
Rohin's opinion: This seems like a great example of the approach suggested in the previous paper.
TECHNICAL AI ALIGNMENT
What the hell is going on inside neural networks (Rob Wiblin and Chris Olah) (summarized by Rohin): This podcast covers a significant chunk of work in understanding neural networks, including circuits (AN #142) and multimodal neurons (AN #142), as well as high-level thoughts such as advantages of neural net interpretability over neuroscience and why larger models may be more interpretable (AN #72). Some interesting points I haven’t made in this newsletter before:
1. Interpretability as a field is fractured into several different mini-paradigms. The author’s paradigm might be described as “mechanistic interpretability”, where you try to “fully understand” the neural network from the ground up. An ML-based paradigm is interested in defining good “interpretability metrics” that can then be optimized. An HCI-based paradigm is interested in developing techniques that show good results based on user evaluations (e.g. people can better predict network outputs).
2. Scaling up mechanistic interpretability does seem possible, because (a) as models get larger their features plausibly get crisper and easier to understand, and (b) there are motifs (such as equivariance in curve circuits) that allow you to reduce the number of neurons you have to understand by over an order of magnitude. However, neurons can be polysemantic, where they encode multiple features at once; this could pose a significant challenge for mechanistic interpretability. (While current features encoded in polysemantic neurons will probably become crisper as models scale up, we might expect that the scaled up models will have new polysemantic neurons that encode multiple more abstract features.)
3. One aesthetically pleasing aspect of the mechanistic interpretability approach is that, in the world where we succeed, humans could plausibly “keep up” with the neural nets and understand these advanced concepts that the networks have, rather than living happy lives but being unable to comprehend what is going on in the world around them. See also Using Artificial Intelligence to Augment Human Intelligence.
You may also want to check out this followup podcast in which Chris talks about his unconventional career path.
What 2026 looks like (Daniel Kokotajlo) (summarized by Rohin): This post describes the author’s median expectations around AI from now until 2026. It is part I of an attempt to write a detailed plausible future trajectory in chronological order, i.e. incrementally adding years to the story rather than writing a story with the end in mind. The hope is to produce a nice complement to the more abstract discussions about timelines and takeoff that usually occur. For example, there are discussions about how AI tools are used by nations for persuasion, propaganda and censorship.
Human modeling in AGI (Scott Garrabrant and Rohin Shah) (summarized by Rohin): This is a conversation between Scott and me about the relative dangers of human modeling (AN #52), moderated by Eli Tyre. From a safety perspective, the main reason to avoid human modeling is that the agent's cognition will be much "further" away from manipulation of humans; for example, it seems more unlikely that your AI system tricks people into launching nukes if it never learned very much about humans in the first place. The main counterargument is that this precludes using human oversight of agent cognition (since when humans are overseeing the agent's cognition, then the agent is likely to learn about humans in order to satisfy that oversight); this human oversight could plausibly greatly increase safety. It also seems like systems that don't model humans will have a hard time performing many useful tasks, though the conversation mostly did not touch upon this point.
Scott's position is that given there are these two quite different risks (manipulation worries vs. learning the wrong cognition due to poor oversight), it seems worthwhile to put some effort into addressing each risk, and avoiding human models is much more neglected than improving human oversight. My position is that it seems much less likely that there is a plausible success path where we do very little human modeling, and so I want a lot more work along the oversight path. I do think that it is worth differentially pushing AI systems towards tasks that don't require much human modeling, e.g. physics and engineering, rather than ones that do, e.g. sales and marketing, but this seems roughly independent of technical work, at least currently.
OTHER PROGRESS IN AI
The Benchmark Lottery (Mostafa Dehghani, Yi Tay, Alexey A. Gritsenko et al) (summarized by Rohin): This paper argues that new machine learning methods participate in a benchmark lottery, that is, our evaluation of a specific method depends in large part on the choice of benchmark on which the method is evaluated, independently of how good the method “actually” is. The authors identify three main sources of such bias:
1. Task selection bias: This is exactly what it sounds like: the evaluation of a method will often depend quite strongly on exactly which tasks in a benchmark it is evaluated on. For example, when evaluating 55 models on SuperGLUE, there are six different models that achieve the top place on at least one task; so if we only chose one task to evaluate models it would be random luck that determines which of those models we would deem “best”. The paper has lots of additional examples and quantifications of the strength of the bias.
2. Community bias: The research community often settles on a particular benchmark on which new methods must be evaluated (or else the paper will be rejected). This decision often happens without any explicit reasoning about which benchmark or tasks should be part of this community standard. This can end up adding bias that privileges some methods over others for reasons unrelated to how “good” the methods are. For example, language models are expected to evaluate on GLUE, but 7 out of the 8 tasks in GLUE are “matching” tasks that require modeling the relationship between multiple sequences. This privileges certain models: for example, Transformers likely perform significantly better on such tasks due to the cross-attention in the encoder.
3. Benchmark state: In the course of solving a benchmark, researchers will pick up lots of little benchmark-specific tricks that then must be incorporated any time anyone is trying to set a new best performance. However, these tricks may “take away” some of the gains that a more general method could have had: for example, in an RL benchmark a trick for reducing the action space is likely to “take away” some of the gains that might be had from a hierarchical RL approach. Put another way, the benchmark has “state”: early on, the hierarchical RL method might look quite good, but after the discovery of the action reduction trick, the method no longer looks good; the hierarchical method thus has to be “lucky” enough to be tested before the action reduction trick is known.
Note though that it is even worse if there is no standard benchmark: in this case authors can (deliberately or not) choose exactly those tasks that make their method look best.
To mitigate these problems, the authors make the following suggestions:
1. Invest in making guidelines for how to make benchmarks.
2. Benchmark creators should ensure that there are good guidelines for how to use the benchmark to avoid the situation where everyone evaluates methods slightly differently.
3. When reviewing papers, do not require authors to beat the existing state of the art (SOTA) if their method is especially novel, as it is likely disadvantaged by not being able to apply all the small tricks that improve performance on the benchmark.
4. Use statistical significance testing to compare models rather than looking just at point estimates.
5. Use multiple benchmarks, or multiple test sets within a single benchmark, to enable statistical testing.
6. Create “living benchmarks” in which various aspects (such as the test set) are updated over time, to prevent overfitting to the benchmark.
Rohin's opinion: I like the descriptions of the problems in this paper. I also like the proposed solutions -- as a way to cut down problems that weren’t the main focus of the paper. Unfortunately, my guess is that there aren’t great not-too-radical solutions to the problems identified by the authors. Still, these seem like important problems to be aware of when interpreting progress in machine learning.
I wasn’t that convinced that the task selection bias is that large. The metrics in the paper were rather hard to interpret -- they clearly show that rankings of models can change depending on which tasks you select, but it was harder to tell how much the rankings changed. In addition, for at least some of these benchmarks, the point of the tasks is to test different skills and so it shouldn’t be surprising that you can get significantly different rankings if you can choose a subset of the tasks. (Often in such cases papers will be expected to test on all the tasks, so that the task selection bias doesn’t occur.)
Introducing the AI Objectives Institute (Peter Eckersley) (summarized by Rohin): For years people have been talking about corporations and capitalism as an example of superintelligence that we have failed to align so far. This new institute plans to take this correspondence seriously and transfer insights between the two. In particular, we can (a) examine how proposed problems with AI are already taking place with capitalism, (b) use tools and ideas from AI safety to improve upon capitalism, and (c) use lessons from capitalism to assist in the project of building a safely aligned AI.
ML Engineer Position at Preamble (Dylan Hadfield-Menell) (summarized by Rohin): Preamble is a seed-stage company aiming to build middleware for AI ethics and safety, with a current focus on recommender systems. They have an early prototype for Twitter users, implemented as a browser extension. They are currently trying to hire an ML engineer to push forward their work.
I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.
An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.