Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.
Why Build an Assistant in Minecraft? (Arthur Szlam et al): This position paper proposes a new challenge for AI research: building a bot that can provide assistance in Minecraft (creative mode). A companion paper presents an initial setup for such an agent.
The main goal here is to advance natural language understanding, intent inference and instruction following. As a result, there is no formal specification like a reward function -- in their own words, "the ultimate goal of the bot is to be a useful and fun assistant in a wide variety of tasks specified and evaluated by human players". They chose Minecraft in particular partly because it has a very rich space of tasks, even though the execution of any given task is relatively straightforward. They script many low level policies to automate this execution in order to make learning easier (for example, they have policies to navigate to a location or to build specified structures) and focus the learning challenge on figuring out what the user wants.
The current version of the bot takes dialogue from the user and uses a neural model to parse it into an action dictionary that unambiguously specifies what the agent should do -- I think this neural model is the main thing to be learned. There are a bunch of details on how the rest of the modules work as well. They have also released three datasets: a semantic parsing dataset that associates instructions with action dictionaries, a house dataset that has trajectories where a human builds a house, and a semantic segmentation dataset that labels various parts of houses.
Rohin's opinion: I'm really excited to see a project that is very directly aimed at inferring end user intent in a complex environment. This seems like a great direction for the field to move towards. I think Minecraft would also be great as a test bed for the setting in which researchers or engineers (as opposed to end users) are trying to get an agent to do something: we can assume more expertise and knowledge here. Ideally, this would allow us to solve more complex tasks than can be accomplished with natural language from end users. I personally plan to do work with Minecraft along these lines.
While this project does need to infer intent, it probably won't require the sort of pragmatic understanding shown by e.g. Cooperative IRL. Even understanding what the human is literally asking for in Minecraft is currently beyond our capabilities.
Technical AI alignment
Learning human intent
Ranking-Based Reward Extrapolation without Rankings (Daniel S. Brown et al) (summarized by Cody): A while back, these authors released the T-REX paper (AN #54), where they showed that providing ranked sets of trajectories, rather than one single optimal trajectory, lets you learn a more accurate reward that can outperform the demonstrator. This ability to outperform the demonstrator is rooted in the ability to extrapolate predicted reward outside of demonstrated points, and that ability to extrapolate comes from the fact that ranked trajectories provide more information about relative reward values. This paper is a fairly straightforward extension of that one, and asks: can we get similar benefits without requiring humans to actually rank trajectories? The authors argue that they can replicate T-REX's ability to outperform the demonstrator by simply learning a behaviorally cloned policy off of a single (potentially sub-optimal) demonstrator, and making that policy gradually worse by adding more noise to it. This model is called D-REX, for Disturbance-based Reward EXtrapolation. They then make an assumption that more noise in the policy corresponds to less reward, and use that as a ranking scheme to throw into the existing T-REX algorithm.
Cody's opinion: Overall, I think this is potentially a straightforward and clever trick for giving your imitation learner more informative data to learn off of. I have two main questions. First off, I'd have loved to see D-REX compared directly to T-REX, to get a sense of how much you lose from this approximate ranking strategy rather than a more ground truth one. And, secondly, I'd have appreciated a bit more justification of their assumption that noisier actions will consistently lead to a worse policy, in ways that capture reward information. This doesn't seem obviously untrue to me, I'd just love some more intuition on why we can get additional information about underlying reward just by adding noise.
SQIL: Imitation Learning via Regularized Behavioral Cloning (Siddharth Reddy et al) (summarized by Cody): Behavioral Cloning is one of the most direct forms of imitation learning: it learns to predict the action the expert would have taken in a given state of the world. A clear weakness of the approach is that, if cloning models are only trained on pairs of (state, expert action) drawn from the expert's policy distribution, that means the model is underconstrained and thus likely to have high error on states that would have been unseen or just highly unlikely to be visited by the expert. This weakness means that errors within behavioral cloning systems can compound: if the system takes an incorrect action that leads it to a state it never saw the expert in, it will have a difficult time knowing what to do there.
The main contribution of this paper is to suggest a fix for this weakness, by learning a Q function to represent expert behavior, and by penalizing the model for being in states where its temporal difference error on the Q function (otherwise known as the Bellman error) is high. Intuitively, the hope is that this term, which can also be seen as a reward for being in states the expert has seen more frequently (equivalently, states where the model had more training experience) will propagate outward, and give the model a loss surface that pulls it back into states where its predictions are more confident.
Cody's opinion: I still have a personal sense that Behavioral Cloning is too brittle of a conceptual frame to build really robust imitative agents with, but this seems like a clever and relatively clean way to build in a bias towards high-confidence states. I find myself wondering if the same general idea of penalizing being in high-model-error states could be more broadly applied as a sort of regularizer in other off-policy settings where exploration can be risky.
Research Agenda v0.9: Synthesising a human's preferences into a utility function (Stuart Armstrong): One approach to AI alignment involves learning a specification of human values that can then be optimized. This agenda proposes that we learn an adequate representation of values (i.e. not ambitious value learning (AN #31)). We first obtain partial preferences and associated weights from human mental models whose symbols have been adequately grounded. Calling these "preferences" is a normative assumption to avoid an impossibility result in value learning (AN #31): the hope is that the AI could correct for incorrect human beliefs. The preferences are then extended to all possible states, and are normalized so that they are comparable to each other, and then synthesized into a utility function that the AI can optimize.
The partial preferences are divided into a few categories: individual preferences, preferences about the rest of the world, and meta-preferences, some of which can be about the synthesis procedure itself. The hope is that further categories of preferences would be handled by the synthesis procedure; these categories are the ones that seem most important to get right, or couldn't be obtained any other way.
Rohin's opinion: See the next entry.
Some Comments on Stuart Armstrong's "Research Agenda v0.9" (Charlie Steiner): This post makes two main critiques of the research agenda in the previous entry. First, the research agenda involves a lot of human-designed features and modules, but The Bitter Lesson (AN #49) is that machine learning tends to shine with highly abstract large models that can make use of a lot of compute. Second, the symbol grounding part of the agenda requires the AI system to develop representations of the world that match the representations that humans use, and we have no idea how to do that, or even what it would mean to "match human representations" when the AI is more intelligent than humans. The post also includes some more specific comments that I'm not summarizing.
Rohin's opinion: I agree with both of these critiques, especially the one about the bitter lesson. It seems like Stuart's approach imposes a particular structure or algorithm for how to synthesize the utility function; I am generally skeptical of such approaches. Also, as you might already know, I think it is neither necessary nor sufficient for AI alignment to find a utility function or "goal" that the AI can safely optimize. Since this promises to be a very difficult enterprise (Section 0.2 notes that it aims to "solve at least 5 major open problems in philosophy, to a level rigorous enough that we can specify them in code"), I prefer to look into other approaches that seem more tractable.
I do think that the problems that motivate the various aspects of the agenda are important and useful to think about, and I am happy that they have all been put into this single post. I also like the fact that the research agenda is directly aiming for a full solution to AI alignment.
IRL in General Environments (Michael Cohen)
Musings on Cumulative Cultural Evolution and AI (calebo): A recent paper develops a conceptual model that retrodicts human social learning. They assume that asocial learning allows you adapt to the current environment, while social learning allows you to copy the adaptations that other agents have learned. Both can be increased by making larger brains, at the cost of increased resource requirements. What conditions lead to very good social learning?
First, we need high transmission fidelity, so that social learning is effective. Second, we need some asocial learning, in order to bootstrap -- mimicking doesn't help if the people you're mimicking haven't learned anything in the first place. Third, to incentivize larger brains, the environment needs to be rich enough that additional knowledge is actually useful. Finally, we need low reproductive skew, that is, individuals that are more adapted to the environment should have only a slight advantage over those who are less adapted. (High reproductive skew would select too strongly for high asocial learning.) This predicts pair bonding rather than a polygynous mating structure.
This story cuts against the arguments in Will AI See Sudden Progress? and Takeoff speeds: it seems like evolution "stumbled upon" high asocial and social learning and got a discontinuity in reproductive fitness of species. We should potentially also expect discontinuities in AI development.
We can also forecast the future of AI based on this story. Perhaps we need to be watching for the perfect combination of asocial and social learning techniques for AI, and once these components are in place, AI intelligence will develop very quickly and autonomously.
Rohin's opinion: As the post notes, it is important to remember that this is one of many plausible accounts for human success, but I find it reasonably compelling. It moves me closer to the camp of "there will likely be discontinuities in AI development", but not by much.
I'm more interested in what predictions about AI development we can make based on this model. I actually don't think that this suggests that AI development will need both social and asocial learning: it seems to me that in this model, the need for social learning arises because of the constraints on brain size and the limited lifetimes. Neither of these constraints apply to AI -- costs grow linearly with "brain size" (model capacity, maybe also training time) as opposed to superlinearly for human brains, and the AI need not age and die. So, with AI I expect that it would be better to optimize just for asocial learning, since you don't need to mimic the transmission across lifetimes that was needed for humans.
The AI Timelines Scam (Jessica Taylor): This post argues that AI researchers and AI organizations have an incentive to predict that AGI will come soon, since that leads to more funding, and so we should expect timeline estimates to be systematically too short. Besides the conceptual argument, we can also see this in the field's response to critics: both historically and now, criticism is often met with counterarguments based on "style" rather than engaging with the technical meat of the criticism.
Rohin's opinion: I agree with the conceptual argument, and I think it does hold in practice, quite strongly. I don't really agree that the field's response to critics implies that they are biased towards short timelines -- see these comments. Nonetheless, I'm going to do exactly what this post critiques, and say that I put significant probability on short timelines, but not explain my reasons (because they're complicated and I don't think I can convey them, and certainly can't convey them in a small number of words).
Jeff Hawkins on neuromorphic AGI within 20 years (steve2152)
AI strategy and policy
How Europe might matter for AI governance (Stefan Torges)
Other progress in AI
Large Scale Adversarial Representation Learning (Jeff Donahue et al) (summarized by Cody): The BigGAN paper, published last September, used a much larger model (and a handful of optimization tricks to facilitate training it) to achieve a huge leap forward in the quality of generated images. However, it was unclear from the earlier paper whether this improvement in generation quality would also be tied to an increase in the model's usefulness as a source of unsupervised semantic representations of images. This paper set out to answer that question by taking an existing technique for learning representations with GANs - called BiGAN - and combining it with the BigGan architecture, which hadn't been available when BiGAN was originally published. BiGAN, short for Bidirectional GAN, works by learning both a latent space to image transformation, and also an image to latent space encoder, and then enforcing that pairs of (latent, image) from these two distributions be indistinguishable from one another. They evaluated the quality of learned representations by measuring the performance of a linear model trained using the encoder's learned latent vectors as input, and did find it to be the case that a BiGAN trained with a BigGAN architecture performs better than one trained with a smaller architecture.
Cody's opinion: I really liked this paper; it was cleanly written, conceptually straightforward, and did a generally useful scientific service of checking whether an advance in one area might change our beliefs about a previous result. I particularly enjoyed looking at the "reconstructed" images they got by running their encoder and then generator: more so than anything I recall seeing from a VAE pixel-based reconstructor, this model seems to be treating images as valid reconstructions of one another if they're of the same class (i.e. two pizzas) even if the colors and low level detail are different. This makes reasonable sense if you think that those two pizzas are probably nearby in latent space, and so each is a plausible reconstruction of each other's latent space encoding, but it's still cool to see concretely borne out.
Join our rapidly growing research teams (Tanya Singh): The Future of Humanity Institute is hiring researchers across a wide range of topics, including AI safety and strategy. The deadline to apply is midday August 16.