Alignment Newsletter #34

Rohin Shah

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through the database of all summaries.

Highlights

Scalable agent alignment via reward modeling (Jan Leike): This blog post and the associated paper outline a research direction that DeepMind's AGI safety team is pursuing. The key idea is to learn behavior by learning a reward and a policy simultaneously, from human evaluations of outcomes, which can scale to superhuman performance in tasks where evaluation is easier than demonstration. However, in many cases it is hard for humans to evaluate outcomes: in this case, we can train simpler agents using reward modeling that can assist the human in evaluating outcomes for the harder task, a technique the authors call recursive reward modeling. For example, if you want to train an agent to write a fantasy novel, it would be quite expensive to have a human evaluate outcomes, i.e. rate how good the produced fantasy novels are. We could instead use reward modeling to train agents that can produce plot summaries, assess prose quality and character development, etc. which allows a human to assess the fantasy novels. There are several research challenges, such as what kind of feedback to get, making it sufficiently sample efficient, preventing reward hacking and unacceptable outcomes, and closing the reward-result gap. They outline several promising approaches to solving these problems.

Rohin's opinion: The proposal sounds to me like a specific flavor of narrow value learning, where you learn reward functions to accomplish particular tasks, rather than trying to figure out the "true human utility function". The recursive aspect is similar to iterated amplification and debate. Iterated amplification and debate can be thought of as operating on a tree of arguments, where each node is the result of considering many child nodes (the considerations that go into the argument). Importantly, the child nodes are themselves arguments that can be decomposed into smaller considerations. Iterated amplification works by learning how to compose and decompose nodes from children, while debate works by having humans evaluate a particular path in the argument tree. Recursive reward modeling instead uses reward modeling to train agents that can help evaluate outcomes on the task of interest. This seems less recursive to me, since the subagents are used to evaluate outcomes, which would typically be a different-in-kind task than the task of interest. This also still requires the tasks to be fast -- it is not clear how to use recursive reward modeling to eg. train an agent that can teach math to children, since it takes days or months of real time to even produce outcomes to evaluate. These considerations make me a bit less optimistic about recursive reward modeling, but I look forward to seeing future work that proves me wrong.

The post also talks about how reward modeling allows us to separate what to do (reward) from how to do it (policy). I think it is an open question whether this is desirable. Past work found that the reward generalized somewhat (whereas policies typically don't generalize at all), but this seems relatively minor. For example, rewards inferred using deep variants of inverse reinforcement learning often don't generalize. Another possibility is that the particular structure of "policy that optimizes a reward" provides a useful inductive bias that makes things easier to learn. It would probably also be easier to inspect a specification of "what to do" than to inspect learned behavior. However, these advantages are fairly speculative and it remains to be seen whether they pan out. There are also practical advantages: any advances in deep RL can immediately be leveraged, and reward functions can often be learned much more sample efficiently than behavior, reducing requirements on human labor. On the other hand, this design "locks in" that the specification of behavior must be a reward function. I'm not a fan of reward functions because they're so unintuitive for humans to work with -- if we could have agents that work with natural language, I suspect I do not want the natural language to be translated into a reward function that is then optimized.

Technical AI alignment

Iterated amplification sequence

Prosaic AI alignment (Paul Christiano): It is plausible that we can build "prosaic" AGI soon, that is, we are able to build generally intelligent systems that can outcompete humans without qualitatively new ideas about intelligence. It seems likely that this would use some variant of RL to train a neural net architecture (other approaches don't have a clear way to scale beyond human level). We could write the code for such an approach right now (see An unaligned benchmark from AN #33), and it's at least plausible that with enough compute and tuning this could lead to AGI. However, this is likely to be bad if implemented as stated due to the standard issues of reward gaming and Goodhart's Law. We do have some approaches to alignment such as IRL and executing natural language instructions, but neither of these are at the point where we can write down code that would plausibly lead to an aligned AI. This suggests that we should focus on figuring out how to align prosaic AI.

There are several reasons to focus on prosaic AI. First, since we know the general shape of the AI system under consideration, it is easier to think about how to align it (while ignoring details like architecture, variance reduction tricks, etc. which don't seem very relevant currently). Second, it's important, both because we may actually build prosaic AGI, and because even if we don't the insights gained will likely transfer. In addition, worlds with short AGI timelines are higher leverage, and in those worlds prosaic AI seems much more likely. The main counterargument is that aligning prosaic AGI is probably infeasible, since we need a deep understanding of intelligence to build aligned AI. However, it seems unreasonable to be confident in this, and even if it is infeasible, it is worth getting strong evidence of this fact in order change priorities around AI development, and coordinate on not building an AGI that is too powerful.

Rohin's opinion: I don't really have much to say here, except that I agree with this post quite strongly.

Approval-directed agents: overview and Approval-directed agents: details (Paul Christiano): These two posts introduce the idea of approval-directed agents, which are agents that choose actions that they believe their operator Hugh the human would most approve of, if he reflected on it for a long time. This is in contrast to the traditional approach of goal-directed behavior, which are defined by the outcomes of the action.

Since the agent Arthur is no longer reasoning about how to achieve outcomes, it can no longer outperform Hugh at any given task. (If you take the move in chess that Hugh most approves of, you probably still lose to Gary Kasparov.) This is still better than Hugh performing every action himself, because Hugh can provide an expensive learning signal which is then distilled into a fast policy that Arthur executes. For example, Hugh could deliberate for a long time whenever he is asked to evaluate an action, or he could evaluate very low-level decisions that Arthur makes billions of times. We can also still achieve superhuman performance by bootstrapping (see the next summary).

The main advantage of approval-directed agents is that we avoid locking in a particular goal, decision theory, prior, etc. Arthur should be able to change any of these, as long as Hugh approves it. In essence, approval-direction allows us to delegate these hard decisions to future overseers, who will be more informed and better able to make these decisions. In addition, any misspecifications seem to cause graceful failures -- you end up with a system that is not very good at doing what Hugh wants, rather than one that works at cross purposes to him.

We might worry that internally Arthur still uses goal-directed behavior in order to choose actions, and this internal goal-directed part of Arthur might become unaligned. However, we could even have internal decision-making about cognition be approval-based. Of course, eventually we reach a point where decisions are simply made -- Arthur doesn't "choose" to execute the next line of code. These sorts of things can be thought of as heuristics that have led to choosing good actions in the past, that could be changed if necessary (eg. by rewriting the code).

How might we write code that defines approval? If our agents can understand natural language, we could try defining "approval" in natural language. If they are able to reason about formally specified models, then we could try to define a process of deliberation with a simulated human. Even in the case where Arthur learns from examples, if we train Arthur to predict approval from observations and take the action with the highest approval, it seems possible that Arthur would not manipulate approval judgments (unlike AIXI).

There are also important details on how Hugh should rate -- in particular, we have to be careful to distinguish between Hugh's beliefs/information and Arthur's. For example, if Arthur thinks there's a 1% chance of a bridge collapsing if we drive over it, then Arthur shouldn't drive over it. However, if Hugh always assigns approval 1 to the optimal action and approval 0 to all other actions, and Arthur believes that Hugh knows whether the bridge will collapse, then the maximum expected approval action is to drive over the bridge.

The main issues with approval-directed agents is that it's not clear how to define them (especially from examples), whether they can be as useful as goal-directed agents, and whether approval-directed agents will have internal goal-seeking behavior that brings with it all of the problems that approval was meant to solve. It may also be a problem if some other Hugh-level intelligence gets control of the data that defines approval.

Rohin's opinion: Goal-directed behavior requires an extremely intelligent overseer in order to ensure that the agent is pointed at the correct goal (as opposed to one the overseer thinks is correct but is actually slightly wrong). I think of approval-directed agents as providing the intuition that we may only require an overseer that is slightly smarter than the agent in order to be aligned. This is because the overseer can simply "tell" the agent what actions to take, and if the agent makes a mistake, or tries to optimize a heuristic too hard, the overseer can notice and correct it interactively. (This is assuming that we solve the informed oversight problem so that the agent doesn't have information that is hidden from the overseer, so "intelligence" is the main thing that matters.) Only needing a slightly smarter overseer opens up a new space of solutions where we start with a human overseer and subhuman AI system, and scale both the overseer and the AI at the same time while preserving alignment at each step.

Approval-directed bootstrapping (Paul Christiano): To get a very smart overseer, we can use the idea of bootstrapping. Given a weak agent, we can define a stronger agent that happens from letting the weak agent think for a long time. This strong agent can be used to oversee a slightly weaker agent that is still stronger than the original weak agent. Iterating this process allows us to reach very intelligent agents. In approval-directed agents, we can simply have Arthur ask Hugh to evaluate approval for actions, and in the process of evaluation Hugh can consult Arthur. Here, the weak agent Hugh is being amplified into a stronger agent by giving him the ability to consult Arthur -- and this becomes stronger over time as Arthur becomes more capable.

Rohin's opinion: This complements the idea of approval from the previous posts nicely: while approval tells us how to build an aligned agent from a slightly smarter overseer, bootstrapping tells us how to improve the capabilities of the overseer and the agent.

Humans Consulting HCH (Paul Christiano): Suppose we unroll the recursion in the previous bootstrapping post: in that case, we see that Hugh's evaluation of an answer can depend on a question that he asked Arthur whose answer depends on how Hugh evaluated an answer that depended on a question that he asked Arthur etc. Inspired by this structure, we can define HCH (humans consulting HCH) to be a process that answers question Q by perfectly imitating how Hugh would answer question Q, if Hugh had access to the question-answering process. This means Hugh is able to consult a copy of Hugh, who is able to consult a copy of Hugh, who is able to consult a copy of Hugh, ad infinitum. This is one proposal for how to formally define a human's enlightened judgment.

You could also combine this with particular ML algorithms in an attempt to define versions of those algorithms aligned with Hugh's enlightened judgment. For example, for RL algorithm A, we could define max-HCH_A to be A's chosen action when maximizing Hugh's approval after consulting max-HCH_A.

Rohin's opinion: This has the same nice recursive structure of bootstrapping, but without the presence of the agent. This probably makes it more amenable to formal analysis, but I think that the interactive nature of bootstrapping (and iterated amplification more generally) is quite important for ensuring good outcomes: it seems way easier to control an AI system if you can provide input constantly with feedback.

Fixed point sequence

Fixed Point Discussion (Scott Garrabrant): This post discusses the various fixed point theorems from a mathematical perspective, without commenting on their importance for AI alignment.

Technical agendas and prioritization

Integrative Biological Simulation, Neuropsychology, and AI Safety (Gopal P. Sarma et al): See Import AI and this comment.

Learning human intent

Scalable agent alignment via reward modeling (Jan Leike): Summarized in the highlights!

Adversarial examples

A Geometric Perspective on the Transferability of Adversarial Directions (Zachary Charles et al)

AI strategy and policy

MIRI 2018 Update: Our New Research Directions (Nate Soares): This post gives a high-level overview of the new research directions that MIRI is pursuing with the goal of deconfusion, a discussion of why deconfusion is so important to them, an explanation of why MIRI is now planning to leave research unpublished by default, and a case for software engineers to join their team.

Rohin's opinion: There aren't enough details on the technical research for me to say anything useful about it. I'm broadly in support of deconfusion but am either less optimistic on the tractability of deconfusion, or more optimistic on the possibility of success with our current notions (probably both). Keeping research unpublished-by-default seems reasonable to me given the MIRI viewpoint for the reasons they talk about, though I haven't thought about it much. See also Import AI.

Other progress in AI

Reinforcement learning

Woulda, Coulda, Shoulda: Counterfactually-Guided Policy Search (Lars Buesing et al) (summarized by Richard): This paper aims to alleviate the data inefficiency of RL by using a model to synthesise data. However, even when environment dynamics can be modeled accurately, it can be difficult to generate data which matches the true distribution. To solve this problem, the authors use a Structured Causal Model trained to predict the outcomes which would have occurred if different actions had been taken from previous states. Data is then synthesised by rolling out from previously-seen states. The authors test performance in a partially-observable version of SOKOBAN, in which their system outperforms other methods of generating data.

Richard's opinion: This is an interesting approach which I can imagine becoming useful. It would be nice to see more experimental work in more stochastic environments, though.

Natural Environment Benchmarks for Reinforcement Learning (Amy Zhang et al) (summarized by Richard): This paper notes that RL performance tends to be measured in simple artificial environments - unlike other areas of ML in which using real-world data such as images or text is common. The authors propose three new benchmarks to address this disparity. In the first two, an agent is assigned to a random location in an image, and can only observe parts of the image near it. At every time step, it is able to move in one of the cardinal directions, unmasking new sections of the image, until it can classify the image correctly (task 1) or locate a given object (task 2). The third type of benchmark is adding natural video as background to existing Mujoco or Atari tasks. In testing this third category of benchmark, they find that PPO and A2C fall into a local optimum where they ignore the observed state when deciding the next action.

Richard's opinion: While I agree with some of the concerns laid out in this paper, I'm not sure that these benchmarks are the best way to address them. The third task in particular is mainly testing for ability to ignore the "natural data" used, which doesn't seem very useful. I think a better alternative would be to replace Atari with tasks in procedurally-generated environments with realistic physics engines. However, this paper's benchmarks do benefit from being much easier to produce and less computationally demanding.

Deep learning

Do Better ImageNet Models Transfer Better? (Simon Kornblith et al) (summarized by Dan H)

Dan H's opinion: This paper shows a strong correlation between a model's ImageNet accuracy and its accuracy on transfer learning tasks. In turn, better ImageNet models learn stronger features. This is evidence against the assertion that researchers are simply overfitting ImageNet. Other evidence is that the architectures themselves work better on different vision tasks. Further evidence against overfitting ImageNet is that many architectures which are desgined for CIFAR-10, when trained on ImageNet, can be highly competitive on ImageNet.

Gather-Excite: Exploiting Feature Context in Convolutional Neural Networks (Jie Hu, Li Shen, Samuel Albanie et al) (summarized by Dan H)

Read more: This method uses spatial summarization for increasing convnet accuracy and was discovered around the same time as this similar work. Papers with independent rediscoveries tend to be worth taking more seriously.

Improving Generalization for Abstract Reasoning Tasks Using Disentangled Feature Representations (Xander Steenbrugge et al)

24