# Ω 5

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Now that we're at the tenth Alignment Newsletter, it seems like the right time for a survey! It's really short and should only take 1-5 minutes (depending on how much feedback you give), so please do take it :)

## Highlights

Measuring and avoiding side effects using relative reachability (blog post) (Victoria Krakovna et al): One classic description of the AI alignment problem, from Stuart Russell, is that if you optimize a function of n variables, where the objective depends on k < n of these variables, then the remaining variables will often be set to extreme values, which can be highly undesirable if we actually care about those variables. This can be thought of as a negative side effect. This work attacks the problem of preventing negative side effects in general, even if the reward function says nothing about the side effect. They show simple examples that motivate four properties that any solution should satisfy -- penalize side effects, not effects necessary for the objective; penalize agent-caused effects but not environment effects; penalize irreversible effects higher than reversible ones; and penalize multiple irreversible effects more than a single irreversible effect. They add a penalty term called relative reachability to the reward function to incentivize the agent not to cause side effects. Since we don't want to penalize environment effects (effects that would happen anyway), they compare against an "inaction baseline", where the agent does nothing (or follows some hardcoded safe policy). Since we want something more quantitative than "is this reversible", they create a numeric score of "coverage", which measures how easy it is to reach states from the current state, and penalize decreases in coverage relative to the baseline. This satisfies all of the properties we want -- it will still penalize irreversible actions that are necessary to achieve the objective, but as long as the penalty is small enough the reward for achieving the objective will dominate and the agent will take the action. It doesn't penalize environment effects because both the actual policy and the inaction baseline contain such effects. Clearly irreversible effects would lead to much lower coverage than reversible ones, and so irreversible effects are penalized more. Finally, multiple irreversible effects would lead to larger decreases in coverage than a single irreversible effect. They demonstrate these properties on toy gridworlds.

My opinion: It's great to see a paper that directly tackles a clear problem with AI alignment, and I think their solution works well in theory compared to other proposals. As the authors say, it's not practical yet, as it assumes knowledge of all states, computing coverage between all pairs of states, and that the agent can simulate the environment (to compute the baseline). On the theory side, I'm worried about what happens when properties interact. For example, suppose the agent takes an irreversible action because it is necessary for the objective. As a result of this action, there are new environment effects that don't happen in the baseline -- it seems like relative reachability would now start to penalize the agent for these effects, since they aren't present in the baseline. Dumb example: Suppose the agent is tasked with building a house, and cuts down some trees for the purpose. Now that there are no trees providing shade, the shallow pond evaporates away, wind intensifies and causes more leaves to fall, etc. and the agent is penalized for all of this because it wasn't in the baseline. More generally, as the agent takes actions in the real world, it will get further away from the baseline, and so the baseline becomes less useful.

The paper also notes that we could hope to learn this sort of behavior from humans, but that this seems hard to do, or at least that including a penalty on side effects can reduce the sample complexity of learning human values. I tend to agree -- in general, there seem to be two kinds of things we want to learn. First, what is it that we actually want our AI to do, and second, what common-sense things should it not do along the way. (In the formal methods community, these are called liveness and safety properties respectively.) In many areas of CS the second one arises as the "frame problem", which makes it difficult to write formal specifications, or to learn common-sense human preferences from humans. So it seems plausible to me that we want separate solutions for each of these kinds of values. I am a bit worried about having "common sense" be encoded formally as a reward penalty, because it seems very likely that it will be misspecified, but perhaps this does work well if combined with techniques that can learn from human data.

Variational Inverse Control with Events: A General Framework for Data-Driven Reward Definition (Justin Fu, Avi Singh et al): For reinforcement learning, we can create a probabilistic model in which there are events for the state the agent is in and the action the agent takes. We can also add events e_t corresponding roughly to "the agent achieved something good in timestep t". We set P(e_t = 1 | s_t, a_t) to be exp(R(s_t, a_t)). Then, we can simply set all of the e_t to 1, and infer the likely state-action pairs that would have led to that. This leads to maximum entropy reinforcement learning, which in the setting of deterministic dynamics is equivalent to soft Q-learning. The authors then note that in this setup, the reward corresponds to the log probability of event e_t happening. So, instead of specifying a reward function, we can instead define binary events that we care about, model their probability of occurring, and then find the actions that maximize the likelihood of the event occurring. The authors derive backup equations for three kinds of queries -- ALL (the event must happen every timestep), AT (the event happens at a particular timestep), and ANY (the event happens on some timestep).
In this setup, specifying a reward function corresponds to explicitly writing down probabilities P(e | s, a). Of course, we can learn these probabilities from data using standard ML techniques, and this now corresponds to learning a reward function! If we use the ALL query, this corresponds to inverse reinforcement learning. However, by using the AT or ANY query instead, we only require examples of the event e_t for a single s_t and a_t -- for example, images that represent a goal state. They derive an algorithm for this query and show experimentally that this framework can learn event probabilities that lead to good behavior on Mujoco environments.

My opinion: I like this framework for a couple of reasons. First, it allows for multiple kinds of queries, which correspond to different ways of specifying tasks, increasing the number of types of inputs we can give in order to communincate our intent to an AI. Concretely, the framework can handle both demonstrations (as in IRL) and examples of goal states. Second, it reduces learning a reward function to learning the probabilities of events, which has been studied in much more depth in the machine learning community and so will hopefully work better.

Technology Roulette: Managing Loss of Control as Many Militaries Pursue Technological Superiority (Richard Danzig): US policy so far has been to pursue technological superiority in order to stay ahead of its adversaries and to prevent conflict through deterrence. This paper argues that policymakers should shift some attention to preparing for other risks, such as accidents, emergent effects, sabotage and proliferation (where other actors get and use the technology, without the same safety standards as the US). There were several interesting sections, but the one I was particularly interested in was the section arguing that keeping a human in the loop would not be sufficient. In military situations, decisions must often be made in time-sensitive, high-stress situations, and in such scenarios humans are not very good at making decisions. For example, if an AI system detects an incoming missile, it must autonomously aim and fire to prevent the missile from hitting its target -- there is not enough time for a human to be in the loop. The biggest issue though is that while a human may be part of the decision-making process, they are reliant on various machine readings and calculations in order to reach their decision, and so a human in the loop doesn't provide an independent check on the answer, and so is of limited utility. And as AI systems get better, humans will become less useful for checking the AI's decisions, making this a temporary solution at best.

My opinion: I found the paper to be quite compelling, especially the comments on the human-in-the-loop solution. This feels relevant to problems in technical AI alignment, though I'm not exactly sure how. One question that it suggests -- how can we learn human preferences, when the human answers may themselves depend on the AI's actions? Stuart Armstrong has pointed out this problem as well.

# Technical AI alignment

### Agent foundations

Prisoners' Dilemma with Costs to Modeling (Scott Garrabrant): Open source game theory looks at the behavior of agents that have access to each other's source code. A major result is that we can define an agent FairBot that will cooperate with itself in the prisoner's dilemma, yet can never be exploited. Later, we got PrudentBot, which still cooperates with FairBots, but will defect against CooperateBots (which always cooperate) since it can at no cost to itself. Given this, you would expect that if you evolved a population of such bots, you'd hopefully get an equilibrium in which everyone cooperates with each other, since they can do so robustly without falling prey to DefectBots (which always defect). However, being a FairBot or PrudentBot is costly -- you have to think hard about the opponent and prove things about them, it's a lot easier to rely on everyone else to punish the DefectBots and become a CooperateBot yourself. In this post, Scott analyzes the equilibria in the two person prisoner's dilemma with small costs to play bots that have to prove things. It turns out that in addition to the standard Defect-Defect equilibirum, there are two mixed strategy equilibria, including one that leads to generally cooperative behavior -- and if you evolve agents to play this game, they generally stay in the vicinity of this good equilibrium, for a range of initial conditions.

My opinion: This is an interesting result. I continue to be surprised at how robust this Lobian cooperative behavior seems to be -- while I used to think that humans could only cooperate with each other because of prosocial tendencies that meant that we were not fully selfish, I'm now leaning more towards the theory that we are simply very good at reading other people, which gives us insight into them, and leads to cooperative behavior in a manner similar to Lobian cooperation.

Prerequisities: Robust Cooperation in the Prisoner's Dilemma and/or Open-source game theory is weird

Logical Inductor Tiling and Why it's Hard (Alex Appel)

Logical Inductor Lemmas (Alex Appel)

Two Notions of Best Response (Alex Appel)

### Learning human intent

Variational Inverse Control with Events: A General Framework for Data-Driven Reward Definition (Justin Fu, Avi Singh et al): Summarized in the highlights!

Learning to Follow Language Instructions with Adversarial Reward Induction (Dzmitry Bahdanau et al): Will be summarized next week!

Measuring and avoiding side effects using relative reachability (blog post) (Victoria Krakovna et al): Summarized in the highlights!

### Miscellaneous (Alignment)

On Strong Artificial Intelligence (Zhou Zhihua, translated by Jeffrey Ding): This article, written by a professor from China, argues that the AI community has never been focused on "strong AI", and we have no real path forward to building "strong AI", and that it would be so dangerous that we should never do research around it. The concept of "strong AI" here is a bit different from what we are used to -- I would probably call it human-like AGI, in that it would have consciousness, self-awareness, and emotions, and be as capable as a human.

My opinion: This is an interesting position I haven't seen much in the West -- both that we can't build AGI, and that we shouldn't build it anyway. It's actually quite heartening to see an emphatic claim that we shouldn't build strong AI -- it seems like AI researchers as a group may in fact be able to coordinate to develop AI safely. Of course, this is a single viewpoint and is not representative of all AI researchers in China.

Disambiguating "alignment" and related notions (capybaralet): Distinguishes between several kinds of alignment. Some focus on terminal values from the AI, such as holistic alignment (the AI has the same terminal values as us) and parochial alignment (which I don't really understand, check the post). Sufficient alignment focuses on outcomes (no X-event happens, or X-risk is sufficiently low). Finally, others focus on the motivations of the AI, including intentional alignment (the AI tries to do what H wants it to do) and benign AI (R doesn't try to do what H doesn't want it to do).

My opinion: It is definitely worth keeping these distinctions in mind whenever talking about alignment. I personally tend to think about the motivation-based definitions, because those seem to be the most tractable definitions to work on, mainly because I don't have to worry about the AI being incompetent (eg. an AI launching nukes accidentally while exploring its action space). It seems possible to get strong arguments for intentional alignment and then use that with improved capabilities to argue for sufficient alignment.

# Near-term concerns

### Privacy and security

Deep Video Portraits (Hyeongwoo Kim et al): See Import AI.

# AI strategy and policy

Technology Roulette: Managing Loss of Control as Many Militaries Pursue Technological Superiority (Richard Danzig): Summarized in the highlights!

AI at Google: our principles (Sundar Pichai): Following the outcry over the Maven program, Google has written a blog post detailing the principles they will follow for AI.
My opinion: I found this line particularly interesting: "We will design our AI systems to be appropriately cautious, and seek to develop them in accordance with best practices in AI safety research." It sounds like the time is ripe for someone to write a "best practices" paper!

# News

Announcing the 2018 AI Fellows: The Open Philanthropy Project has chosen seven out of 180 applicants as the first class of AI fellows.

OpenAI Fellows—Fall 2018 (Larissa Schiavo et al): The OpenAI Fellows program is accepting applications until July 8 for positions starting in September. The program is aimed at people who want to transition into doing AI research, but they do want evidence of interest in AI, either through past projects or self-study.

The first AI Safety Camp & onwards (Remmelt Ellen et al): The first AI safety camp was held in April, in which people interested in AI safety gathered to work on research within groups. Everyone prepared for the camp over the six weeks leading up to it, and then spent 10 days focusing on a particular research question. There were five teams of around four people, and each team wrote up some notes on the results of their project at the end of the camp.

Our essay competitions for young people: There is an essay competition for people between 16 and 25 years old, where one of the topics is "Do the benefits of artificial intelligence outweigh the risks?" Winning essays will be published on The Economist’s Open Future website and the author will be invited to attend one of the three Open Future Festival events. The deadline is July 15th.

BERI Project Grants Program (Rebecca Raible): BERI is offering grants of up to \$300,000 per year for work relating to their mission, with the application deadline of June 30. In their words, "We are open to any ideas you have, as long as you can explain how the project will contribute to improving human civilization’s long-term prospects for survival and flourishing."

New Comment