Alignment Newsletter #26

Rohin Shah

Highlights

Building safe artificial intelligence: specification, robustness, and assurance (Pedro A. Ortega, Vishal Maini et al) (summarized by Richard): In this blog post, the DeepMind safety team divides AI safety into the problems of specification, robustness and assurance. Specification ensures that an AI system's behaviour aligns with the operator's true intentions, i.e that our "ideal specification" of a fully alligned AI system is consistent with the AI's actual behavior. Problems in this category include avoiding side-effects and the creation of dangerous sub-agents. Robustness ensures that an AI system continues to operate within safe limits upon facing peturbations like adversarial inputs or distributional shift, both by preventing these problems arising and by being able to recover from them. It also includes criteria like safe exploration and cautious generalisation. Assurance ensures that we can understand and control AI systems during operation, by monitoring them and enforcing restrictions. Interpretability and interruptability are examples of monitoring and enforcement respectively. I'd encourage you to look at the table in the original post, since it also categorises many more AI safety problems.

Richard's opinion: I like this framing - I think it's an improvement on the categorisation into specification and robustness problems from the AI safety gridworlds paper. In particular, it's useful to separate properties that we want an AI to have from mechanisms by which we might control or limit the damage from an AI that doesn't have those properties.

I do worry that this approach doesn't contain scalability as a core concern in the same way that IDA and agent foundations do. Solutions to AI safety problems which work for sub-human-level AI and even human-level AI may not translate to superintelligent AI. Relatedly, I think that maintaining the same goals under distributional shift should be considered a specification problem, because as AIs get smarter they'll be able to handle tasks increasingly different from the ones they trained on, and also because scaling up a system is itself a form of distributional shift.

Rohin's opinion: I also like this framing. Unlike Richard, I am not worried about the absence of scalability as a core concern -- scalability seems to be a desideratum about solutions, while the blog post aims to categorize problems. I'm also more agnostic about the categorization of distributional shift -- I can see it as both a specification problem and a robustness problem, and want to figure out more precisely what I think the difference between specification and robustness is.

Model-Based Reinforcement Learning via Meta-Policy Optimization (Ignasi Clavera, Jonas Rothfuss et al) (summarized by Richard): This paper introduces a new approach to model-based RL, called Model-Based Meta-Policy-Optimisation (MB-MPO), which doesn't require the dynamics models to be as accurate. It does so by learning an ensemble of dynamics models each trained on different subsets of the data, and then using meta-learning (specifically MAML) to find a policy which adapts well to any of these models within one step of gradient descent. This approach is a form of regularisation of policy learning, and achieves much greater sample efficiency without compromising performance: MB-MPO does just as well as top model-free algorithms in various Mujoco continuous-control environments, while requiring between 10 and 100 times fewer samples. Experiments suggest that it does so by having higher plasticity in regions with high dynamics model uncertainty.

Richard's opinion: This is a simple yet powerful idea which significantly reduces the number of samples required for good performance; I'm excited to see follow-up work. There are some confusing references in the paper to applications of MB-MPO to real-world robotics; it's unclear whether the authors already have results in this direction.

Previous newsletters

Comment on Towards a New Impact Measure (Victoria Krakovna): This comment clarifies the design space of impact measures, and in particular how Attainable Utility Preservation (AUP) from AN #25 and relative reachability (RR) from AN #10 compare. There's the choice of baseline to compare against (inaction for both), the measure of impact (attainable utilities vs. state reachability), and how to compute the deviation from the baseline (penalize both increases and decreases vs. only decreases). AUP prevents an agent from disabling shutdown because it penalizes increases. RR only penalizes decreases by default, but can easily be changed to penalize both increases and decreases, in which case it would also have this effect. RR penalizes decreases by default because it aims not to cripple the agent, whereas AUP penalizes increases as well because it aims to prevent any catastrophic scenario.

Rohin's opinion: I found this comment really helpful both to understand AUP and how it compares to RR, would recommend it.

Wireheading as a potential problem with the new impact measure (Stuart Armstrong): This suggests a potential problem with AUP: that the agent could build a device that creates exactly the observations that would be seen from null actions, which results in the penalty being near-zero, but otherwise maximizes the given utility function. Alex believes that this would be prevented by intent verification -- given a sufficiently granular representation of actions, at least one of the actions in this plan would not be increasing the ability to achieve the given utility function (without the penalty) relative to doing nothing, and would not be allowed as a result.

Technical AI alignment

Technical agendas and prioritization

Building safe artificial intelligence: specification, robustness, and assurance (Pedro A. Ortega, Vishal Maini et al): Summarized in the highlights!

Agent foundations

Asymptotic Decision Theory (Improved Writeup) (Diffractor)

Learning human intent

Web of connotations: Bleggs, Rubes, thermostats and beliefs, Bridging syntax and semantics, empirically and Bridging syntax and semantics with Quine's Gavagai (Stuart Armstrong) (summarized by Richard): Armstrong tackles the problem of determining semantics based on syntax. Take, for example, a thermostat whose reading is strongly correlated with the temperature, at least when it's used in certain ways in certain environments. Does it have "beliefs" about the temperature, as some philosophers have argued? Armstrong notes that most of our concepts, like the concept of a chair, are defined in two ways: an intensional definition ("things humans sit on"), and an extensional definition (many mental examples of chairs). From the latter, we can extract a "web of connotations" which are linked to the central examples in our mind (chairs are made out materials like wood or plastic, found in places like rooms or gardens, etc). I interpret him as arguing that intensional definitions should be judged by how well they approximate the boundaries of our extensional definitions, and that the best way to do so is to make use of a web of connotations. He then claims that since having such a web is crucial for human-like reasoning, we shouldn't think of entities without such webs, like thermostats, as having "beliefs".

In two follow-up posts, he explores a very simple definition of representation: a variable X within an agent is a representation of the variable x within a set of environments if knowing X allows one to predict x. This is defined with respect to a given set of environments, because a representation is better if it works in more environments, but no free lunch theorems suggest that no representation would work in every environment. For thermostats, the relevant range of environments is pretty narrow - being in direct sunlight is enough to skew their readings. Human representations of temperature are much more robust. More generally, we could rank environments by how adversarial they are allowed to be - an agent has a better representation of a variable if their representation still correlates with that variable even when a powerful adversary is trying to deceive them. However, we should be aware that environmental variables may become undefined in some environments. Armstrong also claims that when we can't distinguish which of several environmental variables an agent is referring to, then that agent is simultaneously representing all of them.

Richard's opinion: I'm not sure I fully understand the approach, but there seems to be a lot of hidden complexity in the idea of a "web of connotations", in a way which makes reasoning about it in a rigorous philosophical way difficult. The same is true for the idea of one variable allowing you to predict another well - it seems to depend on background knowledge, processing power, what you mean by "variable", etc. Perhaps further posts will make these ideas clearer, but so far this problem seems quite difficult to me.

Learning Task Specifications from Demonstrations (Marcell Vazquez-Chanlatte et al)

Interpretability

Interpretable Reinforcement Learning with Ensemble Methods (Alexander Brown et al)

Near-term concerns

Adversarial examples

Towards the first adversarially robust neural network model on MNIST (Lukas Schott, Jonas Rauber et al)

AI strategy and policy

Computational Power and the Social Impact of Artificial Intelligence (Tim Hwang) (summarized by Richard): This paper contains a lot of introductory material about hardware in ML, plus details on where it's being made. It discusses the competition between China and the US to dominate hardware production. Hwang notes that the trend towards more specialised hardware may decrease the price of implementing ML but also decrease flexibility after deployment. He points out that simulation learning, self-play and meta-learning reduce the need for data at the expense of increased compute, which may increase hardware's importance going forward.

Richard's opinion: This may be useful for AI policy researchers, since it explores which hardware is being made by which companies in which locations, and some of the geopolitical implications. While it's a long paper, AI researchers could probably skip the first half without missing much.

Other progress in AI

Reinforcement learning

Model-Based Reinforcement Learning via Meta-Policy Optimization (Ignasi Clavera, Jonas Rothfuss et al): Summarized in the highlights!

TStarBots: Defeating the Cheating Level Builtin AI in StarCraft II in the Full Game (Peng Sun, Xinghai Sun, Lei Han, Jiechao Xiong et al) (summarized by Richard): This paper showcases an RL agent which is able to defeat the built-in Starcraft AI (roughly at the level of the 30th-50th percentile of players). It do so by choosing between 165 hand-coded macro actions, which each correspond to an elementary task like producing a certain building. This avoids the necessity of learning unimportant details like exactly where the building should be placed, as well as difficult rules like the prerequisites for each building. The authors create a second agent which uses an expert system to choose actions in a hierarchical fashion, which performs at a similar level to the first.

Richard's opinion: I find Starcraft more interesting as a test-bed for deep RL than as a goal in itself. While the results of this paper are cool, I doubt that its methods will scale well - in general, approaches which rely on a lot of human knowledge being hard-coded in don't tend to.

Note the similarity between the macros in this paper and the way that OpenAI Five could choose between different hand-coded combinations of items. However, the latter is only a small part of the game, whereas the former is much more extensive.

Hierarchical Deep Multiagent Reinforcement Learning (Hongyao Tang et al)

On Reinforcement Learning for Full-length Game of StarCraft (Zhen-Jia Pang et al)

Zero-shot Sim-to-Real Transfer with Modular Priors (Robert Lee et al)

Critiques (AI)

Deep learning - deeper flaws? (Richard Ngo): This post summarizes four pieces about the flaws of deep learning. I don't think they'll benefit from more summarization on my part.

Rohin's opinion: This is worth reading for the sake of knowing what deep learning cannot currently do, which I think the articles are correct about. Two of the pieces also predict that these flaws will continue in the future, but none of the arguments are very compelling to me -- I think the jury is out on that question, I could see it going either way. (This is based not on this post, but on my memories of three of the four linked pieces, which I had read before starting this newsletter.)

News

The Open Philanthropy Project AI Fellows Program: The second Open Phil AI Fellowship has been announced, open to AI and ML PhD students, and people applying to PhD programs starting in 2019. Even if you aren't looking for a fellowship, you may want to read through their example research topics, which are split into three main categories -- reward learning, reliability, and interpretability.

Rohin's opinion: I like the breakdown of research topics, though personally I would have made them broader. I think I would want the "reward learning" category to include anything that aims to provide a specification for the AIs behavior, such as natural language instructions (that are mapped directly to policies with no reward in between). The "reliability" section is then about successfully meeting the specification, while the third section would include anything that allows the operator to empirically verify whether and enforce that the previous two sections are working correctly, including "interpretability". Actually, having written this, it's pretty similar to the three categories in the DeepMind blog post covered in the highlights.

AAAS Policy Fellowship (Niel Bowerman): The AAAS Science & Technology Fellowship is open to Americans with a science PhD or 3 years of industry experience and a CS Masters. 80,000 Hours thinks this is one of the best ways into US Government AI policy careers. Application deadline is Nov 1.

13