Alignment Newsletter #22


Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.


AI Governance: A Research Agenda (Allan Dafoe): A comprehensive document about the research agenda at the Governance of AI Program. This is really long and covers a lot of ground so I'm not going to summarize it, but I highly recommend it, even if you intend to work primarily on technical work.

Technical AI alignment

Agent foundations

Agents and Devices: A Relative Definition of Agency (Laurent Orseau et al): This paper considers the problem of modeling other behavior, either as an agent (trying to achieve some goal) or as a device (that reacts to its environment without any clear goal). They use Bayesian IRL to model behavior as coming from an agent optimizing a reward function, and design their own probability model to model the behavior as coming from a device. They then use Bayes rule to decide whether the behavior is better modeled as an agent or as a device. Since they have a uniform prior over agents and devices, this ends up choosing the one that better fits the data, as measured by log likelihood.

In their toy gridworld, agents are navigating towards particular locations in the gridworld, whereas devices are reacting to their local observation (the type of cell in the gridworld that they are currently facing, as well as the previous action they took). They create a few environments by hand which demonstrate that their method infers the intuitive answer given the behavior.

My opinion: In their experiments, they have two different model classes with very different inductive biases, and their method correctly switches between the two classes depending on which inductive bias works better. One of these classes is the maximization of some reward function, and so we call that the agent class. However, they also talk about using the Solomonoff prior for devices -- in that case, even if we have something we would normally call an agent, if it is even slightly suboptimal, then with enough data the device explanation will win out.

I'm not entirely sure why they are studying this problem in particular -- one reason is explained in the next post, I'll write more about it in that section.

Bottle Caps Aren't Optimisers (Daniel Filan): The previous paper detects optimizers by studying their behavior. However, if the goal is to detect an optimizer before deployment, we need to determine whether an algorithm is performing optimization by studying its source code, without running it. One definition that people have come up with is that an optimizer is something such that the objective function attains higher values than it otherwise would have. However, the author thinks that this definition is insufficient. For example, this would allow us to say that a bottle cap is an optimizer for keeping water inside the bottle. Perhaps in this case we can say that there are simpler descriptions of bottle caps, so those should take precedence. But what about a liver? We could say that a liver is optimizing for its owner's bank balance, since in its absence the bank balance is not going to increase.

My opinion: Here, we want a definition of optimization because we're worried about an AI being deployed, optimizing for some metric in the environment, and then doing something unexpected that we don't like but nonetheless does increase the metric (falling prey to Goodhart's law). It seems better to me to talk about "optimizer" and "agent" as models of predicting behavior, not something that is an inherent property of the thing producing the behavior. Under that interpretation, we want to figure out whether the agent model with a particular utility function is a good model for an AI system, by looking at its internals (without running it). It seems particularly important to be able to use this model to predict the behavior in novel situations -- perhaps that's what is needed to make the definition of optimizer avoid the counterexamples in this post. (A bottle cap definitely isn't going to keep water in containers if it is simply lying on a table somewhere.)

Using expected utility for Good(hart) (Stuart Armstrong): If we include all of the uncertainty we have about human values into the utility function, then it seems possible to design an expected utility maximizer that doesn't fall prey to Goodhart's law. The post shows a simple example where there are many variables that may be of interest to humans, but we're not sure which ones. In this case, by incorporating this uncertainty into our proxy utility function, we can design an expected utility maximizer that has conservative behavior that makes sense.

My opinion: On the one hand, I'm sympathetic to this view -- for example, I see risk aversion as a heuristic leading to good expected utility maximization for bounded reasoners on large timescales. On the other hand, an EU maximizer still seems hard to align, because whatever utility function it gets, or distribution over utility functions, it will act as though that input is definitely true, which means that anything we fail to model will never make it into the utility function. If you could have some sort of "unresolvable" uncertainty, some reasoning (similar to the problem of induction) suggesting that you can never fully trust your own thoughts to be perfectly correct, that would make me more optimistic about an EU maximization based approach, but I don't think it can be done by just changing the utility function, or by adding a distribution over them.

Corrigibility doesn't always have a good action to take (Stuart Armstrong): Stuart has previously argued that an AI could be put in situations where no matter what it does, it would affect the human's values. In this short post, he notes that if you then say that it is possible to have situations where the AI cannot act corrigibly, then other problems arise, such as how you can create a superintelligent corrigible AI that does anything at all (since any action that it takes would likely affect our values somehow).

Computational complexity of RL with traps (Vadim Kosoy): A post asking about complexity theoretic results around RL, both with (unknown) deterministic and stochastic dynamics.

Cooperative Oracles (Diffractor)


The What, the Why, and the How of Artificial Explanations in Automated Decision-Making (Tarek R. Besold et al)

Miscellaneous (Alignment)

Do what we mean vs. do what we say (Rohin Shah): I wrote a post proposing that we define a "do what we mean" system to be one in which the thing being optimized is latent (in the sense that it is not explicitly specified, not that it has a probability distribution over it). Conversely, a "do what we say" system explicitly optimizes something provided as an input. A lot of AI safety arguments can be understood as saying that a pure "do what we say" AI will lead to catastrophic outcomes. However, this doesn't mean that a "do what we mean" system is the way to go -- it could be that we want a "do what we mean" core, along with a "do what we say" subsystem that makes sure that the AI always listens to eg. shutdown commands.

VOI is Only Nonnegative When Information is Uncorrelated With Future Action (Diffractor): Normally, the value of getting more information (VOI) is always nonnegative (for a rational agent), because you can always take the same action you would have if you didn't have the information, so your decision will only improve. However, if the information would cause you to have a different set of actions available, as in many decision theory examples, then this proof no longer applies, since you may no longer be able to take the action you would have otherwise taken. As a result, information can have negative value.

AI strategy and policy

AI Governance: A Research Agenda (Allan Dafoe): Summarized in the highlights!

Superintelligence Skepticism as a Political Tool (Seth Baum)

Other progress in AI

Reinforcement learning

Introducing a New Framework for Flexible and Reproducible Reinforcement Learning Research (Pablo Samuel Castro and Marc G. Bellemare): Researchers at Google have released Dopamine, a small framework for RL research on Atari games, with four built-in agents -- DQN, C51, a simplified version of Rainbow, and the recent Implicit Quantile Network. There's a particular emphasis on reproducibility, by providing logs from training runs, training data, etc.

Dexterous Manipulation with Reinforcement Learning: Efficient, General, and Low-Cost (Henry Zhu et al)

Deep learning

Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures (Gongbo Tang et al)

Transfer Learning for Estimating Causal Effects using Neural Networks (Sören R. Künzel, Bradly C. Stadie et al)

Unsupervised learning

Unsupervised Learning of Syntactic Structure with Invertible Neural Projections (Junxian He et al)


LIFT: Reinforcement Learning in Computer Systems by Learning From Demonstrations (Michael Schaarschmidt et al)


80,000 Hours Job Board: AI/ML safety research: 80,000 Hours recently updated their job board, including the section on technical safety research. The AI strategy and governance section is probably also of interest.

BERI/CHAI ML engineer: I want to highlight this role in particular -- I expect this to be a position where you can not only have a large impact, but also learn more about technical research, putting you in a better position to do research in the future.

HLAI 2018 Field Report (G Gordon Worley III): A report on the human-level AI multiconference from the perspective of a safety researcher who attended. The reflections are more about the state of the field rather than about technical insights gained. For example, he got the impression that most researchers working on AGI hadn't thought deeply about safety. Based on this, he has two recommendations -- first, that we normalize thinking about AI safety, and second, that we establish a "sink" for dangerous AI research.

My opinion: I definitely agree that we need to normalize thinking about AI safety, and I think that's been happening. In fact, I think of that as one of the major benefits of writing this newsletter, even though I started it with AI safety researchers in mind (who still remain the audience I write for, if not the audience I actually have). I'm less convinced that we should have a process for dangerous AI research. What counts as dangerous? Certainly this makes sense for AI research that can be dangerous in the short term, such as research that has military or surveillance applications, but what would be dangerous from a long-term perspective? It shouldn't just be research that differentially benefits general AI over long-term safety, since that's almost all AI research. And even though on the current margin I would want research to differentially advance safety, it feels wrong to call other research dangerous, especially given its enormous potential for good.

State of California Endorses Asilomar AI Principles (FLI Team)