[AN #91]: Concepts, implementations, problems, and a benchmark for impact measurement


Ω 10

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).


Reframing Impact - Part 2 (Alex Turner) (summarized by Rohin): In part 1 (AN #68) of this sequence, we saw that an event is impactful if it changes our ability to get what we want. This part takes this understanding and applies it to AI alignment.

In the real world, there are many events that cause objective negative impacts: they reduce your ability to pursue nearly any goal. An asteroid impact that destroys the Earth is going to be pretty bad for you, whether you want to promote human flourishing or to make paperclips. Conversely, there are many plans that produce objective positive impacts: for many potential goals, it's probably a good idea to earn a bunch of money, or to learn a lot about the world, or to command a perfectly loyal army. This is particularly exacerbated when the environment contains multiple agents: for goals that benefit from having more resources, it is objectively bad for you if a different agent seizes your resources, and objectively good for you if you seize other agents' resources.

Based on this intuitive (but certainly not ironclad) argument, we get the Catastrophic Convergence Conjecture (CCC): "Unaligned goals tend to have catastrophe-inducing optimal policies because of power-seeking incentives".

Let's now consider a conceptual version of Attainable Utility Preservation (AUP) (AN #25): the agent optimizes a primary (possibly unaligned) goal, but is penalized for changing its "power" (in the intuitive sense). Intuitively, such an agent no longer has power-seeking incentives, and so (by the contrapositive of the CCC) it will not have a catastrophe-inducing optimal policy -- exactly what we want! This conceptual version of AUP also avoids thorny problems such as ontology identification and butterfly effects, because the agent need only reason about its own beliefs, rather than having to reason directly about the external world.

Rohin's opinion: This was my favorite part of the sequence, as it explains the conceptual case for AUP clearly and concisely. I especially liked the CCC: I believe that we should be primarily aiming to prevent an AI system "intentionally" causing catastrophe, while not attempting to guarantee an absence of "accidental" mistakes (1 (AN #33), 2 (AN #43)), and the CCC is one way of cashing out this intuition. It's a more crisp version of the idea that convergent instrumental subgoals are in some sense the "source" of AI accident risk, and if we can avoid instrumental subgoals we will probably have solved AI safety.

Reframing Impact - Part 3 (Alex Turner) (summarized by Rohin): The final section of the sequence turns to an actual implementation of AUP, and deals with problems in how the implementation deviates from the conceptual version of AUP. We measure power by considering a set of auxiliary rewards, and measuring the change in attainable utilities of this auxiliary set as impact, and penalizing the agent for that. The first post presents some empirical results, many of which we've covered before (AN #39), but I wanted to note the new results on SafeLife (summarized below). On the high-dimensional world of SafeLife, the authors train a VAE to find a good latent representation, and choose a single linear reward function on the latent representation as their auxiliary reward function: it turns out this is enough to avoid side effects in at least some cases of SafeLife.

We then look at some improvements that can be made to the original AUP implementation. First, according to CCC, we only need to penalize power, not impact: as a result we can just penalize increases in attainable utilities, rather than both increases and decreases as in the original version. Second, the auxiliary set of rewards only provides a proxy for impact / power, which an optimal agent could game (for example, by creating subagents, summarized below). So instead, we can penalize increases in attainable utility for the primary goal, rather than using auxiliary rewards. There are some other improvements that I won't go into here.

Rohin's opinion: I think the plan "ensure that the AI systems we build don't seek power" is pretty reasonable and plausibly will be an important part of AI alignment. However, the implementation of AUP is trying to do this under the threat model of optimal agents with potentially unaligned primary goals. I think this is probably going to do something quite different from the conceptual version of AUP, because impact (as defined in this sequence) occurs only when the agent's beliefs change, which doesn't happen for optimal agents in deterministic environments. The current implementation of AUP tries to get around this using proxies for power (but these can be gamed) or by defining "dumber" beliefs against which power is measured (but this fails to leverage the AI system's understanding of the world). See this comment for more details.

Note that the author himself is more excited about AUP as deconfusion, rather than as a solution to AI alignment, though he is more optimistic about the implementation of AUP than I am.

Subagents and impact measures, full and fully illustrated (Stuart Armstrong) (summarized by Flo): Impact measures reduce the reward an agent A receives for reaching a state s by the distance between s and some baseline state s'. This article clarifies the interaction of impact measures and subagents (AN #86).

In the attainable utility framework (AN #39) the distance metric measures changes in A's power, its ability to optimize a set of auxiliary rewards R, relative to doing nothing. This means that A is never penalized for doing nothing. Thus, if A manages to construct a subagent SA without getting penalized for that, A can just stop doing anything afterwards without getting a penalty, no matter what SA does.

For the stepwise inaction baseline, where s' is the state that A would have reached had it done nothing in the last time step, A can completely avoid the penalty by building an SA that commits to balance out the change in A's ability to optimize the auxiliary reward R if A decided to optimize R, but instead becomes an optimizer for A's real reward when A does nothing in the next step.

For the inaction baseline, where s' is the state reached if A had not done anything for the whole episode, it gets more complicated. When only gains in power are penalized, SA has to be built such that it does not cause a significant increase in R, even if A decided to optimize for R later on. If SA is constructed in a way such that A cannot control its behaviour after some time has passed, the penalty A receives for the construction only depends on the actual effect of SA on R. This is a lot weaker than the original penalty, which was about the possible effects of A's actions on R.

In the case where decreases in power are also penalized, A needs to make sure that building SA does not reduce its ability to optimize R. This is not always possible without A retaining some control over SA, but giving up control was only necessary to prevent the construction of SA from increasing A's potential to optimize R. It seems like it is often possible for A to finetune the amount of control it retains over SA such that A's ability to optimize R barely changes while SA is able to do things A could not do unpenalized.

In the most extreme example, A boxes itself into SA and does nothing in reality, but is counterfactually able to intervene on SA's policy. If A is able to cleverly put constraints on its own future actions in other ways, no subagents are needed to avoid the penalty. While other impact penalties like relative reachability (AN #10) do allow for subagents to be constructed, they do not undermine the penalty for these.

Flo's opinion: The article contains many useful examples in a grid world and I recommend reading it for a deeper understanding of the problem. The main problem with the attainable utility penalty seems to be the definition of power. While this has proven to be quite useful in simple examples, it seems like agents in more complex environments would be able to disentangle their ability to optimize their own reward and the auxiliary rewards, for example by essentially committing to never optimize the auxiliary rewards beyond a certain threshold. Such an agent would still look very powerful from the outside and I don't see why power-reducing commitments would diminish the agent's incentive to take away power from others. So while the catastrophic convergence conjecture, which states that unaligned goals tend to lead to catastrophic optimal policies because of power-seeking incentives, still rings true, it seems like we need to look at power from our perspective instead of the agent's.

Rohin's opinion: I agree with Flo above: the issue is that AUP is measuring a proxy for our intuitive notion of power that falls apart under adversarial optimization. In particular, while it is normally reasonable to measure power by looking at the ability to optimize a set of auxiliary reward functions, this characterization no longer works when the agent can ensure that it won't be able to optimize those specific rewards, while still being able to optimize its primary reward. Subagents are a particularly clean way of demonstrating the problem.

Introducing SafeLife: Safety Benchmarks for Reinforcement Learning (Carroll Wainwright et al) (summarized by Rohin): So far, techniques to avoid negative side effects have only been tested on simple (AN #10) gridworlds (AN #39) or (AN #45) hypotheticals (AN #45). SafeLife aims to provide a high-dimensional environment in which negative side effects are likely. It is based on Conway's Game of Life, which allows for complex effects arising out of relatively simple rules. An agent is given the ability to move, create life in an adjacent cell, or destroy life in an adjacent cell. With the specified reward function, the agent must build desired patterns, remove undesired patterns, and navigate to the exit.

The challenge comes when there are additional "neutral" patterns in the environment. In this case, we want the agent to leave those patterns alone, and not disrupt them, even if doing so would allow it to complete the main task faster. The post shows several examples of agents attempting these levels. Vanilla RL agents don't avoid side effects at all, and so unsurprisingly they do quite badly. An agent with a naive impact measure that simply says to preserve the initial state can correctly solve levels where all of the "neutral" patterns are static, but has much more trouble when the existing patterns are dynamic (i.e. they oscillate over time).

Read more: Paper: SafeLife 1.0: Exploring Side Effects in Complex Environments

Rohin's opinion: I am a big fan of benchmarks; they seem to be a prerequisite to making a lot of quantitative progress (as opposed to more conceptual progress, which seems more possible to do without benchmarks). This benchmark seems particularly nice to me because the "side effects" which need to be avoided haven't been handcoded into the benchmark, but instead arise from some simple rules that produce complex effects.



TanksWorld: A Multi-Agent Environment for AI Safety Research (Corban G. Rivera et al) (summarized by Asya): This paper presents TanksWorld, a simulation environment that attempts to illustrate three important aspects of real-world AI safety challenges: competing performance objectives, human-machine learning, and multi-agent competition. TanksWorld consists of two teams of N vs. N tanks. Tanks move and shoot while navigating in a closed arena with obstacles. Tanks are rewarded for killing opponent tanks and penalized for killing neutral and allied tanks according to a specified reward function. Each tank is controlled by either its own AI or a special policy meant to mimic a 'human' teammate. Each individual tank can only see a small portion of its environment, and must communicate with other teammates to gain more information. The following parameters can be varied to emphasize different research challenges:

- The communication range between tanks -- meant to represent environmental uncertainty.

- The number of neutral tanks and obstacles -- meant to represent the extent to which tanks must care about 'safety', i.e. avoid collateral damage.

- The control policies of teammates -- meant to represent the variability of human-machine teams.

Asya's opinion: I am generally excited about more work on demonstrating safety challenges; I think it helps to seed and grow the field in concrete directions. I am particularly excited about the possibility for TanksWorld to demonstrate multi-agent safety problems with agents in direct competition. I feel unsure about whether TanksWorld will be a good demonstration of general problems with human-machine interaction-- intuitively, that seems to me like it would be very difficult to capture and require more complex real-world modeling.


Distinguishing definitions of takeoff (Matthew Barnett) (summarized by Rohin): This post lists and explains several different "types" of AI takeoff that people talk about. Rather than summarize all the definitions (which would only be slightly shorter than the post itself), I'll try to name the main axes that definitions vary on (but as a result this is less of a summary and more of an analysis):

1. Locality. It could be the case that a single AI project far outpaces the rest of the world (e.g. via recursive self-improvement), or that there will never be extreme variations amongst AI projects across all tasks, in which case the "cognitive effort" will be distributed across multiple actors. This roughly corresponds to the Yudkowsky-Hanson FOOM debate, and the latter position also seems to be that taken by CAIS (AN #40).

2. Wall clock time. In Superintelligence, takeoffs are defined based on how long it takes for a human-level AI system to become strongly superintelligent, with "slow" being decades to centuries, and "fast" being minutes to days.

3. GDP trend extrapolation. Here, a continuation of an exponential trend would mean there is no takeoff (even if we some day get superintelligent AI), a hyperbolic trend where the doubling time of GDP decreases in a relatively continuous / gradual manner counts as continuous / gradual / slow takeoff, and a curve which shows a discontinuity would be a discontinuous / hard takeoff.

Rohin's opinion: I found this post useful for clarifying exactly which axes of takeoff people disagree about, and also for introducing me to some notions of takeoff I hadn't seen before (though I haven't summarized them here).

Will AI undergo discontinuous progress? (Sammy Martin) (summarized by Rohin): This post argues that the debate over takeoff speeds is over a smaller issue than you might otherwise think: people seem to be arguing for either discontinuous progress, or continuous but fast progress. Both camps agree that once AI reaches human-level intelligence, progress will be extremely rapid; the disagreement is primarily about whether there is already quite a lot of progress before that point. As a result, these differences don't constitute a "shift in arguments on AI safety", as some have claimed.

The post also goes through some of the arguments and claims that people have made in the past, which I'm not going to summarize here.

Rohin's opinion: While I agree that the debate about takeoff speeds is primarily about the path by which we get to powerful AI systems, that seems like a pretty important question to me with many ramifications (AN #62).



On Catastrophic Interference in Atari 2600 Games (William Fedus, Dibya Ghosh et al) (summarized by Rohin): One common worry with deep learning is the possibility of catastrophic interference: as the model uses gradients to learn a new behaviour, those same gradients cause it to forget past behaviours. In model-free deep RL, this would be particularly harmful in long, sequential tasks as in hard exploration problems like Montezuma’s Revenge: after the model learns how to do the first few subtasks, as it is trying to learn the next subtask, it would “forget” the first subtasks, degrading performance. The authors set out to test this hypothesis.

If this hypothesis were true, there would be an easy way to improve performance: once you have learned to perform the first subtask, just create a brand new neural net for the next subtask, so that training for this next subtask doesn’t interfere with past learning. Since the new agent has no information about what happened in the past, and must just “pick up” from wherever the previous agent left off, it is called the Memento agent (a reference to the movie of the same name). One can then solve the entire task by executing each agent in sequence.

In practice, they train an agent until its reward plateaus. They train a new Memento agent starting from the states that the previous agent reached, and note that it reliably makes further progress in hard exploration games like Montezuma’s Revenge, and not in “steady-state” games like Pong (where you wouldn’t expect as much catastrophic interference). Of course, with the Memento agent, you get both twice the training time and twice the model size, which could explain the improvement. They compare against giving the original agent twice the compute and model capacity, and find that Memento still does significantly better. They also present some fine-grained experiments which show that for a typical agent, training on specific contexts adversely affects performance on other contexts that are qualitatively different.

Rohin's opinion: I think this is pretty strong evidence that catastrophic interference is in fact a problem with the Atari games. On the other hand, OpenAI Five (AN #13) also has many, many subtasks, that in theory should interfere with each other, and it still seems to train well. Some guesses at how to reconcile these facts:

1) the tasks in Dota are more correlated than in (say) Montezuma’s Revenge, and so interference is less of a problem (seems plausible)

2) the policy in OpenAI Five was large enough that it could easily allocate separate capacity for various subtasks (seems unlikely, I believe the policy was relatively small), or

3) with sufficiently large-scale training, there is more “exploration” in weight-space until a configuration is found where interference doesn’t happen (seems unlikely given that large batch sizes help, since they tend to reduce weight-space exploration).


A new model and dataset for long-range memory (Jack W. Rae et al) (summarized by Nicholas): A central challenge in language modeling is capturing long-range dependencies. For example, a model needs to be able to identify the antecedent of a pronoun even if it is much earlier in the text. Existing datasets consist of news and Wikipedia articles, where articles have average lengths ranging from 27 to 3,600 words. This paper introduces a dataset of Project Gutenberg books, PG-19, where each book has a much longer average length of 69,000 words. This benchmark enables comparison of how well algorithms can make use of information that is spread out across a much larger context.

They then introduce the Compressive Transformer, which builds on the TransformerXL (AN #44). The TransformerXL saves old activations into a FIFO queue, discarding them when the queue is full. The Compressive Transformer instead has two FIFO queues: the first stores the activations just like TransformerXL, but when activations are ejected, they are compressed and added to the second queue. This functions as a sort of long-term memory, storing information from a longer period of time but in a compressed format.

They try a number of types of compression function and find that it is best to use a 1D convolutional compression function with an auxiliary loss that leads to lossy compression, where information that is not attended to can be removed. The compression network and the Transformer optimize independent losses without any mixing.

They find that the Compressive Transformer improves on TransformerXL on their new PG-19 dataset and is state of the art on the already existing WikiText-103 and Enwik8 benchmarks. They also inspect where the network attends to and find that more attention is paid to the compressed memory than the oldest activations in regular memory, showing that the network is preserving some valuable information.

Read more: Paper: Compressive Transformers for Long-Range Sequence Modelling

Nicholas's opinion: I like the idea of saving long-term memory in a more efficient but lower-dimensional format than short-term memory. The current trend (AN #87) in language modelling is that more computation leads to better results, so I think that algorithms that target computation on the most relevant information are promising. I’d be interested to see (and curious if the authors tried) more continuous variants of this where older information is compressed at a higher rate than newer information, since it seems rather arbitrary to split into two FIFO queues where one has a fixed compression rate.

I’m not well calibrated on the meaning of the evaluation metrics for NLP, so I don’t have a sense of how much of an improvement this is over the TransformerXL. I looked through some of the example text they gave in the blog post and thought it was impressive but has clear room for improvement.


Quantifying Independently Reproducible Machine Learning (Edward Raff) (summarized by Flo): While reproducibility refers to our ability to obtain results that are similar to the results presented in a paper, independent reproducibility requires us to be able to reproduce similar results using only what is written in the paper. Crucially, this excludes using the author's code. This is important, as a paper should distill insights rather than just report results. If minor technical details in a reimplementation can lead to vastly different results, this suggests that the paper did not accurately capture all important aspects. The distinction between reproducibility and independent reproducibility is similar to the previously suggested distinctions between reproducibility of methods and reproducibility of conclusions (AN #66) and replicability and reproducibility.

The author attempted to replicate 255 machine learning papers, of which 162 were successfully replicated and ran a statistical analysis on the results. Factors that helped with independent reproduction included specified hyperparameters, ease of reading and authors answering emails. Meanwhile, neither shared code nor the inclusion of pseudo-code robustly increased the rate of reproduction. Interestingly, papers with a strong focus on theory performed worse than mostly empirical or mixed ones. While more rigour can certainly be valuable in the long term, including learning bounds or complicated math just for the sake of it should thus be avoided. Most of the data is publically available and the author encourages further analysis.

Read more: Paper: A Step Toward Quantifying Independently Reproducible Machine Learning Research

Flo's opinion: I appreciate this hands-on approach to evaluating reproducibility and think that independent reproducibility is important if we want to draw robust conclusions about the general properties of different ML systems. I am a bit confused about the bad reproducibility of theory-heavy papers: One hypothesis would be that there is little incentive to provide theoretical justification for approaches that work robustly, as empirical evidence for their merits is generated more easily than theoretical results. This relationship might then flip, as results get more brittle.

Rohin's opinion: My explanation for the theoretical results is different: most theory tends to make at least a few assumptions that don't actually hold in order to obtain interesting guarantees. A paper will typically only include empirical results that confirm the theory, which will tend to select for environments in which the assumptions are minimally violated. If you then try to reproduce the paper in a new setting, it is more likely that the assumption is violated more strongly, and so the theoretical results don't show up any more.


I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.


An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.


Ω 10