# Ω 10

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

# HIGHLIGHTS

Quantifying Differences in Reward Functions (Adam Gleave et al) (summarized by Rohin): Current work on reward learning typically evaluates the learned reward models by training a policy to optimize the learned reward, and seeing how well that policy performs according to the true reward. However, this only tests how well the reward works in the particular environment you test in, and doesn’t tell you how well the reward will generalize. For example, suppose the user loves apricots, likes plums, but hates durians. A reward that has apricots > durians > plums works perfectly -- until the store runs out of apricots, in which case it buys the hated durian.

So, it seems like we should evaluate reward functions directly, rather than looking at their optimal policies. This paper proposes Equivalent-Policy Invariant Comparison (EPIC), which can compare two reward functions while ignoring any potential shaping that doesn’t affect the optimal policy.

EPIC is parameterized by a distribution of states and actions DS and DA, as well as a distribution DT over transitions (s, a, s’). The first step is to find canonical versions of the two rewards to be compared, such that they have expected zero reward over DS and DA, and any potential shaping is removed. Then, we look at the reward each of these would assign to transitions in DT, and compute the Pearson correlation. This is transformed to be in the range [0, 1], giving the EPIC distance.

The authors prove that EPIC is a pseudometric, that is, it behaves like a distance function, except that it is possible for EPIC(R1, R2) to be zero even if R1 and R2 are different. This is desirable, since if R1 and R2 differ by a potential shaping function, then their optimal policies are guaranteed to be the same regardless of transition dynamics, and so we should report the “distance” between them to be zero.

The authors show how to approximately compute the EPIC distance in high dimensional environments, and run experiments to showcase EPIC’s properties. Their first experiment demonstrates that EPIC is able to correctly detect that a densely shaped reward for various MuJoCo environments is equivalent to a sparse reward, whereas other baseline methods are not able to do so. The second experiment compares reward models learned from preferences, demonstrations, and direct regression, and finds that the EPIC distance for the rewards learned from demonstrations are much higher than those for preferences and regression. Indeed, when the rewards are reoptimized in a new test environment, the new policies work when using the preference or regression reward models, but not when using the demonstration reward model. The final experiment shows that EPIC is robust to variations in the visitation distribution DT, while baseline methods are not.

Rohin's opinion: It’s certainly true that we don’t have good methods for understanding how well our learned reward models generalize, and I’m glad that this work is pushing in that direction. I hope that future papers on reward models report EPIC distances to the ground truth reward as one of their metrics (code is available here).

One nice thing is that, roughly speaking, rewards are judged to be equivalent if they would generalize to any possible transition function that is consistent with DT. This means that by designing DT appropriately, we can capture how much generalization we want to evaluate. This is a useful knob to have: if we used the maximally large DT, the task would be far too difficult, as it would be expected to generalize far more than even humans can.

# TECHNICAL AI ALIGNMENT

## TECHNICAL AGENDAS AND PRIORITIZATION

Plausible cases for HRAD work, and locating the crux in the "realism about rationality" debate (Issa Rice) (summarized by Rohin): This post tries to identify the possible cases for highly reliable agent design (HRAD) work to be the main priority of AI alignment. HRAD is a category of work at MIRI that aims to build a theory of intelligence and agency that can explain things like logical uncertainty and counterfactual reasoning.

The first case for HRAD work is that by becoming less confused about these phenomena, we will be able to help AGI builders predict, explain, avoid, detect, and fix safety issues and help to conceptually clarify the AI alignment problem. For this purpose, we just need conceptual deconfusion -- it isn’t necessary that there must be precise equations defining what an AI system does.

The second case is that if we get a precise, mathematical theory, we can use it to build an agent that we understand “from the ground up”, rather than throwing the black box of deep learning at the problem.

The last case is that understanding how intelligence works will give us a theory that allows us to predict how arbitrary agents will behave, which will be useful for AI alignment in all the ways described in the first case and more (AN #66).

Looking through past discussions on the topic, the author believes that people at MIRI primarily believe in the first two cases. Meanwhile, critics (particularly me) say that it seems pretty unlikely that we can build a precise, mathematical theory, and a more conceptual but imprecise theory may help us understand reasoning better but is less likely to generalize sufficiently well to say important and non-trivial things about AI alignment for the systems we are actually building.

Rohin's opinion: I like this post -- it seems like an accessible summary of the state of the debate so far. My opinions are already in the post, so I don’t have much to add.

The flaws that make today’s AI architecture unsafe and a new approach that could fix it (Rob Wiblin and Stuart Russell) (summarized by Rohin): This podcast delves into many of the ideas in Stuart’s book Human Compatible (AN #69). Rob especially pushes on some aspects that are less talked about in the AI safety community, like the enfeeblement problem and whether we’d be locking in suboptimal values. They also discuss Stuart’s response to some counterarguments.

Rohin's opinion: One of the counterarguments the podcast talks about is my position (AN #80) that we’ll probably learn from smaller catastrophes in order to avoid actual extinction. I just want to note that while it might sound like I disagree with Stuart on this point, I don’t think we actually do. I was arguing against the position that extinction is the default outcome (> 50% probability) while Stuart is arguing against the position that extinction is near-impossible (~0% probability). I ended up around 10%; I’d guess that if Stuart were forced to, he’d give a number similar to mine, for similar reasons as me.

## INTERPRETABILITY

Towards A Rigorous Science of Interpretable Machine Learning (Finale Doshi-Velez et al) (summarized by Robert): This paper from 2017 discusses the field of interpretability research, and how it can be made more rigorous and well-defined. The authors first highlight the problem of defining interpretability in the first place - they don't have a resolution to this problem, but suggest that we can think of interpretability in terms of what it's used for. They claim that interpretability is used for confirming other important desiderata in ML systems, which stem from an incompleteness in the problem formalization. For example, if we want a system to be unbiased but aren't able to formally specify this in the reward function, or the reward we're optimising for is only a proxy of the true reward, then we could use interpretability to inspect our model and see whether it's reasoning how we want it to.

The authors next move on to discussing how we can evaluate interpretability methods, providing a taxonomy of different evaluation methods: Application-grounded is when the method is evaluated in the context it will actually be used in, by real humans (i.e. doctors getting explanations for AI diagnoses); Human-grounded is about conducting simpler human-subject experiments (who are perhaps not domain experts) using possibly simpler tasks than what the intended purpose of the method is; Functionally-grounded is where no humans are involved in the experiments, and instead some formal notion of interpretability is measured for the method to evaluate its quality. Each of these evaluation methods can be used in different circumstances, depending on the method and the context it will be used in.

Finally, the authors propose a data-driven approach to understanding the factors which are important in interpretability. They propose to try and create a dataset of applications of machine learning models to tasks, and then analyse this dataset to find important factors. They list some possible task- and method- related factors, and then conclude with recommendations to researchers doing interpretability.

Robert's opinion: I like the idea of interpretability being aimed at trying to fill in mis- or under-specified optimisation objectives. I think this proposes that interpretability is more useful for outer alignment, which is interesting as I think that most people in the safety community think interpretability could help with inner alignment (for example, see An overview of 11 proposals for building safe advanced AI (AN #102), in which transparency (which could be seen as interpretability) is used to solve inner alignment in 4 of the proposals).

# OTHER PROGRESS IN AI

## EXPLORATION

Planning to Explore via Self-Supervised World Models (Ramanan Sekar, Oleh Rybkin et al) (summarized by Flo): PlaNet (AN #33) learns a latent world model which can be used for planning, and Dreamer (AN #83) extends the idea by performing RL within the learned latent world model instead of requiring interaction with the environment. However, we still need to efficiently explore the real environment to obtain training data for the world model.

The authors propose to augment Dreamer with a novel exploration strategy. In addition to the learned latent world model, an ensemble of simpler one-step world models is trained and the magnitude of disagreement within the ensemble for a state is used as a proxy for the information gain for reaching that state. This is used as a (dynamically changing) intrinsic reward that can guide planning. By training Dreamer on this intrinsic reward, we can identify informative states in the real environment without having to first visit similar states as would be the case with e.g. curiosity, where the intrinsic reward is computed in retrospect.

The resulting system achieves state of the art zero-shot learning on a variety of continuous control tasks, and often comes close to the performance of agents that were trained for the specific task.

Flo's opinion: Planning to reach states where a lot of information is gained seems like a very promising strategy for exploration. I am not sure whether building sufficiently precise world models is always as feasible as model-free RL. If it was, misspecified rewards and similar problems would probably become easier to catch, as rollouts of a policy using a precise world model can help us predict what kind of worlds this policy produces without deployment. On the other hand, the improved capabilities for transfer learning could lead to more ubiquitous deployment of RL systems and amplify remaining failure modes, especially those stemming from multiagent interactions (AN #70).

## REINFORCEMENT LEARNING

Learning to Play No-Press Diplomacy with Best Response Policy Iteration (Thomas Anthony, Tom Eccles et al) (summarized by Asya): Diplomacy is a game with simple rules where 7 players simultaneously move units every turn to capture territory. Units are evenly matched by default, so winning relies on getting support from some players against others. 'No-Press' Diplomacy limits communication between players to only orders submitted to units, removing the complex verbal negotiations that characterize traditional gameplay.

Previous state-of-the-art No-Press Diplomacy methods were trained to imitate human actions after collecting a dataset of 150,000 human Diplomacy games. This paper presents a new algorithmic method for playing No-Press Diplomacy using a policy iteration approach initialized with human imitation. To find better policies, their methods use "best response" calculations, where the best response policy for some player is the policy that maximizes the expected return for that player against opponent policies. Diplomacy is far too large for exact best response calculation, so the paper introduces an approximation, "Sampled Best Response", which

- Uses Monte-Carlo sampling to estimate opponents' actions each turn

- Only considers a small set of actions sampled from each candidate best response policy

- Only tries to make a single-turn improvement to its policy (rather than trying to optimize for the whole rest of the game)

Similar to other policy iteration methods, the paper creates a dataset of games every iteration using its Sampled Best Response method, then trains neural networks to create policy and value functions that predict the actions chosen by Sampled Best Response. To remedy issues where Sampled Best Response continually cycles through the best strategy for the last iteration, the paper tries several variants of a technique called "Fictitious Play". In the best-performing variant, the policy network is trained to predict the latest Sampled Best Response given explicitly averaged historical opponent and player policies, rather than just the latest policies.

The paper's methods outperform existing algorithmic methods for No-Press Diplomacy on a variety of metrics, but are still fairly few-shot exploitable-- at the end of training, the strongest (non-human) exploiter of the final policy wins 48% of the time. They also find that the strongest exploit doesn't change much through training, though few-shot exploitability does decrease from the beginning of training to the end.

Asya's opinion: This paper represented real progress in automated Diplomacy, but is still far from human-level. I’ll be pretty interested to see whether we can reach human-level by creating improved self-play algorithms, like the one presented in this paper, and the ones used for Poker and Go, or if we will have to wait for novel, more general reasoning algorithms applied to Diplomacy. Unlike Poker, Diplomacy against multiple human players involves collusion and implicit signalling, even with No Press. It seems possible to me that it is very difficult to become good at modeling those dynamics through self-play alone. If we did get to human-level through self-play, it would make me more optimistic about the extent to which training is likely to be a bottleneck in other domains which require sophisticated models of human behavior.

## META LEARNING

Learning to Continually Learn (Shawn Beaulieu et al) (summarized by Robert): This paper presents the ANML (A Neuromodulated Meta-Learning algorithm) method for countering catastrophic forgetting in continual learning. Continual learning is a problem setting where the system is presented with several tasks in sequence, and must maintain good performance on all of them. When training on new tasks, neural networks often “forget” how to perform the previous tasks, which is called catastrophic forgetting. This makes the naive approach of just training on each task in sequence ineffective.

The paper has two main ideas. First, rather than avoiding catastrophic forgetting by using hand-crafted solutions (e.g. previous methods have encouraged sparsity), the authors use meta-learning to directly optimise for this goal. This is done by learning a network parameterization which, after training sequentially on many tasks, will get good performance on all tasks. This outer loop objective can be optimised for directly by taking higher order gradients (gradients of gradients). The second idea is a novel form of neuromodulation. This takes the form of a neuromodulatory (NM) network, which takes the same input as the prediction network, and gates the prediction network’s forward pass. This provides direct control of the output of the prediction network, but also indirect control of the learning of the prediction network, as gradients will only flow through the paths which haven’t been zeroed out by the gating mechanism.

Their method achieves state-of-the-art results on continual learning in Omniglot, a few-shot dataset consisting of 1623 characters, each with only 20 hand-drawn examples. The network has to learn a sequence of tasks (e.g. classifying a character) with only 15 examples, and is then tested on overall performance over all the classes it’s learned. Their network gets 60% accuracy when presented with 600 classes in a row. A classifier trained with the same data but shuffled independently at random only gets 68% accuracy, implying that the catastrophic forgetting of their network only cost 8 percentage points. Their method also learns a form of sparsity in the activations of the network in a much better way than the hand-crafted methods - while per-class activations are very sparse, no neurons are wasted, as they all still activate over the entire dataset.

Robert's opinion: This paper is interesting because it's a demonstration of the power of meta-learning to formulate the true optimisation objective. Often in machine learning much research is devoted to the manual path of trying to find the correct inductive biases to solve hard problems (such as catastrophic forgetting). Instead, this paper shows we can use methods like meta-learning to learn these inductive biases (such as sparsity) automatically, by optimising directly for what we want. This relates to (and is motivated by) AI-Generating Algorithms (AN #63). Obviously, this method still uses the neuromodulatory network as an architectural inductive bias - it'd be interesting to see whether we could somehow learn this method (or something more specific) as well, perhaps through neural architecture search or just using a larger network which has the representational capacity to perform something like the gating operation.

## UNSUPERVISED LEARNING

Unsupervised Learning of Visual Features by Contrasting Cluster Assignments (Mathilde Caron et al) (summarized by Rohin): There has been a lot of work in self-supervised representation learning for image classification (previously summarized in AN #92 and AN #99). This paper sets a new SOTA of 75.3% top-1 ImageNet accuracy, when allowed to first do self-supervised representation learning on ImageNet, and then to train a linear classifier on top of the learned features using all of ImageNet.

Previous methods use a contrastive loss across the learned representations (possibly after being processed by a few MLP layers), which can be thought of as using the learned representation to predict the representation of augmented versions of the same input. In contrast, this paper uses the representation to predict “codes” of augmented versions, where the codes are computed using clustering.

Rohin's opinion: I’m not sure why we should expect this method to work, but empirically it does. Presumably I’d understand the motivation better if I read through all the related work it’s building on.

Big Self-Supervised Models are Strong Semi-Supervised Learners (Ting Chen et al) (summarized by Rohin): Previously, SimCLR (AN #99) showed that you can get good results on semi-supervised learning on ImageNet, by first using self-supervised learning with a contrastive loss to learn good representations for images, and then finetuning a classifier on top of the representations with very few labels. This paper reports a significantly improved score, using three main improvements:

1. Making all of the models larger (in particular, deeper).

2. Incorporating momentum contrast, as done previously (AN #99).

3. Using model distillation to train a student network to mimic the original finetuned classifier.

On linear classification on top of learned features with a ResNet-50 architecture, they get a top-1 accuracy of 71.7%, so lower than the previous paper. Their main contribution is to show what can be done with larger models. According to top-1 accuracy on ImageNet, the resulting system gets 74.9% with 1% of labels, and 80.1% with 10% of labels. In comparison, standard supervised learning with a ResNet-50 (which is about 33x smaller) achieves 76.6% with all labels, and just 57.9% with 1% of labels and 68.4% with 10% of labels. When they distill down their biggest model into a ResNet-50, it gets 73.9% with 1% of labels and 77.5% with 10% of labels.

Rohin's opinion: It continues to baffle me why model distillation is so helpful -- you’d think that if you train a student model to mimic a teacher model, it would do at most as well as the teacher, but in fact it seems to do better. It's remarkable that just "training a bigger model and then distilling it down" leads to an increase of 16 percentage points (when we just have 1% of the labels). Another thing to add to the list of weird empirical facts about deep learning that we don’t understand.

#### FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

#### PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

New Comment
2 comments, sorted by Click to highlight new comments since:
The authors prove that EPIC is a pseudometric, that is, it behaves like a distance function, except that it is possible for EPIC(R1, R2) to be zero even if R1 and R2 are different. This is desirable, since if R1 and R2 differ by a potential shaping function, then their optimal policies are guaranteed to be the same regardless of transition dynamics, and so we should report the “distance” between them to be zero.

If EPIC(R1, R2) is thought of as two functions f(g(R1), g(R2)), where g returns the optimal policy of its input, and f is a distance function for optimal policies, then f(OptimalPolicy1, OptimalPolicy2) is a metric?

One nice thing is that, roughly speaking, rewards are judged to be equivalent if they would generalize to any possible transition function that is consistent with DT. This means that by designing DT appropriately, we can capture how much generalization we want to evaluate.

Can more than one DT be used, so there's more than one measure?

This is a useful knob to have: if we used the maximally large DT, the task would be far too difficult, as it would be expected to generalize far more than even humans can.

There's a maximum?

If EPIC(R1, R2) is thought of as two functions f(g(R1), g(R2)), where g returns the optimal policy of its input, and f is a distance function for optimal policies, then f(OptimalPolicy1, OptimalPolicy2) is a metric?

The authors don't prove it, but I believe yes, as long as DS and DA put support over the entire state space / action space (maybe you also need DT to put support over every possible transition).

I usually think of this as "EPIC is a metric if defined over the space of equivalence classes of reward functions".

Can more than one DT be used, so there's more than one measure?

Yes.

There's a maximum?

For finite, discrete state/action spaces, the uniform distribution over (s, a, s') tuples has maximal entropy. However, it's not clear that that's the worst case for EPIC.