[AN #67]: Creating environments in which to study inner alignment failures

Rohin Shah

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.

Audio version here (may not be up yet).

Highlights

Towards an empirical investigation of inner alignment (Evan Hubinger) (summarized by Rohin): Last week, we saw that the worrying thing about mesa optimizers (AN #58) was that they could have robust capabilities, but not robust alignment (AN#66). This leads to an inner alignment failure: the agent will take competent, highly-optimized actions in pursuit of a goal that you didn't want.

This post proposes that we empirically investigate what kinds of mesa objective functions are likely to be learned, by trying to construct mesa optimizers. To do this, we need two ingredients: first, an environment in which there are many distinct proxies that lead to good behavior on the training environment, and second, an architecture that will actually learn a model that is itself performing search, so that it has robust capabilities. Then, the experiment is simple: train the model using deep RL, and investigate its behavior off distribution to distinguish between the various possible proxy reward functions it could have learned. (The next summary has an example.)

Some desirable properties:

- The proxies should not be identical on the training distribution.

- There shouldn't be too many reasonable proxies, since then it would be hard to identify which proxy was learned by the neural net.

- Proxies should differ on "interesting" properties, such as how hard the proxy is to compute from the model's observations, so that we can figure out how a particular property influences whether the proxy will be learned by the model.

Rohin's opinion: I'm very excited by this general line of research: in fact, I developed my own proposal along the same lines. As a result, I have a lot of opinions, many of which I wrote up in this comment, but I'll give a summary here.

I agree pretty strongly with the high level details (focusing on robust capabilities without robust alignment, identifying multiple proxies as the key issue, and focusing on environment design and architecture choice as the hard problems). I do differ in the details though. I'm more interested in producing a compelling example of mesa optimization, and so I care about having a sufficiently complex environment, like Minecraft. I also don't expect there to be a "part" of the neural net that is actually computing the mesa objective; I simply expect that the heuristics learned by the neural net will be consistent with optimization of some proxy reward function. As a result, I'm less excited about studying properties like "how hard is the mesa objective to compute".

A simple environment for showing mesa misalignment (Matthew Barnett) (summarized by Rohin): This post proposes a concrete environment in which we can run the experiments suggested in the previous post. The environment is a maze which contains keys and chests. The true objective is to open chests, but opening a chest requires you to already have a key (and uses up the key). During training, there will be far fewer keys than chests, and so we would expect the learned model to develop an "urge" to pick up keys. If we then test it in mazes with lots of keys, it would go around competently picking up keys while potentially ignoring chests, which would count as a failure of inner alignment. This predicted behavior is similar to how humans developed an "urge" for food because food was scarce in the ancestral environment, even though now food is abundant.

Rohin's opinion: While I would prefer a more complex environment to make a more compelling case that this will be a problem in realistic environments, I do think that this would be a great environment to start testing in. In general, I like the pattern of "the true objective is Y, but during training you need to do X to get Y": it seems particularly likely that even current systems would learn to competently pursue X in such a situation.

Technical AI alignment

Iterated amplification

Machine Learning Projects on IDA (Owain Evans et al) (summarized by Nicholas): This document describes three suggested projects building on Iterated Distillation and Amplification (IDA), a method for training ML systems while preserving alignment. The first project is to apply IDA to solving mathematical problems. The second is to apply IDA to neural program interpretation, the problem of replicating the internal behavior of other programs as well as their outputs. The third is to experiment with adaptive computation where computational power is directed to where it is most useful. For each project, they also include motivation, directions, and related work.

Nicholas's opinion: Figuring out an interesting and useful project to work on is one of the major challenges of any research project, and it may require a distinct skill set from the project's implementation. As a result, I appreciate the authors enabling other researchers to jump straight into solving the problems. Given how detailed the motivation, instructions, and related work are, this document strikes me as an excellent way for someone to begin her first research project on IDA or AI safety more broadly. Additionally, while there are many public explanations of IDA, I found this to be one of the most clear and complete descriptions I have read.

Read more: Alignment Forum summary post

List of resolved confusions about IDA (Wei Dai) (summarized by Rohin): This is a useful post clarifying some of the terms around IDA. I'm not summarizing it because each point is already quite short.

Mesa optimization

Concrete experiments in inner alignment (Evan Hubinger) (summarized by Matthew): While the highlighted posts above go into detail about one particular experiment that could clarify the inner alignment problem, this post briefly lays out several experiments that could be useful. One example experiment is giving an RL trained agent direct access to its reward as part of its observation. During testing, we could try putting the model in a confusing situation by altering its observed reward so that it doesn't match the real one. The hope is that we could gain insight into when RL trained agents internally represent 'goals' and how they relate to the environment, if they do at all. You'll have to read the post to see all the experiments.

Matthew's opinion: I'm currently convinced that doing empirical work right now will help us understand mesa optimization, and this was one of the posts that lead me to that conclusion. I'm still a bit skeptical that current techniques are sufficient to demonstrate the type of powerful learned search algorithms which could characterize the worst outcomes for failures in inner alignment. Regardless, I think at this point classifying failure modes is quite beneficial, and conducting tests like the ones in this post will make that a lot easier.

Learning human intent

Fine-Tuning GPT-2 from Human Preferences (Daniel M. Ziegler et al) (summarized by Sudhanshu): This blog post and its associated paper describes the results of several text generation/continuation experiments, where human feedback on initial/older samples was used in the form of a reinforcement learning reward signal to finetune the base 774-million parameter GPT-2 language model (AN #46). The key motivation here was to understand whether interactions with humans can help algorithms better learn and adapt to human preferences in natural language generation tasks.

They report mixed results. For the tasks of continuing text with positive sentiment or physically descriptive language, they report improved performance above the baseline (as assessed by external examiners) after fine-tuning on only 5,000 human judgments of samples generated from the base model. The summarization task required 60,000 samples of online human feedback to perform similarly to a simple baseline, lead-3 - which returns the first three sentences as the summary - as assessed by humans.

Some of the lessons learned while performing this research include 1) the need for better, less ambiguous tasks and labelling protocols for sourcing higher quality annotations, and 2) a reminder that "bugs can optimize for bad behaviour", as a sign error propagated through the training process to generate "not gibberish but maximally bad output". The work concludes on the note that it is a step towards scalable AI alignment methods such as debate and amplification.

Sudhanshu's opinion: It is good to see research on mainstream NLProc/ML tasks that includes discussions on challenges, failure modes and relevance to the broader motivating goals of AI research.

The work opens up interesting avenues within OpenAI's alignment agenda, for example learning a diversity of preferences (A OR B), or a hierarchy of preferences (A AND B) sequentially without catastrophic forgetting.

In order to scale, we would want to generate automated labelers through semi-supervised reinforcement learning, to derive the most gains from every piece of human input. The robustness of this needs further empirical and conceptual investigation before we can be confident that such a system can work to form a hierarchy of learners, e.g. in amplification.

Rohin's opinion: One thing I particularly like here is that the evaluation is done by humans. This seems significantly more robust as an evaluation metric than any automated system we could come up with, and I hope that more people use human evaluation in the future.

Preventing bad behavior

Robust Change Captioning (Dong Huk Park et al) (summarized by Dan H): Safe exploration requires that agents avoid disrupting their environment. Previous work, such as Krakovna et al. (AN #10), penalize an agent's needless side effects on the environment. For such techniques to work in the real world, agents must also estimate environment disruptions, side effects, and changes while not being distracted by peripheral and unaffecting changes. This paper proposes a dataset to further the study of "Change Captioning," where scene changes are described by a machine learning system in natural language. That is, given before and after images, a system describes the salient change in the scene. Work on systems that can estimate changes can likely progress safe exploration.

Interpretability

Learning Representations by Humans, for Humans (Sophie Hilgard, Nir Rosenfeld et al) (summarized by Asya): Historically, interpretability approaches have involved machines acting as experts, making decisions and generating explanations for their decisions. This paper takes a slightly different approach, instead using machines as advisers who are trying to give the best possible advice to humans, the final decision makers. Models are given input data and trained to generate visual representations based on the data that cause humans to take the best possible actions. In the main experiment in this paper, humans are tasked with deciding whether to approve or deny loans based on details of a loan application. Advising networks generate realistic-looking faces whose expressions represent multivariate information that's important for the loan decision. Humans do better when provided the facial expression 'advice', and furthermore can justify their decisions with analogical reasoning based on the faces, e.g. "x will likely be repaid because x is similar to x', and x' was repaid".

Asya's opinion: This seems to me like a very plausible story for how AI systems get incorporated into human decision-making in the near-term future. I do worry that further down the line, AI systems where AIs are merely advising will get outcompeted by AI systems doing the entire decision-making process. From an interpretability perspective, it also seems to me like having 'advice' that represents complicated multivariate data still hides a lot of reasoning that could be important if we were worried about misaligned AI. I like that the paper emphasizes having humans-in-the-loop during training and presents an effective mechanism for doing gradient descent with human choices.

Rohin's opinion: One interesting thing about this paper is its similarity to Deep RL from Human Preferences: it also trains a human model, that is improved over time by collecting more data from real humans. The difference is that DRLHP produces a model of the human reward function, whereas the model in this paper predicts human actions.

Other progress in AI

Reinforcement learning

The Principle of Unchanged Optimality in Reinforcement Learning Generalization (Alex Irpan and Xingyou Song) (summarized by Flo): In image recognition tasks, there is usually only one label per image, such that there exists an optimal solution that maps every image to the correct label. Good generalization of a model can therefore straightforwardly be defined as a good approximation of the image-to-label mapping for previously unseen data.

In reinforcement learning, our models usually don't map environments to the optimal policy, but states in a given environment to the corresponding optimal action. The optimal action in a state can depend on the environment. This means that there is a tradeoff regarding the performance of a model in different environments.

The authors suggest the principle of unchanged optimality: in a benchmark for generalization in reinforcement learning, there should be at least one policy that is optimal for all environments in the train and test sets. With this in place, generalization does not conflict with good performance in individual environments. If the principle does not initially hold for a given set of environments, we can change that by giving the agent more information. For example, the agent could receive a parameter that indicates which environment it is currently interacting with.

Flo's opinion: I am a bit torn here: On one hand, the principle makes it plausible for us to find the globally optimal solution by solving our task on a finite set of training environments. This way the generalization problem feels more well-defined and amenable to theoretical analysis, which seems useful for advancing our understanding of reinforcement learning.

On the other hand, I don't expect the principle to hold for most real-world problems. For example, in interactions with other adapting agents performance will depend on these agents' policies, which can be hard to infer and change dynamically. This means that the principle of unchanged optimality won't hold without precise information about the other agent's policies, while this information can be very difficult to obtain.

More generally, with this and some of the criticism of the AI safety gridworlds that framed them as an ill-defined benchmark, I am a bit worried that too much focus on very "clean" benchmarks might divert from issues associated with the messiness of the real world. I would have liked to see a more conditional conclusion for the paper, instead of a general principle.

17

[AN #67]: Creating environments in which to study inner alignment failures

17

Ω 12

Highlights

Technical AI alignment

Iterated amplification

Mesa optimization

Learning human intent

Preventing bad behavior

Interpretability

Other progress in AI

Reinforcement learning

17

Ω 12

17

Ω 12