Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.
Risks from Learned Optimization in Advanced Machine Learning Systems (Evan Hubinger et al): Suppose you search over a space of programs, looking for one that plays TicTacToe well. Initially, you might find some good heuristics, e.g. go for the center square, if you have two along a row then place the third one, etc. But eventually you might find the minimax algorithm, which plays optimally by searching for the best action to take. Notably, your outer optimization over the space of programs found a program that was itself an optimizer that searches over possible moves. In the language of this paper, the minimax algorithm is a mesa optimizer: an optimizer that is found autonomously by a base optimizer, in this case the search over programs.
Why is this relevant to AI? Well, gradient descent is an optimization algorithm that searches over the space of neural net parameters to find a set that performs well on some objective. It seems plausible that the same thing could occur: gradient descent could find a model that is itself performing optimization. That model would then be a mesa optimizer, and the objective that it optimizes is the mesa objective. Note that while the mesa objective should lead to similar behavior as the base objective on the training distribution, it need not do so off distribution. This means the mesa objective is pseudo aligned; if it also leads to similar behavior off distribution it is robustly aligned.
A central worry with AI alignment is that if powerful AI agents optimize the wrong objective, it could lead to catastrophic outcomes for humanity. With the possibility of mesa optimizers, this worry is doubled: we need to ensure both that the base objective is aligned with humans (called outer alignment) and that the mesa objective is aligned with the base objective (called inner alignment). A particularly worrying aspect is deceptive alignment: the mesa optimizer has a long-term mesa objective, but knows that it is being optimized for a base objective. So, it optimizes the base objective during training to avoid being modified, but at deployment when the threat of modification is gone, it pursues only the mesa objective.
As a motivating example, if someone wanted to create the best biological replicators, they could have reasonably used natural selection / evolution as an optimization algorithm for this goal. However, this then would lead to the creation of humans, who would be mesa optimizers that optimize for other goals, and don't optimize for replication (e.g. by using birth control).
The paper has a lot more detail and analysis of what factors make mesa-optimization more likely, more dangerous, etc. You'll have to read the paper for all of these details. One general pattern is that, when using machine learning for some task X, there are a bunch of properties that affect the likelihood of learning heuristics or proxies rather than actually learning the optimal algorithm for X. For any such property, making heuristics/proxies more likely would result in a lower chance of mesa-optimization (since optimizers are less like heuristics/proxies), but conditional on mesa-optimization arising, makes it more likely that it is pseudo aligned instead of robustly aligned (because now the pressure for heuristics/proxies leads to learning a proxy mesa-objective instead of the true base objective).
Rohin's opinion: I'm glad this paper has finally come out. The concepts of mesa optimization and the inner alignment problem seem quite important, and currently I am most worried about x-risk caused by a misaligned mesa optimizer. Unfortunately, it is not yet clear whether mesa optimizers will actually arise in practice, though I think conditional on us developing AGI it is quite likely. Gradient descent is a relatively weak optimizer; it seems like AGI would have to be much more powerful, and so would require a learned optimizer (in the same way that humans can be thought of as "optimizers learned by evolution").
There still is a lot of confusion and uncertainty around the concept, especially because we don't have a good definition of "optimization". It also doesn't help that it's hard to get an example of this in an existing ML system -- today's systems are likely not powerful enough to have a mesa optimizer (though even if they had a mesa optimizer, we might not be able to tell because of how uninterpretable the models are).
Read more: Alignment Forum version
Technical AI alignment
Selection vs Control (Abram Demski): The previous paper focuses on mesa optimizers that are explicitly searching across a space of possibilities for an option that performs well on some objective. This post argues that in addition to this "selection" model of optimization, there is a "control" model of optimization, where the model cannot evaluate all of the options separately (as in e.g. a heat-seeking missile, which can't try all of the possible paths to the target separately). However, these are not cleanly separated categories -- for example, a search process could have control-based optimization inside of it, in the form of heuristics that guide the search towards more likely regions of the search space.
Rohin's opinion: This is an important distinction, and I'm of the opinion that most of what we call "intelligence" is actually more like the "control" side of these two options.
Learning human intent
Imitation Learning as f-Divergence Minimization (Liyiming Ke et al) (summarized by Cody): This paper frames imitation learning through the lens of matching your model's distribution over trajectories (or conditional actions) to the distribution of an expert policy. This framing of distribution comparison naturally leads to the discussion of f-divergences, a broad set of measures including KL and Jenson-Shannon Divergences. The paper argues that existing imitation learning methods have implicitly chosen divergence measures that incentivize "mode covering" (making sure to have support anywhere the expert does) vs mode collapsing (making sure to only have support where the expert does), and that the latter is more appropriate for safety reasons, since the average between two modes of an expert policy may not itself be a safe policy. They demonstrate this by using a variational approximation of the reverse-KL distance as the divergence underlying their imitation learner.
Cody's opinion: I appreciate papers like these that connect peoples intuitions between different areas (like imitation learning and distributional difference measures). It does seem like this would even more strongly lead to lack of ability to outperform the demonstrator, but that's honestly more a critique of imitation learning more generally than this paper in particular.
Handling groups of agents
Social Influence as Intrinsic Motivation for Multi-Agent Deep RL (Natasha Jaques et al) (summarized by Cody): An emerging field of common-sum multi-agent research asks how to induce groups of agents to perform complex coordination behavior to increase general reward, and many existing approaches involve centralized training or hardcoding altruistic behavior into the agents. This paper suggests a new technique that rewards agents for having a causal influence over the actions of other agents, in the sense that the actions of the pair of agents agents have high mutual information. The authors empirically find that having even a small number of agents who act as "influencers" can help avoid coordination failures in partial information settings and lead to higher collective reward. In one sub-experiment, they only add this influence reward to the agents' communication channels, so agents are incentivized to provide information that will impact other agents' actions (this information is presumed to be truthful and beneficial since otherwise it would subsequently be ignored).
Cody's opinion: I'm interested by this paper's finding that you can generate apparently altruistic behavior by incentivizing agents to influence others, rather than necessarily help others. I also appreciate the point that was made to train in a decentralized way. I'd love to see more work on a less asymmetric version of influence reward; currently influencers and influencees are separate groups due to worries about causal feedback loops, and this implicitly means there's a constructed group of quasi-altruistic agents who are getting less concrete reward because they're being incentivized by this auxiliary reward.
ICML Uncertainty and Robustness Workshop Accepted Papers (summarized by Dan H): The Uncertainty and Robustness Workshop accepted papers are available. Topics include out-of-distribution detection, generalization to stochastic corruptions, label corruption robustness, and so on.
To first order, moral realism and moral anti-realism are the same thing (Stuart Armstrong)
AI strategy and policy
Grover: A State-of-the-Art Defense against Neural Fake News (Rowan Zellers et al): Could we use ML to detect fake news generated by other ML models? This paper suggests that models that are used to generate fake news will also be able to be used to detect that same fake news. In particular, they train a GAN-like language model on news articles, that they dub GROVER, and show that the generated articles are better propaganda than those generated by humans, but they can at least be detected by GROVER itself.
Notably, they do plan to release their models, so that other researchers can also work on the problem of detecting fake news. They are following a similar release strategy as with GPT-2 (AN #46): they are making the 117M and 345M parameter models public, and releasing their 1.5B parameter model to researchers who sign a release form.
Rohin's opinion: It's interesting to see that this group went with a very similar release strategy, and I wish they had written more about why they chose to do what they did. I do like that they are on the face of it "cooperating" with OpenAI, but eventually we need norms for how to make publication decisions, rather than always following the precedent set by someone prior. Though I suppose there could be a bit more risk with their models -- while they are the same size as the released GPT-2 models, they are better tuned for generating propaganda than GPT-2 is.
Read more: Defending Against Neural Fake News
The Hacker Learns to Trust (Connor Leahy): An independent researcher attempted to replicate GPT-2 (AN #46) and was planning to release the model. However, he has now decided not to release, because releasing would set a bad precedent. Regardless of whether or not GPT-2 is dangerous, at some point in the future, we will develop AI systems that really are dangerous, and we need to have adequate norms then that allow researchers to take their time and evaluate the potential issues and then make an informed decision about what to do. Key quote: "sending a message that it is ok, even celebrated, for a lone individual to unilaterally go against reasonable safety concerns of other researchers is not a good message to send".
Rohin's opinion: I quite strongly agree that the most important impact of the GPT-2 decision was that it has started a discussion about what appropriate safety norms should be, whereas before there were no such norms at all. I don't know whether or not GPT-2 is dangerous, but I am glad that AI researchers have started thinking about whether and how publication norms should change.
Other progress in AI
A Survey of Reinforcement Learning Informed by Natural Language (Jelena Luketina et al) (summarized by Cody): Humans use language as a way of efficiently storing knowledge of the world and instructions for handling new scenarios; this paper is written from the perspective that it would be potentially hugely valuable if RL agents could leverage information stored in language in similar ways. They look at both the case where language is an inherent part of the task (example: the goal is parameterized by a language instruction) and where language is used to give auxiliary information (example: parts of the environment are described using language). Overall, the authors push for more work in this area, and, in particular, more work using external-corpus-pretrained language models and with research designs that use human-generated rather than synthetically-generated language; the latter is typically preferred for the sake of speed, but the former has particular challenges we'll need to tackle to actually use existing sources of human language data.
Cody's opinion: This article is a solid and useful version of what I would expect out of a review article: mostly useful as a way to get thinking in the direction of the intersection of RL and language, and makes me more interested in digging more into some of the mentioned techniques, since by design this review didn't go very deep into any of them.
the transformer … “explained”? (nostalgebraist) (H/T Daniel Filan): This is an excellent explanation of the intuitions and ideas behind self-attention and the Transformer architecture (AN #44).
Ray Interference: a Source of Plateaus in Deep Reinforcement Learning (Tom Schaul et al) (summarized by Cody): The authors argue that Deep RL is subject to a particular kind of training pathology called "ray interference", caused by situations where (1) there are multiple sub-tasks within a task, and the gradient update of one can decrease performance on the others, and (2) the ability to learn on a given sub-task is a function of its current performance. Performance interference can happen whenever there are shared components between notional subcomponents or subtasks, and the fact that many RL algorithms learn on-policy means that low performance might lead to little data collection in a region of parameter space, and make it harder to increase performance there in future.
Cody's opinion: This seems like a useful mental concept, but it seems quite difficult to effectively remedy, except through preferring off-policy methods to on-policy ones, since there isn't really a way to decompose real RL tasks into separable components the way they do in their toy example
Alpha MAML: Adaptive Model-Agnostic Meta-Learning (Harkirat Singh Behl et al)
Really great summaries as always. Digesting these papers is rather time consuming, and these distillations are very understandable and illuminating.
RE Natasha's work: she's said she thinks that whether the influence criteria leads to more or less altruistic behavior is probably environment dependent.
(I didn't understand that last part.)
This reminds me of OpenAI Five - the way they didn't communicate, but all had the same information. It'll be interesting to see if (in this work) the "AI" used the other benefits/types of communication, or if it was all about providing information. (The word "influencers" seems to invoke that.) "Presuming the information is truthful and beneficial" - this brings up a few questions.
1) Are they summarizing? Or are they giving a lot of information and leaving the other party to figure out what's important? We (humans) have preferences over this, but whether these agents do will be interesting, along with how that works - is it based on volume or ratios or frequency?
I'm also gesturing at a continuum here - providing a little information versus all of it.
Extreme examples: a) The agents are connected in a communications network. Though distributed, (and not all nodes are connected) they share all information.* b) A protocol is developed for sending only the minimum amount of information. Messages read like "101" or "001" or just a "0" or a "1", and are rarely sent.
2) What does beneficial mean? Useful? Can "true" information be harmful in this setting? (One can imagine an agent, which upon receiving the information "If you press that button you will lose 100 points", will become curious, and press the button.)
3) Truthful - absolutely, or somewhat? Leaving aside "partial truths"/"lying by omission", do influencers tend towards "the truth" or something else? Giving more information which is useful for both parties? Saying 'option B is better than option A', 'option C is better than option B', and continuing on in this matter (as it is rewarded for this) in stead of skipping straight to 'option Z is the best'.
The paper says it's intrinsic motivation so that might not be a problem. I'm surprised they got good results from "try to get other agents to do something different", but it is the borrowing from the structure of causality.
This reminds me of the Starcraft AI, AlphaStar. While I didn't get all the details I recall something about the reason for the population was so they could each be given a bunch of different narrower/easier objectives than "Win the game" like "Build 2 Deathstalkers" or "Scout this much of the map" or "find the enemy base ASAP", in order to find out what kind of easy to learn things helped them get better at the game.
Glancing through the AlphaStar article again, that seemed more oriented around learning a variety of strategies, and learning them well. Also, there might be architecture differences I'm not accounting for.
(Emphasis added.) Well, I guess AlphaStar demonstrates the effectiveness of off-policy methods. (Possibly with a dash of supervised learning, and well, everything else.)
This sounds like one of those "as General Intelligences we find this easy but it's really hard to program".
*Albeit with two types of nodes - broadcasters and receivers. (If broadcasters don't broadcast to each other, then: 1) In order for everyone to get all the information, the broadcasters must receive all information. 2) In order for all the receivers to get all the information, then for each receiver r, the information held by the set of broadcasters that broadcast to it, b, must include all information.)
Suppose Alice and Bob are in an iterated prisoners dilemma ($2/$2 for both cooperating, 1/1 for both defecting, and 3/0 for cooperate/defect.) I now tell Alice that actually she can have an extra $5 each time if she always cooperates. Now the equilibrium is for Alice to always cooperate and Bob to always defect (which is not an equilibrium behavior in normal IPD).
The worry here is that by adding this extra auxiliary intrinsic reward, you are changing the equilibrium behavior. In particular, agents will exploit the commons less and instead focus more on finding and transmitting useful information. This doesn't really seem like you've "learned cooperation".
Note that in the referenced paper the agents don't have the same information. (I think you know that, just wanted to clarify in case you didn't.)
Most of the considerations you bring up are not things that can be easily evaluated from the paper (and are hard to evaluate even if you have the agents in code and can look into their innards). One exception: you should expect that the information given by the agent will be true and useful: if it weren't, then the other agents would learn to ignore the information over time, which means that the information doesn't affect the other agents' actions and so won't get any intrinsic reward.
I do think that it's dependent on the particular environments you use.
While I didn't read the Ray Interference paper, I think its point was that if the same weights are used for multiple skills, then updating the weights for one of the skills might reduce performance on the other skills. AlphaStar would have this problem too. I guess by having "specialist" agents in the population you are ensuring that those agents don't suffer from as much ray interference, but your final general agent will still need to know all the skills and would suffer from ray interference (if it is actually a thing, again I haven't read the paper).
Yup, sounds right to me.
Note that in the referenced paper the agents don't have the same information. (I think you know that, just wanted to clarify in case you didn't.)
Yes. I brought it up for this point: (edited)
A better way of putting it would have been - "Open AI Five cooperated with full information and no communication, this work seems interested in cooperation between agents with different information and communication."
That makes sense. I'm curious about what value "this thing that isn't learned cooperation" doesn't capture.
A better way of putting my question would have been:
Is "useful" a global improvement, or a local improvement? (This sort of protocol leads to improvements, but what kind of improvements are they?)
Hm, I thought of them as things that would require looking at:
1) Behavior in environments constructed for that purpose.
2) Looking at the information the agents communicate.
My "communication" versus "information" distinction is an attempt to differentiate between (complex) things like:
Information: This is the payoff matrix (or more of it).*
Communication: I'm going for stag next round. (A promise.)
Information: If everyone chooses "C" things are better for everyone.
Communication: If in a round, I've chosen "C" and you've chosen "D", the following round I will choose "D".
Information: Player M has your source code, and will play whatever you play.
*I think of this as being different from advice (proposing an action, or an action over another).
It suggests that in other environments that aren't tragedies of the commons, the technique won't lead to cooperation. It also suggests that you could get the same result by giving the agents any sort of extra reward (that influences their actions somehow).
Also not clear what the answer to this is.
The agents won't work in any environment other than the one they were trained in, and the information they communicate is probably in the form of vectors of numbers that are not human-interpretable. It's not impossible to analyze them, but it would be difficult.
if it also