Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world

Newsletter #115

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).












Open Questions in Creating Safe Open-ended AI: Tensions Between Control and Creativity (Adrien Ecoffet et al) (summarized by Rohin): One potential pathway to powerful AI is through open-ended search, in which we use search algorithms to search for good architectures, learning algorithms, environments, etc. in addition to using them to find parameters for a particular architecture. See the AI-GA paradigm (AN #63) for more details. What do AI safety issues look like in such a paradigm?

Building on DeepMind’s framework (AN #26), the paper considers three levels of objectives: the ideal objective (what the designer intends), the explicit incentives (what the designer writes down), and the agent incentives (what the agent actually optimizes for). Safety issues can arise through differences between any of these levels.

The main difference that arises when considering open-ended search is that it’s much less clear to what extent we can control the result of an open-ended search, even if we knew what result we wanted. We can get evidence about this from existing complex systems, though unfortunately there are not any straightforward conclusions: several instances of convergent evolution might suggest that the results of the open-ended search run by evolution were predictable, but on the other hand, the effects of intervening on complex ecosystems are notoriously hard to predict.

Besides learning from existing complex systems, we can also empirically study the properties of open-ended search algorithms that we implement in computers. For example, we could run search for some time, and then fork the search into independent replicate runs with different random seeds, and see to what extent the results converge. We might also try to improve controllability by using meta learning to infer what learning algorithms, environments, or explicit incentives help induce controllability of the search.

The remaining suggestions will be familiar to most readers: they suggest work on interpretability (that now has to work with learned architectures), better benchmarks, human-in-the-loop search, safe exploration, and sim-to-real transfer.

Rohin's opinion: I’m glad that people are paying attention to safety in this AGI paradigm, and the problems they outline seem like reasonable problems to work on. I actually expect that the work needed for the open-ended search paradigm will end up looking very similar to the work needed by the “AGI via deep RL” paradigm: the differences I see are differences in difficulty, not differences in what problems qualitatively need to be solved. I’m particularly excited by the suggestion of studying how particular environments can help control the result of the open-ended search: it seems like even with deep RL based AGI, we would like to know how properties of the environment can influence properties of agents trained in that environment. For example, what property must an environment satisfy in order for agents trained in that environment to be risk-averse?



Model splintering: moving from one imperfect model to another (Stuart Armstrong) (summarized by Rohin): This post introduces the concept of model splintering, which seems to be an overarching problem underlying many other problems in AI safety. This is one way of more formally looking at the out-of-distribution problem in machine learning: instead of simply saying that we are out of distribution, we look at the model that the AI previously had, and see what model it transitions to in the new distribution, and analyze this transition.

Model splintering in particular refers to the phenomenon where a coarse-grained model is “splintered” into a more fine-grained model, with a one-to-many mapping between the environments that the coarse-grained model can distinguish between and the environments that the fine-grained model can distinguish between (this is what it means to be more fine-grained). For example, we may initially model all gases as ideal gases, defined by their pressure, volume and temperature. However, as we learn more, we may transition to the van der Waal’s equations, which apply differently to different types of gases, and so an environment like “1 liter of gas at standard temperature and pressure (STP)” now splinters into “1 liter of nitrogen at STP”, “1 liter of oxygen at STP”, etc.

Model splintering can also apply to reward functions: for example, in the past people might have had a reward function with a term for “honor”, but at this point the “honor” concept has splintered into several more specific ideas, and it is not clear how a reward for “honor” should generalize to these new concepts.

The hope is that by analyzing splintering and detecting when it happens, we can solve a whole host of problems. For example, we can use this as a way to detect if we are out of distribution. The full post lists several other examples.

Rohin's opinion: I think that the problems of generalization and ambiguity out of distribution are extremely important and fundamental to AI alignment, so I’m glad to see work on them. It seems like model splintering could be a fruitful approach for those looking to take a more formal approach to these problems.

An Architectural Risk Analysis of Machine Learning Systems: Towards More Secure Machine Learning (Gary McGraw et al) (summarized by Rohin) (H/T Catherine Olsson): One systematic way of identifying potential issues in a system is to perform an architectural risk analysis, in which you draw an architecture diagram showing the various components of the system and how they interact, and then think about each component and interaction and how it could go wrong. (Last week’s highlight (AN #114) did this for Bayesian history-based RL agents.) This paper performs an architectural risk analysis for a generic ML system, resulting in a systematic list of potential problems that could occur.

Rohin's opinion: As far as I could tell, the problems identified were ones that we had seen before, but I’m glad someone has gone through the more systematic exercise, and the resulting list is more organized and easier to understand than previous lists.


Forecasting Thread: AI Timelines (Amanda Ngo et al) (summarized by Rohin): This post collects forecasts of timelines until human-level AGI, and (at the time of this writing) has twelve such forecasts.

Roadmap to a Roadmap: How Could We Tell When AGI is a ‘Manhattan Project’ Away? (John-Clark Levin et al) (summarized by Rohin): The key hypothesis of this paper is that once there is a clear “roadmap” or “runway” to AGI, it is likely that state actors could invest a large number of resources into achieving it, comparably to the Manhattan project. The fact that we do not see signs of such investment now does not imply that it won’t happen in the future: currently, there is so little “surface area” on the problem of AGI that throwing vast amounts of money at the problem is unlikely to help much.

If this were true, then once such a runway is visible, incentives could change quite sharply: in particular, the current norms of openness may quickly change to norms of secrecy, as nations compete (or perceive themselves to be competing) with other nations to build AGI first. As a result, it would be good to have a good measure of whether we have reached the point where such a runway exists.

Read more: Import AI summary


State of AI Ethics (Abhishek Gupta et al) (summarized by Rohin): This report from the Montreal AI Ethics Institute has a wide variety of summaries on many different topics in AI ethics, quite similarly to this newsletter in fact.


Decision Points in AI Governance (Jessica Cussins Newman) (summarized by Rohin): While the last couple of years have seen a proliferation of “principles” for the implementation of AI systems in the real world, we are only now getting to the stage in which we turn these principles into practice. During this period, decision points are concrete actions taken by some AI stakeholder with the goal of shaping the development and use of AI. (These actions should not be predetermined by existing law and practice.) Decision points are the actions that will have a disproportionately large influence on the field, and thus are important to analyze. This paper analyzes three case studies of decision points, and draws lessons for future decision points.

First, we have the Microsoft AETHER committee. Like many other companies, Microsoft has established a committee to help the company make responsible choices about its use of AI. Unlike e.g. Google’s AI ethics board, this committee has actually had an impact on Microsoft’s decisions, and has published several papers on AI governance along the way. The committee attributes its success in part to executive-level support, regular opportunities for employee and expert engagement, and integration with the company’s legal team.

Second, we have the GPT-2 (AN #46) staged release process. We’ve covered (AN #58) this (AN #55) before (AN #58), so I won’t retell the story here. However, this shows how a deviation from the norm (of always publishing) can lead to a large discussion about what publication norms are actually appropriate, leading to large changes in the field as a whole.

Finally, we have the OECD AI Policy Observatory, a resource that has been established to help countries implement the OECD AI principles. The author emphasizes that it was quite impressive for the AI principles to even get the support that they did, given the rhetoric about countries competing on AI. Now, as the AI principles have to be put into practice, the observatory provides several resources for countries that should help in ensuring that implementation actually happens.

Read more: MAIEI summary



Combining Deep Reinforcement Learning and Search for Imperfect-Information Games (Noam Brown, Anton Bakhtin et al) (summarized by Rohin): AlphaZero (AN #36) and its predecessors have achieved impressive results in zero-sum two-player perfect-information games, by using a combination of search (MCTS) and RL. This paper provides the first combination of search and deep RL for imperfect-information games like poker. (Prior work like Pluribus (AN #74) did use search, but didn’t combine it with deep RL, instead relying on significant expert information about poker.)

The key idea that makes AlphaZero work is that we can estimate the value of a state independently of other states without any interaction effects. For any given state s, we can simulate possible future rollouts of the game, and propagate the values of the resulting new states back up to s. In contrast, for imperfect information games, this approach does not work since you cannot estimate the value of a state independently of the policy you used to get to that state. The solution is to instead estimate values for public belief states, which capture the public common knowledge that all players have. Once this is done, it is possible to once again use the strategy of backing up values from simulated future states to the current state, and to train a value network and policy network based on this.


AI Governance Project Manager (Markus Anderljung) (summarized by Rohin): The Centre for the Governance of AI is hiring for a project manager role. The deadline to apply is September 30.


I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.


An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

New Comment
16 comments, sorted by Click to highlight new comments since:

MAIEI also has an AI Ethics newsletter I recommend for those interested in the topic.

Is this page completely unreadable for anyone else?

Yeah, sorry, we get some really mangled HTML from the RSS feed that Rohin registered, that's a bit of a pain to clean up, so we've been doing it manually for a bit. My guess is I will get around to automating it, but it's not super trivial, since the HTML we get has a lot of table-layout stuff that sometimes is relevant to the content, and sometimes isn't, and so I would have to experiment for a while to find the right sanitization rules to make everything work nicely without human intervention.

I forget if I mentioned this before, but all of this HTML is generated by a script with a much more structured input, which you can see here. Plausibly we should just add another output mode to the script that can be easily imported into LessWrong? (Happy to share you on the spreadsheet from which the input data comes if that would help.)

Yeah, that might end up being easier. I might look at the code and make a PR for a minimalist HTML template.


Yep, clicking "View this email in browser" allowed me to read it but obviously would be better to have it fixed here as well.

Currently this is fixed manually for each crosspost by converting it to draft-js and then deleting some extra stuff. I'm not sure how high a priority it is to make that automatic.

Decision Points in AI Governance


(These actions should not have been predetermined by existing law and practice.)

Should not have been, or should not be?

should not be, thanks

I actually expect that the work needed for the open-ended search paradigm will end up looking very similar to the work needed by the “AGI via deep RL” paradigm: the differences I see are differences in difficulty, not differences in what problems qualitatively need to be solved.

I'm inclined to agree. I wonder if there are any distinctive features that jump out?

Hey Rohin, I'm writing a review on everything that' been written on corrigibility so far. Do the "the off switch game", "Active Inverse Reward Design" "should robots be obedient", "incorrigibility in CIRL" as well as your reply in the Newsletter represent CHAI's current views on the subject? If not, which papers contain them?

Uh, I don't speak for CHAI, and my views differ pretty significantly from e.g. Dylan's or Stuart's on several topics. (And other grad students differ even more.) But those seem like reasonable CHAI papers to look at (though I'm not sure how Active IRD relates to corrigibility). Chapter 3 of the Value Learning sequence has some of my takes on reward uncertainty, which probably includes some thoughts about corrigibility somewhere.

Human Compatible also talks about corrigibility iirc, though I think the discussion is pretty similar to the one in the off switch game?

Active IRD doesn't have anything to do with corrigibility, I guess my mind just switched off when I was writing that. Anyway, how diverse are CHAI's views on corrigibility? Could you tell me who I should talk to? Because I've already read all the published stuff on it if I'm understanding you rightly and I want to make sure that all the perspectives no this topic are covered.

Hmm, I expect each grad student will have a slightly different perspective, but off the top of my head I think Michael Dennis has the most opinions on it. (Other people could include Daniel Filan and Adam Gleave.)

Thanks. Two questions:

Do the staff and faculty have a similair diversity of opinions?

Is messaging in orde to contact your peers the right procedure here?

Hmm, of the faculty Stuart spends the most time thinking about AI alignment, I'm not sure how much the other faculty have thought about corrigibility -- they'll have views about the off switch game, but not about MIRI-style corrigibility.

Most of the staff doesn't work on technical research, so they probably won't have strong opinions. Exceptions: Critch and Karthika (though I don't think Karthika has engaged much with corrigibility).

Probably the best way is to find emails of individual researchers online and email them directly. I've also left a message on our Slack linking to this discussion.