[AN #130]: A new AI x-risk podcast, and reviews of the field

Rohin Shah

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.

HIGHLIGHTS

Announcing AXRP, the AI X-risk Research Podcast (Daniel Filan) (summarized by Rohin): Episodes of this new podcast will involve Daniel interviewing a researcher about a paper they’ve written, talking about the ideas in the paper and why they matter. Three episodes have already been released; I summarize them later in this newsletter.

2020 AI Alignment Literature Review and Charity Comparison (Larks) (summarized by Rohin): The tradition continues (AN #78)! I'll say nearly the same thing as I did last year:

This mammoth post goes through the work done within AI alignment from December 2019 - November 2020, from the perspective of someone trying to decide which of several AI alignment organizations to donate to. As part of this endeavor, Larks summarizes a ton of papers that were published at various organizations, and compares them to their budget and room for more funding.

Rohin's opinion: I look forward to this post every year. It continues to be a stark demonstration of how much work doesn't get covered in this newsletter -- while I tend to focus on the technical alignment problem, with some focus on AI governance and AI capabilities, this literature review spans many organizations working on existential risk, and as such has many papers that were never covered in this newsletter. Anyone who wants to donate to an organization working on AI alignment and/or x-risk should read this post.

Last year I mentioned I might write an overview for the sake of building inside view models (rather than donation decisions), this went out shortly afterward (AN #84). I don't expect to write a similar post this year, partly because I think last year's post is still quite good as an overview of the discussion that's been happening.

TAI Safety Bibliographic Database (Jess Riedel et al) (summarized by Rohin): Related to the previous summary, we also have a database of a bunch of papers on transformative AI safety, that has attempted to have comprehensive coverage of papers motivated by safety at organizations with a significant safety focus within the years 2016-20, but also includes other stuff such as blog posts, content from earlier years, etc. There’s a bunch of analysis as well that I won’t go into.

Rohin's opinion: I like this project and analysis -- it’s a different view on the landscape of technical AI safety than I usually get to see. I especially recommend reading it if you want to get a sense of the people and organizations comprising the technical AI safety field; I’m not going into detail here because I mostly try to focus on the object level issues in this newsletter.

TECHNICAL AI ALIGNMENT

LEARNING HUMAN INTENT

AXRP 2: Learning Human Biases (Daniel Filan and Rohin Shah) (summarized by Rohin): After talking about my paper on learning biases (AN #59) (for which I refer you to the linked blog post and past AN summary), Daniel and I talked about the implications of inverse reinforcement learning for AI safety, and in particular how we would want AI systems to be architected at a high level.

My position was that we want intelligent AI systems to be trying to help their users: they are explicitly interacting with humans in order to clarify what they should do, perhaps by explicitly asking questions, or by watching other human decisions and making inferences about what humans must care about. (However, this isn’t the vast majority of what they do; it is probably significantly less than one-fifth of “everything they do”.)

In contrast, Daniel would prefer for a superintelligent AI system to be pursuing a well-defined task, such as “build a thriving city”. He has three reasons for this:

1. When our goal is to build AI systems that can pursue a relatively well-defined task, it is much easier for us to tell whether we are succeeding, and we can be much clearer about what it is we are trying to accomplish.

2. We can increase the difficulty of well-specified tasks over time, rising in tandem with the capabilities of AI systems. In contrast, if our AI system is supposed to generically make our life better, that seems like a fixed task that is fairly difficult and requires quite a high minimum threshold of capabilities.

3. It seems easier to tell whether your AI system has built a good city, than to tell whether an AI system has generically improved your life.

In the podcast, I don’t think I really engaged properly with the first two points, so I’ll talk about that in the opinion. I did disagree with the third point -- I don’t see why it should be harder to evaluate whether my life has been generically improved; for example, I expect that we are capable of telling apart good and bad personal assistants.

Daniel also asked why it helps to aim for “AI systems that are trying to help you” -- how has that made the problem any simpler? It seems to me that the notion of “helpfulness” is domain-independent: once you have the concept of being helpful, it can be applied in different domains. One hopes that we could then train lots of AI systems that are specialized to particular domains, but all of them are still trying to be helpful.

Rohin's opinion: I think I broadly agree with Daniel’s first two points in support of the task-based approach, and I was somewhat talking past him during the podcast. I generally do agree that individual AI systems should be specialized to particular tasks or domains, and should not be “generically improving one’s life”. I agree with Daniel that at least outwardly it seems like most of the AI alignment field seems to be about building AI systems that can generically optimize your entire life, or even more ambitiously, the lot of humanity; I also agree that this is weird and probably not the right thing to do.

My optimism about helpfulness is not predicated on an idea that we’ll build AI systems that are generically trying to make all aspects of your life better: I do think that we still want our AI systems to be domain-specific, such as (say) a financial advisor AI system. The idea is more that if we can design domain-general techniques that allow us to train domain-specific systems that are trying to be helpful, that seems like it would be a solution to the AI alignment problem (the problem of how to prevent an AI from adversarially optimizing against its user).

HANDLING GROUPS OF AGENTS

AXRP 3: Negotiable Reinforcement Learning (Daniel Filan and Andrew Critch) (summarized by Rohin): This podcast centers on negotiable RL, which studies how to aggregate preferences of multiple principals (humans) who have different beliefs. In the case where the principals have the same beliefs, Harsanyi’s utilitarianism theorem tells us that any reasonable method of aggregating preferences will end up optimizing some linear combination of the principals’ utility functions. In the case of differing beliefs, this paper proves that every Pareto optimal policy must be optimizing some linear combination of the principals’ utility functions, except that over time the weights are modified based on how well the principals’ beliefs model reality. Intuitively, the principals are both agreeing to the contract “the AI will optimize more for the person whose beliefs are more correct”; since each principal believes their own beliefs, they are both happy with this contract.

Most of the podcast is about the motivation and reason for writing this paper. Critch envisions a world in which people and AI systems must cooperate rather than fight, and this paper can be thought of as a study in how people can maximize cooperation. Unfortunately, it turns out that the cooperation-maximizing approach ends up being unfair: people whose beliefs are incorrect end up getting penalized (in terms of actual outcomes, rather than what they believe will happen).

More broadly, Critch hopes that this will spur more research into how parties with different beliefs can share control of AI systems: this seems important for AI to go well in the future.

Rohin's opinion: I really liked this podcast: I definitely hadn’t understood Critch’s full reasons for doing this work. I didn’t include all the points in the summary, so I recommend you listen to it in addition to this summary.

ADVERSARIAL EXAMPLES

AXRP 1: Adversarial Policies (Daniel Filan and Adam Gleave) (summarized by Rohin): The first part of this podcast describes the adversarial policies paper (AN #70); see the summary for details about that. (As a reminder, this is the work which trained an adversarial goalie, that by spasming in a random-looking manner, causes the kicker to completely fail to even kick the ball towards the goal.)

Let’s move on to the more speculative thoughts discussed in this podcast (and not in the paper). One interesting thing that the paper highlights is that the space of policies is very non-transitive: it is possible, perhaps even common, that policy A beats policy B, which beats policy C, which beats policy A. This is clear if you allow arbitrary policies -- for example, the policy “play well, unless you see your opponent make a particular gesture; if you see that gesture then automatically lose” will beat many policies, but can be beaten by a very weak policy that knows to make the particular gesture. You might have thought that in practice, the policies produced by deep RL would exclude these weird possibilities, and so could be ranked by some notion of “competence”, where more competent agents would usually beat less competent agents (implying transitivity). The results of this paper suggest that isn’t the case.

The conversation then shifts to the research community and how to choose what research to do. The motivation behind this work was to improve the evaluation of policies learned by deep RL: while the freedom from the lack of theoretical guarantees (as in control theory) has allowed RL to make progress on previously challenging problems, there hasn’t been a corresponding uptick in engineering-based guarantees, such as testing. The work has had a fairly positive reception in the AI community, though unfortunately it seems this is probably due in part to its flashy results. Other papers that Adam is equally excited about have not had as good a reception.

AI GOVERNANCE

Our AI governance grantmaking so far (Luke Muehlhauser) (summarized by Rohin): This post describes Open Philanthropy’s approach to AI governance, in which they focus on governance for worlds in which we have transformative AI (that is, AI which is as impactful as the Industrial Revolution). However, there is a major challenge: it is usually quite unclear whether a proposed intermediate goal is even positive. Something like “increased safety regulation in the US and Europe” might initially seem good if done well, but even if done well it may actually increase risk by (say) privileging AI development in countries that have lower standards than what would have existed without the regulations. Given this effect, it’s hard to do hits-based giving: most intermediate goals have relatively low expected value, because the huge positive value scenarios can be canceled out by the huge negative value scenarios, and so not many things look like a “hit”.

As a result, grantmaking has so far focused on intermediate goals that seem robustly neutral-to-good: some kinds of research (including on which intermediate goals would be robustly good), advocacy for intermediate goals that are robustly good (e.g. methods for gaining high assurance in AI systems), broad field-building, and better-informed AI governance training and advice.

The post also has some examples of AI governance work that the author thinks have been useful.

NEWS

Formal Methods for the Informal Engineer (Gopal Sarma et al) (summarized by Rohin): This online workshop will teach engineers how to use verification tools like Z3 and Coq, and then discuss how formal verification can be applied in many different areas of software engineering (including robust machine learning). The organizers tell me they plan to produce a white-paper with high-level recommendations following the workshop. You can register here.

FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

8