Question: MIRI Corrigbility Agenda

by algon331 min read13th Mar 201911 comments

16

Corrigibility
Frontpage

MIRI's reading list on corrigbility seems out dated, and I can't find a centralised list Does anyone have, or know of, one?

As a side note, has MIRI stopped updating their reading list? It seems like that's the case.

EDIT:

Links given in the comment section to do with corrigibility. I'll try and update this with some summaries as I read them.

https://www.greaterwrong.com/posts/5bd75cc58225bf0670375041/a-first-look-at-the-hard-problem-of-corrigibility

https://arbital.com/p/corrigibility/

https://arbital.com/p/updated_deference/

https://arxiv.org/pdf/1611.08219.pdf

11 comments, sorted by Highlighting new comments since Today at 1:34 PM
New Comment

The only major changes we've made to the MIRI research guide since mid-2015 are to replace Koller and Friedman's Probabilistic Graphical Models with Pearl's Probabilistic Inference; replace Rosen's Discrete Mathematics with Lehman et al.'s Mathematics for CS; add Taylor et al.'s "Alignment for Advanced Machine Learning Systems", Wasserman's All of Statistics, Shalev-Shwartz and Ben-David's Understanding Machine Learning, and Yudkowsky's Inadequate Equilibria; and remove the Global Catastrophic Risks anthology. So the guide is missing a lot of new material. I've now updated the guide to add the following note at the top:

This research guide has been only lightly updated since 2015. Our new recommendation for people who want to work on the AI alignment problem is:
1. If you have a computer science or software engineering background: Apply to attend our new workshops on AI risk and to work as an engineer at MIRI. For this purpose, you don’t need any prior familiarity with our research.
If you aren’t sure whether you’d be a good fit for an AI risk workshop, or for an engineer position, shoot us an email and we can talk about whether it makes sense.
You can find out more about our engineering program in our 2018 strategy update.
2. If you’d like to learn more about the problems we’re working on (regardless of your answer to the above): See “Embedded Agency” for an introduction to our agent foundations research, and see our Alignment Research Field Guide for general recommendations on how to get started in AI safety.
After checking out those two resources, you can use the links and references in “Embedded Agency” and on this page to learn more about the topics you want to drill down on. If you want a particular problem set to focus on, we suggest Scott Garrabrant’s “Fixed Point Exercises.”
If you want people to collaborate and discuss with, we suggest starting or joining a MIRIx group, posting on LessWrong, applying for our AI Risk for Computer Scientists workshops, or otherwise letting us know you’re out there.

For corrigibility in particular, some good material that's not discussed in "Embedded Agency" or the reading guide is Arbital's Corrigibility and Problem of Fully Updated Deference articles.

Is Jessica Taylor's A first look at the hard problem of corrigibility still a good reference or is it outdated?

I'd expect Jessica/Stuart/Scott/Abram/Sam/Tsvi to have a better sense of that than me. I didn't spot any obvious signs that it's no longer a good reference.

I've now also highlighted Scott's tip from "Fixed Point Exercises":

Sometimes people ask me what math they should study in order to get into agent foundations. My first answer is that I have found the introductory class in every subfield to be helpful, but I have found the later classes to be much less helpful. My second answer is to learn enough math to understand all fixed point theorems.
These two answers are actually very similar. Fixed point theorems span all across mathematics, and are central to (my way of) thinking about agent foundations.

Just found your question via comment sections of recent posts. I understand you are still interested in the topic. so I'll add to the comments below. In the summer of 2019 I did significant work trying to understand the status of the corrigibility literature, so here is a long answer mostly based on that.

First, at this point in time there is no up-to-date centralised reading list on corrigibility. All research agenda or literature overview lists that I know of lack references to the most recent work.

Second, the 'MIRI corrigibility agenda', if we define this agenda as a statement of the type of R&D that MIRI wants to encourage when it comes to the question of corrigibility, is very different from e.g. the 'Paul Christiano corrigibility agenda', if we define that agenda as the type of R&D that Paul Christiano likes to do when it comes to the question of corrigibility. MIRI's agenda related to corrigibility still seems to be to encourage work on decision theory and embeddedness. I am saying 'still seems' here because MIRI as an organisation has largely stopped giving updates about what they are thinking collectively.

Below I am going to talk about the problem of compiling or finding up to date reading lists that show all work on the problem of corrigibility, not a subset of work that is most preferred or encouraged by a particular agenda.

One important thing to note is that by now, unfortunately, the word corrigibility means very different things to different people. MIRI very clearly defined corrigibility, in their 2015 paper with that title, by a list of 4 criteria, (and in a later section also by a list of 5 criteria at a different level of abstraction), 4 criteria that an agent has to satisfy in order to be corrigible. Many subsequent authors have used the terms 'corrigibility' or 'this agent is corrigible' to denote different, and usually weaker, desirable properties of an agent. So if someone says that they are working on corrigibility, they may not be working towards the exact 4 (or 5) criteria that MIRI defined. MIRI stresses that a corrigible agent should not take any action that tries to prevent a shutdown button press (or more generally a reward function update). But many authors are defining success in corrigibility to mean a weaker property, e.g. that the agent must always accept the shutdown instruction (or the reward function update) when it gets it, irrespective of whether the agent tried to manipulate the human into not pressing the stop button beforehand.

When writing the related work section of my 2019 paper corrigibility with utility preservation, I tried to do a survey of all related work on corrigibility, a survey without bias towards my own research agenda. I quickly found that there is a huge amount of writing about corrigibility in various blog/web forum posts and their comment sections, way too much for me to describe in a related work section. There was too much for me to even read it all, though I read a lot of it. So I limited myself, for the related work section, to reading and describing the available scientific papers, including arxiv preprints. I first created a long list of some 60 papers by using google scholar to search for all papers that reference the 2015 MIRI paper, by using some other search terms, any by using literature overviews. I then filtered out all the papers which a) just mention corrigibility in a related work section or b) describe the problem in more detail, but without contributing any new work or insights towards a solution. This left me with a short list of only a few papers to cite as related work, actually it surprised me that so little further work had been done on corrigibility after 2015, at least work that made it to publication in a formal paper or preprint.

In any case, I can offer the related work section in my mid 2019 paper on corrigibility is an up-to-date-as-of-mid-2019 reading list on corrigibility, for values of the word corrigibility that stay close to the original 2015 MIRI definition. For broader work that departs further from the definition, I used the device of referencing the 2018 literature review of Everitt, Lee and Hutter.

So what about the literature written after mid-2019 that would belong on a corrigibility reading list? I have not done a complete literature search since then, but definitely my feeling is that the pace of work on corrigibility has picked up a bit since mid 2019, for various values of the word corrigibility.

Several authors, including myself, are avoiding the word corrigibility, to refer to the problem of corrigibility, My own reason for avoiding it is that it just means too many different things to different people. So I prefer to use a broader terms like 'reward tampering' or 'unwanted manipulation of the end user by the agent'. In the 2019 book human compatible, Russell is using the phrasing 'the problem of control' to kind-of denote the problem of corrigibility.

So here is my list of post-mid-2019 books and papers are useful to read if you want to do new R&D on safety mechanisms that achieve corrigibility/that prevent reward tampering or unwanted manipulation, if you want to do more R&D on such mechanisms without risking re-inventing the wheel. Unlike the related work section discussed above, this is not based on a systematic global long-list-to-short-list literature search, it is just work that happened to encounter (and write myself).

  • The book human compatible by Russell. -- This book provides a good natural-language problem statement of the reward tampering problem, but it does not get into much technical detail about possible solutions, because it is not aimed at a technical audience. For technical detail about possible solutions:
  • Everitt, T., Hutter, M.: Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective. arXiv:1908.04734 (2019) -- this paper is not just about causal influence diagrams but it also can be used as a good literature overview of many pre-mid-2019 reward tampering solutions, a literature overview that is more recent, and provides more descriptive detail, than the 2018 literature review I mentioned above.
  • Stuart Armstrong, Jan Leike, Laurent Orseau, Shane Legg: Pitfalls of learning a reward function online
    https://arxiv.org/abs/2004.13654
    -- this has a very good problem statement in the introduction, phrasing the tampering problem in an 'AGI agent as a reward learner' context. It then gets into a very mathematical examination of the problem.
  • Koen Holtman: AGI Agent Safety by Iteratively Improving the Utility Function
    https://arxiv.org/abs/2007.05411
    (blog post intro here) -- This deals with a particular solution direction to the tampering problem. It also uses math, but I have tried to make the math as accessible as possible to a general technical audience.

This post-mid-2019 reading list is also biased to my own research agenda, and my agenda favours the use of mathematical methods and mathematical analysis over the use of natural language when examining AGI safety problems and solutions. Other people might have other lists.

Hey, thanks for writing all of that. My current goal is to do an up to date literature review on corrigibility, so that was a most helpful comment. I'll definitely look over your blog, since some of these papers are quite dense. Out of the paper's you recommended, is there one that stands out? Bear in mind that I've read Stewart and MIRI's papers already.

Thanks, you are welcome!

Dutch custom prevents me from recommending my own recent paper in any case, so I had to recommend one paper from the time frame 2015-2020 that you probably have not read yet, I'd recommend 'Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective'. This stands out as an overview of different approaches, and I think you can get a good feeling of the state of the field out of it even if you do not try to decode all the math.

Note that there are definitely some worthwhile corrigibility related topics that are discussed only/mainly in blog posts and in LW comment threads, but not in any of the papers I mention above or in my mid-2019 related work section. For example, there is the open question whether Christiano's Iterated Amplification approach will produce a kind of corrigibility as an emergent property of the system, and if so what kind, and is this the kind we want, etc. I have not seen any discussion of this in the 'formal literature', if we define the formal literature as conference/arxiv papers, but there is a lot of discussion of this in blog posts and comment threads.

Dutch custom prevents me from recommending my own recent paper in any case

This phrase and its implications are perfect examples of problems in corrigibility. Was that intentional? If so, bravo. Your paper looks interesting, but I think I'll read the blog post first. I want a break from reading heavy papers. I wonder if the researchers would be OK with my drawing on their blog posts in the review. Would you mind?

Thanks for recommending "Reward tampering", it is much appreciated. I'll get on it after synthesising what I've read so far. Otherwise, I don't think I'll learn much.

Nope, not intentional.

You should feel free to write a literature overview that cites or draws heavily on paper-announcement blog posts. I definitely won't mind. In general, the blog posts tend to use language that is less mathematical and more targeted at a non-specialist audience. So if you aim to write a literature overview that is as readable as possible for a general audience, then drawing on phrases from the author's blog posts describing the papers (when such posts are available) may be your best bet.

The CHAI reading list is also fairly out of date (last updated april 2017) but has a few more papers, especially if you go to the top and select [3] or [4] so it shows lower-priority ones.

(And in case others haven't seen it, here's the MIRI reading guide for learning agent foundations.)