2019 AI Alignment Literature Review and Charity Comparison

[-]Ben Pace6y*340

Well I'm in love. Thanks so much for doing this again, this was great, I learned a lot, yadda yadda yadda, let me talk about MIRI and Secrecy.

I don’t think we can give people money just because they say they are doing good things, because of the risk of abuse. There are many other reasons for not publishing anything. Some simple alternative hypothesis include “we failed to produce anything publishable” or “it is fun to fool ourselves into thinking we have exciting secrets” or “we are doing bad things and don’t want to get caught.” The fact that MIRI’s researchers appear intelligent suggest they at least think they are doing important and interesting issues, but history has many examples of talented reclusive teams spending years working on pointless stuff in splendid isolation.

And later on you say that it's hard to donate to them "in good faith", "given their lack of disclosure".

Here's my quick thoughts:

In the last 2-5 years I endorsed donating to MIRI (and still do), and my reasoning back then was always of the type "I don't understand their technical research, but I have read a substantial amount of the philosophy and worldview that was used to successfully pluck that problem out of the space of things to work on, and think it is deeply coherent and sensible and it's been surprisingly successful in figuring out AI is an x-risk, and I expect to find it is doing very sensible things in places I understand less well." Then, about a year ago, MIRI published the Embdded Agency sequence, and for the first time I thought "Oh, now I feel like I have an understanding of what the technical research is guided by, and what it's about, and indeed, this makes a lot of sense." My feelings have rarely been changed by reading ongoing research papers at MIRI, which were mostly just very confusing to me. They all seemed individually interesting, but I didn't see the broader picture before Embedded Agency.
At no time has any important part of my reasoning been "I don't understand their reasoning, but they're situated inside a broader intellectual space that is running a number of checks of the basics of their work, and can evaluate the quality of the work, so I don't have to evaluate the project myself." I've never really thought there were experts in this space I could defer to; academia seemed terribly slow and confused about the problem. Similarly to how companies rarely argue that they shouldn't exist while they're making money, fields are very bad at arguing that they shouldn't exist whilst the research is progressing, which is the sort of work that would be required to realise that AGI is by-default an existential catastrophe and that capabilities must be dropped whilst alignment work needs doing. It's always been a question of whether I trust the epistemology of the leadership at MIRI to think clearly about these problems, and what I understand of how they think.
During the time that MIRI was non-non-disclosed (i.e. open), I don't think there was substantially more checking of their work than there is now. Most of academia and industry has/had no interest. There were LWers like Paul Christiano and Wei Dai and Stuart Armstrong that engaged, and I'm sad that these people aren't able to engage with the present work. But at the time Paul and Wei were basically saying "I don't get why this is going to work out" so it's not like MIRI could start getting much more negative feedback, unless Paul and Wei were going to say "This obviously fails and I can prove how," which I assign little probability to.
Also, you mention hypotheses like "we failed to produce anything publishable" or “it is fun to fool ourselves into thinking we have exciting secrets” or “we are doing bad things and don’t want to get caught.” Again, I don't think this is likely to be the cause of MIRI doing non-disclosed work, because they had policies like this since a long time before when there were real results. For example, when Scott Garrabrant came up with logical inductors, MIRI spent (according to this post) almost 6 months figuring out whether to disclose it. So I expect the reasoning to be fairly consistent with the same decisions in the past.
Note that Nate used the new non-disclosed research strategy to solve the initial formulation of the tiling agents problem, something he mentions here (search 'tiling agents'). Now, he's not published it yet, but, I mean, I definitely trust he's telling the truth. Maybe it turned out to not be as useful as hoped (as with Logical Inductors), but it still seems like progress that's similar in kind to the progress that happened before MIRI was non-disclosed by default.
Oh also, they've finally been published for decision theory in a prestigious philosophy journal, so a number of highly skeptical outside views should now not be concerned. Overall it seems to me like from many such perspectives they're clearly now in a stronger place than they were 2 years ago. They used to be confusing and had no results and not recognised by academia. Now they are less confusing and have some fascinating results and are recognised by academia.

To be clear, I have many mixed feelings about MIRI being non-disclosed, and think it's not very helpful for the broader intellectual environment and for recruitment and I broadly feel a bit confused about it. But "risk of abuse [of secrecy]" is not a factor I personally worry about with MIRI's research progress. I feel similarly about the day-to-day and month-to-month work at MIRI as before they entered non-disclosure; I expect they're largely chugging along, but with more direction and focus than 3-4 years ago.

[-]Ben Pace6y*130

Let me return to this section:

I don’t think we can give people money just because they say they are doing good things, because of the risk of abuse. There are many other reasons for not publishing anything. Some simple alternative hypothesis include “we failed to produce anything publishable” or “it is fun to fool ourselves into thinking we have exciting secrets” or “we are doing bad things and don’t want to get caught.” The fact that MIRI’s researchers appear intelligent suggest they at least think they are doing important and interesting issues, but history has many examples of talented reclusive teams spending years working on pointless stuff in splendid isolation.

Additionally, by hiding the highest quality work we risk impoverishing the field, making it look unproductive and unattractive to potential new researchers.

It's funny to hear you suggest "What if they have no good research?" and "I'm irritated that they are hiding all of our best research!" in successive paragraphs as two reasons you're concerned about secrecy ;-)

So, my current epistemic state is something like this: Eliezer and Benya and Patrick and others spent something like 4 MIRI-years hacking away at research, and I didn't get it. Finally Scott and Abram made some further progress on it, and crystalised it into an explanation I actually felt I sorta get. And most of the time I spent trying to understand their work in the meantime was wasted effort on my part, and quite plausibly wasted effort on their part. I remember that time they wrote up a ton of formal-looking papers for the puerto rico conference, to be ready in case a field suddenly sprang around them... but then nobody really read them or built on them. So I don't mind if, in the intervening 3-4 years, they again don't really try to explain what they're thinking about to me, until a bit of progress and good explanation comes along. They'll continue to write things about the background worldview, like Security Mindset, and Rocket Alignment, and Challenges to Christiano's Capability Amplification Proposal, and all of the beautiful posts that Scott and Abram write, but overall focus on getting a better understanding of the problem by themselves.

(I'm open to hearing that other researchers at DM, OpenAI, and so on, feel like they used to get major value out of talking with MIRI and now are not able to; my current epistemic state is that core MIRI is still interested to talk with Paul about iterated amplification, talk with Critch about the stuff Critch does, and so on, about as much as when those folks worked at other orgs but MIRI was open. Which is to say I think they used to talk a little bit, and were often interesting but confusing, and now still talk a little bit, and are similarly interesting but confusing.)

In their annual write-up they suggest that progress was slower than expected in 2019. However I assign little weight to this as I think most of the cross-sectional variation in organisation reported subjective effectiveness comes from variance in how optimistic/salesy/aggressive they are, rather than actually indicating much about object-level effectiveness.

It took me 5 attempts to read that sentence. I think you're saying "when different parts of orgs report how well they're doing, the difference is primarily explained by how salesy they generally are". To state the obvious, if Nate had said "Y'know what, it's going great, everything's fine, don't worry about us!" then I'd've felt it was more obviously a bit worrying from the perspective of someone on the outside, but especially a somewhat negative report feels like a not obviously adversarial signal. I mean, maybe they're playing one level higher, but I mostly think they care about saying literally true things, and hold themselves to high standards here.

That said, to empathise with your position a bit, I do think that if the quote by Nate hadn't been in the most recent fundraiser post... it would have felt much more like I couldn't coordinate with this (large!) part of MIRI, and I'd have to start acting as though they just didn't exist. Like, maybe they'd be making the right choices for the right reasons, but it certainly wouldn't likely be something I'd feel I could plan around for ~5 years, without a massive chance of being unpleasantly blindsided at the end by "Oh yeah, for the past few years we've increasingly realised this research is doomed". MIRI's annual self-reports and small number of explicit results have often been some of the central things I've had to go on regarding how well the research is going year-to-year, and losing those would feel pretty rough. I think I'd be pretty sad if a few clear results also weren't written up to some extent in the intervening years (e.g. the tiling agents result).

I don't want to underplay that I think the secrecy is currently pretty bad from the perspective of public discourse, and that most folks in x-risk are noticing all the true and important reasons for secrecy but not making plans to ensure that the public discourse will continue to work once they've stopped saying what they think out loud any more, especially in public discourse's current fragile state. I'm not sure if people aren't noticing the problems or just feel stretched too thin. I think this will substantially reduce the number of surprisingly good things that will happen and damage our long-term coordination. I should write more on that some time, when I have something clearer to say, which I don't feel I do right now.

[-]ESRogs6y30

It took me 5 attempts to read that sentence. I think you're saying "when different parts of orgs report how well they're doing, the difference is primarily explained by how salesy they generally are". To state the obvious, if Nate had said "Y'know what, it's going great, everything's fine, don't worry about us!" then I'd've felt it was more obviously a bit worrying from the perspective of someone on the outside, but especially a somewhat negative report feels like a not obviously adversarial signal.

And now I'm having trouble interpreting your comment! :P

Your paraphrase of Larks's sentence sounds plausible (though I am also confused about what he was trying to say), but I don't see how your "To state the obvious..." responds to it. I didn't take Larks to be saying that he finds Nate's sentence worrying, but rather that he just doesn't consider it much evidence of how much progress MIRI is making one way or the other.

[-]Rafael Harth6y10

Like, during the time that MIRI was non-non-disclosed (i.e. open), I don't think there was substantially more checking of their work than there is now. Most of academia and industry has/had no interest. There were LWers like Paul Christiano and Wei Dai and Stuart Armstrong that engaged, and I'm sad that these people aren't able to engage with the present work. But at the time Paul and Wei were basically saying "I don't get why this is going to work out" so it's not like MIRI could start getting much more negative feedback, unless Paul and Wei were going to say "This obviously fails and I can prove how," which I assign little probability to.

However, the open Philanthropy project did have people evaluate Miri's work, and they ended up sending them a substantial donation because they liked the Logical Induction paper – unless I'm misremembering how it went down.

[-]Ben Pace6y60

Yes, that's right. OTOH I recall OpenPhil trying to evaluate MIRI's work before the logical induction paper, and they thought it was pretty terrible. As I mentioned, I'd be pro MIRI writing up any / a few clear results over the next few years, for reasons like this.

There's also a question of how much OpenPhil's support mattered in your estimation of MIRI. I might write more on it later, but overall it's not been a major factor for me.

[-]ioannes6y30

See also My current thoughts on MIRI's "highly reliable agent design" work by Daniel Dewey (Open Phil lead on technical AI grant-making).

From the "What do I think of HRAD?" section:

... This reduces my credence in HRAD being very helpful to around 10%. I think this is the decision-relevant credence.

[-]evhub6yΩ13210

On the other hand, I don’t think we can give people money just because they say they are doing good things, because of the risk of abuse. There are many other reasons for not publishing anything. Some simple alternative hypothesis include “we failed to produce anything publishable” or “it is fun to fool ourselves into thinking we have exciting secrets” or “we are doing bad things and don’t want to get caught.” The fact that MIRI’s researchers appear intelligent suggest they at least think they are doing important and interesting issues, but history has many examples of talented reclusive teams spending years working on pointless stuff in splendid isolation.

Additionally, by hiding the highest quality work we risk impoverishing the field, making it look unproductive and unattractive to potential new researchers.

My work at MIRI is public, btw.

a Mesa-Optimizer - a sub-agent of an optimizer that is itself an optimizer

I think this is a poor description of mesa-optimization. A mesa-optimizer is not a subagent, it's just a trained model implementing a search algorithm.

[-]habryka6yΩ6190

Promoted to curated: This post has been an amazing resource for almost anyone who is working on AI Alignment every year. It also turns out to be one of the most comprehensive bibliographies for AI Alignment research (next to the AI Alignment Newsletter spreadsheet), which in itself is a various resource that I've made use of myself a few times.

Thank you very much for writing this!

[-]Vanessa Kosoy6yΩ10160

Thank you for writing this impressive review!

Some comments on MIRI's non-disclosure policy.

First, some disclosure :) My research is funded by MIRI. On the other hand, all of my opinions are my own and do not represent MIRI or anyone else associated with MIRI.

The non-disclosure policy has no direct effect on me, but naturally, both before and after it was promulgated, I used my own judgement to decide what should or should not be made public. The vast majority of my work I do make public (subject only to the cost of time and effort to write and explain it), because if I think something would increase risk rather than reduce it^[1], then I don't pursue this line of inquiry in the first place. Things I don't make public are mostly early stage ideas that I don't develop.

I think it is fair enough to judge AI alignment orgs only by the public output they produce. However, this doesn't at all follow that a non-disclosure policy leads to immediate disqualification, like you seem to imply. You can judge an org by its public output whether or not all of its output is public. This is somewhat similar to the observation that management overhead is a bad metric. Yes, some of your money goes into something that doesn't immediately and directly translate to benefit. All else equal, you want that not to happen. But all else is not equal, and can never be equal.

This is completely tangential, but I think we need more public discussion on how do we decide whether making something public is beneficial vs. detrimental. ↩︎

[-]habryka6y*120

(Crossposted from the EA Forum)

Some minor clarifications for the Long Term Future Fund section (Larks did reach out to us and ask for feedback, though it looks like not all of our corrections made it into the final version). [Correction: We hadn't actually sent out the relevant corrections back yet, so this is not Lark's fault. Sorry for that!]

Notably this means the funds only paid out $150,000 to CFAR (10%), as the balance was made up by a private donor after CEA did not approve the second grant.

The thing as written is correct, but I want to clarify that CEA did not reject the grant, but was still in the process of deciding whether to approve the grant, when a private donor stepped in. I expect that CEA would have eventually approved this grant, though there is definitely still some uncertainty in that.

I was not impressed that one grant that saw harsh and accurate criticism on the forum after the first round was re-submitted for the second round. Ex post this didn’t matter as CEA rejected it on substantive grounds the second time

This is correct as written, but I think I want to clarify what is meant by resubmission here:

We’ve had a few grants that ran into logistical difficulties (like there not being a clear way to make this grant compatible with CEA’s charitable objectives), and in cases like this we’ve worked with the potential grantee to resolve those issues. I think that support we provide should be independent from the evaluations of grants that we make, and I don’t think we should reject grants because of logistical issues like that, if they are easily fixable.

The Lauren Lee grant ran into some issues in this space that took a while to resolve, so CEA ended up only properly evaluating the grant in the round following the one in which we recommended it, and then subsequently rejected it. The “resubmission” in that sense shouldn’t be seen as an additional strong endorsement of the grant, but is just a thing that could happen to any grant and I think doesn’t say much about how good we thought the grant was, after we had made the decision to recommend the grant.

Geography chart

We were about to send Larks an updated version of the geography data before this post went up. Here is a graph with my best guesses (this includes all recommendations we made, even for grants that didn't end up going through):

And here is one excluding the three grants that ended up being covered by private donors:

[Edit Note: I briefly had a version of the comment up that showed the geographic distribution by count instead of by grant amount. This is now fixed.]

[-]Liam Donovan6y20

I was somewhat confused by the discussion of LTFF grants being rejected by CEA; is there a public writeup of which grants were rejected?

[-]habryka6y110

I don't think there is a public writeup. So here is a quick summary:

Only one grant was ever rejected by CEA, which was the grant to Lauren Lee (after a round of logistical obstacles to the grant) with the reasoning of "it is outside of the historical scope of the fund"
The grant to give out HPMOR's to IMO winners was urgent and CEA couldn't complete the due-diligence process before the money was required, so a private donor ended up filling it before it was made
The $150k recommendation to CFAR was being reviewed by the CEA board, who had some concerns about the grant, but a private donor ended up filling the request before the board reached consensus on a decision

There are also some other cases of grants that we recommended, but that never even made it into our writeups because the relevant grantees decided they didn't actually want to pursue the project they specified in their applications.

[-]Charlie Steiner6yΩ490

An excellent exterior scoop.

If I had to point out one more research avenue from the past year that I find interesting, it would be the application of the predictive processing model of cognition to AI safety. One post from Jan Kulveit (FHI), one post from G Gordon Worley (PAISRI, which appears to be a one man organization at the moment).

I'm also only like 85% sure that I'm not among those referred to as "just learn human values with an RNN." So on that 15% chance, I would like to stress that although it's definitely something I'm thinking about, I'm just trying to nail down the details so that it's specific enough to poke holes in. Honest!

[-]John_Maxwell6yΩ240

There's also this post and others by the same author.

[-]CarlShulman6yΩ360

MIRI researchers contributed to the following research led by other organisations

MacAskill & Demski's A Critique of Functional Decision Theory

This seems like a pretty weird description of Demski replying to MacAskill's draft.

[-]Ben Pace6yΩ120

I also thought so. I wondered maybe if Larks is describing that MacAskill incorporated Demski's comments-on-a-draft into the post.

[-]John_Maxwell6yΩ150

just use ML to learn ethics

Can you explain more about why you think that these papers are low quality? Is it just a matter of lack of originality? Personally, I think this is a perspective that can be steelmanned pretty effectively, and gets unfairly disregarded because it's too simple or something like that. I think it's worth engaging with this perspective in depth because (a) I think it's pretty likely there's a solution to friendliness there and (b) even if there isn't, a very clear explanation of why (which anticipates as many counterarguments as possible) could be a very useful recruiting tool.

[-]apc6y10

Great post.

Re Regulatory Markets for AI Safety: I’d be interested in hearing more about why you think that they might not be useful in an AGI scenario, because “the goals and ex post measurement for private regulators are likely to become outdated and irrelevant”.

Why do you think the goals and ex post measurement are likely to become irrelevant? Furthermore, isn’t this also an argument against any kind of regulatory action on AI? Because with status-quo regulation, the goals and ex post measurement of regulatory outcomes are of course also done by the government agency. (though I’m not sure that I correctly understand what you mean by “ex-post measurement for the private regulators”)

130

2019 AI Alignment Literature Review and Charity Comparison

130

Ω 39

130

Ω 39

Introduction

How to read this document

New to Artificial Intelligence as an existential risk?

Research Organisations

FHI: The Future of Humanity Institute

Research

Finances

CHAI: The Center for Human-Aligned AI

Research

Finances

MIRI: The Machine Intelligence Research Institute

Research

Non-disclosure policy

Finances

GCRI: The Global Catastrophic Risks Institute

Research

Finances

CSER: The Center for the Study of Existential Risk

Research

Finances

Ought

Research

Finances

OpenAI

Research

Finances

Google DeepMind

Research

Finances

AI Safety camp

Research

Finances

FLI: The Future of Life Institute

AIImpacts

Research

Finances

GPI: The Global Priorities Institute

Research

Finances

FRI: The Foundational Research Institute

Research

Finances

Median Group

Research

Finances

CSET: The Center for Security and Emerging Technology

Leverhulme Center for the Future of Intelligence

Research

BERI: The Berkeley Existential Risk Initiative

Grants

Research

Finances

AI Pulse

Research

Other Research

Capital Allocators

LTFF: Long-term future fund

OpenPhil: The Open Philanthropy Project

Grants

Research

Finances

SFF: The Survival and Flourishing Fund

Grants

Other News

Methodological Thoughts

Inside View vs Outside View

Politics

Openness

Research Flywheel

Differential AI progress

Near-term safety AI issues

Financial Reserves

Donation Matching

Poor Quality Research