We (Zvi Mowshowitz, Vladimir Slepnev and Paul Christiano) are happy to announce that the AI Alignment Prize is a success. From November 3 to December 31 we received over 40 entries representing an incredible amount of work and insight. That's much more than we dared to hope for, in both quantity and quality.

In this post we name six winners who will receive $15,000 in total, an increase from the originally planned $5,000.

We're also kicking off the next round of the prize, which will run from today until March 31, under the same rules as before.

The winners

First prize of $5,000 goes to Scott Garrabrant (MIRI) for his post Goodhart Taxonomy, an excellent write-up detailing the possible failures that can arise when optimizing for a proxy instead of the actual goal. Goodhart’s Law is simple to understand, impossible to forget once learned, and applies equally to AI alignment and everyday life. While Goodhart’s Law is widely known, breaking it down in this new way seems very valuable.

Five more participants receive $2,000 each:

  • Tobias Baumann (FRI) for his post Using Surrogate Goals to Deflect Threats. Adding failsafes to the AI's utility function is a promising idea and we're happy to see more detailed treatments of it.
  • Vanessa Kosoy (MIRI) for her work on Delegative Reinforcement Learning (1, 2, 3). Proving performance bounds for agents that learn goals from each other is obviously important for AI alignment.
  • John Maxwell (unaffiliated) for his post Friendly AI through Ontology Autogeneration. We aren't fans of John's overall proposal, but the accompanying philosophical ideas are intriguing on their own.
  • Alex Mennen (unaffiliated) for his posts on legibility to other agents and learning goals of simple agents. The first is a neat way of thinking about some decision theory problems, and the second is a potentially good step for real world AI alignment.
  • Caspar Oesterheld (FRI) for his post and paper studying which decision theories would arise from environments like reinforcement learning or futarchy. Caspar's angle of attack is new and leads to interesting results.

We'll be contacting each winner by email to arrange transfer of money.

We would also like to thank everyone who participated. Even if you didn't get one of the prizes today, please don't let that discourage you!

The next round

We are now announcing the next round of the AI alignment prize.

As before, we're looking for technical, philosophical and strategic ideas for AI alignment, posted publicly between now and March 31, 2018. You can submit your entries in the comments here or by email to apply@ai-alignment.com. We may give feedback on early entries to allow improvement, though our ability to do this may become limited by the volume of entries.

The minimum prize pool this time will be $10,000, with a minimum first prize of $5,000. If the entries once again surpass our expectations, we will again increase that pool.

Thank you!

(Addendum: I've written a post summarizing the typical feedback we've sent to participants in the previous round.)


New Comment
68 comments, sorted by Click to highlight new comments since: Today at 3:56 PM

Wow, thanks a lot guys!

I'm probably not the only one who feels this way, so I'll just make a quick PSA: For me, at least, getting comments that engage with what I write and offer a different, interesting perspective can almost be more rewarding than money. So I definitely encourage people to leave comments on entries they read--both as a way to reinforce people for writing entries, and also for the obvious reason of making intellectual progress :)

I definitely wish I had commented more on them in general, and ran into a thing where a) the length and b) the seriousness of it made me feel like I had to dedicate a solid chunk of time to sit down and read, and then come up with commentary worth making (as opposed to just perusing it on my lunch break).

I'm not sure if there's a way around that (posting things in smaller chunks in venues where it's easy for people to comment might help, but my guess isn't the whole thing)

Just sent you some more feedback. Though you should also get comments from others, because I'm not the smartest person in the room by far :-)

This is freaking awesome - thank you so much for doing both this one and the new one.

Added: I think this is a really valuable contribution to the intellectual community - successfully incentivising research, and putting in the work on your end (assessing all the contributions and giving the money) to make sure solid ideas are rewarded - so I've curated this post.

Added2: And of course, congratulations to all the winners, I will try to read all of your submissions :-)

Thanks to you and Oliver for spreading the news about this!


This may sound very silly, but it had not occurred to me that blog posts might count as legitimate entries to this, and if I had realized that I might have tried to submit something. Writing this mostly in case it applies to others too.

It's sort of weird how "blogpost" and "paper" feel like such different categories, especially when AFAICT, papers tend to be, on average, less convenient and more poorly written blogposts.

The funny thing is that if you look at some old papers, they read a lot more like blog posts than modern papers. One of my favorite examples is the paper where Alan Turing introduced what's now known as the Turing test, and whose opening paragraph feels pretty playful:

I propose to consider the question, "Can machines think?" This should begin with definitions of the meaning of the terms "machine" and "think." The definitions might be framed so as to reflect so far as possible the normal use of the words, but this attitude is dangerous, If the meaning of the words "machine" and "think" are to be found by examining how they are commonly used it is difficult to escape the conclusion that the meaning and the answer to the question, "Can machines think?" is to be sought in a statistical survey such as a Gallup poll. But this is absurd.

Blog posts don't have the shining light of Ra around them, of course.

If your blog posts would benefit from being lit with a Ra-coloured tint, we'd be more than happy to build this feature for you.

<adds to list of LessWrong April Fools Day ideas>

The shining light of Ra may be doing useful work if the paper is peer-reviewed. Especially if it made it through the peer review process of a selective journal.

Fair, but a world where we can figure out how to bestow the shining light of Ra on selectively peer reviewed, clearly written blogposts seems even better. :P

The distinction between papers and blog posts is getting weaker these days - e.g. distill.pub is an ML blog with the shining light of Ra that's intended to be well-written and accessible.

Qiaochu, I'd love to see an entry from you in the current round.

Cool, this looks better than I'd been expecting. Thanks for doing this! Looking forward to next round.

Thank you Luke! I probably should've asked before, but if you have any ideas how to make this better organizationally, please let me know.

Datum: The existence of this prize has spurred me to put actual some effort into AI alignment, for reasons I don't fully understand--I'm confident it's not about the money, and even the offer of feedback isn't that strong an incentive, since I think anything worthwhile I posted on LW would get feedback anyway.

My guess is that it sends the message that the Serious Real Researchers actually want input from random amateur LW readers like me.

Also, the first announcement of the prize rules was in one ear and out the other for me. Reading this announcement of the winners is what made it click for me that this is something I should actually do. Possibly because I had previously argued on LW with one of the winners in a way that made my brain file them as my equal (admittedly, the topic of that was kinda bike-sheddy, but system 1 gonna system 1).

I'm curious if these papers / blogs would have been written at some point anyway, or if they happened because of the call to action? And to what extend was the prize money a motivator?

Not only would Goodhart Taxonomy probably not have been finished otherwise (it was 20 percent written in my drafts folder for months), but I think writing that jump started me writing publicly and caused the other posts Ive written since.

Are you entering this round BTW?


I was waiting until the last minute to see if I would have a clear winner on what to submit. Unfortunately, I do not, since there are four posts on the Pareto frontier of karma and how much I think they have an important insight. In decreasing order of karma and increasing order of my opinion:

Robustness to Scale

Sources of Intuitions and Data on AGI

Don't Condition on no Catastrophes

Knowledge is Freedom

Can I have you/other judges decide which post/subset of posts you think is best/want to put more signal towards, and consider that my entry?

Also, why is my opinion anti-correlated with Karma?

Maybe, it is a selection effect where I post stuff that is either good content or a good explanation.

Or maybe important insights have a larger inferential gap.

Or maybe I like new insights and the old insights are better because they survived across time, but they are old to me so I don't find them as exciting.

Or maybe it is noise.

I just noticed that the first two posts were curated, and the second two were not, so maybe the only anti-correlation is between me and the Sunshine Regiment, but IIRC, most of the karma was pre-curration, and I posted Robustness to Scale and No Catastrophes at about the same time and was surprised to see a gap in the karma. (I would have predicted the other direction.)

I posted Robustness to Scale and No Catastrophes at about the same time and was surprised to see a gap in the karma

FWIW, I was someone who upvoted Robustness to Scale (and Sources of Intuitions, and Knowledge is Freedom), but did not upvote No Catastrophes.

I think the main reason was that I was skeptical of the advice given in No Catastrophes. People often talk about timelines in vague ways, and I agree that it's often useful to get more specific. But I didn't feel compelled by the case made in No Catastrophes for its preferred version of the question. Neither that one should always substitute a more precise question for the original, nor that if one wants to ask a more precise question, then this is the question to ask.

(Admittedly I didn't think about it very long, and I wouldn't be too surprised if further reflection caused me to change my mind, but at the time I just didn't feel compelled to endorse with an upvote.)

Robustness (along with the other posts) does not give advice, but rather stakes out conceptual ground. That's easier to endorse.

We accept them all as your entry :-)

Awesome! I hadn't seen Caspar's idea, and I think it's a neat point on its own that could also lead in some new directions.

Edit: Also, I'm curious if I had any role in Alex's idea about learning the goals of a game-playing agent. I think I was talking about inferring the rules of checkers as a toy value-learning problem about a year and a half ago. It's just interesting to me to imagine what circuituitous route the information could have taken, in the case that it's not independent invention.

I don't think that was where my idea came from. I remember thinking of it during AI Summer Fellows 2017, and fleshing it out a bit later. And IIRC, I thought about learning concepts that an agent has been trained to recognize before I thought of learning rules of a game an agent plays.

Charlie, it'd be great to see an entry from you in this round.

Thanks, that's very flattering! The thing I'm working on now (looking into prior work on reference, because it seems relevant to what Abram Demski calls model-utility learning) will probably qualify, so I will err on the side of rushing a little (prize working as intended).

Sometimes you just read a chunk of philosophical literature on reference and it's not useful to you even as a springboard. shrugs So I don't have an ideasworth of posts, and it'll be ready when it's ready.

Good and Safe use of AI Oracles: https://arxiv.org/abs/1711.05541

An Oracle is a design for potentially high power artificial intelligences (AIs), where the AI is made safe by restricting it to only answer questions. Unfortunately most designs cause the Oracle to be motivated to manipulate humans with the contents of their answers, and Oracles of potentially high intelligence might be very successful at this. Solving that problem, without compromising the accuracy of the answer, is tricky. This paper reduces the issue to a cryptographic-style problem of Alice ensuring that her Oracle answers her questions while not providing key information to an eavesdropping Eve. Two Oracle designs solve this problem, one counterfactual (the Oracle answers as if it expected its answer to never be read) and one on-policy, but limited by the quantity of information it can transmit.

Very interesting work!

You might like my new post that explains why I think only Oracles can resolve the problems of Causal Goodhart-like issues; https://www.lesserwrong.com/posts/iK2F9QDZvwWinsBYB/non-adversarial-goodhart-and-ai-risks

I'm unsure whether the problems addressed in your paper are sufficient for resolving the causal Goodhart concerns, since I need to think much more about the way the reward function is defined, but it seems it might not. This question is really important for the follow-on work on adversarial Goodhart, and I'm still trying to figure out how to characterize the metrics / reward functions that are and are not susceptible to corruption in this way. Perhaps a cryptographic approach solve parts of the problem

I have a paper which preprint was uploaded in December 2017 but which is expected to be officially published in the beggining of 2018. Is it possible to suggest the text to this round of the competition? The text in question is:

"Military AI as a Convergent Goal of Self-Improving AI"

Alexey Turchin & Denkenberger David

In Artificial Intelligence Safety and Security.

Louiswille: CRC Press (2018)


I worked with Scott to formalize some of his earlier blog post here; https://arxiv.org/abs/1803.04585 - and wrote a bit more about AI-specific concerns relating to the first three forms in this new lesswrong post: https://www.lesserwrong.com/posts/iK2F9QDZvwWinsBYB/non-adversarial-goodhart-and-ai-risks

The blog post discussion was not included in the paper both because agreement on these points proved difficult, and because I wanted the paper to be relevant more widely than only for AI risk. The paper was intended to expand thinking about Goodhart-like phenomena to address what I initially saw as a confusion about causal and adversarial Goodhart, and to allow a further paper on adversarial cases I've been contemplating for a couple years, and am actively working on again. I was hoping to get the second paper, on Adversarial Goodhart and sufficient metrics, done in time for the prize, but since I did not, I'll nominate the arxiv paper and the blog post, and I will try to get the sequel blog post and maybe even the paper done in time for round three, if there is one.

Accepted! Can you give your email address?

My username @ gmail

Congratulations to the winners! Everyone, winners or not, submitted some great works that inspire a lot of thought.

Would it be possible for all of us submitters to get feedback on our entries that did not win so that we can improve entries for the next round?

I'll mention that one of the best ways for people to learn what sorts of submissions can get accepted, is to read the winning submissions in detail :-)

Also excellent advice

Hi Berick! I didn't send you feedback because your entry arrived pretty close to the deadline. But since you ask, I just sent you an email now.

Thanks! I didn't expect feedback prior to the first round closing since, as you said, my submission was (scarily) close to the deadline.

How to resolve human values, completely and adequately: https://www.lesswrong.com/posts/Y2LhX3925RodndwpC/resolving-human-values-completely-and-adequately

(this connects with https://www.lesswrong.com/posts/kmLP3bTnBhc22DnqY/beyond-algorithmic-equivalence-self-modelling and https://www.lesswrong.com/posts/pQz97SLCRMwHs6BzF/using-lying-to-detect-human-values).

Impossibility of deducing preferences and rationality from human policy: https://arxiv.org/abs/1712.05812

Inverse reinforcement learning (IRL) attempts to infer human rewards or preferences from observed behavior. Since human planning systematically deviates from rationality, several approaches have been tried to account for specific human shortcomings. However, there has been little analysis of the general problem of inferring the reward of a human of unknown rationality. The observed behavior can, in principle, be decomposed into two components: a reward function and a planning algorithm, both of which have to be inferred from behavior. This paper presents a No Free Lunch theorem, showing that, without making `normative' assumptions beyond the data, nothing about the human reward function can be deduced from human behavior. Unlike most No Free Lunch theorems, this cannot be alleviated by regularising with simplicity assumptions. We show that the simplest hypotheses which explain the data are generally degenerate.

Accepted too

Awesome results!

I emailed my submission, but for the sake of redundancy, I'll submit it here too:

"The Regularizing-Reducing Model"


Probably too late, but I wanted to submit this:



I have significantly elaborated and extended my article of self deception in the last couple of months (before that it was about two pages long).

"Self-deception: Fundamental limits to computation due to fundamental limits to attention-like processes"


I included some examples for the taxonomy, positioned this topic in relation to other similar topics, compared the applicability of this article to applicability of other known AI problems.

Additionally, I described or referenced a few ideas to potential partial solutions to the problem (some of the descriptions of solutions are new, some of them I have published before).

One of the motivations for the post is that when we are building an AI that is dangerous in a certain manner, we should at least realise that we are doing that.

I will probably continue updating the post. The history of the post and state by 31. March can be seen from the linked Google Doc’s history view (that link is in top of the article).

When it comes to feedback to postings, I have noticed that people are more likely to get feedback when they ask for it.

I am always very interested in feedback, regardless whether it is given to my past, current or future postings. So if possible, please send any feedback you have. It would be of great help!

I will post the same message to your e-mail too.

Thank you and regards:


I would like to submit the following entries:

A typology of Newcomblike problems (philosophy paper, co-authored with Caspar Oesterheld).

A wager against Solomonoff induction (blog post).

Three wagers for multiverse-wide superrationality (blog post).

UDT is “updateless” about its utility function (blog post). (I think this post is hard to understand. Nevertheless, if anyone finds it intelligible, I would be interested in their thoughts.)

For this round I submit the following entries on decision theory:

Robust Program Equilibrium (paper)

The law of effect, randomization and Newcomb’s problem (blog post) (I think James Bell's comment on this post makes an important point.)

A proof that every ex-ante-optimal policy is an EDT+SSA policy in memoryless POMPDs (IAFF comment) (though see my own comment to that comment for a caveat to that result)

My entry: https://www.lesserwrong.com/posts/DTv3jpro99KwdkHRE/ai-alignment-prize-super-boxing

Accepted, gave some feedback in the comments

Here is my submission, Anatomy of Prediction and Predictive AI

Hopefully I'm early enough this time to get some pre-deadline feedback :)

Will just email about this.

[This comment is no longer endorsed by its author]Reply

Okay, this project overshot my expectations. Congratulations to the winners!

Can we get a list of links to all the submitted entries from the first round?

This is a bit contentious but I don't think we should be the kind of prize that promises to publish all entries. A large part of the prize's value comes from restricting our signal boost to only winners.

Why should one option exclude the other?

Having the blinders would not be so good either.

I propose that with proper labeling these options can both be implemented. So that people can themselves decide what to pay attention to and what to develop further.

You're right, keeping it the cream of the crop is a better idea.

I wouldn't mind feedback as well if possible. Mainly because I only dabble in AGI theory and not AI. So i'm curious to see the differance in thoughts/opinion/ fields, or however you wish to put it. Thanks in advance., and thanks to the contest host/judges. I learned a lot more about the (human) critic process then I did before.

I just sent you feedback by email.

Thanks for taking the time. Appreciated.

Forgot to congradulate the winners.... Congrats...

New to LessWrong?