We (Zvi Mowshowitz and Vladimir Slepnev) are happy to announce the results of the second round of the AI Alignment Prize, funded by Paul Christiano. From January 15 to April 1 we received 37 entries. Once again, we received an abundance of worthy entries. In this post we name five winners who receive $15,000 in total, an increase from the planned $10,000.

We are also announcing the next round of the prize, which will run until June 30th, largely under the same rules as before.

The winners

First prize of $5,000 goes to Tom Everitt (Australian National University and DeepMind) and Marcus Hutter (Australian National University) for the paper The Alignment Problem for History-Based Bayesian Reinforcement Learners. We're happy to see such a detailed and rigorous write up of possible sources of misalignment, tying together a lot of previous work on the subject.

Second prize of $4,000 goes to Scott Garrabrant (MIRI) for these LW posts:

Each of these represents a small but noticeable step forward, adding up to a sizeable overall contribution. Scott also won first prize in the previous round.

Third prize of $3,000 goes to Stuart Armstrong (FHI) for his post Resolving human values, completely and adequately and other LW posts during this round. Human values can be under-defined in many possible ways, and Stuart has been very productive at teasing them out and suggesting workarounds.

Fourth prize of $2,000 goes to Vanessa Kosoy (MIRI) for the post Quantilal control for finite MDPs. The idea of quantilization might help mitigate the drawbacks of extreme optimization, and it's good to see a rigorous treatment of it. Vanessa is also a second time winner.

Fifth prize of $1,000 goes to Alex Zhu (unaffiliated) for these LW posts:

Alex's posts have good framings of several problems related to AI alignment, and led to a surprising amount of good discussion.

We will contact each winner by email to arrange transfer of money.

We would also like to thank everyone else who sent in their work! The only way to make progress on AI alignment is by working on it, so your participation is the whole point.

The next round

We are now also announcing the third round of the AI alignment prize.

We're looking for technical, philosophical and strategic ideas for AI alignment, posted publicly between January 1, 2018 and June 30, 2018 and not submitted for previous iterations of the AI alignment prize. You can submit your entries in the comments here or by email to apply@ai-alignment.com. We may give feedback on early entries to allow improvement, though our ability to do this may become limited by the volume of entries.

The minimum prize pool will again be $10,000, with a minimum first prize of $5,000.

Thank you!

New Comment
29 comments, sorted by Click to highlight new comments since: Today at 12:12 PM

It seems like maybe there should be an archive page for past rounds.

Maybe in the form of a LW sequence.

I might post something else later this month, but if not, my submission is my new Prisoners' Dilemma thing.

Ok, I have two other things to submit:

Counterfactual Mugging Poker Game and Optimization Amplifies.

I hope that your decision procedure includes a part where if I win, you choose whichever subset of my posts you most want to draw attention to. I think that a single post would get a larger signal boost than each post in a group of three, and would not be offended of one or two of my posts gets cut from the announcement post to increase the signal for other things.

Congrats everyone! Really good to see this sort of thing happening.

Thank you! Really inspiring to win this prize. As John Maxwell stated in the previous round, the recognition is more important than the money. Very happy to receive further comments and criticism by email tom4everitt@gmail.com. Through debate we grow :)

Another one: Overcoming Clinginess in Impact Measures. Also, please download the latest version of my paper, and don't use the one I emailed earlier.

To people who become interested in the topic of side effects and whitelists, I would add links to a couple of additional articles of my own past work on related subjects that you might be interested in - for developing the ideas further, for discussion, or for cooperation:


The principles are based mainly on the idea of competence-based whitelisting and preserving reversibility (keeping the future options open) as the primary goal of AI, while all task-based goals are secondary.


More technical details / a possible implementation of the above.

This is intended as a comment, not as a prize submission, since I first published these texts 10 years ago.

Hey there!

I don't know if you'll consider these reward-function-learning posts as candidates, or whether you'll see them are rephrasing of old ideas. There are some new results in there - the different value functions and the fact that Q-learning agents can learn un-riggable learning processes is new.



In any case, thanks for running these competitions!

I made some notes as I read the winner. Is Tom Everitt on here? Should I send these somewhere?

2.2: The description of ξ is a bit odd - it seems like w_ν is your actual prior over models.

2.4: I'm not totally sold on some parts where it seems like you equate agents and policies. I think the thing we want to compare is not the difference in utility given the same policy, but the difference in true utility between policies that are chosen.

If the human and the AI have the same utility-maximizing policy, they can have nonzero "misalignment" if they have different standard deviations in utility. Meanwhile, even if there is zero misalignment on policy π, that might not be the utility-maximizing policy for the AI.

Should footnote 1 read CA rather than MA?

3.5: It seems like an agent is incentivized to create the extra observation systems, and then encode the important facts about the world within delusional high-reward states of the original observation channel, even assuming it still has to use the original observation and action channels. But it might not even require that facade - a corruption-aware system might notice that a policy that involves changing its hardware and how it uses it will get it larger rewards even according to the original utility function over percepts - beneficial corruption, if you will.

5.5: Decoupled data is nice, but I'm concerned that it's not fully addressing the issue it gets brought up to address, which is fundamentally the issue of induction - how to predict reward for novel situations. An agent can learn as much as it wants about human preferences over normal states of the world, you can even tell it "stories" about unseen states, but it can still be misaligned on novel states never mentioned by humans. The issue is not just about the data the agent sees, but about how it interprets it.

5.5.3: For the counterfactual "default" policy, there are still some unsolved problems with matching human intuition. The classic example is an agent whose default policy is no output - and so it thinks that the default state of the world is the programmers acting confused and trying to fix the agent, because that's what happens with the default policy.

5.6: (thinking about fig. 5.2) I'm hesitant to call any of these paths aligned, and I think it has to do with the difference between good agent design and successful learning of human values. This is a bit related to the inductive problem that I don't think decoupled data fully addresses - it's possible for one to code up something that tries to learn from humans, and successfully learns and faithfully maximizes some value function, but that value function is not the human value function. But this problem is quite ill-understood - most current work on it focuses on "corrigibility" in the intuitive sense of being willing to defer to humans even after you think you've learned human preferences.

Thanks for your comments, much appreciated! I'm currently in the middle of moving continents, will update the draft within a few weeks.

I'm not sure I see the point to awarding an already-in-the-works 67 page paper that happened to be released at the time of the competition if the goal of the prize is to stimulate AI work that otherwise would not have happened.

Personally, my long-term goal is a world where high-quality work on alignment is consistently funded, and where people doing high-quality work on alignment have plenty of money. I think that an effort to restrict to counterfactually-additional alignment work would "save" some money (in the sense that I'd have the money rather than some researcher who is doing alignment work) but wouldn't be great for that long-term goal.

Also, if you actually think about the dynamics they are pretty crappy, even if you only avoid "obvious" cases. For example, it would become really hard for anyone to actually assess counterfactual impact, since every winner would need to make it look like there was at least a plausible counterfactual impact. (I already wish there was less implicit social pressure in that direction.)

On reflection I strongly agree that social pressure around counterfactualness is a net harm for motivation.

I think you want to reward output rather than output that would not have otherwise happened.

This is similar to the fact that if you want to train calibration, you have to optimize you log score and just observe your lack of calibration as an opportunity to increase your log score.

If I understand correctly, one of the goals of this initiative is to increase the prestige that is associated with making useful contributions in AI safety. For that purpose, it doesn't matter whether the prize incentivized the winning authors or not. But it is important that enough people will trust that the main criterion for selecting the winning works is usefulness.

My take on this is that the ideal version of this prize selects for both usefulness and counterfactualness, but selecting for counterfactualness without producing weird side effects seems hard. (I do think it's worth spending an hour or two thinking about how to properly incentivize or reward counterfactualness, just, if you haven't come up with anything, strictly rewarding quality/usefulness seems better)

> selecting for counterfactualness without producing weird side effects seems hard

agree, I just thought the winner in this case was over the top enough to not be in the fuzzy boundary but clearly on the other side.

Our rules don't draw that boundary at the moment, and I'm not even sure how it could be phrased. Do you have any suggestions?

I wouldn't be in favor of adding explicit rules for goodheart related reasons. I think prizes and grants should have the minimum rules to account for basic logistics and the rest should be illegible.

Ah, yeah that makes sense.

I wonder if it's possible to help with this in a tax-advantaged way. Maybe set up donation to MIRI earmarked for this kind of thing.

The prize isn't affiliated with MIRI or any other organization, it's just us. But maybe we can figure something out. Can you email me (vladimir.slepnev at gmail)?

BERI seems like the kind of organization that would be happy to help with this.

Congrats to all winners! :-) I've read about half the winning submissions, and am looking forward to looking through the other half.


Here are my submissions for this time. They are all strategy related.

The first one is a project for popularisation AI safety topics. This is not a technical text by its content but the project itself is still technological.


As a bonus I would add a couple of non-technical ideas about possible economic or social partial solutions for slowing down AI race (which would enable having more time for solving the AI alignment) :



The latter text is not totally new - it is a distilled and edited version of one of my other old texts, that was originally multiple times longer and had a narrower goal than the new one.



My new submission is "Dangerous AI is possible before 2030" which stable version is here https://philpapers.org/rec/TURPOT-5

The draft was first discussed on LW on April 3.

I also going to submit another text later which I will link here.