Edit: In case it's not obvious, I have done limited research on AI alignment organizations and the goal of my post is to ask questions from the point of view of someone who wants to contribute and is unsure how. Read down to the comments for some great info on the topic.

I was introduced to the topic of AI alignment when I joined this very forum in 2014. Two years and one "Superintelligence" later, I decided that I should donate some money to the effort. I knew about MIRI, and I looked forward to reading some research comparing their work to the other organizations working in this space. The only problem is... there really aren't any.

MIRI recently announced a new research agenda focused on "agent foundations". Yet even the Open Philanthropy Project, made up of people who at least share MIRI's broad worldview, can't decide whether that research direction is promising or useless. The Berkeley Center for Human-Compatible AI doesn't seem to have a specific research agenda beyond Stuart Russell. The AI100 Center at Stanford is just kicking off. That's it.

I think that there are two problems here:


  1. There's no way to tell which current organization is going to make the most progress towards solving AI alignment.
  2. These organizations are likely to be very similar to each other, not least because they practically share a zipcode. I don't think that MIRI and the academic centers will do the exact same research, but in the huge space of potential approaches to AI alignment they will likely end up pretty close together. Where's the group of evo-psych savvy philosophers who don't know anything about computer science but are working to spell out an approximation of universal human moral intuitions?
It seems like there's a meta-question that needs to be addressed, even before any work is actually done on AI alignment itself:


How to evaluate progress in AI alignment?

Any answer to that question, even if not perfectly comprehensive or objective, will enable two things. First of all, it will allow us to direct money (and the best people) to the existing organizations where they'll make the most progress.

More importantly, it will enable us to open up the problem of AI alignment to the world and crowdsource it. 

For example, the XPrize Foundation is a remarkable organization that creates competitions around achieving goals beneficial to humanity, from lunar rovers to ecological monitoring. The prizes have two huge benefits over direct investment in solving an issue:


  1. They usually attract a lot more effort than what the prize money itself would pay for. Competitors often spend in aggregate 2-10 times the prize amount in their efforts to win the competition.
  2. The XPrizes attract a wide variety of creative entrants from around the world, because they only describe what needs to be done, not how.
So, why isn't there an XPrize for AI safety? You need very clear guidelines to create an honest competition, like "build the cheapest spaceship that can take 3 people to 100km and be reused within 2 weeks". It doesn't seem like we're close to being able to formulate anything similar for AI alignment. It also seems that if anyone will have good ideas on the subject, it will be the people on this forum. So, what do y'all think?

Can we come up with creative ways to objectively measure some aspect of progress on AI safety, enough to set up a competition around it?



New Comment
8 comments, sorted by Click to highlight new comments since: Today at 1:04 PM

I came to a similar conclusion a while ago: it is hard to make progress in a complex technical field when progress itself is unmeasurable or worse ill-defined.

Part of the problem may be cultural: most working in the AI safety field have math or philosophy backgrounds. Progress in math and philosophy is intrinsically hard to measure objectively; success is mostly about having great breakthrough proofs/ideas/papers that are widely read and well regarded by peers. If your main objective is to convince the world, then this academic system works fine - ex: Bostrom. If your main objective is to actually build something, a different approach is perhaps warranted.

The engineering oriented branches of Academia (and I include comp sci in this) have a very different reward structure. You can publish to gain social status just as in math/philosophy, but if your idea also has commercial potential there is the powerful additional motivator of huge financial rewards. So naturally there is far more human intellectual capital going into comp sci than math, more into deep learning than AI safety.

In a sane world we'd realize that AI safety is a public good of immense value that probably requires large-scale coordination to steer the tech-economy towards solving. The X-prize approach essentially is to decompose a big long term goal into subgoals which are then contracted to the private sector.

The high level abstract goal for the Ansari XPrize was "to usher in a new era of private space travel". The specific derived prize subgoal was then "to build a reliable, reusable, privately financed, manned spaceship capable of carrying three people to 100 kilometers above the Earth's surface twice within two weeks".

AI safety is a huge bundle of ideas, but perhaps the essence could be distilled down to: "create powerful AI which continues to do good even after it can take over the world."

For the Ansari XPrize, the longer term goal of "space travel" led to the more tractable short term goal of "100 kilometers above the Earth's surface twice within two weeks". Likewise, we can replace "the world" in the AI safety example:

AI Safety "XPrize": create AI which can take over a sufficiently complex video game world but still tends to continue to do good according to a panel of human judges.

To be useful, the video game world should be complex in the right ways: it needs to have rich physics that agents can learn to control, it needs to permit/encourage competitive and cooperative strategic complexity similar to that in the real world, etc. So more complex than pac-man, but simpler than the Matrix. Something in the vein of a minecraft mod might have the right properties - but there are probably even more suitable open-world MMO games.

The other constraint on such a test is we want the AI to be superhuman in the video game world, but not our world (yet). Clearly this is possible - ala AlphaGo. But naturally the more complex the video game world is in the direction of our world, both the harder the goal becomes and the more dangerous.

Note also that the AI should not know that it is being tested; it shall not know it inhabits a simulation. This isn't likely to be any sort of problem for the AI we can actually build and test in the near future, but it becomes an interesting issue later on.

DeepMind is now focusing on Starcraft, OpenAI has universe, so we already on a related path. Competent AI for open-ended 3D worlds with complex physics - like minecraft - is still not quite here, but is probably realizable in just a few years.


MIRI recently announced a new research agenda focused on "agent foundations". Yet even the Open Philanthropy Project, made up of people who at least share MIRI's broad worldview, can't decide whether that research direction is promising or useless. The Berkeley Center for Human-Compatible AI doesn't seem to have a specific research agenda beyond Stuart Russell. The AI100 Center at Stanford is just kicking off. That's it.

There's also:

  • MIRI's Alignment for Advanced Machine Learning Systems agenda
  • The Concrete Problems agenda by Amodei, Olah and others
  • Russell's Research Priorities doc written with Dewey and Tegmark, covers probably more than his CHCAI Centre
  • Owain Evans, Stuart Armstrong and Eric Drexler at FHI
  • Paul Christiano's thinking on AI Control
  • OpenAI's safety team in formation
  • DeepMind's safety team
  • (?) Wei Dai's thinking on metaphilosophy and AI... He occasionally comments e.g. on AgentFoundations
  • Other Machine Learning researchers, e.g. safe exploration in Deep RL, transparency.

Looking at what they've produced to date, I don't really expect MIRI and CHCAI to produce that similar of work. I expect Russell's group to be more focused on value learning an corrigibility vs. reliable agent designs (MIRI).

The Berkeley Center for Human-Compatible AI doesn't seem to have a specific research agenda beyond Stuart Russell.

Stuart Russell was the primary author of the FLI research priorities document, so I'd expect CHCAI's work to focus in on some of the problems sketched there. Based on CHCAI's publication page, their focus areas will probably include value learning, human-robot cooperation, and theories of bounded rationality. Right now, Russell's group is spending a lot of time on cooperative inverse reinforcement learning and corrigibility.

This slide from a recent talk by Critch seems roughly right to me: https://intelligence.org/wp-content/uploads/2017/01/hotspot-slide.png

So, why isn't there an XPrize for AI safety?

A prize fund is one of the main side-projects MIRI has talked about wanting to do for the last few years, if we could run a sufficiently large one -- more or less for the reasons you mention. Ideally the AI safety community would offer a diversity of prizes representing different views about what kinds of progress we'd be most excited by.

If funds for this materialize at some point, the main challenge will be that the most important conceptual breakthroughs right now involve going from mostly informal ideas to crude initial formalisms. This introduces some subjectivity in deciding whether the formalism really captures the key original idea, and also makes it harder for outside researchers to understand what kinds of work we're looking for. (MIRI's research team's focus is exactly on the parts of the problem that are hardest to design a prize for.) It's easier to come up with benchmarks in areas where there's already been a decent amount of technical progress, which would be quite valuable on its own, though it means potentially neglecting the most important things to work on.

There is in fact an X-Prize for AI, and it is general enough that it can apply to safety/alignment, though it's not purely aimed at this. http://ai.xprize.org

www.OpenEth.org is a 'let's build stuff and iterate whilst flailing around slightly as concepts develop' engineering-focussed project for Narrow AI / Smart Contract / Algorithm alignment and proofing, that is competing for this prize.

You point out a problem: There's no way to tell which organizations are making progress on AI alignment, and there is little diversity in current approaches. You turn this into the question: How do we create prizes that incentivize progress in AI alignment? You're missing a step or two here.

I'd say the logic goes the opposite direction: because there are no clear objectively measurable targets that will improve AI safety, prizes are probably a bad idea for increasing the diversity and effectiveness of AI safety research.

How about having a prize for coming up with the very clear guidelines for an AI safety XPrize?

There certainly should be more orgs with different approaches. But possibly, CHCAI plays a role as the representative of MIRI in the mainstream academic world, and so from the perspective of goals, it is OK that the two are quite close.