DeepMind authors present a set of toy environments that highlight various AI safety desiderata. Each is a 10x10 grid in which an agent completes a task by walking around obstacles, touching switches, etc. Some of the tests have a reward function and a hidden 'better-specified' reward function, which represents the true goals of the test. The agent is incentivized based on the reward function, but the goal is to find an agent that does not overfit it in a way that clashes with the hidden function.

I think this paper is relatively accessible to readers outside the machine learning community.

New Comment
6 comments, sorted by Click to highlight new comments since: Today at 5:46 PM

To anyone who feels competent enough to answer this: how should we rate this paper? On a scale from 0 to 10 where 0 is a half-hearted handwaving of the problem to avoid criticism and 10 is a fully genuine and technically solid approach to the problem, where does it fall? Should I feel encouraged that DeepMind will pay more attention to AI risk in the future?

(paper coauthor here) When you ask whether the paper indicates that DeepMind is paying attention to AI risk, are you referring to DeepMind's leadership, AI safety team, the overall company culture, or something else?

I was thinking about the DeepMind leadership when I asked, but I'm also very interested in the overall company culture.

I think the DeepMind founders care a lot about AI safety (e.g. Shane Legg is a coauthor of the paper). Regarding the overall culture, I would say that the average DeepMind researcher is somewhat more interested in safety than the average ML researcher in general.

I don't interpret this as an attempt to make tangible progress on a research question, since it presents an environment and not an algorithm. It's more like an actual specification of a (very small) subset of problems that are important. Without steps like this I think it's very clear that alignment problems will NOT get solved--I think they're probably (~90%) necessary but definitely not (~99.99%) sufficient.

I think this is well within the domain of problems that are valuable to solve for current ML models and deployments, and not in the domain of constraining superintelligences or even AGI. Because of this I wouldn't say that this constitutes a strong signal that DeepMind will pay more attention to AI risk in the future.

I'm also inclined to think that any successful endeavor at friendliness will need both mathematical formalisms for what friendliness is (i.e. MIRI-style work) and technical tools and subtasks for implementing those formalisms (similar to those presented in this paper). So I'd say this paper is tangibly helpful and far from complete regardless of its position within DeepMind or the surrounding research community.

The link no longer works, but here are currently working links for this paper:

https://arxiv.org/abs/1711.09883

https://github.com/google-deepmind/ai-safety-gridworlds

(And the original link is presumably replaced by this one: https://deepmind.google/discover/blog/specifying-ai-safety-problems-in-simple-environments/)