DeepMind authors present a set of toy environments that highlight various AI safety desiderata. Each is a 10x10 grid in which an agent completes a task by walking around obstacles, touching switches, etc. Some of the tests have a reward function and a hidden 'better-specified' reward function, which represents the true goals of the test. The agent is incentivized based on the reward function, but the goal is to find an agent that does not overfit it in a way that clashes with the hidden function.
I think this paper is relatively accessible to readers outside the machine learning community.