if we have some intuitive argument that our wish-contract is safe and a few human-designed tests, do we really expect it to have no loopholes exploitable by the genie/demon?

I wasn't imagining just input-output tests in laboratory conditions, which I agree are insufficient. I was thinking of studying counterfactuals, e.g. what the optimization target would suggest doing under the hypothetical scenario where it has lots of power. Alternatively, you could imagine tests of the form "pose this scenario, and see how the AI thinks about it", e.g. t... (read more)

AI Alignment Open Thread August 2019

by habryka 1 min read4th Aug 201996 comments


Ω 12

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is an experiment in having an Open Thread dedicated to AI Alignment discussion, hopefully enabling researchers and upcoming researchers to ask small questions they are confused about, share very early stage ideas and have lower-key discussions.