Perhaps the disagreement is in how hard it is to prove things vs. test them. I pretty strongly disagree with

if we know how to "see whether the AI runs a check for whether it can deceive humans", then I expect systems which provably don't do that won't be much extra work.

The version based on testing has to look at a single input scenario to the AI, whereas the proof has to quantify over all possible scenarios. These seem wildly different. Compare to e.g. telling whether Alice is being manipulated by Bob by looking at interactions between Alice and Bob, vs. trying to prove that Bob will never be manipulative. The former seems possible, the latter doesn't.

AI Alignment Open Thread August 2019

by habryka 1 min read4th Aug 201996 comments


Ω 12

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is an experiment in having an Open Thread dedicated to AI Alignment discussion, hopefully enabling researchers and upcoming researchers to ask small questions they are confused about, share very early stage ideas and have lower-key discussions.