Perhaps the disagreement is in how hard it is to prove things vs. test them. I pretty strongly disagree with

if we know how to "see whether the AI runs a check for whether it can deceive humans", then I expect systems which provably don't do that won't be much extra work.

The version based on testing has to look at a single input scenario to the AI, whereas the proof has to quantify over all possible scenarios. These seem wildly different. Compare to e.g. telling whether Alice is being manipulated by Bob by looking at interactions between Alice and Bob, vs. trying to prove that Bob will never be manipulative. The former seems possible, the latter doesn't.

AI Alignment Open Thread August 2019

