if we have some intuitive argument that our wish-contract is safe and a few human-designed tests, do we really expect it to have no loopholes exploitable by the genie/demon?

I wasn't imagining just input-output tests in laboratory conditions, which I agree are insufficient. I was thinking of studying counterfactuals, e.g. what the optimization target would suggest doing under the hypothetical scenario where it has lots of power. Alternatively, you could imagine tests of the form "pose this scenario, and see how the AI thinks about it", e.g. t... (read more)

AI Alignment Open Thread August 2019

August 2019


