I mentioned that I expect proof-level guarantees will be easy once the conceptual problems are worked out. Strong interpretability is part of that: if we know how to "see whether the AI runs a check for whether it can deceive humans", then I expect systems which provably don't do that won't be much extra work. So we might disagree less on that front than it first seemed.

The question of whether to model the AI as an open-ended optimizer is is one I figured would come up. I don't think we need to think of it as truly open-ended in or... (read more)

AI Alignment Open Thread August 2019

by habryka 1 min read4th Aug 201996 comments


Ω 12

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is an experiment in having an Open Thread dedicated to AI Alignment discussion, hopefully enabling researchers and upcoming researchers to ask small questions they are confused about, share very early stage ideas and have lower-key discussions.