Sep 10, 2018
I had a discussion with Paul Christiano, about his Iterated Amplification and Distillation scheme. We had a disagreement, a disagreement that I believe points to something interesting, so I'm posting this here.
It's a disagreement about the value of the concept of "preserving alignment". To vastly oversimplify Paul's idea, the AI A[n] will check that A[n+1] is still aligned with human preferences; meanwhile, A[n-1] will be checking that A[n] is still aligned with human preferences, all the way down to A and an initial human H that checks on it.
Intuitively, this seems doable - A[n] is "nice", so it seems that it can reasonably check that A[n+1] is also nice, and so on.
But, as I pointed out in this post, it's very possible that A[n] is "nice" only because it lacks power/can't do certain things/hasn't thought of certain policies. So niceness - in the sense of behaving sensibly as an autonomous agent - does not go through the inductive step in this argument.
Instead, Paul confirmed that "alignment" means "won't take unaligned actions, and will assess the decisions of a higher agent in a way that preserves alignment (and preserves the preservation of alignment, and so on)".
This concept does induct properly, but seems far less intuitive to me. It relies on humans, for example, being able to ensure that A will be aligned, that any more powerful copies it assesses will be aligned, that any more powerful copies those copies assess are also aligned, and so on.
Intuitively, for any concept C of alignment for H and A, I expect one of four things will happen, with the first three being more likely:
Hopefully, further research should clarify if my intuitions are correct.