OP (please comment there): https://medium.com/@lucarade/issues-with-iterated-distillation-and-amplification-5aa01ab37173
I will show that, even under the most favorable assumptions regarding the feasibility of IDA and the solving of currently open problems necessary for implementing IDA, it fails to produce an aligned agent in the sense of corrigibility.
Part 1: The assumptions
Class 1: There are no problems with the human overseer.
1.1:Human-generated vulnerabilities are completely eliminated through security amplification. (See this post for a lengthy overview and intuition, and this postfor a formalization). In short, security amplification converts the overseer in IDA from high-bandwidth (receiving the full input in one piece) to low-bandwidth (receiving inputs divided into small pieces), to make impossible an attack... (read 1593 more words →)
Only just learned of this, unfortunately.
OP (please comment there): https://medium.com/@lucarade/issues-with-iterated-distillation-and-amplification-5aa01ab37173
I will show that, even under the most favorable assumptions regarding the feasibility of IDA and the solving of currently open problems necessary for implementing IDA, it fails to produce an aligned agent in the sense of corrigibility.
Part 1: The assumptions
Class 1: There are no problems with the human overseer.
1.1: Human-generated vulnerabilities are completely eliminated through security amplification. (See this post for a lengthy overview and intuition, and this postfor a formalization). In short, security amplification converts the overseer in IDA from high-bandwidth (receiving the full input in one piece) to low-bandwidth (receiving inputs divided into small pieces), to make impossible an attack... (read 1593 more words →)