Can we start out with an unaligned superintelligent AGI, and end up with an aligned AGI-system? I argue maybe, and discuss principles, techniques and strategies that may enable us to do so.
One reason for exploring such strategies is contingency planning (what if we haven’t solved alignment by the time the first superintelligent AGI arrives?). Another reason is that additional layers of security could be beneficial (even if we think we have solved alignment, are there ways to relatively quickly add additional layers of alignment-assurance?).
This is an ongoing series.