Epistemic status
Speculative, exploratory. I am not confident this works. I am even less confident that it is implementable. The goal of this post is not to argue that this approach solves AI existential risk, but to propose a different framing that might be worth examining, especially given the apparent difficulty of solving alignment directly.
Bottom line: Maybe we could avoid a catastrophe by implementing multiple, extremely powerful AI guardians that all monitor each other. All the models must be developed and trained fully separately. This way, the guardians will be able to stop 'bad' agents without becoming misaligned themselves.
Motivation: Why the usual framing feels incomplete
Much of the discussion around AI existential risk implicitly... (read 1076 more words →)