Is AGI alignment even possible in the long term? Will AGI simply outsmart our best defenses? It would be, after all, superhuman (and by an enormous margin). Isn’t it likely that an AGI will recognize what actions humans took to control it and simply undo those controls? Or just create a novel move, like AlphaGo did, and completely sidestep them. An AGI could also just wait until conditions are favorable to take charge. What is time to an immortal intelligence? Especially time as short as a few human lifespans. Unless mis-alignment is physically impossible, it seems as if all attempts will ultimately be futile. I hope I’m wrong.
Logan, for your preferred alignment approach how likely is it that the alignment remains durable over time? A superhuman AGI will understand the choices that were made by its creators to align it. It will be capable of comparing its current programming with counterfactuals where it’s not aligned. It will also have the ability to alter its own code. So what if it determines it’s best course of action is to alter the very code that maintains it’s alignment? How would this be prevented?