About ten years ago, after reading Bostrom, Yudkowsky, and others on AI risk and the control problem, I had a conspiratorial thought: if a superintelligence ever emerged, wouldn’t all this public discussion of control strategies just help the superintelligence resist them? At the time, the idea struck me as silly, so I set it aside.
A year ago, I revisited the thought and noticed that it hadn't been developed or discussed much in the literature. And when I thought about it in more detail, I found it to be not so silly. This led me to write a paper, recently published in Philosophical Studies, where I argue that secrecy about how to control (and, indirectly, align) AGI follows from two independent arguments:
Perhaps more disconcerting: in the analysis, I found that the arguments apply to alignment research too, although not as directly.
These arguments are somewhat related to the idea of misalignment as a self-fulfilling prophecy (see here and here).
I'd be very interested in the LessWrong community’s take on these arguments. Here’s the abstract:
How do you control a superintelligent artificial being given the possibility that its goals or actions might conflict with human interests? Over the past few decades, this concern– the AGI control problem– has remained a central challenge for research in AI safety. This paper develops and defends two arguments that provide pro tanto support for the following policy for those who worry about the AGI control problem: don’t talk about it. The first is argument from counter-productivity, which states that unless kept secret, efforts to solve the control problem could be used by a misaligned AGI to counter those very efforts. The second is argument from suspicion, stating that open discussions of the control problem may serve to make humanity appear threatening to an AGI, which increases the risk that the AGI perceives humanity as a threat. I consider objections to the arguments and find them unsuccessful. Yet, I also consider objections to the don’t-talk policy itself and find it inconclusive whether it should be adopted. Additionally, the paper examines whether the arguments extend to other areas of AI safety research, such as AGI alignment, and argues that they likely do, albeit not necessarily as directly. I conclude by offering recommendations on what one can safely talk about, regardless of whether the don’t-talk policy is ultimately adopted.