Language Agents Reduce the Risk of Existential Catastrophe
This post was written by Simon Goldstein, associate professor at the Dianoia Institute of Philosophy at ACU, and Cameron Domenico Kirk-Giannini, assistant professor at Rutgers University, for submission to the Open Philanthropy AI Worldviews Contest. Both authors are currently Philosophy Fellows at the Center for AI Safety. Abstract: Recent advances in natural language processing have given rise to a new kind of AI architecture: the language agent. By repeatedly calling an LLM to perform a variety of cognitive tasks, language agents are able to function autonomously to pursue goals specified in natural language and stored in a human-readable format. Because of their architecture, language agents exhibit behavior that is predictable according to the laws of folk psychology: they have desires and beliefs, and then make and update plans to pursue their desires given their beliefs. We argue that the rise of language agents significantly reduces the probability of an existential catastrophe due to loss of control over an AGI. This is because the probability of such an existential catastrophe is proportional to the difficulty of aligning AGI systems, and language agents significantly reduce that difficulty. In particular, language agents help to resolve three important issues related to aligning AIs: reward misspecification, goal misgeneralization, and uninterpretability. 1. Misalignment and Existential Catastrophe There is a significant chance that artificial general intelligence will be developed in the not-so-distant future — by 2070, for example. How likely is it that the advent of AGI will lead to an existential catastrophe for humanity? Here it is worth distinguishing between two possibilities: an existential catastrophe could result from humans losing control over an AGI system (call this a misalignment catastrophe), or an existential catastrophe could result from humans using an AGI system deliberately to bring that catastrophe about (call this a malicio