Introduction to Towards Causal Foundations of Safe AGI
By Tom Everitt, Lewis Hammond, Rhys Ward, Ryan Carey, James Fox, Sebastian Benthall, Matt MacDermott and Shreshth Malik representing the Causal Incentives Working Group. Thanks also to Toby Shevlane, MH Tessler, Aliya Ahmad, Zac Kenton, Maria Loks-Thompson, and Alexis Bellot. Over the next few years, society, organisations, and individuals will face a number of fundamental questions stemming from the rise of advanced AI systems: * How to make sure that advanced AI systems do what we want them to (the alignment problem)? * What makes a system safe enough to develop and deploy, and what constitutes sufficient evidence of that? * How do we preserve our autonomy and control as decision making is increasingly delegated to digital assistants? A causal perspective on agency provides conceptual tools for navigating the above questions, as we’ll explain in this sequence of blog posts. An effort will be made to minimise and explain jargon, to make the sequence accessible to researchers from a range of backgrounds. Agency First, with agent we mean a goal-directed system that acts as if it it is trying to steer the world in some particular direction(s). Examples include animals, humans, and organisations (more on agents in a subsequent post). Understanding agents is key to the above questions. Artificial agents are widely considered the primary existential threat from AGI-level technology, whether they emerge spontaneously or through deliberate design. Despite the myriad risks to our existence, highly capable agents pose a distinct danger, because many goals can be achieved more effectively by accumulating influence over the world. Whereas an asteroid moving towards earth isn’t intending to harm humans and won’t resist redirection, misaligned agents might be distinctly adversarial and active threats. Second, the preservation of human agency is critical in the approaching technological transition, for both individuals and collectives. Concerns have already been raised
I agree with all of this! A related shortform here.