Safety-First Agents/Architectures Are a Promising Path to Safe AGI
Summary Language model agents (LMAs) like AutoGPT have promising safety characteristics compared to traditional conceptions of AGI. The LLMs they are composed of plan, think, and act in highly transparent and correctable ways, although not maximally so, and it is unclear whether safety will increase or decrease in the future. Regardless of where commercial trends will take us, it is possible to develop safer versions of LMAs, as well as other "cognitive architectures" that are not dependent on LLMs. Notable areas of potential safety work include effectively separating and governing how agency, cognition, and thinking arise in cognitive architectures. If needed, safety-first cognitive architectures (SCAs) can match or exceed the performance of less safe systems, and can be compatible with many ways AGI may develop. This makes SCAs a promising path towards influencing and ensuring safe AGI development in everything from very-short-timeline (e.g. LMAs are the first AGIs) to long-timeline scenarios (e.g. future AI models are incorporated into or built explicitly for an existing SCA). Although the SCA field has begun emerging over the past year, awareness seems low, and the field seems underdeveloped. I wanted to write this article so that more people are aware of what's happening with SCAs, document my thinking on the SCA landscape and promising areas of work, and advocate for more people, funding, and research going towards SCAs. Background Language model agents (LMAs), systems that integrate thinking performed by large language models (LLMs) prompting themselves in loops, have exploded in popularity since the release of AutoGPT at the end of March 2023. Even before AutoGPT, the related field that I call "safety-first cognitive architectures" (SCAs) emerged in the AI safety community. Most notably, in 2022, Eric Drexler formulated arguments for the safety of such systems and developed a high-level design for an SCA called the open agency model. Shortly thereafter
Unfortunately I see this question didn’t get much engagement when it was originally posted, but I’m going to put a vote in for highly federated systems along the axes of agency, cognitive processes, and thinking, especially those that maximize transparency and determinism. I think that LM agents are just a first step into this area of safety. I write more about this here: https://www.lesswrong.com/posts/caeXurgTwKDpSG4Nh/safety-first-agents-architectures-are-a-promising-path-to
For specific proposals I’d recommend Drexler’s work on federating agency https://www.lesswrong.com/posts/5hApNw5f7uG8RXxGS/the-open-agency-model and federating cognitive processes (memory) https://www.lesswrong.com/posts/FKE6cAzQxEK4QH9fC/qnr-prospects-are-important-for-ai-alignment-research