Safety-First Agents/Architectures Are a Promising Path to Safe AGI
Summary Language model agents (LMAs) like AutoGPT have promising safety characteristics compared to traditional conceptions of AGI. The LLMs they are composed of plan, think, and act in highly transparent and correctable ways, although not maximally so, and it is unclear whether safety will increase or decrease in the future. Regardless of where commercial trends will take us, it is possible to develop safer versions of LMAs, as well as other "cognitive architectures" that are not dependent on LLMs. Notable areas of potential safety work include effectively separating and governing how agency, cognition, and thinking arise in cognitive architectures. If needed, safety-first cognitive architectures (SCAs) can match or exceed the performance of less safe systems, and can be compatible with many ways AGI may develop. This makes SCAs a promising path towards influencing and ensuring safe AGI development in everything from very-short-timeline (e.g. LMAs are the first AGIs) to long-timeline scenarios (e.g. future AI models are incorporated into or built explicitly for an existing SCA). Although the SCA field has begun emerging over the past year, awareness seems low, and the field seems underdeveloped. I wanted to write this article so that more people are aware of what's happening with SCAs, document my thinking on the SCA landscape and promising areas of work, and advocate for more people, funding, and research going towards SCAs. Background Language model agents (LMAs), systems that integrate thinking performed by large language models (LLMs) prompting themselves in loops, have exploded in popularity since the release of AutoGPT at the end of March 2023. Even before AutoGPT, the related field that I call "safety-first cognitive architectures" (SCAs) emerged in the AI safety community. Most notably, in 2022, Eric Drexler formulated arguments for the safety of such systems and developed a high-level design for an SCA called the open agency model. Shortly thereafter