Notes on the importance and implementation of safety-first cognitive architectures for AI

Background

I've been working on a knowledge management/representation system for ~1.5 years with the initial goal of increasing personal and collective intelligence. I discovered that this work could be highly applicable to AI safety several weeks ago through Eric Drexler's work on QNRs, which pertains to knowledge representation, and then discovered Open Agencies and the broader work that has been done on cognitive architectures and how they can be made safer. I am excited about the potential for "safety-first cognitive architectures" to help society harness AI in a safer manner.

I figured I would spent a couple hours documenting my thoughts to help people learn more about what cognitive architectures are, how they're relevant for AI safety, and how they might be designed in safer ways. It seems like this field is nascent and the resources aren't aggregated in one place, so this is my first attempt at doing so.

One-Line Summary of Safety-First Cognitive Architectures

Harness AI more safely by having intelligence emerge from separate, non-agentic systems that communicate with each other and operate in interpretable ways, rather than from a singular, agentic AI.

Implementing Safety-First Cognitive Architectures

Separate the components of cognition like planning, execution, and long-term memory. Have each component communicate with the others in a transparent, rate controlled, and human readable way (currently done with natural language, likely done with human-readable structured data in the future).
Ensure that each component can be run by some combination of non-agentic, transient, memory-constrained, and action-constrained AI models, deterministic automated systems, and/or humans.
Incorporate measures at every level of the system, from the system's goals to the outputs of AI models, to evaluate contributions and detect potentially harmful contributions.
Apply the latest alignment research to the AI models used in the architecture.

Why Cognitive Architectures Can Be Safer Than Singular, Agentic AI

When the components of the architecture are put together, the architecture can act like an agent and behave intelligently, but the architecture itself functions more like an information storage and sharing system, or an "international agency" as described in work on Open Agencies, rather than an AI agent. It is essentially fully interpretable and corrigible by design, with goals and plans that are human-understandable and changeable at any time. The constraints on the underlying AI models reduce the risk of bad outcomes compared to employing AI models that are agentic, run perpetually, and have access to a comprehensive world model and limitless actions they can take.

Key People and Ideas

Eric Drexler, a researcher at FHI, developed the Open Agency Model, a simple framework for a safer cognitive architecture that primarily involves separating setting goals, generating plans, evaluating plans, implementing plans, and evaluating plans. This article describes how LLMs can be employed in an open agency. Drexler's 2019 work on Comprehensive AI Services (CAIS) is quite related.

David Dalrymple, another researcher at FHI, is working on a sophisticated, near-AGI implementation of an open agency, centered on robustly simulating the world and using that world model to accurately specify goals and assess the outcomes of plans.

Brendon Wong (myself) is working on creating an open agency based purely on components that can be built on existing technologies. It uses a simpler world model. I plan on iteratively adding more advanced features over time.

Seth Herd is an AI safety researcher at Astera who is researching cognitive architectures, including what the implications of current-day and near-future language model cognitive architectures (LMCAs) are, and how to make cognitive architectures safer.

David Shapiro was one of the early thought leaders for modern-day cognitive architectures, including authoring a prescient open-source book in 2021. His proposal contains safety features like using a knowledge store to learn human values (similar to Drexler's QNR proposal) and specifying human values in natural language with an approach similar to Anthropic's Constitutional AI (but using many examples to back each value, not just specifying the value itself in natural language).

Related Ideas

Eric Drexler's work on QNR prospects are important for AI alignment research predicts that future AI systems may use external knowledge stores to learn about the world and human values and conduct reasoning. Drexler states that these knowledge stores could support alignment because they will be at least partially human understandable, and thus quite interpretable, as well as human-editable, and thus facilitate corrigibility. Cognitive architectures make use of knowledge stores and use them as a key element to facilitate cognition, and so are quite interpretable and corrigible.

Veedrac's post Optimality is the tiger, and agents are its teeth provides an interesting hypothetical example of how a superintelligent, non-agentic LLM could recommend that the user run code that enables the LLM to recursively call itself, thus creating an unsafe agent in the form of a cognitive architecture. This post highlights the risks of unsafe cognitive architectures and illustrates various safety aspects of LLMs (and why they're different from agentic AI, but also potentially prone to failure in more indirect ways that should also be accounted for—see the failures of tool AI for more on this).

Tamera's post Externalized reasoning oversight: a research direction for language model alignment describes various methods that could better ensure that the reasoning that LLMs provide to support their responses is authentic. Cognitive architectures generally express all reasoning explicitly, and have checks in place to detect issues with model reasoning and plans, so this work seems related.

Other Potentially Related Ideas I May Summarize Later

The Translucent Thoughts Hypothesis
Natural Language Alignment
AI Oversight
Natural Abstraction Hypothesis and Alignment By Default

The related ideas are roughly ordered with the most relevant ideas at the top.

LESSWRONG
is fundraising!
LW