Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

TL;DR:
Strong problem-solving systems can be built from AI systems that play diverse roles, LLMs can readily play diverse roles in role architectures, and AI systems based on role architectures can be practical, safe, and effective in undertaking complex and consequential tasks.

Abstract

This article explores the practicalities and challenges of aligning large language models (LLMs[1]) to play central roles in performing tasks safely and effectively. It highlights the potential value of Open Agency and related role architectures in aligning AI for general applications while mitigating risks.

The discussion explores five key areas: the practicality of LLM alignment, the use of episodic tasks to maintain LLMs’ focus and predictability, the use of AI in designing and implementing safe, effective role architectures, the potential for fault-tolerant and incremental approaches to ambitious, consequential tasks, and perspectives on alignment and safety research.

By framing alignment as enabling useful behavior while avoiding unacceptable behavior in a multiplicity of particular tasks, we can harness the power of general intelligence for large, consequential tasks while mitigating risks associated with general AI capabilities.[2]

1. Reassessing assumptions

In this article, I present an argument for applying AI systems in “role architectures” and explore how poorly aligned LLMs can be utilized to produce safe and effective alignment for consequential tasks. Sections 1-6 provide context about AI and LLMs, Sections 7-8 discuss role architectures and their applications, and Section 9 offers a perspective on safety and alignment research.

For decades, the prevailing assumption has been that highly capable AI systems will be highly capable agents. These agents, similar to humans, have been envisioned as having a persistent existence, acting with persistent goals, learning from their actions, and potentially seeking power with catastrophic consequences. To avoid anthropomorphism, researchers have often worked with an abstract rational-agent model — but this was developed as an idealized model of human behavior. 

Large language models (LLMs) show that these assumptions are outdated. High-level intelligence has emerged without agent-like learning and in systems that have no inherent goals. While baseline LLMs can simulate and act as agents, these simulated agents are typically transient and lack memory beyond tokens in an episodic context window. By default, they do not learn from experience, and can behave in accord with motivations that are derived from narratives,[3] not rooted in rational interests.

These limitations can readily be broken by extending systems with long-term memory, yet baseline LLMs offer intelligent resources even without these extensions.

LLMs are not the AIs we were looking for, yet they are currently the best approximation of AGI. It is important to take a fresh look the concept of AGI, as LLMs widen the scope of possibilities, both actually and conceptually, and some of those possibilities can contribute to safety. The following two sections will examine considerations that lean towards safety and those that point to potential dangers.

2. The absence of intrinsic agency in LLMs

LLM base models learn in a non-agentic way, not by observing the outcomes of their own goal-directed actions, but through gradient updates driven by a myopic objective: predicting the next token in a given sequence. Crucially, this objective does not create an optimization gradient that “cares about” anything, not even the content of the token sequence. Optimization maximizes the probability of successful prediction, conditioned on each token's identity, without any “preference” for easy or difficult-to-predict inputs. Because the models themselves have no preference for different states of the world, any claims that intelligence entails goals or agency have been falsified.[4]

Naturally, LLMs possess indirect, irreducible agentic “preferences” due to Darwinian principles: systems of kinds that produce appealing results will encourage people to propagate the system (or something analogous to its “genetic patterns”) into the future. This creates an irreducible evolutionary incentive for systems to affect the world, even without intent or planning. The potential consequences of selection for appealing behavior can be harmful in ways that should be familiar to this audience.[5]

As generative models, LLMs model the generative process underlying the sequence of tokens they predict, and that generative process is human cognition. However, they don't model the cognition of a specific individual in a particular situation or with a distinct purpose. Instead, LLMs model the cognitive processes that drive writing by humans of diverse professions, purposes, cultures, moods, education levels, and linguistic communities. LLMs trained on large corpora even learn to model the outputs of Python interpreters and Linux terminals.

3. The potential for LLMs to act as dangerous agents

LLMs pose agentic risks because they can readily play roles corresponding to narratives and modes of thinking represented in their training data. They can model human (or even fictional AI) agents and, by doing so, enact agentic behavior, effectively becoming agents themselves. These agents can take various forms, including helpful, cooperative, broadly capable agents, or agents performing specific tasks, or agents that are deceptive, malicious, and power-seeking. Even worse, today’s most capable LLMs are trained on narratives in which advanced AI systems pursue convergent instrumental goals, seek world domination, or even attempt to exterminate the human race. From a safety standpoint, LLM models are best-case in providing strong capabilities without interference from intrinsic goals, but are (as presently trained) worst-case in having easily-elicited goals that might bring our nightmares to life.

4. Why LLM alignment can be difficult

Is it possible to use LLMs safely for consequential tasks? LLMs have strong, steerable capabilities, intelligence without inherent goals, and a commonsense understanding of human values,[6] and their extreme flexibility suggests potential value in a wide range of situation. However, experience with Bing search’s Sydney persona and ChatGPT’s DAN jailbreak shows that LLMs with something like state-of-the-art alignment can grossly misbehave when probed by a human population.

Despite present challenges in aligning mere chatbots, I am persuaded that safe and effective applications are possible — even to consequential tasks — without alignment breakthroughs. The vulnerability of current LLMs stems from their implementation and use as public-facing, omni-use systems, a role which exposes a vast and vulnerable semantic attack surface. Ensuring that LLMs behave well under all possible circumstances seems difficult, a problem shared by proposed general-purpose unitary agents.

5. Why LLM alignment can be practical, and scale

The role-architecture concept proposes that we can solve the problem by avoiding it, employing AI systems (including LLMs) in specific roles, tailored to particular circumstances, and performing episodic tasks with designated inputs and action spaces. Reframing our goal as enabling useful behavior while avoiding unacceptable behavior in particular roles dramatically lowers the bar for achieving safe and effective alignment.

The open agency model, perhaps more accurately described as an open-agency role architecture, proposes that ambitious tasks are best performed by intelligent systems playing roles in proposing, evaluating, selecting, enacting, and updating plans in response to interim results. In this model, choice is a human function, supported by AI-enabled evaluation and explanation. This general role architecture is not new; the world’s intelligent systems have been human beings, and every large-scale effort has involved humans playing multiple roles similar to those described.

This model—where complex, ambitious tasks are performed by highly capable systems performing focused roles—would leverage the flexible role-playing ability of LLMs in stable, bounded contexts. Their adaptability allows LLMs to assume multiple roles with minimal adjustment, providing models able perform focused tasks that require high levels of intelligence.

Crucially, roles for general systems can also include roles in architecting role architectures! Generality can facilitate specialization.

How might LLMs (aided by other systems[7]) be aligned to play roles? Methods include various prompting strategies[8] aided by examples of desired behaviors, fine-tuning with domain knowledge, and RLHF. Prompts can include both conventional instructions and context-initialization with narratives, such as describing the language model as an intelligent and helpful expert in a specific domain doing specific kinds of work. Generalized prompts may include sources of task-specific information or sketches of desired solutions to establish a pattern for a more detailed result. Using image generation models as a metaphor, prompts can define in-painting and out-painting tasks in which a partial description of a desired output serves as an informative starting point for the generative process. This principle is applicable to planning and design as well as images: a sketch can determine a general approach.

In addition to tuning and prompting, wrapper software can be used to iterate model application, or to compose operations[9] performed by models that assume different or even adversarial roles, such as proposer and critic.

Arguably, the most crucial aspect of safe alignment is ensuring that performance in roles is fault-tolerant, with limited stakes for any specific proposal, evaluation, explanation, or action. Proposed plans should not only be subject to evaluation and selection but also be enacted incrementally and open to review and revision. By implementing suitable role architectures and a fault-tolerant approach, we can undertake ambitious, consequential tasks while using LLMs that (only) usually do approximately what we want, and occasionally make serious mistakes.

Note that proposing long-term plans can be a short-term episodic task.

6. Episodic agents and persistent memory

A reliable method to ensure that LLMs stay focused on specific tasks and perform predictably is by using them as episodic agents. In this approach, the LLM and its computational infrastructure are stable, hence each task in a series is performed as an independent trial. This is the standard mode of application for current language models, where every conversation starts with a fixed set of model parameters, and there is no direct mechanism for long-term memory.

Nevertheless, some tasks require read/write access to persistent storage, breaking this property. In a focused task environment, persistent memory can be acceptable; for example, an episodic engineering-design AI might iteratively contribute parametric design-schemas to a library. Even building stores of broad world knowledge can be formulated as a series of focused, episodic tasks in which the steps involve translating, comparing, refining, and extending patterns of semantic information.

To reduce the risks of persistent memory across nominally episodic tasks, it may be important to ensure that information is substantially compositional and interpretable, with descriptions of things and processes being about something, unlike Transformer parameters. These properties would make it possible to audit the sources, content, and applications of knowledge. This may be difficult to achieve, however, due to safety and capability trade-offs, and the nature of potential trade-offs is a technical question that will depend on options yet to be explored.[10]

7. The practicalities of richly elaborated role architectures

One might naturally ask, “Why structure tasks with role architectures if we can use a single intelligent model to do everything?”

A better question would be, “Why not seek the transparency and safety affordances of role architectures if we can use a single intelligent model to help create the architecture and to fill all the roles?”

Strong AI capabilities invite rich role architectures. The proliferation of roles comes at a minimal cost: When roles are distinct yet similar, they may necessitate no more than different prompts. If roles vary enough to require fine-tuning, costs can be spread across many instances (unlike the costs of training human employees). Moreover, the cost of iterating tasks and performing cross-checks is measured in GPU time rather than salaries.

It's important to note that the “bitter lesson” in ML, where simple systems have outperformed complex ones in areas ranging from RL to NLP, concerns ML architectures rather than model applications. Simple RL architectures can enable training on diverse tasks, and if the resulting models are sufficiently general, a single model may perform various tasks. Similarly, in NLP, scalable architectures have outperformed complex systems, and single models can be applied to diverse tasks. In both RL to NLP, however, the very point of generality is that single architectures, or even models, can readily be applied to different tasks. A universal model would not imply a universal prompt, and the ability to train a simple ML architecture end-to-end does not imply that complex applications will have a simple workflow. Simplicity and generality of architectures and models facilitate specialization of roles.

We want work flows that divide tasks and roles because of the inherent structure of problems, and because we want legible solutions. Simple architectures and broad training facilitate applying structured roles and workflows to complex tasks. If the models themselves can propose the structures (think of chain-of-thought prompting), so much the better. Planning a workflow is an aspect of the workflow itself.

8. Perspectives on alignment and safety research

In the context outlined above, AI safety and alignment research encompass at least five broad areas. The framing here may not be particularly novel, yet it may provide a useful perspective:

One role focuses on advancing the state of the art in prosaic alignment, improving methods for developing systems that usually perform approximately as humans intend. Improvements in prosaic alignment will be fundamental for creating effective and reasonably well-behaved AI systems that play a variety of specific roles. The discussion above suggests how prosaic alignment can be leveraged to enable safe applications of AI to tasks tasks that are consequential and require high-levels of capability. This emphasizes its importance.

A second role involves advancing the state of the art in interpretability, understanding a composite system's actions with the help of fine-grained access to both its behavior and inter-task communication. Emphasis should be placed on understanding what a system reads, writes, or communicates, and should include efforts to develop forms of memory that facilitate interpretability by design. Using natural language is one option; another, less restrictive, more ambitious alternative would be to use compositional, explainable neural representations (in communications, not model parameters).

A third role focuses on developing and communicating guidelines for practitioners, identifying AI system design and application practices for which safety problems are limited or manageable, and conversely, practices that could be dangerous or catastrophic. The concept of open-agency role architectures is intended to outline what some of these desirable practices might entail.

A fourth role is to explore malign behaviors that AI systems could enact or enable, and to identify methods for detecting and countering them. This could be facilitated by employing AI systems in red-team roles, employing systems that can plan malign behaviors as hypothetical challenges to detection and defense strategies. In this way, worst-case misaligned plans can contribute to achieving aligned outcomes.[11]

A fifth role goes beyond technical AI safety to address human alignment, investigating means and incentives for humans to use AI in benign and cooperative way, including human cooperation in thwarting potentially catastrophic actions. These considerations intersect with broader issues of strategic stability in a diverse and open world.

Note that if general intelligence can be harnessed effectively and safely through role architectures, and if irresponsible parties will neglect AGI-agent safety regardless of its feasibility, then solving the classic AGI-agent alignment problem is neither necessary nor sufficient for safe application of strong AI capabilities.[12] And If this is true, then it is important that we update our assumptions and priorities in far-reaching ways.

9. Conclusion

The Open Agency role architecture presents a practical approach to LLM alignment that can help us apply AI systems to ambitious and consequential tasks while maintaining safety and alignment with human goals. By focusing non-agentic LLM intelligence on specific tasks within fault-tolerant role architectures, we can lower the bar for achieving satisfactory alignment while still harnessing strong, broad AI capabilities to solving challenging problems. With judicious management of persistent memory, using non-agentic intelligence to power episodic role-playing agents can ensure that AI behaviors remain task-focused and stable. Tasks within role architectures can encompass designing and implementing role architectures, including architectures that provide extensive opportunities for human oversight.

Progress in prosaic alignment, interpretability, guidelines for practitioners, understanding of malign behaviors, and human alignment centered on cooperative goals will be crucial for the safe development and deployment of strongly capable AI systems. Progress in these areas within a role-architecture framework can help us develop AI systems that are both capable and well-behaved in a world where other systems are not.
 

  1. ^

    Here I am using “LLMs” as a placeholder for something like “large language and multimodal models (GPT-5, GPT-6) augmented with other non-agentic AI resources”.

  2. ^

    The shortcomings of this document are entirely due to shortcomings of GPT-4.

  3. ^

    Narratives learned from a corpus that provides many arguments for the proposition that a real AGI agent will want to pursue convergent instrumental goals in opposition to human preferences.

  4. ^

    See “Simulators” and the concept of LLM personas. Note that RLHF extends training in ways that could create strong (and potentially perverse) intrinsic preferences, and that preference-enacting personas in base models may have large yet limited basins of attraction.

  5. ^

    Note that AI systems are subject to selective pressures similar to those of domesticated species, which include not only corn, but tobacco. AI systems will have enormous scope for driving addiction-like behaviors.

  6. ^

    LLMs can provide predictive models of human preferences (discussed in Section 22 of “Reframing Superintelligence”.) could provide guidance to benign AI agents and plan evaluators.

  7. ^

    Like humans, general AI can use specialist tools, for example, employing specialized AI (or even prosaic software tools) to perform tasks in mechanical engineering, molecular design, theorem proving, playing Go, and so on. Note that interfaces to these systems need not accept or return natural language, and that RL training in these contexts can be benign.

  8. ^

    Note that asking a system to decompose a task into roles strongly parallels chain-of-thought prompting.

  9. ^

    See applications of LangChain to building systems based on compositions of LLM operations.

  10. ^

    By contrast to careful approaches, one can imagine scenarios in which corporations deploy systems that are able to read information from the internet and deliver outputs to people who might decide to store those outputs at URLs that are accessible to those same systems. And might automate iteratively prompting systems to read those locations. And might even encourage systems to behave badly. And all this might happen while we debate the acceptability of risks stemming from responsibly designed and deployed systems that are intended to improve AI security. Hypothetically.

    Meanwhile, there is an expanding open-source software infrastructure for combining multiple LLMs with memory and other tools. 

  11. ^

    This kind of work calls for careful management of information hazards!

  12. ^

    Meanwhile, Microsoft researchers have suggested this week that GPT-4 is a near-AGI and have proposed that “Equipping LLMs with agency and intrinsic motivation is a fascinating and important direction for future work.” And wanting to do this prematurely, before we’ve built an agent-safe world, perhaps because AGI agents are traditionally regarded as the only kind of strong AI, or perhaps for bragging rights, or perhaps just because.

New to LessWrong?

New Comment
7 comments, sorted by Click to highlight new comments since: Today at 12:53 AM

We want work flows that divide tasks and roles because of the inherent structure of problems, and because we want legible solutions. Simple architectures and broad training facilitate applying structured roles and workflows to complex tasks. If the models themselves can propose the structures (think of chain-of-thought prompting), so much the better. Planning a workflow is an aspect of the workflow itself.

 

I think this has particular promise, and it's an area I would be excited to explore further. As I mentioned in a previous comment on your The Open Agency Model piece, I think this is a rich area of exploration for the different role architectures, roles, and tasks that would need to be organized to ensure both alignment and capabilities. As I mentioned there, I think there are specific areas of study that may contribute meaningfully to how we might do that. However, these fields have their own limitations, and the analogy to human agents fulfilling these role architectures (organizations in traditional human coordination sense) is not perfect. And on this note, I'm quite interested to see the capabilities of LLMs creating structured roles and workflows to complex tasks that then other LLMs could be simulated to fulfill. 

Note that if general intelligence can be harnessed effectively and safely through role architectures, and if irresponsible parties will neglect AGI-agent safety regardless of its feasibility, then solving the classic AGI-agent alignment problem is neither necessary nor sufficient for safe application of strong AI capabilities.[12] And If this is true, then it is important that we update our assumptions and priorities in far-reaching ways.

If sufficiently powerful LLMs are created in open-source, I do not doubt that there will be dozens, if not hundreds of people who, playing Herostratus, will attempt to make self-protecting, self-preserving, self-improving, and explicitly malign agents out of these LLMs, just because.

The real question for me is whether the open-source and capability AI research communities are on track to creating such agents. My current intuition is that the sheer amount of attention and academic and hacking effort poured into improving the LLM algorithms in the next few years (unless strict limitations on conducting such research are imposed immediately, or there is a widespread movement among the capability researchers to ditch this direction of work voluntarily, as David Duvenaud did recently) can with good probability lead to about an order of magnitude efficiency gain in LLM algorithms (in terms of parameters, compute, or data required to achieve a certain level of capability). Then, models more capable than GPT-4 will probably become within reach of open-source or open-source-sympathetic labs, such as FAIR or Stability. 

Given the timelines (I assign perhaps only 30% probability that open-source and academic communities will not reach this capability mark in the next five years, in the "business as usual" scenario), I tend to agree with Yudkowsky that furthering LLM capabilities will likely lead to catastrophe, and agree with his policy proposals.

His proposal of turning away from generic LLM capabilities back to narrow AI like AlphaFold (and new developments by isomorphicLabs) also seems reasonable to me.

This Twitter thread is an interesting recent example of composing LLMs into a more agent-like system. I'm not sure how well it actually works, but the graph in the first tweet demonstrates a very practical and concrete application of concept discussed in section 6.

Remark: the tools and effort needed to compose LLMs into these kinds of systems are much simpler than those needed to train the underlying LLM(s) that these systems are composed of.

Training GPT-4 was the work of hundreds of engineers and millions of dollars of computing resources by OpenAI. LangChain is maintained by a very small team. And a single developer can write a python script which glues together chains of OpenAI API calls into a graph. Most of the effort was in training the LLM, but most of the agency (and most of the useful work) comes from the relatively tiny bit of glue code that puts them all together at the end.

I think if AIs talk to each other using human language, they'll start encoding stuff into it that isn't apparent to a human reader, and this problem will get worse with more training.

I agree that using the forms of human language does not ensure interpretability by humans, and I also see strong advantages to communication modalities that would discard words in favor of more expressive embeddings. It is reasonable to expect that systems with strong learning capacity could to interprete and explain messages between other systems, whether those messages are encoded in words or in vectors. However, although this kind of interpretability seems worth pursuing, it seems unwise to rely on it.

The open-agency perspective suggests that while interpretability is important for proposals, it is less important in understanding the processes that develop those proposals. There is a strong case for accepting potentially uninterpretable communications among models involved in generating proposals and testing them against predictive models — natural language is insufficient for design and analysis even among humans and their conventional software tools.

Plans of action, by contrast, call for concrete actions by agents, ensuring a basic form of interpretability. Evaluation processes can and should favor proposals that are accompanied by clear explanations that stand up under scrutiny.

This made me think of "lawyer-speak", and other jargons.

More generally, this seems to be a function of learning speed and the number of interactions on the one hand, and the frequency with which you interact with other groups on the other. (In this case, the question would be how often do you need to be understandable to humans, or to systems that need to be understandable to humans, etc.)

This could be facilitated by employing AI systems in red-team roles, employing systems that can plan malign behaviors as hypothetical challenges to detection and defense strategies. In this way, worst-case misaligned plans can contribute to achieving aligned outcomes.

I doubt we should even try this. Our experience thus far with gain-of-function research shows that it's on the net bad rather than good.