Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Notes on AI for complex, consequential problems

Eric Drexler
Centre for the Governance of AI
University of Oxford


This document argues for “open agencies” — not opaque, unitary agents — as the appropriate model for applying future AI capabilities to consequential tasks that call for combining human guidance with delegation of planning and implementation to AI systems. This prospect reframes and can help to tame a wide range of classic AI safety challenges, leveraging alignment techniques in a relatively fault-tolerant context.

Rethinking safe AI and its applications

AI safety research is too varied to summarize, yet broad patterns are obvious. A long-established reference-problem centers on prospects for rational superintelligent agents that pursue narrow goals with potentially catastrophic outcomes. This frame has been productive, but developments in deep learning call for updates that take account of the proliferation of narrow models (for driving, coding, robot control, image generation, game playing…) that are either non-agentic or act as agents in only a narrow sense, and that take account of the rise of more broadly capable foundation models and LLMs. These updates call for reframing questions of AI safety, and call for attention to how consequential tasks might be accomplished by organizing AI systems that usually do approximately what humans intend.

Two frames for high-level AI

The unitary-agent frame

From its beginnings in popular culture, discussion of the AI control problem has centered around a unitary agent model of high-level AI and potential AI risks. In this model, a potentially dominant agent both plans and acts to achieve its goals.     

The unitary-agent model typically carries assumptions regarding goals, plans, actions, and control.

  • Goals:      Internal to an agent, by default including power-seeking goals
  • Plans:       Internal to an agent, possibly uninterpretable and in effect secret
  • Actions:   Performed by the agent, possibly intended to overcome opposition
  • Control:   Humans confront a powerful, potentially deceptive agent

The typical unitary-agent threat model contemplates the emergence of a dominant, catastrophically misaligned agent, and safety models implicitly or explicitly call for deploying a dominant agent (or an equivalent collective system) that is both aligned and powerful enough to suppress unaligned competitors everywhere in the world.

The open-agency frame

Recent developments suggest an alternative open agency model of high-level AI. Today, the systems that look most like AGI are large language models (LLMs), and these are not agents that seek goals,[1] but are generative models that produce diverse outputs in response to prompts (in a generalized sense) and random-number seeds.[2] Most outputs are discarded.

Trained on prediction tasks, LLMs learn world models that include agent behaviors, and generative models that are similar in kind can be informed by better world models and produce better plans. There is no need to assume LLM-like implementations: The key point is that generation of diverse plans is by nature a task for generative models, and that in routine operation, most outputs are discarded.

These considerations suggest an “open-agency frame” in which prompt-driven generative models produce diverse proposals, diverse critics help select proposals, and diverse agents implement proposed actions to accomplish tasks (with schedules, budgets, accountability mechanisms, and so forth).

Goals, plans, actions, and control look different in the open-agency model:

  • Goals:     Are provided as prompts to diverse generative models, 
                     yielding diverse plans on request
  • Plans:     Are selected with the aid of diverse, independent comparison 
                     and evaluation mechanisms
  • Actions:  Incremental actions are performed by diverse task-oriented agents
  • Control:  Diverse, independent monitoring and evaluation mechanisms 
                      guide revision of plans

For narrow tasks, of course, unitary agents may be both efficient and unproblematic:

Playing board games and driving cars are tasks performed by individual humans with fast decision cycles. The open agency model, however, fits best when equivalent human tasks (building a factory, deploying a communication system, developing a drug) call for multiple skills and technologies coordinated over time. The structure of tasks and accountability relationships speaks against merging exploratory planning, strategic decision-making, project management, task performance, reporting, and auditing into a single opaque process. The costs would be high, and any gains in simplicity would be illusory.

Note: To avoid a natural misconception, is important to recognize that the open-agency architecture describes roles and relationships, not the systems that fill those roles. Thus, division of labor does not imply division of competence, and any or all roles could in principle be performed by large, arbitrarily-capable models.

AI safety challenges reframed

Basic challenges in AI safety — corrigibility, interpretability, power seeking, and alignment — look different in the agent and agency frames:


     Agents:     Goal-driven agents may defend goals against change.
     Agencies:  Generative planning models respond to goals as prompts.


     Agents:     Plan descriptions may be internal and opaque.
     Agencies:  Plan descriptions are externalized and interpretable.[3]

Power seeking:

     Agents:     Open-ended goals may motivate secret plans to gain power.
     Agencies:  Bounded tasks include time and budget constraints.


     Agents:     Humans may have only one chance to set the goals of a dominant agent.
     Agencies:  Humans engage in ongoing development and direction of diverse systems.

AI development in the open-agency frame can and should be informed by agentic concerns — we know why vigilance is important! — but the considerations outlined above suggest that the most tractable solutions AI to safety problems may rely to a substantial extent on agency-centered strategies. Separation of roles, incremental actions, greater transparency, and affordances for control can all make imperfect alignment techniques more effective and failures less catastrophic.

Threat models in an agency-centered world must include the emergence of dangerous agents, whether autonomous and unitary or under human direction. Effective safety models call for the development and deployment of systems that, collectively, make catastrophically dangerous goals impossible to achieve. This task does not require a world-dominating AI agent, but may require a degree of human coordination that is difficult to achieve.

Applications of AI to human goal alignment may be critically important. Combining fluent language models with compelling, well-grounded knowledge models could help by presenting us with a more complete picture of the world — situations, causality, threats, opportunities, and options — but this is a topic for another time.


The open agency model offers a new perspective on applying AI capabilities to complex, consequential tasks.  It reframes the traditional AI safety challenges by introducing the concept of "open agencies" that rely on generative models that produce diverse proposals, diverse critics that help select proposals, and diverse agents that implement proposed actions to accomplish tasks. By leveraging alignment techniques in a fault-tolerant context, the open agency model provides a framework for safer and more effective AI.

The discerning reader will recognize that the principle outlined here can be applied to mundane AI systems that likewise propose alternatives, advise on choices, and perform bounded tasks with opportunities for oversight. Travel planning, for example, where the bounded task is making reservations. Note that this open-agency pattern is how people often use AI today: This article argues that the pattern scales.



  1. ^

    I buy the description (see “Simulators”) of pure LLMs as thoroughly non-agentic models of intelligence that can readily simulate — hence actualize — a wide range of personas that act as agents. Simulations that actualize such agents have exhibited human-like behaviors that include directly threatening their perceived enemies and (primed with AI-vs-humanity narratives) talking of world conquest. LLM-based agents are opaque, unitary, and perhaps not the best point of departure for developing safe AI. Methods like RLHF could potentially adapt models to roles in open agencies in which the consequences of residual misalignment are constrained by the structure of relationships and tasks.

  2. ^

    Dall·E 2, Stable Diffusion, and their kin are further examples of prompt-driven generative models. Prompts (in a general sense) may include descriptive text, rough sketches, and images with gaps to fill. Generalized prompts for planning models might include goals, budget constraints, and sketches or precedents that suggest the general nature of a desired plan.

  3. ^

    Plan descriptions are of necessity interpretable as actions by downstream components and (with caveats) may also be interpretable by humans. Plans are not Transformer activations, and poorly explained plans are good candidates for rejection.


Ω 35

New Comment
16 comments, sorted by Click to highlight new comments since: Today at 2:00 PM

One key question I would like to ask is, Where do you get your capabilities from? You somewhat get into this in your post, but I would like to hear your full perspective.

I like this post, but some questions/critiques:

  1. In my mind, one of the main requirements for Aligned AGI is the ability to defeat evil AGIs if they arise (hopefully without needing to interfere with the humans activity leading up to them). The open agencies decision making seems a bit slow to meet these requirement. It's also not clear how it scales over time, so could it even beat an evil open agency, assuming the aligned open agency gets a head start? 🤔
  2. Open agencies might not even be fast or cheap enough to fill the economic niches we want out of an AGI. What is the economic niche?
  3. The way you are combining the agents doesn't seem to preserve alignment properly. Even if the individual agents are mostly aligned, there is still optimization pressure against alignment. For example, there is immense optimization pressure for getting a larger budget. In general, I'd like to see how the mesaoptimizer problem manifests (or is solved!) in open agencies. Compare with imitative amplification or debate, where the optimization pressure is much weaker and gets scrutinized by agents that are much smarter than humans.
  4. Modelling in general seems difficult because you need to deal with the complexity of human social dynamics and psychology. We don't even have a model for how humans act "in distribution", let alone out of distribution.
  5. The details don't seem to have much value added v.s. just the simpler idea of "give an organization access to subhuman AI tools". Organizations adopting new tools is fairly established. For example, programmers in organizations already use Codex to help them code. I'm sure business people are using ChatGPT for brainstorming. It would strengthen the post if you listed what value is added v.s. the traditional approach that is already happening organically.

I feel like (2) is the natural starting point, since that will influence the answers to the other four questions.

  1.  If the agencies are actually AI models queried context free, and we automate choosing an action based on Drexler's previous post, , then this will run in realtime
  2. The economic niche is all uses of AI where the user will face accountability for negative consequences.  This method costs a multiple of the amount of hardware at inference time (likely 10-20 times) but comes with a reduction in the risk of adversarial behavior that bankrupts the company using AI
  3. There is actually enormous optimization pressure in favor of alignment.  Every time an agent "deceives" and outputs an answer that disagrees, this creates a record in the log that humans may later see.  Once humans "catch" an agent engaging in deception or being wrong, they will cease using that model.  This is effectively the prisoner's dilemma, where if you don't defect and are caught lying, you will be summarily executed, and there are 10-20+ other prisoners you have not met who share little background with you who are also being questioned.  
  4. I don't have an answer to this one
  5. These are superhuman tools
  1. On the left hand side there a large number of human components. This is where I was expecting the slow down. I'm guessing that defeating evil AGI wouldn't be a narrow task that could be delegated to a unitary agent.
  2. What about something like "safely use nano-technology to reverse aging"? There aren't enough humans to oversee every nano-machine, but it seems dangerous to hand it over to a unitary agent that is either dumb or unaligned. Even just the research stage could be dangerous. 🤔 And what about unforeseeable economic niches; do we have a reduction argument that "anything an AGI could contribute, the open agency can as well"? We can't reduce directly by saying "the open agency can use AI agents, including the AGI", because they can only use narrow AI agents.
  3. I'm not talking about alignment of any individual agent (I'm taking for granted that that they are all mostly aligned (including the humans) and so wouldn't subtly lie), I'm talking about alignment of the overall system.
  4. No response, but I need a number 4 for formatting reasons, lol
  5. Right, true. So I guess the question is what's the value added of the proposal's details over "give an organization access to AI tools"; the subhuman part was unimportant.
  1.  See how "evaluation" and "implementation" show direct arrows to them, not going through the humans on the picture?    The evaluation is automatible : it can be some AI model determining if the proposed plans and projected outcomes meet the preferences of the human users.  See also how "implementation" shows many boxes?  The final plan doesn't get performed by the AI that generated it (so it doesn't get to hijack the effort to make progress towards it's unaligned goals) but is carried out by a different AI trained on say robotics tasks.  This agent has been selected among many possibilities in that it accomplishes the robotics tasks the best, given the constraints.   
  2. See the above.  The agent planning the nanosurgery and the one carrying it out and the one manufacturing the machinery are all different agents, and many of them may actually be invisible "clusters" of agents where 1 is in charge but silently we are querying 10-20 others and making sure the other AIs are not proposing a significantly different plan.  (due to one of the tenants of rationality, that 2 rational agents cannot agree to disagree, each agent should produce the same answer to within a certain amount of floating point error)

    It has to be like this, as you say, there are likely terrabytes of machine instructions involved.  Humans cannot "check" any of this for correctness, they need to have other AIs do it.  
  3. See above.  What makes system alignment possible is using these clusters of agents prevents the agents from colluding against us, and by checking the plans against our human preferences, it ensures that the plans are not completely "out of bounds".  Also we do not give the system "blank check" to do whatever it want, it's future plans are visible to us, as they must be in an interpretable data format so it can  describe to another AI what needs to be carried out in the real world.  
  4. same
  5. Right.  These agents can easily be better at their assignments than humans.

Ah, I completely misunderstood! I thought it was meant that it was actual humans in the loop be queried with each decision, not just that they were modelling human preferences. Nvm then.

Let's describe this in a more implementable form.

What you I think are describing is you have AI systems, and these systems are trained from a simulator or human feedback, or some other offline method.

We have benchmarked their performance, and for the domain the system operates in, for inputs within the latent space of the training set, the system is sufficiently safe. (the stakes of the task determine how much reliability is required)

So the systems operate as limited duration "sessions".  They receive a stream of input, calculate an output, and any un-needed short term variables are cleared once the session ends.  

Moreover, each time they run, we have an input checker that ensures that the input is within the latent space of the training set above.


Note that an autonomous car is perfectly implementable using the above.  It need not be agentic.  This is because all the "overhead" of doing it this way doesn't add much latency if computers are doing it.


Seems straightforward.  I had to look up the name - the Eric Drexler.

On a related topic, have you tried to explain to EY plausible routes to bootstrapping to self replicating nanoforges?  EY's model is not very accurate, the plausible routes require a large amount of carefully collected high information content data about the nanoscale.  You would need to build the smallest subunits, test those, and so for iteratively to ever reach a nanoforge*.  You also would likely need a very large upfront investment of robotics equipment and robotics "effort" to build the first one.

   (*a nanoforge is a large machine with sufficient assembly lines to manufacture every part used in itself.  Before you have one, nanotechnology is expensive and useless)

There are a lot of details to fill in, but the proposal seems basically viable, and I find it considerably more compelling than developing autonomous agents (mostly because of disempowerment concerns [if we successfully align powerful agents, we'll just be gently disempowered]).

I think to a large extent cognition can be largely divorced from volition, so I'm optimistic that humans in the loop architectures do not incur to considerable a penalty compared to unified agents.

([if disempowerment bothers you, then an aligned agent doesn't do it (to you)])

Uhm, the power differential between an aligned sovereign and humanity makes any autonomy humans have conditional on the sovereign's goodwill.

I find they concept of a sovereign nanny unpleasant.

Aligned sovereigns in general belong more to theology than to technology. I’m not convinced it’s a consistent/coherent concept, let alone a realizable or desirable one.

And its will follows your will. The question dissolves.

Is it sovereign, if you it obeys your programming, which obligates it to empower and integrate you as soon as it can? Yes, no, it doesn't matter.

I would like to see a much deeper technical comparison of open agencies vs. unitary agents from the perspectives of control theory, game theory, mechanism design, resilience theory, political science, theories of evolution and development, and many more.

So far, the proposal doesn't seem technically grounded, and the claims that are made, for example,

The structure of tasks and accountability relationships speaks against merging exploratory planning, strategic decision-making, project management, task performance, reporting, and auditing into a single opaque process. The costs would be high, and any gains in simplicity would be illusory.

Could appear to be the reverse on a deeper inspection.

Also, the proposal seems to unnecessarily strawman "unitary agents". It's likely that general AI architectures will be multi-componental in one way or another (e.g., see LeCun proposal; see some arguments for that this might be a general trend even under recursive self-improvement here). OTOH, "open agencies", on a longer timescale (when the AI and the complexity of problems it solves grow beyond the capacity of human supervision) will converge essentially to the same point: a multi-componental AI architecture.

Every multi-componental cognitive architecture is still a cognitive architecture, and its properties could differ from the properties of the components (which is the textbook definition of emergence). Even if all the components are perfectly "aligned" and ethical, the system as a whole could drift in directions (or "move to the goal") which we would not like. Cf. capitalism, misaligned corporations made of good people, prisoner's dilemma, the tragedy of commons, political polarisation and/or standstill, etc.

Agents:     Goal-driven agents may defend goals against change.
Agencies:  Generative planning models respond to goals as prompts.

Cognitive science doesn't predict a priori that multi-component systems will not defend their goals. In fact, goals of multi-component systems ranging from corporations to totalitarian regimes to ant colonies could be notoriously hard to change. It doesn't suffice to say "open agency -> easily changeable goals"; we should analyse concrete architectures in detail for this property (after deciding if this property is in fact desirable in the first place; maybe we actually want the highest levels of civilisational intelligence to be conservative in its goals).

Agents:     Plan descriptions may be internal and opaque.
Agencies:  Plan descriptions are externalized and interpretable. (Plan descriptions are of necessity interpretable as actions by downstream components and (with caveats) may also be interpretable by humans. Plans are not Transformer activations, and poorly explained plans are good candidates for rejection.)

First, "unitary" general AI designs (or quasi-unitary, because as I gestured at above, probably you just can't get to AGI with a single Transformer, you need at least some version of debate or something multi-step/multi-componental/reflexive in the AI architecture) that will be deployed will have plans externalised (I hope, but not so sure after Bing Chat...). Considering otherwise would be crazy. Second, just having a plan outputted in either a quasi-unitary agent or open agency is not enough: we should also apply interpretability to activations when the plan was generated to get evidence (engineering assurance) that this plan is not deceptive. There is no way around this in any AI architecture.

Agents:     Open-ended goals may motivate secret plans to gain power.
Agencies:  Bounded tasks include time and budget constraints.

Again, straw-manning unitary agents. Not sure what precludes implementing time and budget constraints in them.

Agents:     Humans may have only one chance to set the goals of a dominant agent.
Agencies:  Humans engage in ongoing development and direction of diverse systems.

This points to your previous post about collusion. I agree that collusion is not guaranteed (we could find a mechanism design in which it would be precluded), but collusion is likewise not ruled out. So, on the rhetorical level, it's just as valid to say that "humans may have only one chance to deploy an open agency" if this open agency will truly have general intelligence. If it is "safety (or, at least, buying of another chance) by human gatekeeping", then first, it should be analysed from safety science and reliability engineering perspectives how likely it is that it will help, and second, we can in principle add this sort of gatekeeping to quasi-unitary agents as well. If the problem with the latter is that quasi-unitary agents will be uninterpretable in some ways, then the proposal boils down to saying that "multi-component AI is more interpretable", which depends a lot on the details (as I pointed out above, in multi-component AI there is an extra task of interpreting the entire system even if all components are interpretable) of the exact "open agency" proposal. So, again, the comparison that dunks quasi-unitary agents is not founded a priori.

Power seeking: Agents: Open-ended goals may motivate secret plans to gain power. Agencies: Bounded tasks include time and budget constraints.

I think the methods for accumulating resources is one of the key questions of interest that an AI safety approach needs to handle. For instance:

  • We might want to apply AI to sales and marketing. How does Open Agencies address questions such as how you know what to tell customers?

  • We might want to apply AI to allocating budgets within a company. How does Open Agencies address questions such as how you know which projects to allocate resources to?

  • We might want to apply AI to design political messaging for a political party. How does Open Agencies address questions such as how you know what political promises you need to make?

  • We might want to apply AI to design molecular nanotechnology. How does Open Agencies address questions such as how you know how to power the nanobots, how to produce large quantities of nanobots, how to decide what waste products are acceptable to leak, etc.?

I would also add the same point I bought up when you presented it in-person: I don't think the "generating diverse proposals and then selecting from them" is directly providing much of the value from this proposal.

Presumably a proposal for an advanced task is going to involve a lot of information, say kilobytes of information as a lower bound. The people evaluating the proposal presumably have to spend nontrivial amounts of time reading and thinking about the proposal. So as an upper bound, it seems unlikely that it is economical for them to read through more than, say, four billion proposals.

Which gives you 4 bytes of information about their preferences, out of kilobytes of information in a proposal.

So I would deemphasize the evaluation/selection step, or maybe replace it with a different component, e.g. evaluation/feedback (if instead of generating four billion proposals and selecting one of them, you iteratively generate two approaches and give feedback for which one is better, then after four billion approaches you've accumulated two billion bits of information, rather than 32 bits of information).

Playing board games and driving cars are tasks performed by individual humans with fast decision cycles. The open agency model, however, fits best when equivalent human tasks (building a factory, deploying a communication system, developing a drug) call for multiple skills and technologies coordinated over time. The structure of tasks and accountability relationships speaks against merging exploratory planning, strategic decision-making, project management, task performance, reporting, and auditing into a single opaque process. The costs would be high, and any gains in simplicity would be illusory.

I'm not sure whether I'm understanding this paragraph correctly. My reading is that here you are arguing that while Open Agencies involve a lot of coordination work that would be prohibitively expensive for fast bounded tasks, the current existence of other tasks which do involve similar amounts of coordination work implies that it would not be prohibitively expensive for other tasks.