Notes on AI for complex, consequential problems
Centre for the Governance of AI
University of Oxford
This document argues for “open agencies” — not opaque, unitary agents — as the appropriate model for applying future AI capabilities to consequential tasks that call for combining human guidance with delegation of planning and implementation to AI systems. This prospect reframes and can help to tame a wide range of classic AI safety challenges, leveraging alignment techniques in a relatively fault-tolerant context.
Rethinking safe AI and its applications
AI safety research is too varied to summarize, yet broad patterns are obvious. A long-established reference-problem centers on prospects for rational superintelligent agents that pursue narrow goals with potentially catastrophic outcomes. This frame has been productive, but developments in deep learning call for updates that take account of the proliferation of narrow models (for driving, coding, robot control, image generation, game playing…) that are either non-agentic or act as agents in only a narrow sense, and that take account of the rise of more broadly capable foundation models and LLMs. These updates call for reframing questions of AI safety, and call for attention to how consequential tasks might be accomplished by organizing AI systems that usually do approximately what humans intend.
Two frames for high-level AI
The unitary-agent frame
From its beginnings in popular culture, discussion of the AI control problem has centered around a unitary agent model of high-level AI and potential AI risks. In this model, a potentially dominant agent both plans and acts to achieve its goals.
The unitary-agent model typically carries assumptions regarding goals, plans, actions, and control.
- Goals: Internal to an agent, by default including power-seeking goals
- Plans: Internal to an agent, possibly uninterpretable and in effect secret
- Actions: Performed by the agent, possibly intended to overcome opposition
- Control: Humans confront a powerful, potentially deceptive agent
The typical unitary-agent threat model contemplates the emergence of a dominant, catastrophically misaligned agent, and safety models implicitly or explicitly call for deploying a dominant agent (or an equivalent collective system) that is both aligned and powerful enough to suppress unaligned competitors everywhere in the world.
The open-agency frame
Recent developments suggest an alternative open agency model of high-level AI. Today, the systems that look most like AGI are large language models (LLMs), and these are not agents that seek goals, but are generative models that produce diverse outputs in response to prompts (in a generalized sense) and random-number seeds. Most outputs are discarded.
Trained on prediction tasks, LLMs learn world models that include agent behaviors, and generative models that are similar in kind can be informed by better world models and produce better plans. There is no need to assume LLM-like implementations: The key point is that generation of diverse plans is by nature a task for generative models, and that in routine operation, most outputs are discarded.
These considerations suggest an “open-agency frame” in which prompt-driven generative models produce diverse proposals, diverse critics help select proposals, and diverse agents implement proposed actions to accomplish tasks (with schedules, budgets, accountability mechanisms, and so forth).
Goals, plans, actions, and control look different in the open-agency model:
- Goals: Are provided as prompts to diverse generative models,
yielding diverse plans on request
- Plans: Are selected with the aid of diverse, independent comparison
and evaluation mechanisms
- Actions: Incremental actions are performed by diverse task-oriented agents
- Control: Diverse, independent monitoring and evaluation mechanisms
guide revision of plans
For narrow tasks, of course, unitary agents may be both efficient and unproblematic:
Playing board games and driving cars are tasks performed by individual humans with fast decision cycles. The open agency model, however, fits best when equivalent human tasks (building a factory, deploying a communication system, developing a drug) call for multiple skills and technologies coordinated over time. The structure of tasks and accountability relationships speaks against merging exploratory planning, strategic decision-making, project management, task performance, reporting, and auditing into a single opaque process. The costs would be high, and any gains in simplicity would be illusory.
Note: To avoid a natural misconception, is important to recognize that the open-agency architecture describes roles and relationships, not the systems that fill those roles. Thus, division of labor does not imply division of competence, and any or all roles could in principle be performed by large, arbitrarily-capable models.
AI safety challenges reframed
Basic challenges in AI safety — corrigibility, interpretability, power seeking, and alignment — look different in the agent and agency frames:
Agents: Goal-driven agents may defend goals against change.
Agencies: Generative planning models respond to goals as prompts.
Agents: Plan descriptions may be internal and opaque.
Agencies: Plan descriptions are externalized and interpretable.
Agents: Open-ended goals may motivate secret plans to gain power.
Agencies: Bounded tasks include time and budget constraints.
Agents: Humans may have only one chance to set the goals of a dominant agent.
Agencies: Humans engage in ongoing development and direction of diverse systems.
AI development in the open-agency frame can and should be informed by agentic concerns — we know why vigilance is important! — but the considerations outlined above suggest that the most tractable solutions AI to safety problems may rely to a substantial extent on agency-centered strategies. Separation of roles, incremental actions, greater transparency, and affordances for control can all make imperfect alignment techniques more effective and failures less catastrophic.
Threat models in an agency-centered world must include the emergence of dangerous agents, whether autonomous and unitary or under human direction. Effective safety models call for the development and deployment of systems that, collectively, make catastrophically dangerous goals impossible to achieve. This task does not require a world-dominating AI agent, but may require a degree of human coordination that is difficult to achieve.
Applications of AI to human goal alignment may be critically important. Combining fluent language models with compelling, well-grounded knowledge models could help by presenting us with a more complete picture of the world — situations, causality, threats, opportunities, and options — but this is a topic for another time.
The open agency model offers a new perspective on applying AI capabilities to complex, consequential tasks. It reframes the traditional AI safety challenges by introducing the concept of "open agencies" that rely on generative models that produce diverse proposals, diverse critics that help select proposals, and diverse agents that implement proposed actions to accomplish tasks. By leveraging alignment techniques in a fault-tolerant context, the open agency model provides a framework for safer and more effective AI.
The discerning reader will recognize that the principle outlined here can be applied to mundane AI systems that likewise propose alternatives, advise on choices, and perform bounded tasks with opportunities for oversight. Travel planning, for example, where the bounded task is making reservations. Note that this open-agency pattern is how people often use AI today: This article argues that the pattern scales.
I buy the description (see “Simulators”) of pure LLMs as thoroughly non-agentic models of intelligence that can readily simulate — hence actualize — a wide range of personas that act as agents. Simulations that actualize such agents have exhibited human-like behaviors that include directly threatening their perceived enemies and (primed with AI-vs-humanity narratives) talking of world conquest. LLM-based agents are opaque, unitary, and perhaps not the best point of departure for developing safe AI. Methods like RLHF could potentially adapt models to roles in open agencies in which the consequences of residual misalignment are constrained by the structure of relationships and tasks.
Dall·E 2, Stable Diffusion, and their kin are further examples of prompt-driven generative models. Prompts (in a general sense) may include descriptive text, rough sketches, and images with gaps to fill. Generalized prompts for planning models might include goals, budget constraints, and sketches or precedents that suggest the general nature of a desired plan.
Plan descriptions are of necessity interpretable as actions by downstream components and (with caveats) may also be interpretable by humans. Plans are not Transformer activations, and poorly explained plans are good candidates for rejection.
One key question I would like to ask is, Where do you get your capabilities from? You somewhat get into this in your post, but I would like to hear your full perspective.
I like this post, but some questions/critiques:
I feel like (2) is the natural starting point, since that will influence the answers to the other four questions.
It has to be like this, as you say, there are likely terrabytes of machine instructions involved. Humans cannot "check" any of this for correctness, they need to have other AIs do it.
Ah, I completely misunderstood! I thought it was meant that it was actual humans in the loop be queried with each decision, not just that they were modelling human preferences. Nvm then.
Let's describe this in a more implementable form.
What you I think are describing is you have AI systems, and these systems are trained from a simulator or human feedback, or some other offline method.
We have benchmarked their performance, and for the domain the system operates in, for inputs within the latent space of the training set, the system is sufficiently safe. (the stakes of the task determine how much reliability is required)
So the systems operate as limited duration "sessions". They receive a stream of input, calculate an output, and any un-needed short term variables are cleared once the session ends.
Moreover, each time they run, we have an input checker that ensures that the input is within the latent space of the training set above.
Note that an autonomous car is perfectly implementable using the above. It need not be agentic. This is because all the "overhead" of doing it this way doesn't add much latency if computers are doing it.
Seems straightforward. I had to look up the name - the Eric Drexler.
On a related topic, have you tried to explain to EY plausible routes to bootstrapping to self replicating nanoforges? EY's model is not very accurate, the plausible routes require a large amount of carefully collected high information content data about the nanoscale. You would need to build the smallest subunits, test those, and so for iteratively to ever reach a nanoforge*. You also would likely need a very large upfront investment of robotics equipment and robotics "effort" to build the first one.
(*a nanoforge is a large machine with sufficient assembly lines to manufacture every part used in itself. Before you have one, nanotechnology is expensive and useless)
There are a lot of details to fill in, but the proposal seems basically viable, and I find it considerably more compelling than developing autonomous agents (mostly because of disempowerment concerns [if we successfully align powerful agents, we'll just be gently disempowered]).
I think to a large extent cognition can be largely divorced from volition, so I'm optimistic that humans in the loop architectures do not incur to considerable a penalty compared to unified agents.
([if disempowerment bothers you, then an aligned agent doesn't do it (to you)])
Uhm, the power differential between an aligned sovereign and humanity makes any autonomy humans have conditional on the sovereign's goodwill.
I find they concept of a sovereign nanny unpleasant.
Aligned sovereigns in general belong more to theology than to technology. I’m not convinced it’s a consistent/coherent concept, let alone a realizable or desirable one.
And its will follows your will. The question dissolves.
Is it sovereign, if you it obeys your programming, which obligates it to empower and integrate you as soon as it can? Yes, no, it doesn't matter.
I would like to see a much deeper technical comparison of open agencies vs. unitary agents from the perspectives of control theory, game theory, mechanism design, resilience theory, political science, theories of evolution and development, and many more.
So far, the proposal doesn't seem technically grounded, and the claims that are made, for example,
Could appear to be the reverse on a deeper inspection.
Also, the proposal seems to unnecessarily strawman "unitary agents". It's likely that general AI architectures will be multi-componental in one way or another (e.g., see LeCun proposal; see some arguments for that this might be a general trend even under recursive self-improvement here). OTOH, "open agencies", on a longer timescale (when the AI and the complexity of problems it solves grow beyond the capacity of human supervision) will converge essentially to the same point: a multi-componental AI architecture.
Every multi-componental cognitive architecture is still a cognitive architecture, and its properties could differ from the properties of the components (which is the textbook definition of emergence). Even if all the components are perfectly "aligned" and ethical, the system as a whole could drift in directions (or "move to the goal") which we would not like. Cf. capitalism, misaligned corporations made of good people, prisoner's dilemma, the tragedy of commons, political polarisation and/or standstill, etc.
Cognitive science doesn't predict a priori that multi-component systems will not defend their goals. In fact, goals of multi-component systems ranging from corporations to totalitarian regimes to ant colonies could be notoriously hard to change. It doesn't suffice to say "open agency -> easily changeable goals"; we should analyse concrete architectures in detail for this property (after deciding if this property is in fact desirable in the first place; maybe we actually want the highest levels of civilisational intelligence to be conservative in its goals).
First, "unitary" general AI designs (or quasi-unitary, because as I gestured at above, probably you just can't get to AGI with a single Transformer, you need at least some version of debate or something multi-step/multi-componental/reflexive in the AI architecture) that will be deployed will have plans externalised (I hope, but not so sure after Bing Chat...). Considering otherwise would be crazy. Second, just having a plan outputted in either a quasi-unitary agent or open agency is not enough: we should also apply interpretability to activations when the plan was generated to get evidence (engineering assurance) that this plan is not deceptive. There is no way around this in any AI architecture.
Again, straw-manning unitary agents. Not sure what precludes implementing time and budget constraints in them.
This points to your previous post about collusion. I agree that collusion is not guaranteed (we could find a mechanism design in which it would be precluded), but collusion is likewise not ruled out. So, on the rhetorical level, it's just as valid to say that "humans may have only one chance to deploy an open agency" if this open agency will truly have general intelligence. If it is "safety (or, at least, buying of another chance) by human gatekeeping", then first, it should be analysed from safety science and reliability engineering perspectives how likely it is that it will help, and second, we can in principle add this sort of gatekeeping to quasi-unitary agents as well. If the problem with the latter is that quasi-unitary agents will be uninterpretable in some ways, then the proposal boils down to saying that "multi-component AI is more interpretable", which depends a lot on the details (as I pointed out above, in multi-component AI there is an extra task of interpreting the entire system even if all components are interpretable) of the exact "open agency" proposal. So, again, the comparison that dunks quasi-unitary agents is not founded a priori.
I think the methods for accumulating resources is one of the key questions of interest that an AI safety approach needs to handle. For instance:
We might want to apply AI to sales and marketing. How does Open Agencies address questions such as how you know what to tell customers?
We might want to apply AI to allocating budgets within a company. How does Open Agencies address questions such as how you know which projects to allocate resources to?
We might want to apply AI to design political messaging for a political party. How does Open Agencies address questions such as how you know what political promises you need to make?
We might want to apply AI to design molecular nanotechnology. How does Open Agencies address questions such as how you know how to power the nanobots, how to produce large quantities of nanobots, how to decide what waste products are acceptable to leak, etc.?
I would also add the same point I bought up when you presented it in-person: I don't think the "generating diverse proposals and then selecting from them" is directly providing much of the value from this proposal.
Presumably a proposal for an advanced task is going to involve a lot of information, say kilobytes of information as a lower bound. The people evaluating the proposal presumably have to spend nontrivial amounts of time reading and thinking about the proposal. So as an upper bound, it seems unlikely that it is economical for them to read through more than, say, four billion proposals.
Which gives you 4 bytes of information about their preferences, out of kilobytes of information in a proposal.
So I would deemphasize the evaluation/selection step, or maybe replace it with a different component, e.g. evaluation/feedback (if instead of generating four billion proposals and selecting one of them, you iteratively generate two approaches and give feedback for which one is better, then after four billion approaches you've accumulated two billion bits of information, rather than 32 bits of information).
I'm not sure whether I'm understanding this paragraph correctly. My reading is that here you are arguing that while Open Agencies involve a lot of coordination work that would be prohibitively expensive for fast bounded tasks, the current existence of other tasks which do involve similar amounts of coordination work implies that it would not be prohibitively expensive for other tasks.