Deger Turan, Tantum Collins, davidad and Charbel-Raphael Segerie contributed to the writing of this post.

Andrew Critch, Ben Goldhaber, Ozzie Gooen, and Evan Miyazono reviewed this post.

tl;dr: Open Agency Architecture (OAA) is a bold proposal for AI alignment. It requires a wide ranging formal-modeling enterprise that integrates into a comprehensive world model. We plan to begin developing OAA by iterating on smaller, domain-specific applications that can find near-term use as institutional decision-making tools and provide feedback to continue improving the OAA framework . This document decomposes that goal into sub-agendas designed to engage different technical communities in an international collaboration.

Motivations

The AI Objectives Institute (AOI) aims to ensure that AI and future economic systems are built and deployed with genuine human objectives at their core, enabled by broad public input and scalable cooperation. This goal involves many challenging technical and political problems and pursuing it requires long-term, ambitious, detail-oriented work that integrates general-public preferences and values with expert opinions and predictions.

We find the Open Agency Architecture (OAA, described below) compelling as a platform for decomposing this objective into achievable, well-defined sub-problems. Aligning AI systems is one of the most pressing and fateful challenges of our era: if humanity fails to solve technical alignment, or solves technical alignment in a manner that empowers malicious actors or dysfunctional power structures, then increasingly powerful autonomous systems could usher in dystopias from which we cannot recover. OAA helps reveal the continuity between key components of the alignment problem and a range of long standing governance problems that are not traditionally considered “AI issues,” ranging from organizational design to international coordination protocols. By making these connections systematic and explicit, OAA lets us operationalize our commitment to aligning AI through means that are not only safe but also deeply democratic.

As outlined in this post, we intend to follow a practical roadmap grounded in the OAA that leads to the construction of a min-viable prototype and, eventually, to full-scale implementation that would collect feedback via deployed systems. Most of the projects proposed relate to the first part of OAA – agenda setting and modeling. Parallel projects for time-bound RL, model checking, simulation etc. would also have value, but lie beyond the scope of this post. We believe that this initiative can become a central and effective hub for shepherding AI development towards a safe outcome that supports human flourishing. We describe some of the ways that we hope individuals and institutions will engage with this open-source effort. Please reach out here if you have questions, thoughts or proposals.

Open Agency Architecture overview

In December, Davidad proposed an ‘Open Agency Architecture for Safe Transformative AI’. A subsequent post from Charbel-Raphaël Ségerie and Gabin Kolly discusses the proposal in greater technical detail and maps it onto a scaling roadmap for development and deployment. We encourage readers of this post to also examine those posts in depth. We provide a brief summary below:

The OAA identifies three separable, high-level goals: learning a world model and eliciting desirabilities over possible future states from humans; planning against this world model and the associated desirabilities; and acting in real time.
The architecture is grounded in several assumptions, including the beliefs that: RLHF alone will not succeed in aligning LLMs, scalable oversight poses too many difficulties to be practical, and interpretability used to reverse-engineer an arbitrary AGI comprehensively seems infeasible (hence the need for iterative construction of an understandable world model).
The practical use of such a system would likely adhere to a flow featuring the following components: human representatives handle ‘agenda setting’ to determine the deliberation/negotiation/decision space; AI models help predict the likely effects of various choices; another set of models elicits human preferences over these possibilities; model-based RL simulators then reconcile these and advise on selected Pareto-optimal policies according to their expected desirability aggregated across the population; throughout this cycle, various control systems and review processes maintain stability. See the image below for a more detailed accounting.
Outstanding problems include the difficulty of building and formalizing such a world model, lack of prior work for critical technical components of the proposal (e.g. computational Infra-Bayes), verification issues resulting from the difficulty of understanding the framework, and potential limitations of RL (this final challenge is not unique to the OAA).

The OAA may impose a significant ‘alignment tax’ (i.e. adhering to the OAA, in particular the requirements for formalization and verification, might slow AI development). Conditional on all major labs committing to follow the OAA, this is likely a good outcome, in that slower capability advances offer more time for addressing safety and ethics problems. However, absent such commitments, some labs may find this tax too burdensome and instead proceed down a riskier path that optimizes for capability development. If we want labs to embrace this agenda, we need to demonstrate its feasibility and lay a foundation for collective commitment. Subsequent sections of this document outline initial steps in this direction.

General vision: A Practical Minimal Viable OAA

The roadmap outlined below is grounded in the below principles:

Iterative feedback loop: We believe that a large-scale open-source project can accelerate both the development of individual components and their integration by enabling parallelization, although it’s important to note that we would not want to open-source frontier capabilities (which would be necessary components in an ultimately successful OAA per Davidad’s success model). We seek a min-viable, human-in-the-loop prototype with real life applications for development feedback. This can iteratively grow towards more complex world modeling, value elicitation, and policy choice mechanisms.
Parallel execution of sub-agendas: By separating alignment into a set of independently pursuable subproblems, OAA enables parallel execution of workstreams. This accelerates development timelines and is compatible with an approach that centers open source collaboration. We have designed this project with an eye towards making it as easy as possible for academics, practitioners and enthusiasts worldwide to contribute meaningfully.
Testing in institutional design: Because OAA is applicable to institutional decision making problems that do not involve near-AGI, it offers opportunities for testing across diverse settings during technical development. Feedback and optimization processes very similar to those central to reinforcement learning appear in many institutional contexts. Embedding partial OAA solutions into such governance and feedback structures can improve organizational performance and provide valuable insights for further technical development.

We hope that a functional, modular, open-source, end-to-end implementation of a simplified OAA can attract collaborators across research organizations and academia, growing incrementally while adhering to clear governance principles. We will try to use OAA soon for institutional design and decision-making applications, such as funding allocation and participatory budgeting, bargaining, and reliable energy and resource allocation (e.g. designing safer SmartGrids).

Roadmap: breathing life into the blueprint

Each section here describes an independent research problem. If you are working on these or related subjects and would like to collaborate, please reach out. We have done our best to connect our discussion to relevant open-source projects when possible.

Necessary modular steps are:

Agenda Setting Projects - in order to formalize the space of possible actions

1 - Autoformalizing Option Spaces
2 - Structured Construction of Option Spaces via Templates, if 1 is too complex.

World modeling - in order to formalize the consequences of those actions in a world model

3 - Conversational interface for probability elicitation from experts
4 - Probabilistic Interface for Expert Feedback
5 - Verified RL in iterated-PPL environments
6 - Training LLMs to write models in the PRISM language or PPLs
Bonus : 7 - Formalizing the world model with Continuous time coalgebras - this is not directly relevant for the minimal version, but necessary in order to make the this prototype scalable for world models with deeper physical account of reality

Preference elicitation

8 - Approaches To Public Feedback: Utility functions directly over the action space
- Eliciting the preferences of the different stakeholders and of the public

Approaches To Choice Mechanisms

9 - Nash Bargaining : Once the possible solutions of the problem and the consequences of each action are formulated, we just need to optimize in an adequate solver the problem.

Agenda-Setting Projects

Creating tools and processes that build formal representations of potential collective options

1. Autoformalizing Option Spaces

Step 1 - Creation of the dataset: The initial ‘minimum viable implementation’ for agenda-setting could simply be a protocol for soliciting experts or policymakers for formal or natural language descriptions (respectively) of the option set relevant to a given decision. In some cases this would consist of a finite list of decision outcomes; in other cases (like resource allocation) it will contain outcomes that relate or assign values to a finite set of opportunities or resources. This may help with calibration (surfacing options that are plausible and/or near-term) or creativity (expanding the option space to unforeseen, possibly positive-sum outcomes.)

Consider, for example, a formalization of participatory budgeting:

There is a finite set X of proposals.
There is a constant positive number b which is the total budget.
The option space consists of finite functions f from X->R+ such that the sum over all i in X of f(i) is less than or equal to b.

Real world example: SFF’s S-Process

Step 2 - Training by auto-formalization of option spaces: The goal of the auto-formalization project is to train/develop a system which takes the natural language description of a decision situation and produces a formal description of the decision variables and constraints in the language of an optimization framework such as CVXPY. (Note that it is not a goal at this point to define the objective function, just the feasible space of potential decision outcomes.)

Baseline training: The first step in this project would be simply to compile examples of commonly occurring decision situations (resource allocation, travel planning, task scheduling, scheduling sessions at a conference, etc.) based on our min-viable library of option space descriptions, and manually write down CVXPY formalizations of the decision space. To get our initial auto-formalization baseline, we will simply use these pairs of natural language and formal descriptions as a few-shot prompt for GPT-4. We may also experiment with including in each example some intermediately structured representations, e.g. in YAML, to benefit from chain-of-thought-style processing.

Synthetic data augmentation: One could synthesize more data by simply prompting GPT-4 to generate more examples and then filtering and editing them for syntactic and semantic correctness. As the dataset grows, it will become easier to generate correct examples. One can also further augment the dataset by, for example, keeping the CVXPY code the same but asking GPT-4 to simply rephrase the natural-language description in other words while keeping the meaning the same. Renaming variables or changing the ordering of constraints in the CVXPY code maye also be useful for data-augmentation.

Once a satisfactory dataset built via these methods exists, it should be possible to fine-tune GPT-3.5 (if not GPT-4 yet) on a large set of correct pairs of natural-language descriptions and formal specifications of a decision situation.

2. Structured Construction of Option Spaces via Templates

TLDR: A parallel approach to agenda setting that may have more immediate practical applications is Structured Construction. The core idea is to classify decision problems into one of a fixed finite set of template problem-types, such as “simple choice”, “resource allocation”, “matching problem”, etc. One could refer to the table of contents of Schrijver’s Combinatorial Optimization text, and the MOSEK Modeling Cookbook, as sources of different problem-types.

Baseline procedure: The data strategy for a structured construction approach would resemble the autoformalization description above, but it could likely generate examples more easily without creating concerns around syntactic correctness. To process a decision, one could apply a classifier to a natural language problem description, automatically pick an appropriate template based on that classification, and then prompt an LLM to fill in the template’s parameters, e.g. as a structured YAML file, based on the natural language description. A simple hand-coded (or Copilot-coded…) function or script could then algorithmically generate the appropriate CVXPY code.

Extensions: Further work could extend this project toward more compositional descriptions of decision situations. This more sophisticated model should be able to compose and decompose decision problems in cases where, for example, one needs to allocate two different resources between the same set of opportunities, or when one first needs to make a simple choice (e.g. between taking a taxi or taking transit) and then only under some top-level choices does one face an additional decision problem (e.g. if one takes transit, one must choose a route through the transit network, whereas if one takes a taxi, it is the taxi driver’s problem to choose a route).

Comparing the results of this structured approach with the results of the more LLMs-forward approach outlined above could provide valuable methodological insight.

Approaches to World Modeling

Creating tools and processes that can model possible outcomes from the option space

3. Conversational interface for probability elicitation from experts

Min version: The min-viable version of world modeling for OAA would be a protocol for gathering and integrating expert predictions. This ‘world model’ would practically consist of a council of experts, consulting relevant technical tools (e.g. simulations) where available, writing down their thoughts about what is likely to happen if a given policy is pursued.

User interface version: A second area of interest when designing this protocol is finding the right UX approach for eliciting quantitative judgements from experts. UX modalities such as sliders, or a conversational interface that performs binary search on betting odds, may produce more considered judgments than directly eliciting numbers.

Extensions: This strategy is highly compatible with the Structured Construction Strategy for templates: we may, for instance, deploy LLMs to propose translations of the experts’ predictions into parameters of model templates. More ambitiously, the LLM could iteratively propose to “fill in holes” in the expert’s model by filling-in increasingly detailed templates using the expert’s feedback.

4. Probabilistic Interface for Expert Feedback

Prototype: While autoformalization from natural-language descriptions to probabilistic programs is highly ambitious due to number-synthesis challenges, a V1 prototype may rely on a Squiggle-style representation of uncertainty (see this post on why Squiggle is useful and how to use it).

Technical questions: To work out the details and viability of this approach, we need to work through certain key technical questions: First, is it viable to translate models (i.e. Squiggle or a similar PPL) models into an MDP modeling language like PRISM/JANI? A significant impedance-mismatch here is that JANI specifies finite-state models, meaning that continuous probability distributions need to be discretized – ultimately, in a clever way (to avoid combinatorial explosion), possibly along the lines of cylindrical algebraic decomposition. This is likely a fruitful direction for research, but success isn’t obvious. Davidad together with Peter Schmidt-Nielsen, Nisan Stiennon, and Andrew Critch recently spent a few days doing early experiments toward learning state-space discretizations with Adam, with promising preliminary results in low dimensions.

5. Verified RL in iterated-PPL environments

Can we bridge the gap between Squiggle-style models and RL control policy algorithms such as an implementation of the RASM verified RL algorithm by Lechner and Henzinger et al.? This would be most useful in situations such as periodic funding rounds or quarterly planning, where one expects the situation to evolve in some structurally similar way across a time horizon of many periods during which new decisions can be made. This general approach -- using RL to find an optimal policy for the entire horizon while knowing that one will replan at every single decision point -- is known as “receding-horizon planning,” epitomized by the maxim “plans are useless but planning is indispensable”. The main research challenge in bridging this approach with Squiggle-style modeling may be to compute the Lipschitz constant of the deterministic part of the Squiggle model, and/or to automatically smooth non-Lipschitz operations (such as threshold comparisons).

Testing the RASM algorithm—designed for cases like inverted pendulum problems and collision avoidance—on realistic R&D decision-making situations should be very scientifically interesting.

6. Training LLMs to write models in the PRISM language or PPLs

This approach is similar to the LLMs-forward auto formalization plan we described for agenda-setting, but applied to probabilistic distributions over outcomes over option space rather than to the construction of option spaces.

We believe that in this context, the technique known as ‘backward distilliation’ may prove particularly useful for training the LLM. To perform backward distillation, we start with a dataset of models in the formal language, then ask the LLM to summarize each model in a compact English description. We then transpose that dataset, and finetune the LLM to predict the formal model based only on the description. This strategy is potentially applicable to generating models in the PRISM language or Squiggle (a probabilistic programming language) from brief human input, but does require having a set of syntactically correct and semantically meaningful models to start the process.

7. Formalizing the world model with Continuous time coalgebras

Creating a world-model adequate for a mature OAA may require a new programming language efficient enough to represent certain especially challenging complex dynamics in the world.

The key technical question the design of the new programming language should address is the problem of cross-scale compositionality: how to define a sufficiently compositional world model with a good representation of multi scale physical phenomenon. One concrete subproblem we’d like to tackle to make progress on this question is the problem of continuous time coalgebras: It is known that discrete-time Markov processes (MDP/POMDP) are coalgebras for a probability monad. Such a coalgebra can be viewed as a functor from the one-object category ℕ to the Kleisli category of the probability monad. A “continuous time coalgebra” can be defined as a functor from the one-object category ℚ⁺ of non-negative rationals in place of ℕ (with the same codomain, the Kleisli category of the monad). Which concepts of coalgebra theory can be generalized to continuous time coalgebra? Especially, is there an analog to final coalgebras and their construction by Adamek's theorem?

8. Approaches To Public Feedback: Utility functions directly over the action space

Combining public value elicitation with actionable option-space constraints and expert opinion under OAA requires defining a function that takes expressions of stakeholder values as inputs and produces utilities over the space of possible outcomes. One min-viable approach is to elicit preferences directly on the action space (i.e. taking the world-model to be the identity function and passing actions through to evaluation). This method produces a continuous and multiple choice combinatorial action space, and then asks participants “what features of the decision do you care about?” The system will then propose a quantitative operationalization for each feature named by the stakeholder (e.g. for fairness, the Gini coefficient; for “I want X to be selected”, a simple threshold on the associated decision). The system will then elicit a linear weighting of all operationalized features. Appropriate UX design would be critical for making a system of this kind effect. One UX option would be to:

Generate a bunch of diverse plausible (or at least feasible) decision outcomes, including at least one which is optimal for each feature the user cares about,
Initialize all the feature weights to 1/n (where n is the number of features),
Render these on a line from WORST to BEST (normalizing the weighted evaluation to its minimum and maximum),
Offer buttons and sliders to increase and decrease the relative importance of each feature (keeping the total of all feature weights at 1),
Show immediate visual feedback demonstrating how this affects the ordinal ranking, as well as the relative positioning, of various tradeoffs between the features that the user values.

A very alpha version of an interface for ranking and comparing relative values for the construction of 'utility functions' can be found at QURI’s research.

9. Approaches To Choice Mechanisms: Nash Bargaining

Once the problem and preferences are elicited, we see automated Nash bargaining as a promising approach for extracting a group-decision from multiple individuals’ preference inputs. Davidad has already produced some code for the two-player case (using the OpEn implementation of the PANOC optimization algorithm, on the Nash welfare function defined via Random Dictator as the “disagreement point”) and we would like to explore generalizations, as it is a well-scoped technical problem.

While the Nash welfare function can be defined to have the actual decision space as its domain (thereby facilitating a single-stage solution), if the actual decision space is not a convex set then the true domain of the Nash welfare function should be finitely supported probability distributions over the decision space. The two-stage solution first uses a multi-objective optimizer or Pareto-front enumeration technique to list all Pareto-optimal feasible decisions (or a densely representative subset), and then in the second stage the Nash welfare function is optimized over variables, one for each Pareto-optimal feasible decision, which represent the probability of choosing it.

Note: one may, or may not, use the same optimization engine to find feasible solutions in the decision space vs. to solve the Nash bargaining problem. While the Nash welfare function can be defined to have the actual decision space as its domain (thereby facilitating a single-stage solution), if the actual decision space is not a convex set then the true domain of the Nash welfare function should be finitely supported probability distributions over the decision space. The two-stage solution first uses a multi-objective optimizer or Pareto-front enumeration technique to list all Pareto-optimal feasible decisions (or a densely representative subset), and then in the second stage the Nash welfare function is optimized over variables, one for each Pareto-optimal feasible decision, which represent the probability of choosing it.

LESSWRONG
is fundraising!
LW