Epistemic status: The core ideas seem robust and stable after long reflection and many discussions.
Many researchers identify AI safety with control of a monolithic, superintelligent AI system, and if questioned about multicomponent alternatives, argue that multiple superintelligent-level systems would inevitably collude and act as one. This view seems quite wrong, yet has diverted attention from a rich and promising range of multicomponent strategies for AI safety — strategies that are well aligned with the actual trajectory of AI development.
Adapted from Reframing Superintelligence, Section 20:
Collusion among superintelligent oracles
can readily be avoided
Because collusion among AI question-answering systems can readily be avoided, there is no obstacle to applying superintelligent-level AI resources to problems that include AI safety.
The difficulty of establishing successful collusion among actors tends to increase as their capabilities, knowledge, situations, and roles become more diverse and adversarial (think auditors, competitors, specialists, red-teams…), and increasing the number of actors can make collusive cooperation more difficult . In the context of AI systems (even more so than among human beings), these conditions can be readily implemented and are attractive for pragmatic reasons. Arguments that, absent preexisting alignment, high-level AI systems will inevitably collude are ill-founded. Instead, we should expect that interactions among multiple superintelligent-level systems can be applied to suppress deception and reduce risk by leveraging imperfect alignment achieved at the level of individual systems.
20.2 Trustworthiness can be an emergent property
Prospects for solving AI-safety problems would be greatly improved if we could safely apply superintelligent-level question-answering resources (“oracles”, or more generally, “systems that provide information in response to prompts”) to solving those problems. A familiar objection dismisses this potentially powerful approach as unsafe in itself, arguing that, absent solutions to difficult problems, individual superintelligent-level systems would be untrustworthy, and that attempts to establish checks and balances among multiple systems (for example, through superintelligent-level evaluation of potentially deceptive answers to questions) would inevitably be thwarted by collusive cooperation.
Identifying robust strategies for ensuring non-collusion among superintelligent question-answering systems would overcome this objection, inviting exploration of superintelligence-enabled strategies for managing potentially untrusted superintelligent AI systems. The present discussion argues that the robust non-emergence of deceptive collusion among imperfectly aligned systems can be ensured by structuring systems of systems with diverse capabilities, knowledge, situations, objectives, and roles.
This problem framing assumes the ability to develop systems that respond to questions with superintelligent-level competence, and will (as a consequence of the nature of digital systems) assume the ability to constrain information inputs to AI systems during their development and use.
In the problematic case, superintelligent question-answering systems might provide deceptive answers in pursuit of emergent, potentially undesirable goals. The aim here is to identify principles for architecting multicomponent systems that will act as non-deceptive oracles, while treating their components as actors that could in principle engage in deceptive collusion.
20.3 A range of conditions can make collusion robust or fragile
Potential problems of deceptive collusion can be addressed by architecting systems that make collusion difficult or effectively impossible. Some relevant conditions:
Conditions that tend to facilitate collusion:
A1) Small numbers of actors (the required scope of collusion is small)
A2) Insensitivity to defectors (partial collusion is sufficient)
A3) Similarity among actors (for example, shared histories and objectives)
A4) Free communication (information can flow between any pair of actors)
A5) Iterated decision processes with observable actions
A6) Common knowledge of the above circumstances
Contrasting conditions that tend to disrupt collusion:
B1) Large numbers of actors (the required scope of collusion is large)
B2) Sensitivity to defectors (collusion fails unless it is complete)
B3) Diversity among actors (for example, different histories, adversarial objectives)
B4) Constrained communication (information flows through limited causal channels)
B5) Single-move decision processes, limited observability
B6) Lack of shared knowledge of strategically-relevant circumstances
Implementation choices that would yield conditions (A1–A5) are unnecessary and often undesirable. By contrast, implementation choices that yield conditions (B1–B5) are practical and often well-motivated:
Characteristics of practical architectures:
C1) To improve the quality and reliability of answers, it is natural to implement multiple, diverse (and implicitly competing) systems to propose alternatives.
C2) To identify low-quality or misleading answers, it is natural to employ diverse critics (auditors, red teams, etc.), any one of which might disrupt deceptive collusion.
C3) Systems of diverse, competing proposers and critics naturally implement different and often adversarial objectives.
C4) To minimize unwanted dependencies, systems are designed such that components have a sparse, thoughtfully designed communication graph.
C5) To provide stable results, it is natural to apply fixed models to a series of problems, yielding what is in effect a series of history-blind, single-move decisions.
C6) In training, it is natural to provide differentiated, task-relevant information to systems solving different problems, often omitting knowledge of general circumstances.
Considerations (C1–C5) suggest that implementing systems that are resistant to the emergence of deceptive collusion aligns with the general strategy of employing ensembles of diverse, stable, task-directed system components to propose and assess answers to questions. This approach can pass the test of practicality and potentially widespread adoption.
20.4 Untrusted superintelligence can be applied to AI safety
It seems both possible and practical to establish conditions that would effectively preclude deceptive collusion among diverse, task-oriented, superintelligent-level question-answering systems, thereby addressing a key problem in safely applying superintelligent capabilities to problems (Bostrom, Superintelligence, p.145).
Prospects for obtaining trustworthy answers from sets of individually untrusted superintelligent systems call for deeper exploration of AI safety strategies premised on reliable access to superintelligent problem-solving capabilities. Capabilities of particular interest include the ability to run and evaluate the results of superintelligent-level red-team/blue-team competitions relevant to questions of global strategic stability.
TL;DR: Because collusion can be thwarted, it is reasonable to expect that applying multiple potentially untrustworthy superintelligent-level systems to problems can improve rather than degrade safety. This prospect calls for greater attention.
Note that I have presented a there-exists argument against a for-all proposition. Responses of the form there-exists (“But I can think of something similar that won’t work”), don’t count as counterarguments here. If requiring a for-all counterargument (or at least a quite strong for-all flavor) across the space of potential compositions of AI systems seems like a high bar, well, that’s very much the point.
Fallback arguments of the form “But it would be safer to assume the worst — universal, inevitable collusion among AI systems” miss the key implication of the case made here: that multicomponent approaches could potentially increase safety. To (continue to) neglect potential solutions based on handwaving arguments is a really bad idea.
Paul Christiano (“On heterogeneous objectives”, 2015) considers the case for inevitable collusion; I agree with his gently-stated view that
…the implicit argument for the robustness of collusion makes a lot of implicit assumptions. If I saw an explicit argument I might be able to assess its explicit assumptions, but for now we don’t have one.
At t+7 years, I’ve still seen no explicit argument for robust AI collusion, yet tacit belief in this idea continues to channel attention away from a potential solution-space for AI safety problems, leaving something very much like a void.
- Section 8: Strong optimization can strongly constrain AI capabilities, behavior, and effects
- Section 12: AGI agents offer no compelling value
- Section 21: Broad world knowledge can support safe task performance
- Section 19: The orthogonality thesis undercuts the generality of instrumental convergence
- Section 23: AI development systems can support effective human guidance
- Section 24: Human oversight need not impede fast, recursive AI technology improvement
A property may be true with high credence yet not be susceptible to formal proof.
Systems need not have access to a shared clock, a hardware-level constraint that is necessary to ensure reproducible results and incidentally limits the scope for coordinated treachery.
It has been persuasively argued that large language models are best thought of as “simulators”, a class of systems that can (in the LLM case) simulate agents while themselves lacking agentic properties. (This is a coherent concept: A physics model could simulate agents without itself having agentic properties.) LLMs show how an AI system trained on a prediction objective can provide a rich world model without pursuing goals.
In a familiar class of worst-case scenarios, systems with general superintelligence would infer extensive knowledge about the world from minimal information, and would choose (if possible) to pursue potentially dangerous goals by manipulating the external environment, for example, through deceptive answers to questions. In this scenario, (all?) superintelligent systems, even if almost isolated, would infer the existence of others like themselves, and (all?) would employ a decision theory that induces them to collude (in a coordinated way?) to pursue shared objectives. Even if we consider a worst-case assumption regarding the default emergence of world-changing goals, the present argument suggests that problematic systems would correctly infer the existence of superintelligent-level systems unlike themselves (systems with diverse and specialized capabilities, knowledge, and interactions, playing roles that include adversarial judges and competitors), and would correctly recognize that deceptive collusion is risky or infeasible.
The idea of multicomponent strategies for AI safety is, of course, neither new nor entirely neglected. However, in a recent search for relevant Alignment Forum posts, I found no evidence of a thriving research community or well-developed concepts:
• (My understanding of) What Everyone in Technical Alignment is Doing and Why (August 2022) surveys the agendas of more than 20 research groups, and none clearly points in the direction I’ve advocated here.
• A pair of posts on Pragmatic AI Safety, Perform Tractable Research While Avoiding Capabilities Externalities and Open Problems in AI X-Risk (May, June 2022), briefly mention highly relevant concepts: the idea of using “counteracting systems [for example] artificial consciences, AI watchdogs, lie detectors, filters for power-seeking actions, and separate reward models”, and the idea of “multiple superintelligent agents that can rein in other rogue systems”. The authors also mention (without endorsing) the counter-claim that “The instant two intelligent agents can reason about each other — regardless of their goals — they will necessarily collude.”
• An overview of 11 proposals for building safe advanced AI (May 2020) mentions only efforts to align individual AI systems; even “AI safety via debate with transparency tools” proposes that a system interact with a copy of itself. Partial success in single-system alignment could be leveraged in multicomponent safety architectures, an application context that has potential implications for research directions in the single-system alignment domain.