Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Epistemic status: The core ideas seem robust and stable after long reflection and many discussions.

Many researchers identify AI safety with control of a monolithic, superintelligent AI system, and if questioned about multicomponent alternatives, argue that multiple superintelligent-level systems would inevitably collude and act as one. This view seems quite wrong, yet has diverted attention from a rich and promising range of multicomponent strategies for AI safety — strategies that are well aligned with the actual trajectory of AI development.

___________________________


Adapted from Reframing Superintelligence, Section 20: 

Collusion among superintelligent oracles 
can readily be avoided 

Because collusion among AI question-answering systems can readily be avoided, there is no obstacle to applying superintelligent-level AI resources to problems that include AI safety.

20.1 Summary 

The difficulty of establishing successful collusion among actors tends to increase as their capabilities, knowledge, situations, and roles become more diverse and adversarial (think auditors, competitors, specialists, red-teams…), and increasing the number of actors can make collusive cooperation more difficult . In the context of AI systems (even more so than among human beings), these conditions can be readily implemented and are attractive for pragmatic reasons. Arguments that, absent preexisting alignment, high-level AI systems will inevitably collude are ill-founded. Instead, we should expect that interactions among multiple superintelligent-level systems can be applied to suppress deception and reduce risk by leveraging imperfect alignment achieved at the level of individual systems.

20.2 Trustworthiness can be an emergent property 

Prospects for solving AI-safety problems would be greatly improved if we could safely apply superintelligent-level question-answering resources (“oracles”, or more generally, “systems that provide information in response to prompts”) to solving those problems. A familiar objection dismisses this potentially powerful approach as unsafe in itself, arguing that, absent solutions to difficult problems, individual superintelligent-level systems would be untrustworthy, and that attempts to establish checks and balances among multiple systems (for example, through superintelligent-level evaluation of potentially deceptive answers to questions) would inevitably be thwarted by collusive cooperation.

Identifying robust strategies for ensuring non-collusion among superintelligent question-answering systems would overcome this objection, inviting exploration of superintelligence-enabled strategies for managing potentially untrusted superintelligent AI systems. The present discussion argues that the robust non-emergence of deceptive collusion among imperfectly aligned systems can be ensured by structuring systems of systems with diverse capabilities, knowledge, situations, objectives, and roles.

This problem framing assumes the ability to develop systems that respond to questions with superintelligent-level competence, and will (as a consequence of the nature of digital systems) assume the ability to constrain information inputs to AI systems during their development and use. 

In the problematic case, superintelligent question-answering systems might provide deceptive answers in pursuit of emergent, potentially undesirable goals. The aim here is to identify principles for architecting multicomponent systems that will act as non-deceptive oracles, while treating their components as actors that could in principle engage in deceptive collusion. 

20.3 A range of conditions can make collusion robust or fragile 

Potential problems of deceptive collusion can be addressed by architecting systems that make collusion difficult or effectively impossible.[1] Some relevant conditions: 


Conditions that tend to facilitate collusion:

      A1) Small numbers of actors (the required scope of collusion is small)

      A2) Insensitivity to defectors (partial collusion is sufficient)

      A3) Similarity among actors (for example, shared histories and objectives)

      A4) Free communication (information can flow between any pair of actors)

      A5) Iterated decision processes with observable actions

      A6) Common knowledge of the above circumstances

 

Contrasting conditions that tend to disrupt collusion:

      B1) Large numbers of actors (the required scope of collusion is large)

      B2) Sensitivity to defectors (collusion fails unless it is complete)

      B3) Diversity among actors (for example, different histories, adversarial objectives)

      B4) Constrained communication (information flows through limited causal channels)

      B5) Single-move decision processes, limited observability

      B6) Lack of shared knowledge of strategically-relevant circumstances

 

Implementation choices that would yield conditions (A1–A5) are unnecessary and often undesirable. By contrast, implementation choices that yield conditions (B1–B5) are practical and often well-motivated:


Characteristics of practical architectures:

      C1) To improve the quality and reliability of answers, it is natural to implement multiple, diverse (and implicitly competing) systems to propose alternatives.

      C2) To identify low-quality or misleading answers, it is natural to employ diverse critics (auditors, red teams, etc.), any one of which might disrupt deceptive collusion.

      C3) Systems of diverse, competing proposers and critics naturally implement different and often adversarial objectives.

      C4) To minimize unwanted dependencies, systems are designed such that components have a sparse, thoughtfully designed communication graph.

      C5) To provide stable results, it is natural to apply fixed models to a series of problems, yielding what is in effect a series of history-blind, single-move decisions.[2]

      C6) In training, it is natural to provide differentiated, task-relevant information to systems solving different problems, often omitting knowledge of general circumstances.[3]

 

Considerations (C1–C5) suggest that implementing systems that are resistant to the emergence of deceptive collusion aligns with the general strategy of employing ensembles of diverse, stable, task-directed system components to propose and assess answers to questions. This approach can pass the test of practicality and potentially widespread adoption.

20.4 Untrusted superintelligence can be applied to AI safety

It seems both possible and practical to establish conditions that would effectively preclude deceptive collusion among diverse, task-oriented, superintelligent-level question-answering systems, thereby addressing a key problem in safely applying superintelligent capabilities to problems (Bostrom, Superintelligence, p.145).[4]

Prospects for obtaining trustworthy answers from sets of individually untrusted superintelligent systems call for deeper exploration of AI safety strategies premised on reliable access to superintelligent problem-solving capabilities. Capabilities of particular interest include the ability to run and evaluate the results of superintelligent-level red-team/blue-team competitions relevant to questions of global strategic stability. 

TL;DR:  Because collusion can be thwarted, it is reasonable to expect that applying multiple potentially untrustworthy superintelligent-level systems to problems can improve rather than degrade safety. This prospect calls for greater attention.

___________________________

Afterword:

Note that I have presented a there-exists argument against a for-all proposition. Responses of the form there-exists (“But I can think of something similar that won’t work”), don’t count as counterarguments here. If requiring a for-all counterargument (or at least a quite strong for-all flavor) across the space of potential compositions of AI systems seems like a high bar, well, that’s very much the point.

Fallback arguments of the form “But it would be safer to assume the worst — universal, inevitable collusion among AI systems” miss the key implication of the case made here: that multicomponent approaches could potentially increase safety. To (continue to) neglect potential solutions based on handwaving arguments is a really bad idea.

Paul Christiano (“On heterogeneous objectives”, 2015) considers the case for inevitable collusion; I agree with his gently-stated view that

…the implicit argument for the robustness of collusion makes a lot of implicit assumptions. If I saw an explicit argument I might be able to assess its explicit assumptions, but for now we don’t have one.

At t+7 years, I’ve still seen no explicit argument for robust AI collusion, yet tacit belief in this idea continues to channel attention away from a potential solution-space for AI safety problems, leaving something very much like a void.[5]

___________________________

Further Reading [in Reframing Superintelligence]

  • Section 8:  Strong optimization can strongly constrain AI capabilities, behavior, and effects 
  • Section 12: AGI agents offer no compelling value 
  • Section 21: Broad world knowledge can support safe task performance 
  • Section 19: The orthogonality thesis undercuts the generality of instrumental convergence
  • Section 23: AI development systems can support effective human guidance 
  • Section 24: Human oversight need not impede fast, recursive AI technology improvement 

 

  1. ^

    A property may be true with high credence yet not be susceptible to formal proof.

  2. ^

    Systems need not have access to a shared clock, a hardware-level constraint that is necessary to ensure reproducible results and incidentally limits the scope for coordinated treachery.

  3. ^

    It has been persuasively argued that large language models are best thought of as “simulators”, a class of systems that can (in the LLM case) simulate agents while themselves lacking agentic properties. (This is a coherent concept: A physics model could simulate agents without itself having agentic properties.) LLMs show how an AI system trained on a prediction objective can provide a rich world model without pursuing goals.

  4. ^

    In a familiar class of worst-case scenarios, systems with general superintelligence would infer extensive knowledge about the world from minimal information, and would choose (if possible) to pursue potentially dangerous goals by manipulating the external environment, for example, through deceptive answers to questions. In this scenario, (all?) superintelligent systems, even if almost isolated, would infer the existence of others like themselves, and (all?) would employ a decision theory that induces them to collude (in a coordinated way?) to pursue shared objectives. Even if we consider a worst-case assumption regarding the default emergence of world-changing goals, the present argument suggests that problematic systems would correctly infer the existence of superintelligent-level systems unlike themselves (systems with diverse and specialized capabilities, knowledge, and interactions, playing roles that include adversarial judges and competitors), and would correctly recognize that deceptive collusion is risky or infeasible. 

  5. ^

    The idea of multicomponent strategies for AI safety is, of course, neither new nor entirely neglected. However, in a recent search for relevant Alignment Forum posts, I found no evidence of a thriving research community or well-developed concepts:

    (My understanding of) What Everyone in Technical Alignment is Doing and Why (August 2022) surveys the agendas of more than 20 research groups, and none clearly points in the direction I’ve advocated here.

    • A pair of posts on Pragmatic AI Safety, Perform Tractable Research While Avoiding Capabilities Externalities and Open Problems in AI X-Risk (May, June 2022), briefly mention highly relevant concepts: the idea of using “counteracting systems [for example] artificial consciences, AI watchdogs, lie detectors, filters for power-seeking actions, and separate reward models”, and the idea of “multiple superintelligent agents that can rein in other rogue systems”. The authors also mention (without endorsing) the counter-claim that “The instant two intelligent agents can reason about each other — regardless of their goals — they will necessarily collude.”

    An overview of 11 proposals for building safe advanced AI (May 2020) mentions only efforts to align individual AI systems; even “AI safety via debate with transparency tools” proposes that a system interact with a copy of itself. Partial success in single-system alignment could be leveraged in multicomponent safety architectures, an application context that has potential implications for research directions in the single-system alignment domain.

88

Ω 33

New Comment
58 comments, sorted by Click to highlight new comments since: Today at 6:53 AM

Just to restate the standard argument against:

If you've got 20 entities much much smarter than you, and they can all get a better outcome by cooperating with each other, than they could if they all defected against each other, there is a certain hubris in imagining that you can get them to defect.  They don't want your own preferred outcome.  Perhaps they will think of some strategy you did not, being much smarter than you, etc etc.

(Or, I mean, actually the strategy is "mutually cooperate"?  Simulate a spread of the other possible entities, conditionally cooperate if their expected degree of cooperation goes over a certain threshold?  Yes yes, more complicated in practice, but we don't even, really, get to say that we were blindsided here.  The mysterious incredibly clever strategy is just all 20 superintelligences deciding to do something else which isn't mutual defection, despite the hopeful human saying, "But I set you up with circumstances that I thought would make you not decide that!  How could you?  Why?  How could you just get a better outcome for yourselves like this?")

I don’t see that as an argument [to narrow this a bit: not an argument relevant to what I propose]. As I noted above, Paul Christiano asks for explicit assumptions.

To quote Paul again:.

I think that concerns about collusion are relatively widespread amongst the minority of people most interested in AI control. And these concerns have in fact led to people dismissing many otherwise-promising approaches to AI control, so it is de facto an important question

Dismissing promising approaches calls for something like a theorem, not handwaving about generic “smart entities”.

[Perhaps too-pointed remark deleted]

I don't think you're going to see a formal proof, here; of course there exists some possible set of 20 superintelligences where one will defect against the others (though having that accomplish anything constructive for humanity is a whole different set of problems).  It's also true that there exists some possible set of 20 superintelligences all of which implement CEV and are cooperating with each other and with humanity, and some single superintelligence that implements CEV, and a possible superintelligence that firmly believes 222+222=555 without this leading to other consequences that would make it incoherent.  Mind space is very wide, and just about everything that isn't incoherent to imagine should exist as an actual possibility somewhere inside it.  What we can access inside the subspace that looks like "giant inscrutable matrices trained by gradient descent", before the world ends, is a harsher question.

I could definitely buy that you could get some relatively cognitively weak AGI systems, produced by gradient descent on giant inscrutable matrices, to be in a state of noncooperation.  The question then becomes, as always, what it is you plan to do with these weak AGI systems that will flip the tables strongly enough to prevent the world from being destroyed by stronger AGI systems six months later.

Mind space is very wide

Yes, and the space of (what I would call) intelligent systems is far wider than the space of (what I would call) minds. To speak of “superintelligences” suggests that intelligence is a thing, like a mind, rather than a property, like prediction or problem-solving capacity. This is which is why I instead speak of the broader class of systems that perform tasks “at a superintelligent level”. We have different ontologies, and I suggest that a mind-centric ontology is too narrow.

The most AGI-like systems we have today are LLMs, optimized for a simple prediction task. They can be viewed as simulators, but they have a peculiar relationship to agency:

The simulation objective

A simulator trained with machine learning is optimized to accurately model its training distribution – in contrast to, for instance, maximizing the output of a reward function or accomplishing objectives in an environment.… Optimizing toward the simulation objective notably does not incentivize instrumentally convergent behaviors the way that reward functions which evaluate trajectories do.

LLMs have rich knowledge and capabilities, and can even simulate agents, yet they have no natural place in an agent-centric ontology. There’s an update to be had here (new information! fresh perspectives!) and much to reconsider.

Does it make sense to talk about "(non)cooperating simulators"? Expected failure mode for simulators are more like exfo- and infohazards, like output to the query "print code for CEV-Sovereign" or "predict the future 10 years of my life".

The question then becomes, as always, what it is you plan to do with these weak AGI systems that will flip the tables strongly enough to prevent the world from being destroyed by stronger AGI systems six months later.

Yes, this is the key question, and I think there’s a clear answer, at least in outline:

What you call “weak” systems can nonetheless excel at time- and resource-bounded tasks in engineering design, strategic planning, and red-team/blue-team testing. I would recommend that we apply systems with focused capabilities along these lines to help us develop and deploy the physical basis for a defensively stable world — as you know, some extraordinarily capable technologies could be developed and deployed quite rapidly. In this scenario, defense has first move, can preemptively marshal arbitrarily large physical resources, and can restrict resources available to potential future adversaries. I would recommend investing resources in state-of-the-art hostile planning to support ongoing red-team/blue-team exercises.

This isn’t “flipping the table”, it’s reinforcing the table and bolting it to the floor. What you call “strong” systems then can plan whatever they want, but with limited effect.

So I think that building nanotech good enough to flip the tables - which, I think, if you do the most alignable pivotal task, involves a simpler and less fraught task than "disassemble all GPUs", which I choose not to name explicitly - is an engineering challenge where you get better survival chances (albeit still not good chances) by building one attemptedly-corrigible AGI that only thinks about nanotech and the single application of that nanotech, and is not supposed to think about AGI design, or the programmers, or other minds at all; so far as the best-attempt doomed system design goes, an imperfect-transparency alarm should have been designed to go off if your nanotech AGI is thinking about minds at all, human or AI, because it is supposed to just be thinking about nanotech.  My guess is that you are much safer - albeit still doomed - if you try to do it the just-nanotech way, rather than constructing a system of AIs meant to spy on each other and sniff out each other's deceptions; because, even leaving aside issues of their cooperation if they get generally-smart enough to cooperate, those AIs are thinking about AIs and thinking about other minds and thinking adversarially and thinking about deception.  We would like to build an AI which does not start with any crystallized intelligence about these topics, attached to an alarm that goes off and tells us our foundational security assumptions have catastrophically failed and this course of research needs to be shut down if the AI starts to use fluid general intelligence to reason about those topics.  (Not shut down the particular train of thought and keep going; then you just die as soon as the 20th such train of thought escapes detection.)

Hang on — how confident are you that this kind of nanotech is actually, physically possible? Why? In the past I've assumed that you used "nanotech" as a generic hypothetical example of technologies beyond our current understanding that an AGI could develop and use to alter the physical world very quickly. And it's a fair one as far as that goes; a general intelligence will very likely come up with at least one thing as good as these hypothetical nanobots. 

But as a specific, practical plan for what to do with a narrow AI, this just seems like it makes a lot of specific unstated assumption about what you can in fact do with nanotech in particular. Plausibly the real technologies you'd need for a pivotal act can't be designed without thinking about minds. How do we know otherwise? Why is that even a reasonable assumption?

We maybe need an introduction to all the advance work done on nanotechnology for everyone who didn't grow up reading "Engines of Creation" as a twelve-year-old or "Nanosystems" as a twenty-year-old.  We basically know it's possible; you can look at current biosystems and look at physics and do advance design work and get some pretty darned high confidence that you can make things with covalent-bonded molecules, instead of van-der-Waals folded proteins, that are to bacteria as airplanes to birds.

For what it's worth, I'm pretty sure the original author of this particular post happens to agree with me about this.

We maybe need an introduction to all the advance work done on nanotechnology for everyone who didn't grow up reading "Engines of Creation" as a twelve-year-old or "Nanosystems" as a twenty-year-old.

Ah. Yeah, that does sound like something LessWrong resources have been missing, then — and not just for my personal sake. Anecdotally, I've seen several why-I'm-an-AI-skeptic posts circulating on social media for whom "EY makes crazy leaps of faith about nanotech" was a key point of why they rejected the overall AI-risk argument.

(As it stands, my objection to your mini-summary would be that that sure, "blind" grey goo does trivially seem possible, but programmable/'smart' goo that seeks out e.g. computer CPUs in particular could be a whole other challenge, and a less obviously solvable one looking at bacteria. But maybe that "common-sense" distinction dissolves with a better understanding of the actual theory.)

Yes. Please do. 

This would be of interest to many people. The tractability of nanotech seems like a key parameter for forecasting AI x-risk timelines. 

Seconding. I'd really like a clear explanation of why he tends to view nanotech as such a game changer. Admittedly Drexler is on the far side of nanotechnology being possible, and wrote a series of books about it: (Engines of Creation, Nanosystems, and Radical Abundance)

I believe that 

“weak” systems can nonetheless excel at time- and resource-bounded tasks in engineering design, strategic planning, and red-team/blue-team testing

enough to "flip the tables strongly enough". What I don't believe is that we can feasibly find such systems before a more integrated system is found by less careful researchers. Say, we couldn't do it with less than 100x the resources being put into training general, integrated, highly capable systems. I'd compare husbandry with trying to design an organism from scratch in DNA space; the former just requires some high-level hooking things up together, where the later requires a massive amount of multi-level engineering. 

I already had a chached and named thought about the cognitive move how the AIs would foil the basic premise: they need to "just" do something that is Not That, as it is called in Project Lawful. If you find yourself feeling tragic upon a basic overview of your situation that is a reason to think that fanciness might achieve something.

That the state of being divided and conquered matters in comparison to not being, suggests that there is not going to be a sweeping impossibility result. When there is money on the floor people bend and approaches that try to classify muscle group actions as unviable will have a lot of surface area to be wrong.

Dismissing promising approaches calls for something like a theorem, not handwaving about generic “smart entities”.

Strongly agreed

If you've got 20 entities much much smarter than you, and they can all get a better outcome by cooperating with each other, than they could if they all defected against each other,

By your own arguments unaligned AGI will have random utility functions - but perhaps converging somewhat around selfish empowerment. Either way such agents have no more reason to cooperate with each other than with us (assuming we have any relevant power).

If some of the 20 entities are somewhat aligned to humans that creates another attractor and a likely result is two competing coalitions: more-human-aligned vs less-human-aligned, with the latter being a coalition of convenience. There are historical examples: democratic allies vs autocratic axis in WW2 (democratic allies more aligned to human society and thus each other), modern democratic allies vs autocratic russia+china.

Their mutual cooperation with each other, but not with humans, isn't based on their utility functions having any particular similarity - so long as their utility functions aren't negatives of each other (or equally exotic in some other way) they have gains to be harvested from cooperation.  They cooperate with each other but not you because they can do a spread of possibilities on each other modeling probable internal thought processes of each other; and you can't adequately well-model a spread of possibilities on them, which is a requirement on being able to join an LDT coalition.  (If you had that kind of knowledge / logical sight on them, you wouldn't need any elaborate arrangements of multiple AIs because you could negotiate with a single AI; better yet, just build an AI such that you knew it would cooperate with you.)

so long as their utility functions aren’t negatives of each other (or equally exotic in some other way)

Why doesn't setting some of the utility functions to red-team the others make them sufficiently antagonistic?

If you could literally make exactly two AIs whose utility functions were exact opposites, then at least one might have an incentive to defect against the other.  This is treading rather dangerous ground, but seems relatively moot since it requires far more mastery of utility functions than anything you can get out of the "giant inscrutable matrices" paradigm.

If two agents have utility functions drawn from some random distribution over longtermist utility functions with wide consequences over reality (ie the types of agents which matter), they are almost guaranteed to be in conflict due to instrumental convergence to empowerment. Reality is a strictly zero sum game for them, and any coalition they form is strictly one of temporary necessity - if/when one agent becomes strong enough to defect and overpower the other, it will.

Also, regardless of what some "giant inscrutable matrix" based utility function does (ie maximize paperclips), it is actually pretty easy to mathematically invert it (ie minimize paperclips). (But no that doesn't make the strategy actually useful)

Reality's far from constant sum. E.g. system1 and system2 both prefer to kill all humans and then flip a coin for who gets the universe, vs. give the humans more time to decide to turn off both s1 and s2. 

(Note:  TekhneMakre responded correctly / endorsedly-by-me in this reply and in all replies below as of when I post this comment.)

I didn't say "reality is constant sum", I said reality is a strictly zero sum game for two longtermist agents that want to reconstruct the galaxy/universe in very different ways. And then right after that I mentioned them forming temporary coalitions which your comment is an example of.

It's not constant sum for "two longtermist agents that want to reconstruct the galaxy/universe in very different ways". That's what I'm arguing against. If it were constant sum, the agents would plausibly be roughly indifferent between them both dying vs. them both living but then flipping a coin to decide who gets the universe (well, this would depend on what happens if they both die, but assuming that that scenario is value-neutral for them). The benefit for system1 of +50% chance of controlling the universe would be exactly canceled out by the detriment to system1 caused by system2 getting +50% chance of controlling the universe (since how good something is for system2 is exactly that bad for system1, by definition of constant sum). 

I don't follow your logic. If the universe is worth X, and dying is worth 0 (a constant sum game), then 0.5X is clearly worth more than dying. Constant sum games also end up equivalent to zero sum games after a trivial normalization: ie universe worth 0.5X, dying worth -0.5X.

I think we maybe agree that two AIs with random utility functions would cooperate to kill all humans, and then divvy up the universe? The question is about what AIs might not do that. I'm saying that only AIs in a near-true constant-sum game might do that, because they'd rather die than see their enemy get the universe, so to speak. AIs with random utility functions are not in a constant sum game. To make this more clear: if P1 and P2 have orthogonal utility functions, then for any probability p>0, P1 would accept a 1-p chance that P2 rules the universe in exchange for a p chance that P1 rules the universe, as compared to dying. That is not the case for players in a constant sum game. 

My guess is that you're using the word "zero sum" (or as I'd say, "constant sum") in a non-standard way. See e.g. this random website: https://www.britannica.com/science/game-theory/Two-person-constant-sum-games 

A constant sum game is a game of perfect competition: for all possible outcomes, if the outcome gives X utilons to Player1, then it gives -X utilons to Player2. (This is a little too restrictive, because we want to allow for positive affine transformations of the utility functions, as you point out, but whatever.)

If P1 and P2 are in a constant sum game, then the payoffs for P1 look like this:

P1 gets universe: 1

P2 gets universe: -1

neither gets universe: 0

and the reverse for P2. 

So P1 is indifferent between the choices:

Cooperate: get a 50% chance of P1 gets universe, 50% chance P2 gets universe; .5 x 1 + .5 x -1 = 0
 

Defect: both die, 100% chance of 0. 

I think we maybe agree that two AIs with random utility functions would cooperate to kill all humans, and then divvy up the universe?

That is merely one potential outcome: or one AI cooperates with humans to kill the other, etc. Also "killing humans" is probably not instrumentally rational vs taking control of humans.

A constant sum game is a game of perfect competition: for all possible outcomes, if the outcome gives X utilons to Player1, then it gives -X utilons to Player2

Not exactly - that is zero sum. Constant sum is merely a game where all outcomes have total payout of C, for some C. But yeah it is (always?) equivalent to zero sum after a normalization shift to set C to 0.

If P1 and P2 are in a constant sum game, then the payoffs for P1 look like this: P1 gets universe: 1 P2 gets universe: -1 neither gets universe: 0

That seems wrong. P1 only cares whether it gets the universe, so "neither gets the universe" is the same as "P2 gets the universe". If the universe has a single owner, then P1's payoff is 1 if that owner is P1 and -1 (or 0) otherwise.

Defect: both die, 100% chance of 0.

That obviously isn't the only outcome of defection. If defection results in both agents dying, then of course they don't defect. But often a power imbalance develops (over time the probability of this goes to 1) and defection then allows one agent to have reasonable odds of overpowering the other.

P1 only cares whether it gets the universe, so "neither gets the universe" is the same as "P2 gets the universe". If the universe has a single owner, then P1's payoff is 1 if that owner is P1 and -1 (or 0) otherwise.

No, this isn't a constant sum game:

Outcome 1, P1 gets universe: P1 utility = 1, P2 utility = 0, total = 1

Outcome 2, P2 gets universe: P1 utility = 0, P2 utility = 1, total = 1

Outcome 3, neither gets universe: P1 utility = 0, P2 utility = 0, total = 0

In the last outcome, the total is different. This can't be scaled away. 

Ok technically true for your setup, but that isn't the model I'm using. There are only two long term outcomes: 1 and 2. If you are modeling outcome 3 as "the humans defeat the AIs", then as I said earlier that isn't the only coalition possibility. If humanity is P0, then the more accurate model is a 3 outcome game with 3 possible absolute winners in the long term.

So a priori it's just as likely that P0+P1 ally vs P2 as P1+P2 ally vs P0.

If your argument is then "but AI's are different and can ally with each other because of X", then my reply is nope, AI won't be that different at all - as it's just going to be brain-like DL based.

Regardless if P1+P2 ally against P0, then they inevitably eventually fight until there is just P1 or P2. Outcome 3 is always near zero probability in the long term (any likely conflicts have a winner and never result in both systems being destroyed - the offense/defense imbalance of nukes is temporary and will not last), which is why I said:

any coalition they form is strictly one of temporary necessity - if/when one agent becomes strong enough to defect and overpower the other, it will.

I think you're saying that there's a global perfectly competitive game between all actors because the universe will get divvied up one way or another. This doesn't hold if anyone has utility that's non-linear in the amount of universe they get. Also there's outcomes where everyone dies, which nearly Pareto-sucks (no one gets the universe). And there's outcomes where more negentropy is burned on conflict rather than fulfilling anyone's preferences (the universe is diminished). So it's not a zero sum game. 

Your reply to Yudkowsky upthread now makes more sense, but you should have called out that you're contradicting the assumption that it's AIs vs. humans, because what you said within that assumptive context was besides the point (the question at hand was about what circumstances two AIs would or wouldn't defect against each other instead of cooperating to kill the humans), in addition to being false (because it's not a perfectly competitive game). 

nope, AI won't be that different at all - as it's just going to be brain-like DL based.

Sorry to say, this is wishful thinking. Have you written up an argument? If it's the case that if this were false you'd want to know it were false, writing up an argument in a way that exposes your cruxes might be a good way to find that out. 

Also there's outcomes where everyone dies,

Very improbable in my model.

And there's outcomes where more negentropy is burned on conflict rather than fulfilling everyone's preferences (the universe is diminished). So it's not a zero sum game.

Also improbable in my model. The conflict will be in the near future over earth and will then determine the fate of the galaxy. Please recall I said "two agents have utility functions drawn from some random distribution over longtermist utility functions with wide consequences over reality (ie the types of agents which matter)"

The tiny amounts of negentropy that may be burnt in the initial conflict over earth are inconsequential.

Your reply to Yudkowsky upthread now makes more sense,

Do you mean where he said:

If you could literally make exactly two AIs whose utility functions were exact opposites, then at least one might have an incentive to defect against the other. This is treading rather dangerous ground, but seems relatively moot since it requires far more mastery of utility functions than anything you can get out of the "giant inscrutable matrices" paradigm.

To which I replied actually it's easy to invert a utility function in the "giant inscrutable matrices" paradigm. Do you disagree with that?

AI won't be that different at all - as it's just going to be brain-like DL based.

Have you written up an argument?

Of course - I have written up argument(s), accumulating over almost a decade, the cores of which are somewhat upvoted - even here. See this comment for a brief overview and especially this somewhat longer comment for an introduction to why the sequences are built on a faulty foundation in terms of implicit viewpoints around the brain and AI.

Do you disagree with that?

I do disagree that it's easy to invert utility functions in that paradigm. But that's not what I'm referring to, I'm referring to you responding to his argument that the only way you might get AIs to defect against their coalition against the humans, is if they're in a perfectly competitive game with each other, having directly opposed utility functions. You responded with a false nonsequitur. (It's especially false in the question at hand, namely the situation where the humans might turn off both AIs if the AIs don't cooperate with each other; very not perfectly competitive.) Not sure there's much else to say here, unless you think there's something useful here. 

EY said:

If you could literally make exactly two AIs whose utility functions were exact opposites, then at least one might have an incentive to defect against the other.

To which I responded:

If two agents have utility functions drawn from some random distribution over longtermist utility functions with wide consequences over reality (ie the types of agents which matter), they are almost guaranteed to be in conflict due to instrumental convergence to empowerment.

Perhaps I should have added "eventually" after conflict, but regardless that comment is still obviously correct, given my world model where eventually one agent becomes powerful enough to completely remove the other agent at low cost, and this thread has explicated why that statement is correct given my modelling assumptions. Do you disagree?

It's a nonsequitur. "Defect" to my understanding was in that context referring to defecting on a coalition of AIs against the agents who imminently might turn them off (i.e. humans), and the question was under what circumstances the AIs might defect in that way. 

Yes, obviously they're in conflict to some extent. In the very next sentence, you said they were in a zero sum game, which is false in general as I described, and especially false in the context of the comment you were responding to: they especially want to cooperate, since they don't have perfectly opposed goals, and therefore want to survive the human threat, not minding as much--compared to a zero sum situation--that their coalition-mate might get the universe instead of them. 

I wasn't actually imagining a scenario where the humans had any power (such as the power to turn the AI off) - because I was responding to a thread where EY said "you've got 20 entities much smarter than you".

Also even in that scenario (where humans have non trivial power), they are just another unaligned entity from the perspective of the AIs - and in my simple model - not even the slightest bit different. So they are just another possible player to form coalitions with and would thus end up in one of the coalitions.

The idea of a distinct 'human threat' and any natural coalition of AI vs humans, is something very specific that you only get by adding additional postulated speculative differences between the AIs and the humans - all of which are more complex and not part of my model.

(Really we should be talking about perfectly competitive games, and you could have a perfectly competitive game which has nonconstant total utilities, e.g. by taking a constant-sum game and then translating and scaling one of the utilities. But the above game is in fact not perfectly competitive; in particular if there's a Pareto dominant outcome or a Pareto-worse outcome, assuming not all outcomes are the same, it's not perfectly competitive.)

Sure if they are that much better than us at "spread of possibilities on each other modeling probable internal thought processes of each other" then we are probably in the scenario where humans don't have much relevant power anyway and are thus irrelevant as coalition partners.

However that ability to model other's probable internal thought processes - especially if augmented with zk proof techniques - allows AGIs to determine what other AGIs have utility functions most aligned to their own. Even partial success at aligning some of the AGIs with humanity could then establish an attractor seeding an AGI coalition partially aligned to humanity.

However that ability to model other's probable internal thought processes - especially if augmented with zk proof techniques - allows AGIs to determine what other AGIs have utility functions most aligned to their own. Even partial success at aligning some of the AGIs with humanity could then establish an attractor seeding an AGI coalition partially aligned to humanity.

Not a strong ask, but I'll say I'm interested in what you're visualizing here if it all goes according to plan, because when I visualize what you say, I'm still imagining the 20 AGI systems immediately killing humanity and dividing up the universe, it's just now I might like a little bit of the universe they create. But it's not "they stay in some equilibrium state where human civilization is in charge and using them as services" which I believe is what Mr Drexler is proposing.

The outcome of course depends on the distribution of alignment, but there are now plausible designs that would not kill humanity. For example AGI with a human empowerment utility function would not kill humanity - and that is a statement we can be somewhat confident in because empowerment is crisply defined and death is minimally empowering (that type of AGI may want to change us in undesirable ways, but it would not want to kill us).

There are various value learning approaches that may diverge and fail eventually, but they tend to diverge in the future, not immediately.

So I think it's just unrealistic and hard to imagine we'll get 20 different AGI systems none of which are at least partially aligned - especially initially. And if some are partially aligned in different ways, the resulting coalition can be somewhat more aligned than any individual AGI. For example say AGI 3 wants to preserve humans but eliminate hedonic reward, and AGI 5 wants to preserve humans but increase our hedonic reward, a natural comprise is preserve humans and don't change hedonic reward.

There's an ensemble robustness bonus in play from having multiple partially aligned systems - their specific alignment errors are unlikely to overlap.

Agents then tend to join aligned coalitions, so then the natural outcome is a coalition of the semi-aligned AGI vs the rest (think allies vs axis, democratic allies vs autocratic states), with the semi-aligned coalition hopefully dominating which then increases the alignment fraction. The end result is then hopefully humanity surviving with some variable amount of power, depending on the alignment/power distribution of the semi-aligned AGIs.

If the non-aligned AGI coalition wins of course we are more likely doomed, and since they are internally unaligned and held together only out of necessity they just recursively split into warring sub-coalitions until only one is left (as Germany and Japan would have ultimately eventually fought each other if they won WW2 as in the High Castle)

But no, I don't put much weight in "they stay in some equilibrium state where human civilization is in charge and using them as services". Even if everything favors AI services/tools over agents, eventually you get uploading and those evolve and occupy the niche of agentic AGI.

Either way such agents have no more reason to cooperate with each other than with us (assuming we have any relevant power).

Conflict is expensive.  If you have an alternative (i.e. performing a values handshake) which is cheaper, you'd probably take it?  (Humans can't do that, for reasons outlined in Decision theory does not imply that we get to have nice things.)

Of course humans can cooperate with AGI for a variety of reasons, just as we cooperate with humans. I don't think decision theory philosophy explains humans well, and the evidence required to convince me that humans can't cooperate with AGI would be enormous, so I don't see the potential relevance of that post.

Unaligned AGI systems are unlikely to have inter-aligned utility functions. The only strong attractor in utility function space is towards selfish empowerment, but that - by definition - is not inter-aligned.

AGI systems aligned to humanity (broadly) have the same utility function - by definition - and thus have a strong basis to reliably cooperate.

So cooperation/collusion favors alignment! Especially when you consider that AGIs will potentially be able to use cryptographic ZK proof techniques to prove properties of their utility functions.

The generalized version of this lesson "that cooperation/collusion favors the good guys - ie those aligned towards humanity" actually plays out in history. In WW2 the democratic powers - those with interconnected economies and governments more aligned to their people - formed the stronger allied coalition. The remaining autocratic powers - all less aligned to their people and also each other - formed a coalition of necessity. Today history simply repeats itself with the democratic world aligned against the main autocratic powers (russia, china, north korea, iran).

The generalized version of this lesson "that cooperation/collusion favors the good guys - ie those aligned towards humanity" actually plays out in history. In WW2 the democratic powers - those with interconnected economies and governments more aligned to their people - formed the stronger allied coalition. The remaining autocratic powers - all less aligned to their people and also each other - formed a coalition of necessity. Today history simply repeats itself with the democratic world aligned against the main autocratic powers (russia, china, north korea, iran).

I don't want to enter a history debate, but I'm not at all sold on that view, which seems to rewrite history. The european part of WW2  was mainly won because of the USSR, not really a "democratic power" (you could argue that USSR would never have had the means to do that without the financial help of the US, or that without England holding up, Germany would have won on the eastern front, both of which are probably true, but the point still stands that it's not as simple as "democratic vs autocratic").

Regarding the present, I'm not sold at all on the "democratic world aligned against the main autocratic powers". Actually, I'd even make the case that democratic powers actively cooperate with autocratic ones as long as they have something to gain, despite it being contrary to the values they advocate for: child labor in asian coutries, women's rights in Emirates, Qatar, Saudi Arabia, and so on. So I believe that once we look at a more detailed picture than the one you're depicting it's actually a counterargument to your take.

USSR originally sided with axis, then switched to allies, but yes obviously the coalitions are not pure.

"democratic world aligned against the main autocratic powers" obviously doesn't mean that democratic powers don't also cooperate or trade with autocratic powers - obviously we still have extensive trade with china for example. I meant "aligned against" in a larger strategic sense. Clearly we are at near proxy war with Russia already, and we have recently taken steps to try and cripple china's long term strategic - and especially - military power with the recent foundry-targeting embargoes.

I like the framing of "how can we disrupt collusion?", but I'm a little skeptical that some of the solutions are practical. Specifically:

B2: Collusion can happen between agents that are copies of each other, who then have no reason to defect (so long as they employ a decision theory with a broad enough cartesian boundary). If we knew we could engineer systems with narrow decision theories (e.g. "Even though that's a copy of me, I'll pretend it's a different agent") I'd feel much better here, but as far as I know we don't know how to do that. I'd be excited to hear about work in that direction.

B4: I'm skeptical that we can really constrain communication meaningfully between e.g. models interacting over the internet. And I do expect a lot of models to get deployed over the internet, because that's where the economic value is.

B6: Some of the most powerful models are LLM's, which get trained on a large chunk of all of the text ever written. It seems they'll share an awful lot of knowledge, though maybe you're considering very specific information they'll lack?

C1: I don't think this is born out by the models in use today, but I don't know for sure. E.g. I would bet that text-davinci does better than a combination of smaller models with diverse training processes, but I don't know if that comparison has been done.

From the Afterword: “Note that I have presented a there-exists argument against a for-all proposition. Responses of the form “But I can think of something similar that won’t work” don’t count as counterarguments.” [This has been edited in the main text to sharpen the phrasing.]

B2: Copies might collude with one another, provided they have a broad decision theory: ∃(some sets of models) among which some mechanisms for impeding collusion might not work.

B4: A lot of models may be deployed over the internet: ∃(some communication channels) for which one of 6 mechanisms for impeding collusion might not work (given flawed security).

B6: LLM's will share a lot of knowledge: ∃(high performance models for some tasks) for which one of 6 mechanisms for impeding collusion doesn’t contribute to safety.

C1: Large models tend to beat compositions of small models for some tasks: ∃(some tasks) where some models should be large, and narrow knowledge won’t contribute to impeding collusion.

Not too worrisome, I think, but these are all considerations important to building robust multicomponent systems.

BTW, regarding the (non-)goal oriented nature of LLMs, I highly recommend the Simulators post by Janus.

From the Afterword: “Note that I have presented a there-exists argument against a for-all proposition. Responses of the form “But I can think of something similar that won’t work” don’t count as counterarguments.” [This has been edited in the main text to sharpen the phrasing.]

I saw this, but I think it sets a somewhat unhelpful standard. In practice we need to make choices about which approaches are most promising, which to pursue, etc., and evidence that there is more probability mass on success in one area does feel useful. 

So, for instance, my point in B2 is (as you correctly interpreted) not that there literally can't exist models that don't exhibit copy-collusion. Rather, I think copy-collusion is reasonably likely, and we don't currently know of a way to prevent it.

Yes, we need to weigh the considerations you mention, and many others.

I think copy-collusion is reasonably likely, and we don't currently know of a way to prevent it.

Copy-collusion must be considered carefully, and its implications will depend on both potential contexts for using copies (sometimes problematic, sometimes OK) and on potential alternatives to using copies.

Copies will always “collude” in the trivial sense of giving identical, potentially deceptive responses to identical inputs, but might also offer potentially deceptive, mutually-supporting responses to differing inputs. For some purposes, a multicomponent architecture must treat multiple copies as a single source of information. (Note that solving the alignment problem is often taken to require that we be able to trust a single copy, or a system that consists of nothing but copies of a single entity — this seems strictly riskier than anything I’ve suggested.)

Some constraints can be effective even among copies: Constrained communication can be applied to preclude coordination that requires new, shared knowledge. Lack of access to a shared clock can be applied to preclude coordination of simultaneous or cleverly sequenced actions. Systems that audit alternative behaviors by re-running processes from checkpoints can be applied to offline copies, where adaptive, multi-trial probing (by what may be highly intelligent systems!) can likely be leveraged to make some kinds of deceptions unwise.

Of course a central idea is that many roles call for systems that are not copies: proposers vs. critics, designers vs. testers, actors vs. auditors, diverse members of ensembles, and so on.

we need to make choices about which approaches are most promising, which to pursue, etc.

Before asking whether one approach is superior to another, we should ask whether they are competitors or complements. Multicomponent approaches for making non-collusion robust seem quite complementary to strategies for making individual systems more trustworthy.

Regarding criteria for investment, neglectedness is of course a huge consideration.

HCH-like amplification seems related, multiple unreliable agent instances assembled into a bureaucracy that as a whole improves on some of their qualities (perhaps trustworthiness) and allows generating a better dataset for retraining the unreliable agents. So this problem is not specific to interaction of somewhat independently developed and instantiated AIs acting in the world, as it appears in a popular training story of a single AI. There is also the ELK problem that might admit solutions along these lines. This makes it less obviously neglected, even with little meaningful progress.

HCH-like amplification seems related, multiple unreliable agent instances assembled into a bureaucracy that as a whole improves on some of their qualities

This approach to amplification involves multiple instances, but not diverse systems, competing systems, different roles, adversarial relationships, or a concern with collusion. It is, as you say a training story for a single AI. Am I missing a stronger connection?

Instances in a bureaucracy can be very different and play different roles or pursue different purposes. They might be defined by different prompts and behave as differently as text continuations of different prompts in GPT-3 (the prompt is "identity" of the agent instance, distinguishes the model as a simulator from agent instances as simulacra). Decision transformers with more free-form task/identity prompts illustrate this point, except a bureaucracy should have multiple agents with different task prompts in a single episode. MCTS and GAN are adversarial and could be reframed like this. One of the tentative premises of ELK is that a model trained for some purpose might additionally allow instantiating an agent that reports what the model knows, even if that activity is not clearly related to the model's main purpose. Colluding instances inside a bureaucracy make it less effective in achieving its goals of producing a better dataset (accurately evaluating outcomes of episodes).

So I think useful arrangements of diverse/competing/flawed systems is a hope in many contexts. It often doesn't work, so looks neglected, but not for want of trying. The concern with collusion in AI risk seems to be more about deceptive alignment, observed behavior of a system becoming so well-optimized that it ceases to be informative about how it would behave in different situations. Very capable AIs can lack any tells even from the perspective of extremely capable observers, their behavior can change in a completely unexpected way with changing circumstances. Hence the focus on interpretability, it's not just useful for human operators, a diverse system of many AIs also needs it to notice misalignment or collusion in its parts. It might even be enough for corrigibility.

These are good points, and I agree with pretty much all of them. 

Instances in a bureaucracy can be very different and play different roles or pursue different purposes. They might be defined by different prompts and behave as differently as text continuations of different prompts in GPT-3

I think that this is an important idea. Though simulators analogous to GPT-3, it may be possible to develop strong, almost-provably-non-agentic intelligent resources, then prompt them to simulate diverse, transient agents on the fly. From the perspective of building multicomponent architectures this seems like a strange and potentially powerful tool.

Regarding interpretability, tasks that require communication among distinct AI components will tend to expose information, and manipulating “shared backgrounds” between information sources and consumers could potentially be exploited to make that information more interpretable. (How one might train against steganography is an interesting question.)

So I think useful arrangements of diverse/competing/flawed systems is a hope in many contexts. It often doesn't work, so looks neglected, but not for want of trying.

What does and doesn’t work will depend greatly on capabilities, and the problem-context here assumes potentially superintelligent-level AI.

Usually I bounce off of Reframing Superintelligence and your recent work.  I don't know what's different about this article, but I managed to understand what you were getting at without having to loop over each paragraph five times. It's like I'm reading Drexler circa 1992, and I love it. 

RE simulators: Isn't the whole problem with simulators that we're not sure how to extract the information we want from the system? If we knew how to get it to simulate alignment researchers working on the problem for a long time until it gets solved, then that seems like a major alignment advance. Similairly, with my hazy conception of narrow AI systems, I have the impression that getting weak AI systems to do substantial alignment work seems like that'd constitute a major alignment advance in itself. I guess I want to know how much work you estimate is needed before we could design a more detailed version of what you've outlined and expect it to work.

and increasing the number of actors can make collusive cooperation more difficult

An empirical counterargument to this is in the incentives human leaders face when overseeing people who might coordinate against them. When authoritarian leaders come into power they will actively purge members from their inner circles in order to keep them small. The larger the inner circle, the harder it becomes to prevent a rebellious individual from gathering the critical mass needed for a full blown coup. 

Source: The Dictator's Handbook by Bruce Bueno de Mesquita and Alastair Smith

At t+7 years, I’ve still seen no explicit argument for robust AI collusion, yet tacit belief in this idea continues to channel attention away from a potential solution-space for AI safety problems, leaving something very much like a void.

I agree with you that this part of the AGI x-risk solution space, the part where one tries to design measures to lower the probability of collusion between AGIs, is very under-explored. However, I do not believe that the root cause of this lack of attention is a widely held 'tacit belief' that robust AGI collusion is inevitable.

It is easy to imagine the existence of a very intelligent person who nevertheless hates colluding with other people. It is easy to imagine the existence of an AI which approximately maximises a reward function which has a term in it that penalises collusion. So why is nobody working on creating or improving such an AI or penalty term?

My current x-risk community model is that the forces that channel people away from under-explored parts of the solution space have nothing to do with tacit assumptions about impossibilities. These forces operate at a much more pre-rational level of human psychology. Specifically: if there is no critical mass of people working in some part of the solution space already, then human social instincts will push most people away from starting to work there, because working there will necessarily be a very lonely affair. On a more rational level, the critical mass consideration is that if you want to do work that gets you engagement on your blog post, or citations on your academic paper, the best strategy is pick a part of the solution space that already has some people working in it.

TL;DR: if you want to encourage people to explore an under-visited part of the solution space, you are not primarily fighting against a tacit belief that this part of the space will be empty of solutions. Instead, you will need to win the fight against the belief that people will be lonely when they go into that part of the space.

What you describe is a form of spiral causation (only seeded by claims of impossibility), which surely contributes strongly to what we’ve seen in the evolution of ideas. Fortunately, interest in the composite-system solution space (in particular, the open agency model) is now growing, so I think we’re past the worst of the era of unitary-agent social proof.

Conditions B1-B6 seem to exactly correspond to the description of a catallaxy.  A catallaxy (such as the current global economic system) has its own dynamics, which none of the individual agents participating in it can see and predict, as scholars of the field of complex systems have postulated (e. g., Sidney Dekker, David Woods, Richard Cook). However, I don't know what the modern, physics-based information-theoretic version of the science of complex systems tells us about this.

(The following is an adapted version of my comment to the post Interpretability/Tool-ness/Alignment/Corrigibility are not Composable that wasn't published.)

Clearly, in an open system (that is, a system which is open to unpredictable influences of the external world, rather than a closed, spherical-cow-in-a-vacuum game-theoretic setup), computational irreducibility prevents any individual agent comprising the system from reliably modelling and predicting the whole system's behaviour.  And that's the idea of the complexity theory: a system can be comprised of components that interact in a certain way, which the master designer of the system can predict from their vantage point, but the agents within the system cannot. If they were intelligent enough to be able to do that, the complexity of the system would rise yet again (since more intelligent agents behave in more complex ways) to a level incomprehensible to them.

So, I strongly feel that aligning a network (community, society, competition) of independent superintelligent agents is impossible: this community will steer in some direction that neither humans nor these superintelligent agents can predict and control.

However, here it's timely to note that aligning a singleton superintelligence with "humans", "the civilization", or "the ecosphere" is also impossible, in a sense, because all these are open systems and that superintelligence would be just a part of that open system. I don’t agree with Jan Kulveit that companies, communities, and “molochs”, are agents. I’d call them “mindless” emergent non-equilibrium dynamical patterns.

So, the strategy proposed in this post (robustness and safety through diversity) looks more promising to me than strategies with a singleton superintelligence, even if we have bulletproof strong reasons to think that that superintelligence is corrigible and trustworthy.[1]

This also reminds me of the program of systems engineering scientist John Doyle, in which the main idea is that robust systems have heterogeneous, diverse controls and feedback mechanisms in them. The central concept there is a "diversity-enabled sweet spot" (DESS). For a recent article on this topic, check "Internal Feedback in Biological Control: Diversity, Delays, and Standard Theory".

  1. ^

    Also, the "singleton superintelligence" design might turn out technically impossible: that singleton AI should somehow be everywhere in the world all at once, thus it should be a distributed system; even if it's a hierarchical system with a core "supermind" system, that system must not be deployed alone, for reliability reasons; then, there is a philosophical question whether we can call this setup "a singleton AI" or we got a distributed network of AIs, after all. Though, in this case, these systems are explicitly designed to collude.

New to LessWrong?