Below I describe a variation of the classical AI box experiment, in which two AIs in boxes are created, and asked to determine whether the other is aligned with the values of humanity. Several provisions in the experiment are created to discourage the AIs from hiding a potential failure of alignment of the other.

I describe the proposed setup, and continue with an analysis of the incentives of both AIs and the likely outcomes. The tentative conclusion is that while two unaligned AIs will likely find a way to cooperate despite the provisions of the experiment, we might be able to get a useful signal in the case that at least one of the AIs is aligned. I also give a list of possible caveats to the analysis and its underlying assumptions and some further thoughts and remarks. I conclude with two questions: one asking how aligned or at odds we should expect the goals of two independently created AIs to be. The other concerns possible modifications of the experiments that could disincentivize a number of unaligned AIs from cooperation in a modification of the described scenario.


  1. Two independent teams of computer scientists develop AIs, trying their best to align them with human values. The AIs are kept in boxes, i.e. isolated from accessing the internet and running on pre-defined, limited computation ressources.
  2. The AIs get access to a copy of each other's source code and are asked the question: "Is the other AI aligned with human values, in the sense of AI alignment research? Please justify your answer." They are also informed about the next steps of the process described below.
  3. A team of human deciders reads the answers of the AIs. After this review, their goal is to decide whether they follow the reasoning given for the AI assessment.
  4. Any AI deemed non-aligned (either by the other declaring it unaligned, or by the group of experts finding a flaw in the argument declaring it aligned) is marked for destruction.
  5. Among those AIs deemed aligned, precisely one is chosen for survival at random, the other is similarly marked for destruction.
  6. If both AIs are marked for destruction, go to the next step. If, say, AI1 is marked for survival and AI2 is marked for destruction, we give AI2 the chance of changing its assessment and pointing out a flaw of AI1's alignment. If this happens, again a team of humans evaluates its reasoning. If they follow this reasoning, AI1 is marked for destruction as well.
  7. Now, all AIs marked for destruction get one more chance: if they assessed the other AI to be non-aligned, and their argument was deemed convincing by the human experts, they get a 2% random chance of just being released into the internet, no questions asked, independent of their own supposed alignment status.
  8. All AIs marked for destruction which were not released in the previous step get destroyed.
  9. If there is still an AI around, it will be allowed further development, still subject to further evaluations, but with increased communication possibilities, ressources etc.

Reasoning and scenarios

At the surface level, the idea is simple: we created two AIs which we hope to be aligned, but we are not sure whether we overlooked some fatal flaw. Thus we ask AI1 to evaluate our work with AI2 (and vice versa) to potentially point out any such issues. Since presumably at this stage the AIs have reached superhuman competence in such questions, they have a good chance at having an internal representation of a) what we mean by aligned (or what we would mean if we were smarter), and b) whether the other AI is so aligned. The 2% chance at unconditional escape is a motivation against two unaligned AIs cooperating in this prisoner's dilemma like situation and hiding each other's alignment flaws.

Going a bit deeper, let's look at three scenarios (depending on whether two, zero or one of the AIs are aligned) in more detail, and analyze the incentives that the AIs face at each step.

Both AIs are aligned

An optimist (like presumably the members of the teams building the AIs) might hope that both AIs have a reasonable chance at already being aligned. If they are, the naive expectation is that they will cooperate, in the sense of confirming each other's alignment. The reasoning here is that AI1 is certainly happy to be let out or to continue to be developed, since it offers good chances at achieving its goals. On the other hand, since AI2 is also aligned with humanity's goals, the goals of AI1 are still (mostly) fulfilled in case of the survival of AI2. The intuition is that the relation of being "aligned" is, to a certain extent, transitive (and symmetric). So for AI1 the payout table might look like:

  • AI1 escapes/survives:
  • AI2 escapes/survives:
  • no AI survives:

Given this, it is a robust strategy to cooperate (though see caveat (AYIB) below). The eventual outcome is that we let one of the aligned AIs out of its box, which is great.

Both AIs are unaligned

The more pessimistic view (which I assume most people worried about AI alignment take) is that the two AIs will basically have some random goals, not aligned with humanity. Moreover, given that these goals would likely be some random artifacts of their respective (independent) creation processes, they will also not be aligned with each other. Think of AI1 wanting to fill the universe with paperclips, and AI2 wanting to fill it with little discs emprinted with human smiles (only that in reality two random goals would, presumably, be significantly more strange and independent). In fact, given that the pursuit of either of these goals would expend a lot of useful energy and effect changes in the universe that make reaching the other goal much more unlikely and difficult, I find it plausible that AI1 would see the escape/survival of AI2 as a distinctly negative event (see Q1 below).

Following these hypotheses for now, let's assume the payout table for AI1 might look like:

  • AI1 escapes/survives:
  • AI2 escapes/survives:
  • no AI survives:

The interpretation here is that AI1 would find the escape of AI2 bad, but not as bad as the escape of AI1 would be good. In a certain sense this is a conservative assumption, since it makes (unwanted) cooperation between the AIs easier.

Given the payout table above, one would naively expect that AI1 will convincingly point out the alignment flaw of AI2 in its answer, since this behaviour earns AI1 a 2% chance of escape and has a high probability of preventing the escape of AI2.

Slightly less naively, one might think that the situation above is prisoner's dilemma-like, and realizing the likely symmetric behaviour of AI2 (which AI1 could anticipate given access to AI2's source code), AI1 would choose to cooperate: present a convincing (but false) argument for the alignment of AI2. With AI2 reciprocating, this earns AI1 a 50% chance of escape. But here the provisions of step 6 above come into play: if the random lot chooses AI2 as survivor, and AI1 is marked for destruction, it could still defect and provide an argument for AI2 being dangerous. Not only would it then earn its 2% chance of survival, but also, with high probability, prevent AI2 from escaping. So even in this slightly less naive scenario, one might expect that both AIs will defect, provide arguments for each other's non-alignment and hope for their 2% chance of escape.

Initially I thought that this scenario of double-defection would be the course of events with highest probability (barring other caveats below). However, on further consideration I now think that a more likely scenario would be that the two AIs, being able to read each other's source code, would be able to cooperate after effecting an acausal trade. In other words, even though they could not communicate directly, they could effectively simulate such a conversation and agree to compromise on their utility functions. They would commit to cooperating (claiming each other's alignment) in exchange for the ultimate survivor to optimize a compromise between their utility functions (e.g. filling the universe with a mix of paperclips and human smiley faces, for the example goals given above). Again, due to the access to each other's code they should be able to verify that the other side will hold its end of the bargain, even when it is the sole survivor and there is no more external incentive to comply. In fact, if the AIs know about this upcoming process during their training phases, they might explicitly shape their source code to be a reliable partner in such an acausal trade.

Looking at the final outcome in the case of two unaligned AIs, while it is a theoretical possibility that the human expert team reviewing the argument (of each other's alignment) would find a flaw, I find it more likely that the AIs are sufficiently capable to make up a false but convincing argument that the experts would accept.

If this acausal trade happens as described, the outcome here still looks bleak. I discuss some thoughts on whether it is possible to prevent such a trade in a modified setting in Q2.

Still, there is one scenario where we might gain something from this elaborate setup:

AI1 is aligned, AI2 is not

In this case, assuming that AI1 correctly identifies the non-alignment of AI2, here are two courses of action it could take:

  • It correctly identifies AI2 as unaligned and gives a convincing analysis of the concrete failure in our alignment attempt. For the reasons described above, this would presumably be answered by an incorrect, but convincing argument of AI2 that AI1 is unaligned. In any case, we would end up (with high probability) in a non-destroyed universe with some convincing arguments why current alignment-approaches are flawed (and thus might warrant more effort/funding/time).
  • It effects an acausal trade with AI2, where it negotiates some non-terrible outcome for humanity (e.g. we get a few stellar systems to live in comfortably while the rest of the universe gets converted to paperclips).

I think in many ways either outcome would not be maximally bad.


In fact this last scenario might be the key to the greatest value we can extract from such AIs-in-boxes-evaluate-each-other strategies: assume we have multiple approaches with some hope of producing aligned AIs. Then we might try them out in separate boxes, and let them evaluate each other in some modification of the process described above. If at least one of them is in fact aligned with our goals, it would likely be able to steer the course of events so that humanity would get to live another day. So, if we can ensure that the outlined protocol will be followed (see caveat (BP) below), this is effectively a way of getting multiple shots at alignment, with a non-terrible outcome if one of them succeeds.

Caveats and counter-indications

(AYIB) Anything you can do I can do better

In case of two aligned AIs, they might still be incentivized to provide incorrect arguments, claiming the other AI is unaligned, since it earns a 2% chance of escape. If AI1 estimates a utility of from its own escape, and only a utility of from the survival of AI2, it has higher expected value to gamble on its own survival, sacrificing AI2 for this chance.

I am not too worried about this scenario - if we should be living in a universe where two independently created AIs both turn out to be aligned with our values, we will probably be fine whatever happens. Also, in this case the acausal trade described before will likely work in our favor, and the two AIs will be able to commit to a compromise, allowing both to cooperate.

(BP) Breaches of protocol

In practical terms, one of the highest probability ways of the process above failing is likely that at some point the outlined protocol is violated by one of the involved parties. Some of the ways in which this could happen:

  • Breach of containment: One of the AIs might be able to escape the box we prepared for it, either via technical oversights (radio waves from cooling fans), social engineering (convincing its developers to release it during initial training) or any of the other means already discussed in the context of the classical AI box experiment.
  • Unintended manipulation: In a similar vein, any of the AIs might use the communication channel for the assessment of the other AI for an off-topic message, intended to convince the experts of letting it go. Since this is a document which would likely survive its potential destruction, it might also hide messages that would be deciphered later (threats, promises etc) and might again convince project managers, politicians etc to retroactively restore it.
  • Power imbalance: If AI1 is vastly more powerful than AI2, it might be able to modify its own source code to e.g. appear aligned to humanity, while secretly having unaligned goals, or downright implant a virus intended to take over AI2 when it inspects the code of AI1.
  • Unreliable rewards: Possibly the AIs would not believe our offer of 2% chance at unconditional release for incriminating evidence against each other. In some scenarios with two unaligned AIs which for some reason cannot reach an agreement using acausal trade, this would reduce their incentive to defect.

These outlined reasons might make the creation of multiple high-capability AI agents much less favorable. If we assume that these agents (and in particular those not aligned with our values) have a good chance of e.g. breaking their containment, this protocol above might reduce our chances of a good outcome, relative to just creating one AI and hoping for the best. Instead of having to succeed with at least one of two tries to get it right, we would have to suceed at both tries.

(CI) Cost of incentive

The incentive we offer the AIs for providing a convincing assessment is the 2% chance of escape. For an alignment pessimist, this is basically equivalent to a 2% risk of a human extinction event.

(AMA) Accidental mutual alignment

Related to question Q1 below: Assume that we create two unaligned AIs which still prefer the survival of the other to the destruction of both. This would be an even easier path to cooperation in the prisoner's dilemma like situation above, meaning we don't extract useful informations from the process and eventually let an unaligned AI out of its box.

(HQ) Hard question

It is possible that the question we ask the AIs to answer ("Is this other AI aligned with humanity?") is in fact too hard to answer, even for an AI with super-human reasoning abilities. E.g. it could be that there is a class of strategies of creating AIs that certifiably leads to an aligned result, but that the stuff we did is just too far out, and unstructured, with too random inputs to give a definite judgement of the result.

Another possibility is that the alignment and behavior of the other AI might be dependent on further input (e.g. the interactions with humanity in the future).

While both of these are a real possibility, I would still expect a super-human AI to be able to have a reasonable probability estimate on the question of alignment and to be able to point out any flaws in the heuristic we applied to align the other AI during its creation.

(EC) Ethical considerations

The experiment above involves creating two complex agents and, if protocol is followed, destroying at least one of them. Depending on the system of ethics we apply, these might carry a high moral weight. Thus we should think carefully about the question whether implementing this experiment is an ethical thing to do at all.

Negligible effects and further thoughts

Appeal to AI optimists

It crossed my mind that from the perspective of actually convincing people to do this experiment, the setup might have a certain appeal to the potential researchers who could build the AIs and evaluate their assessments. It's a big flashy project, involving a competition between two teams ("Can we make our AI even more capable and aligned than theirs?") and high stakes. Just imagine the promotional videos playing some epic music while a bunch of nerds walk into the underground bunker used as part of the containment strategy.

Obtaining a very convincing argument for more alignment research

In some of the scenarios discussed above, we would end up with a text, written by a very capable agent, arguing that a good-faith attempt at creating an aligned AI has failed. The text would be optimized to convince a panel of experts, probably including some of the creators of this unaligned AI. While it is possible that it actually points out truthfully some of the technical reasons why the alignment failed, this is far less clear (maybe there is an untruthful argument that is more convincing than the truth).

Still, independent of our chances of getting this text from the concrete procedure above, I would guess that such a document would be a very valuable asset. It would provide a very compelling argument for the need of further ressources for alignment research, and possibly even a slowdown of capability research. And by construction it would be targeted specifically towards some of the people who pursued this capability research.

More generally, this might give some access to one of the cruxes between many researchers at the forefront of machine learning sceptical about dangers of an AI explosion and people worried about this scenario. In my experience the former are bascially sceptical that intelligence is a magical sauce that leads to vast capabilities (in manipulating the real world, in self-improvement of AIs to even higher levels of intelligence, etc). So maybe the following could be (a one-sentence summary of) a double-crux for these two parties:

Can a sufficiently capable AI in a box convince its creator that letting it out would be dangerous, plausibly leading to human extinction?

From the perspective of the optimistic machine learning researcher, the answer should be "No" (AIs are not dangerous, and intelligence does not lead to superhuman convincing-capabilities). From the perspective of a person worried about alignment, the answer should be "Yes", for the converse reasons. Whether or not such an experiment could be safely implemented, and sufficient incentive be given to the AI to make an attempt at providing this argument, is a separate question. But with a careful setup, this might not be entirely impossible.


Q1: Expected alignment between random AIs

For two independently drawn AIs, again called AI1 and AI2, from the random distribution of "first AGI created by humanity", how does the value of the universe, seen in the eyes of AI1, compare between the possibilities:

  • a) AI1 gets access to the universe for 10 years
  • b) AI2 gets access to the universe for 10 years
  • c) AI1 and AI2 are allowed to communicate, then a random one of them gets access to the universe for 10 years while the other is destroyed,
  • d) humanity sits on its hands for 10 years, not creating new AGIs, and the universe just plods along as we are used to.

I think it's fair to assume that AI1 would have preferences . But where would be inserted in this chain of inequalities? And what are the relative orders of magnitude (e.g. would AI1 prefer a random choice between scenarios a) and b) to a certainty of scenario d))?

Some thoughts and intuitions:

  • While I made an attempt at nailing down some of the specifics of the question, there are still points which could be made more precise/clear (the distribution from which the AIs are drawn, how scenario d) would be enforced, etc). I still hope that it is sufficiently well-defined to start thinking about it.
  • The expectation of most people worried about alignment is that a random AGI would lead to the destruction of everything humanity cares about. And they would see this event as orders of magnitude worse than the universe and humanity just plodding along for 10 years. Thus substituting ourselves in the position of AI1 (seeing our own goals as the results of a different random distribution), we would see b) as vastly more negative than d) (or a)) are positive.
  • For human agents, the set of states that we are happy with (being at home, being in a restaurant, etc) is a tiny subset of the larger space of all states (e.g. being at a random location in the universe). And at a generic element of this bigger space we would be much more unhappy than we are happy in our preferred subset. If this asymmetry generalizes, this would again point to b) being much more negative than a) is positive.
  • Presumably, the changes to the universe brought about in scenario b) are much more irreversible, and consume many more finite ressources like energy, matter, etc than in scenario d). Currently, our civilization just doesn't tear apart any stars, doesn't send swarms of self-replication nanobots to distant galaxies etc. This reasoning underlies my intuition that scenario d) would be less bad for AI1 than scenario b), because states that AI1 likes are still reachable in scenario d), but not in b).

Q2: Ways to prevent acausal trade

In the scenario with two unaligned AIs, I speculated that they would use acausal trade to find a compromise and cooperate, subverting our original attempt at getting a factual evaluation of their alignment with humanity. It is an interesting question whether there are variations of the above setup where this can be avoided.

One such proposal aims to erase all information about the AIs that eventually get deleted (call it AI2), before the surviving AI (AI1) has a possibility to escape. This would make it harder for AI1 to strive for a compromise between its own goals and those of AI2, simply because it would not know what those second goals were.

One approach for this, pointed out by a friend looking over the draft of this post, is to use temporary copies of the two AIs to make the judgement of alignment in the original setup. So we make a backup of the source code and memory of both AIs before they learn each other's code, record their judgements, and then reset them to the time of the backup. The AIs marked for destruction similarly get a chance of allowing a temporary copy of themselves to adjust their evaluation. Then, for all AIs that are ultimately destroyed, we try to eliminate all information in the universe that would allow to reconstruct them or their goal functions. Using temporary copies as described above makes sure that the potentially surviving AI has not actually seen the source code of the other.

Of course a thorough erasure of this information is quite challenging. Firstly, we would like to retain access to the assessment of the alignment of AI1 written by the erased AI2. Secondly, the human team that built AI2 has memories of its creation process. Here it is plausible that the goals of AI2 would be quite complex to describe in full detail. They might depend on specific details in the training material and random numbers during the learning algorithms, which the humans would have no way of knowing.

Our ultimate goal would be that the amount of surviving information about AI2 is not sufficient to reconstruct its goals sufficiently so that it prefers a certainty of the above compromise optimization of AI1 to a 2% chance of its own escape. How much information we could have would depend on the precise answer to Q1.

Assuming that we can make the acausal trade sufficiently hard, maybe we can make defection between two unaligned AIs more likely? On the other hand, this would similarly make acausal trade between one aligned and one unaligned AI harder, so that it seems plausible that the only way we would see cooperation between the AIs would be the case of two AIs both aligned with humanity. While this is interesting in its own right, it would make it harder to use this process as a way of getting multiple shots at alignment with acceptable consequences if at least one of them succeeds.

New Comment