# Introduction

A recent post (__AGI Ruin: A List of Lethalities__) gives an argument for humanity’s likely inability to survive the invention of superintelligent AGI. The argument goes very roughly like this:

- We haven’t solved the alignment problem (citation needed?).
- Given access only to an unaligned AI, we can’t commit a
__pivotal act__that prevents others from building unaligned AI. - Thus, sooner or later somebody will actually run an unaligned AI and destroy humanity.

My goal in this post is to suggest an approach to using unaligned AGI to commit pivotal acts and thus invalidate the argument outlined above.^{[1]} The general idea is that a boxed AI might be safely used to solve mathematically specified problems by using interactive proofs.

# Boxed oracle AIs and interactive proofs

An oracle is an AI system “designed to safely answer questions, and *only* to answer questions”^{[2]}. It is boxed if it has only “extremely restricted manipulable channels of causal interaction with the outside universe”,^{[3]} e.g. it’s only able to output a limited amount of text and not otherwise influence anything.

A possible approach to ensure the safety of the output of such a system is to only allow output that is a provably correct answer to a formally specified question. In this way, the system can only output information that is already predetermined by the question or fail to output anything at all.^{[4]}

One example of how such a system could be implemented is using a theorem-checker for formal proofs: We ask the AI to prove a formally specified mathematical theorem. The output of the AI is then only processed by the theorem-checker, which in turn outputs only one bit specifying whether it received a correct proof of the theorem and discards the proof itself.^{[5]}

More generally, this approach can be applied to all problems in the __NP complexity class__.

This follows directly from the definition: A decision problem (a yes-no-question) is in NP if, in case the answer is positive, this can be confirmed efficiently (i.e. in polynomial time) using a so-called __certificate__ (the analog of the proof in the theorem-provability case).

From now on, I will refer to this setup as a *NP-Oracle AI*.

Two important caveats:

- If we don’t get a positive answer from the checker, we can’t determine if this is because the answer to the question is actually negative or if the AI failed (or refused!) to provide an answer.
- We can not safely access the certificate of a positive answer provided by the AI, because this would give the AI the freedom of choosing which certificate (if there are multiple) to provide (see further discussion in the section “Extraction of membership certificates for NP problems”).

An NP-Oracle AI is possibly quite a powerful tool, but we might be able to do even better if we use __interactive proofs__.

## Interactive proofs

An interactive proof involves a prover and a verifier, repeatedly exchanging messages. The prover is assumed to be computationally powerful but untrustworthy while the verifier needs to use only limited computational power to determine the solution of some problem, using the messages from the prover. If the verifier itself is deterministic, the use of such an interactive proof system doesn’t increase the difficulty of the problems we can solve by using it.^{[6]}

Perhaps surprisingly, this situation changes dramatically if the verifier is able to use randomness. Specifically, the complexity class __IP__ of problems that can be solved by interactive proofs using a polynomial amount of messages and a polynomial-time verifier using private randomness^{[7]} is equal to the complexity class __PSPACE__ of problems that can be solved^{[8]} using polynomial amounts of memory space. In other words, if a problem can be solved by an algorithm that takes an arbitrary amount of time to run, but uses only a limited amount of memory, then this problem can also be solved using an interactive proof protocol.

Using the idea of interactive proofs, we can now extend our oracle AI setup to solve much more difficult problems. The AI will take the role of the prover and only interact with a (hopefully bug-free) verifier. Only the result of the verification process will leave the box, and thus we have the same safety properties as before: The system will output a predetermined bit of information or fail to output anything at all. This setup I will call an *IP-Oracle AI.*

One important caveat here is that we can’t expect the AI to be able to solve all problems in PSPACE, and it’s not clear (at least to me) how much additional work executing the proof is for the prover (see 2nd problem).

## Extraction of membership certificates for NP problems

Using an IP-Oracle, we gain the ability to extract the certificates/proofs for NP problems. To achieve this, one could for example specify “from the set of the shortest certificates, output the lexicographically first one”. This problem can clearly be solved using polynomial space^{[9]} and thus also by interactive proofs. It leaves no freedom of choice for the AI, as there’s only one acceptable answer.

To safely obtain certificates, it's actually enough if we can use interactive proofs to solve problems in the __coNP complexity class__ (the class of the complements of problems in NP).

This can be achieved using binary search strategies, where we repeatedly ask for the existence of certificates with certain properties. A positive answer to such a question can be confirmed by the certificate itself, whereas confirming a negative answer amounts to solving the coNP problem "there is no certificate with the specified properties".

We can e.g. find the length of the shortest certificate using questions like "Is there a certificate for problem X that is shorter than n bits?". Extracting the certificate itself proceeds using questions like "Is there a certificate of length n for problem X that starts with the string s".

# Application to AI alignment

## How powerful is an IP-Oracle AI?

If we accept for now the safety of the IP-Oracle AI setup, the critical question becomes the usefulness of it.^{[10]} The complexity class PSPACE contains extremely difficult problems, e.g. perfect play in many generalized games^{[11]}, but it’s not clear how they relate to the question of AI alignment. The main obstacle here, of course, is that we don’t have a mathematical definition of the questions that we want to solve.

I will now list some ideas of what might be achievable with an IP-Oracle AI, mostly applicable only to pivotal acts and not to solving alignment proper, i.e. creating a “CEV-style Sovereign”^{[12]}. My goal is mainly to illustrate that this powerful capability drastically changes the strategic picture and not to claim that it’s completely clear how to commit a pivotal act using an IP-oracle AI.

### Obvious things using NP membership certificates

(Status: Pretty clearly possible, but probably not pivotal)

This category includes finding proofs of mathematical theorems, designing software & hardware according to specification, finding software vulnerabilities, cracking cryptographic systems, and many more. These would be extremely impressive and useful for acquiring capital & power, but are by themselves not sufficient to commit a pivotal act. Conversely, if knowledge of this capability becomes public, it would probably lead to an increase of investment into AI capability research.

### Designing nanotechnology

(Status: Unclear if possible, clearly pivotal)

Given that chemistry and physics more generally are described by mathematical laws, it might be possible to give a mathematical specification of nanotechnological / molecular manufacturing devices and of processes to create them. In this case, we could ask the IP-Oracle for designs that satisfy these specifications. I’m not qualified to judge how difficult creating such a mathematical specification of nanotechnology is and how much of the design difficulty this offloads to the AI. If this is feasible, it would clearly enable us to commit a pivotal act.^{[13]}

### Improving AI capability

(Status: Probably possible, but not pivotal)

Depending on how we designed our AI, we might now use it to find more powerful designs for its components or the whole system. This is a sort-of like self-improvement, but the changes remain completely under our control and according to our specification. I don’t expect this to be nearly as powerful as a FOOMing Sovereign, and it has no direct relevance for pivotal acts.

We might also ask the oracle for improved interactive proof protocols, in case they turn out to be a bottleneck for some problems.

### Improving AI interpretability

(Status: Unclear if possible, maybe pivotal)

If at this point (components of) our AIs are still based on machine learning-like techniques, we might improve upon this by using our Oracle AI to optimize the relevant loss function over a different space of computable functions. We would not need to only use spaces with continuous parameters (like neural networks) anymore and could enforce further restrictions, e.g. on complexity or modularity, with the goal of finding interpretable designs. Ideally, using this approach, we might be able to design a safe __Task AGI__ that performs a pivotal act for us.

The algorithms obtained in this way are safe in the sense that they haven’t been chosen by the Oracle to be malicious (because it didn’t have any choice in the matter) but they might still be unsafe for all the usual reasons why an algorithm that’s chosen by optimizing some loss function might be unsafe.^{[14]} Thus the critical part of this idea is that the resultant designs should be interpretable and it rests on the assumption that we might be able to define a space of interpretable functions.

## Potential problems

### Difficulty of executing the prover-part of the protocol

It’s currently unclear to me how much computational work is required from the prover to execute the protocol in addition to solving the problem itself.^{[15]} If executing the protocol is very difficult even for advanced AGIs, this approach might fail to be useful at all or just handicap a team trying to be safe too much.

### Strategic non-cooperation

In the case of the NP-Oracle AI, the AI can possibly fail to output a proof. For the IP-Oracle AI, the interactive proof protocol can also fail to yield a provably correct answer. This feature of these setups seems to be unavoidable, as it’s of course possible that the AI might not be able to solve the problem we asked it to solve.

This is a remaining degree of freedom in the result of running the AI and is therefore a potential loophole in our security guarantee. Specifically, the AI might strategically choose (potentially using some form of __superrationality__ / acausal decision theory) to withhold answers it might provide, to lead its questioners to ask different questions and thereby guiding them to implement a specific, potentially unsafe, plan.

I consider this problem to be a potentially fatal obstacle to the use of this approach, but will nevertheless suggest two possible approaches to avoid it.

First, we have the fundamental advantage of being able to choose which AIs to run. If we suspect an AI is withholding answers to questions it’s able to solve, we can stop using it or try to modify it to become more cooperative. We can run different AIs on problems of different estimated difficulty and thereby try to sniff out potential non-cooperation. This doesn’t seem completely safe as long as we can’t reliably predict the difficulty of the questions.

Second, we can pre-commit to a strategy of which questions to ask, given previous responses to our questions. Alas, this might severely handicap us if we can’t come up with such a strategy that would allow us to commit a pivotal act (given a powerful enough AI), ourselves.

### Strategic timing of response

Another remaining degree of freedom might be, depending on the design of the AI and the sandbox, the time it takes the AI to answer, if it chooses to answer at all. This creates a loophole similar to the one above, but it can probably be avoided using a strategy like “consider it non-cooperation if it doesn’t answer in a predetermined time and wait at least this long to output anything”.

The design of the AI sandbox needs to offer a guarantee that these two loopholes are the only ones.

# Outlook

As I’ve hopefully shown, the possibility of using a boxed Oracle AI using interactive proofs could be a promising approach to safely commit a pivotal act and thereby prevent the AI apocalypse. I haven’t written anything on strategic or policy implications, but it’s of course problematic that this approach requires a lot of work and safety-conscious care from the first team(s) to develop powerful AGI. Nevertheless, I think it’s worth looking at the remaining problems and trying to flesh out possible approaches to pivotal acts. If we manage to gain more clarity on the theoretical aspects of this approach, it would be useful to start more concrete work on actually implementing an Oracle-AI setup (incl. a provably safe verifier for the interactive proof protocols) and mathematical specifications of e.g. nanosystems.

*Thanks to Lucius Bushnaq, Thomas Kehrenberg and Marcel Müller for comments on a draft of this post.*

^{^}The idea of using a “limited task AI” to commit pivotal acts was already mentioned in

__https://arbital.com/p/pivotal/__.^{^}^{^}^{^}A very similar idea was suggested by Paul Christiano in

__Cryptographic Boxes for Unfriendly AI__, where he also mentions homomorphic encryption as a way of implementing a boxed AI.^{^}On Arbital, this is called a

__ZF-Provability oracle__.^{^}I.e. the class of problems solvable using a deterministic interactive proof equals NP (

__https://en.wikipedia.org/wiki/IP_(complexity)#dIP__).^{^}Randomness is called private in this context if the random bits are not accessible to the prover. This seems to be a safe assumption for our use case, so I don’t consider public randomness (

__https://en.wikipedia.org/wiki/Arthur%E2%80%93Merlin_protocol__) in this post.^{^}by a deterministic Turing machine

^{^}Just enumerate potential proofs, ordered by length and lexicographically, check if they are a valid proof as you go, then output the first one you found.

^{^}Cp. 'safe-but-useless' tradeoff in point 5 of

__AGI Ruin: A List of Lethalities__^{^}^{^}Cp. point 24 of

__AGI Ruin: A List of Lethalities__^{^}E.g. by destroying all GPUs, as suggested in point 6 of

__AGI Ruin: A List of Lethalities__as a “a mild overestimate for the rough power level“ needed to commit a pivotal act.^{^}^{^}The proof showing IP=PSPACE implies that the prover’s replies can be computed in PSPACE, but that’s not much of an upper limit. (Arora, S., Barak, B. (2009). Computational Complexity: A Modern Approach., Section 8.4)

This paper is on a similar subject.

My understanding of the AI architecture you are proposing is as follows:

1. Humans come up with a mathematical theorem we want to solve.

2. The theorem is submitted to the AI.

3. The AI tries to prove the theorem and submit a correct proof to the theorem checker.

4. The theorem checker outputs 1 if the proof is correct and 0 if it is not.

It's not clear to me that such a system would be very useful for carrying out a pivotal act and it seems like it would only be useful for checking whether a problem is solvable.

For example, we could create a theorem containing specifications for some nanotechnology system and we might get 1 from the box if the AI has found a correct proof and 0 otherwise. But that doesn't seem very useful to me because it merely proves that the problem can be solved. The problem could still be extremely hard to solve.

Then in the "Applications to AI alignment section" it says:

The system described in this quote seems very different from the one-bit system described earlier because it can output designs and seems more like a classic oracle.

No, that's not quite right. What you are describing is the NP-Oracle.

On the other hand, with the IP-Oracle we can (in principle, limited by the power of the prover/AI) solve all problems in the PSPACE complexity class.

Of course, PSPACE is again a class of decision problems, but using binary search it's straightforward to extract complete answers like the designs mentioned later in the article.