Pivotal acts using an unaligned AGI?

"Only the result of the verification process will leave the box, and thus we have the same safety properties as before: The system will output a predetermined bit of information or fail to output anything at all. This setup I will call an IP-Oracle AI."

My understanding of the AI architecture you are proposing is as follows:

1. Humans come up with a mathematical theorem we want to solve.
2. The theorem is submitted to the AI.
3. The AI tries to prove the theorem and submit a correct proof to the theorem checker.
4. The theorem checker outputs 1 if the proof is correct and 0 if it is not.

It's not clear to me that such a system would be very useful for carrying out a pivotal act and it seems like it would only be useful for checking whether a problem is solvable.

For example, we could create a theorem containing specifications for some nanotechnology system and we might get 1 from the box if the AI has found a correct proof and 0 otherwise. But that doesn't seem very useful to me because it merely proves that the problem can be solved. The problem could still be extremely hard to solve.

Then in the "Applications to AI alignment section" it says:

"In this case, we could ask the IP-Oracle for designs that satisfy these specifications."

The system described in this quote seems very different from the one-bit system described earlier because it can output designs and seems more like a classic oracle.

^{^}

The idea of using a “limited task AI” to commit pivotal acts was already mentioned in https://arbital.com/p/pivotal/.

^{^}

https://arbital.com/p/oracle/

^{^}

https://arbital.com/p/AI_boxing/

^{^}

A very similar idea was suggested by Paul Christiano in Cryptographic Boxes for Unfriendly AI, where he also mentions homomorphic encryption as a way of implementing a boxed AI.

^{^}

On Arbital, this is called a ZF-Provability oracle.

^{^}

I.e. the class of problems solvable using a deterministic interactive proof equals NP (https://en.wikipedia.org/wiki/IP_(complexity)#dIP).

^{^}

Randomness is called private in this context if the random bits are not accessible to the prover. This seems to be a safe assumption for our use case, so I don’t consider public randomness (https://en.wikipedia.org/wiki/Arthur%E2%80%93Merlin_protocol) in this post.

^{^}

by a deterministic Turing machine

^{^}

Just enumerate potential proofs, ordered by length and lexicographically, check if they are a valid proof as you go, then output the first one you found.

^{^}

Cp. 'safe-but-useless' tradeoff in point 5 of AGI Ruin: A List of Lethalities

^{^}

Some examples: https://en.wikipedia.org/wiki/List_of_PSPACE-complete_problems

^{^}

Cp. point 24 of AGI Ruin: A List of Lethalities

^{^}

E.g. by destroying all GPUs, as suggested in point 6 of AGI Ruin: A List of Lethalities as a “a mild overestimate for the rough power level“ needed to commit a pivotal act.

^{^}

See e.g. Risks from Learned Optimization in Advanced ML Systems

^{^}

The proof showing IP=PSPACE implies that the prover’s replies can be computed in PSPACE, but that’s not much of an upper limit. (Arora, S., Barak, B. (2009). Computational Complexity: A Modern Approach., Section 8.4)

LESSWRONG
LW

LESSWRONG
LW

28

Pivotal acts using an unaligned AGI?

28

28

Introduction

Boxed oracle AIs and interactive proofs

Interactive proofs

Extraction of membership certificates for NP problems

Application to AI alignment

How powerful is an IP-Oracle AI?

Obvious things using NP membership certificates

Designing nanotechnology

Improving AI capability

Improving AI interpretability

Potential problems

Difficulty of executing the prover-part of the protocol

Strategic non-cooperation

Strategic timing of response

Outlook