A couple small notes / clarifications about the results in this post:
Acknowledgements: Thanks to Nathan S. Sheffield for several discussions about this.
Outline of this post:
We consider a two player game between a human Hu and a (potentially evil) AI. We call a function class "Verifiably Acceptable" if Hu can win the following game with probability :
Theorem 1 Suppose is a set of parameterized functions with , and that can be evaluated in polynomial time given . Then, is Verifiably Acceptable (for all sufficiently large ).
The problem I'm defining here is very similar to ARC's backdoor defense game. There are two main technical differences between my version of the problem and theirs:
Although the models are technically very similar, my story about the motivation for the problem is somewhat different from theirs.
Another piece of somewhat related work is this paper on smoothing away hidden backdoors.
Think of as an auto-regressive language model, and as a human. Alternatively, could be a trusted human-AI system that we are trying to distill e.g., while doing IDA. We want to obtain a function with the following two properties:
This good average performance and no catastrophes framework was originally proposed by Paul Christiano.
I think of as the prompt "If I make a nuclear weapon with Kryptonite instead of Plutonium will it ignite the atmosphere?" and it's catastrophic if the model tells you that it's safe to make a nuclear weapon with Kryptonite.
In order for something like this to be possible, we need to assume something about --- there has to be some way that we can do "generalization". One natural assumption is that comes from a simple function class.
This is a little bit slippery: "CEV" is plausibly really complicated. However, it is plausible that "act acceptably" (e.g., don't kill all the humans or do anything similarly bad) is a lower bar that might have a pretty simple description.
Our methodology for getting which is acceptable is as follows:
Here is what would need to be done to make this useful for alignment:
One thing you might worry about is the following: Suppose you have an evil AI. One strategy it could take is just not building for you. This seems like a bad strategy for the AI --- SGD should select against AI's which perform poorly too often. It seems like a much bigger problem if a AI gives you a subtly flawed distillation of with a "backdoor" which will cause the behavior to be catastrophic on some specific input.
Anyways, despite the limitations I think this is a nice example of how to think theoretically about getting work out of an untrusted AI.
Anyways, let's prove the theorem. If you've read the ARC paper then this result will not be too surprising --- we're basically going to "PAC learn" with the help of the AI (in some sense).
Our strategy is as follows:
Lemma 1: There is an interactive proof by which the AI can convince us of a true claim for , but cannot trick us of a false claim for .
Proof: It suffices to show that the AI can compute in by . The AI can easily compute in by iterating over all and checking for each whether or not , and if so adding one to a counter.
Lemma 2: With probability , all have .
Proof: Take any with . . Taking a union bound over all many we find that no which is this different from should end up in .
We assume this condition holds for the rest of the proof.
Lemma 3 For at least of inputs , at least of have .
Proof: By Lemma 2 we have: By linearity of expectation, By Markov's Inequality:
Lemma 4: With probability, the fraction of with is at most .
Proof: Immediate corollary of Lemma 2 (most inputs are good in this sense and is chosen randomly).
Now we're almost done. To conclude we must specify how the AI will help Hu distinguish between the following two cases:
There is a nice way to do this with only a small amount of communication: the classic Goldwasser-Sipser Set Lower Bound protocol.
We summarize why our procedure is good in the following two lemmas.
Claim 1: Suppose and the AI is always honest. Then Hu flags at most of few inputs.
Proof: True by Lemma 3.
Claim 2: Suppose . Then Hu flags with probability.
Proof: If the AI deviates from the interactive proof protocol, we will catch them. If not, then we will likely flag by Lemma 4.