**There is a problem with the proof here and I have to think about whether I can fix it.** Thanks to * I have posted a new and hopefully correct proof attempt. Thanks again to vi21maobk9vp!*

*In his talk on open problems in Friendly AI*,* Eliezer's first question is how, given Löb's theorem, an AI can replace itself with a better expected utility maximizer that believes in as much mathematics as the original AI. I know exactly one trick for that sort of problem, so I decided to try that on a toy variant. To my surprise, it more or less just worked. Therefore:*

Professor Quirrell proposes a game. You start with a score of one. Professor Quirrell moves first, by choosing a computer program and showing you its source code. You then have three options: Take your winnings; double down; or self-destruct.

If you take your winnings, the game ends, and your score is converted to Quirrell points.

If you self-destruct, the game ends, your score is lost, you'll be sent to bed without dinner, you'll lose 150 House points, Rita Skeeter will write a feature alleging that you're a Death Eater, and Professor Quirrell will publicly critique your performance. You are advised not to pick this option.

If you double down, your score doubles, and you advance to the next round. Professor Quirrell again moves first by choosing a computer program. Then, it's your turn—except that this time, you don't get to choose your move yourself: instead, it'll be chosen by Professor Quirrell's program from the previous round.

Professor Quirrell will endeavor to present an *educational* sequence of programs.

*

The idea is that Quirrell will make you self-destruct if he possibly can, so you must only accept programs that don't self-destruct, that accept only programs that don't self-destruct, that accept only programs that only accept—etc. That's supposed to capture one essential aspect of Eliezer's problem, namely how to make sure that a proposed rewrite doesn't destroy your values, while ignoring the complications due to a different aspect, namely comparing the expected values before and after the rewrite. In Quirrell's game, there are safe and unsafe rewrites, and you should always double down on a safe one and take your winnings when presented with an unsafe one. Let's look at some interesting programs that we could recognize as safe. [~~And to deal with the possibility of an infinite sequence of double-downs, let's stipulate a small but positive chance each round that Quirrell will end the game and pay you even if your program chose to double down.~~ **ETA:** Luke A. Somers points out that this provision isn't necessary.]

Let PA(0) := Peano Arithmetic, and PA(n+1) := PA(n) + for all formulas 'C': "if PA(n) proves 'C', then C". Define AI(n) by

def AI(n)(p): Look at the first 3^^^3 theorems of PA(n). if (one of them says "p is safe"): double down else: take winnings

If the current round is controlled by AI(7) and Quirrell hands it AI(6) as input, then AI(7) will double down: PA(7) will conclude that AI(6) will only double down on programs that PA(6) proves to be safe, which implies that they *are* safe. But if Quirrell *hands AI(7) to itself*, this reasoning doesn't go through, because PA(7) cannot use "PA(7) proves p is safe" to conclude "p is safe". So even if Quirrell is nice enough to choose AI(n)'s in a decreasing sequence, our windfall will end after a constant number rounds, because (***sinister drumroll***) AI(0) has run out of math to believe in, so to speak. That's precisely the problem Eliezer explains in his talk.

By quining, we could write an AI_{Q} which will recognize itself as safe, so if Quirrell chooses AI_{Q} over and over again, we'll be doubling down for longer and longer times. But that insight won't help us with writing interesting self-modifying AIs. [**ETA:** Wei Dai has managed to push the limits of using quining into interesting territory.] Is there something that can use non-specialized reasoning, like the AI(n), to recognize an indefinitely long sequence of variants of itself as safe?

*

Define "p is safe for n steps" (p a program) to mean that there is no m, 0 < m <= n, and sequence of programs p=p_{0}, p_{1}, ..., p_{m} such that (a) for 0 < i < m, p_{i-1} will double down when fed p_{i}, and (b) p_{m-1} will * self-destruct* when fed p

_{m}. Clearly this can be formalized in the language of Peano Arithmetic.

Now let's extend the language of PA by a constant symbol K, and define PA_{K} := Peano Arithmetic (actually, the natural extension of PA to this new language) + for all formulas 'C' of the base language: "if PA(K) proves 'C', then C". Finally, define AI_{K} by

def AI_{K}(p): Look at the first 3^^^3 theorems of PA_{K}. if (one of them says "p is safe for K steps"): double down else: take winnings

I claim that AI_{K} is safe, and furthermore, AI_{K}(AI_{K}) will double down, and so will AI_{K}(p) where p is some trivial modification of AI_{K} like changing 3^^^3 to 4^^^^4.

*

Fair warning: now comes the technical section of this post.

*Proof sketch. *Let inst(n,x) be the meta-level function that takes a sentence or proof x in the extended language and substitutes the numeral n (i.e., the unary number in the language of PA that encodes n) for all occurrences of the constant symbol K. Note that if x is a valid proof in PA_{K}, then inst(n,x) will be a valid proof in PA(n+1). Of course, what statement it is a proof *of* will depend on n. In particular, if x proves "p is safe for K steps", then inst(n,x) is a PA(n+1) proof of "p is safe for n steps". Since this is argument works for all n, p will in fact be safe [and we can formalize this in PA_{ω}, if we want to]. This shows that AI_{K} is safe, as claimed.

Formalizing the above argument in Peano Arithmetic, and writing inst_{PA}(n,x) for the object-level encoding of the meta-level function, we can prove: "For all extended-language sentences 'C', if PA_{K} proves 'C', then for all n, PA(n+1) proves inst_{PA}(n,'C')." By substituting K-1 for n, it follows that for all exteded-language sentences 'C', PA_{K} proves "If PA_{K} proves 'C', and K>0, then PA(K) proves inst_{PA}(K-1,'C')". Now, for 'D' := inst(K-1,'C'), PA can prove that "inst_{PA}(K-1,'C') = 'D'"; thus, by virtue of its extra axioms, PA_{K} concludes: "If PA_{K} proves 'C', and K>0, then inst(K-1,C)".

What we must show is that PA_{K} proves "AI_{K} is safe for K steps" through some argument that doesn't actually involve simulating AI_{K}(AI_{K}) (because in that case, we wouldn't expect to find the proof among the first 3^^^3 theorems). First, note that PA already ("quickly") proves "If AI_{K}(p) doubles down, then PA_{K} proves 'p is safe for K steps'". In PA_{K}, it follows that "If AI_{K}(p) doubles down, and K>0, then inst(K-1, 'p is safe for K steps')". But this is just a meta-level expression for the sentence "If AI_{K}(p) doubles down, and K>0, then p is safe for K-1 steps"—and at that point, the reflective magic is over, and we only need to establish by "ordinary" PA reasoning that if AI_{K} will only accept programs that are safe for K-1 steps, then AI_{K} is safe for K steps.

Q.E.D.

Note that this argument does not depend on the number of theorems AI_{K} looks at, because it only depends on the fact that if AI_{K}(p) *does* double down, then there is *some* PA_{K} proof of p's (K-1)-step safety.

*

The real question is whether this is just a hack or can tell us something about how to approach a solution to the real thing. It could very well be that this is one of the cheap tricks Eliezer and Marcello tried that don't solve the core problem (as Eliezer explains in his talk). Certainly the proof seems to be more tangled with the rounds structure of Quirrell's game than I find elegant. Also, I'm not at all sure that the key proof idea noted in the previous paragraph still works when we go from Quirrell's game to expected utility maximization.

However, as I said at the beginning of this post, I know exactly one trick for dealing with problems of this sort—by which I mean, trying to do something that seems impossible due to a diagonalization proof. It's well-appreciated that you can usually avoid diagonalization by passing from a single collection of things to a hierarchy of things (we can avoid Cantor by concluding that there are multiple infinite cardinalities; Gödel and Löb, by the PA(n) hierarchy; Turing, by a hierarchy of halting oracles; Tarski, by a hierarchy of truth predicates; and so on). It's less well appreciated, but I think true (though fuzzy), that many fields manage to circumvent the effects of diagonalization a bit further by considering objects that in some sense live on multiple levels of the hierarchy at the same time. I'd call that "the parametric polymorphism trick", perhaps.

In this post, we met PA_{K}, which can be interpreted as PA(n), for any n. I haven't tried it in detail, but something very similar should be possible for Tarski's truth predicates. A sufficiently powerful total programming language cannot have a self-interpreter, but you should be able to have a constant symbol K, a single interpreter code file, and a hierarchy of semantics such that according to the K=n+1 semantics, the interpreter implements the K=n semantics. In Church's simple theory of types, you can't apply a function to itself, but in polymorphic variants, you can instantiate (id : α -> α) to (id : (α -> α) -> (α -> α)), so that id(id) makes sense. Set-theoretic models of the untyped lambda calculus need to deal with the fact that if a set S is isomorphic to the function space (S -> T), then S is either empty or has only one element; the usual solution would take me a bit more than one sentence to explain, but it's always struck me as related to what happens in the polymorphic type theories. Looking a bit farther afield, if you're a set theory platonist, if α<β<γ and if the von Neumann levels **V**_{α}, **V**_{β} and **V**_{γ} are all models of ZFC, then the ZFC proof that "if there is a set model of ZFC, then ZFC is consistent" can be interpreted in **V**_{β}, where it applies to **V**_{α}, and it can also be interpreted in **V**_{γ}, where it applies to both **V**_{α} and **V**_{β}. And so on. It may be that there's a good solution to the AI rewrite problem and it has nothing to do with this type of trick at all, but it seems at least worthwhile to look in that direction.

[Actually, come to think of it, I *do* know a second trick for circumventing diagonalization, exemplified by passing from total to partial recursive functions and from two- to three-valued logics, but in logic, that one usually makes the resulting systems too weak to be interesting.]

*

Three more points in closing. First, sorry for the informality of the proof sketch! It would be very much appreciated if people would go through it and point out unclear things/problems/corrections. Also, I'm hoping to make a post at some point that gives a more intuitive explanation for *why* this works.

[**ETA:** ] Second, provability in PA_{K} in some sense says that a sentence is provable in PA(n), for all n. In particular, PA_{K} is conservative over PA(1), since if PA_{K} proves a sentence C in the base language, then PA(1) proves inst(1,C), which is just C; in some sense, this makes PA_{K} rather weak. If we don't want this, we could make a variant where provability implies says there is some m such that the sentence is provable in PA(n) for all n>m. To do this, we'd use a trick usually employed to show that there are non-standard models of the natural numbers: Add to PA_{K} the axioms "K>1", "K>2", "K>3", and so on. This is consistent by the Compactness Theorem, because any concrete proof can only use a finite number of these axioms. But in this extended system, we can prove anything that we can prove in any PA(n).

Third, I chose the particular definition of PA_{K} because it made my particular proof simpler to write. Looking only at the definitions, I would find it more natural to make PA_{K} conservative over PA(0) by using "if K>0 and PA(K-1) proves 'C', then C".

I played around a bit with quining solutions, and came up with the following, which solves this toy problem fairly well:

AI_Q2(3) should double down on AI_Q2(4) as well as AI_Q2(4^^^^4). (As well as variants that are provably equivalent to itself like speed/size-optimized versions of itself.) I sent this to Benja last night and he responded with (in part) "You've succeeded in making me uncertain whether quining approaches could actually be directly useful in solving the real problem (though it still doesn't seem likely)."

I agree it doesn't seem likely the real problem can be solved with quining approaches, but I'm post this solution here in case anyone can extend the idea further. At the very least it should be interesting to understand

whyquining approaches don't work on the real problem. What relevant aspect of the real problem isn't being captured by this toy problem?Let me back up from my other response. It just occurred to me that UDT1.1 (with a proof system instead of "math intuition") already constitutes a quining solution to AI reflection.

Consider an AI facing the choice of either creating a copy of itself, which will then go out into the world, or doing nothing. Unfortunately, due to Lobian problems it can't prove that a copy of itself won't do something worse than nothing. But UDT1.1 can be thought of as optimizing over an input/output mapping that is implemented by all of its copies. For each possible mapping, it proves a utility value starting from the assumption that it implements that map (which implies that its copies and provably equivalent variants also implement that map). So it doesn't need to prove (from scratch) that its copy won't do something worse than nothing.

It looks like you have moved from "PA_K proves that PA_K proves that p is K-step safe" to "PA_K proves that p is K-step safe" here. This would be adding consistency of entire PA_K to PA_K, which is not allowed.

You probably meant something else, but as it is written I failed to understand this. Could you please explain?

OK

Oh, this is the place that I should have pointed out. Sorry.

If I understand this correctly, "PA_K proves that if PA_K proves 'C', PA(1) proves instPA(0, 'C')". Also, PA_K should prove all simple things from PA(2), like PA(1) being consistent. Let us consider 'C' being plain "falsehood". Then we shall get "PA_K proves that PA(1) is not contradictory and if PA_K is contradictory, PA(1) is contradictory". For the benefit of casual reader: this would imply that PA_K contains PA and proves its own consistency, which implies that PA_K is contradictory.

This means I missed something very basic about what goes on here. Could you point out what exactly, please?

My thoughts on this problem were: define a predicate quasi_safe(p), equivalent to:

(I may have some off-by-one errors in that but that was the general idea). In the OP's statement of the problem, a time step can represent either a single step of computation or halting with "double down" and being instantaneou... (read more)

A separate point about this specific toy model with Quirell being very cooperative.

Let us imagine a toy language called "Truly Safe". We are accepting only programs which look like "TS(some___string);", TS being interpreter of "Truly Safe".

It turns out that any TS program first checks that its input has the same form, "TS(some___string);". If not, it just takes money. It also turns out that TS lacks self_destruct instruction. On the bright side, it comes bundled with interpreters of Pascal, Erlang, Haskell, Java, C, ... (read more)

Richard Feynman supposedly said something like

I wonder what other low-hanging fruit there is in trying the right tricks on the right problems?

Was the PA_K trick invented by you, and if not where can I learn more about it? Can you also give some references for other instances of this "parametric polymorphism trick"?

Or just observe that Quirrel can eventually write a program that just chooses to take the winnings.

What about the following generalized cliquing approach?

Let E be the set of computable equivalence predicates between programs which are each a sufficient condition for functional equivalence (for instance, textual equivalence is in E).

For all e in E define the program

clique_e(other):

if e(self, other) then double down

else take winnings

On the first round, if you can prove that there exists e in E such that Quirrell's program is functionally equivalent to clique_e then double down, else take winnings.

This guarantees that you'll never self-destruct and you will get the maximum possible payoff as long as Quirrell remains cooperative.

As I noted in the comments, I think I have a new and better proof, and I'm working on a careful writeup. However, that's going to take a while, so for now I'm posting the introduction, which gives an intuitive overview, because I think this may help in the meantime.

== How to cheat Löb's Theorem: my second try ==

In his open problems talk, Eliezer explains how Löb's theorem prevents you from having a consistent proof system P with an axiom schema that anything P proves is actually true, and ask how we can then "build an AI that could completely rewrite ... (read more)

Is the intention here that

any numberof Quirrel points is less good than the self-destruct option is bad? In other words, there is no N such that I would accept a fair coin flip on walking away with a score of N vs self destructing?Either I didn't understand how PA(K) is constructed at all, or this instantly dies by Lob's Theorem. If any language L proves 'If L proves C, then C', then L proves C. So if PA(K) has "If PA(K) proves 2=1, then 2=1" then PA(K) proves 2=1.

ASSUMING: "Safe for K steps" is exactly the same as "Will not double down on any input that will (self destruct or double down on any input that is not safe for K-1 steps)"

AI_K is unsafe for all K.

Application:

For my first program, I offer AI_150.

For my second program, I offer TRAP_151

Interesting attempt, I'm not completely clear upon the distinction between using a number n and the symbol K in the substitutions, but if we assume that bit works, I think your second paragraph proves PA_K to be inconsistent.

Since if your sentence C does not contain the constant K, then you have proved: given K>0, if PA_K proves C, then C. But all sentences which are in PA are of this form, hence by Löb's theorem they are all true.