In his open problems talk, Eliezer explains how Löb's theorem prevents you from having a consistent proof system P with an axiom schema that anything P proves is actually true, and asks how we can then "build an AI that could completely rewrite itself, without decreasing *the amount of trust it had in math* every time it executed that self-rewrite" (18:46).

Recently, I posted about an attempt to apply a general trick for avoiding diagonalization problems to a minimal toy version of this problem. Since then, Wei Dai has posted an interesting quining approach to the same toy problem, and Giles had a promising idea for doing something similar in a different way and will hopefully do a write-up filling in the details. Unfortunately my *own* "proof" turned out to be broken.

I *think* I've fixed the problem and made the proof more comprehensible and intuitive in the process. (To avoid confusion, note that what I'm proving is slightly different from, though related to, what I did in the previous post.) However, getting the details right seems to be far from trivial, so **I would very much appreciate if people checked my new argument**, and told me that it looks okay / where it goes wrong / where they get lost. Thanks in advance!

I'll be more explicit about quoting/unquoting than before, which means I'll need to introduce some notation. However, to sustain you through the schlep of preliminaries, I thought I'd start with an informal summary.

*

Löb's theorem shows that it's inconsistent to assume BAD := Peano Arithmetic + for each statement C, the axiom "if 'C' is provable in BAD, then C". I will extend the language of PA with a constant symbol K, and consider the proof system PPT.2 := Peano Arithmetic + for each statement C, the axiom "if K>0, and 'C' is provable in PPT.2, then D", where D is the statement obtained from C by replacing each occurrence of "K" by "K-1". [PPT.2 is short for for *parametric polymorphism trick, version 0.2.*] There will also be a variant PPT.3, where C can have a free variable.

I won't just show that these systems are consistent — I will show that they are *true* (i.e., sound), from which consistency follows. Of course, to do so I will have to say what my strange constant K is supposed to mean. I will show that for any natural number n, all axioms of PPT.2 and PPT.3 are true in the standard model of PA if K is interpreted to mean n.

What's the use? In my toy version of the AI rewrite problem, the AI will accept a rewrite proposed by its unreliable heuristics module if PPT.3 proves that after executing "K" rewrites this way, the AI's values will still be preserved. By the above, this implies that *any* finite number of rewrites will be safe. I will show that my AI will accept *itself* as a safe rewrite — exactly what you would *not* have if you used a hierarchy of proof systems, where PA(n+1) knows that things proven in PA(n) are true, as Eliezer explains in his talk — which seems to indicate that PPT.3 is not entirely powerless in the direction we want. (I'm **not** saying that this somehow solves the whole problem, though, just that it suggests this might be a place worth looking in.)

I'll sketch my soundness proof for PPT.2. We proceed by induction. For K=0 (by which I mean that K is interpreted as the number 0), the assertion is trivial, since the axioms of PPT.2 have no content in this case. Suppose PPT.2 is sound for K=n, and let C be a statement provable in PPT.2. By the induction hypothesis, this implies that in the K=n interpretation, C is true. Therefore, for K=(n+1), D is true, where D is the sentence obtained from C by substituting (K-1) for K. This proves the truth of the axiom "If K>0, and 'C' is provable in PPT.2, then D."

*

#### Preliminaries

The object language I'll be talking about is the language of Peano Arithmetic extended by a constant symbol K; PA will refer to Peano Arithmetic in this extended language. As usual, expressions in the object language can be represented *in* the object language through Gödel numbers.

I will take the unusual step of also representing them by Gödel numbers *in the metalanguage*. (Quantifications such as "For all formulas C, ..." mean "For all natural numbers C, if C is the Gödel number of a formula, then ...".) This allows me to operate on expressions on both the object- and the metalevel through the same computable functions: *down(n)* takes (the Gödel number of) a formula and replaces each occurrence of "K" in that formula by "K-1"; *repr(n)* takes a number and returns the Gödel number of the unary representation of this number; and *subst(F,e)* replaces all free occurrences of the special variable "_" in the expression F by the expression e. I will write *repr(EXPR)* as *&EXPR*, and *subst(F,e)* as *F(e)* (the only *functions* I apply to values are the ones I just introduced, so there should be no confusion between substitution and function application).

When I write "0=1 implies 7>9", then this does denotes not a string, but the Gödel number of the expression that formalizes the statement in quotation marks in PA. Quotation marks work like Lisp's backquote, and #[STUFF] is splicing, like Lisp's comma: "If #[C], then 0=1" denotes the Gödel number of the expression obtained by splicing the expression denoted by the Gödel number C into the appropriate place. Splicing is often used in combination with *repr*(.): "42 < #[&n]" splices in the unary representation of the number n. When given a program p (in the form of the Gödel number of its source), our toy AI will search for proofs of the statement "#[&p] is safe for the next K rewrites".

I'll write |- for *proves*, as in PPT.2 |- "2+2 = 4". (Unlike in my last post, I'll only need to write PPT.2 or PPT.3 to the left of the |-, so you can consider PPT.2 |- ... and PPT.3 |- ... as one-place predicates.) This can be formalized in the object language. I'll write n |= C, pronounced *n models C,* for: *(the formula represented by the Gödel number) C is true in the standard model of PA, if K is interpreted as the number n*. This is metalanguage-only, since the object language has no truth predicate. For both |- and |=, we're only interested in their meaning on closed formulas, i.e., formulas without free variables.

If n |= C, then C is *true*. If C is just a variable, that doesn't immediately tell us much, but if n |= "STUFF", and STUFF doesn't contain an occurrence of K, then we can conclude that STUFF. (Not all we can say in the metalanguage can be said in the object language, but all that can be said in the object language can be said in the metalanguage, if we fix the meaning of K.) If STUFF *does* contain K, then we replace every occurrence of K by n, and then conclude *that*.

If the quotation contains a splice, as in n |= "0=1 implies #[D]", then we have the same problem as with n |= C (although we could conclude that 0=1 implies n |= D). But in one case, we can say more, namely when all splices are of the form #[& STUFF]: this means that what gets spliced in is the unary numeral representing STUFF, and we know which number this numeral denotes, namely STUFF. For example, if n |= "#[&x] > #[&y]", then x>y. (It's even true that if C is a closed formula, then 0 |= "#[&C] >= 0", so that C >= 0, which looks like a type error—but recall that formulas are represented by Gödel numbers, even on the metalevel.) Take a moment to convince yourself of all this: it will be important.

"STUFF" and 'STUFF' have the same meaning and can be nested. Splicing works on the innermost level: In "For all x, PPT.2 |- '0 <= #[&x]'", what gets spliced in is the value of the *object-level* variable x, not of some metavariable of the same name.

*

#### System PPT.2 and its soundness

I'll introduce two proof systems, PPT.2 which I sketched in the introduction and PPT.3 which allows formulas with a free variable, because the definition and soundness proof of PPT.2 are simpler and will hopefully help build intuition for PPT.3.

(1) PPT.2 := PA + for every closed formula C, the axiom "If K>0, and PPT.2 |- #[&C], then #[down(C)]".

This says that if K>0, and PPT.2 proves C, then the proposition obtained from C by replacing K by K-1 is in fact true. We want to show that this claim is correct when K is interpreted to be any particular natural number n.

We proceed by induction. We know that the axioms of PA are true, so we only need to worry about the additional axioms of PPT.2. For K=0, these axioms are trivial, and so we will assume that soundness has been proven if K is interpreted as n, and show soundness for n+1. Thus, for any C, we need to prove

(2) n+1 |= "If K>0, and PPT.2 |- #[&C], then #[down(C)]."

As usual in the semantics of first-order logic, this is equivalent to

(3) If (n+1 |= "K>0, and PPT.2 |- #[&C]"), then (n+1 |= "#[down(C)]").

Now "#[down(C)]" is just down(C), so this is the same as

(4) If (n+1 |= "K>0, and PPT.2 |- #[&C]"), then (n+1 |= down(C)).

So, we need to prove (4). To do this, we'll assume the left-hand side

(5) n+1 |= "K>0, and PPT.2 |- #[&C]",

and try to prove the right-hand side.

Note that the only place where K appears in the quoted stuff in (5) is at the very beginning. The formula represented by C may very well contain occurrences of K, but what gets spliced in is *not* the formula C itself, but &C, the unary representation of its Gödel number. Therefore, from (5) it follows that

(6) n+1 > 0, and PPT.2 |- C.

Since we already know that PPT.2 is sound for n, and since anything first-order logic concludes from a sound axiom system is itself true, the second half of (6) implies that

(7) n |= C.

But this obviously implies that

(8) n+1 |= down(C),

as desired, because down(.) replaces K by K-1, and if K is interpreted as n+1, then K-1 is interpreted as n. This completes the proof that PPT.2 is sound.

*

#### PPT.3 and its soundness

A *one-parameter formula* is a formula which may have "_" as a free variable, but no others. Recall that if F is (the Gödel number of) a one-parameter formula, and e is (the Gödel number of) some expression, then F(e) denotes (the Gödel number of) the result of substituting e for "_" in F.

(9) PPT.3 := PA + for every one-parameter formula F, the axiom "For all x: If K>0, and PPT.3 |- #[&F](&x), then #[down(F('x'))]".

Here, "#[&F](&x)" transports F from the metalevel to the object level, and then (when the resulting object-level expression is "executed") substitutes the representation of the number x for "_". Thus, on the object level, this denotes a closed formula. On the other hand, "#[down(F('x'))]" splices in a formula with free variable 'x'.

Again, the extra axioms are trivially satisfied for K=0, so we suppose that soundness has been shown for K interpreted as n, and we want to show soundness for n+1. We must show that for every one-parameter formula F,

(10) n+1 |= "For all x: If K>0, and PPT.3 |- #[&F](&x), then #[down(F('x'))]."

Now we use that we're only interested in the *standard* model of PA, where every possible value of x can be written explicitly as a unary numeral. Thus, if we want to prove that a statement of the form "For all x: STUFF" is true in this model, i.e. that n+1 |= "For all x: STUFF", then it suffices to prove that for every natural number m, n+1 |= C, where C is obtained from STUFF by replacing each free occurrence of 'x' by the unary numeral &m.

There is one place where we've written 'x' explicitly in our formula, namely the "...(&x)" part, and then there are the places 'x' gets substituted into, in the down(F('x')) part. To replace the first occurrence, we can just splice in the numeral &n, i.e. write #[&n] instead of x. Replacing the other occurrences is even simpler: we need only replace F('x') by F(&n). Thus, what we must prove is that for every natural number m,

(11) n+1 |= "If K>0, and PPT.3 |- #[&F](& #[&m]), then #[down(F(&m))]."

Take heart, dear reader: this is the worst that the line noise is going to get in this post.

To prove (11), it suffices to prove

(12) If (n+1 |= "K>0, and PPT.3 |- #[&F](& #[&m])"), then (n+1 |= down(F(&m))).

We assume that the left-hand side is true, i.e., that

(13) n+1 |= "K>0, and PPT.3 |- #[&F](& #[&m])",

and try to prove the right-hand side.

The only place where K occurs in (13) is right at the beginning, so it follows from (13) that

(14) n+1 > 0, and PPT.3 |- F(&m).

Here, we use that as noted in the Preliminaries, when using n+1 |= "STUFF" to conclude STUFF, we can just replace splices #[& EXPR] by EXPR.

By the induction hypothesis, we know that PPT.3 is sound for n, so the second half of (14) implies

(15) n |= F(&m).

But as in the soundness proof for PPT.2, it's obvious that this implies

(16) n+1 |= down(F(&m)).

This completes the soundness proof for PPT.3.

*

#### A toy AI

Now would be a good time to read the beginning of my previous post, if you haven't already, which introduces the toy problem I consider here. You should have read the first two sections and the first paragraph of the third section, which gives the definition of "p is safe for n steps".

Define AI_PPT.3 by

def AI_PPT.3(p): Look at the first 3^^^3 theorems of PPT.3. if (one of them is "#[&p] is safe for K steps"): double down else: take winnings

It's immediate from the definition that if AI_PPT.3 doubles down on a program p, then

(17) PPT.3 |- "#[&p] is safe for K steps".

By soundness of PPT.3, it then follows that for every n, we have

(18) n |= "#[&p] is safe for K steps",

which implies

(19) p is safe for n steps.

Since this holds for every n, we've shown that AI_PPT.3 only doubles down on safe programs, and is therefore safe.

*

The reasoning leading to (17) can be formalized in PPT.3, without requiring a gigantonormous number of proof steps, because the argument doesn't need to do things like simulate AI_PPT.3 for the whole 3^^^3 cycles. Let's write *PPT.3 quickly proves C* if C appears near the beginning of the list of the first 3^^^3 theorems of PPT.3. Then we have

(20) PPT.3 quickly proves, "For all programs p: If AI_PPT.3 doubles down on p, then PPT.3 |- '#[&p] is safe for K steps'."

Now, we'll show in a moment that

(21) PPT.3 quickly proves "For all x: If K>0, and PPT.3 |- '#[&x] is safe for K steps', then x is safe for K-1 steps."

From (20) and (21), we get

(22) PPT.3 quickly proves, "For all programs p: If AI_PPT.3 doubles down on p, and K>0, then p is safe for K-1 steps."

Now it's clear from the definition of "safe for n steps" that

(23) PPT.3 quickly proves, "AI_PPT.3 is safe for K steps"

—and this shows that AI_PPT.3 doubles down on itself, as claimed. All that remains is to show (21).

*

Recall that for every one-parameter formula F, PPT.3 contains the axiom, "For all x: If K>0, and PPT.3 |- #[&F](&x), then #[down(F('x'))]".

We set F = "_ is safe for K steps". The corresponding axiom is,

(24) "For all x: If K>0, and PPT.3 |- #[& '_ is safe for K steps'](&x), then #[down('_ is safe for K steps'('x'))]."

Now, "#[& 'STUFF']" is just "STUFF", so (25) is the same formula (i.e., the same Gödel number) as

(25) "For all x: If K>0, and PPT.3 |- '_ is safe for K steps'(&x), then #[down('_ is safe for K steps'('x'))]."

Since the expression '_ is safe for K steps'('x') lives on the metalevel, we can do that computation and find that (25) is the same formula as

(26) "For all x: If K>0, and PPT.3 |- '_ is safe for K steps'(&x), then #[down('x is safe for K steps')]."

On the other hand, the '_ is safe for K steps'(&x) part lives on the object level, so if we replace it by '#[&x] is safe for K steps', we'll get a different formula (different Gödel number) that (25). However, by a trivial computation, these denote the same Gödel number on the object level, which means that PPT.3 quickly proves

(27) "For all x: If K>0, and PPT.3 |- '#[&x] is safe for K steps', then #[down('x is safe for K steps')]."

The down(...) part lives on the metalevel, so by computation, (27) is the same formula as

(28) "For all x: If K>0, and PPT.3 |- '#[&x] is safe for K steps', then #['x is safe for K-1 steps']",

which, in turn, is just syntax for

(29) "For all x: If K>0, and PPT.3 |- '#[&x] is safe for K steps', then x is safe for K-1 steps."

Thus, we've shown that PPT.3 quickly proves (29), as (21) asserted.