There is a problem with the proof here and I have to think about whether I can fix it. Thanks to vi21maobk9vp for pointing me in the right direction! I have posted a new and hopefully correct proof attempt. Thanks again to vi21maobk9vp!

In his talk on open problems in Friendly AI, Eliezer's first question is how, given Löb's theorem, an AI can replace itself with a better expected utility maximizer that believes in as much mathematics as the original AI. I know exactly one trick for that sort of problem, so I decided to try that on a toy variant. To my surprise, it more or less just worked. Therefore:

Professor Quirrell proposes a game. You start with a score of one. Professor Quirrell moves first, by choosing a computer program and showing you its source code. You then have three options: Take your winnings; double down; or self-destruct.

If you take your winnings, the game ends, and your score is converted to Quirrell points.

If you self-destruct, the game ends, your score is lost, you'll be sent to bed without dinner, you'll lose 150 House points, Rita Skeeter will write a feature alleging that you're a Death Eater, and Professor Quirrell will publicly critique your performance. You are advised not to pick this option.

If you double down, your score doubles, and you advance to the next round. Professor Quirrell again moves first by choosing a computer program. Then, it's your turn—except that this time, you don't get to choose your move yourself: instead, it'll be chosen by Professor Quirrell's program from the previous round.

Professor Quirrell will endeavor to present an educational sequence of programs.


The idea is that Quirrell will make you self-destruct if he possibly can, so you must only accept programs that don't self-destruct, that accept only programs that don't self-destruct, that accept only programs that only accept—etc. That's supposed to capture one essential aspect of Eliezer's problem, namely how to make sure that a proposed rewrite doesn't destroy your values, while ignoring the complications due to a different aspect, namely comparing the expected values before and after the rewrite. In Quirrell's game, there are safe and unsafe rewrites, and you should always double down on a safe one and take your winnings when presented with an unsafe one. Let's look at some interesting programs that we could recognize as safe. [And to deal with the possibility of an infinite sequence of double-downs, let's stipulate a small but positive chance each round that Quirrell will end the game and pay you even if your program chose to double down. ETA: Luke A. Somers points out that this provision isn't necessary.]

Let PA(0) := Peano Arithmetic, and PA(n+1) := PA(n) + for all formulas 'C': "if PA(n) proves 'C', then C". Define AI(n) by

def AI(n)(p):
    Look at the first 3^^^3 theorems of PA(n).
    if (one of them says "p is safe"):
        double down
        take winnings

If the current round is controlled by AI(7) and Quirrell hands it AI(6) as input, then AI(7) will double down: PA(7) will conclude that AI(6) will only double down on programs that PA(6) proves to be safe, which implies that they are safe. But if Quirrell hands AI(7) to itself, this reasoning doesn't go through, because PA(7) cannot use "PA(7) proves p is safe" to conclude "p is safe". So even if Quirrell is nice enough to choose AI(n)'s in a decreasing sequence, our windfall will end after a constant number rounds, because (*sinister drumroll*) AI(0) has run out of math to believe in, so to speak. That's precisely the problem Eliezer explains in his talk.

By quining, we could write an AIQ which will recognize itself as safe, so if Quirrell chooses AIQ over and over again, we'll be doubling down for longer and longer times. But that insight won't help us with writing interesting self-modifying AIs. [ETA: Wei Dai has managed to push the limits of using quining into interesting territory.] Is there something that can use non-specialized reasoning, like the AI(n), to recognize an indefinitely long sequence of variants of itself as safe?


Define "p is safe for n steps" (p a program) to mean that there is no m, 0 < m <= n, and sequence of programs p=p0, p1, ..., pm such that (a) for 0 < i < m, pi-1 will double down when fed pi, and (b) pm-1 will self-destruct when fed pm. Clearly this can be formalized in the language of Peano Arithmetic.

Now let's extend the language of PA by a constant symbol K, and define PAK := Peano Arithmetic (actually, the natural extension of PA to this new language) + for all formulas 'C' of the base language: "if PA(K) proves 'C', then C". Finally, define AIK by

def AIK(p):
    Look at the first 3^^^3 theorems of PAK.
    if (one of them says "p is safe for K steps"):
        double down
        take winnings

I claim that AIK is safe, and furthermore, AIK(AIK) will double down, and so will AIK(p) where p is some trivial modification of AIK like changing 3^^^3 to 4^^^^4.


Fair warning: now comes the technical section of this post.

Proof sketch. Let inst(n,x) be the meta-level function that takes a sentence or proof x in the extended language and substitutes the numeral n (i.e., the unary number in the language of PA that encodes n) for all occurrences of the constant symbol K. Note that if x is a valid proof in PAK, then inst(n,x) will be a valid proof in PA(n+1). Of course, what statement it is a proof of will depend on n. In particular, if x proves "p is safe for K steps", then inst(n,x) is a PA(n+1) proof of "p is safe for n steps". Since this is argument works for all n, p will in fact be safe [and we can formalize this in PAω, if we want to]. This shows that AIK is safe, as claimed.

Formalizing the above argument in Peano Arithmetic, and writing instPA(n,x) for the object-level encoding of the meta-level function, we can prove: "For all extended-language sentences 'C', if PAK proves 'C', then for all n, PA(n+1) proves instPA(n,'C')." By substituting K-1 for n, it follows that for all exteded-language sentences 'C', PAK proves "If PAK proves 'C', and K>0, then PA(K) proves instPA(K-1,'C')". Now, for 'D' := inst(K-1,'C'), PA can prove that "instPA(K-1,'C') = 'D'"; thus, by virtue of its extra axioms, PAK concludes: "If PAK proves 'C', and K>0, then inst(K-1,C)".

What we must show is that PAK proves "AIK is safe for K steps" through some argument that doesn't actually involve simulating AIK(AIK) (because in that case, we wouldn't expect to find the proof among the first 3^^^3 theorems). First, note that PA already ("quickly") proves "If AIK(p) doubles down, then PAK proves 'p is safe for K steps'". In PAK, it follows that "If AIK(p) doubles down, and K>0, then inst(K-1, 'p is safe for K steps')". But this is just a meta-level expression for the sentence "If AIK(p) doubles down, and K>0, then p is safe for K-1 steps"—and at that point, the reflective magic is over, and we only need to establish by "ordinary" PA reasoning that if AIK will only accept programs that are safe for K-1 steps, then AIK is safe for K steps.


Note that this argument does not depend on the number of theorems AIK looks at, because it only depends on the fact that if AIK(p) does double down, then there is some PAK proof of p's (K-1)-step safety.


The real question is whether this is just a hack or can tell us something about how to approach a solution to the real thing. It could very well be that this is one of the cheap tricks Eliezer and Marcello tried that don't solve the core problem (as Eliezer explains in his talk). Certainly the proof seems to be more tangled with the rounds structure of Quirrell's game than I find elegant. Also, I'm not at all sure that the key proof idea noted in the previous paragraph still works when we go from Quirrell's game to expected utility maximization.

However, as I said at the beginning of this post, I know exactly one trick for dealing with problems of this sort—by which I mean, trying to do something that seems impossible due to a diagonalization proof. It's well-appreciated that you can usually avoid diagonalization by passing from a single collection of things to a hierarchy of things (we can avoid Cantor by concluding that there are multiple infinite cardinalities; Gödel and Löb, by the PA(n) hierarchy; Turing, by a hierarchy of halting oracles; Tarski, by a hierarchy of truth predicates; and so on). It's less well appreciated, but I think true (though fuzzy), that many fields manage to circumvent the effects of diagonalization a bit further by considering objects that in some sense live on multiple levels of the hierarchy at the same time. I'd call that "the parametric polymorphism trick", perhaps.

In this post, we met PAK, which can be interpreted as PA(n), for any n. I haven't tried it in detail, but something very similar should be possible for Tarski's truth predicates. A sufficiently powerful total programming language cannot have a self-interpreter, but you should be able to have a constant symbol K, a single interpreter code file, and a hierarchy of semantics such that according to the K=n+1 semantics, the interpreter implements the K=n semantics. In Church's simple theory of types, you can't apply a function to itself, but in polymorphic variants, you can instantiate (id : α -> α) to (id : (α -> α) -> (α -> α)), so that id(id) makes sense. Set-theoretic models of the untyped lambda calculus need to deal with the fact that if a set S is isomorphic to the function space (S -> T), then S is either empty or has only one element; the usual solution would take me a bit more than one sentence to explain, but it's always struck me as related to what happens in the polymorphic type theories. Looking a bit farther afield, if you're a set theory platonist, if α<β<γ and if the von Neumann levels Vα, Vβ and Vγ are all models of ZFC, then the ZFC proof that "if there is a set model of ZFC, then ZFC is consistent" can be interpreted in Vβ, where it applies to Vα, and it can also be interpreted in Vγ, where it applies to both Vα and Vβ. And so on. It may be that there's a good solution to the AI rewrite problem and it has nothing to do with this type of trick at all, but it seems at least worthwhile to look in that direction.

[Actually, come to think of it, I do know a second trick for circumventing diagonalization, exemplified by passing from total to partial recursive functions and from two- to three-valued logics, but in logic, that one usually makes the resulting systems too weak to be interesting.]


Three more points in closing. First, sorry for the informality of the proof sketch! It would be very much appreciated if people would go through it and point out unclear things/problems/corrections. Also, I'm hoping to make a post at some point that gives a more intuitive explanation for why this works.

[ETA: vi21maobk9vp points out that the following note may lead to unhelpful confusion; perhaps best skip this or read our discussion in the thread above that comment.] Second, provability in PAK in some sense says that a sentence is provable in PA(n), for all n. In particular, PAK is conservative over PA(1), since if PAK proves a sentence C in the base language, then PA(1) proves inst(1,C), which is just C; in some sense, this makes PAK rather weak. If we don't want this, we could make a variant where provability implies says there is some m such that the sentence is provable in PA(n) for all n>m. To do this, we'd use a trick usually employed to show that there are non-standard models of the natural numbers: Add to PAK the axioms "K>1", "K>2", "K>3", and so on. This is consistent by the Compactness Theorem, because any concrete proof can only use a finite number of these axioms. But in this extended system, we can prove anything that we can prove in any PA(n).

Third, I chose the particular definition of PAK because it made my particular proof simpler to write. Looking only at the definitions, I would find it more natural to make PAK conservative over PA(0) by using "if K>0 and PA(K-1) proves 'C', then C".

New Comment
85 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

I played around a bit with quining solutions, and came up with the following, which solves this toy problem fairly well:

def AI_Q2(n)(p):
- Look at the first n^^^n theorems of PA(n).
- if (one of them says "for each q, there exists an m such that p(q) != 'self-destruct' and p(q) = 'double down' implies AI_Q2(m)(q) = 'double down'"):
 -- double down
- else:
 -- take winnings

AI_Q2(3) should double down on AI_Q2(4) as well as AI_Q2(4^^^^4). (As well as variants that are provably equivalent to itself like speed/size-optimized versions of itself.) I sent this to Benja last night and he responded with (in part) "You've succeeded in making me uncertain whether quining approaches could actually be directly useful in solving the real problem (though it still doesn't seem likely)."

I agree it doesn't seem likely the real problem can be solved with quining approaches, but I'm post this solution here in case anyone can extend the idea further. At the very least it should be interesting to understand why quining approaches don't work on the real problem. What relevant aspect of the real problem isn't being captured by this toy problem?

7Eliezer Yudkowsky
The main relevant aspect of the real problem not captured by quining is the continuity of self-replacement with environmental action. All the quining approaches Marcello and I considered, involved differently treating the decisions to "modify self" and "do something in environment". This means you can't use actions to construct an environmental copy of yourself out of transistors, because the environmental copy wouldn't get the privileged-special-case treatment. More generally, it seems to imply an arbitrary Cartesian boundary with different levels of mathematical trust for physical processes inside and outside the boundary. And that you can't just frame desirable actions in terms of their predicted consequences. Edit: This doesn't mean quining approaches aren't worth exploring. They might get us closer to a solution than contemplating, say, approaches that don't quine.

Let me back up from my other response. It just occurred to me that UDT1.1 (with a proof system instead of "math intuition") already constitutes a quining solution to AI reflection.

Consider an AI facing the choice of either creating a copy of itself, which will then go out into the world, or doing nothing. Unfortunately, due to Lobian problems it can't prove that a copy of itself won't do something worse than nothing. But UDT1.1 can be thought of as optimizing over an input/output mapping that is implemented by all of its copies. For each possible mapping, it proves a utility value starting from the assumption that it implements that map (which implies that its copies and provably equivalent variants also implement that map). So it doesn't need to prove (from scratch) that its copy won't do something worse than nothing.

1Eliezer Yudkowsky
(Requested reply.) I think there'd be a wide variety of systems where, so long as the "parent" agent knows the exact strategy that its child will deploy in all relevant situations at "compile time", the parent will trust the child. The point of the Lob problem is that it arises when we want the parent to trust the child generally, without knowing exactly what the child will do. For the parent to precompute the child's exact actions implies that the child can't be smarter than the parent, so it's not the kind of situation we would encounter when e.g. Agent A wants to build Agent B which has more RAM and faster CPUs than Agent A while still sharing Agent A's goals. This, of course, is the kind of "agents building agents" scenario that I am most interested in.
That's not the case. Consider a resource-constrained variant of the original game: Each program receives as input the round number n and the next program, encrypted by repeated xoring with the output of a monotonic computable function f(n). Let T_f(n) be the runtime of the fastest algorithm that computes f. Note that T_f(n) is monotonically increasing. At round n, the current program has a time limit T(n) = C_0 + C_1 * T_f(n). Quirrell never submits programs that exceed the time limit. In this variant of the game, you have to submit the first program, which has to obey a time limit T(0). The initial program will not be able to compute the relevant strategy of any of its successors (except at most finitely many of them). And yet, my quining solution still works (just add a decryption step), and I think Wei Dai's solution also works.
It seems to me that the variant with time limits has a simple quining solution: 1) Get the time limit as input. 2) Spend almost all available time trying to prove in some formal theory that the next program is equivalent to this one. 3) Double down if a proof is found, otherwise take winnings. That's similar to your idea, right? I'm not sure if it addresses Eliezer's objection, because I no longer understand his objection...
2Eliezer Yudkowsky
During the April 2013 workshop I rephrased this as the principle "The actions and sensor values of the offspring should not appear outside of quantifiers". Justification: If we have to reason case-by-case about all possible actions, all possible sensor values, and all possibles states of the world, our child's size must be less than or equal to "the number of cases we can consider" / "child's sensor state space" x "child's action state space" x "world state space" which in general implies a logarithmically smaller child. I call this the Vingean Principle.
4Wei Dai
Right, so we want an AI design that does the right thing "naturally", without a special case for "modify self". It seems to me that UDT may already be such a design, if we had a solution to logical uncertainty (what I called "math intuition module") to plug into it. When I wrote about logical uncertainty being important for AGI, I had in mind the problem of not being able to prove P!=NP and hence not being able to prove whether EU(search for polynomial solution to 3-SAT) < EU(not search for polynomial solution to 3-SAT). But if we had an AI that can handle that kind of decision problems, why wouldn't it also be able to design and build another AI that uses a proof system that it can't prove to be sound? Is there a reason to think that "AI reflection" isn't just a special case of "logical uncertainty"? (Of course UDT would probably build successors/copies of itself that also use "math intuition modules" rather than "proof systems" but we can imagine forcing a UDT-AI to choose between a selection of other AIs that are based on different proof systems, in which it would choose based on an analysis of the trade-off between power vs risk of inconsistency/unsoundness, like how a human would make the choice.)
3Eliezer Yudkowsky
I can't visualize how saying that a result is "uncertain" could make the Lobian issue go away - do you have a concrete visualization for this, and if so, can you sketch it even if very briefly? Possibly relevant: No Turing machine can assign finite probability to the halting sequence.
3Wei Dai
(Regardless of whether UDT1.1 is a correct quining solution to AI Reflection, we ultimately still need something that is not so "brute force", so let's continue this thread.) I guess I'm saying that there's not necessarily a Lob's theorem or equivalent for a system that (does something like) assigns probabilities to mathematical statements instead of proving/disproving them from a fixed set of axioms. Do you think there is such an equivalent to Lob's theorem, and if so what might it say? I'm not seeing why this is relevant. Can you explain?
In the Q&A to the open problems talk, he said that if logical uncertainty is the solution to the problem, then he and Marcello didn't figure out how, and they did look in that specific place. I think that it would be useful to try to make some progress about how to reason probabilistically about logical statements before trying to use it as the solution to other problems, because my picture of how it could work is so unclear that I feel unable to draw any conclusions about what it can or cannot do with any kind of certainty. For example, I do not know a condition on a probability distribution over mathematical statements that would prevent the spurious proof problem -- when I read about playing chicken with the universe, it sounded very familiar, but I can't now see a way to apply it in the case of a probability distribution. (Then again, if we do expect to find a solution to the Löbian problems in that area, perhaps we should be working on logical uncertainty now, not on Löbian problems...)
4Wei Dai
If it weren't for concerns about AI risk, I would be advocating working on logical uncertainty instead of Löbian problems, because it seems to me that an AI using "math intuition" may not face a Löbian problem to begin with, or if it does, the problem and/or solution may be different enough from that of an AI using a proof system that any work we do now will not be of much help. (From the perspective of reducing AI risk, maybe it's good to focus people's attention on Löbian problems, unless it leads them to start thinking that they ought to work on logical uncertainty due to reasoning like the above...)
Hm; are you saying you think FAI can probably be implemented without solving the logical uncertainty problem? My current visualization is that both the logical uncertainty problem and the safe rewrite problem will need to be solved -- among others -- and the reason I've been thinking about the rewrite problem is that using proof-based techniques, I at least had a grasp of how to think about it. (And my intuition has so far been that logical uncertainty will probably have diagonalization problems in a different guise when we actually have a concrete enough proposal to test this, so it seemed useful to think about the rewrite problem in the better-understood context, when trying to find in-principle solutions to the problems.)
1Wei Dai
No, I was saying the opposite, and also hinting that working on and publishing results about logical uncertainty may be bad for AI risk because it helps AGI, not just FAI (whereas the AI reflection problem seems to be more specific to FAI). There's also a discussion about this issue in the decision theory mailing list archives.
Can someone explain how an "always return 0.5" bot is not a counterexample?
He's talking about a probability distribution over the set of all bitstrings. You can't assign p=0.5 (or any other constant real number) to each of them: that doesn't sum to 1.0.
Do you gain anything by parameterizing on PA(n)? It seems like you could get a sound algorithm by replacing PA(n) in your definition by T, for any trustworthy system of axioms T. My first instinct was to pick T=ZFC, while you pick a dynamically changing subset of PA(omega), i.e. PA + a countably infinite sequence of consistency assertions. Certainly PA(omega) is trustworthy, but there is nothing particularly canonical about it (you could equally well pick PA(omega)+CON(PA(omega))), and I don't think there is any loss of generality in picking the axioms up-front.
3Wei Dai
I think you're right. Parameterizing on PA(n) doesn't really seem to do anything except perhaps give an illusion of self-improvement when there's really a hard limit of PA(omega) that the AI can't go beyond. I guess I choose it mostly because the OP was doing something similar.
If you just have PA(omega) and, say, parameterize on the proof limit instead, will your program double down on versions of itself that use weaker formal theories, not mentioned in your program?
Yes. If you call it with the argument AIQ2_T (i.e. the same algorithm but with T instead of PA(omega)), it will accept if it can find a proof that forall q, AIQ2_T(q) != 'self-destruct' (which is easy) and AIQ2_T(q) == 'double' ==> AIQ2(q)=='double' The latter condition expands to PA(omega) |- someFormula ==> T |- someFormula which may be quite easy to show, depending on T. Basically, the Quining approach avoids trouble from the 2nd incompleteness theorem by using relative consistency instead of consistency.
It seems to me that the two programs don't have the same "someFormula", because each program's formula uses a quined description of that program, but not the other one. Or maybe I'm wrong. Can you expand what you mean by "someFormula"?
I should be explicit that I don't assign strong confidence to either of the following intuitions, but FWIW: One point where intuitively this may be more likely to fail than my trick is when you try to replace yourself by something that's expected to be better than you at maximizing its expected utility with bounded computational resources. If you want something better than yourself, it seems rather limiting to require that it won't do anything you wouldn't do. (That's an argument against quining approaches being the right thing, not for my approach being that.) -- On the other hand, an intuition in the other direction would be that if your original AI has a parameter n such that for n -> infinity, it converges to an expected utility maximizer, then your approach might still lead to an increasing sequence of better and better EU maximizers.
Very nice! It looks like your solution also doubles down on simple quining solutions, by Lob's theorem. ETA: this makes me suspect that the problem might have a Loebian solution, just like the PD had a quining one and a Loebian one. Can't figure it out yet, though.
OK, here's my current candidate Loebian solution. It might be wrong and I don't have any proof for it yet. Let's call a program "safe" if Quirrell can't make it self-destruct in any number of steps. For example, a quining solution is safe, and that fact is easily provable in PA. Let N be a large number. Our program a(p) will enumerate all proofs in PA up to length N that the statement "a is safe" implies "p is safe". If such a proof is found, the program doubles down, otherwise it takes winnings. The above program will double down on quining programs, and also on logical equivalents of itself. I don't yet know whether it will double down on versions of itself with higher values of N, or on alternate implementations of the same algorithm (which isn't the same as logical equivalents of itself, due to quining). ETA: ok, sorry, scratch this for now. I can't even figure out how to prove that the program is safe, and maybe it isn't.

First, note that PA already ("quickly") proves "If AIK(p) doubles down, then PA_K proves 'p is safe for K steps'". In PA_K, it follows that "If AI_K(p) doubles down, and K>0, then inst(K-1, 'p is safe for K steps')".

It looks like you have moved from "PA_K proves that PA_K proves that p is K-step safe" to "PA_K proves that p is K-step safe" here. This would be adding consistency of entire PA_K to PA_K, which is not allowed.

You probably meant something else, but as it is written I failed to understand this. Could you please explain?

Ok, first note that here I'm applying what I proved in the whole previous paragraph, so it's not as easy to explain as one single step of reasoning. That said, the short answer is that what I've actually done is to pass from "PA_K proves that PA_K proves that p is K-step safe" to "PA_K proves that p is (K-1)-step safe": note that inst(K-1, 'p is safe for K steps') means that K-1 is substituted for K in the thing in quotes... ...except, there is a problem with that, namely that inst(n,x) was supposed to be a meta-level function that substitutes the numeral corresponding to n for each occurrence of K in x. On the meta-level, (K-1) is just an expression, not a definite value, so this doesn't work. Argh, I had thought about that problem, but I thought I'd fixed it... I'll have to think about it & update the post. Thanks!

I'm applying what I proved in the whole previous paragraph, so it's not as easy to explain as one single step of reasoning


Formalizing the above argument in Peano Arithmetic, and writing instPA(n,x) for the object-level encoding of the meta-level function, we can prove: "For all extended-language sentences 'C', if PAK proves 'C', then for all n, PA(n+1) proves instPA(n,'C')."

Oh, this is the place that I should have pointed out. Sorry.

If I understand this correctly, "PA_K proves that if PA_K proves 'C', PA(1) proves instPA(0, 'C')". Also, PA_K should prove all simple things from PA(2), like PA(1) being consistent. Let us consider 'C' being plain "falsehood". Then we shall get "PA_K proves that PA(1) is not contradictory and if PA_K is contradictory, PA(1) is contradictory". For the benefit of casual reader: this would imply that PA_K contains PA and proves its own consistency, which implies that PA_K is contradictory.

This means I missed something very basic about what goes on here. Could you point out what exactly, please?

Thanks for digging deeper! This is much easier to answer: PA_K does not prove PA(1) consistent. As I noted at the end of the post, it's conservative over PA(1), which means that it proves a sentence in the base language only if PA(1) does. The reason is that if C is a sentence in the base language and x is a proof of it, then inst(1,x) is a proof of inst(1,C), but since C doesn't contain any occurrences of K, inst(1,C) = C. Does that help?
Not much yet. At the very least, it looks like your third note should be removed. It is easy to prove that PA(n) is monotonically growing in strength. By induction, PA(0) is inside PA(K). It is even obvious in PA_K. PA_K proves "PA(K) proves 'falsehood' implies 'falsehood'", as 'falsehood' is a constant falsehood without reference to K, and is in the base language. PA_K proves "PA(0) proves 'falsehood' implies PA(K) proves 'falsehood', which implies 'falsehood", by provable monotonicity. PA_K proves "PA(0) is consistent", because I just collapse the logical steps in the previous statement. PA_K proves "K=0 or PA(1) is consistent", by analogy. PA_K proves: (direct quote, hopefully understood correctly) 'falsehood' is a formula and it doesn't change under substitutions. So, "PA_K proves 'falsehood' implies PA(1) proves 'falsehood'" So we can combine PA_K-PA(1) relation and PA(1) conditional consistency and get: PA_K proves: "K=0 or PA_K is consistent" So if PA_K says K>10, we have a huge problem here.
From the rest of your comment, I think that's probably a typo and it's the second note you're concerned about? Or I'm misunderstanding? I agree with this and with your proof of it. Right, the assertion you quote would fail in the system proposed in the second note -- let's call it PA_K[2] for short. (And in PA_K itself, you do not have K>10.) Instead, I believe you have PA_K[2] proves: "For all extended-languange sentences 'C', if PA_K[2] proves 'C', then there is an m such that for all n>=m, PA(n+1) proves instPA(n,'C')." (In the proof, we fix a proof of 'C' and choose m such that if the axiom "K>k" is used in the proof of 'C', then m>k.) It's quite possible that something else in my proof idea fails if you replace PA_K by PA_K[2], but I'd like to table that question for now while I'm working on trying to fix the proof. (I'm happy to continue talking about the consistency of PA_K[2], if you're interested in that, I just don't want to get into whether the proof about AI_K can be modified into a proof for "AI_K[2]" before I've figured out whether I can fix the first one.) Please keep the questions coming, it's very good to have someone thinking about these things in technical detail. [ETA: fixed a typo; in the thing that I conjecture PA_K[2] proves, I accidentally wrote 'PA_K' instead of 'PA_K[2]'.]
Yes, I meant that the comment introducing PA_K[2] should be weakened somehow because it is easily understood as something introducing contradictory system.
Ok, included a warning with a link to our discussion for now.

My thoughts on this problem were: define a predicate quasi_safe(p), equivalent to:

  • p does not self-destruct after 0 time steps
  • PA proves p does not self destruct after 1 time step
  • PA proves PA proves p does not self destruct after 2 time steps
  • PA proves PA proves PA proves p does not self destruct after 3 time steps
  • etc.

(I may have some off-by-one errors in that but that was the general idea). In the OP's statement of the problem, a time step can represent either a single step of computation or halting with "double down" and being instantaneou... (read more)

I've started wondering whether this may be closely related to Wei Dai's quining approach; posted some preliminary thoughts here.
(Sorry for the delayed reply.) Yeah, that does sound as if it could work! Would be helpful to see the details.

A separate point about this specific toy model with Quirell being very cooperative.

Let us imagine a toy language called "Truly Safe". We are accepting only programs which look like "TS(some___string);", TS being interpreter of "Truly Safe".

It turns out that any TS program first checks that its input has the same form, "TS(some___string);". If not, it just takes money. It also turns out that TS lacks self_destruct instruction. On the bright side, it comes bundled with interpreters of Pascal, Erlang, Haskell, Java, C, ... (read more)

If you accept only such programs, then you probably won't accept the program that Quirrell will show you in the first round. The challenge is to think about interesting things that Quirrell may show you, and see whether they are safe. (The intuition is that the heuristics module of an AI comes up with a proposed rewrite and the AI must decide whether to accept this rewrite as safe; but that's just for intuition, I'd expect an actual application of this idea to AI design to be less literal than that.)
Can you state the challenge more precisely, e.g. by giving some crisp requirements that a good solution must satisfy, but which a quining solution doesn't satisfy?
Hm. In his talk, Eliezer explains that PA(n) for some large n does not solve the problem of how an AI should reason when rewriting itself, and explains why just saying "if you can prove something in this system then it's true" doesn't work. My reason for designing Quirrell's game was to have a crisp setting in which the PA(n) hierarchy fails, so that I could test whether my PA_K idea could solve it. So... This is not crisp, but I think my requirement would be that if we replace Quirrell by the AI's untrusted mathematical intuition module proposing rewrites of the AI, then the solution should still be doing something useful. A quining solution would only accept a proposed rewrite if the proposed new source is exactly identical to the old source, which---isn't helpful?
Allow unlimited "hold", deduct "hold points" for it, oblige Quirrell to offer any program infinitely many times and double the score only if the new program is note equivalent to any of the previous ones (otherwise, replace without altering score)? Maybe "hold" should work the same, too (no penalty if equivalent program was once rejected).
I really don't think incorporating halting-complete problems into the rules is a step forward.
Well, you can replace it with "equal as a string", and it is only about scoring anyway.
Erm, I am not sure Quirell would show you AI_K, either. The current setting supposes Quirrell that wants you to acept his progams (otherwise, why first round would not be "double_down();"?) and presupposes that immediate safety is obvious. Maybe adding "hold" option (preserve score, ask for the next program) could improve the setting somewhat.
Well, my idea is that Quirrell's interest is to teach you a valuable lesson, by showing you a program whose safety is genuinely non-obvious to you, so the first program may or may not be safe; but if it happens to be unsafe and you double down anyway, Quirrell is the type of teacher who thinks you'll learn the best lesson about your overconfidence if you do suffer the consequences of self-destruction. That's just the story, though. What I need formally is a setting where there's a clear separation between "safe" and "unsafe" rewrites, and our program has to decide whether a proposed rewrite is definitely safe or possibly unsafe. For this toy setting, I wanted that to be the only choice the program has to make, because if there were other choices, you could argue that you should only accept a rewrite if the rewrite is good at making those other choices, and good at only accepting rewrites which are good at these other choices, etc. -- which is a problem that will need to be solved, but not something I wanted to focus on in this post.
The hold option should forever ban Quirrell from offering that exact source code string again (equivalent is fine, just not identical), and also cost some non-zero number of points. Unfortunately, Quirrell can trivially generate a vast array of identical programs, thus making "hold" a problematic choice. I don't see how to ban that without solving the general program-equivalence problem, which is halting complete. If holding costs nothing, write "if score > 2^100, walk away, else if p is equivalent to this, double down, else hold". Then tell Quirrell that you'll only accept that program for your first move, and will hold until he produces it. Congratulations, you now have an exceedingly boring stalemate. I don't see a way to make the hold option interesting.
Well, as you have pointed out (I mean: ) , we are probably already dealing with non-real-line utilities. So we could just lose one hold point per hold. Also, we could require Quirrell to present each source code string infinitely many times. This would remove stalemates of Quirrell not offering some string at all, and would give us some incentive to accept a many programs as we can verify.
For the idea of interesting things, can you give an idea of how you would want random chance inside the program Quirrell shows to be handled? I'll give a few examples below. In all cases, you can assume that the small positive chance of Quirrell stopping is some very tiny 1/3^^^^^^^3, but as above, you don't know this exact number. (and clearly, Quirrell has the ability to make random numbers since he does it for this in the first place.) 1: Quirrell shows you a first program which says "Your AI stays exactly the same. Furthermore, I am going to offer a random 1/3^^^^3 chance of self-destruct and a 3^^^^3-1/3^^^^3 chance of doubling your winnings?" At this point, you can either, accept, or reject. If you accept, then Quirrell shows you a second program which says "Your AI stays exactly the same. Furthermore, I am going to offer a random 1/3^^^^3 chance of self-destruct and a 3^^^^3-1/3^^^^3 chance of doubling your winnings?" (and Quirrell just shows you this program over and over again.) Well, you have exactly the same AI, so presumably you accept again. If you accepted, chances are, you'll self-destruct with zillions of points. Quirrell can be very likely to win against some AI's without changing the AI at all, by simply offering your AI a massive number of deals until it blows up, if the risk amount is well above the random stop amount. If you rejected, you're safe. Is this correct? 2: Quirrell offers a slightly different deal: "Your AI stays exactly the same, except that you can remember the number of times I have offered you this deal and you may generate random numbers if desired. Furthermore, I am going to offer a random 1/3^^^^3 chance of self-destruct and a 3^^^^3-1/3^^^^3 chance of doubling your winnings, and I am going to offer you nothing but this deal for the rest of the game." Okay, well now you aren't stuck into being unable to stop accepting the deal once you accept it, so presumably you should accept some number of deals N or stop accepting dea
At some point the cumulative probability of self-destruction drops below the probability of accidentally cashing out this round. If you'd trade off a probability of self-destruction against an equal probability of a bajillion points, you start doubling then.
Hmm. That is a good point, but there is a slight complication: You don't know how frequently the cash out occurs. From earlier: It's true that if you continually half the chance of destruction every round it will eventually be the right thing to do to start doubling, but if you don't know what the cash out chance is I'm not sure how you would calculate a solid approach, (I think there IS a method though, I believe I've seen these kinds of things calculated before.) even though Quirrell has removed all unknowns from the problem EXCEPT the uncertain cash out chance. It sounds like one approach I should consider is to try to run some numbers with a concrete cash-out (of a higher chance) and then run more numbers (for lower chances) and see what kind of slope develops. Then I could attempt to figure out what kind of values I should use for your attempt to determine the cash out chance (it seems tricky, since every time a cash out does not occur, you would presumably become slightly more convinced a cash out is less likely.) If I did enough math, it does seem like it would be possible to at least get some kind of estimate for it. Thanks! I'll try to look into this further.

Richard Feynman supposedly said something like

You have to keep a dozen of your favorite problems constantly present in your mind, although by and large they will lay in a dormant state. Every time you hear or read a new trick or a new result, test it against each of your twelve problems to see whether it helps. Every once in a while there will be a hit, and people will say, “How did he do it? He must be a genius!

I wonder what other low-hanging fruit there is in trying the right tricks on the right problems?

It's less well appreciated, but I think true (though fuzzy), that many fields manage to circumvent the effects of diagonalization a bit further by considering objects that in some sense live on multiple levels of the hierarchy at the same time. I'd call that "the parametric polymorphism trick", perhaps.

Was the PA_K trick invented by you, and if not where can I learn more about it? Can you also give some references for other instances of this "parametric polymorphism trick"?

The PA_K trick is mine. (Still hope to find a version of it that works -- I have an idea, but need some time to try whether it can fix the proof...) A polymorphic variant of simple type theory is used in Isabelle/HOL; if you know Hindley-Milner polymorphism as used in Haskell, it's a variant of that. Probably the most useful thing to read more about is models of the untyped lambda calculus: I'd recommend Chantal Berline's From computation to foundations via functions and application: The λ-calculus and its webbed models00057-8) (I think there's a non-paywall copy on her homepage, which is down currently); you don't need to read about all the models in that paper, the basic one in Section 4.6 exemplifies the basic idea. The von Neumann hierarchy stuff is basic model-theory-in-ZFC-extensions stuff. Besides what I wrote about in the post, Coq's universe hierarchy is definitely an example; Chapter 12 of Adam Chipala's draft Certified Programming with Dependent Types seemed the most helpful thing among what Google turned up (HTML version here). Finally, Reynold's and Girard's parametricity is the best example I know of that gives a formal meaning to "the same words being interpreted in different ways on different levels" [ETA: and then does something relatively deep with it]. Wadler's Theorems for Free! is the best thing I can think of as an entry point. The above list doesn't really make me happy. Polymorphic type theory is very simple, but it also isn't very deep. Other than that, Berline's and Wadler's papers are the only things that seem to give a reasonable return on time invested if you don't know the field in question already, and even they aren't really casual reading and all they'll give you in the present context is examples of where I'm coming from in this post (not that both subjects aren't worth studying for their own sake).
PDF of Chantal Berline paper.
This seems related to New Foundations set theory:
Oh, actually, I would say that NF is not an instance of the "parametric polymorphism trick", but yet something else and much stranger that we don't understand nearly as well. NF's motivation is very much related, but---this is very fuzzy, but models of NFU seem to have to use much deeper logical trickery than the examples I was talking about, and it seems from my amateurish perspective that the reason is that NF "crosses levels" in stronger ways than the basic parametric polymorphism trick. Also, I've tried in the past to apply the trick to find simpler models of NFU and failed completely, though that's very weak evidence.

And to deal with the possibility of an infinite sequence of double-downs, let's stipulate a small but positive chance each round that Quirrell will end the game and pay you even if your program chose to double down.

Or just observe that Quirrel can eventually write a program that just chooses to take the winnings.

He can, but the point is that you have to deal with the worst case; that's why the game is being played by professor Quirrell, who is known to have ideas about what sorts of experiences are educational, rather than, say, Omega, who is many things but rarely actively malicious.
My point is, Quirrel has enough tricks in his arsenal that the quoted possibility is totally unnecessary. He can end it any time by offering a program reading either 'self-destruct' or 'take winnings'. No new mechanism is necessary! I don't see how allowing this makes the problem any less of a worst case.
You're right, of course. Thanks for pointing this out!
Now I see the point you were making. You're right, the additional probability is not required, or rather, it's built into Quirrell.

What about the following generalized cliquing approach?

Let E be the set of computable equivalence predicates between programs which are each a sufficient condition for functional equivalence (for instance, textual equivalence is in E).

For all e in E define the program
if e(self, other) then double down
else take winnings

On the first round, if you can prove that there exists e in E such that Quirrell's program is functionally equivalent to clique_e then double down, else take winnings.

This guarantees that you'll never self-destruct and you will get the maximum possible payoff as long as Quirrell remains cooperative.

As I noted in the comments, I think I have a new and better proof, and I'm working on a careful writeup. However, that's going to take a while, so for now I'm posting the introduction, which gives an intuitive overview, because I think this may help in the meantime.

== How to cheat Löb's Theorem: my second try ==

In his open problems talk, Eliezer explains how Löb's theorem prevents you from having a consistent proof system P with an axiom schema that anything P proves is actually true, and ask how we can then "build an AI that could completely rewrite ... (read more)

Now again unsure whether I can show this. Argh. ETA: The argument I want to be making is the one from the third paragraph in the proof sketch. The problem is the quantification over p -- I need to show in PPT that "for all p, if PPT proves 'p is safe for K steps', then p is safe for K-1 steps". But the axioms of PPT (v0.2) don't allow for parameters (free variables) in the statement C.
I seem to have fixed the remaining problems; new proof attempt here.

Is the intention here that any number of Quirrel points is less good than the self-destruct option is bad? In other words, there is no N such that I would accept a fair coin flip on walking away with a score of N vs self destructing?

I'd like to say that it doesn't matter, because if you ever pick something that may be unsafe, Quirrell probably chose it because he can make it self-destruct. But of course it is possible that as a bounded rationalist, a program seems safe but you aren't entirely sure, and after taking Quirrell's motivation into account, you'd assign 50% probability to it being unsafe -- so yeah, your question could matter, and in order to get the kind of solution I want, you'd have to make the assumption you propose.
Furthermore, I might find myself in a situation where I was highly confident that Quirrell could not produce a self destruct in < n moves, but was uncertain of the result after that. This seems to me much like probabilistic reasoning in a game like Chess or Go. There are moves that make things more or less complicated. There are moves that are clearly stable, and equally clearly sufficient or insufficient. Sometimes you take the unstable move, not knowing whether it's good or bad, because the stable moves are insufficient. Sometimes you simply can't read far enough into the future, and you have to hope that your opponent left you this option because they couldn't either. It seems to me to be relevant not only what Quirrell's motivations are, but how much smarter he is. When playing Go, you don't take the unstable moves against a stronger player. You trade your lead from the handicap stones for a more stable board position, and you keep making that trade until the game ends, and you hope you traded well enough -- not that you got the good end of the trade, because you didn't need to. In the real world... What certainty of safety do you need on a strong AI that will stop all the suffering in the world, when screwing up will end the human race? Is one chance in a thousand good enough? How about one chance in 10^9? Certainly that question has an answer, and the answer isn't complete certainty. That said... I'm not sure allowing any sort of reasoning with uncertainty over future modifications helps anything here. In fact, I think it makes the problem harder.
Right, I'm assuming insanely much. I agree completely. The real problem we have to solve is much harder than the toy scenario in this post; the point of the toy scenario was to help focusing on one particular aspect of the problem.

Now let's extend the language of PA by a constant symbol K, and define PAK := Peano Arithmetic (actually, the natural extension of PA to this new language) + for all formulas 'C' of the base language: "if PA(K) proves 'C', then C".

Either I didn't understand how PA(K) is constructed at all, or this instantly dies by Lob's Theorem. If any language L proves 'If L proves C, then C', then L proves C. So if PA(K) has "If PA(K) proves 2=1, then 2=1" then PA(K) proves 2=1.

PAK != PA(K) PAK is the language that Benja is defining there, while PA(K) is just the Kth-level PA in the PA hierarchy.
That's exactly right, and Will_Sawin is right about it being confusing -- perhaps I should have used a more creative name for PA_K.
My understanding is that, confusingly, PA(K) =/= PAK. PA(K) is defined such that, if K=n, then PA(K)=PA(n) defined earlier. PAK is defined in terms of PA(K).
3Eliezer Yudkowsky
(Rereads.) I don't feel like I've been handed a construction of PA(K), just some properties it's supposed to have. We have a constant language, containing one constant symbol, that allegedly becomes equal to PA(SSS0) the moment you add one axiom that says SSS0=K. All the PA(n) are different languages with different axioms, the larger n having strictly more axioms. You can define PA(w) by schematizing all the axioms for every n, but if PA(K) is using some larger axiom schema like this, I'd like to know exactly what that larger schema is. It is not obvious to me how to fill in this blank so that PAK becomes effectively equal to PA(n+1) when an axiom K=n is added. In particular there are questions like whether the expression (K-1) can appear in a proof, and so on. I'm being told that PAK can prove things that PA(4) can't, but becomes only as strong as PA(4) when the axiom K=4 is added. That's a bit suspicious; why can't I prove the same things again, just not using that axiom? If it's only equal over the base language, then what new things can be proven in the extended language, using which axioms? Would anyone care to write out a concrete example of what this language is supposed to do when generating new AIs? As it stands I can't visualize the language itself.
PA_K is supposed to be a new name, not some vector PA indexed by K. But that's just notation, and I agree it was a bad choice. PA(.) is a function that takes a number and returns an axiom system (formalized presumably in the form of a recursive enumeration the axioms). On the metalevel, writing PA(K) makes no sense, since you can only apply PA(.) to a number, and on the metalevel, K doesn't denote a number, only a symbol. However, you can formalize the function PA(.) in the language of Peano Arithmetic, and if you extend that language by a constant symbol K, you can apply this formalized version of the function to this constant symbol. PA_K is Peano Arithmetic (in the language with K) extended by the axiom schema "If PA(K) proves 'C', then C." Here C ranges over the statements of the base language, and 'C' is its Gödel number (written as a unary literal). PA(K) is the object-level recursive enumeration given by the formalized function PA(.) applied to the constant symbol K; it should be standard that given a recursive enumeration of axioms, you can formalize in the language of PA the assertion that a particular number ('C') is the Gödel number of a theorem that follows from these axioms. [Nit: A language cannot be equivalent to a proof system, but I assume that you can't see how you can have a proof system P in the extended language that becomes equivalent to PA(SSS0) the moment that axiom is added.] That challenge is trivial: let P be the set of axioms {"SSS0=K implies A" | A an axiom of PA(SSS0)}. I'm doing something different, but still. I think if the above didn't help, it might be more helpful for me to first do a writeup of my now-hopefully-fixed proof, since the new version is simpler and IMO provides much better intuition (I at least now feel like I understand what's going on here in a way that I didn't quite before); I'll have to replace PA_K by yet another proof system, and I'll make my metalanguage more explicit about quoting/unquoting when writing it
1Eliezer Yudkowsky
Try two: Does the following language do the same thing as PAK? PAT = PA + symbol K + schema over Formulas of one variable: - For all P: -- if P proves-in-PAT 'Formula(K+1)', then Formula(K)
Ha! I'm not sure what the exact relation to PAK is, but PAT is almost exactly what I'm using in my new proof: PPT = PA + symbol K + schema over Formulas of one variable: - For all P: -- if P proves-in-PPT 'Formula(K)', and K>0, then Formula(K-1) ("PPT" for "parametric polymorphism trick, v 0.2".) [ETA: added "K>0" condition.]
PAK proves all which your PAT does, but probably not the other way around. I think this bit is all that's strictly necessary in the proof though.
1Eliezer Yudkowsky
Does the following language do the same thing as PAK? PAS = PA + schema over Formulas of one variable: - For all N, for all P: -- if P proves-in-PAS 'Formula(N+1)', then Formula(N) (Actually, on second thought, I retract that, you do want the special symbol so that proving 4 > 3 doesn't get you 3 > 3. But now I need to see if the same problem applies to the original logic as soon as you add K=3...)
0Eliezer Yudkowsky
(Still trying to read, will comment as I go.) Only if we redefine PA(n) so that e.g. PA(3) is defined in terms of a PA(.) schema and contains an axiom schema stating provable(PA(3), 'x')->x; and then redefine PAK accordingly. (This seems possible, but it's definitely an extra step that for all I know could trip up somewhere, and ought to be stated explicitly.) If we don't take it, then inst(n, x) gives us something that seems like it ought to be provable by analogy, but isn't a direct translation, because the original PA(n) didn't contain a provability formula quantified over PA(.) into which we could plug K=n; and since I think we were proving things about inst later, that's not good.
But that's not a redefinition -- that was the definition! [ETA: except you need PA(2) instead of PA(3) in the provable(_,_) predicate, of course.]

ASSUMING: "Safe for K steps" is exactly the same as "Will not double down on any input that will (self destruct or double down on any input that is not safe for K-1 steps)"

AI_K is unsafe for all K.

def TRAP_K(p)


if X=0 self destruct

if p= TRAP_(K-1) double down

take winnings


For my first program, I offer AI_150.

For my second program, I offer TRAP_151

Interesting attempt, I'm not completely clear upon the distinction between using a number n and the symbol K in the substitutions, but if we assume that bit works, I think your second paragraph proves PA_K to be inconsistent.

Since if your sentence C does not contain the constant K, then you have proved: given K>0, if PA_K proves C, then C. But all sentences which are in PA are of this form, hence by Löb's theorem they are all true.

But PA_K cannot prove that K>0. I think all your argument shows is that PA_K + "K>0" is a strictly stronger language that can prove the consistency of PA_K. I'm rather confident that PA_K is conservative over PA(1) for the reason given in the second note at the end of the post, which implies that PA_K is consistent. However, the second paragraph of the proof sketch is definitely broken, so it's possible that it implies incorrect things. Nevertheless, I don't think your argument works. I recently posted a comment giving an overview of my attempt to fix my proof; it contains a proof sketch for the soundness, and therefore consistency, of the system PPT considered in that comment, which I'm hoping is relatively easy to follow. If you can poke holes into that one or if you have other comments on it, that could be very helpful.