Paul Christiano has devised **a new fundamental approach** to the "Löb Problem" wherein Löb's Theorem seems to pose an obstacle to AIs building successor AIs, or adopting successor versions of their own code, that trust the same amount of mathematics as the original. (I am currently writing up a more thorough description of the *question *this preliminary technical report is working on answering. For now the main online description is in a quick Summit talk I gave. See also Benja Fallenstein's description of the problem in the course of presenting a different angle of attack. Roughly the problem is that mathematical systems can only prove the soundness of, aka 'trust', weaker mathematical systems. If you try to write out an exact description of how AIs would build their successors or successor versions of their code in the most obvious way, it looks like the mathematical strength of the proof system would tend to be stepped down each time, which is undesirable.)

Paul Christiano's approach is inspired by the idea that whereof one cannot prove or disprove, thereof one must assign probabilities: and that although no mathematical system can contain its own *truth* predicate, a mathematical system might be able to contain a reflectively consistent *probability* predicate. In particular, it looks like we can have:

∀a, b: (a < P(φ) < b) ⇒ P(a < P('φ') < b) = 1

∀a, b: P(a ≤ P('φ') ≤ b) > 0 ⇒ a ≤ P(φ) ≤ b

Suppose I present you with the human and probabilistic version of a Gödel sentence, the Whitely sentence "You assign this statement a probability less than 30%." If you disbelieve this statement, it is true. If you believe it, it is false. If you assign 30% probability to it, it is false. If you assign 29% probability to it, it is true.

Paul's approach resolves this problem by restricting your belief about your own probability assignment to within epsilon of 30% for any epsilon. So Paul's approach replies, "Well, I assign *almost* exactly 30% probability to that statement - maybe a little more, maybe a little less - in fact I think there's about a 30% chance that I'm a tiny bit under 0.3 probability and a 70% chance that I'm a tiny bit over 0.3 probability." A standard fixed-point theorem then implies that a consistent assignment like this should exist. If asked if the probability is over 0.2999 or under 0.30001 you will reply with a definite yes.

We haven't yet worked out a walkthrough showing if/how this solves the Löb obstacle to self-modification, and the probabilistic theory itself is nonconstructive (we've shown that something like this should exist, but not how to compute it). Even so, a possible fundamental triumph over Tarski's theorem on the undefinability of truth and a number of standard Gödelian limitations is important news as math *qua* math, though work here is still in very preliminary stages. There are even whispers of unrestricted comprehension in a probabilistic version of set theory with ∀φ: ∃S: P(x ∈ S) = P(φ(x)), though this part is not in the preliminary report and is at even earlier stages and could easily not work out at all.

It seems important to remark on how this result was developed: Paul Christiano showed up with the idea (of consistent probabilistic reflection via a fixed-point theorem) to a week-long "math squad" (aka MIRI Workshop) with Marcello Herreshoff, Mihaly Barasz, and myself; then we all spent the next week proving that version after version of Paul's idea couldn't work or wouldn't yield self-modifying AI; until finally, a day after the workshop was supposed to end, it produced something that looked like it might work. If we hadn't been trying to *solve *this problem (with hope stemming from how it seemed like the sort of thing a reflective rational agent ought to be able to do somehow), this would be just another batch of impossibility results in the math literature. I remark on this because it may help demonstrate that Friendly AI is a productive approach to math *qua *math, which may aid some mathematician in becoming interested.

I further note that this does not mean the Löbian obstacle is resolved and no further work is required. Before we can conclude that we need a computably specified version of the theory plus a walkthrough for a self-modifying agent using it.

See also the blog post on the MIRI site (and subscribe to MIRI's newsletter here to keep abreast of research updates).

This LW post is the preferred place for feedback on the paper.

EDIT: But see discussion on a Google+ post by John Baez here. Also see here for how to display math LaTeX in comments.

I'm

veryhappy both to see this result and the fact that prominent mathematicians such as John Baez and Timothy Gowers seem to be taking an interest in it. It considerably increases my confidence in thatMy confidence in MIRI being successful in its mission, while still pretty low, is now much higher than it was a few days ago.

John Baez has posted about the draft on Google+.

The main theorem of the paper (Theo. 2) does not seem to me to accomplish the goals stated in the introduction of the paper. I think that it might sneakily introduce a meta-language and that this is what "solves" the problem.

What I find unsatisfactory is that the assignment of probabilities to sentences is not shown to be definable in L. This might be too much to ask, but if nothing of the kind is required, the reflection principles lack teeth. In particular, I would guess that Theo. 2 as stated is trivial, in the sense that you can simply take to only have value 0 or 1. Note, that the third reflection principle imposes no constraint on the value of on sentences of L' \ L.

Your application of the diagonalisation argument to refute the reflection scheme (2) also seems suspect, since the scheme only quantifies over sentences of L and you apply it to a sentence G which might not be in L. To be exact, you do not claim that it refutes the scheme, only that it seems to refute it.

You are completely right, and thanks for the correction! A new version with this problem fixed should be up in a bit.

(The actual proof of theorem 2 establishes the correct thing, there is just a ' missing from the statement.)

Actually, I believe you've found a bug in the paper which everyone else seems to have missed so far, but fortunately it's just a typo in the statement of the theorem! The quantification should be over L', not over L, and the proof does prove this much stronger statement. The statement in the paper is indeed trivial for the reason you say.

Given the stronger statement, the reason you can't just have P have value 0 or 1 is sentences like G <=> P('G') < 0.5: if P(G) = 0, by reflection it would follow that P(G) = 1, and if P(G) = 1, then P(not G) = 0, and by reflection it would follow that P(not G) = 1.

The Kakutani fixed point theorem that this result relies on is Corollary 17.55 in Infinite Dimensional Analysis - A Hitchhiker's Guide by Aliprantis and Border.

Edit: pdf

Thanks Nisan!

Some technical remarks on topological aspects of the paper:The notion of a "coherent probability distribution" and the use of the product topology on the set of all such distributions are a bit ideosyncratic. Here's a way to relate them to more standard notions: Consider a countable first-order language L and the Stone space of complete theories over it. Equip this with the Borel-σ-algebra. (1.) A coherent probability distribution is (equivalent to) a probability distribution on the resulting measure space. (2.) The (AFAICT) most commonly used topology on the space of all such distributions is the one of weak convergence of probability measures. It turns out that this is equivalent to the product topology on the coherent probability distributions.

(1.) The main unusual feature of coherent probability distributions is that they're only defined on the sentences of L, which don't form a σ-algebra. Their equivalence classes under logical equivalence form a Boolean algebra, though, which is in particular a ring of sets (in the measure-theoretic sense) of complete theories (we identify each equivalence class with the set of complete theories containing the sentences in the equiv... (read more)

Wow, this is great work--congratulations! If it pans out, it bridges a really fundamental gap.

I'm still digesting the idea, and perhaps I'm jumping the gun here, but I'm trying to envision a UDT (or TDT) agent using the sense of subjective probability you define. It seems to me that an agent can get into trouble even if its subjective probability meets the coherence criterion. If that's right, some additional criterion would have to be required. (Maybe that's what you already intend? Or maybe the following is just muddled.)

Let's try invoking a coherent P in the case of a simple decision problem for a UDT agent. First, define G <--> P("G") < 0.1. Then consider the 5&10 problem:

If the agent chooses A, payoff is 10 if ~G, 0 if G.

If the agent chooses B, payoff is 5.

And suppose the agent can prove the foregoing. Then unless I'm mistaken, there's a coherent P with the following assignments:

P(G) = 0.1

P(Agent()=A) = 0

P(Agent()=B) = 1

P(G | Agent()=B) = P(G) = 0.1

And P assigns 1 to each of the following:

P("Agent()=A") < epsilon

P("Agent()=B") > 1-epsilon

P("G & Agent()=B") / P("Agent()=B") = 0.1 +- epsilon

P("G... (read more)

John Baez just posted a blog article on the topic which is longer than his previous Google+ post.

The informal idea seems very interesting, but I'm hesitant about the formalisation. The one-directional => in the reflection principle worries me. This "sort of thing" already seems possible with existing theories of truth. We can split the Tarski biconditional A<=>T("A") into the

quotation principleA=>T("A") and thedequotation principleT("A")=>A. Many theories of truth keep one or the other. A quotational theory may be called a "glut theory": the sentences reguarded as true are a strict superset of those which are true, which means we have T("A") and T("~A") for some A. Disquotational theories, on the other hand, will be "gap theories": the sentences asserted true are a strict subset of those which are true, so that we have ~T("A") and ~T("~A") for some A. This classification oversimplifies the situation (it makes some unfounded assumptions), but...The point is, the single-direction arrow makes this reflection principle look like a quotational ("glut") theory.

Usually, for a glut theory, we want a

restricteddisquotational principle to hold. The quotationa... (read more)Yes, the theory is logically coherent, so it can't have P("a < P('A')") = 1 and P("a > P('A')") = 1.

For example, the following disquotational principle follows from the reflection principle (by taking contrapositives):

P( x <= P(A) <= y) > 0 ----> x <= P(A) <= y

The unsatisfying thing is that one direction has "<=" while the other has "<". But this corresponds to a situation where you can make arbitrarily precise statements about P, you just can't make exact statements. So you can say "P is within 0.00001 of 70%" but you can't say "P is exactly 70%."

I would prefer be able to make exact statements, but I'm pretty happy to accept this limitation, which seems modest in the scheme of things---after all, when I'm writing code I never count on exact comparisons of floats anyway!

This was intended as a preliminary technical report, and we'll include much more discussion of these philosophical issues in future write-ups (also much more mathematics).

Ok, I see!

To put it a different way (which may help people who had the same confusion I did):

We can't set all statements about probabilities to 1, because P is encoded as a function symbol within the language, so we can't make inconsistent statements about what value it takes on. ("Making a statement" simply means setting a value of P to 1.)

I'm very pleased with the simplicity of the paper; short is good in this case.

I'm actually quite surprised. Didn't think I'd be saying that but looks like good job math wise (edit: obvious caveats apply: not my area of expertise, don't know if novel, don't know if there's any related proofs). (As far as x-risks go though, something not well designed for self improvement is a lot less scary).

It's not obvious to me that this work is relevant for AI safety research, but it seems mathematically interesting on it's own (though it's not my area of experise either, so I can't evaluate it).

I'm positively impressed. Honestly, I'd never thought that SI/MIRI would deliver anything scientifically noteworhy. I hope this is the beginning of a positive trend.

Note that Paul keeps updating the draft, always at the same URL. That's convenient because links always point to the latest draft, but you may need to hit 'refresh' on the PDF's URL a couple times to get your browser to not just pull up the cached version. As of this moment, the current draft at that URL should say "revised March 26" at the top.

One subtlety that has been bothering me: In the draft's statement of Theorem 2, we start with a language L and a consistent theory T over L, then go on considering the extension L' of L with a function symbol P(.). This means that P(.) cannot appear in any axiom schemas in T; e.g., if T is ZFC, we cannot use a condition containing P(.) in comprehension. This seems likely to produce subtle gaps in applications of this theory. In the case of ZFC, this

mightnot be a problem since the function denoted by P(.) is representable by an ordinary set at least in the standard models, so the free parameters in the axiom schemas might save us -- though then again, this approach takes us beyond the standard models...I think not having to think about such twiddly details might be worth making the paper a tad more complicated by allowing T to be in the extended language. I haven't worked through the details, but I

thinkall that should change is that we need to make sure that A(P) is nonempty, and Ithinkthat to get this it should suffice to require that T is consistent with any possible reflection axiom a < P("phi") < b. [ETA:...okay, I guess that's not enough: consider ((NOT phi) OR (NOT psi)), where phi and psi are reflection axioms. Might as well directly say "consistent with A_P for any coherent probability distribution P" then, I suppose.]Sorry, we're miscommunicating somewhere. What I'm saying is that e.g. given a set of statements, I want the axiom asserting the existence of the set %20%3E%201/2\}), i.e. the comprehension axiom applied to the condition %20%3E%201/2). I don't understand how this would lead to %20=%20\mathbb{P}(x%20\notin%20x)); could you explain? (It seems like you're talking about unrestricted comprehension of some sort; I'm just talking about allowing the condition in ordinary restricted comprehension to range over formulas in L'. Maybe the problem you have in mind only occurs in the unrestricted comprehension work which isn't in this draft?)

Consider my proposed condition that " is consistent with for any coherent distribution ". To see that this is true for ZFC in the language L', choose a standard model of ZFC in L and, for any function from the sentences of L' to , extend it to a model in L' by interpreting ) as ); unless I'm being stupid somehow, it's clear that the extended model will satisfy ZFC-in-L' + .

It seems to me that the only parts of the proof that need to be re-thought are the arguments that (a) and (b) ) are non-empty. Perhaps the easiest way to say the argument is t... (read more)

Comments on the proof of Theorem 2:I believe the entire second-to-last paragraph can be replaced by:

I'm thinking that the proof becomes conceptually slightly clearer if you show that the graph is closed by showing that it contains the limit of any convergent sequence, though:

Another dilettante question: Is there a mathematics of statements with truth values that change? I'm imagining "This statement is false" getting filed under "2-part cycle".

Yes. There is an overview of revision theories of truth (and many other approaches to defining truth) at the SEP.

I've never heard of it but my intuition says it doesn't sound very promising - any correspondence definition of "truth" that changes over time should reduce to a timeless time-indexed truth predicate and I don't see how that would help much, unless the truth predicate couldn't talk about the index.

I think that Theo. 2 of your paper, in a sense, produces only non-standard models. Am I correct?

Here is my reasoning:

If we apply Theo. 2 of the paper to a theory T containing PA, we can find a sentence G such that "G <=> Pr("G")=0" is a consequence of T. ("Pr" is the probability predicate inside L'.)

If P(G) is greater than 1/n (I use P to speak about probability over the distribution over models of L'), using the reflection scheme, we get that with probability one, "Pr("G")>1/n" is true. This mean... (read more)

This is right. (I think we point out in the paper that the model necessarily contains non-standard infinitesimals.) But an interesting question is: can this distribution assign any model a positive probability? I would guess not (e.g. because P("G") is uniformly random for some "G").

There may be an issue with the semantics here. I'm not entirely certain of the reasoning here, but here goes:

Reflection axiom: ∀a, b: (a < P(φ) < b) ⇒ P(a < P('φ') < b) = 1 (the version in Paul's paper).

Using diagonalisation, define G ⇔ -1<P('G')<1

Then P(G)<1

⇒ -1 0)

⇒ P(-1<P('G')<1)=1 (by reflection)

⇒ P(G)=1 (by definition of G).

Hence, by contradiction (and since P cannot be greater than 1), P(G)=1. Hence P('G')=1 (since the two P's are the same formal object), and hence G is false. So we have a false result that has probability... (read more)

P(G) is a real number in the interval [0, 1], so you shouldn't be

toosad that P(P(G) = p) = 0 for each p. More generally, you shouldn't expect good behavior when P talks about exact statements---P can be mistaken about itself up to some infinitesimal epsilon, so it can be wrong about whether P(G) is exactly 1 or slightly less.We can prove P(G) = 1, but the system can't. The reflection schema is not a theorem---each instance is true (and is assigned probability 1), and the universal quantification is true, but the universal quantification is not assigned probability 1. In fact, it is assigned probability 0.

Congrats to MIRI! Just some comments that I already made to Mihaly and Paul:

1) Agree with Abram that unidirectional implication is disappointing and the paper should mention the other direction.

2) The statement "But this leads directly to a contradiction" at the top of page 4 seems to be wrong, because the sentence P(G)<1 isn't of the form P(phi)=p required for the reflection principle.

3) It's not clear whether we could have bidirectional implication if we used closed intervals instead of open, maybe worth investigating further.

Told about the Whitely sentence to a friend who is studying physics. After two minutes of reflection, he suggested: "it ought to be like a delta function around 30%". Needless to say, I was blown away.

(Bit off-topic:)

Why are we so invested in solving the Löb problem in the first place?

In a scenario in which there is AGI that has not yet foomed, would that AGI not refrain from rewriting itself until

ithad solved the Löb problem, just as Gandhi would reject taking a pill which would make him more powerful at the risk of changing his goals?In other words, does coming up with a sensible utility function not take precedence? Let the AGI figure out the rewriting details, if the utility function is properly implemented, the AGI won't risk changing itself unti... (read more)

Another stupid question: would relaxing the schema 3 to incomplete certainty allow the distribution P to be computable? If so, can you get arbitrarily close to certainty and still keep P nice enough to be useful? Does the question even make sense?

This is all nifty and interesting, as mathematics, but I feel like you are probably barking up the wrong tree when it comes to applying this stuff to AI. I say this for a couple of reasons:

First, ZFC itself is already comically overpowered. Have you read about reverse mathematics? Stephen Simpson edited a good book on the topic. Anyway, my point is that there's a whole spectrum of systems a lot weaker than ZFC that are sufficient for a large fraction of theorems, and probably all the reasoning that you would ever need to do physics or make real wo... (read more)

The result works for theories as simple as Peano arithmetic.

The problem is if your mathematical power has to go down each time you create a successor or equivalently self-modify. If PA could prove itself sound that might well be enough for many purposes. The problem is if you need a system that proves another system sound and in this case the system strength has to be stepped down each time. That is the Lob obstacle.

I gave this paper a pretty thorough read-through, and decided that the best way to

reallysee what was going on would be to write it all down, expanding the most interesting arguments and collapsing others. This has been my main strategy for learning math for a while now, but I just decided recently that I ought to start a blog for this kind of thing. Although I wrote this post mainly for myself, it might be helpful for anyone else who's interested in reading the paper.Towards the end, I note that the result can be made "constructive", in the sen... (read more)

The above is an extraordinarily good example of why lukeprog no longer talks to you.

A stupid question from a dilettante... Does this avenue of research promise to eventually be able to assign probabilities to interesting statements, like whether P=NP or whether some program halts, or whether some formal system is self-consistent?

Promise? No. Might it start down an avenue that someday leads there after a whole lot more work and some additional ideas? Maybe. Don't hold your breath on that P!=NP formal probability calculation.

Annoying pedantic foundational issues: The collection of models is a proper class, but we can instead consider probability measures over the set of consistent valuations , which is a set. And it suffices to consider the product σ-algebra.

This idea reminds me of some things in mechanism design, where it turns out you can actually prove a lot more for weak dominance than for strong dominance, even though one might naïvely suppose that the two should be equivalent when dealing with a continuum of possible actions/responses. (I'm moderately out of my depth with this post, but the comparison might be interesting for some.)

What would the world look like if I assigned "I assign this statement a probability less than 30%." a probability of 99%? What would the world look like if I assigned it a probability of 1%?

The idea is similar in concept to fuzzy theories of truth (which try to assign fractional values to paradoxes), but succeeds where those fail (ie, allowing quantifiers) to a surprising degree. Like fuzzy theories and many other extra-value theories, the probabilistic self-reference is a "vague theory" of sorts; but it seems to be minimally vague (the vagueness comes down to infinitesimal differences!).

Comparing it to my version of probabilities over logic, we see that this system will trust its axioms to an arbitrary high probability, whereas if ... (read more)

Just found a few typos in my notes that I had forgotten to post. Hopefully somebody's still reading this.

In the proof of Theorem 2, you write "Clearly is convex." This isn't clear to me; could you explain what I am missing?

More specifically, let ) be the subset of obeying %20%3C%20b%20\%20\implies%20\%20\mathbb{P}\left(%20a%20%3C%20\mathbb{P}(\lceil%20\phi%20\rceil)%20%3C%20b%20\right)%20=1%20). So }%20X(\phi,a,b)). If ) were convex, then would be as well.

But ) is not convex. Project ) onto in the coordinates corresponding to the sentences and %20%3C%20b). The image is %20\cup%20\left(%20%20[0,1]%20\times%20\{%201%20\}%20\right)%20\cup%20\left(... (read more)

Did anything happen with the unrestricted comprehension?

Regarding the epsilons and stuff: you guys might want to look in to hyperreal numbers.

I have felt probabilistic reasoning about mathematics is important ever since I learned about the Goedel machine, for a reason related but somewhat different from Loeb's problem. Namely, the Goedel machine is limited to self-improvements provable within an axiom system A whereas humans seem to be able to use arbitrarily powerful axiom systems. It can be argued that Piano arithmetics is "intuitively true" but ZFC is only accepted by mathematicians because of empirical considerations: it just works. Similarly a self-improving AI should be able to a... (read more)

My memory is that this was on the last day of the workshop, when only Paul and Marcello had the mental energy reserves to come into the office again.

Typo in section 2.2 Going Meta - agumented should be augmented

When reading this paper, and the background, I have a recurring intuition that the best approach to this problem is a distributed, probabilistic one. I can't seem to make this more coherent on my own, so posting thoughts in the hope discussion will make it clearer:

ie, have a group of related agents, with various levels of trust in each others' judgement, each individually asses how likely a descendant will be to take actions which only progress towards a given set of goals.

While each individual agent may only be able to asses a subset of a individual desc... (read more)

This seems like another "angels dancing on the head of a pin" question. I am not willing to assign numerical probabilities to any statement whose truth value is unknown, unless there is a compelling reason to choose that specific number (such as, the question is about a predictable process such as the result of a die roll). It seems to me that people who do so assign, are much more likely to get basic problems of probability theory (such as the Monty Hall problem) wrong than those who resist the urge.

Not really my field of expertise, thus I may be missing something, but can't you just set P = 1 for all classically provably true statements, P = 0 for all provably false statements and P = 0.5 to the rest, essentially obtaining a version of three-valued logic?

If that's correct, does your approach prove anything fundamentally different than what has been proved about known three-valued logics such as Kleene's or Łukasiewicz's?

In ZF set theory, consider the following three statements.

I) The axiom of choice is false

II) The axiom of choice is true and the continuum hypothesis is false

III) The axiom of choice is true and the continuum hypothesis is true

None of these is provably true or false so they all get assigned probability 0.5 under your scheme. This is a blatant absurdity as they are mutually exclusive so their probabilities cannot possibly sum to more than 1

`s/adopoting/adopting/`

I think this was how the heretic Geth got started. #generalisingfromfictionalevidence

... Why does it matter to an AI operating within a system p if proposition C is true? Shouldn't we need to create the most powerful sound system p, and then instead of caring about what is true, care about what is provable?

EDIT: Or instead of tracking 'truth', track 'belief'. "If this logical systems proves proposition C, then C is believed.", where 'believed' is a lower standard than proved (Given a proof of "If a than b" and belief in a, the result is only a belief in b)