Paul Christiano has devised a new fundamental approach to the "Löb Problem" wherein Löb's Theorem seems to pose an obstacle to AIs building successor AIs, or adopting successor versions of their own code, that trust the same amount of mathematics as the original.  (I am currently writing up a more thorough description of the question this preliminary technical report is working on answering.  For now the main online description is in a quick Summit talk I gave.  See also Benja Fallenstein's description of the problem in the course of presenting a different angle of attack.  Roughly the problem is that mathematical systems can only prove the soundness of, aka 'trust', weaker mathematical systems.  If you try to write out an exact description of how AIs would build their successors or successor versions of their code in the most obvious way, it looks like the mathematical strength of the proof system would tend to be stepped down each time, which is undesirable.)

Paul Christiano's approach is inspired by the idea that whereof one cannot prove or disprove, thereof one must assign probabilities: and that although no mathematical system can contain its own truth predicate, a mathematical system might be able to contain a reflectively consistent probability predicate.  In particular, it looks like we can have:

∀a, b: (a < P(φ) < b)          ⇒  P(a < P('φ') < b) = 1
∀a, b: P(a ≤ P('φ') ≤ b) > 0  ⇒  a ≤ P(φ) ≤ b

Suppose I present you with the human and probabilistic version of a Gödel sentence, the Whitely sentence "You assign this statement a probability less than 30%."  If you disbelieve this statement, it is true.  If you believe it, it is false.  If you assign 30% probability to it, it is false.  If you assign 29% probability to it, it is true.

Paul's approach resolves this problem by restricting your belief about your own probability assignment to within epsilon of 30% for any epsilon.  So Paul's approach replies, "Well, I assign almost exactly 30% probability to that statement - maybe a little more, maybe a little less - in fact I think there's about a 30% chance that I'm a tiny bit under 0.3 probability and a 70% chance that I'm a tiny bit over 0.3 probability."  A standard fixed-point theorem then implies that a consistent assignment like this should exist.  If asked if the probability is over 0.2999 or under 0.30001 you will reply with a definite yes.

We haven't yet worked out a walkthrough showing if/how this solves the Löb obstacle to self-modification, and the probabilistic theory itself is nonconstructive (we've shown that something like this should exist, but not how to compute it).  Even so, a possible fundamental triumph over Tarski's theorem on the undefinability of truth and a number of standard Gödelian limitations is important news as math qua math, though work here is still in very preliminary stages.  There are even whispers of unrestricted comprehension in a probabilistic version of set theory with ∀φ: ∃S: P(x ∈ S) = P(φ(x)), though this part is not in the preliminary report and is at even earlier stages and could easily not work out at all.

It seems important to remark on how this result was developed:  Paul Christiano showed up with the idea (of consistent probabilistic reflection via a fixed-point theorem) to a week-long "math squad" (aka MIRI Workshop) with Marcello Herreshoff, Mihaly Barasz, and myself; then we all spent the next week proving that version after version of Paul's idea couldn't work or wouldn't yield self-modifying AI; until finally, a day after the workshop was supposed to end, it produced something that looked like it might work.  If we hadn't been trying to solve this problem (with hope stemming from how it seemed like the sort of thing a reflective rational agent ought to be able to do somehow), this would be just another batch of impossibility results in the math literature.  I remark on this because it may help demonstrate that Friendly AI is a productive approach to math qua math, which may aid some mathematician in becoming interested.

I further note that this does not mean the Löbian obstacle is resolved and no further work is required.  Before we can conclude that we need a computably specified version of the theory plus a walkthrough for a self-modifying agent using it.

See also the blog post on the MIRI site (and subscribe to MIRI's newsletter here to keep abreast of research updates).

This LW post is the preferred place for feedback on the paper.

EDIT:  But see discussion on a Google+ post by John Baez here.  Also see here for how to display math LaTeX in comments.

New Comment
168 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

I'm very happy both to see this result and the fact that prominent mathematicians such as John Baez and Timothy Gowers seem to be taking an interest in it. It considerably increases my confidence in that

  • Current MIRI-folk will be capable of producing original, valuable, solid math results
  • By producing such results, they will become capable of attracting more talented mathematicians and producing even more such results

My confidence in MIRI being successful in its mission, while still pretty low, is now much higher than it was a few days ago.

John Baez has posted about the draft on Google+.

And, Baez's post on Google+ was reshared 26 times.

The main theorem of the paper (Theo. 2) does not seem to me to accomplish the goals stated in the introduction of the paper. I think that it might sneakily introduce a meta-language and that this is what "solves" the problem.

What I find unsatisfactory is that the assignment of probabilities to sentences is not shown to be definable in L. This might be too much to ask, but if nothing of the kind is required, the reflection principles lack teeth. In particular, I would guess that Theo. 2 as stated is trivial, in the sense that you can simply take to only have value 0 or 1. Note, that the third reflection principle imposes no constraint on the value of on sentences of L' \ L.

Your application of the diagonalisation argument to refute the reflection scheme (2) also seems suspect, since the scheme only quantifies over sentences of L and you apply it to a sentence G which might not be in L. To be exact, you do not claim that it refutes the scheme, only that it seems to refute it.

You are completely right, and thanks for the correction! A new version with this problem fixed should be up in a bit.

(The actual proof of theorem 2 establishes the correct thing, there is just a ' missing from the statement.)


Actually, I believe you've found a bug in the paper which everyone else seems to have missed so far, but fortunately it's just a typo in the statement of the theorem! The quantification should be over L', not over L, and the proof does prove this much stronger statement. The statement in the paper is indeed trivial for the reason you say.

Given the stronger statement, the reason you can't just have P have value 0 or 1 is sentences like G <=> P('G') < 0.5: if P(G) = 0, by reflection it would follow that P(G) = 1, and if P(G) = 1, then P(not G) = 0, and by reflection it would follow that P(not G) = 1.


The Kakutani fixed point theorem that this result relies on is Corollary 17.55 in Infinite Dimensional Analysis - A Hitchhiker's Guide by Aliprantis and Border.

Edit: pdf

Thanks Nisan!


Some technical remarks on topological aspects of the paper:

The notion of a "coherent probability distribution" and the use of the product topology on the set of all such distributions are a bit ideosyncratic. Here's a way to relate them to more standard notions: Consider a countable first-order language L and the Stone space of complete theories over it. Equip this with the Borel-σ-algebra. (1.) A coherent probability distribution is (equivalent to) a probability distribution on the resulting measure space. (2.) The (AFAICT) most commonly used topology on the space of all such distributions is the one of weak convergence of probability measures. It turns out that this is equivalent to the product topology on the coherent probability distributions.

(1.) The main unusual feature of coherent probability distributions is that they're only defined on the sentences of L, which don't form a σ-algebra. Their equivalence classes under logical equivalence form a Boolean algebra, though, which is in particular a ring of sets (in the measure-theoretic sense) of complete theories (we identify each equivalence class with the set of complete theories containing the sentences in the equiv... (read more)

Wow, this is great work--congratulations! If it pans out, it bridges a really fundamental gap.

I'm still digesting the idea, and perhaps I'm jumping the gun here, but I'm trying to envision a UDT (or TDT) agent using the sense of subjective probability you define. It seems to me that an agent can get into trouble even if its subjective probability meets the coherence criterion. If that's right, some additional criterion would have to be required. (Maybe that's what you already intend? Or maybe the following is just muddled.)

Let's try invoking a coherent P in the case of a simple decision problem for a UDT agent. First, define G <--> P("G") < 0.1. Then consider the 5&10 problem:

  • If the agent chooses A, payoff is 10 if ~G, 0 if G.

  • If the agent chooses B, payoff is 5.

And suppose the agent can prove the foregoing. Then unless I'm mistaken, there's a coherent P with the following assignments:

P(G) = 0.1

P(Agent()=A) = 0

P(Agent()=B) = 1

P(G | Agent()=B) = P(G) = 0.1

And P assigns 1 to each of the following:

P("Agent()=A") < epsilon

P("Agent()=B") > 1-epsilon

P("G & Agent()=B") / P("Agent()=B") = 0.1 +- epsilon

P("G... (read more)

I've also tried applying this theory to UDT, and have run into similar 5-and-10-ish problems (though I hadn't considered making the reward depend on a statement like G, that's a nice trick!). My tentative conclusion is that the reflection principle is too weak to have much teeth when considering a version of UDT based on conditional expected utility, because for all actions A that the agent doesn't take, we have P(Agent() = A) = 0; we might still have P("Agent() = A") > 0 (but smaller than epsilon), but the reflection axioms do not need to hold conditional on Agent() = A, i.e., for X a reflection axiom we can have P assign positive probability to e.g. P("X & Agent() = A") / P("Agent() = A") < 0.9. But it's difficult to ask for more. In order to evaluate the expected utility conditional on choosing A, we need to coherently imagine a world in which the agent would choose A, and if we also asked the probability distribution conditional on choosing A to satisfy the reflection axioms, then choosing A would not be optimal conditional on choosing A -- contradiction to the agent choosing A... (We could have P("Agent() = A") = 0, but not if you have the agent playing chicken, i.e., play A if P("Agent() = A"); if we have such a chicken-playing agent, we can coherently imagine a world in which it would play A -- namely, a world in which P("Agent() = A") = 0 -- but this is a world that assigns probability zero to itself. To make this formal, replace "world" by "complete theory".) I think applying this theory to UDT will need more insights. One thing to play with is a formalization of classical game theory: * Specify a decision problem by a function from (a finite set of) possible actions to utilities. This function is allowed to be written in the full formal language containing P("."). * Specify a universal agent which takes a decision problem D(.), evaluates the expected utility of every action -- not in the UDT way of conditioning on Agent(D) = A, but by simply evaluatin
It occurs to me that my references above to "coherence" should be replaced by "coherence & P(T)=1 & reflective consistency". That is, there exists (if I understand correctly) a P that has all three properties, and that assigns the probabilities listed above. Therefore, those three properties would not suffice to characterize a suitable P for a UDT agent. (Not that anyone has claimed otherwise.)

John Baez just posted a blog article on the topic which is longer than his previous Google+ post.

The informal idea seems very interesting, but I'm hesitant about the formalisation. The one-directional => in the reflection principle worries me. This "sort of thing" already seems possible with existing theories of truth. We can split the Tarski biconditional A<=>T("A") into the quotation principle A=>T("A") and the dequotation principle T("A")=>A. Many theories of truth keep one or the other. A quotational theory may be called a "glut theory": the sentences reguarded as true are a strict superset of those which are true, which means we have T("A") and T("~A") for some A. Disquotational theories, on the other hand, will be "gap theories": the sentences asserted true are a strict subset of those which are true, so that we have ~T("A") and ~T("~A") for some A. This classification oversimplifies the situation (it makes some unfounded assumptions), but...

The point is, the single-direction arrow makes this reflection principle look like a quotational ("glut") theory.

Usually, for a glut theory, we want a restricted disquotational principle to hold. The quotationa... (read more)

Is there anything blocking you from setting P("a<P('A')<b")=1 for all cases, and therefore satisfying the stated reflection principle trivially?

Yes, the theory is logically coherent, so it can't have P("a < P('A')") = 1 and P("a > P('A')") = 1.

For example, the following disquotational principle follows from the reflection principle (by taking contrapositives):

P( x <= P(A) <= y) > 0 ----> x <= P(A) <= y

The unsatisfying thing is that one direction has "<=" while the other has "<". But this corresponds to a situation where you can make arbitrarily precise statements about P, you just can't make exact statements. So you can say "P is within 0.00001 of 70%" but you can't say "P is exactly 70%."

I would prefer be able to make exact statements, but I'm pretty happy to accept this limitation, which seems modest in the scheme of things---after all, when I'm writing code I never count on exact comparisons of floats anyway!

This was intended as a preliminary technical report, and we'll include much more discussion of these philosophical issues in future write-ups (also much more mathematics).

Ok, I see!

To put it a different way (which may help people who had the same confusion I did):

We can't set all statements about probabilities to 1, because P is encoded as a function symbol within the language, so we can't make inconsistent statements about what value it takes on. ("Making a statement" simply means setting a value of P to 1.)

I'm very pleased with the simplicity of the paper; short is good in this case.

Actually, we can use coherence to derive a much more symmetric disquotation principle: P(x>P(A)>y)=1 => x>P(A)>y. Suppose P(x>P(A)>y)=1. For contradiction, suppose P(A) is outside this range. Then we would also have P(w>P(A)>z)=1 for some (w,z) mutually exclusive with (x,y), contradicting coherence. Right?
Not quite---if P(A) = x or P(A) = y, then they aren't in any interval (w, z) which is non-overlapping (x, y). We can obtain P(x > P(A) > y) =1 ---> x >= P(A) >= y by this argument. We can also obtain P(x >= P(A) >= y) > 0 ---> x >= P(A) >= y.
Ah, right, good!
So, herein lies the "glut" of the theory: we will have more > statements than are strictly true. > will behave as >= should: if we see > as a conclusion in the system, we have to think >= with respect to the "true" P. A "gap" theory of similar kind would instead report too few inequalities...
Yes, there is an infinitesimal glut/gap; similarly, the system reports fewer >= statements than are true. This seems like another way at looking at the trick that makes it work---if you have too many 'True' statements on both sides you have contradictions, if you have too few you have gaps, but if you have too many > statements and too few >= statements they can fit together right.
The negated statements become 'or', so we get x <= P(A) or P(A) <= y, right? To me, the strangest thing about this is the >0 condition... if the probability of this type of statement is above 0, it is true!
I agree that the derivation of (4) from (3) in the paper is unclear. The negation of a=b>=c.
Ah, so there are already revisions... (I didn't have a (4) in the version I read).

I'm actually quite surprised. Didn't think I'd be saying that but looks like good job math wise (edit: obvious caveats apply: not my area of expertise, don't know if novel, don't know if there's any related proofs). (As far as x-risks go though, something not well designed for self improvement is a lot less scary).


It's not obvious to me that this work is relevant for AI safety research, but it seems mathematically interesting on it's own (though it's not my area of experise either, so I can't evaluate it).

I'm positively impressed. Honestly, I'd never thought that SI/MIRI would deliver anything scientifically noteworhy. I hope this is the beginning of a positive trend.

Note that Paul keeps updating the draft, always at the same URL. That's convenient because links always point to the latest draft, but you may need to hit 'refresh' on the PDF's URL a couple times to get your browser to not just pull up the cached version. As of this moment, the current draft at that URL should say "revised March 26" at the top.

Tip: Most browsers have a "Force full refresh" option that forces clearing the cache and re-downloading all assets on the page, usually assigned the shortcut Ctrl+F5 by default.

One subtlety that has been bothering me: In the draft's statement of Theorem 2, we start with a language L and a consistent theory T over L, then go on considering the extension L' of L with a function symbol P(.). This means that P(.) cannot appear in any axiom schemas in T; e.g., if T is ZFC, we cannot use a condition containing P(.) in comprehension. This seems likely to produce subtle gaps in applications of this theory. In the case of ZFC, this might not be a problem since the function denoted by P(.) is representable by an ordinary set at least in the standard models, so the free parameters in the axiom schemas might save us -- though then again, this approach takes us beyond the standard models...

I think not having to think about such twiddly details might be worth making the paper a tad more complicated by allowing T to be in the extended language. I haven't worked through the details, but I think all that should change is that we need to make sure that A(P) is nonempty, and I think that to get this it should suffice to require that T is consistent with any possible reflection axiom a < P("phi") < b. [ETA: ...okay, I guess that's not enough: consider ((NOT phi) OR (NOT psi)), where phi and psi are reflection axioms. Might as well directly say "consistent with A_P for any coherent probability distribution P" then, I suppose.]

I think including a condition with P in comprehension will require more subtlety---for example, such an axiom would not be consistent with most coherent distributions P . If I include the axiom P(x in S) = P(x not in x), then it better be the case that P(S in S) = 1/2. We can still do it, but I don't see how to factor out the work into this paper. If you see a useful way to generalize this result to include axioms of L', that would be cool; I don't see one immediately, which is why it is in this form.

Sorry, we're miscommunicating somewhere. What I'm saying is that e.g. given a set of statements, I want the axiom asserting the existence of the set %20%3E%201/2\}), i.e. the comprehension axiom applied to the condition %20%3E%201/2). I don't understand how this would lead to %20=%20\mathbb{P}(x%20\notin%20x)); could you explain? (It seems like you're talking about unrestricted comprehension of some sort; I'm just talking about allowing the condition in ordinary restricted comprehension to range over formulas in L'. Maybe the problem you have in mind only occurs in the unrestricted comprehension work which isn't in this draft?)

Consider my proposed condition that " is consistent with for any coherent distribution ". To see that this is true for ZFC in the language L', choose a standard model of ZFC in L and, for any function from the sentences of L' to , extend it to a model in L' by interpreting ) as ); unless I'm being stupid somehow, it's clear that the extended model will satisfy ZFC-in-L' + .

It seems to me that the only parts of the proof that need to be re-thought are the arguments that (a) and (b) ) are non-empty. Perhaps the easiest way to say the argument is t... (read more)

I understand what you are saying. You are completely right, thanks for the observation. I don't have time to muck with the paper now, but it looks like this would work.

Comments on the proof of Theorem 2:

I believe the entire second-to-last paragraph can be replaced by:

If there is no such , then . Thus we can take to be the complement of (which is open since is closed) and let be the entire space.

I'm thinking that the proof becomes conceptually slightly clearer if you show that the graph is closed by showing that it contains the limit of any convergent sequence, though:

Suppose ) for all , and \to(\mathbb{P}',\mathbb{P})). Since is closed, we have . We must show that %20=%201). Thus, suppose that %3Cb). Since \to\mathbb{P}(\varphi)), for all sufficiently large we have %3Cb) and therefore %20%3C%20b)%20=%201). Since , it follows that %20%3C%20b)%20=%201). In other words, assigns probability 1 to every element of , and hence to all of (since this set is countable).

I think this is a cleaner way of saying it; I'm going to take this approach. Thanks!

Another dilettante question: Is there a mathematics of statements with truth values that change? I'm imagining "This statement is false" getting filed under "2-part cycle".

Yes. There is an overview of revision theories of truth (and many other approaches to defining truth) at the SEP.

I've never heard of it but my intuition says it doesn't sound very promising - any correspondence definition of "truth" that changes over time should reduce to a timeless time-indexed truth predicate and I don't see how that would help much, unless the truth predicate couldn't talk about the index.

The truth predicate could talk about the past (but not the present), but I believe that reduces down to other things that have been explored quite thoroughly (iirc, an idea of a hierarchy of math with higher levels being able to talk about lower levels but not vice versa was around for Bertrand Russel to endorse). And at that point "time" is just a label for what level of explanatory power you are talking about.
(This reminds me of when I short-circuited a NOT gate, curious to see what would happen and suspecting it might oscillate between the two values. Actually, it stabilized to a voltage somewhere between “0” and “1” and stayed there.)
If you posit that: "This statement is false" (i.e. the fixpoint μ x. ¬x ) oscillates or "switches" between TRUE and FALSE, then this is pretty much how digital logic works in electronics, because the NOT operation is implemented as a logic gate with a finite delay.
You could look up George Spencer Brown's "Laws of Form". It's...odd. (Poking this in through a phone otherwise would go into more detail.)
I read it long ago, but didn't understand the parts that looked like circuits. I should probably give it another try.

I think that Theo. 2 of your paper, in a sense, produces only non-standard models. Am I correct?

Here is my reasoning:

If we apply Theo. 2 of the paper to a theory T containing PA, we can find a sentence G such that "G <=> Pr("G")=0" is a consequence of T. ("Pr" is the probability predicate inside L'.)

If P(G) is greater than 1/n (I use P to speak about probability over the distribution over models of L'), using the reflection scheme, we get that with probability one, "Pr("G")>1/n" is true. This mean... (read more)

This is right. (I think we point out in the paper that the model necessarily contains non-standard infinitesimals.) But an interesting question is: can this distribution assign any model a positive probability? I would guess not (e.g. because P("G") is uniformly random for some "G").

I agree with your guess and I think that I see a way of proving it. Let us make the same assumptions as in Theo. 2 and assume that T contains PA. Let G be any sentence of L'. We can build a second sentence G2 such that "G2 <=> ( G and (Pr(G and G2)< p))" for P(G)/3< p <2P(G)/3, using diagonalization. From this and the reflection scheme, it should be possible to prove that P(G and G2) and P(G and not G2) are both smaller than (2/3)P(G). Repeating the argument above, we can show that any complete theory in L' must have vanishing probability and therefore every model must also have vanishing probability.

There may be an issue with the semantics here. I'm not entirely certain of the reasoning here, but here goes:

Reflection axiom: ∀a, b: (a < P(φ) < b) ⇒ P(a < P('φ') < b) = 1 (the version in Paul's paper).

Using diagonalisation, define G ⇔ -1<P('G')<1

Then P(G)<1

⇒ -1 0)

⇒ P(-1<P('G')<1)=1 (by reflection)

⇒ P(G)=1 (by definition of G).

Hence, by contradiction (and since P cannot be greater than 1), P(G)=1. Hence P('G')=1 (since the two P's are the same formal object), and hence G is false. So we have a false result that has probability... (read more)

P(G) is a real number in the interval [0, 1], so you shouldn't be too sad that P(P(G) = p) = 0 for each p. More generally, you shouldn't expect good behavior when P talks about exact statements---P can be mistaken about itself up to some infinitesimal epsilon, so it can be wrong about whether P(G) is exactly 1 or slightly less.

We can prove P(G) = 1, but the system can't. The reflection schema is not a theorem---each instance is true (and is assigned probability 1), and the universal quantification is true, but the universal quantification is not assigned probability 1. In fact, it is assigned probability 0.

Hum... So the system is unable to prove that -1 < P('G') < 1 implies P('-1 < P('G') < 1') (note where the ' are)? This is a (true) statement about the function P.
So, first of all, there isn't really a particular formal proof system in the paper, except sort of implicitly; all we get is a function P(.) from L'-sentences to real numbers. However, we can talk about the first-order theory you obtain by taking the base theory T and adding to it the reflection axioms, i.e. the axioms a < P('G') < b for all G such that the function P(.) satisfies a < P(G) < b. [For anyone not paying very close attention, P('.') is a function symbol in the formal language, while P(.) is a meta-language function from the formal language to the real numbers.] The main theorem is that P(.) assigns probability 1 to this theory (and therefore to all consequences of this theory), so it is interesting to ask what it entails. Your question, if I understand correctly, is whether this theory does in particular entail [-1 < P('G') < 1] --> [P('-1 < P('G') < 1') = 1] (note that this is simply a sentence in the language L', not a meta-language statement). There's no obvious reason that the system should be able to prove this -- all we have is T and the individual axioms of the form [a < P('G') < b] --, and indeed your proof shows that (unless there is an inconsistency in the paper) this is not a theorem of the system: if it were, then so would be P('G') = 1, and therefore ¬G, leading to a contradiction since P(.) assigns probability 1 to all theorems of this system. In short, yes, there are true statements about P(.) such that the corresponding statement about P('.') is false in the formal system, and assigned zero probability by P(.). And yes, I do believe this is worth keeping in mind. Here is my current preferred way of thinking about this: In non-standard analysis, there is a "standard" operator st(.) which takes a non-standard real number x and returns the standard number st(x) that x is infinitesimally close to (assuming one exists, which is the case if x is not unbounded). This operator can also be applied to functions f; st(f)(x) is the standard num
Ok, for [a < P('G') < b] I see why you'd use a schema (because it's an object level axiom, not a meta-language axiom). But this still seems possibly problematic. We know that adding axioms like [-1 < P('G') < 1] --> [P('-1 < P('G') < 1') = 1], would break the system. But I don't see a reason to suppose that the other reflection axioms don't break the system. It might or it might not, but I'm not sure; was there a proof along the lines of "this system is inconsistent if and only if the initial system is" or something?
I'm not sure whether I'm misunderstanding your point, but the paper proves that there is a coherent probability distribution P(.) that assigns probability 1 to both T and to the collection of reflection axioms [a < P('G') < b] for P(.); this implies that there is a probability distribution over complete theories assigning probability 1 to (T + the reflection axioms for P). But if (T + the reflection axioms for P) were inconsistent, then there would be no complete theory extending it, so this would be the empty event and would have to be assigned probability 0 by any coherent probability distribution. It follows that (T + the reflection axioms for P) is consistent. (NB: by "the reflection axioms for P(.)", I only mean the appropriate instances of [a < P('G') < b], not anything that quantifies over a, b or G inside the object language.)
Incidentally, why bother with a schema? Reflection belongs in the meta language, so there's no need to use a schema (?)
I think the point is that reflection gives an infinite collection of formal statements all of which must be assigned probability 1 ("axioms"), in analogy to how an axiom schema is a specification of an infinite collection of axioms. Whether "schema" is the right word to use here is debatable, I guess; I was happy with the choice but perhaps you could argue that "schema" should be reserved to recursively enumerable collections or something.
My point was that reflection is defined for the meta language, and quantifies over all the sentences in the object language. This is not something that can be added as an axiom in the object language at all - P (without the ') has no meaning there. So none of the sub-statements of reflection get assigned probability 1 - they just aren't the kind of thing P works on. So I felt that calling this an schema distracted attention from what reflection is, which is a property of P, not axioms for the lower system. But I may be misunderstanding the setup here...
Right, here's what's going on. The statement in the paper is, %20%3C%20b)\implies\mathbb{P}(a%20%3C%20\mathbb{P}(\ulcorner\varphi\urcorner)%20%3C%20b)%20=%201.) The idea is not that this whole statement is an axiom schema. Instead, the idea is that the schema is the collection of the axioms %20%3C%20b) for all rational numbers a,b and sentences phi such that %20%3C%20b). The full statement above is saying that each instance of this schema is assigned probability 1. (From what I remember of a previous conversation with Paul, I'm pretty confident that this is the intended interpretation.) The language in the paper should probably be clearer about this.
That makes sense...
0Eliezer Yudkowsky
I think this should be "Then P(G) <= 1".
No, that's correct. P(G)<1 is the premise of a proof by contradiction.
0Eliezer Yudkowsky
Ah. I didn't quite understand what you were trying to do there. In general in this theory, I don't think we have (p(phi) < 1 -> p(phi) =1) -> p(phi) = 1. The theory only reflects on itself to within epsilon, so what the theory would conclude if it were certain about p(phi) being less than 1 can't be taken as the premise of a proof-by-contradiction in the same way that ~phi can be taken as a premise... ergh, I'm trying to figure out a more helpful way to state this than "I don't think you can derive that kind of probabilistic proof by contradiction from the stated properties of the theory because the system is never supposed to know the exact probabilities it assigns". EDIT: Retracted pending further cogitation.
Stuart's proof by contradiction goes through, as far as I can see. (The speculation in the last paragraph doesn't, as Paul notes, and I don't know what "Hence P('G')=1" is supposed to mean, but the proof by contradiction part does work.) ETA: I now think that "Hence P('G') = 1" is supposed to mean that this statement is implied by the first-order theory (T + the reflection axioms).
I used the axioms as stated, and any non-intuitionist logic. If we want to capture your intuition, I think we'll have to tweak the definitions...

Congrats to MIRI! Just some comments that I already made to Mihaly and Paul:

1) Agree with Abram that unidirectional implication is disappointing and the paper should mention the other direction.

2) The statement "But this leads directly to a contradiction" at the top of page 4 seems to be wrong, because the sentence P(G)<1 isn't of the form P(phi)=p required for the reflection principle.

3) It's not clear whether we could have bidirectional implication if we used closed intervals instead of open, maybe worth investigating further.

6Eliezer Yudkowsky
We tried closed intervals a fair bit, no luck so far.
Posted an edited version in response to some of these issues. Thanks again for the remarks! (I feel less strongly about the bidirectional implication than you; as math I can see why you would definitely want it, but in practice I am nearly as happy to eat the epsilon of error.)
My concern is more about "leaving money on the table" so to speak. The result with epsilon error would be satisfying if we knew it was impossible to do better.
I consider Paul's response to be sufficient.

Told about the Whitely sentence to a friend who is studying physics. After two minutes of reflection, he suggested: "it ought to be like a delta function around 30%". Needless to say, I was blown away.

(Bit off-topic:)

Why are we so invested in solving the Löb problem in the first place?

In a scenario in which there is AGI that has not yet foomed, would that AGI not refrain from rewriting itself until it had solved the Löb problem, just as Gandhi would reject taking a pill which would make him more powerful at the risk of changing his goals?

In other words, does coming up with a sensible utility function not take precedence? Let the AGI figure out the rewriting details, if the utility function is properly implemented, the AGI won't risk changing itself unti... (read more)

It's somewhat tricky to separate "actions which might change my utility function" from "actions". Gandhi might not want the murder pill, but should he eat eggs? They have cholesterol that can be metabolized into testosterone which can influence aggression. Is that a sufficiently small effect?
Any kind of self-modification would be out of the question until the AGI has solved the problem of keeping its utility function intact. That being said, it should be easier for an AGI to keep its code unmodified while solving a problem than for Gandhi not to eat.
That is not clear to me. If I haven't yet worked out how to architect myself such that my values remain fixed, I'm not sure on what basis I can be confident that observing or inferring new things about the world won't alter my values. And solving problems without observing or inferring new things about the world is a tricky problem.
How would any epistemic insights change the terminal values you'd want to maximize? They'd change your actions pursuant to maximizing those terminal values, certainly, but your own utility function? Wouldn't that be orthogonality-threatening? edit: I remembered this may be a problem with e.g. AIXI[tl], but with an actual running AGI? Possibly.
If you cock up and define a terminal value that refers to a mutable epistemic state, all bets are off. Like Asimov's robots on Solaria, who act in accordance with the First Law, but have 'human' redefined not to include non-Solarians. Oops. Trouble is that in order to evaluate how you're doing, there has to be some coupling between values and knowledge, so you must prove the correctness of that coupling. But what is correct? Usually not too hard to define for the toy models we're used to working with, damned hard as a general problem.
Beats me, but I don't see how a system that's not guaranteed to keep its values fixed in the first place can be relied upon not to store its values in such a way that epistemic insights won't alter them. If there's some reason I should rely on it not to do so, I'd love an explanation (or pointer to an explanation) of that reason. Certainly, I have no confidence that I'm architected so that epistemic insights can't alter whatever it is in my brain that we're talking about when we talk about my "terminal values."
The trouble is basically: there is a good chance we can build systems that---in practice---do self-modification quite well, while not yet understanding any formalism that can capture notions like value stability. For example, see evolution. So if we want to minimize the chance of doing that, one thing to do is to first develop a formalism in which goal stability makes sense. (Goal stability is one kind of desirable notion out of many. The general principle is: if you don't understand how or why something works, it's reasonably likely to not do quite what you want it to do. If you are trying to build an X and want to make sure you understand how it works, a sensible first step is to try and develop an account of how X could exist at all.)
Evolution isn't an AGI, if we're talking not about "systems that do self-modification" in general, but an "AGI that could do self modification", does it need to initially contain an additional, explicit formalism to capture notions like value stability? Failing to have such a formalism, would it then not care what happens to its own utility function? Randomizing itself in unpredictable ways would be counter to the AGI's current utility function. Any of the actions the AI takes is by definition intended to serve its current utility function - even if that intent seems delayed, e.g. when the action is taking measurements in order to build a more accurate model, or self-modifying to gain more future optimizing power. Since self-modification in an unpredictable way is an action that strictly jeapardizes the AGI's future potential to maximize its current utility function (the only basis for the decision concerning any action, including whether the AGI will self-modify or not), the safeguard against unpredictable self-modification would be inherently engrained in the AGI's desire to only ever maximize its current utility function. Conclusion: The formalism that saves the AGI from unwanted self-modification is its desire to fulfill its current utility function. The AGI would be motivated to develop formalisms that allow it to self modify to better optimize its current utility function in the future, since that would maximize its current utility function better (what a one-trick-pony!).
I understand the point you are making about value stability. With our current understanding, an AI with this architecture would not do anything productive. The concern isn't that an AI with this architecture would do something bad, it's that (in light of the fact that it would not do anything productive) you wouldn't build it. Instead you would build something different; quite possibly something you understand less well and whose good behavior is more of an empirical regularity (and potentially more unstable). Humans as the output of evolution are the prototypical example of this, though many human artifacts have the same character to varying degrees (programs, organizations, cities, economies).
Is that because any non-trivial action could run a chance of changing the AGI, and thus the AGI wouldn't dare do anything at all? (If (false), disregard the following. Return 0;). If (true), with goal stability being a paramount invariant, would you say that the AGI needs to extrapolate the effect any action would have on itself, before executing it? As in "type 'hi'" or "buy an apple" being preceded by "prove this action maintains the invariant 'goal stability'". It seems like such an architecture wouldn't do much of anything either, combing through its own code whenever doing anything. (Edit: And competing teams would be quick to modify the AGI such that it checks less.) If you say that not every action necessitates proving the invariant over all of its code, then an AI without having a way of proving actions to be non-invariant-threatening could do any actions that wouldn't result in a call to the (non-existing) "prove action isn't self-modifying in a goal shifting way".
That or it takes actions changing itself without caring that they would make it worse because it doesn't know that its current algorithms are worth preserving. Your scenario is what might happen if someone notices this problem and tries to fix it by telling the AI to never modify itself, depending on how exactly they formalize 'never modify itself'.
What if, in building a non-Löb-compliant AI, you've already failed to give it part of your inference ability / trust-in-math / whatever-you-call-it? Even if the AI figures out how to not lose any more, that doesn't mean it's going to get back the part you missed. Possibly related question: Why try to solve decision theory, rather than just using CDT and let it figure out what the right decision theory is? Because CDT uses its own impoverished notion of "consequences" when deriving what the consequence of switching decision theories is.

Another stupid question: would relaxing the schema 3 to incomplete certainty allow the distribution P to be computable? If so, can you get arbitrarily close to certainty and still keep P nice enough to be useful? Does the question even make sense?

The computability problem is due to the unbounded ability of P to reason about theorems in the base theory (so that would be a necessary point of relaxation). A computable total function can't even assign values > 1/2 to all the theorems of PA and values < 1/2 to all the sentences refuted by PA.

This is all nifty and interesting, as mathematics, but I feel like you are probably barking up the wrong tree when it comes to applying this stuff to AI. I say this for a couple of reasons:

First, ZFC itself is already comically overpowered. Have you read about reverse mathematics? Stephen Simpson edited a good book on the topic. Anyway, my point is that there's a whole spectrum of systems a lot weaker than ZFC that are sufficient for a large fraction of theorems, and probably all the reasoning that you would ever need to do physics or make real wo... (read more)


The result works for theories as simple as Peano arithmetic.

The problem is if your mathematical power has to go down each time you create a successor or equivalently self-modify. If PA could prove itself sound that might well be enough for many purposes. The problem is if you need a system that proves another system sound and in this case the system strength has to be stepped down each time. That is the Lob obstacle.

Each time you create a successor or equivalently self-modify, or rather each time that an action that you take has a non-zero chance of modifying yourself? Wouldn't that mean you'd have to check constantly? What if the effect of an action is typically unpredictable at least to some degree? Also, since any system is implemented using a physical substrate, some change is inevitable (and until the AI has powered up to multiple redundancy level not yet so stochastically unlikely as to be ignorable). What happens if that change affects the "prove the system sound" physical component, is there a provable way to safeguard against that? (Obligatory "who watches the watcher".) It's a hard problem any way you slice it. You there, go solve it.
So you are assuming that it will be wanting to prove the soundness of any successors? Even though it can't even prove the soundness of itself? But it can believe in it's own soundness in a Bayesian sense without being able to prove it. There is not (as far as I know) any Godelian obstacle to that. I guess that was your point in the first place.
Why would successors use a different system, though? Verifying proofs in formal systems is easy, it's coming up with the proofs that's difficult -- an AI would refine its heuristics in order to figure out proofs more efficiently, but it would not necessarily want to change the system it is checking them against.
It would want to change the system it checked proofs against if it did not trust that system. A naively built system which checked its proofs with PA but did not trust PA probabilistically (ie, held some probability that PA is false: not difficult to construct) would very possibly prefer to reduce its set of axioms to something which PA verifies (something true with 100% probability in its understanding). But (at least if the reduced system is still robinson arithmetic or stronger) the cycle would continue, since the new system is also not self-verifying. This cycle could (likely) be predicted by the system, however, so it is not obvious what the system would actually choose to do.
If the system did not trust PA, why would it trust a system because PA verifies it? More to the point, why would it trust a self-verifying system, given that past a certain strength, only inconsistent systems are self-verifying? If the system held some probability that PA was inconsistent, it could evaluate it on the grounds of usefulness, perhaps contrasting it with other systems. It could also try to construct contradictions, increasing its confidence in PA for as long as it doesn't find any. That's what we do, and frankly, I don't see any other way to do it.
Because that's how it works! The system "is" PA, so it will trust (weaker) systems that it (PA) can verify, but it will not trust itself (PA). It would only trust them if it could verify them. True. Not necessarily; this depends on how the system works. In my probabilistic prior, this would work to some degree, but because there exists a nonstandard model in which PA is inconsistent (there are infinite proofs ending in contradictions), there will be a fixed probability of inconsistency which cannot be ruled out by any amount of testing.
That doesn't seem consistent to me. If you do not trust yourself fully, then you should not fully trust anything you demonstrate, and even if you do, there is still no incentive to switch. Suppose that the AI can demonstrate the consistency of system S from PA, and wants to demonstrate proposition A. If AI trusts S as demonstrated by PA, then it should also trust A as demonstrated by PA, so there is no reason to use S to demonstrate A. In other words, it is not consistent for PA to trust S and not A. Not fully, at any rate. So why use S at all? What may be the case, however, (that might be what you're getting to) is that in demonstrating the consistency of S, PA assigns P(Consistent(S)) > P(Consistent(PA)). Therefore, what both PA and S can demonstrate would be true with greater probability than what only PA demonstrates. However, this kind of system solves our problem in two ways: first, it means the AI can keep using PA, but can increase its confidence in some statement A by counting the number of systems that prove A. Second, it means that the AI can increase its confidence in PA by arbitrarily building stronger systems and proving the consistency of PA from within these stronger systems. Again, that's similar to what we do. That sounds reasonable to me -- the usefulness of certainty diminishes sharply as it approaches 1 anyway. Your paper sounds interesting, I'll give it a read when I have the time :)
You are smuggling in some assumptions from your experience with human cognition. If I believe X, but don't believe that the process producing my beliefs is sane, then I will act on X (I believe it, and we haven't yet talked about any bridge between what I believe, and what I believe about my beliefs), but I still won't trust myself in general.
I certainly agree that "if I don't trust myself, I don't trust my output" is an assumption I draw from human cognition, but if I discard that assumption and those that seem in the same reference class it I'm not sure I have much of a referent left for "I trust myself" at all. Is there a simple answer to what it means for me to trust myself, on your account? That is, how could I tell whether or not I did? What would be different if I did or didn't?
E.g., do you think the generalization "things I believe are true" is true? Would you be comfortable acquiring influence that you can use as you see fit, because you trust yourself to do something reasonable? I have a machine which maximizes expected utility given the judgments output by some box O. If O says "don't trust O!" it won't cause the machine to stop maximizing expected utility according to O's judgments---that's just what it is programmed to do. But it will cause the machine to doubt that it will do a good job, and work to replace itself with an alternative. The details of exactly what happens obviously depend on the formalism, but hopefully the point makes sense.
I guess I'm just not following you, sorry. It seems to me that these are precisely the sorts of things that follow from "I trust myself" in the derived-from-human-cognition sense that I thought you were encouraging us to reject. The question arises, is it programmed to "maximize expected utility given the judgments output by O", or is it programmed to "maximize expected utility" and currently calculates maximum expected utility as involving O's judgments? If the former, then when O says "don't trust O!" the machine will keep maximizing expected utility according to O's judgments, as you say. And it's maximum expected utility will be lower than it previously was. But that won't cause it to seek to replace itself. Why should it? If the latter, then when O says "don't trust O!" the machine will immediately lower the expected utility that it calculates from O's judgments. If that drops below the machine's calculated expected utility from some other source P, it will immediately start maximizing expected utility given the judgments output by P instead. It might replace itself with an alternative, just as it might have if O hadn't said that, but O saying that shouldn't affect that decision one way or the other. No?
If you are merely using O's judgments as evidence, then you are ultimately using some other mechanism to form beliefs. After all, how do you reason about the evidential strength of O's judgments? Either that, or you are using some formalism we don't yet understand. So I'll stick with the case where O is directly generating your beliefs (that's the one we care about). If O is the process generating your beliefs, then when O says "don't trust O" you will continue using O because that's what you are programmed to do, as you wrote in your comment. But you can reason (using O) about what will happen if you destroy yourself and replace yourself with a new agent whose beliefs are generated by O'. If you trust O' much more than O then that would be a great change, and you would rush to execute it.
(nods) Agreed; if O is generating my beliefs, and O generates the belief in me that some other system having O'-generated beliefs would be of greater value than me having O-generated beliefs, then I would rush to execute that change and replace myself with that other system. For similar reasons, if if O is generating my beliefs, and O generates the belief in me that me having O'-generated beliefs would be of greater value than me having O-generated beliefs, then I would rush to execute that change and accept O'-generated beliefs instead. But, sure, if I'm additionally confident that I'm incapable of doing that for some reason, then I probably wouldn't do that. The state of "I believe that trusting O' would be the most valuable thing for me to do, but I can't trust O'," is not one I can actually imagine being in, but that's not to say cognitive systems can't exist which are capable of being in such a state. Have I missed anything here?
As others have said, doesn't much matter whether you use ZFC or any other system. On the second point: one general danger is that your ability to build systems to do X will outpace your understanding of systems that do X. In this case, you might mess up and not quite get what you want. One way to try and remedy this is to develop a better formal understanding of "systems that do X." AGI seems like an important case of this pattern, because there is some chance of serious negative consequences from failing, from building systems whose behavior you don't quite understand. (I tend to think this probability is smaller than Eliezer, but I think most everyone who has thought about it seriously agrees that this is a possible problem.) At the moment we have no such formal understanding for AGI, so it might be worth thinking about.
Both are huge difficulties, but most of the work in FAI is probably in the AI part, not the F part.
It is worth noting that taking the "F" seriously implies adding a rather significant amounts of work to the "AI part". It requires whole extra orders of formal rigor and all the additional complications of provable goal stability under self improvement (regardless of what those goals happen to be). While this doesn't matter for the purpose of answering smoofra's question it seems to me that it could could be potentially misleading to neglect the difference in workload between creating "F-compatible AI" and "AI" when talking about the workload imposed by 'F'. Note that I don't think I'm saying something controversial here. I am expecting this to just be wording that I am wary of rather than a fundamentally different understanding. But if I have actually misunderstood the MIRI position on the relative difficulty of Friendliness to arbitrary AI then I would appreciate being corrected. That would be significant new information for me to consider (and also extremely good news!) The reason this is significant can of course be illustrated by considering the counterfactual world where a convergence thesis holds. Or, more relevantly, considering the possibility of GAI researchers that believe that a convergence thesis holds but somehow manage to be competent researchers anyhow. Their task becomes (crudely speaking) that of creating any AI that can make something smarter than itself. My estimate is that this is an order of magnitude simpler than the task for FAI creators creating an AI, even completely neglecting the work that goes into creating the goal system. If I find out that I am mistaken about the relative difficulties here then I will get to drastically update my expectation of humanity surviving and in general in the direction of awesome things happening.
I take it that by "convergence thesis", you're referring to the statement that all sufficiently intelligent agents will have approximately the same values.
OK, forget about F for a second. Isn't the huge difficulty finding the right deductions to make, not formalizing them and verifying them?

I gave this paper a pretty thorough read-through, and decided that the best way to really see what was going on would be to write it all down, expanding the most interesting arguments and collapsing others. This has been my main strategy for learning math for a while now, but I just decided recently that I ought to start a blog for this kind of thing. Although I wrote this post mainly for myself, it might be helpful for anyone else who's interested in reading the paper.

Towards the end, I note that the result can be made "constructive", in the sen... (read more)

Note that there's a second sense in which the argument is nonconstructive: Kakutani requires as a hypothesis that the space be compact. In your post you say [0, 1]^{\omega} is compact by Tychonoff's theorem, but using Tychonoff's theorem in its general form here is a use of choice. For compact Hausdorff spaces it's enough to use the ultrafilter lemma, but I believe you can just prove by hand that [0, 1]^{\omega} is compact in ZF.
I'm glad you bring this up. I have two comments: (1) The absoluteness argument shows that intermediate uses of Choice don't matter. They happen (definably) in L, and the conclusion reflects back up to V. (2) As you suspect, you don't really need Choice for compactness of the Hilbert cube. Given a sequence in it, I can define a convergent subsequence by repeatedly pigeonholing infinitely many points into finitely many holes. I believe sequential compactness is what is needed for the Kakutani theorem. Variants of open cover compactness are derivable from sequential compactness without Choice in Polish spaces, in case something like that is needed. Or if you somehow really need full open cover compactness... well, return to point (1), I guess.

The above is an extraordinarily good example of why lukeprog no longer talks to you.

A stupid question from a dilettante... Does this avenue of research promise to eventually be able to assign probabilities to interesting statements, like whether P=NP or whether some program halts, or whether some formal system is self-consistent?

Promise? No. Might it start down an avenue that someday leads there after a whole lot more work and some additional ideas? Maybe. Don't hold your breath on that P!=NP formal probability calculation.

Hey, a man can hope. Interesting how it used to be physics driving the progress in math, and lately it's been computer science.
This comment is a grossly oversimplified perspective, I think.
Maybe I'm misunderstanding here, but it seems like we have no particular reason to suppose P=NP is independent of ZFC. Unless it is independent, its probability under this scheme must already be 1 or 0, and the only way to find out which is to prove or disprove it.
I think shminux is talking about the possibility of future research addressing bounded reasoners, who could be uncertain of P=NP even if it followed from ZFC.
I fully agree that is an interesting avenue of discussion, but it doesn't look much like what the paper is offering us.
... Or a probability for the continuum hypothesis, axiom of choice, et cetera, if the probabilistic set theory works out. :)
While set theory already does that within forcing language (kinda, truth values are in a boolean algebra instead of a ring) for CH, AC, etc., the values of P!=NP cannot be changed if the models have the same ordinal (due to Shoenfield's absoluteness theorem). I really hope that Probabilistic Set Theory works out, it seems very interesting.

We say that P is coherent if there is a probability measure μ over models of L such that

Annoying pedantic foundational issues: The collection of models is a proper class, but we can instead consider probability measures over the set of consistent valuations , which is a set. And it suffices to consider the product σ-algebra.

I assumed that this meant "there is some probability space whose set of outcomes is some set of models", rather than "there is a measure over 'the canonical' probability space of all models", whatever that could mean. This is the smallest σ-algebra such that for every sentence phi of L, {A : phi in A} is measurable, right? If so, it is pleasingly also the Borel-σ-algebra on the Stone space of consistent valuations (aka complete theories) of L. (Assuming L is countable, but I believe the paper does that anyway.)
Yes, I asked Paul about this and this is the interpretation he gave.

This idea reminds me of some things in mechanism design, where it turns out you can actually prove a lot more for weak dominance than for strong dominance, even though one might naïvely suppose that the two should be equivalent when dealing with a continuum of possible actions/responses. (I'm moderately out of my depth with this post, but the comparison might be interesting for some.)

What would the world look like if I assigned "I assign this statement a probability less than 30%." a probability of 99%? What would the world look like if I assigned it a probability of 1%?

It means that you are rather confused about your function for assigning probabilities. In other words, you think you assign probabilities a different way from how you really do.
Suppose that my method of assigning probabilities about things that should not update on new evidence of any kind (invisible dragons) is to assign them either a very low probability (something which I almost always do), or a very high probability (something I very rarely do)? This statement seems insane in that one should always update based only on one's current evaluation of it is (if you make the change "less than or equal to 30%", it has a stable solution)

The idea is similar in concept to fuzzy theories of truth (which try to assign fractional values to paradoxes), but succeeds where those fail (ie, allowing quantifiers) to a surprising degree. Like fuzzy theories and many other extra-value theories, the probabilistic self-reference is a "vague theory" of sorts; but it seems to be minimally vague (the vagueness comes down to infinitesimal differences!).

Comparing it to my version of probabilities over logic, we see that this system will trust its axioms to an arbitrary high probability, whereas if ... (read more)

I liked your paper at AGI (we were discussing the same ideas at MIRI at that time; I guess it's in the air at the moment). Our system is a definition of truth rather than a definition of beliefs. In fact P( "axioms inconsistent" ) > 0 in our formalism. There is definitely much more to say on this topic (I'm currently optimistic about resolving incompleteness up to some arbitrarily small epsilon, but it involves other ideas.) I agree that some self-referential sentences have no meaning, but I think accepting them is philosophically unproblematic and there are lots of meaningful self-referential sentences (e.g. believing "everything I believe is true") which I'd prefer not through out with the bathwater.
Indeed, it seems relatively easy to make progress in this direction given the current state of things! (I would have been surprised if you had not been thinking along similar lines to my paper.) Ah! Interesting. Why is that? I did construct a distribution in which this is not the case, but it was not particularly satisfying. If (during the random theory generation) you block the creation of existential statements until an example has already been introduced, then you seem to get acceptable results. However, the probability distribution does not seem obviously correct.

Just found a few typos in my notes that I had forgotten to post. Hopefully somebody's still reading this.

  1. Page 3: "By axiom 3, P(T0) = 1" should be by axiom 2.
  2. Axiom 3 seems redundant (it follows from axiom 1 + 2).
  3. Page 4, "which contradicts P(G) in [0, 1]" should be "which contradicts P(G) in [0, 1)", unless I'm misreading.

In the proof of Theorem 2, you write "Clearly is convex." This isn't clear to me; could you explain what I am missing?

More specifically, let ) be the subset of obeying %20%3C%20b%20\%20\implies%20\%20\mathbb{P}\left(%20a%20%3C%20\mathbb{P}(\lceil%20\phi%20\rceil)%20%3C%20b%20\right)%20=1%20). So }%20X(\phi,a,b)). If ) were convex, then would be as well.

But ) is not convex. Project ) onto in the coordinates corresponding to the sentences and %20%3C%20b). The image is %20\cup%20\left(%20%20[0,1]%20\times%20\{%201%20\}%20\right)%20\cup%20\left(... (read more)

Other nitpicks (which I don't think are real problems): If the Wikipedia article on Kakatuni's fixed point theorem is to be believed, then Kakatuni's result is only for finite dimensional vector spaces. You probably want to be citing either Glicksberg or Fan for the infinite dimensional version. These each have some additional hypotheses, so you should check the additional hypotheses. At the end of the proof of Theorem 2, you want to check that the graph of is closed. Let be the graph of . What you check is that, if ) is a sequence of points in which approaches a limit, then that limit is in . This set off alarm bells in my head, because there are examples of a topological space , and a subspace , so that is not closed in but, if is any sequence in which approaches a limit in , then that limit is in . See Wikipedia's article on sequential spaces. However, this is not an actual problem. Since is countable, is metrizable and therefore closure is the same as sequential closure in .
The set A is convex because the convex combination (t times one plus (1-t) times the other) of two coherent probability distributions remains a coherent probability distribution. This in turn is because the convex combination of two probability measures over a space of models (cf. definition 1) remains a probability distribution over the space of models. I think, but am not sure, that your issue is looking at arbitrary points of [0,1]^{L'}, rather than the ones which correspond to probability measures.

Did anything happen with the unrestricted comprehension?

Regarding the epsilons and stuff: you guys might want to look in to hyperreal numbers.

We know. (The situation should be something like the following: internal statements about the probabilities assigned to various statements get interpreted in nonstandard models, so the corresponding probabilities are nonstandard reals.) Nonstandard models are an essential component of the approach here: probability assignments are the same thing as probability distributions over models (only one of which is standard).

I have felt probabilistic reasoning about mathematics is important ever since I learned about the Goedel machine, for a reason related but somewhat different from Loeb's problem. Namely, the Goedel machine is limited to self-improvements provable within an axiom system A whereas humans seem to be able to use arbitrarily powerful axiom systems. It can be argued that Piano arithmetics is "intuitively true" but ZFC is only accepted by mathematicians because of empirical considerations: it just works. Similarly a self-improving AI should be able to a... (read more)

a day after the workshop was supposed to end

My memory is that this was on the last day of the workshop, when only Paul and Marcello had the mental energy reserves to come into the office again.

7Eliezer Yudkowsky
No, I allocated my energy for a week and then it turned out that everyone was in town for an extra day anyway so it went into overtime. I thought Mihaly made it? Anyway I, having judged my energy reserves accurately, didn't.

Typo in section 2.2 Going Meta - agumented should be augmented

When reading this paper, and the background, I have a recurring intuition that the best approach to this problem is a distributed, probabilistic one. I can't seem to make this more coherent on my own, so posting thoughts in the hope discussion will make it clearer:

ie, have a group of related agents, with various levels of trust in each others' judgement, each individually asses how likely a descendant will be to take actions which only progress towards a given set of goals.

While each individual agent may only be able to asses a subset of a individual desc... (read more)

This seems like another "angels dancing on the head of a pin" question. I am not willing to assign numerical probabilities to any statement whose truth value is unknown, unless there is a compelling reason to choose that specific number (such as, the question is about a predictable process such as the result of a die roll). It seems to me that people who do so assign, are much more likely to get basic problems of probability theory (such as the Monty Hall problem) wrong than those who resist the urge.


Not really my field of expertise, thus I may be missing something, but can't you just set P = 1 for all classically provably true statements, P = 0 for all provably false statements and P = 0.5 to the rest, essentially obtaining a version of three-valued logic?

If that's correct, does your approach prove anything fundamentally different than what has been proved about known three-valued logics such as Kleene's or Łukasiewicz's?

In ZF set theory, consider the following three statements.

I) The axiom of choice is false

II) The axiom of choice is true and the continuum hypothesis is false

III) The axiom of choice is true and the continuum hypothesis is true

None of these is provably true or false so they all get assigned probability 0.5 under your scheme. This is a blatant absurdity as they are mutually exclusive so their probabilities cannot possibly sum to more than 1

Ok, it seems that this is covered in the P(phi) = P(phi and psi) + P(phi and not psi) condition. Thanks.
That won't work because some of the unprovable claims are, through Gödel-coding, asserting that their own probabilities are different from 0.5.


[This comment is no longer endorsed by its author]Reply

I think this was how the heretic Geth got started. #generalisingfromfictionalevidence


... Why does it matter to an AI operating within a system p if proposition C is true? Shouldn't we need to create the most powerful sound system p, and then instead of caring about what is true, care about what is provable?

EDIT: Or instead of tracking 'truth', track 'belief'. "If this logical systems proves proposition C, then C is believed.", where 'believed' is a lower standard than proved (Given a proof of "If a than b" and belief in a, the result is only a belief in b)

Is there an internal benefit to something being provable besides it being true?
It's a lot easier to tell if something is provable than if it is true.