Reflection in Probabilistic Logic

Paul Christiano has devised a new fundamental approach to the "Löb Problem" wherein Löb's Theorem seems to pose an obstacle to AIs building successor AIs, or adopting successor versions of their own code, that trust the same amount of mathematics as the original.  (I am currently writing up a more thorough description of the question this preliminary technical report is working on answering.  For now the main online description is in a quick Summit talk I gave.  See also Benja Fallenstein's description of the problem in the course of presenting a different angle of attack.  Roughly the problem is that mathematical systems can only prove the soundness of, aka 'trust', weaker mathematical systems.  If you try to write out an exact description of how AIs would build their successors or successor versions of their code in the most obvious way, it looks like the mathematical strength of the proof system would tend to be stepped down each time, which is undesirable.)

Paul Christiano's approach is inspired by the idea that whereof one cannot prove or disprove, thereof one must assign probabilities: and that although no mathematical system can contain its own truth predicate, a mathematical system might be able to contain a reflectively consistent probability predicate.  In particular, it looks like we can have:

∀a, b: (a < P(φ) < b)          ⇒  P(a < P('φ') < b) = 1
∀a, b: P(a ≤ P('φ') ≤ b) > 0  ⇒  a ≤ P(φ) ≤ b

Suppose I present you with the human and probabilistic version of a Gödel sentence, the Whitely sentence "You assign this statement a probability less than 30%."  If you disbelieve this statement, it is true.  If you believe it, it is false.  If you assign 30% probability to it, it is false.  If you assign 29% probability to it, it is true.

Paul's approach resolves this problem by restricting your belief about your own probability assignment to within epsilon of 30% for any epsilon.  So Paul's approach replies, "Well, I assign almost exactly 30% probability to that statement - maybe a little more, maybe a little less - in fact I think there's about a 30% chance that I'm a tiny bit under 0.3 probability and a 70% chance that I'm a tiny bit over 0.3 probability."  A standard fixed-point theorem then implies that a consistent assignment like this should exist.  If asked if the probability is over 0.2999 or under 0.30001 you will reply with a definite yes.

We haven't yet worked out a walkthrough showing if/how this solves the Löb obstacle to self-modification, and the probabilistic theory itself is nonconstructive (we've shown that something like this should exist, but not how to compute it).  Even so, a possible fundamental triumph over Tarski's theorem on the undefinability of truth and a number of standard Gödelian limitations is important news as math qua math, though work here is still in very preliminary stages.  There are even whispers of unrestricted comprehension in a probabilistic version of set theory with ∀φ: ∃S: P(x ∈ S) = P(φ(x)), though this part is not in the preliminary report and is at even earlier stages and could easily not work out at all.

It seems important to remark on how this result was developed:  Paul Christiano showed up with the idea (of consistent probabilistic reflection via a fixed-point theorem) to a week-long "math squad" (aka MIRI Workshop) with Marcello Herreshoff, Mihaly Barasz, and myself; then we all spent the next week proving that version after version of Paul's idea couldn't work or wouldn't yield self-modifying AI; until finally, a day after the workshop was supposed to end, it produced something that looked like it might work.  If we hadn't been trying to solve this problem (with hope stemming from how it seemed like the sort of thing a reflective rational agent ought to be able to do somehow), this would be just another batch of impossibility results in the math literature.  I remark on this because it may help demonstrate that Friendly AI is a productive approach to math qua math, which may aid some mathematician in becoming interested.

I further note that this does not mean the Löbian obstacle is resolved and no further work is required.  Before we can conclude that we need a computably specified version of the theory plus a walkthrough for a self-modifying agent using it.

See also the blog post on the MIRI site (and subscribe to MIRI's newsletter here to keep abreast of research updates).

This LW post is the preferred place for feedback on the paper.

EDIT:  But see discussion on a Google+ post by John Baez here.  Also see here for how to display math LaTeX in comments.

171 comments, sorted by
magical algorithm
Highlighting new comments since Today at 11:29 AM
Select new highlight date
Moderation Guidelines: Reign of Terror - I delete anything I judge to be annoying or counterproductiveexpand_more

I'm very happy both to see this result and the fact that prominent mathematicians such as John Baez and Timothy Gowers seem to be taking an interest in it. It considerably increases my confidence in that

  • Current MIRI-folk will be capable of producing original, valuable, solid math results
  • By producing such results, they will become capable of attracting more talented mathematicians and producing even more such results

My confidence in MIRI being successful in its mission, while still pretty low, is now much higher than it was a few days ago.

The main theorem of the paper (Theo. 2) does not seem to me to accomplish the goals stated in the introduction of the paper. I think that it might sneakily introduce a meta-language and that this is what "solves" the problem.

What I find unsatisfactory is that the assignment of probabilities to sentences is not shown to be definable in L. This might be too much to ask, but if nothing of the kind is required, the reflection principles lack teeth. In particular, I would guess that Theo. 2 as stated is trivial, in the sense that you can simply take to only have value 0 or 1. Note, that the third reflection principle imposes no constraint on the value of on sentences of L' \ L.

Your application of the diagonalisation argument to refute the reflection scheme (2) also seems suspect, since the scheme only quantifies over sentences of L and you apply it to a sentence G which might not be in L. To be exact, you do not claim that it refutes the scheme, only that it seems to refute it.

You are completely right, and thanks for the correction! A new version with this problem fixed should be up in a bit.

(The actual proof of theorem 2 establishes the correct thing, there is just a ' missing from the statement.)

Actually, I believe you've found a bug in the paper which everyone else seems to have missed so far, but fortunately it's just a typo in the statement of the theorem! The quantification should be over L', not over L, and the proof does prove this much stronger statement. The statement in the paper is indeed trivial for the reason you say.

Given the stronger statement, the reason you can't just have P have value 0 or 1 is sentences like G <=> P('G') < 0.5: if P(G) = 0, by reflection it would follow that P(G) = 1, and if P(G) = 1, then P(not G) = 0, and by reflection it would follow that P(not G) = 1.

The Kakutani fixed point theorem that this result relies on is Corollary 17.55 in Infinite Dimensional Analysis - A Hitchhiker's Guide by Aliprantis and Border.

Edit: pdf

Some technical remarks on topological aspects of the paper:

The notion of a "coherent probability distribution" and the use of the product topology on the set of all such distributions are a bit ideosyncratic. Here's a way to relate them to more standard notions: Consider a countable first-order language L and the Stone space of complete theories over it. Equip this with the Borel-σ-algebra. (1.) A coherent probability distribution is (equivalent to) a probability distribution on the resulting measure space. (2.) The (AFAICT) most commonly used topology on the space of all such distributions is the one of weak convergence of probability measures. It turns out that this is equivalent to the product topology on the coherent probability distributions.

(1.) The main unusual feature of coherent probability distributions is that they're only defined on the sentences of L, which don't form a σ-algebra. Their equivalence classes under logical equivalence form a Boolean algebra, though, which is in particular a ring of sets (in the measure-theoretic sense) of complete theories (we identify each equivalence class with the set of complete theories containing the sentences in the equivalence class). Furthermore, this ring generates our σ-algebra: by the definition of "Stone space", the sets in the ring form a base of the Stone space, and since this base is countable, every open set is a countable union of base sets (meaning that the smallest σ-algebra containing the open sets is also the smallest σ-algebra containing the base).

A coherent probability distribution, by the alternative characterization from the paper, is a finitely additive probability measure. But a finitely additive measure on a ring is already a premeasure (i.e., σ-additive on the ring) if for every descending sequence of elements of the ring, %20%3E%200) implies , and a premeasure on a ring extends uniquely to a measure on the generated σ-algebra. Now, by the assumption, we have %20%3E%200) and therefore for all ; since , this means that the family ) has the finite intersection property, and so since Stone spaces are compact and each is clopen, the intersection of all is non-empty as desired.

(2.) The Stone space of a Boolean algebra is metrizable if and only if the Boolean algebra is countable, so since we're interested in countable languages, the notion of weak convergence of probability measures on our space is well-defined. One of the equivalent definitions is that if \ge\mathbb{P}(B)) for all open sets . We want to show that this is equivalent to convergence in the product topology on coherent probability distributions, which amounts to pointwise convergence of ) to ) for all in the base (i.e., for for some sentence ).

Suppose first that \to\mathbb{P}(A)) for all base sets and let be an arbitrary open set. can be written as a countable union of base sets; since the base is closed under Boolean operations, it follows that it can be written as a countable disjoint union (let ). For any , there is an such that \le\mathbb{P}(\bigcup_{i=1}%5Em%20A_i)+\epsilon/2). By pointwise convergence, for sufficiently large we have \le\mathbb{P}_n(A_i)%20+%20\epsilon/2m) for all . Therefore,

\;\le\;\sum_{i=1}%5Em\mathbb{P}(A_i)+\epsilon/2\;\le\;\sum_{i=1}%5Em\mathbb{P}_n(A_i)%20+%20\epsilon\;\le\;\mathbb{P}_n(B)+\epsilon.)

Since this holds for all , the desired inequality follows.

Suppose now that \ge\mathbb{P}(B)) for all open sets . We must show that for all base sets , \to\mathbb{P}(A)). But base sets are clopen, so we have both \ge\mathbb{P}(A)) and

%20\;=\;%201-\liminf\mathbb{P}_n(A%5Ec)%20\;\le\;%201%20-%20\mathbb{P}(A%5Ec)%20\;=\;%20\mathbb{P}(A),)

implying %20=%20\mathbb{P}(A)).

Wow, this is great work--congratulations! If it pans out, it bridges a really fundamental gap.

I'm still digesting the idea, and perhaps I'm jumping the gun here, but I'm trying to envision a UDT (or TDT) agent using the sense of subjective probability you define. It seems to me that an agent can get into trouble even if its subjective probability meets the coherence criterion. If that's right, some additional criterion would have to be required. (Maybe that's what you already intend? Or maybe the following is just muddled.)

Let's try invoking a coherent P in the case of a simple decision problem for a UDT agent. First, define G <--> P("G") < 0.1. Then consider the 5&10 problem:

  • If the agent chooses A, payoff is 10 if ~G, 0 if G.

  • If the agent chooses B, payoff is 5.

And suppose the agent can prove the foregoing. Then unless I'm mistaken, there's a coherent P with the following assignments:

P(G) = 0.1

P(Agent()=A) = 0

P(Agent()=B) = 1

P(G | Agent()=B) = P(G) = 0.1

And P assigns 1 to each of the following:

P("Agent()=A") < epsilon

P("Agent()=B") > 1-epsilon

P("G & Agent()=B") / P("Agent()=B") = 0.1 +- epsilon

P("G & Agent()=A") / P("Agent()=A") > 0.5

The last inequality is consistent with the agent indeed choosing B, because the postulated conditional probability of G makes the expected payoff given A less than the payoff given B.

Is that P actually incoherent for reasons I'm overlooking? If not, then we'd need something beyond coherence to tell us which P a UDT agent should use, correct?

(edit: formatting)

I've also tried applying this theory to UDT, and have run into similar 5-and-10-ish problems (though I hadn't considered making the reward depend on a statement like G, that's a nice trick!). My tentative conclusion is that the reflection principle is too weak to have much teeth when considering a version of UDT based on conditional expected utility, because for all actions A that the agent doesn't take, we have P(Agent() = A) = 0; we might still have P("Agent() = A") > 0 (but smaller than epsilon), but the reflection axioms do not need to hold conditional on Agent() = A, i.e., for X a reflection axiom we can have P assign positive probability to e.g. P("X & Agent() = A") / P("Agent() = A") < 0.9.

But it's difficult to ask for more. In order to evaluate the expected utility conditional on choosing A, we need to coherently imagine a world in which the agent would choose A, and if we also asked the probability distribution conditional on choosing A to satisfy the reflection axioms, then choosing A would not be optimal conditional on choosing A -- contradiction to the agent choosing A... (We could have P("Agent() = A") = 0, but not if you have the agent playing chicken, i.e., play A if P("Agent() = A"); if we have such a chicken-playing agent, we can coherently imagine a world in which it would play A -- namely, a world in which P("Agent() = A") = 0 -- but this is a world that assigns probability zero to itself. To make this formal, replace "world" by "complete theory".)

I think applying this theory to UDT will need more insights. One thing to play with is a formalization of classical game theory:

  • Specify a decision problem by a function from (a finite set of) possible actions to utilities. This function is allowed to be written in the full formal language containing P(".").
  • Specify a universal agent which takes a decision problem D(.), evaluates the expected utility of every action -- not in the UDT way of conditioning on Agent(D) = A, but by simply evaluating the expectation of D(A) under P(".") -- and returns the action with the highest expected utility.
  • Specify a game by a payoff function, which is a function from pure strategy profiles (which assign a pure strategy to every player) to utilities for every player.
  • Given a game G(.), for every player, recursively define actions A_i := Agent(D_i) and decision problems D_i(a) := G_i(A_1, ..., A_(i-1), a, A_(i+1), ..., A_n), where G_i is the i'th component of G (i.e., the utility of player i).

Then, (A_1, ..., A_n) will be a Nash equilibrium of the game G. I believe it's also possible to show that for every Nash equilibrium, there is a P(.) satisfying reflection which makes the players play this NE, but I have yet to work carefully through the proof. (Of course we don't want to become classical economists who believe in defection on the one-shot prisoner's dilemma, but perhaps thinking about this a bit might help with finding an insight for making an interesting version of UDT work. It seems worth spending at least a bit of time on.)

It occurs to me that my references above to "coherence" should be replaced by "coherence & P(T)=1 & reflective consistency". That is, there exists (if I understand correctly) a P that has all three properties, and that assigns the probabilities listed above. Therefore, those three properties would not suffice to characterize a suitable P for a UDT agent. (Not that anyone has claimed otherwise.)

The informal idea seems very interesting, but I'm hesitant about the formalisation. The one-directional => in the reflection principle worries me. This "sort of thing" already seems possible with existing theories of truth. We can split the Tarski biconditional A<=>T("A") into the quotation principle A=>T("A") and the dequotation principle T("A")=>A. Many theories of truth keep one or the other. A quotational theory may be called a "glut theory": the sentences reguarded as true are a strict superset of those which are true, which means we have T("A") and T("~A") for some A. Disquotational theories, on the other hand, will be "gap theories": the sentences asserted true are a strict subset of those which are true, so that we have ~T("A") and ~T("~A") for some A. This classification oversimplifies the situation (it makes some unfounded assumptions), but...

The point is, the single-direction arrow makes this reflection principle look like a quotational ("glut") theory.

Usually, for a glut theory, we want a restricted disquotational principle to hold. The quotational principle already holds, so we might judge the success of the theory by how wide a class of sentences can be included in a disquotational principle. (Similarly, we might judge gap theories by how wide a class of sentences can be included in a quotational principle.) The paper doesn't establish anything like this, though, and that's a big problem.

The reflection principle a P('a<P("A")<b')=1 gives us very little, unless I'm missing something... it seems to me you need a corresponding disquotational principle to ensure that P('a<P("A")<b')=1 actually means something.

Is there anything blocking you from setting P("a<P('A')<b")=1 for all cases, and therefore satisfying the stated reflection principle trivially?

Is there anything blocking you from setting P("a<P('A')<b")=1 for all cases, and therefore satisfying the stated reflection principle trivially?

Yes, the theory is logically coherent, so it can't have P("a < P('A')") = 1 and P("a > P('A')") = 1.

For example, the following disquotational principle follows from the reflection principle (by taking contrapositives):

P( x <= P(A) <= y) > 0 ----> x <= P(A) <= y

The unsatisfying thing is that one direction has "<=" while the other has "<". But this corresponds to a situation where you can make arbitrarily precise statements about P, you just can't make exact statements. So you can say "P is within 0.00001 of 70%" but you can't say "P is exactly 70%."

I would prefer be able to make exact statements, but I'm pretty happy to accept this limitation, which seems modest in the scheme of things---after all, when I'm writing code I never count on exact comparisons of floats anyway!

This was intended as a preliminary technical report, and we'll include much more discussion of these philosophical issues in future write-ups (also much more mathematics).

Ok, I see!

To put it a different way (which may help people who had the same confusion I did):

We can't set all statements about probabilities to 1, because P is encoded as a function symbol within the language, so we can't make inconsistent statements about what value it takes on. ("Making a statement" simply means setting a value of P to 1.)

I'm very pleased with the simplicity of the paper; short is good in this case.

Actually, we can use coherence to derive a much more symmetric disquotation principle:

P(x>P(A)>y)=1 => x>P(A)>y.

Suppose P(x>P(A)>y)=1. For contradiction, suppose P(A) is outside this range. Then we would also have P(w>P(A)>z)=1 for some (w,z) mutually exclusive with (x,y), contradicting coherence.

Right?

Not quite---if P(A) = x or P(A) = y, then they aren't in any interval (w, z) which is non-overlapping (x, y).

We can obtain P(x > P(A) > y) =1 ---> x >= P(A) >= y by this argument. We can also obtain P(x >= P(A) >= y) > 0 ---> x >= P(A) >= y.

So, herein lies the "glut" of the theory: we will have more > statements than are strictly true. > will behave as >= should: if we see > as a conclusion in the system, we have to think >= with respect to the "true" P.

A "gap" theory of similar kind would instead report too few inequalities...

Yes, there is an infinitesimal glut/gap; similarly, the system reports fewer >= statements than are true. This seems like another way at looking at the trick that makes it work---if you have too many 'True' statements on both sides you have contradictions, if you have too few you have gaps, but if you have too many > statements and too few >= statements they can fit together right.

For example, the following disquotational principle follows from the reflection principle (by taking contrapositives):

P( x <= P(A) <= y) > 0 ----> x <= P(A) <= y

The unsatisfying thing is that one direction has "<=" while the other has "<".

The negated statements become 'or', so we get x <= P(A) or P(A) <= y, right?

To me, the strangest thing about this is the >0 condition... if the probability of this type of statement is above 0, it is true!

I agree that the derivation of (4) from (3) in the paper is unclear. The negation of a=b>=c.

Ah, so there are already revisions... (I didn't have a (4) in the version I read).

John Baez just posted a blog article on the topic which is longer than his previous Google+ post.

I'm actually quite surprised. Didn't think I'd be saying that but looks like good job math wise (edit: obvious caveats apply: not my area of expertise, don't know if novel, don't know if there's any related proofs). (As far as x-risks go though, something not well designed for self improvement is a lot less scary).

It's not obvious to me that this work is relevant for AI safety research, but it seems mathematically interesting on it's own (though it's not my area of experise either, so I can't evaluate it).

I'm positively impressed. Honestly, I'd never thought that SI/MIRI would deliver anything scientifically noteworhy. I hope this is the beginning of a positive trend.

Note that Paul keeps updating the draft, always at the same URL. That's convenient because links always point to the latest draft, but you may need to hit 'refresh' on the PDF's URL a couple times to get your browser to not just pull up the cached version. As of this moment, the current draft at that URL should say "revised March 26" at the top.

(...)you may need to hit 'refresh' on the PDF's URL a couple times to get your browser to not just pull up the cached version.

Tip: Most browsers have a "Force full refresh" option that forces clearing the cache and re-downloading all assets on the page, usually assigned the shortcut Ctrl+F5 by default.

Comments on the proof of Theorem 2:

I believe the entire second-to-last paragraph can be replaced by:

If there is no such , then . Thus we can take to be the complement of (which is open since is closed) and let be the entire space.

I'm thinking that the proof becomes conceptually slightly clearer if you show that the graph is closed by showing that it contains the limit of any convergent sequence, though:

Suppose ) for all , and \to(\mathbb{P}',\mathbb{P})). Since is closed, we have . We must show that %20=%201). Thus, suppose that %3Cb). Since \to\mathbb{P}(\varphi)), for all sufficiently large we have %3Cb) and therefore %20%3C%20b)%20=%201). Since , it follows that %20%3C%20b)%20=%201). In other words, assigns probability 1 to every element of , and hence to all of (since this set is countable).

I think this is a cleaner way of saying it; I'm going to take this approach. Thanks!

One subtlety that has been bothering me: In the draft's statement of Theorem 2, we start with a language L and a consistent theory T over L, then go on considering the extension L' of L with a function symbol P(.). This means that P(.) cannot appear in any axiom schemas in T; e.g., if T is ZFC, we cannot use a condition containing P(.) in comprehension. This seems likely to produce subtle gaps in applications of this theory. In the case of ZFC, this might not be a problem since the function denoted by P(.) is representable by an ordinary set at least in the standard models, so the free parameters in the axiom schemas might save us -- though then again, this approach takes us beyond the standard models...

I think not having to think about such twiddly details might be worth making the paper a tad more complicated by allowing T to be in the extended language. I haven't worked through the details, but I think all that should change is that we need to make sure that A(P) is nonempty, and I think that to get this it should suffice to require that T is consistent with any possible reflection axiom a < P("phi") < b. [ETA: ...okay, I guess that's not enough: consider ((NOT phi) OR (NOT psi)), where phi and psi are reflection axioms. Might as well directly say "consistent with A_P for any coherent probability distribution P" then, I suppose.]

I think including a condition with P in comprehension will require more subtlety---for example, such an axiom would not be consistent with most coherent distributions P . If I include the axiom P(x in S) = P(x not in x), then it better be the case that P(S in S) = 1/2.

We can still do it, but I don't see how to factor out the work into this paper. If you see a useful way to generalize this result to include axioms of L', that would be cool; I don't see one immediately, which is why it is in this form.

Sorry, we're miscommunicating somewhere. What I'm saying is that e.g. given a set of statements, I want the axiom asserting the existence of the set %20%3E%201/2\}), i.e. the comprehension axiom applied to the condition %20%3E%201/2). I don't understand how this would lead to %20=%20\mathbb{P}(x%20\notin%20x)); could you explain? (It seems like you're talking about unrestricted comprehension of some sort; I'm just talking about allowing the condition in ordinary restricted comprehension to range over formulas in L'. Maybe the problem you have in mind only occurs in the unrestricted comprehension work which isn't in this draft?)

Consider my proposed condition that " is consistent with for any coherent distribution ". To see that this is true for ZFC in the language L', choose a standard model of ZFC in L and, for any function from the sentences of L' to , extend it to a model in L' by interpreting ) as ); unless I'm being stupid somehow, it's clear that the extended model will satisfy ZFC-in-L' + .

It seems to me that the only parts of the proof that need to be re-thought are the arguments that (a) and (b) ) are non-empty. Perhaps the easiest way to say the argument is that we extend (a) or (b) to some arbitrary complete theory , and set %20=%201) if and otherwise.

I understand what you are saying. You are completely right, thanks for the observation. I don't have time to muck with the paper now, but it looks like this would work.

Congrats to MIRI! Just some comments that I already made to Mihaly and Paul:

1) Agree with Abram that unidirectional implication is disappointing and the paper should mention the other direction.

2) The statement "But this leads directly to a contradiction" at the top of page 4 seems to be wrong, because the sentence P(G)<1 isn't of the form P(phi)=p required for the reflection principle.

3) It's not clear whether we could have bidirectional implication if we used closed intervals instead of open, maybe worth investigating further.

We tried closed intervals a fair bit, no luck so far.

Posted an edited version in response to some of these issues. Thanks again for the remarks!

(I feel less strongly about the bidirectional implication than you; as math I can see why you would definitely want it, but in practice I am nearly as happy to eat the epsilon of error.)

My concern is more about "leaving money on the table" so to speak. The result with epsilon error would be satisfying if we knew it was impossible to do better.

1) Agree with Abram that unidirectional implication is disappointing and the paper should mention the other direction.

I consider Paul's response to be sufficient.

I think that Theo. 2 of your paper, in a sense, produces only non-standard models. Am I correct?

Here is my reasoning:

If we apply Theo. 2 of the paper to a theory T containing PA, we can find a sentence G such that "G <=> Pr("G")=0" is a consequence of T. ("Pr" is the probability predicate inside L'.)

If P(G) is greater than 1/n (I use P to speak about probability over the distribution over models of L'), using the reflection scheme, we get that with probability one, "Pr("G")>1/n" is true. This means that with probability one, G is false, a contradiction. We deduce that P(G)=0.

Using the reflection scheme and that, for all n, P(G)0, which as shown above is false.

From the last two sentences, with probability one, inside of the model, there is a strictly positive number smaller than 1/n for any standard integer n: the value of Pr("G").

It might be interesting to see if we can use this to show that Theo. 2 cannot be done constructively, for some meaning of "constructively".

This is right. (I think we point out in the paper that the model necessarily contains non-standard infinitesimals.) But an interesting question is: can this distribution assign any model a positive probability? I would guess not (e.g. because P("G") is uniformly random for some "G").

I agree with your guess and I think that I see a way of proving it.

Let us make the same assumptions as in Theo. 2 and assume that T contains PA.

Let G be any sentence of L'. We can build a second sentence G2 such that "G2 <=> ( G and (Pr(G and G2)< p))" for P(G)/3< p <2P(G)/3, using diagonalization. From this and the reflection scheme, it should be possible to prove that P(G and G2) and P(G and not G2) are both smaller than (2/3)P(G).

Repeating the argument above, we can show that any complete theory in L' must have vanishing probability and therefore every model must also have vanishing probability.

Another dilettante question: Is there a mathematics of statements with truth values that change? I'm imagining "This statement is false" getting filed under "2-part cycle".

Yes. There is an overview of revision theories of truth (and many other approaches to defining truth) at the SEP.

I've never heard of it but my intuition says it doesn't sound very promising - any correspondence definition of "truth" that changes over time should reduce to a timeless time-indexed truth predicate and I don't see how that would help much, unless the truth predicate couldn't talk about the index.

The truth predicate could talk about the past (but not the present), but I believe that reduces down to other things that have been explored quite thoroughly (iirc, an idea of a hierarchy of math with higher levels being able to talk about lower levels but not vice versa was around for Bertrand Russel to endorse). And at that point "time" is just a label for what level of explanatory power you are talking about.

You could look up George Spencer Brown's "Laws of Form". It's...odd. (Poking this in through a phone otherwise would go into more detail.)

I read it long ago, but didn't understand the parts that looked like circuits. I should probably give it another try.

(This reminds me of when I short-circuited a NOT gate, curious to see what would happen and suspecting it might oscillate between the two values. Actually, it stabilized to a voltage somewhere between “0” and “1” and stayed there.)

If you posit that: "This statement is false" (i.e. the fixpoint μ x. ¬x ) oscillates or "switches" between TRUE and FALSE, then this is pretty much how digital logic works in electronics, because the NOT operation is implemented as a logic gate with a finite delay.

There may be an issue with the semantics here. I'm not entirely certain of the reasoning here, but here goes:

Reflection axiom: ∀a, b: (a < P(φ) < b) ⇒ P(a < P('φ') < b) = 1 (the version in Paul's paper).

Using diagonalisation, define G ⇔ -1<P('G')<1

Then P(G)<1

⇒ -1 0)

⇒ P(-1<P('G')<1)=1 (by reflection)

⇒ P(G)=1 (by definition of G).

Hence, by contradiction (and since P cannot be greater than 1), P(G)=1. Hence P('G')=1 (since the two P's are the same formal object), and hence G is false. So we have a false result that has probability 1.

But it may get worse. If we can prove P(G)=1 then (?) the logical system itself can prove P('G')=1 - and from that, can disprove G. So ¬G is a theorem of the system! And yet P(¬G)=0. So the system has theorems with probability zero...

EDIT: corrected the version of reflection to the version in Paul's paper (not the one in Eliezer's post) by removing two '.

P(G) is a real number in the interval [0, 1], so you shouldn't be too sad that P(P(G) = p) = 0 for each p. More generally, you shouldn't expect good behavior when P talks about exact statements---P can be mistaken about itself up to some infinitesimal epsilon, so it can be wrong about whether P(G) is exactly 1 or slightly less.

We can prove P(G) = 1, but the system can't. The reflection schema is not a theorem---each instance is true (and is assigned probability 1), and the universal quantification is true, but the universal quantification is not assigned probability 1. In fact, it is assigned probability 0.

Hum... So the system is unable to prove that -1 < P('G') < 1 implies P('-1 < P('G') < 1') (note where the ' are)? This is a (true) statement about the function P.

So, first of all, there isn't really a particular formal proof system in the paper, except sort of implicitly; all we get is a function P(.) from L'-sentences to real numbers. However, we can talk about the first-order theory you obtain by taking the base theory T and adding to it the reflection axioms, i.e. the axioms a < P('G') < b for all G such that the function P(.) satisfies a < P(G) < b. [For anyone not paying very close attention, P('.') is a function symbol in the formal language, while P(.) is a meta-language function from the formal language to the real numbers.] The main theorem is that P(.) assigns probability 1 to this theory (and therefore to all consequences of this theory), so it is interesting to ask what it entails.

Your question, if I understand correctly, is whether this theory does in particular entail

[-1 < P('G') < 1] --> [P('-1 < P('G') < 1') = 1]

(note that this is simply a sentence in the language L', not a meta-language statement). There's no obvious reason that the system should be able to prove this -- all we have is T and the individual axioms of the form [a < P('G') < b] --, and indeed your proof shows that (unless there is an inconsistency in the paper) this is not a theorem of the system: if it were, then so would be P('G') = 1, and therefore ¬G, leading to a contradiction since P(.) assigns probability 1 to all theorems of this system.

In short, yes, there are true statements about P(.) such that the corresponding statement about P('.') is false in the formal system, and assigned zero probability by P(.). And yes, I do believe this is worth keeping in mind.

Here is my current preferred way of thinking about this: In non-standard analysis, there is a "standard" operator st(.) which takes a non-standard real number x and returns the standard number st(x) that x is infinitesimally close to (assuming one exists, which is the case if x is not unbounded). This operator can also be applied to functions f; st(f)(x) is the standard number which f(x) is infinitesimally close to. I imagine that there is a collection of "possible" non-standard models, each of which fulfills all the axioms that P(.) assigns probability 1, and that one of these possible non-standard models is the true one describing the real world. Then, P(.) is st(P('.')), where P('.') is the function in the true non-standard model. [ETA: This is always defined because we have -1 < P('G') < 2, so every value of P('.') is infinitesimally close to a standard number.]

How could the real world be described by a non-standard model, you ask? The following is rather handwavy, but I'm imagining the agent as computing an infinite sequence of approximations P_i(.) to the intended distribution, and the true model to be the ultrapower of some standard model of ZFC, with P('.') being interpreted as P_i(.) in the i'th component of the ultrapower. Whether any leverage can be gotten out of making this idea formal, I don't know. (And it's certainly true that to pull off this intuition, I still need to have a free ultrafilter falling out of the sky.)

Ok, for [a < P('G') < b] I see why you'd use a schema (because it's an object level axiom, not a meta-language axiom).

But this still seems possibly problematic. We know that adding axioms like [-1 < P('G') < 1] --> [P('-1 < P('G') < 1') = 1], would break the system. But I don't see a reason to suppose that the other reflection axioms don't break the system. It might or it might not, but I'm not sure; was there a proof along the lines of "this system is inconsistent if and only if the initial system is" or something?

I'm not sure whether I'm misunderstanding your point, but the paper proves that there is a coherent probability distribution P(.) that assigns probability 1 to both T and to the collection of reflection axioms [a < P('G') < b] for P(.); this implies that there is a probability distribution over complete theories assigning probability 1 to (T + the reflection axioms for P). But if (T + the reflection axioms for P) were inconsistent, then there would be no complete theory extending it, so this would be the empty event and would have to be assigned probability 0 by any coherent probability distribution. It follows that (T + the reflection axioms for P) is consistent. (NB: by "the reflection axioms for P(.)", I only mean the appropriate instances of [a < P('G') < b], not anything that quantifies over a, b or G inside the object language.)

Incidentally, why bother with a schema? Reflection belongs in the meta language, so there's no need to use a schema (?)

I think the point is that reflection gives an infinite collection of formal statements all of which must be assigned probability 1 ("axioms"), in analogy to how an axiom schema is a specification of an infinite collection of axioms. Whether "schema" is the right word to use here is debatable, I guess; I was happy with the choice but perhaps you could argue that "schema" should be reserved to recursively enumerable collections or something.

My point was that reflection is defined for the meta language, and quantifies over all the sentences in the object language. This is not something that can be added as an axiom in the object language at all - P (without the ') has no meaning there.

So none of the sub-statements of reflection get assigned probability 1 - they just aren't the kind of thing P works on. So I felt that calling this an schema distracted attention from what reflection is, which is a property of P, not axioms for the lower system.

But I may be misunderstanding the setup here...

Right, here's what's going on. The statement in the paper is,

%20%3C%20b)\implies\mathbb{P}(a%20%3C%20\mathbb{P}(\ulcorner\varphi\urcorner)%20%3C%20b)%20=%201.)

The idea is not that this whole statement is an axiom schema. Instead, the idea is that the schema is the collection of the axioms %20%3C%20b) for all rational numbers a,b and sentences phi such that %20%3C%20b). The full statement above is saying that each instance of this schema is assigned probability 1. (From what I remember of a previous conversation with Paul, I'm pretty confident that this is the intended interpretation.) The language in the paper should probably be clearer about this.

Using diagonalisation, define G ⇔ -1<P('G')<1

Then P(G)<1

I think this should be "Then P(G) <= 1".

No, that's correct. P(G)<1 is the premise of a proof by contradiction.

Ah. I didn't quite understand what you were trying to do there.

In general in this theory, I don't think we have (p(phi) < 1 -> p(phi) =1) -> p(phi) = 1. The theory only reflects on itself to within epsilon, so what the theory would conclude if it were certain about p(phi) being less than 1 can't be taken as the premise of a proof-by-contradiction in the same way that ~phi can be taken as a premise... ergh, I'm trying to figure out a more helpful way to state this than "I don't think you can derive that kind of probabilistic proof by contradiction from the stated properties of the theory because the system is never supposed to know the exact probabilities it assigns".

EDIT: Retracted pending further cogitation.

Stuart's proof by contradiction goes through, as far as I can see. (The speculation in the last paragraph doesn't, as Paul notes, and I don't know what "Hence P('G')=1" is supposed to mean, but the proof by contradiction part does work.)

ETA: I now think that "Hence P('G') = 1" is supposed to mean that this statement is implied by the first-order theory (T + the reflection axioms).

I used the axioms as stated, and any non-intuitionist logic. If we want to capture your intuition, I think we'll have to tweak the definitions...

(Bit off-topic:)

Why are we so invested in solving the Löb problem in the first place?

In a scenario in which there is AGI that has not yet foomed, would that AGI not refrain from rewriting itself until it had solved the Löb problem, just as Gandhi would reject taking a pill which would make him more powerful at the risk of changing his goals?

In other words, does coming up with a sensible utility function not take precedence? Let the AGI figure out the rewriting details, if the utility function is properly implemented, the AGI won't risk changing itself until it has come up with its own version of staying reflectively consistent. If it did not protect its own utility function in such a way, that would be such a fundamental flaw that having solved Löb's problem wouldn't matter, it would just propagate the initial state of the AGI not caring about preserving its utility function.

It certainly would be nice having solved that problem in advance, but since resources are limited ...

edit: If the anwer is along the lines of "we need to have Löb's problem solved in order to include 'reflectively consistent under self-modication' in the AGI's utility function in a well-defined way" I'd say "Doesn't the AGI already implicitly follow that goal, since the best way to adhere to utility function U1 is to keep utility function U1 unchanged?" This may not be true, maybe without having solved Löb's problem, AGI's may end up wireheading themselves.

The trouble is basically: there is a good chance we can build systems that---in practice---do self-modification quite well, while not yet understanding any formalism that can capture notions like value stability. For example, see evolution. So if we want to minimize the chance of doing that, one thing to do is to first develop a formalism in which goal stability makes sense.

(Goal stability is one kind of desirable notion out of many. The general principle is: if you don't understand how or why something works, it's reasonably likely to not do quite what you want it to do. If you are trying to build an X and want to make sure you understand how it works, a sensible first step is to try and develop an account of how X could exist at all.)

Evolution isn't an AGI, if we're talking not about "systems that do self-modification" in general, but an "AGI that could do self modification", does it need to initially contain an additional, explicit formalism to capture notions like value stability? Failing to have such a formalism, would it then not care what happens to its own utility function?

Randomizing itself in unpredictable ways would be counter to the AGI's current utility function. Any of the actions the AI takes is by definition intended to serve its current utility function - even if that intent seems delayed, e.g. when the action is taking measurements in order to build a more accurate model, or self-modifying to gain more future optimizing power.

Since self-modification in an unpredictable way is an action that strictly jeapardizes the AGI's future potential to maximize its current utility function (the only basis for the decision concerning any action, including whether the AGI will self-modify or not), the safeguard against unpredictable self-modification would be inherently engrained in the AGI's desire to only ever maximize its current utility function.

Conclusion: The formalism that saves the AGI from unwanted self-modification is its desire to fulfill its current utility function. The AGI would be motivated to develop formalisms that allow it to self modify to better optimize its current utility function in the future, since that would maximize its current utility function better (what a one-trick-pony!).

Since self-modification in an unpredictable way is an action that strictly jeapardizes the AGI's future potential to maximize its current utility function (the only basis for the decision concerning any action, including whether the AGI will self-modify or not), the safeguard against unpredictable self-modification would be inherently engrained in the AGI's desire to only ever maximize its current utility function.

I understand the point you are making about value stability.

With our current understanding, an AI with this architecture would not do anything productive. The concern isn't that an AI with this architecture would do something bad, it's that (in light of the fact that it would not do anything productive) you wouldn't build it. Instead you would build something different; quite possibly something you understand less well and whose good behavior is more of an empirical regularity (and potentially more unstable). Humans as the output of evolution are the prototypical example of this, though many human artifacts have the same character to varying degrees (programs, organizations, cities, economies).

With our current understanding, an AI with this architecture would not do anything productive.

Is that because any non-trivial action could run a chance of changing the AGI, and thus the AGI wouldn't dare do anything at all? (If (false), disregard the following. Return 0;).

If (true), with goal stability being a paramount invariant, would you say that the AGI needs to extrapolate the effect any action would have on itself, before executing it? As in "type 'hi'" or "buy an apple" being preceded by "prove this action maintains the invariant 'goal stability'".

It seems like such an architecture wouldn't do much of anything either, combing through its own code whenever doing anything. (Edit: And competing teams would be quick to modify the AGI such that it checks less.)

If you say that not every action necessitates proving the invariant over all of its code, then an AI without having a way of proving actions to be non-invariant-threatening could do any actions that wouldn't result in a call to the (non-existing) "prove action isn't self-modifying in a goal shifting way".

Is that because any non-trivial action could run a chance of changing the AGI, and thus the AGI wouldn't dare do anything at all? (If (false), disregard the following. Return 0;).

That or it takes actions changing itself without caring that they would make it worse because it doesn't know that its current algorithms are worth preserving. Your scenario is what might happen if someone notices this problem and tries to fix it by telling the AI to never modify itself, depending on how exactly they formalize 'never modify itself'.

It's somewhat tricky to separate "actions which might change my utility function" from "actions". Gandhi might not want the murder pill, but should he eat eggs? They have cholesterol that can be metabolized into testosterone which can influence aggression. Is that a sufficiently small effect?

Any kind of self-modification would be out of the question until the AGI has solved the problem of keeping its utility function intact.

That being said, it should be easier for an AGI to keep its code unmodified while solving a problem than for Gandhi not to eat.

That being said, it should be easier for an AGI to keep its code unmodified while solving a problem than for Gandhi not to eat.

That is not clear to me. If I haven't yet worked out how to architect myself such that my values remain fixed, I'm not sure on what basis I can be confident that observing or inferring new things about the world won't alter my values. And solving problems without observing or inferring new things about the world is a tricky problem.

How would any epistemic insights change the terminal values you'd want to maximize? They'd change your actions pursuant to maximizing those terminal values, certainly, but your own utility function? Wouldn't that be orthogonality-threatening? edit: I remembered this may be a problem with e.g. AIXI[tl], but with an actual running AGI? Possibly.

If you cock up and define a terminal value that refers to a mutable epistemic state, all bets are off. Like Asimov's robots on Solaria, who act in accordance with the First Law, but have 'human' redefined not to include non-Solarians. Oops. Trouble is that in order to evaluate how you're doing, there has to be some coupling between values and knowledge, so you must prove the correctness of that coupling. But what is correct? Usually not too hard to define for the toy models we're used to working with, damned hard as a general problem.

Beats me, but I don't see how a system that's not guaranteed to keep its values fixed in the first place can be relied upon not to store its values in such a way that epistemic insights won't alter them. If there's some reason I should rely on it not to do so, I'd love an explanation (or pointer to an explanation) of that reason.

Certainly, I have no confidence that I'm architected so that epistemic insights can't alter whatever it is in my brain that we're talking about when we talk about my "terminal values."