# 22

In this post I intend to:

• Briefly explain the Loebian obstacle and it's relevance to AI (feel free to skip it if you know what the Loebian obstacle is).
• Suggest a solution in the form a formal system which assigns probabilities (more generally probability intervals) to mathematical sentences (and which admits a form of "Loebian" self-referential reasoning). The method is well-defined both for consistent and inconsistent axiomatic systems, the later being important in analysis of logical counterfactuals like in UDT.

# Background

## Logic

When can we consider a mathematical theorem to be established? The obvious answer is: when we proved it. Wait, proved it in what theory? Well, that's debatable. ZFC is popular choice for mathematicians, but how do we know it is consistent (let alone sound, i.e. that it only proves true sentences)? All those spooky infinite sets, how do you know it doesn't break somewhere along the line? There's lots of empirical evidence, but we can't prove it, and it's proofs we're interesting in, not mere evidence, right?

Peano arithmetic seems like a safer choice. After all, if the natural numbers don't make sense, what does? Let's go with that. Suppose we have a sentence s in the language of PA. If someone presents us with a proof p in PA, we believe s is true. Now consider the following situations: instead of giving you a proof of s, someone gave you a PA-proof p1 that p exists. After all, PA admits defining "PA-proof" in PA language. Common sense tells us that p1 is a sufficient argument to believe s. Maybe, we can prove it within PA? That is, if we have a proof of "if a proof of s exists then s" and a proof of R(s)="a proof of s exists" then we just proved s. That's just modus ponens

There are two problems with that.

First, there's no way to prove the sentence L:="for all s if R(s) then s", since it's not a PA-sentence at all. The problem is that "for all s" references s as a natural number encoding a sentence. On the other hand, "then s" references s as the truth-value of the sentence. Maybe we can construct a PA-formula T(s) which means "the sentence encoded by the number s is true"? Nope, that would get us in trouble with the liar paradox (it would be possible to construct a sentence saying "this sentence is false").

Second, Loeb's theorem says that if we can prove L(s):="if R(s) exists then s" for a given s, then we can prove s. This is a problem since it means there can be no way to prove L(s) for all s in any sense, since it's unprovable for s which are unprovable. In other words, if you proved not-s, there is no way to conclude that "no proof of s exists".

What if we add an inference rule Q to our logic allowing to go from R(s) to s? Let's call the new formal system PA1p1 appended by a Q-step becomes an honest proof of s in PA1. Problem solved? Not really! Now someone can give you a proof of
R1(s):="a PA1-proof of s exists". Back to square one! Wait a second, what if we add a new rule Q1 allowing to go from R1(s) to s? OK, but now we got R2(s):="a PA2-proof of s exists". Hmm, what if add an infinite number of rules Qk? Fine, but now we got Rω(s):="a PAω-proof of s exists". And so on, and so forth, the recursive ordinals are a plenty...

Bottom line, Loeb's theorem works for any theory containing PA, so we're stuck.

## AI

Suppose you're trying to build a self-modifying AGI called "Lucy". Lucy works by considering possible actions and looking for formal proofs that taking one of them will increase expected utility. In particular, it has self-modifying actions in its strategy space. A self-modifying action creates essentially a new agent: Lucy2. How can Lucy decide that becoming Lucy2 is a good idea? Well, a good step in this direction would be proving that Lucywould only take actions that are "good". I.e., we would like Lucy to reason as follows "Lucyuses the same formal system as I, so if she decides to take action a, it's because she has a proof p of the sentence s(a) that 'a increases expected utility'. Since such a proof exits, a does increase expected utility, which is good news!" Problem: Lucy is using L in there, applied to her own formal system! That cannot work! So, Lucy would have a hard time self-modifying in a way which doesn't make its formal system weaker

As another example where this poses a problem, suppose Lucy observes another agent called "Kurt". Lucy knows, by analyzing her sensory evidence, that Kurt proves theorems using the same formal system as Lucy. Suppose Lucy found out that Kurt proved theorem s, but she doesn't know how. We would like Lucy to be able to conclude s is, in fact, true (at least with the probability that her model of physical reality is correct). Alas, she cannot.

See MIRI's paper for more discussion.

# Evidence Logic

Here, cousin_it explains a method to assign probabilities to sentences in an inconsistent theory T. It works as follows. Consider sentence s. Since T is inconsistent, there are T-proofs both of s and of not-s. Well, in a courtroom both sides are allowed to have arguments, why not try the same approach here? Let's weight the proofs as a function of their length, analogically to weighting hypotheses in Solomonoff induction. That is, suppose we have a prefix-free encoding of proofs as bit sequences. Then, it makes sense to consider a random bit sequence and ask whether it is a proof of something. Define the probability of s to be

P(s) := (probability of a random sequence to be a proof of s) / (probability of a random sequence to be a proof of s or not-s)

Nice, but it doesn't solve the Loebian obstacle yet.

I will now formulate an extension of this idea that allows assigning an interval of probabilities [Pmin(s), Pmax(s)] to any sentence s. This interval is a sort of "Knightian uncertainty". I have some speculations how to extract a single number from this interval in the general case, but even without that, I believe that Pmin(s) = Pmax(s) in many interesting cases.

First, the general setting:

• With every sentence s, there are certain texts v which are considered to be "evidence relevant to s". These are divided into "negative" and "positive" evidence. We define sgn(v) := +1 for positive evidence, sgn(v) := -1 for negative evidence.
• Each piece of evidence v is associated with the strength of the evidence strs(v) which is a number in [0, 1]
• Each piece of evidence v is associated with an "energy" function es,v : [0, 1] -> [0, 1]. It is a continuous convex function.
• The "total energy" associated with s is defined to b es := ∑v 2-l(ves,v where l(v) is the length of v.
• Since es,v are continuous convex, so is es. Hence it attains its minimum on a closed interval which is
[Pmin(s), Pmax(s)] by definition.
Now, the details:
• A piece of evidence v for s is defined to be one of the following:
• a proof of s
• sgn(v) := +1
• strs(v) := 1
• es,v(q) := (1 - q)2
• a proof of not-s
• sgn(v) := -1
• strs(v) := 1
• es,v(q) := q2
• a piece of positive evidence for the sentence R-+(s, p) := "Pmin(s) >= p"
• sgn(v) := +1
• strs(v) := strR-+(s, p)(v) p
• es,v(q) := 0 for q > p; strR-+(s, p)(v) (q - p)2 for q < p
• a piece of negative evidence for the sentence R--(s, p) := "Pmin(s) < p"
• sgn(v) := +1
• strs(v) := strR--(s, p)(v) p
• es,v(q) := 0 for q > p; strR--(s, p)(v) (q - p)2 for q < p
• a piece of negative evidence for the sentence R++(s, p) := "Pmax(s) > p"
• sgn(v) := -1
• strs(v) := strR++(s, p)(v) (1 - p)
• es,v(q) := 0 for q < p; strR-+(s, p)(v) (q - p)2 for q > p
• a piece of positive evidence for the sentence R+-(s, p) := "Pmax(s) <= p"
• sgn(v) := -1
• strs(v) := strR+-(s, p)(v) (1 - p)
• es,v(q) := 0 for q < p; strR-+(s, p)(v) (q - p)2 for q > p
Technicality: I suggest that for our purposes, a "proof of s" is allowed to be a proof of sentence equivalent to s in 0-th order logic (e.g. not-not-s). This ensures that our probability intervals obey the properties we'd like them to obey wrt propositional calculus.

Now, consider again our self-modifying agent Lucy. Suppose she makes her decisions according to a system of evidence logic like above. She can now reason along the lines of "Lucyuses the same formal system as I. If she decides to take action a, it's because she has strong evidence for the sentence s(a) that 'a increases expected utility'. I just proved that there would be strong evidence for the expected utility increasing. Therefore, the expected utility would have a high value with high logical probability. But evidence for high logical probability of a sentence is evidence for the sentence itself. Therefore, I now have evidence that expected utility will increase!"

This analysis is very sketchy, but I think it lends hope that the system leads to the desired results.

## New to LessWrong?

New Comment

Does Lob's Theorem hold in intuitionistic logic?

[-][anonymous]9y0

The Double-Negation Translation says that a classically equivalent proposition does hold in intuitionistic logic.

I think it doesn't. However, I suspect it doesn't allow to achieve self-trust in AI. I don't really understand intuitionistic logic, but if we only admit constructive proofs then Lucy won't be able to reason in full generality about Lucy2's reasoning, will she? Won't the step "Lucy2 took action A so there exists a proof that A is a good idea" fail since it doesn't constructively show the proof Lucy2 would find? It feels like intuitionistic logic overcomes the problem of "a proof that a proof exists" in a "degenerate" way, namely, since every proof is constructive "a proof that a proof exists" has to display an actual proof.

I admit I haven't read Weaver's paper. Have to look into it. Thanks!

But doesn't this depend on the idea that strong evidence of strong evidence of X is strong evidence of X? And isn't that in fact a false assertion?

My definition of str is s.t. that strengths get multiplied when "recursing".

What properties do you want your logical probability distribution to have? I'd prefer building something with these properties, rather than jumping to answers first.

You don't seem to worry about computational resources, but I still recommend my posts. You focus on using other agents' beliefs as evidence, but you also want to avoid Löb's theorem, and I think you might violate it if you prove R-+(s,1).

What properties do you want your logical probability distribution to have?

In a consistent theory, my distribution assigns probability 1 to provably true statements and probability 0 to provably false statements. In general, I think it satisfies the interval version of the property P(x) = P(x & y) + P(x & not-y) (I think it should follow from my treatment of 0-order equivalent sentences as equivalent but I have to think about it). These two would mean that for consistent theories it is similar to "coherent" assignments in the sense of Christiano et al, albeit it only assigns probability intervals rather than actual probabilities. In inconsistent theories, it assigns high probability to sentences with short evidence in favor and long evidence against, which is what we want to happen for UDT to work. There is more to be said regarding choosing encoding of evidence and I hope to write about it later.

You don't seem to worry about computational resources, but I still recommend my posts.

I've read some of them! You can get a computable approximation of my probability intervals if you limit evidence length to some finite D. Of course the time complexity would be exponential in D.

...you also want to avoid Löb's theorem, and I think you might violate it if you prove R-+(s,1).

What do you mean "violate it"? How can you violate a theorem?

it assigns high probability to sentences with short evidence in favor and long evidence against, which is what we want to happen for UDT to work.

Interesting. I'm not convinced that that's required (e.g. for the 5 and 10 problem). If you read Ingredients of TDT with an eye towards that, I think there's a strong case that using causal surgery rather than logical implication solves the problem.

What do you mean "violate it"? How can you violate a theorem?

Fair enough - I meant that if you prove R-+(s,1), then for a consistent set of axioms, I think you violate the consistency condition set by Löb's theorem if say "If Pmin(s)=1, then s." Hmm. I guess this is not really a problem with the content of your post - it's more about the form of the additional axiom "GL".

Interesting. I'm not convinced that that's required (e.g. for the 5 and 10 problem). If you read Ingredients of TDT with an eye towards that, I think there's a strong case that using causal surgery rather than logical implication solves the problem.

As far as I know, TDT was mostly abandoned in favor of UDT. In particular I don't think there is a well defined recipe how to describe a given process as a causal diagram with pure-computation nodes. But I might be missing something.

Fair enough - I meant that if you prove R-+(s,1), then for a consistent set of axioms, I think you violate the consistency condition set by Löb's theorem if say "If Pmin(s)=1, then s."

I'm not sure what you're saying here. The usual notion of consistency doesn't apply to my system since it works fine for inconsistent theories. I believe that for consistent theories the energy minimum is always 0 which provides a sort of analogue to consistency in usual logics.

Problem: Lucy is using L in there, applied to her own formal system! That cannot work!

Lucy can assume a priori that it only takes "good" actions and use this to prove properties about Lucy_2, without having to prove the consistency of its own formal system.

I might be missing something, but it seems to me that this thing that you people call "Löbian obstacle" isn't really about self-modification (it occurs even if Lucy = Lucy_2), but about internal consistency: it's the old question of: "how can I know that I'm not insane?"
Well, you can't know, and there fundamentally is no way around it: Gödel's incompleteness theorem, Löb's theorem, Rice's theorem, etc. are all formalized restatements of this fact.

So, why can't Lucy just assume that it is sane? It seems to work well for humans.

So, why can't Lucy just assume that it is sane?

Lucy reasons using formal mathematical logic. A formal mathematical logic cannot "assume it is sane" because of Loeb's theorem. As long as Lucy isn't self-modifying, all is fine and dandy. She just proves theorems and acts accordingly, it's not like she has a built-in compulsion to prove her own sanity. However, once she begins to self-modify (or encounters another agent of similar complexity), she has to start proving theorems about Lucy2. Now, since Lucy2 is at least as smart as Lucy and preferably even smarter, Lucy has hard time predicting the specifics of how Lucy2 will behave. What Lucy wants is to reason generally, i.e. prove something along the lines of "Lucy2 will only do good things" rather than analyze what Lucy2 will do on case-by-case basis (Yudkowsky & Herreshoff call it "the Vingean principle" after Vernor Vinge). Unfortunately, Lucy is barred from proving such a general theorem. Now, we can try to hack Lucy's system of reasoning to accept self-modifications on weaker grounds than proving they are "good" (e.g. allow proving their goodness in a stronger formal system). But that would involve treating self-modification as something ontologically different than all other phenomena in the universe, which doesn't seem to be a good thing. We would like Lucy to be naturalistic i.e. to reason about herself and her descendants in the same way she reasons about everything else in the universe. Otherwise, she will be handicapped. For example, imagine that instead of self-modifying she considers to break into another much more powerful computer and programming that computer to be Lucy2. Such a scenario should be treated just like a self-modification, but since it doesn't fall under the definition of self-modification hard-coded by the programmers, Lucy is back to using her "general purpose" reasoning mode rather than her special self-modification mode. Same goes for encountering another agent. We can try to introduce "agent" as a special ontological category but defining it would be hard.

It seems to work well for humans.

I have a suspicion human brains use something like the evidence logic I'm proposing. It feels that the concept of "proof" is much less native to human thinking than the concept of "argument for/against".

She just proves theorems and acts accordingly, it's not like she has a built-in compulsion to prove her own sanity.

I suppose she could reason that if she can't prove that her formal system is consistent, then she can't trust that the actions she can prove to be "good" are actually "good". Or, even if she trusts her present self, she can't trust her future selves, even if they are identical (well, they can never be completely identical if they accumulate memories).

Your argument works even for Lucy = Lucy_2.

However, once she begins to self-modify (or encounters another agent of similar complexity), she has to start proving theorems about Lucy2.

It seems to me that the problem you are trying to solve is how to coordinate with other agents.
This has investigated in the context of program-equilibrium prisoner's dilemma, and the only satisfactory solution I know is to form cliques based on computable under-approximations of functional equivalence between the program source codes (no, I don't consider the FairBot/PrudentBot proposal a satisfactory solution, for reason I've explained here).
Generalize this to arbitrary problems and you get UDT or something alike.

I suppose she could reason that if she can't prove that her formal system is consistent, then she can't trust that the actions she can prove to be "good" are actually "good". Or, even if she trusts her present self, she can't trust her future selves...

Lucy works by trying to prove a given action is "good" and doing it if she's successful. She doesn't need to prove that "there is no proof A is not good" into order to do A. However it's true she can't trust her future selves. Indeed it can be a problem even without self-modifications. For example, if someone drops an anvil on Lucy, Lucy cannot reason that "I should dodge the anvil because if I get destroyed I won't do good things in the future". She can still have a hard-coded aversion to being destroyed, but this seems like a hack.

...Generalize this to arbitrary problems and you get UDT or something alike.

UDT assumes a system of assigning probabilities to mathematical sentences given a (possibly inconsistent) theory. For UDT to admit naturalistic trust, it needs that system to overcome the Loebian obstacle, like the system I propose here. Indeed the system I propose is intended as a component of UDT or UIM.

"I should dodge the anvil because if I get destroyed I won't do good things in the future". She can still have a hard-coded aversion to being destroyed, but this seems like a hack.

I think it's worse than that. If your argument is correct, the type of AI you are describing can't plan because it can't trust its future selves to follow through with the plan, even if doing so wouldn't require commitment. It would be limited to immediate utility maximization.

It seems that your argument proves too much. In fact, I think that Lucy can still prove useful properties about its successors even without proving the consistency of their and therefore her own, proof system.

UDT assumes a system of assigning probabilities to mathematical sentences given a (possibly inconsistent) theory. For UDT to admit naturalistic trust, it needs that system to overcome the Loebian obstacle, like the system I propose here. Indeed the system I propose is intended as a component of UDT or UIM.

That's not how UDT works in my understanding, even though I shall admit that I'm not an expert on the subject. Do you have a reference?

I think it's worse than that. If your argument is correct, the type of AI you are describing can't plan because it can't trust its future selves to follow through with the plan, even if doing so wouldn't require commitment.

We can avoid this problem if Lucy performs an action once it is the first in a provably "good" sequence of actions. This would allow her to dodge the anvil if it interferes with her immediate plans, but not on the general grounds of "a universe with Lucy is a better universe since Lucy is doing good things".

That's not how UDT works in my understanding, even though I shall admit that I'm not an expert on the subject. Do you have a reference?

I don't have a reference which discusses UDT and the Loebian obstacle together. You can find a description of the AFAIK "latest and greatest" UDT here. UDT considers proofs in a formal system. If this system suffers from the Loebian obstacle this will lead to the kind of problems I discuss here. In fact, I haven't stated it explicitly but I think of Lucy as a UDT agent: she considers possible actions as logical counterfactuals and computes expected utility based on that.