Overcoming the Loebian obstacle using evidence logic

1CronoDAS

0[anonymous]

0Squark

3CronoDAS

0Squark

0ThisSpaceAvailable

0Squark

0Manfred

0Squark

0Manfred

1Squark

-2V_V

0Squark

0V_V

1Squark

0V_V

0Squark

New Comment

17 comments, sorted by Click to highlight new comments since: Today at 6:32 PM

I think it doesn't. However, I suspect it doesn't allow to achieve self-trust in AI. I don't really understand intuitionistic logic, but if we only admit *constructive* proofs then Lucy won't be able to reason in full generality about Lucy2's reasoning, will she? Won't the step "Lucy2 took action A so there exists a proof that A is a good idea" fail since it doesn't *constructively* show the proof Lucy2 would find? It feels like intuitionistic logic overcomes the problem of "a proof that a proof exists" in a "degenerate" way, namely, since every proof is constructive "a proof that a proof exists" has to display an actual *proof*.

What properties do you want your logical probability distribution to have? I'd prefer building something with these properties, rather than jumping to answers first.

You don't seem to worry about computational resources, but I still recommend my posts. You focus on using other agents' beliefs as evidence, but you also want to avoid Löb's theorem, and I think you might violate it if you prove R-+(s,1).

What properties do you want your logical probability distribution to have?

In a consistent theory, my distribution assigns probability 1 to provably true statements and probability 0 to provably false statements. In general, I think it satisfies the interval version of the property P(x) = P(x & y) + P(x & not-y) (I think it should follow from my treatment of 0-order equivalent sentences as equivalent but I have to think about it). These two would mean that for consistent theories it is similar to "coherent" assignments in the sense of Christiano et al, albeit it only assigns probability intervals rather than actual probabilities. In inconsistent theories, it assigns high probability to sentences with short evidence in favor and long evidence against, which is what we want to happen for UDT to work. There is more to be said regarding choosing encoding of evidence and I hope to write about it later.

You don't seem to worry about computational resources, but I still recommend my posts.

I've read some of them! You can get a computable approximation of my probability intervals if you limit evidence length to some finite D. Of course the time complexity would be exponential in D.

...you also want to avoid Löb's theorem, and I think you might violate it if you prove R-+(s,1).

What do you mean "violate it"? How can you violate a theorem?

it assigns high probability to sentences with short evidence in favor and long evidence against, which is what we want to happen for UDT to work.

Interesting. I'm not convinced that that's required (e.g. for the 5 and 10 problem). If you read Ingredients of TDT with an eye towards that, I think there's a strong case that using causal surgery rather than logical implication solves the problem.

What do you mean "violate it"? How can you violate a theorem?

Fair enough - I meant that if you prove R-+(s,1), then for a consistent set of axioms, I think you violate the consistency condition set by Löb's theorem if say "If Pmin(s)=1, then s." Hmm. I guess this is not really a problem with the content of your post - it's more about the form of the additional axiom "GL".

Interesting. I'm not convinced that that's required (e.g. for the 5 and 10 problem). If you read Ingredients of TDT with an eye towards that, I think there's a strong case that using causal surgery rather than logical implication solves the problem.

As far as I know, TDT was mostly abandoned in favor of UDT. In particular I don't think there is a well defined recipe how to describe a given process as a causal diagram with pure-computation nodes. But I might be missing something.

Fair enough - I meant that if you prove R-+(s,1), then for a consistent set of axioms, I think you violate the consistency condition set by Löb's theorem if say "If Pmin(s)=1, then s."

I'm not sure what you're saying here. The usual notion of consistency doesn't apply to my system since it works fine for inconsistent theories. I believe that for consistent theories the energy minimum is always 0 which provides a sort of analogue to consistency in usual logics.

Problem: Lucy is using L in there, applied to her own formal system! That cannot work!

Lucy can assume a priori that it only takes "good" actions and use this to prove properties about Lucy_2, without having to prove the consistency of its own formal system.

I might be missing something, but it seems to me that this thing that you people call "Löbian obstacle" isn't really about self-modification (it occurs even if Lucy = Lucy_2), but about internal consistency: it's the old question of: "how can I know that I'm not insane?"

Well, you can't know, and there fundamentally is no way around it: Gödel's incompleteness theorem, Löb's theorem, Rice's theorem, etc. are all formalized restatements of this fact.

So, why can't Lucy just assume that it is sane? It seems to work well for humans.

So, why can't Lucy just assume that it is sane?

Lucy reasons using formal mathematical logic. A formal mathematical logic cannot "assume it is sane" because of Loeb's theorem. As long as Lucy isn't self-modifying, all is fine and dandy. She just proves theorems and acts accordingly, it's not like she has a built-in compulsion to prove her own sanity. However, once she begins to self-modify (or encounters another agent of similar complexity), she has to start proving theorems about Lucy2. Now, since Lucy2 is at least as smart as Lucy and preferably even smarter, Lucy has hard time predicting the specifics of how Lucy2 will behave. What Lucy wants is to reason *generally*, i.e. prove something along the lines of "Lucy2 will only do good things" rather than analyze what Lucy2 will do on case-by-case basis (Yudkowsky & Herreshoff call it "the Vingean principle" after Vernor Vinge). Unfortunately, Lucy is barred from proving such a general theorem. Now, we can try to hack Lucy's system of reasoning to accept self-modifications on weaker grounds than proving they are "good" (e.g. allow proving their goodness in a stronger formal system). But that would involve treating self-modification as something ontologically different than all other phenomena in the universe, which doesn't seem to be a good thing. We would like Lucy to be naturalistic i.e. to reason about herself and her descendants in the same way she reasons about everything else in the universe. Otherwise, she will be handicapped. For example, imagine that instead of self-modifying she considers to break into another much more powerful computer and programming that computer to be Lucy2. Such a scenario should be treated just like a self-modification, but since it doesn't fall under the definition of self-modification hard-coded by the programmers, Lucy is back to using her "general purpose" reasoning mode rather than her special self-modification mode. Same goes for encountering another agent. We can try to introduce "agent" as a special ontological category but defining it would be hard.

It seems to work well for humans.

I have a suspicion human brains use something like the evidence logic I'm proposing. It feels that the concept of "proof" is much less native to human thinking than the concept of "argument for/against".

She just proves theorems and acts accordingly, it's not like she has a built-in compulsion to prove her own sanity.

I suppose she could reason that if she can't prove that her formal system is consistent, then she can't trust that the actions she can prove to be "good" are actually "good". Or, even if she trusts her present self, she can't trust her future selves, even if they are identical (well, they can never be completely identical if they accumulate memories).

Your argument works even for Lucy = Lucy_2.

However, once she begins to self-modify (or encounters another agent of similar complexity), she has to start proving theorems about Lucy2.

It seems to me that the problem you are trying to solve is how to coordinate with other agents.

This has investigated in the context of program-equilibrium prisoner's dilemma, and the only satisfactory solution I know is to form cliques based on computable under-approximations of functional equivalence between the program source codes (no, I don't consider the FairBot/PrudentBot proposal a satisfactory solution, for reason I've explained here).

Generalize this to arbitrary problems and you get UDT or something alike.

I suppose she could reason that if she can't prove that her formal system is consistent, then she can't trust that the actions she can prove to be "good" are actually "good". Or, even if she trusts her present self, she can't trust her future selves...

Lucy works by trying to prove a given action is "good" and doing it if she's successful. She doesn't need to prove that "there is no proof A is not good" into order to do A. However it's true she can't trust her future selves. Indeed it can be a problem even without self-modifications. For example, if someone drops an anvil on Lucy, Lucy cannot reason that "I should dodge the anvil because if I get destroyed I won't do good things in the future". She can still have a hard-coded aversion to being destroyed, but this seems like a hack.

...Generalize this to arbitrary problems and you get UDT or something alike.

UDT *assumes* a system of assigning probabilities to mathematical sentences given a (possibly inconsistent) theory. For UDT to admit naturalistic trust, it needs that system to overcome the Loebian obstacle, like the system I propose here. Indeed the system I propose is intended as a component of UDT or UIM.

"I should dodge the anvil because if I get destroyed I won't do good things in the future". She can still have a hard-coded aversion to being destroyed, but this seems like a hack.

I think it's worse than that. If your argument is correct, the type of AI you are describing can't plan because it can't trust its future selves to follow through with the plan, even if doing so wouldn't require commitment. It would be limited to immediate utility maximization.

It seems that your argument proves too much. In fact, I think that Lucy can still prove useful properties about its successors even without proving the consistency of their and therefore her own, proof system.

UDT assumes a system of assigning probabilities to mathematical sentences given a (possibly inconsistent) theory. For UDT to admit naturalistic trust, it needs that system to overcome the Loebian obstacle, like the system I propose here. Indeed the system I propose is intended as a component of UDT or UIM.

That's not how UDT works in my understanding, even though I shall admit that I'm not an expert on the subject. Do you have a reference?

I think it's worse than that. If your argument is correct, the type of AI you are describing can't plan because it can't trust its future selves to follow through with the plan, even if doing so wouldn't require commitment.

We can avoid this problem if Lucy performs an action once it is the first in a provably "good" sequence of actions. This would allow her to dodge the anvil if it interferes with her immediate plans, but not on the general grounds of "a universe with Lucy is a better universe since Lucy is doing good things".

That's not how UDT works in my understanding, even though I shall admit that I'm not an expert on the subject. Do you have a reference?

I don't have a reference which discusses UDT and the Loebian obstacle together. You can find a description of the AFAIK "latest and greatest" UDT here. UDT considers proofs in a formal system. If this system suffers from the Loebian obstacle this will lead to the kind of problems I discuss here. In fact, I haven't stated it explicitly but I think of Lucy as a UDT agent: she considers possible actions as logical counterfactuals and computes expected utility based on that.

In this post I intend to:

feel free to skip it if you know what the Loebian obstacle is).probabilities(more generally probability intervals) to mathematical sentences (and which admits a form of "Loebian" self-referential reasoning). The method is well-defined both for consistent and inconsistent axiomatic systems, the later being important in analysis of logical counterfactuals like in UDT.## Background

## Logic

When can we consider a mathematical theorem to be established? The obvious answer is: when we proved it. Wait, proved it in what theory? Well, that's debatable. ZFC is popular choice for mathematicians, but how do we know it is consistent (let alone sound, i.e. that it only proves true sentences)? All those spooky infinite sets, how do you know it doesn't break somewhere along the line? There's lots of empirical evidence, but we can't

proveit, and it'sproofswe're interesting in, not mere evidence, right?Peano arithmetic seems like a safer choice. After all, if the natural numbers don't make sense, what does? Let's go with that. Suppose we have a sentenceAfter all, PA admits defining "PA-proof" in PA language. Common sense tells us that

sin the language of PA. If someone presents us with a proofpin PA, we believesis true. Now consider the following situations: instead of giving you a proof ofs, someone gave you a PA-proofp_{1}thatpexists.p_{1}is a sufficient argument to believes. Maybe, we can prove itwithin PA? That is, if we have a proof of "if a proof ofsexists thens" and a proof of R(s)="a proof ofsexists" then we just proveds. That's just modus ponens.There are two problems with that.

First, there's no way to prove the sentence L:="

for allsif R(s) thens", sinceit's not a PA-sentence at all. The problem is that "for alls" referencessas a natural numberencodinga sentence. On the other hand, "thens" referencessas thetruth-valueof the sentence. Maybe we can construct a PA-formula T(s) which means "the sentence encoded by the numbersis true"? Nope, that would get us in trouble with the liar paradox (it would be possible to construct a sentence saying "this sentence is false").Second, Loeb's theorem says that if we can prove L(

s):="if R(s) exists thens" for a givens, then we can proves. This is a problem since it means there can be no way to prove L(s) for allsin any sense, since it's unprovable forswhich are unprovable. In other words, if you proved not-s, there is no way to conclude that "no proof ofsexists".What if we add an

inference ruleQ to our logic allowing to go from R(s) tos? Let's call the new formal system PA_{1}.p_{1}appended by a Q-step becomes an honest proof ofsin PA_{1}. Problem solved? Not really! Now someone can give you a proof ofR

_{1}(s):="a PA_{1}-proof ofsexists". Back to square one! Wait a second, what if we add a new rule Q_{1}allowing to go from R_{1}(s) tos? OK, but now we got R_{2}(s):="a PA_{2}-proof ofsexists". Hmm, what if add aninfinitenumber of rules Q_{k}? Fine, but now we got R_{ω}(s):="a PA_{ω}-proof ofsexists". And so on, and so forth, the recursive ordinals are a plenty...Bottom line, Loeb's theorem works for

anytheory containing PA, so we're stuck.## AI

Suppose you're trying to build a self-modifying AGI called "Lucy". Lucy works by considering possible actions and looking for formal proofs that taking one of them will increase expected utility. In particular, it has self-modifying actions in its strategy space. A self-modifying action creates essentially a new agent: Lucy

_{2}. How can Lucy decide that becoming Lucy_{2}is a good idea? Well, a good step in this direction would be proving that Lucy_{2 }would only take actions that are "good". I.e., we would like Lucy to reason as follows "Lucy_{2 }uses the same formal system as I, so if she decides to take actiona, it's because she has a proofpof the sentences(a) that 'aincreases expected utility'. Since such a proof exits,adoes increase expected utility, which is good news!" Problem: Lucy is using L in there, applied toher ownformal system! That cannot work! So, Lucy would have a hard time self-modifying in a way which doesn't make its formal systemweaker.As another example where this poses a problem, suppose Lucy observes another agent called "Kurt". Lucy knows, by analyzing her sensory evidence, that Kurt proves theorems using the same formal system as Lucy. Suppose Lucy found out that Kurt proved theorem

s, but she doesn't know how. We would like Lucy to be able to concludesis, in fact, true (at least with the probability that her model of physical reality is correct). Alas, she cannot.See MIRI's paper for more discussion.## Evidence Logic

Here, cousin_it explains a method to assign probabilities to sentences in an

inconsistenttheory T. It works as follows. Consider sentences. Since T is inconsistent, there are T-proofs both ofsand of not-s. Well, in a courtroom both sides are allowed to have arguments, why not try the same approach here? Let'sweightthe proofs as a function of their length, analogically to weighting hypotheses in Solomonoff induction. That is, suppose we have a prefix-free encoding of proofs as bit sequences. Then, it makes sense to consider a random bit sequence and ask whether it is a proof of something. Define the probability ofsto beP(

s) := (probability of a random sequence to be a proof ofs) / (probability of a random sequence to be a proof ofsor not-s)Nice, but it doesn't solve the Loebian obstacle yet.

I will now formulate an extension of this idea that allows assigning an

intervalof probabilities [P_{min}(s), P_{max}(s)] to any sentences. This interval is a sort of "Knightian uncertainty". I have some speculations how to extract a single number from this interval in the general case, but even without that, I believe that P_{min}(s) = P_{max}(s) in many interesting cases.First, the general setting:

s, there are certain textsvwhich are considered to be "evidence relevant tos". These are divided into "negative" and "positive" evidence. We define sgn(v) := +1 for positive evidence, sgn(v) := -1 for negative evidence.vis associated with the strength of the evidence str_{s}(v) which is a number in [0, 1]vis associated with an "energy" function e_{s,v}: [0, 1] -> [0, 1]. It is a continuous convex function.sis defined to b e:= ∑_{s}_{v }2^{-l(v) }e_{s,v}where l(v) is the length ofv.eare continuous convex, so is e_{s,v}_{s}. Hence it attains its minimum on a closed interval which is[P

_{min}(s), P_{max}(s)] by definition.vforsis defined to be one of the following:sv) := +1_{s}(v) := 1e(q) := (1 - q)_{s,v}^{2}sv) := -1_{s}(v) := 1e(q) := q_{s,v}^{2}_{-+}(s, p) := "P_{min}(s) >= p"v) := +1_{s}(v) := str_{R-+(s, p)}(v) pe(q) := 0 for q > p; str_{s,v}_{R-+(s, p)}(v) (q - p)^{2}for q < p_{--}(s, p) := "P_{min}(s) < p"v) := +1_{s}(v) := str_{R--(s, p)}(v) pe(q) := 0 for q > p; str_{s,v}_{R--(s, p)}(v) (q - p)^{2}for q < p_{++}(s, p) := "P_{max}(s) > p"v) := -1_{s}(v) := str_{R++(s, p)}(v) (1 - p)e(q) := 0 for q < p; str_{s,v}_{R-+(s, p)}(v) (q - p)^{2}for q > p_{+-}(s, p) := "P_{max}(s) <= p"v) := -1_{s}(v) := str_{R+-(s, p)}(v) (1 - p)e(q) := 0 for q < p; str_{s,v}_{R-+(s, p)}(v) (q - p)^{2}for q > pTechnicality:I suggest that for our purposes, a "proof ofs" is allowed to be a proof of sentence equivalent tosin 0-th order logic (e.g. not-not-s). This ensures that our probability intervals obey the properties we'd like them to obey wrt propositional calculus._{2 }uses the same formal system as I. If she decides to take actiona, it's because she has strong evidence for the sentences(a) that 'aincreases expected utility'. I just proved that there would be strong evidence for the expected utility increasing. Therefore, the expected utility would have a high value with high logical probability. But evidence for high logical probability of a sentence is evidence for the sentence itself. Therefore, I now have evidence that expected utility will increase!"