An example of self-fulfilling spurious proofs in UDT

8Vladimir_Nesov

2orthonormal

0Vladimir_Nesov

4orthonormal

8cousin_it

3gRR

0cousin_it

2Vladimir_Nesov

2cousin_it

0gRR

0gRR

2cousin_it

0gRR

2cousin_it

0gRR

2cousin_it

0gRR

2cousin_it

2gRR

2cousin_it

2gRR

2cousin_it

0gRR

8cousin_it

2orthonormal

0gRR

0gRR

0gRR

3cousin_it

2gRR

0gRR

2cousin_it

0[anonymous]

0Dmytry

0cousin_it

-2Dmytry

2cousin_it

-3Dmytry

3cousin_it

0Dmytry

4cousin_it

0cousin_it

0CronoDAS

New Comment

43 comments, sorted by Click to highlight new comments since: Today at 7:20 PM

More generally, what makes Q dangerous is that (1) it only settles for a spurious moral argument, doesn't accept natural ones, and (2) what it finds is taken seriously by the agent, acted out. As a result, provability of a spurious moral argument provably implies its truth, which by Loeb's theorem makes it true and forces the agent to be thus misled.

The only difference between Q and normal proof search procedures is that the normal procedures are OK with any proof, while Q dislikes natural proofs and ignores them. And this bit of "motivated skepticism" is sufficient to make the preferred spurious proofs come true, Q doesn't just loop without finding anything.

This is a whole new level of Oracle AI unsafety... Take what it says seriously, and it can argue you into doing anything at all. :-)

Excellent! I wonder how far the malicious behavior can be extended.

Here the problem is that A directly uses a valid but malicious inference module Q. If A were built to enumerate proofs by length and act on the first one of the form "A()==a implies U()==u, and A()!=a implies U()<=u", can U be rewritten to guarantee that a particular spurious proof comes up first?

Or if A uses a halting oracle to check through statements of that form, can U be written (incorporating that oracle) so that the halting oracle fails to return for the genuine counterfactual and only returns "True" for a particular spurious one?

My next attempt :) Below, "PA" means Peano Arithmetic.

```
def A():
Enumerate proofs (in PA) by length
if found a proof that "A()==a implies U()==u, and A()!=a implies U()<=u"
then return a
def U():
Enumerate proofs in PA+Consistent(PA) by length up to a very large N
if found a proof that "A()==1"
then return (A()==1 ? 5 : 0)
if(A()==2) return 10
return 0
```

Proposition: A() returns 1 (if PA is consistent).

Proof:

Let X = "there exist a!=1 and u, such that: A()==a => U()==u and A()!=a => U()<=u".

Let S = "there exists u, such that: A()==1 => U()==u and A()!=1 => U()<u".

Let T = "A()==1".

Let ProvableU(P) = the provability formula of U (PA+Consistent(PA), with proofs of length < N).

Let ProvableA(P) = the provability formula of A (PA).

Let "U⊦ P" mean P is a theorem of U's proof system.

Then:

```
1. U⊦ ProvableU(T) => S // by inspection of U
2. U⊦ S => ~X // by definitions of X and S
3. U⊦ ProvableU(T) => ~X // from the previous two
4. U⊦ ~X => ~ProvableA(X) // by the axiom Consistent(PA)
5. U⊦ ProvableA(ProvableU(T) => S) // by A's inspection of U
6. U⊦ ProvableA(ProvableU(T)) => ProvableA(S) // from the previous
7. U⊦ ProvableU(T) => ProvableA(ProvableU(T))
// because PA is sufficient to model provability in any formal system
8. U⊦ ProvableU(T) => ProvableA(S) and ~ProvableA(X) // from 3, 4, 6, and 7
9. U⊦ ProvableA(S) and ~ProvableA(X) => T // by inspection of A
10. U⊦ ProvableU(T) => T // from the previous two
11. U⊦ T // by Loeb's theorem
```

QED

Sorry for not noticing your comment earlier. It would be nice to have a way to subscribe to comment threads.

I don't understand why the definitions of X and S imply that S => ~X. Does that rely on specific features of your A and U? For example, if we take a different A and U so that A()==2 and U()==0, then X is true for a=2 and u=0, and S is true for u=1.

Here's another interesting followup question. My implementation of Q relies on enumerating all proofs in order, so running Q requires exponential time. Is there a "Henkin-style" implementation of Q that constructs the self-referential proof quickly and directly, maybe using the steps of Löb's theorem or something? That's how I first tried to construct Q after reading your comments, and failed, but it might still be possible.

Maybe something like this:

```
def Q():
return Proof1("IsValidProof(X, Q()) => X") +
Proof2("IsValidProof(X, Q())") +
X
```

where IsValidProof(Statement, Proof) is Goedel's Bew function, Proof1 is a formalization of the argument that "if Q() returns a valid proof of X, then A() will return 1, and X will be true", and Proof2 is a formalization of the argument that "a valid proof of A=>B, followed by a valid proof of A, followed by B, is a valid proof".

```
def A():
Enumerate proofs by length and act on the first one of the form
"A()==a implies U()==u, and A()!=a implies U()<=u"
def U():
Enumerate proofs by length up to a very large N
if found a proof that "A()==1 implies U()==5, and A()!=1 implies U()<=5"
then if(A()==1) return 5
if(A()==2) return 10
return 0
```

The proof of the statement S: "A()==1 implies U()==5, and A()!=1 implies U()<=5" exists, so it will be found by both A() and U(). Any proof of another statement of the same form would imply "there was no proof of length <N of statement S", which would be false given high enough N, so it couldn't be found.

```
def U():
Enumerate proofs by length up to a very large N
if found a proof that "A()==1 implies U()==5, and A()!=1 implies U()<=5"
then return (A()==1 ? 5 : 0)
if found any other proof of the form "A()==a implies U()==u, and A()!=a implies U()<=u"
then return (A()==a ? u-1 : u+1)
if(A()==2) return 10
return 0
```

The intuition is that the proof of "A()==1 implies U()==5, and A()!=1 implies U()<=5", if it exists, would not depend on N, whereas any proof of "A()==2 implies U()==10..." would have to be longer than N, so by making N large enough...

The setup should make "if S has a proof of length < N, then S" apparent by inspection, answering your earlier objection: if S is (<N)-provable, then U() will find the proof (because finding any other proof first would imply U() is unsound), and then return (A()==1 ? 5 : 0), which requires A() to return 1, otherwise U() is unsound again.

But A can use consistency arguments when proving "Provable(S) => S", can't it?

Let S be "A()==1 implies U()==5, and A()!=1 implies U()<=5".

Then the following is provable by inspection: "if T is a moral argument with the shortest proof of length S, or there exists T, such that Provable(T) and Provable(~T)". From the second part everything provably follows, including "Provable(S) => S". Putting everything together, we get Provable(Provable(S) => S).

I think this reasoning is valid:

either Provable(S)=>S, or there exists T such that Provable(T) and Provable(~T)

either Provable(S)=>S, or Provable(Anything)

either Provable(S)=>S, or Provable(Provable(S)=>S)

But the last step doesn't seem to imply Provable(S)=>S. Or am I missing something again?

Let X be the statement "S is the moral argument with the shortest proof of length <N". Then it's true that X=>S and Provable(X)=>Provable(S), but I don't see why Provable(X)=>S.

In general I think your proof is missing a compelling internal idea, so you probably can't patch it by manipulating symbols. You'll just be inviting more and more subtle mistakes. When I find myself in a situation like this, I usually try to rethink the whole thing until it becomes obvious, and then the proof becomes inevitable even if I compose it sloppily. It's kinda hard to explain...

Yes, I think I know what you mean... it just feels as if there must be an easy technical proof, but, I can't find it...

So I tried to change the solution again, to make the proof easier:

```
def U():
Enumerate proofs by length up to a very large N
if found a proof that "A()==1 implies U()==5, and A()!=1 implies U()<=5"
then return (A()==1 ? 5 : 0)
Enumerate proofs by length up to a very large N
if found any other proof of the form "A()==a implies U()==u, and A()!=a implies U()<=u"
then return (A()==a ? u-1 : u+1)
return (A()==2 ? 10 : 0)
```

With this, Provable(S) implies S directly, so S is provable, so A must find it, because there are no other short-proof moral arguments if A an U are consistent.

But the provability of S does not depend on A, so U() will never get past the first loop, no matter against which agent it is played. So this is a wrong solution to the original problem. And the previous one would be wrong for the same reason, even if I did find the proof...

[EDIT: this is wrong] You're saying, I don't get "Provable(Provable(S) => S)", but only "Provable(Provable(Provable(S) => S))"?

But then,

Provable(Provable(Provable(S) => S)) =>

Provable(Provable(Provable(S)) => Provable(S)) => /Loeb's theorem/

Provable(Provable(S)) =>

[EDIT: this is also wrong] Provable(S)

I thought it couldn't find any other proofs of length < N, because it would imply there was no proof of S. But this is not a problem if S is false... Ok, modification:

```
def U():
Enumerate proofs by length up to a very large N
if found a proof that "A()==1 implies U()==5, and A()!=1 implies U()<=5"
then if(A()==1) return 5
if found any other proof of the same form
then whatever A() return 0
if(A()==2) return 10
return 0
```

EDIT: Wait, this is not good, now if(A()==2) is unreachable...

EDIT2: No, not actually unreachable, but any proof for a statement of the form "A()==2 => U()==10..." must be of length > N, which is what was needed, I suppose. Still feels like cheating, but I'm not sure why...

[This comment is no longer endorsed by its author]

What is your definition of UDT?

If you use the definition given in this paper, then that definition of A is not UDT, simply because it is not guaranteed to "[find] a function mapping sense data to actions that will maximize the utility of outcomes weighted by the probability that the outcome will be caused by the procedure returning that mapping."

It is implicit in the definition that the agent must be able to choose from a set of counter-factual action-selection-functions *{f: S -> A}*, because it is in this choice the optimisation happens.

P can only be a valid proof that *a* is the best action exactly when *a* is the only option, because all other actions where excluded, by the use of non-counter-factual logic.

I know that counter-factual calculations are problematic. But if you take the counter-factual-nes out of UDP, then there are literally nothing left to the definition. Without the counter-factual-nes, everything goes. Any choice is the optimal one, because it was the only one, because that was the choice that actually happened.

Other conclusion is: don't feed the agent it's source code when you feed it a proof that it's output effects a change. Blackbox the output as a variable. When you output a, the world gives u, which is the largest U of any a. Works on newcomb's too (note that this makes no distinction between stuff being put into the boxes before, or after the decision; if the 100% accurate prediction is a given. It simply knows nothing of the causality beyond that implied in the problem)

To check the supplied proofs, A needs to be able to recognize statements of the form "A()==a implies U()==u, and A()!=a implies U()<=u", which seems to require knowing the source code of A. Or do you propose trying to check proofs of some different form instead, that would only mention U? What would such statements look like?

Not when the statements are already given with A() in them, as in the Newcomb's problem. The newcomb's problem doesn't say - here is the omega, and it has a brain sim with such and such neurons connected so and so with such and such weights, and it exposes this brain sim to this entire string, and if this brain sim tells it so and so it puts stuff into one box, otherwise it puts stuff into other box. No, it already says omega predicts you, using the symbol you which already means A() . Nobody has ever presented newcomb to you with an implicit, rather than explicit, self reference.

One could insert a quine, of course, into the algorithm, to allow the algorithm to recognize self; then such algorithm could start off by replacing the 'brain sim with such and such neurons connected so and so' with symbol A() in the input statements (or perhaps running a prover there to try and replace any obscured instances of A() ) and then solve it without further using knowledge of internals of A() .

More philosophically, this may be what our so called reflection is about: fully opaque black box substitution. Albeit one must not read much into it, because the non-opaque substitution does not only lead to 'pollution' of the theorem prover with contradictions any time you think "okay, suppose I give out answer a", but is also terribly impractical.

One could also fix it by assuming that your output is slightly uncertain - epsilon probability of changing the mind - which kills the wrong proofs (and also stops the prediction nonsense and other entirely un-physical ideas, with a side effect that the agent might not work ideally in an ideal world).

I don't understand your comment. Our proof-theoretic algorithms can handle worlds with randomness just fine, because the logical fact of the algorithm's return value will be more or less correlated with many physical facts. The proof-theoretic algorithms don't *require* slight randomness to work, and my grandparent comment gave a reason why it seems hard to write a version that relies *completely* on slight randomness to kill spurious proofs, as you suggested. Of course if you do find a way, it will be cool.

Let me try to explain. Assume you have an agent that's very similar to our proof-theoretic ones, but also has a small chance of returning a random action. Moreover, the agent's self-description includes a mention of the random variable V that makes the agent return a random action.

The first problem is that statements like "A()==a" are no longer logical statements. That seems easy to fix by making the agent look at conditional expected values instead of logical implications.

The second and more serious problem is this: how many independent copies of V does the world contain? Imagine the agent is faced with Newcomb's Problem. If the world program contains only one copy of V that's referenced by both instances of the agent, that amounts to abusing the problem formulation to tell the agent "your copies are located here and here". And if the two copies of the agent use uncorrelated instances of V, then looking at conditional expected values based on V buys you nothing. Figuring out the best action reduces to same logical question that the proof-theoretic algorithms are faced with.

Your comment seems to be an instance of arguing about assumptions. Does your alternative approach lead toward interesting new math? If not, then exploring it doesn't seem to make sense, even if it's true, because we already know all there is to know about it.

[This comment is no longer endorsed by its author]

Let X be the statement "A(Q())==1 implies U()==5, and A(Q())!=1 implies U()<=5". Let Q be the program that enumerates all proofs trying to find a proof of X, and returns that proof if found. (The definitions of X and Q are mutually quined.) If X is provable at all, then Q will find that proof, and X will become true (by inspection of U and A). That reasoning is formalizable in our proof system, so the statement "if X is provable, then X" is provable. Therefore, by Löb's theorem, X is provable. So Q will find a proof of X, and A will return 1.

Doesn't this amount to saying that we've got a proof system that can prove false statements, and is therefore unsound?

ETA: I suppose this relies on the fact that "A(Q())!=1 implies U()<=5" can be made true simply by ensuring that A(Q())=1, regardless of what U equals?

[This comment is no longer endorsed by its author]

Benja Fallenstein was the first to point out that spurious proofs pose a problem for UDT. Vladimir Nesov and orthonormal asked for a formalization of that intuition. In this post I will give an example of a UDT-ish agent that fails due to having a malicious proof searcher, which feeds the agent a spurious but valid proof.

The basic idea is to have an agent A that receives a proof P as input, and checks P for validity. If P is a valid proof that a certain action a is best in the current situation, then A outputs a, otherwise A tries to solve the current situation by its own means. Here's a first naive formalization, where U is the world program that returns a utility value, A is the agent program that returns an action, and P is the proof given to A:

This formalization cannot work because a proof P can never be long enough to contain statements about A(P) inside itself. To fix that problem, let's introduce a function Q that generates the proof P:

In this case it's possible to write a function Q that returns a proof that makes A return the suboptimal action 1, which leads to utility 5 instead of 10. Here's how:

Let X be the statement "A(Q())==1 implies U()==5, and A(Q())!=1 implies U()<=5". Let Q be the program that enumerates all possible proofs trying to find a proof of X, and returns that proof if found. (The definitions of X and Q are mutually quined.) If X is provable at all, then Q will find that proof, and X will become true (by inspection of U and A). That reasoning is formalizable in our proof system, so the statement "if X is provable, then X" is provable. Therefore, by Löb's theorem, X is provable. So Q will find a proof of X, and A will return 1.

One possible conclusion is that a UDT agent cannot use just any proof searcher or "mathematical intuition module" that's guaranteed to return valid mathematical arguments, because valid mathematical arguments can make the agent choose arbitrary actions. The proof searchers from some previous posts were well-behaved by construction, but not all of them are.

The troubling thing is that you may end up with a badly behaved proof searcher by accident. For example, consider a variation of U that adds some long and complicated computation to the "else" branch of U, before returning 10. That increases the length of the "natural" proof that a=2 is optimal, but the spurious proof for a=1 stays about the same length as it was, because the spurious proof can just ignore the "else" branch of U. This way the spurious proof can become much shorter than the natural proof. So if (for example) your math intuition module made the innocuous design decision of first looking at actions that are likely to have shorter proofs, you may end up with a spurious proof. And as a further plot twist, if we make U return 0 rather than 10 in the long-to-compute branch, you might choose the

correctaction due to a spurious proof instead of the natural one.