# A cheating approach to the tiling agents problem

AI Alignment Forum

# Ω 3

(This post resulted from a conversation with Wei Dai.)

Formalizing the tiling agents problem is very delicate. In this post I'll show a toy problem and a solution to it, which arguably meets all the desiderata, but only by cheating in a new and unusual way.

Here's a summary of the toy problem: we ask an agent to solve a difficult math question and also design a successor agent. Then the successor solves another math question and designs its own successor, and so on. The questions get harder each time, so they can't all be solved in advance, and each of them requires believing in Peano arithmetic (PA). This goes on for a fixed number of rounds, and the final reward is the number of correct answers.

Moreover, we will demand that the agent must handle both subtasks (solving the math question and designing the successor) using the same logic. Finally, we will demand that the agent be able to reproduce itself at each round, not just design a custom-made successor that solves the math question with PA and reproduces itself by quining.

To nail down the formalization, we will use good old computable programs with proof limits, not oracles or modal formulas. The universe program U will contain the agent program A as a subprogram, and A will know the full source code of U (including A) by diagonalization. At each round, the agent will receive the text of a math question as input, and return a pair consisting of the answer to the math question and the source code of the successor agent.

The math questions will be of the form "you're allowed to use N_i computation steps, here's the source code of a program P_i, what does it return?" The programs P_i will be chosen so that N_i steps isn't enough to run them, but is more than enough to find a PA proof that P_i returns some particular value.

To give the agent some assurance of the above, the universe program will use the following transparent procedure to evaluate the agent's answer:

1. Run P_i to determine the right answer, and compare it with the agent's answer.

2. Verify that there exist PA proofs about P_i that can be found in much fewer than N_i computation steps. Ignore the answers found in the proofs.

3. Increment the agent's utility if (1) succeeded or (2) failed.

We will also assume that all math questions are so difficult, and all N_i are so large, that doing any fixed work up front is a rounding error by comparison.

The above sounds like a fair formulation of the tiling agents problem, right? The Vingean principle is satisfied because the initial agent doesn't have enough time to solve all math questions up front. The tiling principle is satisfied because we want a solution that will reproduce itself at each round. The naturalistic principle is satisfied because we want to uniformly handle both subtasks (solve the math question and design a successor).

Now I will show a solution that seems to meet all desiderata. It will be based on the formal theory PA+Con(PA) (which is the weakest possible, because the problem requires creating successors that trust PA). But you will still say that I'm cheating.

Let the agent A be this program:

"Ignore my input at first. Using a small number of computation steps, and using PA+Con(PA) as the formal theory, search for programs B such that, if all instances of A internally delegate all input-output to B, then U will provably return some particular utility. Then internally delegate to the B that leads to the highest utility found."

Then A will be able to prove that delegating to the following program B will lead to utility equal to number of rounds:

"Accept P_i and N_i as input. Using slightly fewer than N_i computation steps, and using PA as the formal theory, find the first proof that P_i returns some particular value. Then return that value and designate A as successor."

The proof is obvious. If A internally delegates to B on the first round, then it will delegate to the same B on all subsequent rounds, because it ignores the input while choosing B. That fact is easily provable in PA. Then B goes on to handle the inputs on different rounds and wins all of them, which is easily provable in PA+Con(PA), due to the way the universe program verifies the answers. At each round B designates A as successor, which rubber-stamps B again, and the cycle continues.

You could say that A fails the Vingean principle, but that wouldn't be fair. After all, A indeed can't solve all questions in advance, because it doesn't have enough time. It only solves the question "which B to run internally?" in advance, giving the same answer redundantly each time, but that's just an implementation detail. If you didn't know how A worked internally, you wouldn't object to it.

Of course it's not guaranteed that A will reproduce itself. It knows that reproducing itself will lead to utility equal to number of rounds, and we happen to know that higher utility isn't achievable, but the same utility could be achieved in other ways (e.g. by outputting a B that delegates to itself directly, instead of going through A). But that's not a strong objection, because given the right tie-breaking, A will indeed reproduce itself.

Moreover, the approach of A looks quite general, without anything specific to this problem. It can also deal with all the usual UDT test cases, like Newcomb's Problem, as well as all imaginable tiling problems that are amenable to quining. So someone optimistic could say that A is a version of UDT that solves the tiling agents problem!

That would be unsatisfactory though. It looks more like A has found a trick that leads to quining in a roundabout way. I feel that the people working on tiling agents wouldn't accept my solution, but I can't quite figure out which new principle would rule it out. Perhaps some kind of "no meta-quining principle"? Do humans obey that principle, or do they have intuitions like "solve decision theory in the abstract and delegate to that" which sound suspiciously like my agent? It's tricky and my thinking is very confused right now.

What do you think?

# Ω 3

New Comment

I think the general principle you're taking advantage of is:

Do a small amount of very predictable reasoning in PA+Sound(PA). Then use PA for the rest of the reasoning you have to do. When reasoning about other instances of agents similar to you, simulate their use of Sound(PA), and trust the reasoning that they did while confining themselves to PA.

In your example, PA+Con(PA) sufficed, but PA+Sound(PA) is more flexible in general, in ways that might be important. This also seems to solve the closely related problem of how to trust statements if you know they were proven by an agent that reasons approximately the same way you do, for instance, statements proven by yourself in the past. If you proved X in the past, and you want to establish that that makes X true, you fully simulate everything your past self could have done in PA+Sound(PA)-mode before confining itself to PA, and then you can just trust that the reasoning you did afterwards in PA-mode was correct, so you don't have to redo that part.

This also might allow for an agent to make improvements in its own source code and still trust its future self, provided that the modification that it makes still only uses Sound(PA) in ways that it can predict and simulate. On the other hand, that condition might be a serious limitation to recursive self-improvement, since the successor agent would need to use Sound(PA) in order to pick its successor agent, and it can't do so in ways that the predecessor agent can't predict.

Perhaps it could be worse than that, and attempting to do anything nontrivial with this trick leads to a combinatorial explosion from every instance of the agent trying to simulate every other instance's uses of Sound(PA). But I'm cautiously optimistic that it isn't quite that bad, since simply simulating an agent invoking Sound(PA) does not itself require you to invoke Sound(PA) yourself, so these simulations can be run in PA-mode, and only the decision to run them needs to be made in PA+Sound(PA)-mode.

I think the right direction is to think about what we want from self-modification informally, and refine that into a toy problem that the agent in the post can't solve. It's confusing in a philosophical kind of way. I don't have the right skill for it, but I've been trying to mine Eliezer's Arbital writings.

I just realized that A will not only approve itself as successor, but also approve some limited self-modifications, like removing an inefficiency that provably doesn't affect the choice of B. Though it still doesn't matter much, because A might as well delete all code for choosing B and appoint a quining B as successor.

This suggests that the next version of the tiling agents problem should involve nontrivial self-improvement, not just self-reproduction. I have no idea how to formalize that though.