An early draft of publication #2 in the Open Problems in Friendly AI series is now available: Tiling Agents for Self-Modifying AI, and the Lobian Obstacle. ~20,000 words, aimed at mathematicians or the highly mathematically literate. The research reported on was conducted by Yudkowsky and Herreshoff, substantially refined at the November 2012 MIRI Workshop with Mihaly Barasz and Paul Christiano, and refined further at the April 2013 MIRI Workshop.

**Abstract:**

We model self-modication in AI by introducing 'tiling' agents whose decision systems will approve the construction of highly similar agents, creating a repeating pattern (including similarity of the offspring's goals). Constructing a formalism in the most straightforward way produces a Godelian difficulty, the Lobian obstacle. By technical methods we demonstrate the possibility of avoiding this obstacle, but the underlying puzzles of rational coherence are thus only partially addressed. We extend the formalism to partially unknown deterministic environments, and show a very crude extension to probabilistic environments and expected utility; but the problem of finding a fundamental decision criterion for self-modifying probabilistic agents remains open.

Commenting here is the preferred venue for discussion of the paper. This is an early draft and has not been reviewed, so it may contain mathematical errors, and reporting of these will be much appreciated.

The overall agenda of the paper is introduce the conceptual notion of a self-reproducing decision pattern which includes reproduction of the goal or utility function, by exposing a particular possible problem with a tiling logical decision pattern and coming up with some partial technical solutions. This then makes it conceptually much clearer to point out the even deeper problems with "We can't yet describe a probabilistic way to do this because of non-monotonicity" and "We don't have a good bounded way to do this because maximization is impossible, satisficing is too weak and Schmidhuber's swapping criterion is underspecified." The paper uses first-order logic (FOL) because FOL has a lot of useful standard machinery for reflection which we can then invoke; in real life, FOL is of course a poor representational fit to most real-world environments outside a human-constructed computer chip with thermodynamically expensive crisp variable states.

As further background, the idea that something-like-proof might be relevant to Friendly AI is not about achieving some chimera of absolute safety-feeling, but rather about the idea that the total probability of catastrophic failure should not have a significant conditionally independent component on each self-modification, and that self-modification will (at least in initial stages) take place within the highly deterministic environment of a computer chip. This means that statistical testing methods (e.g. an evolutionary algorithm's evaluation of average fitness on a set of test problems) are not suitable for self-modifications which can potentially induce catastrophic failure (e.g. of parts of code that can affect the representation or interpretation of the goals). Mathematical proofs have the property that they are as strong as their axioms and have no significant conditionally independent per-step failure probability if their axioms are semantically true, which suggests that something like mathematical reasoning may be appropriate for certain particular types of self-modification during some developmental stages.

Thus the content of the paper is very far off from how a realistic AI would work, but conversely, if you can't even answer the kinds of simple problems posed within the paper (both those we partially solve and those we only pose) then you must be very far off from being able to build a stable self-modifying AI. Being able to say how to build a theoretical device that would play perfect chess given infinite computing power, is very far off from the ability to build Deep Blue. However, if you can't even say how to play perfect chess given infinite computing power, you are confused about the rules of the chess or the structure of chess-playing computation in a way that would make it entirely hopeless for you to figure out how to build a bounded chess-player. Thus "In real life we're always bounded" is no excuse for not being able to solve the much simpler unbounded form of the problem, and being able to describe the infinite chess-player would be substantial and useful conceptual progress compared to *not *being able to do that. We can't be absolutely certain that an analogous situation holds between solving the challenges posed in the paper, and realistic self-modifying AIs with stable goal systems, but every line of investigation has to start somewhere.

Parts of the paper will be easier to understand if you've read Highly Advanced Epistemology 101 For Beginners including the parts on correspondence theories of truth (relevant to section 6) and model-theoretic semantics of logic (relevant to 3, 4, and 6), and there are footnotes intended to make the paper somewhat more accessible than usual, but the paper is still essentially aimed at mathematically sophisticated readers.

The fact that MIRI is finally publishing technical research has impressed me. A year ago it seemed, to put it bluntly, that your organization was stalling, spending its funds on the full-time development of Harry Potter fanfiction and popular science books. Perhaps my intuition there was uncharitable, perhaps not. I don't know how much of your lead researcher's time was spent on said publications, but it certainly seemed, from the outside, that it was the majority. Regardless, I'm very glad MIRI is focusing on technical research. I don't know how much farther you have to walk, but it's clear you're headed in the right direction.

(Reply to.)

By default, if you can build a Friendly AI you were not troubled by the Lob problem. That working on the Lob Problem gets you closer to being able to build FAI is neither obvious nor certain (perhaps it is shallow to work on directly, and those who can build AI resolve it as a side effect of doing something else) but everything has to start somewhere. Being able to state crisp difficulties to work on is itself rare and valuable, and the more you

engagewith a problem like stable self-modification, the more you end up knowing about it. Engagement in a form where you can figure out whether or not your proof goes through is more valuable than engagement in the form of pure verbal arguments and intuition, although the latter is significantly more valuable than not thinking about something at all.Reading through the whole Tiling paper might make this clearer; it spends the first 4 chapters on the Lob problem, then starts introducing further concepts once the notion of 'tiling' has been made sufficiently crisp, lik... (read more)

I feel like in this comment you're putting your finger on a general principal of instrumental rationality that goes beyond the specific issue at hand, and indeed beyond the realm of mathematical proof. It might be worth a post on "engagement" at some point.

Specifically, I note similar phenomena in software development where sometimes what I start working on ends up being not at all related to the final product, but nonetheless sets me off on a chain of consequences that lead me to the final, useful product. And I too experience the annoyance of managers insisting that I lay out a clear path from beginning to end, when I don't yet know what the territory looks like or sometimes even what the destination is.

As Eisenhower said, "Plans are worthless, but planning is everything."

If somebody wrote a paper showing how an economy could naturally build another economy while being guaranteed to have all prices derived from a constant set of prices on intrinsic goods, even as all prices were set by market mechanisms as the next economy was being built, I'd think, "Hm. Interesting. A completely different angle on self-modification with natural goal preservation."

I'm surprised at the size of the apparent communications gap around the notion of "How to get started for the first time on a difficult basic question" - surely you can think of mathematical analogies to research areas where it would be significant progress just to throw out an attempted formalization as a base point?

There are all sorts of disclaimers plastered o... (read more)

I think that your paraphrasing

is pretty close to my position.

I would qualify it by saying:

I'd replace "no progress" with "not enough progress for there to be a known research program with a reasonable chance of success."

I have high confidence that some of the recent advances in narrow AI will contribute (whether directly or indirectly) to the eventual creation of AGI (contingent on this event occurring), just not necessarily in a foreseeable way.

If I discover that there's been significantly more progress on AGI than I had thought, then I'll have to reevaluate my position entirely. I could imagine updating in the directly of MIRI's FAI work being very high value, or I could imagine continuing to believe that MIRI's FAI research isn't a priority, for reasons different from my current ones.

I am very glad to see MIRI taking steps to list open problems and explain why those problems are important for making machine intelligence benefit humanity.

I'm also struggling to see why this Lob problem is a reasonable problem to worry about right now (even within the space of possible AI problems). Basically, I'm skeptical that this difficulty or something similar to it will arise in practice. I'm not sure if you disagree, since you are saying you don't think this difficulty will "block AI." And if it isn’t going to arise in practice (or someth... (read more)

I think you are right that strategy work may be higher value. But I think you underestimate the extent to which (1) such goods are complements [granting for the moment the hypothesis that this kind of AI work is in fact useful], and (2) there is a realistic prospect of engaging in many such projects in parallel, and that getting each started is a bottleneck.

As Wei Dai observed, you seem to be significantly understating the severity of the problem. We are investigating conditions under which an agent can believe that its own operation will lawfully lead to good outcomes, which is more or less necessary for reasonably intelligent, reasonably sane behavior given our current understanding.

Compare to: "I'm not sure how relevant formalizing mathematical reasoning is, because evolution made humans who are pretty good a... (read more)

(These were some comments I had on a slightly earlier draft than this, so the page numbers and such might be slightly off.)

Page 4, footnote 8: I don't think it's true that only stronger systems can prove weaker systems consistent. It can happen that system A can prove system B consistent and A and B are incomparable, with neither stronger than the other. For example, Gentzen's proof of the consistency of PA uses a system which is neither stronger nor weaker than PA.

Page 6: the hypotheses of the second incompleteness theorem are a little more restrictive t... (read more)

Please augment footnote 4 or otherwise provide a more complete summary of symbols.

I appreciate that some of these symbols have standard meanings in the relevant fields. But different subfields use these with subtle differences.

Note, for example, this list of logic symbols in which single-arrow -> and double-arrow => are said to carry the same meaning, but you use them distinctly. (You also use horizontal line (like division) to indicate implication on page 7. I think you mean the same as single-arrow. Again, that's standard notation, but it shou... (read more)

Why frame this problem as about tiling/self-modification instead of planning/self-prediction? If you do the latter though, the problem looks more like an AGI (or AI capability) problem than an FAI (or AI safety) problem, which makes me wonder if it's really a good idea to publicize the problem and invite more people to work on it publicly.

Regarding section 4.3 on probabilistic reflection, I didn't get a good sense from the paper of how Christiano et al's formalism relates to the concrete problem of AI self-modification or self-prediction. For example what ... (read more)

Sorry for being confusing, and thanks for giving me a chance to try again! (I did write that comment too quickly due to lack of time.)

So, my point is, I think that there is very little reason to think that evolution somehow had to solve the Löbstacle in order to produce humans. We run into the Löbstacle when we try to use the standard foundations of mathematics (first-order logic + PA or ZFC) in the obvious way to make a self-modifying agent that will continue to follow a given goal after having gone through a very large number of self-modifications. We don't currently have any framework not subject to this problem, and we need one if we want to build a Friendly seed AI. Evolution didn't have to solve this problem. It's true that evolution did have to solve the planning/self-prediction problem, but it didn't have to solve it with extremely high reliability. I see very little reason to think that if we understood how evolution solved the problem it solved, we would then be really close to having a satisfactory Löbstacle-free decision theory to use in a Friendly seed AI -- and thus, conversely, I see little reason to think that an AGI project must solve the Löbstacle in order to solv... (read more)

Congrats to Eliezer and Marcello on the writeup! It has helped me understand Benja's "parametric polymorphism" idea better.

There's a slightly different angle that worries me. What happens if you ask an AI to solve the AI reflection problem?

1) If an agent A_1 generates another agent A_0 by consequentialist reasoning, possibly using proofs in PA, then future descendants of A_0 also count as consequences. So at least A_0 should not face the problem of "telomere shortening", because PA can see the possible consequences of "telomere sho... (read more)

As a non-expert fan of AI research, I simply wanted to mention that this and other recent papers seem to go a fair ways toward addressing one of the Karnofsky's criticisms of the SI I remember agreeing with:

... (read more)Page 14, Remarks. Typo:

This should be "T_0 can prove certain exact theorems which T_1 cannot".

Cool.

I'll get my math team working on this right away, and we eagerly await more of these. ;)

EDIT: slight sarcasm on the "my math team".

If ever I wanted to upvote something twice, it's this.

This post does answer some questions I had regarding the relevance of mathematical proof to AI safety, and the motivations behind using mathematical proof in the first place. I don't believe I've seen this bit before:

For the record, I personally think this statement is too strong. Instead, I would say something more like what Paul said:

... (read more)I read the first two pages of the publication, and wasn't convinced that the problem that the paper attempts to address has non-negligible relevance to AI safety. I would be interested in seeing you spell out your thoughts on this point in detail.

[

Edit:Thinking this over a little bit, I realize that maybe my comment isn't so helpful in isolation. I corresponded with Luke about this, and would be happy to flesh out my thinking either publicly or in private correspondence.]Personally, I feel like that kind of metawork is very important, but that somebody should also be doing something that

isn'tjust metawork. If there's nobody making concrete progress on the actual problem that we're supposed to be solving, there's a major risk of the whole thing becoming a lost purpose, as well as of potentially-interested people wandering off to somewhere where they can actually do something that feels more real.From inside MIRI, I've been able to feel this one

viscerallyas genius-level people come to me and say "Wow, this has really opened my eyes. Where do I get started?" and (until now) I've had to reply "Sorry, we haven't written down our technical research agenda anywhere" and so they go back to machine learning or finance or whatever because no, they aren't going to learn 10 different fields and become hyper-interdisciplinary philosophers working on important but slippery meta stuff like Bostrom and Shulman.For another take on why some people expect this kind of work on the Lobian obstacle to be useful for FAI, see this interview with Benja Fallenstein.

Probably a stupid question but... on the issue of goal stability more generally, might the Lyapunov stability theorems (see sec. 3) be of use? For more mathematical detail, see here.

I like to think of this paper as being as the philosophical edge of AI research, which doesn't yet have its own subfield, ala a quote from Boden in Bobrow & Hayes (1985):

... (read more)I'm confused by this sentence. There are many statistical testing methods that output what are essentially proofs; e.g. statements of the form "probability of a failure existing is at most 10^(-100)". Why would this not be sufficient?... (read more)

Possible typo: Equation 4.2 subscript: T- n+1

Should this be T- (n+1) ?

On page 12, when you talk about the different kinds of trust, it seems like tiling trust is just a subtype of naturalistic trust. If something running on T can trust some arbitrary physical system if that arbitrary physical system implements T, then it should be able to trust its successor if that successor is a physical system that implements T. Not sure if that means anything.

Can't you require that the agents you swap to spend at least some fraction of their effort on meliorizing? Each swap c... (read more)