An early draft of publication #2 in the Open Problems in Friendly AI series is now available: Tiling Agents for Self-Modifying AI, and the Lobian Obstacle. ~20,000 words, aimed at mathematicians or the highly mathematically literate. The research reported on was conducted by Yudkowsky and Herreshoff, substantially refined at the November 2012 MIRI Workshop with Mihaly Barasz and Paul Christiano, and refined further at the April 2013 MIRI Workshop.

**Abstract:**

We model self-modication in AI by introducing 'tiling' agents whose decision systems will approve the construction of highly similar agents, creating a repeating pattern (including similarity of the offspring's goals). Constructing a formalism in the most straightforward way produces a Godelian difficulty, the Lobian obstacle. By technical methods we demonstrate the possibility of avoiding this obstacle, but the underlying puzzles of rational coherence are thus only partially addressed. We extend the formalism to partially unknown deterministic environments, and show a very crude extension to probabilistic environments and expected utility; but the problem of finding a fundamental decision criterion for self-modifying probabilistic agents remains open.

Commenting here is the preferred venue for discussion of the paper. This is an early draft and has not been reviewed, so it may contain mathematical errors, and reporting of these will be much appreciated.

The overall agenda of the paper is introduce the conceptual notion of a self-reproducing decision pattern which includes reproduction of the goal or utility function, by exposing a particular possible problem with a tiling logical decision pattern and coming up with some partial technical solutions. This then makes it conceptually much clearer to point out the even deeper problems with "We can't yet describe a probabilistic way to do this because of non-monotonicity" and "We don't have a good bounded way to do this because maximization is impossible, satisficing is too weak and Schmidhuber's swapping criterion is underspecified." The paper uses first-order logic (FOL) because FOL has a lot of useful standard machinery for reflection which we can then invoke; in real life, FOL is of course a poor representational fit to most real-world environments outside a human-constructed computer chip with thermodynamically expensive crisp variable states.

As further background, the idea that something-like-proof might be relevant to Friendly AI is not about achieving some chimera of absolute safety-feeling, but rather about the idea that the total probability of catastrophic failure should not have a significant conditionally independent component on each self-modification, and that self-modification will (at least in initial stages) take place within the highly deterministic environment of a computer chip. This means that statistical testing methods (e.g. an evolutionary algorithm's evaluation of average fitness on a set of test problems) are not suitable for self-modifications which can potentially induce catastrophic failure (e.g. of parts of code that can affect the representation or interpretation of the goals). Mathematical proofs have the property that they are as strong as their axioms and have no significant conditionally independent per-step failure probability if their axioms are semantically true, which suggests that something like mathematical reasoning may be appropriate for certain particular types of self-modification during some developmental stages.

Thus the content of the paper is very far off from how a realistic AI would work, but conversely, if you can't even answer the kinds of simple problems posed within the paper (both those we partially solve and those we only pose) then you must be very far off from being able to build a stable self-modifying AI. Being able to say how to build a theoretical device that would play perfect chess given infinite computing power, is very far off from the ability to build Deep Blue. However, if you can't even say how to play perfect chess given infinite computing power, you are confused about the rules of the chess or the structure of chess-playing computation in a way that would make it entirely hopeless for you to figure out how to build a bounded chess-player. Thus "In real life we're always bounded" is no excuse for not being able to solve the much simpler unbounded form of the problem, and being able to describe the infinite chess-player would be substantial and useful conceptual progress compared to *not *being able to do that. We can't be absolutely certain that an analogous situation holds between solving the challenges posed in the paper, and realistic self-modifying AIs with stable goal systems, but every line of investigation has to start somewhere.

Parts of the paper will be easier to understand if you've read Highly Advanced Epistemology 101 For Beginners including the parts on correspondence theories of truth (relevant to section 6) and model-theoretic semantics of logic (relevant to 3, 4, and 6), and there are footnotes intended to make the paper somewhat more accessible than usual, but the paper is still essentially aimed at mathematically sophisticated readers.

The fact that MIRI is finally publishing technical research has impressed me. A year ago it seemed, to put it bluntly, that your organization was stalling, spending its funds on the full-time development of Harry Potter fanfiction and popular science books. Perhaps my intuition there was uncharitable, perhaps not. I don't know how much of your lead researcher's time was spent on said publications, but it certainly seemed, from the outside, that it was the majority. Regardless, I'm very glad MIRI is focusing on technical research. I don't know how much farther you have to walk, but it's clear you're headed in the right direction.

(Reply to.)

By default, if you can build a Friendly AI you were not troubled by the Lob problem. That working on the Lob Problem gets you closer to being able to build FAI is neither obvious nor certain (perhaps it is shallow to work on directly, and those who can build AI resolve it as a side effect of doing something else) but everything has to start somewhere. Being able to state crisp difficulties to work on is itself rare and valuable, and the more you

engagewith a problem like stable self-modification, the more you end up knowing about it. Engagement in a form where you can figure out whether or not your proof goes through is more valuable than engagement in the form of pure verbal arguments and intuition, although the latter is significantly more valuable than not thinking about something at all.Reading through the whole Tiling paper might make this clearer; it spends the first 4 chapters on the Lob problem, then starts introducing further concepts once the notion of 'tiling' has been made sufficiently crisp, lik... (read more)

I feel like in this comment you're putting your finger on a general principal of instrumental rationality that goes beyond the specific issue at hand, and indeed beyond the realm of mathematical proof. It might be worth a post on "engagement" at some point.

Specifically, I note similar phenomena in software development where sometimes what I start working on ends up being not at all related to the final product, but nonetheless sets me off on a chain of consequences that lead me to the final, useful product. And I too experience the annoyance of managers insisting that I lay out a clear path from beginning to end, when I don't yet know what the territory looks like or sometimes even what the destination is.

As Eisenhower said, "Plans are worthless, but planning is everything."

If somebody wrote a paper showing how an economy could naturally build another economy while being guaranteed to have all prices derived from a constant set of prices on intrinsic goods, even as all prices were set by market mechanisms as the next economy was being built, I'd think, "Hm. Interesting. A completely different angle on self-modification with natural goal preservation."

I'm surprised at the size of the apparent communications gap around the notion of "How to get started for the first time on a difficult basic question" - surely you can think of mathematical analogies to research areas where it would be significant progress just to throw out an attempted formalization as a base point?

There are all sorts of disclaimers plastered o... (read more)

Neither 2 nor 3 is the sort of argument I would ever make (there's such a thing as an attempted steelman which by virtue of its obvious weakness doesn't really help). You already know(?) that I vehemently reject all attempts to multiply tiny probabilities by large utilities in real-life practice, or to claim that the most probable assumption about a background epistemic question leads to a forgone doom and use this to justify its improbable negation. The part at which you lost me is of course part 1.

I still don't understand what you could be thinking here, and feel like there's some sort of basic failure to communicate going on. I could guess something along the lines of "Maybe Jonah is imagining that Friendly AI will be built around principles completely different from modern decision theory and any notion of a utility function..." (but really, is something like that one of just 10^10 equivalent candidates?) "...and more dissimilar to that than logical AI is from decision theory" (that's a lot of dissimilarity but we already demonstrated conceptual usefulness over a gap that size). Still, that's the sort of belief someone might end up with if their knowledg... (read more)

I think I may have been one of those three graduate students, so just to clarify, my view is:

Zero progress being made seems too strong a claim, but I would say that most machine learning research is neither relevant to, nor trying to be relevant to, AGI. I

thinkthat there is no real disagreement on this empirical point (at least, from talking to both Jonah and Eliezer in person, I don't get the impression that I disagree with either of you on this particular point).The model for AGI that MIRI uses seems mostly reasonable, except for the "self-modification" part, which seems to be a bit too much separated out from everything else (since pretty much any form of learning is a type of self-modification --- current AI algorithms are self-modifying all the time!).

On this vein, I'm skeptical of both the need or feasibility of an AI providing an actual proof of safety of self-modification. I also think that using mathematical logic somewhat clouds the issues here, and that most of the issues that MIRI is currently working on are prerequisites for any sort of AI, not just friendly AI. I expect them to be solved as a side-effect of what I see as more fundamental outstanding

Without further context I see nothing wrong here. Superintelligences are Turing machines, check. You might need a 10^20 slowdown before that becomes relevant, check. It's possible that the argument proves too much by showing that a well-trained high-speed immortal dog can simulate Mathematica and therefore a dog is 'intellectually expressive' enough to understand integral calculus, but I don't know if that's what Scott means and principle of charity says I shouldn't assume that without confirmation.

EDIT: Parent was edited, my reply was to the first part, not the second. The second part sounds like something to talk with Scott about. I really think the "You're just as likely to get results in the opposite direction" argument is on the priors overstated for most forms of research. Does Scott think that work we do today is just as likely to decrease our understanding of P/NP as increase it? We may be a long way off from proving an answer but that's not a reason to adopt such a strange prior.

I think that your paraphrasing

is pretty close to my position.

I would qualify it by saying:

I'd replace "no progress" with "not enough progress for there to be a known research program with a reasonable chance of success."

I have high confidence that some of the recent advances in narrow AI will contribute (whether directly or indirectly) to the eventual creation of AGI (contingent on this event occurring), just not necessarily in a foreseeable way.

If I discover that there's been significantly more progress on AGI than I had thought, then I'll have to reevaluate my position entirely. I could imagine updating in the directly of MIRI's FAI work being very high value, or I could imagine continuing to believe that MIRI's FAI research isn't a priority, for reasons different from my current ones.

Agreed-on summaries of persistent disagreement aren't ideal, but they're more conversational progress than usually happens, so... thanks!

No doubt. And as of now, for none of them we're able to tell whether they are safe or not. There's insufficient rigor in the language, the formulizations aren't standardized or pinned down (in this subject matter). MIRI's work is creating and pinning down the milestones for how we'd even go about assessing self-modifying friendly AI in terms of goal stability, in mathematical language.

To have

anyoperationalization of how some specific model of self-modification provably maintains some invariant would be a large step forward, the existence of other models of self-modification nonwithstanding. Safety cannot be proven for all approaches, because not all approachesaresafe.Even if that were so, that's not MIRI's (or EY's) most salient comparative advantage (also: CFAR).

I am very glad to see MIRI taking steps to list open problems and explain why those problems are important for making machine intelligence benefit humanity.

I'm also struggling to see why this Lob problem is a reasonable problem to worry about right now (even within the space of possible AI problems). Basically, I'm skeptical that this difficulty or something similar to it will arise in practice. I'm not sure if you disagree, since you are saying you don't think this difficulty will "block AI." And if it isn’t going to arise in practice (or something similar to it), I’m not sure why this should be high on the priority list of general AI issues to think about it (edited to add: or why working on this problem now should be expected to help machine intelligence develop in a way that benefits humanity).

Some major questions I have are:

Part ... (read more)

I think you are right that strategy work may be higher value. But I think you underestimate the extent to which (1) such goods are complements [granting for the moment the hypothesis that this kind of AI work is in fact useful], and (2) there is a realistic prospect of engaging in many such projects in parallel, and that getting each started is a bottleneck.

As Wei Dai observed, you seem to be significantly understating the severity of the problem. We are investigating conditions under which an agent can believe that its own operation will lawfully lead to good outcomes, which is more or less necessary for reasonably intelligent, reasonably sane behavior given our current understanding.

Compare to: "I'm not sure how relevant formalizing mathematical reasoning is, because evolution made humans who are pretty good a... (read more)

The opportunity to kill yourself in exchange for $10 is a prototypical case. It's well and good to say "this is only a problem for an agent who uses proofs," but that's not a

scenario, that's atype of agent. Yes, real agents will probably not use mathematical logic in the naive way. But as I pointed out in response to Wei Dai, probabilism doesn't make the issues go away. It softens the impossibility proofs, but we still lack possibility proofs (which is what MIRI is working on). So this seems like a weak objection. If you want to say "future agents will have different reasoning frameworks than the ones we currently understand," that's well and good, but see below. (That seems a lot like discouraging someone from trying to develop logic because their proto-logic doesn't resemble the way that humans actually reason.)This is what I thought you meant; it seems analogous to:

That's not it, rather:

Yep. We have reasoning frameworks like the currently dominant forms of decision theory, but they don't handle reflectivity well.

The Lob Problem isn't a top-priority scary thing that is carved upon the tombstones of worlds, it's more like, "Look! We managed to crisply exhibit something very precise that would go wrong with standard methods and get started on analyzing and fixing it! Before we just saw in a more intuitive sense that something would go wrong when we applied standard theories to reflective problems but now we can state three problems very prec... (read more)

Using probabilities instead of proofs seems to eliminate the old obstructions, but it does leave a sequence of challenging problems (hence the work on probabilistic reflection). E.g., we've proved that there is an algorithm P using a halting oracle such that:

(Property R): Intuitively, we "almost" have a < P(X | a < P(X) < b) < b. Formally:

But this took a great deal of work, and we can't exhibit any algorithm that simultaneously satisfies Property R and has P(Property R) = 1. Do you think this is not an important question? It seems to me that we don't yet know how many of the Godelian obstructions carry in the probabilistic environment, and there are still real problems that will involve ingenuity to resolve.

(These were some comments I had on a slightly earlier draft than this, so the page numbers and such might be slightly off.)

Page 4, footnote 8: I don't think it's true that only stronger systems can prove weaker systems consistent. It can happen that system A can prove system B consistent and A and B are incomparable, with neither stronger than the other. For example, Gentzen's proof of the consistency of PA uses a system which is neither stronger nor weaker than PA.

Page 6: the hypotheses of the second incompleteness theorem are a little more restrictive than this (though not much, I think).

Page 11, problem c: I don't understand the sentence containing "highly regular and compact formula." Looks like there's a typo somewhere.

Why frame this problem as about tiling/self-modification instead of planning/self-prediction? If you do the latter though, the problem looks more like an AGI (or AI capability) problem than an FAI (or AI safety) problem, which makes me wonder if it's really a good idea to publicize the problem and invite more people to work on it publicly.

Regarding section 4.3 on probabilistic reflection, I didn't get a good sense from the paper of how Christiano et al's formalism relates to the concrete problem of AI self-modification or self-prediction. For example what are the functions P and p supposed to translate to in terms of an AI and its descendent or future self?

Sorry for being confusing, and thanks for giving me a chance to try again! (I did write that comment too quickly due to lack of time.)

So, my point is, I think that there is very little reason to think that evolution somehow had to solve the Löbstacle in order to produce humans. We run into the Löbstacle when we try to use the standard foundations of mathematics (first-order logic + PA or ZFC) in the obvious way to make a self-modifying agent that will continue to follow a given goal after having gone through a very large number of self-modifications. We don't currently have any framework not subject to this problem, and we need one if we want to build a Friendly seed AI. Evolution didn't have to solve this problem. It's true that evolution did have to solve the planning/self-prediction problem, but it didn't have to solve it with extremely high reliability. I see very little reason to think that if we understood how evolution solved the problem it solved, we would then be really close to having a satisfactory Löbstacle-free decision theory to use in a Friendly seed AI -- and thus, conversely, I see little reason to think that an AGI project must solve the Löbstacle in order to solv... (read more)

Please augment footnote 4 or otherwise provide a more complete summary of symbols.

I appreciate that some of these symbols have standard meanings in the relevant fields. But different subfields use these with subtle differences.

Note, for example, this list of logic symbols in which single-arrow -> and double-arrow => are said to carry the same meaning, but you use them distinctly. (You also use horizontal line (like division) to indicate implication on page 7. I think you mean the same as single-arrow. Again, that's standard notation, but it shou... (read more)

Congrats to Eliezer and Marcello on the writeup! It has helped me understand Benja's "parametric polymorphism" idea better.

There's a slightly different angle that worries me. What happens if you ask an AI to solve the AI reflection problem?

1) If an agent A_1 generates another agent A_0 by consequentialist reasoning, possibly using proofs in PA, then future descendants of A_0 also count as consequences. So at least A_0 should not face the problem of "telomere shortening", because PA can see the possible consequences of "telomere sho... (read more)

Page 14, Remarks. Typo:

This should be "T_0 can prove certain exact theorems which T_1 cannot".

If ever I wanted to upvote something twice, it's this.

As a non-expert fan of AI research, I simply wanted to mention that this and other recent papers seem to go a fair ways toward addressing one of the Karnofsky's criticisms of the SI I remember agreeing with:

... (read more)Cool.

I'll get my math team working on this right away, and we eagerly await more of these. ;)

EDIT: slight sarcasm on the "my math team".

This post does answer some questions I had regarding the relevance of mathematical proof to AI safety, and the motivations behind using mathematical proof in the first place. I don't believe I've seen this bit before:

For the record, I personally think this statement is too strong. Instead, I would say something more like what Paul said:

... (read more)Probably a stupid question but... on the issue of goal stability more generally, might the Lyapunov stability theorems (see sec. 3) be of use? For more mathematical detail, see here.

I like to think of this paper as being as the philosophical edge of AI research, which doesn't yet have its own subfield, ala a quote from Boden in Bobrow & Hayes (1985):

... (read more)For another take on why some people expect this kind of work on the Lobian obstacle to be useful for FAI, see this interview with Benja Fallenstein.

I read the first two pages of the publication, and wasn't convinced that the problem that the paper attempts to address has non-negligible relevance to AI safety. I would be interested in seeing you spell out your thoughts on this point in detail.

[

Edit:Thinking this over a little bit, I realize that maybe my comment isn't so helpful in isolation. I corresponded with Luke about this, and would be happy to flesh out my thinking either publicly or in private correspondence.]The LW post may address some of your concerns. The idea here is that we need a tiling decision criterion, and the paper isn't supposed to be an AI design, it's supposed to get us a little conceptually closer to a tiling decision criterion. If you don't understand why a tiling decision criterion is a good thing in a self-improving AI which is supposed to have a stable goal system, then I'm not quite sure what issue needs addressing.

Jonah, some self-modifications will potentially be large, but others might be smaller. More importantly we don't want each self-modification to involve wrenching changes like altering the mathematics you believe in, or even worse, your goals. Most of the core idea in this paper is to prevent those kinds of drastic or deleterious changes from being forced by a self-modification.

But it's also possible that there'll be many gains from small self-modifications, and it would be nicer not to need a special case for those, and for this it is good to have (in theoretical principle) a highly regular bit of cognition/verification that needs to be done for the change (e.g. for logical agents the proof of a certain theorem) so that small local changes only call for small bits of the verification to be reconsidered.

Another way of looking at it is that we're trying to have the AI be as free as possible to self-modify while still knowing that it's sane and stable, and the more overhead is forced or the more small changes are ruled out, the less free it is.

The paper is meant to be interpreted within an agenda of "Begin tackling the conceptual challenge of describing a stably self-reproducing decision criterion by inventing a simple formalism and confronting a crisp difficulty"; not as "We think this Godelian difficulty will block AI", nor "This formalism would be good for an actual AI", nor "A bounded probabilistic self-modifying agent would be like this, only scaled up and with some probabilistic and bounded parts tacked on". If that's not what you meant, please clarify.

Personally, I feel like that kind of metawork is very important, but that somebody should also be doing something that

isn'tjust metawork. If there's nobody making concrete progress on the actual problem that we're supposed to be solving, there's a major risk of the whole thing becoming a lost purpose, as well as of potentially-interested people wandering off to somewhere where they can actually do something that feels more real.From inside MIRI, I've been able to feel this one

viscerallyas genius-level people come to me and say "Wow, this has really opened my eyes. Where do I get started?" and (until now) I've had to reply "Sorry, we haven't written down our technical research agenda anywhere" and so they go back to machine learning or finance or whatever because no, they aren't going to learn 10 different fields and become hyper-interdisciplinary philosophers working on important but slippery meta stuff like Bostrom and Shulman.Yes, that's a large de-facto part of my reasoning.

If I gave you a list of people who in fact expressed interest but then, when there were no technical problems for them to work on, "wandered off to somewhere where they can actually do something that feels more real," would you change your mind? (I may not be able to produce such a list, because I wasn't writing down people's names as they wandered away, but I might be able to reconstruct it.)

No, the oops is on me for not realizing how shallow "working on something that feels more real" would feel after the novelty of being able to explain what I work on to laypeople wore off.

That sounds like a very long conversation if we're supposed to be giving quantitative estimates on everything. The qualitative version is just that this sort of thing can take a long time, may not parallelize easily, and can potentially be partially factored out to academia, and so it is wise to start work on it as soon as you've got enough revenue to support even a small team, so long as you can continue to scale your funding while that's happening.

This reply takes for granted that all astronomical benefits bottleneck through a self-improving AI at some point.

BTW, I spent a large fraction of the first few months of 2013 weighing FAI research vs. other options before arriving at MIRI's 2013 strategy (which focuses heavily on FAI research). So it's not as though I think FAI research is

obviouslythe superior path, and it's also not as though we haven't thought through all these different options, and gotten feedback from dozens of people about those options, and so on.Also note that MIRI did, in fact, decide to focus on (1) spreading rationality, and (2) building a community of people who care about rationality, the far future, and x-risk,

beforeturning its head to FAI research: see (in chronological order) the Singularity Summit, Less Wrong and CFAR.But the question of which interventions are most cost effective (given astronomical waste) is a huge and difficult topic, one that will require thousands of hours to examine properly. Building on Beckstead and Bostrom, I've tried to begin that examination here. Before jumping over to that topic, I wonder: do you now largely accept the case Eliezer made for this latest paper as an important first step on an important sub-problem of the Friendly AI problem? And if not, why not?

Self-modification is to be interpreted to include 'directly editing one's own low-level algorithms using high-level deliberative process' but not include 'changing one's diet to change one's thought processes'. If you are uncomfortable using the word 'self-modification' for this please substitute a new word 'fzoom' which means only that and consider everything I said about self-modification to be about fzoom.

Humans wouldn't look at their own source code and say, "Oh dear, a Lobian obstacle", on this I agree, but this is because humans would look at their own source code and say "What?". Humans have no idea under what exact circumstances they will believe something, which comes with its own set of problems. The Lobian obstacle shows up when you approach things from the end we can handle, namely weak but well-defined systems which can well-define what they will believe, whereas human mathematicians are stronger than ZF plus large cardinals but we don't know how they work or what might go wrong or what might change if we started editing neural circuit #12,730,889,136.

As Christiano's work shows, allowing for tiny finite variances of probability might well dissipate the Lobian obstacle, but that's the sort of thing you find out by knowing what a Lobian obstacle is.

I'm confused by this sentence. There are many statistical testing methods that output what are essentially proofs; e.g. statements of the form "probability of a failure existing is at most 10^(-100)". Why would this not be sufficient?... (read more)

Possible typo: Equation 4.2 subscript: T- n+1

Should this be T- (n+1) ?

On page 12, when you talk about the different kinds of trust, it seems like tiling trust is just a subtype of naturalistic trust. If something running on T can trust some arbitrary physical system if that arbitrary physical system implements T, then it should be able to trust its successor if that successor is a physical system that implements T. Not sure if that means anything.

Can't you require that the agents you swap to spend at least some fraction of their effort on meliorizing? Each swap c... (read more)