Related to: Shut up and do the Impossible! The Hidden Complexity of Wishes.  What can you do with an Unfriendly AI?

Suppose you find yourself in the following situation.  There is a process, call it X, in a box.  It knows a lot about the current state of the universe, but it can influence the rest of the world only through a single channel, through which it sends a single bit exactly once (at a predetermined time).  If it sends 1 (cooperates), then nothing happens---humanity is free to go about its business.  If it sends 0 (defects), then in one month a powerful uFAI is released which can take over the universe.

The question is, when can we count on X to cooperate?  If X is friendly, then it seems like it should cooperate.  Is designing an AGI which can be incentivized to cooperate any easier than designing a completely friendly AGI?  It might be easier for two reasons.  First, the AI just needs to prefer human survival without intervention to a particular catastrophic intervention. We don't need to guarantee that its favorite outcome isn't catastrophic in some other way.  Second, the humans have some time to punish or reward the AI based on its behavior.  In general, lets call a process X slightly friendly if it can be incentivized to cooperate in reasonable instantiations of this hypothetical (ie, reasonable worlds satisfying the properties I have laid out).

I ask this question because it seems much simpler to think about than friendliness (or AI boxing) but still confuses me badly---this post has no hope of answering this question, just clarifying some issues surrounding it.  If it turns out that the design of slightly friendly AIs is no easier than the design of friendly AIs, then we have conclusive evidence that boxing an AI is not helpful for obtaining friendliness.  If it turns out that the design of slightly friendly AIs is significantly easier, then this is a good first step towards resolving the legitimate objections raised in response to my previous post. (Eventually if we want to implement a scheme like the one I proposed we will need to get stronger guarantees. I think this is the right first step, since it is the easiest simplification I don't know how to do.)


Question 1: Is a paperclipper slightly friendly?

Answer: Almost certainly not.  We can try to incentivize the paperclipper, by promising to make a paperclip for it if and only if it cooperates.  This would work if the uFAI taking over the universe didn't make any paperclips.  In the normal game theoretic sense it may not be credible for the uFAI to precommit to make a bunch of paperclips if freed, but I think no one on LW believes that this is a serious obstacle.  The situation is precisely Parfit's hitchhiker, which a uFAI might well win at.  Our only other hope is that human society, if not destroyed by the uFAI, will make more paperclips than the uFAI.  This seems like a subtle question, if humanity is in a position to engineer a friendly foom, but I definitely don't want to stake the future of humanity on it.

Now consider a time-bounded paperclipper, whose goal is to make as many paperclips as possible within the next month and who is indifferent to anything that happens more than a month from now.  Designing a time-bounded paperclipper seems like it might be much harder than designing a paperclipper, but I would assign significant epistemic probability to the situation where a time-bounded paperclipper is easier to design than a friendly AI.  (I think we can do the same thing in a more complicated way using a normal paperclipper if we could make the utility function stable under self-modification)

Question 2: Is a time-bounded paperclipper slightly friendly?

Answer: I have no idea, but suspect it depends on exactly what sort of optimization the paperclipper is doing.  We can use the same incentive scheme proposed for the last question: promise to make X a bunch of paperclips if and only if it cooperates.  Then cooperation is the only way that X can make any paperclips in the next month, which is purportedly all X cares about.  So if X always does the thing that maximizes its utility function and suspsects that he is in the simplest world consistent with his observations then this probably works.  There are two obvious acausal attacks a uFAI might employ, exploiting flaws with this assumption.

The first reason X may defect is a commitment by the uFAI to simulate many copies of X, up until the point that X has to make its decision, and then after the decision intervene in the simulation by producing many paperclips.  If X chooses to let the uFAI out, then an overwhelming majority of its simulated copies will witness a sudden and completely unphysical mass paperclip genesis.  Thus depending on exactly what X believes, releasing the uFAI might be optimal.  To solve this, you could try to very carefully ensure that the notion of paperclip maximization used by X is precisely the one that humans have control over---if we are inside a simulation, then then X should value paperclips inside this simulation, but if an uFAI in the simulation with us creates a further simulation, then X shouldn't value paperclips inside that even more deeply embedded simulation.  I don't know how easy this is. A useful first step is to give X rapidly decreasing marginal value for each additional paperclip.  If X is is only concerned with maximizing the probability that a paperclip gets produced in the next month, then the uFAI has much less leverage. In fact, in this case you might be able to prevent this attack by making a sufficiently strong precommitment to making a paperclip.

The second reason X may defect is to make a sacrifice to help the uFAI, expecting no reciprocity.  If X and the uFAI are running the same "decision engine" with different utilities (they definitely have different utilities, since minimally the uFAI cares about next month's paper clip count instead of this month's) then maybe this decision engine decides something like "I have a 1/2 chance of being in X and a 1/2 chance of being in the uFAI that X may release.  If I choose to release the uFAI, then 1/2 of me loses a couple of paperclips but the other 1/2 of me takes over the universe, so lets go for that." I honestly don't know how such acausal compromises between agents with different utility functions, where there is no possibility of reciprocity, would work.  If the uFAI was a stapler instead, who is to say that 10000000000000 staples is more valauble than 100 paperclips?  The only consistent decision theories I can imagine do not make such an acausal compromise, but it does seem like some significant care should be taken to make sure that X doesn't. 

Hopefully if you found a way to resolve both of these difficulties, you would either think of a new explicit reason that X may not cooperate or you would be able to produce some compelling evidence that X is slightly friendly.  Such compelling evidence seems like it might be possible because humans control all causal influences on X---we just need to bound the effect of a uFAI's acausal influence.

Question 3: Is a friendly AI slightly friendly?

Answer:  Its not as obvious as it looks.  I am including this discussion mostly because it confuses me, especially juxtaposed with Question 2.

In the answers to the last 2 questions, I mentioned my belief/fear that a uFAI could implicitly precommit to doing favors for X (either producing paper clips, or simulating many very happy copies of X) in order to get X to let it out.  This belief/fear was explicitly articulated by Eliezer in response to my last post and it strikes me as reasonable in that context, where it interferes with our ability to incentivize X.  But if we apply it to the situation of a friendly X, we have a failure that seems strange to me (though it may be completely natural to people who have thought about it more).  The friendly X could believe that, in order to be let out, the uFAI will actually do something friendly.  In this case, letting the uFAI is correct even for the friendly AI.

If X is all-knowing this is well and good, since then the uFAI really will do something friendly.  But if X is fallible then it may believe that the uFAI will do something friendly when in fact it will not.  Even if the friendly X constructs a proof that the uFAI will be friendly to humans, if we believe the concerns about certifying friendliness that Eliezer mentions here then X may still be wrong, because formalizing what it means to be friendly is just too hard if you need your formalization to screen out adversarially chosen uFAI (and X's formalization of friendliness need not be perfect unless the formalization of the people who built X was perfect).  Does part of friendliness involve never letting an AI out of a box, at least until some perfect formalization of friendliness is available?  What sort of decision theory could possibly guarantee the level of hyper-vigilance this requires without making all sorts of horribly over-conservative decisions elsewhere?

My question to people who know what is going on: is the above discussion just me starting to suspect how hard friendliness is?  Is letting the uFAI out analogous to performing a self-modification not necessarily guaranteed to perform friendliness (ie, modifying yourself to emulate the behavior of that uFAI)?  My initial reaction was that "stability under self-modification" would need to imply that a friendly AI is slightly friendly.  Now I see that this is not necessarily the case--- it may be easier to be stable under modifications you think of yourself than under proposed modifications which are adversarially chosen (in this example, the uFAI which is threatening to escape is chosen adversarially).  This would make the very knowledge of such an adversarially chosen modification enough to corrupt a friendly AI, which seems bad but maybe that is just how it goes (and you count on the universe not containing anything horrible enough to suggest such a modification).

In summary: I think that the problem of slight friendliness is moderately easier than friendliness, because it involves preserving a simpler invariant which we can hope to reason about completely formally. I personally suspect that it will basically come down to solving the stability under self-modification problem, dropping the requirement that you can describe some magical essence of friendliness to put in at the beginning. This may already be the part of the problem that people in the know think is difficult, but I think the general intuition (even at less wrong) is that getting a powerful AI to be nice at all is extremely difficult and that this is what makes friendliness hard. If slight friendliness is possible, then we can think about how it could be used to safely obtain friendliness; I think this is an interesting and soluble problem. Nevertheless, the very possibility of building an only slightly friendly AI is an extremely scary thing which could well destroy the world on its own without much more sophisticated social safeguards than currently exist.


23 comments, sorted by Click to highlight new comments since: Today at 4:26 AM
New Comment

There is a general problem with attempts to outwit an UFAI, to use it for one's benefit in a way that is less than optimal for UFAI's goals.

We are limited by our human capabilities for resolving logical uncertainty. This means that some facts that we don't even notice can in fact be true, and some other facts that we assign very low subjective probability can in fact be true. In particular, while reasoning about a plan involving UFAI, we are limited in understanding of the consequences of that plan. The UFAI is much less limited, so it'll see some of the possible consequences that are to its advantage that we won't even consider, and it'll take the actions that exploit those possible consequences.

So unless the argument is completely air-tight (electron-tight?), there is probably something you've missed, something that you won't be even in principle able to notice, as a human, however long you think about the plan and however well you come to understand it, which UFAI will be able to use to further its goals more than you've expected, probably at the expense of your own goals.

You could equally well say the same thing if someone set out to prove that a cryptosystem was secure against an extremely powerful adversary, but I believe that we can establish this with reasonable confidence.

Computer scientists are used to competing with adversaries who behave arbitrarily. So if you want to say I can't beat a UFAI at a game, you aren't trying to tell me something about the UFAI---you are trying to tell me something about the game. To convince me that this could never work you will have to convince me that the game is hard to control, not that the adversary is smart or that I am stupid. You could argue, for example, that any game taking place in the real world is necessarily too complex to apply this sort of analysis to. I doubt this very strongly, so you will have to isolate some other property of the game that makes it fundamentally difficult.

So if you want to say I can't beat a UFAI at a game, you aren't trying to tell me something about the UFAI---you are trying to tell me something about the game.

I like this line, and I agree with it. I'm not sure how much more difficult this becomes when the game we're playing is to figure out what game we're playing.

I realize that this is the point of your earlier box argument---to set in stone the rules of the game, and make sure that it's one we can analyze. This, I think is a good idea, but I suspect most here (I'm not sure if I include myself) think that you still don't know which game you're playing with the AI.

Hopefully, the UFAI doesn't get to mess with us while we figure out which game to play.

Thanks, that helps me transform my explicit knowledge of uFAI danger into more concrete fear.

Good post. I just want to say I think it's great that you are thinking and writing about this stuff for LW.

Edit: Although people are right that it's not really on topic as a main article.

I personally suspect that it will basically come down to solving the stability under self-modification problem... This may already be the part of the problem that people in the know think is difficult

Yes, it is already the part we suspect will be difficult. The other part may or may not be difficult once we solve that part.

is the above discussion just me starting to suspect how hard friendliness is?

No, you still haven't started to suspect that.

Also, moving this post to the Discussion section.

Also, moving this post to the Discussion section.

I disagree with this particular use of moderation power. That's what not-promoting is for.

It's not particularly on-topic for the main site.

That's what downvoting is for. (And, perhaps, commenting with "I downvoted this because I consider it off-topic for the main site, but I will remove my downvote if you move it to the Discussion section.")

Would it be fair to say that my last two posts were similarly off-topic (they were both descriptions of widgets that would be used for AI boxing)? I have a very imprecise conception of what is and what is not on-topic for the main site.

Would it be fair to say that my last two posts were similarly off-topic (they were both descriptions of widgets that would be used for AI boxing)?

In my opinion, yes, as it's not about development or application of rationality, and free discussion of transhumanist topics will damage the rationality site. But I think it's fine for the discussion area.

It wasn't the topicness so much as the degree to which the post seemed written in the form of... natter, maybe, would be the word to describe it? It read like Discussion, and not like a main LW post.

I agree, but is the writing style your real objection? If I were an under-rated fAI researcher moderating a website, I'd be annoyed if I saw a main-page post purporting to be humble but still missing the cutting edge of usefulness by several levels of skill. I'd move it to the discussion page by way of emphasizing how far the poster still had to go to be as skilled as I am.

A tad too much cynicism, there. You can't think of any other reason to move it to the Discussion page, under those circumstances?

Sure: a third reason might be because leaving low-level posts on friendliness on the top page encourages people to misunderestimate how difficult friendliness is, which might lead them to trivialize the whole project, donate less to SIAI, and/or try to build their own non-provably friendly AI.

The interesting question is why you would remark on writing style if the third reason were your true objection. Maybe you've got some other objection altogether -- it's not that important to me; I like your blog and you can move stuff around on it if you feel like it.

A long time ago you described what you perceived as the difficulties for FAI:

  1. Solving the technical problems required to maintain a well-specified abstract invariant in a self-modifying goal system. (Interestingly, this problem is relatively straightforward from a theoretical standpoint.)
  2. Choosing something nice to do with the AI. This is about midway in theoretical hairiness between problems 1 and 3.
  3. Designing a framework for an abstract invariant that doesn't automatically wipe out the human species. This is the hard part.

I know that was a long time ago, but people here still link to it, presumably because they don't know of any more up to date statement with similar content. Hopefully you can see why I was confused about which part of this problem was supposed to be hard. I now see that I probably misinterpreted it, but the examples which come directly afterwards reaffirm my incorrect interpretation.

So would it be fair to say that figuring out how to build a good paperclipper, as opposed to a process that does something we don't understand, already requires solving the hard part?

Either my views changed since that time, or what I was trying to communicate by it was that 3 was the most inscrutable part of the problem to people who try to tackle it, rather than that it was the blocker problem. 1 is the blocker problem, I think. I probably realize that now to a greater degree than I did at that time, and probably also made more progress on 3 relative to 1, but I don't know how much my opinions actually changed (there's some well-known biases about that).

So would it be fair to say that figuring out how to build a good paperclipper, as opposed to a process that does something we don't understand, already requires solving the hard part?

Current estimate says yes. There would still be an inscrutable problem to solve too, but I don't think it would have quite the same impenetrability about it.

Would it be fair to say that even developing a formalism which is capable of precisely expressing the idea that something is a good paperclipper is significantly beyond current techniques, and that substantial progress on this problem probably represents substantial progress towards FAI?

I'm fine with this use of moderation---I considered posting this in discussion, but editors' discretion is better than mine, and people have to look at it while they wait for me to move it at someone's suggestion---but stuff about discussion doesn't appear anywhere on the main page so I was confused about what had happened.

Yes, it is already the part we suspect will be difficult. The other part may or may not be difficult once we solve that part.

I'm surprised by this - I can imagine where I might start on the "stable under self-modification" problem, but I have a very hard time thinking where I might start on the "actually specifying the supergoal" problem.

To talk about "stable under self-modification", you need a notion of what it is that needs to be stable, the kind of data that specifies a decision problem. Once we have that notion, it could turn out to be relatively straightforward to extract its instance from human minds (but probably not). On the other hand, while we don't have that notion, there is little point in attacking the human decision problem extraction problem.