I am posting this is because I'm interested in self-modifying agent decision theory but I'm too lazy to read up on existing posts.  I want to see a concise justification as to why a sophisticated decision theory would be needed for the implementation of an AGI.  So I'll present a 'naive' decision theory, and I want to know why it is unsatisfactory.

The one condition in the naive decision theory is that the decision-maker is the only agent in the universe who is capable of self-modification.  This will probably suffice for production of the first Artificial General Intelligence (since humans aren't actually all that good at self-modification.)

Suppose that our AGI has a probability model for predicting the 'state of the universe in time T (e.g. T= 10 billion years)' conditional on what it knows, and conditional on one decision it has to make.  This one decision is how should it rewrite its code at time zero.  We suppose it can rewrite its code instantly, and the code is limited to X bytes.  So the AGI has to maximize utility at time T over all programs with X bytes.  Supposing it can simulate its utility at the 'end state of the universe' conditional on which program it chooses, why can't it just choose the program with the highest utility? Implicit in our set-up is that the program it chooses may (and very likely) will have the capacity to self-modify again, but we're assuming that our AGI's probability model accounts for when and how it is likely to self-modify.  Difficulties with infinite recursion loops should be avoidable if our AGI backtracks from the end of time.

Of course our AGI will need a probability model for predicting what a program for its behavior will do without having to simulate or even completely specify the program.  To me, that seems like the hard part.  If this is possible, I don't see why it's necessary to develop a specific theory for dealing with convoluted Newcomb-like problems, since the above seems to take care of those issues automatically.

New to LessWrong?

New Comment
28 comments, sorted by Click to highlight new comments since: Today at 2:59 PM

Your idea is a little similar to one-action AI that I described sometime ago. It's a neat way to get goal stability (if you trust your magical math intuition module), but IMO it doesn't solve all of decision theory. The tricky question is how to build the "probability model", or assignment of expected utilities to all possible actions. A poorly chosen assignment can end up being mathematically true by making itself true through influencing your actions, like a self-fulfilling prophecy.

For example, imagine that an alien AI created a million years ago predicted that humanity will build an AI based on your decision theory, and precommitted to waging self-destructive war against us unless we give it 99% of our resources. Your AI knows everything about physics, so it will infer the existence of the alien AI at "time zero" and immediately give up the resources. But this decision of your AI was itself predicted by the alien AI like I predicted it here, and that's why the alien AI made its precommitment in the first place. TDT tries to solve this problem by not giving in to extortion, though we don't know how to formalize that.

For a more interesting twist, consider that our universe is likely to contain many instances of us (e.g. if it's spatially infinite, which is currently considered feasible), scattered all over spacetime. Will the different copies of your AI cooperate with each other, or will they do something stupid like wage war? UDT tries to solve this problem and others like it by accounting for logical copies implicitly so they all end up cooperating.

TDT tries to solve this problem by not giving in to extortion, though we don't know how to formalize that.

UDT can solve this problem by noticing that a decision to not give in to extortion makes the extortion improbable. TDT won't be able to notice the scenario where the aliens never appear, and so won't solve this problem for the same reason it doesn't solve Counterfactual Mugging. (Does this mean that TDT doesn't solve Newcomb's problem with transparent boxes? I don't remember hearing that, although I remember Drescher mentioning that CM is analogous to one of his thought experiments.) Eliezer, and not TDT, refers to the intuitive notion of "extortion", and advises to not give in to extortion.

Will the different copies of your AI cooperate with each other, or will they do something stupid like wage war?

As Will recently pointed out, "cooperation" is itself an unclearly specified idea (in particular, the agent can well be a self-improving bundle of wires that quickly escapes any recognition unless it wants to signal something). Also, as I pointed out before, in PD the Pareto frontier for mixed strategies includes one player cooperating, with the other player cooperating or defecting randomly (and randomness can be from logical uncertainty). They will just bargain about who of them should defect how probably.

So non-cooperation is not always stupid, both because "cooperation" is not a clear idea, and because random defecting by one of the players remains on Pareto frontier.

That's funny. What you described in the second paragraph is something like a 2-player bimatrix game played across time and space in which the players aren't even sure of their opponents' existence and in which our player's strategy is which decision theory he uses.

Very interesting, and great food for thought. But again, the complication comes from the possible existence of another player. I would argue that it's reasonable to assume ourselves some 'breathing room' of one to two million years before we have to deal with other players. Then in that case, why not build a 'naive' FAI which operates under the assumption that there is no other player, let it grow, and then when it has some free time, let it think of a decision theory for you? (I don't know whether you speak for the SIAI, cousin_it, but I think it would be fair for an outsider to wonder why Yudkowsky thinks this route in particular has the greatest cost/benefit in terms of achieving FAI as fast as possible.)

I'm not affiliated with SIAI in any way. Just like you, I'm an outsider trying to clear up these topics for my own satisfaction :-)

Many people here think that we must get FAI right on the first try, because after it gains power it will resist our attempts to change it. If you code into the AI the assumption that it's the only player, it won't believe in other players even when it sees them, and will keep allocating resources to building beautiful gardens even as alien ships are circling overhead (metaphorically speaking). When you ask it to build some guns, it will see you as promoting a suboptimal strategy according to its understanding of what's likely to work.

It might be preferable to build a less rigid AI that would be open to further amendments from humanity, rather than maximizing its initial utility function no matter what. But we don't know any mathematical formalism that can express that. The first AIs are likely to be expected utility maximizers just because maximization of expected utility is mathematically neat.

+1 great explanation.

The issue of rigidity is broad and important topic which has been insufficiently addressed on this site. A 'rigid' AI cannot be considered rational, because all rational beings are aware that their reasoning processes are prone to error. I would go on further to say that a rigid FAI can be just as dangerous (in the long-term) as a paperclip maximizer. However, the problem of implementing a 'flexible' AI would indeed be difficult. Such an AI would be a true inductive agent--even its confidence in the solidity of mathematical proof would be based on empirical evidence. Thus it would be difficult to predict how such an AI might function--there is a risk that the AI would 'go insane' as it loses confidence in the validity of the core assumptions underlying its cognitive processes. But this is already taking us far afield of the original subject of discussion.

Yes, this is correct; an AGI that did this would self-modify into a stable and effective utility maximizer of the same utility function it started with. However, this strategy is computationally impossible, and perhaps more importantly, it does not lend itself well to approximation.

Such an agent still needs to use "probability". Things like anthropic reasoning are still not well understood, and the basis of probability appears to be tied to decision theory. Thus, building this AI first requires that you solve decision theory.

What disadvantage does an AGI have if it doesn't employ anthropic reasoning?

Massively overestimating the probability of conflict with other extra-terrestrial life (and thus wasting resources on unnecessary preparation)/ Massively underestimating the probability of conflict with other extra-terrestrial life (and thus getting annihilated).

For example.

I am fine with AGI which is 'merely' powerful enough to expand itself into a Type I civilization. If that's the only justification for developing a decision theory which can handle anthropic reasoning then I have no use for such a theory.

Fine, but you'd probably be less happy if it stored all humans in stasis while preparing resources to defend itself against aliens that never turn up. Besides, there might be other anthropic arguments that we haven't noticed yet but such that the AI's response would vary wildly depend on how it reasons. Without knowing in advance that it is sensible, there's no telling if it will do the right thing.

(I am not an AI designer, I'm just interested in probability for its own sake.)

For the sake of argument, I'll grant that correctly formulated anthropic priors can reduce the bias in posterior estimates for the possibility of ET contact/confrontation: but the simple consequence of the math is that the influence of an anthropic prior decreases as the AGI gains more scientific knowledge. An AGI which has an (1-epsilon)-complete understanding of science, yet does not employ anthropic reasoning will have asymptotically equivalent estimates to an AGI which has an (1-epsilon)-complete understanding of science and employs correct anthropic reasoning.

[-][anonymous]13y00

How does a complete understanding of physics allow you to asymptotically approach "correct" solutions to anthropic problems? We can already imagine reformulating these problems in toy universes with completely known physics (like cellular automata), but that doesn't seem to help us solve them...

[-][anonymous]13y00

Your idea is a little similar to one-action AI that I described a while ago. It is a neat way to get goal stability (as long as you trust your magical math intuition module), but it doesn't seem to solve all of decision theory, even if you know the correct theory of physics at the outset.

If humanity eventually builds an AI based on your proposed decision theory, an alien AI built a million years ago may have inferred that fact and precommitted to wage self-destructive war against us unless we surrender 99% of our resources. Your AI will infer that fact at startup and give up the resources immediately. This predictable reaction of your AI caused the alien AI to make its precommitment in the first place. TDT tries to fix this by not yielding to extortion, though we don't know how to formalize that.

If our universe contains other copies of us at different places and times (which is feasible if it's spatially infinite), something weird can happen when copies of your AI interact with each other. UDT tries to fix that by accounting for copies "implicitly", so they all end up cooperating.

  1. How do you handle anthropic scenarios (sleeping beauty, presumptuous philosopher, doomsday argument)?

  2. Imagine the AI wants to determine what the would universe look like if it filled it with paperclips. It is superintelligent, so therefore it would have to have a very good reason. Since, in this hypothetical situation, there is a very good reason to fill the universe with paperclips, it would get lots of utility. Therefore, it is a good reason. It is actually quite difficult to prevent an AI from following this absurd-to-humans chain of reasoning without making the opposite mistake where it does not understand its own intelligence at all and mines itself for silicon.

  3. Your premises preclude the possibility of meeting another AI, which could be very dangerous.

How do you handle anthropic scenarios (sleeping beauty, presumptuous philosopher, doomsday argument)?

By defining the utility function in terms of the universe, rather than a subjective-experience path. Anthropics are a problem for humans because our utility functions are defined in terms of a "self" which is defined in a way that does not generalize well; for an AI, this would be a problem for us writing the utility function to give it, but not for the AI doing the optimization.

Imagine the AI wants to determine what the would universe look like if it filled it with paperclips. It is superintelligent, so therefore it would have to have a very good reason. Since, in this hypothetical situation, there is a very good reason to fill the universe with paperclips, it would get lots of utility. Therefore, it is a good reason.

There are two utility functions here, and the AI isn't optimizing the paperclip one.

Your premises preclude the possibility of meeting another AI, which could be very dangerous.

This premise is unnecessary, since that possibility can be folded into the probability model.

Anthropics are a problem for humans because our utility functions are defined in terms of a "self" which is defined in a way that does not generalize well; for an AI, this would be a problem for us writing the utility function to give it, but not for the AI doing the optimization.

So if the AI were building a robot that had to bet in a presumptuous philosopher or doomsday scenario, how would it bet in each? You do already have the right answer to sleeping beauty.

There are two utility functions here, and the AI isn't optimizing the paperclip one.

I used a ridiculously bad example. I was trying to ask what would happen if the AI considered the possibility that tiling the universe with paperclips would actually satisfy its own utility function. That is very implausible, but that just means that if a superintelligence were to decide to do so, it would have a very good reason.

This premise is unnecessary, since that possibility can be folded into the probability model.

No, you would defect on the true prisoners dilemma.

[-]gjm13y10

So if the AI were building a robot [...] how would it bet in each?

That would depend on what the AI hoped to achieve by building the robot. It seems to me that specifying that clearly should determine what approach the AI would want the robot to take in such situations.

(More generally, it seems to me that a lot of anthropic puzzles go away if one eschews indexicals. Whether that's any use depends on how well one can do without indexicals. I can't help suspecting that the Right Answer may be "perfectly well, and things that can only be said with indexicals aren't really coherent", which would be ... interesting. In case it's not obvious, everything in this paragraph is quarter-baked at best and may not actually make any sense. I think I recall that someone else posted something in LW-discussion recently with a similar flavour.)

I think I agree with everything written here, so at least your naive decision theory looks like it can handle anthropic problems.

That would depend on what the AI hoped to achieve by building the robot.

Which side of the presumptuous philosopher's bet would you take?

[-]gjm13y00

Just to clarify, I am not the author of the original article; I haven't proposed any particular naive decision theory. (I don't think "treat indexicals as suspect" qualifies as one!)

Unfortunately, my mind is not the product of a clearthinking AI and my values do (for good or ill) have indexically-defined stuff in them: I care more about myself than about random other people, for instance. That may or may not be coherent, but it's how I am. And so I don't know what the "right" way to get indexicals out of the rest of my thinking would be. And so I'm not sure how I "should" bet in the presumptuous philosopher case. (I take it the situation you have in mind is that someone offers me to bet me at 10:1 odds that we're in the Few People scenario rather than the Many People scenario, or something of the kind.) But, for what it's worth, I think I take the P.P.'s side, while darkly suspecting I'm making a serious mistake :-). I repeat that my thinking on this stuff is all much less than half-baked.

I am not the author of the original article

I realized this, but I think my mind attached some of snarles' statements to you.

I agree with your choice in the presumptuous philosopher problem, but I doubt that anyone could actually be in such an epistemic state, basically because the Few People cannot be sure that there is not another universe with completely different laws of physics simulating many copies of theirs, as well as many other qualitatively similar possible scenarios.

What's "time zero"? Are you saying that it self-modifies to what it should have started at? In that case, it would attempt to write messages to its past self.

If you mean just rewrite its code now, it seems equivalent to a EDT agent capable of self-modification.

More concretely, we give the AGI say, a year in which it's not allowed to take any action other than contemplate how it should rewrite its code. When it chooses the program it's going to use, it turns off for ten minutes while the program is loaded.

Hard perhaps does not cut it. If this thing self-modifies more than once (as is expected), you run into an at least exponential explosion in resource use, growing to planet-eating sizes even before anything useful gets done. If you don't take into account further self-modifications, how can you claim that you chose the best? And then there are the problems of simulating the next 10 billion years with any accuracy...

But on thinking about it I'm not sure why the decision theory is particularly difficult either. Maybe if you wanted to not just use higher-level properties of the modification and instead be able to modify "on the fly," e.g. if Omega says he'll give you 10 dollars if you choose box A by the process of choosing alphabetically. There might also be some interesting problems in bargaining with an agent to change its terminal values.

Is the purpose of CDT/UDT/TDT is to arrive at a computationally efficient decision procedure? I've never seen this stated explicitly.

Is the purpose of CDT/UDT/TDT is to arrive at a computationally efficient decision procedure?

No.

Nope. :D

I think you've just described something a lot like AIXI...