Towards a New Decision Theory

It commonly acknowledged here that current decision theories have deficiencies that show up in the form of various paradoxes. Since there seems to be little hope that Eliezer will publish his Timeless Decision Theory any time soon, I decided to try to synthesize some of the ideas discussed in this forum, along with a few of my own, into a coherent alternative that is hopefully not so paradox-prone.

I'll start with a way of framing the question. Put yourself in the place of an AI, or more specifically, the decision algorithm of an AI. You have access to your own source code S, plus a bit string X representing all of your memories and sensory data. You have to choose an output string Y. That’s the decision. The question is, how? (The answer isn't “Run S,” because what we want to know is what S should be in the first place.)

Let’s proceed by asking the question, “What are the consequences of S, on input X, returning Y as the output, instead of Z?” To begin with, we'll consider just the consequences of that choice in the realm of abstract computations (i.e. computations considered as mathematical objects rather than as implemented in physical systems). The most immediate consequence is that any program that calls S as a subroutine with X as input, will receive Y as output, instead of Z. What happens next is a bit harder to tell, but supposing that you know something about a program P that call S as a subroutine, you can further deduce the effects of choosing Y versus Z by tracing the difference between the two choices in P’s subsequent execution. We could call these the computational consequences of Y. Suppose you have preferences about the execution of a set of programs, some of which call S as a subroutine, then you can satisfy your preferences directly by choosing the output of S so that those programs will run the way you most prefer.

A more general class of consequences might be called logical consequences. Consider a program P’ that doesn’t call S, but a different subroutine S’ that’s logically equivalent to S. In other words, S’ always produces the same output as S when given the same input. Due to the logical relationship between S and S’, your choice of output for S must also affect the subsequent execution of P’. Another example of a logical relationship is an S' which always returns the first bit of the output of S when given the same input, or one that returns the same output as S on some subset of inputs.

In general, you can’t be certain about the consequences of a choice, because you’re not logically omniscient. How to handle logical/mathematical uncertainty is an open problem, so for now we'll just assume that you have access to a "mathematical intuition subroutine" that somehow allows you to form beliefs about the likely consequences of your choices.

At this point, you might ask, “That’s well and good, but what if my preferences extend beyond abstract computations? What about consequences on the physical universe?” The answer is, we can view the physical universe as a program that runs S as a subroutine, or more generally, view it as a mathematical object which has S embedded within it. (From now on I’ll just refer to programs for simplicity, with the understanding that the subsequent discussion can be generalized to non-computable universes.) Your preferences about the physical universe can be translated into preferences about such a program P and programmed into the AI. The AI, upon receiving an input X, will look into P, determine all the instances where it calls S with input X, and choose the output that optimizes its preferences about the execution of P. If the preferences were translated faithfully, the the AI's decision should also optimize your preferences regarding the physical universe. This faithful translation is a second major open problem.

What if you have some uncertainty about which program our universe corresponds to? In that case, we have to specify preferences for the entire set of programs that our universe may correspond to. If your preferences for what happens in one such program is independent of what happens in another, then we can represent them by a probability distribution on the set of programs plus a utility function on the execution of each individual program. More generally, we can always represent your preferences as a utility function on vectors of the form <E1, E2, E3, …> where E1 is an execution history of P1, E2 is an execution history of P2, and so on.

These considerations lead to the following design for the decision algorithm S. S is coded with a vector <P1, P2, P3, ...> of programs that it cares about, and a utility function on vectors of the form <E1, E2, E3, …> that defines its preferences on how those programs should run. When it receives an input X, it looks inside the programs P1, P2, P3, ..., and uses its "mathematical intuition" to form a probability distribution P_Y over the set of vectors <E1, E2, E3, …> for each choice of output string Y. Finally, it outputs a string Y* that maximizes the expected utility Sum P_Y(<E1, E2, E3, …>) U(<E1, E2, E3, …>). (This specifically assumes that expected utility maximization is the right way to deal with mathematical uncertainty. Consider it a temporary placeholder until that problem is solved. Also, I'm describing the algorithm as a brute force search for simplicity. In reality, you'd probably want it to do something cleverer to find the optimal Y* more quickly.)

Example 1: Counterfactual Mugging

Note that Bayesian updating is not done explicitly in this decision theory. When the decision algorithm receives input X, it may determine that a subset of programs it has preferences about never calls it with X and are also logically independent of its output, and therefore it can safely ignore them when computing the consequences of a choice. There is no need to set the probabilities of those programs to 0 and renormalize.

So, with that in mind, we can model Counterfactual Mugging by the following Python program:

def P(coin):
    AI_balance = 100
    if coin == "heads":
        if S("heads") == "give $100":
            AI_balance -= 100
    if coin == "tails":
        if Omega_Predict(S, "heads") == "give $100":
            AI_balance += 10000

The AI’s goal is to maximize expected utility = .5 * U(AI_balance after P("heads")) + .5 * U(AI_balance after P("tails")). Assuming U(AI_balance)=AI_balance, it’s easy to determine U(AI_balance after P("heads")) as a function of S’s output. It equals 0 if S(“heads”) == “give $100”, and 100 otherwise. To compute U(AI_balance after P("tails")), the AI needs to look inside the Omega_Predict function (not shown here), and try to figure out how accurate it is. Assuming the mathematical intuition module says that choosing “give $100” as the output for S(“heads”) makes it more likely (by a sufficiently large margin) for Omega_Predict(S, "heads") to output “give $100”, then that choice maximizes expected utility.

Example 2: Return of Bayes

This example is based on case 1 in Eliezer's post Priors as Mathematical Objects. An urn contains 5 red balls and 5 white balls. The AI is asked to predict the probability of each ball being red as it as drawn from the urn, its goal being to maximize the expected logarithmic score of its predictions. The main point of this example is that this decision theory can reproduce the effect of Bayesian reasoning when the situation calls for it. We can model the scenario using preferences on the following Python program:

def P(n):
    urn = ['red', 'red', 'red', 'red', 'red', 'white', 'white', 'white', 'white', 'white']
    history = []
    score = 0
    while urn:
        i = n%len(urn)
        n = n/len(urn)
        ball = urn[i]
        urn[i:i+1] = []
        prediction = S(history)
        if ball == 'red':
            score += math.log(prediction, 2)
        else:
            score += math.log(1-prediction, 2)
        print (score, ball, prediction)
        history.append(ball)

Here is a printout from a sample run, using n=1222222:

-1.0 red 0.5
-2.16992500144 red 0.444444444444
-2.84799690655 white 0.375
-3.65535182861 white 0.428571428571
-4.65535182861 red 0.5
-5.9772799235 red 0.4
-7.9772799235 red 0.25
-7.9772799235 white 0.0
-7.9772799235 white 0.0
-7.9772799235 white 0.0

S should use deductive reasoning to conclude that returning (number of red balls remaining / total balls remaining) maximizes the average score across the range of possible inputs to P, from n=1 to 10! (representing the possible orders in which the balls are drawn), and do that. Alternatively, S can approximate the correct predictions using brute force: generate a random function from histories to predictions, and compute what the average score would be if it were to implement that function. Repeat this a large number of times and it is likely to find a function that returns values close to the optimum predictions.

Example 3: Level IV Multiverse

In Tegmark's Level 4 Multiverse, all structures that exist mathematically also exist physically. In this case, we'd need to program the AI with preferences over all mathematical structures, perhaps represented by an ordering or utility function over conjunctions of well-formed sentences in a formal set theory. The AI will then proceed to "optimize" all of mathematics, or at least the parts of math that (A) are logically dependent on its decisions and (B) it can reason or form intuitions about.

I suggest that the Level 4 Multiverse should be considered the default setting for a general decision theory, since we cannot rule out the possibility that all mathematical structures do indeed exist physically, or that we have direct preferences on mathematical structures (in which case there is no need for them to exist "physically"). Clearly, application of decision theory to the Level 4 Multiverse requires that the previously mentioned open problems be solved in their most general forms: how to handle logical uncertainty in any mathematical domain, and how to map fuzzy human preferences to well-defined preferences over the structures of mathematical objects.

142 comments, sorted by
magical algorithm
Highlighting new comments since Today at 4:25 PM
Select new highlight date
Moderation Guidelinesexpand_more

There's lots of mentions of Timeless Decision Theory (TDT) in this thread - as though it refers to something real. However, AFAICS, the reference is to unpublished material by Eliezer Yudkowsky.

I am not clear about how anyone is supposed to make sense of all these references before that material has been published. To those who use "TDT" as though they know what they are talking about - and who are not Eliezer Yudkowsky - what exactly is it that you think you are talking about?

1) Congratulations: moving to logical uncertainty and considering your decision's consequences to be the consequence of that logical program outputting a particular decision, is what I would call the key insight in moving to (my version of) timeless decision theory. The rest of it (that is, the work I've done already) is showing that this answer is the only reflectively consistent one for a certain class of decision problems, and working through some of the mathematical inelegancies in mainstream decision theory that TDT seems to successfully clear up and render elegant (the original Newcomb's Problem being only one of them).

Steve Rayhawk also figured out that it had to do with impossible possible worlds.

Neither of you have arrived at (published?) some important remaining observations about how to integrate uncertainty about computations into decision-theoretic reasoning; so if you want to completely preempt my would-be PhD thesis you've still got a bit more work to do.

Why didn't you mention earlier that your timeless decision theory mainly had to do with logical uncertainty? It would have saved people a lot of time trying to guess what you were talking about.

Looking at my 2001 post, it seems that I already had the essential idea at that time, but didn't pursue very far. I think it was because (A) I wasn't as interested in AI back then, and (B) I thought an AI ought to be able to come up with these ideas by itself.

I still think (B) is true, BTW. We should devote some time and resources to thinking about how we are solving these problems (and coming up with questions in the first place). Finding that algorithm is perhaps more important than finding a reflectively consistent decision algorithm, if we don't want an AI to be stuck with whatever mistakes we might make.

Why didn't you mention earlier that your timeless decision theory mainly had to do with logical uncertainty?

Because I was thinking in terms of saving it for a PhD thesis or some other publication, and if you get that insight the rest follows pretty fast - did for me at least. Also I was using it as a test for would-be AI researchers: "Here's Newcomblike problems, here's why the classical solution doesn't work for self-modifying AI, can you solve this FAI problem which I know to be solvable?"

I still think (B) is true, BTW. We should devote some time and resources to thinking about how we are solving these problems (and coming up with questions in the first place). Finding that algorithm is perhaps more important than finding a reflectively consistent decision algorithm, if we don't want an AI to be stuck with whatever mistakes we might make.

And yet you found a reflectively consistent decision algorithm long before you found a decision-system-algorithm-finding algorithm. That's not coincidence. The latter problem is much harder. I suspect that even an informal understanding of parts of it would mean that you could find timeless decision theory as easily as falling backward off a tree - you just run the algorithm in your own head. So with vey high probability you are going to start seeing through the object-level problems before you see through the meta ones. Conversely I am EXTREMELY skeptical of people who claim they have an algorithm to solve meta problems but who still seem confused about object problems. Take metaethics, a solved problem: what are the odds that someone who still thought metaethics was a Deep Mystery could write an AI algorithm that could come up with a correct metaethics? I tried that, you know, and in retrospect it didn't work.

The meta algorithms are important but by their very nature, knowing even a little about the meta-problem tends to make the object problem much less confusing, and you will progress on the object problem faster than on the meta problem. Again, that's not saying the meta problem is important. It's just saying that it's really hard to end up in a state where meta has really truly run ahead of object, though it's easy to get illusions of having done so.

It's interesting that we came upon the same idea from different directions. For me it fell out of Tegmark's multiverse. What could consequences be, except logical consequences, if all mathematical structures exist? The fact that you said it would take a long series of posts to explain your idea threw me off, and I was kind of surprised when you said congratulations. I thought I might be offering a different solution. (I spent days polishing the article in the expectation that I might have to defend it fiercely.)

And yet you found a reflectively consistent decision algorithm long before you found a decision-system-algorithm-finding algorithm. That's not coincidence. The latter problem is much harder.

Umm, I haven't actually found a reflectively consistent decision algorithm yet, since the proposal has huge gaps that need to be filled. I have little idea how to handle logical uncertainty in a systematic way, or whether expected utility maximization makes sense in that context.

The rest of your paragraph makes good points. But I'm not sure what you mean by "metaethics, a solved problem". Can you give a link?

One way to approach the meta problem may be to consider the meta-meta problem: why did evolution create us with so much "common sense" on these types of problems? Why do we have the meta algorithm apparently "built in" when it doesn't seem like it would have offered much advantage in the ancestral environment?

But I'm not sure what you mean by "metaethics, a solved problem". Can you give a link?

http://wiki.lesswrong.com/wiki/Metaethics_sequence

(Observe that this page was created after you asked the question. And I'm quite aware that it needs a better summary - maybe "A Natural Explanation of Metaethics" or the like.)

The fact that you said it would take a long series of posts to explain your idea threw me off, and I was kind of surprised when you said congratulations

"Decide as though your decision is about the output of a Platonic computation" is the key insight that started me off - not the only idea - and considering how long philosophers have wrangled this, there's the whole edifice of justification that would be needed for a serious exposition. Maybe come Aug 26th or thereabouts I'll post a very quick summary of e.g. integration with Pearl's causality.

The usual reason for building things in is that it reduces trial-and-error learning. That's good if the errors are expensive and have a negative impact on fitness.

Is there something wrong with that explanation in this context?

Now that I have some idea what Eliezer and Nesov were talking about, I'm still a bit confused about AI cooperation. Consider the following scenario: Omega appears and asks two human players (who are at least as skilled as Eliezer and Nesov) to each design an AI. The AIs will each undergo some single-player challenges like Newcomb's Problem and Counterfactual Mugging, but there will be a one-shot PD between the two AIs at the end, with their source codes hidden from each other. Omega will grant each human player utility equal to the total score of his or her AI. Will the two AIs play cooperate with each other?

I don't think it's irrational for human players to play defect in one-shot PD. So let's assume these two human players would play defect in one-shot PD. Then they should also program their AIs to play defect, even if they have to add an exception to their timeless/updateless decision algorithms. But exceptions are bad, so what's the right solution here?

I'm still quite confused, but I'll report my current thoughts in case someone can help me out. Suppose we take it as an axiom that an AI's decision algorithm shouldn't need to contain any hacks to handle exceptional situations. Then the following "exceptionless" decision algorithm seems to pop out immediately: do what my creator would want me to do. In other words, upon receiving input X, S computes the following: suppose S's creator had enough time and computing power to create a giant lookup table that contains an optimal output for every input S might encounter, what would the entry for X be? Return that as the output.

This algorithm correctly solves Counterfactual Mugging, since S's creator would want it to output "give $100", since "give $100" would have maximized the creator's expected utility at the time of coding S. It also solves the problem posed by Omega in the parent comment. It seems to be reflectively consistent. But what is the relationship between this "exceptionless" algorithm and the timeless/updateless decision algorithm?

There are two parts to AGI: consequentialist reasoning and preference.

Humans have feeble consequentialist abilities, but can use computers to implement huge calculations, if the problem statement can be entered in the computer. For example, you can program the material and mechanical laws in an engineering application, enter a building plan, and have the computer predict what's going to happen to it, or what parameters should be used in the construction so that the outcome is as required. That's the power outside human mind, directed by the correct laws, and targeted at the formally specified problem.

When you consider AGI in isolation, it's like an engineering application with a random building plan: it can powerfully produce a solution, but it's not a solution to the problem you need solving. Nonetheless, this part is essential when you do have an ability to specify the problem. And that's the AI's algorithm, one aspect of which is decision-making. It's separate from the problem statement that comes from human nature.

For an engineering program, you can say that the computer is basically doing what a person would do if they had crazy amount of time and machine patience. But that's because a person can know both problem statement and laws of inference formally, which is the way it was programmed in the computer in the first place.

With human preference, the problem statement isn't known explicitly to people. People can use preference, but can't state this whole object explicitly. A moral machine would need to work with preference, but human programmers can't enter it, and neither can they do what a machine would be able to do given a formal problem statement, because humans can't know this problem statement, it's too big. It could exist in a computer explicitly, but it can't be entered there by programmers.

So, here is a dilemma: problem statement (preference) resides in the structure of human mind, but the strong power of inference doesn't, while the strong power of inference (potentially) exists in computers outside human minds, where the problem statement can't be manually transmitted. Creating FAI requires these components to meet in the same system, but it can't be done in a way other kinds of programming are done.

Something to think about.

This is the clearest statement of the problem FAI that I have read to date.

This algorithm . . . seems to be reflectively consistent. But what is the relationship between this "exceptionless" algorithm and the timeless/updateless decision algorithm?

Suppose that, before S's creator R started coding, Omega started an open game of counterfactual mugging with R, and that R doesn't know this, but S does. According to S's inputs, Omega's coin came up tails, so Omega is waiting for $100.

Does S output "give $0"? If Omega had started the game of counterfactual mugging after S was coded, then S would output "give $100".

Suppose that S also knows that R would have coded S with the same source code, even if Omega's coin had come up heads. Would S's output change? Should S's output change (should R have coded S so that this would change S's output)? How should S decide, from its inputs, which R is the creator with the expected utility S's outputs should be optimal for? Is it the R in the world where Omega's coin came up heads, or the R in the world where Omega's coin came up tails?

If there is not an inconsistency in S's decision algorithm or S's definition of R, is there an inconsistency in R's decision algorithm or R's own self-definition?

I'm having trouble understanding this. You're saying that Omega flipped the coin before R started coding, but R doesn't know that, or the result of the coin flip, right? Then his P(a counterfactual mugging is ongoing) is very low, and P(heads | a counterfactual mugging is ongoing) = P(tails | a counterfactual mugging is ongoing) = 1/2. Right?

In that case, his expected utility at the time of coding is maximized by S outputting "give $100" upon encountering Omega. It seems entirely straightforward, and I don't see what the problem is...

. . . do what my creator would want me to do. In other words, upon receiving input X, S computes the following: suppose S's creator had enough time and computing power to create a giant lookup table that contains an optimal output for every input S might encounter, what would the entry for X be? Return that as the output.

I don't know how to define what R "would want" or would think was "optimal".

What lookup table would R create? If R is a causal decision theorist, R might think: "If I were being counterfactually mugged and Omega's coin had come up heads, Omega would have already made its prediction about whether S would output 'give $100' on the input 'tails'. So, if I program S with the rule 'give $100 if tails', that won't cause Omega to give me $10000. And if the coin came up tails, that rule would lose me $100. So I will program S with the rule 'give $0 if tails'."

R's expected utility at the time of coding may be maximized by the rule "give $100 if tails", but R makes decisions by the conditional expected utilities given each of Omega's possible past predictions, weighted by R's prior beliefs about those predictions. R's conditional expected utilities are both maximized by the decision to program S to output "give $0".

[I deleted my earlier reply, because I was still confused about your questions.]

If, according to R's decision theory, the most preferred choice involves programming S to output "give $0", then that is what S would do.

It might be easier to think of the ideal S as consisting of a giant lookup table created by R itself given infinite time and computing power. An actual S would try to approximate this ideal to the best of its abilities.

How should S decide, from its inputs, which R is the creator with the expected utility S's outputs should be optimal for? Is it the R in the world where Omega's coin came up heads, or the R in the world where Omega's coin came up tails?

R would encode its own decision theory, prior, utility function, and memory at the time of coding into S, and have S optimize for that R.

Sorry. I wasn't trying to ask my questions as questions about how R would make decisions. I was asking questions to try to answer your question about the relationship between exceptionless and timeless decision-making, by pointing out dimensions of a map of ways for R to make decisions. For some of those ways, S would be "timeful" around R's beliefs or time of coding, and for some of those ways S would be less timeful.

I have an intuition that there is a version of reflective consistency which requires R to code S so that, if R was created by another agent Q, S would make decisions using Q's beliefs even if Q's beliefs were different from R's beliefs (or at least the beliefs that a Bayesian updater would have had in R's position), and even when S or R had uncertainty about which agent Q was. But I don't know how to formulate that intuition to something that could be proven true or false. (But ultimately, S has to be a creator of its own successor states, and S should use the same theory to describe its relation to its past selves as to describe its relation to R or Q. S's decisions should be invariant to the labeling or unlabeling of its past selves as "creators". These sequential creations are all part of the same computational process.)

"Do what my creator would want me to do"?

We could call that "pass the buck" decision theory ;-)

But what is the relationship between this "exceptionless" algorithm and the timeless/updateless decision algorithm?

Here's my conjecture: An AI using the Exceptionless Decision Theory (XDT) is equivalent to one using TDT if its creator was running TDT at the time of coding. If the creator was running CDT, then it is not equivalent to TDT, but it is reflectively consistent, one-boxes in Newcomb, and plays defect in one-shot PD.

And in case it wasn't clear, in XDT, the AI computes the giant lookup table its creator would have chosen using the creator's own decision theory.

AI's creator was running BRAINS, not a decision theory. I don't see how "what the AI's creator was running" can be a meaningful consideration in a discussion of what constitutes a good AI design. Beware naturalistic fallacy.

One AI can create another AI, right? Does my conjecture make sense if the creator is an AI running some decision theory? If so, we can extend XDT to work with human creators, by having some procedure to approximate the human using a selection of possible DTs, priors, and utility functions. Remember that the goal in XDT is to minimize the probability that the creator would want to add an exception on top of the basic decision algorithm of the AI. If the approximation is close enough, then this probability is minimal.

ETA: I do not claim this is good AI design, merely trying to explore the implications of different ideas.

The problem of finding the right decision theory is a problem of Friendliness, but for a different reason than finding a powerful inference algorithm fit for an AGI is a problem of Friendliness.

"Incompleteness" of decision theory, such as what we can see in CDT, seems to correspond to inability of AI to embody certain aspects of preference, in other words the algorithm lacks expressive power for its preference parameter. Each time an agent makes a mistake, you can reinterpret it as meaning that it just prefers it this way in this particular case. Whatever preference you "feed" to the AI with a wrong decision theory, the AI is going to distort by misinterpreting, losing some of its aspects. Furthermore, the lack of reflective consistency effectively means that the AI continues to distort its preference as it goes along. At the same time, it can still be powerful in consequentialist reasoning, being as formidable as a complete AGI, implementing the distorted version of preference that it can embody.

The resulting process can be interpreted as an AI running "ultimate" decision theory, but with a preference not in perfect fit with what it should've been. If at any stage you have a singleton that owns the game but has a distorted preference, whether due to incorrect procedure for getting the preference instantiated, or incorrect interpretation of preference, such as a mistaken decision theory as we see here, there is no returning to better preference.

More generally, what "could" be done, what AI "could" become, is a concept related to free will, which is a consideration of what happens to a system in isolation, not a system one with reality: you consider a system from the outside, and see what happens to it if you perform this or that operation on it, this is what it means that you could do one operation or the other, or that the events could unfold this way or the other. When you have a singleton, on the other hand, there is no external point of view on it, and so there is no possibility for change. The singleton is the new law of physics, a strategy proven true [*].

So, if you say that the AI's predecessor was running a limited decision theory, this is a damning statement about what sort of preference the next incarnation of AI can inherit. The only significant improvement (for the fate of preference) an AGI with any decision theory can make is to become reflectively consistent, to stop losing the ground. The resulting algorithm is as good as the ultimate decision theory, but with preference lacking some aspects, and thus behavior indistinguishable (equivalent) from what some other kinds of decision theories would produce.

__
[*] There is a fascinating interpretation of truth of logical formulas as the property of corresponding strategies in a certain game to be the winning ones. See for example
S. Abramsky (2007). `A Compositional Game Semantics for Multi-Agent Logics of Imperfect Information'. In J. van Benthem, D. Gabbay, & B. Lowe (eds.), Interactive Logic, vol. 1 of Texts in Logic and Games, pp. 11-48. Amsterdam University Press. (PDF)

An AI running causal decision theory will lose on Newcomblike problems, be defected against in the Prisoner's Dilemma, and otherwise undergo behavior that is far more easily interpreted as "losing" than "having different preferences over final outcomes".

The AI that starts with CDT will immediately rewrite itself with AI running the ultimate decision theory, but that resulting AI will have distorted preferences, which is somewhat equivalent to the decision theory it runs having special cases for the time AI got rid of CDT (since code vs. data (algorithm vs. preference) is strictly speaking an arbitrary distinction). The resulting AI won't lose on these thought experiments, provided they don't intersect the peculiar distortion of its preferences, where it indeed would prefer to "lose" according to preference-as-it-should-have-been, but win according to its distorted preference.

A TDT AI consistently acts so as to end up with a million dollars. A CDT AI acts to win a million dollars in some cases, but in other cases ends up with only a thousand. So in one case we have a compressed preference over outcomes, in the other case we have a "preference" over the exact details of the path including the decision algorithm itself. In a case like this I don't use the word "preference" so as to say that the CDT AI wants a thousand dollars on Newcomb's Problem, I just say the CDT AI is losing. I am unable to see any advantage to using the language otherwise - to say that the CDT AI wins with peculiar preference is to make "preference" and "win" so loose that we could use it to refer to the ripples in a water pond.

It's the TDT AI resulting from CDT AI's rewriting of itself that plays these strange moves on the thought experiments, not CDC AI. The algorithm of idealized TDT is parameterized by "preference" and always gives the right answer according to that "preference". To stop reflective inconsistency, CDT AI is going to rewrite itself with something else. That something else can be characterized in general as a TDT AI with crazy preferences, that prefers $1000 in the Newcomb's thought experiments set before midnight October 15, 2060, or something of the sort, but works OK after that. The preference of TDT AI to which a given AGI is going to converge can be used as denotation of that AGI's preference, to generalize the notion of TDT preference on systems that are not even TDT AIs, and further to the systems that are not even AIs, in particular on humans or humanity.

These are paperclips of preference, something that seems clearly not right as a reflection of human preference, but that is nonetheless a point in the design space that can be filled in particular by failures to start with the right decision theory.

I suggest that regarding crazy decision theories with compact preferences as sane decision theories with noncompact preferences is a step backward which will only confuse yourself and the readers. What is accomplished by doing so?

How to regard humans then? They certainly don't run a compact decision algorithm, their actions are not particularly telling of their preferences. And still, they have to be regarded as having a TDT preference, to extract that preference and place it in a TDT AI. As I envision a theory that would define what TDT preference humans have, it must also be capable of telling what is the TDT preference of crazy AIs or petunia or the Sun.

(Btw, I'm now not sure that CDT-generated AI will give crazy answers on questions about the past, it may just become indifferent to the past altogether, as that part of preference is already erased from its mind. CDT gave crazy answers, but when it constructed the TDT, it already lost the part of preference that corresponds to giving those crazy answers, and so the TDT won't give them.)

If you regard humans as sane EU maximizers with crazy preferences then you end up extracting crazy preferences! This is exactly the wrong thing to do.

I can't make out what you're saying about CDT-gen AI because I don't understand this talk about "that part of preference is already erased from its mind". You might be better off visualizing Dai's GLT, which a "half timeless decision theory" is just the compact generator of.

If you regard humans as sane EU maximizers with crazy preferences then you end up extracting crazy preferences! This is exactly the wrong thing to do.

No, that's not what I mean. Humans are no more TDT agents with crazy preferences than CDT agents are TDT agents with crazy preferences: notice that I defined CDT's preference to be the preference of TDT to which CDT rewrites itself. TDT preference is not part of CDT AI's algorithm, but it follows from it, just like factorial of 72734 follows from the code of factorial function. Thus (if I try to connect the concepts that don't really fit) humanity's preference is analogous to preference of TDT AI that humanity could write if the process of writing this AI would be ideal according to the resulting AI's preference (but without this process wireheading on itself, more like a fixpoint, and not really happening in time). Which is not to say that it's the AI that humanity is most likely to write, which you can see from the example of trying to define petunia's preferences. Well, if I could formalize this step, I'd had it written up already. It seems to me like a direction towards better formalization from "if humans thought faster, were smarter, knew more, etc."

I think an AI running CDT would immediately replace itself by an AI running XDT (or something equivalent to it). If there is no way to distinguish between an AI running XDT and an AI running TDT (prior to a one-shot PD), the XDT AI can't do worse than an TDT AI. So CDT is not losing, as far as I can tell (at least for an AI capable of self-modification).

ETA: I mean a XTD AI can't do worse than a TDT AI within the same world. But a world full of XTD will do worse than a world full of TDT.

The parent comment may be of some general interest, but it doesn't seem particularly helpful in this specific case. Let me back off and rephrase the question so that perhaps it makes more sense:

Can our two players, Alice and Bob, design their AIs based on TDT, such that it falls out naturally (i.e. without requiring special exceptions) that their AIs will play defect against each other, while one-boxing Newcomb's Problem?

If so, how? In order for one AI using TDT to defect, it has to either believe (A) that the other AI is not using TDT, or (B) that it is using TDT but their decisions are logically independent anyway. Since we're assuming in this case that both AIs do use TDT, (A) requires that the players program their AIs with a falsehood, which is no good. (B) might be possible, but I don't see how.

If the answer is no, then it seems that TDT isn't the final answer, and we have to keep looking for another one. Is there another way out of this quandary?

I don't understand why you want the AIs to defect against each other rather than cooperating with each other.

Are you attached to this particular failure of causal decision theory for some reason? What's wrong with TDT agents cooperating in the Prisoner's Dilemma and everyone living happily ever after?

I don't understand why you want the AIs to defect against each other rather than cooperating with each other.

Come on, of course I don't want that. I'm saying that is the inevitable outcome under the rules of the game I specified. It's just like if I said "I don't want two human players to defect in one-shot PD, but that is what's going to happen."

ETA: Also, it may help if you think of the outcome as the human players defecting against each other, with the AIs just carrying out their strategies. The human players are the real players in this game.

Are you attached to this particular failure of causal decision theory for some reason?

No, I can't think of a reason why I would be.

What's wrong with TDT agents cooperating in the Prisoner's Dilemma and everyone living happily ever after?

There's nothing wrong with that, and it may yet happen, if it turns out that the technology for proving source code can be created. But if you can't prove that your source code is some specific string, if the only thing you have to go on is that you and the other AI must both use the same decision theory due to convergence, that isn't enough.

Sorry if I'm repeating myself, but I'm hoping one of my explanations will get the point across...

Come on, of course I don't want that. I'm saying that is the inevitable outcome under the rules of the game I specified. It's just like if I said "I don't want two human players to defect in one-shot PD, but that is what's going to happen."

I don't believe that is true. It's perfectly conceivable that two human players would cooperate.

Yes, I see the possibility now as well, although I still don't think it's very likely. I wrote more about it in http://lesswrong.com/lw/15m/towards_a_new_decision_theory/11lx

You're saying that TDT applied directly by both AIs would result in them cooperating; you would rather that they defect even though that gives you less utility; so you're looking for a way to make them lose? Why?

If both AIs use the same decision theory and this is common knowledge, then the only options are (C,C) or (D,D). Pick whichever you prefer. If they use different decision theories, then you can give yours pure TDT and tell it truthfully that you've tricked the other player into unconditionally cooperating. What else is there?

If both AIs use the same decision theory then the only options are (C,C) or (D,D).

You (and they) can't assume that, as they could be in different states even with the same algorithm that operates on those states, and so will output different decisions, even if from the problem statement it looks like everything significant is the same.

The problem is that the two human player's minds aren't logically related. Each human player in this game wants his AI to play defect, because their decisions are logically independent of each other's. If TDT doesn't allow a player's AI to play defect, then the player would choose some other DT that does, or add an exception to the decision algorithm to force the AI to play defect.

I explained here why humans should play defect in one-shot PD.

The problem is that the two human player's minds aren't logically related. Each human player in this game wants his AI to play defect, because their decisions are logically independent of each other's.

Your statement above is implicitly self-contradictory. How can you generalize over all the players in one fell swoop, applying the same logic to each of them, and yet say that the decisions are "logically independent"? The decisions are physically independent. Logically, they are extremely dependent. We are arguing over what is, in general, the "smart thing to do". You assume that if "the smart thing to do" is defect, and so all the players will defect. Doesn't smell like logical independence to me.

More importantly, the whole calculation about independence versus dependence is better carried out by an AI than by a human programmer, which is what TDT is for. It's not for cooperating. It's for determining the conditional probability of the other agent cooperating given that a TDT agent in your epistemic state plays "cooperate". If you know that the other agent knows (up to common knowledge) that you are a TDT agent, and the other agent knows that you know (up to common knowledge) that it is a TDT agent, then it is an obvious strategy to cooperate with a TDT agent if and only if it cooperates with you under that epistemic condition.

The TDT strategy is not "Cooperate with other agents known to be TDTs". The TDT strategy for the one-shot PD, in full generality, is "Cooperate if and only if ('choosing' that the output of this algorithm under these epistemic conditions be 'cooperate') makes it sufficiently more likely that (the output of the probability distribution of opposing algorithms under its probable epistemic conditions) is 'cooperate', relative to the relative payoffs."

Under conditions where a TDT plays one-shot true-PD against something that is not a TDT and not logically dependent on the TDT's output, the TDT will of course defect. A TDT playing against a TDT which falsely believes the former case to hold, will also of course defect. Where you appear to depart from my visualization, Wei Dai, is in thinking that logical dependence can only arise from detailed examination of the other agent's source code, because otherwise the agent has a motive to defect. You need to recognize your belief that what players do is in general likely to correlate, as a case of "logical dependence". Similarly the original decision to change your own source code to include a special exception for defection under particular circumstances, is what a TDT agent would model - if it's probable that the causal source of an agent thought it could get away with that special exception and programmed it in, the TDT will defect.

You've got logical dependencies in your mind that you are not explicitly recognizing as "logical dependencies" that can be explicitly processed by a TDT agent, I think.

If you already know something about the other player, if you know it exists, there is already some logical dependence between you two. How to leverage this minuscule amount of dependence is another question, but there seems to be no conceptual distinction between this scenario and where the players know each other very well.

The problem is that the two human player's minds aren't logically related. Each human player in this game wants his AI to play defect, because their decisions are logically independent of each other's.

I don't think so. Each player wants to do the Winning Thing, and there is only one Winning Thing (their situations are symmetrical), so if they're both good at Winning (a significantly lower bar than successfully building an AI with their preferences), their decisions are related.

So what you're saying is, given two players who can successfully build AIs with their preferences (and that's common knowledge), they will likely (surely?) play cooperate in one-shot PD against each other. Do I understand you correctly?

Suppose what you say is correct, that the Winning Thing is to play cooperate in one-shot PD. Then what happens when some player happens to get a brain lesion that causes him to unconsciously play defect without affecting his AI building abilities? He would take everyone else's lunch money. Or if he builds his AI to play defect while everyone else builds their AIs to play cooperate, his AI then takes over the world. I hope that's a sufficient reductio ad absurdum.

Hmm, I just noticed that you're only saying "their decisions are related" and not explicitly making the conclusion they should play cooperative. Well, that's fine, as long as they would play defect in one-shot PD, then they would also program their AIs to play defect in one-shot PD (assuming each AI can't prove its source code to the other). That's all I need for my argument.

So what you're saying is, given two players who can successfully build AIs with their preferences (and that's common knowledge), they will likely (surely?) play cooperate in one-shot PD against each other. Do I understand you correctly?

Yes.

Suppose what you say is correct, that the Winning Thing is to play cooperate in one-shot PD. Then what happens when some player happens to get a brain lesion that causes him to unconsciously play defect without affecting his AI building abilities? He would take everyone else's lunch money. Or if he builds his AI to play defect while everyone else builds their AIs to play cooperate, his AI then takes over the world. I hope that's a sufficient reductio ad absurdum.

Good idea. Hmm. It sounds like this is the same question as: what if, instead of "TDT with defection patch" and "pure TDT", the available options are "TDT with defection patch" and "TDT with tiny chance of defection patch"? Alternately: what if the abstract computations that are the players have a tiny chance of being embodied in such a way that their embodiments always defect on one-shot PD, whatever the abstract computation decides?

It seems to me that Lesion Man just got lucky. This doesn't mean people can win by giving themselves lesions, because that's deliberately defecting / being an abstract computation that defects, which is bad. Whether everyone else should defect / program their AIs to defect due to this possibility depends on the situation; I would think they usually shouldn't. (If it's a typical PD payoff matrix, there are many players, and they care about absolute, not relative, scores, defecting isn't worth it even if it's guaranteed there'll be one Lesion Man.)

This still sounds disturbingly like envying Lesion Man's mere choices – but the effect of the lesion isn't really his choice (right?). It's only the illusion of unitary agency, bounded at the skin rather than inside the brain, that makes it seem like it is. The Cartesian dualism of this view (like AIXI, dropping an anvil on its own head) is also disturbing, but I suspect the essential argument is still sound, even as it ultimately needs to be more sophisticated.

I guess my reductio ad absurdum wasn't quite sufficient. I'll try to think this through more thoroughly and carefully. Let me know which steps, if any, you disagree with, or are unclear, in the following line of reasoning.

  1. TDT couldn't have arisen by evolution.
  2. Until a few years ago, almost everyone on Earth was running some sort of non-TDT which plays defect in one-shot PD.
  3. It's possible that upon learning about TDT, some people might spontaneously switch to running it, depending on whatever meta-DT controls this, and whether the human brain is malleable enough to run TDT.
  4. If, in any identifiable group of people, a sufficient fraction switches to TDT, and that proportion is public knowledge, the TDT-running individuals in that group should start playing cooperate in one-shot PD with other members of the group.
  5. The threshold proportion is higher if the remaining defectors can cause greater damage. If the remaining defectors can use their gains from defection to better reproduce themselves, or to gather more resources that will let them increase their gains/damage, then the threshold proportion must be close to 1, because even a single defector can start a chain reaction that causes all the resources of the group to become held by defectors.
  6. What proportion of skilled AI designers would switch to TDT is ultimately an empirical question, but it seems to me that it's unlikely to be close to unity.
  7. TDT-running AI designers will design their AIs to run TDT. Non-TDT-running AI designers will design their AIs to run non-TDT (not necessarily the same non-TDT).
  8. Assume that a TDT-running AI (TAI) can't tell which other AIs are running TDT and which ones aren't, so in every game it faces the decision described in steps 4 and 5. A TDT AI will cooperate in some situations where the benefit from cooperation is relatively high and damage from defection relatively low, and not in other situations.
  9. As a result, non-TAI will do better than TAI, but the damage to TAIs will be limited.
  10. Only if a TAI is sure that all AIs are TAIs, will it play cooperate unconditionally.
  11. If a TAI encounters an AI of alien origin, the same logic applies. The alien AI will be TAI if-and-only-if its creator was running TDT. If the TAI knows nothing about the alien creator, then it has to estimate what fraction of AI-builders in the universe runs TDT. Taking into account that TDT can't arise from evolution, and not seeing any reason for evolution to create a meta-DT that would pick TDT upon discovering it, this fraction seems pretty low, and so the TAI will likely play defect against the alien AI.

Hmm, this exercise has cleared a lot of my own confusion. Obviously a lot more work needs to be done to make the reasoning rigorous, but hopefully I've gotten the gist of it right.

ETA: According to this line of argument, your hypothesis that all skilled AI designers play cooperate in one-shot PD against each other is equivalent to saying that skilled AI designers have minds malleable enough to run TDT, and have a meta-DT that causes them to switch to running TDT. But I do not see an evolutionary reason for this, so if it's true, it must be true by luck. Do you agree?

It looks like in this discussion you assume that switching to "TDT" (it's highly uncertain what this means) immediately gives the decision to cooperate in "true PD". I don't see why it should be so. Summarizing my previous comments, exactly what the players know about each other, exactly in what way they know it, may make their decisions go either way. That the players switch from CDT to some kind of more timeless decision theory doesn't determine the answer to be "cooperate", it merely opens up the possibility that previously was decreed irrational, and I suspect that what's important in the new setting for making the decision go either way isn't captured properly in the problem statement of "true PD".

Also, the way you treat "agents with TDT" seems more appropriate for "agents with Cooperator prefix" from cousin_it's Formalizing PD. And this is a simplified thing far removed from a complete decision theory, although a step in the right direction.

I don't assume that switching to TDT immediately gives the decision to cooperate in "true PD". I assume that an AI running TDT would decide to cooperate if it thinks the expected utility of cooperating is higher than the EU of defecting, and that is true if its probability of facing another TDT is sufficiently high compared to its probability of facing a defector (how high is sufficient depends on the payoffs of the game). Well, this is necessary but not sufficient. For example if the other TDT doesn't think its probability of facing a TDT is high enough, it won't cooperate, so we need some common knowledge of the relevant probabilities and payoffs.

Does my line of reasoning make sense now, given this additional explanation?

Actually it makes less sense now, since your explanation seems to agree that two "TDT" algorithms that know each of them is "TDT" won't necessarily cooperate, which undermines my hypothesis for why you were talking about cooperation as a sure thing in some relation to "TDT". I still think you make that assumption though. Citation from your argument:

4. If, in any identifiable group of people, a sufficient fraction switches to TDT, and that proportion is public knowledge, the TDT-running individuals in that group should start playing cooperate in one-shot PD with other members of the group.

I'm having trouble understanding what you're talking about again. Do you agree or disagree with step 4? To rephrase it a bit, if an identifiable group of people contains a high fraction of individuals running TDT, and that proportion is public knowledge, then TDT-running individuals in that group should play cooperate in one-shot PD with other members of the group in games where the payoffs are such that potential gains from mutual cooperation is large compared to potential loses from being defected against. (Assuming being in such a group is the best evidence available about whether someone is running TDT or not.)

If you disagree, why do you think a TDT-running individual might not play cooperate in this situation? Can you give an example to help me understand?

I disagree with step 4, I think sometimes the TDT players that know they both are TDT players won't cooperate, but this discussion stirred up some of the relevant issues, so I'll answer later when I figure out what I should believe now.

I don't see why TDT players would fail to cooperate under conditions of common knowledge. Are you talking about a case where they each know the other is TDT but think the other doesn't know they know, or something like that?

I don't know the whole answer, but for example consider what happens with Pareto-efficiency in PD when you allow mixed strategies (and mixed strategy is essentially the presence of nontrivial dependence of the played move on the agent's state of knowledge, beyond what is restricted by the experiment, so there is no actual choice about allowing mixed strategies, mixed strategies are what's there by default even if the problem states that players select some certain play). Now, the Pareto-efficient plays are those where one player cooperates with certainty, while the other cooperates or defects with some probability. These strategies correspond to bargaining between the players. I don't know how to solve the bargaining problem (aka fairness problem aka first-mover problem in TDT), but I see no good reason to expect that the solution in this case is going to be exactly pure cooperation. Which is what I meant by the insufficiency in correspondence between true PD and pure cooperation: true PD seems to give too little info, leaving uncertainty about the outcome, at least in this sense. This example doesn't allow both players to defect, but it's not pure cooperation either.

  1. TDT couldn't have arisen by evolution.

It's too elegant to arise by evolution, and it also deals with one-shot PDs with no knock-on effects which is an extremely nonancestral condition - evolution by its nature deals with events that repeat many times; sexual evolution by its nature deals with organisms that interbreed; so "one-shot true PDs" is in general a condition unlikely to arise with sufficient frequency that evolution deals with it at all.

Taking into account that TDT can't arise from evolution, and not seeing any reason for evolution to create a meta-DT that would pick TDT upon discovering it

This may perhaps embody the main point of disagreement. A self-modifying CDT which, at 7am, expects to encounter a future Newcomb's Problem or Parfit's Hitchhiker in which the Omega gets a glimpse at the source code after 7am, will modify to use TDT for all decisions in which Omega glimpses the source code after 7am. A bit of "common sense" would tell you to just realize that "you should have been using TDT from the beginning regardless of when Omega glimpsed your source code and the whole CDT thing was a mistake" but this kind of common sense is not embodied in CDT. Nonetheless, TDT is a unique reflectively consistent answer for a certain class of decision problems, and a wide variety of initial points is likely to converge to it. The exact proportion, which determines under what conditions of payoff and loss stranger-AIs will cooperate with each other, is best left up to AIs to calculate, I think.

Nonetheless, TDT is a unique reflectively consistent answer for a certain class of decision problems, and a wide variety of initial points is likely to converge to it.

The main problem I see with this thesis (to restate my position in a hopefully clear form) is that an agent that starts off with a DT that unconditionally plays D in one-shot PD will not self-modify into TDT, unless it has some means of giving trustworthy evidence that it has done so. Suppose there is no such means, then any other agent must treat it the same, whether it self-modifies into TDT or not. Suppose it expects to face a TDT agent in the future. Whether that agent will play C or D against it is independent of what it decides now. If it does self-modify into TDT, then it might play C against the other TDT where it otherwise would have played D, and since the payoff for C is lower than for D, holding the other player's choice constant, it will decide not to self-modify into TDT.

If it expects to face Newcomb's Problem, then it would self-modify into something that handles it better, but that something must still unconditionally play D in one-shot PD.

Do you still think "a wide variety of initial points is likely to converge to it"? If so, do you agree that (ETA: in a world where proving source code isn't possible) those initial points exclude any DT that unconditionally plays D in one-shot PD?

BTW, there are a number of decision theorists in academia. Should we try to get them to work on our problems? Unfortunately, I have no skill/experience/patience/willpower for writing academic papers. I tried to write such a paper about cryptography once and submitted it to a conference, got back a rejection with nonsensical review comments, and that was that. (I guess I could have tried harder but then that would probably have put me on a different career path where I wouldn't be working these problems today.)

Also, there ought to be lots of mathematicians and philosophers who would be interested in the problem of logical uncertainty. How can we get them to work on it?

Suppose it expects to face a TDT agent in the future. Whether that agent will play C or D against it is independent of what it decides now.

Unless that agent already knows or can guess your source code, in which case it is simulating you or something highly correlated to you, and in which case "modify to play C only if I expect that other agent simulating me to play C iff I modify to play C" is a superior strategy to "just D" because an agent who simulates you making the former choice (and which expects to be correctly simulated itself) will play C against you, while if it simulates you making the latter choice it will play D against you.

If it does self-modify into TDT, then it might play C against the other TDT where it otherwise would have played D, and since the payoff for C is lower than for D, holding the other player's choice constant, it will decide not to self-modify into TDT.

The whole point is that the other player's choice is not constant. Otherwise there is no reason ever for anyone to play C in a one-shot true PD! Simulation introduces logical dependencies - that's the whole point and to the extent it is not true even TDT agents will play D.

"Holding the other player's choice constant" here is the equivalent of "holding the contents of the boxes constant" in Newcomb's Problem. It presumes the answer.

Unless that agent already knows or can guess your source code, in which case it is simulating you or something highly correlated to you

I think you're invoking TDT-style reasoning here, before the agent has self-modified into TDT.

Besides, I'm assuming a world where agents can't know or guess each others' source codes. I thought I made that clear. If this assumption doesn't make sense to you, consider this: What evidence can one AI use to infer the source code of another AI or its creator? What if any such evidence can be faked near perfectly by the other AI? What about for two AIs of different planetary origins meeting in space?

I know you'd like to assume a world where guessing each others' source code is possible, since that makes everything work out nicely and everyone can "live happily ever after". But why shouldn't we consider both possibilities, instead of ignoring the less convenient one?

ETA: I think it may be possible to show that a CDT won't self-modify into a TDT as long as it believes there is a non-zero probability that it lives in a world where it will encounter at least one agent that won't know or guess its current or future source code, but in the limit as that probability goes to zero, the DT it self-modifies into converges to TDT.

I think you're invoking TDT-style reasoning here, before the agent has self-modified into TDT.

I already said that agents which start out as pure CDT won't modify into pure TDTs - they'll only cooperate if someone gets a peek at their source code after they self-modified. However, humans, at least, are not pure CDT agents - they feel at least the impulse to one-box on Newcomb's Problem if you raise the stakes high enough.

This has nothing to do with evolutionary contexts of honor and cooperation and defection and temptation, and everything to do with our evolved instincts governing abstract logic and causality, which is what governs what sort of source code you think has what sort of effect. Even unreasonably pure CDT agents recognize that if they modify their source code at 7am, they should modify to play TDT against any agent that has looked at their source code after 7am. To humans, who are not pure CDT agents, the idea that you should play essentially the same way if Omega glimpsed your source code at exactly 6:59am, seems like common sense given the intuitions we have about logic and causality and elegance and winning. If you're going to all the trouble to invent TDT anyway, it seems like a waste of effort to two-box against Omega if he perfectly saw your source code 5 seconds before you self-modified. (These being the kind of ineffable meta-decision considerations that we both agree are important, but which are hard to formalize.)

Besides, I'm assuming a world where agents can't know or guess each others' source codes.

You are guessing their source code every time you argue that they'll choose D. If I can't make you see this as an instance of "guessing the other agent's source code" then indeed you will not see the large correlations at the start point, and if the agents start out highly uncorrelated then the rare TDT agents will choose the correct maximizing action, D. They will be rare because, by assumption in this case, most agents end up choosing to cooperate or defect for all sorts of different reasons, rather than by following highly regular lines of logic in nearly all cases - let alone the same line of logic that kept on predictably ending up at D.

There's a wide variety of cases where philosophers go astray by failing to recognize an instance of an everyday concept as an abstract concept. For example, they say in one breath that "God is unfalsifiable", and in the next breath talk about how God spoke to them in their heart, because they don't recognize "God spoke to me in my heart" as an instance of "God allegedly made something observable happen". Philosophers talk about qualia being epiphenomenal in one breath, and then in the next speak of how they know themselves to be conscious, because they don't recognize this self-observation as an instance of "something making something else happen" aka "cause and effect". The only things recognized as matching the formal-sounding phrase "cause and effect" are big formal things officially labeled "causal", not just stuff that makes other stuff happen.

In the same sense, you have this idea about modeling other agents as this big official affair that requires poring over their source code with a magnifying glass and then furthermore verifying that they can't change it while you aren't looking.

You need to recognize the very thought processes you are carrying out right now in arguing that just about anyone will choose D as an instance of guessing the outputs of the other agents' source codes and moreover guessing that most such codes and outputs are massively logically correlated.

This is witnessed by the fact that if we did get to see some interstellar transactions, and you saw that the first three transactions were (C, C), you would say, "Wow, guess Eliezer was right" and expect the next one to be (C, C) as well. (And of course if I witnessed three cases of (D, D) I would say "Guess I was wrong.") Even though the initial conditions are not physically correlated, we expect a correlation. What is this correlation, then? It is a logical correlation. We expect different species to end up following similar lines of reasoning, that is, performing similar computations, like factorizing 123,456 in spacelike separated galaxies.

It occurs to me that the problem is important enough that even if we can reach intuitive agreement, we should still do the math. But it doesn't help to solve the wrong problem, so do you think the following is the right formalization of the problem?

  1. Assume a "no physical proof of source code" universe.
  2. Assume three types of intelligent life can arise in this universe.
  3. In a Type A species, Eliezer's intuition is obvious to everyone, so they build AIs running TDT without further consideration.
  4. In a Type B species, my intuition is obvious to everyone so, so they build AIs running XDT, or AIs running CDT which immediately self-modify into XDT. Assume (or prove) that XDT behaves like TDT except it unconditionally plays D in PD.
  5. In a Type C species, different people have different intuitions, and some (Type D individuals) don't have strong intuitions or prefer to use a formal method to make this meta-decision. We human beings obviously belong to this type of species, and let's say we at LessWrong belong to this last subgroup (Type D).

Does this make sense so far?

Let me say where my intuition expects this to lead to, so you don't think I'm setting a trap for you to walk into. Whatever meta-decision we make, it can be logically correlated only to AIs running TDT and other Type D individuals in the universe. If the proportion of Type D individuals in the universe is low, then it's obviously better for us to implement XDT instead of TDT. That's because whether we use TDT or XDT will have little effect on how often other TDTs play cooperate. (They can predict what Type D individuals will decide, but since there are few of us and they can't tell which AIs were created by Type D individuals, it won't affect their decisions much.)

Unfortunately we don't know the proportions of different types of species/individuals. So we should program an AI to estimate them, and have it make the decision of what to self-modify into.

ETA: Just realized that the decisions of Type D individuals can also correlate with the intuitions of others, since intuitions come from unconscious mental computations and they may be of a similar nature with our explicit decisions. But this correlation will be imperfect, so the above reasoning still applies, at least to some extent.

ETA2: This logical correlation stuff is hard to think about. Can we make any sense of these types of problems before having a good formal theory of logical correlation?

ETA3: The thing that's weird here is that assuming everyone's intuitions/decisions aren't perfectly correlated, some will build TDTs and some will build XDTs. And it will be the ones who end up deciding to build XDTs that defect who will win. How to make sense of this, if that's the wrong decision?

ETA4: I'll be visiting Mt. Rainier for the rest of the day, so that's it. :) Sorry for the over-editing.

Maybe cousin_it is right and we really have to settle this by formal math. But I'm lazy and will give words one more try. If we don't reach agreement after this I'm going to the math.

So, right now we have different intuitions. Let's say you have the correct intuition and convince everyone of it, and I have the incorrect one but I'm too stupid to realize it. So you and your followers go on to create a bunch of AIs with TDT. I go on to create an AI which is like TDT except it plays defect in PD. Lets say I pretended to be your follower and we never had this conversation, so there is no historical evidence that I would create such an AI. When my AI is born, it modifies my brain so that I start to believe I created an AI with TDT, thus erasing the last shred of evidence. My AI will then go on and win against every other AI.

Given the above, why should I change my mind now, and not win?

ETA: Ok, I realize this is pretty much the same scenario as the brain lesion one, except it's not just possible, it's likely. Someone is bound to have my intuition and be resistant to your persuasion. If you say that smart agents win, then he must be the smart one, right?

This I think connects one more terminological distinction. When you talked earlier about something like "reasoning about the output of platonic computation" as a key insight that started your version of TDT, you meant basically the same thing as me talking about how even knowing about the existence of the other agent, little things that you can use to even talk about it, is already the logical dependence between you and the other agent that could in some cases be used to stage cooperation.

so "one-shot true PDs" is in general a condition unlikely to arise with sufficient frequency that evolution deals with it at all

But there are analogs of one-shot true PD everywhere.

A self-modifying CDT which, at 7am, expects to encounter a future Newcomb's Problem or Parfit's Hitchhiker in which the Omega gets a glimpse at the source code after 7am, will modify to use TDT for all decisions in which Omega glimpses the source code after 7am.

No, I disagree. You seem to have missed this comment, or do you disagree with it?

But there are analogs of one-shot true PD everywhere.

Name a single one-shot true PD that any human has ever encountered in the history of time, and be sure to calculate the payoffs in inclusive fitness terms.

Of course that's a rigged question - if you can tell me the name of the villain, I can either say "look how they didn't have any children" or "their children suffered from the dishonor brought upon their parent". But still, I think you are taking far too liberal a view of what constitutes one-shotness.

Empirically, humans ended up with both a sense of temptation and a sense of honor that, to the extent it holds, holds when no one is looking. We have separate impulses for "cooperate because I might get caught" and "cooperate because it's the honorable thing to do".

Regarding your other comment, "Do what my programmer would want me to do" is not formally defined enough for me to handle it - all the complexity is hidden in "would want". Can you walk me through what you think a CDT agent self-modifies to if it's not "use TDT for future decisions where Omega glimpsed my code after 7am and use CDT for future decisions where Omega glimpsed my code before 7am"? (Note that calculations about general population frequency count as "before 7am" from the crazed CDT's perspective, because you're reasoning from initial conditions that correlate to the AI's state before 7am rather than after it.)

By "analog of one-shot true PD" I meant any game where the Nash equilibrium isn't Pareto-optimal. The two links in my last comment gave plenty of examples.

all the complexity is hidden in "would want"

I think I formalized it already, but to say it again, suppose the creator had the option of creating a giant lookup table in place of S. What choice of GLT would have maximized his expected utility at the time of coding, under the creator's own decision theory? S would compute that and then return whatever the GLT entry for X is.

ETA:

Can you walk me through what you think a CDT agent self-modifies to

It self-modifies to the S described above, with a description of itself embedded as the creator. Or to make it even simpler but less realistic, a CDT just replaces itself by a GLT, chosen to maximize its current expected utility.

Is that sufficiently clear?

By "analog of one-shot true PD" I meant any game where the Nash equilibrium isn't Pareto-optimal. The two links in my last comment gave plenty of examples.

Suppose we have an indefinitely iterated PD with an unknown bound and hard-to-calculate but small probabilities of each round being truly unobserved. Do you call that "a game where the Nash equilibrium isn't a Pareto optimum"? Do you think evolution has handled it by programming us to just defect?

I've done some informal psychological experiments to check human conformance with timeless decision theory on variants of the original Newcomb's Problem, btw, and people who one-box on Newcomb's Problem seem to have TDT intuitions in other ways. Not that this is at all relevant to the evolutionary dilemmas, which we seem to've been programmed to handle by being temptable, status-conscious, and honorable to variant quantitative degrees.

But programming an AI to cooperate with strangers on oneshot true PDs out of a human sense of honor would be the wrong move - our sense of honor isn't the formal "my C iff (opponent C iff my C)", so a TDT agent would then defect against us.

I just don't see human evolution - status, temptation, honor - as being very relevant here. An AI's decision theory will be, and should be, decided by our intuitions about logic and causality, not about status, temptation, and honor. Honor enters as a human terminal value, not as a decider of the structure of the decision theory.

How do you play "cooperate iff (the opponent cooperates iff I cooperate)" in a GLT? Is the programmer supposed to be modeling the opponent AI in sufficient resolution to guess how much the opponent AI knows about the programmer's decision, and how many other possible programmers that the AI is modeling are likely to correlate with it? Does S compute the programmer's decision using S's knowledge or only the programmer's knowledge? Does S compute the opponent inaccurately as if it were modeling only the programmer, or accurately as if it were modeling both the programmer and S?

I suppose that a strict CDT could replace itself with a GLT, if that GLT can take into account all info where the opponent AI gets a glimpse at the GLT after it's written. Then the GLT behaves just like the code I specified before on e.g. Newcomb's Problem - one-box if Omega glimpses the GLT or gets evidence about it after the GLT was written, two-box if Omega perfectly knows your code 5 seconds before the GLT gets written.

[Edit: Don't bother responding to this yet. I need to think this through.]

How do you play "cooperate iff (the opponent cooperates iff I cooperate)" in a GLT?

I'm not sure this question makes sense. Can you give an example?

Does S compute the programmer's decision using S's knowledge or only the programmer's knowledge?

S should take the programmer R's prior and memories/sensory data at the time of coding, and compute a posterior probability distribution using them (assuming it would do a better job at this than R). Then use that to compute R's expected utility for the purpose of computing the optimal GLT. This falls out of the idea that S is trying to approximate what the GLT would be if R had logical omniscience.

Is the programmer supposed to be modeling the opponent AI in sufficient resolution to guess how much the AI knows about the programmer?

No, S will do it.

Does S compute the opponent as if it were modeling only the programmer, or both the programmer and S?

I guess both, but I don't understand the significance of this question.

like AIXI, dropping an anvil on its own head

Or:

Suppose what you say is correct, that the Winning Thing is to play cooperate in one-shot PD. Then what happens when some player happens to get a brain lesion that causes him to unconsciously play defect without affecting his AI building abilities? He would take everyone else's lunch money.

Possibly. But it has to be an unpredictable brain lesion - one that is expected to happen with very low frequency. A predictable decision to do this just means that TDTs defect against you. If enough AI-builders do this then TDTs in general defect against each other (with a frequency threshold dependent on relative payoffs) because they have insufficient confidence that they are playing against TDTs rather than special cases in code.

Or if he builds his AI to play defect while everyone else builds their AIs to play cooperate, his AI then takes over the world.

No one is talking about building AIs to cooperate. You do not want AIs that cooperate on the one-shot true PD. You want AIs that cooperate if and only if the opponent cooperates if and only if your AI cooperates. So yes, if you defect when others expect you to cooperate, you can pwn them; but why do you expect that AIs would expect you to cooperate (conditional on their cooperation) if "the smart thing to do" is to build an AI that defects? AIs with good epistemic models would then just expect other AIs that defect.

The comment you responded to was mostly obsoleted by this one, which represents my current position. Please respond to that one instead. Sorry for making you waste your time!

Can Nesov's AI correctly guess what AI Eliezer would probably have built and vice versa? Clearly I wouldn't want to build an AI which, if it believes Nesov's AI is accurately modeling it, and cooperating conditional on its own cooperation, would fail to cooperate. And in the true PD - which couldn't possibly be against Nesov - I wouldn't build an AI that would cooperate under any other condition. In either case there's no reason to use anything except TDT throughout.

Can Nesov's AI correctly guess what AI Eliezer would probably have built and vice versa?

No, I'm assuming that the AIs don't have enough information or computational power to predict the human players' choices. Think if a human-created AI were to meet a paperclipper that was designed by a long-lost alien race. Wouldn't you program the human AI to play defect against the paperclipper, assuming that there is no way for the AIs to prove their source codes to each other? The two AIs ought to think that they are both using the same decision theory (assuming there is just one obviously correct theory that they would both converge to). But that theory can't be TDT, because if it were TDT, then the human AI would play cooperate, which you would have overridden if you knew was going to happen.

Let me know if that still doesn't make sense.

Wei, the whole point of TDT is that it's not necessary for me to insert special cases into the code for situations like this. Under any situation in which I should program the AI to defect against the paperclipper, I can write a simple TDT agent and it will decide to defect against the paperclipper.

TDT has that much meta-power in it, at least. That's the whole point of using it.

(Though there are other cases - like the timeless decision problems I posted about that I still don't know how to handle - where I can't make this statement about the TDT I have in hand; but this is because I can't handle those problems in general.)