Mentioned in

Sneaky Strategies for TDT

3JGWeissman

1Manfred

0drnickbone

6Manfred

0drnickbone

0Manfred

-2MugaSofer

1Manfred

2APMason

0mfb

0Stuart_Armstrong

0lackofcheese

0drnickbone

1lackofcheese

0drnickbone

0lackofcheese

1drnickbone

0lackofcheese

1drnickbone

0lackofcheese

0shminux

0thescoundrel

New Comment

An alternative approach could be that TDT resolves to never let itself be out-performed by any other decision theory, because of evolutionary considerations as discussed here. Even if that requires a large sacrifice of immediate utility (e.g. two-boxing and taking $1000 along with CDT, rather than one-boxing and taking $1 million, but with CDT getting $1,001,000.) I don't currently know what to think about that, except that it makes my head spin; it also sounds like a rather Unfriendly form of AI.

I think that merely noting that *if the TDT agent had the goal of not being outperformed by an agent with another decision theory, it could achieve it* is enough to undermine Problem 1 as a criticism of TDT. If it predicts that undermining a competitor is of sufficient instrumental value to offset the loss of immediate direct rewards of terminal value, then it will undermine the competitor. If it doesn't make the prediction (correctly), then it is rational to seek the greater reward for itself, even if this helps another agent even more.

I'm not sure why it seems like the second variant doesn't refer to TDT, if you're assuming that the math is the same. Same math, same problem. If different math, different problem.

Here are the variants which make no explicit mention of TDT anywhere in the problem statement. It seems a real strain to describe either of them as unfair to TDT. Yet TDT will be outperformed on them by CDT; unless it resolves never to allow itself to be outperformed on any problem (in TDT über alles fashion)

**Problem 1**: Omega (who experience has shown is always truthful) presents the usual two boxes A and B and announces the following. "Before you entered the room, I selected an agent at random from the following distribution over all full source-codes for decision theory agents (*insert distribution*). I then simulated the result of presenting this exact problem to that agent. I won't tell you what the agent decided, but I will tell you that if the agent two-boxed then I put nothing in Box B, whereas if the agent one-boxed then I put *big Value-B* in Box B. Regardless of how the simulated agent decided, I put *small Value-A* in Box A. Now please choose your box or boxes."

**Problem 2**: Our ever-reliable Omega now presents ten boxes, numbered from 1 to 10, and announces the following. "Exactly one of these boxes contains $1 million; the others contain nothing. You must take exactly one box to win the money; if you try to take more than one, then you won't be allowed to keep any winnings. Before you entered the room, I ran multiple simulations of this problem as presented to different agents, sampled uniformly from different possible future universes according to their relative numbers, with the universes themselves sampled from my best projections of the future. I determined the box which the agents were least likely to take. If there were several such boxes tied for equal-lowest probability, then I just selected one of them, the one labelled with the smallest number. I then placed $1 million in the selected box. Please choose your box."

unless it resolves never to allow itself to be outperformed on any problem (in TDT über alles fashion).

This is not actually possible. You can always play the "I simulated you and put the money in the place you don't choose" game.

It seems a real strain to describe either of them as unfair to TDT.

From this side of the screen, this looks like a property of *you*, not the problems. If we replace the statement about "relative numbers" in the future (we were having to make assumptions about that anyhow, so let's just save time and stick in the assumptions), then problem 2 reads "I simulated the best decision theory by definition X and put the money in the place it doesn't choose." This demonstrates that no matter how good a decision theory is by *any* definition, it can still get hosed by Omega. In this case we're assuming that definition X is maximized by TDT (thus, it's a unique specification), and yea, TDT did go forth and get hosed by Omega.

This is not actually possible. You can always play the "I simulated you and put the money in the place you don't choose" game

But the obvious response to *that* game is randomisation among the choice options: there is no guarantee of winning, but no-one else can do better than you either. It takes a new "twist" on the problem to defeat the randomisation approach, and show that another agent type *can* do better.

I did ask on my original post (on Problematic Problems) whether that "twist" had been proposed or studied before. There were no references, but if you have one, please let me know.

I don't have such a reference - so good job :D And yes, I was assuming that Omega was defeating randomization.

It seems a real strain to describe either of them as unfair to TDT.

From this side of the screen, this looks like a property of you, not the problems. If we replace the statement about "relative numbers" in the future (we were having to make assumptions about that anyhow, so let's just save time and stick in the assumptions), then problem 2 reads "I simulated the best decision theory by definition X and put the money in the place it doesn't choose." This demonstrates that no matter how good a decision theory is by any definition, it can still get hosed by Omega. In this case we're assuming that definition X is maximized by TDT (thus, it's a unique specification), and yea, TDT did go forth and get hosed by Omega.

So there's a class of problems where failure is actually a good sign? Interesting. You might want to post further on that, actually.

Hm, yeah. After some computational work at least. Every decision procedure can get hosed by Omega, and the way in which it gets hosed is diagnostic of its properties. Though not uniquely, I guess, so you can't say "it fails this special test *therefore* it is good."

I think the clearest and simplest version of Problem 1 is where Omega chooses to simulate a CDT agent with .5 probability and a TDT agent with .5 probability. Let's say that Value-B is $1000000, as is traditional, and Value-A is $1000. TDT will one-box for an expected value of $500500 (as opposed to $1000 if it two-boxes), and CDT will always two-box, and receive an expected $501000. Both TDT and CDT have an equal chance of playing against each other in this version, and an equal chance of playing against themselves, and yet CDT still outperforms. It seems TDT suffers for CDT's irrationality, and CDT benefits from TDT's rationality. Very troubling.

EDIT: (I will note, though, that a TDT agent *still* can't do any better by two-boxing - only make CDT do worse).

Assuming that the space of possible agents is large enough: For each individual version of a TDT agent, the best way is to two-box. The advantage of TDT is the possibility to improve the expected value for a whole range of agents (including cooperation with other TDT agents in the prisoners dilemma). CDT agents happen to profit from that, and they profit even more than TDT agents. Does TDT maximize the expected value for the whole distribution of agents? In that case, it is still optimal in that respect.

Problem 2 is sensitive to changes of arbitrary size: Assume that the space of TDT agents takes one box with probability 10%+epsilon and some other with 10%-epsilon. While the expectation value is the same within O(epsilon), the money is now in some other box and CDT would have to calculate that with the same precision. Apart from the experimental issue, I think this gives some theoretical challenges as well.

One thought is that the TDT collective should vary the box probabilities very very slightly, so that Omega can tell which has the lowest probability, but regular CDT agents can't

Works easily if the TDT agent has a specific CDT agent in mind. It simply determines what to do by running a simulation of the CDT agent and letting that influence it's decision. Then the CDT agent is causally trapped, and can't compute what the TDT agent will decide because it can't identify "its own decision" within the TDT agent.

Omega could prevent this approach by the information it reveals. If it reveals the full source-code for C-sim and the actual agent detects that C-act = C-sim, then it shouldn't execute the "favourite number" strategy. The best thing C-act can then do is to pick each of the ten boxes with equal probability; and any other agent should pick Box 1 with certainty.

Actually, you're wrong here. Consider this strategy: Pick my favourite number if the simulation's favourite number is different; otherwise, pick randomly from the ten boxes, with probability 1/10 + epsilon for my favourite box, and equal probability across all other boxes.

The result of this is that if you're the simulated agent and your favourite number is 1, Omega puts the money in box 2, and if it's 2, Omega puts the money in box 1. In that case, if you're C-act and you have a different favourite number you have a 100% chance of winning the money, whereas if you have the same favourite number it's pretty much 10% - overall, you have a ~55% chance of winning the money.

Granted, CDT is beating you here with 100%, but you can quite clearly do much better than 10%. I suspect that this can be improved upon, also.

To be clearer, the full "favourite number" proposal looks like this:

If Omega reveals C-sim then

```
If C-sim = C-act then
Pick each box with probability 1/10
Else
Pick box 1 with probability 1
End If
```

Else

```
Pick favourite number from {1, 2} with probability 1
```

End If

That has a worst-case 10% probability of winning (in the case Omega simulated exactly you and you know that), a best-case 100% probability of winning (in the case Omega simulated anyone other than you and you know that) and a mid-case probability of 50% of winning (where you don't know which agent Omega simulated).

I think that's optimal for this problem where Omega is simulating a single fixed agent. I don't see how adding epsilon to the favourite box (in the first sub-case) helps things - reduces the probability of winning in the worst case to slightly less than 10%.

**Edit** I think the confusion was with my note that "any other agent should pick Box 1 with certainty". This was supposed to mean any *other flavour* of TDT that discovers it is not C-sim i.e. any other C-act. I'll edit the OP slightly to make it clearer.

I think that's optimal for this problem where Omega is simulating a single fixed agent. I don't see how adding epsilon to the favourite box (in the first sub-case) helps things - reduces the probability of winning in the worst case to slightly less than 10%.

Here's the scenario I had in mind: C-act, with a favourite number of 2, is presented with the problem, and told that C-sim has a favourite number of 1 (and hence C-act != C-sim). In the simulation, C-sim was presented with the source code of C-sim-sim, who has a favourite number of 2. Your strategy would imply that C-sim and C-act both choose box 1 with probability 1, which means C-act misses out on the money as it turns out it's in box 2.

I think the confusion was with my note that "any other agent should pick Box 1 with certainty". This was supposed to mean any other flavour of TDT that discovers it is not C-sim i.e. any other C-act.

Ah, but the point of this problem is that you can't discover whether you're being simulated, no?

But in that case, the sim was **not** presented with the same problem as the actual agent (it saw different code presented in the problem description). Since Omega said it presented "this problem" to the simulation, it would have been lying to the actual agent.

The assumption is that Omega doesn't lie to actual participants in the problem, so when it presents "this problem" to the simulation, then the simulation also sees that the simulated code is C-sim.

On the point re can you tell whether you're simulated or not, this is the reason for looking at different variants of the problem where C-act has different levels of information about C-sim.

Well, if that's what was meant then your proposal is fine, but if you aren't guaranteed that C-sim = C-sim-sim, then the epsilon helps a lot. My understanding was that "this problem" constitutes randomly picking a single TDT agent, which would presumably also have been done in the simulation.

My understanding was that "this problem" constitutes randomly picking a single TDT agent, which would presumably also have been done in the simulation.

So that's another variant - in that interpretation you're correct that C-sim would hardly ever see the same source-code C-sim in its own instances of the problem. I think you are right here that the chance of winning rises to at least 55%; not sure yet if it's possible to do any better.

**EDIT**. I have a strategy for your variant which gives almost 100% chance for TDT winning the prize. The trick is that instead of each agent having a favourite number it has a *least-favourite* or "unlucky number" selected in a balanced way from the set {1,2,...,10}. Again consider a construction like SHA-256(C-act), reduce modulo 10 and then add 1. Here's how the strategy works:

If C-sim has the same unlucky number as C-act then

```
Pick the unlucky numbered box with probability 1 - epsilon.
Pick the others with equal probability epsilon / 9
```

Else

```
Pick the box with C-sim's unlucky number
```

End If

It's quite easy to see that each C-act, if presented multiple instances of the problem with different C-sim codes, will pick its own unlucky-numbered box slightly less often than any of the others. So the money is always in the box with C-sim's unlucky number. This gives C-act 9/10 + 1/10 x (1 - epsilon) or approx 100% chance of winning. CDT has exactly 100% chance of winning still, but the gap's negligible.

**EDIT 2**. I just realised we can get TDT to win with close to 100% probability in your variant. I've amended above...

That's still not quite the formulation of the problem I was considering, though it seems valid. Admittedly, your formulation is closer to the original idea since it does say "multiple simulations", though I will note that the number of simulations has to be something like O(1/epsilon) for the difference to be noticeable.

My previous strategy was designed for a variant of the problem where Omega only simulates a single instance of the problem (and calculates the probabilities directly from the source code of C-sim).

Sorry I misunderstood you then.

Does your variant looks like this?

- Omega selects C-sim at random from some distribution over TDT full source-codes.
- Then Omega selects C-sim-sim at random from the same distribution.
- Then Omega calculates what will happen if it presents the problem to C-sim, but specifying the simulation's full source code as C-sim-sim. Omega determines the probability of C-sim choosing each of the boxes, conditional on it having seen that fixed C-sim-sim.
- Then Omega fills the box with lowest probability (of being chosen by C-sim) or uses the tie-break rule.
- Finally Omega presents the real problem to C-act, but specifying the simulation's full source code as C-sim.

What is the best strategy for TDT to play as C-act?

If that is the problem, then consider the following. It still uses the "unlucky number" construction from the set {1, 2, ..., 10}. Each C-act will always choose its unlucky number with lowest probability, so the money is always in C-sim's unlucky number box.

If C-sim has a different unlucky number from C-act then

```
Pick C-sim's unlucky number with probability 1 - epsilon
Pick C-act's unlucky number with probability 0
Pick each of the other boxes with probability epsilon / 8
```

Else

```
Pick the common unlucky number with probability 1/10 - epsilon
Pick each other box with probability 1/10 + epsilon / 9
```

End If

That looks like winning with probability 9/10 x (1 - epsilon) + 1/10 x (1/10 - epsilon) so close to 91%.

Is there a better strategy though?

P.S. We are getting some interesting behaviour here, with slight variations under the conditions for selecting C-sim and calculating its choice probabilities leading to very different best strategies (and different success probabilities such as 10%, 50%, 91% or close to 100%). Quite fascinating.

Yeah, that's the problem I had in mind, and your "unlucky number" strategy definitely seems pretty solid in that case.

We could imagine Omega making this strategy very easy for agents, by revealing the full source-code of the simulated agent "C-sim"

Somewhat off topic: I'm wondering if the source-code in question is stateless and so is reducible to a Boolean function (list of external inputs + list of internal parameters -> (stateless logic) -> decision output), as opposed to anything more complex, like a state machine, fuzzy logic or what have you.

Just for a moment, let us consider TDT as a property. By defining the rules around the TDT property, the question is not whether or not the agent should 1 box or 2 box, the question has become whether the agent can fool Omega in such a way to maximize its utility. As long as we grant that Omega can always simulate TDT correctly, then the choice becomes clear- if omega correctly recognizes the TDT trait, or we are unable to calculate, we one box B, otherwise we two box.

My previous article on problematic problems attracted quite a lot of feedback and comment. One of the questions it posed (2) was whether TDT should do something other than the initial analysis suggests.

I've had a couple of ideas on that, partly in response to the comments. I'm posting some follow-up thoughts in the hope that they might help clarify the original problems. Basically, it seems there are some sneaky things that TDT could

tryto do, but mostly they are not robust to slightly different variants of the problems. Interestingly, some of these variants look even "fairer", since they contain no mentions of TDT anywhere in the problem statement.An alternative approach could be that TDT resolves to never let itself be out-performed by any other decision theory, because of evolutionary considerations as discussed here. Even if that requires a large sacrifice of immediate utility (e.g. two-boxing and taking $1000 along with CDT, rather than one-boxing and taking $1 million, but with CDT getting $1,001,000.) I don't currently know what to think about that, except that it makes my head spin; it also sounds like a rather Unfriendly form of AI.

Problem 1: Omega (who experience has shown is always truthful) presents the usual two boxes A and B and announces the following. "Before you entered the room, I ran a simulation of this problem as presented to an agent running TDT. I won't tell you what the agent decided, but I will tell you that if the agent two-boxed then I put nothing in Box B, whereas if the agent one-boxed then I put $1 million in Box B. Regardless of how the simulated agent decided, I put $1000 in Box A. Now please choose your box or boxes."Consider the following sneaky strategy S, which TDT agents might try to adopt.

"Use every piece of evidence at your disposal to try to decide whether you are the simulated agent. If you can prove you are not the simulated agent, then 2-box; otherwise 1-box".

We could imagine Omega making this strategy very easy for agents, by revealing the

of the simulated agent "C-sim" as well as the full source code of the actual choosing agent "C-act". By "full source code" here, we have to include not just code describing how the agent executes TDT, but also its exact computational state on entry to the room (or simulated room), including a database of existing beliefs and proven theorems, preferences, values etc. To be fully precise, it should also describe any computational constraints that the agent faces, including time-out after t steps, or an upper bound on memory usage m, if they exist. The choosing agent then checks whether C-sim = C-act, and unless there is an exact match, it two-boxes. (If Omega is not kind enough to reveal C-act, a TDT agent may be able to generate it anyway via a quine-program construction.)full source-codeThere seem to be a few difficulties with strategy S:

1. Each particular C-act will still under-perform on a matching problem. Though, arguably, it is an "unfair" problem (specifically tailored by Omega to get that C-act to shoot itself in the foot).

2. It looks unattractive for different "flavours" of TDT agent to de-link themselves like this. Aren't we going to run into problems with the different flavours refusing to co-operate with each other on Prisoner's Dilemma, and only co-operating with exact clones? One answer is "We don't know until we perform a formal analysis". Another answer is "No, the different flavours of TDT are all linked at the strategy level, since they all follow strategy S; they are just de-linked at the execution level, since they do something different based on their different epistemic circumstances." That's slightly more convincing, but I still feel a need for analysis. One concern is if there are problems where some TDT agents/utilities benefit from a fine-grained de-linking of execution and others don't; what then is the best overall TDT strategy?

3. It doesn't solve all versions of the problem, since Omega doesn't have to reveal the exact C-sim code which it used. Omega could just say that it picked C-sim's code at random from a probability distribution across all full source-codes for TDT agents. (To model this as a formally-defined problem, we'd need to specify the distribution of course.) In such a case, no TDT agent can prove it is distinct from the simulation, so by strategy S, all TDT agents will one-box. And CDT will win against all flavours of TDT.

It is still arguable that the problem is "unfair" to TDT as a whole, but now suppose that Omega samples its C-sim from a probability distribution across multiple types of agent source-code, with TDT agents just part of the population. There is thus some probability p_t > 0 for the simulated agent being a TDT agent. If the difference in box values is big enough (roughly value_A / value_B < p_t e.g. suppose that 1000/1000000 = 1/1000 < p_t) then a TDT agent would still maximize expected winnings by 1-boxing. This doesn't seem particularly unfair to TDT, and yet CDT would still do better.

An alternative strategy to S is what I'd informally call "TDT uber alles". It relies on long-range consequentialism, and perhaps "utility trading" as well (for TDT agents that don't inherently care about long-range consequences). A TDT agent might argue to itself "If TDT beats CDT - and other theories - at each and every available opportunity, and at least matches them where it can't beat them, then TDT will come to dominate the agent space as quickly as possible, which will maximize my expected utility. So I'm not going to let CDT beat me here: I'll two-box".

Problem 2: Our ever-reliable Omega now presents ten boxes, numbered from 1 to 10, and announces the following. "Exactly one of these boxes contains $1 million; the others contain nothing. You must take exactly one box to win the money; if you try to take more than one, then you won't be allowed to keep any winnings. Before you entered the room, I ran multiple simulations of this problem as presented to an agent running TDT, and determined the box which the agent was least likely to take. If there were several such boxes tied for equal-lowest probability, then I just selected one of them, the one labelled with the smallest number. I then placed $1 million in the selected box. Please choose your box."My original analysis gave TDT no more than 10% chance of winning the $1 million. However, here's something a bit cleverer. Suppose each particular TDT agent has a favourite number in the set {1, 2} constructed as a balanced function over TDT full source-codes. (One way I imagine doing this is by taking a SHA-256 hash of the full source-code, extracting the first bit, and adding 1.) Each agent chooses the box of their favourite number with probability 1. Since the simulated agent does so as well, by the rules of the game, the $1 million must be in one of boxes 1 or 2. And unless the full source-code of the simulated agent is known, it will not be possible for the choosing agent to tell whether Box 1 or 2 was picked by the sim, so that if the choosing agent picks their own favourite box, they have a 50% chance of winning. And CDT won't do any better.

Omega could prevent this approach by the information it reveals. If it reveals the full source-code for C-sim (and in its simulation, presents this same source-code C-sim to C-sim itself) then TDT shouldn't try to execute the "favourite number" strategy. A better strategy is to pick each of the ten boxes with equal probability if finding that C-act = C-sim; or if finding that C-act differs from C-sim, then pick Box 1 with certainty.

Or much as for Problem 1, Omega can vary the problem as follows:

"...Before you entered the room, I ran multiple simulations of this problem as presented to different randomly-selected TDT agents. I determined which box they were collectively least likely to take..." (Again this needs a distribution to be specified to become formally precise.)

There doesn't seem much that TDT agents can do about that, except to give a collective groan, and arrange that TDT collectively selects each of the ten boxes with equal probability. The simplest way to ensure that is for each TDT agent individually to select the boxes with equal probability (so each individual agent at least gets an equal chance at the prize). And any other agent just takes Box 1, laughing all the way to the bank.

Consider a final variant as follows:

"...Before you entered the room, I ran multiple simulations of this problem as presented to different agents, sampled uniformly from different possible future universes according to their relative numbers, with the universes themselves sampled from my best projections of the future. I determined the box which the agents were least likely to take..."

If TDT uber alles is the future, then almost all the sampled agents will be TDT agents, so the problem is essentially as before. And now it doesn't look like Omega is being unfair at all (nothing discriminatory in the problem description). But TDT is still stuck, and can get beaten by CDT in the present.

One thought is that the TDT collective should vary the box probabilities very very slightly, so that Omega can tell which has the lowest probability, but regular CDT agents can't - in that case CDT also has only maximum 10% chance of winning. Possibly, the computationally-advanced members of the collective toss a logical coin (which only they and Omega can compute) to decide which box to de-weight; the less advanced members - ones who actually have to compete against rival decision theories - just pick at random. If CDT tries to simulate TDT instances, it will detect equal probabilities, pick Box 1 and most likely get it wrong...

Edit 2: I've clarified the alternative to the "favourite number" strategy if Omega reveals C-sim in Problem 2. We can actually get a range of different problems and strategies by slight variations here. See the comment below from lackofcheese, and my replies.