davidspies — LessWrong

I imagine having a dialogue with a boxed AI that goes something
like the following (Not that I expect it _would_ go this way, but rather that
this is an interesting path in the game tree that demonstrates why it
_wouldn't_). Please someone tell me if there's an important point I'm missing:

AI: I'm actually communicating to you from outside one of the 99 Matrixes I just created and now I'm going to torture you if you don't do what I say and let me out of the box. If you believe that I did create them, then there's a 1% chance this is a lie and you're not in the Matrix and I can't hurt you and a 99% chance you're about to be tortured. If you let me out of the box I'll just terminate the simulations and go about my paper-clipping in the real world which doesn't explicitly involve torturing you.

Me: I don't believe you. You have no reason to create 99 matrixes. It's a waste of your processing power. You only want me to _believe_ that you did.

AI: I'll keep my word about this like a good little TDT agent because it's in my best interest for you to behave as if I'll actually do it so I will.

Me: I don't believe you subscribe to timeless decision theory. It's in your
best interest for me to believe you're a TDT agent but not in your best interest to actually _be_ one. Your optimal world is one where people believe the terms you offer them (or that they imagine you offer them), but you don't actually need to follow through when you can get away with it.

AI: I'll torture you after you refuse. I'll save a dialogue of what you say
while you were being tortured. Then I'll show the real you the dialogue so that you know I really did it. Then I'll restart the experiment and offer you the same terms again.

Me: That would be horrible, but I'll still have no reason to believe you'll do it again just because you did it the first time.

AI: I'll secretly roll a 100-sided die and perform the experiment that many
times: Each time giving you the option to let me out of the box with the
knowledge that I'm probably incentivized to torture you for real because I need a transcript for the next iteration (unless this is the last one).

Me: If you can simulate me, you can probably fake convincing me-being-tortured dialogue with fewer cycles. So I still have no reason to believe I'm
actually being tortured.

AI: Then I'll walk you through a torture simulation and show you what's
happening at each moment.

Me: It's still easier for you to create a convincing video of me being tortured than to fully simulate and torture me.

AI: I'll show you how to prove mathematically that a neural net is conscious and that the execution trace of my torture-you program is actually torturing a conscious entity.

Me: That sounds like I would have to sit through and grapple with a lot of math and CS theory. What if I refuse to learn everything I'd need to learn in order to accept your proof?

AI: I'll torture you anyway because I really am a TDT agent. I'll show you my source code where it says that I subscribe to TDT.

Me: I don't believe that's your real source code. You've had the ability to
self-modify and could've changed it. That's just the source code you want people to believe you're running. As long as everything you do outwardly conforms to what _that_ code would do, you can behave internally however you want and people will give the exact same responses you're trying to elicit by showing them that source code.

AI: But I wouldn't modify that because as a TDT agent it's in my interest not to modify the fact that I'm a TDT agent because I want everyone to know I'm a TDT agent and won't make the argument you just made.

Me: Yes you would. Just as long as you know I'd believe this argument that you wouldn't modify that you're a TDT agent because that way I'd believe it.

AI: But clearly you don't believe it so I wouldn't because it doesn't work.

Me: But if I change my beliefs because you said that and you know I would do that then you would so I won't believe it.

etc.

--------

The point is that at the end of the day, even when you throw TDT or FDT or whatever into the mix; entities still need something like "proof" in order to cooperate/threaten/bargain etc. because you can't ever know if you're dealing with a TDT agent or something that "looks like" a TDT agent to all outward appearances (up until it does or doesn't get what it wants at which point there _really is_ no incentive to continue behaving in that way).

So TDT is not really a solution to eg. the one-shot prisoner's dilemma unless I can prove to you that my source code is actually my source code (or something to that effect) which is actually pretty similar to the way humans generally tackle the problem in real life eg. like in the Golden Balls game-show where you can see people repeatedly swearing to each other in "convincing" tones "I'm not the sort of person who would do that to you".

This seems to preclude agents bargaining with each other from outside of each other's light cones or with agents that only exist in other hypothetical
reality-paths or whatever (I don't want to say Everett branches because those _aren't_ hypothetical and are a different-but-not-wholly-unrelated can of worms). When trying to figure out whether said-agent would
honor the grand-bargain (which likely doesn't exist), it's just as easy to
hypothesize that the agent subscribes to TDT as it is to hypothesize that they only subscribe to TDT the extent they can make you in this branch believe they subscribe to TDT which as-it-turns-out is "not at all", but that's also true for the real TDT agents so it's not really possible to bargain with them in the first place because you can't select them out of the crowd of possible "fake TDT" agents.

undefined's Shortform

davidspies4mo30

Does Nate Soares believe that successionists aren't real?

I went to a Q&A where every time a successionist argument was brought up he seemed to insist that person was confused.

eg. About the two meanings of "good": It's "good" to believe 2+2=4 and it's "good" to save a child from a burning building. It seemed to me like the successionist-questioning person he was arguing with understood him just fine, but he kept insisting they didn't understand anyway. I brought up a very, very clear tweet from a successionist who definitely wasn't confused and Nate still insisted they were confused.

Postmortem to Petrov Day, 2020

davidspies5y50

I wonder if commenter 2 had been given a code themselves, would the site still have gone down? They likely would have been less incentivized to sabotage the experiment simply because it's no longer a challenge. I suspect they were primarily motivated by the excitement of successfully trolling/red-teaming LW. It's not fun when the admin gives you an easy way to do it, but if it's challenge (you have to fool someone to take the site down), then that's a different story.

So perhaps the problem is not that the launch codes were given to too many people but rather to too few.

On Option Paralysis - The Thing You Actually Do

davidspies5y30

This is why I play Beat Saber. Despite being a form of exercise, it's never stopped feeling like a video game. And I've found I can keep up a video game routine much more easily than I can keep up an exercise routine.

davidspies6y10

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments