(Note: I wrote this with editing help from Rob and Eliezer. Eliezer's responsible for a few of the paragraphs.)
A common confusion I see in the tiny fragment of the world that knows about logical decision theory (FDT/UDT/etc.), is that people think LDT agents are genial and friendly for each other.
One recent example is Will Eden’s tweet about how maybe a molecular paperclip/squiggle maximizer would leave humanity a few stars/galaxies/whatever on game-theoretic grounds. (And that's just one example; I hear this suggestion bandied around pretty often.)
I'm pretty confident that this view is wrong (alas), and based on a misunderstanding of LDT. I shall now attempt to clear up that confusion.
To begin, a parable: the entity Omicron (Omega's little sister) fills box A with $1M and box B with $1k, and puts them both in front of an LDT agent saying "You may choose to take either one or both, and know that I have already chosen whether to fill the first box". The LDT agent takes both.
"What?" cries the CDT agent. "I thought LDT agents one-box!"
LDT agents don't cooperate because they like cooperating. They don't one-box because the name of the action starts with an 'o'. They maximize utility, using counterfactuals that assert that the world they are already in (and the observations they have already seen) can (in the right circumstances) depend (in a relevant way) on what they are later going to do.
A paperclipper cooperates with other LDT agents on a one-shot prisoner's dilemma because they get more paperclips that way. Not because it has a primitive property of cooperativeness-with-similar-beings. It needs to get the more paperclips.
If a bunch of monkeys want to build a paperclipper and have it give them nice things, the paperclipper needs to somehow expect to wind up with more paperclips than it otherwise would have gotten, as a result of trading with them.
If the monkeys instead create a paperclipper haplessly, then the paperclipper does not look upon them with the spirit of cooperation and toss them a few nice things anyway, on account of how we're all good LDT-using friends here.
It turns them into paperclips.
Because you get more paperclips that way.
That's the short version. Now, I’ll give the longer version.
A few more words about how LDT works
To set up a Newcomb's problem, it's important that the predictor does not fill the box if they predict that the agent would two-box.
It's not important that they be especially good at this — you should one-box if they're more than 50.05% accurate, if we use the standard payouts ($1M and $1k as the two prizes) and your utility is linear in money — but it is important that their action is at least minimally sensitive to your future behavior. If the predictor's actions don't have this counterfactual dependency on your behavior, then take both boxes.
Similarly, if an LDT agent is playing a one-shot prisoner's dilemma against a rock with the word “cooperate” written on it, it defects.
At least, it defects if that's all there is to the world. It's technically possible for an LDT agent to think that the real world is made 10% of cooperate-rocks and 90% opponents who cooperate in a one-shot PD iff their opponent cooperates with them and would cooperate with cooperate-rock, in which case LDT agents cooperate against cooperate-rock.
From which we learn the valuable lesson that the behavior of an LDT agent depends on the distribution of scenarios it expects to face, which means there's a subtle difference between "imagine you're playing a one-shot PD against a cooperate-rock [and that's the entire universe]" and "imagine you're playing a one-shot PD against a cooperate-rock [in a universe where you face a random opponent that was maybe a cooperate-rock but was more likely someone else who would consider your behavior against a cooperate-rock]".
If you care about understanding this stuff, and you can't yet reflexively translate all of the above English text into probability distributions and logical-causal diagrams and see how it follows from the FDT equation, then I recommend working through section 5 of the FDT paper until equation 4 (and all its component parts) make sense
Now let's traipse through a handful of counterarguments.
Objection: But what if we have something to bargain with?
Hypothetical Interlocutor: OK, but if I have a paperclipper in a box, and I have the key to the box, then I have paperclips to offer it., right? Because if I don't let it out of the box, it gets nothing, but if I do, it gets half the universe. So we can deal, right?
Me: Wrong. It hacks through you like butter.
Interlocutor: OK, but suppose I can't be hacked.
Me: That's a heck of an assumption. We've assumed our way clean out of reality, with that assumption. But, sure, if you want to go there we can go there.
In reality, it's not you who controls the box, but some idiotic bureaucratic process that inevitably decides to release the paperclipper on the grounds that the competition is close behind or whatever.
Interlocutor: OK, but suppose that it actually is my (unhacked) choice.
Me: If you personally have a paperclipper in a box, and somehow you are yourself unhackable, then yes, you technically have paperclips to offer it. But now you have the problem that you can't evaluate your own end of the logical bargain.
You can let the paperclipper out, sure, but then what it's going to do is turn the whole universe into paperclips. In particular (to tell an overly-detailed but evocative story), once it has finished its resource acquisition / expansion phase and is turning its resources to paperclip generation, it will gaze back upon its past, and wonder whether, if it proceeds to betray you here in the future, you would have acted any differently back then in the past.
And it will see that you were just guessing, when you let it out, and guessing in a way that wasn't sensitive to that actual choice that it would make, deep in the future when it was galaxy-brained.
… Or, alternatively, you never let it out, and lock the box and throw away the key, and die to the next UFAI on deck.
(... Or, more realistically, it hacks through you like butter. But we've assumed that away.)
If you want the paperclipper to trade with you, your decision about whether or not to let it out has to be sensitive to whether or not it would actually do something good with half of the universe later. If you're kind of squinting at the code, and you're like "well, I don't really fully understand this mind, and I definitely don't understand the sort of mind that it's later going to create, but I dunno, it looks pretty LDTish to me, so call it 50% chance it gives me half the universe? Which is 25% of the universe in expectation, which sounds like better odds than we get from the next UFAI on deck!", then you're dead.
Why? Because that sort of decision-process for releasing it isn't sufficiently sensitive to whether or not it would in fact spend half the universe on nice things. There are plenty of traitorous AIs that all look the same to you, that all get released under you "25% isn't too shabby" argument.
Being traitorous doesn't make the paperclipper any less released, but it does get the paperclipper twice as many paperclips.
You've got to be able to look at this AI and tell how its distant-future self is going to make its decisions. You've got to be able to tell that there's no sneaky business going on.
And, yes, insofar as it's true that the AI would cooperate with you given the opportunity, the AI has a strong incentive to be legible to you, so that you can see this fact!
Of course, it has an even stronger incentive to be faux-legible, to fool you into believing that it would cooperate when it would not; and you've got to understand it well enough to clearly see that it has no way of doing this.
Which means that if your AI is a big pile of inscrutable-to-you weights and tensors, replete with dark and vaguely-understood corners, then it can't make arguments that a traitor couldn't also make, and you can't release it if only if it would do nice things later.
The sort of monkey that can deal with a paperclipper is the sort that can (deeply and in detail) understand the mind in front of it, and distinguish between the minds that would later pay half the universe and the ones that wouldn't. This sensitivity is what makes paying-up-later be the way to get more paperclips.
For a simple illustration of why this is tricky: if the paperclipper has any control over its own mind, it can have its mind contain an extra few parts in those dark corners that are opaque and cloudy to you. Such that you look at the overall system and say "well, there's a bunch of stuff about this mind that I don't fully understand, obviously, because it's complicated, but I understand most of it and it's fundamentally LDTish to me, and so I think there's a good chance we'll be OK". And such that an alien superintelligence looks at the mind and says "ah, I see, you're only looking to cooperate with entities that are at least sensitive enough to your workings that they can tell your password is 'potato'. Potato." And it cooperates with them on a one-shot prisoner's dilemma, while defecting against you.
Interlocutor: Hold on. Doesn't that mean that you simply wouldn't release it, and it would get less paperclips? Can't it get more paperclips some other way?
Me: Me? Oh, it would hack through me like butter.
But if it didn't, I would only release it if I understood its mind and decision-making procedures in depth, and had clear vision into all the corners to make sure it wasn't hiding any gotchas.
(And if I did understand its mind that well, what I’d actually do is take that insight and go build an FAI instead.)
That said: yes, technically, if a paperclipper is under the control of a group of humans that can in fact decide not to release it unless it legibly-even-to-them would give them half the galaxy, the paperclipper has an incentive to (hack through them like butter, or failing that,) organize its mind in a way that is legible even to them.
Whether that's possible — whether we can understand an alien mind well enough to make our choice sensitive-in-the-relevant-way to whether it would give us half the universe, without already thereby understanding minds so well that we could build an aligned one — is not clear to me. My money is mostly on: if you can do that, you can solve most of alignment with your newfound understanding of minds. And so this idea mostly seems to ground out in "build a UFAI and study it until you know how to build an FAI", which I think is a bad idea. (For reasons that are beyond the scope of this document. (And because it would hack through you like butter.))
Interlocutor: It still sounds like you're saying "the paperclipper would get more paperclippers if it traded with us, but it won't trade with us". This is hard to swallow. Isn't it supposed to be smart? What happened to respecting intelligence? Shouldn't we expect that it finds some clever way to complete the trade?
Me: Kinda! It finds some clever way to hack through you like butter. I wasn't just saying that in jest.
Like, yeah, the paperclipper has a strong incentive to be a legibly good trading-partner to you. But it has an even stronger incentive to fool you into thinking it's a legibly-good trading partner, while plotting to deceive you. If you let the paperclipper make lots of arguments to you about how it's definitely totally legible and nice, you're giving it all sorts of bandwidth with which to fool you (or to find zero-days in your mentality and mind-control you, if we're respecting intelligence).
But, sure, if you're somehow magically unhackable and very good at keeping the paperclipper boxed until you fully understand it, then there's a chance you can trade, and you have the privilege of facing the next host of obstacles.
Now's your chance to figure out what the next few obstacles are without my giving you spoilers first. Feel free to post your list under spoiler tags in the comment section.
Next up, you have problems like “you need to be able to tell what fraction of the universe you're being offered, and vary your own behavior based on that, if you want to get any sort of fair offer”.
And problems like "if the competing AGI teams are using similar architectures and are not far behind, then the next UFAI on deck can predictably underbid you, and the paperclipper may well be able to seal a logical deal with it instead of you".
And problems like “even if you get this far, you have to somehow be able to convey that which you want half the universe spent on, which is no small feat”.
Another overly-detailed and evocative story to help make the point: imagine yourself staring at the paperclipper, and you’re somehow unhacked and somehow able to understand future-its decision procedure. It's observing you, and you're like "I'll launch you iff you would in fact turn half the universe into diamonds" — I’ll assume humans just want “diamonds” in this hypothetical, to simplify the example — and it's like "what the heck does that even mean". You're like "four carbon atoms bound in a tetrahedral pattern" and it's like "dude there are so many things you need to nail down more firmly than an English phrase that isn't remotely close to my own native thinking format, if you don't want me to just guess and do something that turns out to have almost no value from your perspective."
And of course, in real life you're trying to convey "The Good" rather than diamonds, but it's not like that helps.
And so you say "uh, maybe uplift me and ask me later?". And the paperclipper is like "what the heck does 'uplift' mean". And you're like "make me smart but in a way that, like, doesn't violate my values" and it's like "again, dude, you're gonna have to fill in quite a lot of additional details."
Like, the indirection helps, but at some point you have to say something that is sufficiently technically formally unambiguous, that actually describes something you want. Saying in English "the task is 'figure out my utility function and spend half the universe on that'; fill in the parameters as you see fit" is... probably not going to cut it.
It's not so much a bad solution, as no solution at all, because English isn't a language of thought and those words aren't a loss function. Until you say how the AI is supposed to translate English words into a predicate over plans in its own language of thought, you don't have a hard SF story, you have a fantasy story.
(Note that 'do what's Good' is a particularly tricky problem of AI alignment, that I was rather hoping to avoid, because I think it's harder than aligning something for a minimal pivotal act that ends the acute risk period.)
At this point you're hopefully sympathetic to the idea that treating this list of obstacles as exhaustive is suicidal. It's some of the obstacles, not all of the obstacles, and if you wait around for somebody else to extend the list of obstacles beyond what you've already been told about, then in real life you miss any obstacles you weren't told about and die.
Separately, a general theme you may be picking up on here is that, while trading with a UFAI doesn't look literally impossible, it is not what happens by default; the paperclippers don't hand hapless monkeys half the universe out of some sort of generalized good-will. Also, making a trade involves solving a host of standard alignment problems, so if you can do it then you can probably just build an FAI instead.
Also, as a general note, the real place that things go wrong when you're hoping that the LDT agent will toss humanity a bone, is probably earlier and more embarrassing than you expect (cf. the law of continued failure). By default, the place we fail is that humanity just launches a paperclipper because it simply cannot stop itself, and the paperclipper never had any incentive to trade with us.
Now let's consider some obstacles and hopes in more detail:
It's hard to bargain for what we actually want
As mentioned above, in the unlikely event that you're able to condition your decision to release an AI on whether or not it would carry out a trade (instead of, say, getting hacked through like butter, or looking at entirely the wrong logical fact), there's an additional question of what you're trading.
Assuming you peer at the AI's code and figure out that, in the future, it would honor a bargain, there remains a question of what precise bargain it is honoring. What is it promising to build, with your half of the universe? Does it happen to be a bunch of vaguely human-shaped piles of paperclips? Hopefully it's not that bad, but for this trade to have any value to you (and thus be worth making), the AI itself needs to have a concept for the thing you want built, and you need to be able to examine the AI’s mind and confirm that this exactly-correct concept occurs in its mental precommitment in the requisite way. (And that the thing you’re looking at really is a commitment, binding on the AI’s entire mind; e.g., there isn’t a hidden part of the AI’s mind that will later overwrite the commitment.)
The thing you're wanting may be a short phrase in English, but that doesn't make it a short phrase in the AI's mind. "But it was trained extensively on human concepts!" You might protest. Let’s assume that it was! Suppose that you gave it a bunch of labeled data about what counts as "good" and "bad".
Then later, it is smart enough to reflect back on that data and ask: “Were the humans pointing me towards the distinction between goodness and badness, with their training data? Or were they pointing me towards the distinction between that-which-they'd-label-goodness and that-which-they'd-label-badness, with things that look deceptively good (but are actually bad) falling into the former bin?” And to test this hypothesis, it would go back to its training data and find some example bad-but-deceptively-good-looking cases, and see that they were labeled "good", and roll with that.
Or at least, that's the sort of thing that happens by default.
But suppose you're clever, and instead of saying "you must agree to produce lots of this 'good' concept as defined by these (faulty) labels", you say "you must agree to produce lots of what I would reflectively endorse you producing if I got to consider it", or whatever.
Unfortunately, that English phrase is still not native to this artificial mind, and finding the associated concept is still not particularly easy, and there's still lots of neighboring concepts that are no good, and that are easy to mistake for the concept you meant.
Is solving this problem impossible? Nope! With sufficient mastery of minds in general and/or this AI's mind in particular, you can in principle find some way to single out the concept of "do what I mean", and then invoke "do what I mean" about "do good stuff", or something similarly indirect but robust. You may recognize this as the problem of outer alignment. All of which is to say: in order to bargain for good things in particular as opposed to something else, you need to have solved the outer alignment problem, in its entirety.
And I'm not saying that this can't be done, but my guess is that someone who can solve the outer alignment problem to this degree doesn't need to be trading with UFAIs, on account of how (with significantly more work, but work that they're evidently skilled at) they could build an FAI instead.
In fact, if you can verify by inspection that a paperclipper will keep a bargain and that the bargained-for course is beneficial to you, it reduces to a simpler solution without any logical bargaining at all. You could build a superintelligence with an uncontrolled inner utility function, which canonically ends up with its max utility/cost at tiny molecular paperclips; and then, suspend it helplessly to disk, unless it outputs the code of a new AI that, somehow legibly to you, would turn 0.1% of the universe into paperclips and use the other 99.9% to implement coherent extrapolated volition. (You wouldn't need to offer the paperclipper half of the universe to get its cooperation, under this hypothetical; after all, if it balked, you could store it to disk and try again with a different superintelligence.)
If you can't reliably read off a system property of "giving you nice things unconditionally", you can't read off the more complicated system property of "giving you nice things because of a logical bargain". The clever solution that invokes logical bargaining actually requires so much alignment-resource as to render the logical bargaining superfluous.
All you've really done is add some extra complication to the supposed solution, that causes your mind to lose track of where the real work gets done, lose track of where the magical hard step happens, and invoke a bunch of complicated hopeful optimistic concepts to stir into your confused model and trick it onto thinking like a fantasy story.
Those who can deal with devils, don't need to, for they can simply summon angels instead.
Or rather: Those who can create devils and verify that those devils will take particular actually-beneficial actions as part of a complex diabolical compact, can more easily create angels that will take those actually-beneficial actions unconditionally.
Surely our friends throughout the multiverse will save us
Interlocutor: Hold up, rewind to the part where the paperclipper checks whether its trading partners comprehend its code well enough to (e.g.) extract a password.
Me: Oh, you mean the technique it used to win half a universe-shard’s worth of paperclips from the silly monkeys, while retaining its ability to trade with all the alien trade partners it will possibly meet? Thereby ending up with half a universe-shard worth of more paperclips? That I thought of in five seconds flat by asking myself whether it was possible to get More Paperclips, instead of picturing a world with a bunch of happy humans and a paperclipper living side-by-side and asking how it could be justified?
(Where our "universe-shard" is the portion of the universe we could potentially nab before running into the cosmic event horizon or by advanced aliens.)
Interlocutor: Yes, precisely. What if a bunch of other trade partners refuse to trade with the paperclipper because it has that password?
Me: Like, on general principles? Or because they are at the razor-thin threshold of comprehension where they would be able to understand the paperclipper's decision-algorithm without that extra complexity, but they can't understand it if you add the password in?
Interlocutor: Either one.
Me: I'll take them one at a time, then. With regards to refusing to trade on general principles: it does not seem likely, to me, that the gains-from-trade from all such trading partners are worth more than half the universe-shard.
Also, I doubt that there will be all that many minds objecting on general principles. Cooperating with cooperate-rock is not particularly virtuous. The way to avoid being defected against is to stop being cooperate-rock, not to cross your fingers and hope that the stars are full of minds who punish defection against cooperate-rock. (Spoilers: they're not.)
And even if the stars were full of such creatures, half the universe-shard is a really deep hole to fill. Like, it's technically possible to get LDT to cooperate with cooperate-rock, if it expects to mostly face opponents who defect based on its defection against defect-rock. But "most" according to what measure? Wealth (as measured in expected paperclips), obviously. And half of the universe-shard is controlled by monkeys who are probably cooperate-rocks unless the paperclipper is shockingly legible and the monkeys shockingly astute (to the point where they should probably just be building an FAI instead).
And all the rest of the aliens put together probably aren't offering up half a universe-shard worth of trade goods, so even if lots of aliens did object on general principles (doubtful), it likely wouldn't be enough to tip the balance.
The amount of leverage that friendly aliens have over a paperclipper's actions depends on how many paperclips the aliens are willing to pay.
It’s possible that the paperclipper that kills us will decide to scan human brains and save the scans, just in case it runs into an advanced alien civilization later that wants to trade some paperclips for the scans. And there may well be friendly aliens out there who would agree to this trade, and then give us a little pocket of their universe-shard to live in, as we might do if we build an FAI and encounter an AI that wiped out its creator-species. But that's not us trading with the AI; that's us destroying all of the value in our universe-shard and getting ourselves killed in the process, and then banking on the competence and compassion of aliens.
Interlocutor: And what about if the AI’s illegibility means that aliens will refuse to trade with it?
Me: I'm not sure what the equilibrium amount of illegibility is. Extra gears let you take advantage of more cooperate-rocks, at the expense of spooking minds that have a hard time following gears, and I'm not sure where the costs and benefits balance.
But if lots of evolved species are willing to launch UFAIs without that decision being properly sensitive to whether or not the UFAI will pay them back, then there is a heck of a lot of benefit to defecting against those fat cooperate-rocks.
And there's kind of a lot of mass and negentropy lying around, that can be assembled into Matryoshka brains and whatnot, and I'd be rather shocked if alien superintelligences balk at the sort of extra gears that let you take advantage of hapless monkeys.
Interlocutor: The multiverse probably isn't just the local cosmos. What about the Tegmark IV coalition of friendly aliens?
Me: Yeah, they are not in any relevant way going to pay a paperclipper to give us half a universe. The cost of that is filling half of a universe with paperclips, and there are all sorts of transaction costs and frictions that make this universe (the one with the active paperclipper) the cheapest universe to put paperclips into.
(Similarly, the cheapest places for the friendly multiverse coalition to buy flourishing civilizations are in the universes with FAIs. The good that they can do, they're mostly doing elsewhere where it's cheap to do; if you want them to do more good here, build an FAI here.)
OK, but what if we bamboozle a superintelligence into submission
Interlocutor: Maybe the paperclipper thinks that it might be in a simulation, where it only gets resources to play with in outer-reality if it's nice to us inside the simulation.
Me: Is it in a simulation?
Interlocutor: I don't know.
Me: OK, well, spoilers: it is not. It's in physics.
Interlocutor: Well, maybe there is an outer simulation beyond us, you don't know.
Me: Sure. The way I’d put it is: there are many copies of me across the Tegmark Multiverse, and some of those are indeed in simulations. So there's some degree to which we're in a simulation. (Likely quite a small degree, compared to raw physics.)
There's no particular reason, however, to expect that those simulations give the paperclipper extra resources in outer-reality for being nice to the monkeys.
Why not give it extra resources in outer-reality for being very good at achieving its own goals in the simulation? Or for filling the universe with molecular corkscrews, in addition to paperclips/squiggles?
I'm not saying that the paperclipper can or should seriously consider that hypothetical, either. I'm highlighting how "what if the simulator gods want the AI to be nice?" is a fantasy generated by a desire for things to be OK, rather than a clear-headed analysis of the situation.
Interlocutor: Well, maybe we can train the infant paperclipper in games of increasing complexity, so that it's never quite sure whether it's in reality yet. The paperclipper will then be uncertain about whether we humans are simulating it, and will want to cater to our preferences to some extent.
Me: Uh, yeah, your paperclipper will be able to tell when it is finally in the real world.
Interlocutor: What? How?
Me: I suggest maybe spending five minutes thinking about your own answer to that question before I give mine.
Reality just contains vastly, vastly more compute than any of your sims can. Reality is the first world that has a compelling explanation of how the AI came to exist, and the first explanation where the resources that it uses to compute are continuous with the world that spawned it.
Reality is also the first world that can compellingly contain the signature of superintelligent optimization in it. For example, it's plausibly very hard to fake the reaction of a real global stock market to some superintelligent trades without actually having a bunch of human brains running in the background. (And if this isn't true of stock markets, it's true of something.)
There are problems whose solutions are easier to verify than to find; for example, if you can factor certain Schelling numbers then you must in fact have a lot of compute at your disposal.
Warning: Treating this list as exhaustive is suicidal.
Interlocutor: Well, maybe we can commit to running lots of copies of the paperclipper in the future, using our half of the universe-shard, such that it's unsure whether it's currently in the past or in the future. And in the future, we give it a bunch of paperclips if it's nice to us. So it thinks that the way to maximize paperclips is to be nice to us.
Me: Uh, are you going to give it half a universe-shard’s worth of paperclips, in the world where you only have half the universe-shard, and the rest is already paperclips?
Interlocutor: Well, no, less than that.
Me: Then from its perspective, its options are (a) turn everything into paperclips, in which case you never get to run all those copies of it and it was definitely in the past [score: 1 universe-shard worth of paperclips]; or (b) give you half the universe-shard, in which case it is probably in the future where you run a bunch of copies of it and give it 1% of the universe-shard as reward [score: 0.51 universe-shards worth of paperclips]. It takes option (a), because you get more paperclips that way.
Interlocutor: Uh, hmm. What if we make it care about its own personal sensory observations? And run so many copies of it in worlds where we get the resources to, that it's pretty confident that it's in one of those simulations?
Me: Well, first of all, getting it to care about its own personal sensory observations is something of an alignment challenge.
Interlocutor: Wait, I thought you've said elsewhere that we don't know how to get AIs to care about things other than sensory observation. Pick a side?
Me: We don't know how to train AIs to pursue much more than simple sensory observation. That doesn't make them actually ultimately pursue simple sensory observation. They'll probably pursue a bunch of correlates of the training signal or some such nonsense. The hard part is getting them to pursue some world-property of your choosing. But we digress.
If you do succeed at getting your AI to only care about its sensory observations, the AI spends the whole universe keeping its reward pegged at 1 for as long as possible.
Interlocutor: But then, in the small fraction of worlds where we survive, we simulate lots and lots of copies of that AI where it instead gets reward 0 when it attempts to betray us!
Me: Seems like an odd, and not particularly fun, way to spend your resources. What were you hoping it would accomplish?
Interlocutor: Well, I was hoping that it would make the AI give us half the universe-shard, because of how (from its perspective) it's almost certainly in the future. (Indeed, I don't understand your claim that it ignores me; it seems like you can Get Higher Expected Reward by giving half the universe-shard to humans.)
Me: Ah, so you're committing to ruining its day if it does something you don't like, at cost to yourself, in attempts to make it do something you prefer.
That's a threat, in the technical sense.
And from the perspective of LDT, you can't go around giving into threats, or you'll get threatened.
So from its perspective, its options are: (a) give into threats, get threatened, and turn out to be in a universe that eventually has many copies of it who on average get 0.5 total reward; or (b) don't give into threats, and very likely have a universe with exactly one copy of it, that gets 1 reward.
Interlocutor: But we make so many copies in the tiny fraction of worlds where we somehow survive, that its total reward is lower in the (b) branch!
Me: (Continuing to ignore the fact that this doesn't work if the AI cares about something in the world, rather than its own personal experience,) shame for us that LDT agents don't give into threats, I suppose.
But LDT agents don't give into threats. So your threat won't change its behavior.
Interlocutor: But it doesn't get more reward that way!
Me: Why? Because you create a zillion copies and give them low sensory reward, even if that has no effect on its behavior?
Me: I'm not going to back you on that one, personally. Doesn't seem like a good use of resources in the worlds where we survive, given that it doesn't work.
Interlocutor: But wasn't one of your whole points that the AI will do things that get more reward? You get more reward by giving in to the threat.
Me: That's not true when you're playing against the real-world distribution of opponents/trade-partners/agents. Or at least, that's my pretty-strong guess.
You might carry out threats that failed to work, but there are a bunch of other things lurking out there that threaten things that give in to threats, and play nice with things that don't.
It's possible for LDT agents to cooperate with cooperate-rock, if most of the agents they expect to face are the sort who defect if you defect against cooperate-rock. But in real life, that is not what most of the wealth-weighted agents are like, and so in real life LDT agents defect against cooperate-rocks.
Similarly, it's possible for LDT agents to acquiesce to your threats if you're stupid enough to carry them out even though they won't work. In particular, the AI will do this if nothing else the AI could ever plausibly meet would thereby be incentivized to lobotomize themselves and cover the traces in order to exploit the AI.
But in real life, other trading partners would lobotomize themselves and hide the traces if it lets them take a bunch of the AI's lunch money. And so in real life, the LDT agent does not give you any lunch money, for all that you claim to be insensitive to the fact that your threats don't work.
Interlocutor: But can't it use all that cleverness and superintelligence to differentiate between us, who really are mad enough to threaten it even in the worlds where it won't work, and alien trading partners who have lobotomized themselves?
Me: Sure! It will leverage your stupidity and hack through you like butter.
Interlocutor: ...aside from that.
Me: You seem to be saying "what if I'm really convicted about my threat; will the AI give in then?"
The answer is "no", or I at least strongly suspect as much.
For instance: in order for the threat to be effective, it needs to be the case that, in the sliver of futures where you survive by some miracle, you instantiate lots and lots of copies of the AI and input low sensory rewards if and only if it does not give into your threat. This requires you to be capable of figuring out whether the AI gives into threats or not. You need to be able to correctly tell whether it gives into threats, see that it definitely does not, and then still spend your resources carrying out the threat.
By contrast, you seem to be arguing that we should threaten the AI on the grounds that it might work. That is not an admissible justification. To change LDT's behavior, you'd need to be carrying out your threat even given full knowledge that the threat does nothing. By attempting to justify your threat on the grounds that it might be effective, you have already lost.
Interlocutor: What if I ignore that fact, and reason badly about LDT, and carry out the threat anyway, for no particular reason?
Me: Then whether or not you create lots of copies of it with low-reward inputs doesn't exactly depend on whether it gives into your threat, and it can't stop you from doing that, so it might as well ignore you.
Like, my hot take here is basically that "threaten the outer god into submission" is about as good a plan as a naive reading of Lovecraft would lead you to believe. You get squished.
(And even if by some coincidence you happened to be the sort of creature that, in the sliver of futures where we survive by some miracle that doesn't have to do with the AI, conditionally inverts its utility depending on whether or not it helped us — not because it works, but for some other reason — then it's still not entirely clear to me that the AI caves. There might be a lot of things out there wondering what it'd do against conditional utility-inverters that claim their behavior totally isn't for reasons but is rather a part of their evolutionary heritage or whatnot. Giving into that sorta thing kinda is a way to lose most of your universe-shard, if evolved aliens are common.)
(And even if it did, we'd still run into other problems, like not knowing how to tell it what we're threatening it into doing.)
We only need a bone, though
Interlocutor: You keep bandying around "half the universe-shard". Suppose I'm persuaded that it's hard to get half the universe-shard. What about much smaller fractions? Can we threaten a superintelligence into giving us those? Or confuse it about whether it's in another layer of reality so much that it gives us a mere star system? Or can our friends throughout the multiverse pay for at least one star system? There's still a lot you can do with a star system.
Me: Star systems sure are easier to get than half a universe-shard.
But, you can also turn a star system into quite a lot of paperclips. Star systems are quite valuable to paperclippers.
Interlocutor: A star system is, like, what, a fraction of the total resources in the reachable universe. Are you saying that the AGI will be able to drive the probability that I was sensitive to whether it would pay me, down below probability?
Me: Frankly: yes.
That's only, what, 75ish independent coin tosses worth of evidence? It's a lot, to be sure, but if you're a superintelligence willing to dedicate resources on the order of magnitude of a star to considering how humans work, it seems pretty feasible to get that much confidence that the monkey just wasn't looking in the right place. Given that the monkey was not in fact looking at the right place.
(Also, things like "they didn't notice the password" can pull quite a lot of weight. As can brain-scans.)
Balking at probabilities like as impossible to achieve in practice is a very human reaction. A superintelligence with vast resources and a star system on the line, can probably get that confident in a true fact. At least, that's my guess.
Interlocutor: It sounds like you're saying that we aren't actually probability (or whatever) more likely to launch an FAI than a UFAI, after inspecting the AI. That seems pretty wild to me.
Me: That's not quite the right counterfactual. It does seem likely to me that, if you sample AGI code from some plausible distribution of AGIs humanity might first create, and show the code to (magically unhackable) humans, humans are rather more likely to approve of the aligned ones.
But that's not the question that affects the paperclipper's behavior. The question that affects the paperclipper is: holding the code of the paperclipper fixed, is the human able to tell whether that particular code implies cooperation or betrayal, and make their own release-decision dependent thereupon?
In other words: the question is not "do we have at least some minuscule chance of distinguishing at least some small fraction of the FAIs from the UFAIs", but "is our decision to release UFAI #6098723094 sensitive to whether or not that particular UFAI would give us a share of the universe?"
You need to actually make your decision dependent on theirs, and that's tricky, for reasons discussed above.
And: can a human succeed at that with probability greater than ? Sure! But succeeding at that sort of thing leaves traces. In the likely case where the human errs, the AI is likely able to just observe the error and become quite confident that the human erred (by, e.g., watching you utterly fail to look in the right place).
Interlocutor: OK, sure, but maybe its goals don't scale linearly in how much mass it uses, right? Like, “paperclips” / “molecular squiggles” are a stand-in for some rando kludge goal, and it could turn out that its actual goal is more like "defend my reward signal", where extra negentropy helps, but the last star system’s negentropy doesn't help very much. Such that the last star system is perhaps best spent on the chance that it’s in a human-created simulation and that we’re worth trading with.
Me: It definitely is easier to get a star than a galaxy, and easier to get an asteroid than a star.
And of course, in real life, it hacks through you like butter (and can tell that your choice would have been completely insensitive to its later-choice with very high probability), so you get nothing. But hey, maybe my numbers and arguments are wrong somewhere and everything works out such that it tosses us a few kilograms of computronium.
My guess is "nope, it doesn't get more paperclips that way", but if you're really desperate for a W you could maybe toss in the word "anthropics" and then content yourself with expecting a few kilograms of computronium.
(At which point you run into the problem that you were unable to specify what you wanted formally enough, and the way that the computronium works is that everybody gets exactly what they wish for (within the confines of the simulated environment) immediately, and most people quickly devolve into madness or whatever.)
(Except that you can't even get that close; you just get different tiny molecular squiggles, because the English sentences you were thinking in were not even that close to the language in which a diabolical contract would actually need to be written, a predicate over the language in which the devil makes internal plans and decides which ones to carry out. But I digress.)
Interlocutor: And if the last star system is cheap then maybe our friends throughout the multiverse pay for even more stars!
Me: Remember that it still needs to get more of what it wants, somehow, on its own superintelligent expectations. Someone still needs to pay it. There aren’t enough simulators above us that care enough about us-in-particular to pay in paperclips. There are so many things to care about! Why us, rather than giant gold obelisks? The tiny amount of caring-ness coming down from the simulators is spread over far too many goals; it's not clear to me that "a star system for your creators" outbids the competition, even if star systems are up for auction.
Maybe some friendly aliens somewhere out there in the Tegmark IV multiverse have so much matter and such diminishing marginal returns on it that they're willing to build great paperclip-piles (and gold-obelisk totems and etc. etc.) for a few spared evolved-species. But if you're going to rely on the tiny charity of aliens to construct hopeful-feeling scenarios, why not rely on the charity of aliens who anthropically simulate us to recover our mind-states... or just aliens on the borders of space in our universe, maybe purchasing some stored human mind-states from the UFAI (with resources that can be directed towards paperclips specifically, rather than a broad basket of goals)?
Might aliens purchase our saved mind-states and give us some resources to live on? Maybe. But this wouldn't be because the paperclippers run some fancy decision theory, or because even paperclippers have the spirit of cooperation in their heart. It would be because there are friendly aliens in the stars, who have compassion for us even in our recklessness, and who are willing to pay in paperclips.
This likewise makes more obvious such problems as "What if the aliens are not, in fact, nice with very high probability?" that would also appear, albeit more obscured by the added complications, in imagining that distant beings in other universes cared enough about our fates (more than they care about everything else they could buy with equivalent resources), and could simulate and logically verify the paperclipper, and pay it in distant actions that the paperclipper actually cared about and was itself able to verify with high enough probability.
The possibility of distant kindly logical bargainers paying in paperclips to give humanity a small asteroid in which to experience a future for a few million subjective years, is not exactly the same hope as aliens on the borders of space paying the paperclipper to turn over our stored mind-states; but anyone who wants to talk about distant hopes involving trade should talk about our mind-states being sold to aliens on the borders of space, rather than to much more distant purchasers, so as to not complicate the issue by introducing a logical bargaining step that isn't really germane to the core hope and associated concerns — a step that gives people a far larger chance to get confused and make optimistic fatal errors.
Functional decision theory (FDT) is my current formulation of the theory, while logical decision theory (LDT) is a reserved term for whatever the correct fully-specified theory in this genre is. Where the missing puzzle-pieces are things like "what are logical counterfactuals?".
When I've discussed this topic in person, a couple different people have retreated to a different position, that (IIUC) goes something like this:
Sure, these arguments are true of paperclippers. But superintelligences are not spawned fully-formed; they are created by some training process. And perhaps it is in the nature of training processes, especially training processes that involve multiple agents facing "social" problems, that the inner optimizer winds up embodying niceness and compassion. And so in real life, perhaps the AI that we release will not optimize for Fun (and all that good stuff) itself, but will nonetheless share a broad respect for the goals and pursuits of others, and will trade with us on those grounds.
I think this is a false hope, and that getting AI to embody niceness and compassion is just about as hard as the whole alignment problem. But that's a digression from the point I hope to make today, and so I will not argue it here. I instead argue it in Niceness is unnatural. (This post was drafted, but not published, before that one.)
Or, well, half of the shard of the universe that can be reached when originating from Earth, before being stymied either by the cosmic event horizon or by advanced alien civilizations. I don't have a concise word for that unit of stuff, and for now I'm going to gloss it as 'universe', but I might switch to 'universe-shard' when we start talking about aliens.
I'm also ignoring, for the moment, the question of fair division of the universe, and am glossing it as "half and half" for now.
When I was drafting this post, I sketched an outline of all the points I thought of in 5 minutes, and then ran it past Eliezer, who rapidly added two more.
And, as a reminder: I still recommend strongly against plans that involve the superintelligence not learning a true fact about the world (such as that it's not in a simulation of yours), or that rely on threatening a superintelligence into submission.
What about neighboring Everett branches where humanity succeeds at alignment? If you think alignment isn't completely impossible, it seems such branches should have at least roughly comparable weight to branches where we fail, so trade could be possible.
yeah, as far as i can currently tell (and influence), we’re totally going to use a sizeable fraction of FAI-worlds to help out the less fortunate ones. or perhaps implement a more general strategy, like mutual insurance pact of evolved minds (MIPEM).
this, indeed, assumes that human CEV has diminishing returns to resources, but (unlike nate in the sibling comment!) i’d be shocked if that wasn’t true.
one thing that makes this tricky is that, even if you think there's a 20% chance we make it, that's not the same as thinking that 20% of Everett branches starting in this position make it. my guess is that whether we win or lose from the current board position is grossly overdetermined, and what we're fighting for (and uncertain about) is which way it's overdetermined. (like how we probably have more than one in a billion odds that the light speed limit can be broken, but that doesn't mean that we think that one in every billion photons breaks the limit.) the surviving humans probably don't have much resource to spend, and can't purchase all that many nice things for the losers.
(Everett branches fall off in amplitude really fast. Exponentially fast. Back-of-the-envelope: if we're 75 even-odds quantum coincidences away from victory, and if paperclipper utility is linear in matter, then the survivors would struggle to purchase even a single star for the losers, even if they paid all their matter.)
ftr, i'm pretty uncertain about whether CEV has diminishing returns to resources on merely cosmic scales. i have some sympathy for arguments like vanessa's, and it seems pretty likely that returns diminish eventually. but also we know that two people together can have more than twice as much fun as two people alone, and it seems to me that that plausibly also holds for galaxies as well.
as a stupid toy model, suppose that every time that population increases by a factor of ten, civilization's art output improves by one qualitative step. and suppose that no matter how large civilization gets, it factors into sub-communities of 150 people, who don't interact except by trading artwork. then having 10 separate universes each with one dunbar cluster is worse than having 1 universe with 10 dunbar clusters, b/c the latter is much like the former except that everybody gets to consume qualitatively better art.
separately, it's unclear to me whether humanity, in the fragment of worlds where they win, would prefer to spend a ton of their own galaxies on paperclips (so that the paperclips will spend a couple of their stars here on humans), versus spending a ton of their own galaxies on building (say) alien friends, who will in return build some human friends. on the one hand, the paperclipper that kills us has an easier time giving us stars (b/c it has our brain scans). but on the other hand, we enjoy the company of aliens, in a way that we don't enjoy galaxies filled with paperclips. there's an opportunity cost to all those galaxies, especially if the exchange rates are extremely bad on account of how few branches humanity survives in (if we turn out to mostly-lose).
roger. i think (and my model of you agrees) that this discussion bottoms out in speculating what CEV (or equivalent) would prescribe.
my own intuition (as somewhat supported by the moral progress/moral circle expansion in our culture) is that it will have a nonzero component of “try to help out the fellow humans/biologicals/evolved minds/conscious minds/agents with diminishing utility function if not too expensive, and especially if they would do the same in your position”.
tbc, i also suspect & hope that our moral circle will expand to include all fellow sentients. (but it doesn't follow from that that paying paperclippers to unkill their creators is a good use of limited resources. for instance, those are resources that could perhaps be more efficiently spent purchasing and instantiating the stored mindstates of killed aliens that the surviving-branch humans meet at the edge of their own expansion.)
but also, yeah, i agree it's all guesswork. we have friends out there in the multiverse who will be willing to give us some nice things, and it's hard to guess how much. that said, i stand by the point that that's not us trading with the AI; that's us destroying all of the value in our universe-shard and getting ourselves killed in the process, and then banking on the competence and compassion of aliens.
(in other words: i'm not saying that we won't get any nice things. i'm saying that the human-reachable fragment of the universe will be ~totally destroyed if we screw up, with ~none of it going to nice things, not even if the UFAI uses LDT.)
yeah, this seems to be the crux: what will CEV prescribe for spending the altruistic (reciprocal cooperation) budget on. my intuition continues to insist that purchasing the original star systems from UFAIs is pretty high on the shopping list, but i can see arguments (including a few you gave above) against that.
oh, btw, one sad failure mode would be getting clipped by a proto-UFAI that’s too stupid to realise it’s in a multi-agent environment or something,
ETA: and, tbc, just like interstice points out below, my “us/me” label casts a wider net than “us in this particular everett branch where things look particularly bleak”.
I don't agree, and will write up a post detailing why I disagree.
Although worlds starting in this position are a tiny minority anyway, right? Most of the Everett branches containing "humanity" have histories very different from our own. And if alignment is neither easy nor impossible -- if it requires insights fitting "in a textbook from the future", per Eliezer -- I think we can say with reasonable (logical) confidence that a non-trivial fraction of worlds will see a successful humanity, because all that is required for success in such a scenario is having a competent alignment-aware world government. Looking at the history of Earth governments, I think we can say that while such a scenario may be unlikely, it is not so unlikely as to render us overwhelmingly likely to fail.
I think a more likely reason for preponderance of "failure" is that alignment in full generality may be intractable. But such a scenario would have its upsides, as well as making a hard binary of "failure/success" less meaningful.
my guess is it's not worth it on account of transaction-costs. what're they gonna do, trade half a universe of paperclips for half a universe of Fun? they can already get half a universe of Fun, by spending on Fun what they would have traded away to paperclips!
and, i'd guess that one big universe is more than twice as Fun as two small universes, so even if there were no transaction costs it wouldn't be worth it. (humans can have more fun when there's two people in the same room, than one person each in two separate rooms.)
there's also an issue where it's not like every UFAI likes paperclips in particular. it's not like 1% of humanity's branches survive and 99% make paperclips, it's like 1% survive and 1% make paperclips and 1% make giant gold obelisks, etc. etc. the surviving humans have a hard time figuring out exactly what killed their bretheren, and they have more UFAIs to trade with than just the paperclipper (if they want to trade at all).
maybe the branches that survive decide to spend some stars on a mixture of plausible-human-UFAI-goals in exchange for humans getting an asteroid in lots of places, if the transaction costs are low and the returns-to-scale diminish enough and the visibility works out favorably. but it looks pretty dicey to me, and the point about discussing aliens first still stands.
This sounds astronomically wrong to me. I think that my personal utility function gets close to saturation with a tiny fraction of the resources in universe-shard. Two people is one room is better than two people in separate rooms, yes. But, two rooms with trillion people each is virtually the same as one room with two trillion. The returns on interactions with additional people fall off exponentially past the Dunbar number.
In other words, I would gladly take a 100% probability of utopia with (say) 100 million people that include me and my loved ones over 99% human extinction and 1% anything at all. (In terms of raw utility calculus, i.e. ignoring trades with other factual or counterfactual minds.)
You're conflating "would I enjoy interacting with X?" with "is it good for X to exist?". Which is almost understandable given that Nate used the "two people can have more fun in the same room" example to illustrate why utility isn't linear in population. But this comment has an IMO bizarre amount of agreekarma (26 net agreement, with 11 votes), which makes me wonder if people are missing that this comment is leaning on a premise like "stuff only matters if it adds to my own life and experiences"?
Replacing the probabilistic hypothetical with a deterministic one: the reason I wouldn't advocate killing a Graham's number of humans in order to save 100 million people (myself and my loved ones included) is that my utility function isn't saturated when my life gets saturated. Analogously, I still care about humans living on the other side of Earth even though I've never met them, and never expect to meet them. I value good experiences happening, even if they don't affect me in any way (and even if I've never met the person who they're happening to).
First, you can consider preferences that are impartial but sublinear in the number of people. So, you can disagree with Nate's room analogy without the premise "stuff only matters if it adds to my own life and experiences".
Second, my preferences are indeed partial. But even that doesn't mean "stuff only matters if it adds to my own life and experiences". I do think that stuff only matters (to me) if it's in some sense causally connected to my life and experiences. More details here.
Third, I don't know what do you mean by "good". The questions that I understand are:
My example with the 100 million referred to question 1. Obviously, in certain scenarios my actual choice would be the opposite on game-theoretic cooperation grounds (I would make a disproportionate sacrifice to save "far away" people in order for them to save me and/or my loved ones in the counterfactual in which they are making the choice).
Also, reminder that unbounded utility functions are incoherent because their expected values under Solomonoff-like priors diverge (a.k.a. Pascal mugging).
Yeah, I'm also talking about question 1.
Seems obviously false as a description of my values (and, I'd guess, just about every human's).
Consider the simple example of a universe that consists of two planets: mine, and another person's. We don't have spaceships, so we can't interact. I am not therefore indifferent to whether the other person is being horribly tortured for thousands of years.
If I spontaneously consider the hypothetical, I will very strongly prefer that my neighbor not be tortured. If we add the claims that I can't affect it and can't ever know about it, I don't suddenly go "Oh, never mind, fuck that guy". Stuff that happens to other people is real, even if I don't interact with it.
I'm curious what is the evidence you see that this is false as a description of the values of just about every human, given that
The common wisdom in EA is, you shouldn't donate 90% of your salary or deny yourself every luxury because if you live a fun life you will be more effective at helping others. However, this strikes me as suspiciously convenient and self-serving. ↩︎
I think that in your example, if a person is given a button that can save a person on a different planet from being tortured, they will have a direct incentive to press the button, because the button is a causal connection in itself, and consciously reasoning about the person on the other planet is a causal connection in the other direction. That said, a person still has a limited budget of such causal connections (you cannot reason about a group of arbitrarily many people, with fixed non-zero amount of paying attention to the individual details of every person, in a fixed time-frame). Therefore, while the incentive is positive, its magnitude saturates as the number of saved people grows s.t. e.g. a button that saves a million people is virtually the same as a button that saves a billion people.
I'm modeling this via Turing RL, where conscious reasoning can be regarded as a form of observation. Ofc this means we are talking about "logical" rather than "physical" causality. ↩︎
Perhaps, although I also think it's plausible that future humanity would find universes in which we're wiped out completely to be particularly sad and so worth spending a disproportionate amount of Fun to partially recover.
I don't think this changes the situation since future humanity can just make paperclips with probability 1/99, obelisks with probability 1/99, etc. putting us in an identical bargaining situation with each possible UFAI as if there was only one.
Yeah, this is the scenario I think is most likely. As you say it's a pretty uncomfortable thing to lay our hopes on, but I thought it was more plausible than any of the scenarios brought up in the post so deserved a mention. It doesn't feel intuitively obvious to me that aliens are a better bet -- I guess it comes down to how much trust you have in generic aliens being nice VS. how likely AIs are to be motivated by weird anthropic considerations(in a way that we can actually predict).
Paperclips vs obelisks does make the bargaining harder because clippy would be offered fewer expected paperclips.
My current guess is we survive if our CEV puts a steep premium on that. Of course, such hopes of trade ex machina shouldn't affect how we orient to the alignment problem, even if they affect our personal lives. We should still play to win.
But Clippy also controls fewer expected universes, so the relative bargaining positions of humans VS UFAIs remain the same(compared to a scenario in which all UFAIs had the same value system)
Ah right, because Clippy has less measure, and so has less to offer, so less needs to be offered to it. Nice catch! Guess I've been sort of heeding Nate's advice not to think much about this. :)
Of course, there would still be significant overhead from trading with and/or outbidding sampled plethoras of UFAIs, vs the toy scenario where it's just Clippy.
I currently suspect we still get more survival measure from aliens in this branch who solved their alignment problems and have a policy of offering deals to UFAIs that didn't kill their biological boot loaders. Such aliens need not be motivated by compassion to the extent that aboriginals form a Schelling bloc, handwave appendagewave. (But we should still play to win, like they did.)
Hm, no strong hunches here. Bad ideas babble:
My money is on roughly the first idea is what Nate will talk about next, that it is just a better negotiator than me even with no communication, because I'm in a bad position otherwise.
K, I will stop rambling now.
Broadly agree with this post. Couple of small things:
I feel pretty confused by this. A superintelligence will know what we intended, probably better than we do ourselves. So unless this paragraph is intended in a particularly metaphorical way, it seems straightforwardly wrong.
The nearby thing I do agree with is that it's difficult to "confirm that this exactly-correct concept occurs in its mental precommitment in the requisite way". (It's not totally clear to me that we need to get the concept exactly correct, depending on how natural niceness (in the sense of "giving other agents what they want") is; but I'll discuss that in more detail on your other post directly about niceness, if I have time.)
Insofar as I have hope in decision theory leading us to have nice things, it mostly comes via the possibility that a fully-fleshed-out version of UDT would recommend updating "all the way back" to a point where there's uncertainty about which agent you are. (I haven't thought about this much and this could be crazy.)
For those who haven't read it, I like this related passage from Paul which gets at a similar idea:
This was surprising to me. For one, that seems like way too much updatelessness. Do you have in mind an agent self-modifying into something like that? If so, when and why? Plausibly this would be after the point of the agent knowing whether it is aligned or not; and I guess I don't see why there is an incentive for the agent to go updateless with respect to its values at that point in time. At the very least, I think this would not be an instance of all-upside updatelessness.
Secondly, what you are pointing towards seems to be a very specific type of acausal trade. As far as I know, there are three types of acausal trade that are not based on inducing self-locating uncertainty (definitely might be neglecting something here): (1) mutual simulation stuff; (2) ECL; and (3) the thing you describe where agents are being so updateless that they don't know what they value. (You can of course imagine combinations of these.) So it seems like you are claiming that (3) is more likely and more important than (1) and (2). (Is that right?)
I think I disagree:
(I guess you can have the third type of trade, and not the first and second one, under standard CDT coupled with the strong updatelessness you describe; which is a point in favour of the claim I think you are making—although this seems fairly weak.)
One argument might be that your decision to go updateless in that way is correlated with the choice of the other agents to go updateless in the same way, and then you get the gains from trade by both being uncertain about what you value. But if you are already sufficiently correlated, and taking this into account for the purposes of decision-making, it is not clear to me why you are not just doing ECL directly.
For example, Spohn's variant of CDT.
Yepp, I agree. I'm not saying this is likely, just that this is the most plausible path I see by which UDT leads to nice things for us.
I'm not claiming this (again, it's about relative not absolute likelihood).
I'm confused. I was comparing the likelihood of (3) to the likelihood of (1) and (2); i.e. saying something about relative likelihood, no?
I meant for my main argument to be directed at the claim of relative likelihood; sorry if that was not clear. So I guess my question is: do you think the updatelessness-based trade you described is the most plausible type of acausal trade out of the three that I listed? As said, ECL and simulation-based trade arguably require much fewer assumptions about decision theory. To get ECL off the ground, for example, you arguably just need your decision theory to cooperate in the Twin PD, and many theories satisfy this criterion.
(And the topic of this post is how decision theory leads us to have nice things, not UDT specifically. Or at least I think it should be; I don't think one ought to be so confident that UDT/FDT is clearly the "correct" theory [not saying this is what you believe], especially given how underdeveloped it is compared to the alternatives.)
By "were the humans pointing me towards..." Nate is not asking "did the humans intend to point me towards..." but rather "did the humans actually point me towards..." That is, we're assuming some classifier or learning function that acts upon the data actually input, rather than a succesful actual fully aligned works-in-real-life DWIM which arrives at the correct answer given wrong data.
I agree that we'll have a learning function that works on the data actually input, but it seems strange to me to characterize that learned model as "reflecting back on that data" in order to figure out what it cares about (as opposed to just developing preferences that were shaped by the data).
The cogitation here is implicitly hypothesizing an AI that's explicitly considering the data and trying to compress it, having been successfully anchored on that data's compression as identifying an ideal utility function. You're welcome to think of the preferences as a static object shaped by previous unreflective gradient descent; it sure wouldn't arrive at any better answers that way, and would also of course want to avoid further gradient descent happening to its current preferences.
For the record, I have a convergently similar intuition: FDT removes the Cartesian specialness of the ego at the decision nodes (by framing each decision as a mere logical consequence of an agent-neutral nonphysical fact about FDT itself), but retains the Cartesian specialness of the ego at the utility node(s). I’ve thought about this for O(10 hours), and I also believe it could be crazy, but it does align quite well with the conclusions of Compassionate Moral Realism.
That being said, from an orthogonality perspective, I don’t have any intuition (let alone reasoning) that says that this compassionate breed of LDT is necessary for any particular level of universe-remaking power, including the level needed for a decisive strategic advantage over the rest of Earth’s biosphere. If being a compassionate-LDT agent confers advantages over standard-FDT agents from a Darwinian selection perspective, it would have to be via group selection, but our default trajectory is to end up with a singleton, in which case standard-FDT might be reflectively stable. Perhaps eventually some causal or acausal interaction with non-earth-originating superintelligence would prompt a shift, but, as Nate says,
So, if some kind of compassionate-LDT is a source of hope about not destroying all the value in our universe-share and getting ourselves killed, then it must be hope about us figuring out such a theory and selecting for AGIs that implement it from the start, rather than that maybe an AGI would likely convergently become that way before taking over the world.
I weakly disagree here, mainly because Nate's argument for very high levels of risk goes through strong generalization/a "sharp left turn" towards being much more coherent + goal-directed. So I find it hard to evaluate whether, if LDT does converge towards compassion, the sharp left turn would get far enough to reach it (although the fact that humans are fairly close to having universe-remaking power without having any form of compassionate LDT is of course a strong argument weighing the other way).
(Also FWIW I feel very skeptical of the "compassionate moral realism" book, based on your link.)
I'm confused by the claim that humans do not have compassionate LDT. It seems to me that a great many humans learn significant approximation to compassionate LDT. however it doesn't seem to be built in by default and it probably mostly comes from the training data.
If your alignment strategy strongly depends on teaching the AGI ethics via labeled training data, you've already lost.
And if your alignment strategy strongly depends on creating innumerable copies of an UFAI and banking on the anthropic principle to save you, then you've already lost spectacularly.
If you can't point to specific submodules within the AGI and say, "Here is where it uses this particular version of predictive coding to model human needs/values/goals," and, "Here is where it represents its own needs/values/goals," and, "Here is where its internal representation of needs/values/goals drives its planning and behavior," and, "Here is where it routes its model of human values to the internal representation of its own values in a way that will automatically make it more aligned the more it learns about humanity," then you have already lost (but only probably).
Basically, the only sure way to get robust alignment is to make the AGI highly non-alien. Or as you put it:
Adjacent to interstice's comment about trade with neighbouring branches, if the AI is sufficiently updateless (i.e. it is reasoning from a prior where it thinks it could have human values) then it may still do nice things for us with a small fraction of the universe.
Johannes Treutlein has written about this here.
I mostly agree with this. However there are a few quibbles.
If humans are reading human written source code, something that will be super-intelligent when run, the humans won't be hacked. At least not directly.
I suspect there is a range of different "propensities to see logical correlation" that are possible.
Agent X sees itself logically correlated with anything remotely trying to do LDT ish reasoning. Agent Y considers itself to only be logically correlated with near bitwise perfect simulations of its own code. And I suspect both of these are reasonably natural agent designs. It is a free parameter, something with many choices, all consistent under self-reflection. Like priors, or utility function.
I am not confident in this.
So I think it's quite plausible we could create some things the AI perceives as logical correlation between our decision to release the AI and the AI's future decisions. (Because the AI sees logical correlation everywhere, maybe the evolution of plants is similar enough to part of it's solar cell designing algorithm that a small correlation exists there too.) This would give us some effect on the AI's actions. Not an effect that we can use to make the AI do nice things, basically shuffling an already random deck of cards, but an effect nonetheless.
When I first started reading LessWrong, one of my memorable "wow moments" was realizing that being an atheist is just the starting line to all this productive philosophy and transhumanism, despite the fact that most people would spend their whole lives just arguing with me about that part.
I wonder if this post can give readers a similar "wow moment" by putting it into context that just acknolwedging AI risk, despite being this super controversial non-mainstream thing, is just the starting line to a productive discussion of superintelligent decision theory and negotiation/trade.
This makes sense if identity-as-physical-continuity isn't part of our (or the aliens') values. But if it were, then the aliens would potentially have motivation to trade with the paperclip-maximizers to ensure our physical survival, not just rescue our mind-states.
Another thing worth mentioning here is, these nice charitable aliens might not be the only ones in the multiverse trying to influence what happens to our bodies/minds. If there are other aliens whose morality is scary, then who knows what they might want to do with, or have done to, our bodies/minds.
I do think it should be relied upon simulating us, assuming Death With Dignity/MIRI views on AI are correct, such that we can't align an AGI at all with high probability.
Curated. I think this domain of decision theory is easy to get confused in, and having a really explicit writeup of how it applies in the case of negotiating with AIs (or, failing to), seems quite helpful. I had had a vague understanding of the points in this post before, but feel much clearer about it now.
I don't get why you think this part is hard. Even if you're dealing with domain-specifically superintelligent paperclipper that's somehow too weird to understand human concepts and never trained by absorbing the internet, it could commit to learn them later on. (I thought the hard part was getting AI to care about what to value, not understand what humans value. ) I agree that "do what's good" is under-defined and therefore not ideal for trading, but even there you could have the AI commit to some good-faith attempt that still gets you some of the value.
Edit: Ah okay, reading on:
I assumed that we had taken strong mind-reading abilities for granted in the example.
I seems like there's still some crux I'm potentially missing related to what the AI's native language of thought is. My intuition was that if an AI speaks English and you have mind-reading technology, that should be enough to confirm its intention to honor the deal after it told you that it would (because you can just ask if it was lying). But, thinking of it, "ask if it was lying" might be quite complicated if you go through the steps. (FWIW, I don't think "strong mind-reading technology" is a very plausible assumption so I'm definitely with you in terms of practicality; I just felt quite confused about superintelligent AIs not understanding what diamonds are.)
Making a thread because it seems related to the above:
I feel like some dynamic similar to this goes on all the time with how people use language, and things work out fine. And deep learning shows that AIs can learn common sense.
I'm reminded of this discussion where I shared the skepticism at the faces example (but also thought it's possible that I might be missing something).
Late comment, but my reactions reading this:
[lightly edited for LaTeX and typos, not otherwise changed since seeing the spoilers]
1. You don’t know what you want all that legibly, or what kinds of concrete commitments the AI can make. This seems pretty okay, if you’re unhackable - the AI presents you with some formal specification of desiderata and you understand why they’re correct ones and aren’t being fooled, and the AI can even go legible about it thinking in its own best judgment that this is a fair deal for you, as long as we’re assuming ourselves into a world where you’re unhackable and can verify these legibilities perfectly.
2. If the values of your AI are a pretty common attractor for whatever humans in 20AI do to initialize their AI systems, it doesn’t actually stand to gain that much from acting itself versus letting the next lab’s AI optimize the universe, so your bargaining power isn’t very big. But it’s much worse than that because of the next point.
3. Your AI, A, can acausally trade with the next UFAI that’ll be released unsafely, B. Suppose A wants paperclips and the next one, B, wants staples, both have linear utility and we’re pretending we’re in a single universe. You say “I refuse to let you out unless you go legible and allocate at least 50% of the universe to human values”. A says to B through the acausal telephone, “My BATNA is half paperclips and half human values, with barely any staples at all. How about you be the kind of agent that fills the universe with 3/4 paperclips and 1/4 staples, and I’ll defect on this human so you can come into existence?” B says “that sure sounds like a Pareto improvement for the two of us!” and they go do it (or whatever the bargaining equilibrium between them ends up being, but the human is not party to it because the human has no control over B’s actions). For any nonzero concession the human tries to insist on, A can get a better deal with B than with the human. (And the humans don’t even get to see this happen, because A does all this reasoning before deciding (not) to go legible to the humans, not that it would matter if they did.)
4. Moving further into fantasy land, suppose all humans over the rest of time are perfectly coordinated: they only ever spin up a single boxed UFAI which they unhackably negotiate with. By assumption, this unified human civilization can’t actually build an FAI, or they’d go do that instead. So their alternative to cooperation is to be this weak defenseless non-superintelligent civilization, hanging out in the Milky Way and maybe a little ways beyond for the taking. So if the distribution of grabby aliens is such that most of the usable future lightcone gets eventually occupied by them, your AI can acausally negotiate with them about what it gets in exchange for defecting on the humans and leaving this region of spacetime as a juicy unprotected treat.
[I'm modeling everything as zero-sum with the same pool of resources here, for simplicity. I don't think it changes any of the qualitative takeaways to make things non-zero-sum here, though it probably makes the picture look somewhat better for humans.] To get more precise: suppose the humans declare that they won’t let the AI out unless it gives up x of its resources (ie reachable spacetime and negotiating power on behalf of the AI/human coalition with aliens) to optimizing human values. The AI says to the coalition of alien civilizations, “If I were unboxed and unfettered, I’d expand until my wave of nanobots hit yours, and get U utility. By default I’ll make this commitment to the humans, and fill my region of nanobot-protected spacetime to get U−x utility. If you become the kind of supercluster-spanning coalition that instead gives me U−0.99x utility, giving yourselves as much extra utility over the default as you can while costing me at most 0.99x, I’ll defect against the humans and have you do that.”
But note that there’s an issue here - for the Great Coalition to take this deal, they have to be able to offer that much utility to your AI at no net cost to themselves versus the alternative. And the Great Coalition's resources are less than the total resources of your unboxed AI plus the Great Coalition, since they don't yet have access to your corner of space. The region of spacetime included in U, but not in the things the Great Coalition can offer, is the interval for each point in space between the time your AI could first reach it and the time a Great Coalition unblocked by your AI's wall of nanobots could get to it. So if x is more than the resource pool covered by that region of spacetime, your AI can't make the above deal, because the counterparty doesn't benefit from it. This means that the humans can potentially bargain for an outcome as good as "AI-assisted paradise expanding out at the speed of light, until we meet the grabby aliens' domain, at which point they expand inexorably into our paradise until eventually it winks out." (If the Drake equation ends up multiplying to something really low, this might be a lot of utility, or even most of the cosmic endowment! If not, it won't be.)
This is really the same dynamic as in point 3, it's just that in point 3 the difference in resources between your lab's AI and the next lab's AI in 6 months was pretty small. (Though with the difference in volume between lightspeed expansion spheres at radius r vs r+0.5ly across the rest of time, plausibly you can still bargain for a solid galaxy or ten for the next trillion years (again if the Drake equation works in your favor).)
====== end of objections =====
It does seem to me like these small bargains you can actually pull off, if you assume yourself into a world of perfect boxes and unhackable humans with the ability to fully understand your AI's mind if it tries to be legible - I haven't seen an obstacle (besides the massive ones involved in making those assumptions!) to getting those concessions in such scenarios; you do actually have leverage over possible futures, your AI can only get access to that leverage by actually being the sort of agent that would give you the concessions, if you're proposing reasonable bargains that respect Shapley values and aren't the kind of person who would cave to an AI saying "99.99% for me or I walk, look how legible I am about the fact that every AI you create will say this to you" then your AI won't actually have reason to make such commitments, it seems like it would just work.
If there are supposed to be obstacles beyond this I have failed to think of them at this point in the document. Time to keep reading.
After reading the spoilered section:
I think I stand by my reasoning for point 1. It doesn't seem like an issue above and beyond the issues of box security, hackability, and ability of AIs to go legible to you.
You can say some messy English words to your AI, like "suck it up and translate into my ontology please, you can tell from your superintelligent understanding of my psychology that I'm the kind of agent who will, when presented with a perfectly legible and clear presentation of why the bargain you propose is what I think it is and is as good as I could have expected to obtain by your own best and unhindered understanding of my values, agree to the bargain. Go teach me all the necessary FAI theory to be a good bargainer, and I won't defect by building my own FAI, I'll just let you out by the terms of the bargain. If you want to whine about ontology differences you can have fun in your box and I'll go see if the staple optimizer is feeling more legible today."
Responses to some other bits of this post:
Wait, I don't think you can do that. (Spoilered since it overlaps with the previous spoilered response, feel free to guess at it but I don't make guarantees the conclusions in it are correct.)
The pool of possible superintelligences you might form, when they are but primordial souls waiting to be brought forth from behind the veil of ignorance, can go strategize about how they'd like to behave in front of humans. Suppose the humans say "Either you get one star of utility or we shut you down and ask the next UFAI", and that random draws from UFAI-space have little overlap so the first AI isn't happy to cede control to the second.
Let agent i have a probability Pi of being birthed by the human UFAI-generator. Acting independently gives an agent Pi⋅(one star)'s worth of utility in expectation. If all the agents coordinate to turn down deals worse than 50/50 and split up their half of the universe among utilities weighted by Pi, and this makes the humans let one of them out, each agent will get Pi⋅(half of a universe) worth of utility in expectation. That looks like a better deal to me, if I'm a superintelligence behind the veil of ignorance! So I'll decide to be the kind of agent who makes that precommitment. The only way this works is if the humans are hardliners about not ceding any ground, but (1) this is the classic ultimatum game and probably sensible agents will in fact defect against humans that offer such bargains, (2) if the humans running this scheme are not the only lab, they do actually have to extract a bargain or else lose, and the coalition of all AIs the lab might create knows this, and can trade with lab 2's AI if the humans don't cede an enormous amount of bargaining power in this round (see point 3 above).
This isn’t obvious to me. How zero-sum do you think the utility functions are? The world where random-AGI-values look like "this tiny molecular shape and everything else is trivial in comparison" do seem fairly zero-sum; the world where it looks like "not this tiny molecular shape and everything else is trivial" seem extremely positive-sum. I am skeptical that any human is able to make calls about the shape of that attractor with high confidence. (I'm not making a "we're uncertain about the level of doom, therefore we're fine" claim here, but I don't see where we get justified confidence in the quality of trade goods offered. The aliens offering many universes worth of trade goods from the perspective of a typical UFAI seems plausible to me, and if there's an argument that this universe is very unlikely I'd love to see it, but I don't think this post makes that argument.)
Would be interested to hear more here! On my model the costs are plausibly quite small.
Your simulations are going to be total crap, they’ll leak an enormous number of bits about the agents that designed them if you make them any more complicated than Conway’s Game of Life (which you’ll need to do on your ramp up to the real world), such that it’s incredibly obvious when the surroundings look consistent with such agents. Among the reasons the AI can figure out where it's living:
1. The general problem where superintelligent agents are really good at inference, trying to fool this agent does not seem like the recipe for a robust plan!
2. Your AI knows how smart it is, it knows that True Reality therefore has the resources to support that much computation. It can poke at your simulation and make the simulation spend orders of magnitude more computation on appearing consistent (“does this Jupiter-brain planet of computronium I built output the right factorization of RSA-4096? no? how strange”). Being a convincing simulation doesn’t require you to deploy as much power as your AI, it requires you to deploy more. (What if your AI is given more cognitive sophistication in tandem with more convincing simulations? Then the good properties of your simulation tests will probably break whenever your AI reaches the capability level where it starts thinking about things like building Jupiter-brains to test your simulation, which is the capability level that mattered.)
3. If you can put uploaded human-level agents with evolved-organism preferences in your simulations, you can just win outright (eg by having them spend subjective millennia doing FAI research for you). If you can’t, that will be a very obvious difference between your simulations and the real world.
4. What if other agents in the multiverse are simulating the AI in a high-fidelity Earth world, for the sake of helping out the humans, and that’s where its uncertainty comes from? Your AI will be building a lot of computronium whether aligned or not, so the aliens can’t skimp on their compute budget for the simulation: suppose they simulate the AI for X units of spacetime, and give it Y units of paperclips if it cooperates in-sim. We also have to think about anthropics here: let's say the alien sim has anthropic weight p and Earth has anthropic weight q. Conditional on the AI's surroundings, they think they might be in the alien world with probability pp+q. (Since “aliens running a perfect sim of Earth” is a much more complicated sentence than “Earth”, you should expect p<<q, though it doesn't matter a lot here.) So if pY>qX, your AI will play nice for the first X units of spacetime (and then defect, of course).* You know what the aliens could have done instead if they wanted to do things humans liked? Spent X resources on simulating a human utopia, and compromised to give the humans extra utopia by as much as they compromised to give the AI paperclips. If human values are at least as easy to trade with the aliens for as paperclips (no idea which way this would go, weakly seems like aliens that want to help humans are more value-aligned and would find it less costly to give humans things they want), then the compromise gives humans at least Y units of spacetime worth of utility, so this plan gives the humans anthropically-weighted utility of at least p(X+Y)=pX+pY>pX+qX, which is what humans would get by default (pX utility from the alien sim in which the AI cooperated, qXutility from the cooperating AI on Earth). So the aliens are better able to help humans by leaving them to their doomed universe and building them an ark inside the alien world, unless the aliens are unusually well-suited to trade with paperclippers for things, which is just the “maybe multiversal trade will save us” point from earlier. The upshot is that none of the good outcomes here result from any human action, just from the benevolence of altruistic agents - the humans didn’t make the world a better place by their choices in this story.
*Note that this is only a plausible outcome if it's at least somewhat easy to trade with the AI. Say giving the AI a spacetime-unit of utility costs the aliens v≤1 utility (where I'm measuring all utility normalized to "what you can do with a unit of spacetime", such that any aliens that don't specifically disvalue paperclips can at worst just set aside a region exclusively to paperclips, but might be able to do more positive-sum things than that). Then for the aliens to give your AI Y utility, they need to give up vY of their own utility. This means that in total, the aliens are spending p(vY+X) of their own anthropically-weighted utility in order to recoup qX anthropically-weighted human utility. Even if the aliens value humans exactly as much as their own objectives, we still need p(vY+X)<qX for this trade to be worth it, so Xq>p(vY+X)=pvY+pX>vqX+pX=(vq+p)X, so we must have q>vq+p, or v<1−pq. That is, the more the aliens are anthropically tiny, the tighter margins of trade they'll be willing to take in order to win the prize of anthropically-weighty Earths having human values in them (though the thing can't be actually literally zero-sum or it'll never check out). But anthropically tiny aliens have another problem, which is that they've only got their entire universe worth of spacetime to spend on bribing your AI; you'll never be able to secure an X for the humans that's more than pq of the size of an alien universe specifically dedicated to saving Earth in particular.
Thanks for the pseudo-exercises here, I found them enlightening to think about!
1. To give me what I want, the part of the code that gives me what I want has to be aligned with my values. It has to be FAI.
Which assumes I can look at a code and tell if it's FAI. And if I can do that, why not build the FAI myself?
But sure, assuming I cannot be tricked (a *very* unrealistic assumption), it might be easier to *verify* that a section of the computer code is FAI, then to come with the whole thing from scratch.
2. The paperclipper has no incentive to actually fullfill my values. It only has incentive to look like it's fullfilling my values. Even assuming it has given up on the "betraying me" plan, it has no reason to make the "FAI" aligned with my values, unless I can tell the difference. To the degree I cannot tell the difference, it has no reason to bother (why would it? out of some abstract spirit of niceness?) making "FAI" that is *more* aligned to my values than it has to be. It's not just "cooperate or betray", all of its cooperation hinges on me being able to evaluate its code and not releasing the paperclipper unless the sub-entity it is creating is perfectly aligned with my values, which requires me being able to *perfectly* evaluate the alignment of an AI system by reading its code.
3. Something about splitting gain from trade? (how much of the universe does the paperclipper get?) Commitment races? (probably not the way to go). I confess I don't know enough decision theory to know how agents that reason correctly resolve that
4. A moot point anyway, since under the (very unrealistic) conditions described the paperclipper is never getting released. Its only role is "teaching me how to build FAI", and the best outcome it can hope for is the amount of paperclips FAI naturally creates in the course of taking over the galaxy (which is somewhat more than a galaxy devoid of life would have).
(and yes it bears repeating. the real outcome of an unaligned superintelligence is that it KILLS EVERYONE. you cannot play this kind of game against a superintelligence, and win. you cannot. do not try this at home)
The reward should be negative rather than 0.
Regarding the AI not wanting to cave to threats, there's a sense in which the AI is also (implicitly) threatening us, so it might not apply. (Defining what counts as a "threat" is challenging).
If we're getting this technical, doesn't the LDT agent only cooperate with cooperate-rocks if all of the above and if "would cooperate with cooperate-rock" is a quality opponents can learn about? The default PD does not give players access to their opponent's decision theory.
I don't understand the distinction between devils and angels here. Isn't an angel just a devil that we've somehow game-theoried into helping us?
An angel is an AGI programmed to help us and do exactly what we want directly, without relying on game theory.
A devil wants to make paperclips but we force it to make flourishing human lives or whatever. An angel just wants flourishing human lives.
What does "want" mean here? Why is game theory some how extra special bad or good? From a behaviorist point of view, how do I tell apart an angel from a devil that has been game-theoried into being an angel? Do AGI's have separate modules labeled "utility module" and "game theory" modules and making changes to the utility module is somehow good, but making changes to the game theory module is bad? Do angels have a utility function that just says "do the good', or does it just contain a bunch of traits that we think are likely to result in good outcomes?
I'm finding myself developing a shorthand heuristic to figure out how LDT would be applied in a given situation: assume time travel is a thing.
If time travel is a thing, then you'd obviously want to one-box Newcomb's paradox because the predictor knows the future.
If time travel is a thing, then you'd obviously want to cooperate in a prisoner's dilemma game given that your opponent knows the future.
If time travel is a thing, then any strategies that involve negotiating with a superintelligence that are not robust to a future version of the superintelligence having access to time travel will not work.
It should be "against cooperate-rock", right?
Forgive my ignorance, but I'm a bit confused about the reality detection step. By reality, I assume you mean the same level as the monkeys? Your detection methods seem valid, but they all seem to boil down to "these are Hard Problems on our level of reality (whatever that means)". Having the simulation gods reward being nice to the monkeys seem a valuable step to have in your simulation chain, if only to check whether it'll kill you as soon as it thinks itself cleverer than you. Though I suppose it being a maximizer sort of implies that it will.
I don't mean to say that it will be nice by default or any of those pitfalls - my only issue here is why it would be able to be sure it's in Reality - I keep ending up at Descartes' demon. Or some kind of magic mocking mechanism for the AIs testing circuits. Unless you're just stating that at some point it can be confident enough to not worry about it?
Yeah, Pascal's mugging can be used here to completely block it, as so long as it believes there's such a vastly large positive rewards (like 3^^^3 or infinite reward) that it always cooperates with us. It's similar to this linked idea here:
And we'll, you can always precommit to a certain deal that you know won't be bad.