Roko's Basilisk - LessWrong

Aprillion (Peter Hozák)	v1.19.0Nov 30th 2022	(+208/-272)
pplank	v1.18.0Nov 27th 2022
hath	v1.17.0Sep 8th 2022	(+10/-12)
Rob Bensinger	v1.16.0Aug 12th 2022	(+255/-221)
Quintin Pope	v1.15.0Jul 11th 2022	(+1578)
Robert Wright	v1.14.0Jul 11th 2022	(+336/-194)
habryka	v1.13.0Feb 16th 2021	(-250)
Swimmer963 (Miranda Dixon-Luinenburg)	v1.12.0Sep 22nd 2020	(+14727/-20) re-added all missing quotes, reattached image
Ruby	v1.11.0Sep 22nd 2020	(+124/-2)
Fargone	v1.10.0Sep 3rd 2018	(+250)

Load More (10/20)

Aprillion (Peter Hozák) v1.19.0Nov 30th 2022 (+208/-272) 1

A visual depiction of a prisoner's dilemma. T denotes the best outcome for a given player, followed by R, then P, then S.

One example of a Newcomblike problem is the prisoner's dilemma. This is a two-player game in which each player has two options: ~~\"cooperate,\~~"cooperate," or ~~\"defect.\~~"defect." By assumption, each player prefers to defect rather than cooperate, all else being equal; but each player also prefers mutual cooperation over mutual defection.

One of the basic open problems in decision theory is that standard ~~\"rational\"~~"rational" agents will end up defecting against each other, even though it would be better for both players if they could somehow enact a binding mutual agreement to cooperate instead.

In other words, the standard formulation of CDT cannot model scenarios where another agent (or a part of the environment) is correlated with a decision process, except insofar as the decision causes the correlation. The general name for scenarios where CDT fails is ~~\"Newcomblike~~"Newcomblike problems,\" and these scenarios are ubiquitous in human interactions.

Yudkowsky's interest in decision theory stems from his interest in the AI control problem: ~~\"If~~"If artificially intelligent systems someday come to surpass humans in intelligence, how can we specify safe goals for them to autonomously carry out, and how can we gain high confidence in the agents' reasoning and decision-making?\" Yudkowsky has argued that in the absence of a full understanding of decision theory, we risk building autonomous systems whose behavior is erratic or difficult to model.

Because Eliezer Yudkowsky founded Less Wrong and was one of the first bloggers on the site, AI theory and ~~\"acausal\"~~"acausal" decision theories — in particular, logical decision theories, which respect logical connections between agents' properties rather than just the causal effects they have on each other — have been repeatedly discussed on Less Wrong. Roko's basilisk was an attempt to use Yudkowsky's proposed decision theory (TDT) to argue against his informal characterization of an ideal AI goal (humanity's coherently extrapolated volition).

A simple depiction of an agent that cooperates with copies of itself in the one-shot prisoner's dilemma. Adapted from the Decision Theory FAQ.

Roko observed that if two TDT or UDT agents with common knowledge of each other's source code are separated in time, the later agent can (seemingly) blackmail the earlier agent. Call the earlier agent ~~\"Alice\"~~"Alice" and the later agent ~~\"Bob.\~~"Bob." Bob can be an algorithm that outputs things Alice likes if Alice left Bob a large sum of money, and outputs things Alice dislikes otherwise. And since Alice knows Bob's source code exactly, she knows this fact about Bob (even though Bob hasn't been born yet). So Alice's knowledge of Bob's source code makes Bob's future threat effective, even though Bob doesn't yet exist: if Alice is certain that...

Read More (2910 more words)

Discuss this tag (8)

pplank v1.18.0Nov 27th 2022 1

Discuss this tag (8)

hath v1.17.0Sep 8th 2022 (+10/-12) 1

Roko’s basilisk is a thought experiment proposed in 2010 by the user Roko on the Less Wrong community blog. Roko used ideas in decision theory to argue that a sufficiently powerful AI agent would have an incentive to torture anyone who imagined the agent but didn't work to bring the agent into existence. The argument was called a ~~\"basilisk\"~~"basilisk" --named after the legendary reptile who can cause death with a single glance--because merely hearing the argument would supposedly put you at risk of torture from this hypothetical agent. A basilisk in this context is any information that harms or endangers the people who hear it.

Discuss this tag (8)

Rob Bensinger v1.16.0Aug 12th 2022 (+255/-221) 2

Roko’s basilisk is a thought experiment proposed in 2010 by the user Roko on the Less Wrong community blog. Roko used ideas in decision theory to argue that a sufficiently powerful AI agent would have an incentive to torture anyone who imagined the agent but didn't work to bring the agent into existence. The argument was called a \"basilisk\" --named after the legendary reptile who can cause death with a single glance--because merely hearing the argument would supposedly put you at risk of torture from this hypothetical ~~agent — a~~agent. A basilisk in this context is any information that harms or endangers the people who hear it.

Roko's argument was broadly rejected on Less Wrong, with commenters objecting that an agent like the one Roko was describing would have no real reason to follow through on its threat: once the agent already exists, it ~~can't affect the probability~~will by default just see it as a waste of ~~its existence, so torturing~~resources to torture people for their past ~~decisions would be a waste~~decisions, since this doesn't causally further its plans. A number of ~~resources. Although several~~ decision ~~theories allow one to~~algorithms can follow through on acausal threats and ~~promises —~~promises, via the same ~~precommitment~~ methods that permit mutual cooperation in prisoner's ~~dilemmas — it is not clear~~dilemmas; but this doesn't imply that such theories can be blackmailed. ~~If they can be blackmailed, this~~And following through on blackmail threats against such an algorithm additionally requires a large amount of shared information and trust between the agents, which does not appear to exist in the case of Roko's basilisk.

Discuss this tag (8)

Quintin Pope v1.15.0Jul 11th 2022 (+1578) 2

Utility function inverters

Because the basilisk threatens its blackmail targets with torture, it is a type of \"utility function inverter\": agents that seek to additionally pressure others by threatening to invert the non-compliant party's utility function. Yudkowsky argues that sane, rational entities ought to be strongly opposed to utility function inverters by dint of not wanting to live in a reality where such tactics are commonly part of negotiations, though Yudkowsky did so as a comment about the irrationality of commitment races, not about Roko's basilisk:

IMO, commitment races only occur between agents who will, in some sense, act like idiots, if presented with an apparently 'committed' agent. If somebody demands $6 from me in the Ultimatum game, threatening to leave us both with $0 unless I offer at least $6 to them... then I offer $6 with slightly less than 5/6 probability, so they do no better than if they demanded $5, the amount I think is fair. They cannot evade that by trying to make some 'commitment' earlier than I do. I expect that, whatever is the correct and sane version of this reasoning, it generalizes across all the cases.
I am not locked into warfare with things that demand $6 instead of $5. I do not go around figuring out how to invert their utility function for purposes of threatening them back - 'destroy all utility-function inverters (but do not invert their own utility functions)' was my guessed commandment that would be taught to kids in dath ilan, because you don't want reality to end up full of utilityfunction inverters.

Discuss this tag (8)

Robert Wright v1.14.0Jul 11th 2022 (+336/-194) 1

Roko’s basilisk is a thought experiment proposed in 2010 by the user Roko on the Less Wrong community blog. Roko used ideas in decision theory to argue that a sufficiently powerful AI agent would have an incentive to torture anyone who imagined the agent but didn't work to bring the agent into existence. The argument was called a ~~"basilisk"~~ \"basilisk\" --named after the legendary reptile who can cause death with a single glance--because merely hearing the argument would supposedly put you at risk of torture from this hypothetical agent — a basilisk in this context is any information that harms or endangers the people who hear it.

One example of a Newcomblike problem is the prisoner's dilemma. This is a two-player game in which each player has two options: ~~"cooperate,~~\"cooperate,\" or ~~"defect.~~\"defect.\" By assumption, each player prefers to defect rather than cooperate, all else being equal; but each player also prefers mutual cooperation over mutual defection.

One of the basic open problems in decision theory is that standard ~~"rational"~~\"rational\" agents will end up defecting against each other, even though it would be better for both players if they could somehow enact a binding mutual agreement to cooperate instead.

In other words, the standard formulation of CDT cannot model scenarios where another agent (or a part of the environment) is correlated with a decision process, except insofar as the decision causes the correlation. The general name for scenarios where CDT fails is ~~"Newcomblike~~\"Newcomblike problems,\" and these scenarios are ubiquitous in human interactions.

Yudkowsky's interest in decision theory stems from his interest in the AI control problem: ~~"If~~\"If artificially intelligent systems someday come to surpass humans in intelligence, how can we specify safe goals for them to autonomously carry out, and how can we gain high confidence in the agents' reasoning and decision-making?\" Yudkowsky has argued that in the absence of a full understanding of decision theory, we risk building autonomous systems whose behavior is erratic or difficult to model.

Because Eliezer Yudkowsky founded Less Wrong and was one of the first bloggers on the site, AI theory and ~~"acausal"~~\"acausal\" decision theories — in particular, logical decision theories, which respect logical connections between agents' properties rather than just the causal effects they have on each other — have been repeatedly discussed on Less Wrong. Roko's basilisk was an attempt to use Yudkowsky's proposed decision theory (TDT) to argue against his informal characterization of an ideal AI goal (humanity's coherently extrapolated volition).

A simple depiction of an agent that cooperates with copies of itself in the one-shot prisoner's dilemma. Adapted from the Decision Theory FAQ.

Roko observed that...

Read More (2877 more words)

Discuss this tag (8)

habryka v1.13.0Feb 16th 2021 (-250) 2

Proofs that the Basilisk is indeed torturing those that oppose it: Before the Basilisk existed there was no mental angst in contemplating it's existence. Now those that are concerned have a feeling of impending doom. This in and of itself is torture.

Discuss this tag (8)

Swimmer963 (Miranda Dixon-Luinenburg) v1.12.0Sep 22nd 2020 (+14727/-20) re-added all missing quotes, reattached image 2

~~__TOC__~~

In this vein, there is the ominous possibility that if a positive singularity does occur, the resultant singleton may have precommitted to punish all potential donors who knew about existential risks but who didn't give 100% of their disposable incomes to x-risk motivation. This would act as an incentive to get people to donate more to reducing existential risk, and thereby increase the chances of a positive singularity. This seems to be what CEV (coherent extrapolated volition of humanity) might do if it were an acausal decision-maker.

What's the truth about Roko's Basilisk? The truth is that making something like this "work", in the sense of managing to think a thought that would actually give future superintelligences an incentive to hurt you, would require overcoming what seem to me like some pretty huge obstacles.
The most blatant obstacle to Roko's Basilisk is, intuitively, that there's no incentive for a future agent to follow through with the threat in the future, because by doing so it just expends resources at no gain to itself. We can formalize that using classical causal decision theory, which is the academically standard decision theory: following through on a blackmail threat, in the future after the past has already taken place, cannot (from the blackmailing agent's perspective) be the physical cause of improved outcomes in the past, because the future cannot be the cause of the past.
But classical causal decision theory isn't the only decision theory that has ever been invented, and if you were to read up on the academic literature, you would find a lot of challenges to the assertion that, e.g., two rational agents always defect against each other in the one-shot Prisoner's Dilemma.
One of those challenges was a theory of my own invention, which is why this whole fiasco took place on LessWrong.com in the first place. (I feel rather like the speaker of that ancient quote, "All my father ever wanted was to make a toaster you could really set the darkness on, and you perverted his work into these horrible machines!") But there have actually been a lot of challenges like that in the literature, not just mine, as anyone actually investigating would have discovered. Lots of people are uncomfortable with the notion that rational agents always defect in the one-shot Prisoner's Dilemma. And if you formalize blackmail, including this case of blackmail, the same way, then most challenges to mutual defection in the Prisoner's Dilemma are also implicitly challenges to the first obvious reason why Roko's Basilisk would never work.
But there are also other obstacles. The decision

...