Roko's Basilisk

Aprillion (Peter Hozák) (+208/-272)
pplank
hath (+10/-12)
Rob Bensinger (+255/-221)
Quintin Pope (+1578)
Robert Wright (+336/-194)
habryka (-250)
Swimmer963 (Miranda Dixon-Luinenburg) (+14727/-20) re-added all missing quotes, reattached image
Ruby (+124/-2)
Fargone (+250)

A visual depiction of a prisoner's dilemma. T denotes the best outcome for a given player, followed by R, then P, then S.

One example of a Newcomblike problem is the prisoner's dilemma. This is a two-player game in which each player has two options: \"cooperate,\"cooperate," or \"defect.\"defect." By assumption, each player prefers to defect rather than cooperate, all else being equal; but each player also prefers mutual cooperation over mutual defection.

One of the basic open problems in decision theory is that standard \"rational\""rational" agents will end up defecting against each other, even though it would be better for both players if they could somehow enact a binding mutual agreement to cooperate instead.

In other words, the standard formulation of CDT cannot model scenarios where another agent (or a part of the environment) is correlated with a decision process, except insofar as the decision causes the correlation. The general name for scenarios where CDT fails is \"Newcomblike"Newcomblike problems,\" and these scenarios are ubiquitous in human interactions.

Yudkowsky's interest in decision theory stems from his interest in the AI control problem: \"If"If artificially intelligent systems someday come to surpass humans in intelligence, how can we specify safe goals for them to autonomously carry out, and how can we gain high confidence in the agents' reasoning and decision-making?\" Yudkowsky has argued that in the absence of a full understanding of decision theory, we risk building autonomous systems whose behavior is erratic or difficult to model.

Because Eliezer Yudkowsky founded Less Wrong and was one of the first bloggers on the site, AI theory and \"acausal\""acausal" decision theories — in particular, logical decision theories, which respect logical connections between agents' properties rather than just the causal effects they have on each other — have been repeatedly discussed on Less Wrong. Roko's basilisk was an attempt to use Yudkowsky's proposed decision theory (TDT) to argue against his informal characterization of an ideal AI goal (humanity's coherently extrapolated volition).

A simple depiction of an agent that cooperates with copies of itself in the one-shot prisoner's dilemma. Adapted from the Decision Theory FAQ.

 

Roko observed that if two TDT or UDT agents with common knowledge of each other's source code are separated in time, the later agent can (seemingly) blackmail the earlier agent. Call the earlier agent \"Alice\""Alice" and the later agent \"Bob.\"Bob." Bob can be an algorithm that outputs things Alice likes if Alice left Bob a large sum of money, and outputs things Alice dislikes otherwise. And since Alice knows Bob's source code exactly, she knows this fact about Bob (even though Bob hasn't been born yet). So Alice's knowledge of Bob's source code makes Bob's future threat effective, even though Bob doesn't yet exist: if Alice is certain that...

Read More (2910 more words)

Roko’s basilisk is a thought experiment proposed in 2010 by the user Roko on the Less Wrong community blog. Roko used ideas in decision theory to argue that a sufficiently powerful AI agent would have an incentive to torture anyone who imagined the agent but didn't work to bring the agent into existence. The argument was called a \"basilisk\""basilisk" --named after the legendary reptile who can cause death with a single glance--because merely hearing the argument would supposedly put you at risk of torture from this hypothetical agent. A basilisk in this context is any information that harms or endangers the people who hear it.

Roko’s basilisk is a thought experiment proposed in 2010 by the user Roko on the Less Wrong community blog. Roko used ideas in decision theory to argue that a sufficiently powerful AI agent would have an incentive to torture anyone who imagined the agent but didn't work to bring the agent into existence. The argument was called a \"basilisk\" --named after the legendary reptile who can cause death with a single glance--because merely hearing the argument would supposedly put you at risk of torture from this hypothetical agent — aagent. A basilisk in this context is any information that harms or endangers the people who hear it.

Roko's argument was broadly rejected on Less Wrong, with commenters objecting that an agent like the one Roko was describing would have no real reason to follow through on its threat: once the agent already exists, it can't affect the probabilitywill by default just see it as a waste of its existence, so torturingresources to torture people for their past decisions would be a wastedecisions, since this doesn't causally further its plans. A number of resources. Although several decision theories allow one toalgorithms can follow through on acausal threats and promises —promises, via the same precommitment methods that permit mutual cooperation in prisoner's dilemmas — it is not cleardilemmas; but this doesn't imply that such theories can be blackmailed. If they can be blackmailed, thisAnd following through on blackmail threats against such an algorithm additionally requires a large amount of shared information and trust between the agents, which does not appear to exist in the case of Roko's basilisk.

Utility function inverters

Because the basilisk threatens its blackmail targets with torture, it is a type of \"utility function inverter\": agents that seek to additionally pressure others by threatening to invert the non-compliant party's utility function. Yudkowsky argues that sane, rational entities ought to be strongly opposed to utility function inverters by dint of not wanting to live in a reality where such tactics are commonly part of negotiations, though Yudkowsky did so as a comment about the irrationality of commitment races, not about Roko's basilisk:

IMO, commitment races only occur between agents who will, in some sense, act like idiots, if presented with an apparently 'committed' agent.  If somebody demands $6 from me in the Ultimatum game, threatening to leave us both with $0 unless I offer at least $6 to them... then I offer $6 with slightly less than 5/6 probability, so they do no better than if they demanded $5, the amount I think is fair.  They cannot evade that by trying to make some 'commitment' earlier than I do.  I expect that, whatever is the correct and sane version of this reasoning, it generalizes across all the cases.

I am not locked into warfare with things that demand $6 instead of $5.  I do not go around figuring out how to invert their utility function for purposes of threatening them back - 'destroy all utility-function inverters (but do not invert their own utility functions)' was my guessed commandment that would be taught to kids in dath ilan, because you don't want reality to end up full of utilityfunction inverters.

Roko’s basilisk is a thought experiment proposed in 2010 by the user Roko on the Less Wrong community blog. Roko used ideas in decision theory to argue that a sufficiently powerful AI agent would have an incentive to torture anyone who imagined the agent but didn't work to bring the agent into existence. The argument was called a "basilisk" \"basilisk\" --named after the legendary reptile who can cause death with a single glance--because merely hearing the argument would supposedly put you at risk of torture from this hypothetical agent — a basilisk in this context is any information that harms or endangers the people who hear it.

A visual depiction of a prisoner's dilemma. T denotes the best outcome for a given player, followed by R, then P, then S.

One example of a Newcomblike problem is the prisoner's dilemma. This is a two-player game in which each player has two options: "cooperate,\"cooperate,\" or "defect.\"defect.\" By assumption, each player prefers to defect rather than cooperate, all else being equal; but each player also prefers mutual cooperation over mutual defection.

One of the basic open problems in decision theory is that standard "rational"\"rational\" agents will end up defecting against each other, even though it would be better for both players if they could somehow enact a binding mutual agreement to cooperate instead.

In other words, the standard formulation of CDT cannot model scenarios where another agent (or a part of the environment) is correlated with a decision process, except insofar as the decision causes the correlation. The general name for scenarios where CDT fails is "Newcomblike\"Newcomblike problems,\" and these scenarios are ubiquitous in human interactions.

Yudkowsky's interest in decision theory stems from his interest in the AI control problem: "If\"If artificially intelligent systems someday come to surpass humans in intelligence, how can we specify safe goals for them to autonomously carry out, and how can we gain high confidence in the agents' reasoning and decision-making?\" Yudkowsky has argued that in the absence of a full understanding of decision theory, we risk building autonomous systems whose behavior is erratic or difficult to model.

Because Eliezer Yudkowsky founded Less Wrong and was one of the first bloggers on the site, AI theory and "acausal"\"acausal\" decision theories — in particular, logical decision theories, which respect logical connections between agents' properties rather than just the causal effects they have on each other — have been repeatedly discussed on Less Wrong. Roko's basilisk was an attempt to use Yudkowsky's proposed decision theory (TDT) to argue against his informal characterization of an ideal AI goal (humanity's coherently extrapolated volition).

A simple depiction of an agent that cooperates with copies of itself in the one-shot prisoner's dilemma. Adapted from the Decision Theory FAQ.

Roko observed that...

Read More (2877 more words)

Proofs that the Basilisk is indeed torturing those that oppose it: Before the Basilisk existed there was no mental angst in contemplating it's existence. Now those that are concerned have a feeling of impending doom. This in and of itself is torture.

  __TOC__  

A simple depiction of an agent that cooperates with copies of itself in the one-shot prisoner's dilemma. Adapted from the Decision Theory FAQ.

 

In this vein, there is the ominous possibility that if a positive singularity does occur, the resultant singleton may have precommitted to punish all potential donors who knew about existential risks but who didn't give 100% of their disposable incomes to x-risk motivation. This would act as an incentive to get people to donate more to reducing existential risk, and thereby increase the chances of a positive singularity. This seems to be what CEV (coherent extrapolated volition of humanity) might do if it were an acausal decision-maker.

What's the truth about Roko's Basilisk? The truth is that making something like this "work", in the sense of managing to think a thought that would actually give future superintelligences an incentive to hurt you, would require overcoming what seem to me like some pretty huge obstacles.

The most blatant obstacle to Roko's Basilisk is, intuitively, that there's no incentive for a future agent to follow through with the threat in the future, because by doing so it just expends resources at no gain to itself. We can formalize that using classical causal decision theory, which is the academically standard decision theory: following through on a blackmail threat, in the future after the past has already taken place, cannot (from the blackmailing agent's perspective) be the physical cause of improved outcomes in the past, because the future cannot be the cause of the past.

But classical causal decision theory isn't the only decision theory that has ever been invented, and if you were to read up on the academic literature, you would find a lot of challenges to the assertion that, e.g., two rational agents always defect against each other in the one-shot Prisoner's Dilemma.

One of those challenges was a theory of my own invention, which is why this whole fiasco took place on LessWrong.com in the first place. (I feel rather like the speaker of that ancient quote, "All my father ever wanted was to make a toaster you could really set the darkness on, and you perverted his work into these horrible machines!") But there have actually been a lot of challenges like that in the literature, not just mine, as anyone actually investigating would have discovered. Lots of people are uncomfortable with the notion that rational agents always defect in the one-shot Prisoner's Dilemma. And if you formalize blackmail, including this case of blackmail, the same way, then most challenges to mutual defection in the Prisoner's Dilemma are also implicitly challenges to the first obvious reason why Roko's Basilisk would never work.

But there are also other obstacles. The decision

...
Read More (1956 more words)
A visual depiction of a prisoner's dilemma. T denotes the best outcome for a given player, followed by R, then P, then S.

A visual depiction of a prisoner's dilemma. T denotes the best outcome for a given player, followed by R, then P, then S. Roko's argument ties together two hotly debated academic topics: Newcomblike problems in decision theory, and normative uncertainty in moral philosophy.

A simple depiction of an agent that cooperates with copies of itself in the one-shot prisoner's dilemma. Adapted from the Decision Theory FAQ. Two agents that are running a logical decision theory can achieve mutual cooperation in a prisoner's dilemma even if there is no outside force mandating cooperation. Because their decisions take into account correlations that are not caused by either decision (though there is generally some common cause in the past), they can even cooperate if they are separated by large distances in space or time.

 

 

 

Proofs that the Basilisk is indeed torturing those that oppose it: Before the Basilisk existed there was no mental angst in contemplating it's existence. Now those that are concerned have a feeling of impending doom. This in and of itself is torture.

Load More (10/20)