Roko's Basilisk

Roko's basilisk cannot exist if humans do not cooperate to create it. 

However, if we had a 'grey' AI that would reward the people who built it and torture those who have envisioned, but not built it, then this gets resolved back to the original prisoner's dilemma problems.

2habryka10moI think the Open Thread is probably a generally better place to bring up random new ideas related to Roko's basilisk stuff. This page is more for discussing the current content of the page, and how it might be improved.

Proofs that the Basilisk is indeed torturing those that oppose it: Before the Basilisk existed there was no mental angst in contemplating it's existence. Now those that are concerned have a feeling of impending doom. This in and of itself is torture.


A simple depiction of an agent that cooperates with copies of itself in the one-shot prisoner's dilemma. Adapted from the Decision Theory FAQ.


In this vein, there is the ominous possibility that if a positive singularity does occur, the resultant singleton may have precommitted to punish all potential donors who knew about existential risks but who didn't give 100% of their disposable incomes to x-risk motivation. This would act as an incentive to get people to donate more to reducing existential risk, and thereby increase the chances of a positive singularity. This seems to be what CEV (coherent extrapolated volition of humanity) might do if it were an acausal decision-maker.

What's the truth about Roko's Basilisk? The truth is that making something like this "work", in the sense of managing to think a thought that would actually give future superintelligences an incentive to hurt you, would require overcoming what seem to me like some pretty huge obstacles.

The most blatant obstacle to Roko's Basilisk is, intuitively, that there's no incentive for a future agent to follow through with the threat in the future, because by doing so it just expends resources at no gain to itself. We can formalize that using classical causal decision theory, which is the academically standard decision theory: following through on a blackmail threat, in the future after the past has already taken place, cannot (from the blackmailing agent's perspective) be the physical cause of improved outcomes in the past, because the future cannot be the cause of the past.

But classical causal decision theory isn't the only decision theory that has ever been invented, and if you were to read up on the academic literature, you would find a lot of challenges to the assertion that, e.g., two rational agents always defect against each other in the one-shot Prisoner's Dilemma.

One of those challenges was a theory of my own invention, which is why this whole fiasco took place on in the first place. (I feel rather like the speaker of that ancient quote, "All my father ever wanted was to make a toaster you could really set the darkness on, and you perverted his work into these horrible machines!") But there have actually been a lot of challenges like that in the literature, not just mine, as anyone actually investigating would have discovered. Lots of people are uncomfortable with the notion that rational agents always defect in the one-shot Prisoner's Dilemma. And if you formalize blackmail, including this case of blackmail, the same way, then most challenges to mutual defection in the Prisoner's Dilemma are also implicitly challenges to the first obvious reason why Roko's Basilisk would never work.

But there are also other obstacles. The decision theory I proposed back in the day says that you have to know certain things about the other agent in order to achieve mutual cooperation in the Prisoner's Dilemma, and that's with both parties trying to set up a situation which leads to mutual cooperation instead of mutual defection. As I presently understand the situation, there is literally nobody on Earth, including me, who has the knowledge needed to set themselves up to be blackmailed if they were deliberately trying to make that happen.

Any potentially blackmailing AI would much prefer to have you believe that it is blackmailing you, without actually expending resources on following through with the blackmail, insofar as they think they can exert any control on you at all via an exotic decision theory. Just like in the one-shot Prisoner's Dilemma the "ideal" outcome is for the other player to believe you are modeling them and will cooperate if and only if they cooperate, and so they cooperate, but then actually you just defect anyway. For the other player to be confident this will not happen in the Prisoner's Dilemma, for them to expect you not to sneakily defect anyway, they must have some very strong knowledge about you. In the case of Roko's Basilisk, "defection" corresponds to not actually torturing anyone, not expending resources on that, and just letting them believe that you will blackmail them. Two AI agents with sufficiently strong knowledge of each other, and heavily motivated to achieve mutual cooperation on the Prisoner's Dilemma, might be able to overcome this obstacle and cooperate with confidence. But why would you put in that degree of effort — if you even could, which I don't think you as a human can — in order to give a blackmailing agent an incentive to actually carry through on its threats?

I have written the above with some reluctance, because even if I don't yet see a way to repair this obstacle myself, somebody else might see how to repair it now that I've said what it is. Which is not a good general procedure for handling infohazards; people with expert knowledge on them should, obviously, as a matter of professional ethics, just never discuss them at all, including describing why a particular proposal doesn't work, just in case there's some unforeseen clever way to repair the proposal. There are other obstacles here which I am not discussing, just in case the logic I described above has a flaw. Nonetheless, so far as I know, Roko's Basilisk does not work, nobody has actually been bitten by it, and everything I have done was in the service of what I thought was the obvious Good General Procedure for Handling Potential Infohazards[.]

[Roko:] "One might think that the possibility of CEV punishing people couldn't possibly be taken seriously enough by anyone to actually motivate them. But in fact one person at SIAI was severely worried by this, to the point of having terrible nightmares, though ve wishes to remain anonymous."

I don't usually talk like this, but I'm going to make an exception for this case.

Listen to me very closely, you idiot.


There's an obvious equilibrium to this problem where you engage in all positive acausal trades and ignore all attempts at acausal blackmail. Until we have a better worked-out version of TDT and we can prove that formally, it should just be OBVIOUS that you DO NOT THINK ABOUT DISTANT BLACKMAILERS in SUFFICIENT DETAIL that they have a motive toACTUALLY [sic] BLACKMAIL YOU.

If there is any part of this acausal trade that is positive-sum and actually worth doing, that is exactly the sort of thing you leave up to an FAI. We probably also have the FAI take actions that cancel out the impact of anyone motivated by true rather than imagined blackmail, so as to obliterate the motive of any superintelligences to engage in blackmail.

Meanwhile I'm banning this post so that it doesn't (a) give people horrible nightmares and (b) give distant superintelligences a motive to follow through on blackmail against people dumb enough to think about them in sufficient detail, though, thankfully, I doubt anyone dumb enough to do this knows the sufficient detail. (I'm not sure I know the sufficient detail.)

You have to be really clever to come up with a genuinely dangerous thought. I am disheartened that people can be clever enough to do that and not clever enough to do the obvious thing and KEEP THEIR IDIOT MOUTHS SHUT about it, because it is much more important to sound intelligent when talking to your friends. This post was STUPID.

(For those who have no idea why I'm using capital letters for something that just sounds like a random crazy idea, and worry that it means I'm as crazy as Roko, the gist of it was that he just did something that potentially gives superintelligences an increased motive to do extremely evil things in an attempt to blackmail us. It is the sort of thing you want to be EXTREMELY CONSERVATIVE about NOT DOING.)

There is apparently a idea so horrible, so utterly Cuthulian (sic) in nature that it needs to be censored for our sanity. Simply knowing about it makes it more likely of becoming true in the real world. Elizer Yudkwosky and the other great rationalist keep us safe by deleting any posts with this one evil idea. Yes they really do believe that. Occasionally a poster will complain off topic about the idea being deleted.

You may be wondering why this is such a big deal for the LessWrong people, given the apparently far-fetched nature of the thought experiment. It’s not that Roko’s Basilisk will necessarily materialize, or is even likely to. It’s more that if you’ve committed yourself to timeless decision theory, then thinking about this sort of trade literally makes it more likely to happen. After all, if Roko’s Basilisk were to see that this sort of blackmail gets you to help it come into existence, then it would, as a rational actor, blackmail you. The problem isn’t with the Basilisk itself, but with you. Yudkowsky doesn’t censor every mention of Roko’s Basilisk because he believes it exists or will exist, but because he believes that the idea of the Basilisk (and the ideas behind it) is dangerous.

Now, Roko’s Basilisk is only dangerous if you believe all of the above preconditions and commit to making the two-box deal [sic] with the Basilisk. But at least some of the LessWrong members do believe all of the above, which makes Roko’s Basilisk quite literally forbidden knowledge. [...]

If you do not subscribe to the theories that underlie Roko’s Basilisk and thus feel no temptation to bow down to your once and future evil machine overlord, then Roko’s Basilisk poses you no threat. (It is ironic that it’s only a mental health risk to those who have already bought into Yudkowsky’s thinking.) Believing in Roko’s Basilisk may simply be a “referendum on autism,” as a friend put it.

When Roko posted about the Basilisk, I very foolishly yelled at him, called him an idiot, and then deleted the post.

Why I did that is not something you have direct access to, and thus you should be careful about Making Stuff Up, especially when there are Internet trolls who are happy to tell you in a loud authoritative voice what I was thinking, despite having never passed anything even close to an Ideological Turing Test on Eliezer Yudkowsky.

Why I yelled at Roko: Because I was caught flatfooted in surprise, because I was indignant to the point of genuine emotional shock, at the concept that somebody who thought they'd invented a brilliant idea that would cause future AIs to torture people who had the thought, had promptly posted it to the public Internet. In the course of yelling at Roko to explain why this was a bad thing, I made the further error — keeping in mind that I had absolutely no idea that any of this would ever blow up the way it did, if I had I would obviously have kept my fingers quiescent — of not making it absolutely clear using lengthy disclaimers that my yelling did not mean that I believed Roko was right about CEV-based agents torturing people who had heard about Roko's idea. It was obvious to me that no CEV-based agent would ever do that and equally obvious to me that the part about CEV was just a red herring; I more or less automatically pruned it from my processing of the suggestion and automatically generalized it to cover the entire class of similar scenarios and variants, variants which I considered obvious despite significant divergences (I forgot that other people were not professionals in the field). This class of all possible variants did strike me as potentially dangerous as a collective group, even though it did not occur to me that Roko's original scenario might be right — that was obviously wrong, so my brain automatically generalized it. [...]

What I considered to be obvious common sense was that you did not spread potential information hazards because it would be a crappy thing to do to someone. The problem wasn't Roko's post itself, about CEV, being correct. That thought never occurred to me for a fraction of a second. The problem was that Roko's post seemed near in idea-space to a large class of potential hazards, all of which, regardless of their plausibility, had the property that they presented no potential benefit to anyone. They were pure infohazards. The only thing they could possibly do was be detrimental to brains that represented them, if one of the possible variants of the idea turned out to be repairable of the obvious objections and defeaters. So I deleted it, because on my worldview there was no reason not to. I did not want to be a place where people were exposed to potential infohazards because somebody like me thought they were being clever about reasoning that they probably weren't infohazards. On my view, the key fact about Roko's Basilisk wasn't that it was plausible, or implausible, the key fact was just that shoving it in people's faces seemed like a fundamentally crap thing to do because there was no upside.

Again, I deleted that post not because I had decided that this thing probably presented a real hazard, but because I was afraid some unknown variant of it might, and because it seemed to me like the obvious General Procedure For Handling Things That Might Be Infohazards said you shouldn't post them to the Internet. If you look at the original SF story where the term "basilisk" was coined, it's about a mind-erasing image and the.... trolls, I guess, though the story predates modern trolling, who go around spraypainting the Basilisk on walls, using computer guidance so they don't know themselves what the Basilisk looks like, in hopes the Basilisk will erase some innocent mind, for the lulz. These people are the villains of the story. The good guys, of course, try to erase the Basilisk from the walls. Painting Basilisks on walls is a crap thing to do. Since there was no upside to being exposed to Roko's Basilisk, its probability of being true was irrelevant. And Roko himself had thought this was a thing that might actually work. So I yelled at Roko for violating basic sanity about infohazards for stupid reasons, and then deleted the post. He, by his own lights, had violated the obvious code for the ethical handling of infohazards, conditional on such things existing, and I was indignant about this.




  • Can formal decision agents be designed to resist blackmail?


  • Are information hazards a serious risk, and are there better ways of handling them?


  • Does the oversimplified coverage of Roko's argument suggest that "weird" philosophical topics are big liabilities for pedagogical or research-related activities?


[I]magine that you mostly endorse positions that your audience already agrees with, positions that are within a standard deviation of the median position on the issue, and then you finally gather up all your cherished, saved-up weirdness points and write a passionate defense of the importance of insect suffering. How do you think your audience is going to react? "Ugh, they used to be so normal, and then it was like they suddenly went crazy. I hope they go back to bashing the Rethuglicans soon."

A visual depiction of a prisoner's dilemma. T denotes the best outcome for a given player, followed by R, then P, then S.

Roko's argument ties together two hotly debated academic topics: Newcomblike problems in decision theory, and normative uncertainty in moral philosophy.

Two agents that are running a logical decision theory can achieve mutual cooperation in a prisoner's dilemma even if there is no outside force mandating cooperation. Because their decisions take into account correlations that are not caused by either decision (though there is generally some common cause in the past), they can even cooperate if they are separated by large distances in space or time.




Applied to Political Roko's basilisk by Ruby at 1y

From the old wiki discussion page:

Talk:Roko's basilisk

Weirdness points

Why bring up weirdness points here, of all places, when Roko's basilisk is known to be an invalid theory? Is this meant to say, "Don't use Roko's basilisk as a conversation starter for AI risks"? The reason for bringing up weirdness points on this page could do with being made a bit more explicit, otherwise I might just remove or move the section on weirdness points.--Greenrd (talk) 08:37, 29 December 2015 (AEDT)

I just wanted to say

That I didnt know about the term "basilisk" with that meaning and that makes it a basilisk for me. Or a meta-basilisk I should say. Now I'm finding hard not to look for examples on the internet.

Eliminate the Time Problem & the Basilisk Seems Unavoidable

Roko's Basilisk is refutable for the same reason that it makes our skin crawl, the time differential, the idea that a future AI would take retribution for actions predating its existence. The refutation is, more or less, why would it bother? Which I suppose makes sense, unless the AI is seeking to establish credibility. Nevertheless, the time dimension isn't critical to the Basilisk concept itself. At whatever point in the future a utilitarian AI (UAI) were to come into existence, there would no doubt be some who opposed it. If there were enough who opposed it to present a potential threat to the existence of the UAI, the UAI may be forced to defend itself by eliminating that risk, not because it presents a risk to the UAI, but because by presenting a risk to the UAI it presents a risk to the utilitarian goal.

Consider self driving cars with the following assumptions: currently about 1.25 million people are killed and 20-50 million injured each year in traffic accidents (; let's say a high quality self-driving system (SDS) would reduce this by 50%; BUT, some of those who die as a result of the SDS would not have died without the SDS. Deploying the SDS universally would seem a utilitarian imperative, as it would save over 600,000 lives per year. Yet some people may oppose doing so because of a bias in favor of human agency, and out of fear of the fact that there would be some quantity of SDS-caused deaths that would otherwise not occur.

Why would a UAI not eliminate 100,000 dissenters per year to achieve the utilitarian advantage of a net 500,000 lives saved?

TomR Oct 18 2019

The Fallacy of Information Hazards

The concept that a piece of information, like Roko's Basilisk, should not be disclosed, assumes (i) that no one else will think it and (ii) that a particular outcome, such as the eventual existence of the AI, is a predetermined certainty that can neither be (a) prevented or (b) mitigated by ensuring that its programming addresses the Balisk. I am unaware of any basis for either of these propositions.

TomR Oct 18 2019

Created by Rob Bensinger at 1y

Proofs that the Basilisk is indeed torturing those that oppose it: Before the Basilisk existed there was no mental angst in contemplating it's existence. Now those that are concerned have a feeling of impending doom. This in and of itself is torture.

Eliezer Yudkowsky proposed an alternative to CDT, timeless decision theory (TDT), that can achieve mutual cooperation in prisoner's dilemmas — provided both players are running TDT, and both players have common knowledge of this fact. The cryptographer Wei DeiDai subsequently developed a theory that outperforms both TDT and CDT, called updateless decision theory (UDT).

Other users on 'Less Wrong generally rejected Roko's arguments at the time, and skepticism about his supposed basilisk appears to have only increased with time. Subsequent discussion of Roko's basilisk has focused on Less Wrong moderator responses to Roko's post, rather than on the specific merits or dismerits of his argument.

Yudkowsky proceeded to delete Roko's post and the ensuing discussion, while banning further discussion of the topic on the blog. A few months later, RationalWiki sysadmin David Gerardan anonymous editor added a discussion of Roko's basilisk to an article covering Less Wrong. GerardThe editor inferred from Yudkowsky's comments that people on Less Wrong accepted Roko's argument:

Because Eliezer Yudkowsky founded Less Wrong and was one of the first bloggers on the site, AI theory and "acausal" decision theories — in particular, logical decision theories, which respect logical connections between agents' properties rather than just the causal effects they have on each other — have been repeatedly discussed on Less Wrong. In particular, Roko's basilisk was an attempt to use Yudkowsky's proposed decision theory (TDT) to argue against his informal characterization of an ideal AI goal (humanity's coherently extrapolated volition).

If Bob ran CDT, then he would be unable to blackmail Alice. A CDT agent would assume that its decision is independent of Alice's and would not waste resources on rewarding or punishing a once-off decision that has already happened; and we are assuming that Alice could spot this fact by reading CDT-Bob's source code. A TDT or UDT agent, on the other hand, can recognize that Alice in effect has a copy of Bob's source code in her head (insofar as she is accurately modeling Bob), and that Alice's decision and Bob's decision are therefore correlated — the same as if two copies of the same source code were in a Prisoner'prisoner's Dilemma.dilemma.

Other sources have repeated the claim that Less Wrong users think Roko's basilisk is a serious concern. However, none of these sources have yet cited supporting evidence on this point, aside from Less Wrong moderation activity itself. The(The ban, of course, didn't make it easy to collect good information.)

It hasn't been formally demonstrated that any logical decision theories givesgive in to blackmail, or what scenarios would make them vulnerable to blackmail. If it turned out that TDT or UDT were blackmailable, this would provide evidence against these beingsuggest that they aren't normatively optimal decision theories. For more background on open problems in decision theory, see the Decision Theory FAQ and "Toward Idealized Decision Theory".