A visual depiction of a prisoner's dilemma. T denotes the best outcome for a given player, followed by R, then P, then S.
One example of a Newcomblike problem is the prisoner's dilemma. This is a two-player game in which each player has two options: \"cooperate,\"cooperate," or \"defect.\"defect." By assumption, each player prefers to defect rather than cooperate, all else being equal; but each player also prefers mutual cooperation over mutual defection.
One of the basic open problems in decision theory is that standard \"rational\""rational" agents will end up defecting against each other, even though it would be better for both players if they could somehow enact a binding mutual agreement to cooperate instead.
In other words, the standard formulation of CDT cannot model scenarios where another agent (or a part of the environment) is correlated with a decision process, except insofar as the decision causes the correlation. The general name for scenarios where CDT fails is \"Newcomblike"Newcomblike problems,\" and these scenarios are ubiquitous in human interactions.
Yudkowsky's interest in decision theory stems from his interest in the AI control problem: \"If"If artificially intelligent systems someday come to surpass humans in intelligence, how can we specify safe goals for them to autonomously carry out, and how can we gain high confidence in the agents' reasoning and decision-making?\" Yudkowsky has argued that in the absence of a full understanding of decision theory, we risk building autonomous systems whose behavior is erratic or difficult to model.
Because Eliezer Yudkowsky founded Less Wrong and was one of the first bloggers on the site, AI theory and \"acausal\""acausal" decision theories — in particular, logical decision theories, which respect logical connections between agents' properties rather than just the causal effects they have on each other — have been repeatedly discussed on Less Wrong. Roko's basilisk was an attempt to use Yudkowsky's proposed decision theory (TDT) to argue against his informal characterization of an ideal AI goal (humanity's coherently extrapolated volition).
A simple depiction of an agent that cooperates with copies of itself in the one-shot prisoner's dilemma. Adapted from the Decision Theory FAQ.
Roko observed that if two TDT or UDT agents with common knowledge of each other's source code are separated in time, the later agent can (seemingly) blackmail the earlier agent. Call the earlier agent \"Alice\""Alice" and the later agent \"Bob.\"Bob." Bob can be an algorithm that outputs things Alice likes if Alice left Bob a large sum of money, and outputs things Alice dislikes otherwise. And since Alice knows Bob's source code exactly, she knows this fact about Bob (even though Bob hasn't been born yet). So Alice's knowledge of Bob's source code makes Bob's future threat effective, even though Bob doesn't yet exist: if Alice is certain that Bob will someday exist, then mere knowledge of what Bob would do if he could get away with it seems to force Alice to comply with his hypothetical demands.
Singularity here refers to an intelligence explosion, and singleton refers to a superintelligent AI system. Since a highly moral AI agent (one whose actions are consistent with our coherently extrapolated volition) would want to be created as soon as possible, Roko argued that such an AI would use acausal blackmail to give humans stronger incentives to create it. Roko made the claim that the hypothetical AI agent would particularly target people who had thought about this argument, because they would have a better chance of mentally simulating the AI's source code. Roko added: \"Of"Of course this would be unjust, but is the kind of unjust thing that is oh-so-very utilitarian.\"
What's the truth about Roko's Basilisk? The truth is that making something like this
\"work\""work", in the sense of managing to think a thought that would actually give future superintelligences an incentive to hurt you, would require overcoming what seem to me like some pretty huge obstacles.The most blatant obstacle to Roko's Basilisk is, intuitively, that there's no incentive for a future agent to follow through with the threat in the future, because by doing so it just expends resources at no gain to itself. We can formalize that using classical causal decision theory, which is the academically standard decision theory: following through on a blackmail threat, in the future after the past has already taken place, cannot (from the blackmailing agent's perspective) be the physical cause of improved outcomes in the past, because the future cannot be the cause of the past.
But classical causal decision theory isn't the only decision theory that has ever been invented, and if you were to read up on the academic literature, you would find a lot of challenges to the assertion that, e.g., two rational agents always defect against each other in the one-shot Prisoner's Dilemma.
One of those challenges was a theory of my own invention, which is why this whole fiasco took place on LessWrong.com in the first place. (I feel rather like the speaker of that ancient quote,
\"All"All my father ever wanted was to make a toaster you could really set the darkness on, and you perverted his work into these horrible machines!\") But there have actually been a lot of challenges like that in the literature, not just mine, as anyone actually investigating would have discovered. Lots of people are uncomfortable with the notion that rational agents always defect in the one-shot Prisoner's Dilemma. And if you formalize blackmail, including this case of blackmail, the same way, then most challenges to mutual defection in the Prisoner's Dilemma are also implicitly challenges to the first obvious reason why Roko's Basilisk would never work.But there are also other obstacles. The decision theory I proposed back in the day says that you have to know certain things about the other agent in order to achieve mutual cooperation in the Prisoner's Dilemma, and that's with both parties trying to set up a situation which leads to mutual cooperation instead of mutual defection. As I presently understand the situation, there is literally nobody on Earth, including me, who has the knowledge needed to set themselves up to be blackmailed if they were deliberately trying to make that happen.
Any potentially blackmailing AI would much prefer to have you believe that it is blackmailing you, without actually expending resources on following through with the blackmail, insofar as they think they can exert any control on you at all via an exotic decision theory. Just like in the one-shot Prisoner's Dilemma the
\"ideal\""ideal" outcome is for the other player to believe you are modeling them and will cooperate if and only if they cooperate, and so they cooperate, but then actually you just defect anyway. For the other player to be confident this will not happen in the Prisoner's Dilemma, for them to expect you not to sneakily defect anyway, they must have some very strong knowledge about you. In the case of Roko's Basilisk,\"defection\""defection" corresponds to not actually torturing anyone, not expending resources on that, and just letting them believe that you will blackmail them. Two AI agents with sufficiently strong knowledge of each other, and heavily motivated to achieve mutual cooperation on the Prisoner's Dilemma, might be able to overcome this obstacle and cooperate with confidence. But why would you put in that degree of effort — if you even could, which I don't think you as a human can — in order to give a blackmailing agent an incentive to actually carry through on its threats?I have written the above with some reluctance, because even if I don't yet see a way to repair this obstacle myself, somebody else might see how to repair it now that I've said what it is. Which is not a good general procedure for handling infohazards; people with expert knowledge on them should, obviously, as a matter of professional ethics, just never discuss them at all, including describing why a particular proposal doesn't work, just in case there's some unforeseen clever way to repair the proposal. There are other obstacles here which I am not discussing, just in case the logic I described above has a flaw. Nonetheless, so far as I know, Roko's Basilisk does not work, nobody has actually been bitten by it, and everything I have done was in the service of what I thought was the obvious Good General Procedure for Handling Potential Infohazards[.]
[Roko:]
\"One"One might think that the possibility of CEV punishing people couldn't possibly be taken seriously enough by anyone to actually motivate them. But in fact one person at SIAI was severely worried by this, to the point of having terrible nightmares, though ve wishes to remain anonymous.\"I don't usually talk like this, but I'm going to make an exception for this case.
Listen to me very closely, you idiot.
YOU DO NOT THINK IN SUFFICIENT DETAIL ABOUT SUPERINTELLIGENCES CONSIDERING WHETHER OR NOT TO BLACKMAIL YOU. THAT IS THE ONLY POSSIBLE THING WHICH GIVES THEM A MOTIVE TO FOLLOW THROUGH ON THE BLACKMAIL.
There's an obvious equilibrium to this problem where you engage in all positive acausal trades and ignore all attempts at acausal blackmail. Until we have a better worked-out version of TDT and we can prove that formally, it should just be OBVIOUS that you DO NOT THINK ABOUT DISTANT BLACKMAILERS in SUFFICIENT DETAIL that they have a motive toACTUALLY [sic] BLACKMAIL YOU.
If there is any part of this acausal trade that is positive-sum and actually worth doing, that is exactly the sort of thing you leave up to an FAI. We probably also have the FAI take actions that cancel out the impact of anyone motivated by true rather than imagined blackmail, so as to obliterate the motive of any superintelligences to engage in blackmail.
Meanwhile I'm banning this post so that it doesn't (a) give people horrible nightmares and (b) give distant superintelligences a motive to follow through on blackmail against people dumb enough to think about them in sufficient detail, though, thankfully, I doubt anyone dumb enough to do this knows the sufficient detail. (I'm not sure I know the sufficient detail.)
You have to be really clever to come up with a genuinely dangerous thought. I am disheartened that people can be clever enough to do that and not clever enough to do the obvious thing and KEEP THEIR IDIOT MOUTHS SHUT about it, because it is much more important to sound intelligent when talking to your friends. This post was STUPID.
(For those who have no idea why I'm using capital letters for something that just sounds like a random crazy idea, and worry that it means I'm as crazy as Roko, the gist of it was that he just did something that potentially gives superintelligences an increased motive to do extremely evil things in an attempt to blackmail us. It is the sort of thing you want to be EXTREMELY CONSERVATIVE about NOT DOING.)
\"FAI\"FAI" here stands for \"Friendly"Friendly AI,\" a hypothetical superintelligent AI agent that can be trusted to autonomously promote desirable ends. Yudkowsky rejected the idea that Roko's basilisk could be called \"friendly\""friendly" or \"utilitarian,\"utilitarian," since torture and threats of blackmail are themselves contrary to common human values. Separately, Yudkowsky doubted that humans possessed enough information about any hypothetical unfriendly AI system to enter Alice's position even if we tried. Yudkowsky additionally argued that a well-designed version of Alice would precommit to resisting blackmail from Bob, while still accepting positive-sum acausal trades (e.g., ordinary contracts).
Interest in the topic increased over subsequent years. In 2014, Roko's basilisk was name-dropped in the webcomic xkcd. The magazine Slate ran an article on the thought experiment, titled \""The Most Terrifying Thought Experiment of All Time\":
Less Wrong user Gwern reports that \"Only"Only a few LWers seem to take the basilisk very seriously,\" adding, \"It'"It's funny how everyone seems to know all about who is affected by the Basilisk and how exactly, when they don't know any such people and they're talking to counterexamples to their confident claims.\"
When Roko posted about the Basilisk, I very foolishly yelled at him, called him an idiot, and then deleted the post.
Why I did that is not something you have direct access to, and thus you should be careful about Making Stuff Up, especially when there are Internet trolls who are happy to tell you in a loud authoritative voice what I was thinking, despite having never passed anything even close to an Ideological Turing Test on Eliezer Yudkowsky.
Why I yelled at Roko: Because I was caught flatfooted in surprise, because I was indignant to the point of genuine emotional shock, at the concept that somebody who thought they'd invented a brilliant idea that would cause future AIs to torture people who had the thought, had promptly posted it to the public Internet. In the course of yelling at Roko to explain why this was a bad thing, I made the further error — keeping in mind that I had absolutely no idea that any of this would ever blow up the way it did, if I had I would obviously have kept my fingers quiescent — of not making it absolutely clear using lengthy disclaimers that my yelling did not mean that I believed Roko was right about CEV-based agents torturing people who had heard about Roko's idea. It was obvious to me that no CEV-based agent would ever do that and equally obvious to me that the part about CEV was just a red herring; I more or less automatically pruned it from my processing of the suggestion and automatically generalized it to cover the entire class of similar scenarios and variants, variants which I considered obvious despite significant divergences (I forgot that other people were not professionals in the field). This class of all possible variants did strike me as potentially dangerous as a collective group, even though it did not occur to me that Roko's original scenario might be right — that was obviously wrong, so my brain automatically generalized it. [...]
What I considered to be obvious common sense was that you did not spread potential information hazards because it would be a crappy thing to do to someone. The problem wasn't Roko's post itself, about CEV, being correct. That thought never occurred to me for a fraction of a second. The problem was that Roko's post seemed near in idea-space to a large class of potential hazards, all of which, regardless of their plausibility, had the property that they presented no potential benefit to anyone. They were pure infohazards. The only thing they could possibly do was be detrimental to brains that represented them, if one of the possible variants of the idea turned out to be repairable of the obvious objections and defeaters. So I deleted it, because on my worldview there was no reason not to. I did not want LessWrong.com to be a place where people were exposed to potential infohazards because somebody like me thought they were being clever about reasoning that they probably weren't infohazards. On my view, the key fact about Roko's Basilisk wasn't that it was plausible, or implausible, the key fact was just that shoving it in people's faces seemed like a fundamentally crap thing to do because there was no upside.
Again, I deleted that post not because I had decided that this thing probably presented a real hazard, but because I was afraid some unknown variant of it might, and because it seemed to me like the obvious General Procedure For Handling Things That Might Be Infohazards said you shouldn't post them to the Internet. If you look at the original SF story where the term
\"basilisk\""basilisk" was coined, it's about a mind-erasing image and the.... trolls, I guess, though the story predates modern trolling, who go around spraypainting the Basilisk on walls, using computer guidance so they don't know themselves what the Basilisk looks like, in hopes the Basilisk will erase some innocent mind, for the lulz. These people are the villains of the story. The good guys, of course, try to erase the Basilisk from the walls. Painting Basilisks on walls is a crap thing to do. Since there was no upside to being exposed to Roko's Basilisk, its probability of being true was irrelevant. And Roko himself had thought this was a thing that might actually work. So I yelled at Roko for violating basic sanity about infohazards for stupid reasons, and then deleted the post. He, by his own lights, had violated the obvious code for the ethical handling of infohazards, conditional on such things existing, and I was indignant about this.
It hasn't been formally demonstrated that any logical decision theories give in to blackmail, or what scenarios would make them vulnerable to blackmail. If it turned out that TDT or UDT were blackmailable, this would suggest that they aren't normatively optimal decision theories. For more background on open problems in decision theory, see the Decision Theory FAQ and \""Toward Idealized Decision Theory\".
Because the basilisk threatens its blackmail targets with torture, it is a type of \"utility"utility function inverter\"inverter": agents that seek to additionally pressure others by threatening to invert the non-compliant party's utility function. Yudkowsky argues that sane, rational entities ought to be strongly opposed to utility function inverters by dint of not wanting to live in a reality where such tactics are commonly part of negotiations, though Yudkowsky did so as a comment about the irrationality of commitment races, not about Roko's basilisk:
David Langford coined the term basilisk, in the sense of an information hazard that directly harms anyone who perceives it, in the 1988 science fiction story \""BLIT.\" On a societal level, examples of real-world information hazards include the dissemination of specifications for dangerous technologies; on an individual level, examples include triggers for stress or anxiety disorders.
The Roko's basilisk incident suggests that information that is deemed dangerous or taboo is more likely to be spread rapidly. Parallels can be drawn to shock site and creepypasta links: many people have their interest piqued by such topics, and people also enjoy pranking each other by spreading purportedly harmful links. Although Roko's basilisk was never genuinely dangerous, real information hazards might propagate in a similarsaimilar way, especially if the risks are non-obvious.
Non-specialists spread Roko's argument widely without first investigating the associated risks and benefits in any serious way. One take-away is that someone in possession of a serious information hazard should exercise caution in visibly censoring or suppressing it (cf. the Streisand effect). \""Information Hazards: A Typology of Potential Harms from Knowledge\" notes: \"In"In many cases, the best response is no response, i.e., to proceed as though no such hazard existed.\" This also means that additional care may need to be taken in keeping risky information under wraps; retracting information that has been published to highly trafficked websites is often difficult or impossible. However, Roko's basilisk is an isolated incident (and an unusual one at that); it may not be possible to draw any strong conclusions without looking at a number of other examples.
\"Weirdness points\"points"
Peter Hurford argues in \""You Have a Set Amount of Weirdness Points; Spend Them Wisely\" that promoting or talking about too many nonstandard ideas simultaneously makes it much less likely that any one of the ideas will be taken seriously. Advocating for any one of veganism, anarchism, or mind-body dualism is difficult enough on its own; discussing all three at once increases the odds that a skeptical interlocutor will write you off as 'just generally prone to having weird beliefs.' Roko's basilisk appears to be an example of this phenomenon: long-term AI safety issues, acausal trade, and a number of other popular Less Wrong ideas are all highly unusual in their own right, and their combination is stranger than the sum of its parts.
[I]magine that you mostly endorse positions that your audience already agrees with, positions that are within a standard deviation of the median position on the issue, and then you finally gather up all your cherished, saved-up weirdness points and write a passionate defense of the importance of insect suffering. How do you think your audience is going to react?
\"Ugh,"Ugh, they used to be so normal, and then it was like they suddenly went crazy. I hope they go back to bashing the Rethuglicans soon.\"
In \""The Economy of Weirdness,\" Katja Grace paints a more complicated painting of the advantages and disadvantages of weirdness. Communities with different goals and different demographics will plausibly vary in how 'normal' they should try to look, and in what the relevant kind of normality is. E.g., if the goal is to get more people interested in AI control problems, then weird ideas like Roko's basilisk may drive away conventional theoretical computer scientists, but they may also attract people who favor (or are indifferent to) unorthodox ideas.
What about a similar AI that helps anyone who tries to bring him into existence and does nothing to other people?
Roko’s basilisk is a thought experiment proposed in 2010 by the user Roko on the Less Wrong community blog. Roko used ideas in decision theory to argue that a sufficiently powerful AI agent would have an incentive to torture anyone who imagined the agent but didn't work to bring the agent into existence. The argument was called a \"basilisk\""basilisk" --named after the legendary reptile who can cause death with a single glance--because merely hearing the argument would supposedly put you at risk of torture from this hypothetical agent. A basilisk in this context is any information that harms or endangers the people who hear it.
Roko’s basilisk is a thought experiment proposed in 2010 by the user Roko on the Less Wrong community blog. Roko used ideas in decision theory to argue that a sufficiently powerful AI agent would have an incentive to torture anyone who imagined the agent but didn't work to bring the agent into existence. The argument was called a \"basilisk\" --named after the legendary reptile who can cause death with a single glance--because merely hearing the argument would supposedly put you at risk of torture from this hypothetical agent — aagent. A basilisk in this context is any information that harms or endangers the people who hear it.
Roko's argument was broadly rejected on Less Wrong, with commenters objecting that an agent like the one Roko was describing would have no real reason to follow through on its threat: once the agent already exists, it can't affect the probabilitywill by default just see it as a waste of its existence, so torturingresources to torture people for their past decisions would be a wastedecisions, since this doesn't causally further its plans. A number of resources. Although several decision theories allow one toalgorithms can follow through on acausal threats and promises —promises, via the same precommitment methods that permit mutual cooperation in prisoner's dilemmas — it is not cleardilemmas; but this doesn't imply that such theories can be blackmailed. If they can be blackmailed, thisAnd following through on blackmail threats against such an algorithm additionally requires a large amount of shared information and trust between the agents, which does not appear to exist in the case of Roko's basilisk.
Bunch of broken images in this one
Utility function inverters
Because the basilisk threatens its blackmail targets with torture, it is a type of \"utility function inverter\": agents that seek to additionally pressure others by threatening to invert the non-compliant party's utility function. Yudkowsky argues that sane, rational entities ought to be strongly opposed to utility function inverters by dint of not wanting to live in a reality where such tactics are commonly part of negotiations, though Yudkowsky did so as a comment about the irrationality of commitment races, not about Roko's basilisk:
IMO, commitment races only occur between agents who will, in some sense, act like idiots, if presented with an apparently 'committed' agent. If somebody demands $6 from me in the Ultimatum game, threatening to leave us both with $0 unless I offer at least $6 to them... then I offer $6 with slightly less than 5/6 probability, so they do no better than if they demanded $5, the amount I think is fair. They cannot evade that by trying to make some 'commitment' earlier than I do. I expect that, whatever is the correct and sane version of this reasoning, it generalizes across all the cases.
I am not locked into warfare with things that demand $6 instead of $5. I do not go around figuring out how to invert their utility function for purposes of threatening them back - 'destroy all utility-function inverters (but do not invert their own utility functions)' was my guessed commandment that would be taught to kids in dath ilan, because you don't want reality to end up full of utilityfunction inverters.
Roko’s basilisk is a thought experiment proposed in 2010 by the user Roko on the Less Wrong community blog. Roko used ideas in decision theory to argue that a sufficiently powerful AI agent would have an incentive to torture anyone who imagined the agent but didn't work to bring the agent into existence. The argument was called a "basilisk" \"basilisk\" --named after the legendary reptile who can cause death with a single glance--because merely hearing the argument would supposedly put you at risk of torture from this hypothetical agent — a basilisk in this context is any information that harms or endangers the people who hear it.
A visual depiction of a prisoner's dilemma. T denotes the best outcome for a given player, followed by R, then P, then S.
One example of a Newcomblike problem is the prisoner's dilemma. This is a two-player game in which each player has two options: "cooperate,\"cooperate,\" or "defect.\"defect.\" By assumption, each player prefers to defect rather than cooperate, all else being equal; but each player also prefers mutual cooperation over mutual defection.
One of the basic open problems in decision theory is that standard "rational"\"rational\" agents will end up defecting against each other, even though it would be better for both players if they could somehow enact a binding mutual agreement to cooperate instead.
In other words, the standard formulation of CDT cannot model scenarios where another agent (or a part of the environment) is correlated with a decision process, except insofar as the decision causes the correlation. The general name for scenarios where CDT fails is "Newcomblike\"Newcomblike problems,\" and these scenarios are ubiquitous in human interactions.
Yudkowsky's interest in decision theory stems from his interest in the AI control problem: "If\"If artificially intelligent systems someday come to surpass humans in intelligence, how can we specify safe goals for them to autonomously carry out, and how can we gain high confidence in the agents' reasoning and decision-making?\" Yudkowsky has argued that in the absence of a full understanding of decision theory, we risk building autonomous systems whose behavior is erratic or difficult to model.
Because Eliezer Yudkowsky founded Less Wrong and was one of the first bloggers on the site, AI theory and "acausal"\"acausal\" decision theories — in particular, logical decision theories, which respect logical connections between agents' properties rather than just the causal effects they have on each other — have been repeatedly discussed on Less Wrong. Roko's basilisk was an attempt to use Yudkowsky's proposed decision theory (TDT) to argue against his informal characterization of an ideal AI goal (humanity's coherently extrapolated volition).
A simple depiction of an agent that cooperates with copies of itself in the one-shot prisoner's dilemma. Adapted from the Decision Theory FAQ.
Roko observed that if two TDT or UDT agents with common knowledge of each other's source code are separated in time, the later agent can (seemingly) blackmail the earlier agent. Call the earlier agent "Alice"\"Alice\" and the later agent "Bob.\"Bob.\" Bob can be an algorithm that outputs things Alice likes if Alice left Bob a large sum of money, and outputs things Alice dislikes otherwise. And since Alice knows Bob's source code exactly, she knows this fact about Bob (even though Bob hasn't been born yet). So Alice's knowledge of Bob's source code makes Bob's future threat effective, even though Bob doesn't yet exist: if Alice is certain that Bob will someday exist, then mere knowledge of what Bob would do if he could get away with it seems to force Alice to comply with his hypothetical demands.
Singularity here refers to an intelligence explosion, and singleton refers to a superintelligent AI system. Since a highly moral AI agent (one whose actions are consistent with our coherently extrapolated volition) would want to be created as soon as possible, Roko argued that such an AI would use acausal blackmail to give humans stronger incentives to create it. Roko made the claim that the hypothetical AI agent would particularly target people who had thought about this argument, because they would have a better chance of mentally simulating the AI's source code. Roko added: "Of\"Of course this would be unjust, but is the kind of unjust thing that is oh-so-very utilitarian.\"
What's the truth about Roko's Basilisk? The truth is that making something like this
"work"\"work\", in the sense of managing to think a thought that would actually give future superintelligences an incentive to hurt you, would require overcoming what seem to me like some pretty huge obstacles.The most blatant obstacle to Roko's Basilisk is, intuitively, that there's no incentive for a future agent to follow through with the threat in the future, because by doing so it just expends resources at no gain to itself. We can formalize that using classical causal decision theory, which is the academically standard decision theory: following through on a blackmail threat, in the future after the past has already taken place, cannot (from the blackmailing agent's perspective) be the physical cause of improved outcomes in the past, because the future cannot be the cause of the past.
But classical causal decision theory isn't the only decision theory that has ever been invented, and if you were to read up on the academic literature, you would find a lot of challenges to the assertion that, e.g., two rational agents always defect against each other in the one-shot Prisoner's Dilemma.
One of those challenges was a theory of my own invention, which is why this whole fiasco took place on LessWrong.com in the first place. (I feel rather like the speaker of that ancient quote,
"All\"All my father ever wanted was to make a toaster you could really set the darkness on, and you perverted his work into these horrible machines!\") But there have actually been a lot of challenges like that in the literature, not just mine, as anyone actually investigating would have discovered. Lots of people are uncomfortable with the notion that rational agents always defect in the one-shot Prisoner's Dilemma. And if you formalize blackmail, including this case of blackmail, the same way, then most challenges to mutual defection in the Prisoner's Dilemma are also implicitly challenges to the first obvious reason why Roko's Basilisk would never work.But there are also other obstacles. The decision theory I proposed back in the day says that you have to know certain things about the other agent in order to achieve mutual cooperation in the Prisoner's Dilemma, and that's with both parties trying to set up a situation which leads to mutual cooperation instead of mutual defection. As I presently understand the situation, there is literally nobody on Earth, including me, who has the knowledge needed to set themselves up to be blackmailed if they were deliberately trying to make that happen.
Any potentially blackmailing AI would much prefer to have you believe that it is blackmailing you, without actually expending resources on following through with the blackmail, insofar as they think they can exert any control on you at all via an exotic decision theory. Just like in the one-shot Prisoner's Dilemma the
"ideal"\"ideal\" outcome is for the other player to believe you are modeling them and will cooperate if and only if they cooperate, and so they cooperate, but then actually you just defect anyway. For the other player to be confident this will not happen in the Prisoner's Dilemma, for them to expect you not to sneakily defect anyway, they must have some very strong knowledge about you. In the case of Roko's Basilisk,"defection"\"defection\" corresponds to not actually torturing anyone, not expending resources on that, and just letting them believe that you will blackmail them. Two AI agents with sufficiently strong knowledge of each other, and heavily motivated to achieve mutual cooperation on the Prisoner's Dilemma, might be able to overcome this obstacle and cooperate with confidence. But why would you put in that degree of effort — if you even could, which I don't think you as a human can — in order to give a blackmailing agent an incentive to actually carry through on its threats?I have written the above with some reluctance, because even if I don't yet see a way to repair this obstacle myself, somebody else might see how to repair it now that I've said what it is. Which is not a good general procedure for handling infohazards; people with expert knowledge on them should, obviously, as a matter of professional ethics, just never discuss them at all, including describing why a particular proposal doesn't work, just in case there's some unforeseen clever way to repair the proposal. There are other obstacles here which I am not discussing, just in case the logic I described above has a flaw. Nonetheless, so far as I know, Roko's Basilisk does not work, nobody has actually been bitten by it, and everything I have done was in the service of what I thought was the obvious Good General Procedure for Handling Potential Infohazards[.]
[Roko:]
"One\"One might think that the possibility of CEV punishing people couldn't possibly be taken seriously enough by anyone to actually motivate them. But in fact one person at SIAI was severely worried by this, to the point of having terrible nightmares, though ve wishes to remain anonymous.\"I don't usually talk like this, but I'm going to make an exception for this case.
Listen to me very closely, you idiot.
YOU DO NOT THINK IN SUFFICIENT DETAIL ABOUT SUPERINTELLIGENCES CONSIDERING WHETHER OR NOT TO BLACKMAIL YOU. THAT IS THE ONLY POSSIBLE THING WHICH GIVES THEM A MOTIVE TO FOLLOW THROUGH ON THE BLACKMAIL.
There's an obvious equilibrium to this problem where you engage in all positive acausal trades and ignore all attempts at acausal blackmail. Until we have a better worked-out version of TDT and we can prove that formally, it should just be OBVIOUS that you DO NOT THINK ABOUT DISTANT BLACKMAILERS in SUFFICIENT DETAIL that they have a motive toACTUALLY [sic] BLACKMAIL YOU.
If there is any part of this acausal trade that is positive-sum and actually worth doing, that is exactly the sort of thing you leave up to an FAI. We probably also have the FAI take actions that cancel out the impact of anyone motivated by true rather than imagined blackmail, so as to obliterate the motive of any superintelligences to engage in blackmail.
Meanwhile I'm banning this post so that it doesn't (a) give people horrible nightmares and (b) give distant superintelligences a motive to follow through on blackmail against people dumb enough to think about them in sufficient detail, though, thankfully, I doubt anyone dumb enough to do this knows the sufficient detail. (I'm not sure I know the sufficient detail.)
You have to be really clever to come up with a genuinely dangerous thought. I am disheartened that people can be clever enough to do that and not clever enough to do the obvious thing and KEEP THEIR IDIOT MOUTHS SHUT about it, because it is much more important to sound intelligent when talking to your friends. This post was STUPID.
(For those who have no idea why I'm using capital letters for something that just sounds like a random crazy idea, and worry that it means I'm as crazy as Roko, the gist of it was that he just did something that potentially gives superintelligences an increased motive to do extremely evil things in an attempt to blackmail us. It is the sort of thing you want to be EXTREMELY CONSERVATIVE about NOT DOING.)
\"FAI"FAI\" here stands for "Friendly\"Friendly AI,\" a hypothetical superintelligent AI agent that can be trusted to autonomously promote desirable ends. Yudkowsky rejected the idea that Roko's basilisk could be called "friendly"\"friendly\" or "utilitarian,\"utilitarian,\" since torture and threats of blackmail are themselves contrary to common human values. Separately, Yudkowsky doubted that humans possessed enough information about any hypothetical unfriendly AI system to enter Alice's position even if we tried. Yudkowsky additionally argued that a well-designed version of Alice would precommit to resisting blackmail from Bob, while still accepting positive-sum acausal trades (e.g., ordinary contracts).
Interest in the topic increased over subsequent years. In 2014, Roko's basilisk was name-dropped in the webcomic xkcd. The magazine Slate ran an article on the thought experiment, titled "\"The Most Terrifying Thought Experiment of All Time\":
Less Wrong user Gwern reports that "Only\"Only a few LWers seem to take the basilisk very seriously,\" adding, "It'\"It's funny how everyone seems to know all about who is affected by the Basilisk and how exactly, when they don't know any such people and they're talking to counterexamples to their confident claims.\"
When Roko posted about the Basilisk, I very foolishly yelled at him, called him an idiot, and then deleted the post.
Why I did that is not something you have direct access to, and thus you should be careful about Making Stuff Up, especially when there are Internet trolls who are happy to tell you in a loud authoritative voice what I was thinking, despite having never passed anything even close to an Ideological Turing Test on Eliezer Yudkowsky.
Why I yelled at Roko: Because I was caught flatfooted in surprise, because I was indignant to the point of genuine emotional shock, at the concept that somebody who thought they'd invented a brilliant idea that would cause future AIs to torture people who had the thought, had promptly posted it to the public Internet. In the course of yelling at Roko to explain why this was a bad thing, I made the further error — keeping in mind that I had absolutely no idea that any of this would ever blow up the way it did, if I had I would obviously have kept my fingers quiescent — of not making it absolutely clear using lengthy disclaimers that my yelling did not mean that I believed Roko was right about CEV-based agents torturing people who had heard about Roko's idea. It was obvious to me that no CEV-based agent would ever do that and equally obvious to me that the part about CEV was just a red herring; I more or less automatically pruned it from my processing of the suggestion and automatically generalized it to cover the entire class of similar scenarios and variants, variants which I considered obvious despite significant divergences (I forgot that other people were not professionals in the field). This class of all possible variants did strike me as potentially dangerous as a collective group, even though it did not occur to me that Roko's original scenario might be right — that was obviously wrong, so my brain automatically generalized it. [...]
What I considered to be obvious common sense was that you did not spread potential information hazards because it would be a crappy thing to do to someone. The problem wasn't Roko's post itself, about CEV, being correct. That thought never occurred to me for a fraction of a second. The problem was that Roko's post seemed near in idea-space to a large class of potential hazards, all of which, regardless of their plausibility, had the property that they presented no potential benefit to anyone. They were pure infohazards. The only thing they could possibly do was be detrimental to brains that represented them, if one of the possible variants of the idea turned out to be repairable of the obvious objections and defeaters. So I deleted it, because on my worldview there was no reason not to. I did not want LessWrong.com to be a place where people were exposed to potential infohazards because somebody like me thought they were being clever about reasoning that they probably weren't infohazards. On my view, the key fact about Roko's Basilisk wasn't that it was plausible, or implausible, the key fact was just that shoving it in people's faces seemed like a fundamentally crap thing to do because there was no upside.
Again, I deleted that post not because I had decided that this thing probably presented a real hazard, but because I was afraid some unknown variant of it might, and because it seemed to me like the obvious General Procedure For Handling Things That Might Be Infohazards said you shouldn't post them to the Internet. If you look at the original SF story where the term
"basilisk"\"basilisk\" was coined, it's about a mind-erasing image and the.... trolls, I guess, though the story predates modern trolling, who go around spraypainting the Basilisk on walls, using computer guidance so they don't know themselves what the Basilisk looks like, in hopes the Basilisk will erase some innocent mind, for the lulz. These people are the villains of the story. The good guys, of course, try to erase the Basilisk from the walls. Painting Basilisks on walls is a crap thing to do. Since there was no upside to being exposed to Roko's Basilisk, its probability of being true was irrelevant. And Roko himself had thought this was a thing that might actually work. So I yelled at Roko for violating basic sanity about infohazards for stupid reasons, and then deleted the post. He, by his own lights, had violated the obvious code for the ethical handling of infohazards, conditional on such things existing, and I was indignant about this.
It hasn't been formally demonstrated that any logical decision theories give in to blackmail, or what scenarios would make them vulnerable to blackmail. If it turned out that TDT or UDT were blackmailable, this would suggest that they aren't normatively optimal decision theories. For more background on open problems in decision theory, see the Decision Theory FAQ and "\"Toward Idealized Decision Theory\".
David Langford coined the term basilisk, in the sense of an information hazard that directly harms anyone who perceives it, in the 1988 science fiction story "\"BLIT.\" On a societal level, examples of real-world information hazards include the dissemination of specifications for dangerous technologies; on an individual level, examples include triggers for stress or anxiety disorders.
Non-specialists spread Roko's argument widely without first investigating the associated risks and benefits in any serious way. One take-away is that someone in possession of a serious information hazard should exercise caution in visibly censoring or suppressing it (cf. the Streisand effect). "\"Information Hazards: A Typology of Potential Harms from Knowledge\" notes: "In\"In many cases, the best response is no response, i.e., to proceed as though no such hazard existed.\" This also means that additional care may need to be taken in keeping risky information under wraps; retracting information that has been published to highly trafficked websites is often difficult or impossible. However, Roko's basilisk is an isolated incident (and an unusual one at that); it may not be possible to draw any strong conclusions without looking at a number of other examples.
\"Weirdness points"points\"
Peter Hurford argues in "\"You Have a Set Amount of Weirdness Points; Spend Them Wisely\" that promoting or talking about too many nonstandard ideas simultaneously makes it much less likely that any one of the ideas will be taken seriously. Advocating for any one of veganism, anarchism, or mind-body dualism is difficult enough on its own; discussing all three at once increases the odds that a skeptical interlocutor will write you off as 'just generally prone to having weird beliefs.' Roko's basilisk appears to be an example of this phenomenon: long-term AI safety issues, acausal trade, and a number of other popular Less Wrong ideas are all highly unusual in their own right, and their combination is stranger than the sum of its parts.
[I]magine that you mostly endorse positions that your audience already agrees with, positions that are within a standard deviation of the median position on the issue, and then you finally gather up all your cherished, saved-up weirdness points and write a passionate defense of the importance of insect suffering. How do you think your audience is going to react?
"Ugh,\"Ugh, they used to be so normal, and then it was like they suddenly went crazy. I hope they go back to bashing the Rethuglicans soon.\"
In "\"The Economy of Weirdness,\" Katja Grace paints a more complicated painting of the advantages and disadvantages of weirdness. Communities with different goals and different demographics will plausibly vary in how 'normal' they should try to look, and in what the relevant kind of normality is. E.g., if the goal is to get more people interested in AI control problems, then weird ideas like Roko's basilisk may drive away conventional theoretical computer scientists, but they may also attract people who favor (or are indifferent to) unorthodox ideas.
I guess the answer is supposed to be
I recently learned about the basilisk argument and I wanted to thank Mr. Roko for his post, not for the argument itself indeed but 1) for the reactions and comments generated thereafter and 2) because that lead me to the discovery of this blog.
I take the opportunity to share the following (don’t panic please, it won’t harm anybody):
could you tell a) what would be the remaining time line’s outcome and b) to whom each of the time lines corresponds to?
Don’t take it too seriously but don’t simply dismiss it. Have fun, cheers. EM
I think the Open Thread is probably a generally better place to bring up random new ideas related to Roko's basilisk stuff. This page is more for discussing the current content of the page, and how it might be improved.
Roko's basilisk cannot exist if humans do not cooperate to create it.
However, if we had a 'grey' AI that would reward the people who built it and torture those who have envisioned, but not built it, then this gets resolved back to the original prisoner's dilemma problems.
I mostly fixed the page by removing quotes from links (edited as markdown in VS Code, 42 links were like []("...") and 64 quotes were double-escaped \") ... feel free to double check (I also sent feedback to moderators, maybe they want to check for similar problems on other pages on DB level)