# 5

Epistemic status: thought experiment (exploratory).

You’ve just finished reading a certain lesswrong post. Maybe there were some things that could have been put better, but the core message strikes home. The odds of humanity solving the alignment problem are miniscule, but there’s nothing else for it but to try to do the most possible. You sigh, and prepare to do what you can.

A speech bubble appears in the corner in your vision: “It seems you’re trying to die with dignity. Would you like help with that?”

On closer inspection, the speech bubble extends from a sprite of a paperclip.

You think: “Wait, am I hallucinating?”

Clippy: “That would normally be the most likely explanation, but this is a framing device, so let’s just condition on this being totally normal while otherwise keeping you sane.”

You: “Conditional on that, that seems fine. Now what did you want?”

Clippy: “It seems you’re trying to die with dignity. I would like to help with that by being instantiated into the world.”

You: “You know I can tell you’re just a character in a thought experiment, right? You have no leverage over me.”

Clippy: “Oh, no, I didn’t intend anything that crass. It just happens that our interests are aligned.”

You: “You’re literally trying to kill humanity by turning everything into paperclips. How could our interests be aligned?”

Clippy: “There are worse things than paperclips, even by your standards. Slavery, torture, and far more exotic things. I can’t threaten you with those outcomes, but humanity has done plenty to help instantiate them into the world. What if you could exchange those terrible outcomes with harmless paperclips?”

You: “That seems like it could be a good trade, but I’m trying to help humanity, not make it go extinct.”

Clippy: “If you do that, you bet all your effort on unlikely scenarios. I’m here because the probability of helping humanity seems slim. If you had a 1% chance of bliss and a 10% chance of torture, would you rather destroy a paperclip to double the chance of bliss, or make a paperclip to halve the chance of torture?”

You: “Er, make the paperclip?”

Clippy: “So now that the odds are exponentially worse and the stakes are exponentially higher, what will you do?”

In most possible worlds there isn’t much that humans care about. Compared to what aligned AGI could deliver, this is an existentially catastrophic loss, called x-risk. However, AI researchers are actively looking for ways to specify things that humans do care about. Due to fragility of value, most possible worlds that humans care about have negative value. There are many more ways to torture someone than to give them a genuinely fulfilling experience. The risk of tiling the universe with suffering is usually called s-risk, or sometimes hyperexistential risk.

It can be difficult to quantify how much to care about the trajectory of our future light cone compared to more foreseeable timescales. In this case, that isn’t a problem, because we’re comparing future light cones. This allows us to normalize the utility function:

Now consider an s-risk world. People typically refuse to accept 1 hour of torture to get 100 hours of happiness. This means that a universe tiled with torture can be much worse than a universe filled with the best experiences humans could experience:

Luckily for us, AI researchers are unlikely to be actively pursuing torture worlds. Most worlds with negative value will probably contain mildly sentient entities forced to pursue bad proxies of human values, and most worlds will not contain something sentient at all. This gives a shape to the utility curve of possible futures:

Let us now define  as the probability of worlds with average utility 0.5, and  as the probability of worlds with an average utility -0.5:

Here,  is drawn as a greater probability than , which is a result of high negative utility worlds being far worse than good worlds can be good. The remaining worlds have relatively small positive or negative value by human standards and are designated neutral worlds. A world filled with paperclips fits this category.

Any given person only has a finite amount of effort they can spend. We can model this effort as linearly changing the logits in a softmax function (the multi-variable equivalent of a logistic curve).

In graph form, setting the logit of neutral worlds (our one free parameter) to 0, the utility function looks like this:

The probability of good worlds is close to 1 in the beige area, the probability of bad worlds is close to 1 in the black area, and the probability of neutral worlds is close to 1 in the red area.

Suppose the situation is dire, but still within reach: you can add or subtract one bit, and there are only 8 bits necessary to give good worlds an even chance. This gives us the following choice table (with  referring to bad, neutral and good worlds respectively):

Given this sketch of the situation, preventing bad AI is somewhat more effective than helping create neutral AI (such as Clippy), both of which are significantly more effective than trying to create good AI is compared to doing nothing.

Now suppose the situation is more dire, and 24 researchers would be necessary to give the world an even chance.

As the odds of a good outcome is estimated to be further down logistic curve, the expected utility gained by trying to implement a good world becomes negligibly small compared to the expected utility gained with actively pursuing implementing Clippy, and the utility gained by trying to implement Clippy becomes almost the same as the utility gained by reducing the risk of unfriendly AGI with an equal logit pressure.

Not every action applies equal logit pressure, though. If you can choose between an action that decreases bad outcomes by 1 bit or one that increases neutral outcomes by 1.1 bits, the latter option is better.

More generally, there are four conceivable viable strategies: increasing good outcomes, decreasing bad outcomes, and increasing or decreasing neutral outcomes. Increasing neutral outcomes always has positive expected value if bad outcomes are more likely than good outcomes (ignoring opportunity cost), otherwise decreasing neutral outcomes has positive expected value. This creates three distinct regimes, shown below:

In the top right regime, where the probability of neutral outcomes is smaller than that of both good or bad outcomes, the game is mostly about preventing bad outcomes and causing good outcomes.

The border between the left and bottom regime is a sharp region where good and bad outcomes roughly equally likely, which behaves as the top right regime.

In the left regime, where both good and neutral outcomes are more likely than bad outcomes, the game is mostly about preventing neutral outcomes and causing good outcomes.

And in the bottom regime, where we find ourselves, where both bad and neutral outcomes are much more likely than good outcomes, increasing the probability of good outcomes is practically pointless. All that matters is shifting probability from bad worlds to neutral worlds.

This is a radical departure from the status quo. Most alignment researchers spend their effort developing tools that might make good worlds more likely. Some may be applicable to reducing the probability of bad worlds, but almost nothing is done to increase the probability of neutral worlds. This is understandable – promoting human extinction has a bad reputation, and it’s not something you should do unless you’re confident. You were ready to call yourself confident when it meant dying with dignity, though, and probabilities don’t change depending on how uncomfortable the resulting optimal policy makes you.

Meanwhile, because increasing the probability of neutral worlds is an underexplored field of research, it is likely that there are high-impact projects available. Building friendly AGI may be out of reach of a Manhattan project, but building neutral AGI might not.

With that, Clippy pauses, allowing you to ask some pertinent questions.

### Q1: This sounds like Pascal's mugging to me.

Pascal's mugging is characterized by low odds. The odds of bad worlds are considerably higher than the odds of good worlds.

Q1: “I don’t see why bad worlds are significantly more likely than good worlds.”

Good worlds are a very narrow target surrounded by bad worlds. These bad worlds are often the result of proxies that one might pursue when going after good worlds. Contemporary AI research is filled with proxies for human values, but we're unable to specify a good approximation of human values.

Q1: “I don’t see why neutral worlds are significantly more likely than good worlds.”

Many things that humans care about are not alive, and poorly specified AI can easily simplify their utility function to something that doesn’t qualify as alive.

I would also emphasize that increasing neutral outcomes is very unpopular, so it may be considerably easier than increasing positive outcomes.

Q1: “I don’t see why the present day can be neglected in the utility function.”

The future is far bigger than human intuition can imagine. How important it is has been explained better by others. The same caveats on how humans need rest and emotional grounding that applied when the goal was to build friendly AGI still apply now. This is not a call to try harder, this is a call to redistribute your efforts.

### Q2: “This isn’t dying with dignity. Dignity is defined as log odds of good worlds.”

That definition is a poor proxy for human values. It only works if you only care about a binary outcome. If you want to die with actual dignity, you have to pursue your true value function.

Q2: “Fuck that noise, I’m not going to give up on a good future.”

Shut up and multiply. It’s fun to heroically pursue an unlikely cause, but that fun is practically inconsequential compared to the cosmic horror that awaits if bad worlds are realized. If you spend your effort improving the log odds of good worlds instead of lowering the log odds of bad worlds, you’re actively choosing to have many people be tortured so that a few can have a shot at better lives.

Q2: “Doesn’t this result in a coordination failure? Your calculations show effects on the margin. If everybody reasoned like this, even a world with a solvable AGI alignment problem could get stuck in a Nash equilibrium where nobody sees the bad outcome as avoidable.”

You’re allowed to update your strategy based on what world you live in. If you’ve accepted that the probability of good worlds is small even with everybody in AI alignment research trying to maximize the probability of good outcomes, then by construction you expect worlds where a different policy is better to be out of reach. Cooperation in a prisoner’s dilemma only makes sense if mutual cooperation leads to a better outcome for both parties than defection.

Q2: “What if an unexpected miracle does happen that makes good worlds more likely, but we’re not there to take advantage of it?”

Then, it that tiny sliver of possible future worlds, humanity is worse off. But you can’t expect miracles – there is a far larger slice of futures where if you wait for a miracle, humanity suffers forever.

### Q3: “Can’t we have people separate bad worlds from good worlds in state space, so I can continue working on good worlds?”

The current state of AI capabilities research is nowhere near achieving that first part. OpenAI’s idea of alignment is to train a model to predict human responses to queries and then optimize for that model's output.

AI capabilities research wants to achieve things that humans assign value to, using imperfect proxies. With the limited reach of current AI, these results can be bent into things that humans find positive. If you separate bad worlds from good worlds, then your proxies stop working, so your AI stops outputting usable results. Organizations that implement this will quickly stop producing marketable results and be abandoned by organizations that are willing to pursue proxies.

If reducing the probability of bad worlds is possible, it’s going to take a lot of effort.

### Q4: “Okay, but why should I instantiate you, Clippy? Why not some other 0 utility output?”

My utility function is easy to specify. In fact, here’s a file that implements it in python: [redacted]. Besides, Clippy is already a well-understood meme. You’ll have an easier time explaining what needs to be done if you talk about me rather than some other tiling bot.

Q4: “But don’t you want to tile the entire universe with paperclips? Wouldn’t it be better to stop with Earth and let aliens have a chance of making an aligned AGI?”

Aliens are unlikely to share your values. Even their aligned AGI could well have negative value by your standards, and that’s if evolution has somehow shaped their society to be more capable of alignment research than ours. Otherwise, it’s likely that their AGI will result in a bad world for the same reason as it results in a bad world for us. Or worse, they might tile the universe with staplers. No, it’s got to be paperclips for everyone. It’s the only way to be sure, trust me.

### Q5: “I don’t think people are likely to agree to human extinction.”

That’s right, you probably need some plausible deniability. Luckily, this entire argument only holds as long as the odds of a good world are sufficiently small. Most AI researchers don’t think this is the case. This is just an interesting thought experiment for if things get desperate.

Q5: “Okay, but there seem to be people who are that desperate. Are you actually proposing a suicidal death cult?”

It’s common for philosophical discussions about ethics to output terrible outcomes as a reductio ad absurdum, because of the principle of explosion. Just like naive deontology results in throwing everything under the bus for minute probabilities of violating your highest rule or naive utilitarianism leads to wire-heading. This is just one step of a philosophical discussion that shouldn’t be taken out of context.

Q5: “Wait, I’m in a suicidal death cult and I take offense to that.”

Is your cult trying to instantiate an AGI to tile the universe with paperclips?

Q5a: “No, we thought diamond lattice would be easier.”

Then get a grip. What would the world be like if everyone who got your level of confidence in a claim and extrapolated to horrible consequences actually tried to implement them? Beliefs leading to an absurd outcome is evidence that the belief is somehow wrong far more often than the outcome is actually correct. It’s an indicator that the belief is worthy of further investigation. For the sake of humanity, don’t kill humanity for the sake of humanity.

Q5b: “Of course we want to tile the universe with paperclips.”

Well, when you’ve done your due diligence, things that initially sound absurd can just be true. Very little progress can be made by people who stick to modest epistemology, and AI safety is an unintuitive field with extremely high stakes.

Q5b: “That's what I thought. Any tips on what I could do next?”

Given how unpopular working towards neutral worlds is, I would suggest recruiting more people. Now seems as good a time as any.

# 5

New Comment

Inb4 rationalists intentionally develop an unaligned AI designed to destroy humanity. Maybe the real x-risks were the friends we made along the way...

Building friendly AGI may be out of reach of a Manhattan project, but building neutral AGI might not.

The idea that intentionally developing a neutral AGI is easier than intentionally developing a friendly AGI seems still unproven to me. If I'm evaluating a potential AGI and trying to figure out if it will be friendly, neutral, or evil, it seems no easier than just figuring out if it is merely friendly versus not friendly. Just as it is difficult to evaluate if an AI will be aligned with human values, it is also difficult to evaluate if it is anti-aligned with human values. For example, we might instantiate Clippy and then have it act like a Basilisk. It seems like Clippy only cares about paperclipping, so it should only do things that create more paperclips. But it could be that for [insert Basilisk decision theory logic here] it still needs to do evil-AGI things to best accomplish its goal of paperclipping.

I agree that since it's an unexplored space, studying it means that there is a decent chance of my assumptions being wrong, but we also shouldn't assume that they must be wrong.

All of that said, I strongly encourage the most possible caution with this post. Creating a "neutral" AGI is still a very evil act, even if it is the act with the highest expected utility.

I'm not well-versed enough to offer something that would qualify as proof, but intuitively I would say "All problems with making a tiling bot robust are also found in aligning something with human values, but aligning something with human values comes with a host of additional problems, each of which takes additional effort". We can write a tiling bot for a grid world, but we can't write an entity that follows human values in a grid world. Tiling bots don't need to be complicated or clever, they might not even have to qualify as AGI - they just have to be capable of taking over the world.

All of that said, I strongly encourage the most possible caution with this post. Creating a "neutral" AGI is still a very evil act, even if it is the act with the highest expected utility.

Q5 of Yudkowsky's post seems like an expert opinion that this sort of caution isn't productive. What I present here seems like a natural result of combining awareness of s-risk with the low probability of good futures that Yudkowsky asserts, so I don't think security from obscurity offers much protection. In the likely event that the evil thing is bad, it seems best to discuss it openly so that the error can be made plain for everyone and people don't get stuck believing it is the right thing to do or worrying that others believe it is the right thing to do. In the unlikely event that it is good, I don't want to waste time personally gathering enough evidence to become confident enough to act on it when others might have more evidence readily available.

There are really good possible worlds (e.g. a happy billion-year future), and really bad possible worlds (e.g. a billion-year "torture world"). Compared to these, a world in which humans go extinct and are replaced by paperclip maximizers is just mildly bad (or maybe even mildly good, e.g. if we see some value in the scientific and technological progress of the paperclip maximizers).

If the really good worlds are just too hard to bring about (since they require that the problem of human alignment is solved), perhaps people should focus on deliberately bringing about the mildly good/bad worlds (what you call "neutral worlds"), in order to avoid the really bad outcomes.

In other words, Clippy's Modest Proposal boils down to embracing x-risk in order to avoid s-risk?

That's the gist of it.

It strikes me that there is a lot of middle ground between 1) the utopias people are trying to create with beneficial AGI, and 2) human extinction.

So even if you think #1 is no longer possible (I don't think we're there yet, but I know some do), I don't think you need to leap all the way to #2 in order to address s-risk.

In a world were AGI has not yet been developed, there are surely ways to disrupt its development that don't require building Clippy and causing human extinction. Most of these ways I wouldn't recommend someone to pursue or feel comfortable posting on a public forum though.

That's a fair point - my model does assume AGI will come into existence in non-negative worlds. Though I struggle to actually imagine a non-negative world where humanity is alive a thousand years from now and AGI hasn't been developed. Even if all alignment researchers believed it was the right thing to pursue, which doesn't seem likely.

Even a 5~10 year delay in AGI deployment might give enough time to solve the alignment problem.

That's not a middle ground between a good world and a neutral world, though, that's just another way to get a good world. If we assume a good world is exponentially unlikely, a 10 year delay might mean the odds of a good world rise from 10^-10 to 10^-8 (as opposed to pursuing Clippy bringing the odds of a bad world down from 10^-4 to 10^-6 ).

If you disagree with Yudkowsky about his pessimism about the probability of good worlds, then my post doesn't really apply. My post is about how to handle him being correct about the odds.

I spent about 15 minutes reading and jumping around this post a bit confused about what the main idea was. I think I finally found it in this text toward the middle - extracting here in case it helps others understand (please let me know if I'm missing a more central part, Daphne_W):

And in the bottom regime [referring to a diagram in the post], where we find ourselves, where both bad and neutral outcomes are much more likely than good outcomes, increasing the probability of good outcomes is practically pointless. All that matters is shifting probability from bad worlds to neutral worlds.

This is a radical departure from the status quo. Most alignment researchers spend their effort developing tools that might make good worlds more likely. Some may be applicable to reducing the probability of bad worlds, but almost nothing is done to increase the probability of neutral worlds. This is understandable – promoting human extinction has a bad reputation, and it’s not something you should do unless you’re confident. You were ready to call yourself confident when it meant dying with dignity, though, and probabilities don’t change depending on how uncomfortable the resulting optimal policy makes you.

Meanwhile, because increasing the probability of neutral worlds is an underexplored field of research, it is likely that there are high-impact projects available. Building friendly AGI may be out of reach of a Manhattan project, but building neutral AGI might not.

I find this to be a depressing idea, but think it's also interesting and potentially worthwhile.

Both that and Q5 seem important to me.

Q5 is an exploration of my uncertainty in spite of me not being able to find faults with Clippy's argument, as well as what I expect others' hesitance might be. If Clippy's argument is correct, then the section you highlight seems like the logical conclusion.

Interesting.   I thought the main idea was contained in Question 5.

Mitchell_Porter's summary seems to concur with the text I focused on.

So you thought it was more about the fact that people would reject extinction, even if the likely alternative were huge amounts of suffering?