In my mind, interventions against s-risks from AI seem like the impartial[1] top priority of our time, being more tractable[2], important[3], and neglected[4] than alignment. Hence I’m surprised that they’re not as central as alignment to discussions of AI safety. This is a quick-and-dirty post to try to understand why so few people in the wider EA and AI safety community prioritize s-risks. (It’s a long-form version of this tweet.)

I’ll post a few answers of my own and, in some cases, add why I don’t think they are true. Please vote on the answers that you think apply or add your own.

I don’t expect to reach many people with this question, so please interpret the question as “Why do so few EAs/LWians care about s-risks from AI?” and not just “Why don’t you care about s-risks from AI?” So as a corollary, please feel free to respond even if you personally do care about s-risks!

(Here are some ways to learn more: “Coordination Challenges for Preventing AI Conflict,” “Cooperation, Conflict, and Transformative Artificial Intelligence: A Research Agenda,” and Avoiding the Worst (and

  1. ^

    Some people have a particular idea for how to solve alignment and so have a strong personal fit for alignment research. Thank you for everything you’re doing! Please continue. This post is not for you. 

    But many others seem resigned, seem to have given up hope in affecting how it all will play out. I don’t think that’s necessary!

  2. ^

    Tractability. With alignment we always try to align an AI with something that at least vaguely or indirectly resembles human values. So we’ll make an enemy of most of the space of possible values. We’re in an adversarial game that we’re almost sure to lose. Our only winning hand is that we’re early compared to the other agents, but just by a decade or two.

    Maybe it’s just my agreeableness bias speaking, but I don’t want to be in an adversarial game with most superintelligences. Sounds hopeless.

    That’s related to the deployment problem. If existing agents don’t want to be aligned, you have a deployment problem. (And you have to resort to morally ambiguous and highly intractable solutions like pivotal acts and long reflections to solve it.) If you have something to offer that they all want, you’ve solved the deployment problem.

    Averting s-risks mostly means preventing zero-sum AI conflict. If we find a way (or many ways) to do that, every somewhat rational AI will voluntarily adopt them, because who wants to lose out on gains from trade. Our current earliness may be enough to seed public training data with any solutions we find and with Schelling points that they can use to coordinate.

    Another intuition pump is that alignment aims at a tiny patch in value space whereas averting s-risks only aims to avert a bunch of outlier scenarios that shouldn’t be so hard to avert. When you’re at a shooting range, it’s much easier not to kill any of the people next to you than to hit the center of the target.

  3. ^

    Importance. If I imagine trading extreme suffering for extreme bliss personally, I end up with ratios of 1 to 300 million – e.g., that I would accept a second of extreme suffering for ten years of extreme bliss. The ratio is highly unstable as I vary the scenarios, but the point is that I disvalue suffering many orders of magnitude more than I value bliss.

    Clearly there are some people who feel differently, but the intuition that suffering is worse than bliss is good is widely shared. (And the factor doesn’t need to be as big as mine. Given the high tractability and neglectedness, averting s-risks from AI may even be interesting for somewhat positive-leaning utilitarians.)

    Plus, a high-probability non-dystopic not-quite-utopia may be better in expectation than a lot of low-probability utopias with dystopic counterfactuals. But I guess that depends on countless details.

    Arguably, extinction is somewhat more likely than dystopic s-risk lock-ins. But my guess is that s-risks are only a bit less likely than multipolar takeoffs, maybe 1–10% as likely, and that multipolar takeoffs are very likely, maybe 90%. (The GPT-3 to -4 “takeoff” has been quite slow. It could stop being slow at any moment, but while it’s still slow, I’ll continue updating towards month- or year-long takeoffs rather than minute-long ones.) As soon as there are multiple AIs, one coordination failure can be enough to start a war. Yes, maybe AIs are generally great at coordinating with each other. But that can be ruined by a single sufficiently powerful one that is not. (And sufficiently powerful can mean just, like, 1% as powerful as the others.) Anything from 0.1–10% s-risk between now and shortly after we have a superintelligence seems about right to me.

  4. ^

    Neglectedness. Alignment is already critically neglected, especially the approaches that Tammy calls “hard alignment.” Paul Christiano estimated some numbers in this excellent Bankless podcast interview. S-risks from AI are only addressed by the Center on Long-Term Risk, to some extent by the Center for Reducing Suffering, and maybe incidentally by a number of other groups. So in total maybe 1/10th the number of people work on it. (But the ideal solution is not for people in alignment to switch to s-risks but for people outside both camps to join s-risk research!)

New Answer
Ask Related Question
New Comment

26 Answers sorted by

I'm now going to answer a slightly different question, which "Why is discussion of this sort of downvoted and dismissed sometimes?"

There is a vibe that I often get from suffering focused people, which is a combo of

a) seeming to be actively stuck in some kind of anxiety loop, preoccupied with hell in a way that seems more pathological to me than well-reasoned. 

b) something about their writing and vibe feels generally off, 

c) negative-utilitarians seem very frequently to me to be highly depressed, and I think the sort of person who ends up highly suffering focused rather than incorporating positive experiences into their agenda tend to be living in a world where they literally can't experience pleasure/good-things. 

I don't think any of this is necessary to care about s-risks (or even to be negative utilitarian). But I think it is common enough that a) sometimes people are downvoting/dismissing this because they're picking up correctly on this vibe, b) sometimes people are just... anticipating that vibe, maybe seeing it where it wasn't necessarily.

(oddly, I get this more from S-risk people than from Animal Rights people. Animal Rights people seem more motivated from 'man, atrocities are happening right now, and we should care about these for the same reason we generally care about atrocities. S-Risk people seem more often like they're trapped in a cognitive loop imagining the worst hell they can dream up no matter how useful that is)

Interesting. Do I give off that vibe – here or in other writings?

Quite surprisingly, that hasn't been my (still recent) experience at all... I've found s-riskers I've met to be cheerful and open-minded. Most concretely, I've found in them a lot of that animal rights oomph, and haven't felt them mentally trapped anywhere?

7Dawn Drescher23d
I also know plenty of cheerful ones. :-3

There is a vibe that I often get from suffering focused people, which is a combo of

a) seeming to be actively stuck in some kind of anxiety loop, preoccupied with hell in a way that seems more pathological to me than well-reasoned. 

b) something about their writing and vibe feels generally off,


I agree that this seems to be the case with LessWrong users who engage in suffering-related topics like quantum immortality and Roko's basilisk. However, I don't think any(?) of these users are/have been professional s-risk researchers; the few (three, iirc) s-risk researchers I've talked to in real life did not give off this kind of vibe at all.

Socially emergent avoidance. This is a secondary effect (and could explain anything if taken too strongly). My intuition is that there's a freaked-out-ness around S-risks that goes deeper than mere X-risk. This freaked-out-ness might be about the things you mention about NNTs or NU-affiliation, and also "sad". But it looks on the surface, to others, like general freaked-out-ness. When people around you are freaked out by something, you avoid talking about it. That snowballs, where people around you are avoiding talking about something, so you assume you're supposed to avoid talking about it.

Oooh, good point! I’ve certainly observed that in myself in other areas.

Like, “No one is talking about something obvious? Then it must be forbidden to talk about and I should shut up too!” Well, no one is freaking out in that example, but if someone were, it would enhance the effect.

i'll say that while i'm absolutely horrified at the possibility of S-risks, i think they're somewhat small, and that the work i'm doing now (fairly S-risk-resistant alignment) is pretty convergent to both S-risk and X-risk reduction.

in particular, an aligned AI sells more of its lightcone to get baby-eating aliens to eat their babies less, and in general a properly aligned AI will try its hardest to ensure what we care about (including reducing suffering) is satisfied, so alignment is convergent to both.

but some wonkier approaches could be pretty scary.

(note that another reason i don't think about S-risks too much is that i don't think my mental health could handle worrying about them a lot, and i need all the mental health i can get to solve alignment.)

but some wonkier approaches could be pretty scary.

Yeah, very much agreed. :-/

in particular, an aligned AI sells more of its lightcone to get baby-eating aliens to eat their babies less, and in general a properly aligned AI will try its hardest to ensure what we care about (including reducing suffering) is satisfied, so alignment is convergent to both.

Those are some good properties, I think… Not quite sure in the end.

But your alignment procedure is indirect, so we don’t quite know today what the result will be, right? Then the question whether we’ll end up ... (read more)

yes, the eventual outcome is hard to predict. but by plan looks like the kind of plan that would fail in Xrisky rather than Srisky ways, when it fails. i don't use the Thing-line nomenclature very much anymore and i only use U/X/S. i am concerned about the other paths as well but i'm hopeful we can figure them out within the QACI counterfactuals.

I don't believe that reducing s-risks from AI involves substantially different things than those you'd need to deal with AI alignment.

I do think that alignment solutions which try to solve value alignment have more of a chance of causing s-risks than those which solve corrigibility. In particular because if you get the AI to care about the same things humans value, this is pretty close to getting the AI to actively dislike things that humans value, and if there’s even one component of human values which is pessimized, this seems extremely bad even if the rest of the parts are optimized.

I'd recommend checking out this post critiquing this view, if you haven't read it already. Summary of the counterpoints:

  • (Intent) alignment doesn't seem sufficient to ensure an AI makes safe decisions about subtle bargaining problems in a situation of high competitive pressure with other AIs. I don't expect the kinds of capabilities progress that is incentivized by default to suffice for us to be able to defer these decisions to the AI, especially given path-dependence on feedback from humans who'd be pretty naïve about this stuff. (C.f. this post—you need
... (read more)

Some promising interventions against s-risks that I’m aware of are:

  1. Figure out what’s going on with bargaining solutions. Nash, Kalai, or Kalai-Smorodinsky? Is there one that is privileged in some impartial way? 
  2. Is there some sort of “leader election” algorithm over bargaining solutions?
  3. Do surrogate goals work, are they cooperative enough?
  4. Will neural-net based AIs be comprehensible to each other, if so, what does the open source game theory say about how conflicts will play out?
  5. And of course CLR’s research agenda.

Interpretability research is probably i... (read more)

I don’t see how any of these actually help reduce s-risk. Like, if we know some bargaining solutions lead to everyone being terrible and others lead to everyone being super happy so what? Its not like we can tremendously influence the bargaining solution our AI & those it meets settles on after reflection.

2Dawn Drescher22d
In the tractability footnote above I make the case that it should be at least vastly easier than influencing the utility functions of all AIs to make alignment succeed.
3Garrett Baker22d
Yeah, I expect that if you make a superintelligence it won’t need humans to tell it the best bargaining math it can use. You are trying to do better than a superintelligence at a task it is highly incentivized to be good at, so you are not going to beat the superintelligence. Secondly, you need to assume that the pessimization of the superintelligence’s values would be bad, but in fact I expect it to be just as neutral as the optimization. I don’t care about wars between unaligned AIs, even if they do often have them. Their values will be completely orthogonal to my own, so their inverses will also. Even in wars between aligned and unaligned (hitler, for example) humans, suffering which I would trade the world to stop does not happen. Also, wars end, it’d be very weird if you got two AIs warring with each other for eternity. If both knew this was the outcome (of placed some amount of probability on it), why would either of them start the war? People worried about s-risks should be worried about some kinds of partial alignment solutions, where you get the AI aligned enough to care about keeping humans (or other things that are morally relevant) around, but not aligned enough to care if they’re happy (or satisfying any other of a number of values), so you get a bunch of things that can feel pain in moderate pain for eternity.
4Dawn Drescher22d
I’m not a fan of idealizing superintelligences. 10+ years ago that was the only way to infer any hard information about worst-case scenarios. Assume perfect play from all sides, and you end up with a fairly narrow game tree that you can reason about. But now it’s a pretty good guess that superintelligences will be more advanced successors of GPT-4 and such. That tells us a lot about the sort of training regimes through which they might learn bargaining, and what sorts of bargaining solutions they might completely unreflectedly employ in specific situations. We can reason about what sorts of training regimes will instill which decision theories in AIs, so why not the same for bargaining. If we think we can punt the problem to them, then we need to make sure they reflect on how they bargain and the game theoretic implication of that. We may want to train them to seek out gains from trade like it’s useful in a generally cooperative environment, rather than seek out exploits as it would be useful in a more hostile environment. If we find that we can’t reliably punt the problem to them, we now still have the chance to decide on the right (or a random) bargaining solution and train enough AIs to adopt it (more than 1/3rd? Just particularly prominent projects?) to make it the Schelling point for future AIs. But that window will close when they (OpenAI, DeepMind, vel sim.) finalize the corpus of the training data for the AIs that’ll take over the world. Okay. I’m concerned with scenarios where at least one powerful AI is at least as (seemingly) well aligned as GPT-4. Can you rephrase? I don’t follow. It’s probably “pessimization” that throws me off? Well, I’m already concerned about finite versions of that. Bad enough to warrant a lot of attention in my mind. But there are different reasons why that could happen. The one that starts the war could’ve made any of a couple different mistakes in assessing their opponent. It could make mistakes in the process of readying it
9Seth Herd23d
These suggestions are all completely opaque to me. I don't see how a single one of them would work to reduce s-risk, or indeed understand what the first three are or why the last one matters. That's after becoming conversant with the majority of thinking and terminology around alignment approaches. So maybe that's one reason you don't see people.discussing s-risk much - the few people doing it are not communicating their ideas in a compelling or understandable way. That doesn't answer the main question, but cause-building strategy is one factor in any question of why things are or aren't attended.

Surrogate goals are defined here, or (not by that name) here. IIRC, the gist of it is something like: “let’s make an AGI from whose perspective the best possible thing is utopia, and the second-worse possible thing is eternal torture throughout the universe, and the worst possible thing is some specific random thing like a stack of 189 boxes on a certain table in a very specific configuration. Then the idea is that if there’s a conflict between AGIs, and threats are made, and these threats are then carried out (or alternatively if a cosmic ray flips a crucial bit), then we’re now more likely to get stacks of boxes instead of hell.

1Barr Detwix23d
This may have an obvious response, but I can't quite see it: If the worst possible thing is a negligible change, an easily achievable state, shouldn't an AGI want to work to prevent that catastrophic risk? Couldn't this cause terribly conflicting priorities? If there is a minor thing that the AGI despises above all, surely some joker will make a point of trying to see what happens when they instruct their local copy of Marsupial-51B to perform the random inconsequential action. It might be tempting to try to compromise on utopia to avoid a strong risk of the literal worst possible thing. Apologies if there's a reason why this is obviously not a concern :)
1Dawn Drescher22d
Yeah, that’s a known problem. I don’t quite remember what the go-to solutions where that people discussed. I think creating an s-risks is expensive, so negating the surrogate goal could also be something that is almost as expensive… But I imagine an AI would also have to be a good satisficer for this to work or it would still run into the problem with conflicting priorities. I remember Caspar Oesterheld (one of the folks who originated the idea) worrying about AI creating infinite series of surrogate goals to protect the previous surrogate goal. It’s not a deployment-ready solution in my mind, just an example of a promising research direction.
We'd want to pick something to 1. have badness per unit of resources (or opportunity cost) only moderately higher than any actually bad thing according to the surrogate, 2. scale like actually bad things according to the surrogate, and 3. be extraordinarily unlikely to occur otherwise. Maybe something like doing some very specific computations, or building very specific objects.

Solving alignment solves s-risk as well as x-risks.

The main solutions to s-risks that don't work for x-risks are to facilitate extinction so that there's nothing left to suffer.

There are huge cost to advocating for human extinction to prevent s-risk.

This could be true as a reason why some people de-prioritize s-risks, but I don't think it's a correct statement.  See the section "s-risk reduction is separate from alignment work" here.  

It is simply not true that s-risk interventions not solving x-risk are "facilitate extinction". See for example CLR's agenda:

I agree with what Lukas linked. But there are also various versions of the Waluigi Effect, so that alignment, if done wrong, may increase s-risk. Well, and I say in various answers and the in post proper that I’m vastly more optimistic about reducing s-risk than having to resort to anything that would increase x-risk.

Too unlikely. I’ve heard three versions of this concern. One is that s-risks are unlikely. I simply don’t think it is as explained above, in the post proper. The second version is that it’s 1/10th of extinction, hence less likely, hence not a priority. The third version of this take is that it’s just psychologically hard to be motivated for something that is not the mode of the probability distribution of how the future will turn out (given such clusters as s-risks, extinction, and business as usual). So even if s-risks are much worse and only slightly less likely than extinction, they’re still hard for people to work on.

There have been countless discussions of takeoff speeds. The slower the takeoff and the closer the arms race, the greater the risk of a multipolar takeoff. Most of you probably have some intuition of what the risk of a multipolar takeoff is. S-risk is probably just 1/10th of that – wild guess. So I’m afraid that the risk is quite macroscopic.

The second version ignores the expected value. I acknowledge that expected value calculus has its limitations, but if we use it at all, and we clearly do, a lot, then there’s no reason to ignore its implications specifically for s-risks. With all ITN factors taken together but ignoring probabilities, s-risk work beats other x-risk work by a factor of 10^12 for me (your mileage may vary), so if it’s just 10x less likely, that’s not decisive for me.

I don’t have a response to the third version.

I don't understand why you think a multipolar takeoff would run S-risks.
1Dawn Drescher20d
My perhaps a bit naive take (acausal stuff, other grabby aliens, etc.) is that a conflict needs at least two, and humans are too weak and uncoordinated to be much of an adversary. Hence I’m not so worried about monopolar takeoffs. Not sure, though. Maybe I should be more worried about those too.

NNTs. Some might argue that “naive negative utilitarians that take ideas seriously” (NNTs) want to destroy the world, so that any admissions that s-risks are morally important in expectation should happen only behind closed doors and only among trusted parties.

That sounds to me like, “Don’t talk about gun violence in public or you’ll enable people who want to overthrow the whole US constitution.” Directionally correct but entirely disproportionate. Just consider that non-negative utilitarians might hypothetically try to kill everyone to replace them with beings with greater capacity for happiness, but we’re not self-censoring any talk of happiness as a result. I find this concern to be greatly exaggerated.

In fact, moral cooperativeness is at the core of why I think work on s-risks is a much stronger option than ... (read more)

My take is that x-risk from misalignment is far more likely than s-risk, and that work to reduce loss of control of humanity to AGI is (for now) a sufficiently general pursuit to cover both cases. Maybe at some point there will be enough researchers to specialize more.

In the language of Superintelligent AI is necessary for an amazing future but far from sufficient, I expect that the majority of possible s-risks are weak dystopias rather than strong dystopias. We're unlikely to succeed at alignment enough and then signflip it (like, I expect strong dystopia to be dominated by 'we succeed at alignment to an extreme degree' ^ 'our architecture is not resistant to signflips' ^ 'somehow the sign flips'). So, I think literal worse-case Hell and the immediate surrounding possibilities are negligible.
I expect that the extrema of most AIs, even ones with attempted alignment patches, to be weird and unlikely to be of particular value to us. The ways values resolve has a lot of room to maneuver early on, before it becomes a coherent agent, and I don't expect those to have extrema that are best fit by humans (see various of So8res other posts). Thus, I think it is unlikely that we end up with a weak dystopia (at least for a long time, which is the s-risk) relative to x-risk.

Thanks for linking that interesting post! (Haven’t finished it yet though.) Your claim is a weak one though, right? Only that you don’t expect the entirely lightcone of the future to be filled with worst-case hell, or less than 95% of it? There are a bunch of different definitions of s-risk, but what I’m worried about definitely starts at a much smaller-scale level. Going by the definitions in that paper (p. 3 or 391), maybe the “astronomical suffering outcome” or the “net suffering outcome.”

I primarily mentioned it because I think people base their 'what is the S-risk outcome' on basically antialigned AGI. The post has 'AI hell' in the title and uses comparisons between extreme suffering versus extreme bliss, calls s-risks more important than alignment (which I think makes sense to a reasonable degree if antialigned s-risk is likely or a sizable portion of weaker dystopias are likely, but I don't think makes sense for antialigned being very unlikely and my considering weak dystopias to also be overall not likely) . The extrema argument is why I don't think that weak dystopias are likely, because I think that - unless we succeed at alignment to a notable degree - then the extremes of whatever values shake out are not something that keeps humans around for very long. So I don't expect weaker dystopias to occur either. I expect that most AIs aren't going to value making a notable deliberate AI hell, whether out of the lightcone or 5% of it or 0.01% of it. If we make an aligned-AGI and then some other AGI says 'I will simulate a bunch of humans in torment unless you give me a planet' then I expect that our aligned-AGI uses a decision-theory that doesn't give into dt-Threats and doesn't give in (and thus isn't threatened, because the other AGI gains nothing from actually simulating humans in that). So, while I do expect that weak dystopias have a noticeable chance of occurring, I think it is significantly unlikely? It grows more likely we'll end up in a weak dystopia as alignment progresses. Like if we manage to get enough of a 'caring about humans specifically' (though I expect a lot of attempts like that to fall apart and have weird extremes when they're optimized over!), then that raises the chances of a weak dystopia. However I also believe that alignment is roughly the way to solve these. To get notable progress on making AGIs avoid specific area, I believe that requires more alignment progress than we have currently. -------------------------------
I'm also not sure that I consider astronomical suffering outcome (by how its described in the paper) to be bad by itself. If you have (absurd amount of people) and they have some amount of suffering (ex: it shakes out that humans prefer some degree of negative-reinforcement as possible outcomes, so it remains) then that can be more suffering in terms of magnitude, but has the benefits of being more diffuse (people aren't broken by a short-term large amount of suffering) and with less individual extremes of suffering. Obviously it would be bad to have a world that has astronomical suffering that is then concentrated on a large amount of people, but that's why I think - a naive application of - astronomical suffering is incorrect because it ignores diffuse experiences, relative experiences (like, if we have 50% of people with notably bad suffering today, then your large future civilization with only 0.01% of people with notably bad suffering can still swamp that number, though the article mentions this I believe), and more minor suffering adding up over long periods of time. (I think some of this comes from talking about things in terms of suffering versus happiness rather than negative utility versus positive utility? Where zero is defined as 'universe filled with things we dont care about'. Like, you can have astronomical suffering that isn't that much negative utility because it is diffuse / lower in a relative sense / less extreme, but 'everyone is having a terrible time in this dystopia' has astronomical suffering and high negative utility)

. If I imagine trading extreme suffering for extreme bliss personally, I end up with ratios of 1 to 300 million – e.g., that I would accept a second of extreme suffering for ten years of extreme bliss. The ratio is highly unstable as I vary the scenarios, but the point is that I disvalue suffering many orders of magnitude more than I value bliss.

I also disvalue suffering significantly more than I value happiness (I think bliss is the wrong term to use here), but not to that level. My gut feeling wants to dispute those numbers as being practical, but I'l... (read more)

My comments on this topic have been poorly received. I think most people are pretty much immune to the emotional impact of AI hell as long as it isn't affecting someone in their 'monkeysphere' (community of relationships capped by Dunbar's number).

The popular LW answer seems to be the top comment from Robin Hanson to my post here:

My other more recent comment:

Arguably, if you're concerned about s-risk, you should be theorizing about ways of controlling access to Em data. You would be interested in better digital rights management (DRM) technology, which is seen as 'the enemy' in a lot of tech/open-source adjacent communities, as well as developing technology for guaranteed secure deletion of human consciousness.

If it were possible to emulate a human and place them into AI hell, I am absolutely certain that the US government would find a way to use it for both interrogation and incarceration.

That sounds promising actually… It has become acceptable over the past decade to suggest that some things ought not to be open-sourced. Maybe it can become acceptable to argue for DRM for certain things too. Since we don’t yet have brain scanning technology, I’d also be interested in an inverse cryonics organization that has all the expertise to really really really make sure that your brain and maybe a lot of your social media activity and whatnot really gets destroyed after your death. (Perhaps even some sorts of mechanism by which suicide and complete s... (read more)

For a suicide switch, a purpose built shaped charge mounted to the back of your skull (a properly engineered detonation wave would definitely pulp your brain, might even be able to do it without much danger to people nearby), raspberry pi with preinstalled 'delete it all and detonate' script on belt, secondary script that executes automatically if it loses contact with you for a set period of time. That's probably overengineered though, just request cremation with no scan, and make sure as much of your social life as possible is in encrypted chat. When you die, the passwords are gone. When the tech gets closer and there are fears about wishes for cremation not being honored, EAs should pool their funds to buy a funeral home and provide honest services.

(I deleted this comment)

[This comment is no longer endorsed by its author]

I’ve thought a bunch about acausal stuff in the context of evidential cooperation in large worlds, but while I think that that’s super important in and of itself (e.g., it could solve ethics), I’d be hard pressed to think of ways in which it could influence thinking about s-risks. I rather prefer to think of the perfectly straightforward causal conflict stuff that has played out a thousand times throughout history and is not speculative at all – except applied to AI conflict.

But more importantly it sounds like you’re contradicting my “tractability“ footnot... (read more)

Speaking for myself: I care a bunch about s-risks. I listed "the AI might simulate sentient beings" as a failure mode in "Carefully Bootstrapped Alignment" is organizationally hard.

But I overall think working on alignment is largely more urgent. Being able to understand what's going on at all inside a neural net, and advocating that companies be required to understand what's going on before developing new/bigger/better models, seems like a convergent goal relevant to both human extinction and astronomical suffering. 

There's also something of a Maslow Hierarchy of needs thing where I think getting to "okay, we're not all dead and we have some control over the future" seems more like the next major step to focus on, in part since it requires fewer philosophical assumptions.

But I am pretty in favor of people a) figuring out how to build AI evals that check for sentience, and/or finding a reasonable bright line that's like "if your AI can do X, there's at least a sizable chance that it's sentient", and b) advocating companies or governments halt training runs or deployment if that bright line is crossed. 

Note, like, Eliezer did go out of his way to mention AI consciousness as a problem in his TIME article and in some tweets. 

But I overall think working on alignment is largely more urgent. Being able to understand what's going on at all inside a neural net, and advocating that companies be required to understand what's going on before developing new/bigger/better models, seems like a convergent goal relevant to both human extinction and astronomical suffering. 

Fwiw, Lukas's comment link to a post arguing against that and I 100% agree with it. I think the "Alignment will solve s-risks as well anyway" is one the most untrue and harmful widespread memes in the EA/LW community.

Nod (fyi I vaguely remembered that comment but couldn't find it a second time while I was writing my own answer) I do think "AI targeted at optimizing a good goal" is more likely to near miss if precautions aren't taken and I do think that's quite important. I did carefully not say "alignment automatically solves s-risks", I said it was a convergent goal that seemed more important to me overall. I do think that's a reasonable thing to disagree on.
1Dawn Drescher24d
I suppose my shooting range metaphor falls short here. Maybe alignment is like teaching a kid to be an ace race car driver, and s-risks are accidents on normal roads. There it also depends on the details whether the ace race car driver will drive safely on normal roads.

Note, something in your framing here seems focused on psychologizing in a way that feels unnecessary and unhelpful, although I'm not sure whether I particularly object to it. [Specifically, the part where your question title is "why are we complacent?" rather than "why aren't more people taking Particular Actions or prioritizing AI Hell above AI extinction"]

1Dawn Drescher24d
Good point. I can still change it. What title would you vote for? I spent a lot of time vacillating between titles and don’t have a strong opinion. These were the options that I considered: 1. Why not s-risks? A poll. 2. Why are we so complacent about AI hell? 3. Why aren’t we taking s-risks from AI more seriously? 4. Why do so few people care about s-risks from AI? 5. Why are we ignoring the risks of AI hell? 6. What’s holding us back from addressing s-risks from AI? 7. Why aren’t we doing more to prevent s-risks from AI? 8. What will it take to get people to care about s-risks from AI?
"Why aren't more people prioritizing work on S-risks more heavily" seems better to me and seems like the question you probably actually care about. Question-titles that are making (in many cases inaccurate) claims about people's motivations seem more fraught and unhelpfully opinionated.
3Dawn Drescher24d
Thx! I’ll probably drop the “more heavily” for stylistic reasons, but otherwise that sounds good to me!

Oh, true! Digital sentience is also an important point! A bit of an intuition pump is that if you consider a certain animal to be sentient (at least with some probability), then an em of that animal’s brain may be sentient with a similar probability. If an AI is powerful enough to run such ems, the question is no longer whether digital sentience is possible but why an AI would run such an em.

The Maslow hierarchy is reverse for me, i.e. rather dead/disempowered than being tortured, but that’s just a personal thing. In the end it’s more important what the acausal moral compromise says, I think.

Yeah to be clear my mainline prediction is that an unfriendly AI goes through some period of simulating lots of humans (less likely to simulate animals IMO) as part of it's strategizing process, kills humanity, and then goes on to do mostly non-sentient things.  There might be a second phase where it does some kind of weird acausal thing, not sure. I don't know that in my mainline prediction the simulation process results in much more negative utility than the extinction part. I think the AI probably has to do much of it's strategizing without enough compute to simulate vast numbers of humans, and I weakly bet against those simulations ending up suffering in a way that ends up outweighing human extinction. There are other moderately likely worlds IMO and yeah I think s-risk is a pretty real concern.

As I'd mentioned elsewhere,

The one good thing about the [nature of the] technical problem of alignment is that it makes hyperexistential risks — the risks of astronomical suffering — very unlikely.

The problem of AI Alignment can be viewed as the problem of encoding our preferences into an AGI, bit by bit. The strength of alignment tools, in turn, translates to how many bits we can encode. With the current methods of end-to-end training, we're essentially sampling preferences at random. Perfect interpretability and parameter-surgery tools would allow us to encode an arbitrary amount of bits. The tools we'll actually have will be somewhere between these two extremes.

"Build us our perfect world" is a very complicated ask, and it surely takes up many, many thousands of bits. That's why alignment is hard.

"Build us a hell" is its mirror. It's essentially the same ask, except for a flipped sign. As such, specifying it would require pretty much the same amount of bits.

Thus, in the timelines where we have alignment tools advanced enough to build a hell-making AGI, it's overwhelmingly likely that we have the technical capability to build an utopia-building AGI. On the flipside, conditioning on our inability to build an utopia-builder, our tools are probably so bad we can't come close to a hell-builder. In that case, we just sample some random preferences, and the AGI kills us quickly and painlessly.

Screwing up so badly we create a suffering-maximizer is vanishingly unlikely: it's only possible in a very, very narrow range of technical capabilities.

I am worried about S-risks though. I think they're pretty likely in timelines where we solve the technical problem of alignment, but the technology ends up in the wrong hands; central examples being xenophobic or authoritarian political entities.

I'm concerned it may be neglected, too: I expect the various AI Governance/field-building initiatives may not be spending any time considering how not to attract the wrong kind of attention, and instead simply maximize for getting as much attention as possible. (Though I suppose if they're competent at it, I wouldn't see any public evidence of them considering that; I'm just guessing on priors.)

Edit: Mm, though there's a caveat. I'm operating under the least forgiving model of the alignment problem; under it, S-risks really are that unlikely. But many people don't share it — e. g., the shard theory assumes "rough" alignment will suffice to avoid omnicide — which should make their P(hell) non-negligibly high. Yet they're not worried either, so there must be something else going on with their models.

Thx! Yep, your edit basically captures most of what I would reply. If alignment turns out so hard that we can’t get any semblance of human values encoded at all, then I’d also guess that hell is quite unlikely. But there are caveats, e.g., if there is a nonobvious inner alignment failure, we could get a system that technically doesn’t care about any semblance of human values but doesn’t make that apparent because ostensibly optimizing for human values appears useful for it at the time. That could still cause hell, even with a higher-than-normal probability.

S-risks aren't disproportionately important to many people. Personally I think they're only a bit worse than death, and aesthetically I think they're maybe twice as bad as human extinction.

This should be combined with a likelihood estimate to recommend actions.


3Charlie Steiner23d
If you get to pick how the universe is arranged in the future, would you rather it be lifeless and full of shit, or lifeless and full of brilliant art? I'm gonna guess that you, like me, would prefer art. This is an aesthetic preference about how you'd rather the atoms in the universe be arranged. You don't need to justify it by any deeper principle, it doesn't matter that you're not around to care in either case, it's sufficient for you to prefer universes full of art to universes full of shit as a raw preference, and this can motivate you to steer the future to favor one over the other. I find universes full of cosmopolitan civilizations good, and universes full of suffering bad, in just this raw way. You might also call it "non person-affecting preferences over the use of atoms in the universe."

Interesting take! Obviously that’s different for me and many others, but you’re not alone with that. I even know someone who would be ready to cook in a lava lake forever if it implies continuing to exist. I think that’s also in line with the DALY disability weights, but only because they artificially scale them to the 0–1 interval.

So I imagine you’d never make such a deal as shortening you life by three hours in exchange for not experiencing one hour of the worst pain or other suffering you’ve experienced?

Sorry for pursuing this tangent (which I'm assuming you'll feel free to ignore), but have they ever indicated how likely they think it is that they would continue to hold that preference while in the lava lake?  (I was aware some people voiced preferences like this, but I haven't directly discussed it with any of them. I've often wondered whether they think they would, in the (eternally repeated) moment, prefer the suffering to death, or whether they are willing to condemn themselves to infinite suffering even though they expect to intensely regret it. In both cases I think they are horribly mistaken, but in quite different ways.)
3Dawn Drescher22d
The example I was thinking of is this one []. (There’s a similar thread here [].) So in this case it’s the first option – they don’t think they’ll prefer death. But my “forever” was an extrapolation. It’s been almost three years since I read the comment. I’m the ECL type [] of intersubjective moral antirealist. So in my mind, whether they really want what they want is none of my business, but what that says about what is desirable as a general policy for people we can’t ask is a largely empirical question that hasn’t been answered yet. :-3
3Charlie Steiner23d
It's plausible you could catch me on days where I would take the deal, but basically yeah, 3:1 seems like plenty of incentive to choose life, whereas at 1:1 (the lava lake thing), life isn't worth it (though maybe you could catch me on days etc etc).
1Dawn Drescher22d
Huh, thanks! 

One of the reasons why I'm skeptical of the S-Risk is as follows. 

Not sure if it's a core idea, but I've observed that S-Risk proponents often propagate the idea that some large amount of suffering is worse than death. 

For example, some of them claim that assisted suicide for a patient in pain is ethical (the claim which I find abhorrent, unless the procedure is done for cryonics).

My view is, there NO fate worse than death.  A single human death is worse than trillions of years of the worst possible suffering by trillions of people. 

The "some suffering is worse than death" idea is increasing X-risks: one day, some sufficiently powerful idiot could decide that human extinction is better than an AGI-dystopia. 

It's a good idea to work on preventing large amounts of suffering, but S-Risk is a bad framework for that. 

Putting the question of assisted suicide aside, I agree with what seems to be the core of this answer: The "value calculus" often used by utilitarians is a nice mathematical framework, but ultimately not a real thing (not saying that suffering isn't a real thing or that one can't gain useful knowledge from such calculations).

E.g. I would always trade an infinite amount of suffering for +epsilon control of the future and my current and future values don't necessarily align. I don't see how a strong form of utilitarianism can contend with such things.

I’d prefer to keep these things separate, i.e. (1) your moral preference that “a single human death is worse than trillions of years of the worst possible suffering by trillions of people” and (2) that there is a policy-level incentive problem that implies that we shouldn’t talk about s-risks because that might cause a powerful idiot to take unilateral action to increase x-risk.

I take it that statement 1 is a very rare preference. I, for one, would hate for it to be applied to me. I would gladly trade any health state that has a DALY disability weight >... (read more)

I think there's an additional freaked-out-ness that comes from a thing related to NNTs. But less about being worried that people will try to destroy the world. Instead, more like not wanting to inhabit or cause other people to inhabit a minset where self-destruction would be an obvious choice. Suicide is a fairly taboo topic. And my impression is that this is at least a little justified; my pop-sci level understanding is that suicide being raised to attention (especially by a celebrity killing themselves) can cause other people to be more likely to kill themselves.

That could simply be from raising the hypothesis to attention.

There could be an intuitive fear of self-fulfilling prophecies. If you start to believe that it would be better if you didn't exist, one reaction might be to stop making your life better to live--which would at least somewhat contribute toward it becoming true that it'd be better if you didn't exist. (Which could be in a self-reinforcing loop.)

Interesting take! 

Friend circles of mine – which, I should note, don’t to my knowledge overlap with the s-risks from AI researchers I know – do treat suicide as a perfectly legitimate thing you can do after deliberation, like abortion or gender-affirming surgery. So there’s no particular taboo there. Hence, maybe, why I also don’t recoil from considering that the future might be vastly worse than the present.

But it seems to be like a rationalist virtue not to categorically recoil from certain considerations.

Could you explain the self-fulfilling prophe... (read more)

My impression was that Freudian death wish is aggression in general, (mis)directed at the self. I'm not talking about that. I'm confused what you're saying ,and curious. I would predict that this attitude toward suicide would indeed correlate with being open to discussing S-risks. Are you saying you have counter-data, or are you saying you don't have samples that would provide data either way? It's basically like this: my experience is bad. If my experience is this bad, I'd rather not live. Can I make my experience good enough to be worth living? That depends on whether I work really hard or not. I observe that I am not working hard. Therefore I expect that my experience won't get sufficiently better to be worth it. Therefore locally speaking it's not worth it to try hard today to make my life better; I won't keep that work up, and will just slide back. So my prediction that I won't work to make my life better is correct and self-fulfilling. If I thought I would spend many days working to make my life better, then it would become worth it, locally speaking, to work hard today, because that would actually move the needle on chances of making life worth it. Surely you can see that this isn't common, and the normal response is to just be broken until you die.
1Dawn Drescher22d
I was just agreeing. :-3 In mainstream ML circles there is probably a taboo around talking about AI maybe doing harm or AI maybe ending up uncontrollable etc. Breaking that taboo was, imo, a good thing because it allowed us to become aware of the dangers AI could pose. Similarly, breaking a taboo around talking about things worse than death can be helpful to become aware of ways in which we may be steering toward s-risks. I see! I have a bunch of friends who would probably consider their lives not worth living. They often express the wish to not have been born or at least consider their current well-being level to be negative. But I think only one of them might be in such a negative feedback loop, and I’m probably misdiagnosing her here. Two of them are bedridden due to Long Covid and despite their condition have amassed a wealth of knowledge on virus-related medicine, probably by googling things on their phones while lying down for ten minutes at a time. Others have tried every depression drug under the sun. Other have multiple therapists. They are much more held back by access and ability than by motivation, even though motivation is probably also hard to come by in that state. Idk, Harold and Maude is sort of like that. I’ve actually done a back-of-the-envelope calculation, which is perhaps uncommon, but the general spirit of the idea seems normal enough to me? Then again I could easily be typical-minding.
Sounds likely enough from your description. Most things are mostly not about self-fulfilling prophecies, life can just be sad / hard :( I think that the feedback loop thing is a thing that happens; usually in a weakish form. I mean, I think it's the cause of part of some depressions. Separately, even if it doesn't happen much or very strongly, it could also be a thing that people are afraid of in themselves and in others, continuously with things like "trying to cheer someone up". That's my guess, to some extent, but IDK. I think we'd live in different, more hopeful world if you're not (incorrectly) typical-minding here.

Too unknown. Finally there’s the obvious reason that people just don’t know enough about s-risks. That seems quite likely to me.

The biggest extinction risk from AI comes from instrumental convergence for resource acquisition in which an AI not aligned with human values uses the atoms in our bodies for whatever goals it has.  An advantage of such instrumental convergence is that it would prevent an AI from bothering to impose suffering on us.

Unfortunately, this means that making progress on the instrumental convergence problem increases S-risks.  We get hell if we solve instrumental convergence, but not, say, mesa-optimization and we get a powerful AGI that cares about our fate, but does something to us we consider worse than death.

Some reasons why Im personally not as involved in working to prevent AI Hell:

(in no order of importance).

1. Im not strongly convinced a hostile Singularity is plausible at least in the near future, from technological, logistical, and practical standpoint. Pretty much every AI Hell scenario I have read, hinges on sudden appearance of scientifically implausible technologies, and on instant perfect logistics that the AI could use.

2. Main issue that could lead to AI Hell is the misalignment of values between AI and humans. However, it is patently obvious that humans are not aligned with each other, with themselves or with rational logic. Therefore, I do not see a path to align AI with human values unless we Solve Ethics, which is an impossible task unless we completely redesign human brains from scratch.

3. Im personally not qualified to work on any technological aspects of preventing AI Hell. I am qualified to work on human-end Ethics and branch into alignment from that, and I see it as an impossible task with the kind of humans we get to work with.

4. A combination of points 1 and 2 leads me to believe that humanity is far more likely to abuse early stage AI to wipe itself out, than for AI itself to wipe out humanity of its own volition. To put it differently, crude sub-human level AI can plausibly be used to cause WW3 and a nuclear holocaust without any need for hostile superhuman AI. I think we worry too much about the unlikely but extremely lethal post-Singularity AI, and not enough about highly likely and just sufficiently lethal wargame systems in the hands of actual biological humans, who are not sufficiently concerned with humanity's survival.

5. Roko's Gremlin: anyone who is actively working on limiting or forcibly aligning AI is automatically on the hit-list of any sufficiently advanced hostile AI. Im not talking about long term high-end scenario of the Roko's Basilisk, but rather the near-future low-end situation in which an Internet savvy AI can ruin your life for being a potential threat to it. In fact, this scenario does not require actively hostile AI at all. I see it as completely plausible that a human being with a vested financial interest in AI advancement could plausibly use AI to create a powerful smear campaign against, say, EY, to destroy his credibility, and with him the credibility of the AI Safety movement. Currently accessible AI is excellent at creating plausible-seeming bullshit, which would be perfect to use for social media warfare against anyone who tries to monkeywrench its progression. Look at Nick Bostrom to see how easily one of us can be sniped down with minimum effort.

Sorry for glossing over some of these. E.g., I’m not sure if you consider ems to be “scientifically implausible technologies.” I don’t, but I bet there are people who could make smart arguments for why they are far off.

Reason 5 is actually a reason to prioritize some s-risk interventions. I explain why in the “tractability” footnote.

In addition to many good points already mentioned, I would like to add that I have no idea how to approach this problem.

Approaching x-risk is very hard too, but it is much clearer in comparison.

Personal fit. Surely, some people have tried working on s-risks in different roles for some substantial period of time but haven’t found an angle from which they can contribute given their particular skills.

Related to the "personal fit" explanation: I'd argue that the skills required to best reduce s-risks have much overlap with the skills to make alignment progress (see here).  

At least, I think this goes for directly AI-related s-risks, which I consider most concerning, but I put significantly lower probabilities on them than you do.

For s-risks conditioned on humans staying in control over the future, we maybe wouldn't gain much from explicitly modelling AI takeoff and engaging in all the typical longtermist thought. Therefore, some things that reduce ... (read more)

1Dawn Drescher24d
Yeah… When it comes to the skill overlap, having alignment research aided by future pre-takeoff AIs seems dangerous. Having s-risk research aided that way seems less problematic to me. That might make it accessible (now or in a year) for people who have struggled with alignment research. I also wonder whether there is maybe still more time for game-theoretic research in s-risks than three is in alignment. The s-risk-related problems might be easier, so they can perhaps still be solved in time. (NNTR, just thinking out loud.)
1[comment deleted]24d

Egoism, presentism, or substratism. The worst s-risks will probably not befall us (humans presently alive) or biological beings at all. Extinction, if it happens, will. Maybe death or the promise of utopia has a stronger intuitive appeal to people if they themselves have a risk/chance of experiencing it?

NUs. Some people may think that you have to be a negative utilitarian to care about s-risks. They are not negative utilitarians, so they steer clear of the topic.

I don’t think you have to be a negative utilitarian to care about s-risks. S-risks are about suffering, but people can be concerned about suffering among other values. Classic utilitarianism is about minimizing suffering and maximizing happiness. One does not exclude the other. Neither does concern for suffering exclude self-preservation, caring for one’s family, wanting to uphold traditions or making one’s ancestors proud. All values are sometimes in conflict, but that is not cause to throw out concern for suffering in particular. 

My vague ... (read more)

Too unpopular. Maybe people are motivated by what topics are in vogue in their friend circles, and s-risks are not?

Too sad. Some people think that maybe working on s-risks is unpopular because suffering is too emotionally draining to think about, so people prefer to ignore it.

Another version of this concern is that sad topics are not in vogue with the rich tech founders who bankroll our think tanks; that they’re selected to be the sort of people who are excited about incredible moonshots rather than prudent risk management. If these people hear about averting suffering, reducing risks, etc. too often from EA circles, they’ll become uninterested in EA-aligned thinking and think tanks.

I want to argue with the Litany of Gendlin here, but what work on s-risks really looks like in the end is writing open source game theory simulations and writing papers. All try academic stuff that makes it easy to block out thoughts of suffering itself. Just give it a try! (E.g., at a CLR fellowship.)

I don’t know if that’s the case, but s-risks can be reframed:

  1. We want to unlock positive-sum trades for the flourishing of our descendants (biological or not).
  2. We want to distribute the progress and welfare gains from AI equitably (i.e. not have some sizable fr
... (read more)

We don't even care about actually existing hells. We just make the social signals that appear like caring insofar as they're expected of us, method-acting the emotions for exactly as long as is appropriate. Hence, we cry "that's horrible!" when we hear that a hundred million chickens literally suffer to death in factory farms annually, only to stop thinking about it ASAP so that we can keep making fun of vegans and eating chicken nuggets without being overwhelmed by the terror and nausea and rage. Hence, we look down when we see pictures of starving children, unwilling to bear the pain of empathy, but dissimulate about "local responsibility" or "warlords" whenever the idea of using money to alleviate the problem comes up. Hence, we feel a wretched blackness envelop our hearts when a nature documentary shows us a wide-eyed deer screaming as a cougar rips its unborn calf out of its stomach, only to about-face at the very moment that someone raises the question of doing anything about it, smugly telling them that it's not in our moral jurisdiction. 

Unlike AI safety in general, the study of s-risks isn't particularly large, technical, or drama-filled, so the typical person who ends up exposed to the idea—a shape rotator with something to prove—tends not to be drawn to it in particular. Those that are attracted to it will generally have been attracted because it was introduced to them in some particularly spicy context, or because it finds a perfect fit in their pre-existing structure of neuroses; they will say that they're working on it out of a sense of ethical prioritization, and they'll believe themselves, but somehow it's always their personal fascinations and status struggles that end up being prioritized. If you want to get more people working on s-risks, then, you might want to nerd-snipe them with math-y dilemmas, like why suffering necessarily takes place over time, and the questions this poses for ethical optimization at relativistic scales, or, what protocols an agent might deploy for detecting and undoing adversarial modifications of their utility function like sign-flipping (keeping in mind that such a protocol must act automatically, since the agent won't want to trigger it). 

Also, a nitpick, or maybe a miscommunication, but: Death with Dignity is not about giving up the fight: it's about dropping the unrealistic expectation that we'll somehow end up making it, while still continuing to, in Eliezer's own words, "increase the log odds of Earth's survival".  

12 comments, sorted by Click to highlight new comments since: Today at 7:31 PM

Flagging that Diffractor's work on threat-resistant bargaining feels like the most important s-risk-related work I've ever seen, but I also haven't thoroughly evaluated it so I'd love for someone to do so and write up their thoughts.

Woah, thanks! I hadn’t seen it!

There's a new forum for this that seeks to increase discussion & coordination,

Not really core to any of those communities, so I don't have specific answers.  But I note that complacency is the human default for ANYTHING that doesn't have direct, obvious, immediate impact on an individual and their loved ones.

From nuclear war risks to repeated financial crises to massive money and power differentials, "why are we so complacent about X" is a common and valid question, rarely answered.

I'd recommend instead you frame it as a recommendation for specific action, not a question about attitude.  "you, dear reader, should do Y next week to reduce expected {average, total, median, whatever} future suffering" would go a lot further than asking why they're not obsessing over the topic.

I will note, though, for myself, I tend to focus on magnitude of positive experience-moments with some declining marginal value for both intensity and quantity) rather than suffering in isolation, so I think about s-risks only when they're so universal as to effectively be x-risks.

I’d recommend instead you frame it as a recommendation for specific action, not a question about attitude. “you, dear reader, should do Y next week to reduce expected {average, total, median, whatever} future suffering” would go a lot further than asking why they’re not obsessing over the topic.

This would seem to be at odds with “aim to inform, not persuade”. (Is that still a rule? I seem to recall it being a rule, but now I can’t easily find it anywhere…)

It's never been a rule, more of a recommendation, and it's more about avoiding "arguments as soldiers" than a literal formation.  There are lots of exceptions, and I'd argue that it really should be "aim to learn" more than "aim to inform", though they're related.

In any case, obfuscating advocacy in the form of a somewhat rhetorical question seems strictly worse than EITHER informing or persuading.  It doesn't seem like anyone's trying to answer literally, they're answering related questions about the implied motivation of getting people to do something about S-risk.

It's part of the "frontpage comment guidelines" that show up every time you make a comment. They don't appear on GreaterWrong though, which is why I guess you can't see them...

I'd like to add another question: 

Why aren't we more concerned about s-risk than x-risk? 

Given that virtually everyone would prefer dying rather than facing an indefinite amount of suffering for an indefinite amount of time, I don't understand why more people are asking this question.

Personally, I have some deep psychological trauma related to pain and thinking about the topic is ... unproductive for me. Prolonged thinking about S-risks scares me, and I might not be able to think clearly about the topic. But maybe I could. The fear is what keeps me away. This is a flaw, and I'm unsure if it extends to other rationalists/EAs, but I'd guess people in these groups are unusually likely to have such scars because the LW memeplex is attractive to the walking wounded. I wouldn't be suprised if a few alignment researchers avoid s-risks for similair reasons. 

Averting s-risks mostly means preventing zero-sum AI conflict. If we find a way (or many ways) to do that, every somewhat rational AI will voluntarily adopt them, because who wants to lose out on gains from trade.

You're hoping to come up with an argument for human value, that will be accepted by any AI, no matter what its value system?

No, just a value-neutral financial instrument such as escrow. If two people can fight or trade, but they can’t trade, because they don’t trust each other, they’ll fight. That loses out on gains from trade, and one of them ends up dead. But once you invent escrow, there’s suddenly, in many cases, an option to do the trade after all, and both can live!

New to LessWrong?