AI alignment researchers may have a comparative advantage in reducing s-risks

Lukas_Gloor

I believe AI alignment researchers might be uniquely well-positioned to make a difference to s-risks. In particular, I think this of alignment researchers with a keen interest in “macrostrategy.” By that, I mean ones who habitually engage in big-picture thinking related to the most pressing problems (like AI alignment and strategy), form mental models of how the future might unfold, and think through their work’s paths to impact. (There’s also a researcher profile where a person specializes in a specific problem area so much that they no longer have much interest in interdisciplinary work and issues of strategy – those researchers aren’t the target audience of this post.)

Of course, having the motivation to work on a specific topic is a significant component of having a comparative advantage (or lack thereof). Whether AI alignment researchers find themselves motivated to invest a portion of their time/attention into s-risk reduction will depend on several factors, including:

Their opportunity costs
Whether they think the work is sufficiently tractable
Whether s-risks matter enough (compared to other practical priorities) given their normative views
Whether they agree that they may have a community-wide comparative advantage

Further below, I will say a few more things about these bullet points. In short, I believe that, for people with the right set of skills, reducing AI-related s-risks will become sufficiently tractable (if it isn’t already) once we know more about what transformative AI will look like. (The rest depends on individual choices about prioritization.)

Summary

Suffering risks (or “s-risks”) are risks of events that bring about suffering in cosmically significant amounts. (“Significant” relative to our current expectation over future suffering.)
(This post will focus on “directly AI-related s-risks,” as opposed to things like “future humans don't exhibit sufficient concern for other sentient minds.”)
Early efforts to research s-risks were motivated in a peculiar way – morally “suffering-focused” EAs started working on s-risks not because they seemed particularly likely or tractable, but because of the theoretical potential for s-risks to vastly overshadow more immediate sources of suffering.
Consequently, it seems a priori plausible that the people who’ve prioritzed s-risks thus far don’t have much of a comparative advantage for researching object-level interventions against s-risks (apart from their high motivation inspired by their normative views).
Indeed, this seems to be the case: I argue below that the most promising (object-level) ways to reduce s-risks often involve reasoning about the architectures or training processes of transformative AI systems, which involves skills that (at least historically) the s-risk community has not been specializing in all that much.^[1]
Taking a step back, one challenge for s-risk reduction is that s-risks would happen so far ahead in the future that we have only the most brittle of reasons to assume that we can foreseeably affect things for the better.
Nonetheless, I believe we can tractably reduce s-risks by focusing on levers that stay identifiable across a broad range of possible futures. In particular, we can focus on the propensity of agents to preserve themselves and pursue their goals in a wide range of environments. By focusing our efforts on shaping the next generation(s) of influential agents (e.g., our AI successors), we can address some of the most significant risk factors for s-risks.^[2] In particular:
- Install design principles like hyperexistential separation into the goal/decision architectures of transformative AI systems.
- Shape AI training environments to prevent the evolution of sadistic or otherwise anti-social instincts.
To do the above, we (mostly)^[3] “only” need to understand the inner workings of transformative AI systems or successor agents created during this upcoming transition. Naturally, AI alignment researchers are best positioned to do this.
While some people believe that alignment work is already effective at reducing s-risks, I think the situation is more complicated – alignment work arguably increases the chance of some types of s-risks, while it reduces others. (It depends on what we think of as the relevant counterfactual – even if EA-inspired alignment researchers stopped working on alignment, it’s not like AI companies and their capabilities teams wouldn’t try to align AI at all.)
More importantly, however, the question “What’s the sign of alignment research for s-risks” doesn’t seem all that relevant. From a portfolio perspective (“What portfolio of interventions should the longtermist community pursue given different population-ethical views and individual comparative advantages?”), we can design a package of interventions that aims towards successful AI alignment and reduces s-risks on net.
- Firstly, any unwanted side-effects of the sort “alignment makes some s-risks more likely” would still be a concern even if EA-motivated alignment researchers stopped their work. Accordingly, it’s challenging to say what sign alignment work has on the margin. Secondly, due to this sign uncertainty, the net effect – whatever it is – will likely be smaller than the effect of more targeted measures to reduce s-risks.
Normatively, there is no uniquely correct answer to population ethics. It’s defensible to hold moral views according to which s-risk aren’t among our primary priorities. That said, at the level of a “movement portfolio of interventions” that represents the distribution of values in the longtermist community, s-risks deserve significant attention.^[4] (Relatedly, see this section for my description of three in-my-view underappreciated arguments for the normative importance of reducing s-risks.)
My appeal to AI alignment researchers:
- If you have a flair for macrostrategy (as opposed to deep-diving into technical issues without coming back to the surface much), you might have a comparative advantage in keeping an eye out for spotting promising interventions to reduce s-risks.
- If you’ve previously thought about s-risks and had some concrete ideas to explore, then you almost certainly have a community-wide comparative advantage in reducing s-risks.
- Don’t assume the s-risk community “has got it all covered” – people who’ve prioritized s-risk reduction so far may not have skilled up sufficiently in their understanding of alignment-relevant considerations (and related intellectual “backgrounds”). However, if you’re interested, please consider contacting me or Stefan Torges from the Center on Long-term Risk, to see some of our preliminary thoughts or ideas. (Or do so after a bit of thinking to add an independent, unprimed take on the topic.)

In the rest of the post, I’ll flesh out the above points.

What I mean by “s-risks”

Suffering risks (or “s-risks”) are risks of events that bring about suffering in cosmically significant amounts. By “significant,” I mean significant relative to our expectation over future suffering.

For something to constitute an “s-risk” under this definition, the suffering involved not only has to be astronomical in scope (e.g., “more suffering than has existed on Earth so far”),^[5] but also significant compared to other sources of expected future suffering. This last bit ensures that “s-risks,” assuming sufficient tractability, are always a top priority for suffering-focused longtermists. Consequently, it also ensures that s-risks aren’t a rounding error from a “general longtermist portfolio” perspective, where people with community-wide comparative advantages work on the practical priorities of different value systems.

I say “assuming sufficient tractability” to highlight that the definition of s-risks only focuses on the expected scope of suffering (probability times severity/magnitude), but not on whether we can tractably reduce a given risk. To assess the all-things-considered importance of s-risk reduction, we also have to consider tractability.

Why reducing s-risks is challenging

Historically, efforts to reduce s-risks were motivated in a peculiar way, which introduced extra difficulties compared to other research areas.

Research aimed toward positive altruistic impact is often motivated as follows. “Here’s a problem. How can we better understand it? How might we fix it?”
By contrast, I started working on s-risks (alongside others) because of the following line of reasoning. “Here’s a problem class that would be overridingly bad according to my values if it happened. Let’s study if something like that might happen. Let’s zoom in on the most likely and most severe scenarios that could happen so we can try to steer clear of these trajectories or build safeguards.”
The added challenge from this way of motivating research (apart from inviting biases like winner’s curse about identifying the top risks or confirmation bias about the likelihood and severity of specific risks after deliberately searching for reasons they might happen or might be severe) is that it’s disconnected from tractability concerns. Reducing s-risks seems to require predicting stuff further ahead in the future than for other EA-inspired interventions. (It’s not just that s-risks won’t happen for a long time – it’s also that the pathways through which they might occur involve forecasting a drastically changed world. To reduce s-risks, we arguably have to foresee features of the AI transition that go one step beyond us building transformative AI. This makes it difficult to identify relevant levers we could predictably affect.) To summarize, to predictably affect s-risks, we have to identify reliable levers at a distance and level of detail that goes so far beyond the superforecasting literature that we have only the brittlest of reasons to assume that it might work.

Why s-risks nonetheless seem tractable

Nonetheless, by focusing our efforts on shaping the next generation(s) of influential agents (e.g., our AI successors), we can reduce or even circumvent the demands for detailed forecasts of distant scenarios.

A fundamental property of “agents” is that they continue to pursue their objectives in a wide range of circumstances and environments. So, agency introduces predictability at the level of strategically-important objectives and incentives. Accordingly, as long as we can shape the ways (some of) the future’s influential agents think and what they value, we then have a comparatively robust lever of influence into the future. We can use this lever to influence risk factors for s-risks.
In particular, we could try to shape the goals and decision-making architectures of transformative AI systems in the following ways/directions:
- ensure AI designs follow the principle of hyperexistential separation
- train early-stage AI systems to reason myopically or otherwise install safeguards for not modeling distant superintelligences
- prevent the emergence of powerful consequentialist AI systems (perhaps especially ones that don’t have concern for human values and have “little to lose”)^[6]
- prevent the emergence of AI systems with “malevolent” or otherwise anti-social instincts/heuristics/motivations (compare: dark tetrad traits that evolved in humans).^[7]
- steer AI takeoff dynamics towards more homogeneity and better coordination (insofar as we expect multipolar takeoff)
For more information, see the CLR research agenda, particularly sections 4-6.

Alignment researchers have a comparative advantage in reducing s-risks

Naturally, AI alignment researchers are best positioned to understand the likely workings of transformative AI systems. Therefore, alignment researchers may have a comparative advantage in reducing s-risks. (Of course, many other considerations factor into this as well – e.g., opportunity costs and motivation to work on s-risks.)
Still, the task continues to be difficult. Alignment research itself arguably struggles with transitioning from phase 1 to phase 2 interventions. Coming up with preventative measures against s-risks is challenging for similar reasons.
Timing matters. Perhaps it’s too early to form informative mental models of what transformative AI systems will look like. If so, we might be better off to research s-risk-specific safeguards for AI goal architectures at a later point. (Relatedly, consider how it was challenging to do valuable alignment work before it became apparent that the current deep-learning paradigm could pretty directly lead to human-level intelligence – perhaps with a few additions/tweaks.)
To act when the time is right, it’s crucial to stay on the lookout. S-risk reduction may not be tractable right away, but, at the very least, AI alignment researchers will have a comparative advantage in noticing when efforts to reduce s-risks become sufficiently tractable.

S-risk reduction is separate from alignment work

I’ve sometimes heard the belief that s-risk reduction efforts are superfluous because alignment work already efficiently reduces s-risks. Sure, the two types of research are related – it likely takes similar skills to do them well. However, s-risk reduction efforts (even regarding the most directly AI-related s-risks) aren’t a subset of alignment work.
For one thing, someone who is highly pessimistic about alignment success can still reduce s-risks, since we might still be able to affect some features of misaligned AI and select ones that steer clear of risk factors.
Moreover, AI alignment work not only makes it more likely that fully aligned AI is created and everything goes perfectly well, but it also affects the distribution of alignment failure modes. In particular, progress in AI alignment could shift failure modes from “far away from perfect in conceptual space” to “near miss.” By “near miss,” I mean something close but slightly off target.^[8]
That said, “AI alignment” is a broad category that includes approaches different from (narrow) value learning. For instance, using AI in a restricted domain to help with a pivotal task arguably has a very different risk profile regarding s-risks. (Figuring out how various alignment approaches relate to s-risks remains one of the most important questions to research further.)
Also, the relevant counterfactual seems unclear. If EA-inspired alignment researchers stopped their work, AI companies would still attempt to create aligned rather than misaligned AI.
Accordingly, it is hard to tell whether alignment work on the margin increases or decreases s-risks.
More importantly, due to the indirect way the intervention relates to s-risks and our uncertainty about it’s sign, the effect – whatever it is – will likely be smaller than the effect of the most targeted measures against s-risks.
Therefore, it should be feasible to put together a package of interventions that aims toward successful AI alignment and reduces s-risks on net.^[9]

Normative reasons for dedicating some attention to s-risks

For the portfolio approach I’m advocating (where EAs, to some degree, consider their comparative advantages in helping out value systems with community buy-in), it isn’t necessary to buy into suffering-focused ethics wholeheartedly.
Still, anyone skeptical of suffering-focused views needs a reason for giving these views significant consideration, for instance in the context of a portfolio approach to prioritization. So, here are some prominent reasons:
- People may be uncertain/undecided about the weight their idealized values^[10] give to suffering-focused ethics.
- People may intrinsically value (low-demanding) cooperation and adhere to cooperative heuristics like “If I can greatly benefit others’ values at low cost, I’ll do so.” Or similarly, for risks of accidentally causing s-risks, people may endorse the heuristic, “If I can avoid greatly harming others’ values at a reasonable cost, I’ll do so.”
- People may value (low-demanding)^[11] cooperation instrumentally for decision-theoretic reasons and living in a multiverse.
I expect most longtermist EAs to be familiar with discussions about suffering-focused ethics, so I won’t repeat all the arguments here about why it could be warranted to endorse such views or hold space for them out of uncertainty about one’s values. Instead, I want to point out three lines of argument that I expect might be underappreciated among longtermists who don’t already spend a part of their attention on s-risks (this is, admittedly, a very subjective list):
- Population ethics without axiology. See my blog post here, which I was really pleased to have win a prize in the EA criticism and red-teaming contest. Firstly, the post argues that person-affecting views are considered untenable by many EAs mostly because of the questionable, moral realist utilitarian assumption that “every possible world has a well-defined intrinsic impartial utility but there are additional ethical considerations.” Without this assumption, person-affecting views seem perfectly defensible (so the post argues). According to person-affecting morality, s-risks are important to prevent, but safeguarding existing people from extinction is also a priority. Secondly, the post argues that ethics has two parts and that s-risks reduction might fall into the “responsibility/fairness/cooperation” part of ethics. (As opposed to the “axiology,” “What's the most moral/altruistic outcome?” part.) Specifically, the position is as follows. Next to (1) “What are my life goals?” (and for effective altruists, “What are my life goals assuming that I want maximally altruistic/impartial life goals?”), ethics is also about (2) “How do I react to other people having different life goals from mine?” Concerning (2), just like parents who want many children have a moral responsibility to take reasonable precautions so none of their children end up predictably unhappy, people with good-future-focused population ethical views arguably have a responsibility to take common-sense-reasonable precautions against accidentally bringing about s-risks.
- People have two types of motivation. Many posts on suffering-focused ethics, such as my article on tranquilism, focus mostly on “system-1/model-free/impulsive motivation,” not on “system-2/model-based/reflective motivation.” The latter type of motivation can attach itself to different life goals. For instance, some people are hedonistic regarding their own well-being, while others live for the personal meaning that comes from altruistic pursuits, grand achievements, or close relationships. Now, people sometimes reject suffering-focused views because they seem “bleak” or “obviously wrong.” I want to point out that when they do this, they mostly talk from the perspective of system-2/model-based/reflective motivation. That is, in terms of their reflected life goals, they know they value things other than suffering reduction. That makes perfect sense (I do too). However, consider that many suffering-focused effective altruists have complex (as opposed to “it’s all about one thing/quantity”) moral views where both motivational systems play a role. For instance, I believe that people’s system-2/model-based/reflective motivation matters for existing people and their specific life goals, but, since “what matters from that kind of motivation’s perspective” is under-defined for “new” people and for matters of population ethics more generally, I’m approaching those “under-defined” contexts from an axiological perspective inspired by my system-1/model-free/impulsive motivation. Tranquilism, then, makes the normative claim that system-1/model-free/impulsive motivation is more about suffering reduction than about pursuing pleasure. (Others may still disagree with this claim – but that’s a different reason for rejecting tranquilism than due to it seeming “bleak.”)
- Asymmetries between utopia and dystopia. It seems that we can “pack” more bad things into dystopia than we can “pack” good things into utopia. Many people presumably value freedom, autonomy, some kind of “contact with reality.” The opposites of these values are easier to implement and easier to stack together: dystopia can be repetitive, solipsistic, lacking in options/freedom, etc. For these reasons, it feels like there’s at least some type of asymmetry between good things and bad things – even if someone were to otherwise see them as completely symmetric.

My appeal to AI alignment researchers

To any alignment researcher who has read this far, given that the text has (hopefully) held your interest, you might have a comparative advantage in reducing s-risks (especially at the level of “keeping an eye on whether there are any promising interventions.”)
If you’ve previously thought about s-risks and had some concrete ideas to maybe explore, then you almost certainly have a community-wide comparative advantage in reducing s-risks.
Don’t assume the s-risk community “has got it all covered,” but please consider getting in touch with me or Stefan Torges from the Center on Long-term Risk to see some of our preliminary thoughts or ideas. (Or do so after a bit of thinking to add an independent, unprimed take on the topic.)

This is changing to some degree. For instance, the Center on Long-term Risk has recently started working with language models, among other things. That said, my impression is that there’s still a significant gap in “How much people understand the AI alignment discourse and the workings of cutting-edge ML models” between the most experienced researchers in the s-risk community and the ones in the AI alignment field. ↩︎
Not everyone in the suffering-focused longtermist community considers “directly AI-related” s-risks sufficiently tractable to deserve most of our attention. Other interventions to reduce future suffering include focusing on moral circle expansion and reducing political polarization – see the work of the Center on Reducing Suffering. I think there’s “less of pull from large numbers” (or “less of a Pascalian argument”) for these specific causes, but that doesn’t necessarily make them less important. Arguably, sources of suffering related to our societal values and culture are easier to affect, and they require less of a specific technical background to engage in advocacy work. That said, societal values and culture will only have a lasting effect on the future if humanity stays in control over that future. (See also my comment here on conditions under which moral circle expansion becomes less important from a longtermist perspective.) ↩︎
It’s slightly more complicated. We also must be able to tell what features of an AI system’s decision architecture robustly correlate with increased/decreased future s-risks. I think the examples in the bullet point above might qualify. Still, there’s a lot of remaining uncertainty. ↩︎
The exception here would be if s-risks remained intractable, due to their speculative nature, even to the best-positioned researchers. ↩︎
The text that initially introduced the term “s-risks” used to have a different definition focused on the astronomical stakes of space colonization. With my former colleagues at the Center on Long-term Risk and in the context of better coordinating the way we communicate about our respective priorities with longtermists whose priorities aren’t “suffering-focused,” we decided to change the definition for s-risks. We had two main reasons for the change: (1) Calling something an “s-risk” when it doesn’t constitute a plausible practical priority (not even for people who prioritize suffering reduction over other goals) risks generating the impression that s-risks are generally not that important. (2) Calling the future scenario “galaxy-wide utopia where people still suffer headaches now and then” an “s-risk” may come with the connotation (always unintended by us) that this entire future scenario ought to be prevented. Over the years, we received a lot of feedback (e.g., here and here) that this was off-putting about the older definition. ↩︎
By “little to lose_,” I mean AIs whose value structure incentivizes them to bargain in particularly reckless ways because the default outcome is as bad as it gets according to their values. E.g., a paperclip maximizer who only cares about paperclips considers both an empty future and “utopia for humans” ~as bad as it gets because – absent moral trade – these outcomes wouldn’t contain any paperclips. ↩︎
For instance, one could do this by shaping the incentives in AI training environments after maybe studying why anti-social phenotypes evolved in humans and what makes them identifiable, etc. ↩︎
The linked post on “near misses” discusses why this increases negative variance. For more thoughts, see the subsection on AI alignment from my post, Cause prioritization for downside-focused value systems. (While some of the discussion in that post will feel a bit dated, I find that it still captures the most significant considerations.) ↩︎
In general, I think sidestepping the dilemma in this way is often the correct solution to “deliberation ladders” where an intervention that seems positive and important at first glance comes with some identifiable but brittle-seeming “backfiring risks.” For instance, let’s say we’re considering the importance of EA movement building vs. the risk of movement dilution, or economic growth vs. the risk of speeding up AI capabilities prematurely. In both cases, I think the best solution is to push ahead (build EA, promote economic growth) while focusing on targeted measures to keep the respective backfiring risks in check. Of course, that approach may not be correct for all deliberation ladder situations. It strikes me as “most likely correct” for cases where the following conditions apply. 1. Targeted measures to prevent backfiring risks will be reasonably effective/practical when they matter most. 2. The value of information from more research into the deliberation ladder (“trying to get to the top/end of it”) is low. ↩︎
See this post on why I prefer this phrasing over “are morally uncertain” ↩︎
I think evidential cooperation in large worlds (or “multiverse-wide superrationality,” the idea’s initial name) supports a low-demanding form of cooperation more robustly than it supports all-out cooperation. After all, since we’ll likely have a great deal of uncertainty about whether we think of the idea in the right ways and accurately assess our comparative advantages and gains from trade, we have little to gain from pushing the envelope and pursuing maximal gains from trade over the 80-20 version. ↩︎

LESSWRONG
LW