I believe AI alignment researchers might be uniquely well-positioned to make a difference to s-risks. In particular, I think this of alignment researchers with a keen interest in “macrostrategy.” By that, I mean ones who habitually engage in big-picture thinking related to the most pressing problems (like AI alignment and strategy), form mental models of how the future might unfold, and think through their work’s paths to impact. (There’s also a researcher profile where a person specializes in a specific problem area so much that they no longer have much interest in interdisciplinary work and issues of strategy – those researchers aren’t the target audience of this post.)
Of course, having the motivation to work on a specific topic is a significant component of having a comparative advantage (or lack thereof). Whether AI alignment researchers find themselves motivated to invest a portion of their time/attention into s-risk reduction will depend on several factors, including:
- Their opportunity costs
- Whether they think the work is sufficiently tractable
- Whether s-risks matter enough (compared to other practical priorities) given their normative views
- Whether they agree that they may have a community-wide comparative advantage
Further below, I will say a few more things about these bullet points. In short, I believe that, for people with the right set of skills, reducing AI-related s-risks will become sufficiently tractable (if it isn’t already) once we know more about what transformative AI will look like. (The rest depends on individual choices about prioritization.)
Summary
- Suffering risks (or “s-risks”) are risks of events that bring about suffering in cosmically significant amounts. (“Significant” relative to our current expectation over future suffering.)
- (This post will focus on “directly AI-related s-risks,” as opposed to things like “future humans don't exhibit sufficient concern for other sentient minds.”)
- Early efforts to research s-risks were motivated in a peculiar way – morally “suffering-focused” EAs started working on s-risks not because they seemed particularly likely or tractable, but because of the theoretical potential for s-risks to vastly overshadow more immediate sources of suffering.
- Consequently, it seems a priori plausible that the people who’ve prioritzed s-risks thus far don’t have much of a comparative advantage for researching object-level interventions against s-risks (apart from their high motivation inspired by their normative views).
- Indeed, this seems to be the case: I argue below that the most promising (object-level) ways to reduce s-risks often involve reasoning about the architectures or training processes of transformative AI systems, which involves skills that (at least historically) the s-risk community has not been specializing in all that much.
- Taking a step back, one challenge for s-risk reduction is that s-risks would happen so far ahead in the future that we have only the most brittle of reasons to assume that we can foreseeably affect things for the better.
- Nonetheless, I believe we can tractably reduce s-risks by focusing on levers that stay identifiable across a broad range of possible futures. In particular, we can focus on the propensity of agents to preserve themselves and pursue their goals in a wide range of environments. By focusing our efforts on shaping the next generation(s) of influential agents (e.g., our AI successors), we can address some of the most significant risk factors for s-risks. In particular:
- To do the above, we (mostly) “only” need to understand the inner workings of transformative AI systems or successor agents created during this upcoming transition. Naturally, AI alignment researchers are best positioned to do this.
- While some people believe that alignment work is already effective at reducing s-risks, I think the situation is more complicated – alignment work arguably increases the chance of some types of s-risks, while it reduces others. (It depends on what we think of as the relevant counterfactual – even if EA-inspired alignment researchers stopped working on alignment, it’s not like AI companies and their capabilities teams wouldn’t try to align AI at all.)
- More importantly, however, the question “What’s the sign of alignment research for s-risks” doesn’t seem all that relevant. From a portfolio perspective (“What portfolio of interventions should the longtermist community pursue given different population-ethical views and individual comparative advantages?”), we can design a package of interventions that aims towards successful AI alignment and reduces s-risks on net.
- Firstly, any unwanted side-effects of the sort “alignment makes some s-risks more likely” would still be a concern even if EA-motivated alignment researchers stopped their work. Accordingly, it’s challenging to say what sign alignment work has on the margin. Secondly, due to this sign uncertainty, the net effect – whatever it is – will likely be smaller than the effect of more targeted measures to reduce s-risks.
- Normatively, there is no uniquely correct answer to population ethics. It’s defensible to hold moral views according to which s-risk aren’t among our primary priorities. That said, at the level of a “movement portfolio of interventions” that represents the distribution of values in the longtermist community, s-risks deserve significant attention. (Relatedly, see this section for my description of three in-my-view underappreciated arguments for the normative importance of reducing s-risks.)
- My appeal to AI alignment researchers:
- If you have a flair for macrostrategy (as opposed to deep-diving into technical issues without coming back to the surface much), you might have a comparative advantage in keeping an eye out for spotting promising interventions to reduce s-risks.
- If you’ve previously thought about s-risks and had some concrete ideas to explore, then you almost certainly have a community-wide comparative advantage in reducing s-risks.
- Don’t assume the s-risk community “has got it all covered” – people who’ve prioritized s-risk reduction so far may not have skilled up sufficiently in their understanding of alignment-relevant considerations (and related intellectual “backgrounds”). However, if you’re interested, please consider contacting me or Stefan Torges from the Center on Long-term Risk, to see some of our preliminary thoughts or ideas. (Or do so after a bit of thinking to add an independent, unprimed take on the topic.)
In the rest of the post, I’ll flesh out the above points.
What I mean by “s-risks”
Suffering risks (or “s-risks”) are risks of events that bring about suffering in cosmically significant amounts. By “significant,” I mean significant relative to our expectation over future suffering.
For something to constitute an “s-risk” under this definition, the suffering involved not only has to be astronomical in scope (e.g., “more suffering than has existed on Earth so far”), but also significant compared to other sources of expected future suffering. This last bit ensures that “s-risks,” assuming sufficient tractability, are always a top priority for suffering-focused longtermists. Consequently, it also ensures that s-risks aren’t a rounding error from a “general longtermist portfolio” perspective, where people with community-wide comparative advantages work on the practical priorities of different value systems.
I say “assuming sufficient tractability” to highlight that the definition of s-risks only focuses on the expected scope of suffering (probability times severity/magnitude), but not on whether we can tractably reduce a given risk. To assess the all-things-considered importance of s-risk reduction, we also have to consider tractability.
Why reducing s-risks is challenging
- Research aimed toward positive altruistic impact is often motivated as follows. “Here’s a problem. How can we better understand it? How might we fix it?”
- By contrast, I started working on s-risks (alongside others) because of the following line of reasoning. “Here’s a problem class that would be overridingly bad according to my values if it happened. Let’s study if something like that might happen. Let’s zoom in on the most likely and most severe scenarios that could happen so we can try to steer clear of these trajectories or build safeguards.”
- The added challenge from this way of motivating research (apart from inviting biases like winner’s curse about identifying the top risks or confirmation bias about the likelihood and severity of specific risks after deliberately searching for reasons they might happen or might be severe) is that it’s disconnected from tractability concerns. Reducing s-risks seems to require predicting stuff further ahead in the future than for other EA-inspired interventions. (It’s not just that s-risks won’t happen for a long time – it’s also that the pathways through which they might occur involve forecasting a drastically changed world. To reduce s-risks, we arguably have to foresee features of the AI transition that go one step beyond us building transformative AI. This makes it difficult to identify relevant levers we could predictably affect.) To summarize, to predictably affect s-risks, we have to identify reliable levers at a distance and level of detail that goes so far beyond the superforecasting literature that we have only the brittlest of reasons to assume that it might work.
Why s-risks nonetheless seem tractable
Nonetheless, by focusing our efforts on shaping the next generation(s) of influential agents (e.g., our AI successors), we can reduce or even circumvent the demands for detailed forecasts of distant scenarios.
- A fundamental property of “agents” is that they continue to pursue their objectives in a wide range of circumstances and environments. So, agency introduces predictability at the level of strategically-important objectives and incentives. Accordingly, as long as we can shape the ways (some of) the future’s influential agents think and what they value, we then have a comparatively robust lever of influence into the future. We can use this lever to influence risk factors for s-risks.
- In particular, we could try to shape the goals and decision-making architectures of transformative AI systems in the following ways/directions:
- ensure AI designs follow the principle of
- train early-stage AI systems to reason myopically or otherwise install safeguards for
- prevent the emergence of powerful consequentialist AI systems (perhaps especially ones that don’t have concern for human values and have “little to lose”)
- prevent the emergence of AI systems with “malevolent” or otherwise anti-social instincts/heuristics/motivations (compare: dark tetrad traits that evolved in humans).
- steer AI takeoff dynamics towards more homogeneity and better coordination (insofar as we expect multipolar takeoff)
- For more information, see the CLR research agenda, particularly sections 4-6.
Alignment researchers have a comparative advantage in reducing s-risks
- Naturally, AI alignment researchers are best positioned to understand the likely workings of transformative AI systems. Therefore, alignment researchers may have a comparative advantage in reducing s-risks. (Of course, many other considerations factor into this as well – e.g., opportunity costs and motivation to work on s-risks.)
- Still, the task continues to be difficult. Alignment research itself arguably struggles with transitioning from phase 1 to phase 2 interventions. Coming up with preventative measures against s-risks is challenging for similar reasons.
- Timing matters. Perhaps it’s too early to form informative mental models of what transformative AI systems will look like. If so, we might be better off to research s-risk-specific safeguards for AI goal architectures at a later point. (Relatedly, consider how it was challenging to do valuable alignment work before it became apparent that the current deep-learning paradigm could pretty directly lead to human-level intelligence – perhaps with a few additions/tweaks.)
- To act when the time is right, it’s crucial to stay on the lookout. S-risk reduction may not be tractable right away, but, at the very least, AI alignment researchers will have a comparative advantage in noticing when efforts to reduce s-risks become sufficiently tractable.
S-risk reduction is separate from alignment work
- I’ve sometimes heard the belief that s-risk reduction efforts are superfluous because alignment work already efficiently reduces s-risks. Sure, the two types of research are related – it likely takes similar skills to do them well. However, s-risk reduction efforts (even regarding the most directly AI-related s-risks) aren’t a subset of alignment work.
- For one thing, someone who is highly pessimistic about alignment success can still reduce s-risks, since we might still be able to affect some features of misaligned AI and select ones that steer clear of risk factors.
- Moreover, AI alignment work not only makes it more likely that fully aligned AI is created and everything goes perfectly well, but it also affects the distribution of alignment failure modes. In particular, progress in AI alignment could shift failure modes from “far away from perfect in conceptual space” to “near miss.” By “near miss,” I mean something close but slightly off target.
- That said, “AI alignment” is a broad category that includes approaches different from (narrow) value learning. For instance, using AI in a restricted domain to help with a pivotal task arguably has a very different risk profile regarding s-risks. (Figuring out how various alignment approaches relate to s-risks remains one of the most important questions to research further.)
- Also, the relevant counterfactual seems unclear. If EA-inspired alignment researchers stopped their work, AI companies would still attempt to create aligned rather than misaligned AI.
- Accordingly, it is hard to tell whether alignment work on the margin increases or decreases s-risks.
- More importantly, due to the indirect way the intervention relates to s-risks and our uncertainty about it’s sign, the effect – whatever it is – will likely be smaller than the effect of the most targeted measures against s-risks.
- Therefore, it should be feasible to put together a package of interventions that aims toward successful AI alignment and reduces s-risks on net.
Normative reasons for dedicating some attention to s-risks
- For the portfolio approach I’m advocating (where EAs, to some degree, consider their comparative advantages in helping out value systems with community buy-in), it isn’t necessary to buy into suffering-focused ethics wholeheartedly.
- Still, anyone skeptical of suffering-focused views needs a reason for giving these views significant consideration, for instance in the context of a portfolio approach to prioritization. So, here are some prominent reasons:
- People may be uncertain/undecided about the weight their idealized values give to suffering-focused ethics.
- People may intrinsically value (low-demanding) cooperation and adhere to cooperative heuristics like “If I can greatly benefit others’ values at low cost, I’ll do so.” Or similarly, for risks of accidentally causing s-risks, people may endorse the heuristic, “If I can avoid greatly harming others’ values at a reasonable cost, I’ll do so.”
- People may value (low-demanding) cooperation instrumentally for decision-theoretic reasons and living in a multiverse.
- I expect most longtermist EAs to be familiar with discussions about suffering-focused ethics, so I won’t repeat all the arguments here about why it could be warranted to endorse such views or hold space for them out of uncertainty about one’s values. Instead, I want to point out three lines of argument that I expect might be underappreciated among longtermists who don’t already spend a part of their attention on s-risks (this is, admittedly, a very subjective list):
- Population ethics without axiology. See my blog post here, which I was really pleased to have win a prize in the EA criticism and red-teaming contest. Firstly, the post argues that person-affecting views are considered untenable by many EAs mostly because of the questionable, moral realist utilitarian assumption that “every possible world has a well-defined intrinsic impartial utility but there are additional ethical considerations.” Without this assumption, person-affecting views seem perfectly defensible (so the post argues). According to person-affecting morality, s-risks are important to prevent, but safeguarding existing people from extinction is also a priority. Secondly, the post argues that ethics has two parts and that s-risks reduction might fall into the “responsibility/fairness/cooperation” part of ethics. (As opposed to the “axiology,” “What's the most moral/altruistic outcome?” part.) Specifically, the position is as follows. Next to (1) “What are my life goals?” (and for effective altruists, “What are my life goals assuming that I want maximally altruistic/impartial life goals?”), ethics is also about (2) “How do I react to other people having different life goals from mine?” Concerning (2), just like parents who want many children have a moral responsibility to take reasonable precautions so none of their children end up predictably unhappy, people with good-future-focused population ethical views arguably have a responsibility to take common-sense-reasonable precautions against accidentally bringing about s-risks.
- People have two types of motivation. Many posts on suffering-focused ethics, such as my article on tranquilism, focus mostly on “system-1/model-free/impulsive motivation,” not on “system-2/model-based/reflective motivation.” The latter type of motivation can attach itself to different life goals. For instance, some people are hedonistic regarding their own well-being, while others live for the personal meaning that comes from altruistic pursuits, grand achievements, or close relationships. Now, people sometimes reject suffering-focused views because they seem “bleak” or “obviously wrong.” I want to point out that when they do this, they mostly talk from the perspective of system-2/model-based/reflective motivation. That is, in terms of their reflected life goals, they know they value things other than suffering reduction. That makes perfect sense (I do too). However, consider that many suffering-focused effective altruists have complex (as opposed to “it’s all about one thing/quantity”) moral views where both motivational systems play a role. For instance, I believe that people’s system-2/model-based/reflective motivation matters for existing people and their specific life goals, but, since “what matters from that kind of motivation’s perspective” is under-defined for “new” people and for matters of population ethics more generally, I’m approaching those “under-defined” contexts from an axiological perspective inspired by my system-1/model-free/impulsive motivation. Tranquilism, then, makes the normative claim that system-1/model-free/impulsive motivation is more about suffering reduction than about pursuing pleasure. (Others may still disagree with this claim – but that’s a different reason for rejecting tranquilism than due to it seeming “bleak.”)
- Asymmetries between utopia and dystopia. It seems that we can “pack” more bad things into dystopia than we can “pack” good things into utopia. Many people presumably value freedom, autonomy, some kind of “contact with reality.” The opposites of these values are easier to implement and easier to stack together: dystopia can be repetitive, solipsistic, lacking in options/freedom, etc. For these reasons, it feels like there’s at least some type of asymmetry between good things and bad things – even if someone were to otherwise see them as completely symmetric.
My appeal to AI alignment researchers
- To any alignment researcher who has read this far, given that the text has (hopefully) held your interest, you might have a comparative advantage in reducing s-risks (especially at the level of “keeping an eye on whether there are any promising interventions.”)
- If you’ve previously thought about s-risks and had some concrete ideas to maybe explore, then you almost certainly have a community-wide comparative advantage in reducing s-risks.
- Don’t assume the s-risk community “has got it all covered,” but please consider getting in touch with me or Stefan Torges from the Center on Long-term Risk to see some of our preliminary thoughts or ideas. (Or do so after a bit of thinking to add an independent, unprimed take on the topic.)