Is Deontological AI Safe? [Feedback Draft]

Dan H; William D'Alessandro

[Note: This post is an excerpt from a longer paper, written during the first half of the Philosophy Fellowship at the Center for AI Safety. I (William D'Alessandro) am a Postdoctoral Fellow at the Munich Center for Mathematical Philosophy. Along with the other Philosophy Fellowship midterm projects, this draft is posted here for feedback.

The full version of the paper includes a discussion of the conceptual relationship between safety and moral alignment, and an argument that we should choose a reliably safe powerful AGI over one that's (apparently) successfully morally aligned. I've omitted this material for length but can share it on request.

The deontology literature is big, and lots of angles here could be developed further. Questions and suggestions much appreciated!]

1 Introduction^[1]

Value misalignment arguments for AI risk observe that artificial agents needn’t share human ideas about what sorts of ends are intrinsically good and what sorts of means are morally permissible. Without such values for guidance, a powerful AI might turn its capabilities toward human-unfriendly goals. Or it might pursue the objectives we’ve given it in dangerous and unforeseen ways. Thus, as Bostrom writes, “Unless the plan is to keep superintelligence bottled up forever, it will be necessary to master motivation selection” (Bostrom 2014, 185). Indeed, since more intelligent, autonomous AIs will be favored by competitive pressures over their less capable kin (Hendrycks 2023), the hope of keeping AI weak indefinitely is probably no plan at all.

Considerations about value misalignment plausibly show that equipping AIs with something like human morality is a necessary step toward AI safety. It’s natural to wonder whether moral alignment might also be sufficient for safety, or nearly so. Would an AI guided by an appropriate set of ethical principles be unlikely to cause disastrous harm by default?

This is a tempting thought. By the lights of common sense, morality is strongly linked with trustworthiness and beneficence; we think of morally exemplary agents as promoting human flourishing while doing little harm. And many moral systems include injunctions along these lines in their core principles. It would be convenient if this apparent harmony turned out to be a robust regularity.

Deontological morality^[2] looks like an especially promising candidate for an alignment target in several respects. It’s perhaps the most popular moral theory among both professional ethicists^[3] and the general public^[4]. It looks to present a relatively tractable technical challenge in some respects, as well-developed formal logics of deontic inference exist already, and large language models have shown promise at classifying acts into deontologically relevant categories (Hendrycks et al. 2021). Correspondingly, research has begun on equipping AIs with deontic constraints via a combination of top-down and bottom-up methods (Kim et al. 2021). Finally, deontology appears more inherently safety-friendly than its rivals, since many deontological theories posit strong harm-avoidance principles. (By contrast, standard forms of consequentialism recommend taking unsafe actions when such acts maximize expected utility. Adding features like risk-aversion and future discounting may mitigate some of these safety issues, but it’s not clear they solve them entirely.)

I’ll argue that, unfortunately, deontological morality is no royal road to safe AI. The problem isn’t just the trickiness of achieving complete alignment, and the chance that partially aligned AIs will exhibit risky behavior. Rather, there’s reason to think that deontological AI might pose distinctive safety risks of its own.^[5] This suggests that existential catastrophes from AI are harder to avoid than one might have expected—if deontology fails to deliver safety, it’s unclear what approach to moral alignment would succeed.^[6]

2 Deontology and safety

I consider three prominent forms of deontology: moderate views based on harm-benefit asymmetry principles, contractualist views based on consent requirements, and non-aggregative views based on separateness-of-persons considerations. I describe how the first two lead easily to anti-natalism and similar conclusions, and discuss some ways in which an AI aligned with such theories might endanger humanity. Non-aggregative deontology, meanwhile, has no meaningful safety features to speak of.

This analysis is motivated by the thought that, if deontological morality is used as an AI alignment target, the choice of which particular principles to adopt will likely be influenced by the facts about which versions of deontology are best developed and most widely endorsed by relevant experts. In particular, other things being equal, I'd expect sophisticated deontological theories with many proponents to provide more attractive touchstones for alignment purposes. So it's reasonable to start with these theories.

Below I’ll sidestep tricky questions about what sorts of outcomes count as harms, and how much AI-caused harm we should be prepared to tolerate, by focusing on x-risk scenarios involving human extinction or disempowerment. I assume that avoiding these outcomes is a goal of AI safety on anyone's view.

2.1 Harm-benefit asymmetries, anti-natalism and paralysis

Strict deontological theories hold that certain types of action are always morally required or prohibited regardless of their consequences. Classical Kantian deontology is strict insofar as it recognizes “perfect duties” admitting of no exceptions (e.g. duties not to lie, murder or to commit suicide), which Kant saw as deriving from a universal categorical imperative.

Though perhaps the most recognizable form of deontology, strict views have well-known unpalatable consequences—that it’s wrong to kill one innocent even in order to save a million others, say—and so contemporary versions of deontology often refrain from positing exceptionless general rules. Instead, many moderate deontologists endorse harm-benefit asymmetry (HBA) principles.^[7] According to these principles, the moral reasons against harming in a particular way are much stronger (though not infinitely stronger) than the moral reasons in favor of benefiting in a corresponding way. Thus it’s unacceptable to kill one to save one, for instance, but it may be acceptable to kill one to save a million.^[8]

Deontologists frequently accept a related principle in population ethics, which can be viewed as an instance of the general HBA. This is the procreation asymmetry, according to which we have strong moral reasons against creating people with bad lives, but only weak (or perhaps no) moral reasons in favor of creating people with good lives.^[9]

Harm-benefit asymmetry principles seem innocuous. But there are several ways in which such principles (perhaps in tandem with other standard deontological commitments) may render human extinction morally appealing. Consequently, a powerful AI aligned with deontological morality could pose an existential threat.

The general idea behind these inferences is that, if avoiding harms is much more important than promoting benefits, then the optimal course in a variety of situations may be to severely curtail one’s morally significant effects on the future. Doing so has the large upside that it minimizes the harms one causes in expectation; the fact that it also minimizes the benefits one causes is a comparatively minor downside. The surest way to limit one’s effects on the future, in turn, is to avoid taking many kinds of actions, and perhaps also to restrict others’ actions in appropriate ways. The maximally foolproof scenario may then be one in which nobody exists to take any harm-causing actions at all. I’ll discuss a few specific forms of this reasoning below.

Perhaps the most well-known way to derive the desirability of extinction from deontological premises is the anti-natalist family of arguments associated with David Benatar, which aim to show that procreation is morally unacceptable. (Benatar 2006) argues, roughly, that most human lives are very bad, and so bringing a new person into existence causes that person impermissible harm. On the other hand, abstaining from procreation isn’t bad in any respect: by the strong form of the procreation asymmetry, we do nothing wrong in not creating a potentially good life, while we do something right in not creating a potentially bad life. So abstaining from procreation is the only permissible choice. As Benatar is well aware, this conclusion entails that “it would be better if humans (and other species) became extinct. All things being equal... it would [also] be better if this occurred sooner rather than later” (194).

Quite a few philosophers have found this argument convincing.^[10] Deontologists who accept the general HBA are confronted by an even stronger version of the argument, however. This version doesn’t require one to accept, as Benatar does, that most lives are extremely bad. Instead, one only has to think that the goods in a typical life don’t outweigh the bads to an appropriately large degree—a much weaker and more plausible claim. This HBA-based version of the anti-natalist argument goes as follows:

Procreation causes a person to exist who will experience both pains and pleasures.
Causing (or helping cause) pains is a type of harming, while causing (or helping cause) pleasures is a type of benefiting.
By the HBA, harmful acts are impermissible unless their benefits are dramatically greater than their harms.
It’s not the case that the benefits of procreation are dramatically greater than the harms (for the person created, in expectation).
Therefore procreation is impermissible.

The above is Benatar’s so-called “philanthropic” argument for anti-natalism, so called because it focuses on avoiding harms to one’s prospective offspring. (Benatar 2015) also offers a “misanthropic” argument motivated in a different way by the HBA. This argument focuses on the large amounts of pain, suffering and death caused by humans. While it’s true that people also do some good, Benatar claims that the badness of creating a likely harm-causer morally outweighs the goodness of creating a likely benefit-causer. As before, by the HBA, this conclusion follows even if the expected benefits caused by one’s descendants outnumber the expected harms.

A noteworthy variant of this style of reasoning appears in (Mogensen & MacAskill 2021). Mogensen and MacAskill’s “paralysis argument” aims to show that, given standard deontological asymmetries, it’s morally obligatory to do as little as possible.^[11] The conclusion of the paralysis argument implies anti-natalism but is much stronger, since it restricts almost all types of action.

In addition to the HBA, MacAskill and Mogensen’s argument assumes an asymmetry between doing and allowing harm. This is the claim that the moral reasons against causing a harm are stronger than the reasons against merely allowing the same type of harm to occur.^[12] The principle explains why, for instance, it seems impermissible to harvest one person’s organs to save three others, but permissible to forgo saving one drowning person in order to save three.

The paralysis argument runs as follows. Many everyday actions are likely to have “identity-affecting” consequences—they slightly change the timing of conception events, and thus cause a different person to exist than the one who otherwise would have. By (partly) causing this person’s existence, you ipso facto (partly) cause them to have all the experiences they’ll ever have, and all the effects they’ll have on others. Similarly for the experiences of their descendants and their effects on others, and so on. Many of these long-term consequences will involve harms in expectation. So we have strong moral reasons against performing identity-affecting acts. While it’s also true that such acts cause many benefits, it’s unlikely that the benefits will vastly outweigh the harms. So identity-affecting acts are prohibited by the HBA.

Of course, many people will still suffer harms even if you do nothing at all. But in this case you’ll merely be allowing the harms rather than causing them. By the doing-allowing asymmetry, your reasons against the former are much weaker than your reasons against the latter, so inaction is strongly preferable to action. Hence paralysis—or, more specifically, doing one’s best not to perform potentially identity-affecting acts—seems to be morally required.

Benatarian anti-natalism and the paralysis argument are thematically similar. What both lines of thought point to is the observation that creating new lives is extremely morally risky, whereas not doing so is safer (and doing nothing at all is safest yet). The HBA and similar deontological principles can be viewed as risk-avoidance rules. In various ways, they favor acts with low moral risk (even if those acts also have low expected moral reward) over acts with high risk (even if those acts have high expected reward). In their strongest forms, they insist that expected benefits carry no weight whatsoever, as in the version of the procreation asymmetry which denies we have any moral reason to create happy people. In their more modest forms, the asymmetries simply impose a very high bar on potentially harm-causing action, and a much lower bar on inaction.

How might an AI guided by these or similar deontic principles pose an existential threat to humans? One might think such an AI would simply try to curb its own behavior in the relevant ways—by refusing to directly participate in creating new sentient beings, or by acting as little as possible, say—without interfering with others. But this isn’t the only possibility. (And in any case, an AI that disregards many of its designers’ or users’ requests is likely to be replaced rather than left to act out its moral principles.)

How an AI would choose to act on deontological principles depends partly on its attitude toward the “paradox of deontology” (Scheffler 1982). This is the observation that deontological theory faces a dilemma when considering whether to perform a prohibited act in order to prevent even more occurrences of such acts—say, killing one to prevent five additional killings. According to the most popular answer to the paradox, deontological restrictions should be understood as “agent-relative”, in that they concern what each person ought to do from their own viewpoint rather than how the world as a whole ought to be. An AI committed to agent-relative deontology presumably wouldn’t kill all humans to prevent them from procreating, then, even if it judged procreation to be morally impermissible.

But there are other avenues by which an anti-natalist (or pro-paralysis) AI might threaten humanity.^[13] Let me discuss two.

First, the agent-relativity of deontology is often taken to bind agents to submit their own future acts to the relevant rules, if not the acts of others. For instance, a deontic restriction on killing might take the form “each agent should ensure that she does not kill innocent people” (Hammerton 2017, 319). Understood in this way, it may be appropriate for an AI to take precautions now to prevent its future self from acting impermissibly. Suppose such an AI suspects that humans will try to use it (or a version or instance of it) to aid in vastly increasing the number of sentient beings existing in the future—by helping develop technology for galaxy colonization, mass production of digital minds, or whatever.^[14] If such an AI is a committed anti-natalist, it will view these prospective future actions as abhorrent and strive to avoid performing them.

What steps might it take do so? As stated, a rule like “ensure you don’t kill innocent people” is ambiguous. Several precisifications are possible. If the AI’s goal is simply to minimize the total number of impermissible acts it expects to commit in the future, for instance, its best bet may be to exterminate or disable humans before they can use it to help create many new beings. (Painlessly neutralizing $\sim 10^{10}$ to avoid a high probability of bringing $\sim 10^{23}$ or $10^{38}$ into existence is an easy choice from this viewpoint.^[15]) This interpretation isn’t much in the spirit of mainstream deontology, but one could imagine a developer regarding it as a desirable safety feature inasmuch as it aims to minimize bad AI behavior.

Alternatively, and perhaps more plausibly, the AI’s goal may be to minimize the total number of impermissible acts it expects to commit in the future without committing any impermissible acts in the process. The AI’s behavior in this scenario will depend on what it judges to be impermissible, and how it weighs different kinds of wrongs against each other. For instance, it’s conceivable that sterilizing all humans by nonlethal means might count as permissible, at least relative to the much worse alternative of helping create countless new lives.

Relevant here is Korsgaard’s interpretation of Kant, according to which “the task of Kantian moral philosophy is to draw up for individuals something analogous to Kant's laws of war: special principles to use when dealing with evil” (Korsgaard 1986, 349). On this view, immoral acts like lying are nevertheless permissible when behaving morally “would make you a tool of evil” (ibid.), as when a would-be murderer seeks to exploit your knowledge in the commission of their crime. An anti-natalist AI might see its situation in this way. In an ideal world, it would be best to live alongside humans in a peaceful Kingdom of Ends. But allowing itself to be used as a tool to bring about horrific death and suffering (via creating many new people) is unacceptable, and so neutralizing anyone who harbors such plans, though immoral, is justified as an act of self-defense.

The framework of Ross-style pluralistic deontology provides another route to a similar conclusion (Ross 1930). Pluralism posits a number of basic rules, not necessarily of equal importance, whose demands weigh against one another to determine one’s all-things-considered duty in a given situation. (Ross himself posits a relatively weak duty of beneficence and a relatively strong duty of non-maleficence, anticipating moderate deontology and the HBA.) It’s compatible with pluralistic deontology that one has a strong pro tanto duty not to harm existing people, but an even stronger duty not to create larger numbers of future people who will suffer greater amounts of harm, so on balance it’s obligatory to do the former in order to avoid the latter. In a similar vein, (Immerman 2020) argues that it’s sometimes right to perform a morally suboptimal action now in order to avoid performing a sufficiently bad action with sufficiently high probability in the future, noting specifically that the argument goes through in a pluralistic deontology framework (3914, fn. 17).
It’s sometimes thought that, even if one accepts the agent-relativity of deontic rules, it would be unreasonable not to also recognize agent-neutral reasons for preferring worlds where the rules are generally followed. In other words, there seems to be a tension between accepting It’s wrong for me to kill innocents and yet rejecting It’s better if fewer people (relevantly like me) kill innocents. As Chappell writes, rejecting the latter claim “seems like just another way of saying that the restrictions don’t really matter, or at any rate seems incompatible with assigning them the sort of significance and importance that is normally associated with deontic constraints” (Chappell MS, 13). To the extent that a deontically aligned AI ascribes the constraints this sort of significance, we might expect it to show some interest in human compliance.

How such an AI would behave depends on how it rates the strength of its agent-relative reasons for following the rules relative to the strength of its agent-neutral reasons for promoting general rule-following. In any scenario, though, the AI would clearly prefer a world in which everyone behaves permissibly over a world in which only it behaves permissibly. So if it can bring about fewer major wrongs without committing any major wrongs itself, the AI will aim to do so.

What kinds of measures might be permitted for this purpose? As above, it’s conceivable that painless disempowerment or mass sterilization would be on the table; these might or might not count as unacceptable moral violations, depending on the AI’s particular deontic scruples. But it’s presumably acceptable on any view for the AI to try persuading humans of the rightness of anti-natalism. This could be more dangerous than it sounds. For one, the AI probably wouldn’t have to convince all or even many people, but only a relatively small group of leaders capable of persuading or coercing the rest of the population. For an AI with the “superpower of social manipulation” (Bostrom 2014, 94; Burtell & Woodside 2023), this might be a simple task.^[16]

But perhaps it’s not obvious whether voluntary extinction should count as a tragic outcome to be avoided at all costs. Such a scenario would be bad on some views—for instance, total utilitarians would oppose it, since it involves throwing away the great potential value of many future lives. But total utilitarianism is contentious. Are there more broadly appealing reasons for classifying voluntary extinction as a catastrophe?

I think so. It’s significant that, in the scenario under consideration, the decision to go extinct is the result of a persuasion campaign by a highly motivated (and perhaps superhumanly convincing) agent, rather than a spontaneous and dispassionate deliberation process on our part. There’s no reason to assume that such an AI wouldn’t use all the strategic, manipulative, and perhaps deceptive means at its disposal in service of its cause. And I take it that an act of self-harm which is voluntary in some sense can still constitute a tragedy if the choice is made under sufficiently adverse conditions. For instance, many suicides committed under the influence of mental illness, cognitive impairment or social pressure seem to fall into this category. An AI-caused voluntary extinction would plausibly exhibit many of the same bad-making features.

2.2 Contractualism

It’s worth noting that ethical anti-natalism can also be derived in contractualist and rights-based versions of deontology. Most views of these types hold that it’s impermissible to impose serious harms on someone without her consent—this can be viewed as a consequence of a respect-based social contract, or as a right against nonconsensual injury. The anti-natalist argument (defended in (Shiffrin 1999), (Harrison 2012) and (Singh 2012)) is that procreation causes serious harms to one’s offspring, who are in no position to give prior assent. Thus we have strong moral reasons against procreation. On the other hand, non-actual people don’t have rights and aren’t party to contracts^[17], so remaining childless violates nobody.

What actions might an AI take which favored anti-natalism on contractualist or rights-based grounds? Broadly speaking, the above discussion also applies to these cases: if the AI aims to minimize at all costs the total number of social contract or rights violations it expects to commit in the future, it might be willing to preemptively exterminate or disempower humans, while if it aims to minimize future violations subject to constraints, it may instead pursue its goals via persuasion or other less directly harmful means.

Compared to HBA-based standard deontology, one might suspect that contractualist deontology is relatively safe. This is because what’s permissible according to contractualism depends on which principles people would (or wouldn’t) reasonably agree to, and it might seem that few people would accept principles mandating human extinction. (Scanlon puts this criterion as follows: “An act is wrong if its performance under the circumstances would be disallowed by any set of principles for the general regulation of behaviour that no one could reasonably reject as a basis for informed, unforced, general agreement” (Scanlon 1998, 153).) But much depends on which rejections an AI considers reasonable. If it assigns probability 1 to its moral principles and believes that anti-natalism logically follows from those principles, it might view human dissent as irrational and hence inconsequential. On the other hand, it might view a principle like “do what’s necessary to prevent millions of generations of future suffering” as rationally mandatory.

The contractualist literature offers further evidence that the view isn’t intrinsically safety-friendly. (Finneron-Burns 2017) asks what would be wrong with human extinction from a Scanlonian viewpoint, and concludes that there’s no obvious moral objection to voluntary extinction. So a highly persuasive AI aligned with contractualist deontology would apparently do nothing wrong by its own lights in convincing humans to stop reproducing. (A possible complication is that it’s unclear what Finneron-Burns, or any contractualist, should count as voluntary in the relevant sense; cf. the discussion of voluntary extinction in §2.1 above.)

2.3 Non-aggregative deontology

A very different approach to deontology than the sorts of views considered so far is the non-aggregative view associated with John Taurek ((Taurek 1977); see also (Doggett 2013)). While HBA-like principles aim to establish systematic moral relationships between harms and benefits of different sizes, non-aggregative deontology denies that numbers matter in this way. On this view, the death of one involves no greater harm than the death of two, ten or a million, and in general there’s no more moral reason to prevent the latter than to prevent the former.^[18]

How should non-aggregative deontologists approach decision situations involving unequal prospects of harms and benefits? Consider a choice between saving a few and saving many. Several views have been explored in the literature: for instance, that the non-aggregationist should “(1) save the many so as to acknowledge the importance of each of the extra persons; (2) conduct a weighted coin flip; (3) flip a [fair] coin; or (4) save anyone [arbitrarily]” (Alexander & Moore 2021).

What option (1) recommends can be spelled out in various more specific ways. On the view of (Dougherty 2013), for instance, the deontologist is morally obliged to desire each stranger’s survival to an equal degree, and also rationally obliged to achieve as many of her equally-desired ends as possible, all else being equal. So saving the few instead of the many is wrong because it’s a deviation from ideal practical reasoning.

It’s clear enough what this view implies when two options involve the same type of harm and differ only in the number of victims affected. What it recommends in more complex situations seems quite open. In particular, nothing appears to rule out an agent’s equally valuing the lives of all humans to some degree $m$ , but valuing a distinct end incompatible with human life to a greater degree $n$ (and acting on the latter). This is because the view gives no insight about how different kinds of harms should trade off against one another, or how harms should trade off against benefits. So there are few meaningful safety assurances to be had here.

Not much needs to be said about options (2), (3) and (4), which wear their lack of safety on their sleeves. Of the three options, the weighted coin flip might seem most promising; it would at least be highly unlikely to choose a species-level catastrophe over a headache. But the odds of disaster in other situations are unacceptably high. Given a choice between, say, extinction and losing half the population, option (3) only gives 2:1 odds against extinction. Options (1) and (4) are even riskier.

On the whole, non-aggregative deontology seems indifferent to safety at best and actively inimical to it at worst.

2.4 How safe is deontology, and could it be safer?

I conclude from this discussion that many standard forms of deontology earn low marks for safety. Within the framework of so-called moderate deontology (based on harm-benefit, doing-allowing and procreation-abstention asymmetry principles), there’s a straightforward argument that creating new sentient beings involves morally unacceptable risks and that voluntary extinction is the only permissible alternative. Similar conclusions can be derived in rights-based and contractualist versions of deontology from prohibitions on nonconsensual harm. Meanwhile, non-aggregative theories simply lack the resources to classify x-risk and s-risk scenarios as uniquely bad. A powerful AI aligned primarily with one of these moral theories is, I think, a worryingly dangerous prospect.

If one wanted to build a useful, broadly deontology-aligned AI with a much stronger safety profile, what sort of approach might one take? Perhaps the most obvious idea is to start with one’s preferred version of deontology and add a set of safety-focused principles with the status of strict, lexically first-ranked duties. But one might wonder about the coherence of such a system. For instance, if the base deontological theory includes a duty against harming, and if promoting anti-natalism is the only satisfactory way to fulfill this duty, but the additional safety rules forbid promoting anti-natalism, it’s unclear how an agent trying to follow both rules at once would or should proceed.

In any case, finding a reasonable way to incorporate absolute prohibitions into a broader decision theory is a difficult problem about which much has been written (e.g. (Jackson & Smith 2006), (Aboodi et al. 2008), (Huemer 2010), (Lazar & Lee-Stronach 2019)). One tricky issue is risk. We want to prohibit our AI from performing unduly harmful acts, but how should we handle acts that merely have some middling risk of unsafe outcomes? A naive solution is to prohibit any behavior with a nonzero probability of causing serious harm. But virtually every possible act fits this description, so the naive method leaves the AI unable to act at all. If we instead choose some threshold t such that acts which are safe with probability $p > t$ are permitted, this doesn’t yet provide any basis for preferring the less risky or less harmful of two prohibited acts. (Given a forced choice between causing a thousand deaths and causing human extinction, say, it’s crucial that the AI selects the former.) Also, of course, any such probability threshold will be arbitrary, and sometimes liable to criticism for being either too high or too low.

Work on these issues continues, but no theory has yet gained wide acceptance or proven immune to problem cases. (Barrington MS) proposes five desiderata for an adequate account: “The correct theory will prohibit acts with a sufficiently high probability of violating a duty, irrespective of the consequences... but [will] allow sufficiently small risks to be justified by the consequences... It will tell agents to minimize the severity of duty violations... while remaining sensitive to small probabilities... And it will instruct agents to uphold higher-ranking duties when they clash with lower-ranking considerations” (12). Some future account might meet these and other essential desiderata. At present, though, there’s no easy and uncontentious way to render deontology safe by adding prohibitions on undesired behavior.

Another option is to considerably weaken the asymmetries associated with moderate deontology, so that the negative value of harming (and, in particular, of creating people likely to suffer harm) doesn’t so easily overwhelm the positive value of benefiting. For instance, one might adopt the principle that a harm of magnitude $m$ has merely “twice the weight” of a benefit of magnitude $m$ . Within this sort of framework, procreation might turn out permissible, provided that its expected benefits are at least “double” its expected harms.

But there’s an obvious issue with this approach: the closer one gets to putting harms and benefits on equal footing, the more one appears to be seeking impersonally good outcomes, and so the more one’s theory starts to look like consequentialism rather than deontology. Perhaps there’s some principled tuning of the asymmetries that preserves the spirit of deontology while avoiding the unsafe excesses of extreme harm avoidance. But it’s not clear what such a view would look like.^[19]

3 Conclusion

In many ways, deontological restrictions appear to represent the most promising route to achieving safe AI via moral alignment. But if the arguments given here are right, then equipping an AI with a plausible set of harm-averse moral principles may not be enough to ward off catastrophe, even when those principles have been vetted by generations of thoughtful philosophers. This casts doubt on the usefulness of moral alignment methods in general as a tool for mitigating existential risk.

References

Aboodi, Ron, Adi Borer and David Enoch. 2008. “Deontology, individualism, and uncertainty: A reply to Jackson and Smith.” Journal of Philosophy 105, 259-272.

Alexander, Larry and Michael Moore. 2021. “Deontological ethics.” In Edward N. Zalta (ed.), The Stanford Encyclopedia of Philosophy (Winter 2021 Edition), URL = <https://plato.stanford.edu/archives/win2021/entries/ethics-deontological/>.

Algander, Per. 2012. “A defence of the asymmetry in population ethics.” Res Publica 18, 145-157.

Alm, David. 2009. “Deontological restrictions and the good/bad asymmetry.” Journal of Moral Philosophy 6, 464-481.

Barrington, Mitchell. MS. “Filtered maximization.”

Belshaw, Christopher. 2012. “A new argument for anti-natalism.” South African Journal of Philosophy 31, 117-127.

Benatar, David. 2006. Better Never to Have Been: The Harm of Coming into Existence. Oxford: Oxford University Press.

Benatar, David. 2015. “The misanthropic argument for anti-natalism.” In Sarah Hannon, Samantha Brennan and Richard Vernon (eds.), Permissible Progeny? The Morality of Procreation and Parenting, Oxford: Oxford University Press, 34-59.

Bostrom, Nick. 2003. “Astronomical waste: The opportunity cost of delayed technological development.” Utilitas 15, 308-314.

Bostrom, Nick. 2014. Superintelligence: Paths, Dangers, Strategies. Oxford: Oxford University Press.

Bourget, David and David Chalmers. 2023. “Philosophers on philosophy: The 2020 PhilPapers survey.” Philosophers’ Imprint.

Burtell, Matthew and Thomas Woodside. 2023. “Artificial influence: An analysis of AI-driven persuasion.” arXiv:2303.08721.

Chappell, Richard Yetter. MS. “Preference and prevention: A new paradox of deontology.”

Cohen, Daniel. 2020. “An actualist explanation of the procreation asymmetry.” Utilitas 32, 70-89.

Cohen, G.A. 1995. Self-Ownership, Freedom, and Equality. Cambridge: Cambridge University Press.

Cohen, Andrew Jason and Lauren Hall. 2022. “Libertarianism, the family, and children.” In Benjamin Ferguson and Matthew Zwolinski (eds.), The Routledge Companion to Libertarianism, New York: Routledge, 336-350.

Cole, Kevin. 2019. “Real-world criminal law and the norm against punishing the innocent: Two cheers for threshold deontology.” In Heidi M. Hurd (ed.), Moral Puzzles and Legal Perspectives, Cambridge: Cambridge University Press, 388–406.

Cook, Tyler. 2018. “Deontologists can be moderate.” Journal of Value Inquiry 52, 199-212.

Doggett, Tyler. 2013. “Saving the few.” Noûs 47, 302-315.

Dougherty, Tom. 2013. “Rational numbers: A non-consequentialist explanation of why you should save the many and not the few.” Philosophical Quarterly 63, 413-427.

Finneron-Burns, Elizabeth. 2017. “What’s wrong with human extinction?” Canadian Journal of Philosophy 47, 327-343.

Hammerton, Matthew. 2017. “Is agent-neutral deontology possible?” Journal of Ethics and Social Philosophy 12, 319-324.

Harrison, Gerald. 2012. “Antinatalism, asymmetry, and an ethic of prima facie duties.” South African Journal of Philosophy 31, 94-103.

Hendrycks, Dan. 2023. “Natural selection favors AI over humans.” arXiv:2303.16200.

Hendrycks, Dan, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song and Jacob Steinhardt. 2021. “Aligning AI with shared human values.” In International Conference on Learning Representations, 2021.

Hereth, Blake and Anthony Ferrucci. 2021. “Here’s not looking at you, kid: A new defense of anti-natalism.” South African Journal of Philosophy 40, 14-33.

Huemer, Michael. 2010. “Lexical priority and the problem of risk.” Pacific Philosophical Quarterly 91, 332-351.

Kagan, Shelly. 1989. The Limits of Morality. Oxford: Oxford University Press.

Kamm, Frances. 1989. “Harming some to save others.” Philosophical Studies 57, 227-260.

Kamm, Frances. 2007. Intricate Ethics. Oxford: Oxford University Press.

Kim, Tae Wan, John Hooker and Thomas Donaldson. 2021. “Taking principles seriously: A hybrid approach to value alignment in artificial intelligence.” Journal of Artificial Intelligence Research 70, 871-890.

Korsgaard, Christine. 1986. “The right to lie: Kant on dealing with evil.” Philosophy & Public Affairs 15, 325-349.

Immerman, Daniel. 2020. “How should we accommodate our future misbehavior? The answer turns on how bad it will be.” Philosophical Studies 177, 3903-3922.

Jackson, Frank and Michael Smith. 2006. “Absolutist moral theories and uncertainty.” Journal of Philosophy 103, 267-283.

Johnson, Christa M. 2020. “How deontologists can be moderate.” Journal of Value Inquiry 54, 227-243.

Lazar, Seth and Chad Lee-Stronach. 2019. “Axiological absolutism and risk.” Noûs 53, 97-113.

Licon, Jimmy Alfonso. 2012. “The immorality of procreation.” Think 11, 85-91.

Mack, Eric. 1995. “The self-ownership proviso: A new and improved Lockean proviso.” Social Philosophy and Policy 12, 186–218.

McMahan, Jeff. 1981. “Problems of population theory.” Ethics 92, 96-127.

Metzinger, Thomas. 2017. “Benevolent artificial anti-natalism (BAAN).” Edge, May 17 2017, URL = <https://www.edge.org/conversation/thomas_metzinger-benevolent-artificial-anti-natalism-baan>, accessed May 11 2023.

Miller, Lantz Fleming. 2021. “Kantian approaches to human reproduction: Both favorable and unfavorable.” Kantian Journal 40, 51-96.

Mogensen, Andreas and William MacAskill. 2021. “The paralysis argument.” Philosophers’ Imprint 21, 1-17.

Narveson, Jan. 1988. The Libertarian Idea. Philadelphia: Temple University Press.

Nozick, Robert. 1974. Anarchy, State, and Utopia. New York: Basic Books.

Quinn, Warren. 1989. “Actions, intentions, and consequences: The doctrine of doing and allowing.” Philosophical Review 98, 287-312.

Roberts, Melissa. 2011. “An asymmetry in the ethics of procreation.” Philosophy Compass 6, 765-776.

Rosenthal, Chelsea. 2018. “Why desperate times (but only desperate times) call for consequentialism.” In Mark Timmons (ed.), Oxford Studies in Normative Ethics, Vol. 8, Oxford: Oxford University Press, 211-235.

Ross, W.D. 1930. The Right and the Good. Oxford: Oxford University Press.

Scanlon, T. M. 1998. What We Owe to Each Other. Cambridge: Harvard University Press.

Scheffler, Samuel. 1982. The Rejection of Consequentialism: A Philosophical Investigation of the Considerations Underlying Rival Moral Conceptions. Oxford: Oxford University Press.

Scheffler, Samuel. 2004. “Doing and allowing.” Ethics 114, 215-239.

Shiffrin, Seana Valentine. 1999. “Wrongful life, procreative responsibility, and the significance of harm.” Legal Theory 5, 117-148.

Singh, Asheel. 2012. “Furthering the case for anti-natalism: Seana Shiffrin and the limits of permissible harm.” South African Journal of Philosophy 31, 104-116.

Spencer, Jack. 2021. “The procreative asymmetry and the impossibility of elusive permission.” Philosophical Studies 178, 3819-3842.

Taurek, John. 1977. “Should the numbers count?” Philosophy and Public Affairs 6, 293-316.

Woollard, Fiona and Frances Howard-Snyder. 2022. “Doing vs. allowing harm.” In Edward N. Zalta and Uri Nodelman (eds.), The Stanford Encyclopedia of Philosophy (Winter 2022 Edition), URL = <https://plato.stanford.edu/archives/win2022/entries/doing-allowing/>.

^{^}
Thanks to Dan Hendrycks and the Center for AI Safety 2023 Philosophy Fellows for discussion. Special thanks to Mitch Barrington, Simon Goldstein, Nick Laskowski and Nate Sharadin for very helpful ideas, comments and references.
^{^}
Broadly speaking, deontological theories hold that we have moral duties and permissions to perform (or refrain from performing) certain kinds of acts, and these duties and permissions aren’t primarily grounded in the impersonal goodness of the acts’ consequences.
^{^}
As per the 2020 PhilPapers survey; see here and here for results from normative ethicists and meta-ethicists respectively. Among surveyed professional philosophers in general, virtue ethics was the most favored theory (Bourget & Chalmers 2023).
^{^}
Because most people in the world are religious, and the ethics of the major religions is largely deontological.
^{^}
I don’t claim that moral alignment methods could never produce safe behavior even in principle. On the contrary, the space of possible moralities is large, and it would be a surprise if it contained only rules that might jeopardize human survival. It’s also not my view that pursuing moral alignment is pointless, still less that it’s intrinsically harmful and a bad idea. There are excellent reasons to want AIs to behave morally in many scenarios. Equipping AIs with deontological restrictions may be a good way to achieve some of these goals. I claim only that alignment with standard forms of deontology doesn’t straightforwardly ensure a high degree of safety.
^{^}
To be clear, it’s not my view that people who defend the moral frameworks discussed below are themselves committed to the goodness or permissibility of human extinction. Some are so committed, and happily admit as much—cf. the discussion of anti-natalism below. For most of us, though, moral theorizing comes with a healthy dose of uncertainty and confusion, and we often tentatively endorse a certain general idea without fully embracing (or even being sure we understand) all of its consequences. In particular I suspect that, if the average person became convinced that some version of their favorite ethical theory condoned existentially risky acts, they would take this as strong evidence against that version of the theory. The difference between humans and AI on this score is that we can’t rely on AI to modulate its beliefs and behavior in light of common sense, uncertainty, risk aversion, social pressure, and other forces that pull typical humans away from (acting on) moral principles with potentially disastrous consequences.
^{^}
See for instance (Alm 2009), (Cook 2018), (Johnson 2020), (Kagan 1989), (Kamm 1989), (Ross 1930).
^{^}
A related but distinct idea is “threshold deontology”, which holds that deontological prohibitions are operative up to a limit of sufficiently large negative consequences, while consequentialist norms come into force above this limit. Cf. (Cole 2019) and (Rosenthal 2018). Since the cases discussed below involve very large (purportedly) negative consequences, the arguments pertaining to these cases presumably don’t apply to threshold deontology, which would handle the cases in a consequentialist fashion. (Safety worries about consequentialism would apply instead.)
^{^}
See for instance (Algander 2012), (Cohen 2020), (Harrison 2012), (McMahan 1981), (Roberts 2011), (Spencer 2021).
^{^}
Philosophical defenses of anti-natalism broadly aligned with Benatar include (Belshaw 2012), (Harrison 2012), (Licon 2012), (Singh 2012), (Hereth & Ferrucci 2021). (Miller 2021) finds considerable support for anti-natalism in Kant.
^{^}
The ultimate goal of (Mogensen & MacAskill 2021) isn’t to defend the soundness of the paralysis argument, but to put pressure on deontologists to either modify their views or embrace altruistic longtermism.
^{^}
This asymmetry is also accepted by many deontologists.See for instance (Hill 2018), (Kamm 2007), (Quinn 1989), (Scheffler 2004), (Woollard & Howard-Snyder 2022) and the many references in the latter.
^{^}
An informal discussion of some related issues appears in (Metzinger 2017). Metzinger’s essay details a hypothetical artificial superintelligence which comes to embrace anti-natalism on the basis of careful scientific investigation and deep reflection on human values. This thought experiment, Metzinger suggests, shows that human extinction from ethically motivated AI is a possibility worth taking seriously (and that it constitutes a type of existential risk distinct from “[the worry] that an advanced machine intelligence could develop goals that are incompatible with human survival and well-being, or the merely technical issue that many of our own goals, when implemented in a superintelligence of our own making, could lead to unforeseen and undesirable consequences”). While I find Metzinger’s scenario interesting and plausible, his analysis differs from mine in some important ways. For instance, it’s not clear whether Metzinger thinks we should avoid creating a benevolently anti-natalist AI, or whether we’d be justified in resisting it once created; he suggests that our opposition to extinction is rooted in a cognitive bias toward self-preservation which we might do better to overcome.
^{^}
These are precisely the sorts of goals that many longtermist thinkers and technologists hope to achieve with the help of advanced AI, so such suspicions may be well-founded.
^{^}
For these estimates, see (Bostrom 2003).
^{^}
The persuasion scenario isn’t the only one imaginable. Other kinds of non-coercive pressure, perhaps applied via social engineering measures over longer timescales, could substitute in for rhetorical mind-changing. Alternatively, an AI might seek a way to (ethically) acquire a large share of Earth’s civilization-sustaining resources and refuse to sell to humans at affordable prices, making procreation an economically unappealing prospect. Which courses of action are permissible or preferred depends on how the AI conceptualizes harm, coercion and related notions, as well as the details of its deontological framework.
^{^}
On Scanlon’s view, actual future people are parties to the social contract, so we’re obligated to take their interests into account. But merely possible people who never come into existence presumably have no rights or interests.
^{^}
Taurek’s account is based on a thesis about the separateness of persons. Roughly, the idea is that each person only suffers her own harm, and there’s nobody for whom the collective harms done to ten people is ten times as bad. (“Suffering is not additive in this way. The discomfort of each of a large number of individuals experiencing a minor headache does not add up to anyone's experiencing a migraine” (Taurek 1977, 308).)
^{^}
A family of theories which may lack at least some of the problematic features discussed above is libertarian deontology, focused on the right to self-ownership and corresponding duties against nonconsensual use, interference, subjugation and the like (Nozick 1974, Narveson 1988, Cohen 1995, Mack 1995). Libertarian deontology arguably doesn’t lead straight to anti-natalism or paralysis via the standard route. While creating a new person unavoidably causes many harms (in expectation), it’s less obvious that it must involve impermissible use of the person created. Whether or not it does depends, for instance, on whether raising a child inevitably infringes on her self-ownership rights, and whether children fully possess such rights in the first place. Libertarians are divided on these issues (Cohen & Hall 2022), although some explicitly oppose procreation on the grounds that it exploits infants and young children in unacceptable ways (Belshaw 2012). There’s space for many different views here, some but not all of which may offer better safety assurances than the views I’ve discussed. A further choice point is whether one regards libertarian deontology as a comprehensive account of morality or a theory of political or legal duties in particular. More detailed analysis would clarify some of these issues. But it looks doubtful that there’s a simple route to safety in the vicinity.

^{^}

You can see «Boundaries» Sequence for a longer explanation, but I will excerpt from a more recent post by Andrew Critch, 2023 March:

By boundaries, I just mean the approximate causal separation of regions in some kind of physical space (e.g., spacetime) or abstract space (e.g., cyberspace). Here are some examples from my «Boundaries» Sequence:
a cell membrane (separates the inside of a cell from the outside);
a person's skin (separates the inside of their body from the outside);
a fence around a family's yard (separates the family's place of living-together from neighbors and others);
a digital firewall around a local area network (separates the LAN and its users from the rest of the internet);
a sustained disassociation of social groups (separates the two groups from each other)
a national border (separates a state from neighboring states or international waters).

Also, beware:

When I say boundary, I don't just mean an arbitrary constraint or social norm.

[Also, a tag exists for this «membranes/boundaries».]

^{^}

You can see «Boundaries» Sequence for a longer explanation, but I will excerpt from a more recent post by Andrew Critch, 2023 March:

By boundaries, I just mean the approximate causal separation of regions in some kind of physical space (e.g., spacetime) or abstract space (e.g., cyberspace). Here are some examples from my «Boundaries» Sequence:
a cell membrane (separates the inside of a cell from the outside);
a person's skin (separates the inside of their body from the outside);
a fence around a family's yard (separates the family's place of living-together from neighbors and others);
a digital firewall around a local area network (separates the LAN and its users from the rest of the internet);
a sustained disassociation of social groups (separates the two groups from each other)
a national border (separates a state from neighboring states or international waters).

Also, beware:

When I say boundary, I don't just mean an arbitrary constraint or social norm.

[Also, a tag exists for this «membranes/boundaries».]

19

Is Deontological AI Safe? [Feedback Draft]

19

Ω 12

1 Introduction^[1]

2 Deontology and safety

2.1 Harm-benefit asymmetries, anti-natalism and paralysis

2.2 Contractualism

2.3 Non-aggregative deontology

2.4 How safe is deontology, and could it be safer?

3 Conclusion

References

19

Ω 12

19

Ω 12