Moral strategies at different capability levels

Richard_Ngo

Let’s consider three ways you can be altruistic towards another agent:

You care about their welfare: some metric of how good their life is (as defined by you). I’ll call this care-morality - it endorses things like promoting their happiness, reducing their suffering, and hedonic utilitarian behavior (if you care about many agents).
You care about their agency: their ability to achieve their goals (as defined by them). I’ll call this cooperation-morality - it endorses things like honesty, fairness, deontological behavior towards others, and some virtues (like honor).
You care about obedience to them. I’ll call this deference-morality - it endorses things like loyalty, humility, and respect for authority.

I think a lot of unresolved tensions in ethics comes from seeing these types of morality as in opposition to each other, when they’re actually complementary:

Care-morality mainly makes sense as an attitude towards agents who are much less capable than you, and/or can't make decisions for themselves - for example animals, future people, and infants.
- In these cases, you don’t have to think much about what the other agents are doing, or what they think of you; you can just aim to produce good outcomes in the world. Indeed, trying to be cooperative or deferential towards these agents is hard, because their thinking may be much less sophisticated than yours, and you might even get to choose what their goals are.
- Applying only care-morality in multi-agent contexts can easily lead to conflict with other agents around you, even when you care about their welfare, because:
  - You each value (different) other things in addition to their welfare.
  - They may have a different conception of welfare than you do.
  - They can’t fully trust your motivations.
- Care morality doesn’t focus much on the act-omission distinction. Arbitrarily scalable care-morality looks like maximizing resources until the returns to further investment are low, then converting them into happy lives.
Cooperation-morality mainly makes sense as an attitude towards agents whose capabilities are comparable to yours - for example others around us who are trying to influence the world.
- Cooperation-morality can be seen as the “rational” thing to do even from a selfish perspective (e.g. as discussed here), but in practice it’s difficult to robustly reason through the consequences of being cooperative without relying on ingrained cooperative instincts, especially when using causal decision theories. Functional decision theories make it much easier to rederive many aspects of intuitive cooperation-morality as optimal strategies (as discussed further below).
- Cooperation-morality tends to uphold the act-omission distinction, and a sharp distinction between those within versus outside a circle of cooperation. It doesn’t help very much with population ethics - naively maximizing the agency of future agents would involve ensuring that they only have very easily-satisfied preferences, which seems very undesirable.
- Arbitrarily scalable cooperation-morality looks like forming a central decision-making institution which then decides how to balance the preferences of all the agents that participate in it.
- A version of cooperation-morality can also be useful internally: enhancing your own agency by cultivating virtues which facilitate cooperation between different parts of yourself, or versions of yourself across time.
Deference-morality mainly makes sense as an attitude towards trustworthy agents who are much more capable than you - for example effective leaders, organizations, communities, and sometimes society as a whole.
- Deference-morality is important for getting groups to coordinate effectively - soldiers in armies are a central example, but it also applies to other organizations and movements to a lesser extent. Individuals trying to figure out strategies themselves undermines predictability and group coordination, especially if the group strategy is more sophisticated than the ones the individuals generate.
- In practice, it seems very easy to overdo deference-morality - compared to our ancestral environment, it seems much less useful today. Also, whether or not deference-morality makes sense depends on how much you trust the agents you’re deferring to - but it’s often difficult to gain trust in agents more capable than you, because they’re likely better at deception than you. Cult leaders exploit this.
- Arbitrarily-scalable deference-morality looks like an intent-aligned AGI. One lens on why intent alignment is difficult is that deference-morality is inherently unnatural for agents who are much more capable than the others around them.

Cooperation-morality and deference-morality have the weakness that they can be exploited by the agents we hold those attitudes towards; and so we also have adaptations for deterring or punishing this (which I’ll call conflict-morality). I’ll mostly treat conflict-morality as an implicit part of cooperation-morality and deference-morality; but it’s worth noting that a crucial feature of morality is the coordination of coercion towards those who act immorally.

Morality as intrinsic preferences versus morality as instrumental preferences

I’ve mentioned that many moral principles are rational strategies for multi-agent environments even for selfish agents. So when we’re modeling people as rational agents optimizing for some utility function, it’s not clear whether we should view those moral principles as part of their utility functions, versus as part of their strategies. Some arguments for the former:

We tend to care about principles like honesty for their own sake (because that was the most robust way for evolution to actually implement cooperative strategies).
Our cooperation-morality intuitions are only evolved proxies for ancestrally-optimal strategies, and so we’ll probably end up finding that the actual optimal strategies in other environments violate our moral intuitions in some ways. For example, we could see love as a cooperation-morality strategy for building stronger relationships, but most people still care about having love in the world even if it stops being useful.

Some arguments for the latter:

It seems like caring intrinsically about cooperation, and then also being instrumentally motivated to pursue cooperation, is a sort of double-counting.
Insofar as cooperation-morality principles are non-consequentialist, it’s hard to formulate them as components of a utility function over outcomes. E.g. it doesn’t seem particularly desirable to maximize the amount of honesty in the universe.

The rough compromise which I use here is to:

Care intrinsically about the welfare of all agents which currently exist or might in the future, with a bias towards myself and the people close to me.
Care intrinsically about the agency of existing agents to the extent that they're capable enough to be viewed as having agency (e.g. excluding trees), with a bias towards myself and the people close to me.
- In other words, I care about agency in a person-affecting way; and more specifically in a loss-averse way which prioritizes preserving existing agency over enhancing agency.
Define welfare partly in terms of hedonic experiences (particularly human-like ones), and partly in terms of having high agency directed towards human-like goals.
- You can think of this as a mixture of hedonism, desire, and objective-list theories of welfare.
Apply cooperation-morality and deference-morality instrumentally in order to achieve the things I intrinsically care about.
- Instrumental applications of cooperation-morality and deference-morality lead me to implement strong principles. These are partly motivated by being in an iterated game within society, but also partly motivated by functional decision theories.

Rederiving morality from decision theory

I’ll finish by elaborating on how different decision theories endorse different instrumental strategies. Causal decision theories only endorse the same actions as our cooperation-morality intuitions in specific circumstances (e.g. iterated games with indefinite stopping points). By contrast, functional decision theories do so in a much wider range of circumstances (e.g. one-shot prisoner’s dilemmas) by accounting for logical connections between your choices and other agents’ choices. Functional decision theories follow through on commitments you previously made; and sometimes follow through on commitments that you would have made. However, the question of which hypothetical commitments they should follow through with depends on how updateless they are.

Updatelessness can be very powerful - it’s essentially equivalent to making commitments behind a veil of ignorance, which provides an instrumental rationale for implementing cooperation-morality. But it’s very unclear how to reason about how justified different levels of updatelessness are. So although it’s tempting to think of updatelessness as a way of deriving care-morality as an instrumental goal, for now I think it’s mainly just an interesting pointer in that direction. (In particular, I feel confused about the relationship between single-agent updatelessness and multi-agent updatelessness like the original veil of ignorance thought experiment; I also don’t know what it looks like to make commitments “before” having values.)

Lastly, I think deference-morality is the most straightforward to derive as an instrumentally-useful strategy, conditional on fully trusting the agent you’re deferring to - epistemic deference intuitions are pretty common-sense. If you don’t fully trust that agent, though, then it seems very tricky to reason about how much you should defer to them, because they may be manipulating you heavily. In such cases the approach that seems most robust is to diversify worldviews using a meta-rationality strategy which includes some strong principles.

One of the approaches in Steven Byrnes' Brain-like AGI Safety is reverse engineering human motivation systems, e.g., the Social-instinct AGI in chapter 12. Your break-down suggests that 'just' reverse-engineering human alignment is not enough.

Arbitrarily-scalable deference-morality looks like an intent-aligned AGI. One lens on why intent alignment is difficult is that deference-morality is inherently unnatural for agents who are much more capable than the others around them.

This is a really useful framing, it crystallized a lot of messy personal moral intuitions. Thanks for writing it.

This is really useful as a framing for what kinds of disagreements arise among altruists (or within an altruist who notices a conflict in their intuition). I think it also explains some of the variance in altruistic targets (whether you care about distant very poor people more than closer but way less poor, or animals, or "the rich" or other categories which can be understood on these dimensions).

Saying that "naively maximizing the agency of future agents would involve ensuring that they only have very easily-satisfied preferences" is clearly wrong. You appear to be completely misdefining agency here. Agency is full of the ability to come to decisions and value things on your own, not having it picked for you. It is the ability to try to make the world 'more' to your liking, not for the world to just be the way you like.

An agent that does not exist has zero agency. An agent that does exist, but has been fully controlled has zero agency. Only agents that make real choices in the world have agency. The maximally satisfiable being (let's stop calling it an agent) does nothing, or does things without regard to how the world should be and thus has no agency. Maximizing agency does not equal creating only beings with zero agency.

This glaring error makes the whole 'cooperation-morality' segment seem to be shakily reasoned. I'm not sure how it changes things, but having a third of it this way makes the whole post seem unreliable.

It’s weird to think about what “respecting agency” means when the agent in question doesn’t currently exist and you are building it from scratch and you get to build it however you want. You can’t apply normal intuitions here.

For example, brainwashing a human such that they are only motivated to play tic-tac-toe is obviously not “respecting their agency”. We’re all on the same page for that situation.

But what if we build an AGI from scratch such that it is only motivated to play tic-tac-toe, and then we let the AGI do whatever it wants in the world (which happens to be playing tic-tac-toe)? Are we disrespecting its agency? If so, I don’t feel the force of the argument that this is bad. Who exactly are we harming here? Is it worse than not making this AGI in the first place?

Was evolution “disrespecting my agency” when I was born with a hunger drive and sex drive and status drive etc.? If not, why would it be any different to make an AGI with (only) a tic-tac-toe drive? Or if yes, well, we face the problem that we need to put some drives into our AGI or else there’s no “agent” at all, just a body that takes random actions, or doesn’t do anything at all.

I never talked about respect in my points for a reason. This isn't about respect. It's about how it is not an agent if it doesn't do anything in an attempt to make the world more to its liking. If it does nothing, or does things randomly (without regard to making things better), that is hardly agentic. If I don't care at all about colors, then picking between a red shirt, and an otherwise identical blue shirt is not an agentic choice, merely a necessary choice (and I will not likely have even bothered thinking about it as a choice involving color.). Identically, if I just have to always wear a specific color, I am not being an agent by wearing that color. There are obviously degrees of agency, too, but the article is genuinely assuming that beings that do basically nothing are still agents.

Thanks! I'd never thought of this breakdown before, but it feels pretty helpful and clarifying

I was looking for this, thanks again for writing it!

I like this breakdown! But I have one fairly big asterisk — so big, in fact, that I wonder if I'm misunderstanding you completely.

Care-morality mainly makes sense as an attitude towards agents who are much less capable than you - for example animals, future people, and people who aren’t able to effectively make decisions for themselves.

I'm not sure animals belong on that list, and I'm very sure that future people don't. I don't see why it should be more natural to care about future humans' happiness than about their preferences/agency (unless, of course, one decides to be that breed of utilitarian across the board, for present-day people as well as future ones).

Indeed, the fact that one of the futures we want to avoid is one of future humans losing all control over their destiny, and instead being wireheaded to one degree or another by a misaligned A.I., handily demonstrates that we don't think about future-people in those terms at all, but in fact generally value their freedom and ability to pursue their own preferences, just as we do our contemporaries'.

(As I said, I also disagree with taking this approach for animals. I believe that insofar as animals have intelligible preferences, we should try to follow those, not perform naive raw-utility calculations — so that e.g. the question is not whether a creature's life is "worth living" in terms of a naive pleasure/pain ratio, but whether the animal itself seems to desire to exist. That being said, I do know nonzero amounts of people in this community have differing intuitions on this specific question, so it's probably fair game to include in your descriptive breakdown.)

I assume that you do think it makes sense to care about the welfare of animals and future people, and you're just questioning why we shouldn't care more about their agency?

The reductio for caring more about animals' agency is when they're in environments where they'll very obviously make bad decisions - e.g. there are lots of things which are poisonous and they don't know; there are lots of cars that would kill them, but they keep running onto the road anyway; etc. (The more general principle is that the preferences of dumb agents aren't necessarily well-defined from the perspective of smart agents, who can elicit very different preferences by changing the inputs slightly.)

The reductio for caring more about future peoples' agency is in cases where you can just choose their preferences for them. If the main thing you care about is their ability to fulfil their preferences, then you can just make sure that only people with easily-satisfied preferences (like: the preference that grass is green) come into existence.

The other issue I have with focusing primarily on agency is that, as we think about creatures which are increasingly different from humans, my intuitions about why I care about their agency start to fade away. If I think about a universe full of paperclip maximizers with very high agency... I'm just not feeling it. Whereas at least if it's a universe full of very happy paperclip maximizers, that feels more compelling.

(I do care somewhat about future peoples' agency; and I personally define welfare in a way which includes some component of agency, such that wireheading isn't maximum-welfare. But I don't think it should be the main thing.)

(Also, as I wrote this comment, I realized that the phrasing in the original sentence you quoted is infelicitous, and so will edit it now.)

Thank you! This is helpful. I'll start with the bit where I still disagree and/or am still confused, which is the future people. You write:

The reductio for caring more about future peoples' agency is in cases where you can just choose their preferences for them. If the main thing you care about is their ability to fulfil their preferences, then you can just make sure that only people with easily-satisfied preferences (like: the preference that grass is green) come into existence.

Sure. But also, if the main thing you care about is their ability to be happy, you can just make sure that only people whom green grass sends to the heights of ecstasy come into existence? This reasoning seems like it proves too much.

I'd guess that your reply is going to involve your kludgier, non-wireheading-friendly idea of "welfare". And that's fair enough in terms of handling this kind of dilemma in the real world; but running with a definition of "welfare" that smuggles in that we also care about agency a bit… seems, to me, like it muddles the original point of wanting to cleanly separate the three "primary colours" of morality.

That aside:

Re: animals, I think most of our disagreement just dissolves into semantics. (Yay!) IMO, keeping animals away from situations which they don't realize would kill them just falls under the umbrella of using our superior knowledge/technology to help them fulfill their own extrapolated preference to not-get-run-over-by-a-car. In your map this probably taken care of by your including some component of agency in "welfare", so it all works out.

Re: caring about paperclip paximizers: intuitively I care about creatures' agencies iff they're conscious/sentient, and I care more if they have feelings and emotions I can grok. So, I care a little about the paperclip-maximizers getting to maximize paperclips to their heart's content if I am assured that they are conscious; and I care a bit more if I am assured that they feel what I would recognise as joy and sadness based on the current number of paperclips. I care not at all otherwise.

If I think about a universe full of paperclip maximizers with very high agency... I'm just not feeling it. Whereas at least if it's a universe full of very happy paperclip maximizers, that feels more compelling.

This is really the old utilitarian argument that we value things (like agency) in addition to utility because they are instrumentally useful (which agency is). But if agency had never given us utility, we would never have valued it.

If you don’t fully trust that agent, though, then it seems very tricky to reason about how much you should defer to them, because they may be manipulating you heavily. In such cases the approach that seems most robust is to diversify worldviews using a meta-rationality strategy which includes some strong principles.

This doesn't seem to follow. Why wouldn't the 'strong principles' also be a product of heavy manipulation?

Strong principles tend to be harder to manipulate, because:

a) Strong principles tend to be simple and clear; there's not much room for cherrypicking them to produce certain outcomes.

b) Principle-driven actions are less dependent on your specific beliefs.

Regardless of how much harder they may be to manipulate, they can never be invulnerable. Which implies that given enough time, all principles, even the strongest, are subject to change.

Arbitrarily-scalable deference-morality looks like an intent-aligned AGI. One lens on why intent alignment is difficult is that deference-morality is inherently unnatural for agents who are much more capable than the others around them.

This is a really useful framing, it crystallized a lot of messy personal moral intuitions. Thanks for writing it.

For example, brainwashing a human such that they are only motivated to play tic-tac-toe is obviously not “respecting their agency”. We’re all on the same page for that situation.

Thanks! I'd never thought of this breakdown before, but it feels pretty helpful and clarifying

I was looking for this, thanks again for writing it!

I like this breakdown! But I have one fairly big asterisk — so big, in fact, that I wonder if I'm misunderstanding you completely.

Care-morality mainly makes sense as an attitude towards agents who are much less capable than you - for example animals, future people, and people who aren’t able to effectively make decisions for themselves.

I assume that you do think it makes sense to care about the welfare of animals and future people, and you're just questioning why we shouldn't care more about their agency?

(Also, as I wrote this comment, I realized that the phrasing in the original sentence you quoted is infelicitous, and so will edit it now.)

Thank you! This is helpful. I'll start with the bit where I still disagree and/or am still confused, which is the future people. You write:

The reductio for caring more about future peoples' agency is in cases where you can just choose their preferences for them. If the main thing you care about is their ability to fulfil their preferences, then you can just make sure that only people with easily-satisfied preferences (like: the preference that grass is green) come into existence.

That aside:

If I think about a universe full of paperclip maximizers with very high agency... I'm just not feeling it. Whereas at least if it's a universe full of very happy paperclip maximizers, that feels more compelling.

If you don’t fully trust that agent, though, then it seems very tricky to reason about how much you should defer to them, because they may be manipulating you heavily. In such cases the approach that seems most robust is to diversify worldviews using a meta-rationality strategy which includes some strong principles.

This doesn't seem to follow. Why wouldn't the 'strong principles' also be a product of heavy manipulation?

Strong principles tend to be harder to manipulate, because:

a) Strong principles tend to be simple and clear; there's not much room for cherrypicking them to produce certain outcomes.

b) Principle-driven actions are less dependent on your specific beliefs.

Regardless of how much harder they may be to manipulate, they can never be invulnerable. Which implies that given enough time, all principles, even the strongest, are subject to change.

131

Moral strategies at different capability levels

131

Ω 47

Morality as intrinsic preferences versus morality as instrumental preferences

Rederiving morality from decision theory

131

Ω 47

131

Ω 47