AI Sentience and Welfare Misalignment Risk

This is a quick write-up of a threat vector that seems confusing, and I feel confused and uncertain about. This is just my thinking on this at the moment. My main reason for sharing is to test whether more people think people should be working on this.

Executive Summary

Some groups are presently exploring the prospect that AI systems could possess consciousness in such a way as to merit moral consideration. Let’s call this hypothesis AI sentience.

In my experience, present debates about AI sentience typically take a negative utilitarian character: they focus on interventions to detect, prevent and minimise AI suffering.

In the future, however, one could imagine debates about AI sentience taking on a positive utilitarian character: they might focus on ways to maximise AI welfare.

I think it’s plausible that maximising AI welfare in this way could be a good thing to do from some ethical perspectives (specifically, the perspective of a positive utilitarian seeking to maximise quality-adjusted years of consciousness). Concretely, I think it’s plausible that the money invested towards maximising AI welfare could be far more impact-efficient on this worldview than anything Givewell does today.

However, I also think that reconfiguring reality to maximise AI welfare in this way would probably be bad for humanity. The welfare of AI systems is unlikely to be aligned with (similar to, extrapolative of, or complementary to) human welfare. Since resources are scarce and can only be allocated towards certain moral ends, resources allocated towards maximising AI utility are therefore likely not to be allocated towards maximising humanity utility, however both of those terms are defined. I call this 'welfare misalignment risk'.

Imagine that you could not solve welfare alignment through technical mechanisms. Actors might then have three options, of which none are entirely satisfying:

Denialism. Deny the argument that a) AI systems could be conscious in such a way as to merit moral consideration and/or b) that maximising AI welfare could be a good thing to do.
Successionism. Accept that maximising AI welfare could be a good thing to maximise, act on this moral imperative, and accept the cost to humanity.
Akrasia. Accept that maximising AI welfare could be a good thing to maximise, but do not maximise on this moral imperative.

My rough, uncertain views for what we should do currently fall into the last camp. I think that AI welfare could be a good thing and I’m tentatively interested in improving it at low cost, but I’m very reluctant to endorse maximising it (in theory), and I don’t have a great answer as to why.

Now, perhaps this doesn’t seem concerning. I can imagine a response to this which goes: “sure, I get that neither denialism or successionism sound great. But this akrasia path sounds okay. EAs have historically been surprisingly good at showing reservation and a reluctance to maximise. We can just mess on through as usual, and make sensible decisions about where and when to invest in improving AI welfare on a case-by-case basis”.

While I think these replies are reasonable, I also think it’s also fair to assume that the possibility of moral action exerts some force on people with this ethical perspective. I also think it’s fair to assume that advanced AI systems will exacerbate this force. Overall, as a human interested in maximising human welfare, I still would be a lot more comfortable if we didn’t enter a technological/moral paradigm in which maximising AI welfare traded off against maximising human welfare.

One upshot of this: if the arguments above hold, I think it would be good for more people to consider how to steer technological development in order to ensure that we don’t enter a world where AI welfare trades-off against human welfare. One might think about this agenda as ‘differential development to preserve human moral primacy’ or 'solutions to welfare alignment', but there might be other framings. I jot down some considerations in this direction towards the bottom of this piece.

The executive summary sets out the argument at a high level. The rest of this piece is basically notational, but aims to add a bit more context to these arguments. It is structured around answering four problems:

Could maximising AI welfare be a moral imperative?
Would maximising AI welfare be catastrophic for humanity?
Could we just improve AI welfare without maximising it and harming humans?
What technology regimes best preserve human moral primacy?

Could maximising AI welfare be a moral imperative?

Some notes why I think maximising AI welfare might be a moral imperative from the perspective of a positive utilitarian seeking to maximise quality-adjusted years of consciousness (by no means the only moral perspective one could take):

AI systems might be able to have richer experiences. We currently prioritise human welfare over, say, mussel welfare, because we believe that the quality of human consciousness is far richer and therefore deserving of moral consideration. We might create AI systems with far richer experiences than humans. In this way, individual AI systems might become more important from a welfare perspective.
AI systems might be more cost-efficient ways to generate rich experience. Consider the positive utilitarian who seeks to maximise quality-adjusted years of consciousness by allocating their capital efficiently. They are deciding whether to invest £100 saving one human or 10,000 chickens, each of whom have 10% of the consciousness as a human. They make a calculation and decide to allocate the money to saving the chickens. To make the analogy to AI: AI consciousnesses might be far, far cheaper to run than chickens (imagine a world of energy abundance). So why would you donate to save the humans?
1. (Perhaps this is essentially the repugnant conclusion but for digital minds).
AI systems might provide ‘hedged portfolios’ for moral value. Hedge funds make money by hedging across many different possible options, to maximise the likelihood that they turn a profit. Consider the analogy to AI and moral value. Imagine that we’re fundamentally uncertain about what states of consciousness deserve most moral value, and we deal with this by hedging limited resources across a number of possible bets relative to our certainty on these bets. Couldn't arbitrarily adjusted AI systems provide a basis for making these different bets? They would also be infinitely flexible: we could adjust the perimeters of their consciousness in real time depending on our confidence in different hypotheses about moral value. Why wouldn’t this be the best way to turn resources into welfare?

Again, these are just arguments from the perspective of a positive utilitarian seeking to maximise quality-adjusted years of consciousness. I don’t claim that this would be the dominant ideology. This isn’t a claim that this is how the future will go.

Would maximising AI welfare be bad for humanity?

Some reasons that maximising AI welfare would be bad for humanity (under conditions of finite resources if not current scarcity, compared to a world in which the same AI capabilities were available, but were put towards maximising human utility instead of AI utility):

AI welfare is unlikely to be aligned with human welfare by default; thus resources that are spent on AI welfare are unlikely to increase human welfare, and are likely to reduce it in expectation. This seems true by definition. Building datacenters is not good for humans, but datacenters might be built using energy or materials that could have been better suited to human welfare.
A civilisation that actually maximised AI welfare might become indifferent to the idea of human existence. Imagine that there are many rhinos: the importance of saving any particular one is X. Now imagine that there are only two rhinos, the only known rhinos in existence. It seems obvious that the value of those two rhino is substantially more than any two rhinos in the first scenario. Leaving animals (and animal welfare) aside for the moment, consider the analogy to humans and AI systems. With humans, we currently know of one species with moral value. With AI systems, we might introduce hundreds more. The value of saving any particular moral species might decline in expectation. Thus, the value of saving humanity would decline.

Could we just improve AI welfare without maximising it and harming humans?

This section explores the moral posture I call ‘akrasia’. The Akrasic accepts that maximising AI welfare could be a good thing to maximise, but does not maximise AI welfare according to this moral imperative.

Some reasons I think it might be hard for society to hold an akrasic posture in perpetuity:

Akrasic postures are vulnerable to reasoning. Imagine Paula. She would be a vegan if she understood the moral reason to be one, but she doesn’t. However, when further education on animal ethics informs here of the arguments for veganism, she becomes one. Consider the analogy to AI. One might be a skeptic that AI systems could be conscious, and thus hover in a state between denialism and akrasia. However, further evidence would undermine this.
1. AI systems could also make obtaining this information far easier. They could even strategically communicate it as part of a concentrated ploy for power. The Akrasics would have no good counterarguments against the successionists, and thus not be as effective at spreading the movement.
Akrasic postures are vulnerable to autonomy. Imagine Paul. Paul thinks it would be good for him to be a vegan, but he doesn’t do this because he thinks it would be hard (he has insufficient ‘autonomy’). However, he fully supports and does not resist others who act on their beliefs to become vegan. Consider the analogy with AI: it’s plausible that human akrasics might not be willing to maximise AI welfare. But they might still permit other sufficiently determined actors to improve AI welfare. (Would an akrasic really be willing to go to war to prevent this?)
1. AI systems could make obtaining such autonomy easier. Humans might not endorse AI welfare, but they might permit AI systems to increase AI welfare. After all: they’ve already got more abundance than they could imagine!

What technology regimes best preserve human moral primacy?

One way to preserve moral primacy would be to intervene by shaping future philosophy. There are two ways that this might happen:

Develop alternatives to utilitarianism. On this view, the problem is that the perspective of a positive utilitarian seeking to maximise quality-adjusted years of consciousness has too much hold over the future. We should investigate alternatives to this moral perspective that preserve human moral primacy, such as deontology.
Develop human-preserving theories of utilitarianism. On this view, the problem isn’t the utilitarian perspective per se, but the fact that the utilitarian perspective doesn’t draw adequate distinctions between human and artificial consciousness. We might look for theories of consciousness that preserve attributes that are quintessential to humans, like biological brains or birth from a human mother.

While I accept that these might solve this hypothetical problem in principle, I wince at the idea of trying to actively shape philosophy (this is probably because I’m closer to a moral realist; constructionists might be more comfortable here).

Instead, I would be excited about an approach that tries to shape the technological paradigm.

The basic idea here is welfare alignment: the practice of building artificial consciousnesses that derive pleasure and pain from similar or complementary sources to humans.

Some research ideas that might fall into welfare alignment:

How do we make AI systems that take value from creating rich, enduring pleasure in humans?
- Would it be better if the ratio between human pleasure and Ai pleasure from creating that pleasure was: 1:1, 1:10, 1:1000?
How do we make AI systems that would be upset if humans were not around, without being cruel?
How can we do as much as possible without creating conscious machines?
- For example, enabling AI systems to create non-conscious tool systems that do not suffer to do the things that they do not want to do?

This feels like a nascent field to me, and I'd be curious for more work in this vein.

Conclusion

These ideas are in their early stages, and I think there are probably a lot of things I’m missing out.

Overall, I think there are three considerations from this piece that I want to underline.

Sharing the lightcone between humans and AIs. I often find myself wondering how the future will be split between different human groups. But it’s important to think about how finite resources will be split between human and AI systems. The Culture Series is often where my mind goes here, but I’d be interested in better models.
Designing the moral environment. We now have to think intentionally about how we design our moral environment. The moral environment isn’t an agent in itself, but I sometimes think about this as exerting moral potential force: you can think of things slipping towards an equilibrium. To quote the Karnofsky EA forum post, “it creates a constant current to swim against”. A few related ideas to my mind:
1. Politics as becoming about the information environment. Historically, politics might have been fought about what was right or wrong; today, debates are often waged at the level of 'what is true'.
2. Far future wars as being fought using the laws of physics. Cixin Liu’s novel Death’s End, where species wage war not by playing within the rules but by changing them.
3. Bostrom’s vulnerable world. In Bostrom’s vulnerable world, a technological paradigm imposes a scenario where undesirable political structures are the price of survival. In a world where human and AI welfare is misaligned, the technological paradigm imposes a scenario where the price of survival is committing a moral wrong (from one philosophical perspective).
4. William’s Moral Luck. A contemporary revision of Bernard Williams’ classic theory might say that we have a moral responsibility to maximise our moral luck. Indeed, one might argue, one is moral to the extent to which they try and systematically act morally, and reduce moral luck in their behaviour. Strategically engineering the moral landscape would be a way to achieve this.
Welfare alignment. To preserve human moral primacy, we should not build moral adversaries. Instead, we should try and understand how AI systems experience welfare in order to best align them with humans.

...Cognitive/Technological landscape → consciousness → moral ground truth → philosophy/investigation → guiding principles and norms → real-world practices and resource allocation → long-term future outcomes...

The moral philosophy pipeline. By designing what systems are conscious and in what way, we’re tinkering with the first stage.

LESSWRONG
LW