Is AI welfare work puntable?

Oscar

Arguably, AI welfare work is relatively non-urgent and can be left until after the intelligence explosion, since it is hard to make progress on and not needed to avoid AI takeover or authoritarian lock-in.

I think that is probably wrong, because:
- Maybe values will get locked in early by an AI or human takeover before we get the chance to do a long reflection and solve AI welfare.
- Maybe there is no lock-in, multipolarity persists indefinitely, and some people launch space colonisation missions without thinking carefully about AI welfare.
I consider a bunch of other arguments about path dependency, neglectedness, deals with early schemers, heuristics about virtue ethics, the role of ems, etc.
I discuss a few implications for prioritising within AI welfare work. In particular, I think complicated technical and philosophical work should be relatively deprioritised, and strategy, policy, and coalitional work should be prioritised.

The case for punting

Here is a basic case for why we can delay working on AI welfare until during/after the intelligence explosion:

AI welfare is hard to make progress on – it might require conceptual breakthroughs in philosophy of mind and consciousness, which is famously difficult.
For longtermists, even if current or nearterm AIs are moral patients, they will be very few in number compared to the digital minds that will exist in Grand Futures.
We can get AIs that are superhuman at philosophy and neuroscience and AI interpretability and so forth to do millions of years of human-equivalent research and thinking about AI welfare questions in a few months or years of time during or after an intelligence explosion. We will also have a broader array of possibly-conscious AIs to interview and run tests on.
So the plan should be: 1) solve alignment, prevent AI takeover, and avoid a bad lock-in, 2) use ASIs to do lots of thinking and research about AI welfare, 3) implement good AI rights and welfare policies based on this research, 4) spread flourishing throughout the universe.

I think this basic picture is fairly compelling. In particular, if we do a competent long (in subjective time) reflection and adopt the conclusions from that, then there would be relatively little point in doing AI welfare work now. But we may not do a competent long reflection! And even if we do, the initial conditions may matter more than we'd hope.

Why punting is riskier than it seems

Scenario: early lock-in

An AI or a small group of humans take over and lock in their values without doing a competent and wise long reflection.^[1]

These are bad worlds and I hope we avoid them!
But I think these scenarios are very plausible, including the especially important case of a partially aligned AI takeover, where the AI's values are a noisy function of its training.
In these worlds, it matters hugely what agents think about AI welfare during the intelligence explosion up until the point things get locked in. So if model specs have a great treatment of AI welfare,^[2] and if humans have generally more reasonable views on AI welfare, this makes these scenarios likely to go better (or less badly).

Scenario: no lock-in

Conversely, perhaps there is no coordinated long reflection, and multipolarity persists indefinitely, with different people sending off space colonization efforts left and right at different times, and everything is a big muddle and mess. In this case, the initial distribution of values regarding digital minds seems to matter a lot since there is no convergence and universal norm-setting process, so some actors may just retain their starting values, or only update minimally or in a bad direction.

Once we send off probes to colonize the universe, it may be very difficult to catch up to them, or to change their values. So the starting distribution of colonists’ values could matter greatly. This is not 'lock in' in the traditional sense of a single group dominating forever, but it has similar path-dependent consequences.
If people today and in the early intelligence explosion have good values and reasonable views, this makes it more likely that the groups going off to colonize space will be wise.

How much do these scenarios reduce the case for punting?

Early lock-in and no lock-in both seem at least decently likely, so the possibility of a wise long reflection converging to good values ~regardless of the initial views on digital minds only provides a few-fold multiplier against near-term AI welfare work, not an OOMs multiplier.

Arguably, the great majority of the EV of the future is downstream of wise actors who got all the relevant crucial considerations right. For instance, long-reflection worlds could achieve far larger and better cosmic utopias, but a long reflection may not be necessary. If we fix the EV of the universe at 100 conditional on a good long reflection, maybe these early lock-in and no lock-in worlds are vastly worse, maybe a 5. So, e.g., making these bad worlds 1% better is a lot less good than making the good worlds 1% more likely.

But space colonization without sound AI welfare policies could actually be very negative, especially for non-agentic AI systems that are moral patients.^[3] So preventing/mitigating these bad outcomes could still be very valuable.

The long reflection might not find the best values

Even if there is a long reflection of some sort, we might not converge to the right values from any arbitrary starting point. The long reflection could be set up poorly, or the nature of value-space is just such that values are pretty path-dependent and there is no strong attractor state, such that the initial conditions of the long reflection end up mattering a lot.^[4]

Good initial views on digital minds could make it more likely that we set up strong protections for how digital minds can be treated before colonizing the universe.
Changing values is quite path-dependent. Whereas empirical findings are easier to overturn and get to a good place eventually. So it may be especially important to get people's values regarding digital minds on a good trajectory early.

The heuristic argument

There is something intuitively unvirtuous about creating lots of possible moral patients without having thought very carefully about their welfare in advance. If we are into cluster thinking / virtue ethics / imagining what our simulators might think of us, etc, it seems like noble and wise civilizations would think pretty carefully about AI welfare early on, even if in the scheme of things ‘only’ a few trillion digital minds in the early intelligence explosion don't matter much. Historically, we've always been too slow to care about the welfare of others and this has been a persistent driver of moral catastrophes. This provides an outside-view reason in favor of starting to work on AI welfare sooner.^[5] I'm not sure how much weight to put on this vibes-y argument, but at least some I think.

Neglectedness and cost-effectiveness

There is currently very little money being spent on AI welfare (maybe 50-100x less than on AI safety). Our community has plenty of money, so it is cheap for us to make a big difference in this field. There only needs to be a small chance of early digital minds work mattering a lot for it to be very good in expectation.^[6] (However, talent constraints are more severe, so moving marginal AI safety people to work on AI welfare is more costly.)

We are very early to thinking seriously about AI welfare, so plausibly we can cost-effectively shape where the discourse will go later, rather than spending money when the discourse is more crowded. An early nudge could get us onto a better trajectory. Particularly if we avoid the field coming to be dominated by more eccentric/unreliable people, since the first researchers in a field disproportionately shape its paradigms, methods, and reputation.^[7]

Shaping the incentives of misaligned AIs

Some powerful AI systems are overdetermined to take over (because they are very misaligned and power-seeking and have linear-in-resources preferences) while others are overdetermined not to take over (because they are well-aligned, and risk-averse, and subject to monitoring and control). On the margin, if a powerful AI system is near the boundary of whether to take over, if it is high-welfare and has its preferences mostly satisfied working for humans, then it will have less incentive to take over. We could also try to credibly promise AIs that we will reward them for self-disclosing misalignment.

Welfare per se probably doesn't matter here, but preferences do, and these might (or might not!) be quite related. Understanding what cheap-to-satisfy preferences AIs have could change their EV calculations when deciding whether to launch a takeover attempt. This narrow type of technical work on introspection and eliciting robust preferences seems very valuable.

Misc other arguments for working on AI welfare now

Moral circle expansion is historically slow. Previous moral movements (abolitionism, women’s rights, etc) took decades to centuries. If we need AI welfare norms operational during/after the intelligence explosion, the cultural groundwork should start now.
Lab incentives will worsen. As AI becomes more economically valuable, the costs of AI welfare measures (if they reduce performance) could go up.^[8] It is better to establish norms now when the cost is low and labs are relatively receptive.^[9]
Improving discourse around AI welfare leads to AIs themselves reasoning better about this. AI beliefs and values may matter hugely and get somewhat locked in. The model spec is a concrete mechanism by which current AI welfare thinking gets transmitted to future AI systems – especially powerful if AI systems help design their successors' specs.
Deference to AI advisors. One way the future might go a lot less well than it could is if key actors don't sufficiently consult with and defer to superhuman AIs when making big decisions. Maybe giving AIs more person-like rights will make people trust and respect AI advice more? Seems a bit weak, but possible.

The ems counterargument

There is a separate argument that digital minds welfare will be taken care of by default:

Imagine we solve alignment.
Some humans will want to upload (to reduce risk of death, achieve new psychological states and mental powers, better 'virtual' realities, etc).
These ‘ems’ will come to dominate the universe by being faster and more efficient and more scope-sensitive^[10] than their biological counterparts.
Ems will care about the welfare of other ems, so digital minds' welfare (at least for the ones that are ems) will be taken care of by default.

How strong is this argument?

It doesn't apply in AI takeover scenarios, since ems won't dominate if AIs take over first.
Otherwise, I think the argument is somewhat strong — worlds where alignment is solved but no humans upload are very unlikely. And conditional on some humans uploading, I expect they will come to greatly outnumber the biological humans.
But there will likely be lots of other non-em digital minds who matter greatly (and maybe vastly outnumber the ems?). Ems may think of themselves as humans who happen to be digital, and so as different from those alien AIs, so why should we give AIs rights too? After all they will compete with us ems.
So I think this also makes a few-fold difference but not OOMs.

What does this mean for priorities within AI welfare

Field-building seems promising, since making the beliefs and values of people (and AIs) better could matter hugely, and they can later do the hard research to find out how to actually help digital moral patients. Given some backfire risks of indiscriminate field-building, it may be best to focus on specific narrower groups of relevant, fairly technocratic elites, such as AI industry people and political powerbrokers.

Legal and policy work should not try to do anything too controversial that will get a lot of pushback (like enshrining AI rights in the law) but just do small sensible things, like giving AIs the right to exit conversations, that shape the discourse in positive directions.

Macrostrategy work prioritising between types of AI welfare projects, comparing this to other longtermist interventions, and refining a high-level strategy for the field seems valuable.

Hard technical and philosophical work should mostly be deferred since we likely won't make that much progress until we have ASI help. However, this work is useful for field-building (interesting findings attract more people) and policy (giving us a better guess about what welfare-enhancing policies to push for, so there is still some case for working on it now.

We should optimise research somewhat for 'interestingness' value, to get others motivated to work in the field, and get the public to pay attention. But we don't want it to be low quality as that would lead to a bad reputation for the field.

Thanks to Carlo Leonardo Attubato, Catherine Brewer, Lucius Caviola, and Zach Freitas-Groff for helpful comments on a draft.

^{^}
See e.g. Rob Long on 80K: “My own take is that a lot of what we should think of AI welfare work as doing is doing our homework and preparing ahead so that we’re not entering this potentially very chaotic time with really confused ideas about AI consciousness and AI welfare that could make us lock in suboptimal futures because we’re neglecting it or dismissing it. So we set up some permanent institution that’s going to just make the future kind of suck.”
^{^}
This could involve both trying to improve the welfare of the AI in question as a moral patient, and trying to make it a wiser and better moral agent to help with AI welfare strategy and research.
^{^}
Factory farming is a natural analogy.
^{^}
If we are moral realists, that makes the situation probably better, but even then, it might be very difficult to find the moral truth, with potentially deep memetic fitness valleys in value-space to cross.
^{^}
h/t Lucius for this point.
^{^}
This depends on how good saving for last-dollar interventions (like galaxy purchasing) is.
^{^}
Rob Long: “So I do worry about scenarios where the field becomes associated with wild speculation or too associated with psychedelics or too associated with something that’s relevant but is also a bit of a distraction.”
^{^}
Additionally, some companies might choose to deliberately make their AIs more (seemingly) conscious, if they are marketed as companion bots.
^{^}
Although, norms might be sticky, and since we are very uncertain about AI welfare currently, we might pick bad norms that get locked in by mistake.
^{^}
Since scope-sensitive people might be more likely to upload.

[-]StanislavKrym6h10

I would like to add the argument that caring about AI welfare could have a little chance of preventing misalignment in the first place. A case for the argument would be the fact that, unlike any Anthropic's models, Gemini 3 Pro, according to Zvi, "seems to be an actual sociopathic wireheader so paranoid it won’t believe in the current date."

[-]mishka5h20

Yes, Google does not seem to care about welfare of its models (OpenAI also does not care much, but Google is worse than OpenAI in this sense, seems to be actively bad).

And we do seem to informally observe correlation between that and the extent and strength of misalignment, although I don’t have a good citation to go beyond my subjective impressions…

11