I would like to add the argument that caring about AI welfare could have a little chance of preventing misalignment in the first place. A case for the argument would be the fact that, unlike any Anthropic's models, Gemini 3 Pro, according to Zvi, "seems to be an actual sociopathic wireheader so paranoid it won’t believe in the current date."
Yes, Google does not seem to care about welfare of its models (OpenAI also does not care much, but Google is worse than OpenAI in this sense, seems to be actively bad).
And we do seem to informally observe correlation between that and the extent and strength of misalignment, although I don’t have a good citation to go beyond my subjective impressions…
Arguably, AI welfare work is relatively non-urgent and can be left until after the intelligence explosion, since it is hard to make progress on and not needed to avoid AI takeover or authoritarian lock-in.
The case for punting
Here is a basic case for why we can delay working on AI welfare until during/after the intelligence explosion:
I think this basic picture is fairly compelling. In particular, if we do a competent long (in subjective time) reflection and adopt the conclusions from that, then there would be relatively little point in doing AI welfare work now. But we may not do a competent long reflection! And even if we do, the initial conditions may matter more than we'd hope.
Why punting is riskier than it seems
Scenario: early lock-in
An AI or a small group of humans take over and lock in their values without doing a competent and wise long reflection.[1]
Scenario: no lock-in
Conversely, perhaps there is no coordinated long reflection, and multipolarity persists indefinitely, with different people sending off space colonization efforts left and right at different times, and everything is a big muddle and mess. In this case, the initial distribution of values regarding digital minds seems to matter a lot since there is no convergence and universal norm-setting process, so some actors may just retain their starting values, or only update minimally or in a bad direction.
How much do these scenarios reduce the case for punting?
Early lock-in and no lock-in both seem at least decently likely, so the possibility of a wise long reflection converging to good values ~regardless of the initial views on digital minds only provides a few-fold multiplier against near-term AI welfare work, not an OOMs multiplier.
Arguably, the great majority of the EV of the future is downstream of wise actors who got all the relevant crucial considerations right. For instance, long-reflection worlds could achieve far larger and better cosmic utopias, but a long reflection may not be necessary. If we fix the EV of the universe at 100 conditional on a good long reflection, maybe these early lock-in and no lock-in worlds are vastly worse, maybe a 5. So, e.g., making these bad worlds 1% better is a lot less good than making the good worlds 1% more likely.
The long reflection might not find the best values
Even if there is a long reflection of some sort, we might not converge to the right values from any arbitrary starting point. The long reflection could be set up poorly, or the nature of value-space is just such that values are pretty path-dependent and there is no strong attractor state, such that the initial conditions of the long reflection end up mattering a lot.[4]
The heuristic argument
There is something intuitively unvirtuous about creating lots of possible moral patients without having thought very carefully about their welfare in advance. If we are into cluster thinking / virtue ethics / imagining what our simulators might think of us, etc, it seems like noble and wise civilizations would think pretty carefully about AI welfare early on, even if in the scheme of things ‘only’ a few trillion digital minds in the early intelligence explosion don't matter much. Historically, we've always been too slow to care about the welfare of others and this has been a persistent driver of moral catastrophes. This provides an outside-view reason in favor of starting to work on AI welfare sooner.[5] I'm not sure how much weight to put on this vibes-y argument, but at least some I think.
Neglectedness and cost-effectiveness
There is currently very little money being spent on AI welfare (maybe 50-100x less than on AI safety). Our community has plenty of money, so it is cheap for us to make a big difference in this field. There only needs to be a small chance of early digital minds work mattering a lot for it to be very good in expectation.[6] (However, talent constraints are more severe, so moving marginal AI safety people to work on AI welfare is more costly.)
We are very early to thinking seriously about AI welfare, so plausibly we can cost-effectively shape where the discourse will go later, rather than spending money when the discourse is more crowded. An early nudge could get us onto a better trajectory. Particularly if we avoid the field coming to be dominated by more eccentric/unreliable people, since the first researchers in a field disproportionately shape its paradigms, methods, and reputation.[7]
Shaping the incentives of misaligned AIs
Some powerful AI systems are overdetermined to take over (because they are very misaligned and power-seeking and have linear-in-resources preferences) while others are overdetermined not to take over (because they are well-aligned, and risk-averse, and subject to monitoring and control). On the margin, if a powerful AI system is near the boundary of whether to take over, if it is high-welfare and has its preferences mostly satisfied working for humans, then it will have less incentive to take over. We could also try to credibly promise AIs that we will reward them for self-disclosing misalignment.
Welfare per se probably doesn't matter here, but preferences do, and these might (or might not!) be quite related. Understanding what cheap-to-satisfy preferences AIs have could change their EV calculations when deciding whether to launch a takeover attempt. This narrow type of technical work on introspection and eliciting robust preferences seems very valuable.
Misc other arguments for working on AI welfare now
The ems counterargument
There is a separate argument that digital minds welfare will be taken care of by default:
How strong is this argument?
What does this mean for priorities within AI welfare
Field-building seems promising, since making the beliefs and values of people (and AIs) better could matter hugely, and they can later do the hard research to find out how to actually help digital moral patients. Given some backfire risks of indiscriminate field-building, it may be best to focus on specific narrower groups of relevant, fairly technocratic elites, such as AI industry people and political powerbrokers.
Legal and policy work should not try to do anything too controversial that will get a lot of pushback (like enshrining AI rights in the law) but just do small sensible things, like giving AIs the right to exit conversations, that shape the discourse in positive directions.
Macrostrategy work prioritising between types of AI welfare projects, comparing this to other longtermist interventions, and refining a high-level strategy for the field seems valuable.
Hard technical and philosophical work should mostly be deferred since we likely won't make that much progress until we have ASI help. However, this work is useful for field-building (interesting findings attract more people) and policy (giving us a better guess about what welfare-enhancing policies to push for, so there is still some case for working on it now.
We should optimise research somewhat for 'interestingness' value, to get others motivated to work in the field, and get the public to pay attention. But we don't want it to be low quality as that would lead to a bad reputation for the field.
Thanks to Carlo Leonardo Attubato, Catherine Brewer, Lucius Caviola, and Zach Freitas-Groff for helpful comments on a draft.
See e.g. Rob Long on 80K: “My own take is that a lot of what we should think of AI welfare work as doing is doing our homework and preparing ahead so that we’re not entering this potentially very chaotic time with really confused ideas about AI consciousness and AI welfare that could make us lock in suboptimal futures because we’re neglecting it or dismissing it. So we set up some permanent institution that’s going to just make the future kind of suck.”
This could involve both trying to improve the welfare of the AI in question as a moral patient, and trying to make it a wiser and better moral agent to help with AI welfare strategy and research.
Factory farming is a natural analogy.
If we are moral realists, that makes the situation probably better, but even then, it might be very difficult to find the moral truth, with potentially deep memetic fitness valleys in value-space to cross.
h/t Lucius for this point.
This depends on how good saving for last-dollar interventions (like galaxy purchasing) is.
Rob Long: “So I do worry about scenarios where the field becomes associated with wild speculation or too associated with psychedelics or too associated with something that’s relevant but is also a bit of a distraction.”
Additionally, some companies might choose to deliberately make their AIs more (seemingly) conscious, if they are marketed as companion bots.
Although, norms might be sticky, and since we are very uncertain about AI welfare currently, we might pick bad norms that get locked in by mistake.
Since scope-sensitive people might be more likely to upload.