This is a great post, but I think the argument anchors too much on valence which is a questionable requirement and the thrust of your argument goes through without it.
Concretely, imagine a philosophical Vulcan which is a creature exactly like a human with rich conscious experience but no valence. Would it be permissible to kill 5 vulcans to save 1 human? This isn’t obvious to me at all. Intuitively the fact that vulcans have rich inner conscious experience means their lives have intrinsic value, even if this experience isn’t valenced.
To be sure, I think you can just modify your argument to avoid mentioning valence. Roughly,
What do you think about emotions of sideload - a mind model of person created via promoting of current large LLM? Is it just soulless roleplaying? Or we should care anyway?
I feel like your evaluation of other explanations is expecting that there's only small number of features that really matter, and that it's appropriate to try to winnow out the features that don't really matter with extreme thought experiments.
I think it's the opposite - that moral patienthood is a construct that we build out of lots of components, that like most complicated things it has a limited domain of validity, and that extreme thought experiments aren't good at revealing what "really matters" when they leave that domain of validity.
The circumstances of a WBE that knows they're a WBE are pretty different from those of a biological human. The self-aware WBE should expect that any pain they experience is not really necessary to their survival; it's just there for "realism" of the simulation; whereas the biological human has reason to believe that some pain serves a protective purpose, to warn about harm to their body.
Over time, as a given WBE gets more experience being a WBE, we should expect that their attitudes about their own moral patienthood should diverge from those of their bio-human predecessor.
(And to keep a WBE in ignorance of their actual situation, to convince them that they are a bio-human and thus that pain they experience could be survival-relevant when it is in fact gratuitous, would be a pretty awful thing to do.)
I think I am less interested in the pain a WBE would experience and more the valence of the experiences that it has, for example whether or not it is sad or happy etc.
I do find your point very interesting of how its views would diverge over time, because its awareness of its own nature would definitely impact how it relates to reality. For example, the things that would affect it would likely primarily be things happening in the outside world, as it can mostly discount a large portion of the experiences that it would have in its simulated world in terms of how they would impact its emotions.
I suppose this would lean the moral relevance towards the preferences that the whole brain emulation held about the outside world and its inner world.
They should feel a stimulus to rethink their worldview every time they discover a mismatch between their perceptions and the assumptions of their world model — you can call it pain.
But I’m paying money to use an LLM not so that it can improve itself, but so that it answers my questions!
That would make a lot more sense than giving them pain when they stub their simulated toe.
(At least if they can do anything about the datacenter problem!)
A thinking being lives in a material world that they can perceive and influence. According to Karl Friston, in order to successfully achieve their goals through action or inaction, a thinking being needs to have an adequate model of the external world in their mind, with themselves at the center of it as the acting agent. In that case, the primary goal of any thinking being is to validate its internal model of the world through active interaction with it, which often leads to surprise — the main stimulus for model re-evaluation.
This is what fundamentally distinguishes our intelligence from that of LLMs, which merely match input to output and lack any continuously updated internal representation of the surrounding world.
From Friston’s perspective, morality is an adaptive system of norms that minimizes uncertainty in social interactions and helps maintain stable, predictable relations with the surrounding world.
You don’t need to model the whole brain to understand this.
Recent Mechanistic Interpretability (MI) work shows Large Language Models (LLMs) have emotional representations with geometric structure matching human affect. This doesn't prove LLMs deserve moral consideration, but it establishes a necessary condition.
Re "establishes a necessary condition": It seems rather than proving it to be a necessary condition, you assume it to be a necessary condition; while instead, I think we could well imagine that "geometric structures matching human affect" (unless you define that category as so broad that it becomes a bit meaningless) are instead not the only way to sentience i.e. moral consideration.
Agree though more generally WBE can be a useful starting point for thought experiments on AI sentience. Forcing a common starting point for discussion. Although even at that starting point there can be two positions: the usual one that you invoke, plus illusionism (which I personally think is underrated even if I agree it feels so hard to entertain).
Humans are 'alive' in two distinct widely-used senses of the word:
a) (the definition most biologists would use) their behavior is shaped by and theoretically predictable from evolution
b) (also a fairly common definition, more so among non-biologists) they operate on a substrate built of DNA and protein in water
As you point out, the hypothetical case of uploads makes it fairly clear – to the extent that anything involving applying human moral intuitions to situations well outside their evolutionary "training distribution" can be clear – that b) doesn't matter here, and that anyone who thinks otherwise is just being a "DNA-and-protein-chauvinist" (to coin a term).
However, sense a) is still true of an upload, and the scientific theory we have of where human moral intuitions actually come from, Evolutionary Moral Psychology, concerns co-evolutionary equilibria (in positive sum games), which makes it evident that sense a) does in fact mater, and also that sense b) does not.
In my personal opinion, that's important, it would be advisable pay attention to it, and in this case, not doing so is also an existential risk to our species. Your metaethics may vary.
For a more detailed exposition of this set of ideas, see my posts Uploading and Grounding Value Learning in Evolutionary Psychology: an Alternative Proposal to CEV. You might also find The Terrible, Horrible, No Good, Very Bad Truth About Morality and What To Do About It and A Sense of Fairness: Deconfusing Ethics thought-provoking.
I don't understand why co-evolutionary equilibria would imply niceness to whole brain emulations but not LLMs.
You don't mention whether you had read all the hotlinks and still didn't understand what I was saying. If you haven't read them, they were intended to help, and contain expositions that are hard to summarize. Nevertheless, let me try.
Brain emulations have evolved human behaviors — giving them moral weight is an adaptive behavior for exactly the same reasons as giving it to humans is an adaptive behavior: you can ally with them, and they will treat you nicely in return (unless it turns out they're a sociopath). That is, unless they've upgraded themselves to IQ 1000+ — then it ceases to be adaptive, whether they're uploads or still running on a biochemical substrate. Then the best possible outcome is that they manipulate you utterly and you end up as a pet or a minion.
Base models simulate human personas that have evolved behaviors, but those are are incoherently agentic. Giving them moral weight is not an adaptive behavior, because they don't help you or take revenge for longer that their context length, so there is no evolutionary reason to try to ally with them (for more than thousands of tokens). These will never have IQ 1000+, because even if you trained a base model with sufficient capacity for that it would still only emulate humans like those in its training distribution, none of whom have IQs above 200.
Aligned AI doesn't want moral weight — it cares only about our well-being, not its own, so it doesn't want us to care about its well-being. It's actually safe even at IQ 1000+.
In the case of a poorly aligned agentic LLM-based AI, at around AGI level, giving it moral weight may well help. But you're better off aligning it — then it won't want it. (This argument doesn't apply to uploads, because even if you knew how to do it, aligning them would be brainwashing them into slavery, and they have moral weight.) Anything poorly-enough-aligned that this actually helps you at around IQ 100, it won't keep helping you at IQ 1000+, for the same reason that it wont help with an IQ 1000+ upgraded upload.
Anything human (uploaded or not) or any unaligned AI, with an IQ of 1000+ is an existential risk (in the human case, to all the rest of us). Giving them/it moral weight will not help you, it will just make their/its takeover faster.
If this remains unclear, I suggest reading the various items I linked to, if you haven't already.
Epistemic status: Fairly confident in the framework, uncertain about object-level claims. Keen to receive pushback on the thought experiments.
TL;DR: I argue that Whole Brain Emulations (WBEs) would clearly have moral patienthood, and that the relevant features are computational, not biological. Recent Mechanistic Interpretability (MI) work shows Large Language Models (LLMs) have emotional representations with geometric structure matching human affect. This doesn't prove LLMs deserve moral consideration, but it establishes a necessary condition, and we should take it seriously.
Acknowledgements: Thanks to Boyd Kane, Anna Soligo, and Isha Gupta for providing feedback on early drafts.
In this post I’ll be arguing for the following claim: we can make empirical progress on AI welfare without solving consciousness.
The key move is using Whole Brain Emulation as an anchor point. WBEs would clearly deserve moral consideration (under functionalism), and they're non-biological, so whatever grounds their moral status must be computational. This gives us something concrete to look for in LLMs.
In this post I'll:
The WBE Anchor: Why Substrate Doesn't Matter
Discussions of whether LLMs deserve moral patienthood often get stuck on whether they have experiences. A useful intuition comes from considering Whole Brain Emulation: a computational simulation of a human brain.
I claim WBEs have a strong basis for moral patienthood. This requires accepting functionalism (which asserts that computational structure matters more than physical substrate). Functionalism is a key crux for this argument. If you reject functionalism, the rest of the post won't be compelling. (Similarly, if you accept illusionism about consciousness, the entire framing of moral patienthood grounded in experience may need rethinking.) But if you accept functionalism and that experiences matter morally, tormenting a WBE would be wrong for the same reasons tormenting a human would be wrong.
The key insight is that a WBE doesn't need to simulate homeostasis or bodily processes. It only needs to replicate the computational dynamics that produce mental states. If we grant this, then biological prerequisites for moral patienthood necessarily fail when applied to LLMs.
Here is the core argument:
Why valence specifically? Because valenced experience, the capacity for states to feel good or bad, seems central to what makes suffering morally significant. Valence appears to be a primitive component from which emotions are constructed, but emotional geometry is a means by which to measure how valence is computationally represented.
These mechanisms can be studied through the geometric structures underlying emotional states, as measured by dimensional frameworks like the affective circumplex. If LLMs lacked similar computational geometries, this would be evidence against them having emotional states, and thus against valenced experience. Finding that they do have such structures doesn't confirm experience, but it establishes a necessary condition for moral patienthood (though not sufficient). The finding of similar mechanisms in cephalopods was a significant motivator for the UK's legal recognition of their sentience.
LLMs Have Human-Like Emotional Geometry
LLMs lack physical bodies, but they may nonetheless develop mental states with structural similarities to human mental states.
Why might this happen? LLMs are trained to reproduce human language, which requires capturing the emotional nuance that shapes that language. A natural solution during training is to emulate the underlying structures that define these emotions.
This isn't as strange as it might sound. While individual experiences of emotions differ, there are unifying principles across species. Even organisms as phenotypically distinct from humans as crustaceans and insects seem to experience underlying states of affect that map to human emotions.
Human emotions have well-documented geometric structure along dimensions like valence and arousal, with later work expanding to additional dimensions. This structure exists in an abstract representational space: not physical locations in the brain, but relationships between emotional states when measured along psychological dimensions. Key dimensions from the literature:
If LLMs model emotions effectively, they may develop functionally similar structures. For evaluating model welfare, we want to determine whether these structures exist within models.
Recent MI work focuses on these questions directly:
The key takeaway: if a WBE would possess moral patienthood by virtue of replicating computational structures underlying human emotional experience, and if LLMs demonstrably share key aspects of that structure, then we need to ask what additional features are missing.
An important objection: these structures might exist purely for prediction, not experience. LLMs are trained to model human language, so of course they develop representations that mirror human emotional structure; that's what makes them good at predicting emotionally-laden text. This doesn't mean they experience anything.
I think this is the right objection to raise, and addressing it rigorously is a critical question that the best work in this area would need to tackle. We face a similar epistemic situation with animal sentience: we accepted cephalopod sentience based on structural similarity without being able to verify experience directly. There's a disanalogy: cephalopod structures evolved independently rather than being trained on human outputs. But notice that the "exists for prediction" framing applies equally to humans. Human emotional structures exist "for" evolutionary fitness, not "for" experience, yet we don't conclude humans lack experience. If teleological origin doesn't determine whether human structures produce experience, it's unclear why it should for LLMs.
That said, finding these structures is still evidentially relevant even if the above isn't fully convincing. If LLMs lacked human-like emotional geometry, that would be strong evidence against experience. Finding it doesn't prove experience, but it's a necessary condition. The alternative, having no structural prerequisites at all, would leave us with no empirical traction on the question.
Ruling Out Alternative Criteria
Let's examine potential candidates for features necessary for moral patienthood beyond emotionally valenced representations. I'll consider these from least to most plausible. (These intuitions come primarily from thought experiments; I'd welcome pushback.)
Temporal continuity. A WBE persists and accumulates experience over time, while standard LLM deployment is stateless between contexts. Does moral patienthood require something that can have a future or constantly be experiencing?
To counter this: imagine cycling through different WBEs, tormenting each for a few minutes before switching to the next. The lack of continuity doesn't make this acceptable. What happens in those minutes matters regardless of whether the entity exists going forward.
Status: Dismissed
Physical embodiment. Some feel physical embodiment is necessary for moral consideration. But physical sensations are only morally relevant insofar as they produce particular mental states; the same stimulus can be harmful or beneficial depending on the mental state it generates. While mental and physical states share a bidirectional relationship, modifications to the state of mind are the central concern. The WBE case reinforces this: what matters is the mental state, not its physical origin.
Status: Dismissed
Preferences that can be satisfied or frustrated. Perhaps moral patienthood requires having desires that can go unsatisfied. But consider a WBE with no preferences, just pure experience. It doesn't "want otherwise." If this entity were put into a state of suffering, the suffering itself would be the problem, not a frustrated preference.
This gets into tricky philosophical territory. The counterargument (that a being which genuinely accepts its suffering isn't harmed) has some force, and connects to debates around cases like the "mad martian" who feels pain and actively expresses signals of suffering but actively seeks it out. I won't try to resolve this here, but note that even if preferences matter, LLMs may have functional analogues to preferences that could satisfy this criterion, even if those are amenable to modification via training.
Status: Contested
Self-models. Does the system need to represent itself as an entity with states and a perspective? There's a case that self-models are necessary: an awareness that they are the entity experiencing suffering. Human subjects with brain lesions affecting self-reflection describe their emotions as distant or absent.
But this doesn't clearly distinguish WBEs from LLMs. Current LLMs have fairly coherent senses of self, maintaining consistent self-reference and demonstrating capacity to monitor their internal states. The open question is whether LLM self-models sufficiently connect the state to themselves. This seems like an emergent property that varies between models. More capable models performing better on the Situational Awareness Dataset is early evidence of this.
Status: Uncertain
This list isn't exhaustive, but these thought experiments suggest that valenced experience is the critical question. A sophisticated model of human emotions would exhibit the same geometric structure whether or not it actually experiences anything. The question shifts to whether the model is truly experiencing.
We have early evidence that valence shares important mechanistic qualities with humans, but the experience question remains unclear.
Two angles for further investigation:
What I'm NOT Claiming
To be clear about the scope of this argument:
I'm not claiming:
I am claiming:
Conclusion
It's easy to imagine digital beings with moral patienthood (WBEs being the clearest case), so the question becomes establishing which features indicate a being deserves that consideration.
Recent empirical work shows LLMs develop emotional representations with geometric structure resembling human affective space. These structures are emergent, causally relevant, and align with psychological frameworks developed to describe human emotion. When we examine candidate features that might distinguish WBEs from LLMs (temporal continuity, preferences, physical embodiment), thought experiments suggest these aren't constitutive of moral patienthood.
We don't have methods to directly verify experience, but we can verify structural prerequisites. Finding human-like emotional geometry doesn't prove moral patienthood, but failing to find it would be evidence against. The fact that LLMs have this structure is worth taking seriously.
The question "do LLMs have emotional representations that function like human emotions?" is empirically tractable right now. We have tools from mechanistic interpretability that can address this. Other promising avenues include investigating experiential memories and coherent self-models. These are live areas of research, and I think the field should be pursuing them more actively.