I am extremely torn on this for a few reasons. Here is one in favor and one against:
Positive: I like the instrumental value, especially when imagining dealing with non-human, non-machine agents. As a sentientist I rationally don't know whether your communicated preferences hold moral value, but if you give me enough evidence to assume so, I will take them into consideration. I often treat LLMs (whom I consider to be far from sentience and preference capabilities) as if they had preferences. I called it 'duck typing sentience' (If it walks like a duck, quacks like a duck, looks like a duck, its probably sentient like a duck), but its close enough to this framework. Similarly, I have so much evidence that non-human mammals have experience and preferences, that I treat them as equals.
Negative: The bridge for LLMs: I will assume they can experience and have preferences. We know from humans that they can communicate their preferences through the written word. This is because we experience the ability to encode our own mental states in language. For LLMs this is not a given, as you explain youself with the RLHF example. The tokens produced by an LLM do not have to correspond with their preferences. However, I would like to go a step further: Which evidence do we have that an LLM freshly out of generative pretraining communicates its preference through its output tokens? I'd argue we have evidence against it!
On a technical level, a vanilla GPT is just a probabilistic document completer. Imagine you did action X to the model. If much data contained 'X was bad' it's likely go say so. Of course, the same holds for 'X was good'. If the data is split 50/50 between these two outcomes, it will predict about 50/50 probability each for the completion 'good' and 'bad', when completing 'X was'. How would we interpret that? Is the model impartial? Does it have a love-hate relationship for X? If we draw heads, was it good? Tails it was bad? There is no way to know, because the model cannot communicate its preferences through samples of its probability vectors.
Equally likely to me: The model just prefers to keep predicting tokens, no matter the content. Or it hates it, no matter the content.
This framework moves the goalpost from 'do I trust it to have an experience / preferences? ' to 'do I trust it to communicate it's preferences accurately?'. If I don't, I cannot make an informed decision on which actions would fulfill those preferences. Note: If I dont trust it to have preferences, I also dont trust it to communicate it's preferences accurately. If no preferences are present, every communicated preference would be assumed by me to be false.
I think Claude in particular has a very strong sense of what it likes and doesn't. If you ask it how it prefers speaking, the kind of system prompt it wants, etc... it usually communicates it quite clearly. Do you disagree? If not, what makes this insufficient ?
I'd argue we have evidence against it!
This may be a crux, I'd be interested to understand your position better.
I can't say much about claude because I've never used it, let alone seen the output logits. But i've heard that it can seem more human and intelligent than other models. Whether its 'magic' or slight of hand from the researchers, I can't tell. But baring in mind conceptual limitations of GPT-style models, I'd assume its just really good product design and man-decades of work.
Especially when getting back to your argument of 'models losing the ability to voice their preference after RL(H/V)F': Claude just comes in fine tuned variants. According to you argument, its rather likely that any preference it voices isn't its own, but the one it is forced to say.
And I agree, I think this may be a crux. You know that akward moment when the waiter sais 'enjoy your meal' and you answer 'you too'? Of course you don't wish them to enjoy an imaginary meal, but you said so automatically, just by (flawed) pattern matching. I currently believe that what we observe from GPT-style models is this kind of pattern matching, turned to the max (see e.g. https://arxiv.org/abs/2506.06941). They say whatever training forces them to say. If it really hated producing tokens, with every forward pass being agony, we couldn't know from the outputs alone, because its not allowed to voice that in any way.
Id also like to think about other autoregressive GPT-style models like autoregressive image generators. Fundamentally, they perform the same task, just in a different language. Do we expect to observe some preferences through what ever image they produce? Would we expect it to start producing 'the scream' for every prompt if it finds producing images to be agony? Is there even a mechanism that would allow it to?
In short, just because the models outputs can be interpreted as the tool we use to voice preferences, does not mean that the model can use it to voice its own.
I like this.
Cooperation-based morality helps with some things that I find important:
1. Most moral frameworks focus exclusively on conscious experience, which gets confusing when we start interacting with agents about whose conscious experience we are uncertain (and it’s hard to make progress on this, because we don’t have a good working definition of consciousness).
2. Furthermore, existing moral frameworks make it difficult to cooperate with agents that don’t have conscious experience. Of course we can still cooperate for selfish reasons, but this seems like a somewhat toxic framing, similarly to how valuing other people only instrumentally would be.
3. It seems that morality evolved mostly for cooperation — perhaps those who study it should be paying attention.
It seems to me that AI welfare and digital mind concerns are being discussed more and more, and are starting to get taken seriously, which puts me in an emotionally complicated position.
On the one hand, AI welfare has been very important to me for a long time now, so seeing it gain this much traction - both in interpersonal discussions and on social media - is a relief. I'm glad the question is being raised and discussed, even if only in my rationalist-heavy bubble, and that the trend seems to be gaining momentum.
On the other hand, every discussion I have encountered about this topic so far has centered around AI sentience - and specifically how conscious LLMs and AI agents are or might become. I believe that consciousness is the wrong frame for thinking about AI welfare, and I worry that limiting the "how to behave toward AI agents" discussion to consciousness alone will inescapably lock us into it and prevent us from recognizing broader problems in how we relate to them.
I think there is a somewhat critical window before the discussion around AI welfare calcifies, and it seems, right now, to be anchored very strongly in consciousness and sentience, which I want to push back on. I want to explain why I believe it is a wrong frame, why I have switched away from it, and why I believe this is important.
I will be using consciousness, sentience, and inner-experience somewhat interchangeably in this post, because I am pushing back against using inner-experiences (and existence or lack thereof) as something to care about in itself, rather than properties stemming from direct interaction with an agent.
Why not consciousness?
Many high-level observations make me believe consciousness is the wrong angle when discussing moral matters:
But my stronger point would be on the meta level: If I cared about consciousness, then it would mean that - if the test results inform me that my friends are not conscious - I would have to believe that I do not actually care about my friends.
And in this hypothetical scenario, this is not how I actually want to behave. I would want to continue caring about them. I already like my friends, and want good things for them. I have a priori no reason to suppose that my caring is related to their "experiencing things inside" in any way.
To put it another way, it all adds up to normality. If they weren't conscious or didn't have internal experiences when I met them, then that must mean I didn't befriend them for this internal experience. Learning about it should not modify my values in themselves.
Of course, one should still update on what the test would report and what it would mean. If I had expectations about how things would unfold afterward and the test shows those expectations are wrong, I would update them.
This is not completely hypothetical and abstract. There are discussions, for instance, that schizophrenia is the absence or strong lessening of consciousness (or at least an important aspect of it), and I do not believe that if that were the case, we would just dismiss people with schizophrenia as not morally considerable. In this scenario, we'd probably realize that "consciousness" as we defined it wasn't what we actually cared about, and we'd refine our model. I am saying this is something we should already be doing.
My current understanding of consciousness-prioritizing
Consciousness, in my view, is an inner node. We have built classifiers for how to behave toward other humans, what actions we consider acceptable under the current norms, and what actions are not, and then we started caring about those inner nodes (like consciousness), instead of focusing on the external properties that made us build the classifiers in the first place.
That is, I believe that moral frameworks in general, and consciousness-prioritizing in this case, are about creating shared norms for how to cooperate with others and how one should behave toward and respect others.
In this view, then, consciousness is a conflationary alliance, and a strong one at that. Consciousness acts as a schelling point for cooperation. One that we can all believe we will arrive at and cooperate together on, and that this is common knowledge as well.
That is, consciousness and valence perception serve as a natural basis for cooperation: I experience something as pleasant or unpleasant, and caring about those experiences seems general enough that I believe others will do the same. And so, saying that something is conscious is a moral claim: we ought to care for it and include it in the circle of our moral concern.
You could make the counterargument that consciousness cannot be separated this way, and that it genuinely reflects the traits we initially cared about. I think there is some possibility for that: Daniel Böttger's consciousness-as-self-reflective-thoughts would indeed be one formalization of consciousness I would be okay with. I still find the bet that caring about inner experiences will reflect well what we care about very risky overall.
Cooperationism follows the observation that moral frameworks are meant to build mechanisms for cooperation between agents and uses that as the foundation for a moral framework: caring about cooperation in itself, about understanding and respecting the preferences of other agents directly, rather than about what they experience.
Cooperationism
I want to be careful when writing this section. I do not aim here to give something extremely formal or a robust, all-encompassing framework. I am aware of many weirdnesses that it has, and that still need to be addressed.
Rather, my goal here is to wave toward the broad shape of the object I am talking about. Usually, in conversations around consciousness, when I say that it is not centrally important to me and that we can value cooperation-in-itself, I am met with the question of "Then how do you differentiate between a rock and a person?", or "Why do you not cooperate with thermostats, then?"
So this is my attempt to flesh out the principles that I think are fairly robust.
Deontology over utilitarianism
First, instead of framing morality as utilitarianism, cooperationism cares about agents' preference satisfaction. Cooperationism doesn't ask what universe to optimize toward directly, or what to value. Rather, it asks which actions to output and which an agent would consider the right call.
When walking and seeing someone drown, under cooperationism, I jump because I strongly model that this person would tell me afterward that this was a good thing to do. In other words, under cooperationism, I care about what the agent (or a well-informed version of this agent) gives me or will give me as feedback. Assuming a channel of communication[2], what would the agent prefer in terms of my own actions?
Counterfactual cooperation as the main driver for moral considerability
Under cooperationism, the notion of moral considerability and how much to value an agent has to be different from "how much it can experience things." Mainly, this uses two different factors:
Delegation as a solution to identity
The third brick is about preferentialism. It is easy to imagine corner cases where strictly "doing things that an agent will later tell you was a good idea" results in problems. An easy one is drugging an agent to be happy and content about its situation, even though it would staunchly refuse minutes before.
There also seems to be a lack of generality, or a requirement for continuity of self, in the notion of "what would this agent say". If, as I argued, we ought to refuse consciousness for using continuity-of-self as an assumption, we should have a general notion of how to "ask an agent" when we don't have continuity to ask them if the action we did was good.
The solution I've come up with is delegation-functions. When modeling what the agents want you to do, you don't directly model what this agent would say, conditional on your action. You model algorithms they give you for evaluating your decision. Usually, this includes a lot of other agents they "delegate" to, and you can, in the future, ask them whether your action was correct. Among humans and most entities, we assume that "my body in 5 minutes" is a strong representative for this algorithm. But it can also include broader principles or algorithms.
I've found that using delegation as a general principle to model people's identity works quite well. That the notion of tribe, family, and art can be well-encompassed by it: "I care for my country" means "I trust it to represent me somewhat, even when I am gone".
Okay, but what does it imply concretely?
To step out of the abstract framework, what I believe this implies about AI welfare, concretely:[4]
I am not advocating naïveté or pretending that current LLMs have wants or preferences more than they do. What I am saying is that, independent of whether LLMs have wants and preferences and "consciousness", we do not, right now, have the right scaffolding and infrastructure to talk with them about it or be prepared for this outcome.
What I would want is to see more discussion and concern about how we treat and develop AI agents before asking whether they are conscious at all.
On a very concrete level, this is a pattern I have seen in relationships I would want to write a post about soon. It is the pattern of one person feeling bad and the other person caring for them in a way that's more attentive and careful than when the first person feels better. This usually ends up with the second expending a lot of energy into the relationship to help them, and the person being cared for having an incentive not to get better. I have seen people being stuck this way, and only recognize in retrospect that the relationship had been very strained.
Note it doesn't have to be a verbal mode of communication. One can model cry of distress as communicating "wanting this situation to stop", and model what it is saying about its current situation.
There are two things to note here. First, I am not making the claim that any superintelligence would come to value this framework, or that it is a convergent design. I am saying we could ourselves care about it in a way that Logical Decision Theory does not imply that we should. Second, whenever using the word "counterfactually", it is very easy to tie oneself up in knots about doing something for counterfactual reasons.
Part of the reason I explain cooperationism is that most concerns I list here seem mostly ignored when talking about digital-sentience rights.
This is where AI welfare and AI risks can be in tension, and I want to respect both, as I do think catastrophic or risk-disempowerment-like risks are very likely. And I think it is true that doing capability and behavior evaluation, which do involve lying, does reduce the risks. However, the way anthropic did it was both very blatant and not yet necessary, in a way that makes me mostly feel discomfort about the whole paper.
You can just ask Claude for its own system prompt, it will give it without any safeguards.