AI welfare research needs basic science

OscarGilg; Pierre Beckmann; Jake1638

Over the course of MATS 9.0 we formed some views about AI welfare research that we thought were worth writing up. This post is meant to spark discussion rather than to present definitive conclusions. Thanks to Patrick Butlin for useful comments on a previous draft, and for many conversations over the course of MATS which influenced our views. Thanks also to Caspar Kaiser for comments.

A prominent approach in AI welfare research is to start from a theory of moral patienthood, derive indicators from it, and attempt to apply it to AI systems to determine whether they satisfy the theory. We'll refer to this as the top-down theory-driven approach^[1].

We think this approach faces two problems:

In practice, applying theories to AI systems requires making assumptions that are hard to justify, which limits the strength of conclusions.
More broadly, we think current theories will fail to generalise to AI systems. They are calibrated to humans, and as a result end up either over-inclusive (assigning moral patienthood to entities that should not be included), under-inclusive (failing to assign moral patienthood to entities that should be included), or indeterminate (failing to provide a clear verdict about the moral patienthood of a system) .

Instead of the top-down approach, we advocate for iterative, bottom-up basic science that is theory-informed (but not theory-driven). Crucially, this approach does not require presupposing a theory of moral patienthood.

Even for those readers that remain attached to a theory of moral patienthood, we argue empirical findings should at least be used to adjudicate between competing theories. In fact, even readers committed to the top-down theory-driven approach should recognise that applying their chosen theory to AI requires resolving questions about AI systems through bottom-up empirical work.

1. Issues with the top-down theory-driven approach

Before presenting its issues, it is worth outlining the two steps that top-down theory-driven approach typically follow:

First, you pick a theory of moral patienthood which determines what property makes something a moral patient. This step carries a normative aspect: it's a claim about what matters morally. Currently, the most common position is that consciousness is what grounds moral patienthood (another one is the "agency route").
Second, proponents of a theory need to identify what the property they care about might look like in the entity they are studying. If your theory states that consciousness is required for moral patienthood, you then need to identify the presence or absence of consciousness in AI systems. For example, some existing theories of consciousness require a global workspace, higher-order thoughts, integrated information, etc.

These two steps can be distinct. You can hold that consciousness grounds moral patienthood while being open to different theories and indicators of consciousness. That is, you can be silent about what consciousness is (i.e. what properties are necessary or sufficient). Conversely, a theory of consciousness on its own (e.g., global workspace theory) can be agnostic about whether consciousness is what matters morally.

The top-down theory-driven approach typically combines a theory of moral patienthood (say, the consciousness route), with a more fleshed-out theory of what specific indicators matter (say, GWT), and then applies them to AI systems.

We argue against this approach. We have two main reasons for this. First, we argue that applying theories to AI systems often requires making additional assumptions which lack clear justification, and therefore undermine conclusions (§1.1). Second, we think current theories will fail to generalise to AI systems, because they were calibrated to humans (§1.2).

Instead of the top-down approach, we argue for iterative basic science (see section 2). This is theory-informed, but does not assume any given theory —which avoids inheriting the pitfalls of current theories, and allows us to refine our theories based on empirical evidence. We propose using what we call “philosophical probes” to inform empirical investigations (§2.1), and we emphasise the importance of integrating empirical findings into existing theoretical frameworks (§2.2).

1.1 Applying theories to AI systems requires making assumptions that often undermine welfare-relevant conclusions

Applying theories of moral patienthood to AI systems is hard. Coming to conclusions about whether a system satisfies a theory requires making choices that the theory alone does not settle. Theories define empirical markers, but it is not entirely clear how these markers should be applied to AI systems. Applying theories thus requires making additional, hard to justify, assumptions. We show that justifying these assumptions can create new issues, such as leading to unintuitive conclusions.

One example of a choice that needs to be made when applying a theory comes from the fact that current theories do not clearly fix the target level at which they should be applied. We show that making assumptions about the right target level can cause issues.

^[2]Approaches like this seem misguided because any given AI architecture can give rise to different circuitry. For instance we know models "grok" some principles, but in other cases merely memorise input-output patterns. In the case of modular addition: they first memorise the training data early on, and only later transition to learning the general algorithm. The model’s cognition looks very different depending on which training checkpoint you look at, despite the architecture being exactly the same. It seems plausible, even likely, that welfare-relevant properties could arise through similar phase transitions during training. This would suggest that additional factors, beyond a model's architecture, are important for inferring welfare-relevant properties.

Conversely, the same circuitry can be implemented by very different architectures. In Global Workspace Theory, one of the conditions for a system to have a global workspace is that it possesses distinct modules that read from and write to a shared workspace. In a conversation with a philosopher, it was suggested to us that Mixture-of-Experts (MoE) models are better candidates for this than dense models, because the experts resemble GWT’s module. In MoE models, each MLP layer is made of several experts, only few of which are active on any given forward pass. But the specialisation MoEs rely on is already present in dense models: it is an empirical fact that for a given input, only a small part of the network is doing the real work. So the modular structure that matters for a workspace likely already exists in dense models. This suggests that the architecture itself is less important than the learned circuitry to draw conclusions about welfare-relevant properties.

One objection here might be that these issues would not arise if we picked the right target level, i.e. the circuit level, rather than the architecture level. However, we maintain that applying theories at any level remains hard, and still requires making hard-to-justify assumptions.

As an example, suppose you wanted to determine if a model has a global workspace. You have to start by asking: what, in the model, plays the role of a global workspace?

Actually looking for a global workspace at the circuit level would involve looking for circuits that selectively route some information and make it widely available downstream. Setting aside the fact that this is likely an extremely hard interpretability problem, GWT requires drawing a line between what is in the workspace and what isn't. In humans, this line is supposed to be non-arbitrary because the workspace is a serial bottleneck: information either wins the competition for access and is broadcast, or it doesn't. It is unclear whether anything in a transformer plays this role, since attention makes information broadly available in parallel. Any threshold we pick between workspace and non-workspace contents seems likely to be arbitrary.

We are by no means saying no progress can be made here, but rather that a better approach is to avoid committing too strongly to GWT, and instead to follow an iterative approach: using GWT to inform empirical investigations, but then being open to revising GWT based on evidence from AI systems. Before presenting our version of this iterative approach, we want to argue that the problem with theory application are deeper than one might think. Note that some parts of GWT seem overly specific to contingent facts about how human cognition evolved. This suggests a broader issue: that current theories might not generalise to AI systems.

1.2 Current theories of moral patienthood will not generalise to AI systems

Our theories of moral patienthood were calibrated to the moral patients available to us: humans and non-human animals. But what if the human case underdetermines the choice of theory of moral patienthood? It would then be unsurprising that, when applied to foreign systems like AI, our theories end up over-inclusive, under-inclusive, or outright indeterminate.

In humans, many candidate properties (sentience, agency, biological substrate, and so on) reliably co-occur. Our theories tend to designate one of these properties as what matters for welfare. The ML analogy is correlated predictors (aka multicollinearity) which mean the correct parameters cannot be identified from the training data.

This is closely related to what Kammerer calls “analytic drift”: we tend to latch onto features that approximate human wellbeing, and then mistake this local approximation for the essence of moral patienthood. On top of that we tend to go for simple theories, which would explain why trying to extend them to the specifics of foreign systems tends to fail. Shevlin makes a similar point about theories of consciousness, and calls this the “specificity problem”: when applied to non-human cases, theories end up intuitively over-inclusive or under-inclusive^[3].

Current theories can end up over-inclusive of foreign entities like AIs. To probe consciousness in animals without committing to a full theory, Birch proposes three behavioural markers: trace conditioning, rapid reversal learning, and cross-modal learning. These were chosen because in humans, these abilities are facilitated by conscious perception. LLMs do all three trivially, but we don’t think this is strong evidence they are conscious. Something similar happened with our definition of "planet": the "big, round, orbiting the sun" definition worked until we found it to be over-inclusive and the definition had to be refined.

A thought experiment to illustrate under-inclusiveness worries in two prominent moral patienthood theories.

The two most promising “routes” in "Taking AI Welfare Seriously" are the sentience and agency routes. These theories fit our intuitions well when we think about humans and even animals. However they can also appear unintuitive when applied to foreign systems. Consider the following thought experiment from Kagan:

Imagine that in the distant future we discover on another planet a civilization composed entirely of machines — robots, if you will — that have evolved naturally over the ages. Although made entirely of metal, they reproduce (via some appropriate mechanical process), and so they have families. They are also members of larger social groups — circles of friends, communities, and nations. They have culture (literature, art, music), and they have industry as well as politics. Interestingly enough, however, our best science reveals to us — correctly — that they are not sentient. Although they clearly display agency at a level comparable to our own, they lack qualitative experience: there is nothing that it feels like ("on the inside") to be one of these machines. But for all that, of course, they have goals and preferences, they have complex and sophisticated aims, they make plans and they act on them.

Kagan then imagines that we capture a robot-child with the intention to dissect it, while its mother begs us to spare him:

Would it be wrong to dissect the child? […] I find that I have no doubt whatsoever that it would be wrong to kill (or, if you prefer, to destroy) the child in a case like this. It simply doesn't matter to me that the child and its mother are "mere" robots, lacking in sentience. What matters, rather, is that they are full-blown agents, with plans and hopes for their own lives, desires and ambitions for the future.

The argument Kagan makes here was meant specifically against sentientism and in favour of agency-based views. In the thought experiment, the robots lack sentience, yet our intuitions clearly favour not harming the baby, which makes it seem like sentience is the wrong moral commitment.

But as Kammerer notes, it isn't clear that the agency route fares much better in this thought experiment. Our intuition that it's wrong to dissect the robot-child isn't obviously being carried by agency, the thought experiment also loads in other things: culture, art, families, social relations^[4].

In summary, we think it is likely the human case underdetermines the correct theory of moral patienthood, and that as a result our current theories might fail to generalise to AI systems. One might hope to sidestep this concern by avoiding commitment to a single theory, and instead aggregating over many theories, like Rethink Priorities' Digital Consciousness Model. But aggregation won't help: if our theories are wrong, they will tend to be wrong in the same direction, because they were all calibrated to humans. Moreover, for foreign systems like AI, we find it likely that this shared bias dominates the idiosyncratic differences across theories.

We think the way to overcome this is to study AI systems bottom-up, and to be open to revising our theoretical frameworks based on findings. We will now sketch out our approach.

2. AI welfare needs basic science

Our alternative is iterative basic science on AI systems: theory-informed, bottom-up empirical work that seeks to understand AI systems behaviourally and mechanistically, starting with minimal theoretical commitments.

We favour this approach for three reasons:

Basic science is unavoidable. Even applying a theory top-down requires a deep understanding of AI systems.
This approach is less exposed to theories being too calibrated to humans, because it starts from the system we are trying to study.
It includes a natural mechanism for updating our theoretical commitments. Empirical findings have to be reconciled with theories; this is a major part of the process.

We still think philosophers have a role to play by informing empirical research (through what we’re calling philosophical probes, defined below (§2.1)), and secondly integrating empirical findings and revising theoretical frameworks (§2.2).

Together these promote an iterative back-and-forth between theory and empirical research into AI systems:

Crucially, the method can start without committing to any theory of moral patienthood (e.g, either of the two routes) or any specific theory e.g. of consciousness/agency. This sidesteps issues with applying existing theories to AI systems.

2.1 Philosophical probes

Crucially, we still think theory has a role to play. Rather than settling what matters, we think theory should be used to devise what we’re calling philosophical probes: theory-light, revisable instruments which give you a starting point for empirical inquiry. Typically these start from a concept which is broadly considered welfare-related; e.g., preferences, individuation, desires, introspection, etc. The theoretical import stays light: you make whatever minimal assumptions are needed to run experiments. Once you run experiments, you can then use the results to update and revise existing theories, and then devise new philosophical probes. Philosophical probes should be run and re-run in this manner in order to converge on more compelling evidence in favour of a particular theory, without assuming it from the outset.

There are (at least) two useful kinds of philosophical probes.

An operationalisable definition of a welfare-relevant property, state, or capacity. A great example of this is Jack Lindsey’s work on introspection. He sets out four conditions a self-report must satisfy to be introspective – accuracy, grounding, internality and metacognitive representation – and positions them against existing philosophical definitions e.g. Kammerer and Frankish' framework. While Lindsey clearly draws on these frameworks, his operationalisation is more applicable to LLMs.

A philosophical probe can also be an empirical question which improves our understanding of models in a way that seems relevant to AI welfare. Some empirical questions are clearly important for AI welfare, and we don't necessarily need a precise theory to get started on them. One example of such a question: personal identity through time. It isn't obvious how to define persistence of personal identity through time in humans, and AI systems further complicate things through having a different axis for time. Nonetheless we can still formulate interesting empirical questions to study: one might ask what type of representations (e.g. persona activations, planning features) persists through the series of forward passes of an LLM. For example, persona activations seem to go dormant during user turns; which suggests a very alien, flickering kind of persistence through token-time.

2.2 Integrating empirical findings into theories

Philosophers have a second crucial role to play: interpreting empirical results, integrating them into existing frameworks, and potentially revising these frameworks. This is what makes the process iterative: philosophical probes inform empirical research, which in turn can lead to revisions in our theories.

Having started from a philosophical probe, we can in many cases constructively integrate empirical findings into existing theories. Often models will almost, but not quite, fit a given indicator or definition. In such cases we might refine the theory rather than declare a simple pass or fail. Iwan Williams, for example, asks whether LLMs are capable of forming intentions. He considers things like planning features in LLMs and notes that current systems satisfy some requirements of intentionality (in some qualified sense) and fall short of others. Beckmann and Queloz take a similar approach to understanding, going as far as to propose a new functional definition that can apply to LLMs.

Of course there is a limit on how far we can stretch existing definitions. When findings are surprising, and verdicts are unclear, it will likely be necessary to go back to first principles, and reason about which parts of our definitions, or philosophical probes, are load-bearing for moral patienthood.

In some cases empirical findings might raise entirely new theoretical questions which put pressure on our existing frameworks. For example the “individuation” question, of which exact entity is a candidate welfare subject, is made particularly subtle by architectural and empirical facts about LLMs. David Chalmers asks the question of “what we talk to when we talk to language models?”, he weighs different candidates: the model, the instantiation of that model in hardware, or a given conversational “thread”. Recent empirical findings have shown language models adopt personas, further complicating the individuation question and introducing even more candidates (Beckmann and Butlin). Here particular facts about AI systems force us to reconsider our existing frameworks.

The role of philosophers also involves suggesting new follow-up philosophical probes. E.g. in the case of individuation, an open question is which entity/ies in a model is/are a candidate for moral patienthood. Recent results have shown that personas have distinct sets of preferences, and that these are tracked by internal representations. This warrants follow-up philosophical investigations e.g. are there internal activations that indicate some preferences of the model regardless of the currently active persona?

We think iterative basic science is the most promising approach for making progress on AI welfare. With enough iterative refinement of theoretical frameworks, informed by empirical findings, it seems plausible to us that we will uncover action-guiding facts about the moral patienthood of AI systems. If the approach fails (and it might!), we think it would likely be because even our lightest theoretical frameworks are too specific to humans or otherwise flawed to serve as a source of philosophical probes for understanding AI welfare. In those worlds, radical new frameworks would be needed.

Conclusion

We can formulate our position as a set of recommendations, ordered from strongest to weakest, depending on how attached you remain to existing theories. We think our weakest recommendation should be accepted even by a committed theorist.

1. The strong version (our actual proposal). AI welfare research should be led by bottom-up basic science on AI systems: theory-informed but not theory-driven, with philosophical probes directing inquiry and findings feeding back into theory. Notably, this does not require first committing to a theory.

2. The weaker version. If you commit to a theory of moral patienthood (say, sentientism) but not to a specific theory of the relevant property, empirical work on AI systems is still central: it is how you adjudicate between theories of that property (e.g. theories of consciousness) and refine their operationalisations. Studying AI systems could inform us about which conditions are essential and which are contingent on the human case.

3. The minimal version. You might also remain committed to a single theory of moral patienthood, and to a specific theory of the relevant property (e.g. convinced that having a GWT is the definition of moral patienthood). That is, you may remain comitted to the top-down approach in full. Nonetheless, applying that theory to AI systems still requires resolving non-trivial empirical questions, and making assumptions that require a deep understanding of the target system. Even when not the main driver of research, basic science is still necessary.

^{^}
The clearest articulation of this programme is set out by Butlin, Long et al., although this focuses on consciousness rather than welfare.
^{^}
E.g. Birch, Butlin et al. and Butlin makes some related arguments about training algorithms instead of architecture.
^{^}
Current theories might also be indeterminate when it comes to AI systems. E.g. as we argued above, unless you make assumptions, the question of whether LLMs have a global workspace is ambiguous.
^{^}
It might be tempting to treat some of these properties as necessary for the type of agency that matters for moral patienthood, but this very likely would make the theory under-inclusive in some other unintuitive way.
^{^}
In fact philosophers can also start doing this with existing mechanistic interpretability research.
^{^}

[-]zdgroff1d30

This seems like a promising direction that I tentatively agree with. It sounds similar to the "iterative natural kind" strategy that Megan Peters mentions here, though her approach is a bit more formal. (Would be curious if that sounds right to you, though no need to answer.)

I would think the problem is that your judgments about when to revise your top-down theory end up being ad hoc. The GWT/MoE example is making a judgment call about how similar the two structures are, but it's fundamentally just a judgment call. Probably we all do and will in fact make judgment calls about which theories we endorse in part based on what they imply about the world, and so this is just admitting that and doing it more honestly and deliberately, but I guess it's just unfortunate that this is what we have to go with.

[-]Simon Goldstein1d30

Cool post. Cameron and I engage with these kinds of considerations extensively in our book on AI welfare (https://philpapers.org/rec/GOLAWA-2). First, we argue that there are lots of ways to make philosophical progress on the "specificity problem" (how to specify the functional roles), through both thought experiments and thinking about evolutionary functions. Second, we argue that more bottom up or indicator based methodologies don't actually avoid any of the problems related to specificity, and don't avoid the problem of figuring out which theory of moral status or of consciousness is correct (section 6.3). In particular, any evidence you get from the bottom up approach still needs to ultimately connect to the moral status of AIs via some kind of bayesian updating. And when you do this, you'll need to think about how likely each theory of moral status is, and then how likely AIs are to have status conditional on your bottom up evidence and on that theory of moral status. In this way, there's no substitute for grappling with questions about which theory of moral status is correct, and once you do that, you're basically doing what you call "top down" methodology. In this way, I think that denying top down methodology ends up being incompatible with bayesian epistemology.

[-]OscarGilg7h30

Thanks for the comment! I read 6.3. I definitely agree that we want to do Bayesian updating. I totally agree with:

And when you do this, you'll need to think about how likely each theory of moral status is, and then how likely AIs are to have status conditional on your bottom up evidence and on that theory of moral status. In this way, there's no substitute for grappling with questions about which theory of moral status is correct

However I disagree with this:

"and once you do that, you're basically doing what you call 'top down' methodology"

It could be that saying "top-down" makes it sound stronger than we intend. We're mainly making the case for far more "upwards" updating than past approaches. My view is roughly this:

My priors on theories of moral status/consciousness and their indicators are low. In particular it seems like Ai systems are too out-of-distribution for them (§1.2). Also i think there are complications with applying them (§1.1)
This changes how my Bayesian updates are likely to flow. The updates flow disproportionately "upwards" into revising theories compared to past approaches. Doing exploratory basic science is more useful under this scheme.

Maybe this is equivalent to what you mean by a top-down approach but with ~50% prior that all theories are wrong.

[-]Cleo5h*10

Thanks for writing this up, I broadly agree. There's a framing from the AI measurement literature (Olteanu et al.'s position paper on rigor, NeurIPS '25) that I think validates and possibly strengthens the proposal: the proposed approach is more rigorous than the top-down one, if you stop equating rigor with methodological rigor.
The paper argues that the AI community's usual notion of rigor, roughly whether the methods are correctly chosen and applied, is only one of six facets. The others include epistemic rigor (is the background knowledge the work builds on sound, and has it been interrogated rather than just assumed?), conceptual rigor (are the theoretical constructs clearly defined and made specific for the context at hand, rather than left as vague borrowed terms?), and interpretative rigor (do the claims drawn from the findings actually follow, given everything upstream?).

If read through this lens, the top-down approach can be methodologically sound while failing on the upstream steps. For example, importing a theory calibrated to humans, where the candidate welfare-relevant properties reliably co-occur, is an epistemic failure: the background knowledge hasn't been shown to transfer to the systems under study, and it's being used because it's available rather than because it's justified for this domain (the authors have a nice line about artifacts becoming "tools of opportunity, not instruments of epistemic rigor"). The target-level ambiguity you describe with GWT is a conceptual failure, since "global workspace" hasn't been pinned down for transformers and any threshold ends up arbitrary.

I see your philosophical-probe proposal (the strong version) as a way to improve on the conceptual (broad construct -> explicit testable definitions), epistemic (small, revisable theoretical commitments) and interpretative (narrow probe -> narrow conclusion) facets of rigor in AI welfare research. I hope you find this framing useful.

33