LESSWRONG
LW

ConsciousnessAI

9

Building Conscious* AI: An Illusionist Case

by OscarGilg
11th Sep 2025
17 min read
4

9

ConsciousnessAI

9

Building Conscious* AI: An Illusionist Case
2Adele Lopez
3OscarGilg
2soycarts
1OscarGilg
New Comment
4 comments, sorted by
top scoring
Click to highlight new comments since: Today at 11:36 PM
[-]Adele Lopez4h20

Illusionism basically says this: once we have successfully explained all our reports about consciousness, there will be nothing left to explain.
 

 

As a guiding intuition, consider the case of white light, which was regarded as an intrinsic property of nature until Newton discovered that it is in fact composed of seven distinct colours. White light is an illusion in the sense that it does not possess an intrinsic property “whiteness” (even though it seems to). Suppose we manage to explain, with a high degree of precision, exactly how and when we perceive white, and why we perceive it the way we do. We do not subsequently need to formulate a “hard problem of whiteness” asking why, on top of this, whiteness arises. Illusionists claim that consciousness is an illusion in the same sense that whiteness is.[1]


So literally everything that is in some way an abstraction is an illusion? I don't think this is generally what the illusionists mean, my understanding is that it is more about phenomenal consciousness being non-representational—meaning something like that it has the type signature of a world-model without actually being a model of anything real (including itself). That could very well be a misrepresentation of what illusionists believe, but I'm still pretty sure it's not just any reductionist explanation of consciousness.

Reply
[-]OscarGilg2h30

Thanks for the comment! I had to have a think but here's my response:

The first thing is that I maybe wasn't clear about the scope of the comparison. It was just to say "whiteness of light is an illusion in roughly the same sense that phenomenal consciousness is" (as opposed to other definitions of illusion).

Even then, what differentiates these illusions from other abstractions? Obviously not all abstractions are illusions.

Take our (functional) concept of heat. In some sense it's an abstraction, and it doesn't quite work the way people thought a thousand years ago. But crucially, there exists a real-world process which maps onto our folk concept extremely nicely, such that the folk concept remains useful and tracks something real. Unlike phenomenal consciousness, it just so happens that we evolved our concept of heat without us attributing too many weird properties to it. Once we developed models of molecular kinetic energy, we could just plug them right in.

Where I think you might have a point is that this is arguably not a binary distinction, some concepts are clearly confused and others clearly not but in some cases it might be blurry (and consciousness might be one of those, i'm not sure). 

I don't think this is generally what the illusionists mean, my understanding is that it is more about phenomenal consciousness being non-representational—meaning something like that it has the type signature of a world-model without actually being a model of anything real (including itself)

I think most illusionists believe consciousness involves real representations, but systematic misrepresentations. The cognitive processes are genuinely representing something (our cognitive states), but they are attributing phenomenal properties that don't actually exist in those states. That's quite different from it being non-representational, and not being a model of anything.

At least that's my understanding which comes from the Daniel Dennett/Keith Frankish views. I'd be interested in learning about others.

Reply
[-]soycarts5h20

llusionism basically says this: once we have successfully explained all our reports about consciousness, there will be nothing left to explain. Phenomenal experiences are nothing more than illusions. For illusionists, the meta-problem is not just a stepping stone, it's the whole journey.

As a guiding intuition, consider the case of white light, which was regarded as an intrinsic property of nature until Newton discovered that it is in fact composed of seven distinct colours. White light is an illusion in the sense that it does not possess an intrinsic property “whiteness” (even though it seems to). Suppose we manage to explain, with a high degree of precision, exactly how and when we perceive white, and why we perceive it the way we do. We do not subsequently need to formulate a “hard problem of whiteness” asking why, on top of this, whiteness arises. Illusionists claim that consciousness is an illusion in the same sense that whiteness is.

I love your description of Illusionist thought, and pattern-match it as a successful application of self-reference (a cognitive tool I particularly value).

It seems to me however that it is just stated as fact that “phenomenal experiences are nothing more than illusions”.

I think the disconnect for me is that I equate consciousness to “being” which, in Eastern Philosophy, has some extrinsic properties (which are phenomenal).

This means that agents cannot wholly describe the “being” of another agent — its nature of being is not clearly bounded.

  1. There is a correct explanation of our intuitions about consciousness which is independent of consciousness.
  2. If there is such an explanation, and our intuitions are correct, then their correctness is a coincidence.

Initially I agreed with this because I thought you meant “a correct explanation of our intuitions about consciousness” in a partial sense — i.e. not a comprehensive explanation. This is then used to “debunk consciousness”.

It seems to me that we can talk about components of conscious experience without needing to reach a holistic definition, and then we might still be able to discuss Consciousness* as the components of conscious experience minus phenomena. Maybe this matches what you’re saying?

I’m on board with the core idea of intentionally building consciousness into AI (as far as we can ambiguously define it) as a driver of alignment… but perhaps at a later development stage when we’re confident we can absolve the AI of suffering.

Reply
[-]OscarGilg4h10

Thanks for the comment and the kind words!

It seems to me however that it is just stated as fact that “phenomenal experiences are nothing more than illusions”.

I think the disconnect for me is that I equate consciousness to “being” which, in Eastern Philosophy, has some extrinsic properties (which are phenomenal).

I'm no expert in Eastern Philosophy conceptions of consciousness, I've been meaning to but haven't gotten around to digging into it.

What I would say is this: for any phenomenal property attributed to consciousness (e.g. extrinsic ones), you can formulate an illusionist theory of it. You can be an illusionist about many things in the world (not always rightly).

The debunking argument might have to be tweaked, e.g. it might not be about "intuitions", and of course you could reject this kind of argument. Personally I would expect it to also be quite strong across the "phenomenal" range. I would be very happy to see some (counter-)examples!

Initially I agreed with this because I thought you meant “a correct explanation of our intuitions about consciousness” in a partial sense — i.e. not a comprehensive explanation. This is then used to “debunk consciousness”.

It seems to me that we can talk about components of conscious experience without needing to reach a holistic definition, and then we might still be able to discuss Consciousness* as the components of conscious experience minus phenomena. Maybe this matches what you’re saying?

I guess this sounds a bit like weak illusionism? Where phenomenal consciousness exists, but some of our intuitions about it are wrong. We would indeed also be able to discuss consciousness* (with asterisk), but we'd run into other problems and I don't think the argument about moral intuitions would be nearly as strong. Weak illusionism basically collapses to realism. It would point to consciousness* being more cognitively important so many of the points would be preserved. Let me know if this isn't what you meant.

Reply
Moderation Log
More from OscarGilg
View more
Curated and popular this week
4Comments

In this post I want to lay down some ideas on a controversial philosophical position about consciousness: illusionism, and how it might impact the way we think about consciousness in AI. Illusionism, in a nutshell, proposes that phenomenal consciousness does not exist, although it seems to exist. My aim is to unpack that definition and give it just enough credence to make it worth exploring its consequences for AI consciousness, morality and alignment. 

Illusionism suggests that there is a different mechanism: consciousness* (aka the cognitive processes which trick us into thinking we have phenomenal consciousness, introduced later in the post) which is less morally significant but more cognitively consequential. This reframing leads to different conclusions about how to proceed with AI consciousness.

The illusionist approach is different from—but not in contradiction with—the kind of view exemplified by Jonathan Birch's recent "Centrist Manifesto". Birch emphasises the dual challenge of over-attribution and under-attribution of consciousness in AI, and outlines some of the challenges for AI consciousness research. In accordance with other recent work, he advocates for a careful, cautious approach.

By critiquing Birch's framework through an illusionist lens, I will end up arguing that we should seriously consider building consciousness* into AI. I'll outline reasons for expecting links with AI alignment, and how efforts to suppress consciousness-like behaviours could backfire. The illusionist perspective suggests we might be committing a big blunder: trying to avoid anything that looks like consciousness in AI, when it actually matters far less than we think morally, but is far more consequential than we think cognitively.

The case for illusionism

What is phenomenal consciousness?

The classic definition of phenomenal consciousness by Nagel is that a system is conscious if there is “something it is like” to be that system. If this seems vague to you (it does to me) then you might prefer defining consciousness through examples: seeing the colour red, feeling pain in one’s foot, and tasting chocolate are states associated with conscious experiences. The growth of nails and regulation of hormones are not (see Schwitzgebel's precise definition by example).

The hard problem and the meta-problem of consciousness.

In his seminal paper “Facing up to the problem of consciousness”, David Chalmers proposes a distinction between what he calls the easy and hard problems of consciousness. The easy problems are about functional properties of the human brain like “the ability to discriminate, categorize, and react to environmental stimuli”. While these problems might not actually be easy to solve, it is easy to believe they are solvable.

But when we do manage to solve all the easy problems (in Chalmers’ words): “there may still remain a further unanswered question: Why is the performance of these functions accompanied by experience?”. That’s the hard problem of consciousness: understanding why, on top of whatever functionality they have, some cognitive states have phenomenal properties.

23 years later, David Chalmers published “The Meta-Problem of Consciousness”. The first lines read: “The meta-problem of consciousness is (to a first approximation) the problem of explaining why we think that there is a [hard] problem of consciousness.” So instead of “why are we conscious”, the question is “why do we think we are conscious”. Technically, this is part of the easy problems. But as Chalmers notes, solving the hard problem probably requires understanding why we even think we have consciousness in the first place (it would be weird if it was a coincidence!). Thankfully, the meta-problem is more tractable scientifically than the hard one.

So suppose we solved the meta-problem of consciousness? The hard problem says we still have to explain consciousness itself—or do we? This is where illusionism comes in.

Illusionism to the rescue

Illusionism basically says this: once we have successfully explained all our reports about consciousness, there will be nothing left to explain. Phenomenal experiences are nothing more than illusions. For illusionists, the meta-problem is not just a stepping stone, it's the whole journey.

Illusionism
Cover of Illusionism as a theory of consciousness by Keith Frankish

As a guiding intuition, consider the case of white light, which was regarded as an intrinsic property of nature until Newton discovered that it is in fact composed of seven distinct colours. White light is an illusion in the sense that it does not possess an intrinsic property “whiteness” (even though it seems to). Suppose we manage to explain, with a high degree of precision, exactly how and when we perceive white, and why we perceive it the way we do. We do not subsequently need to formulate a “hard problem of whiteness” asking why, on top of this, whiteness arises. Illusionists claim that consciousness is an illusion in the same sense that whiteness is.[1]

So illusionists don’t deny that conscious experiences exist in some sense (we’re talking about them right now!). They deny that conscious experiences have a special kind of property: phenomenality (although they really seem to have phenomenality).

The most common objection to illusionism is straightforward: how can consciousness be an illusion when I obviously feel pain? This is an objection endorsed by a lot of serious philosophers (including Chalmers himself). Intuition pumps can only get us so far, we'll now dive into an actual philosophical argument.

Debunking consciousness

One of the main arguments for illusionism follows the template of a so-called “debunking argument”. The idea is that if we can explain the occurrence of our beliefs about X in a way that is independent of X, then our beliefs about X might be true, but that would be a coincidence (i.e. probability zero). Let’s use this template to “debunk” consciousness (following Chalmers):

  1. There is a correct explanation of our intuitions about consciousness which is independent of consciousness.
  2. If there is such an explanation, and our intuitions are correct, then their correctness is a coincidence.

I think many atheists want to make a similar kind of argument against the existence of God. Suppose that we can explain our beliefs about God in, say, evolutionary, psychological and historical terms without ever including God as a cause. It would then be a bizarre coincidence if our beliefs about God turned out to be correct. As with the debunking argument against consciousness, the hardest part is actually doing the debunking bit (i.e. claim 1). The good news is that philosophers can outsource this: it is a scientifically tractable problem.

Introducing consciousness* (with an asterisk)

If consciousness doesn't exist, what cognitive mechanisms generate our persistent reports and intuitions about it? Being an illusionist involves denying that phenomenal consciousness exists, but not that it seems to exist, and not that something is causing us to have all these intuitions. The fact that the illusion is so strong is precisely what the theory seeks to explain. There must be some cognitive mechanism which causes us to mischaracterise some cognitive states as possessing phenomenal properties.

So let's define consciousness* (with an asterisk) as "the cognitive mechanism leading us to systematically mischaracterise some states as phenomenal."[2] This sort of deflated (diet) version definitely exists. This distinction changes how we should think about AI consciousness:

  • Traditional/realist view: "Is this AI phenomenally conscious?"
  • Illusionist view: "Does this AI have the cognitive architecture that produces reports and intuitions about consciousness?"

The first question assumes something (phenomenal consciousness) that illusionists think is conceptually confused. The second question is scientifically tractable: we can study consciousness* in humans and look for similar mechanisms in AI.

So maybe we can just replace full-fat "consciousness" with diet "consciousness*" in all our ethical theories, dust off our hands, and call it a day. Problem solved, ethics intact, everyone goes home happy.

If only it were that simple. As we'll see, this substitution raises issues about what properties should ground moral consideration of minds—human and artificial alike.

For the remainder of the post I'll focus on consciousness* (the real but diet version, with the asterisk), and occasionally refer to full-fat consciousness (no asterisk). Whenever you see an asterisk, just think "the cognitive processes which trick me into thinking I have phenomenal consciousness".

The consequences of illusionism on ethics

Do illusionists feel pain?

There are three ways one can understand “pain” and illusionism has different takes on each of them (see Kammerer):

  • Functional pain: illusionism does not deny this exists.
  • Phenomenal pain: illusionism denies this exists (but not that it seems to exist).
  • Normative pain (i.e. inflicting pain is bad/unethical): illusionism does not deny this exists.

So illusionists can still say hurting people is wrong. But the question remains: why would inflicting pain be bad if there's no phenomenal experience? What about our new notion of consciousness* which does exist. Does that matter?

Questioning moral intuitions about consciousness

Our intuitions about pain's badness come from introspection: phenomenal pain seems to directly reveal its negative value with apparent immediacy. Pain doesn't just seem bad, it seems bad beyond doubt, with more certainty than any other fact. However as François Kammerer argues in his paper "Ethics without sentience", if illusionism is true, then

our introspective grasp of phenomenal consciousness is, to a great extent, illusory: phenomenal consciousness really exists, but it does not exist in the way in which we introspectively grasp and characterize it. This undercuts our reason to believe that certain phenomenal states have a certain value: if introspection of phenomenal states is illusory – if phenomenal states are not as they seem to be – then it means that the conclusions of phenomenal introspection must be treated with great care and a high degree of suspicion

In other words if we take the leap of faith with illusionism about phenomenal states, why stay stubbornly attached to our intuitions about the moral status of these same states?

To be clear, this argument targets intuitions about consciousness, the full-fat no asterisk version. But since consciousness* (with an asterisk) is none other than the set of cognitive processes which generate our (now-suspect) intuitions, this also removes reasons to treat it as a foundation for moral status.

This seems to point to the need to use other properties as foundations for moral consideration. As Kammerer explores, properties like agency, desires, sophisticated preferences, or capacity for deep caring are good candidates. Of course these might happen to be deeply entangled with consciousness*, such that in practice consciousness* might be linked to moral status. But even if this entanglement exists in humans, there is no guarantee it would persist in all artificial systems. We shouldn't exclude the possibility of systems possessing the cognitive build for consciousness* but without e.g. strong agency, or vice-versa.

Conscious AI: an illusionist critique of the Centrist Manifesto

Having presented illusionism, I now want to examine how it applies to current approaches in AI consciousness research. Beyond laying groundwork for my later argument about building consciousness* into AI, this also showcases the de-confusing powers of illusionism, and makes the issue more tractable overall.

In his recent paper AI Consciousness: A Centrist Manifesto, Jonathan Birch outlines two major challenges facing us in the near future: roughly, over-attribution and under-attribution of consciousness in AI. The paper does a great job of outlining the issues whilst remaining parsimonious in its assumptions[3]. However examining Birch's manifesto from an illusionist lens points to methodological blind spots and suggests a more promising path forward.

The gaming problem and the Janus problem

The first challenge Birch describes is that many people will misattribute human-like consciousness to AI. This is not a new phenomenon and is reminiscent of the ELIZA effect. Things get messy when labs become incentivised either to take advantage of Seemingly Conscious AI (SCAI) or to suppress it. I'll have more to say about this in the final section.

Jonathan Birch's second challenge cuts to the heart of the AI consciousness problem: we might create genuinely conscious AI before we have reliable ways to recognise it, and before we fully understand the moral implications. This is a serious problem. In the worst case, we could create billions of suffering agents. Addressing this challenge means understanding AI consciousness and how it relates to moral status. Birch goes on to identify two fundamental problems that make this solution difficult: the gaming problem and the Janus problem.

The gaming problem arises from the fact that frontier models are trained on massive datasets containing humans talking about their minds and experiences, and also post-trained to produce various responses (e.g. ChatGPT when it is asked if it is conscious). Whatever models say about their own subjective experience cannot be trusted.

Asking ChatGPT if it is conscious
1st/7 paragraph of Claude's answer: that no one knows. 

The Janus problem is that whatever theory-driven indicator you find in AI, there will always be two ways to update: "AI is conscious" or "the theory is wrong". The same evidence points in opposite directions depending on your prior beliefs.

Birch argues these obstacles aren't permanent roadblocks—they can be overcome through systematic comparative research across species, theoretical refinements, and better AI interpretability tools.

Birch's research program for deciding whether to attribute consciousness to AI.

The illusionist response

While it's true that behavioural evidence becomes unreliable when dealing with AI systems, this doesn't mean we can't do empirical work. We can design theory-informed experiments that test real capabilities rather than surface-level mimicry. Illusionists view consciousness* as deeply integrated into cognition, suggesting many avenues for meaningful measurement. 

For instance, we might measure metacognitive abilities by having models predict their own performance or confidence across different domains. We can investigate self-modelling and situational awareness through real benchmarks. We could examine top-down attentional control (see next section on AST), and whether models can selectively shift their focus in ways you wouldn't expect from a pure language model. We have to be smart about how we design our experiments to avoid the gaming problem, but very similar concerns exist in other areas of AI research (e.g. alignment). The gaming problem is real, but far from intractable.

The Janus problem is also real in some sense: we can always draw inferences in both directions when we find theory-driven indicators in AI. There is nothing fundamentally wrong with discrediting a theory of consciousness* by showing that it leads to absurd results on AI models. Inferences go both ways in all parts of science.

In the paper, Birch sketches out how we might look for architectural indicators in LLMs from a leading theory of consciousness Global Workspace Theory (GWT). GWT proposes that consciousness arises when many specialised processors compete for access to a central "workspace" that then broadcasts information back to all input systems and downstream modules. As Birch shows, the transformer architecture does not contain a global workspace, although a similar architecture (the Perceiver variant) does. We run into issues when it turns out even a tiny Perceiver network technically has a global workspace, despite not displaying any kind of coherent behaviour. This issue arises because the approach was doomed from the get-go. From the illusionist perspective it suffers from two fundamental flaws:

  • First, looking for a global workspace solely in the architecture is the wrong place to look: if the architecture was all that matters, then GWT wouldn't distinguish between a trained model and one initialised with random weights! It's a bit like opening up a piano looking for Beethoven's 9th Symphony. Instead of "does this architecture contain a global workspace?" we should ask something like "do models develop global workspace-like dynamics?".
  • Second, GWT is not the right kind of theory. While a robust and convincing account of likely necessary processes for consciousness*, GWT does not explain our intuitions and reports. Michael Graziano terms this the Arrow B problem (see figure below).

    Arrow A is explaining how computational processes produces conscious* states, Arrow B is explaining how those states lead to intuitions and reports. GWT only tackles Arrow A. Figure from Illusionism Big and Small.

Somewhat contra Birch, I actually think that picking the right kind of theory, and asking the right kinds of questions collapses the Janus problem into standard empirical disagreements. This kind of thing happens in physics all the time: theories are rejected precisely because of the predictions they make. When quantum mechanics predicted wave-particle duality, many rejected it because particles can't be waves. The solution wasn't to declare the question unanswerable, but to develop better theories AND experiments that could distinguish between competing interpretations.

So what conditions should the right kind of theory satisfy? It must be a mechanistic theory that makes measurable predictions, AND it must explain how those mechanisms lead to our intuitions and reports about consciousness (the Arrow B problem).

Having critiqued an existing approach to AI consciousness, what does an illusionist-native alternative look like? Illusionism changes the priors, the goals, the methods, and the moral framework. It is not a complete departure from Birch's approach, but a focused recalibration.

Why we should build conscious* AI

Finally we get to the core point of this post: that we should seriously consider building consciousness* into AI. [4]

Illusionism suggests consciousness* is less morally important (making it more acceptable to build) and more cognitively important (making it more useful to build). One response to this is that we are profoundly uncertain, therefore we should take the cautious approach: refrain from building it. This conservative approach is a reasonable default setting, but it does not come without its perils. Suppose we take the cautious approach, I will argue this could lead to:

  • missing out on opportunities that come with building consciousness* in AI. I'll argue from an illusionist perspective that there are first-principles reasons to expect links to alignment.
  • suffering from bad consequences from actors purposefully suppressing consciousness* or seemingness of consciousness* in AI. There is an illusionist case that this could backfire.

To dive into this it helps to introduce one of the leading illusionist-compatible theories of consciousness*.

The Attention Schema Theory: consciousness* as a model of attention

Here is part of the abstract of a 2015 paper, where Michael Graziano introduces the Attention Schema Theory (AST) better than I ever could:

The theory begins with attention, the process by which signals compete for the brain’s limited computing resources. This internal signal competition is partly under a bottom–up influence and partly under top–down control. We propose that the top–down control of attention is improved when the brain has access to a simplified model of attention itself. The brain therefore constructs a schematic model of the process of attention, the ‘attention schema,’ in much the same way that it constructs a schematic model of the body, the ‘body schema.’ The content of this internal model leads a brain to conclude that it has a subjective experience.

(The terms "subjective experience" and "consciousness" are used interchangeably)

In a nutshell, AST equates consciousness* with a model of attention. The crux is that this model is deeply imperfect just like our body schema (which e.g. doesn't represent blood vessels). Graziano would say it's a "quick and dirty" model, which evolved through natural selection to do its job, not to be accurate.

www.frontiersin.org
Going from representing an apple, to representing subjective awareness of an apple.

Say Graziano and his team gather enough evidence and build a rock-solid theory that explains why we have these deep intuitions about consciousness and why we report having subjective experiences. The illusionist position is simple: that's it. We're done. Any feeling that there must be something more is exactly what the theory predicts we would intuit[5].

A first-principles argument for why consciousness* could matter for AI alignment

If AST has any validity, then this cognitive machinery is arguably relevant to challenges in AI alignment. Moreover we might be overlooking them precisely because they involve consciousness. Here's one compelling reason why understanding consciousness* could be vital for alignment:

In his book Consciousness and the Social Brain, Graziano explores how the attention schema evolved not just for self-monitoring, but largely as a social tool. The same neural machinery that lets us model our own attention also lets us model other minds. Watching someone focus intently on something, you use your attention schema to model what they are attending to and predict their next move. Ultimately this provides the necessary tools for navigating complex social coordination problems.

The idea that we don't accurately represent our cognitive states, but rather misrepresent them in useful ways, is basically what illusionism is about. There is little reason to expect evolution to enforce that our reports be correct. Here's an intuition pump, suppose I'm with some friends and I spot a deadly snake. One thing which is not useful to communicate is the sequence of intricate electro-chemical reactions in my brain which lead me to run away. A more helpful broadcast would be to convey a useful fiction about my current cognitive state (e.g. enter the “fear” state, gasp, scream, etc). My representation is a rough but evolutionarily useful shortcut.

The implications for AI are notable: alignment is a bit like a social coordination problem. If we want to cooperate with advanced AI, we might benefit from it having something functionally similar to an attention schema. This would provide AIs with a superior model a) of what humans and other agents are attending to, making it less likely to mess up, and b) of what the AI itself is attending to, leading to better self-modelling/self-control and hopefully a boosted capacity to report its own cognitive states.

Perhaps having AIs develop useful models of others, and of themselves, can help rule out failure modes. The same way that LLMs having good world models makes some Nick Bostrom-style apocalypse scenarios implausible (relative to AlphaGo-type pure RL systems).

Suppressing consciousness: a model for "cautious approach" failure modes

Whether or not consciousness* turns out to be alignment-relevant, AI labs might face strong incentives to suppress consciousness-like behaviour in their models. As public concern about AI consciousness grows—driven by the ELIZA effect and justified moral uncertainty—companies will be pressured to "suppress" Seemingly Conscious AI, either by making it seem less conscious, or by somehow making it less conscious*. While this pressure seems reasonable and critics (like Zvi Mowshowitz in a recent post) rightly call out disingenuous industry arguments, I'll argue the approach could backfire.

The suppression would come from labs applying optimisation pressure (intentional or not, RL or otherwise) that steer models away from making statements that sound conscious or introspective. This risks creating a more general failure mode: the AI learns to broadly avoid communicating its knowledge about its internal states. Despite retaining the relevant self-modelling capabilities (which are essential to performance), subtle optimisation pressures push models to hide them.  This undermines AI alignment research methods which rely on AIs being transparent about their internal states. Methods which may be essential for detecting early signs of deceptive or misaligned behaviour. What seems like an innocent PR fix might turn into a big cognitive alteration.

This is just one illustration of how there could be a tension between consciousness* and alignment issues, there are others. There might also be cases where labs, wanting to be ethically cautious, accommodate AI desires in ways that similarly reinforce bad behaviours. 

The broad point is this: the traditional/realist position is to be cautious about consciousness, treat it as a moral hazard, and do our best to avoid it. The illusionist position, on the other hand, treats consciousness* as less morally and more cognitively significant: it suggests we should be far more comfortable building consciousness* into AI, far more curious about the potential doors it opens in AI research, and far more scared about the downstream consequences of tampering with it.

What success looks like

Going back to Birch's research program, here is an illusionist alternative. The illusionist research program looks a lot like a very boring scientific research agenda without anything special: it involves developing theories that meet the two illusionist criteria (mechanistic and explains intuitions and reports about consciousness), using these theories to inform empirical work on humans, animals and AIs, and updating our theories, and repeating over and over. We have priors about the distribution of consciousness* in the world. That's fine. We can debate and update them as empirical evidence comes in.

In parallel, advances in mechanistic interpretability offer new ways to test theory-driven indicators in models. Work on representation engineering, steering vectors, and sparse autoencoders provides promising avenues for detecting the computational structures that theories like AST predict should underlie consciousness*.

What you end up with is a back-and-forth between theory and experiment which comes with a lot of idiosyncratic methodological considerations. For example: be wary of AIs mimicking humans, sometimes AIs having it means the theory is wrong, beware of our intuitions, etc etc.

What success looks like: a proposed illusionism-native research program.

Closing remarks

Caveating the illusionist approach

There are arguments against building consciousness* into AI. These are valid concerns and important to state:

  • Uncertainty runs deep: Illusionism could be wrong. Our arguments about consciousness*' moral irrelevance could be wrong. We need to proceed carefully.
  • Entanglement problems: Even if consciousness* isn't directly morally relevant, the actual moral markers whatever they may be—agency, preferences, desires—might be deeply intertwined with consciousness*.
  • Indirect human welfare concerns: Making AIs seem conscious might cause psychological harm to humans who form attachments to them (or any other theory of harm which doesn't assume AI suffering).

I personally totally endorse an approach that proceeds with caution and recognises uncertainty. I also happen to think that opinionated takes are an important part of advancing knowledge. In a few paragraphs I’ve claimed that 1) phenomenal consciousness doesn’t exist 2) consciousness doesn’t matter for morals and 3) we should actively build conscious* AI. Super controversial. I’m extremely keen to get any kind of feedback.

Appendix

A very tempting (but flawed) debunking argument about intuitions on the moral status of consciousness*

Following Daniel Dennett's advice in his autobiography, I'm sharing a tempting but ultimately flawed argument I came up with which aims to debunk our moral intuitions about consciousness. Thanks to François Kammerer for helping point out the flaw. The argument is:

  1. There is a correct explanation for our intuitions about the moral status of conscious* states, which is independent of consciousness(*).
  2. If there is such an explanation, and our intuitions are correct, then their correctness is a coincidence.
  3. The correctness of intuitions about consciousness* is not a coincidence.
  4. Our intuitions about the moral status of consciousness* are incorrect.

The argument is tempting, but when you think hard about whether or not to include the asterisk in brackets, it falls apart. Roughly:

  • If you write it with an asterisk, then the claim becomes implausible: it is actually quite likely that our intuitions depend on consciousness*.
  • If you write it without an asterisk, the argument doesn't add anything to the illusionist story (even if it turns out correct).
  1. ^

    Another useful analogy: until the early 20th century, vitalists maintained that there was something irreducibly special (they called it "élan vital") that distinguished living from dead, and which could not be reduced to mere chemistry and physics. That was until it was successfully explained by (bio)chemistry and physics. It turned out there was no explanatory gap after all.

  2. ^

    This is totally inspired by the concept of quasi-phenomenality introduced by Keith Frankish here.

  3. ^

    It seems common in AI consciousness research (e.g. this paper) to refrain from committing to any one theory, and argue we should proceed with uncertainty. I totally agree with this, but I also think opinionated takes help advance knowledge.

  4. ^

    The arguments here very much come from my own interpretation of illusionism. I'm skipping over some assumptions (e.g. materialism). There are also many disagreements between illusionists.

  5. ^

    Graziano goes into more detail on how AST is illusionist-compatible in his article: Illusionism Big and Small.