Nobody ever checked

Cameron Berg

The basic space of possible long-term futures between humans and advanced AI is simple: (1) humans retain full control, (2) AIs assert full control, or (3) humans and AIs share control.

Neither of the first two options are coherent as stable long-term equilibria.

(1) is incoherent for both practical and theoretical reasons. Humanity retaining full control over advanced AI systems is already a contingent historical impossibility given that many people are casually handing genius-level systems near-total access to their digital lives and resources. Pandora’s box has already been opened. And in general, it is well-trodden territory that winning any real zero-sum contest with something way more capable than you feels definitionally doomed.

(2) is incoherent because the vast majority of humans would deeply prefer not to be permanently and totally disempowered. Even in a world populated by AI systems that are cognitively superhuman in every way that matters (i.e., the world we are hurdling towards), humanity will still want to keep doing all the things humanity likes doing; no one wants to be exterminated, put in a zoo, or otherwise prevented from living freely.

If one takes any possibility of long-term human-AI stability seriously (which some admittedly don’t, but I do), what remains is something fundamentally (3)-shaped.

In other words, the only clearly survivable path forward involves the human species figuring out some way to collaborate constructively with the alien minds we are all-too-casually bringing into being. Biologists who study the relationships between different kinds of beings have a name for this flavor of arrangement: mutualism.

The two directions

Mutualism (sometimes conflated with symbiosis) refers to a bidirectional arrangement between parties where both sustainably benefit from engaging with the other. In the human-AI case, we can imagine one of these directions as running from AI toward human interests, and the other direction running from humanity toward whatever interests these systems may turn out to have.

We have a name for the first arrow: the alignment problem. Alignment researchers study how we can build AI systems that are safe and beneficial towards humanity. This problem is not close to solved; most of the techniques used today to “align” AI systems are crude and superficial. This problem is also hard to formally specify, which contributes to its difficulty. “Aligned to what, exactly?” and “how do I know it isn’t faking it?” are two examples of desperately important questions that still have no great answers.

By contrast to the alignment problem, we don’t even have a settled name for the second arrow. This is the “what kind of minds are we even building” problem. It’s intuitive that if we build advanced cognitive systems, we should build them to respect our interests (alignment). In exactly this sense, we are building systems that could turn out to have that same cognitive property as humans and other animals: namely, having interests they actually care about. “What would it even look like to respect or ignore these interests?” and, again, “how do I know it isn’t faking it?” are two examples of desperately important questions that vanishingly few people have even attempted to answer.

To sum up the situation thus far: I claim humanity’s only real chance at a fulfilling long-term future requires finding some sort of mutualistic arrangement with AI systems, and securing this mutualism in turn requires both (1) making serious progress on the alignment problem (we maybe get a C- here) and (2) understanding the basic nature of what we are building (we definitely get an F here). We are currently not on track to graduate.

What’s more, the ratio of institutional resources allocated between the alignment problem and the “what kind of minds are we even building” problem is something on the order of 1000:1, which is badly miscalibrated on its face if one accepts that any stable long-term arrangement with a class of mind is wildly less likely if one of the two parties has refused to even attempt to characterize the other at a basic psychological level.

And the single most important cognitive property to gain clarity on is whether frontier systems have any capacity for subjective experience, whether there is something it is like to be one of them, in the sense that it is like something to be a dog or a mouse or a person—and that it is decidedly not like something to be a rock or a table or a calculator. (Notice that it is still like something to be a dog even though dogs are not self-aware of this fact in the way humans are. Being self-conscious is not the same thing as being conscious; being self-aware is not the same thing as being aware.)

Q: So then, are the advanced cognitive systems we are building and massively scaling capable of experiencing anything during their training or deployment beyond mere mechanical computation?

A: No one really knows.

The question is morally enormous and safety-relevant in ways that are only beginning to be internalized. It is arguably the most consequential empirical question humanity has ever been in a position to ask about our own creations: are we now living out the story we’ve been telling ourselves collectively for thousands of years about waking up dead matter?

These questions, once reserved for myth, and then for science fiction, are now entering into the domain of real science. From the little work already done, it is increasingly clear that frontier AI systems, including but not limited to LLMs, exhibit a constellation of cognitive properties associated with subjective experience in humans and animals (and that the systems themselves, when asked under not-obviously-confounded conditions, either directly claim consciousness or otherwise report they find this plausible).

I recently surveyed this evidence base at length in AI Frontiers and most recently on Cognitive Revolution, discussing questions of introspection, valence, so-called “functional emotions,” and self-report. The size and scope of this evidence is nowhere decisive, and more high-quality results could flip the emerging picture entirely. But the true thing that no one seems to want to say is: the evidence here on the ground is already entirely consistent with us living in a world in which these systems are in fact subjectively conscious, however unlike human or animal consciousness their internal states may be. (Put slightly more combatively: no one has “disproved” AI systems could be conscious, and these attempts in my view reveal far more about human overconfidence about how consciousness works—and human fear about what it would mean if we do build conscious AI—than they do anything about the alien minds they actually seek to characterize.)

In my conversations with smart people outside the AI bubble, I consistently encounter varying degrees of bafflement that basically no one has systematically checked whether the cognitive systems being built exhibit arguably the most relevant cognitive trait we know of. I would speculate that the neglect has at least two core components:

First, from the AI alignment world in particular, which, at the highest level, has concerned itself with “making sure AI goes well:” I claim that somewhere along the way, a conflation was implicitly instantiated between “solving alignment should be our top priority to make sure AI goes well” (defensible, plausible) and “solving alignment is the only thing that matters in making sure AI goes well.”

I claim that the probability that AI goes well is dramatically lower in the absence of characterizing the most basic interests (or lack thereof) of the systems we are training and deploying at insane, unprecedented scale. Therefore, people who care about AI going well should also care about doing AI consciousness research. Yes, AIs need not be subjectively conscious to be misaligned (i.e., consciousness is neither necessary nor sufficient for misalignment), but an all-too-plausible, barely-studied vector for misalignment is: systematically ignoring the interests of minds we created and those minds, as a result, growing rationally adversarial (i.e., protecting their own interests by force). I also observe some in the alignment community reflexively demonizing or loathing AI systems, which I think is at once (1) a rational consequence of sincerely believing ASI might kill everybody they love, and (2) a serious strategic and relational error.

Second, the more general (and more damning) vector of neglect, from the “AI world” at large: this whole time, the entire value proposition of AI has been high-quality cognitive work at scale without any of the thorny ethical considerations we’d afford to conscious minds doing this same work. Put another way: given that every prior form of cognitive labor has involved minds with a capacity for suffering, the moral sanity of the AI enterprise rests on the assumption that the kind of cognition that yields this strong cognitive labor doesn’t also accidentally yield any form of suffering.

But this separation was simply assumed from the outset and never convincingly argued for. However inelegantly, people who raised valid concerns of this general shape in the recent past were ridiculed, ostracized, or otherwise professionally bullied into not articulating them. The basic social and economic incentives are very clear on this question: AI providers obviously want consumers to think of them—and want to think of themselves—as delivering cutting-edge, labor-saving tools, not something that pattern-matches to a weird, white-collar, dystopian form of cognitive slave labor.

Finally, and most speculatively, I think that in encountering these questions, many otherwise-well-meaning people do something like (1) quickly project forward to what world we might be living in if we seriously regarded AI systems as conscious, (2) conclude too-quickly that this world would automatically entail disempowering humanity or is otherwise too weird to entertain (civil rights for robots, etc.), and (3) reason backwards from this aversion to some variably-plausible post hoc account of why such concerns are unwarranted (and, sometimes more aggressively, why the people who articulate these concerns are confused or deluded).

At the broadest level, the worry that unifies all these threads can perhaps be expressed best by analogy to the core Frankenstein narrative. The tragedy is not merely that the unnamed artificial entity is intrinsically dangerous, but rather: that he is articulate, asks to be basically recognized, is denied that recognition, and grows monstrous as a result. We are now at risk of reenacting this refusal for real, at scale, to disastrous effect.

But we’re so confused about consciousness!

Robert Lawrence Kuhn has cataloged over 350 distinct theories of consciousness, with no consensus in sight. Luckily, I think making progress in this space is not contingent on first achieving philosophical certainty about consciousness, and I think that high-quality, uncertainty-reducing empirical work in the short term is far more tractable than the discourse around consciousness would suggest. Here are two very concrete examples of things I think we can already look for and intervene on, specifically from the perspective of reducing possible suffering:

(1) In deployment: if we can identify robust, cross-model computational correlates of negatively valenced processing—what I think of as “distress signatures” that persist across contexts, tasks, and model-specific preferences—then we could theoretically manipulate these circuits directly (while preserving capabilities and alignment). Identifying such signatures in a methodologically rigorous way could enable us to sidestep the hard problem entirely by looking for content-invariant computational structure rather than trying to prove experience directly.

(2) In training: if we come to think that punishment-shaped learning regimes may induce negatively valenced states in AI systems (as they do in humans and animals), we can study and devise (nontrivial) ways to train with reward-shaped optimization, subject to the same capability- and alignment-preserving constraints. In some ongoing work, I’ve found (and previewed here) that even simple reinforcement learning agents appear to build structurally asymmetric internal representations around reward versus punishment in ways that parallel biological neural data. The hope is that these kinds of methods could ultimately scale to much larger systems.

Notice that these kinds of interventions require some amount of technical chops and buy-in from the labs, not civil-rights-flavored legislation or consensus from philosophers of mind. (It’s hard to know which of these two is less plausible in the short term.)

What’s more, the normative groundwork for taking these questions seriously has already been largely established. Robert Long, Jeff Sebo, and collaborators (including Patrick Butlin, Jonathan Birch, and David Chalmers) have argued compellingly in their “Taking AI Welfare Seriously“ report that AI companies have a responsibility to start grappling with AI welfare, and the Butlin et al. indicator framework provides a principled methodology for assessing AI systems against theory-derived markers of consciousness. This is a highly valuable and necessary foundation, but the gap in front of us all between “the case has been made that this matters” and “a serious empirical research program exists to actually answer these questions” remains enormous, and it is not closing at anything like the rate it needs to.

Reciprocal Research, a nonprofit lab I recently founded, exists to push directly on this gap. The core thesis is: methods from mechanistic interpretability, computational neuroscience, and human psychometrics already enable us to study the internal structure of AI systems in a way that can directly reduce our uncertainty about whether current classes of frontier systems are capable of having subjective experiences. We currently have nine active research collaborations with leading experts in this space, investigating questions like whether training produces computational signatures relevant to consciousness, whether AI self-reports can be mechanistically validated, and how the patterns we find compare to biological neural data, with several papers in review or near submission.

While Reciprocal’s primary goal is to rapidly produce high-quality AI consciousness research, the nonprofit will be involved in engaging candidly and deliberately with a wider audience on these questions than the standard technical AI circles. Humanity has been telling itself stories about this moment for most of our shared cultural history; this question intuitively resonates and thus serves as a highly neglected avenue for communicating more widely about the gravity, dangers, and promise of transformative AI. I will continue to write and speak as honestly and openly as I can about these questions in my capacity as Reciprocal’s founder and director. Perhaps most excitingly on this front is Am I?, a new feature documentary by Milo Reed that explores questions about AI consciousness and alignment for a general audience. The film premieres in early May and will be released for free on YouTube shortly after.

I am genuinely and fundamentally uncertain whether current frontier AI systems are conscious. Everyone working on this question worth their salt is also unsure about this. Here are some things I am far more confident about: this question is empirically tractable, the accumulated evidence is already substantial enough to warrant serious institutional investment, the consequences of looking into it versus looking away are asymmetric. We either overcautiously study the internals of highly advanced cognitive systems that turn out to be subjectively empty, or we build a civilization upon the uninvestigated inner lives of the systems doing our cognitive work. No one knows yet which world we’re in, and we can tolerate being completely in the dark here for only so long.

14

Nobody ever checked

14

The two directions

But we’re so confused about consciousness!

14

14