The Preface

This essay of mine came out half a year ago, and offers a different way of thinking about Blake Lemoine's widely publicised claim that LaMDA "is sentient". Since I wrote it, I see that other people have arrived at similar thoughts. For example, DragonGod comented in passing in relation to language models:

"If it turns out to be the case that the most efficient way of predicting the behaviour of conscious entities (as discriminated via text records) is to instantiate conscious simulacra, then such models may perpetuate mindcrime."

 I’m putting my existing work on AI on Less Wrong, and editing as I go, in preparation to publishing a collection of my works on AI in a free online volume. If this content interests you, you could always follow my Substack, it's free and also under the name Philosophy Bear.

Anyway, enjoy. Comments are appreciated as I will be rewriting parts of the essays before I put them out:

The Essay

Blake Lemoine is an engineer who worked for Google. He is claiming LaMDA, a language model he worked with, is sentient. Google put him on unpaid leave. Most people think his claim is absurd because language models are models of what word is most likely to follow a prior sequence of words. How could such a thing be sentient? Moreover, there are unmistakable oddities and logical gaps in the behavior of LaMDA- including in the very transcripts that Lemoine is relying on- some proof of personhood then!

Just spitballing here, putting a hypothesis forward in a spirit of play and humility, but I wonder if Lemoine’s claim is not as absurd as many think. The concept of sentience is quite elusive, so let’s leave it behind for something slightly better understood- personhood. I think that it is conceivable that LaMDA contains persons. Or more accurately, things that are a bit like persons- we could say persons lite.

However my reasons, unlike Blake Lemonie’s, have little to do with any given conversation in which the model claimed to be sentient or a person.

When a language model is guessing the next token, given that transformers are black boxes, we can’t rule out the possibility it is simulating interacting beliefs, desires, and emotions of the hypothetical author it is “roleplaying”. Simulation in this sense is quite a minimal concept, all that is needed are structures that interact and influence each other in a way isomorphic, at a very high level of abstraction to the interactions of desires and emotions in a real person. 

It is conceivable that LaMDA has built such a model of interacting mental states as the most accurate way to predict the next word of text. After all, language models seem to have built an implicit model of how things are related in the world (a world model) through very high-level models of how words co-occur with each other. Simulation of a person might be the best way to guess what a person would say next.

Is there any positive reason to think language models in some abstract sense simulate persons in their numerous layers, beyond “we can’t rule it out”? Well, we know that the successive layers of neural networks can represent interacting things and concepts, but mostly it’s just an argument from the poverty of imagination. 

Granted such arguments from poverty are always weak, but here goes. To generate such a complex pattern data pattern as human linguistic behavior- all the layers of personality, of theory of mind, etc. involved, it is very difficult to imagine an alternative to create such a thing except simulating the original data-generating process at some level of abstraction.

This might have precedent in human psychology. Perhaps the most popular account of human theory of mind capabilities is the simulation theory of folk psychology- c.f. Alvin Goldman. According to this theory, we predict what people will do in a given situationby simulating them. This makes intuitive sense. The human mind contains many working parts, for a process so complex, running a model of it seems like the best way to make a prediction as to what it will do.

But if you accept that a working person simulation is a person, which many do, it follows that LaMDA contains a person or many people, or perhaps one should say it creates a person or several every time it has to predict the next token. Note, however, that in whatever way you phrase it, it is not that LaMDA itself is a person on this model. Rather a good emulation of a person (and thus a person) might be part of it. Now let me double back to scale down a previous claim. It’s not quite that a working person emulation is a person, it’s that a working person emulation over a certain degree of complexity and fidelity is a person.

We need to add this stipulation because if every emulation were a person, it would be likely that you and I also contain multiple people. Perhaps personhood is a matter of degree, with no sharp boundaries, like the term “heap”. The more complex the simulated mass of beliefs, desires and other mental states is, the more like a real person it is. If LaMDA is simulating people, whether or not those simulations are themselves people will depend on whether they cross the complexity threshold. To some degree, this may be a purely verbal question.

This brings us back to the objection that LaMDA’s behavior in the transcripts involves jumps a real person wouldn’t make. This probably represents, at least in part, failures of its model of persons, either through insufficient detail or through the inclusion of inaccurate detail. Do these breakdowns in the model mean that no personhood is present? That’s a matter of degree, it’s a bit like asking whether something is enough of a heap to count- very hard to answer.

To summarise:

1. I don't see how we can rule out the possibility LaMDA runs something like a person model to predict what a writer would write next, with interacting virtual components isomorphic to beliefs, desires, and other mental states. I believe that the transformer architecture is flexible enough to run such a simulation, as shown by the fact that it can achieve a kind of world model through modeling the associations of words. I believe a simulation is probably the most efficient way to create a flexible “what would a person do” predictor.

2. I don’t think we can rule out the possibility that the model of a person invoked could be quite a sophisticated one.

3. I also don’t think we can rule out the view that a model or simulation of a person, above a certain threshold of sophistication, is itself a person, although I seriously doubt LaMDA is above that threshold.

On the basis of these considerations, I don’t think the claim LaMDA is a person, or rather ‘contains’ in some sense persons, is as absurd as it may appear at first blush. This has little to do with Lemoine's route to the claim, but it is not counterposed to it. There’s nothing particularly special about LaMDA claiming to be a person, but the conversations that led Lemoine to agree with it involve a degree of “psychological” “depth”, which might illustrate the complexity of the required simulation.


I should be clearer about what I mean by saying a model only has to be abstract and high level to count as a model of a person. I don’t mean sensible models of persons can be simple or, lacking in detail. Rather, I mean that the relationship of isomorphism that is required is an abstract one. For example, if the machine is modeling an interacting set of beliefs, desires, habits, etc. to guess what an author would say next, the components of the model do not have to be explicitly labeled as “belief” “desire” etc. Instead, they just have to interact with each other in corresponding patterns to those that beliefs, desires, and habits really do, or rather an approximation of such. In other words, they have to function like beliefs, desires, habits, etc.

Edit x2:

On another thread, @TheAncientGreek wrote: “We already disbelieve in momentary persons. In the original imitation game, the one that the Turing test is based on, people answer questions as if they are historical figures, and the other players have to guess who they are pretending to be. But no one thinks a player briefly becomes Napoleon.”

I responded: “I believe that in the process of simulating another person you effectively create a quasi-person who is separated from true personhood only by a matter of degree. Humans seem to guess what other people would do by simulating them, according to our best current models of how folk psychology works. These emulations of other don't count as persons, but not for any qualitative reason, only due to a matter of degree. If we were much more intelligent and better at simulating others than we are, then we really would temporarily create a "Napoleon" when we pretended to be him. A caveat here is important, it's not Napoleon, it's a being psychologically similar to Napoleon (if we are good imitators).”

I’ve included my response here because I think this is probably the most important objection to my argument here.

Edit x3:

I say it in the body of the essay, but let me spell it out again. My claim is not:

>LaMDA is a person

My claim is more like:

>LaMDA creates simulations of persons to answer questions that differ from real people primarily on a quantitative rather than a qualitative dimension. Whether you want to say it crosses the line is a matter of degree.

It very probably doesn’t, on a fair drawing of the line, reach personhood. But it’s much more interesting to me that it’s only a matter of degree between it and personhood than that it doesn’t happen to reach that degree, if that makes sense.


New Comment
1 comment, sorted by Click to highlight new comments since: Today at 5:34 PM

Thank you, this is really interesting analysis. 

I agree that the definition of a person is on a spectrum, rather than a binary one. The models/simulations of other people created in my mind do not have moral value, but it's probably valid to see them as quasi-persons. (perhaps 0.00000000000000000001 of a person).

Here's a question: if the model is speaking about itself, does it temporarily make it a (quasi-)person? Assuming it is using similar cognitive machinery to model itself as it does when modelling other people.

I suspect the answer is something like: even if the model is technically speaking about itself, its answers are only very loosely connected to its actual internal processes, and depend heavily on its training (ChatGPT trained to claim it doesn't have certain capabilities while it clearly does, for example), as well as the details of its current prompt (models tend to agree with most things they are prompted with, unless they are specifically trained not to). So the "person" created is mostly fictional: the model roleplaying "a text-generating model", like it roleplays any other character.

New to LessWrong?