mAIry's room: AI reasoning to solve philosophical problems

Stuart_Armstrong

This post grew out of a conversation with Laurent Orseau; we were initially going to write a paper for a consciousness/philosophy journal of some sort, but that now seems unlikely, so I thought I'd post the key ideas here.

A summary of this post can be found here - it even has some diagrams.

The central idea is that thinking in terms of AI or similar artificial agent, we can get some interesting solutions to old philosophical problems, such as the Mary's room/knowledge problem. In essence, simple agents exhibit similar features to Mary in the thought experiments, so (most) explanations of Mary's experience must also apply to simple artificial agents.

To summarise:

Artificial agents can treat certain inputs as if the input were different from mere information.
This analogises loosely to how humans "experience" certain things.
If the agent is a more limited (and more realistic) design, this analogy can get closer.
There is an artificial version of Mary, mAIry, which would plausibly have something similar to what Mary experiences within the thought experiment.

Edit: See also orthonormal's sequence here.

Mary's Room and the Knowledge problem

In this thought experiment, Mary has been confined to a grey room from birth, exploring the outside world only through a black-and-white monitor.

Though isolated, Mary is a brilliant scientist, and has learnt all there is to know about light, the eye, colour theory, human perception, and human psychology. It would seem that she has all possible knowledge that there could be about colour, despite having never seen it.

Then one day she gets out of her room, and says "wow, so that's what purple looks like!".

Has she learnt anything new here? If not, what is her exclamation about? If so, what is this knowledge - Mary was supposed to know everything there was to know about colour already?

Incidentally, I chose "purple" as the colour Mary would see, as the two colours most often used, red and blue, lead to the confusion as to what "seeing red/blue" means - is this about the brain, or is it about the cones in the eye. But seeing purple is strictly about perception in the brain.

Example in practice

Interestingly, there are real example of Mary's room-like situations. Some people with red-green colour-blindness can suddenly start seeing new colours with the right glasses. Apparently this happens because the red and green cones in their eyes are almost identical, so tend to always fire together. But "almost" is not "exactly", and the glasses force green and red colours apart, so the red and green cones start firing separately, allowing the colour blind to see or distinguish new colours.

Can you feel my pain? The AI's reward channel

This argument was initially presented here.

AIXI

Let's start with the least human AI we can imagine: AIXI, which is more an equation than an agent. Because we'll be imagining multiple agents, let's pick any computable version of AIXI, such as AIXItl.

There will be two such AIXItl's, called and $A_{q}$ , and they will share observations and rewards: at turn $i$ , this will be $o_{i}$ , $r_{i}$ , and $q_{i}$ , with $r_{i}$ the reward of $A_{r}$ and $q_{i}$ the reward of $A_{q}$ .

To simplify, we'll ignore the game theory between the agents; each agent will treat the other as part of the environment and attempt to maximise their reward around this constraint.

Then it's clear that, even though $r_{i}$ and $q_{i}$ are both part of each agent's observation, each agent will treat their own reward in a special way. Their actions are geared to increasing their own reward; $A_{r}$ might find $q_{i}$ informative, but has no use for it beyond that.

For example, $A_{r}$ might sacrifice current $r_{i}$ to get information that could lead it to increase $r_{j > i}$ ; it would never do so to increase $q_{j > i}$ . It would sacrifice all $q$ -rewards to increase the expected sum of $r_{i}$ ; indeed it would sacrifice its knowledge of $q_{i}$ entirely to increase that expected sum by the tiniest amount. And $A_{q}$ would be in the exact opposite situation.

The $A_{r}$ agent would also do other things, like sacrificing $r_{i}$ in counterfactual universes to increase $r_{i}$ in this one. It would also refuse the following trade: perfect knowledge of the ideal policy that would have maximised expected $r_{i}$ , in exchange for the $r_{i}$ being set to $0$ from then on. In other words, it won't trade $r_{i}$ for perfect information about $r_{i}$ .

So what are these reward channels to these agents? It would go too far to call them qualia, but they do seem to have some features of pleasure/pain in humans. We don't feel the pleasure and pain of others in the same way we feel our own. We don't feel counterfactual pain as we feel real pain; and we certainly wouldn't agree to suffer maximal pain in exchange for knowing how we could have otherwise felt maximal pleasure. Pleasure and pain can motivate us to action in ways that few other things can: we don't treat them as pure information.

Similarly, the $A_{r}$ doesn't treat $r_{i}$ purely as information either. To stretch the definition of a word, we might say that $A_{r}$ is experiencing $r_{i}$ in ways that it doesn't experience $q_{i}$ or $o_{i}$ .

Let's try and move towards a more human-like agent.

TD-Lambda learning

TD stands for temporal difference learning: learning by the difference between a predicted reward and the actual reward. For the TD-Lambda algorithm, the agent uses $V (s)$ : the estimated value of the state $s$ . It then goes on its merry way, and as it observes histories of the form $\dots s_{i - 1} a_{i - 1} r_{i - 1} s_{i} a_{i} r_{i} s_{i + 1} a_{i + 1} r_{i + 1}$ , it updates is estimate of all its past $V (s_{i})$ (with a discount factor of $0 \leq λ \leq 1$ for more distant past states $s_{j < i}$ ).

Again, imagine there are two agents, $T_{r}$ and $T_{q}$ , with separate reward functions $r$ and $q$ , and that each agent gets to see the other's reward.

What happens when $T_{r}$ encounters an unexpectedly large or small value of $q_{i}$ ? Well, how would it interpret the $q_{i}$ in the first place? Maybe as part of the state-data $s_{i + 1}$ . In that case, an unexpected $q_{i}$ moves $T_{r}$ to a new, potentially unusual state $s_{i + 1}$ , rather than an expected $s_{i + 1}^{'}$ . But this is only relevant if $V (s_{i + 1})$ is very different from $V (s_{i + 1}^{'})$ : in other words, unexpected $q_{i}$ are only relevant if they imply something about expected $r_{i}$ . And even when they do, their immediate impact is rather small: a different state reached.

Compare what happens when $T_{r}$ encounters an unexpectedly large or small value of $r_{i}$ . The impact of that is immediate: the information percolates backwards, updating all the $V (s_{j < i})$ . There is an immediate change to the inner variables all across the agent's brain.

In this case, the 'experience' of the $T_{r}$ agent encountering high/low $r_{i}$ resembles our own experience of extreme pleasure/pain: immediate involuntary re-wiring and change of estimates through a significant part of our brain.

We could even give $T_{r}$ a certain way of 'knowing' that high/low $r_{i}$ might be incoming; maybe there's a reliability score for $V (s_{i})$ , or some way of tracking variance in the estimate. Then a low reliability or high variance score could indicate to the $T_{r}$ that high/low $r_{i}$ might happen (maybe these could feed into the learning rate $α$ ). But, even if the magnitude of the $r_{i}$ is not unexpected, it will still cause changes across all the previous estimates - even if these changes are in some sense expected.

mAIry in its room

So we've established that artificial agents can treat certain classes of inputs in a special way, "experiencing" their data (for lack of a better word) in a way that is different from simple information. And sometimes these inputs can strongly rewire the agent's brain/variable values.

Let's now turn back to the initial thought experiment, and posit that we have a mAIry, an AI version of Mary, similarly brought up without the colour purple. mAIry stores knowledge as weights in a neural net, rather than connections of neurons, but otherwise the thought experiment is very similar.

mAIry knows everything about light, cameras, and how neural nets interpret concepts, including colour. It knows that, for example, "seeing purple" corresponds to a certain pattern of activation in the neural net. We'll simplify, and just say that there's a certain node $n_{p}$ such that, if its activation reaches a certain threshold, the net has "seen purple". mAIry is aware of this fact, and can identify the $n_{p}$ node within itself, and perfectly predict the sequence of stimuli that could activate it.

If mAIry is still a learning agent, then seeing a new stimuli for the first time is likely to cause a lot of changes in the weights in its nodes; again, these are changes that mAIry can estimate and predict. Let $c_{p}$ be a Boolean corresponding to whether these changes have happened or not.

What dreams of purple may come...

A sufficiently smart mAIry might be able to force itself to "experience" seeing purple, without ever having seen it. If it has full self-modification powers, it could manually activate $n_{p}$ and cause the changes that result in $c_{p}$ being true. With more minor abilities, it could trigger some low-level neurons that caused a similar change in its neural net.

In terms of the human Mary, these would correspond to something like self-brain surgery and self-hypnosis (or maybe self-induced dreams of purple).

Coming out of the room: the conclusion

So now assume that mAIry exits the room for the first time, and sees something purple. It's possible that mAIry has successfully self-modified to activate $n_{p}$ and set $c_{p}$ to true. In that case, upon seeing something purple, mAIry gets no extra information, no extra knowledge, and nothing happens in its brain that could correspond to a "wow".

But what if mAIry has not been able to self-modify? Then upon seeing a purple flower, the node $n_{p}$ is strongly activated for the first time, and a whole series of weight changes flow across mAIry's brain, making $c_{p}$ true.

That is the "wow" moment for mAIry. Both mAIry and Mary have experienced something; something they both perfectly predicted ahead of time, but something that neither could trigger ahead of time, nor prevent from happening when they did see something purple. The novel activation of $n_{p}$ and the changes labelled by $c_{p}$ were both predictable and unavoidable for a smart mAIry without self-modification abilities.

At this point the analogy I'm trying to draw should be clear: activating $n_{p}$ and the unavoidable changes in the weights that causes $c_{p}$ to be true, are similar to what a TD-Lambda agent goes through when encountering unexpectedly high or low rewards. They are a "mental experience", unprecedented for the agent even if entirely predictable.

But they are not evidence for epiphenomenalism or against physicalism - unless we want to posit that mAIry is non-physical or epiphenomenal.

It is interesting, though, that this argument suggests that qualia are very real, and distinct from pure information, though still entirely physical.

This post clearly helped a lot of other people, but it follows a pattern that many other posts on Less Wrong also follow which I consider negative. The valuable contribution here is not the formalisation, but the generator behind the formalisation. The core idea appears to be the following:

"Human brains contain two forms of knowledge: - explicit knowledge and weights that are used in implicit knowledge (admittedly the former is hacked on top of the later, but that isn't relevant here). Mary doesn't gain any extra explicit knowledge from seeing blue, but her brain changes some of her implicit weights so that when a blue object activates in her vision a sub-neural network can connect this to the label "blue"."

Unfortunately, there is a wall of maths that you have to wade through before this is explained to you. I feel it is much better when you provide your readers with a conceptual understanding of what is happening and only then include the formal details.

The valuable contribution here is not the formalisation, but the generator behind the formalisation.

I disagree; the "core idea" I'd already thought of before seeing the post, but the valuable contribution to me was seeing why the core idea has to be true and how it works mechanically, rather than being just a plausible-seeming sentence. Technical explanation vs. verbal explanation.

I don't necessarily see that as a verses. A good verbal explanation can provide enough information for you to simulate a formal model in your head. And obviously it'll never be as reliable as working through a formal description step by step, but often that level of reliability isn't required.

Upvoted for the useful comment, but my mind works completely the opposite to this - only through seeing the math does the formalism make sense to me. I suspect many lesswrongers are similar in that respect, but it's interesting to see that not all are.

(also, yes, I could make my posts easier to follow, I admit that; one day, when I have more time, I will work on that)

FWIW I bounced off the post the first couple times I looked at it and was glad for Chris' comment doing a good distillation of it, and am now more likely to read through the whole thing at some point.

Thanks for commenting that!

Hmm, interesting. Now that you're stating the opposite, it's pretty clear to me that there are very particular assumptions underlying my claim that, "the valuable contribution here is not the formalisation, but the generator behind the formalisation" and maybe I should be more cautious about generalising to other people.

One of my underlying assumptions was my model of becoming good at maths - focusing on what ideas might allow you to generate the proof yourself, rather than trying to remember the exact steps. Of course, it is a bit parochial for me to act as though this is the "one true path".

Curated. After reading this post, I've find my thoughts frequently reusing its concepts, and feel like it has significantly clarified my understanding of qualia as a physicalist concept. Mary's room feels pretty strongly dissolved in my mind, and in a very elegant and concise form.

While I referenced this only relatively briefly in my post on integrating subagents, the way this post considerably weakened the distinction between receiving sense data and modifying your values on a conceptual level felt like an important additional element in my of understanding why people/subagents might want to resist belief updates.

Cool. Glad it was useful!

Nominated. This post was perhaps the most crisply I've felt deconfused about a topic.

Noteably, I think I only actually got deconfused on a higher level of abstraction than Stuart was aiming at. My eyes glossed over the technical discussion, and I didn't understand the post until Chris Leong's summary. My understanding is that if you actually read and understood the technical discussion then you got even more deconfused.

I'm interested in reviews from people who understood the entire post, but feel like I understood it enough to nominate it. There have been at least one time where I've noticed someone being confused about Mary's Room and I've pointed them to this post to resolve it. I have a vague sense of generally being less confused about qualia.

I'm glad I saw this nomination. This post is amazing.

Nominated for the reasons given in my curation notice.

Wanted to note that I really enjoyed this and thought it was quite lucid. A great example of how thinking about possible detailed implementations of a mind help you dodge a lot of the fudging that with various philosophical puzzles.

I've already written a comment with a suggestion that this post needs a summary so that you can benefit from it, even if you don't feel like wading through a bunch of technical material.

Then upon seeing a purple flower, the node is strongly activated for the first time, and a whole series of weight changes flow across mAIry's brain, making $c_{p}$ true.

Another way of describing what happens: performing the update may push the weights into a whole new part of model space. mAIry may have been able to predict what the update would be, roughly, but wouldn't have a memory of having had their brain inhabit this part of the model space. For many long-term memory setups, it would probably make sense for an agent to remember this kind of update as qualitatively different from past experience.

I like that way of seeing it.

I don't have much to say in a review I didn't already say in my nomination. But, a key point of this post is "the math checks out in a way that thoroughly dissolves a confusion" and I'd kinda like it if someone else did a more thorough review that the math actually checks out.

Really, no link to orthonormal's sequence?

I think you haven't zeroed in on the point of the Mary's Room argument. According to this argument, when Mary exclaims, "So that's what red looks like!" she is really pointing to a non-verbal belief she was previously incapable of forming. (I don't mean the probability of her statement, but the real claim to which she attached the probability.) So it won't convince any philosophers if you talk about mAIry setting a preexisting Boolean.

Of course this argument fails to touch physicalism - some version of mAIry could just form a new memory and acquire effective certainty of the new claim that "Red looks like {memories of red}," a claim which mAIry was previously incapable of even formulating. (Note that eg this claim could be made false by altering her memories or showing her a green apple while telling her "Humans call this color 'red'." The claim is clearly meaningful, though a more carefully formulated version might be tautological.) However, the OP as written doesn't quite touch Mary's Room.

According to this argument, when Mary exclaims, “So that’s what red looks like!” she is really pointing to a non-verbal belief she was previously incapable of forming.

That belief is called "the changes that result in $c_{p}$ being true" in mAIry's case. I mean, it even explicitly stated that these changes could not trigger ahead of time.

Added a link to orthonormal's sequence, thanks!

The Boolean $n_{p}$ was a simplification of "a certain pattern of activation in the neural net", corresponding to seeing purple. The Boolean $c_{p}$ was tracking the changes in a still-learning neural net caused by seeing purple.

So there are parts of maIry's brain that are activating as never before, causing her to "learn" what purple looks like. I'm not too clear on how that can be distinguished from a "non-verbal belief": what are the key differentiating features?

One is phrased or presented as knowledge. I don't know the best way to approach this, but to a first approximation the belief is the one that has an explicit probability attached. I know you talked about a Boolean, but there the precise claim given a Boolean value was "these changes have happened", described as an outside observer would, and in my example the claim is closer to just being the changes.

Your example could be brought closer by having mAIry predict the pattern of activation, create pointers to memories that have not yet been formed, and thus formulate the claim, "Purple looks like n<sub>p</sub>." Here she has knowledge beforehand, but the specific claim under examination is incomplete or undefined because that node doesn't exist.

So it won’t convince any philosophers if you talk about mAIry setting a preexisting Boolean.

Not all philosophers are qualiaphiles.

An annoying philosopher would ask whether you could glean knowledge of your "meta-qualia" aka what it consciously feels like to experience what something feels like. The problem is that fully understanding our own consciousness is sadly impossible. If a computer discovers that in a certain location on it's hardware it has stored a picture of a dog, it must then store that information somewhere else, but if it subsequently tries to know everything about itself it must store that knowledge of the knowledge of the picture's location somewhere else, which it must also learn. This repeats in a loop until the computer crashes. An essay can fully describe most things but not itself: "The author starts the essay with writing that he starts the essay with writing that...". So annoyingly there will always be experiences that are mysterious to us.

There are computer programs that can print their own code: https://en.wikipedia.org/wiki/Quine_(computing)

There are also programs which can print their own code and add something to it. Isn't that a way in which the program fully knows itself?

I can also "print my own code", if I make a future version of a MRI scan I could give you all the information necessary to understand (that version of) me, but as soon as I look at it my neurological patterns change. I'm not sure what you mean with "add something to it", but I could also give you a copy of my brain scan and add something to it. Humans and computers can of course know a summery of themselves, but never the full picture.

Unfortunately the more we write about consciousness the further we seem to be getting away from confronting the hard problem of consciousness. Has the author read Nagel's Seeing Like a Bat?

Has the author read Nagel's Seeing Like a Bat?

Yep (I read "What is it like to be a bat?")! Indeed, that started the thought process that let to the above post.

Nagel's biggest failing, as I see it, it that he makes everything boolean. "a bat's consciousness is inaccessible to a human" differs in degree, not kind, from "a human's consciousness is inaccessible to another human". There are observations that could convince me that a human has or has not true insight into a bat's consciousness.

I'm glad you've read it, I had a moment of thought yesterday that as modern fields (computer science, rationality studies etc) begin to deal with consciousness they'll re-invent the wheel when other fields (eg philosophy, psychology) have been thinking about this for awhile.

Putting bats aside, what could convince you that my subjective experience (I promise you I'm a human) is substantially similar to yours?

Mainly people describing their own subjective experience in ways that make me think "hey, that's just like me - and I haven't told anyone about it!" Or me modelling you as having a subjective experience close to my own, using this model to predict your actions, and being reasonably accurate.

But if I said 'oh that sunset makes me feel so warm and fuzzy', yes I'm using the same words you might use, but how can you know they're the same subjective experience?? You say 'describing', that relies on a set of words, but do they point to the same thing in everyone?

I can start asking you about what features "warm and fuzzy" have, what other emotions or words they connect to, what kind of things make you feel warm and fuzzy and what kind of things don't. If enough of what you're describing seems similar to what I feel, then I can increase my confidence that we're feeling the same thing.

But they're all just words and suffer the same problem. What if an AI says the right words?

What evidence do you have that other people are conscious, apart from words (and facial expressions, etc...)? And would that evidence apply or not apply to an AI?

I'm not solving the hard problem of consciousness, I'm saying that that Bayesian evidence exists that some agents have subjective experiences. Compare to an AI that mouths the words but gets them wrong ("fuzzy is like being stabbed by needles"); we have at least evidence that an agent with the right words has a higher chance of having similar subjective experiences.

I think a "subjective experience" (edit: in the sense that two people can have the same subjective experience; not a particular instantiation of one) is just a particular (edit: category in a) categorization of possible experiences, defined by grouping together experiences that put the [person] into similar states (under some metric of "similar" that we care about). This recovers the ability to talk about "lies about subjective experiences" within a physicalist worldview.

In this case, we could look at how the AI internally changes in response to various stimuli, and group the stimuli on the basis of similar induced states. If this grouping doesn't match to its claims at all, then we can conclude that it is perhaps lying. (See: cleaving reality at its joints.) EDIT: Were you saying that AI cannot have subjective experience? Then I think this points at the crux; see my statements below about how I don't see human subjectivity as fundamentally special.

Yes, this means that we can talk about any physical thing having a "subjective experience". This is not a bug. The special thing about animals is that they have significant variance between different "subjective experiences", whereas a rock will react very similarly to any stimuli that don't break or boil it. Humans are different because they have very high meta-subjectivity and the ability to encode their "subjective experiences" into language. However, this still doesn't match up very well to human intuitions: any sort of database or measurement device can be said to have significant "subjective experiences". But my goal isn't to describe human intuitions; it's to describe the same thing that human intuitions describe. Human subjectivity doesn't seem to be fundamentally different from that of any other physical system.

A lot of words but you don't grapple with the hard problem of consciousness. When I look at the sun, how can you know I feel/see the same thing as you? Yes I'll use words, 'yellow', 'warm', 'bright' etc because we've been taught those label what we are experiencing. But it says nothing about whether my experience is the same as yours.

What do you mean by “you don’t grapple with the hard problem of consciousness”? (Is this just an abstruse way of saying “no, you’re wrong” to set up the following description of how I’m wrong? In that case, I’m not sure you have a leg to stand on when you say that I use “a lot of words”.) Edit: to be a bit more charitable, maybe it means “my model has elements that my model of your model doesn’t model”.

How can you know I see the same thing that you do? That depends on what you mean by “same”. To me, to talk about whether things are the same, we need to specify what characteristics we care about, or what category system we’re using. I know what it means for two animals to be of the same species, and what it means for two people to have the same parent. But for any two things to be the same, period, doesn’t really mean anything on its own. (You could argue that everything is the same as itself, but that’s a trivial case.)

This might seem like I’m saying that there isn’t any fundamental truth, only many ways of splitting the world up into categories. Not exactly. I don’t think there’s any fundamental truth to categories. There might be fundamental monads, or something like that, but human subjective experiences are definitely not fundamental. (And what truths can even be said of a stateless monad when considered on its own?)

I think things are not so bad. If our talking of consciousness leads to a satisfactory functional theory, we might conclude that we have solved the hard problem (at least the "how" part). Not everyone will be satisfied, but it will be hard to make an argument that we should care about the hard problem of consciousness more than we currently care about the hard problem of gravity.

I haven't read Nagel's paper but from what I have read _about_ it, it seems like his main point is that it's impossible to fully explain subjective experience by just talking about physical processes in the brain. It seems to me that we do get closer to such explanation by thinking about analogies between conscious minds and AIs. Whether we'll be able to get all the way there is hard to predict but it seems plausible that at some point our theories of consciousness would be "good enough".