Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

EDIT: This is a post about an alien mind being unlearnable in practice. As a reminder, theory of mind is unlearnable in theory, as stated here - there is more information in "preferences + (ir)rationality" than there is in "behaviour", "policy", or even "complete internal brain structure". This information gap must be covered by assumptions (or "labelled data", in CS terms) of one form or another - assumptions that cannot be deduced from observation. It is unclear whether we need only a few trivial assumptions or a lot of detailed and subtle ones. Hence posts like this one, looking at the practicality angle.

Alice learns the ways of aliens

I suggested that an alien "theory of mind" might be unlearnable; Rohin Shah challenged this conclusion, asking whether a theory of mind was truly unlearnable, even for a very intelligent Alice. Let's dig into this concept for a bit.

There is, of course, a weak and a strong version of the unlearnability hypothesis. The strong version is that Alice, even with infinite time and total rationality, couldn't learn an alien theory of mind. The weaker version is that a smart and motivated Alice with a lot of resources and data, couldn't learn an alien theory of mind in reasonable time.

You can nuance both of those by wondering how much of the theory of mind is unlearnable. It doesn't really matter if a few less important bits are unlearnable. So the real question is, how hard is it to learn enough alien theory of mind, with enough data and effort? We might also ask whether the learning process is interactive or non-interactive: does Alice merely observe the alien natives, or is there a conversation going where the aliens try and correct her interpretations?

aAlice learns the ways of humans

Unfortunately, we don't have a convenient alien civilization on hand to test this (and, even if we did, we might be unsure whether we'd really understood their theory of mind, or just thought that we did). So instead, let's imagine an alien Alice - aAlice - who is trying to learn the human theory of mind, and see how she might go astray.

It won't take long for aAlice to realise that there is a difference between what humans say publicly, and what we say privately. Also, there is a difference between what we say under the impact of strong emotion, and what we say when calm and relaxed.

She concludes, naturally (as this is close to how her species behaves), that our authentic statements are those given in public, when we are under the sway of strong emotions. She will find quite a lot of evidence for her position. For example, some people will calmly write about the "authenticity" of strong emotion; aAlice interprets this as: "See? Even in their irrational mode, they sometimes let slip a bit of genuine information."

She can point to other reasons for the correctness of her interpretation. For example, humans often publicly praise powerful people, while mocking them behind their back. These humans also go out of their way to be servile to the powerful humans. aAlice concludes, from the "revealed preference" perspective, that our public praise is the correct interpretation, as that is what is compatible with our behaviour. The private mocking must be some hypocritical "speech act", maybe used for social bonding.

Of course, there is a lot of variety in human public-emotional speech, and a lot of wild contradictions. If you point this out to aAlice, she would respond "yes, I know; aren't humans a fascinating species? I have several theories that I'm developing, to explain their complex preference." She might also point out that private-calm speech is also varied and contradictory; according to her theories - meticulously developed through observation and experimentation - the variations and contradictions in private-calm speech are much more of a problem than those in public-emotional speech.

Can we convince her she's wrong?

Could we convince aAlice that she's wrong; that private-calm speech is much closer to our true preferences than public-emotional speech is? The true picture is much more nuanced than that, of course, but if we can't communicate the basic facts, we can forget about transmitting the nuances.

How would we transmit that information? Our first instinct would be to calmly explain this to her, preferably without too many different people around listening in and chiming in. This approach she would reject immediately, of course, as she already has concluded that private-calm speech is inauthentic.

The above paragraph means that aAlice would have a very hard time concluding she was wrong, in the non-interactive situation. Most of our deep musings about our true preferences are in the private-calm setting, so would be ignored by aAlice. Can our standard public-emotional pronouncements, filtered by aAlice's complex interpretations, ever convince her to take our private-calm statements more seriously? That seems unlikely.

But, back to the interactive setting. We might realise that our explanations to aAlice are not working. This realisation might take some time, as aAlice might calmly and privately agree with us when we explain where she is wrong (she "knows" that private-calm statements carry no weight, so she just follows the social conventions of calmly agreeing to statement like "rationality requires careful thought").

Out of consideration to us, she would be careful to state her true conclusions and beliefs only in public-emotional ways. Thus it might take us a long while to figure out aAlice's true beliefs about us. We'd also need to do a lot of interpretation of aAlice's goals: from our perspective, aAlice being benevolent while taking our public-emotional statements as true, might be indistinguishable to her being whimsical while taking our private-calm statements as true.

But let's assume that we have somehow understood aAlice, in the same way that she has failed to understand us. Can we correct her misapprehension? Our next attempt might be to communicate our corrections in a public-emotional way. But this would be problematic. First of all, in the public-emotional sphere, there will be other humans stating their opinions and contradicting ours. aAlice has no reason to pay more attention to our pronouncements.

Indeed, she has reason to pay less attention to our pronouncements. Because we will have privately-calmly concluded that we needed to express private-calm sentiments to aAlice in public-emotional ways. This will make for very odd and inauthentic public-emotional pronouncements. And this is where nuance will sting us. We know, as does aAlice, that the public-emotional vs private-calm dichotomy is not fully correct, just a rough approximation. aAlice is therefore likely to add nuance to her interpretation, and set aside these odd and inauthentic public-emotional pronouncements, ignoring them entirely.

This is not helped by the fact that we have a relatively poor grasp of our own theory of mind (see Moravec's paradox, amongst others). Many aspects of our minds and culture only become obvious to us when we encounter beings with different minds and cultures. So a lot of what we will be trying to communicate to aAlice, at least initially, will be incorrect or underspecified. This will give her another reason to reject our attempts at correction, and to build a new elaboration in her human theory of mind, where she adds a term saying "public-emotional expressions of private-calm sentiments are as inauthentic as private-calm expressions themselves[1]."

So our explanations have increased aAlice's misunderstanding, and made it actually harder for us to correct her. This one of the reasons that anthropologists use methods like participant observation (becoming integrated in the culture they are studying) rather than simply asking members of that culture questions. If you don't have an understanding of the culture (an understanding derived mostly from using our own theory of mind during the participation process), then we can't know what the people are likely to be honest about, and in what context. Indeed, we might not even understand what the words mean to them, let along whether they're being honest with them.

Unsolvable?

So, is the alien theory of mind problem unsolvable? I'm not sure. Like any method of extracting preferences from behaviour, it relies on assumptions, assumptions that cannot be derived from observations. The optimistic perspective is that we only need a few key assumptions, and then a lot of observation and athropology will suffice to fill in the holes. But the aAlice example above is a cautionary tale; we may need much stronger assumptions than we expect, before two alien species can interpret each other correctly.

And, bringing that all back to AI, we may need stronger assumptions than we expect, before an AI can deduce our preferences from observation.


  1. Notice that this elaboration is actually true: the level of authenticity of our private-calm expressions is roughly the same as that of the public-emotional ones we have constructed specifically for aAlice. So where theory of mind is concerned, adding true statements can sometimes make misinterpretations worse. ↩︎

New to LessWrong?

New Comment
35 comments, sorted by Click to highlight new comments since: Today at 7:33 AM

Fwiw, in this story I find myself surprised at aAlice's confidence in her theory. If I were telling a story about an unlearnable theory of mind, I'd be leaning on huge uncertainty that prevents aAlice from doing anything.

It's an interesting question as to whether aAlice is actually overconfident. Her predictions about human behaviour may be spot on, at this point - much better than human predictions about ourselves. So her confidence depends on whether she has the right kind of philosophical uncertainty.

I was trying to put my finger on why I felt confused when reading about aAlice’s beliefs & actions, and I think you nailed the reason.

Asking people to state what their revealed preferences are is a fool's game. Brains are built to deceive themselves about their preferences; even if someone was trying to be totally honest with you, they would still mislead you. If I wanted to figure out the preferences of an alien race, I wouldn't try to initiate political or philosophical conversations. I would try to trade with them.

If I could only observe the aliens, I would try to figure out how they decide where to spend their energy. Whether the aliens hunt prey, grow crops, run solar farms, or maintain a Dyson swarm, they must gather energy in some fashion. Energy is and always will be a scarce resource, so building predictive models of alien energy allocation policy will reveal information about their preferences.

Looking at humans from an alien perspective:

  • Notice that humans allocate energy by swapping currency for a proportional amount of energy
  • Notice that there is a system by which humans swap currency for a wide variety of other things
  • Build a causal model of this system
  • In doing so, model the valuation structures of the human cortex

Using a model like that ("revealed preferences" about energy) seems like it would fall into the trap mentioned here, in that it needs a very good model of human irrationality, or it would start concluding absurdities (eg that we don't want to be rich since we don't exploit the many arbitrage and profit opportunities that the alien sees all around us). Do you have a way around that issue?

If you point an arbitrage opportunity out to someone, and they take it, then 'they didn't notice it' seems reasonable.

However, figuring out what someone has and has not noticed does sound difficult - perhaps in an intractable fashion. Even if someone saw an opportunity long ago, perhaps they took a better one and forgot it, or lacked the necessary knowledge to figure out the opportunity.

Or maybe the opportunity was only available for someone who could do (super)intelligent follow up to the initial opportunity.

I think that ants like sugar. However, if I spill some sugar on the countertop, I'm not going to be shocked when every ant in a ten mile radius doesn't immediately start walking towards the sugar. It's reasonable to expect a model of an agent's behavior to include a model of that agent's model of its environment.

And so the assumptions pile up :-) We have to distinguish not knowing, from not caring, from not being able to plan the whole way through, from arious biases in the model... I agree that it's necessary, but that doesn't make it feasible.

I think the way around that issue is to bite the bullet - those things belong in a proper theory of mind.  Most people want to be conformist (or at least to maintain a pleasant-to-them self-image) more than they want to be rich.  That seems like a truth (lowercase t - it's culture-sensitive, not necessarily universal) that should be modeled more than a trap to be avoided.

But people are still leaving a lot of efficient, low effort, conformity on the table - a superintelligent conformist human could be so much better at being (or appearing) conformist, than we can ever manage.

So a model that says people are 'super intelligent' would be badly wrong.

Could we convince aAlice that she's wrong; that private-calm speech is much closer to our true preferences than public-emotional speech is?

Forget aAlice, how about convincing me, a fellow human, of this fact. Your statement that private/calm speech is closer to our actual values does not reflect how I understand myself, nor how I understand many of the people around me. People sometimes say things that they don't mean when under emotional distress, but they also sometimes use emotional excitement as a means to express things they would otherwise inhibit.

Cf. this discussion of "truth shouters" and "cutlery loaders". Your argument seems to assume that everyone is a cutlery loader, which is definitely Not True.

I'd like to add two pieces of evidence in favor of the weak unlearnability hypothesis:

(1) Humpback whales have songs that can go on for days.  Despite decades of study, we don't really understand what they're saying.

(2) The output of machine learning algorithms (e.g. Google's Deep Dream) can be exceedingly counterintuitive to humans.

Whales are our distant cousins and humans created machine learning.  We might reasonably suppose that actual aliens, with several billion years of completely independent evolution, might be much harder to understand.

We might reasonably suppose that actual aliens, with several billion years of completely independent evolution, might be much harder to understand.

If they are complex, this seems obvious. Simple, less so.


humans created machine learning

Even ignoring hardware, they can be modeled as being programmed by the data. Perhaps we don't understand the data?

Re aliens - Fair enough.  Some very simple alien, perhaps the Vulcan equivalent of a flatworm, may be well within our capability to understand.  Is that really what we're interested in?

Re machine learning - The data for machine learning is generally some huge corpus.  The question is whether we're even capable of understanding the data in something like the manner the algorithm does.  My intuition says no, but it's an open question.

aAlice should be comparing people’s statements to reality, rather than to each other or to her own, as we should be doing with aAlice’s. Reality is the Rosetta stone that deciphers all languages.

I feel like the core problem here is in the word "authentic". It uses a relatively generic positively-laden word for a phenomenon that might better be described as something like "impactful".

As in, public-emotional speech does genuinely seem more impactful than private-calm speech, and this is partly the phenomenon aAlice is observing.

If you want to model some phenomenon, you probably do need to start with the most impactful aspects of the phenomenon - but hopefully eventually you will learn about the causes of the more impactful aspects, which will often include chaotic interactions between and/or large sums of the less impactful aspects.

In reading this I was reminded (or at least my mind sort of wondered of to the thought) of a statement a psychologist friend of mine made many years ago. He described his job as trying to understand the rationality underlying the people we considered "not normal". 

In other words, understanding the operational map the person has. The goal was not really to verify where that map and the territory were not in sync (at least through the lens of his map -- or other generally acceptable maps) or even to compare it with his map. I think the trivial implication here may be that every mind can be considered an "alien" mind. It's just that for the most part human minds are not very alien from one another.

But what also comes from that is that things like intelligence and rationality are not part of the criteria. So that seems to suggest that we can actually attempt (and I am 99.999% sure some actually are) understanding the alien minds of other species on Earth.

Both implications see to suggest we can look to those areas, human psychology and research by those exploring the "minds" of other species for what types of assumptions are needed/made. We could then look at ways to assess the success (would predicting future actions under defined conditions indicate some understanding of the mind?) and which assumptions (or classes of assumptions?) matter.

That might then inform on the assumptions needed for AI.

I think "unlearnable" is a level removed from equally-important questions (like un-modelable, or sufficiently-precise model is not computable by Alice's means, even if she could learn/find it).  And that this treatment gives FAR too binary a focus on what a "theory of mind" actually does, and how precise and correct it is.

I think we have to start with "how good is Stuart Armstrong's theory of mind for humans (both for their best friend and for a randomly-selected person from a different culture and generation, and for a population making collective decisions)", as a an upper bound for how good Alice's theories of mind can be can be at all.   Identify specific predictions we want from the theory, and figure out how we'll evaluate it for humans and for aliens.

I'd expect that it's not very good to start with, but still better than random and we can do some useful modeling.  For aliens, you're correct that our assumptions matter a lot (well, what matters is the aliens' specifics, but our assumptions control our modeling of our ability to model).  For aliens much more complex than ourselves, our theory of mind will be less accurate, as there's less of their computations that we can simplify to fit in our heads/computers. 

"My preferences" seem to be confused statement. There is how "me" looks from outside and how I look at myself from inside, but those are two different thing unless my conscious self is all that is to "me". Any algorithm trying to find my true preferences have to analyze my brain, find conscious part(s?!) and somehow understand it. Anything else runs the risk of imprisoning me in my own body, mute and helpless.

Not easy to figure this out for aAlice if she doesn't have a related concept but doable - at least if this distinction is something that is inferrable. I think it is as in manifests in exactly the differences between public and private speech and people do talk about consciousness.  

Experimental psychologists spend their working lives devising empirical tests for questions like these. 

One way for Alice to test her theory would be to find a fact that people had motivation to lie about, and see whether their public or private statements were more likely to represent the truth.

Experimental psychologists start with (roughly) the same theory of mind that other humans do.

See also my comment here: https://www.lesswrong.com/posts/kMJxwCZ4mc9w4ezbs/how-an-alien-theory-of-mind-might-be-unlearnable?commentId=iPitpgNxwJH2e98CK

Could we convince aAlice that she's wrong; that private-calm speech is much closer to our true preferences than public-emotional speech is? The true picture is much more nuanced than that, of course, but if we can't communicate the basic facts, we can forget about transmitting the nuances.

Publicly (and later, privately) state that some people are more authentic one on one, others in public.

I like this post but I'm a bit confused about why it would ever come up in AI alignment. Since you can't get an "ought" from an "is", you need to seed the AI with labeled examples of things being good or bad. There are a lot of ways to do that, some direct and some indirect, but you need to do it somehow. And once you do that, it would presumably disambiguate "trust public-emotional supervisor" from "trust private-calm supervisor".

Hmm, maybe the scheme you have in mind is something like IRL? I.e.: (1) AI has a hardcoded template of "Boltzmann rational agent", (2) AI tries to match that template to supervisor as best as it can, (3) AI tries to fulfill the inferred goals of the supervisor? Then this post would be saying that we should be open to the possibility that the "best fit" of this template would be very wrong, even if we allow CIRL-like interaction. But I would say that the real problem in this scenario is that the hardcoded template stinks, and we need a better hardcoded template, or else we shouldn't be using this approach in the first place, at least not by itself. I guess that's "obvious" to me, but it's nice to have this concrete example of how it can go wrong, so thanks for that :-)

Sorry if I'm misunderstanding something here, but doesn't this violate the ol' Rationalist principle of "ignorance is in the mind"[1]. I see your argument about one specific scenario above, but I'm not largely convinced by it and am wary of how it seems to fail to generalize. Insofar as we aren't bringing certain computational stipulations onto the field[2], I don't actually understand what it would mean for a theory of mind to be "unlearnable".

To some extent, I could[3] simulate you and record all pairings of input-world-state to output-reaction, giving me a perfect understanding of you. This is the trivial objection, but what makes it wrong?


  1. When I looked, it seems like we've done a good job at filing the important articles under "Mind Projection Fallacy", so those articles are what I'm jokingly referring to ↩︎

  2. By which I mean objections I have heard along the lines of: There are more world-states possible than are differentiable by humans, so by pigeonhole principle, at least some states must not be orderable by humans, so you cannot build a preference map of humans that is "complete" ↩︎

  3. I mean, in theory, and also it isn't important what we mean by "simulate" here so please no one get distracted if you disagree with the concept of simulation ↩︎

Perfect predictive models of me (even, to a lesser extent, perfect predictive models of my internals) are insufficient to figure out what my preferences are, see https://arxiv.org/abs/1712.05812

Is that true if I change my simulation to just simulate all the particles in your brain?

If you are going to answer "In that case, you could learn it", then is the "strong version of the unlearnability hypothesis" completely false? If it's not false, why?


I still think you and I might be calling different concepts "unlearnable". I do not see how any existing physical thing can be unlearnable since it seems you could learn it by just looking at all its parts[1].


  1. Assuming you are able to do that, of course. I don't mean to be unfair by focusing on the "strong version" of your theory, it's just the one that's more assailable while I don't have good footing with the idea (so it's the one I can learn from without making judgement calls) ↩︎

Is that true if I change my simulation to just simulate all the particles in your brain?

Yes.

The preferences of a system are an interpretation of that system, not a fact about that system. Not all interpretations are equal (most are stupid) but there is no easy single interpretation that gives preferences from brain states. And these interpretations cannot themselves be derived from observations.

I don't understand this. As far as I can tell, I know what my preferences are, and so that information should in some way be encoded in a perfect simulation of my brain. Saying there is no way at all to infer my preferences from all the information in my brain seems to contradict the fact that I can do it right now, even if me telling them to you isn't sufficient for you to infer them.

Once an algorithm is specified, there is no more extra information to specify how it feels from the inside. I don't see how there can be any more information necessary on top of a perfect model of me to specify my feeling of having certain preferences.

The theoretical argument can be found here: https://arxiv.org/abs/1712.05812 ; basically, "goals plus (ir)rationality" contains strictly more information than "full behaviour or policy".

Humans have a theory of mind that allows us to infer the preferences and rationality of others (and ourselves) with a large amount of agreement from human to human. In computer science terms, we can take agent behaviour and add "labels" about the agent's goals ("this human is 'happy' "; "they have 'failed' to achieve their goal", etc...).

But accessing this theory of mind is not trivial; we either have to define it explicitly, or point to where in the human mind it resides (or, most likely, a mixture of the two). One way or another, we need to give the AI enough labelled information that it can correctly infer this theory of mind - unlabelled information (ie pure observations) are not enough.

If we have access to the internals of the human brain, the task is easier, because we can point to various parts of it and say things like "this is a pleasure centre, this part is involved in retrieval of information, etc...". We still need labelled information, but we can (probably) get away with less.

I think I understand now. My best guess is that if your proof was applied to my example the conclusion would be that my example only pushes the problem back. To specify human values via a method like I was suggesting, you would still need to specify the part of the algorithm that "feels like" it has values, which is a similar type of problem.

I think I hadn't grokked that your proof says something about the space of all abstract value/knowledge systems whereas my thinking was solely about humans. As I understand it, an algorithm that picks out human values from a simulation of the human brain will correspondingly do worse on other types of mind.

If preferences cannot be "learned" from a full physical model of a brain, that isn't it the truth that a human theory of mind is unlearnable as well?

I don't see a good reason to privilege having the original copy of a brain here. If you're willing to grant me that a brain can be copied, there is nothing the brain-holder should be unable to learn that the brain-user knows.