Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

User Veedrac recently commented:

You have shown that simplicity cannot distinguish from , but you have not shown that simplicity cannot distinguish a physical person optimizing competently for a good outcome from a physical person optimizing nega-competently for a bad outcome.

This goes to the heart of an important confusion:

  • "Agent has preferences " is not a fact about the world. It is a stance about , or an interpretation of . A stance or an interpretation that we choose to take, for some purpose or reason.

Relevant for us humans is:

  • We instinctively take a particular preference stance towards other humans; and humans tend to take the same stance towards others and towards each other. This makes the stance feel "natural" and intrinsic to the world, when it is not.

The intentional stance

Daniel Dennett defined the intentional stance as follows:

Here is how it works: first you decide to treat the object whose behavior is to be predicted as a rational agent; then you figure out what beliefs that agent ought to have, given its place in the world and its purpose. Then you figure out what desires it ought to have, on the same considerations, and finally you predict that this rational agent will act to further its goals in the light of its beliefs. A little practical reasoning from the chosen set of beliefs and desires will in most instances yield a decision about what the agent ought to do; that is what you predict the agent will do.

In the physical stance, we interpret something as being made of atoms and following the laws of physics. In the intentional stance, we see it as being an agent and following some goal. The first allows for good prediction of the paths of planets; the second, for the outcome of playing AlphaZero in a game of Go.

The preference stance (or the (ir)rationality stance[1]) is a more general stance, where you see the object as having preferences, but not necessarily being rational about optimising them.

The preference/(ir)rationality stance

What it the intentional stance for?

In a sense, the intentional stance is exactly the same as the preference stance. Dennett takes an object and treats it as an agent, and splits it into preference and rationality. Ok, he assumes that the agent is "rational", but allows for us to "figure out what what beliefs the agent ought to have." That, in practice, allows us to model a lot of irrationality if we want to. And I'm fully convinced that Dennett takes biases and other lapses of rationality into account when dealing with other humans.

So, in a sense, Dennett is already taking a preference towards the object. And he is doing so for the express purpose of better predicting the behaviour of that object.

What is the preference stance for?

Unlike the intentional stance, the preference stance is not taken for the purpose of better predicting humans. It is instead taken for the purpose of figuring out what the human preferences are - so that we could maximise or satisfy them. The Occam's razor paper demonstrates that, from the point of view of Kolomogorov complexity, taking a good preference stance (ie plausible preferences to maximise) is not at all the same thing as taking a good (predictive) intentional stance.

But it often feels as if it is; we seem to predict people better when we assume, for example, that they have specific biases or want specific things. Why is this, and how does it seem to get around the result?

Rationality stance vs empathy machine

There are two preference stances that it is easy for humans to take. The first is to assume that an object is a rational agent with a certain preference. Then we can try and predict which action or which outcome would satisfy that preference, and then expect that action/outcome. We do this often when modelling people in economics, or similar mass models of multiple people at once.

The second is to use the empathy machinery that evolution has developed for us, and model the object as being human. Applying this to the weather and the natural world, we anthropomorphised and created gods. Applying to other humans (and to ourselves) gives us quite decent predictive power.

I suspect this is what underlies Veedrac intuition. For if we apply our empathy machine to fellow humans, we get something that is far closer to a "goodness optimiser", albeit a biased one, than to an "badness nega-optimiser".

But this doesn't say that the first is more likely, or more true, about our fellow humans. It say that the easiest stance for us to take is to treat other humans in this way. And this is not helpful, unless we manage to get our empathy machine into an AI. That is part of the challenge.

And this brings us back to why the empathy machine seems to make better predictions about humans. Our own internal goals, the goals that we think we have on reflection, and how we expect people (including us) to behave given those goals... all of those coevolved. It seems that it was easier for evolution to use our internal goals (see here for what I mean by these) and our understanding of our own rationality, to do predictions. Rather than to run our goals and our predictions as two entirely separate processes.

That's why, when you use empathy to figure out someone's goals and rationality, this also allows you to better predict them. But this is a fact about you (and me), not about the world. Just as "Thor is angry" is actually much more complex than electromagnetism, our prediction of other people via our empathy machine is simpler for us to do - but is actually more complex for an agent that doesn't already have this empathy machinery to draw on.

So assuming everyone is rational is a simpler explanation of human behaviour than our empathy machinery - at least, for generic non-humans.

Or, to quote myself:

A superintelligent AI could have all the world’s video feeds, all of Wikipedia, all social science research, perfect predictions of human behaviour, be able to perfectly manipulate humans... And still conclude that humans are fully rational.

It would not be wrong.


  1. I'll interchangeably call it a preference or an (ir)rationality stance, since given preferences, the (ir)rationality can be deduced from behaviour, and vice versa. ↩︎

18

Ω 10

New Comment
4 comments, sorted by Click to highlight new comments since: Today at 8:30 AM

This is a great point that I think sometimes gets lost on folks, which is why it's good that you bring it up. To the extent I disagree with you on your research agenda, for example, it's disagreement over what model we use to describe reality that will be useful to our purposes, rather than disagreement over reality itself.

"Agent A has preferences R" is not a fact about the world. It is a stance about A, or an interpretation of A. A stance or an interpretation that we choose to take, for some purpose or reason.

I find it hard to imagine that you're actually denying that you or I have things that, colloquially, one would describe as preferences, and exist in an objective sense. I do have a preference for a happy and meaningful life over a life of pure agony. Anyone who thinks I do not is factually wrong about the state of the world.

Then there is a sense in which the interpretations of these systems we build are fully interpretative. If “preferences R” refers to a function returning a real number, for sure this is not some facet of the real world, and there are many such seemingly-different models for any agent. Here again I believe we agree.

But we seem not to be agreeing at the next step, with the preference stance. Here I claim your goal should not be to maximize the function “preferences R”, whose precise values are irrelevant and independent, but to maximise the actual human preferences.

Consider measuring a simpler system, temperature, and projecting this onto some number. Clearly, depending on how you do this projection, you can end up at any number for a given temperature. Even with a simplicity prior, higher temperatures can correspond to larger numbers or smaller numbers in the projection, with pretty much equal plausibility. So even in this simplified situation, where we can agree that some temperatures are objectively higher than others, you cannot reliably maximize temperature by maximizing its projection.

Your preference function is a projection. The arbitrary choices you have to make to build this function are not assumptions about the world, they are choices about the model. When you prove that you have many models of human preference, you are not proving that preference is entirely subjective.

That's why, when you use empathy to figure out someone's goals and rationality, this also allows you to better predict them. But this is a fact about you (and me), not about the world. Just as "Thor is angry" is actually much more complex than electromagnetism, our prediction of other people via our empathy machine is simpler for us to do - but is actually more complex for an agent that doesn't already have this empathy machinery to draw on.

This Thor analogy is... enlightening of the differences in our perspectives. Imagining an angry Thor is a much more complex hypothesis up until the point you see an actual Thor in the sky hurling spears of lightning. Then it becomes the only reasonable conclusion, because although brains seem like they involve a lot of assumptions, a brain is ultimately many fewer assumptions (to the pre-industrial Norse people) than that same amount of coincidence.

This is the point I am making with people. If your computer models people as arbitrary, randomly sampled programs, of course you struggle to distinguish human behaviour from their contrapositives. However, people are not fully independent, nor arbitrary computing systems. Arguing that a physical person optimizing competently for a good outcome and a physical person optimizing nega-competently for a bad outcome are similarly simple has to overcome at least two hurdles:

1. We seem to know things about which mental states are good and which mental states are bad. This implies there is objective knowledge that can be learnt about it.

2. You would need to extend your arguments about mathematical functions into the real world. I don't know how this could be approached.

I have a hard time believing that in another world people think that the qualia corresponding to our suffering is good and the qualia corresponding to our happiness is bad, and if it is, this strikes me as a much bigger deal than anything else you are saying.


I find it hard to imagine that you're actually denying that you or I have things that, colloquially, one would describe as preferences, and exist in an objective sense.

I deny that a generic outside observer would describe us as having any specific set of preferences, in an objective sense.

This doesn't bother me too much, because it's sufficient that we have preferences in a subjective sense - that we can use our own empathy modules and self-reflection to define, to some extent, our preferences.

a brain is ultimately many fewer assumptions (to the pre-industrial Norse people)

"Realistic" preferences make ultimately fewer assumptions (to actual humans) that "fully rational" or other preference sets.

The problem is that this is not true for generic agents, or AIs. We have to get the human empathy module into the AI first - not so it can predict us (it can already do that through other means), but so that its decomposition of our preferences is the same as ours.

I deny that a generic outside observer would describe us as having any specific set of preferences, in an objective sense.

It's possible that we've been struggling with this conversation because I've been failing to grasp just how radically different your opinions are to mine.

Imagine your generic outside observer was superintelligent, and understood (through pure analysis) qualia and all the corresponding mysteries of the mind. Would you then still say this outside observer would not consider us to have any specific set of preferences, in an objective sense, where “preferences” takes on its colloquial meaning?

If not, why? I think my stance is obvious; where preferences colloquially means approximately “a greater liking for one alternative over another or others”, all I have to claim is that there is an objective sense in which I like things, which is simple because there's an objective sense in which I have that emotional state and internal stance.