Four years ago I attempted twice to explain what I saw as the conceptual problems with utilitarianism. Upon reflection, I was right that there are conceptual problems with utilitarianism, but wrong about what are the best pointers of how to explain this to the outside world. So here is my attempt to explain my worldview again.

Utilitarianism is defined as "maximizing the expected utility of consequences". I will attack this definition first by exploring the definitions of "consequences" and "utility", before finally settling on "maximize" as the part that most needs correction.

The "consequences" of an action are not the future of an action, but rather what that future would hypothetically be if you took that action (or if the algorithm you use advertized to take that action, the distinction is not important at this point). So the concept of consequences depends on the concept of a hypothetical, which in turn depends on the concept of a model. In other words, the consequences are no more than you imagine them to be. So when we reach the limits of our imagination, we reach the limits of consequences.

(It's worth distinguishing a possible motivated cognition that might occur to one when he hears "the consequences are no more than you imagine them to be": stop imagining things, so that there stop being more consequences. This not what I intend, and it is a mistake, because it misunderstands the role that imagining consequences plays in the ecosystem of the mind. Imagining or not imagining consequences is not meant to prevent bad consequences, but to more perfectly fulfill the role of "a thinker who neither motivatedly stops nor motivatedly continues". We will have all of the Singularity to argue about what exactly this role entails, but for now I advise taking each individual word at face value.)

What are the limits of our imagination? Well, other minds, in particular those other minds that are smarter than us. That's why it's called the Singularity -- we can't predict what's past it.

But we notice now that most of the utility is beyond the Singularity, beyond the limit of consequences. The part of our brain that tries to win is looking for a loophole now, a way that it can create consequences beyond the singularity. This is an important effort, but it is ultimately futile. It's impossible to win against a mind that is smarter than you, and once we realize that, we will be ready to decide what to really do next.

One way our brain is looking for a loophole is saying maybe our utility function is actually bounded. Maybe we don't have to care so much about far future folk whose lives we can't affect. The answer to this is to introduce some distinctions. This will get a bit personal as my views on this topic depend on the particular behaviors I have observed in my own [brain hemispheres]. Since lateralization of brain hemispheres is not talked about much on LessWrong, I should start by saying that my understanding is that the left brain is decontextualizing, object-oriented, and analytical, while the right brain is contextualized, person-oriented, and intuitive, and it is on this basis that I made my observations.

My experience is that my right brain doesn't care about far future folk. It wanted for a while to pretend to care, but under sufficient introspection the attempt was futile. By contrast, my left brain doesn't draw a distinction between "caring" and "pretending to care" in the same way. It sees utility functions as objects oriented towards the goal of systematizing my conception of my relationship with the rest of reality, so if I want to "care" about people, that by definition means putting them in my utility function. My left brain does, upon introspection, care enough about far future people to put them in the utility function. However, I was only confident in this assertion once I had a framework that could explain why the concept of an unbounded utility function still makes sense despite the objections. (I think the answers to objections are obvious given my framework but if anyone has further objections I'll try to deal with them.)

In conclusion, whether or not we care about far future folk probably depends both on whether we are talking about left or right brain, and also on what precisely we mean by "care". Nevertheless, we want to find a way of putting these people in our utility function, which is why instead of conceptually attacking the notion of "utility" I will move on to "maximize".

The problem is that giving the instruction "maximize" to our subagent who is in the role of neither motivated stopping nor motivated continuing suggests that it is unacceptable to stop while the consequences miss the vast majority of utility. My impression is that collectively we are somewhat desperate and will accept just about any solution at this point. And my purpose here is to argue for a change in norm for this subagent, codified by changing our conception of "utilitarianism" and thus to our conception of what an ideal agent is, thus changing our strategy for coding an AI. The proposed change is the title of this post, but more should be said in particular since it introduces the concept of "hope" which is a new concept to the apparatus of utilitarianism.

Once we realize that it's impossible to win against a mind that is smarter than you, we will then be ready to decide what to really do next. We may as well think about how we are going to make that decision. It can't be based on trying to maximize the expected utility of consequences, for reasons outlined above. Or actually... it can be, as long as we only give the instruction "maximize" to the subagent responsible for comparing strategies, rather than to the one responsible for coming up with strategies. Actually that makes more sense, when you are comparing strategies there is a finite number of them so it mathematically makes sense to ask for a maximum, whereas giving the instruction "maximize" to the subagent responsible for imagining strategies is really asking him to take responsibility in relation to how his decisions affect the first subagent. ("In relation" here is quite vague, and intentionally so.)

OK, if we aren't going to give the instruction "maximize" to the imagining strategies subagent, what instruction are we going to give it? The title suggests "hope". This was meant to convey... that if we look at the current maximum from the comparing strategies subagent, and try to extrapolate out what happens next, we aren't too dissatisfied with the prospects? If that happens, then we're allowed to stop imagining possibilities. What would it look like to be not too dissatisfied with the prospects? Well, our current maximum is that we'll write some code and have something that looks like a proof that it's Friendly, and then we'll run the code. Are we satisfied with it? No, because it creates an AGENT THAT IS SMARTER THAN US and we DO NOT KNOW WHAT WILL HAPPEN AFTER THAT. What if we pretend to know? After all, "smarter" is a word, and we can pretend that we know what it means even though whenever we encounter people smarter than us in real life the fact that we call them "smarter" doesn't allow us to make detailed predictions about them. (Meta-observation: This sentence makes sense to the imagining strategies subagent due to the fact that "what it means" is ambiguous between "what is the semantic content" (primary meaning) and "what are the consequences for us if we create an agent that is smarter than us" (secondary meaning). Actually the primary meaning is a code for the secondary meaning, but "security by obscurity" or "spoilers", as they say. (Fun theory note: Codes are fun.))

Let's go with pretending we know what "smarter" means. After all we have some nice heuristics like IQ and resources produced and a pretty-sounding theoretical definition of "able to hit a small target in the search space". In pretend-land, what happens when we create an agent that is smarter than us? It does whatever it wants to, of course! So we just have to make sure that what it wants to do is the same as what we want to do. We don't have to precisely "model" it, we can just make the vague prediction of "what happens is what we want to happen".

Thus saith the imagining strategies subagent, but now the subagent responsible for enforcing our commitment to utilitarianism objects. "The point of asking you what to do next wasn't to actually do it, it was to use your thought processes to update the concept of utilitarianism so as to better program our AI, and this answer is not at all useful for that! (He exaggerates, it is a little useful.) An AI may have to deal with situations where the future may contain multiple agents smarter than them that are vying for power, and a vague prediction like "what happens is what the agent we create wants to happen" (notice this subagent has read the other subagent's thoughts and knows it was giving a fake answer) doesn't suffice for dealing with those situations! (And also the concept of an agent needs to be deconstructed, but it may be too hopeful to expect that to happen before the Singularity.) And the biggest problem, is that we need to know how to tell whether what it wants to do is the same as what we want to do, and you are hoping someone else will solve that for you!"

The imagining strategies subagent: Well of course what we want to do isn't really the same as what the agent we create wants to do, after all it's a different agent! We're just hoping that it'll be close enough that we don't get blamed for it. I can do my job if you give me the code for the agent, honest! (He's not being honest, and anyway it's been pointed out to him that getting the code depends on his thought process here, so he is creating a circular dependency.) (Note that our subagent has internalized our instruction to "hope" and is now viewing it as a Word Of Power. This is good.)

I think I can solve the part about multiple agents though... if we create multiple agents smarter than us, then we we should expect them each to get what they want in proportion to how smart they are, and how many resources they have! And I have heuristics for both of those things!

Now we can go meta because the point of the exercise has been achieved: it is in this final thought that the imagining strategies subagent has exercised the virtue of hope. Namely, they have created a pseudo-model (we dare not call it a model for fear of confusing it with the finished product which then becomes the responsibility of the comparing strategies subagent) with heuristics that are extrapolated, but with good theoretical reasons to expect the extrapolation to fail. (If there weren't good theoretical reasons to expect the extrapolation to fail, it would be worth trying to turn this pseudo-model into a model, which violates the instruction we gave to the imagining strategies subagent.)

What does all of this suggest for AI mind design? Well, by definition an idealized agent is the perfected version of what we are. And we currently view the fact that our subagents think in terms of responsibility and blame, rather than simply executing the orders they have been given, as a flaw to be ironed out, and we see coding an AI as an opportunity to iron it out: AI by definition does exactly what you tell it to do (at least initially) so that is a good start. Whether we will still have this view upon reflection, I do not know. Maybe idealized agents have subagents that think in terms of responsibility and blame.

I could reflect further on what this means for AI mind design, but recall that the purpose of this tangent was to illustrate the process of "looking for loopholes" in the fact that you cannot predict what an agent smarter than you is going to do. That is now sufficiently done, so let us get back to our philosophical discussion that takes for granted the futility of this process, and ask what it all means. Once our mind gives up on looking for loopholes and trying to win against the computer, it will still do something. I propose that what it will do is decide whether or not to trust the machine, i.e. its own idealized conception of an agent smarter than it. And that furthermore, the state of not knowing whether or not you will trust someone (I say someone instead of something because I want to activate my right brain), or even on what basis you will make this decision, until the very moment of deciding, is a fundamental part of what it means to be human. But from an outside view I can predict that we will make this decision on the basis of things like how similar the mind we are supposed to trust is to us, and how much we agree with what we perceive of as its priorities. So my real proposal for AI mind design is that we should prioritize transparency of those attributes. Note that this is somewhat different from prioritizing any particular value of the attributes; sometimes we trust based on intriguing differences.

To roll back the discussion one meta level further, let us ask what this means for utilitarianism. If we take seriously the fact that our choices cannot have consequences beyond the Singularity, then that means we need to change our definition of the utility function as well: rather than being a function over ultimate outcomes, it is only a function on outcomes until the Singularity. To the accusation that this means we have stopped caring about people beyond the Singularity, we respond that even though they no longer appear in our utility function, we still hope that they live good lives, and hope is a form of caring. Nevertheless it is not our responsibility, even our heroic responsibility, to affect them. What is our responsibility is to be human -- so that when it comes to the decision to trust or not to trust, we can make a human decision. Part of being human is doing philosophy, in particular the philosophy of ideal agents. So this plan still leads to the conclusion that we should improve our conception of ideal agents.

How to balance the responsibility to be human against the responsibility we have to maximize utility in limited domains is, again, a human decision.

So that would be my resolution to what I perceive of as the conceptual flaws in standard utilitarianism.


New Answer
New Comment