This post explains von Neumann-Morgenstern (VNM) axioms for decision theory, and what follows from them: that if you have a consistent direction in which you are trying to steer the future, you must be an expected utility maximizer. I'm writing this post in preparation for a sequence on updateless anthropics, but I'm hoping that it will also be independently useful.
The theorems of decision theory say that if you follow certain axioms, then your behavior is described by a utility function. (If you don't know what that means, I'll explain below.) So you should have a utility function! Except, why should you want to follow these axioms in the first place?
A couple of years ago, Eliezer explained how violating one of them can turn you into a money pump — how, at time 11:59, you will want to pay a penny to get option B instead of option A, and then at 12:01, you will want to pay a penny to switch back. Either that, or the game will have ended and the option won't have made a difference.
When I read that post, I was suitably impressed, but not completely convinced: I would certainly not want to behave one way if behaving differently always gave better results. But couldn't you avoid the problem by violating the axiom only in situations where it doesn't give anyone an opportunity to money-pump you? I'm not saying that would be elegant, but is there a reason it would be irrational?
It took me a while, but I have since come around to the view that you really must have a utility function, and really must behave in a way that maximizes the expectation of this function, on pain of stupidity (or at least that there are strong arguments in this direction). But I don't know any source that comes close to explaining the reason, the way I see it; hence, this post.
I'll use the von Neumann-Morgenstern axioms, which assume probability theory as a foundation (unlike the Savage axioms, which actually imply that anyone following them has not only a utility function but also a probability distribution). I will assume that you already accept Bayesianism.
Epistemic rationality is about figuring out what's true; instrumental rationality is about steering the future where you want it to go. The way I see it, the axioms of decision theory tell you how to have a consistent direction in which you are trying to steer the future. If my choice at 12:01 depends on whether at 11:59 I had a chance to decide differently, then perhaps I won't ever be money-pumped; but if I want to save as many human lives as possible, and I must decide between different plans that have different probabilities of saving different numbers of people, then it starts to at least seem doubtful that which plan is better at 12:01 could genuinely depend on my opportunity to choose at 11:59.
So how do we formalize the notion of a coherent direction in which you can steer the future?
Setting the stage
Decision theory asks what you would do if faced with choices between different sets of options, and then places restrictions on how you can act in one situation, depending on how you would act in others. This is another thing that has always bothered me: If we are talking about choices between different lotteries with small prizes, it makes some sense that we could invite you to the lab and run ten sessions with different choices, and you should probably act consistently across them. But if we're interested in the big questions, like how to save the world, then you're not going to face a series of independent, analogous scenarios. So what is the content of asking what you would do if you faced a set of choices different from the one you actually face?
The real point is that you have bounded computational resources, and you can't actually visualize the exact set of choices you might face in the future. A perfect Bayesian rationalist could just figure out what they would do in any conceivable situation and write it down in a giant lookup table, which means that they only face a single one-time choice between different possible tables. But you can't do that, and so you need to figure out general principles to follow. A perfect Bayesian is like a Carnot engine — it's what a theoretically perfect engine would look like, so even though you can at best approximate it, it still has something to teach you about how to build a real engine.
But decision theory is about what a perfect Bayesian would do, and it's annoying to have our practical concerns intrude into our ideal picture like that. So let's give our story some local color and say that you aren't a perfect Bayesian, but you have a genie — that is, a powerful optimization process — that is, an AI, which is. (That, too, is physically impossible: AIs, like humans, can only approximate perfect Bayesianism. But we are still idealizing.) Your genie is able to comprehend the set of possible giant lookup tables it must choose between; you must write down a formula, to be evaluated by the genie, that chooses the best table from this set, given the available information. (An unmodified human won't actually be able to write down an exact formula describing their preferences, but we might be able to write down one for a paperclip maximizer.)
The first constraint decision theory places on your formula is that it must order all options your genie might have to choose between from best to worst (though you might be indifferent between some of them), and then given any particular set of feasible options, it must choose the one that is least bad. In particular, if you prefer option A when options A and B are available, then you can't prefer option B when options A, B and C are available.
Meditation: Alice is trying to decide how large a bonus each member of her team should get this year. She has just decided on giving Bob the same, already large, bonus as last year when she receives an e-mail from the head of a different division, asking her if she can recommend anyone for a new project he is setting up. Alice immediately realizes that Bob would love to be on that project, and would fit the bill exactly. But she needs Bob on the contract he's currently working on; losing him would be a pretty bad blow for her team.
Alice decides there is no way that she can recommend Bob for the new project. But she still feels bad about it, and she decides to make up for it by giving Bob a larger bonus. On reflection, she finds that she genuinely feels that this is the right thing to do, simply because she could have recommended him but didn't. Does that mean that Alice's preferences are irrational? Or that something is wrong with decision theory?
Meditation: One kind of answer to the above and to many other criticisms of decision theory goes like this: Alice's decision isn't between giving Bob a larger bonus or not, it's between (give Bob a larger bonus unconditionally), (give Bob the same bonus unconditionally), (only give Bob a larger bonus if I could have recommended him), and so on. But if that sort of thing is allowed, is there any way left in which decision theory constrains Alice's behavior? If not, what good is it to Alice in figuring out what she should do?
My short answer is that Alice can care about anything she damn well likes. But there are a lot of things that she doesn't care about, and decision theory has something to say about those.
In fact, deciding that some kinds of preferences should be outlawed as irrational can be dangerous: you might think that nobody in their right mind should ever care about the detailed planning algorithms their AI uses, as long as they work. But how certain are you that it's wrong to care about whether the AI has planned out your whole life in advance, in detail? (Worse: Depending on how strictly you interpret it, this injunction might even rule out not wanting the AI to run conscious simulations of people.)
But nevertheless, I believe the "anything she damn well likes" needs to be qualified. Imagine that Alice and Carol both have an AI, and fortuitously, both AIs have been programmed with the same preferences and the same Bayesian prior (and they talk, so they also have the same posterior, because Bayesians cannot agree to disagree). But Alice's AI has taken over the stock markets, while Carol's AI has seized the world's nuclear arsenals (and is protecting them well). So Alice's AI not only doesn't want to blow up Earth, it couldn't do so even if it wanted to; it couldn't even bribe Carol's AI, because Carol's AI really doesn't want the Earth blown up either. And so, if it makes a difference to the AIs' preference function whether they could blow up Earth if they wanted to, they have a conflict of interest.
The moral of this story is not simply that it would be sad if two AIs came into conflict even though they have the same preferences. The point is that we're asking what it means to have a consistent direction in which you are trying to steer the future, and it doesn't look like our AIs are on the same bearing. Surely, a direction for steering the world should only depend on features of the world, not on additional information about which agent is at the rudder.
You can want to not have your life planned out by an AI. But I think you should have to state your wish as a property of the world: you want all AIs to refrain from doing so, not just "whatever AI happens to be executing this". And Alice can want Bob to get a larger bonus if the company could have assigned him to the new project and decided not to, but she must figure out whether this is the correct way to translate her moral intuitions into preferences over properties of the world.
You may care about any feature of the world, but you don't in fact care about most of them. For example, there are many ways the atoms in the sun could be arranged that all add up to the same thing as far as you are concerned, and you don't have terminal preferences about which of these will be the actual one tomorrow. And though you might care about some properties of the algorithms your AI is running, mostly they really do not matter.
Let's define a function that takes a complete description of the world — past, present and future — and returns a data structure containing all information about the world that matters to your terminal values, and only that information. (Our imaginary perfect Bayesian doesn't know exactly which way the world will turn out, but it can work with "possible worlds", complete descriptions of ways the world may turn out.) We'll call this data structure an "outcome", and we require you to be indifferent between any two courses of action that will always produce the same outcome. Of course, any course of action is something that your AI would be executing in the actual world, and you are certainly allowed to care about the difference — but then the two courses of action do not lead to the same "outcome"!1
With this definition, I think it is pretty reasonable to say that in order to have a consistent direction in which you want to steer the world, you must be able to order these outcomes from best to worst, and always want to pick the least bad you can get.
That won't be sufficient, though. Our genie doesn't know what outcome each action will produce, it only has probabilistic information about that, and that's a complication we very much do not want to idealize away (because we're trying to figure out the right way to deal with it). And so our decision theory amends the earlier requirement: You must not only be indifferent between actions that always produce the same outcome, but also between all actions that only yield the same probability distribution over outcomes.
This is not at all a mild assumption, though it's usually built so deeply into the definitions that it's not even called an "axiom". But we've assumed that all features of the world you care about are already encoded in the outcomes, so it does seem to me that the only reason left why you might prefer one action over another is that it gives you a better trade-off in terms of what outcomes it makes more or less likely; and I've assumed that you're already a Bayesian, so you agree that how likely it makes an outcome is correctly represented by the probability of that outcome, given the action. So it certainly seems that the probability distribution over outcomes should give you all the information about an action that you could possibly care about. And that you should be able to order these probability distributions from best to worst, and all that.
Formally, we represent a direction for steering the world as a set of possible outcomes and a binary relation on the probability distributions over (with is interpreted as " is at least as good as ") which is a total preorder; that is, for all , and :
- If and , then (that is, is transitive); and
- We have either or or both (that is, is total).
In this post, I'll assume that is finite. We write (for "I'm indifferent between and ") when both and , and we write (" is strictly better than ") when but not . Our genie will compute the set of all actions it could possibly take, and the probability distribution over possible outcomes that (according to the genie's Bayesian posterior) each of these actions leads to, and then it will choose to act in a way that maximizes . I'll also assume that the set of possible actions will always be finite, so there is always at least one optimal action.
Meditation: Omega is in the neighbourhood and invites you to participate in one of its little games. Next Saturday, it plans to flip a fair coin; would you please indicate on the attached form whether you would like to bet that this coin will fall heads, or tails? If you correctly bet heads, you will win $10,000; if you correctly bet tails, you'll win $100. If you bet wrongly, you will still receive $1 for your participation.
We'll assume that you prefer a 50% chance of $10,000 and a 50% chance of $1 to a 50% chance of $100 and a 50% chance of $1. Thus, our theory would say that you should bet heads. But there is a twist: Given recent galactopolitical events, you estimate a 3% chance that after posting its letter, Omega has been called away on urgent business. In this case, the game will be cancelled and you won't get any money, though as a consolation, Omega will probably send you some book from its rare SF collection when it returns (market value: approximately $55–$70). Our theory so far tells you nothing about how you should bet in this case, but does Rationality have anything to say about it?
The Axiom of Independence
So here's how I think about that problem: If you already knew that Omega is still in the neighbourhood (but not which way the coin is going to fall), you would prefer to bet heads, and if you knew it has been called away, you wouldn't care. (And what you bet has no influence on whether Omega has been called away.) So heads is either better or exactly the same; clearly, you should bet heads.
This type of reasoning is the content of the von Neumann-Morgenstern Axiom of Independence. Apparently, that's the most controversial of the theory's axioms.
You're already a Bayesian, so you already accept that if you perform an experiment to determine whether someone is a witch, and the experiment can come out two ways, then if one of these outcomes is evidence that the person is a witch, the other outcome must be evidence that they are not. New information is allowed to make a hypothesis more likely, but not predictably so; if all ways the experiment could come out make the hypothesis more likely, then you should already be finding it more likely than you do. The same thing is true even if only one result would make the hypothesis more likely, but the other would leave your probability estimate exactly unchanged.
The Axiom of Independence is equivalent to saying that if you're evaluating a possible course of action, and one experimental result would make it seem more attractive than it currently seems to you, while the other experimental result would at least make it seem no less attractive, then you should already be finding it more attractive than you do. This does seem rather solid to me.
So what does this axiom say formally? (Feel free to skip this section if you don't care.)
Suppose that your genie is considering two possible actions and (bet heads or tails), and an event (Omega is called away). Each action gives rise to a probability distribution over possible outcomes: E.g., is the probability of outcome if your genie chooses . But your genie can also compute a probability distribution conditional on , . Suppose that conditional on , it doesn't matter which action you pick: for all . And finally, suppose that the probability of doesn't depend on which action you pick: , with . The Axiom of Independence says that in this situation, you should prefer the distribution to the distribution , and therefore prefer to , if and only if you prefer the distribution to the distribution .
Let's write for the distribution , for the distribution , and for the distribution . (Formally, we think of these as vectors in : e.g., .) For all , we have
so , and similarly . Thus, we can state the Axiom of Independence as follows:
We'll assume that you can't ever rule out the possibility that your AI might face this type of situation for any given , , , and , so we require that this condition hold for all probability distributions , and , and for all with .
Here's a common criticism of Independence. Suppose a parent has two children, and one old car that they can give to one of these children. Can't they be indifferent between giving the car to their older child or their younger child, but strictly prefer throwing a coin? But let mean that the younger child gets the gift, and that the older child gets it, and ; then by Independence, if , then , so it would seem that the parent can not strictly prefer the coin throw.
In fairness, the people who find this criticism persuasive may not be Bayesians. But if you think this is a good criticism: Do you think that the parent must be indifferent between throwing a coin and asking the children's crazy old kindergarten teacher which of them was better-behaved, as long as they assign 50% probability to either answer? Because if not, shouldn't you already have protested when we decided that decisions must only depend on the probabilities of different outcomes?
My own resolution is that this is another case of terminal values intruding where they don't belong. All that is relevant to the parent's terminal values must already be described in the outcome; the parent is allowed to prefer "I threw a coin and my younger child got the car" to "I decided that my younger child would get the car" or "I asked the kindergarten teacher and they thought my younger child was better-behaved", but if so, then these must already be different outcomes. The thing to remember is that it isn't a property of the world that either child had a 50% probability of getting the car, and you can't steer the future in the direction of having this mythical property. It is a property of the world that the parent assigned a 50% probability to each child getting the car, and that is a direction you can steer in — though the example with the kindergarten teacher shows that this is probably not quite the direction you actually wanted.
The preference relation is only supposed to be about trade-offs between probability distributions; if you're tempted to say that you want to steer the world towards one probability distribution or another, rather than one outcome or other, something has gone terribly wrong.
The Axiom of Continuity
And… that's it. These are all the axioms that I'll ask you to accept in this post.
There is, however, one more axiom in the von Neumann-Morgenstern theory, the Axiom of Continuity. I do not think this axiom is a necessary requirement on any coherent plan for steering the world; I think the best argument for it is that it doesn't make a practical difference whether you adopt it, so you might as well. But there is also a good argument to be made that if we're talking about anything short of steering the entire future of humanity, your preferences do in fact obey this axiom, and it makes things easier technically if we adopt it, so I'll do that at least for now.
Let's look at an example: If you prefer $50 in your pocket to $40, the axiom says that there must be some small such that you prefer a probability of of $50 and a probability of of dying today to a certainty of $40. Some critics seem to see this as the ultimate reductio ad absurdum for the VNM theory; they seem to think that no sane human would accept that deal.
Eliezer was surely not the first to observe that this preference is exhibited each time someone drives an extra mile to save $10.
Continuity says that if you strictly prefer to , then there is no so terrible that you wouldn't be willing to incur a small probability of it in order to (probably) get rather than , and no so wonderful that you'd be willing to (probably) get instead of if this gives you some arbitrarily small probability of getting . Formally, for all , and ,
- If , then there is an such that and .
I think if we're talking about everyday life, we can pretty much rule out that there are things so terrible that for arbitrarily small , you'd be willing to die with probability to avoid a probability of of the terrible thing. And if you feel that it's not worth the expense to call a doctor every time you sneeze, you're willing to incur a slightly higher probability of death in order to save some mere money. And it seems unlikely that there is no at which you'd prefer a certainty of $1 to a chance of $100. And if you have some preference that is so slight that you wouldn't be willing to accept any chance of losing $1 in order to indulge it, it can't be a very strong preference. So I think for most practical purposes, we might as well accept Continuity.
The VNM theorem
If your preferences are described by a transitive and complete relation on the probability distributions over some set of "outcomes", and this relation satisfies Independence and Continuity, then you have a utility function, and your genie will be maximizing expected utility.
Here's what that means. A utility function is a function which assigns a numerical "utility" to every outcome. Given a probability distribution over , we can compute the expected value of under , ; this is called the expected utility. We can prove that there is some utility function such that for all and , we have if and only if the expected utility under is greater than the expected utility under .
In other words: is completely described by ; if you know , you know . Instead of programming your genie with a function that takes two outcomes and says which one is better, you might as well program it with a function that takes one outcome and returns its utility. Any coherent direction for steering the world which happens to satisfy Continuity can be reduced to a function that takes outcomes and assigns them numerical ratings.
In fact, it turns out that the for a given is "almost" unique: Given two utility functions and that describe the same , there are numbers and such that for all , ; this is called an "affine transformation". On the other hand, it's not hard to see that for any such and ,
so two utility functions represent the same preference relation if and only if they are related in this way.
You shouldn't read too much into this conception of utility. For example, it doesn't make sense to see a fundamental distinction between outcomes with "positive" and with "negative" von Neumann-Morgenstern utility — because adding the right can make any negative utility positive and any positive utility negative, without changing the underlying preference relation. The numbers that have real meaning are ratios between differences between utilities, , because these don't change under affine transformations (the 's cancel when you take the difference, and the 's cancel when you take the ratio). Academian's post has more about misunderstandings of VNM utility.
In my view, what VNM utilities represent is not necessarily how good each outcome is; what they represent is what trade-offs between probability distributions you are willing to accept. Now, if you strongly felt that the difference between and was about the same as the difference between and , then you should have a very good reason before you make your a huge number. But on the other hand, I think it's ultimately your responsibility to decide what trade-offs you are willing to make; I don't think you can get away with "stating how much you value different outcomes" and outsourcing the rest of the job to decision theory, without ever considering what these valuations should mean in terms of probabilistic trade-offs.
Doing without Continuity
What happens if your preferences do not satisfy Continuity? Say, you want to save human lives, but you're not willing to incur any probability, no matter how small, of infinitely many people getting tortured infinitely long for this?
I do not see a good argument that this couldn't add up to a coherent direction for steering the world. I do, however, see an argument that in this case you care so little about finite numbers of human lives that in practice, you can probably neglect this concern entirely. (As a result, I doubt that your reflective equilibrium would want to adopt such preferences. But I don't think they're incoherent.)
I'll assume that your morality can still distinguish only a finite number of outcomes, and you can choose only between a finite number of decisions. It's not obvious that these assumptions are justified if we want to take into account the possibility that the true laws of physics might turn out to allow for infinite computations, but even in this case you and any AI you build will probably still be finite (though it might build a successor that isn't), so I do in fact think there is a good chance that results derived under this assumption have relevance in the real world.
In this case, it turns out that you still have a utility function, in a certain sense. (Proofs for non-standard results can be found in the math appendix to this post. I did the work myself, but I don't expect these results to be new.) This utility function describes only the concern most important to you: in our example, only the probability of infinite torture makes a difference to expected utility; any change in the probability of saving a finite number of lives leaves expected utility unchanged.
Let's define a relation , read " is much better than ", which says that there is nothing you wouldn't give up a little probability of in order to get instead of — in our example: doesn't merely save lives compared to , it makes infinite torture less likely. Formally, we define to mean that for all and "close enough" to and respectively; more precisely: if there is an such that for all and with
(Or equivalently: if there are open sets and around and , respectively, such that for all and .)
It turns out that if is a preference relation satisfying Independence, then is a preference relation satisfying Independence and Continuity, and there is a utility function such that iff the expected utility under is larger than the expected utility under . Obviously, implies , so whenever two options have different expected utilities, you prefer the one with the larger expected utility. Your genie is still an expected utility maximizer.
Furthermore, unless for all and , isn't constant — that is, there are some and with . (If this weren't the case, the result above obviously wouldn't tell us very much about !) Being indifferent between all possible actions doesn't make for a particularly interesting direction for steering the world, if it can be called one at all, so from now on let's assume that you are not.
It can happen that there are two distributions and with the same expected utility, but . ( saves more lives, but the probability of eternal torture is the same.) Thus, if your genie happens to face a choice between two actions that lead to the same expected utility, it must do more work to figure out which of the actions it should take. But there is some reason to expect that such situations should be rare.
If there are possible outcomes, then the set of probability distributions over is -dimensional (because the probabilities must add up to 1, so if you know of them, you can figure out the last one). For example, if there are three outcomes, is a triangle, and if there are four outcomes, it's a tetrahedron. On the other hand, it turns out that for any , the set of all for which the expected utility equals has dimension or smaller: if , it's a line (or a point or the empty set); if , it's a plane (or a line or a point or the empty set).
Thus, in order to have the same expected utility, and must lie on the same hyperplane — not just on a plane very close by, but on exactly the same plane. That's not just a small target to hit, that's an infinitely small target. If you use, say, a Solomonoff prior, then it seems very unlikely that two of your finitely many options just happen to lead to probability distributions which yield the same expected utility.
But we are bounded rationalists, not perfect Bayesians with uncomputable Solomonoff priors. We assign heads and tails exactly the same probability, not because there is no information that would make one or the other more likely (we could try to arrive at a best guess about which side is a little heavier than the other?), but because the problem is so complicated that we simply give up on it. What if it turns out that because of this, all the difficult decisions we need to make turn out to be between actions that happen to have the same expected utility?
If you do your imperfect calculation and find that two of your options seem to yield exactly the same probability of eternal hell for infinitely many people, you could then try to figure out which of them is more likely to save a finite number of lives. But it seems to me that this is not the best approximation of an ideal Bayesian with your stated preferences. Shouldn't you spend those computational resources on doing a better calculation of which option is more likely to lead to eternal hell?
For you might arrive at a new estimate under which the probabilities of hell are at least slightly different. Even if you suspect that the new calculation will again come out with the probabilities exactly equal, you don't know that. And therefore, can you truly in good conscience argue that doing the new calculation does not improve the odds of avoiding hell —
— at least a teeny tiny incredibly super-small for all ordinary intents and purposes completely irrelevant bit?
Even if it should be the case that to a perfect Bayesian, the expected utilities under a Solomonoff prior were exactly the same, you don't know that, so how can you possibly justify stopping the calculation and saving a mere finite number of lives?
So there you have it. In order to have a coherent direction in which you want to steer the world, you must have a set of outcomes and a preference relation over the probability distributions over these outcomes, and this relation must satisfy Independence — or so it seems to me, anyway. And if you do, then you have a utility function, and a perfect Bayesian maximizing your preferences will always maximize expected utility.
It could happen that two options have exactly the same expected utility, and in this case the utility function doesn't tell you which of these is better, under your preferences; but as a bounded rationalist, you can never know this, so if you have any computational resources left that you could spend on figuring out what your true preferences have to say, you should spend them on a better calculation of the expected utilities instead.
Given this, we might as well just talk about , which satisfies Continuity as well as Independence, instead of ; and you might as well program your genie with your utility function, which only reflects , instead of with your true preferences.
(Note: I am not literally saying that you should not try to understand the whole topic better than this if you are actually going to program a Friendly AI. This is still meant as a metaphor. I am, however, saying that expected utility theory, even with boring old real numbers as utilities, is not to be discarded lightly.)
Next post: Dealing with time
So far, we've always pretended that you only face one choice, at one point in time. But not only is there a way to apply our theory to repeated interactions with the environment — there are two!
One way is to say that at each point in time, you should apply decision theory to set of actions you can perform at that point. Now, the actual outcome depends of course not only on what you do now, but also on what you do later; but you know that you'll still use decision theory later, so you can foresee what you will do in any possible future situation, and take it into account when computing what action you should choose now.
The second way is to make a choice only once, not between the actions you can take at that point in time, but between complete plans — giant lookup tables — which specify how you will behave in any situation you might possibly face. Thus, you simply do your expected utility calculation once, and then stick with the plan you have decided on.
Meditation: Which of these is the right thing to do, if you have a perfect Bayesian genie and you want steer the future in some particular direction? (Does it even make a difference which one you use?)
1 The accounts of decision theory I've read use the term "outcome", or "consequence", but leave it mostly undefined; in a lottery, it's the prize you get at the end, but clearly nobody is saying decision theory should only apply to lotteries. I'm not changing its role in the mathematics, and I think my explanation of it is what the term always wanted to mean; I expect that other people have explained it in similar ways, though I'm not sure how similar precisely.