# Estimates vs. head-to-head comparisons

5 min read4th May 201332 comments

# 23

Personal Blog

(Cross-posted from my blog.)

Summary: when choosing between two options, it’s not always optimal to estimate the value of each option and then pick the better one.

Suppose I am choosing between two actions, X and Y. One way to make my decision is to predict what will happen if I do X and predict what will happen if I do Y, and then pick the option which leads to the outcome that I prefer.

My predictions may be both vague and error-prone, and my value judgments might be very hard or nearly arbitrary. But it seems like I ultimately must make some predictions, and must decide how valuable the different outcomes are. So if I have to evaluate N options, I could do it by evaluating the goodness of each option, and then simply picking the option with the highest value. Right?

There are other possible procedures for evaluating which of two options is better. For example, I have often encountered advice of the form "if your error bars are too big, you should just ignore the estimate". To be most extreme, I could choose some particular axis on which options can be better or worse, and then pick the option which is best on that axis, ignoring all others. (E.g., I could choose the option which is cheapest, or the charity which is most competently administered, or whatever.)

If you have an optimistic quantitative outlook like mine, this probably looks pretty silly—if one option is cheaper, that just gets figured into my estimate for how good it is. If my error bars are big, as long as I keep track of the error bars in my calculation it is still better than nothing. So why would I ever want to do anything other than estimate the value of each option?

In fact I don’t think my intuition is quite right. To see why, let’s start with a very simple case.

### A simple model

Alice and Bob are picking between two interventions X and Y. They only have a year to make their decision, so they split up: Alice will produce an estimate of the value of X and Bob will produce an estimate of the value of Y, and they will both do whichever one looks better. Let’s suppose that Alice and Bob are perfectly calibrated and trust each other completely, so that each of them believes the other’s estimate to be unbiased.

Suppose that intervention X is good because it reduces carbon emissions. First Alice dutifully estimates the reductions in emissions that result from intervention X, call that number A1. Of course Alice doesn’t care about carbon emissions per se, she cares about the improvements in human quality of life that result from decreased emissions--and she couldn’t compare her estimate with Bob’s unless she converts it into units of goodness.  So she next estimates the gain in quality of life per unit of reduced emissions, call that number A2. She then reports that the value of X is A1 * A2. Because she is unbiased, as long as her estimates of A1 and A2 are independent she obtains an unbiased estimate of the value of X.

Meanwhile, it happens to be the case that intervention Y is also good because it reduces carbon emissions. So Bob similarly estimates the reduction in carbon emissions from intervention Y, B1, and then the goodness of reduced emissions, B2, and reports B1 * B2. His estimate is also an unbiased estimate of the value of Y.

The pair decides to do intervention X iff it appears to have a higher value than Y, i.e. iff A1 * A2 > B1 * B2. This is not crazy but it’s also not a very good idea. It is easy to see that intervention X is better than intervention Y iff A1 > B1. But if estimates A2 and B2 are relatively noisy—especially if the noise in those estimates is larger than the actual gap between A1 and B1—then Alice and Bob will make an unnecessarily random decision.

What went wrong? Alice and Bob aren’t making a systematically bad decision, but they could have made a better decision by using a different technique for comparison. I think that a similar situation arises very often, in much less simple and slightly less severe situations. This may mean that the best way to compare X and Y is not always to compute the value for each. When making a comparison between X and Y, we can minimize uncertainty by making the analysis of X as similar to the analysis of Y as possible.

### Objections

Of course this example was very simple, and there are lots of reasons you might expect more realistic estimates to be safe from these problems. I think that, despite all of these divergences, this simple model captures a common failure in estimation. The basic problem is that the argument above shows that there is no general reason to expect independent estimates of value to yield optimal results. Without a general reason to think that this procedure is optimal, it seems to be on much shakier ground. But to make the point, here are responses to some of the most obvious objections:

1. The reason we can say that Alice and Bob did badly is because we know something they didn't---that A2 and B2 were estimates of the same quantity. Couldn't they just have done one extra step of work---updating each of their estimates after looking at the other's work---and avoided the problem?

In this case, that would have solved Alice and Bob's problem. But in practice, different estimates rarely involve estimating exactly the same intermediates. If I want to compare the goodness of health interventions and education interventions in the developing world, the most natural estimates might not have even a single step in common. Nevertheless, each of those estimates would involve many uncertainties about social dynamics in the developing world, long-term global outcomes, and so on. I could do my analysis in a way that introduced analogies between the two estimates, and this could help me eliminate some of this uncertainty (even if the resulting estimates were noisier, or involved ignoring some apparently useful information).

If Alice and Bob's estimates were related in a more complicated way, then it's still the case that there is some extra update Alice and Bob could have done, which would have eliminated the problem (i.e. updating on each other's estimates, using that relationship). But such an update could be quite complicated, and after making it Alice and Bob would need to make further updates still. In general, it's not clear I can fix the problem without being logically omniscient. I don't know the extent of this issue in practice, and I'm not familiar with a literature on this or related problems. It seems pretty messy in general, but I expect it would be possible to make meaningful headway on it.

The point is: in order to prove that comparing independent value estimates is optimal, it is not enough to assume that my beliefs are well-calibrated. I also need to assume that my beliefs make use of all available information (including having considered every alternative estimation strategy that sheds light on the question), which is unrealistic even for an idealized agent unless it is logically omniscient. When my beliefs don’t make use of all available information, other techniques for comparison might do better, including using different estimates which have more elements in common. (In some cases, even very simple approaches like “do the cheapest thing” will be predictably better than comparing independent value estimates.)

2. Alice and Bob had trouble because they are two different people. I agree that I shouldn’t compare estimates from different people, but if I do all of the estimates myself it seems like this isn’t a problem.

When I try to estimate the same thing several times, without remembering my earlier estimates, I tend to get different results. I strongly suspect this is universal, though I haven’t seen research on that question.

Moreover, when I try to estimate different things, my estimates tend not to obey the logical relationships that I know the estimated quantities must, unless I go back through with those particular relationship in mind and enforce them. For example, if I estimate A and B separately, the sum is rarely the same as if I estimated A+B. When the relationships amongst items are complicated, such consistency is unrealistically difficult to enforce. (Of course, the prospects for making comparisons also suffer.) It may be that there is some principled way to get around these problems, but I don't know it.

Alice and Bob's estimates don’t have to be very far from each other before they could have done better. I agree that estimates from a single person will have a higher degree of consistency than estimates from different people, but they won't be consistent enough to remove the problem (or opportunity for improvement, if you want to look at it from a different angle).

3. The weird behavior in the example came from the artificial structure of the problem. How often could you do such factoring out for realistic estimates, even when they are similar?

If I’m trying to estimate the effect of different health interventions, the first step would be to separate the question “How much does this improve people’s health?” from “How much does improving people’s health matter?” That already factors out a big piece of the uncertainty. I think most people get that far, though, and so the question is: can you go farther?

I think it is still easier to estimate "Which of these interventions improve health more?" than to estimate the absolute improvement from either. We can break this comparison down into still smaller comparisons: “How many more or fewer people does X reach than Y?” and “Per person affected, what is the relative impact of X and Y?” etc. By focusing on the most important comparisons, and writing the others off as a wash, we might be able to reduce the total error in our comparison.

### Conclusion

Trying to explicitly estimate the goodness of outcomes tends to draw a lot of criticism from pretty much every side. I think most of this criticism is unjustified (and often rooted in an aversion to making reasoning or motivations explicit, a desire to avoid offense or culpability, etc.). Nevertheless, there are problems with many straightforward approaches to quantitative estimation, and some qualitative processes improve on quantitative estimation in important ways. Many of these improvements are often dismissed by optimistic quantitative types (myself included), and I think that is an error. For example, I mentioned that I've often dismissed arguments of the form "If your error bars are too big, you are sometimes better off ignoring the data." This looks obviously wrong on the Bayesian account, but as far as I can tell it may actually be the optimal behavior---even for idealized, bias-free humans.

# 23

30 comments, sorted by Highlighting new comments since
New Comment

When I try to estimate the same thing several times, without remembering my earlier estimates, I tend to get different results. I strongly suspect this is universal, though I haven’t seen research on that question.

There is research by Val & Pashler (2008) showing a within-person wisdom of crowds effect. They asked each person a trivia question, and then asked the same question to the same person again two weeks later, and found that averaging those two answers provided 1/3 the accuracy benefit that you get from asking the question to two different people. Wisdom of crowds works because each person's estimate is (the true value) + (systematic bias in the population) + (random person-specific noise), and the random person-specific noise cancels out when you average together more people. This result suggests that random person-specific noise actually breaks down into two parts: 2/3 is noise that depends stably on the person, and 1/3 of the noise varies within a person over time (although the exact proportions will presumably depend on the particular question and person).

I made some comments to this sense before, most recently here .

Ultimately, the issue is that applied mathematics is a whole enormous field about finding most effective or most accurate ways to estimate something, with many counter intuitive caveats. There's many ways to improve accuracy which do not look right when viewed through prism of a little bit of theory. For example, often you may want to just toss out some terms of the partial sum when you try to evaluate a value of ideal sum most accurately by computing a partial sum. Such as terms whose selection for evaluation is correlated with their value to an unknown and potentially extreme extent, in a sum which ideally consists of far larger number of terms than could ever be evaluated, and which thus has to be approximated by sampling. There's all sorts of important things such as correct regression towards the mean which apply only to estimates but which you won't see in the ideal case.

It's not theoretically invalid, either - it simply corresponds to very complicated theoretical cases such as Bayesian evaluation on a graph of uncertain topology. (A good rule of thumb is that if something seems practically necessary but theoretically wrong, its because of insufficient understanding of the theory)

It seems to me that quantitative optimism is not common among people with very good knowledge of what would be involved in a good quantitative approach - people who wrote important papers on approximation of things. I can see, though, how quantitative optimism could arise in people who primarily know theory and it's application to simple problems where nothing has to be approximated.

There is also an even bigger issue related to subjectivist probability - the "probability is subjective" implies "subjective can be probability", which quickly decays into "subjective is probability". Which it is not (doesn't sum to 1 over exclusive alternatives, doesn't reflect symmetries in knowledge).

Your linked comment makes the same point I am much more tersely. One reason I'm so much less terse is that I'm not very confident in your off-hand remarks---I think many or most of them are interesting ideas which are worth bringing up, but the implicit claim that these issues are well-understood is misleading and the actual arguments often don't work.

I don't quite know what idealization you are talking about. E.g.,

There's all sorts of important things such as correct regression towards the mean which apply only to estimates but which you won't see in the ideal case.

If I have a noisy estimate and a prior, I should regress towards the mean. By the "ideal case" do you mean the case in which my estimates have no noise? That is a strange idealization, which people might implicitly use but probably wouldn't advocate.

With respect to the other points, I agree that estimation is hard, but the difficulties you cite seem to fit pretty squarely into the simple theoretical framework of computing a well-calibrated estimate of expected value. So to the extent there are gaps between that simple framework and reality, these difficulties don't point to them.

For example, to make this point in the case of sums with biased terms, you would need to say how you could predictably do better by throwing out terms of an estimation, even when you don't expect their inclusion to be correlated with their contribution to the estimate. Everyone agrees that if you know X is biased you should respond appropriately. If we don't know that X is biased, then how do you know to throw it out? One thing you could do is to just be skeptical in general and use simple estimates when there is a significant opportunity for bias. But that, again, fits into the framework I'm talking about and you can easily argue for it on those grounds.

An alternative approach would be to criticize folks' actual epistemology for not living up to the theoretical standards they set. It seems like that criticism is obviously valid, both around LW and elsewhere. If that is the point you want to make I am happy to accept it.

Which it is not (doesn't sum to 1 over exclusive alternatives, doesn't reflect symmetries in knowledge).

I agree that if I assume that my beliefs satisfy the axioms of probability, I will get into trouble (a general pattern with assuming false things). But I don't see why either of these properties---reflecting symmetries, summing to one over exclusive alternatives---are necessary for good outcomes. Suppose that I am trying to estimate the relative goodness of two options in order to pick the best. Why should it matter whether my beliefs have these particular consistency properties, as long as they are my best available guess? In fact, it seems to me like my beliefs probably shouldn't satisfy all of the obvious consistency properties, but should still be used for making decisions. I don't think that's a controversial position.

It seems to me that quantitative optimism is not common among people with very good knowledge of what would be involved in a good quantitative approach - people who wrote important papers on approximation of things. I can see, though, how quantitative optimism could arise in people who primarily know theory and it's application to simple problems where nothing has to be approximated.

I am generally skeptical of the appeal to unspecified beliefs of unspecified experts. Yes, experts in numerical methods will be quick to say that approximating things well is hard, and indeed approximating things well is hard. That is a different issue than whether this particular theoretical framework for reasoning about approximations is sound, which is (1) not an issue on which experts in e.g. numerical methods are particularly well-informed, and (2) not a question for which you actually know the expert consensus.

For example, as a group physicists have quite a lot of experience estimating things and dealing with the world, and they seem to be very optimistic about quantitative methods, in the sense that I mean.

I think I can probably predict how a discussion with experts would go, if you tried to actually raise this question. It would begin with many claims of "things aren't that simple" and attempts to distance from people with stupid naive views, and end with "yes, that formalism is obvious at that level of generality, but I assumed you were making some more non-trivial claims."

This would be a fine response if I were trying to cast myself as better than experts because I have such an excellent clean theory (and I have little patience with Eliezer for doing this). But in fact I am just trying to say relatively simple things in the interest of building up an understanding.

For example, to make this point in the case of sums with biased terms, you would need to say how you could predictably do better by throwing out terms of an estimation, even when you don't expect their inclusion to be correlated with their contribution to the estimate.

I agree with pretty much everything else you wrote here (and in the OP), but I'm a bit confused by this line. It seems like if the terms have a mean that is close to zero, but high variance, then you will usually do better by getting rid of them.

I'm not convinced of this. If you know that a summand has a mean that is close to zero and a high variance, then your prior will be sharply concentrated and you will regress far to the mean. Including the regressed estimate in a sum will still increase your accuracy. (Though of course if the noise is expected to be 1000x greater than the signal, you will be dividing by a factor of 1000 which is more or less the same as throwing it out. But the naive Bayesian EV maximizer will still get this one right.)

Are we using summand to mean the same thing here? To me, if we have an expression X1 + X2 + X3, then the summands are X1, X2, and X3. If we want to estimate Y, and E[X1+X2+X3] = Y, but E[X2] is close to 0 while Var[X2] is large, then X1+X3 is a better estimate for Y than X1+X2+X3 is.

Assume you have noisy measurements X1, X2, X3 of physical quantities Y1, Y2, Y3 respectively; variables 1, 2, and 3 are independent; X2 is much noisier than the others; and you want a point-estimate of Y = Y1+Y2+Y3. Then you shouldn't use either X1+X2+X3 or X1+X3. You should use E[Y1|X1] + E[Y2|X2] + E[Y3|X3]. Regression to the mean is involved in computing each of the conditional expectations. Lots of noise (relative to the width of your prior) in X2 means that E[Y2|X2] will tend to be close to the prior E[Y2] even for extreme values of X2, but E[Y2|X2] is still a better estimate of that portion of the sum than E[Y2] is.

But that's not mysterious, that's just regression to the mean.

I don't understand --- in what way is it regression to the mean?

Also, what does that have to do with my original comment, which is that you will do better by dropping high-variance terms?

You said you should drop X if you know that your estimate is high variance but that the actual values don't vary much. Knowing that the actual value doesn't vary much means your prior has low variance, while knowing that your estimate is noisy means that your prior for the error term has high variance.

So when you observe an estimate, you should attribute most of the variance to error, and regress your estimate substantially towards your prior mean. After doing that regression, you are better off including X than dropping it, as far as I can see. (Of course, if the regressed estimate is sufficiently small then it wasn't even worth computing the estimate, but that's a normal issue with allocating bounded computational resources and doesn't depend on the variance of your estimate of X, just how large you expect the real value to be.)

Of course, any time you toss something out it corresponds to negligible weight. And of course, accuracy-wise, under limited computing power, you're better off actually tossing it out and using the computing time elsewhere to increase the accuracy more.

If I have a noisy estimate and a prior, I should regress towards the mean. By the "ideal case" do you mean the case in which my estimates have no noise? That is a strange idealization, which people might implicitly use but probably wouldn't advocate.

I was primarily referring to this wide eyed optimism prevalent on these boards; attend some workshops and become more rational and win. It's not that people advocate not regressing to the mean, it's that they don't even know this is an issue (and a difficult issue when probability distribution and it's mean are something you need to find out as well). In the ideal case, you have a sum over all terms - it is not an estimate at all - you don't discard any terms, if you discard any terms it will make it less ideal, if you apply any extra scaling it will make it less ideal, and so on. And so you have people see it as biases and imagine enormous gains to be obtained from doing something formal inspired instead. I have a cat test. Can you explicitly determine if something is a picture of a cat based on a list of numbers representing pixel luminosities? This is the size of gap between implicit processing of the evidence and explicit processing of the evidence.

But I don't see why either of these properties---reflecting symmetries, summing to one over exclusive alternatives---are necessary for good outcomes. Suppose that I am trying to estimate the relative goodness of two options in order to pick the best. Why should it matter whether my beliefs have these particular consistency properties, as long as they are my best available guess?

This needs a specific example. Some people were worrying over a very very far fetched scenario, being unable to assign it low enough probability. The property of summing to 1 over the enormous number of likewise far fetched mutually exclusive scenarios would definitely have helped, compared to the state of - I suspect - summing to a very very huge number. Then they were taught a little bit of rationality and they know probability is subjective, which makes them inclined to consider their numerical assessment of a feeling (which may well already incorporate alleged impact) to be a probability, and multiply it with something. Other bad patterns include inversion of probability - why are you so extremely certain in negation of an event? People expect that probabilities close to 1 require evidence, and without any, are reluctant to assign something close to 1, even though in that case it is representative of a sum of almost entire hypothesis space.

With respect to the other points, I agree that estimation is hard, but the difficulties you cite seem to fit pretty squarely into the simple theoretical framework of computing a well-calibrated estimate of expected value. So to the extent there are gaps between that simple framework and reality, these difficulties don't point to them.

not a question for which you actually know the expert consensus.

I do not see people most educated in these matters (or, indeed, the theory) to be running "rationality workshops" advocating explicit theory-based reasoning, that's what I mean. And people I see I would not even suspect of expertise if they haven't themselves claimed expertise.

This would be a fine response if I were trying to cast myself as better than experts because I have such an excellent clean theory (and I have little patience with Eliezer for doing this). But in fact I am just trying to say relatively simple things in the interest of building up an understanding.

Yes I certainly agree here - first make simple steps in the right direction.

I think mostly you are arguing against LW in general, which seems fine but not particularly helpful here or relevant to my point.

Some people were worrying over a very very far fetched scenario, being unable to assign it low enough probability. The property of summing to 1 over the enormous number of likewise far fetched mutually exclusive scenarios would definitely have helped, compared to the state of - I suspect - summing to a very very huge number.

What is the "very very far fetched scenario"? If you mean the intelligence explosion scenario, I do think this is reasonably unlikely, but:

1. Eliezer thinks this scenario is very likely, and many people around here agree. This is hardly a problem of being unwilling to assign a probability too close to 0.
2. In what sense is fast takeoff one hypothesis out of a very large number of equally plausible hypotheses? It seems like a fast takeoff is a priori reasonably likely, and the main reasons you think it seems unlikely are because experts don't take it seriously and it seems incongruous with other tech progress. This seems unrelated to your critique.

This argument is similar to the reasoning behind paired statistical tests.

tl;dr: Sometimes you can get a better estimate of the difference between two things than you can of either of them separately.

I agree it's an easy point, but what I really want to drive home is that this easy point actually disproves the intuitively appealing conjecture: "A sufficiently rational agent in a fair environment should estimate the value of each option and choose the option with the highest value."

There are lots of intuitive objections to this conjecture which people often raise, but most of them don't seem to hold up. So when I realized that this one actually does hold up, I wanted to flesh it out in much more detail.

"A sufficiently rational agent in a fair environment should estimate the value of each option and choose the option with the highest value."

I agree with you in the case of humans, but not in the case of sufficiently rational agents. It is a theorem that the expected value of X-Y is positive if and only if the expected value of X is greater than the expected value of Y. To rephrase using symbols:

E(X-Y) > 0 iff E(X) > E(Y)

This depends on what consistency conditions you get to impose on your agent. I agree that for probability distributions E[X-Y] = E[X] - E[Y].

Any computable agent, no matter how rational, isn't going to have its beliefs closed under all of the obvious consistency conditions, otherwise it would assign P(T) = 1 for each theorem T. This isn't just a quirk of human irrationality.

Maybe we should specify a subset of the consistency conditions which is achievable, and then we can say that expected utility maximization is optimal if you satisfy those consistency conditions. This is what I have been doing when thinking about these issues, but it doesn't seem straightforward nor standard.

Can you find a toy model in which an agent choosing between options X, Y, Z estimates the three differences X - Y, Y - Z, Z - X and finds that they don't add up to zero?

I fear that I've missed your point, but here is my runnable toy model written in Common Lisp

``````(defun x () (random 1.0))
(defun y () (random 1.0))
(defun z () (random 1.0))

(defun x-y () (- (x) (y)))
(defun y-z () (- (y) (z)))
(defun z-x () (- (z) (x)))

(defparameter diffs (list (x-y) (y-z) (z-x)))

(reduce #'+ diffs) => -0.42450535
``````

The variable diffs get set to a list of the three estimates. Adding them up we get -0.424. What has gone wrong?

X, Y, and Z are all 1/2. But they are tricky to measure. (defun x () (random 1.0)) is modelling the idea that when we estimate X we get a random variable uniform from 0.0 to 1.0.

``````(defun x-y () (- (x) (y)))
``````

is modelling the idea that we estimate X and estimate Y and subtract. (And don't remember our estimate)

``````(defun y-z () (- (y) (z)))
``````

is modelling the idea that we start from scratch, estimating Y (again) and then Z before finally subtracting.

Since the Y in X-Y is our first go at estimating Y and the Y in Y-Z is our second go at estimating Y, they have different random errors and don't cancel like they should.

I think this is essentially the point I was trying to make in an earlier argument with you where I objected to comparing E[A] and E[B] if Var[A] and/or Var[B] is large. However, you seem to have phrased it much more eloquently here than I was able to at the time.

Yeah. I noticed this phenomenon when putting together some estimates and was motivated to write it up here because it seemed to be related to intuitions which many people had expressed to me and which I had mostly dismissed.

One way to state this idea is that when comparing between options A and B we need to produce an optimal estimate of utility(A) - utility(B) rather than an optimal estimate of each separately. However, the examples you use are somewhat unrealistic in the following sense. When comparing e.g. two options to reduce carbon emissions, we usually also need to take into account the difference in cost, not only the difference in impact on carbon emissions, because the residual money can be invested in something else. This means there is no avoiding estimating the utility of reducing carbon emissions, even if there's large error involved. Still it makes sense to be consistent.

What about large lists of things? Would it be more efficient to do an individual expected utility estimation on each individual item on the list, and then order them, or would it still be an advantage to compare each item one at a time to another item? I figure that would take vastly more time, though if you just do rough estimates at first, then increase the level of accuracy as the estimates get closer to each other, it could go fairly quick.