How to Measure Anything

29Eliezer Yudkowsky

21lukeprog

6lukeprog

0roland

27Mitchell_Porter

10passive_fist

3waveman

6lukeprog

12LoganStrohl

4wedrifid

16Eliezer Yudkowsky

7owencb

3Rob Bensinger

0LoganStrohl

2ChrisHibbert

2wedrifid

11NancyLebovitz

18TheOtherDave

14SatvikBeri

13owencb

6EHeller

1Dre

7CronoDAS

11lukeprog

4Said Achmiz

2Arkanj3l

0fubarobfusco

3Said Achmiz

4fubarobfusco

0Said Achmiz

2fubarobfusco

2Said Achmiz

-1Lumifer

2Said Achmiz

0Decius

0khafra

5Heka

5roland

4Vasco Grilo

4MinusGix

2Vasco Grilo

3Andrey K

3[anonymous]

3owencb

0Heka

2Conor

2protest_boy

1Andrew Jacob Sauer

2Larks

-1Jiro

1Lumifer

0RajeshKumar

0Stuart_Armstrong

0bornio

0Lumifer

New Comment

A measurement is an observation that quantitatively reduces uncertainty.

A measurement reduces expected uncertainty. Some particular measurement results increase uncertainty. E.g. you start out by assigning 90% probability that a binary variable landed heads and then you see evidence with a likelihood ratio of 1:9 favoring tails, sending your posterior to 50-50. However the *expectation* of the entropy of your probability distribution after seeing the evidence, is always evaluated to be lower than its current value in advance of seeing the evidence.

Just FYI, I think Hubbard knows this and wrote "A measurement is an observation that quantitatively reduces uncertainty" because he was trying to simplify and avoid clunky sentences. E.g. on p. 146 he writes:

It is even possible for an additional sample to sometimes increase the size of the [confidence] interval... before the next sample makes it narrower again. But, on average, the increasing sample size will decrease the size of the [confidence] interval.

I'm reminded also of Russell's comment:

A book should have either intelligibility or correctness; to combine the two is impossible, but to lack both is to be unworthy.

The technical term for this is conditional entropy.

The conditional entropy will always be lower *unless* the evidence is independent of your hypothesis(in this case the conditional entropy will be equal to the prior entropy).

Is there a section on "How To Not Fool Yourself That You're Measuring X When You're Actually Measuring Z"?

This is a very important concern that I have too. I have not read the book, and it might be a very interesting read, but when it starts with:

No matter how “fuzzy” the measurement is, it’s still a measurement if it tells you more than you knew before.

It concerns me. Because business is already full of dubious metrics that actually do harm. For instance, in programming, source lines of code (SLOC) per month is one metric that is used to gauge 'programmer productivity', but has come under extreme and rightful skepticism.

Scientific methods are powerful when used properly, but a little knowledge can be a dangerous thing.

Yes he is all over this.

In the TQM world this comes under the heading "any metric used to reward people will become corrupt". H Edwards Deming was writing about this issue in the 1960s or earlier. For this reason he advocated separating data collection used to run the business from data collection used to reward people. Too often, people decide this is "inefficient" and combine the two, with predictable results. Crime statistics in the US is one terrible example of this.

From my recollection of the book I think he would say that SLOC is not actually a terrible metric and can be quite useful. I personally use it myself on my own projects - but I have no incentive to game the system. If you start paying people for SLOC you are going to get a lot of SLOCs!

Because of the history, you need to go overboard to reassure people you will not use metrics against them. They are going to assume you will use them against them, until proven otherwise.

Not a dedicated section; this advice is scattered throughout the book. E.g. there's a section (p. 174-176) explaining why p-value hypothesis testing doesn't measure what the reader might think it measures (and thus Hubbard doesn't use p-value hypothesis testing).

Wow, this is really exciting. I thought at first, "Man, quantifying my progress on math research sounds really difficult. I don't know how to make it more than a measure of how happy I feel about what I've done."

But I'm only through step one of this post, and I've already pinned down the variables defining "progress on math research" such that measuring these periodically will almost certainly keep me directly on track toward the answer. I can probably make it better (suggestions welcome), but even this first pass saves me lots of grief. Wasted motion is probably my biggest problem with learning math right now, so this totally rocks. Check it out!

Progress is reduction of expected work remaining.

-Number of new things I understand (by proposition). -Change in degree to which I understand old things. -How central these things are to the problem. -Number of things (namely propositions, definitions, rules/instructions) I’ve written down that seem likely to be useful reference later. -Probability that they will be important. -Amount of material produced (in propositions or subsections of proof) that, if correct, will actually be part of my answer in the end. -Number of actions I’ve taken that will increase the values of the above variables. -Degree to which they’ll increase those values.

Thanks, Luke!

Progress is reduction of expected work remaining.

No it isn't. Those things are often correlated but not equivalent. New information can be gained that increases the expected work remaining despite additional valuable work having been done.

Progress is reduction of expected work remaining compared to your *revised* expectation of how much work remained yesterday.

That seems to be a better fit for the *impression of progress*. You wouldn't tend, in retrospect, to call it progress if you realised you'd been going in completely the wrong direction.

This would fit with progress simply be the reduction of work remaining.

|New information can be gained that increases the expected work remaining despite additional valuable work having been done.

That's progress.

[This comment is no longer endorsed by its author]

|New information can be gained that increases the expected work remaining despite additional valuable work having been done.

That's progress.

Yes. That is the point.

The variables that had high information values were routinely those that the client had never measured… * The variables that clients [spent] the most time measuring were usually those with a very low (even zero) information value…

This seems very unlikely to be a coincidence. Any theories about what's going on?

We run into this all the time at my job.

My usual interpretation is that actual measurements with high information value can destabilize the existing system (e.g., by demonstrating that people aren't doing their jobs, or that existing strategies aren't working or are counterproductive), and are therefore dangerous. Low-information measurements are safer.

It's not that they're measuring the wrong variables, it's most likely that those organizations have already made the decisions based on variables they already measure. In the "Function Points" example, I would bet there were a few obvious learnings early on that spread throughout the organizations, and once the culture had changed any further effort didn't help at all.

Another example: I took statistics on how my friends played games that involved bidding, such as Liar's Poker. I found that they typically would bid too much. Therefore a measurement of how many times someone had the winning bid was a high predictor of how they would perform in the game-people who bid high would typically lose.

Once I shared this information, behavior changed and people started using a much more rational bidding scheme. And the old measurement of "how often someone bid high" was no longer very predictive. It simply meant that they'd had more opportunities where bidding high made sense. Other variables such as "the player to your left" started becoming much more predictive.

One possibility is that there are a very large number of things they could measure, most of which have low information value. If they chose randomly we might expect to see an effect like this, and never notice all the low information possibilities they chose not to measure.

I'm not suggesting that they actually do choose randomly, but it might be they chose, say, the easiest to measure, and that these are neither systematically good or bad, so it looks similar to random in terms of the useful information.

in the many cases I've seen this its because (generally) things that are being collected are those things which are easiest to be collected. Often little thought was put into it, and sometimes these things were collected by accident. Generally, those things easiest to be collected offer the least insight (if its easy to collect, its already part of your existing business process).

If there are generally decreasing returns to measurement of a single variable, I think this is more what we would expect see. If you've already put effort into measurement of a given variable it will have lower information value on the margin. If you add in enough costs for switching measurements, then even the optimal strategy might spend a serious amount of time/effort pursuing lower value measurements.

Further, if they hadn't even thought of some measurements they couldn't have pursued them, so they wouldn't have suffered any declining returns.

I don't think this is the primary reason, but may contribute, especially in conjunction with reasons from sibling comments.

What do you mean by "programmer productivity" and why do you care about it? What are you observing when you observe increased programmer productivity?

I haven't read the article so I could be full of shit, but essentially:

If you have the list of desired things ready, there should be an ETA on the work time necessary for each desired thing as well as confidence on that estimate. Confidence varies with past data and expected competence, e.g. how easily you believe you can debug the feature if you begin to draft it. Or such. Then you have a set of estimates for each implementable feature.

Then you put in time on that feature over the day tracked by some passive monitoring program like ManictTime or something like it.

The ratio of time spent on work that counted towards your features over the work that didn't is your productivity metric. As time goes on your confidence is calibrated in your feature-implementation work time estimates.

It's not very hard.

Just recently the task before me was to implement an object selection feature in an app I'm working on.

I implemented it. Now, the app lets the user select objects. Before, it didn't.

Prior to that, the task before me was to fix a data corruption bug in a different app.

I fixed it. Now, the data does not get corrupted when the user takes certain actions. Before, it did.

You see? Easy.

So, I agree that you accomplished these desired things. However, *before* you accomplished them, how accurately did you know how much time they would take, or how useful they would be?

For that matter, if someone told you, "That wasn't one desired thing I just implemented; it was three," is it possible to disagree?

(My point is that "desired thing" is not well-defined, and so "desired things per unit time" cannot be a measurement..)

However, before you accomplished them, how accurately did you know how much time they would take

I didn't. I never said I did.

or how useful they would be?

Uh, pretty accurately. Object selection is a critical feature; the entire functionality of the app depends on it. The usefulness of not having your data be corrupted is also obvious. I'm not really sure what you mean by asking whether I know in advance how useful a feature or bug fix will be. Of course I know. How could I not know? I always know.

For that matter, if someone told you, "That wasn't one desired thing I just implemented; it was three," is it possible to disagree?

Ah, now this is a different matter. Yes, "desired thing" is not a uniform unit of accomplishment. You have to compare to other desired things, and other people who implement them. You can also group desired things into classes (is this a bug fix or a feature addition? how big a feature? how much code must be written or modified to implement it? how many test cases must be run to isolate this bug?).

Yes, "desired thing" is not a uniform unit of accomplishment.

Right! So, "implementation of desired things per unit time" is not a measure of programmer productivity, since you can't really use it to compare the work of one programmer and another.

There are obvious cases, of course, where you *can* — here's someone who pounds out a reliable map-reduce framework in a weekend; there's someone who can't get a quicksort to compile. But if two *moderately successful* (and moderately reasonable) programmers *disagree* about their productivity, this candidate measurement doesn't help us resolve that disagreement.

Well, if your goal is comparing two programmers, then the most obvious thing to do is to give them both the same set of diverse tasks, and see how long they take (on each task and on the whole set).

If your goal is gauging the effectiveness of this or that approach (agile vs. waterfall? mandated code formatting style or no? single or pair programming? what compensation structure? etc.), then it's slightly less trivial, but you can use some "fuzzy" metrics: for instance, classify "desired things" into categories (feature, bug fix, compatibility fix, etc.), and measure *those* per unit time.

As for disagreeing whether something is one desired thing or three — well, like I said, you categorize. But also, it really won't be the case that one programmer says "I just implemented a feature", and another goes "A feature?! You just moved one parenthesis!", and a third goes "A feature?! You just wrote the entire application suite!".

Well, if your goal is comparing two programmers, then the most obvious thing to do is to give them both the same set of diverse tasks, and see how long they take (on each task and on the whole set).

That might work in an academic setting, but doesn't work in a real-life business setting where you're not going to tie up two programmers (or two teams, more likely) reimplementing the same stuff just to satisfy your curiosity.

And of course programming is diverse enough to encompass a wide variety of needs and skillsets. Say, programmer A is great at writing small self-contained useful libraries, programmer B has the ability to refactor a mess of spaghetti code into something that's clear and coherent, programmer C writes weird chunks of code that look strange but consume noticeably less resources, programmer D is a wizard at databases, programmer E is clueless about databases but really groks Windows GUI APIs, etc. etc. How are you going to compare their productivity?

That might work in an academic setting, but doesn't work in a real-life business setting where you're not going to tie up two programmers (or two teams, more likely) reimplementing the same stuff just to satisfy your curiosity.

Maybe that's one reason to have colleges that hand out computer science degrees? ;)

And of course programming is diverse enough to encompass a wide variety of needs and skillsets. Say, programmer A is great at writing small self-contained useful libraries, programmer B has the ability to refactor a mess of spaghetti code into something that's clear and coherent, programmer C writes weird chunks of code that look strange but consume noticeably less resources, programmer D is a wizard at databases, programmer E is clueless about databases but really groks Windows GUI APIs, etc. etc. How are you going to compare their productivity?

Very carefully.

In seriousness, the answer is that you wouldn't compare them. Comparing programmer productivity across problem domains like you describe is rarely all that useful.

You really only care about comparing programmer productivity within a domain, as well as comparing the same programmers' productivity across time.

How are you going to compare their productivity?

I'm going to look at the total desirability of the what Adam does, at the total desirability of what Bob does...

And in the end I'm going to have to make difficult calls, like how desirable it is for us to have weird chunks of code that look strange by consume noticeably fewer resources.

Each of them is better at different things, so as a manager I need to take that into account; I wouldn't use a carpenter to paint while the painter is doing framing, but I *might* set things up so that painter helps with the framing and the carpenter assists with the painting. I certainly wouldn't spend a lot of time optimizing to hire only carpenters and tell them to build the entire house.

It's a two-step process, right? First, you measure how long a specific type of feature takes to implement; from a bunch of historic examples or something. Then, you measure how long a programmer (or all the programmers using a particular methodology or language, whatever you're measuring), take to implement a new feature of the same type.

Hubbard writes about performance measurement in chapter 11. He notes that management typically knows what are the relevant performance metrics. However it has trouble prioritizing between them. Hubbard's proposal is to let the managament create utility charts of the required trade-offs. For instance on curve for programmer could have on-time completion rate in one axis and error-free rate in the other (page 214). Thus the management is required to document how much one must increase to compensate for drop in the other. The end product of the charts should be a single index for measuring employee performance.

Nice post, Luke!

with this handy reference table:

There is no table after this.

He also offers a chart showing how a pure Bayesian estimator compares to other estimators:

There is no chart after this.

Thanks for the great summary of the book. I would like to create and publish a mindmap of the book and reuse some comments you made in this post.

Would it be OK for you as the author of the post if I will create a mindmap of the book with reference to your article and publish it? The link to this post on LessWrong will be included in the mindmap.

Best Regards,

Andrey K. imankey@gmail.com

The Applied Information Economics ideas are very reminiscent of decision tree algorithms. Would it be useful to try to extend the analogy and see if there's an extension of AIE that is like random forests?

Hanson's homo hypocritus idea may also be relevant. Perhaps, even subconsciously, people avoid measuring the dimensions or directions that will add a lot of info because they want to both (a) vociferously claim that they did measure stuff and the measures didn't help and (b) avoid any culpability for implementing changes they don't politically control, such as changes indicated by measuring very informative directions.

Just saying, a lot of people want to appear like they are productively exploring measures that yield changes and progress while tacitly sabotaging that very activity to retain political control over status quos.

Thanks, I liked this post.

However, I was initially a bit confused by the section on EVPI. I think it is important, but it could be a lot clearer.

The expected opportunity loss (EOL) for a choice is the probability of the choice being “wrong” times the cost of it being wrong. So for example the EOL if the campaign is approved is $5M × 40% = $2M, and the EOL if the campaign is rejected is $40M × 60% = $24M.

The difference between EOL before and after a measurement is called the “expected value of information” (EVI).

It seems quite unclear what's meant by "the difference between EOL before and after a measurement" (EOL of which option? is this in expectation?).

I think what must be intended is: your definition is for the EOL of an option. Now the EOL of a choice is the EOL of the option we choose given current beliefs. Then EVI is the expected reduction in EOL upon measurement.

Even this is more confusing than it often needs to be. At heart it's the expected amount better you'll do with the information. Sometimes you can factor out the EOL calculation entirely. For example say you're betting $10 at even odds on a biased coin. You currently think there's a 70% chance of it landing heads; more precisely you know it was either from a batch which lands heads 60% of the time, or from a batch which lands heads 80% of the time, but these are equiprobable. You could take a measurement to find out which batch it was from. Then you are certain that this measurement will change the EOL, but if you do it carefully the expected gain is equal to the expected loss, so there is no EVI. We could spot this directly because we know that whatever the answer is, we'll bet on heads.

I think it might be useful to complete your simple example for EVPI (as in, this would have helped me to understand it faster, so may help others too): Currently you'll run the campaign, with EOL of $2M. With perfect information, you always choose the right option, so you expect the EOL to go down to 0. Hence the EVPI is $2M (this comes from the 40% of the time that the information stops you running the campaign and saving you $5M).

Then in the section on the more advanced model:

In this case, the EVPI turns out to be about $337,000. This means that we shouldn’t spend more than $337,000 to reduce our uncertainty about how many units will be sold as a result of the campaign.

Does this figure come from the book? It doesn't come from the spreadsheet you linked to. By the way, there's a mistake in the spreadsheet: when it assumes a uniform distribution it uses different bounds for two different parts of the calculation.

I like the coin example. In my experience the situation with clear choice is typical in small businesses. It often isn't worth honing the valuation models for projects very long when it is very improbably that the presumed second best choice would turn out to be the best.

I guess the author is used to working for bigger companies that do everything in larger scale and thus have generally more options to choose from. Nothing untrue in the chapter but this point could have been pointed out.

For whomever reads this that is as innumerate as I am and is confused about the example simulation with the excel formula "=norminv(rand(), 15, (20–10)/3.29)", I hope my explanation below helps (and is correct!).

The standard error/deviation* of 3.29 is such because that's the correct value for the confidence interval of 90%. That number is determined by the confidence interval used. It is not the standard deviation of $10-$20. Don't ask me why, I don't know, yet.

Additionally, you can't just paste that formula into excel. Remove the range (20-10) and keep the standard error.

At least that's the best understanding I have of it thus far. I could be wrong!

*Standard deviation is for entire populations and standard error is for samples of populations.

Edit: fixed link to Monte Carlo spreadsheet & all the other downloads for the book

Before I embark on this seemingly Sisyphean endeavor, has anyone attempted to measure "philosophical progress"? It seems that no philosophical problem I know of is apparently fully solved, and no general methods are known which reliably give true answers to philosophical problems. Despite this we definitely have made progress: e.g. we can chart human progress on the problem of Induction, of which an *extremely* rough sketch looks like Epicurus --> Occam --> Hume --> Bayes --> Solomonoff, or something. I don't really know, but there seem to be issues with Solomonoff's formalization of Induction.

I'm thinking of "philosophy" as something like "pre-mathematics/progressing on confusing questions that no reliable methods exist yet to give truthy answers/forming a concept of something and formalizing it". Also it's not clear to me "philosophy" exists independent of the techniques its spawned historically, but there are some problems for which the label of "philosophical problem" seems appropriate, e.g. "how do uncertainties work in a universe where infinite copies of you exist?" and like, all of moral philosophy, etc.

Seems to me that before a philosophical problem is solved, it becomes a problem in some other field of study. Atomism used to be a philosophical theory. Now that we know how to objectively confirm it, it (or rather, something similar but more accurate) is a scientific theory.

It seems that philosophy (at least, the parts of philosophy that are actively trying to progress) is about trying to take concepts that we have intuitive notions of, and figure out what if anything those concepts actually refer to, until we succeed at this well enough that to study then in more precise ways than, well, philosophy.

So, how many examples can we find where some vague but important-seeming idea has been philosophically studied until we learn what the idea refers to in concrete reality, and how to observe and measure it to some degree?

Thanks Luke, this is a great post. It seems like it applies in a very broad range of cases - except the one I'm most interested in, unfortunately, which is judging how bad it is to violate someone's rights. We regularly value a life in CBA calculations ( $1-10 million in the US), but how bad is it to be murdered, *holding constant that you die?*

This cost should be internalised by the murderer, but many people seem ignorant of the cost, leading to an over-supply of murder. It'd be good to know how big the market failure is (so we can judge various preventative policies).

Obviously the same question applies to theft, oath-breaking, and any other rights violations you might think of.

Applying the same question to theft produces the result that if I steal your car and I get more utility out of having your car than you lose by not having it + the utility that you lose from psychological harm due to theft, insurance premiums rising, etc., I can internalize the cost and still come out ahead, so this sort of theft is not in oversupply.

Of course, we normally don't consider the fact that the criminal gains utility to be relevant. Saying that it's not a market failure if the criminal is willing to internalize the cost implies that we consider the gain in the criminal's utility to be relevant.

Does the book address the issue of stale data?

Most statistics assumes that the underlying process is stable: if you're sampling from a population, you're sampling from the *same population* every time. If you estimated some parameters of model, the assumption is that these parameters will be applicable for the forecast period.

Unfortunately, in real life underlying processes tend to be unstable. For a trivial example of a known-to-not-be-stable process consider weather. Let's say I live outside of tropics and I measure air temperature over, say, 60 days. Will my temperature estimates provide a good forecast for the next month? No, they won't because the year has seasons and my "population" of days changes with time.

Or take an example from the book, catch-recatch. Imagine that a considerable period of time passed between the original "catch" and the "recatch". Does the estimation procedure still work? Well, not really -- you need estimates of mortality and birth rate now, you need to know how did your population change between the first and the second measurements.

I have difficulty in understanding the EVPI and link it to return rate,NVP and risk for a project. I have input variable and initial guess. I have run MC to determine NVP. Now I am bit lost how to go ahead with EVPI and link it to NVP, risk etc.

=norminv(rand(), 15, (20–10)/3.29)

Where did **3.29** come from?

[This comment is no longer endorsed by its author]

In some cases, outliers are very close to the mean, and thus our estimate of the mean can converge quickly on the true mean as we look at new samples. In other cases, outliers can be several orders of magnitude away from the mean, and our estimate converges very slowly or not at all.

I think this passage confuses several different things. Let me try to untangle it.

First, all outliers, *by definition*, are rare and are "far away from the mean" (compared to the rest of the data points).

Second, whether your data points are "close" to the mean or "several orders of magnitude" away from the mean is a function of the width (or dispersion or variance or standard deviation or volatility) of the underlying distribution. The width affects how precise your mean estimate from a fixed-size sample will be, but it does not affect the *speed* of the convergence.

The speed of the convergence is a function of what your underlying distribution is. If it's normal (Gaussian), your mean estimate will converge at the same speed regardless of how high or low the variance of the distribution is. If it's, say, a Cauchy distribution then the mean estimate will never converge.

Also, in small samples you generally don't expect to get any outliers. If you do, your small-sample estimate is likely to be way out of whack and actually misleading.

Douglas Hubbard’s

How to Measure Anythingis one of my favorite how-to books. I hope this summary inspires you to buy the book; it’s worth it.The book opens:

The sciences have many established measurement methods, so Hubbard’s book focuses on the measurement of “business intangibles” that are important for decision-making but tricky to measure: things like management effectiveness, the “flexibility” to create new products, the risk of bankruptcy, and public image.

## Basic Ideas

A

measurementis an observation that quantitatively reduces uncertainty. Measurements might not yield precise, certain judgments, but theydoreduce your uncertainty.To be measured, the

object of measurementmust be described clearly, in terms of observables. A good way to clarify a vague object of measurement like “IT security” is to ask “What is IT security, and why do you care?” Such probing can reveal that “IT security” means things like a reduction in unauthorized intrusions and malware attacks, which the IT department cares about because these things result in lost productivity, fraud losses, and legal liabilities.Uncertaintyis the lack of certainty: the true outcome/state/value is not known.Riskis a state of uncertainty in which some of the possibilities involve a loss.Much pessimism about measurement comes from a lack of experience making measurements. Hubbard, who is

farmore experienced with measurement than his readers, says:## Applied Information Economics

Hubbard calls his method “Applied Information Economics” (AIE). It consists of 5 steps:

These steps are elaborated below.

## Step 1: Define a decision problem and the relevant variables

Hubbard illustrates this step by telling the story of how he helped the Department of Veterans Affairs (VA) with a measurement problem.

The VA was considering seven proposed IT security projects. They wanted to know “which… of the proposed investments were justified and, after they were implemented, whether improvements in security justified further investment…” Hubbard asked his standard questions: “What do you mean by ‘IT security’? Why does it matter to you? What are you observing when you observe improved IT security?”

It became clear that

nobodyat the VA had thought about the details of what “IT security” meant to them. But after Hubbard’s probing, it became clear that by “IT security” they meant a reduction in the frequency and severity of some undesirable events: agency-wide virus attacks, unauthorized system access (external or internal),unauthorized physical access, and disasters affecting the IT infrastructure (fire, flood, etc.) And each undesirable event was on the list because of specific costs associated with it: productivity losses from virus attacks, legal liability from unauthorized system access, etc.Now that the VA knew what they meant by “IT security,” they could measure specific variables, such as the number of virus attacks per year.

## Step 2: Determine what you know

## Uncertainty and calibration

The next step is to determine your level of uncertainty about the variables you want to measure. To do this, you can express a “confidence interval” (CI). A 90% CI is a range of values that is 90% likely to contain the correct value. For example, the security experts at the VA were 90% confident that each agency-wide virus attack would affect between 25,000 and 65,000 people.

Unfortunately, few people are well-calibrated estimators. For example in some studies, the true value lay in subjects’ 90% CIs only 50% of the time! These subjects were overconfident. For a well-calibrated estimator, the true value will lie in her 90% CI roughly 90% of the time.

Luckily, “assessing uncertainty is a general skill that can be taught with a measurable improvement.”

Hubbard uses several methods to calibrate each client’s value estimators, for example the security experts at the VA who needed to estimate the frequency of security breaches and their likely costs.

His first technique is the

equivalent bet test. Suppose you’re asked to give a 90% CI for the year in which Newton published the universal laws of gravitation, and you can win $1,000 in one of two ways:If you find yourself preferring option #2, then you must think spinning the dial has a higher chance of winning you $1,000 than option #1. That suggest your stated 90% CI isn’t really your 90% CI. Maybe it’s your 65% CI or your 80% CI instead. By preferring option #2, your brain is trying to tell you that your originally stated 90% CI is overconfident.

If instead you find yourself preferring option #1, then you must think there is

morethan a 90% chance your stated 90% CI contains the true value. By preferring option #1, your brain is trying to tell you that your original 90% CI is under confident.To make a better estimate, adjust your 90% CI until option #1 and option #2 seem equally good to you. Research suggests that even

pretendingto bet money in this way will improve your calibration.Hubbard’s second method for improving calibration is simply

repetition and feedback. Make lots of estimates and then see how well you did. For this, play CFAR’s Calibration Game.Hubbard also asks people to identify reasons why a particular estimate might be right, and why it might be wrong.

He also asks people to look more closely at each bound (upper and lower) on their estimated range. A 90% CI “means there is a 5% chance the true value could be greater than the upper bound, and a 5% chance it could be less than the lower bound. This means the estimators must be 95% sure that the true value is less than the upper bound. If they are not that certain, they should increase the upper bound… A similar test is applied to the lower bound.”

## Simulations

Once you determine what you know about the uncertainties involved, how can you use that information to determine what you know about the

risksinvolved? Hubbard summarizes:The simplest tool for measuring such risks accurately is the Monte Carlo (MC) simulation, which can be run by Excel and many other programs. To illustrate this tool, suppose you are wondering whether to lease a new machine for one step in your manufacturing process.

Your pre-calibrated estimators give their 90% CIs for the following variables:

Thus, your annual savings will equal (MS + LS + RMS) × PL.

When measuring risk, we don’t just want to know the “average” risk or benefit. We want to know the probability of a huge loss, the probability of a small loss, the probability of a huge savings, and so on. That’s what Monte Carlo can tell us.

An MC simulation uses a computer to randomly generate thousands of possible values for each variable, based on the ranges we’ve estimated. The computer then calculates the outcome (in this case, the annual savings) for each generated combination of values, and we’re able to see how often different kinds of outcomes occur.

To run an MC simulation we need not just the 90% CI for each variable but also the

shapeof each distribution. In many cases, the normal distribution will work just fine, and we’ll use it for all the variables in this simplified illustration. (Hubbard’s book shows you how to work with other distributions).To make an MC simulation of a normally distributed variable in Excel, we use this formula:

So the formula for the maintenance savings variable should be:

Suppose you enter this formula on cell A1 in Excel. To generate (say) 10,000 values for the maintenance savings value, just (1) copy the contents of cell A1, (2) enter “A1:A10000” in the cell range field to select cells A1 through A10000, and (3) paste the formula into all those cells.

Now we can follow this process in other columns for the other variables, including a column for the “total savings” formula. To see how many rows made a total savings of $400,000 or more (break-even), use Excel’s countif function. In this case, you should find that about 14% of the scenarios resulted in a savings of less than $400,000 – a loss.

We can also make a histogram (see right) to show how many of the 10,000 scenarios landed in each $100,000 increment (of total savings). This is even more informative, and tells us a great deal about the distribution of risk and benefits we might incur from investing in the new machine. (Download the full spreadsheet for this example here.)

The simulation concept can (and in high-value cases

should) be carried beyond this simple MC simulation. The first step is to learn how to use a greater variety of distributions in MC simulations. The second step is to deal with correlated (rather than independent) variables by generating correlated random numbers or by modeling what the variables have in common.A more complicated step is to use a Markov simulation, in which the simulated scenario is divided into many time intervals. This is often used to model stock prices, the weather, and complex manufacturing or construction projects. Another more complicated step is to use an agent-based model, in which independently-acting agents are simulated. This method is often used for traffic simulations, in which each vehicle is modeled as an agent.

## Step 3: Pick a variable, and compute the value of additional information for that variable

Information can have three kinds of value:

When you’re uncertain about a decision, this means there’s a chance you’ll make a non-optimal choice. The cost of a “wrong” decision is the difference between the wrong choice and the choice you would have made with perfect information. But it’s too costly to acquire perfect information, so instead we’d like to know which decision-relevant variables are the

mostvaluable to measure more precisely, so we can decide which measurements to make.Here’s a simple example:

The expected opportunity loss (EOL) for a choice is the probability of the choice being “wrong” times the cost of it being wrong. So for example the EOL if the campaign is approved is $5M × 40% = $2M, and the EOL if the campaign is rejected is $40M × 60% = $24M.

The difference between EOL before and after a measurement is called the “expected value of information” (EVI).

In most cases, we want to compute the VoI for a range of values rather than a binary succeed/fail. So let’s tweak the advertising campaign example and say that a calibrated marketing expert’s 90% CI for sales resulting from the campaign was from 100,000 units to 1 million units. The risk is that we don’t sell enough units from this campaign to break even.

Suppose we profit by $25 per unit sold, so we’d have to sell at least 200,000 units from the campaign to break even (on a $5M campaign). To begin, let’s calculate the expected value of

perfectinformation (EVPI), which will provide an upper bound on how much we should spend to reduce our uncertainty about how many units will be sold as a result of the campaign. Here’s how we compute it:Of course, we’ll do this with a computer. For the details, see Hubbard’s book and the Value of Information spreadsheet from his website.

In this case, the EVPI turns out to be about $337,000. This means that we shouldn’t spend more than $337,000 to reduce our uncertainty about how many units will be sold as a result of the campaign.

And in fact, we should probably spend much less than $337,000, because no measurement we make will give us

perfectinformation. For more details on how to measure the value ofimperfectinformation, see Hubbard’s book and these three LessWrong posts: (1) VoI: 8 Examples, (2) VoI: Four Examples, and (3) 5-second level case study: VoI.I do, however, want to quote Hubbard’s comments about the “measurement inversion”:

Hubbard calls this the “Measurement Inversion”:

Here is one example:

Hence the importance of calculating EVI.

## Step 4: Apply the relevant measurement instrument(s) to the high-information-value variable

If you followed the first three steps, then you’ve defined a variable you want to measure in terms of the decision it affects and how you observe it, you’ve quantified your uncertainty about it, and you’ve calculated the value of gaining additional information about it. Now it’s time to reduce your uncertainty about the variable – that is, to measure it.

Each scientific discipline has its own specialized measurement methods. Hubbard’s book describes measurement methods that are often useful for reducing our uncertainty about the “softer” topics often encountered by decision-makers in business.

## Selecting a measurement method

To figure out which category of measurement methods are appropriate for a particular case, we must ask several questions:

## Decomposition

Sometimes you’ll want to start by decomposing an uncertain variable into several parts to identify which observables you can most easily measure. For example, rather than directly estimating the cost of a large construction project, you could break it into parts and estimate the cost of each part of the project.

In Hubbard’s experience, it’s often the case that decomposition itself – even without making any new measurements – often reduces one’s uncertainty about the variable of interest.

## Secondary research

Don’t reinvent the world. In almost all cases, someone has already invented the measurement tool you need, and you just need to find it. Here are Hubbard’s tips on secondary research:

I’d also recommend my post Scholarship: How to Do It Efficiently.

## Observation

If you’re not sure how to measure your target variable’s observables, ask these questions:

## Measure just enough

Because initial measurements often tell you quite a lot, and also change the value of continued measurement, Hubbard often aims for spending 10% of the EVPI on a measurement, and sometimes as little as 2% (especially for very large projects).

## Consider the error

It’s important to be conscious of some common ways in which measurements can mislead.

Scientists distinguish two types of measurement error: systemic and random. Random errors are random variations from one observation to the next. They can’t be individually predicted, but they fall into patterns that can be accounted for with the laws of probability. Systemic errors, in contrast, are consistent. For example, the sales staff may routinely overestimate the next quarter’s revenue by 50% (on average).

We must also distinguish precision and accuracy. A “precise” measurement tool has low random error. E.g. if a bathroom scale gives the exact same displayed weight every time we set a particular book on it, then the scale has high precision. An “accurate” measurement tool has low systemic error. The bathroom scale, while precise, might be inaccurate if the weight displayed is systemically biased in one direction – say, eight pounds too heavy. A measurement tool can also have low precision but good accuracy, if it gives inconsistent measurements but they average to the true value.

Random error tends to be easier to handle. Consider this example:

Systemic error is also called a “bias.” Based on his experience, Hubbard suspects the three most important to avoid are:

whatthey changed about the workplace. The workers seem to have been responding merely to thefactthat they were being observed insomeway.## Choose and design the measurement instrument

After following the above steps, Hubbard writes, “the measurement instrument should be almost completely formed in your mind.” But if you still can’t come up with a way to measure the target variable, here are some additional tips:

Work through the consequences. If the value is surprisingly high, or surprisingly low, what would you expect to see?Be iterative. Start with just a few observations, and then recalculate the information value.Consider multiple approaches. Your first measurement tool may not work well. Try others.What’s the really simple question that makes the rest of the measurement moot?First see if you can detectanychange in research quality before trying to measure it more comprehensively.## Sampling reality

In most cases, we’ll estimate the values in a population by measuring the values in a small sample from that population. And for reasons discussed in chapter 7, a very small sample can often offer large reductions in uncertainty.

There are a variety of tools we can use to build our estimates from small samples, and which one we should use often depends on how outliers are distributed in the population. In some cases, outliers are very close to the mean, and thus our estimate of the mean can converge quickly on the true mean as we look at new samples. In other cases, outliers can be several orders of magnitude away from the mean, and our estimate converges very slowly or not at all. Here are some examples:

Below, I survey just a few of the many sampling methods Hubbard covers in his book.

## Mathless estimation

When working with a quickly converging phenomenon and a symmetric distribution (uniform, normal, camel-back, or bow-tie) for the population, you can use the t-statistic to develop a 90% CI even when working with very small samples. (See the book for instructions.)

Or, even easier, make use of the

Rule of FIve: “There is a 93.75% chance that the median of a population is between the smallest and largest values in any random sample of five from that population.”The Rule of Five has another advantage over the t-statistic: it works for any distribution of values in the population, including ones with slow convergence or no convergence at all! It can do this because it gives us a confidence interval for the

medianrather than themean, and it’s the mean that is far more affected by outliers.Hubbard calls this a “mathless” estimation technique because it doesn’t require us to take square roots or calculate standard deviation or anything like that. Moreover, this mathless technique extends beyond the Rule of Five: If we sample 8 items, there is a 99.2% chance that the median of the population falls within the largest and smallest values. If we take the

2ndlargest and smallest values (out of 8 total values), we get something close to a 90% CI for the median. Hubbard generalizes the tool with this handy reference table:And if the distribution is symmetrical, then the mathless table gives us a 90% CI for the mean as well as for the median.

## Catch-recatch

How does a biologist measure the number of fish in a lake? SHe catches and tags a sample of fish – say, 1000 of them – and then releases them. After the fish have had time to spread amongst the rest of the population, she’ll catch another sample of fish. Suppose she caught 1000 fish again, and 50 of them were tagged. This would mean 5% of the fish were tagged, and thus that were about 20,000 fish in the entire lake. (See Hubbard’s book for the details on how to calculate the 90% CI.)

## Spot sampling

The fish example was a special case of a common problem: population proportion sampling. Often, we want to know what proportion of a population has a particular trait. How many registered voters in California are Democrats? What percentage of your customers prefer a new product design over the old one?

Hubbard’s book discusses how to solve the general problem, but for now let’s just consider another special case: spot sampling.

In spot sampling, you take random snapshots of things rather than tracking them constantly. What proportion of their work hours do employees spend on Facebook? To answer this, you “randomly sample people through the day to see what they were doing

at that moment. If you find that in 12 instances out of 100 random samples” employees were on Facebook, you can guess they spend about 12% of their time on Facebook (the 90% CI is 8% to 18%).## Clustered sampling

Hubbard writes:

## Measure to the threshold

For many decisions, one decision is required if a value is above some threshold, and another decision is required if that value is below the threshold. For such decisions, you don’t care as much about a measurement that reduces uncertainty in general as you do about a measurement that tells you which decision to make based on the threshold. Hubbard gives an example:

Hubbard shows how to derive the real chance in his book. The key point is that “the uncertainty about the threshold can fall much faster than the uncertainty about the quantity in general.”

## Regression modeling

What if you want to figure out the cause of something that has many possible causes? One method is to perform a

controlled experiment, and compare the outcomes of a test group to a control group. Hubbard discusses this in his book (and yes, he’s a Bayesian, and a skeptic of p-value hypothesis testing). For this summary, I’ll instead mention another method for isolating causes: regression modeling. Hubbard explains:Hubbard’s book explains the basics of linear regressions, and of course gives the caveat that correlation does not imply causation. But, he writes, “you should conclude that one thing causes another only if you have some

othergood reason besides the correlation itself to suspect a cause-and-effect relationship.”## Bayes

Hubbard’s 10th chapter opens with a tutorial on Bayes’ Theorem. For an online tutorial, see here.

Hubbard then zooms out to a big-picture view of measurement, and recommends the “instinctive Bayesian approach”:

Hubbard says a few things in support of this approach. First, he points to some studies (e.g. El-Gamal & Grether (1995)) showing that people often reason in roughly-Bayesian ways. Next, he says that in his experience, people become better intuitive Bayesians when they (1) are made aware of the base rate fallacy, and when they (2) are better calibrated.

Hubbard says that once these conditions are met,

He also offers a chart showing how a pure Bayesian estimator compares to other estimators:

Also, Bayes’ Theorem allows us to perform a “Bayesian inversion”:

## Other methods

Other chapters discuss other measurement methods, for example prediction markets, Rasch models, methods for measuring preferences and happiness, methods for improving the subjective judgments of experts, and many others.

## Step 5: Make a decision and act on it

The last step will make more sense if we first “bring the pieces together.” Hubbard now organizes his consulting work with a firm into 3 phases, so let’s review what we’ve learned in the context of his 3 phases.

## Phase 0: Project Preparation

Initial research: Interviews and secondary research to get familiar on the nature of the decision problem.Expert identification: Usually 4–5 experts who provide estimates.## Phase 1: Decision Modeling

Decision problem definition: Experts define the problem they’re trying to analyze.Decision model detail: Using an Excel spreadsheet, the AIE analyst elicits from the experts all the factors that matter for the decision being analyzed: costs and benefits, ROI, etc.Initial calibrated estimates: First, the experts undergo calibration training. Then, they fill in the values (as 90% CIs or other probability distributions) for the variables in the decision model.## Phase 2: Optimal measurements

Value of information analysis: Using Excel macros, the AIE analyst runs a value of information analysis on every variable in the model.Preliminary measurement method designs: Focusing on the few variables with highest information value, the AIE analyst chooses measurement methods that should reduce uncertainty.Measurement methods: Decomposition, random sampling, Bayesian inversion, controlled experiments, and other methods are used (as appropriate) to reduce the uncertainty of the high-VoI variables.Updated decision model: The AIE analyst updates the decision model based on the results of the measurements.Final value of information analysis: The AIE analyst runs a VoI analysis on each variable again. As long as this analysis shows information value much greater than the cost of measurement for some variables, measurement and VoI analysis continues in multiple iterations. Usually, though, only one or two iterations are needed before the VoI analysis shows that no further measurements are justified.## Phase 3: Decision optimization and the final recommendation

Completed risk/return analysis: A final MC simulation shows the likelihood of possible outcomes.Identified metrics procedures: Procedures are put in place to measure some variables (e.g. about project progress or external factors) continually.Decision optimization: The final business decision recommendation is made (this is rarely a simple “yes/no” answer).## Final thoughts

Hubbard’s book includes two case studies in which Hubbard describes how he led two fairly different clients (the EPA and U.S. Marine Corps) through each phase of the AIE process. Then, he closes the book with the following summary: