Aug 7, 2013
Douglas Hubbard’s How to Measure Anything is one of my favorite how-to books. I hope this summary inspires you to buy the book; it’s worth it.
The book opens:
Anything can be measured. If a thing can be observed in any way at all, it lends itself to some type of measurement method. No matter how “fuzzy” the measurement is, it’s still a measurement if it tells you more than you knew before. And those very things most likely to be seen as immeasurable are, virtually always, solved by relatively simple measurement methods.
The sciences have many established measurement methods, so Hubbard’s book focuses on the measurement of “business intangibles” that are important for decision-making but tricky to measure: things like management effectiveness, the “flexibility” to create new products, the risk of bankruptcy, and public image.
A measurement is an observation that quantitatively reduces uncertainty. Measurements might not yield precise, certain judgments, but they do reduce your uncertainty.
To be measured, the object of measurement must be described clearly, in terms of observables. A good way to clarify a vague object of measurement like “IT security” is to ask “What is IT security, and why do you care?” Such probing can reveal that “IT security” means things like a reduction in unauthorized intrusions and malware attacks, which the IT department cares about because these things result in lost productivity, fraud losses, and legal liabilities.
Uncertainty is the lack of certainty: the true outcome/state/value is not known.
Risk is a state of uncertainty in which some of the possibilities involve a loss.
Much pessimism about measurement comes from a lack of experience making measurements. Hubbard, who is far more experienced with measurement than his readers, says:
Hubbard calls his method “Applied Information Economics” (AIE). It consists of 5 steps:
These steps are elaborated below.
Hubbard illustrates this step by telling the story of how he helped the Department of Veterans Affairs (VA) with a measurement problem.
The VA was considering seven proposed IT security projects. They wanted to know “which… of the proposed investments were justified and, after they were implemented, whether improvements in security justified further investment…” Hubbard asked his standard questions: “What do you mean by ‘IT security’? Why does it matter to you? What are you observing when you observe improved IT security?”
It became clear that nobody at the VA had thought about the details of what “IT security” meant to them. But after Hubbard’s probing, it became clear that by “IT security” they meant a reduction in the frequency and severity of some undesirable events: agency-wide virus attacks, unauthorized system access (external or internal),unauthorized physical access, and disasters affecting the IT infrastructure (fire, flood, etc.) And each undesirable event was on the list because of specific costs associated with it: productivity losses from virus attacks, legal liability from unauthorized system access, etc.
Now that the VA knew what they meant by “IT security,” they could measure specific variables, such as the number of virus attacks per year.
The next step is to determine your level of uncertainty about the variables you want to measure. To do this, you can express a “confidence interval” (CI). A 90% CI is a range of values that is 90% likely to contain the correct value. For example, the security experts at the VA were 90% confident that each agency-wide virus attack would affect between 25,000 and 65,000 people.
Unfortunately, few people are well-calibrated estimators. For example in some studies, the true value lay in subjects’ 90% CIs only 50% of the time! These subjects were overconfident. For a well-calibrated estimator, the true value will lie in her 90% CI roughly 90% of the time.
Luckily, “assessing uncertainty is a general skill that can be taught with a measurable improvement.”
Hubbard uses several methods to calibrate each client’s value estimators, for example the security experts at the VA who needed to estimate the frequency of security breaches and their likely costs.
His first technique is the equivalent bet test. Suppose you’re asked to give a 90% CI for the year in which Newton published the universal laws of gravitation, and you can win $1,000 in one of two ways:
If you find yourself preferring option #2, then you must think spinning the dial has a higher chance of winning you $1,000 than option #1. That suggest your stated 90% CI isn’t really your 90% CI. Maybe it’s your 65% CI or your 80% CI instead. By preferring option #2, your brain is trying to tell you that your originally stated 90% CI is overconfident.
If instead you find yourself preferring option #1, then you must think there is more than a 90% chance your stated 90% CI contains the true value. By preferring option #1, your brain is trying to tell you that your original 90% CI is under confident.
To make a better estimate, adjust your 90% CI until option #1 and option #2 seem equally good to you. Research suggests that even pretending to bet money in this way will improve your calibration.
Hubbard’s second method for improving calibration is simply repetition and feedback. Make lots of estimates and then see how well you did. For this, play CFAR’s Calibration Game.
Hubbard also asks people to identify reasons why a particular estimate might be right, and why it might be wrong.
He also asks people to look more closely at each bound (upper and lower) on their estimated range. A 90% CI “means there is a 5% chance the true value could be greater than the upper bound, and a 5% chance it could be less than the lower bound. This means the estimators must be 95% sure that the true value is less than the upper bound. If they are not that certain, they should increase the upper bound… A similar test is applied to the lower bound.”
Once you determine what you know about the uncertainties involved, how can you use that information to determine what you know about the risks involved? Hubbard summarizes:
…all risk in any project… can be expressed by one method: the ranges of uncertainty on the costs and benefits, and probabilities on events that might affect them.
The simplest tool for measuring such risks accurately is the Monte Carlo (MC) simulation, which can be run by Excel and many other programs. To illustrate this tool, suppose you are wondering whether to lease a new machine for one step in your manufacturing process.
The one-year lease [for the machine] is $400,000 with no option for early cancellation. So if you aren’t breaking even, you are still stuck with it for the rest of the year. You are considering signing the contract because you think the more advanced device will save some labor and raw materials and because you think the maintenance cost will be lower than the existing process.
Your pre-calibrated estimators give their 90% CIs for the following variables:
Thus, your annual savings will equal (MS + LS + RMS) × PL.
When measuring risk, we don’t just want to know the “average” risk or benefit. We want to know the probability of a huge loss, the probability of a small loss, the probability of a huge savings, and so on. That’s what Monte Carlo can tell us.
An MC simulation uses a computer to randomly generate thousands of possible values for each variable, based on the ranges we’ve estimated. The computer then calculates the outcome (in this case, the annual savings) for each generated combination of values, and we’re able to see how often different kinds of outcomes occur.
To run an MC simulation we need not just the 90% CI for each variable but also the shape of each distribution. In many cases, the normal distribution will work just fine, and we’ll use it for all the variables in this simplified illustration. (Hubbard’s book shows you how to work with other distributions).
To make an MC simulation of a normally distributed variable in Excel, we use this formula:
=norminv(rand(), mean, standard deviation)
So the formula for the maintenance savings variable should be:
=norminv(rand(), 15, (20–10)/3.29)
Suppose you enter this formula on cell A1 in Excel. To generate (say) 10,000 values for the maintenance savings value, just (1) copy the contents of cell A1, (2) enter “A1:A10000” in the cell range field to select cells A1 through A10000, and (3) paste the formula into all those cells.
Now we can follow this process in other columns for the other variables, including a column for the “total savings” formula. To see how many rows made a total savings of $400,000 or more (break-even), use Excel’s countif function. In this case, you should find that about 14% of the scenarios resulted in a savings of less than $400,000 – a loss.
We can also make a histogram (see right) to show how many of the 10,000 scenarios landed in each $100,000 increment (of total savings). This is even more informative, and tells us a great deal about the distribution of risk and benefits we might incur from investing in the new machine. (Download the full spreadsheet for this example here.)
The simulation concept can (and in high-value cases should) be carried beyond this simple MC simulation. The first step is to learn how to use a greater variety of distributions in MC simulations. The second step is to deal with correlated (rather than independent) variables by generating correlated random numbers or by modeling what the variables have in common.
A more complicated step is to use a Markov simulation, in which the simulated scenario is divided into many time intervals. This is often used to model stock prices, the weather, and complex manufacturing or construction projects. Another more complicated step is to use an agent-based model, in which independently-acting agents are simulated. This method is often used for traffic simulations, in which each vehicle is modeled as an agent.
Information can have three kinds of value:
When you’re uncertain about a decision, this means there’s a chance you’ll make a non-optimal choice. The cost of a “wrong” decision is the difference between the wrong choice and the choice you would have made with perfect information. But it’s too costly to acquire perfect information, so instead we’d like to know which decision-relevant variables are the most valuable to measure more precisely, so we can decide which measurements to make.
Here’s a simple example:
Suppose you could make $40 million profit if [an advertisement] works and lose $5 million (the cost of the campaign) if it fails. Then suppose your calibrated experts say they would put a 40% chance of failure on the campaign.
The expected opportunity loss (EOL) for a choice is the probability of the choice being “wrong” times the cost of it being wrong. So for example the EOL if the campaign is approved is $5M × 40% = $2M, and the EOL if the campaign is rejected is $40M × 60% = $24M.
The difference between EOL before and after a measurement is called the “expected value of information” (EVI).
In most cases, we want to compute the VoI for a range of values rather than a binary succeed/fail. So let’s tweak the advertising campaign example and say that a calibrated marketing expert’s 90% CI for sales resulting from the campaign was from 100,000 units to 1 million units. The risk is that we don’t sell enough units from this campaign to break even.
Suppose we profit by $25 per unit sold, so we’d have to sell at least 200,000 units from the campaign to break even (on a $5M campaign). To begin, let’s calculate the expected value of perfect information (EVPI), which will provide an upper bound on how much we should spend to reduce our uncertainty about how many units will be sold as a result of the campaign. Here’s how we compute it:
Of course, we’ll do this with a computer. For the details, see Hubbard’s book and the Value of Information spreadsheet from his website.
In this case, the EVPI turns out to be about $337,000. This means that we shouldn’t spend more than $337,000 to reduce our uncertainty about how many units will be sold as a result of the campaign.
And in fact, we should probably spend much less than $337,000, because no measurement we make will give us perfect information. For more details on how to measure the value of imperfect information, see Hubbard’s book and these three LessWrong posts: (1) VoI: 8 Examples, (2) VoI: Four Examples, and (3) 5-second level case study: VoI.
I do, however, want to quote Hubbard’s comments about the “measurement inversion”:
By 1999, I had completed the… Applied Information Economics analysis on about 20 major [IT] investments… Each of these business cases had 40 to 80 variables, such as initial development costs, adoption rate, productivity improvement, revenue growth, and so on. For each of these business cases, I ran a macro in Excel that computed the information value for each variable… [and] I began to see this pattern: * The vast majority of variables had an information value of zero… * The variables that had high information values were routinely those that the client had never measured… * The variables that clients [spent] the most time measuring were usually those with a very low (even zero) information value… …since then, I’ve applied this same test to another 40 projects, and… [I’ve] noticed the same phenomena arise in projects relating to research and development, military logistics, the environment, venture capital, and facilities expansion.
Hubbard calls this the “Measurement Inversion”:
In a business case, the economic value of measuring a variable is usually inversely proportional to how much measurement attention it usually gets.
Here is one example:
A stark illustration of the Measurement Inversion for IT projects can be seen in a large UK-based insurance client of mine that was an avid user of a software complexity measurement method called “function points.” This method was popular in the 1980s and 1990s as a basis of estimating the effort for large software development efforts. This organization had done a very good job of tracking initial estimates, function point estimates, and actual effort expended for over 300 IT projects. The estimation required three or four full-time persons as “certified” function point counters…
But a very interesting pattern arose when I compared the function point estimates to the initial estimates provided by project managers… The costly, time-intensive function point counting did change the initial estimate but, on average, it was no closer to the actual project effort than the initial effort… Not only was this the single largest measurement effort in the IT organization, it literally added no value since it didn’t reduce uncertainty at all. Certainly, more emphasis on measuring the benefits of the proposed projects – or almost anything else – would have been better money spent.
Hence the importance of calculating EVI.
If you followed the first three steps, then you’ve defined a variable you want to measure in terms of the decision it affects and how you observe it, you’ve quantified your uncertainty about it, and you’ve calculated the value of gaining additional information about it. Now it’s time to reduce your uncertainty about the variable – that is, to measure it.
Each scientific discipline has its own specialized measurement methods. Hubbard’s book describes measurement methods that are often useful for reducing our uncertainty about the “softer” topics often encountered by decision-makers in business.
To figure out which category of measurement methods are appropriate for a particular case, we must ask several questions:
Sometimes you’ll want to start by decomposing an uncertain variable into several parts to identify which observables you can most easily measure. For example, rather than directly estimating the cost of a large construction project, you could break it into parts and estimate the cost of each part of the project.
In Hubbard’s experience, it’s often the case that decomposition itself – even without making any new measurements – often reduces one’s uncertainty about the variable of interest.
Don’t reinvent the world. In almost all cases, someone has already invented the measurement tool you need, and you just need to find it. Here are Hubbard’s tips on secondary research:
I’d also recommend my post Scholarship: How to Do It Efficiently.
If you’re not sure how to measure your target variable’s observables, ask these questions:
Because initial measurements often tell you quite a lot, and also change the value of continued measurement, Hubbard often aims for spending 10% of the EVPI on a measurement, and sometimes as little as 2% (especially for very large projects).
It’s important to be conscious of some common ways in which measurements can mislead.
Scientists distinguish two types of measurement error: systemic and random. Random errors are random variations from one observation to the next. They can’t be individually predicted, but they fall into patterns that can be accounted for with the laws of probability. Systemic errors, in contrast, are consistent. For example, the sales staff may routinely overestimate the next quarter’s revenue by 50% (on average).
We must also distinguish precision and accuracy. A “precise” measurement tool has low random error. E.g. if a bathroom scale gives the exact same displayed weight every time we set a particular book on it, then the scale has high precision. An “accurate” measurement tool has low systemic error. The bathroom scale, while precise, might be inaccurate if the weight displayed is systemically biased in one direction – say, eight pounds too heavy. A measurement tool can also have low precision but good accuracy, if it gives inconsistent measurements but they average to the true value.
Random error tends to be easier to handle. Consider this example:
For example, to determine how much time sales reps spend in meetings with clients versus other administrative tasks, they might choose a complete review of all time sheets… [But] if a complete review of 5,000 time sheets… tells us that sales reps spend 34% of their time in direct communication with customers, we still don’t know how far from the truth it might be. Still, this “exact” number seems reassuring to many managers. Now, suppose a sample of direct observations of randomly chosen sales reps at random points in time finds that sales reps were in client meetings or on client phone calls only 13 out of 100 of those instances. (We can compute this without interrupting a meeting by asking as soon as the rep is available.) As we will see [later], in the latter case, we can statistically compute a 90% CI to be 7.5% to 18.5%. Even though this random sampling approach gives us only a range, we should prefer its findings to the census audit of time sheets. The census… gives us an exact number, but we have no way to know by how much and in which direction the time sheets err.
Systemic error is also called a “bias.” Based on his experience, Hubbard suspects the three most important to avoid are:
After following the above steps, Hubbard writes, “the measurement instrument should be almost completely formed in your mind.” But if you still can’t come up with a way to measure the target variable, here are some additional tips:
In most cases, we’ll estimate the values in a population by measuring the values in a small sample from that population. And for reasons discussed in chapter 7, a very small sample can often offer large reductions in uncertainty.
There are a variety of tools we can use to build our estimates from small samples, and which one we should use often depends on how outliers are distributed in the population. In some cases, outliers are very close to the mean, and thus our estimate of the mean can converge quickly on the true mean as we look at new samples. In other cases, outliers can be several orders of magnitude away from the mean, and our estimate converges very slowly or not at all. Here are some examples:
Below, I survey just a few of the many sampling methods Hubbard covers in his book.
When working with a quickly converging phenomenon and a symmetric distribution (uniform, normal, camel-back, or bow-tie) for the population, you can use the t-statistic to develop a 90% CI even when working with very small samples. (See the book for instructions.)
Or, even easier, make use of the Rule of FIve: “There is a 93.75% chance that the median of a population is between the smallest and largest values in any random sample of five from that population.”
The Rule of Five has another advantage over the t-statistic: it works for any distribution of values in the population, including ones with slow convergence or no convergence at all! It can do this because it gives us a confidence interval for the median rather than the mean, and it’s the mean that is far more affected by outliers.
Hubbard calls this a “mathless” estimation technique because it doesn’t require us to take square roots or calculate standard deviation or anything like that. Moreover, this mathless technique extends beyond the Rule of Five: If we sample 8 items, there is a 99.2% chance that the median of the population falls within the largest and smallest values. If we take the 2nd largest and smallest values (out of 8 total values), we get something close to a 90% CI for the median. Hubbard generalizes the tool with this handy reference table:
And if the distribution is symmetrical, then the mathless table gives us a 90% CI for the mean as well as for the median.
How does a biologist measure the number of fish in a lake? SHe catches and tags a sample of fish – say, 1000 of them – and then releases them. After the fish have had time to spread amongst the rest of the population, she’ll catch another sample of fish. Suppose she caught 1000 fish again, and 50 of them were tagged. This would mean 5% of the fish were tagged, and thus that were about 20,000 fish in the entire lake. (See Hubbard’s book for the details on how to calculate the 90% CI.)
The fish example was a special case of a common problem: population proportion sampling. Often, we want to know what proportion of a population has a particular trait. How many registered voters in California are Democrats? What percentage of your customers prefer a new product design over the old one?
Hubbard’s book discusses how to solve the general problem, but for now let’s just consider another special case: spot sampling.
In spot sampling, you take random snapshots of things rather than tracking them constantly. What proportion of their work hours do employees spend on Facebook? To answer this, you “randomly sample people through the day to see what they were doing at that moment. If you find that in 12 instances out of 100 random samples” employees were on Facebook, you can guess they spend about 12% of their time on Facebook (the 90% CI is 8% to 18%).
“Clustered sampling” is defined as taking a random sample of groups, then conducting a census or a more concentrated sampling within the group. For example, if you want to see what share of households has satellite dishes… it might be cost effective to randomly choose several city blocks, then conduct a complete census of everything in a block. (Zigzagging across town to individually selected households would be time consuming.) In such cases, we can’t really consider the number of [households] in the groups… to be the number of random samples. Within a block, households may be very similar… [and therefore] it might be necessary to treat the effective number of random samples as the number of blocks…
For many decisions, one decision is required if a value is above some threshold, and another decision is required if that value is below the threshold. For such decisions, you don’t care as much about a measurement that reduces uncertainty in general as you do about a measurement that tells you which decision to make based on the threshold. Hubbard gives an example:
Suppose you needed to measure the average amount of time spent by employees in meetings that could be conducted remotely… If a meeting is among staff members who communicate regularly and for a relatively routine topic, but someone has to travel to make the meeting, you probably can conduct it remotely. You start out with your calibrated estimate that the median employee spends between 3% to 15% traveling to meetings that could be conducted remotely. You determine that if this percentage is actually over 7%, you should make a significant investment in tele meetings. The [EVPI] calculation shows that it is worth no more than $15,000 to study this. According to our rule of thumb for measurement costs, we might try to spend about $1,500…
Let’s say you sampled 10 employees and… you find that only 1 spends less time in these activities than the 7% threshold. Given this information, what is the chance that the median time spent in such activities is actually below 7%, in which case the investment would not be justified? One “common sense” answer is 1/10, or 10%. Actually… the real chance is much smaller.
Hubbard shows how to derive the real chance in his book. The key point is that “the uncertainty about the threshold can fall much faster than the uncertainty about the quantity in general.”
What if you want to figure out the cause of something that has many possible causes? One method is to perform a controlled experiment, and compare the outcomes of a test group to a control group. Hubbard discusses this in his book (and yes, he’s a Bayesian, and a skeptic of p-value hypothesis testing). For this summary, I’ll instead mention another method for isolating causes: regression modeling. Hubbard explains:
If we use regression modeling with historical data, we may not need to conduct a controlled experiment. Perhaps, for example, it is difficult to tie an IT project to an increase in sales, but we might have lots of data about how something else affects sales, such as faster time to market of new products. If we know that faster time to market is possible by automating certain tasks, that this IT investment eliminates certain tasks, and those tasks are on the critical path in the time-to-market, we can make the connection.
Hubbard’s book explains the basics of linear regressions, and of course gives the caveat that correlation does not imply causation. But, he writes, “you should conclude that one thing causes another only if you have some other good reason besides the correlation itself to suspect a cause-and-effect relationship.”
Hubbard’s 10th chapter opens with a tutorial on Bayes’ Theorem. For an online tutorial, see here.
Hubbard then zooms out to a big-picture view of measurement, and recommends the “instinctive Bayesian approach”:
Hubbard says a few things in support of this approach. First, he points to some studies (e.g. El-Gamal & Grether (1995)) showing that people often reason in roughly-Bayesian ways. Next, he says that in his experience, people become better intuitive Bayesians when they (1) are made aware of the base rate fallacy, and when they (2) are better calibrated.
Hubbard says that once these conditions are met,
[then] humans seem to be mostly logical when incorporating new information into their estimates along with the old information. This fact is extremely useful because a human can consider qualitative information that does not fit in standard statistics. For example, if you were giving a forecast for how a new policy might change “public image” – measured in part by a reduction in customer complaints, increased revenue, and the like – a calibrated expert should be able to update current knowledge with “qualitative” information about how the policy worked for other companies, feedback from focus groups, and similar details. Even with sampling information, the calibrated estimator – who has a Bayesian instinct – can consider qualitative information on samples that most textbooks don’t cover.
He also offers a chart showing how a pure Bayesian estimator compares to other estimators:
Also, Bayes’ Theorem allows us to perform a “Bayesian inversion”:
Given a particular observation, it may seem more obvious to frame a measurement by asking the question “What can I conclude from this observation?” or, in probabilistic terms, “What is the probability X is true, given my observation?” But Bayes showed us that we could, instead, start with the question, “What is the probability of this observation if X were true?”
The second form of the question is useful because the answer is often more straightforward and it leads to the answer to the other question. It also forces us to think about the likelihood of different observations given a particular hypothesis and what that means for interpreting an observation.
[For example] if, hypothetically, we know that only 20% of the population will continue to shop at our store, then we can determine the chance [that] exactly 15 out of 20 would say so… [The details are explained in the book.] Then we can invert the problem with Bayes’ theorem to compute the chance that only 20% of the population will continue to shop there given [that] 15 out of 20 said so in a random sample. We would find that chance to be very nearly zero…
Other chapters discuss other measurement methods, for example prediction markets, Rasch models, methods for measuring preferences and happiness, methods for improving the subjective judgments of experts, and many others.
The last step will make more sense if we first “bring the pieces together.” Hubbard now organizes his consulting work with a firm into 3 phases, so let’s review what we’ve learned in the context of his 3 phases.
Hubbard’s book includes two case studies in which Hubbard describes how he led two fairly different clients (the EPA and U.S. Marine Corps) through each phase of the AIE process. Then, he closes the book with the following summary: