Consider Representative Data Sets

by Vladimir_Nesov 5 min read6th May 200915 comments


In this article, I consider the standard biases in drawing factual conclusions that are not related to emotional reactions, and describe a simple model summarizing what goes wrong with the reasoning in these cases, that in turn suggests a way of systematically avoiding this kind of problems.

The following model is used to describe the process of getting from a question to a (potentially biased) answer for the purposes of this article. First, you ask yourself a question. Second, in the context of the question, a data set is presented before your mind, either directly, by you looking at the explicit statements of fact, or indirectly, by associated facts becoming salient to your attention, triggered by the explicit data items or by the question. Third, you construct an intuitive model of some phenomenon, that allows to see its properties, as a result of considering the data set. And finally, you pronounce the answer, that is read out as one of the properties of the model you've just constructed.

This description is meant to present mental paintbrush handles, to refer to the things you can see introspectively, and things you could operate consciously if you choose to.

Most of the biases in the considered class may be seen as particular ways in which you pay attention to a wrong data set, not representative of the phenomenon you model to get to the answer you seek. As a result, the intuitive model gets systematically wrong, and the answer read out from it gets biased. Below I review the specific biases, to identify the ways in which things go wrong in each particular case, and then I summarize the classes of mistakes of reasoning playing major roles in these biases and correspondingly the ways of avoiding those mistakes.

Correspondence Bias is a tendency to attribute to a person a disposition to behave in a particular way, based on observing an episode in which that person behaves in that way. The data set that gets considered consists only of the observed episode, while the target model is of the person's behavior in general, in many possible episodes, in many different possible contexts that may influence the person's behavior.

Hindsight bias is a tendency to overestimate the a priori probability of an event that has actually happened. The data set that gets considered overemphasizes the scenario that did happen, while the model that needs to be constructed, of the a priori belief, should be indifferent to which of the options will actually get realized. From this model, you need to read out the probability of the specific event, but which event you'll read out shouldn't figure into the model itself.

Availability bias is a tendency to estimate the probability of an event based on whatever evidence about that event pops into your mind, without taking into account the ways in which some pieces of evidence are more memorable than others, or some pieces of evidence are easier to come by than others. This bias directly consists in considering a mismatched data set that leads to a distorted model, and biased estimate.

Planning Fallacy is a tendency to overestimate your efficiency in achieving a task. The data set you consider consists of simple cached ways in which you move about accomplishing the task, and lacks the unanticipated problems and more complex ways in which the process may unfold. As a result, the model fails to adequately describe the phenomenon, and the answer gets systematically wrong.

The Logical Fallacy of Generalization from Fictional Evidence consists in drawing the real-world conclusions based on statements invented and selected for the purpose of writing fiction. The data set is not at all representative of the real world, and in particular of whatever real-world phenomenon you need to understand to answer your real-world question. Considering this data set leads to an inadequate model, and inadequate answers.

Proposing Solutions Prematurely is dangerous, because it introduces weak conclusions in the pool of the facts you are considering, and as a result the data set you think about becomes weaker, overly tilted towards premature conclusions that are likely to be wrong, that are less representative of the phenomenon you are trying to model than the initial facts you started from, before coming up with the premature conclusions.

Generalization From One Example is a tendency to pay too much attention to the few anecdotal pieces of evidence you experienced, and model some general phenomenon based on them. This is a special case of availability bias, and the way in which the mistake unfolds is closely related to the correspondence bias and the hindsight bias.

Contamination by Priming is a problem that relates to the process of implicitly introducing the facts in the attended data set. When you are primed with a concept, the facts related to that concept come to mind easier. As a result, the data set selected by your mind becomes tilted towards the elements related to that concept, even if it has no relation to the question you are trying to answer. Your thinking becomes contaminated, shifted in a particular direction. The data set in your focus of attention becomes less representative of the phenomenon you are trying to model, and more representative of the concepts you were primed with.

Knowing About Biases Can Hurt People. When you learn about the biases, you obtain a toolset for constructing new statements of fact. Similarly to what goes wrong when you propose solutions to a hard problem prematurely, you contaminate the data set with weak conclusions, allegations against specific data items that don't add to the understanding of phenomenon you are trying to model, distract from considering the question, take away whatever relevant knowledge you had, and in some cases even invert it.

A more general technique for not making these mistakes consists in making sure that the data set you consider is representative of the phenomenon you are trying to understand. Human brain can't automatically correct for the misleading selection of data, so you need to consciously ensure that you get presented with a balanced selection.

The first mistake is introduction of irrelevant data items. Focus on the problem, don't let the distractions get their way. The irrelevant data may find its way in your thoughts covertly, through priming effects you don't even notice. Don't let anything distract you, even if you understand that the distraction isn't related to the problem you are working on. Don't construct the irrelevant items yourself, as byproducts of your activity. Make sure that the data items you consider are actually related to the phenomenon you are trying to understand. To form accurate beliefs about something, you really do have to observe it. Don't think about fictional evidence, don't think about the facts that look superficially relevant to the question, but actually aren't, as in the case of the hindsight bias and reasoning by surface analogies.

The second mistake is to consider an unbalanced data set, overemphasizing some aspects of the phenomenon, and underemphasizing the others. The data needs to cover the whole phenomenon in a representative way, for the human mind to process it adequately. There are two sides to correcting this imbalance. First, you may take away the excessive data points, deliberatively refusing to consider them, so that your mind gets presented with less evidence, but this remaining evidence is more balanced, more representative of what you are trying to understand. This is similar to what happens when you take an outside view, for example, to avoid planning fallacy. Second, you may generate the correct data items to fill the rest of the model, from the cluster of evidence you've got. This generation may happen either formally, through using technical models of the phenomenon that allow to explicitly calculate more facts, or informally, through training your intuition to follow reliable rules for interpreting the specific pieces of evidence as the aspects of the whole phenomenon you are studying. Together, these feats constitute expertise in the domain, an art of knowing how to make use of the data that would only confuse a naive mind. When discarding evidence to correct the imbalance of data, only parts you don't posses expertise in need to be thrown away, while the parts that you are ready to process may be kept, making your understanding of the phenomenon stronger.

The third mistake is to mix reliable evidence with unreliable evidence. The mind can't tell between relevant info and fictional irrelevant info, let alone between solid relevant evidence and shaky relevant evidence. If you know some facts for sure, and some facts only through indirect unreliable methods, don't consider the latter at all when forming the initial understanding of the phenomenon. Your own untrained intuition generates weak facts, on the things in which you don't have domain expertise, for example when you spontaneously think up solutions to a hard problem. You only get wild guesses when the data is too thin for your intuition to retain at least minimal reliability, when getting a few steps away from the data. You get weak evidence from applying general heuristics that don't promise exceptional precision, such as knowledge of biases. You get weak evidence from listening to the opinion of the majority, from listening to the virulent memes. However, when you don't have reliable data, you need to start including less reliable evidence into consideration, but only the best of what you can come up with.

Your thinking shouldn't be contaminated by unrelated facts, shouldn't tumble over from the imbalance in knowledge, and shouldn't get diluted by the abundance of weak conclusions. Instead, the understanding should grow more focused on the relevant details, more comprehensive and balanced, attending to more aspects of the problem, and more technically accurate.

Think representative sets of your best data.