I'm doing a project on how humans can evaluate messy problems and come up with consistent answers (consistent with both themselves over time and with other people), and what the trade off is with accuracy. This isn't a single unified field, so I need to go poking for bits of it in lots of places. Where would people suggest I look? I'm especially interested in information on consistently evaluating novel questions that don't have enough data to make statistical models ("When will [country] develop the nuclear bomb?") as opposed to questions for which we have enough data that we're pretty much just looking for similarities ("Does this biopsy reveal cancer?").

An incomplete list of places I have looked or plan on looking at:

  • interrater reliability
  • test-retest reliability
  • educational rubrics (for both student and teacher evaluations)
  • medical decision making/standard of care
  • Daniel Kahneman's work
  • Philip Tetlock's work
  • The Handbook of Inter-Rater Reliability
New Answer
Ask Related Question
New Comment

2 Answers

My report based on 15 hours of research on this topic.

I recommend Richard Feynman's writings. Read about his role in solving the messy problem of the Challenger Disaster. Feynman did not trust complex methods with hidden assumptions Instead he started with basic physics and science and derived hypotheses and conclusions.

A second scientists whose method I think is under-appreciated is Darwin. His simple methods came to conclusions from very messy data. He is noted for exactness and attention to details and creating the science of evolution.

LOL - "When will [country] develop the nuclear bomb?" - I would ask the CIA on this question. One technique they would use is to monitor what type and quantity of equipment and raw materials they are importing. Of course they(CIA) monitor testing of underground explosives.

I recommend Tim Harford "The Logic of Life" "The Underground Economists" He illustrated that many questions can be answered by investigating it from unexpected directions. His speciality is economics.

Two of his questions: Why do some neighborhoods thrive and others become ghettos? Why is racism so persistent? Why is your idiot boss paid a fortune for sitting behind a mahogany altar?

If you take Tim Harford as an example then look for other people that think outside of the box and answer tough question.

I would not trust very deep and complex methods that cannot be explained in detail at a level the user can understand. A user should gain experience with any system or method before trusting it.

I enjoyed you question - hope this helps.

I think your question is unclear because the question says How can people answer a question consistently? But it appears instead of learning how to answer messy questions yourself you want to know why others do not give consistent answers. Given diverse input information and different backgrounds I am not surprised that the answers vary. This page is a good example of the variability of answers.

A better plan is to understand the methods and reasoning used by the best experts in thinking have solved problems and answered questions. With this research you can develop criteria and methodology that lead to accurate solutions and consistent answers.

3 Related Questions

4Answer by LawrenceC3y
I believe your definition of accuracy differs from the ISO definition [https://en.wikipedia.org/wiki/Accuracy_and_precision#ISO_definition_(ISO_5725)] (which is the usage I learned in undergrad statistics classes, and also the usage most online sources seem to agree with): a measurement is accurate insofar as it is close to the true value. By this definition, the reason the second graph is accurate but not precise is because all the points are close to the true value. I'll be using that definition in the remainder of my post. That being said, Wikipedia does claim your usage is the more common usage of the word [https://en.wikipedia.org/wiki/Accuracy_and_precision]. I don't have a clear sense of how to answer your question empirically, so I'll give a theoretical answer. Suppose our goal is to predict some value y. Let ^y be our predictor for y (for example, we could have ^y= ask a subject to predict y). A natural way to measure accuracy for prediction tasks is the mean squared error E[(y−^y)2], where a lower mean square error is higher accuracy. The Bias Variance Decomposition [https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff#Bias%E2%80%93variance_decomposition_of_squared_error] of mean squared error gives us: E[(y−^y)2]=(E[^y]−y)2+(E[(^y2−E[^y])2]) The first term on the right is the bias of your estimator - how far the expected value of your estimator is from the true value. An unbiased estimator is one that, in expectation, gives you the right value (what you mean by "accuracy" in your post, and what ISO calls "trueness"). The second term is the variance of your estimator - how far your estimator is, in expectation, from the average value of the estimator. Rephrasing a bit, this measures how imprecise your estimator is, on average. As both the terms on the right are always non-negative, the bias and variance of your estimator both lower bound your mean square error. However, it turns out that there's often a trade off between having an unb
Looks like what I'm calling accuracy ISO calls "trueness", and ISO!accuracy is a combination of trueness and precision.
This seems to be a question about the correlation of the two over all processes that generate estimates, which seems very hard to do. Even supposing you had this correlation over processes, I'm guessing once you have a specific process in mind, you get to condition on what to know about it in a way that just screens off the prior. In a given domain though, maybe there are useful priors one could have given what one knows about the particular process. I'll try to think of examples.

I would just like to complain about how many bullseye diagrams I looked at where the "low accuracy" picture averaged to about the bullseye, because the creator had randomly spewed dots and the average location of an image is in fact its center.

I wonder if that's because they're using the ISO definition of accuracy? A quick google search for these diagrams led me to this reddit thread [https://www.reddit.com/r/coolguides/comments/9zfg06/the_difference_between_accuracy_and_precision/] , where the discussion below reflects the fact that people use different definitions of accuracy. EDIT: here's a diagram of the form that Elizabeth is complaining about (source: the aforementioned reddit thread [https://www.reddit.com/r/coolguides/comments/9zfg06/the_difference_between_accuracy_and_precision/] ):
9Answer by DanielFilan3y
One example of the ability of the model: in the paper, the model is run on 120 responses to a quiz consisting of 60 Raven's Progressive Matrices questions, each question with 8 possible answers. As it happens, no responder got more than 50 questions right. The model correctly inferred the answers to 46 of the questions. A key assumption in the model is that errors are random: so, in domains where you're only asking a small number of questions, and for most questions a priori you have reason to expect some wrong answers to be more common than the right one (e.g. "What's the capital of Canada/Australia/New Zealand"), I think this model would not work (although if there were enough other questions such that good estimates of responder ability could be made, that could ameliorate the problem). If I wanted to learn more, I would read this 2016 review paper [http://delivery.acm.org/10.1145/2900000/2897352/p1-li.pdf?ip=] of the general field.
That being said, as long as you know which answers will be more common among people who don't know the right answer, and roughly how much more common they will be, you can probably add that knowledge to the algorithm without too much difficulty and have it still work as long as there are some questions where you don't expect the uninformed to reliably lean one way.
Presumably ones that meet the assumptions its built around which include: *(They also make assumptions about how participant ability and question difficulty determine the probability of a given participant getting it right. Gaussian noise is involved.)* (Note that "xyz" is actually "x_yz".) The language here seems in conflict with the section wrapped in *s. "σ2 p" is "σ^2_p". (The actual method is one of factor graphs and probabilistic inference. They also made an assumption that's analogous to one explicitly not made in voting - that participants are choosing answers honestly.) The method can use additional information. (I wonder what happens if participants do that - or coordinate in advance with some set of falsehoods...) This is how they estimate the unobserved variables (ability, difficulty, and the correct answer). It's possible a closed form solution, or better approximations exist. Part of the point of doing things the way outlined in the paper is for active learning: It's unclear what the best way is - the paper just argues dynamic is better than static. They start talking about results in/after the "Empirical Analysis" section. The end noted some limitations:
8 comments, sorted by Click to highlight new comments since: Today at 6:31 PM

Posting some thoughts I wrote up when first engaging with the question for 10-15 minutes.

The questions is phrased as: How Can People Evaluate Complex Questions Consistently? I will be reading a moderate amount into this exact phrasing. Specifically that it's specifying a project whose primary aim is increasing the consistency of answers to questions.

The projects strikes me as misguided. It seems to me definitely the case that consistency is an indicator of accuracy because if your "process" is reliably picking out a fixed situation in the world, then this process will give roughly the same answers as applied over time. Conversely, if I keep getting back disparate answers, then likely whatever answering process is being executed isn't picking up a consistent feature of reality.

1)  I have a measuring tape and I measure my height. Each time I measure myself, my answer falls within a centimeter range. Likely I'm measuring a real thing in the world with a process that reliably detects it. We know how my brain and my height get entangled, etc.
2) We ask different philosophers about the ethics of euthanasia. They give widely varying answers for widely varying reasons. We might grant that there exists on true answer here, but that the philosophers are not all using reliably processes for accessing that true answer. Perhaps some are, but clearly not all are since they're not converging, which makes it hard to trust any of them.

Under my picture, it really is accuracy that we care about almost all of the time. Consistency/precision is an indicator of accuracy, and lack of consistency is suggestive of lack of accuracy. If you are well entangled with a fixed thing, you should get a fixed answer. Yet, having a fixed answer is not sufficient to guarantee that you are entangled with the fixed thing of interest. ("Thing" is very broad here and includes abstract things like the output of some fixed computation, e.g. morality.)

The real solution/question then is how to increase accuracy, i.e. how to increase your entanglement with reality. Trying to increase consistency separate from accuracy (even at the expense of!) is mixing up an indicator and symptom with the thing which really matters: whether you're actually determining how reality is.

It does seem we want a consistent process for sentencing and maybe pricing (but that's not so much about truth as it is about "fairness" and planning where we fear that differing amounts (sentence duration) is not sourced in legitimate differences between cases. But even this could be cast in the light of accuracy too: suppose there is some "true, correct, fair" sentence for a given crime, then we want a process that actually gets that answer. If the process actually works, it will consistently return that answer which is a fixed aspect of the world. Or we might just choose the thing we want to be entangled with (our system of laws) to be a more predictable/easily-computable one.

I've gotten a little rambly, so let me focus again. I think consistency and precision are important indicators to pay attention to when assessing truth-seeking processes, but when it comes to making improvements, the question should be "how do I increase accuracy/entanglement?" not "how do increase consistency?" Increasing accuracy is the legitimate method by which you increase consistency. Attempting to increase consistency rather than accuracy is likely a recipe for making accuracy worse because you're now focusing on the wrong thing.

jimrandomh correctly pointed out to be that precision can have it's own value for various kinds of comparison. I think he's right. If A and B are each biased estimators of 'a' and 'b' but their bias is consistent (causing lower accuracy) but their precision is high, then I can make comparisons between A/a and B/b over time and between each other in ways I can't even if the estimators were less biased but higher noise.

Still though it's here is that the estimator is tracking a real fixed thing.

If I were to try to improve my estimator, I'd look at the process as a whole it implements and try to improve that rather than just trying to make the answer come out the same.

Cambridge handbook on expertise probably has some useful info.

What's the motivation? In what case is lower accuracy for higher consistency a reasonable trade off? Especially consistency over time sounds like something that would discourage updating on new evidence.

Some examples are where people care more about fairness, such as criminal sentencing and enterprise software pricing.

However you're right that implicit in the question was "without new information appearing", although you'd want the answer to update the same way every time the same new information appeared.

If every study on depression would use it's own metric for depression that's optimal for the specific study it would be hard to learn from the studies and aggregate information from them. It's much better when you have a metric that has consistency.

Consistent measurements allow reacting to how a metric changes over time which is often very useful for evaluating interventions.

This is true, but it doesn't fit well with the given example of "When will [country] develop the nuclear bomb?". The problem isn't that people can't agree what "nuclear bomb" means or who already has them. The problem is that people are working from different priors and extrapolating them in different ways.