Lizardmen are Not Constant - A Introductory Primer to Thinking about Survey Data

DanielW

The quality of a survey is best judged not by its size, scope, or prominence, but by how much attention is given to dealing with the many important problems that can arise.
-Fritz Scheuren, "What is a Survey?" American Statistical Association, 2004

First a note on scope: this is a brief discussion meant to--hopefully--assist readers in thinking more clearly about how to look at survey data. I will not, however, enumerate all of the issues and considerations that should go into considering surveys. At the end, I include links to some freely available guides for survey research and best practices which I would recommend for anyone who has a greater interest in survey data. Largely such publications are aimed at researchers conducting surveys, but the guidelines provide strong reference points to other standing the things that should go into surveys.

I would be remiss not to acknowledge the initial impetus for this 'primer' is comments that seem to apply the 'Lizardman constant'. Scott Alexander's own 2013 essay on the topic looks at examples from public opinion surveys ('polls') and draws an (almost) entirely correct conclusion (emphasis added): "When we’re talking about very unpopular beliefs, polls can only give a weak signal. Any possible source of noise – jokesters, cognitive biases,^[1] or deliberate misbehavior – can easily overwhelm the signal. Therefore, polls that rely on detecting very weak signals should be taken with a grain of salt."

There seems, however, to be issues as the catchy jargon and title "The Lizardman Constant is 4%" seems to be taken by some readers of Scott Alexander (I do not know whether or not he would endorse the view) to mean "badness is in pretty much every survey at nontrivial percentages" as "[a] constant is always present." At a foundational level, this--I fear--is a lazy, unhelpful way of thinking about survey data. It also is quite different from the attitude one Scott advocated in his essay: Scott's conclusion is focused on 'polls'^[2] looking at "very unpopular beliefs" and taking results "with a grain of salt" not (as is sometimes done) dismissing results that fall below a 4% threshold as at core unreliable.

At a foundational level, I fear this is simply leading to a lazy, uninformative way to view survey results that is likely to promote biases. If there are a just two things I hope you take away they are these:

There is no hard and fast rule for judging surveys: surveys need to be assessed individually on the basis of their nature and purpose.
Most of the threats surveys are vulnerable to are not constant, different types of survey's are vulnerable to different types of problems.

Lizardmen are Not Constant - Not Even in Polling

Let's address the claim that the lizardman constant is a constant. The problem Scott Alexander's essay addresses is one that in academic literature is more often referred to as "bogus respondents" or "spurious response bias", which is to say that a survey may have responses that are not-genuine and these may bias results. Some surveys and results are very vulnerable to this kind of error in other cases the risk is negligible.

To illustrate what this looks like, let's imagine in the real world 0.5% of people think the earth is flat. We post a public (and as such non-probabilistic) online poll soliciting responses to the question "Is the Earth round? (Y/N)" and get 1,000 responses, 40, or 4% (95% CI: 3.0-5.0%) say the Earth is flat. Excluding other biases, we might imagine that if we could read the minds of the respondents, we would observe something like this, with the 'bogus' responses highlighted in red and the genuine responses in green.

A first thing we should note looking at this case of non-probabilistic opt-in polls (which are included in the demonstrative). First, bogus responses are not randomly distributed. Bogus responses are much more likely to false positive answers than false negative ones, if given a series of choices they are more likely to pic the first choice, and, interestingly enough on surveys that include demographic data they also tend to self-identify as Hispanic or Latino.^[3] This is important because it means we cannot just subtract out some constant value, positive results are more likely significantly biased towards bogus respondents.

Let's say we run the survey again, except this time we take a probabilistic sample and we call, say, 2,000 randomly selected addresses and get 900 responses that might now look something like:

You can see trivially how the bogus respondent problem is reduced but remains substantial our estimate this time would be 2.1% (with a 95% CI of 1.3%-3.3%). the number of bots with landlines registered to addresses is effectively zero, we are no longer getting bogus bot responses. However, some people are still may give different answers from their actual beliefs, you may have some people who are annoyed at having their dinner interrupted by a pollster so give bogus answers just for the hell of it, or might mishear the question to give a couple examples. Also, there still are certain systematic biases which mean we are unlikely to be able to assume the bogus answers are randomly distributed (e.g., respondents may try to give the answer they think the pollster wants).

Additionally, survey length, what questions are asked, how they are asked and incentive structures and other factors can all influence the rate and characteristics of bogus respondents.

One might still think that even though rates may vary, the bogus respondents themselves are always an issue. This is not true. In practice, for example probabilistic panel surveys generally observe very low rates of bogus responses, of approximately 0 (depending on the exact survey methods and coding).^[4] In addition, most major panel surveys also will include various controls and cleaning to minimize various forms of bias. Panel surveys may go further and match respondents against externally validated data. Imagine, for example, a study looks at health consequences for patients receiving care for the flu. It recruits patients across a set of hospitals using diagnosis data and at regular intervals calls the patients and has them discuss with physicians any health issues which are assessed alongside their medical records which are collected alongside a standard demographic panel. What would you expect would be the rate of bogus respondents? I think most people would intuitively agree it is likely near zero, people have a motive to be honest when their health is at stake and responses are verified against medical records which would very nearly eliminate the rate of bad actors. However, does that mean you can trust the conclusions of probabilistic panel surveys on their face? No! It just means you don't have to worry about 'lizardmen' or 'bogus respondents' at the same rates--there are other concerns which you should have when assessing such a survey.

What It Comes To - Thinking About Data

Not just for survey data, but any data you are looking at one should begin by asking: what is the purpose and how was the data collected or what does it represent.

Looking for the Purpose - Initial Considerations:

For reviewing surveys, the purpose can be understood as two considerations: (1) what was the purpose behind the survey and (2) what are the results purporting to show. The way a survey is subsequently conducted should depend in large part on these, how you conduct a study depends and what methods are valid or not is highly dependent on what you are trying to study and what you.

Some purposes also should make one inherently suspect of a survey. An obvious example is when there are clear motives that are likely to skew results, for example blind taste tests run by Pepsi's marketing division purporting to show a preference are likely to have some bias. Just because the survey designer is biased and has a motive to find a particular result doesn't mean, inherently, that the survey results are wrong or even biased, but it does mean one should be especially skeptical of areas of bias that might have weighted the results in the authors favor.

Other purposes might be inherently suspect of finding biased results. To risk putting myself in more controversial waters, a study that purporting to be "looking for find surprise correlations in areas" should immediately raise suspicions than reported correlations are the result of "data dredging" or "p-hacking." Without delving into the information theory side of things,^[5] if you take enough data across a broad enough dataset one should expect to find somethings are correlated despite having no real relationship. We commonly refer to this as 'spurious correlation.'

5917_popularity-of-the-first-name-monica_correlates-with_the-marriage-rate-in-nevada.png — https://tylervigen.com/spurious/correlation/5917_popularity-of-the-first-name-monica_correlates-with_the-marriage-rate-in-nevada

Additionally, the more variables you are looking at, the greater the chance that some correlations are the result of random chance (this can be mitigated if you are using a probabilistic sample that is sufficiently large).

A Means to an End - How Purpose Informs Methods and Notes on Instrumentalizing

Generally, methods should be looked at with a mind for what they are trying to show. For example, if a study is trying to support a qualitative examination of some common experiences by people in niche social groups, a non-probabilistic survey like a snowball survey may be perfectly functional. That is, you might take a Facebook group that is part of the subculture you are examining, look at members and friends of members, then friends of friends and so forth to derive a sample that is strongly, deliberately, biased towards the subculture you are trying to study.

However, if a study is trying to estimate the rate of membership in a subculture, using a non-probabilistic sample of this sort would be utterly inappropriate as it would be certain to disproportionately elicit responses from the population you are trying to estimate. Generally when reading the methods a study used try to think to yourself whether it makes sense for what it is looking for and what assumptions it relies on (hopefully, they are explicit about this). Statistic methods and checks can limit some forms of bias,^[6] but generally you want to be able to assume that the population you are sampling from is randomly distributed across the effects you are looking to study.^[7]

As mentioned, it is important to look at what the data actually represents and how well it matches with what it represents. For example, let's say I want to study how normal political corruption is in an average person's. One might consider a poll question like: "On a scale of 1 to 5 how normal you feel political corruption is in your political system." This is asking the person's perceptions of what I am trying to study, which I may be able to assume is correlated to the conclusion I want to make. Sometimes, this might be sufficient to assume perceptions are representative, but other things might cause perceptions rather than what we are studying (e.g, if corruption is very normal in a society, they may see decreases in corruption as meaning corruption is low, while in a society where corruption is rate, a smaller increase may be perceived as a larger problem).

Instead, I might to instrumentalize what I want to know in another way, for example, by asking 'how often in the last five years has a public official asked you for a favour/bribe for a service?'^[8] This is a more direct measure of a form of corruption but it is also imperfect as there may be forms of corruption it doesn't capture. I may, therefore, want to ask questions like how likely it is a person thinks politicians would accept bribes, how often they think judges or police accept bribes, or how often decisions are made on extralegal bases, etc to develop a more complete picture (though, for longer surveys it is harder to get robust, consistent responses).

In general terms, you should try to look at a question and try to think of other things that responses might represent, besides the effects the study is being used for, how likely that might be and what, if any, measures are in place to rule out those effects.

A General Note on Bias in Methods

Those familiar with survey data are likely familiar hearing about various forms of sampling bias and response bias and may be curious why I have not spent much effort reviewing them all. There are far too many possible avenues of bias to list, potentially every decision made in designing a survey may introduce its own, tailor made host of ways that bias the results. Often the more particular biases with technical biases are described, but I find a more helpful way of thinking about it is to critically examine them as their principle.

Rather than going down a list, 'did this study account for non-response bias? did this study account for attrition, etc, etc' it can be easier and more helpful, in my experience, to focus on the first principles discussed.

Think about the methods themselves and what biases they might introduce. To reuse my example, I may not be worried about a robust longitudinal medical study having people lie in their answers, but maybe people whose symptoms get better stop responding to the survey, or there is significant attrition from patients dying which introducing bias (i.e. attrition bias). Do the survey methods discuss what respondents drop and and change? Are they recruiting new respondents and what are they doing to make sure their methods are consistent (if they aren't recruiting new respondents, then the population will skew with time the longer the study is going on for). You should expect a study to spend more time and effort dealing with the kinds of biases that are particular study design is most likely to face risks from.

Final Advice for Readers and Our Biases

Many of these recommendations are less than straightforward and prone to personal, qualitative judgement and bias. Further, for many the effort of rigorously reviewing a study's methodology and supplemental material (which in the case of some large robust panel surveys can constitute hundreds of pages of guidelines, questions and control methods) is not exactly practical. I would urge, however, caution in allowing our bias to judge what we review, particularly with regards to the sniff test. As mentioned (and as Scott Alexander indicated with regards to lizardmen), a small effect is a good reason to view a result with more skepticism, responding to "this result is less than 4% so it should be discarded as within the Lizardman constant" is an unacceptable practice, however, responding "this result is fairly small so I would want to review whether it could be the result of some confounding effect or bias before I judge it" is good practice, sometimes even when we do not have time to review it. When we do not have time to review it ourselves, I would suggest looking to through citations briefly and whether journals have published comments/retractions and even just the broad length of the methodology section (and online supplements) for whether there seems to be sufficient scrutiny.

Still, while preferable to outright dismissal, one might be more likely to take as granted things that agree with us while indefinitely delaying judgement on results we find inconvenient. Generally, as good practice if you find a result that is generally viewed as surprising in some way but agrees with you, you should be the most skeptical. On the other hand, where a result is somewhat surprising but contradicts our biases, I would try to approach it with curiosity rather than abject skepticism of what is being discussed, particularly if performed in a reputable publication. It is quite likely there is an explanation other than what is presented, but then one should wonder what that explanation is and whether the authors themselves thought of possibilities you might consider and whether they or others have addressed them.

------------------

A Short Selection of Public Resources, Papers and Examples on Survey Best Practices and Design:

American Association for Public Opinion Research, "Best Practices for Survey Research": https://aapor.org/wp-content/uploads/2023/06/Survey-Best-Practices.pdf

ASA's Proceedings of the Survey Research Methods Section: http://www.asasrms.org/

Podsakoff, et al. "Sources of method bias in social science research and recommendations on how to control it." Annual review of psychology 63, no. 1 (2012): 539-569. https://www2.psych.ubc.ca/~schaller/528Readings/Podsakoff2012.pdf

Pew Research Methodology: https://www.pewresearch.org/our-methods/ (and methodology research: https://www.pewresearch.org/topic/methodological-research/ )

Kennedy et al "Assessing the Risks to Online Polls from Bogus Respondents." Pew Research Center: https://www.pewresearch.org/methods/2020/02/18/assessing-the-risks-to-online-polls-from-bogus-respondents/

The Harvard University Program on Survey Research: https://psr.iq.harvard.edu/book/guides-survey-research

Dillman, D. A. (2000, June). Procedures for conducting government-sponsored establishment surveys: Comparisons of the total design method (TDM), a traditional cost-compensation model, and tailored design. Proceedings of American statistical association https://ww2.amstat.org/meetings/ices/2000/proceedings/S15.pdf

^{^}
One critique I would have is I am somewhat unclear on what Scott is including by "cognitive biases" here. Someone who truthfully answers a poll with a belief they derived from their cognitive biases should not be considered among the 'lizardmen', the purpose of polling them is to identify people's actual beliefs.
^{^}
As an aside on terms, there isn't necessarily a hard and fast rule on when a survey is a 'poll.' However, polls generally refer to a class of surveys aimed at measuring snapshots of public opinions which can be done by various means (such as probabilistic phone/address sampling, or non-probabilistic online sampling).
^{^}
See e.g. Pew Researches' discussion of their work on bogus respondents here: https://www.pewresearch.org/methods/2020/02/18/bogus-respondents-bias-poll-results-not-merely-add-noise/
^{^}
E.g. "2% to 4% of opt-in poll respondents repeatedly gave answers that did not match the question asked. Throughout the report we refer to such answers as non sequiturs. There were a few such respondents in the address-recruited panel samples, but as share of the total their incidence rounds to 0%." https://www.pewresearch.org/methods/2020/02/18/answers-that-did-not-match-the-question-were-concentrated-in-opt-in-polls/
^{^}
There are inherent difficulty in deriving conclusions due to correlations when there are lots of potentially related variables involved.
^{^}
There is a wealth of literature on working with various forms of regressions specifically for these problems.
^{^}
If these residuals are randomly distributed it means that even if some groups are over represented as long that is random across the effects, larger population estimates can be derived by simple weighting, if the effects are not randomly distributed, such naive weighting doesn't work
^{^}
This is based on an actual question in prior rounds of the European Social Survey https://ess.sikt.no/en/datafile/edee45f2-976b-4c8b-902d-b65dc003c92e?tab=1&elems=366f7e3d-65de-4482-b64c-9fb4b908352a

[-]papetoast2mo*70

Related: How Your Survey Responder Lies

Aella didn't even mention Lizardmen's Constant once, but she made a bunch of twitter polls, which shows a different percentage of liars depending on the seriousness of the question and the survey.

I think probably the immediate visibility and informality of the twitter poll gives people a jolt of delight to fuck with the results, as long as fucking with the results would be funny. 1+1 = 3 is hilarious when it’s in front of a crowd and everybody sees the stupid percentages being stupid, but it’s boring when it’s inside the solemn walls of a researcher’s guidedtrack url.
(3³ − 23) ÷ 2 = 3, however, is deadly serious. Nobody finds that funny, crowd or not. It’s the one question where the twitter poll and survey question collapsed to almost exactly the same % of correct answer.

There is also this showing Lizardmen's Constant = 10%, showing how much the "constant" can vary:

In a new study my collaborators and I focus on this hypothesis. We ask participants if this statement is true or false: “The Canadian Armed Forces have been secretly developing an elite army of genetically engineered, super intelligent, giant raccoons to invade nearby countries”.
We found that 10% of participants endorsed this statement (i.e., they selected “Definitely true” or “Probably true”) and that this was a strong predictor of endorsing six pre-existing conspiracy theories, including two that that directly contradicted each other.

[-]DanielW2mo20

Thank you for the references! Aella's case is interesting a very good example of how design matters to how you should think about the problem, particularly her statement (emphasis added):

in general whenever I’ve asked variations on this set of questions, the number of correct votes roughly increases with the difficulty of the problem, before hitting a ceiling when the difficulty is too high, and slowly decreases with that difficulty.

In a lot of survey design the opposite is a much larger problem, particularly if you incentivize responses (i.e. paying people for completing the survey). The more complex you make a question, the more often you will have lazy respondents just pick the first question instead of engaging with what is being asked. Because she is just posing a twitter poll with little incentive to answer, I imagine most lazy respondents simply do not read the question.

[-]DanielW2mo40

@Karl Krueger, thank you for the typo corrections and reacts! I deleted the "linked landlines" part since it was confusing and added nothing, but, for reference, I was referring to a practice of address-based sampling (ABS) where addresses are sometimes matched out to landlines to make it easier to contact them for responses. These often leads to particular sampling bias (which can sometimes be addressed by creating separate sets of addresses that can be matched and those that cannot). It is sort of a middle ground between randomly sampling addresses and landline surveys (randomly sampling numbers has a lot of issues).

Since surveys using landlines are generally more prone to bogus responses, I had referenced that in my example, but on reflection is a needless nuance that doesn't change the point being made.

I will probably try to reword a few things pursuant to some other areas you marked tomorrow, time permitting.

21