1930

LESSWRONG
LW

1929
Probability & Statistics
Frontpage

1

[ Question ]

What is an appropriate sample size when surveying billions of data points?

by Blake
23rd Aug 2024
1 min read
A
1
2

1

Probability & Statistics
Frontpage

1

What is an appropriate sample size when surveying billions of data points?
3Blake
8gwern
New Answer
New Comment
Email me replies to all comments

1 Answers sorted by
top scoring

Blake

Aug 23, 2024

30

FWIW, here is the best survey of ideas on sample size I found: https://cadcommunity.pbworks.com/f/what%20sample%20size.pdf

Add Comment
1 comment, sorted by
top scoring
Click to highlight new comments since: Today at 9:52 AM
[-]gwern1y81

Since you mention 'billions of data points', but you say your goal is 'how accessible the Internet is to people with disabilities' where your sample size should be more like in the hundreds to thousands, you may need to seriously think about what the purpose of your survey is and how it is used. Planning sample size is the least of your problems.

It sounds like you think you can just take some dataset like Common Crawl and crunch numbers about 'the top million domains' and come up with a conclusion like 'X% of the Internet is unusable' and you just need to know how many domains to analyze and can turn the crank and see what pops out with p < 0.05. But that's not the case. For datasets like this, you will find many parameters to be "statistically significant" as you are doing near-population-level analysis, where your sampling error is tiny and all your error will be the (unknown and usually impossible to measure) systematic error & bias which doesn't go away (although Meng 2014 is an interesting discussion of asking how much systematic error goes away when you are sampling a large fraction of the entire population). At scale, all your results may tell you is something about the many serious flaws and biases in these sorts of Internet datasets - they may be all we have, but one shouldn't fool oneself into thinking that they are any good. (As Cohen put it, a burning desire for an answer doesn't mean that a given dataset or survey methodology will be able to provide it.)

Reply
Moderation Log
More from Blake
View more
Curated and popular this week
A
1
1

I am surveying Internet web accessibility. We're hoping to draw big conclusions on how accessible the Internet is to people with disabilities. What is the appropriate sample size? We should draw in order to start making our grand conclusions?

I went down Google/Perplexity rabbit hole trying to figure out an appropriate answer. I am left with more questions than when I started, so I turned to the LW community.