Survey advice

KatjaGrace

Things I believe about making surveys, after making some surveys:

If you write a question that seems clear, there’s an unbelievably high chance that any given reader will misunderstand it. (Possibly this applies to things that aren’t survey questions also, but that’s a problem for another time.)
A better way to find out if your questions are clear is to repeatedly take a single individual person, and sit down with them, and ask them to take your survey while narrating the process: reading the questions aloud, telling you what they think the question is asking, explaining their thought process in answering it. If you do this repeatedly with different people until some are not confused at all, the questions are probably clear.
If you ask people very similar questions in different sounding ways, you can get very different answers (possibly related to the above, though that’s not obviously the main thing going on).
One specific case of that: for some large class of events, if you ask people how many years until a 10%, 50%, 90% chance of event X occurring, you will get an earlier distribution of times than if you ask the probability that X will happen in 10, 20, 50 years. (I’ve only tried this with AI related things, but my guess is that it at least generalizes to other low-probability-seeming things. Also, if you just ask about 10% on its own, it is consistently different from 10% alongside 50% and 90%.
Given the complicated landscape of people’s beliefs about the world and proclivities to say certain things, there is a huge amount of scope for choosing questions to get answers that sound different to listeners (e.g. support a different side in a debate).
There is also scope for helping people think through a thing in a way that they would endorse, e.g. by asking a sequence of questions. This can also change what the answer sounds like, but seems ethical to me, whereas applications of 5 seem generally suss.
Often your respondent knows thing P and you want to know Q, and it is possible to infer something about Q from P. You then have a choice about which point in this inference chain to ask the person about. It seems helpful to notice this choice. For instance, if AI researchers know most about what AI research looks like, and you want to know whether human civilization will be imminently destroyed by renegade AI systems, you can ask about a) how fast AI progress appears to be progressing, b) when it will reach a certain performance bar, c) whether AI will cause something like human extinction. In the 2016 survey, we asked all of these.
Given the choice, if you are hoping to use the data as information, it is often good to ask people about things they know about. In 7, this points to aiming your question early in the reasoning chain, then doing the inference yourself.
Interest in surveys doesn’t seem very related to whether a survey is a good source of information on the topic surveyed on. One of the strongest findings of the 2016 survey IMO was that surveys like that are unlikely to be a reliable guide to the future.
This makes sense because surveys fulfill other purposes. Surveys are great if you want to know what people think about X, rather than what is true about X. Knowing what people think is often the important question. It can be good for legitimizing a view, or letting a group of people have common knowledge about what they think so they can start to act on it, including getting out of bad equilibria where everyone nominally supports claim P because they think others will judge them if not.
If you are surveying people with the intention of claiming a thing, it is helpful to think ahead about what you want to claim, and make sure you ask questions that will let you claim that, in a simple way. For instance, it is better to be able to say ‘80% of a random sample of shoppers at Tesco said that they like tomato more than beans’ than to say ‘80% of a sample of shoppers who were mostly at Tesco but also at Aldi (see footnote for complicated shopper selection process) say that they prefer tomato to peas, or (using a separate subset of shoppers) prefer peas to beans, from which we can infer that probably about 80% of shoppers in general, or more, prefer tomato to beans’. You want to be able to describe the setup and question in a way that is simple enough that the listener understands what happened, and see the significance of the finding.
If you are running a survey multiple times, and you want informative answers about whether there were differences in views between those times, you should probably run exactly the same survey and not change the questions even a tiny bit unless there is very strong reason to. This follows from 3.
Qualtrics costs thousands of dollars to use, and won’t let you sign up for an account or even know how much it might cost unless you book a meeting to talk to someone to sell it to you. Guidedtrack.com seems pretty nice, but I might not have been trying to do such complicated things there.
Running surveys seems underrated as an activity.

There is also scope for helping people think through a thing in a way that they would endorse, e.g. by asking a sequence of questions.

As aptly demonstrated:

I don't think this is a good illustration of point 6. The video shows a string of manipulative leading questions, falling short of the " in a way that they would endorse" criteria.

When people understand that a string of questions is designed to strong arm them into a given position they rarely endorse it. It seems to me that point 6 is more about benevolent and honest uses of leading questions.

Admittedly, I am making the assumption that " in a way that they would endorse" means "such that if people understood the intent that went into writing the string of questions in that way they would approve of the process".

I feel 4. can be explained by humans not having probability distributions on future events but something more like infradistrbutions/imprecise distributions. This a symptom of larger problem of Bayesian dogmatism that has taken hold of some parts of LW/rationalists.

Let me xplain how this works:

To recall: an imprecise distribution is the convex (closed) hull of a collection of probability distributions ${p_{i}}_{i \in I}$ . In other words it combines 'Knightian' uncertainty with probabilistic uncertainty.

If you ask people for 10%,50%,90% chance of AI happening you are implicitly asking for the worst case: i.e. there in at least one probability distributions $p_{i}$ such that $p_{i} ($ AGI) =10%,50%,90%

On the other hand when you ask for a certain event to happen for certain in 10,20,50 years you are asking for the dual 'best case' scenario, i.e. for ALL probability distributions $p_{i} \in I$ what probability $p_{i}$ (AGI in 10y), $p_{i}$ (AGI in 20y), $p_{i}$ (AGI in 50y) is and taking the minimum.

This does seem to be a useful insight, though I don't think it's anywhere near so precise as that.

Personally, the Knightian uncertainty completely dominates my timeline estimates. If someone asks for which year the cumulative probability reaches some threshold, then firstly that sounds like a confusion of terms, and secondly I have or can generate (as described) a whole bunch of probability distributions without anything usable as weightings attached for each. Any answer I give is going to be pointless and subject to the whims of whatever arbitrary weightings I assign in the moment, which is likely to be influenced by the precise wording of the question and probably what I ate for breakfast.

It's not going to be the worst case - that's something like "I am already a simulation within a superintelligent AGI and any fact of the matter about when it happened is completely meaningless due to not occurring in my subjective universe at all". It's not going to be the best case either - that's something like "AGI is not something that humans can create, for reasons we don't yet know". Note that both of these are based on uncertainties: hypotheses that cannot be assigned any useful probability since there is no precedent nor any current evidence for or against them.

It's going to be something in the interior, but where exactly in the interior will be arbitrary, and asking the question a different way will likely shift where.

Curated. I'm interested in this both from the perspective of personal epistemics and group epistemics. Surveys are a tool for figuring out things about the world, and they tend to also be a way to get-on-the-same-page about how the world looks.

Thanks Katja for sharing a bunch of lived experience on how execute surveys well. :)

Guidedtrack.com

is now https://www.guidedtrack.com/

Some of these are strikingly similar to advice for how to interview users when designing user friendly software.
I guess it makes sense that there's some cross over.

I like it!

This is not my research area but this list looks really relevant. Thanks for posting it!

For those that do not know. Survey methods and survey analysis is a field of academic research in itself. There are people who specialise in this topic - and hence we can learn from them or pay them to consult on the design of our surveys.

E.g. SMAG (survey methods and analysis group) at the university of Manchester. And NCRM (national centre for research methods) are two I know if in the UK.

There is a "journal of survey statistics and methodology" and "of social research methodology".

And undergraduate textbooks such as

Lohr's sampling design and analysis
Oppenheim's questionnaire design
Czaja & Blair's designing surveys
Fowler's survey research methods

Good post.

Interest in surveys doesn’t seem very related to whether a survey is a good source of information on the topic surveyed on. One of the strongest findings of the 2016 survey IMO was that surveys like that are unlikely to be a reliable guide to the future.

Can you say more?

How are these two sentences related?
The first sentence seems plausible, but why do you say it?
The second sentence seems plausible, but why do you say it? (Is it just because many responses were internally inconsistent and/or unreasonable?)

Second sentence:

People say very different things depending on framing, so responses to any particularly-framed question are presumably not accurate, though I'd still take them as some evidence.
People say very different things from one another, so any particular person is highly unlikely to be accurate. An aggregate might still be good, but e.g. if people say such different things that three-quarters of them have to be totally wrong, then I don't think it's that much more likely that the last quarter is about right than that the answer is something almost nobody said.

First sentence:

In spite of the above, and the prior low probability of this being a reliable guide to AGI timelines, our paper was the 16th most discussed paper in the world. On the other hand, something like Ajeya's timelines report (or even AI Impacts' cruder timelines botec earlier) seem more informative, and to get way less attention. (I didn't mean 'within the class of surveys, interest doesn't track informativeness much' though that might be true, I meant 'people seem to have substantial interest in surveys beyond what is explained by them being informative about e.g. AI timelines'
)

Point 10 should be 1 and probably a variant should be 2. And they're not even all that good at finding out what people think. They can sometimes find out how people feel or what their current reaction is.

To point 14, it depends on who's doing the rating. I'll point out that survey design and interpretation is a pretty big business - there's a reason Qualtrics charges so much (and that SAP paid $8B for the company), and a related reason that competitors universally suck - the actual presentation is the easy (and non-profitable) part. The design and analytics are difficult and command a lot of revenue.