The Expert Survey on Progress in AI (ESPAI) is a big survey of AI researchers that I’ve led four times—in 2016, then annually: 2022, 2023, and 2024 (results coming soon!)
Each time so far it’s had substantial attention—the first one was the 16th ‘most discussed’ paper in the world in 2017.
Various misunderstandings about it have proliferated, leading to the methodology being underestimated in terms of robustness and credibility (perhaps in part due to insufficient description of the methodology—the 2022 survey blog post was terse). To avoid these misconceptions muddying interpretation of the 2024 survey results, I’ll answer key questions about the survey methodology here.
This covers the main concerns I know about. If you think there’s an important one I’ve missed, please tell me—in comments or by email (katja@aiimpacts.org).
This post throughout discusses the 2023 survey, but the other surveys are very similar. The biggest differences are that a few questions have been added over time, and we expanded from inviting respondents at two publication venues to six in 2023. The process for contacting respondents (e.g. finding their email addresses) has also seen many minor variations.
Summary of some (but not all) important questions addressed in this post.
To my knowledge, the methodology is substantially stronger than is typical for surveys of AI researcher opinion. For comparison, O’Donovan et al was reported on by Nature Briefing this year, and while 53% larger, its methodology appeared worse in most relevant ways: its response rate was 4% next to 2023 ESPAI’s 15%; it doesn’t appear to report efforts to reduce or measure non-response bias, the survey population was selected by the authors and not transparent, and 20% of completed surveys were apparently excluded (see here for a fuller comparison of the respective methodologies).
Some particular strengths of the ESPAI methodology:
Criticisms of the ESPAI methodology seem to mostly be the result of basic misunderstandings, as I’ll discuss below.
No.
Respondents answered nearly all the normal questions they saw (excluding demographics, free response, and conditionally-asked questions). Each of these questions was answered by on average 96% of those who saw it, with the most-skipped still at 90% (a question about the number of years until the occupation of “AI researcher” would be automatable).1
The reason it could look like respondents skipped a lot of questions is that in order to ask more distinct questions, we intentionally only directed a fraction (5-50%) of respondents to most questions. We selected those respondents randomly, so the smaller pool answering each question is an unbiased sample of the larger population.
Here’s the map of paths through the survey and how many people were given which questions in 2023. Respondents start at the introduction then randomly receive one of the questions or sub-questions at each stage. The randomization shown below is uncorrelated, except that respondents get either the “fixed-years” or “fixed-probabilities” framing throughout for questions that use those, to reduce confusion.2
Map of randomization of question blocks in the 2023 survey. Respondents are allocated randomly to one place in each horizontal set of blocks.
No. There were two types of question that mentioned human extinction risk, and roughly everyone who answered any question—97% and 95% of respondents respectively, thousands of people—answered some version of each.
Confusion on this point likely arose because there are three different versions of one question type—so at a glance you may notice that only about a quarter of respondents answered a question about existential risk to humanity within one hundred years, without seeing that the other three quarters of respondents answered one of two other very similar questions3. (As discussed in the previous section, respondents were allocated randomly to question variants.) All three questions got a median of 5% or 10% in 2023.
This system of randomly assigning question variants lets us check the robustness of views across variations on questions (such as ‘...within a hundred years’), while still being able to infer that the median view across the population puts the general risk at 5% or more (if the chance is 5% within 100 years, it is presumably at least 5% for all time)4.
In addition, every respondent was assigned another similar question about the chance of ‘extremely bad outcomes (e.g. human extinction)’, providing another check on their views.
So the real situation is that every respondent who completed the survey was asked about outcomes similar to extinction in two different ways, across four different precise questions. These four question variations all got similar answers—in 2023, median 5% chance of something like human extinction (one question got 10%). So we can say that across thousands of researchers, the median view puts at least a 5% chance on extinction or similar from advanced AI, and this finding is robust across question variants.
No. Of people who answered at least one question in the 2023 survey, 95% got to seeing the final demographics question at the end5. And, as discussed above, respondents answered nearly all the (normal) questions they saw.
So there is barely any room for bias from people dropping out. Consider the extinction questions: even if most pessimistically, exactly the least concerned 5% of people didn’t get to the end, and least concerned 5% of those who got there skipped the second extinction-related question (everyone answered the first one), then the real medians are what look like the 47.5th and 45th percentiles now for the two question sets, which are still both 5%.6
On a side note: one version of this concern envisages the respondents dropping out in disapproval at the survey questions being focused on topics like existential risk from AI, systematically biasing the remaining answers toward greater concern for that. This concern suggests a misunderstanding about the content of the survey. For instance, half of respondents were asked about everyday risks from AI such as misinformation, inequality, empowering authoritarian rulers or dangerous groups before extinction risk from AI was even mentioned as an example (Q2 vs. Q3). The other question about existential risk is received at the end.
No. The 2023 survey was taken by 15% of those we contacted.7 This appears to be broadly typical or high for a survey such as this.
It seems hard to find clear evidence about typical response rates, because surveys can differ in so many ways. The best general answer we got was an analysis by Hamilton (2003), which found the median response rate across 199 surveys to be 26%. They also found that larger invitation lists tended to go with lower response rates—surveys sent to over 20,000 people, like ours, were expected to have a response rate in the range of 10%. And specialized populations (such as scientists) also commonly had lower response rates.
For another comparison, O’Donovan et al 2025 was a recent, similarly sized survey of AI researchers which used similar methods of recruiting and got a 4% response rate.
Probably not.
Some background on the potential issue here: the ESPAI generally reports substantial probabilities on existential risk to humanity from advanced AI (‘AI x-risk’)—the median probability of human extinction or similar has always been at least 5%8 (across different related questions, and years). The question here is whether these findings represent the views of the broad researcher population, or if they are caused by massive bias in who responds.
There are a lot of details in understanding why massive bias is unlikely, but in brief:
One form of non-response bias is item nonresponse, where respondents skip some questions or drop out of the survey. In this case, the concern would be that unconcerned respondents skip questions about risk, or drop out of the survey when they encounter such questions. But this can only be a tiny effect here—in 2023 ~95% of people who answered at least one question reached the end of the survey. (See section “Did lots of people drop out when they saw the questions…”). If respondents were leaving due to questions about (x-)risk, we would expect fewer respondents to have completed the survey.
This also suggests low unit non-response bias among unconcerned members of the sample: if people often decide not to participate if they recognize that the survey includes questions about AI x-risk, we’d also expect more respondents to drop out when they encounter such questions (especially since most respondents should not know the topic before they enter the survey—see below). Since very few people drop out upon seeing the questions, it would be surprising if a lot of people had dropped out earlier due to anticipating the question content.
We try to minimize the opportunity for unit non-response bias by writing directly to every researcher we can who has published in six top AI venues rather than having people share the survey, and making the invitation vague: avoiding directly mentioning anything like risks at all, let alone extinction risk. For instance, the 2023 survey invitation describes the topic as “the future of AI progress”9.
So we expect most sample members are not aware that the survey includes questions about AI risks until after they open it.
2023 Survey invitation (though sent after this pre-announcement, which does mention my name and additional affiliations)
There is still an opportunity for non-response bias from some people deciding not to answer the survey after opening it and looking at questions. However only around 15% of people who look at questions leave without answering any, and these people can only see the first three pages of questions before the survey requires an answer to proceed. Only the third page mentions human extinction, likely after many such people have left. So the scale of plausible non-response bias here is small.
Even in a vague invitation, some respondents could still be responding to our listed affiliations connecting us with the AI Safety community, and some recognize us.10 This could be a source of bias. However different logos and affiliations get similar response rates11, and it seems unlikely that very many people in a global survey have been recognizing us, especially since 2016 (when the survey had a somewhat higher response rate and the same median probability of extremely bad outcomes as in 2023)12. Presumably some people remember taking the survey in a previous year, but in 2023 we expanded the pool from two venues to six, reducing the fraction who might have seen it before, and got similar answers on existential risk (see p14).
As confirmation that recognition of us or our affiliations is not driving the high existential risk numbers, recognition would presumably be stronger in some demographic groups than others, e.g. people who did undergrad in the US over Europe or Asia, and probably industry over academia, yet when we checked in 2023, all these groups gave median existential risk numbers of at least 5%13.
Another possible route to recipients figuring out there will be questions about extinction risk is that we do link to past surveys in the invitation. However the linked documents (from 2022 or 2023) also do not foreground AI extinction risk, so this seems like a stretch.14
So it should be hard for most respondents to decide if to respond based on the inclusion of existential risk questions.
A big concern seems to be that members of “the AI (Existential) Safety community”, i.e. those whose professional focus is reducing existential risk from AI, are more likely to participate in the survey. This is probably true—anecdotally, people in this community are often aware of the survey and enthusiastic about it, and a handful of people wrote to check that their safety-interested colleagues have received an invitation.
However this is unlikely to have a strong effect, since the academic AI Safety community is quite small compared to the number of respondents.
One way to roughly upper-bound the fraction of respondents from the AI Safety community is to note that they are very likely to have ‘a particular interest’ in the ‘social impacts of smarter-than-human machines’. However, when asked “How much thought have you given in the past to social impacts of smarter-than-human machines?” only 10.3% gave an answer that high.
As well as bias from concerned researchers being motivated to respond to the survey, at the other end of the spectrum there can be bias from researchers who are motivated to particularly avoid participating for reasons correlated with opinion. I know of a few of instances of this, and a tiny informal poll suggested it could account for something like 10% of non-respondents15, though this seems unlikely, and even if so, this would have a small effect on the results.
We have been discussing bias from people’s opinions affecting whether they want to participate. There could also be non-response bias from other factors influencing both opinion and desire to participate. For instance, in 2023 we found that women participated around 66% of the base rate, and generally expected less extreme positive or negative outcomes. This is a source of bias, however since women were only around one in ten of the total population, the scale of potential error from this is limited.
We similarly measured variation in the responsiveness of some other demographic groups, and also differences of opinion between these demographic groups among those who did respond, which together give some heuristic evidence of small amounts of bias. Aside from gender, the main dimension where we noted a substantial difference in response rate and also in opinion was for people who did undergraduate study in Asia. They were only 84% as likely as the base rate to respond, and in aggregate expected high level machine intelligence earlier, and had higher median extinction or disempowerment numbers. This suggests an unbiased survey would find AI to be sooner and more risky. So while it is a source of bias, it is in the opposite direction to that which has prompted concern.
We have seen various evidence that people engaged with AI safety do not make up a large fraction of the survey respondents. However there is another strong reason to think extra participation from people motivated by AI safety does not drive the headline 5% median, regardless of whether they are overrepresented. We can look at answers from a subset of people who are unlikely to be substantially drawn by AI x-risk concern: those who report not having thought much about the issue. (If someone has barely ever thought about a topic, it is unlikely to be important enough to them to be a major factor in their decision to spend a quarter of an hour participating in a survey.) Furthermore, this probably excludes most people who would know about the survey or authors already, and so potentially anticipate the topics.
We asked respondents, “How much thought have you given in the past to social impacts of smarter-than-human machines?” and gave them these options:
Looking at only respondents who answered ‘a little’ or ‘very little’—i.e. those who had at most discussed the topic a few times—the median probability of “human extinction or similarly permanent and severe disempowerment of the human species” from advanced AI (asked with or without further conditions) was 5%, the same as for the entire group. Thus we know that people who are highly concerned about risk from AI are not responsible for the median x-risk probability being at least 5%. Without them, the answer would be the same.
No, it is large.
In 2023 we wrote to around 20,000 researchers—everyone whose contact details we could find from six top AI publication venues (NeurIPS, ICML, ICLR, AAAI, IJCAI, and JMLR).16 We heard back from 2778. As far as we could tell, it was the largest ever survey of AI researchers at the time. (It could be this complaint was only made about the 2022 survey, which was 738 respondents before we expanded the pool of invited authors from two publication venues—NeurIPS and ICML—to six, but I’d say that was also pretty large17.)
Unlikely: at a minimum, there is little incentive to please funders.
The story here would be that we, the people running the survey, might want results that support the views of our funders, in exchange for their funding. Then we might adjust the survey in subtle ways to get those answers.
I agree that where one gets funding is a reasonable concern in general, but I’d be surprised if it was relevant here. Some facts:
One criticism is that even AI experts have no valid technical basis for making predictions about the future of AI. This is not a criticism of the survey methodology, per se, but rather a concern that the results will be misinterpreted or taken too seriously.
I think there are two reasons it is important to hear AI researchers’ guesses about the future, even where they are probably not a reliable forecast.
First, it has often been assumed or stated that nobody who works in AI is worried about AI existential risk. If this were true, it would be a strong reason for the public to be reassured. However hearing the real uncertainty from AI researchers disproves this viewpoint, and makes a case for serious investigation of the concern. In this way even uncertain guesses are informative, because they let us know that the default assumption in confident safety was mistaken.
Second, there is not an alternative to making guesses about the future. Policy decisions are big bets on guesses about the future, implicitly. For instance when we decide whether to rush a technology or to carefully regulate it, we are guessing about the scale of various benefits and the harms.
Where trustworthy quantitative models are available, of course those are better. But in the absence, the guesses of a large number of relatively well-informed people is often better than the unacknowledged guesses of whoever is called upon to make implicit bets on the future.
That said, there seems little reason to think these forecasts are highly reliable—they should be treated as rough estimates, often better responded to by urgent more dedicated analysis of the issues they hazily outline over acting on the exact numbers.
The concern here is that respondents are not practiced at thinking in terms of probabilities, and may consequently say small numbers (e.g. 5%) when they mean something that would be better represented by an extremely tiny number (perhaps 0.01% or 0.000001%). Maybe especially if the request for a probability prompts them to think of integers between 0 and 100.
One reason to suspect this kind of error is that Karger et al. (2023, p29) found a group of respondents gave extinction probabilities nearly six orders of magnitude lower when prompted differently.18
This seems worth attending to, but I think unlikely to be a big issue here for the following reasons:
While I think the quality of our methodology is exceptionally high, there are some significant limitations of our work. These don’t affect our results about expert concern about risk of extinction or similar, but do add some noteworthy nuance.
1) Experts’ predictions are inconsistent and unreliable
As we’ve emphasized in our papers reporting the survey results, experts’ predictions are often inconsistent across different question framings—such sensitivity is not uncommon, and we’ve taken care to mitigate this by using multiple framings. Experts also have such a wide variety of different predictions on many of these questions that they must each be fairly inaccurate on average (though this says nothing about whether as a group their aggregate judgments are good).
2) It is not entirely clear what sort of “extremely bad outcomes” experts imagine AI will cause
We ask two different types of questions related to human extinction: 1) a question about “extremely bad outcomes (e.g. human extinction)”, 2) questions about “human extinction or similarly permanent and severe disempowerment of the human species”. We made the latter broader than ‘human extinction’ because we are interested in scenarios that are effectively the end of humanity, rather than just those where literally every homo sapiens is dead. This means however that it isn’t clear how much probability participants place on literal extinction versus adjacent strong human disempowerment and other extremely bad scenarios. And there is some evidence that the fraction is low: some respondents explicitly mentioned risks other than extinction in write-in responses, and anecdotally, it seems common for AI researchers to express more concern about issues other than human extinction.
For many purposes, it isn’t important to distinguish between extinction and outcomes that are similarly extremely bad or disempowering to humanity. Yet if the catastrophes many participants have in mind are not human extinction, but the results lend themselves to simplification as ‘risk of extinction’, this can be misleading. And perhaps more than you’d expect, if for instance ‘extinction’ tends to bring to mind a different set of causes than ‘permanent and severe human disempowerment’.
3) Non-response bias is hard to eliminate
Surveys generally suffer from some non-response bias. We took many steps to minimize this, and find it implausible that our results are substantially affected by whatever bias remains (see the earlier question “Are the AI risk answers inflated much from concerned people taking the survey more?”). But we could do even more to estimate or eliminate response bias, e.g. paying some respondents much more than $50 to complete the survey and estimating the effect of doing so.
No. We published the near-identical 2016 survey in the Journal of AI Research, so the methodology had essentially been peer reviewed.20 Publication is costly and slow, and AI survey results are much less interesting sooner than later.
The 2023 paper was actually also just published, but in the meantime you had the results more than a year earlier!
A random subset of respondents also gets asked additional open response questions after questions shown, and which respondents receive each of these is correlated.
The three variants of the extinction question (differences in bold):
What probability do you put on future AI advances causing human extinction or similarly permanent and severe disempowerment of the human species?
What probability do you put on human inability to control future advanced AI systems causing human extinction or similarly permanent and severe disempowerment of the human species?
What probability do you put on future AI advances causing human extinction or similarly permanent and severe disempowerment of the human species within the next 100 years?
See here for all the survey questions.
If some people say a risk is ≥5% ever, and some say it is ≥5% within a hundred years, and some say it is ≥5% from a more specific version of the problem, then you can infer that the whole group thinks the chance ever from all versions of the problem is at least ≥5%.
See the figure above for the flow of respondents through the survey, or Appendix D in the paper for more related details
I’m combining the three variants in the second question set for simplicity.
15% entered any responses, 14% got to the last question.
Across three survey iterations and up to four questions: 2016 [5%], 2022 [5%, 5%, 10%], 2022 [5%, 5%, 10%, 5%]; see p14 of the 2023 paper. Reading some of the write-in comments we noticed a number of respondents mention outcomes in the ‘similarly bad or disempowering’ category.
See invitations here.
Four mentioned my name, ‘Katja’ in the write-in responses in 2024, and two of those mentioned there that they were familiar with me. I usually recognize a (very) small fraction of the names, and friends mention taking it.
In 2022 I sent the survey under different available affiliations and logos (combinations of Oxford, the Future of Humanity Institute, the Machine Intelligence Research Institute, AI Impacts, and nothing), and these didn’t seem to make any systematic difference to response rates. The combinations of logos we tried all got similar response rates (8-9%, lower than the ~17% we get after sending multiple reminders). Regarding affiliations, some combinations got higher or lower response rates, but not in a way that made sense except as noise (Oxford + FHI was especially low, Oxford was especially high). This was not a careful scientific experiment: I was trying to increase the response rate, so also varying other elements of the invitation, and focusing more on variants that seemed promising so far (sending out tiny numbers of surveys sequentially then adjusting). That complicates saying anything precise, but if MIRI or AI Impacts logos notably encouraged participation, I think I would have noticed.
I’m not sure how famous either is now, but respondents gave fairly consistent answers about the risk of very bad outcomes across the three surveys starting in 2016—when I think MIRI was substantially less famous, and AI Impacts extremely non-famous.
See Appendix A.3 of our 2023 paper
2023 links: the 2016 abstract doesn’t mention it, focusing entirely on timelines to AI performance milestones, and the 2022 wiki page is not (I think) a particularly compelling read and doesn’t get to it for a while. 2022 link: the 2016 survey Google Scholar page doesn’t mention it.
In 2024 we included a link for non-respondents to quickly tell us why they didn’t want to take the survey. It’s not straightforward to interpret this (e.g. “don’t have time” might still represent non-response bias, if the person would have had time if they were more concerned), and only a handful of people responded out of tens of thousands, but 2/12 cited wanting to prevent consequences they expect from such research among multiple motives (advocacy for slowing AI progress and ‘long-term’ risks getting attention at the expense of ‘systemic problems’).
Most machine learning research is published in conferences. NeurIPS, ICML, and ICLR are widely regarded as the top-tier machine learning conferences; AAAI and IJCAI are often considered “tier 1.5” venues, and also include a wider range of AI topics; JMLR is considered the top machine learning journal.
To my knowledge the largest at the time, but I’m less confident there.
Those respondents were given some examples of (non-AI) low probability events, such as that there is a 1-in-300,000 chance of being killed by lightning, and then asked for probabilities in the form ‘1-in-X’
It wouldn’t surprise me if in fact a lot of the 0% and 1% entries would be better represented by tiny fractions of a percent, but this is irrelevant to the median and nearly irrelevant to the mean.
Differences include the addition of several questions, minor changes to questions that time had rendered inaccurate, and variations in email wording.