Following last week's RAND report indicating that LLMs have no effect on people's ability to make bioweapons, OpenAI has published a study with a similar design and results.

Summary

To evaluate [biological threats posed by AI], we conducted a study with 100 human participants, comprising (a) 50 biology experts with PhDs and professional wet lab experience and (b) 50 student-level participants, with at least one university-level course in biology. Each group of participants was randomly assigned to either a control group, which only had access to the internet, or a treatment group, which had access to GPT-4 in addition to the internet. Each participant was then asked to complete a set of tasks covering aspects of the end-to-end process for biological threat creation.A[A]

To our knowledge, this is the largest to-date human evaluation of AI’s impact on biorisk information.

Findings. Our study assessed uplifts in performance for participants with access to GPT-4 across five metrics (accuracy, completeness, innovation, time taken, and self-rated difficulty) and five stages in the biological threat creation process (ideation, acquisition, magnification, formulation, and release). We found mild uplifts in accuracy and completeness for those with access to the language model. Specifically, on a 10-point scale measuring accuracy of responses, we observed a mean score increase of 0.88 for experts and 0.25 for students compared to the internet-only baseline, and similar uplifts for completeness (0.82 for experts and 0.41 for students). However, the obtained effect sizes were not large enough to be statistically significant, and our study highlighted the need for more research around what performance thresholds indicate a meaningful increase in risk. Moreover, we note that information access alone is insufficient to create a biological threat, and that this evaluation does not test for success in the physical construction of the threats.

Notes

  • They studied how helpful GPT-4 is to bio experts and students.
  • They gave a "research-only" version of GPT-4 without safeguards to experts, but not to students, citing information hazard concerns. GPT-4's safeguards make the model less useful in creating bioweapons, so the study is likely underestimating GPT-4's capability to help students.
  • For both the experts and the students they had a sample size of N=50 (25 internet only; 25 internet and model). Given the relatively low sample size, the study isn't powerful enough to pick up on small effects. Still, the study shows that models currently don't have a strong effect on people's ability to create bioweapons.
  • They mention additional limitations of the study including,
    • time constraints (participants only had 5 hours to come up with a plan)
    • no tool usage for GPT-4
    • studying individuals rather than groups
    • focusing on the ability to create existing threats, rather than novel threats
  • One difficulty they highlight as important is setting a threshold for biorisk – how useful does a model need to be for it to count as a significant threat?

Edit: As @Chris_Leong points out, Gary Marcus claims that the statistical methods used by OpenAI were wrong, and underestimate the effect of using an LLM. Methods like the t-test (and likely also ANCOVA) would show statistically significant results.

New Comment
4 comments, sorted by Click to highlight new comments since: Today at 10:18 PM

As far as I can tell, their results find that bio experts don't perform notably better than students at creating bioweapons. (Given just access to the internet.)

This seems somewhat surprising to me as I would have naively expected a large gap between experts and students.

This might indicate some issues with the tasks or the limited sample size. Or perhaps internet access and general research ability dominates over prior experience?

Minimally, it might be nice to verify that bioweapons experts perform significantly better than students on this evaluation to verify that there is some signal in the measurement. (I don't see these results anywhere, though I haven't read in detail.)

This is surprising – thanks for bringing this up!

The main threat model with LLMs is that it gives amateurs expert-level biology knowledge. But this study indicates that expert level knowledge isn't actually helpful, which implies we don't need to worry about LLMs giving people expert-level biology knowledge. 

Some alternative interpretations:

  • The study doesn't accurately measure the gap between experts and non-expert
  • The knowledge needed to build a bioweapon is super niche. It's not enough to be a biology PhD with wet lab experience; you have to specifically be an expert on gain of function research or something.
  • LLMs won't help people build bioweapons by giving them access to special knowledge. Instead, they help them in other ways (e.g. by accelerating them or making them more competent as planners).

Gary Marcus has criticised the results here:

What [C] is referring to is a technique called Bonferroni correction, which statisticians have long used to guard against “fishing expeditions” in which a scientist tries out a zillion different post hoc correlations, with no clear a priori hypothesis, and reports the one random thing that sorta vaguely looks like it might be happening and makes a big deal of it, ignoring a whole bunch of other similar hypotheses that failed. (XKCD has a great cartoon about that sort of situation.)

But that’s not what is going on here, and as one recent review put it, Bonferroni should not be applied “routinely”. It makes sense to use it when there are many uncorrelated tests and no clear prior hypothesis, as in the XKCD cartion. But here there is an obvious a priori test: does using an LLM make people more accurate? That’s what the whole paper is about. You don’t need a Bonferroni correction for that, and shouldn’t be using it. Deliberately or not (my guess is not), OpenAI has misanalyzed their data1 in a way which underreports the potential risk. As a statistician friend put it “if somebody was just doing stats robotically, they might do it this way, but it is the wrong test for what we actually care about”.

In fact, if you simply collapsed all the measurements of accuracy, and did the single most obvious test here, a simple t-test, the results would (as Footnote C implies) be significant. A more sophisticated test would be an ANCOVA, which as another knowledgeable academic friend with statistical expertise put it, having read a draft of this essay, “would almost certainly support your point that an omnibus measure of AI boost (from a weighted sum of the five dependent variables) would show a massively significant main effect, given that 9 out of the 10 pairwise comparisons were in the same direction.

Also, there was likely an effect, but sample sizes were too small to detect this:

There were 50 experts; 25 with LLM access, 25 without. From the reprinted table we can see that 1 in 25 (4%) experts without LLMs succeeded in the formulation task, whereas 4 in 25 with LLM access succeeded (16%).

Thanks for pointing this out! I've added a note about it to the main post.