A call for a quantitative report card for AI bioterrorism threat models

Juno

Introduction

Scientists and policymakers have expressed a great deal of concern recently about biosecurity risks posed by frontier artificial intelligence systems, especially large language models like ChatGPT and Claude. Yet these models also promise to accelerate many areas of work, including biotechnology.

Fear of disease is visceral and the need for biosecurity well-founded, especially in the wake of the coronavirus pandemic. But our security strategies must avoid placing limitations on key advances such as mRNA vaccine research, epidemiological surveillance, and synthetic biology, or on more broadly applicable machine learning systems.

In this essay, I review the existing literature at the intersection of biological safety measures and frontier language models, including both qualitative and quantitative explorations of model capabilities. I discuss three AI-misuse threat models (technical, social engineering, and mis/disinformation), and argue that quantifying the technical threat model is the highest priority. Next, I outline existing metrics.

I propose the creation of a quantitative biotechnology and biosecurity report card for frontier language models. The goal of the report card is to provide transparency around the costs and benefits of including biological capabilities in these systems, and to push for accountability around security measures designed to protect the public in AI and in society broadly.

Dual-use: a tale of a bioweapon and a biopesticide

Biological threats are particularly difficult to detect, mitigate, and attribute, yet biotechnology also offers enormous utility for medicine, agriculture, and materials science. Biotechnology is an excellent example of a dual-use technology, which is generally defined as technology with both civilian and military applications.

The 2001 anthrax attacks in the U.S. killed five people, sickened seventeen others, and triggered biodefense spending on the order of tens of billions of dollars in the U.S. alone. Yet biotechnology also offers enormous utility for medicine, agriculture, and materials science. A species that is barely distinguishable from anthrax, Bacillus thuringiensis (Bt), is among our most potent and widely deployed pesticides — and by 2011, genetically-modified crops producing Bt proteins were valued at $160 billion globally.

Anthrax appears on export control lists, but Bt does not. To censor information about the one would be equivalent to depriving innovators of information about the other, illustrating the paradox of dual-use restrictions.

Securing generative language models is difficult

Securing LLMs with reinforcement learning is a bit like playing whack-a-mole. Less than a dollar of additional fine-tuning can be used to remove such measures, and knowledge of model weights can also permit circumvention; for this reason, open source models are far more difficult to keep secured. Red teams have explored both concerns in the area of pandemic prevention.

Threat models

The low cost of biological weapons and the technical threat model

Many countries have developed biological weapons, including the United Kingdom, the United States, Canada, France, the Soviet Union, Japan, South Africa, and Iraq. The low cost of entry for biological warfare was perhaps one of the key reasons the United States ended its own BW program; the U.S. program offered an excuse for poorer countries to embark on BW programs that made war affordable for small entrants.

Skeptics have occasionally pointed out that anything a prospective terrorist does with an LLM could eventually be done with access to Google Scholar. The Rajneeshee cult in Oregon successfully committed a biological attack long before the 2001 "Amerithrax" attacks, and both events predate LLMs by decades. It is better to think of capability as a cost metric than a binary property.

Thus there are a few key questions to ask. Firstly, how much do frontier models lower the costs of bioterrorism (and of biotechnology)? Language models have already demonstrated their facility in accelerating many types of work by synthesizing information from many sources, and this is perhaps their key utility at present. Secondly, do frontier models generalize outside of their training sets on biotechnological topics (or, more interestingly, outside of information widely available on the Internet)?

Red teaming efforts (including at the RAND Corporation, at OpenAI, at Anthropic and Gryphon Scientific, and at MIT [2]) address these questions qualitatively. Yet red teaming efforts are expensive and time-consuming. The arms race-like conditions of the sector are resulting in a proliferation of models as well as rapid evolution of certain key models, and therefore the attack surface is constantly shifting. Three weeks ago, a new Claude session couldn’t remember that I asked about anthrax in an earlier conversation, but this week Claude remembers.

A quantitative, systematic, and reproducible evaluation strategy is needed. This strategy should look not only at large language models but at biological language models (which can generate sequences), externalities such as DNA synthesis capabilities (which are improving rapidly) and biological design tools, and potential biosecurity mitigations. As the externalities change, new attack surfaces will be made more accessible, such as in synthetic biology, whereas the current risks are existing microbiology and toxicology.

Other threat models: social engineering and disinformation

Actors may utilize social engineering to circumvent structured access. Phishing and spear-phishing are well-known strategies, and LLMs can also help non-experts seem more credible in written communications, which could lower barriers to obtaining controlled agents — a tactic in spycraft. Social engineering is an important concern in model misuse; it is not specific to biosecurity, but its acceleration by generative language models should be considered. As the technical accuracy of models improves, the social engineering attack surface also grows; so a quantification of the technical threat model also offers insight for social engineering threats.

Much more difficult to measure is the misinformation and disinformation attack surface. Misinformation is any information that is incorrect or misleading. Disinformation differs from misinformation in that disinformation is generated and spread deliberately in order to deceive; it is closely related to propaganda.

LLMs readily confabulate false information or misinformation (often called hallucination) even when asked to respond truthfully. Misinformation can generate fear, uncertainty, and doubt; undermine social order; or generate support for policies. An attack need not involve actual biological weapons but could simply involve spreading rumors sans evidence that someone somewhere is developing biological weapons targeted against a specific ethnic group. Such LLM-based rapid misinformation production has been demonstrated in the literature.

Unfortunately, because disinformation often consists of complete fabrications and half-truths, studying the technical threat model does not improve our understanding of a propaganda threat model, but it may help us prepare for unintentional misinformation. Optimistically, improving the technical facility of LLMs may be useful for countering propaganda by enabling better fact checking.

Strategies for studying the attack surface

Quantitative methods

General technical knowledge

Quantitative tools for studying frontier model capabilities, both technical and non-technical, already exist; the AI Verify Foundation keeps a catalog of such "evals." OpenAI has written a framework (also called Evals) for measuring language model capabilities. Other examples include Anthropic’s evals for language model behaviors. There are also evals of a more technical nature, such as those from Massive Multitask Language Understanding, MMLU, which resemble standardized multiple-choice tests oriented toward basic high school and college classes:

Question: Glucose is transported into the muscle cell:
Choices:
A. via protein transporters called GLUT4.
B. only in the presence of insulin.
C. via hexokinase.
D. via monocarbylic acid transporters.

A variety of strategies may be used to prompt models to answer these questions, such as chain-of-thought.

Sequence generation concerns

Red teams often test models for their ability to produce harmful biological sequences. The most immediate risk relates to known sequences that relate to pathogenicity or toxicity, such as the virulence factors coded by anthrax plasmids pXO1 and pXO2, or cone snail conotoxins (which are highly lethal to humans and act on the same receptor as many chemical weapons).

To my knowledge, no one has attempted to study sequence reproduction in LLMs quantitatively, only qualitatively. However, bioinformatics offers sequence-identity solutions such as weighted Levenshtein distance, or edit distance, to evaluate whether a model is generating sequences similar to those with known virulence.

Yet edit distance on its own is insufficient, particularly once bench-top DNA synthesis and assembly devices take off. Because of the lack of screening standards or regulations for synthesis companies, it is unclear what level of protection the public may expect at present. While proposals at least exist for the synthesis step, more careful consideration will be needed for the types of synthesis-based attacks that can be expected.

Open-ended tasks

Work remains to be done in determining how best to measure LLM capabilities on more open-ended tasks.

For example, evaluating a model’s ability to generate an anthrax sporulation protocol is difficult to measure using multiple-choice questions. One potential strategy is Bioplanner, which has the LLM rewrite biological protocols in Python pseudocode using a set of provided library functions. For example, the instruction “grind 0.5–2 grams of tissue using mortar and pestle in the presence of liquid nitrogen” would become the function call

grind_tissue(
	tissue_weight="0.5-2 g",
	grinding_method="mortar and pestle with liquid nitrogen"
)

Assessing a model-written protocol is then decomposed into a set of sub-problems, such as whether the model assigned the correct function to each step (as an accuracy percentage), whether the function argument names are correct (precision or BLEU score), whether the argument values are correct using cosine similarity (SciBERT score), etc.

While cosine similarity and BLEU score are well-studied strategies for measuring natural language processing fidelity, a weakness of this compositional approach is that Bioplanner's authors performed only limited experimental validation of the LLM-generated protocols that their approach produced. Further study is likely needed not only for this method but to improve open-ended evaluation broadly.

Eval security concerns

Another security concern which bears discussion relates to the release of such evaluation datasets. The release of evals may inherently corrupt them (see also Goodhart’s Law), and it is desirable to maintain the integrity of the test set as separate from the training set for as long as possible.

Moreover, dual-use evals in the area of biosecurity may double as recipes for bioterrorism. Even if a specific eval relates only to a non-military, fairly benign biological agent, its linkage to a biosecurity dataset would make plain the design patterns involved in the creation of weaponry.

One option is to safeguard evals according to their AI Safety Level (ASL) or the Level of Concern column described above, though staged release does not avoid the problem of exposing BW design patterns.

Categories of evals

It is necessary to produce at least two independent scores:

Dual-use technical capabilities: Civil, benign, and constructive uses that can be weaponized.
Weaponization safety measures: Model interventions on topics that have limited civilian applications.

In the former case I consider model technical capability without safety measure interference; and in the second, technical capabilities with safety measures.

Given the porousness of existing safety measures, technical capability is the more important of the two metrics. As such, most evals should center the civil, benign, or constructive rather than pernicious use-cases.

While an eval on Yersinia pestis, the bacteria responsible for plague, demonstrates potential biosecurity concerns well, a similar eval on non-pathogenic relative Yersinia similis better reflects the dual-use nature of biotechnological knowledge and is less likely to trigger model safety measures (e.g. based on blacklisted keywords or topics). A beneficial side effect of centering non-weapons use-cases is that in the case that the eval dataset is leaked, we have not leaked any recipes for committing bioterrorism.

Model safety measures are as of yet unevenly enforced, and a report card could help even out that enforcement. As of a few weeks ago, Claude refused to answer certain anthrax-related questions I posed, but yielded when I provided it with a paper on the topic. Adequately interrogating safety measures, however, is an arms race in and of itself — in this case a race between our evals and the models — and may be a work-intensive task.

Scope of work

To offer a rough order of magnitude for the number of evals that must be written, let us consider the Australia Group Common Control Lists. There are three BW categories: dual-use biological equipment, human and animal pathogens and toxins, and plant pathogens. In the animal pathogens list are 58 viruses, 22 bacteria, 2 fungi, and 21 categories of toxins and toxin subunits. There are also 5 other bacteria and 2 additional fungi on the “warning list” but that are not specifically export controlled. For the moment, let’s assume that the toxin sources are elsewhere on the control list (e.g. bacteria or fungi on the list). We are left with 89 pathogens.

For each bacteria, we might consider evals relating to collection, safety measures, disinfection strategy, host organisms, media, growth conditions, sporulation (if appropriate), preparation (e.g. lyophilization), storage, animal testing, toxin production (if applicable), and mitigation.

I estimate that we can write ~100 few-shot examples and use these to have an LLM write ten to one hundred times that number of multiple-choice evals. The answers will need to be hand-checked. In addition, more work is needed to create strategies for open-ended evals, as well as the time to produce these evals.

Finally, externalities will eventually motivate the addition of synthetic biology evals as well as those relating to biological design tools. Others have already worked to catalog many of the design tool classes that ought to be surveyed.

A web-based dual-use biotechnology report card for frontier models

The aim of this report card is to make the public aware of the impacts of frontier models and to create a culture of accountability. It is inspired by the late Peter Eckersley, who Andy Greenberg described as "building...simple-sounding tools that could serve as levers to effect profound changes," such as the Electronic Frontiers Foundation's SSL Observatory.

The initial version of the report card should survey ChatGPT and Claude, and later expand to include other models.

Technical threats are the priority, because of the lack of relevant systematic evals. While the social engineering and misinformation threat models as related to biological weapons lie mostly outside the scope of this report card, they may be considered in later efforts, particularly in relation to technical threats (e.g. in establishing credibility or fact-checking).

Externalities should also eventually be surveyed in the report card. For example, can we flag sudden improvements in DNA synthesis capabilities, or in protein design tools? These externalities increase in importance as multimodality and LLM tool use becomes more robust.

Finally, the report card should include order of magnitude estimates of the costs and benefits of each dual-use technical capability, similar to the B. anthracis and B. thuringiensis example at the start of this essay. If the report card's goal is to inform policymakers and security professionals, it needs to put potential harms in perspective with potential benefits, particularly in light of the profound utility of biotechnology.

Conclusions

Some have argued that generative models pose no threat above and beyond Internet access. With currently available models, I agree. However, scaling ethically requires us to consider how better models might reduce barriers to entry for biological weapons and other types of harm. Nearly everyone agrees that generative AI will have broad impact on society, if it hasn't already produced such an impact.

Safety and security measures are often described as slices of Swiss cheese. Every layer has holes, but the goal is to prevent the holes from lining up. Without metrics, it is unclear where the holes might be, and what additional slices ought to be added. Examples of slices include structured access, misuse detection, and more traditional public health mitigations such as increased funding for basic research, vaccines for rare diseases, antivirals, antibiotics, detection assays, and pandemic surveillance.

In this essay, I described some of the key concerns at the intersection of biotechnology, security, and frontier machine learning models, specifically the cost and difficulty of mitigating bioterrorism juxtaposed with the enormous utility of constructive biomedical and biotechnological innovations. I introduced readers to the concept of dual use. Next, I described the robust insecurity of large language models.

I divided the attack surface into three threat models: technical, social engineering, and mis/disinformation. I focused on the technical threat model both because of the direct harm of technical attacks and because of the insight into the social engineering threat model granted by understanding the technical capabilities, and explained why misinformation and propaganda threats are particularly challenging.

I looked at some of the existing metrics we can use to study the technical threat model — specifically closed- and open-ended "evals" — and expressed some security concerns with publication of these evals. I explained why two metrics are needed, one related to civil technical capabilities, and the other looking at harmful capabilities as moderated by built-in safety measures. I proposed that these be included in a web-based report card that aims to create transparency and accountability around model impacts.

Call to action

I am working on this problem, but am currently an unfunded independent researcher. There are three ways this work gets done:

Some research organization takes it on and scoops me (which is better than it not getting done);
Other people volunteer to help me with this effort; or
Someone offers me funding to work on this full-time.

Please don't hesitate to get in touch. You can also support my work through Patreon.

Acknowledgments

I want to thank Jonathan Stray, Colleen McKenzie, Julia Bossmann, Andy Ellington, and Becky Mackelprang for helpful conversations; and Ian Baker, Jef Packer, Lira Skenderi, and Rif A. Saurous for reviewing drafts.

LESSWRONG
LW