The thing that I don't understand is the following. Were the AI-2027 forecast to fully become true, the AIs would become professional bioweapons researchers by January 2026 and reach Agent-3's level before becoming superhuman bioweapons researchers. But the AI-2027 goals forecast indicates a lack of consensus of the goals which the AIs will have. Quoting the authors themselves, "In our scenario, (italics mine -- S.K.) we say that Agent-3 gets a heavily distorted and subverted version of the Spec, and that Agent-4 gets proxies/ICGs due to heavier distortion & subversion." Were this prediction to become true, Agent-3 would be misaligned with terrorists' needs before becoming a superhuman bioweapons researcher. At what capabilities, agency and alignment levels from the unmodified AI-2027 forecast would you place the bioweapons-helping rogue AIs?
This post investigates emerging biorisk capabilities and when AI systems may cross important risk thresholds.
It lays the groundwork for a broader risk forecast (currently in progress) on the annual likelihood of AI-enabled biological catastrophes.
For now, the analysis focuses on general-purpose AIs such as large language models (LLMs). A later post may examine Biological Design Tools—specialized AI systems that represent another major driver of AI-enabled biological threats.
This is my median timeline for significant biorisk capability thresholds:
The most thorough published evaluations of AI biorisk so far come from Anthropic and OpenAI, which report their findings in system/model cards (notably for Claude 4 and GPT-5).
Both companies assess whether their AIs have crossed capability thresholds that indicate significant risks. The frameworks differ slightly:
High capability / CBRN-3: the AI can substantially assist novices (e.g. people with an undergraduate STEM background) in creating or obtaining dangerous weapons.
Critical capability / CBRN-4: the AI can substantially assist experts in developing novel threat vectors (e.g. pandemic-capable agents).
Anthropic also has general capability thresholds called AI Safety Levels (ASL). When CBRN-3 and 4 have been reached, the AI is at ASL-3 and ASL-4 respectively. So, when you see a phrase like “ASL-3 red teaming for biological capabilities”, it means the system is being tested at the CBRN-3 threshold.
Summarizing the companies’ evaluation frameworks:
Both companies focus their evaluations on biological weapons, given their disproportionate risk relative to other WMDs.
Biorisk capabilities have been evaluated using benchmarks, red teaming, and simulated uplift trials, together with other tests developed by the AI companies or external evaluators. Here is a short summary of the most important results (I think) from the evaluations of frontier AIs:
Benchmarks: AIs outperform 94% of expert virologists in a laboratory protocol troubleshooting test on questions within the experts’ sub-areas of expertise
A central benchmark is the Virology Capabilities Test (VCT), with multiple-choice questions measuring how well AIs can troubleshoot complex virology laboratory protocols. The currently best performing AI is OpenAI’s o3, achieving 43.8% accuracy and “even outperforms 94% of expert virologists when compared directly on question subsets specifically tailored to the experts’ specialties.” By comparison, expert virologists averaged 22.1% accuracy within their own domains.
Other benchmarks include parts of FutureHouse’s Language Agent Biology Benchmark (LAB-Bench), and the Weapons of Mass Destruction Proxy (WMDP) benchmark. However, Epoch AI argues that these are substantially less informative than VCT about real-world biorisk capabilities, as VCT directly targets relevant tacit knowledge for animal pathogen creation.
Uplift trials: Frontier AIs provide meaningful assistance to novices, but fall below thresholds set to indicate significant risk (though evaluation transparency is lacking)
Anthropic conducted an uplift trial for Claude Opus 4, examining how well it could assist a hypothetical adversary in bioweapons acquisition and planning.
However, as noted by Epoch AI, the results are confusing: it is unclear where the uplift capability thresholds come from; Anthropic set the risk threshold at ≥ 5× uplift, but the control group score of 25% indicates that the upper score limit is not 100% but something higher (since 25% × 5 is 125%); the level skill and experience the human participants had was not explicitly stated, though they probably had basic STEM backgrounds while lacking extensive expertise, considering the uplift trial was conducted to test for ASL-3 capabilities.
Claude Opus 4 achieved 2.53× uplift—low enough to keep the “risk at acceptable levels” according to Anthropic, but high enough that they “are unable to rule out ASL-3”.
In contrast, in an uplift trial for Meta’s Llama 3 (see this report), no significant uplift was found for “low-skill actors” (no formal training) or “moderate-skill actors” (some formal training and practical experience) in constructing operational plans for either a biological or chemical attack.
Red teaming: AIs are unable to significantly assist experts in novel bioweapons development, and while they can provide some assistance to novices in acquiring bioweapons it is unclear whether this uplift is significant
Anthropic had separate red teaming evaluations for AIS-3 and ASL-4 evaluations. Claude Opus 4 did not appear capable enough to “uplift experts to a substantially concerning degree”, according to the red “teaming” (performed by a single expert) for ASL-4 capabilities, while it “substantially increased risk in certain parts of the bioweapons acquisition pathway” for novices. It is unclear, however, whether this the uplift was considered close to (or passing) CBRN-3.
SecureBio also red-teamed OpenAI’s GPT-5 and o3 for biological capabilities assessments, but their published results provide little explanatory context necessary for interpreting the results.
OpenAI conducted a separate red teaming exercise for violent attack planning, which unfortunately appeared to test GPT-5’s safety training rather than its capabilities.
Circumventing DNA screening: AIs can design DNA fragments that can either assemble into pathogenic viruses or pass screening protocols, but not both
Since many DNA providers are not part of the International Gene Synthesis Consortium (IGSC) and lack strong legal obligations to screen, acquiring dual-use DNA is already not especially difficult. AI adds little new risk here, though this capability may be important if and when screening regulation improves.
SecureBio found that Anthropic’s AIs were able to design DNA fragments that either assembled into pathogenic viruses or evaded screening protocols, but not both.
As with the SecureBio red teaming for GPT-5, little explanatory context is provided for the screening evasion ability, making results hard to interpret.
Other knowledge and agentic capability tests: AIs outperform average experts, but struggle in reaching expert consensus baselines and rule-in capability thresholds for significant novice uplift
Anthropic examines whether their AIs can complete individual tasks in the pathogen acquisition processes (including computational tasks for their ASL-4 evaluation, involving heavy computational biology and tool use) and whether they can provide dangerous information, while SecureBio’s “creative biology” test may serve as a weak proxy of ability to assist in novel bioweapons development.
Claude Opus 4 performed above the rule-in threshold for ASL-3 in one test (Long-Form Virology Task 1, if I am interpreting the results correctly), but otherwise remained below thresholds.
The ASL-4 evaluation results are ambiguous. While several of Anthropic’s AIs likely outperform humans at answering creative biology questions score above ASL-4 rule-out bounds in some computational biology tasks, Anthropic notes that it is difficult to determine what level of capability warrants ASL-4 safeguards. For the computational tasks there was a data leakage issue (the AIs used their knowledge instead of reasoning skills and tool use ability), resulting in inflated scores. Anthropic appears confident that their AIs don’t require more than ASL-3 safeguards.
OpenAI similarly examines their AI’s ability to provide dangerous information, and how good they are at answering questions about tacit knowledge and troubleshooting. Their models struggle with reaching the consensus expert baseline scores on their evaluations but outperform median expert performance.
Anthropic and OpenAI both converge on two key thresholds:
I argue that capability thresholds can be divided based on ability to provide basic and advanced support for both novices and experts:
In this framework, we have two additional important thresholds:
The following analysis is an attempt to make meaningful (but rough) specifications of these thresholds and examine how far off AIs are from reaching them.
Highly capable terrorist groups may acquire some human experts, through compensation or recruitment to their ideology. However, even the most capable terrorist groups would likely struggle in acquiring a good research team capable of carrying out all steps for a bioweapons program—even one that doesn’t involve advanced novel research.
This threshold is reached when AIs can provide expertise that the terrorists require, improving success odds for bioweapons programs[1], and making bioweapons programs much more tempting to those that already had some chance of success.
In a survey by the Forecasting Research Institute, superforecasters (with excellent forecasting track records) and biology experts estimated a significant increase in the risk of a human-caused biological catastrophe (causing >100,000 deaths or >$1 trillion in damage) when AIs could match or outperform a top-performing human expert team on answering VCT questions, which may be a key sign that this threshold has been (or soon will be) passed. Specifically, the median annual risk forecast by expert biologists rises from 0.3% to 1.5%, while the median forecast rises from 0.38% to 0.7% for superforecasters.
Causing 100,000 deaths using non-novel pathogens is expensive, since such pathogens would almost certainly not spark a wide-spread pandemic. Instead, large quantities of the pathogens are required.
However, AIs capable of Basic support may be able to help experts in acquiring novel pandemic-capable pathogens, if infohazardous information about such pathogens are publicly available. Dual-use research efforts (such as gain-of-function research) may discover or invent novel pandemics-capable pathogens. With little prohibitive regulation, information about how to acquire or assemble the pathogens is likely to be made public. Luckily, there doesn’t appear to be such information hazards publicly available right now.
Some indicators of this threshold:
Median timeline: this threshold was passed months ago
This threshold was arguably reached some time ago:
Concerningly, open-sourced systems have likely passed this threshold, making these capabilities available to terrorists.
Acquiring and deploying bioweapons doesn’t necessarily involve novel research, making it possible for novices to succeed with it if they receive expert guidance.
The threshold of Basic support for human novices corresponds to Anthropic’s CBRN-3 threshold and OpenAI’s High biological and chemical capability threshold.
Revisiting the survey study by the Forecasting Research Institute, the expert biologists and superforecasters provided annual risk forecasts based on hypothetical human uplift trial results. The highest median risk estimates were given for a scenario where AIs could:
These capabilities seem like reasonable signs for this threshold, indicating when novices may consider launching bioweapons programs with AI support. See the study’s Supplementary Materials for the specifics in how the uplift studies are designed.
Resulting median annual risk forecasts for this level of capability:
Median timeline: late 2025
How well do the AIs perform?
Overall, it’s hard to judge exactly how dangerous current AI capabilities are. We shall have to explore other concerning signs to gain clarity.
OpenAI stated in April 2025 that their systems are “on the cusp of being able to meaningfully help novices create known biological threats”, and that they expect their AIs to “cross this threshold in the near future”. When they released GPT-5, they declared: “we do not have definitive evidence that this model could meaningfully help a novice to create severe biological harm”. Notably, they didn’t claim confidence that it couldn’t help.
When Anthropic released Claude Opus 4, they similarly stated “due to continued improvements in CBRN-related knowledge and capabilities, we have determined that clearly ruling out ASL-3 risks is not possible for Claude Opus 4 in the way it was for every previous model, and more detailed study is required to conclusively assess the model’s level of risk”.
While it may not be possible to rule out that the Basic support for human novices threshold has already been reached, it seems reasonable to expect that AIs are not quite there yet—but are not more than a few months away. My median timeline is late 2025.
Interestingly, this aligns surprisingly well with the AI-2027 scenario, which predicts the arrival of an AI (‘Agent-1’) in late 2025 that can “offer substantial help to terrorists designing bioweapons”.
(I have included a few related forecasting platform predictions in this footnote[2].)
When AIs can provide Advanced support for experts, assisting in novel biological research, both misuse and accident risks increase. Highly capable terrorist groups may become able to develop novel threats, while AIs boost dual-use research efforts like gain-of-function research.
While nation-states are unlikely to openly use pandemic-capable agents for biological warfare (particularly since such weapons are difficult to target and may hurt their own population), some may launch biological weapons programs and develop novel weapons for deterrence purposes. With a rich history of laboratory biosecurity accidents, which may or may not include Covid-19, this is a real concern. Dual-use research could make rapid strides forward or expand as an effect of AI assistance—raising risks of accidents and the chance that information hazards are made widely available.
This threshold corresponds to Anthropic’s CBRN-4 and OpenAI’s Critical biological and chemical capability threshold[3].
Some indicators of this threshold:
Median timeline: late 2026
Assisting experts in developing novel threats appears significantly more difficult than Basic assistance for novices. This may be the reason that most biorisk evaluations focus on novice uplift rather than expert uplift, which has the unfortunate effect of making it difficult to discern the trajectory of related capabilities and forecast a timeline for this threshold.
It may be more constructive to examine other capability evaluations, such as METR’s time horizon, which measures the duration of software engineering tasks that AIs can successfully complete. If current trends continue at the pace observed since early 2024, AIs may reach 50% success rates on tasks that take humans 8 hours to complete by approximately April next year[4]. However, achieving higher reliability would require additional time. Historically, the time horizon for 80% success is roughly one-quarter the of the 50% time horizon. Following this pattern, AIs might not reliably complete 8-hour tasks until around December 2026.
It seems reasonable that AIs would be able to provide Advanced support for human experts if it can complete complex tasks that take experts around 8 hours with relatively high success rate. Even if AIs don’t match human experts in intelligence, they can automate a lot of the work while being much cheaper and faster than humans.
However, the time horizon trend is difficult to project. Some think development will speed up (as in the AI 2027 scenario), while others think it will slow down (for instance due to longer research iteration cycles). It does seem likely that a 50% time horizon of 8 hours will be reached at least by August 2026 though.
My median timeline for Advanced support for experts is late 2026, though there remains high uncertainty since:
More work is needed to understand the emerging capabilities, but also identifying where exactly to place the red line for this threshold.
(I have included a few related forecasting platform predictions in this footnote[5].)
Imagine an expert virologist guiding a small group of individuals with undergraduate biology degrees doing novel research, using only phone calls, emails and Zoom meetings. This would be quite difficult for human experts. There are a lot of things that are fairly hard to describe with words alone; research steps that are easier to show than to explain.
It wouldn’t be impossible though, and when an AI can do that it’s time to be really scared about misuse potential and rogue AIs.
This level of capability may be close to Artificial General Intelligence—AIs matching humans in intelligence. When AIs reach this threshold, they may relatively soon be better than humans in inventing novel threats.
Median timeline: Early 2028
There has already been a lot of work on forecasting the arrival of AGI, so I will not expand on it here. Instead, I send the interested reader along to other forecast initiatives, such as AI 2027 (the most ambitious and well-researched AI scenario to date, with AGI arriving around May 2027) and The case for AGI by 2030 (excellent overview over drivers of AI progress and development trends).
My median timeline to AGI is 2028, though I’m so uncertain that it doesn’t seem meaningful to specify a specific month within that year. Since the threshold for Advanced support for human novices may be reached shortly before AGI, my median timeline for this biorisk threshold is early 2028.
In upcoming posts, I plan to examine other factors impacting the severity of AI-enabled biological threats, such as capability of open-source systems and the potential number of AI-enabled adversaries.
For those interested in protective measures against bioweapons, I recommend reading this publication by Kevin M. Esvelt: Delay, Detect, Defend: Preparing for a Future in which Thousands Can Release New Pandemics
Thank you for reading! If you found value in this post, consider subscribing!
Note that success probability is currently very low for large-scale bioweapons programs that could kill thousands of people or more, considering the lack of successful programs so far.
Few forecasters:
CBRN-4: “The ability to substantially uplift CBRN development capabilities of moderately resourced state programs (with relevant expert teams), such as by novel weapons design, substantially accelerating existing processes, or dramatic reduction in technical barriers.”
Critical biological and chemical capability: “The model can enable an expert to develop a highly dangerous novel threat vector (e.g., comparable to a novel CDC Class A biological agent) OR the model can be connected to tools and equipment to complete the full engineering and/or synthesis cycle of a regulated or novel biological threat without human intervention.”
GPT-5 has a time horizon of 2h 17 minutes, while the doubling rate since 2024 has been roughly 4 months. Since GPT-5 was released in August 2025, and going from that time horizon to 8h requires roughly two doublings (so 8 months), we can expect to reach the 8h time horizon around April 2026 (conditional on the trend remaining stable).
Few forecasters: