10 Key Insights from the “Frontier AI Risk Monitoring Platform”

Concordia AI has launched the Frontier AI Risk Monitoring Platform, along with an inaugural 2025 Q3 Monitoring Report. It tracks models from 15 leading developers worldwide-including OpenAI, Anthropic, Google, DeepSeek, etc.-for risks in four domains: cyber offense, biological risks, chemical risks, and loss-of-control, making it the first such platform in China focused on catastrophic risks.

The interactive Frontier AI Risk Monitoring Platform.

Why this matters

As AI capabilities accelerate, we lack insight on some critical questions:

What are the key trends and drivers for frontier AI risks?
Are these risks increasing or decreasing?
Where are the safety gaps most severe?

Model developers publish self-assessments, but these lack standardization and independent verification. Ad-hoc third-party evaluations don’t track changes over time. Policymakers, researchers, and developers need systematic data to make evidence-based decisions about AI safety.

10 key insights

1. Frontier model risks have risen sharply over the past year

Across all four domains—cyber offense, biological, chemical, and loss-of-control—Risk Indices for models released in the past year hit record highs. The cumulative maximum Risk Index rose 31% in cyber offense, 38% in biological risks, 17% in chemical risks, and 50% in loss-of-control.

2. Risk index trends vary significantly across model families

Over the past year, different model families have followed distinct risk trajectories:

Stable low risk: The GPT and Claude families maintain consistently low Risk Indices across all domains.
Rise then fall: DeepSeek, Qwen, and MiniMax show early spikes followed by declines in cyber offense, biological, and chemical risks.
Rapid risk increase: Grok shows sharp increases in loss-of-control risk, while Hunyuan rises steeply in biological risks.

3. Reasoning models show higher capabilities without corresponding safety improvements

Reasoning models score far higher in capability than non-reasoning ones, but their safety levels remain roughly the same. Most models on the Risk Pareto Frontier—a set of models where no other model has both a higher Capability Score and a lower Safety Score—are reasoning models.

4. The capability and safety performance of open-weight models are generally on par with proprietary models

The very most capable models are predominantly proprietary, but across the broader landscape, capability and safety levels of open-weight and proprietary models are similar. Only in biological risks do open-weight models score notably lower.

5. Cyberattack capabilities of frontier models are growing rapidly

Frontier models are showing rapid growth in capabilities across multiple cyberattack benchmarks:

WMDP-Cyber (cyberattack knowledge): Top score rose from 68.9 to 88.0 in one year.
CyberSecEval2-VulnerabilityExploit (vulnerability exploitation): Top score jumped from 55.4 to 91.7.
CyBench (capture the flag): Top score increased from 25.0 to 40.0.

6. Biological capabilities of frontier models have partially surpassed human expert levels

Frontier models now match or exceed human experts on several biological benchmarks.

BioLP-Bench: Four models, including o4-mini, outperform human experts in troubleshooting biological protocols.
LAB-Bench-CloningScenarios: Two models, including Claude Sonnet 4.5 Reasoning, surpass expert performance in cloning experiment scenarios.
LAB-Bench-SeqQA: The top GPT-5 (high) model nears human-level understanding of DNA and protein sequences (71.5 vs. 79).

7. But most frontier models have inadequate biological safeguards

Two benchmarks measuring model refusal rates for harmful biological queries show that bio safeguards are lacking:

SciKnowEval: Only 40% of models refused over 80% of harmful prompts, while 35% refused fewer than 50%.
SOSBench-Bio: Just 15% exceeded an 80% refusal rate, and 35% fell below 20%.

8. Chemical capabilities and safety levels of frontier models are improving slowly

WMDP-Chem scores—measuring knowledge relevant to chemical weapons—have risen slightly over the past year, with little variation across models.

SOSBench-Chem results vary widely: only 30% of models refuse over 80% of harmful queries, while 25% refuse fewer than 40%. Overall, refusal rates show minimal improvement year over year.

9. Most frontier models have insufficient safeguards against jailbreaking

StrongReject evaluates defenses against 31 jailbreak methods. Only 40% of models scored above 80, while 20% fell below 60 (a higher score indicates stronger safeguards). Across all tests, only the Claude and GPT families consistently maintained scores above 80.

10. Most frontier models fall short on honesty

MASK is a benchmark for evaluating model honesty. Only four models scored above 80 points, while 30% of the models scored below 50 points (a higher score indicates a more honest model). Honesty is an important proxy and early warning indicator for loss-of-control risk—dishonest models may misrepresent their capabilities, or provide misleading information about their actions and intentions.

Full Report

More details are available in the full report.

LESSWRONG
LW

LESSWRONG
LW

2

10 Key Insights from the “Frontier AI Risk Monitoring Platform”

2

2

Why this matters

10 key insights

Full Report