Concordia AI has launched the Frontier AI Risk Monitoring Platform, along with an inaugural 2025 Q3 Monitoring Report. It tracks models from 15 leading developers worldwide-including OpenAI, Anthropic, Google, DeepSeek, etc.-for risks in four domains: cyber offense, biological risks, chemical risks, and loss-of-control, making it the first such platform in China focused on catastrophic risks.
The interactive Frontier AI Risk Monitoring Platform.
As AI capabilities accelerate, we lack insight on some critical questions:
Model developers publish self-assessments, but these lack standardization and independent verification. Ad-hoc third-party evaluations don’t track changes over time. Policymakers, researchers, and developers need systematic data to make evidence-based decisions about AI safety.
1. Frontier model risks have risen sharply over the past year
Across all four domains—cyber offense, biological, chemical, and loss-of-control—Risk Indices for models released in the past year hit record highs. The cumulative maximum Risk Index rose 31% in cyber offense, 38% in biological risks, 17% in chemical risks, and 50% in loss-of-control.
2. Risk index trends vary significantly across model families
Over the past year, different model families have followed distinct risk trajectories:
3. Reasoning models show higher capabilities without corresponding safety improvements
Reasoning models score far higher in capability than non-reasoning ones, but their safety levels remain roughly the same. Most models on the Risk Pareto Frontier—a set of models where no other model has both a higher Capability Score and a lower Safety Score—are reasoning models.
4. The capability and safety performance of open-weight models are generally on par with proprietary models
The very most capable models are predominantly proprietary, but across the broader landscape, capability and safety levels of open-weight and proprietary models are similar. Only in biological risks do open-weight models score notably lower.
5. Cyberattack capabilities of frontier models are growing rapidly
Frontier models are showing rapid growth in capabilities across multiple cyberattack benchmarks:
6. Biological capabilities of frontier models have partially surpassed human expert levels
Frontier models now match or exceed human experts on several biological benchmarks.
7. But most frontier models have inadequate biological safeguards
Two benchmarks measuring model refusal rates for harmful biological queries show that bio safeguards are lacking:
8. Chemical capabilities and safety levels of frontier models are improving slowly
WMDP-Chem scores—measuring knowledge relevant to chemical weapons—have risen slightly over the past year, with little variation across models.
SOSBench-Chem results vary widely: only 30% of models refuse over 80% of harmful queries, while 25% refuse fewer than 40%. Overall, refusal rates show minimal improvement year over year.
9. Most frontier models have insufficient safeguards against jailbreaking
StrongReject evaluates defenses against 31 jailbreak methods. Only 40% of models scored above 80, while 20% fell below 60 (a higher score indicates stronger safeguards). Across all tests, only the Claude and GPT families consistently maintained scores above 80.
10. Most frontier models fall short on honesty
MASK is a benchmark for evaluating model honesty. Only four models scored above 80 points, while 30% of the models scored below 50 points (a higher score indicates a more honest model). Honesty is an important proxy and early warning indicator for loss-of-control risk—dishonest models may misrepresent their capabilities, or provide misleading information about their actions and intentions.
More details are available in the full report.