The current SOTA model was released without safety evals

Parv Mahajan; yeedrag

TL;DR: OpenAI released GPT-5.4 Thinking and GPT-5.4 Pro on March 5, 2026. GPT-5.4 Pro is likely the best model in the world for many catastrophic risk-relevant tasks, including biological research R&D, orchestrating cyberoffense operations, and computer use. GPT-5.4 Pro has no system card, only GPT-5.4 Thinking, and, to our best knowledge, Pro has been released without any safety evals. We argue this has occurred at least once before, with GPT-5.2 Pro, and provide recommendations for how a team could conduct fast, independent risk assessments of models post-deployment.

IMPORTANT EDIT: This problem, where Pro models don't have a safety card, has existed since at least o3-pro. Others have noticed this issue before (for o3 and GPT-5). Additionally, Pro "models" are probably just fancy scaffolding that leverage test-time compute on top of the Thinking models. However, we think our recommendations still stand, because:

The best system (e.g., model + leveraging parallel test-time compute) has meaningfully greater capabilities in catastrophic risk-relevant areas than just the model. In this case, the model developer clearly thinks so because of its much greater price, because it is listed as a separate model in the API, and because of the way that it is framed in the model release announcement.
There exist difficult-to-externally-verify claims from OAI about when it has a (soft) commitment to releasing Pro models' benchmark scores, meaning the safety community should probably still look for these things in future models.
There are many other frontier and open-source labs for which the pre-deployment public evaluations are wholly insufficient to meaningfully assess risk.

We should have been more aggressive about looking into what Pro models actually are, and for others' previous comments on this topic. See this comment thread for more - thanks to loops who pointed this out :)

GPT-5.4 Pro is really good

OpenAI released both GPT-5.4 Thinking (what people usually mean when they say GPT-5.4) and GPT-5.4 Pro^[1], the latter of which is designed for “people who want maximum performance on complex tasks.” GPT-5.4 Pro is extremely expensive, and takes a very long time to complete a task. However, it is likely the best model in the world in several areas, including expert-level Q&A and browser use. Alongside the release announcement, OpenAI presented GPT-5.4 Pro’s performance on a subset of capability benchmarks. Here’s a comparison of benchmark scores across the top three frontier models; we only report scores if they were in all models’ system cards^[2]:

Benchmark	Gemini 3.1 Pro	GPT-5.4 Pro	Opus 4.6
GPQA Diamond	94.3%	94.4%	91.3%
HLE (no tools)	44.4%	42.7%	40.0%
HLE (with tools)	51.4%	58.7%	53.1%
ARC-AGI-2 (Verified)	77.1%	83.3%	68.8%
BrowseComp	85.9%	89.3%	84.0%

Based on these results, we expect GPT-5.4 Pro to be SOTA at the Virology Capabilities Test, Agentic-Bio Capabilities Benchmark, FrontierMath, and anywhere else depending on academic reasoning, a broad knowledge base, and that scales nicely with inference compute. BrowseComp and SOTA on FinanceAgent v1.1 (61.5%) make us think it’s probably also SOTA at automating office work generally.

The biggest hole in saying that it’s overall SOTA is agentic coding, but given SOTA measures of abstract reasoning through ARC-AGI-2, we think it’s likely that, given enough compute, it would beat Opus 4.6 and Gemini 3.1 Pro on things like SWE-Bench and Terminal-Bench 2.0.

Yet, it was released without any public safety evals. The system card published alongside the release is only for GPT-5.4 Thinking. It’s possible that GPT-5.4 Pro was tested for safety properties internally (We would hope at least something like Petri was run to make sure there wasn’t a crazy distribution shift?), but we were unable to find any public information about this, if true. We would bet significant money that OAI did not run a suite of internal evals at least as comprehensive as those in the GPT-5.4 Thinking model card prior to Pro’s release.

It is highly unlikely that GPT-5.4 Pro poses catastrophic misuse or misalignment risks, although this is largely because of mitigations that come for free with closed-source models from OpenAI (e.g., CBRNE classifiers). However, releasing no external safety evals is a bad precedent and gives researchers a false understanding of current risks posed by frontier models. Additionally, if GPT-5.4 Pro turned out to be much better on dual-use tasks (e.g., EVM-Bench or LAB-Bench), we would have been able to update our timelines to the critical period of risk accordingly.

This has happened once already

The only reason we were tracking this is because I (Parv) accidentally spent $6,000 of Andy’s compute running LAB-Bench against GPT-5.2 Pro instead of GPT-5.2 Thinking^[3], and we noticed quite a high uplift.

In fact, GPT-5.2 Pro without tools shows comparable performance to Opus 4.6 with tools in Fig-QA (78.3%). We then noticed that we could not corroborate this result, or indeed any safety-relevant benchmark performance, because GPT-5.2 Pro was also released without a system card.

GPT-5.2 Pro was released on December 11, 2025, and Opus 4.6, the first model that seems to outclass it, was released February 5, 2026. Our median guess here is that we had a model that was SOTA on (at minimum) dual-use biology tasks for (at minimum) two months, released without any safety evals, ~~and which the broader safety community largely ignored~~ (see our edit).

What do we do??

We basically assumed that the top three US labs (OAI, Ant, GDM) would, at minimum, publish something matching the concept of a model card with every SOTA model, which was great because it helped us get a better handle on risk. We now think we were wrong, and we can no longer assume labs will provide any safety-relevant benchmark data for their best models at release. However, this data is still extremely important, especially for tracking jagged-y capabilities like CBRNE uplift and cyberoffense.

At minimum, we recommend that a 1-3 person team at an existing organization:

Set up a “press the easy button” framework that can run a large eval suite of existing evals as soon as a model is publicly released, and generate a public report describing its potential for catastrophic misuse risk, and give us some insight into whether it’s scheming, prosaically misaligned, etc. To start, this might literally just be ABC-Bench, VCT, Petri, and EVM-Bench^[4].
Run said framework for every major model release without a substantive system card^[5].

We have a list of evaluations we think such a framework should include, and other ideas for how to make this go well—please reach out!

A more ambitious version of this would also create new evals, and include things like interp to lower-bound sandbagging. It would also coordinate with safety researchers, DC policy folks, and interested parties in USG natsec to frame their assessment in a way accessible to them.

We are also embarrassed ~~that no one (to our knowledge) has commented on this before~~ (this is only true for >GPT 5 models; see our edit) and it took both of us so long to notice this. So, how could we have thought this faster?

More people outside labs need to read system cards in full, within days of their release. Set up a reading club, make a really good agent scaffold, find some way of getting the important info into your brain and noticing inconsistencies.
- Maybe someone can just build a really good Claude Code skill that does this for new model releases? Seems like a task that should take ~2 hours. Get in touch if you’d like to build this!
Lab safety teams should seek through the Frontier Model Forum and other, more informal coordination mechanisms to standardize benchmarks and set a norm of releasing lots of safety benchmark performance data.
- Safety teams should prioritize releasing safety benchmark datasets, through Trusted Access Programs if appropriate. This will allow the safety community to directly compare different models’ benchmark scores, and get a better sense of risk outside of just “the new one scored less on this sabotage eval.”

In the absence of comprehensive and informative safety evaluations of frontier models from labs themselves, we hope the community can fill this gap while also pushing labs to be more transparent^[6].

^{^}
One question we don't answer here is "what exactly is Pro???" Is it a different model, or a weird scaffold, or finetuning on Thinking, or something else? We don't have great answers here; would love to learn more. See our edit
^{^}
Claude has some important notes. “A few caveats worth flagging: the HLE 'with tools' rows use different harnesses (Gemini uses search-blocklist + code; OpenAI's harness isn't specified the same way), so that row is somewhat apples-to-oranges. BrowseComp similarly — Gemini specifies "Search + Python + Browse" while OpenAI's tooling setup isn't detailed identically. GPQA Diamond is essentially a tie at 94.3 vs 94.4."
^{^}
This was during a forthcoming safety evaluation of Kimi K2.5 with Yong, aligned with the kind of work we propose above.
^{^}
The main blocker here is cost, but we think funders would be interested in throwing compute at this; we have seen preliminary interest from many stakeholders in the community, in both policy and technical circles.
^{^}
This would also be extremely useful to get a better handle on risk from Chinese open-source models.
^{^}
Huge thanks to everyone on the Kimi K2.5 eval team, without whom we would never have ran into this. We also thank Claude Opus 4.6, who accidentally ran Pro instead of Thinking on LAB-Bench and burnt $6k for what ended up being a good cause. We promise we are competent researchers, and have learnt our lesson.

My impression is that the "pro" models use the same weights as the underlying non-pro model (here, gpt-5.4-thinking) but with scaffolding on top that gets multiple reasoning traces and selects the best one. I think OpenAI's view is that if the underlying model is safe to deploy, anything that's just scaffolding on top of it must also be safe (because the safety checks for the underlying model should ensure it's safe to deploy, even with malicious scaffolding).

With o3-pro OpenAI said:

As o3-pro uses the same underlying model as o3, full safety details can be found in the o3 system card.

They haven't explicitly said the same for later pro models but various documentation about those pro models implies it.

Even though you can't recreate the exact scaffolding OpenAI uses yourself (because the API doesn't expose reasoning traces), you can get kinda close by querying the underlying non-pro model a bunch of times and asking a model to choose the best response^[1]. It would probably be worth comparing gpt-5.4-thinking with that custom scaffold to gpt-5.4-thinking.

^{^}
You would also want to have the underlying model include a summary of the reasoning in the output so that the model that chooses the best answer can decide which answer had the best reasoning.

Thanks for this! I was totally unaware of this quote. Also, from the GPT-5 system card:

Since gpt-5-thinking-pro is gpt-5-thinking using a setting that makes use of parallel test time compute, we have determined that the results from our safety evaluations on gpt-5-thinking are strong proxies, and therefore we did not rerun these evaluations in the parallel test time compute setting.

Response from Miles Brundage for the o3-pro lack of card:

"The whole point of the term system card is that the model isn't the only thing that matters. If they didn't do a full Preparedness Framework assessment, e.g. because the evals weren't too different and they didn't consider it a good use of time given other coming launches, they should just say that... lax processes/corner-cutting/groupthink get more dangerous each day."

Response from Zvi for the o3-pro lack of card:

But the framework is full of ‘here are the test results’ and presumably those results are different now. I want o3-pro on those charts.

So, this has been thought about before! We're sorry for not noticing and searching harder.

However, in the GPT-5 card OAI says "Because parallel test time compute can further increase performance on some evaluations and because gpt-5-thinking is near the High threshold in this capability domain, we also chose to measure gpt-5-thinking-pro's performance on our biological evaluations." We have no way of verifying whether they should've done the same here (and importantly, we don't know if they even did this internally!). For this reason, we think our recommendations stand.

It's probably incorrect to say the "SOTA model," but we can say the "SOTA system", or something? (It's unclear whether this distinction even matters for catastrophic misuse risk, which is what we're primarily concerned about for now.)

EDIT: I've now edited the blogpost. Thank you again :)))

IMO, the threat/thing to measure is the system, so it doesn't much matter what a badly run model can do. I'm with you here.

GPT 5.4 Thinking does have a system card: https://deploymentsafety.openai.com/gpt-5-4-thinking

We’re talking about GPT-5.4 Pro above, and linked to the system card in your comment. Do you think it was unclear/buried? If so, useful feedback, will try to make more clear.

“The system card published alongside the release is only for GPT-5.4 Thinking.”

Thanks for writing this up - Re: your question, “It has no system card” I think this could be clearer that there is a SC but it doesn’t cover the Pro version.

I found this clear enough overall in the post, to be clear! But think I’d have misunderstood from the first few sentences if I didn’t already have context from knowing there was some SC released for 5.4

Made it much clearer in the TL;DR. Thanks :)

In the time since AI reached broad consumer adoption (2024 ish), I've become increasingly of the opinion that the primary existential AI threat will actually be due to the misuse or mishandling of an otherwise helpful and benign technology, primarily due to the excessive weight placed on profit motive.

Don't underestimate the legislative route. My state representative Stacey Travers introduced this AI safety transparency bill entirely because I had one meeting with her and asked her to write a bill for Arizona that was similar to the original text of New York's RAISE Act.

Maybe this bill doesn't make it into law, but every shot on goal matters, and in the US, every additional redundant state bill that passes helps. Meeting with your political representatives is extraordinarily overpowered and well worth the time and energy. Simply explain your concerns and exactly how they can address the issue.

@Parv Mahajan I understood your post to roughly say: "Some models are being released without safety evals. This is bad and should change." My claim is that a good way* to make this change is to encourage your political representatives to require it by law.

*By "a good way," I really mean "almost certainly the only way."

I am assuming companies have a very good chance of releasing dangerous models, whether at executive or shareholder direction, if their business viability feels threatened.

After the recent DoW vs Ant fiasco, I'm really skeptical if this is actually safety evals. One man's safety evals is another man's capabilities eval. And this seems like we are just don't free labour for DoW