My impression is that the "pro" models use the same weights as the underlying non-pro model (here, gpt-5.4-thinking) but with scaffolding on top that gets multiple reasoning traces and selects the best one. I think OpenAI's view is that if the underlying model is safe to deploy, anything that's just scaffolding on top of it must also be safe (because the safety checks for the underlying model should ensure it's safe to deploy, even with malicious scaffolding).
With o3-pro OpenAI said:
As o3-pro uses the same underlying model as o3, full safety details can be found in the o3 system card.
They haven't explicitly said the same for later pro models but various documentation about those pro models implies it.
Even though you can't recreate the exact scaffolding OpenAI uses yourself (because the API doesn't expose reasoning traces), you can get kinda close by querying the underlying non-pro model a bunch of times and asking a model to choose the best response[1]. It would probably be worth comparing gpt-5.4-thinking with that custom scaffold to gpt-5.4-thinking.
You would also want to have the underlying model include a summary of the reasoning in the output so that the model that chooses the best answer can decide which answer had the best reasoning.
Thanks for this! I was totally unaware of this quote. Also, from the GPT-5 system card:
Since gpt-5-thinking-pro is gpt-5-thinking using a setting that makes use of parallel test time compute, we have determined that the results from our safety evaluations on gpt-5-thinking are strong proxies, and therefore we did not rerun these evaluations in the parallel test time compute setting.
Response from Miles Brundage for the o3-pro lack of card:
"The whole point of the term system card is that the model isn't the only thing that matters. If they didn't do a full Preparedness Framework assessment, e.g. because the evals weren't too different and they didn't consider it a good use of time given other coming launches, they should just say that... lax processes/corner-cutting/groupthink get more dangerous each day."
Response from Zvi for the o3-pro lack of card:
But the framework is full of ‘here are the test results’ and presumably those results are different now. I want o3-pro on those charts.
So, this has been thought about before! We're sorry for not noticing and searching harder.
However, in the GPT-5 card OAI says "Because parallel test time compute can further increase performance on some evaluations and because gpt-5-thinking is near the High threshold in this capability domain, we also chose to measure gpt-5-thinking-pro's performance on our biological evaluations." We have no way of verifying whether they should've done the same here (and importantly, we don't know if they even did this internally!). For this reason, we think our recommendations stand.
It's probably incorrect to say the "SOTA model," but we can say the "SOTA system", or something? (It's unclear whether this distinction even matters for catastrophic misuse risk, which is what we're primarily concerned about for now.)
EDIT: I've now edited the blogpost. Thank you again :)))
I am assuming companies have a very good chance of releasing dangerous models, whether at executive or shareholder direction, if their business viability feels threatened.
TL;DR: OpenAI released GPT-5.4 Thinking and GPT-5.4 Pro on March 5, 2026. GPT-5.4 Pro is likely the best model in the world for many catastrophic risk-relevant tasks, including biological research R&D, orchestrating cyberoffense operations, and computer use. It has no system card, and, to our best knowledge, has been released without any safety evals. We argue this has occurred at least once before, with GPT-5.2 Pro, and provide recommendations for how a team could conduct fast, independent risk assessments of models post-deployment.
IMPORTANT EDIT: This problem, where Pro models don't have a safety card, has existed since at least o3-pro. Others have noticed this issue before (for o3 and GPT-5). Additionally, Pro "models" are probably just fancy scaffolding that leverage test-time compute on top of the Thinking models. However, we think our recommendations still stand, because:
We should have been more aggressive about looking into what Pro models actually are, and for others' previous comments on this topic. See this comment thread for more - thanks to loops who pointed this out :)
GPT-5.4 Pro is really good
OpenAI released both GPT-5.4 Thinking (what people usually mean when they say GPT-5.4) and GPT-5.4 Pro[1], the latter of which is designed for “people who want maximum performance on complex tasks.” GPT-5.4 Pro is extremely expensive, and takes a very long time to complete a task. However, it is likely the best model in the world in several areas, including expert-level Q&A and browser use. Alongside the release announcement, OpenAI presented GPT-5.4 Pro’s performance on a subset of capability benchmarks. Here’s a comparison of benchmark scores across the top three frontier models; we only report scores if they were in all models’ system cards[2]:
Benchmark
Gemini 3.1 Pro
GPT-5.4 Pro
Opus 4.6
Based on these results, we expect GPT-5.4 Pro to be SOTA at the Virology Capabilities Test, Agentic-Bio Capabilities Benchmark, FrontierMath, and anywhere else depending on academic reasoning, a broad knowledge base, and that scales nicely with inference compute. BrowseComp and SOTA on FinanceAgent v1.1 (61.5%) make us think it’s probably also SOTA at automating office work generally.
The biggest hole in saying that it’s overall SOTA is agentic coding, but given SOTA measures of abstract reasoning through ARC-AGI-2, we think it’s likely that, given enough compute, it would beat Opus 4.6 and Gemini 3.1 Pro on things like SWE-Bench and Terminal-Bench 2.0.
Yet, it was released without any public safety evals. The system card published alongside the release is only for GPT-5.4 Thinking. It’s possible that GPT-5.4 Pro was tested for safety properties internally (We would hope at least something like Petri was run to make sure there wasn’t a crazy distribution shift?), but we were unable to find any public information about this, if true. We would bet significant money that OAI did not run a suite of internal evals at least as comprehensive as those in the GPT-5.4 Thinking model card prior to Pro’s release.
It is highly unlikely that GPT-5.4 Pro poses catastrophic misuse or misalignment risks, although this is largely because of mitigations that come for free with closed-source models from OpenAI (e.g., CBRNE classifiers). However, releasing no external safety evals is a bad precedent and gives researchers a false understanding of current risks posed by frontier models. Additionally, if GPT-5.4 Pro turned out to be much better on dual-use tasks (e.g., EVM-Bench or LAB-Bench), we would have been able to update our timelines to the critical period of risk accordingly.
This has happened once already
The only reason we were tracking this is because I (Parv) accidentally spent $6,000 of Andy’s compute running LAB-Bench against GPT-5.2 Pro instead of GPT-5.2 Thinking[3], and we noticed quite a high uplift.
In fact, GPT-5.2 Pro without tools shows comparable performance to Opus 4.6 with tools in Fig-QA (78.3%). We then noticed that we could not corroborate this result, or indeed any safety-relevant benchmark performance, because GPT-5.2 Pro was also released without a system card.
GPT-5.2 Pro was released on December 11, 2025, and Opus 4.6, the first model that seems to outclass it, was released February 5, 2026. Our median guess here is that we had a model that was SOTA on (at minimum) dual-use biology tasks for (at minimum) two months, released without any safety evals,
and which the broader safety community largely ignored(see our edit).What do we do??
We basically assumed that the top three US labs (OAI, Ant, GDM) would, at minimum, publish something matching the concept of a model card with every SOTA model, which was great because it helped us get a better handle on risk. We now think we were wrong, and we can no longer assume labs will provide any safety-relevant benchmark data for their best models at release. However, this data is still extremely important, especially for tracking jagged-y capabilities like CBRNE uplift and cyberoffense.
At minimum, we recommend that a 1-3 person team at an existing organization:
We have a list of evaluations we think such a framework should include, and other ideas for how to make this go well—please reach out!
A more ambitious version of this would also create new evals, and include things like interp to lower-bound sandbagging. It would also coordinate with safety researchers, DC policy folks, and interested parties in USG natsec to frame their assessment in a way accessible to them.
We are also embarrassed
that no one (to our knowledge) has commented on this before(this is only true for >GPT 5 models; see our edit) and it took both of us so long to notice this. So, how could we have thought this faster?In the absence of comprehensive and informative safety evaluations of frontier models from labs themselves, we hope the community can fill this gap while also pushing labs to be more transparent[6].
One question we don't answer here is "what exactly is Pro???" Is it a different model, or a weird scaffold, or finetuning on Thinking, or something else? We don't have great answers here; would love to learn more.See our editClaude has some important notes. “A few caveats worth flagging: the HLE 'with tools' rows use different harnesses (Gemini uses search-blocklist + code; OpenAI's harness isn't specified the same way), so that row is somewhat apples-to-oranges. BrowseComp similarly — Gemini specifies "Search + Python + Browse" while OpenAI's tooling setup isn't detailed identically. GPQA Diamond is essentially a tie at 94.3 vs 94.4."
This was during a forthcoming safety evaluation of Kimi K2.5 with Yong, aligned with the kind of work we propose above.
The main blocker here is cost, but we think funders would be interested in throwing compute at this; we have seen preliminary interest from many stakeholders in the community, in both policy and technical circles.
This would also be extremely useful to get a better handle on risk from Chinese open-source models.
Huge thanks to everyone on the Kimi K2.5 eval team, without whom we would never have ran into this. We also thank Claude Opus 4.6, who accidentally ran Pro instead of Thinking on LAB-Bench and burnt $6k for what ended up being a good cause. We promise we are competent researchers, and have learnt our lesson.