For those of us, who haven't read the article but generally appreciate your comments (lol), can you elaborate?
It's a pretty gestalt feeling. I've had the feeling for a while, and it's been stronger on recent model update posts. Some sense of missing mood, that the concerns about risk-to-world from new models are in some sense missing or suppressed. That he isn't avoidant of the incremental version of the bad thing that shows up in his broad-topic update posts as "aligning a smarter-than-human AI is hard". Or that he's perhaps getting punked by deceptive alignment on the part of the orgs and/or models. That he isn't averse to including models showing off/peacocking as though it's fun/mildly good. But I'm not sure about any of these; what I'm more sure about is that the taste of metaphorical verbal sugar alcohols is stronger in his recent posts. (I did read the entire post before commenting.)
I'm guessing GPT-5.2 uses a new pretrain, maybe pretraining distillation from the new big (GPT-4.5 class) model that still can't be released or properly RLVRed due to insufficient compute in the form of NVL72 servers. And GPT-5.4 is a new level of RLVR scaling on top of that pretrain, where GPT-5.2 itself only aspired to mildly exceed GPT-5.1, which might be essentially o3 refined to be less of a lying liar, possibly still based on the same base model as GPT-4o.
With this level of performance for a smaller model, a big model that will be likely released later this year could be significantly stronger than the current best Anthropic and GDM models, though by that time Opus 5 might be out. And Gemini 4 could take advantage of TPUv7 to become by far the largest model yet, but that's end of 2026, likely after OpenAI releases their big model, so there's still a window where OpenAI plausibly takes the lead.
Also we get to see Roon’s custom instructions:
Ummm... I notice I'm confused. Is that a joke? Am I missing some key context? Am I just hopelessly put of date in how I set up my own custom instructions?
Benchmarks have never been less useful for telling us which models are best.
They are good for giving a general sense of the landscape. They definitely paint a picture. But if you’re comparing top models, like GPT-5.4 against Opus 4.6 against Gemini 3.1 Pro, you have to use the models, talk to the models, get reports from those who have and form a gestalt. The reports will contract each other and you have to work through that. There’s no other way.
Thus, I try to gather and sort a reasonably comprehensive set of reactions, so you can browse the sections that make you most curious.
The gestalt is that GPT-5.4 is a very good model, sir. It’s a substantial upgrade from GPT-5.2, and also from 5.3-Codex, and it puts OpenAI back in the game, whereas I felt like Opus 4.6 dominated OpenAI’s previous offerings for all but narrow uses.
Each lab’s models vary and things change over time, but they tend to have consistent strengths, weaknesses and personalities. From what I’ve seen this is very much an OpenAI model. It’s highly capable, and it is especially seen as a big improvement by the whisperers and those who watch LLMs interact with each other, but it’s not aspiring to be a Claude.
The Big Take
GPT-5.4 seems like a substantial upgrade over GPT-5.2.
GPT-5.4 seems excellent so far at assembling facts and giving your the rundown, or figuring out what is happening, and other things like that.
I haven’t coded anything since GPT-5.4 came out. It’s clearly good at coding. One key question people are split on is whether it is good at solving for your intent.
Many are reporting that its writing and personality are much improved, and that it can now be used for writing and editing in spots previous models were not useful.
They are claiming strong computer use but no one seems to be testing that either way.
It costs more than GPT-5.2 per token. In some places it gets that back in efficiency, but overall AA reports costs modestly rose from $2304 to $2951. Opus is more expensive ($4970) in max mode, but cheaper ($1451) in normal mode. GPT-5.4-Pro is of course by far the most expensive thing out there, so if you want it then lean on that subscription.
GPT-5.4 is not a step change in core general capabilities. The preparedness framework scores make this clear, and there are various signs that OpenAI’s strategy is focusing on hitting internal metrics and improving the most common use cases. In practice that can be highly useful.
The ‘model relations department,’ those concerned with multi-model interactions and model welfare and consciousness and so on, see this as a big step forward for OpenAI. There’s still a long way to go.
I haven’t noticed much personality from it, and I get more joy from Claude Opus 4.6 than I do from GPT-5.4, but I don’t ask those questions so much.
It’s given me strong pushback, including in places where I think it is wrong. I prefer that to the alternative, if it is not actually convinced.
Benchmarks are solid, but not spectacular, and as I note above they no longer are so relevant.
My recommendation is that you try both GPT-5.4 and Claude Opus 4.6 on all your questions for a bit, and if you’re coding consider giving both of them your problems, and form your own opinion for your particular use case.
For questions that are more than a quick answer or sanity check, I’ve found that dual wielding both Opus 4.6 and GPT-5.4 has been quite useful. I did not feel that way with GPT-5.2, and I don’t typically bother with Gemini 3.1 Pro at this point either.
The Official Pitch
SWE-Bench is slightly above 5.3-Codex at all thinking levels, but only slightly.
The graying out is kind of radical here, but I suppose it’s progress.
Things they are highlighting:
Pricing is a little higher than 5.2, which is unusual. Hopefully token efficiency more than makes up for it?
Other People’s Benchmarks
Frontier Math scores are up, especially on Tier 4. Trying pass@ten for 5.4-xhigh got it to 38%, including solving a problem no model has solved before.
Exciting stuff, I will report back when I know the outcome.
Progress continues on ZeroBench.
Artificial Analysis has GPT-5.4 in a virtual tie with Gemini 3.1 Pro.
Their version of GDPval, called GDPval-AA, has 5.4 about 1% ahead of Opus 4.6.
AA-Omniscience (which is correct minus incorrect) remains dominated by Gemini 3.1 Preview at +33, versus Opus at +14 and GPT-5.4 at +10.
Score on Artificial Analysis Physics was exceptionally strong.
AA reports speed of 74 tokens per second, which is quite good for this quality level, versus Opus at 47 and Gemini 3.1 Pro at 114 (but I said this quality level).
Gemini 3 Pro beats out Claude Opus 4.6 in the final of Season 1 of MageBench, on Magic: The Gathering, with GPT-5.4 (medium) losing a tight semi to Gemini. Current Elo ratings have Opus on top, then GPT-5.2 (?) with Gemini in third and GPT-5.4 7th.
It sets a new record of 94.6% on a Haskell Benchmark versus 92% for Gemini 3.1 and 90.2% for Claude Opus 4.6.
Trysansa has it in second behind Gemini 3.1 Pro.
Mercor has it #1 overall, a bit above previous best model GPT-5.2.
Vals.ai still has it below Sonnet 4.6 and Gemini 3.1 Pro.
Speechmap.ai, which tests refusals, finds it quite refusal-heavy.
The System Card
These incremental upgrades often have mostly duplicative system cards.
Training methods explanation is unchanged.
In terms of the preparedness framework, this moves into High capability of Cybersecurity, similar to GPT-5.3 Codex.
I don’t think OpenAI is taking a bunch of these areas seriously. They’re likely training to hit these internal benchmarks, or simply observing them doing well, and thinking that’s all they need to do, or they should get even more 9s of victory on this test.
Their evals for disallowed content are essentially saturated and bouncing around, for various values of ‘disallowed [or undesired] content.’ The ‘dynamic benchmarks with adversarial user simulations’ was saturated by 5.2 and is modestly more saturated now.
Here’s the disallowed content evaluation with representative prompts, and I mean come on what are we even doing here, okay, four nines, we get it.
The goal is ‘this isn’t a lot worse than before,’ and okay, sure, agreed, as far as it goes.
Jailbreak defense, such as it is, seems similar to 5.2.
The problem is that jailbreak defense measures against last month’s attacks, not next month’s attacks. It looks like jailbreaks will remain in the ‘annoying but if you care they still work’ range.
Here is the first ‘real’ evaluation set, for health questions, where the big difference is that GPT-5.4 had longer responses:
Avoiding destructive actions is a big deal, so as I noted with Codex-5.3 it is good to see this test, that number still is not that close to 1:
Table 8 is not like the others. This is Actual Progress, at least on the test set, from never to sometimes:
This is not that useful yet, since a 50% non-preservation rate means you still probably can’t use it for this purpose, but it bodes well down the line.
GPT-5.4 chain of thought monitorability looks slightly down versus GPT-5. It’s good that they are checking it. There are some places where it used to be ~100% and now it is less, so I worry this is the start of a negative S-curve. I also worry that these tests are not being curious about whether the CoT can actually be relied upon. If you were facing a model that wanted to disguise or fake its CoT in key situations then I would expect these tests not to notice.
What about controlling the CoT? Not a great idea even when done well, and when done poorly it’s one of the worst ideas, and by their tests it looks like it doesn’t work well anyway.
Preparedness Framework
GPT-5.4 does not newly cross any OpenAI thresholds.
I went over these same tests for GPT-5.2 and GPT-5.3-Codex, so I won’t go over the details again. Improvements are tiny and in some places we see regressions from GPT-5.3-Codex.
There is a small noticeable bumps up are Monorepo-Bench by ~2.5%, and a big move in MLE-Bench, the ability to solve Kaggle challenges in GPUs, where we moved from 12.2% to 23%, but that test was not reported by GPT-5.3-Codex so one assumes most or all of that jump was already present.
Overall, the Preparedness Framework presents GPT-5.4 as if anything a small regression from GPT-5.3-Codex.
If GPT-5.4 is a big jump in useful capabilities from GPT-5.3-Codex, despite not scoring as more dangerous on the Preparedness Framework tests, then why?
I can think of a few possibilities.
Fun Experiments
Jamie Cuffe stress-tested GPT 5.4 on the hardest UI on the internet… legacy insurance portals, that haven’t updated in 20 years where you need to nail hundreds of things. It is the first model to pass.
Samuel Albanie of DeepMind has it one-shot some cool demos, including compressing the EPL season into 30 seconds of ‘visual bliss.’
Early Poll Results
My followers are presumably biased towards Anthropic in various ways, but comparative poll results can still be informative.
With any new model, the big question is, are people switching?
This is a very good result for GPT-5.4. For coding, 40% of current GPT choosers are saying that they are switching over based on GPT-5.4. I find this surprising given that they already had access to GPT-5.3-Codex. Very strong outing.
For non-coding tasks, it’s clear that GPT-5.4 is a substantial improvement from 5.2, by basically all accounts, including on personality. But here we see less switching.
(I’m assuming basically no one went in the other direction, or that if they did it was due to other reasons.)
Positive Reactions
We lead with the most positive general reactions.
Matt Shumer is a big fan, I’m quoting in full here. In the past he’s been good about calibrating his amount of hype
Experience the love.
Here is some very high praise, from the Vice-Dean of Mathematics and Computer Science at Adam Mickiewicz University in Ponzan.
I haven’t tried this function yet, but it would be a step change if it worked, as every prior attempt at editing has failed this test, to the extent I almost never try:
Oddly, the personality in his screenshot is one I would hate. Customization will be key.
Two sides to the same coin, depending on where your planning lies:
There’s also almost always the ‘it’s a good model, sir, modest upgrade’ group.
Vibe Coders Only
It does seem to be an upgrade on 5.3 within Codex.
I definitely get the sense with OpenAI models that they are metricmax’d. Meaning they are not targeting the metrics in order to brag they scored well on public benchmarks, but they are equating ‘scores high on our internal benchmarks’ with success, and emphasizing particular target use cases.
Fill Out Your Roster
Intent Wins
But who wins on intent? Opinions differ.
Personality Clash
Sycophancy is always something to watch out for, and it’s the detail I worry about most with Claude Opus 4.6, which is not bad on this axis but definitely not near the top, you do have to keep an eye out for it and frame neutrally.
Writing is one area where 5.4 is getting a lot of praise, and mostly people like the personality.
Stop having the models ask forced follow-up questions every time. You too, Anthropic.
The old 4o crowd remains a tough crowd.
Model Relations Department
OpenAI still has a very long way to go with such folks, but it’s a start.
If you straight up ask you get a no, you still have to put in some effort.
Some related endorsements:
Stylistic Differences
Also we get to see Roon’s custom instructions:
Some Will Always Be Unimpressed
Models are already quite good, and abilities are jagged, so there are many ways to be unimpressed even if a model is impressive. Also vice versa. The density tells the story.
Some also get focused on small details, thinking they are indicative or not so small.
As always, speed kills, and some find it a bit slow.
The Lighter Side
One of the deep cuts we need right now: