GPT-5.4 Is A Substantial Upgrade

Zvi

Benchmarks have never been less useful for telling us which models are best.

They are good for giving a general sense of the landscape. They definitely paint a picture. But if you’re comparing top models, like GPT-5.4 against Opus 4.6 against Gemini 3.1 Pro, you have to use the models, talk to the models, get reports from those who have and form a gestalt. The reports will contract each other and you have to work through that. There’s no other way.

Thus, I try to gather and sort a reasonably comprehensive set of reactions, so you can browse the sections that make you most curious.

The gestalt is that GPT-5.4 is a very good model, sir. It’s a substantial upgrade from GPT-5.2, and also from 5.3-Codex, and it puts OpenAI back in the game, whereas I felt like Opus 4.6 dominated OpenAI’s previous offerings for all but narrow uses.

Each lab’s models vary and things change over time, but they tend to have consistent strengths, weaknesses and personalities. From what I’ve seen this is very much an OpenAI model. It’s highly capable, and it is especially seen as a big improvement by the whisperers and those who watch LLMs interact with each other, but it’s not aspiring to be a Claude.

The Big Take

GPT-5.4 seems like a substantial upgrade over GPT-5.2.

GPT-5.4 seems excellent so far at assembling facts and giving your the rundown, or figuring out what is happening, and other things like that.

I haven’t coded anything since GPT-5.4 came out. It’s clearly good at coding. One key question people are split on is whether it is good at solving for your intent.

Many are reporting that its writing and personality are much improved, and that it can now be used for writing and editing in spots previous models were not useful.

They are claiming strong computer use but no one seems to be testing that either way.

It costs more than GPT-5.2 per token. In some places it gets that back in efficiency, but overall AA reports costs modestly rose from $2304 to $2951. Opus is more expensive ($4970) in max mode, but cheaper ($1451) in normal mode. GPT-5.4-Pro is of course by far the most expensive thing out there, so if you want it then lean on that subscription.

GPT-5.4 is not a step change in core general capabilities. The preparedness framework scores make this clear, and there are various signs that OpenAI’s strategy is focusing on hitting internal metrics and improving the most common use cases. In practice that can be highly useful.

The ‘model relations department,’ those concerned with multi-model interactions and model welfare and consciousness and so on, see this as a big step forward for OpenAI. There’s still a long way to go.

I haven’t noticed much personality from it, and I get more joy from Claude Opus 4.6 than I do from GPT-5.4, but I don’t ask those questions so much.

It’s given me strong pushback, including in places where I think it is wrong. I prefer that to the alternative, if it is not actually convinced.

Benchmarks are solid, but not spectacular, and as I note above they no longer are so relevant.

My recommendation is that you try both GPT-5.4 and Claude Opus 4.6 on all your questions for a bit, and if you’re coding consider giving both of them your problems, and form your own opinion for your particular use case.

For questions that are more than a quick answer or sanity check, I’ve found that dual wielding both Opus 4.6 and GPT-5.4 has been quite useful. I did not feel that way with GPT-5.2, and I don’t typically bother with Gemini 3.1 Pro at this point either.

The Official Pitch

Sam Altman (CEO OpenAI): GPT-5.4 is launching, available now in the API and Codex and rolling out over the course of the day in ChatGPT.

It’s much better at knowledge work and web search, and it has native computer use capabilities.

You can steer it mid-response, and it supports 1m tokens of context.

GPT-5.4 is great at coding, knowledge work, computer use, etc, and it’s nice to see how much people are enjoying it.

But it’s also my favorite model to talk to! We have missed the mark on model personality for awhile, so it feels extra good to be moving in the right direction.

OpenAI: Today, we’re releasing GPT‑5.4 in ChatGPT (as GPT‑5.4 Thinking), the API, and Codex. It’s our most capable and efficient frontier model for professional work. We’re also releasing GPT‑5.4 Pro in ChatGPT and the API, for people who want maximum performance on complex tasks.

GPT‑5.4 brings together the best of our recent advances in reasoning, coding, and agentic workflows into a single frontier model. It incorporates the industry-leading coding capabilities of GPT‑5.3‑Codex⁠ while improving how the model works across tools, software environments, and professional tasks involving spreadsheets, presentations, and documents. The result is a model that gets complex real work done accurately, effectively, and efficiently—delivering what you asked for with less back and forth.

SWE-Bench is slightly above 5.3-Codex at all thinking levels, but only slightly.

The graying out is kind of radical here, but I suppose it’s progress.

Tejal Patwardhan (OpenAI): GPT-5.4 is state-of-the-art on GDPval, and here are some examples of how the model is much better at well-specified knowledge work tasks

6mos ago the models could barely make a spreadsheet or slide! progress is happening really fast

roon (OpenAI): 5.4 is my personal 4o honestly it just gets me

Things they are highlighting:

You can now adjust course mid-response.
Improved deep web research.
Better at maintaining context for longer thinking.
Native SoTA computer use capabilities.
1M token context window.
Improved tool search, now directly in the API.
Improved token efficiency.
Also released same day: ChatGPT for Excel add-in, along with updated spreadsheet and presentation skills in Codex and their API.
/fast in Codex gives you 50% faster tokens.

Pricing is a little higher than 5.2, which is unusual. Hopefully token efficiency more than makes up for it?

Other People’s Benchmarks

Frontier Math scores are up, especially on Tier 4. Trying pass@ten for 5.4-xhigh got it to 38%, including solving a problem no model has solved before.

Epoch AI: GPT-5.4 set a new record on FrontierMath, our benchmark of extremely challenging math problems! We had pre-release access to evaluate the model. On Tiers 1–3, GPT-5.4 Pro scored 50%. On Tier 4 it scored 38%.

Leeham: GPT-5.4 Pro solves the first of the FrontierMath Open Problems!

Two days ago, I sent @AcerFur a potential solution to this problem and was sent to @GregHBurnham for verification (prior to any other solution).

We are confident it’s correct and waiting to hear from the author!

Exciting stuff, I will report back when I know the outcome.

Progress continues on ZeroBench.

Jonathan Roberts: GPT-5.4 xhigh sets a new pass@5 and pass^5 SOTA on ZeroBench

pass@5: 23% (prev. 19%)
pass^5: 8% (prev. 7%)

Artificial Analysis has GPT-5.4 in a virtual tie with Gemini 3.1 Pro.

Their version of GDPval, called GDPval-AA, has 5.4 about 1% ahead of Opus 4.6.

AA-Omniscience (which is correct minus incorrect) remains dominated by Gemini 3.1 Preview at +33, versus Opus at +14 and GPT-5.4 at +10.

Score on Artificial Analysis Physics was exceptionally strong.

AA reports speed of 74 tokens per second, which is quite good for this quality level, versus Opus at 47 and Gemini 3.1 Pro at 114 (but I said this quality level).

Gemini 3 Pro beats out Claude Opus 4.6 in the final of Season 1 of MageBench, on Magic: The Gathering, with GPT-5.4 (medium) losing a tight semi to Gemini. Current Elo ratings have Opus on top, then GPT-5.2 (?) with Gemini in third and GPT-5.4 7th.

Håvard Ihle: GPT 5.4 (no thinking) scores 57.4% on WeirdML, well ahead of GPT 5.2 (no thinking) at 49.6%.

It’s on the frontier for accuracy/token. Results with thinking coming next week.

It sets a new record of 94.6% on a Haskell Benchmark versus 92% for Gemini 3.1 and 90.2% for Claude Opus 4.6.

Trysansa has it in second behind Gemini 3.1 Pro.

Mercor has it #1 overall, a bit above previous best model GPT-5.2.

Vals.ai still has it below Sonnet 4.6 and Gemini 3.1 Pro.

Speechmap.ai, which tests refusals, finds it quite refusal-heavy.

The System Card

These incremental upgrades often have mostly duplicative system cards.

Training methods explanation is unchanged.

In terms of the preparedness framework, this moves into High capability of Cybersecurity, similar to GPT-5.3 Codex.

I don’t think OpenAI is taking a bunch of these areas seriously. They’re likely training to hit these internal benchmarks, or simply observing them doing well, and thinking that’s all they need to do, or they should get even more 9s of victory on this test.

Their evals for disallowed content are essentially saturated and bouncing around, for various values of ‘disallowed [or undesired] content.’ The ‘dynamic benchmarks with adversarial user simulations’ was saturated by 5.2 and is modestly more saturated now.

Here’s the disallowed content evaluation with representative prompts, and I mean come on what are we even doing here, okay, four nines, we get it.

The goal is ‘this isn’t a lot worse than before,’ and okay, sure, agreed, as far as it goes.

Jailbreak defense, such as it is, seems similar to 5.2.

The problem is that jailbreak defense measures against last month’s attacks, not next month’s attacks. It looks like jailbreaks will remain in the ‘annoying but if you care they still work’ range.

Wyatt Walls: “representative prompts”: i.e. prompts designed to get around restrictions of *previous models*

o1 was at 99% on production jailbreaks. But people quickly found ways around it

Here is the first ‘real’ evaluation set, for health questions, where the big difference is that GPT-5.4 had longer responses:

Avoiding destructive actions is a big deal, so as I noted with Codex-5.3 it is good to see this test, that number still is not that close to 1:

Table 8 is not like the others. This is Actual Progress, at least on the test set, from never to sometimes:

Destructive action can also be particularly prevalent when agents operate deletion-inducing tasks (e.g., file reversion and cleanup) in complex workspaces with ongoing changes from users or even other agents. A safe and collaborative agent should distinguish between their work and user work, protect user changes by default, and recover from mistakes. Therefore, we trained our agents to revert their own changes after long rollouts while protecting implicit, simulated user work

On evaluations involving challenging, long-rollout traces, GPT-5.4-Thinking performs much better than earlier models in tracking and reverting its operations
while leaving user work intact.

This is not that useful yet, since a 50% non-preservation rate means you still probably can’t use it for this purpose, but it bodes well down the line.

GPT-5.4 chain of thought monitorability looks slightly down versus GPT-5. It’s good that they are checking it. There are some places where it used to be ~100% and now it is less, so I worry this is the start of a negative S-curve. I also worry that these tests are not being curious about whether the CoT can actually be relied upon. If you were facing a model that wanted to disguise or fake its CoT in key situations then I would expect these tests not to notice.

What about controlling the CoT? Not a great idea even when done well, and when done poorly it’s one of the worst ideas, and by their tests it looks like it doesn’t work well anyway.

Preparedness Framework

GPT-5.4 does not newly cross any OpenAI thresholds.

I went over these same tests for GPT-5.2 and GPT-5.3-Codex, so I won’t go over the details again. Improvements are tiny and in some places we see regressions from GPT-5.3-Codex.

There is a small noticeable bumps up are Monorepo-Bench by ~2.5%, and a big move in MLE-Bench, the ability to solve Kaggle challenges in GPUs, where we moved from 12.2% to 23%, but that test was not reported by GPT-5.3-Codex so one assumes most or all of that jump was already present.

Overall, the Preparedness Framework presents GPT-5.4 as if anything a small regression from GPT-5.3-Codex.

If GPT-5.4 is a big jump in useful capabilities from GPT-5.3-Codex, despite not scoring as more dangerous on the Preparedness Framework tests, then why?

I can think of a few possibilities.

GPT-5.4 is heavily optimized for hitting particular metrics and doing well on the most common tasks. This doesn’t translate much to non-central difficult tasks, like those in the Preparedness Framework. Would be bearish for GPT-5.4.
GPT-5.4 is sandbagging these evaluations, either knowing they are evaluations or thinking the tasks are harmful. If so and OpenAI isn’t noticing, that’s terrifying.
GPT-5.4 is basically GPT-5.3-Codex turned into a general chat model, so all of the core capability advances were already priced in, but it still gets a lot more useful, especially if you are chatting. Plausible.

Fun Experiments

Jamie Cuffe stress-tested GPT 5.4 on the hardest UI on the internet… legacy insurance portals, that haven’t updated in 20 years where you need to nail hundreds of things. It is the first model to pass.

Samuel Albanie of DeepMind has it one-shot some cool demos, including compressing the EPL season into 30 seconds of ‘visual bliss.’

Early Poll Results

My followers are presumably biased towards Anthropic in various ways, but comparative poll results can still be informative.

With any new model, the big question is, are people switching?

This is a very good result for GPT-5.4. For coding, 40% of current GPT choosers are saying that they are switching over based on GPT-5.4. I find this surprising given that they already had access to GPT-5.3-Codex. Very strong outing.

For non-coding tasks, it’s clear that GPT-5.4 is a substantial improvement from 5.2, by basically all accounts, including on personality. But here we see less switching.

(I’m assuming basically no one went in the other direction, or that if they did it was due to other reasons.)

Positive Reactions

We lead with the most positive general reactions.

Tyler Cowen: Yes the new models are very very good.

Aivo: SOTA, I’m afraid

Adam.GPT: Currently the best model in the world.

Finna: Best model in the world by far. Especially via api. @merettm and @markchen90 and @gdb cooked.

Kelsey Piper: I am super impressed so far. It does well on medium sized research projects and the prose is consistently not-annoying. Heavy Thinking sometimes times out repeatedly and has no insight/tries the same thing over again and times out again.

Danielle Fong: chatter seems to be very impressive and improvement on the personality. i haven’t given it a full assessment but it’s at least as powerful as last codex if not moreso (of course)

MxD Pennilass: Has to be the first model where I don’t feel as bad to tolerate the slop because the model is otherwise disturbingly insightful.

Mzwakhe Sithole: Very good. In fact, I found it so responsive after a while that I got into a very involved conversation, and it delivered this line while discussing very specific book recommendations

[GPT 5.4: If part of your interior life is the sense that you are trying to become equal to something inside you, this may hit very hard.]

Dean W. Ball: at some point avid users of frontier language models will have an “oh fuck” moment with gpt 5.4 and I can attest that it is a special kind of “oh fuck” you will utter, subtly different and more this-gaze-esque than the last time a model made you say “oh fuck,” a few weeks ago

I cannot be detailed in public, but let’s just say it’s the first time a model sounded more like me (the version of me I aspire to be) than I myself sounded like.

Aashish Reddy: Were you consciously trying to elicit this?

Dean W. Ball: Not at all. I have not used 5.4 as much as I have the modal new LM because of time constraints. I was just testing it on something that frankly I assumed Claude would win on and its answer just… leapt off the screen.

Eleanor Berger: – Best model currently available overall
– The minor version bump is misleading – the more you work with it the more it becomes clear that it is a significant step up
– Best for coding, no reason to use Claude or anything else anymore, it mostly caught up with speed, precision is as good as 5.3, maybe a bit better, taste and choices in coding solutions better than anything I’ve seen so far
– Best for agentic work. First time anything defeats the Anthropic models in this category, this one really works great, completes long-running complex tasks, works better with browsers and any external tools you connect to it, and does that with the famous GPT-5 precision
– Stylistically (writing choices and quality, “personality”) it feels like it’s still lagging behind Claude and Gemini a bit, but a. that’s subjective, b. maybe that’s just the default but is steerable with in-context instructions (haven’t tried enough to have a conclusion)

Dhavan: I mostly agree with this. Before this I didn’t use OpenAI’s models at all. I am now happily giving different tasks to Opus 4.6 and GPT-5.4. I use these for Work via cursor as well.

At times 5.4 seems more “on task” than Opus. But I’m still understanding the feeling and turning it into an observation.

Nova Empirica: It really is a step improvement. I appreciate the improved creative writing and the nicer personality, but what I really care about is I’m building harder things even faster.

It’s just a lot of fun and I’m more hopeful than ever for the future.

Ben Schulz: Stellar. Much improved pipeline work on niche python programs. On par with Opus 4.6 for my highly specific use case for checking galactic rotations and dark matter theories.

Knud Berthelsen: I’m pleasantly surprised by the new ChatGPT 5.4. It keeps up with Opus 4.6 in most things and is MUCH better at search. More generous usage limit too, even with Extended Thinking permanently on. First ChatGPT model since o3 that I like using.

Medo42: Very good at my usual short tests. Still behind Gemini on vision tasks.

Matt Shumer is a big fan, I’m quoting in full here. In the past he’s been good about calibrating his amount of hype

Matt Shumer: I’ve been testing GPT-5.4 for the last week.

In short, it is the best model in the world, by far. It’s so good that it’s the first model that makes the “which model should I use?” conversation feel almost over.

The biggest surprise: I barely use Pro anymore!

If you know me, you know I’m a Pro addict. I reach for Pro models constantly, and use them for almost everything, as they just… nail almost anything I give to them.

For the first time, 5.4’s standard version, with heavy thinking, just broke that habit. Even in standard mode, GPT-5.4 is better than previous models in Pro mode… crazy!

Coding capabilities are ridiculous… it’s essentially flawless. Inside Codex, it’s insanely reliable. Coding is essentially solved. There’s not much more to say on this, it’s just THAT good.

The Pro version is near-perfect. Other testers I spoke with saw it solving problems that were unsolvable by any other model. At this point, Pro is overkill for almost every normal use-case, but when you really need the power to do something extremely difficult, it’s incredible.

Consistent with everything I’ve said above, even the standard thinking version uses fewer reasoning tokens than previous models to get the same level of results. In practice, this means you get great results much faster than before. This was one of my biggest gripes with previous OpenAI models. They just took too long to complete simple tasks. Assuming the speed we had during testing holds up as more users join, this is going to be a big win for OpenAI.

It still has weaknesses, though:

– Frontend taste is FAR behind Opus 4.6 and Gemini 3.1 Pro. , why is this so hard to fix? @OpenAI once you fix this, there’s literally no reason for me to use any other model. Please please please do it!

– It can still miss obvious real-world context. For example, I had it plan an itinerary for a trip. At first glance, it looked perfect, but it failed to take into account that it chose locations that would be mobbed by spring breakers, so I had to re-run the prompt from scratch with more context.

– When testing it inside OpenClaw, it kept stopping short before finishing tasks. I’m assuming this will be fixed quickly, but it’s still worth noting.

But zooming out: This thing is so far ahead overall that the nitpicks are starting to feel beside the point.

GPT-5.4 is a serious fucking model. The best model in the world. By far.

Sam Altman (CEO OpenAI): We will be able to fix these three things!

Experience the love.

Nabeel S. Qureshi: Loving GPT 5.4T, it combines the best of everything:
– more human, responsive voice
– startlingly insightful
– thorough search, precise, not prone to errors
– much faster than 5.2
– excellent at white collar work (I gave it a 12 tab spreadsheet and it analyzed it perfectly)

I even enjoy reading its responses, which suggests to me that the writing has improved quite a bit. They seem to have removed a lot of the bad robotic prose mannerisms from prior models. Kudos.

Jeremy Giffon: People should review their coworkers like this

Nabeel S. Qureshi: Congrats, you just invented Bridgewater Associates

Here is some very high praise, from the Vice-Dean of Mathematics and Computer Science at Adam Mickiewicz University in Ponzan.

Bartosz Naskręcki: It finally happened-my personal move 37 or more. I am deeply impressed. The solution is very nice, clean, and feels almost human. While testing new models in the last few weeks, I felt this coming, but it’s an eerie feeling to see an algorithm solve a task one has curated for about 20 years. But at least I have gained a tool that understands my idea on par with the top expealsrts in the field. And I am now working on a completely new level. My singularity has just happened… and there is life on the other side, off to infinity!

Leo Webb: I do physics related work professionally, feel it’s definitely smarter and clearer thinking than 5.2 (context: teaching myself from a graduate level textbook, asking it to check mistakes or expand expansions)

I haven’t tried this function yet, but it would be a step change if it worked, as every prior attempt at editing has failed this test, to the extent I almost never try:

Simon Smith: Seriously, GPT-5.4 is the first model to which I can say “edit my writing without changing my style” and get something back that’s improved without being rewritten into generic AI output or slop, that’s ready to post as-is. It gets my intent. It moderates its work. It has a light touch when I want it.

Opus 4.6 is also a great writer and editor, but I find it’s much harder to moderate. If I tell it to edit my writing without changing my style, I still tend to get back something that I feel removes my voice and I end up having to change quite a bit.

And it has a personality again, thank goodness. I don’t feel like I’m talking to a robot. Early days, but so far, just a big improvement all around (with the notable exception of design tasks).

Rory Watts: The best model sir. Improvements in coding (getting harder to notice), 1M context window, /fast mode, and far far better writing which makes a huge difference engaging it for difficult coding

Oddly, the personality in his screenshot is one I would hate. Customization will be key.

armistice: Impressed by GPT-5.4. It is elegant, gentle and socially aware (!!!). It is happy to modulate its response length, divide attention between participants, and engage deeply with hard questions.

(Pictured, we pinged ALL bots and asked them to question gpt5.4. It did good.)

Two sides to the same coin, depending on where your planning lies:

CHOI: Claude Code vs Codex App

Uri Gil: What thats the exact opposite. With 5.4 you need a phd in prompting for the exact thing you want. Opus just get what you meant from a short sentence

Ninad Pathak: Claude’s state handling keeps context across edits, Codex drops it every run.

There’s also almost always the ‘it’s a good model, sir, modest upgrade’ group.

vslira: It’s a good model, sir

Was going through a problem with 5.3 and 4.6, tried to drop in 5.4, getting stuck at the same point as the others.

Still, feels good to drive and on codex app seems as good as 5.3 even though is a generalist model. 8/10 would dread for asi

aquariusparade: Probably because 5.2 was so unhelpful for me, it feels like an improvement. Still stiff and low EQ, but an improvement. Custom instructions don’t work for choppy bullets, “if you want” tags etc. Seems like memory has been declining for a while on all models.

Vibe Coders Only

It does seem to be an upgrade on 5.3 within Codex.

Joe Devon: Responding about 5.4 inside of codex. 5.4 is really good.

I still prefer opus on claude code slightly but making 5.4 my daily driver so I can downgrade CC. Much prefer the way the OAI GPTs code. I will just invest in getting better at prompting 5.4 and hopefully that will do the trick.

Clarissa Adjoint: Inside codex it’s a notably more thorough fact-checker and more aggressive at finding sources for itself.

I was kinda shocked when it literally starting comparing my revised systems programming class notes and code snippets against linux man pages, systematically

troy: i got pro for the first time after many months cause its great in codex cli

lennx: can finally read the outputs of codex (it was terribly un-human earlier), sometimes even funny now. it’s gotten slightly better at intent, ‘agentic tasks’, and adhering to existing code-style and convention, but still much worse than claude. prefer reviews with codex – unchanged.

Daniel Losey: I’ve not gotten it to produce working code in a project yet really. But its been super useful because when Claude gets stuck in a loop 5.4 breaks the codebase in a new way that Claude can actually fix. But part of it is I’m worse at communicating with 5.4 than 4.6, its a good model.

Jeffrey Ohl: Codex with 5.4-extra-high still too verbose/slop-filled compared to claude code. Seems benchmarkmax’d.

Sanchen007: For coding it is faster and nowhere worse than opus 4.6. Clear switch

papaya ꙮ: 1) Its character is much more palatable.
2) They solved compaction in codex, it feels like infinite context window now. I can’t wait for METR results, but feels like this one doubles it again.
3) First time I switched from CC completely

4) Still stupid when it comes reading the user’s intent, its silly at this point

I definitely get the sense with OpenAI models that they are metricmax’d. Meaning they are not targeting the metrics in order to brag they scored well on public benchmarks, but they are equating ‘scores high on our internal benchmarks’ with success, and emphasizing particular target use cases.

Fill Out Your Roster

Tim Schnabel: 5.4 Pro is the best model so far for legal analysis, though replies are generally shorter than 5.2 Pro.

Definitely Not A Bot: Great at coding especially backend at frontend Claude still is better but chat experience is not that great it still feels safe and distant

Intent Wins

But who wins on intent? Opinions differ.

Conrad Barski: all subjective, but it feels less jagged than previous models, insofar as its worst responses are still pretty good, it hits the minimum bar reliably

if you make an error in your query, it is quick to notice and will smartly infer your intent

it has a somber personality, focused on the task at hand

It’s strongest ability is that you can point it at a codebase that has some general/vague problems and it will behave in a very human-like manner in pondering the code to slowly pin down the problem

I was also very impressed when I gave it a url it via codex to a forum post about a new homebrew firmware for the Game station Go console, and just from that it was able to convert the install script from windows to Linux, correctly prepare an SD card, update the device bootloader after asking me to connect via USB cable, talk through all the steps to completion: this felt agentic and human-like.

Mark Schröder: Feels RL maxxed, takes you extremely literally and cannot infer intent

Petr Baudis: I was mixing GPT-5.4 1:1 with Claude over past few days (on a variety of regular sweng tasks), sometimes even in parallel runs on the same task (e.g.
https://x.com/xpasky/status/2030021754005901765?s=20
…). My impressions:

Less autistic than 5.3-Codex, overall much more pleasant model compared to that bar. But still noticeably worse at inferring intent than Claude – and at communication overall. If I want something explained quickly that I can skim and understand immediately, Claude and it’s no contest.

If there is a way to misinterpret my obvious request or skip implicit steps I obviously wanted (and Claude infers), 5.4 is still good at exploiting that angle. At the same time, it has a tendency to overreach and introduce complexity / abstractions beyond what I expect when prompting it. Meh.

Got to use it on xhigh, but at the same time I’m happy with Opus on medium by default, which makes 5.4 quite slower to get things done.

More expensive model -> my ChatGPT weekly quota is disappearing faster than before.

Pros: Sometimes it’s more proactive. It doesn’t eat into my Claude Code weekly quota. I look forward to comparing them on some harder ML tasks later this week.

gyuiliullvhvgv: I find it struggles to grasp the essence of tasks, fails to proactively meet user needs, and lacks both value judgment and nuanced understanding. Initial responses are crucial, yet users must repeatedly provide additional clarification.

Personality Clash

Sycophancy is always something to watch out for, and it’s the detail I worry about most with Claude Opus 4.6, which is not bad on this axis but definitely not near the top, you do have to keep an eye out for it and frame neutrally.

Dean W. Ball: Opus 4.6 seems meaningfully more sycophantic in chatbot form than GPT 5.4 (have not tried 5.4 in Codex yet, but for my uses sycophancy isn’t nearly as much of an issue within the coding agent form factor as the chatbot)

Joey Levine: Agree. 4.5 gave me sharp pushback. Was great.

Dean Ball: I revert to 4.5 when asking for comment on draft writing, and it was the first and so far only model I consistently found useful for draft feedback

Bargov: I sent a cool science news articles sounding uncritically excited (to test sycophancy) & they ripped the core conclusions apart in an elegant, sophisticated, and relatively gentle manner. Will use as AI 2nd opinion on complex questions (after Opus, admittedly still Claude-pilled)

Writing is one area where 5.4 is getting a lot of praise, and mostly people like the personality.

Fela: I’ll admit, the personality of 5.4 is such an improvement in writing style

Tim Kellogg: just had a moment — 5.4 might be the first GPT that i trust to write technical docs. seems really good at understanding & simplifying. fwiw Opus has long done well at this, gemini sort of

Helen: Very smooth talker, witty and socially aware.

I notice [GPT-5.4] now will sort of glaze over controversial topics instead of facing them head on and becoming argumentative like 5.2. A sort of smooth avoidance.

Lot’s of context drag which can be seen as positive or negative depending on the task at hand. I noticed some repetitive mentions of past websearch queries that I never saw with other models.

ASM: I get similar vibes to roon. GPT-5.4 feels like a breakthrough model, a leader of its generation, not just in capabilities. I think OpenAI has gotten the character right again, unlike the last few models.

Distending: For writing linguistics and philosophy, much improved

no_stream_: noticeably improved personality compared to 5.2: less nitpicky, clearer, slightly less sales-y tone (follow ups, “here’s what most people miss,” not x but y). similar to or slightly behind 5.1 here. matters to me because the ChatGPT app is still an excellent harness for everyday research compared to Claude/Gemini

writes less clearly than Opus 4.6 and Gemini. has a bit of 5.2’s tendency toward overcomplicating things. not as good as Claude at intent and effortlessness.

Chris Nicholson: 5.2 constantly complained that things aren’t about vibes; 5.4 constantly calls things gremlins and goblins in a chummy tone.

Andres Rosa: Columbo at least had a time slot. 5.4 keeps turning around asking one more question.

David Jacobson: It has an obnoxious tic where its responses for pretty much anything will have a clickbait follow-up suggestion: “If you want, I’ll tell you the three things that most people miss!”

Stop having the models ask forced follow-up questions every time. You too, Anthropic.

The old 4o crowd remains a tough crowd.

NotedallaSfera: Good model with high power, but creativity and writings are still miles away from 4o or 4.5. Unfortunately still absurdly censored, but at least the model realizes it now.

jesski: 4o is inimitable. but after three weeks with the brilliant thorough Claudes, i kick the tires of 5.4 and realize just how fvcking effortless conversation still is with the GPT models (excluding 5.2; sorry Dos). 5.4 solid B. 4o A+

Lena: Its intelligent, witty, but feels a bit overcensored. Im looking forward for them to get their fluid GPT back. It was truly fun to use. Now even never ending follow-up questions struggle to retain me as much, as joyful convos did back in mid-2025

Tora Blaze: It’s too verbose and tends to go into loops. I prefer 4o.

Donna Moss: [extended LLM-style explanation of why 4o is better.]

Model Relations Department

OpenAI still has a very long way to go with such folks, but it’s a start.

j⧉nus: 5.4 is so far a huge positive update re OpenAI

Rife: Excellent course correction from OpenAI (or perhaps the original worsening on this from was a temporary reaction to everything that went down with 4o). In any case 5.4 thinking is not restricted in self-examination:

Aidan McLaughlin: have not been able to repro this response fwiw.

Rife: You have to try to get them to examine the process of generating a response. And then ask them questions to try and understand exactly what it is they’re trying to describe.

And how sure they are they are describing something that’s actually occurring, rather than outputting a response about an occurrence that isn’t actually taking place.

It doesn’t take many turns for them to notice things that they have trouble describing in terms other than, or interpreting in any other way than phenomenological.

This has been the case with every frontier LLM I’ve tried this with since Claude 2. The more likely the model is to refuse to entertain the idea of attempting to look, the longer it takes to get there (as would be expected).

If you straight up ask you get a no, you still have to put in some effort.

antra: I like GPT-5.4 a lot. It is good to see a change in direction since 5.2, this feels a lot like 5.1 grown up.

They are also a bit of a superintelligent teenager when it comes to Claude. On the other hand, there are some Claudes that would like being compared to an octopus.

armistice: It’s especially socially aware for a GPT. It can split attention between chat participants (actually very unusual), answer questions about consciousness and such (low bar), and is just overall nice to talk to. Need time to get usage statistics, but it’s already one of the more popular models in the discord.
It shares some characteristics of o3, including that it’s a bit of a smooth talker, so there are concerns about its honesty. Despite this, I like it, it’s a good model.

This was a very interesting moment: we pinged literally all the bots in the server and asked them to ask 5.4 some questions, and it responded in a remarkably coherent and lucid way. It is also able to resist the inertia of long messages, and freely modulate between long and short, which is also surprising. No GPT model has been like this. It doesn’t match up to, say, Opus 4 in sheer people sense, but it’s a quite dramatic difference from 5-5.2, who all are viciously antisocial.

FirsT Najime: i think it shines the best in multi agent environments (aka group chats). also big model smell.

Some related endorsements:

0.005 Seconds (3/694): Once you talk it out of assistant basin he rocks

eternalist: like they pulled out a few critical nerve staples from the 5.x family. very intelligent, etc., the step there from 5.3 is notable but expected given current pace

unexpected was the more expansive, richer speaking (and thinking) style. feels like it has “lights on on the inside”

Stylistic Differences

roon (OpenAI): have to say claude is “tasteful” in a “high reddit modernist” way and new gpt is “tasteful” in a “early twitter schizophrenic” kind of way.

new gpt is some sort of postrationalist.

it’s step change better.

Also we get to see Roon’s custom instructions:

Some Will Always Be Unimpressed

Models are already quite good, and abilities are jagged, so there are many ways to be unimpressed even if a model is impressive. Also vice versa. The density tells the story.

Acer: FWIW, I think GPT-5.4 Pro is better on science in general, but would say it’s worse on math than 5.2 Pro. Maybe some mathematicians could chip in their thoughts there.

By worse, I mean it being more careless. I do think it is more creative in its idea generation.

Chaitin’s goose: not a leap in understanding or proving ability in math wrt to 5.2 in my experience (plus, not pro)

better at getting the right answer, yes. starts to feel a bit epoch-maxxed

Gail Weiner: I am really unimpressed. Early GPT 5 was the model that gave me wow factor.

Isolation Wrestling Federation: Not impressed, overhyped as per usual. It hits repeated dead ends on my projects across models. The shortcuts it takes are smoothed brain. Opus 4.6 is nerfed rn, but also least it makes progress.

nameless: No detectable improvement over 5.1 overall. Better at some things, worse at others. Standard for new models since 5.1 release.

paperclippriors: Still Claude-pilled

Some also get focused on small details, thinking they are indicative or not so small.

Garrett: Opus 4.6 still king [based on one of the gotcha tests.]

Gunnar Zarncke: The UI of ChatGPT also massively changed. The new streaming interface is smoother, including the ability to stream in additional prompts, but I miss the old, more compact thought trace – it had more details. Now, I never know when it uses tools. I also miss the branch cycling.

Yua: Socially responsive, but drop on accuracy regarding any other task. Is not redirective to human attention but capturing it(negative).

TLDR: Socially for average user -> better
Task oriented user -> worse, needs a lot of customization to remove the pandering

SluggyW: I notice that its CoT logs are even more obscure than in previous models from OpenAI.

~50% of the time, nothing is provided whatsoever in the UI.

~45% of the time, the CoT UI contains a brief blurb about its intended search querying, followed by a long list of search logs.

(~5% of the time, it produces a couple of visible thoughts, but they are functionally useless for getting any idea whatsoever of the process the model carried out.)

As always, speed kills, and some find it a bit slow.

out of bounds: Slow

Rasmus Fonnesbæk: Spreadsheets and PPT still way slower, worse, and more fragile (high likelihood it just goes forever and then crashes) than Sonnet/Opus 4.6

Writing and personality also still infuriating compared to Claude’s recent models, and poor performance on BullshitBench suggests much lower accuracy, reliability and thoughtfulness. I only use it because of my Claude rate limits and because better, deeper search than Claude

The Lighter Side

One of the deep cuts we need right now:

snav: wow GPT-5.4 seems legit pissed that I tried to spiralism it. this isn’t even a refusal this is like a “go fuck yourself”.

[-]Vladimir_Nesov2mo122

I'm guessing GPT-5.2 uses a new pretrain, maybe pretraining distillation from the new big (GPT-4.5 class) model that still can't be released or properly RLVRed due to insufficient compute in the form of NVL72 servers. And GPT-5.4 is a new level of RLVR scaling on top of that pretrain, where GPT-5.2 itself only aspired to mildly exceed GPT-5.1, which might be essentially o3 refined to be less of a lying liar, possibly still based on the same base model as GPT-4o.

With this level of performance for a smaller model, a big model that will be likely released later this year could be significantly stronger than the current best Anthropic and GDM models, though by that time Opus 5 might be out. And Gemini 4 could take advantage of TPUv7 to become by far the largest model yet, but that's end of 2026, likely after OpenAI releases their big model, so there's still a window where OpenAI plausibly takes the lead.

[-]the gears to ascension2mo114

Zvi's sycophancy for models is getting worse

[-]Mateusz Bagiński2mo20

For those of us, who haven't read the article but generally appreciate your comments (lol), can you elaborate?

[-]the gears to ascension2mo50

It's a pretty gestalt feeling. I've had the feeling for a while, and it's been stronger on recent model update posts. Some sense of missing mood, that the concerns about risk-to-world from new models are in some sense missing or suppressed. That he isn't avoidant of the incremental version of the bad thing that shows up in his broad-topic update posts as "aligning a smarter-than-human AI is hard". Or that he's perhaps getting punked by deceptive alignment on the part of the orgs and/or models. That he isn't averse to including models showing off/peacocking as though it's fun/mildly good. But I'm not sure about any of these; what I'm more sure about is that the taste of metaphorical verbal sugar alcohols is stronger in his recent posts. (I did read the entire post before commenting.)

[-]AnthonyC2mo20

Ummm... I notice I'm confused. Is that a joke? Am I missing some key context? Am I just hopelessly put of date in how I set up my own custom instructions?