I was a big fan of Horizon Alpha, a stealth model available on OpenRouter, later revealed to be an early checkpoint/variant of GPT-5. Unfortunately, the release candidate isn't quite as good at my usual vibes benchmark when it comes to creative writing, be it 5 or 5-Thinking.
(4.1 was really good at the same task, and surprisingly so, for a model marketed for coding. I missed it when it was gone, and I'm glad to have it back)
I was initially quite negative about GPT-5T, but I've warmed to it. It seems as smart or smarter than o3, and is a well-rounded and capable model. The claims of a drastically reduced hallucination rate seems borne out in extensive use.
What do I ultimately make of all the new versions of GPT-5?
The practical offerings and how they interact continues to change by the day. I expect more to come. It will take a while for things to settle down.
I’ll start with the central takeaways and how I select models right now, then go through the type and various questions in detail.
Table of Contents
Central Takeaways
My central takes, up front, first the practical:
On the bigger picture:
Choose Your Fighter
This is a distillation of consensus thinking on the new practical equilibrium:
This seems mostly right. I prefer to use Opus if Opus is enough thinking for the job, but OpenAI currently scales more time and compute better than Anthropic does.
So, what do we do going forward to get the most out of AI on a given question?
Here’s how I think about it: There are four ‘speed tiers’:
What about coding?
Here I don’t know because I’ve been too busy to code anything since before Opus 4, nor have I tried out Claude Code.
Also the situation continues to change rapidly. OpenAI claims that they’ve doubled speed for GPT-5 inside cursor as of last night via superior caching and latency, whereas many of the complaints about GPT-5 in Cursor was previously that it was too slow. You’ll need to try out various options and see what works better for you (and you might also think about who you want to support, if it is close).
We can then contrast that with the Official Hype.
Official Hype
That’s not automatically a knock. Hypers gotta hype. It’s worth seeing choice of hype.
Here was Sam Altman live-tweeting the livestream, a much better alternative way to actually watch the livestream, which I converted to bullet points, and reordered a bit for logical coherence but otherwise preserving to give a sense of his vibe. Hype!
I would summarize the meaningful parts of the pitch as:
Altman is careful not to mention the competition, focusing on things being good. He also doesn’t mention the lack of sycophancy, plausibly because ‘regular’ customers don’t understand why sycophancy is bad, actually, and also he doesn’t want to draw attention to that having been a problem.
I have noticed the same pattern that Siemon does here. When a release is impressive relative to expectations, Altman tends to downplay it. When a release is unimpressive, that’s when he tends to bring the hype.
From their Reddit Q&A that mostly didn’t tell us anything:
Again note what isn’t listed here.
Here’s more widely viewed hype that knows what to emphasize:
Notice the focus on trustworthy, accurate and unified. Yes, she talks about it setting new highs across the board, but you can tell that’s an afterthought. This is about refining the experience.
Here’s some more hype along similar lines that feels helpful:
That last line worries me a bit.
That’s the pitch.
GPT-5 isn’t a lot smarter. GPT-5 helps you do the dumb things you gotta do.
Still huge, as they say, if true.
Here’s hype that is targeted at the Anthropic customers out there:
I notice the ‘way less sycophantic’ does not answer the goose’s question ‘than what?’
This is a direct pitch to the coders, saying that GPT-5 is better than Opus or Sonnet, and you should switch. Unlike the other claims, them’s fighting words.
The words do not seem to be true.
There are a lot of ways to quibble on details but this is a resounding victory for Opus.
There’s no way to reconcile that with ‘way better than claude 4.1 opus at swe.’
We also have celebratory posts, which is a great tradition.
I view GPT-5 as both evidence that synthetic data can work in some ways (such as the lower hallucination rates) and also evidence that synthetic data is falling short on general intelligence.
Roon is different. His hype is from the heart, and attempts to create clarity.
One particular piece of hype from the livestream is worth noting, that they are continuing to talk about approaching ‘a recursive self-improvement loop.’
I mean, at sufficient strength this is yikes, indeed the maximum yikes thing.
Then again, stated this way it is likely something much weaker, more hype?
Here is Bloomberg’s coverage from Rachel Metz, essentially a puff piece reporting moderated versions of OpenAI’s hype.
Chart Crime
I mean wow just wow, this was from the livestream.
And we also have this:
This is not a chart crime, but it is still another presentation error.
And another:
I think L is overreacting here, but I do think that when details get messed up that does tell you a lot.
One recalls the famous Van Halen Brown M&Ms contract clause: “There will be no brown M&M’s in the backstage area, upon pain of forfeiture of the show, with full compensation.” Because if the venue didn’t successfully execute on sorting out the brown M&Ms then they knew they’d messed up other things and the venue probably wasn’t safe for their equipment.
Then there was a rather serious actual error:
OpenAI definitely did not sort out their brown M&Ms on this one.
I would feel better about OpenAI if they made a lot less of these types of mistakes. It does not bode well for when they have to manage the development and release of AGI or superintelligence.
Model Crime
Many people are saying:
The supposed motive is to clear up confusion. One model, GPT-5, that most users query all the time. Don’t confuse people with different options, and it is cheaper not to have to support them. Besides, GPT-5 is strictly better, right?
Under heavy protest, Altman agreed to give Plus users back GPT-4o if they want it, for the time being.
Future Plans For OpenAI’s Compute
I find it strange to prioritize allocating compute to the free ChatGPT tier if there are customers who want to pay to use that compute in the API?
I notice that one could indefinitely improve the free tier of ChatGPT, so the question is how much one intends to improve it.
The other thing that is missing here is using compute to advance capabilities. Sounds great to me, if it correctly indicates that they don’t know how to get much out of scaling up compute use in their research at this time. Of course they could also simply not be talking about that and pretending that part of compute isn’t fungible, in order to make this sound better.
There are various ways OpenAI could go. Ben Thompson continues to take the ultimate cartoon supervillain approach to what OpenAI should prioritize, that the best business is the advertising platform business, so they should stop supporting this silly API entirely to pivot to consumer tech and focus on what he is totally not calling creating our new dystopian chat overlord.
This of course is also based on Ben maximally not feeling any of the AGI, and treating future AI as essentially current AI with some UI updates and a trenchcoat, so all that matters is profit maximization and extracting the wallets and souls of the low end of the market the way Meta does.
Which is also why he’s strongly against all the anti-enshittification changes OpenAI is making to let us pick the right tool for the job, instead wishing that the interface and options be kept maximally simple, where OpenAI takes care of which model to serve you silently behind the scenes. Better, he says, to make the decisions for the user, at least in most cases, and screw the few power users for whom that isn’t true. Give people what they ‘need’ not what they say they want, and within the $20 tier he wants to focus on the naive users.
Rate Limitations
One reason some people have been angry was the temporary downgrade in the amount of reasoning mode you get out of a $20 subscription, which users were not reassured at the time was temporary.
OpenAI started at 200 Thinking messages a week on Plus, then doubled rate limits once the rollout was complete, then went to 3,000 thinking queries per week which is far more than I have ever used in a week. Now there is also the fallback to Thinking-Mini after that.
So this generated a bunch of initial hostility (that I won’t reproduce as it is now moot), but at 3,000 I think it is fine. If you are using more than that, it’s time to upgrade, and soon you’ll also (they say) get unlimited GPT-5-mini.
It’s kind of amazing that only 7% of plus users used a reasoning model daily. Two very different worlds indeed.
The Routing Options Expand
I don’t know that Thompson is wrong about what it should look like as a default. I am increasingly a fan of hiding complex options within settings. If you want the legacy models, you have to ask for them.
It perhaps makes sense to also put the additional GPT-5 options behind a setting? That does indeed seem to be the new situation as of last night, with ‘show additional models’ as the setting option instead of ‘show legacy models’ to keep things simple.
There is real risk of Paradox of Choice here, where you feel forced to ensure you are using the right model, but now there are too many options again and you’re not sure which one it is, and you throw up your hands.
As of this morning, your options look like this, we now have a ‘Thinking mini’ option:
o3 Pro is gone. This makes me abstractly sad, especially because it means you can’t compare o3 Pro to GPT-5 Pro, but I doubt anyone will miss it. o4-mini-high is also gone, again I doubt we will miss it.
For the plus plan, GPT-4.5 is missing, since it uses quite a lot of compute.
I also notice the descriptions of the legacy models are gone, presumably on the theory that if you should be using the legacies then you already know what they are for.
Thinking-mini might be good for fitting the #2 slot on the speed curve, where previously GPT-5 did not give us a good option. We’ll have to experiment to know.
System Prompt
Pliny is here to provide it.
I hadn’t looked at a ChatGPT system prompt in a while so I read it over. Things that stood out to me that I hadn’t noticed or remembered:
Wyatt Walls then offers us a different prompt for thinking mode.
On Writing
If you are using GPT-5 for writing, definitely at least use GPT-5-Thinking, and still probably throw in at least a ‘think harder.’
My verdict on the Seinfeld episode is that it was indeed better than previous attempts I’ve seen, with some actually solid lines. It’s not good, but then neither was the latest Seinfeld performance I went to, which I’m not sure was better. Age comes for us all.
One thing it is not good at is ‘just do a joke,’ you want it to Do Writing instead.
We do have the Short Story Creative Writing benchmark but I don’t trust it. The holistic report is something I do trust, though:
Writing is notoriously hard to evaluate, and I essentially never ask LLMs for writing so I don’t have much of a comparison point. It does seem like if you use thinking mode, you can get at least get a strong version of what GPT-4.5 had here with GPT-5.
The other problem with writing is you need to decide what to have it write. Even when Roon highlights writing, we get assignments like ‘If Napoléon wrote a personal and intimate letter to Sydney Sweeney’ or ‘You are Dostoevsky, but you are also a Snapchat fuckboi. Write to me.’
Or you could try this prompt?
I did indeed try it. And yes, this seems better than previous attempts. I still didn’t successfully force myself to finish reading the story.
Leading The Witness
Yes, you still have to be careful with the way you prompt to avoid leading the witness. Sycophancy might not be at absurd levels but it definitely is never at zero.
You’re right to question it:
Hallucinations Are Down
My guess is that the improved hallucination rate from o3 (and also GPT-4o) to GPT-5 and GPT-5-thinking is the bulk of the effective improvement from GPT-5.
The decrease in hallucinations is presumably a big driver in things like the METR 50% completion rate and success on various benchmarks. Given the modest improvements it could plausibly account for more than all of the improvement.
I’m not knocking this. I agree with Gallabytes that ‘o3 the Lying Liar, except it stops lying to you’ is a great pitch. That would be enough to shift me over to o3, or now GPT-5-Thinking, for many longer queries, and then there’s Pro, although I’d still prefer to converse with Opus if I don’t need o3’s level of thinking.
For now, I’ll be running anything important through both ChatGPT and Claude, although I’ll rarely feel the need to add a third model on top of that.
Best Of All Possible Worlds?
This was a great ‘we disagree on important things but are still seeking truth together’:
I too of course would very much like a cure for cancer and other neat stuff like that. There are big upsides to creating minds smarter than ourselves. I simply think we are not yet prepared to handle doing that at this time.
It seems plausible GPT-5 could hit the perfect sweet spot if it does its job of uplifting the everyday use cases:
Watch out, though. As Nick says, this definitely isn’t over.
Then again, what if it is the worst of all possible worlds, instead?
Timelines
I often draw the distinction between mundane utility and underlying capability.
When we allow the same underlying capability to capture more mundane utility, the world gets better.
When we advance underlying capability, we get more mundane utility, and we also move closer to AI being powerful enough that it transforms our world, and potentially takes effective control or kills everyone.
(Often this is referred to as Artificial Superintelligence or ASI, or Artificial General Intelligence or AGI, and by many definitions AGI likely leads quickly to ASI.)
Timelines means how long it takes for AGI, ASI or such a transformation to occur.
Thus, when we see GPT-5 (mostly as expected at this point) focus on giving us mundane utility and Just Doing Things, without much advance in underlying capability, that is excellent news for those who want timelines to not be quick.
So people were at least half expecting not to have jobs in 2030, but then thinking ‘permanent underclass’ rather than half expecting to be dead in 2040. The focus on They Took Our Jobs, to me, reflects an inability to actually think about the implications of the futures they are imagining.
There were some worlds in which GPT-5 was a lot more impressive, and showed signs that we can ‘get there’ relatively soon with current techniques. That didn’t happen .So this is strong evidence against very rapid scenarios in particular, and weak evidence for bing slower in general.
Those seem like wise things to be confused about.
It is however ‘right on trend’ on the METR chart, and we should keep in mind that these releases are happening every few months so we shouldn’t expect the level of jump we used to get every few years.
My instinct would be that GPT 4 → GPT 5 is more like GPT 3.5 → GPT 4, especially if you’re basing this on GPT-5 rather than starting with thinking or pro? If you look at GPT-5-Thinking outputs only and ignore speed I can see an argument this is 5-level-worthy. But it’s been long enough that maybe that’s not being fair.
But also think about how it got that much further along on the chart, on several levels, all of which points towards future progress likely being slower, especially by making the extreme left tail of ‘very fast’ less likely.
It is indeed very tempting to compare GPT-5 to what existed right before its release, including o3, and compare that to the GPT-3.5 to GPT-4 gap. That’s not apples to apples.
GPT-5 isn’t a giant update, but you do have to do Conservation of Expected Evidence, including on OpenAI choosing to have GPT-5 be this kind of refinement.
It also depends on what evidence you already used for your updates. If you already knew GPT-5 was going to be an incremental model that was more useful rather than it being OpenAI scaling up more, as they already mostly told us, then your update should probably be small. If you didn’t already take that into account, then larger.
It’s about how this impacts your underlying model of what is going on:
As I noted yesterday, you also have to be cautious that they might be holding back.
On the question of economic prospects if and when They Took Our Jobs and how much to worry about this, I remind everyone that my position is unchanged: I do not think one should worry much about being in a ‘permanent underclass’ or anything like that, as this requires a highly narrow set of things to happen – the AI is good enough to take the jobs, and the humans stay in charge and alive, but those humans do you dirty – and even if it did happen the resulting underclass probably does damn well compared to today.
You should worry more about not surviving or humanity not remaining in control, or your place in the social and economic order if transformational AI does not arrive soon, and less about your place relative to other humans in positive post-AI worlds.
Sycophancy Will Continue Because It Improves Morale
GPT-5 is less sycophantic than GPT-4o.
In particular, it has a much less warm and encouraging tone, which is a lot of what caused such negative initial reactions from the Reddit crowd.
GPT-5 is still rather sycophantic in its non-thinking mode where it is most annoying to me and probably you, which is when it is actually evaluating.
The good news is, if it matters that the model not be sycophantic, that is a situation where, if you are using ChatGPT, you should be using GPT-5-Thinking if not Pro.
He turns this into CrankTest (not CrankBench, not yet) and this seems very well calibrated to my intuitions. Remember that lower is better:
Gaslighting Will Continue
As usual there is the issue that if within a context an LLM gets too attached to a wrong answer (for example here the number of rs in ‘boysenberry’) this creates pressure to going to keep doubling down on that, and gaslight the user. I also suppose fighting sycophancy makes this more likely as a side effect, although they didn’t fight sycophancy all that hard.
I wouldn’t agree with Jonathan Mannhart that this means ‘it is seriously misaligned’ but it does mean that this particular issue has not been fixed. I notice that Johnathan here is pattern matching in vibes to someone who is often wrong, which presumably isn’t helping.
Going Pro
How often are they suggesting you should wait for Pro, if you have it available? How much should you consider paying for it (hint: $200/month)?
If my own experience with o3-pro was any indication, the instinct to not want to wait is strong, and you need to redesign workflow to use it more. A lot of that was that when I tried to use o3-pro it frequently timed out, and at that pace this is super frustrating. Hopefully 5-pro won’t have that issue.
When you care, though? You really care, such as the experiences with Wes Roth and David Shapiro here. The thing is both, yes, the model picker is back for the pro tier including o3-pro, and also you have GPT-5-Pro.
How is GPT-5-Pro compared to o3-Pro?
That’s hard to evaluate, since queries take a long time and are pretty unique. So far I’d say the consensus is that GPT-5-pro is better, but not a ton better?
Going Forward
OpenAI seems to be rolling out changes on a daily basis. They are iterating quickly.
Anthropic promised us larger updates than Opus 4.1 within the coming weeks.
Google continues to produce a stream of offerings, most of which we don’t notice.
This was not OpenAI’s attempt to blow us away or to substantially raise the level of underlying capabilities and intelligence. That will come another time.
Yes, as a sudden move to ‘GPT-5’ this was disappointing. Many, including the secondhand reports from social media, are not initially happy, usually because their initial reactions are based on things like personality. The improvements will still continue, even if people don’t realize.
What about the march to superintelligence or the loss of our jobs? Is it all on indefinite hold now because this release was disappointing? No. We can reduce how much we are worried about these things in the short term, meaning the next several years, and push back somewhat the median. But if you see anyone proclaiming with confidence that it’s over, rest assured changes are very good we will soon be so back.