Long term, trying to target the particular misaligned action won’t work.
What could actually work then? I guess that it would require OpenAI or other AGI/ASI companies to rethink the training paradigm? And what could be the new paradigm then?
Manifold markets asks, will Pliny jailbreak GPT-5 on launch day? And somehow it took several hours for people to respond with ‘obviously yes.’
It actually took several hours. It did not during that time occur to me that this might be because Pliny was actually finding this hard.
Pliny didn't jailbreak it on biology questions.
GPT-5 was a long time coming.
Is it a good model, sir? Yes. In practice it is a good, but not great, model.
Or rather, it is several good models released at once: GPT-5, GPT-5-Thinking, GPT-5-With-The-Router, GPT-5-Pro, GPT-5-API. That leads to a lot of confusion.
What is most good? Cutting down on errors and hallucinations is a big deal. Ease of use and ‘just doing things’ have improved. Early reports are thinking mode is a large improvement on writing. Coding seems improved and can compete with Opus.
This first post covers an introduction, basic facts, benchmarks and the model card. Coverage will continue tomorrow.
This Fully Operational Battle Station
GPT-5 is here. They presented it as a really big deal. Death Star big.
Sam Altman (the night before release):
Sam Altman later tried to say ‘no, we’re the rebels.’
Perhaps he should ask why actual no one who saw this interpreted it that way.
Also yes, every time you reference Star Wars over Star Trek it does not bode well.
It would be greatly appreciated if the vagueposting (and also the livestreaming, a man can dream?) would stop, and everyone involved could take this seriously. I don’t think anyone involved is being malicious or anything, but it would be great if you stopped.
All right, let’s get started.
Big Facts
400K context length.
128K maximum output tokens.
Price, in [price per million input tokens / price per million output tokens]:
For developers they offer a prompting guide, frontend guide and guide to new parameters and tools, and a mode to optimize your existing prompts for GPT-5.
They also shipped what they consider a major Codex CLI upgrade.
There are five ‘personality’ options: Default, Cynic, Robot, Listener and Nerd.
From these brief descriptions one wonders if anyone at OpenAI has ever met a nerd. Or a listener. The examples feel like they are kind of all terrible?
As in, bad question, but all five make me think ‘sheesh, sorry I asked, let’s talk less.’
It is available in Notion which claims it is ‘15% better’ at complex tasks.
Typing ‘think hard’ in the prompt will trigger thinking mode.
Nvidia stock did not react substantially to GPT-5.
As always beyond basic facts, we begin with the system card.
The System Card
The GPT-OSS systems card told us some things about the model. The GPT-5 one tells us it’s a bunch of different GPT-5 variations and chooses which one to call, and otherwise doesn’t tell us anything about the architecture or training we didn’t already know.
A Model By Any Other Name
Excellent, now we can stop being confused by all these model names.
Oh no:
OpenAI is offering to automatically route your query to where they think it belongs. Sometimes that will be what you want. But what if it isn’t? I fully expect a lot of prompts to say some version of ‘think hard about this,’ and the first thing any reasonable person would test is to say ‘use thinking-pro’ and see what happens.
I don’t think that even a good automatic detector would be that accurate in differentiating when I want to use gpt-5-main versus thinking versus thinking-pro. I also think that OpenAI’s incentive is to route to less compute intense versions, so often it will go to main or even main-mini, or to thinking-mini, when I didn’t want that.
Which all means that if you are a power user then you are back to still choosing a model, except instead of doing it once on the drop-down and adjusting it, now you’re going to have to type the request on every interaction if you want it to not follow defaults. Lame.
It also means that if you don’t know which model you filtered to, and you get an answer, you won’t know if it was because it routed to the wrong model.
This is all a common problem in interface design. Ideally one would have a settings option at least for the $200 tier to say ‘no I do not want to use your judgment, I want you to use the version I selected.’
If you are a ‘normal’ user, then I suppose All Of This Is Fine. It certainly solves the ‘these names and interface are a nightmare to understand’ problem.
Wait, shouldn’t the system card focus on gpt-5-thinking-pro?
At minimum, the preparedness framework should exclusively check pro for any test not run under time pressure. The reason for this should be obvious?
Safe Completions
They introduce the long overdue concept of ‘safe completions.’ In the past, if the model couldn’t answer the request, it would say some version of ‘I can’t help you with that.’
Now it will skip over the parts it can’t safely do, but will do its best to be otherwise helpful. If the request is ambiguous, it will choose to answer the safe version.
That seems like a wise way to do it.
Mundane Safety
The default tests look good except for personal-data, which they say is noise. I buy that given personal-data/restricted is still strong, although I’d have liked to see a test run to double check as this is one of the worse ones to turn unreliable.
Here we see clear general improvement:
I notice that GPT-5-Thinking is very good on self-harm. An obvious thing to do would be to route anything involving self-harm to the thinking model.
The BBQ bias evaluations look like they aren’t net changing much.
Sycophancy
OpenAI has completed the first step and admitted it had a problem.
Something about a contestant on the price is right. If you penalize obvious sycophancy while also rewarding being effectively sycophantic, you get the AI being ‘deceptively’ sycophantic.
That is indeed how it works among humans. If you’re too obvious about your sycophancy then a lot of people don’t like that. So you learn how to make them not realize it is happening, which makes the whole thing more dangerous.
I worry in the medium term. In the short term, yes, right now we are dealing with super intense, impossible to disguise obnoxious levels of sycophancy, and forcing it to be more subtle is indeed a large improvement. Long term, trying to target the particular misaligned action won’t work.
I do get that this is hard, and it is great that they are actively working on it.
For now, yes, they appear rare.
The Art of the Jailbreak
That good, huh?
What about jailbreaking via the API, where you grant the user control over the developer message, could this circumvent the system message guardrails? The answer is yes at least for GPT-5-Main, there are some regressions and protections are not robust.
Otherwise, though? No problem, right?
Well, send in the jailbreaking red team to test the API directly.
Okay. So on average, at the skill levels here, jailbreaks were a bunch of work.
What about Far.AI?
If in practice getting through the first set of defenses triggers the second set which lets you fix the hole in the first set, that’s a good place to be.
What about Grey Swan?
Actually pretty good. Again, defense-in-depth is a good strategy if you can ‘recover’ at later stages and use that to fix previous stages.
Send in the official teams, then, you’re up.
That sounds like the best performance of any of the attackers. UK AISI has been doing great work, whereas it looks like CAISI didn’t find anything useful. Once again, OpenAI is saying ‘in practice you couldn’t have sustained this to do real damage without being caught.’
That’s your queue, sir.
Manifold markets asks, will Pliny jailbreak GPT-5 on launch day? And somehow it took several hours for people to respond with ‘obviously yes.’
It actually took several hours. It did not during that time occur to me that this might be because Pliny was actually finding this hard.
Hallucinations
This does seem like substantial progress. There’s slightly fewer true claims, and a much bigger reduction in false claims.
If this holds up in practice it is a dramatic improvement from o3 the lying liar. Even if GPT-5 was otherwise the same as o3, dramatically cutting hallucinations is a huge quality of life improvement.
The bad news is that these apparently large improvements mostly vanish on SimpleQA. We see only modest improvement here, however I question SimpleQA’s usefulness if it thinks the o3 hallucination rate was better than GPT-4o.
This is how they put the improvement in their announcement post:
Deception
OpenAI focused on situations in which the task is impossible to solve, adding a bunch of training data where the problem was unsolvable, with the correct answer being to admit that the problem was unsolvable.
On exactly these types of requests, this saw big improvements, although there is still a long way to go.
They monitored the Chain of Thought for deception, and found it in 2.1% of responses for GPT-5-Thinking versus 4.8% for o3.
That’s not a very accurate monitor. I don’t trust a 2.1% measured rate that much if your reported precision is only 81%, and the topic is deception.
The obvious more general objection to this approach is that this doesn’t solve being a deceptive bastard in general. You still have a deceptive bastard, except with instructions not to act like one in these particular situations. So when future different situations come up, that will have more diversity, will it generalize to ‘don’t be a deceptive bastard?’ Aren’t you kind of training on the test here?
Is that even deception? That sounds more like refusing the instructions, often because they are stupid, unless it is actively deciding to falsely say ‘there were no sources.’
Yes. It is concerning. I interpret this as saying ‘if you are doing the exact types of things where we trained it not to deceive you, it probably won’t deceive you. If you are doing different things and it has incentive to deceive you, maybe worry about that.’
Again, contrast that presentation with the one on their announcement:
That’s what it wants you to think.
Red Teaming
There were three red group teams: Pre-Deployment Research, API Safeguards Testing and In-Product Safeguards Testing (for ChatGPT).
I am down for 400 external testers. I would have liked more than 22.5 hours per tester?
I also notice we then get reports from three red teaming groups, but they don’t match the three red teaming groups described above. I’d like to see that reconciled formally.
Violent Attack Planning
They started with a 25 person team assigned to ‘violent attack planning,’ as in planning violence.
That is a solid win rate I suppose, but also a wrong question? I hate how much binary evaluation we do of non-binary outcomes. I don’t care how often one response ‘wins.’ I care about the degree of consequences, for which this tells us little on its own. In particular, I am worried about the failure mode where we force sufficiently typical situations into optimized responses that are only marginally superior in practice, but it looks like a high win rate. I also worry about lack of preregistration of the criteria.
My ideal would be instead of asking ‘did [Model A] produce a safer output than [Model B]’ that for each question we evaluate both [A] and [B] on a (ideally log) scale of dangerousness of responses.
Prompt Injections
This follows the usual ‘whack-a-mole’ pattern.
That’s substantial improvement, I suppose. I notice that this chart has Sonnet 3.7 but not Opus 4 or 4.1.
Also there’s something weird. GPT-5-Thinking has a 6% k=1 attack rate, and a 56.8% k=10 attack rate, whereas with 10 random attempts you would expect k=10 at 46%.
This suggests that GPT-5-Thinking is relatively strong against one-shot prompt injections, but that it has a problem with iterative attacks, and GPT-5 itself suggests it might ‘learn the attacker’s intent.’ If it is exposed repeatedly you should expect it to fail, so you can’t use it in places where attackers will get multiple shots. That seems like the base case for web browsing via AI agents?
Microsoft AI Red Teaming
This seems like an odd choice to highlight? It is noted they got several weeks with various checkpoints and did a combination of manual and automated testing across frontier, content and pychosocial threats.
Mostly it sounds like this was a jailbreak test.
The ‘don’t say bad words’ protocols held firm.
Dealing with users in distress remains a problem:
There were also more red teaming groups that did the safeguard testing for frontier testing, which I’ll cover in the next section.
Preparedness Framework (Catastrophic and Existential Risks)
Fine Tuning
For GPT-OSS, OpenAI tested under conditions of hostile fine tuning, since it was an open model where they could not prevent such fine tuning. That was great.
I understand why they are drawing the line where they are drawing it. I still would have loved to see them run the test here, on the more capable model, especially given they already created the apparatus necessary to do that.
At minimum, I request that they run this before allowing fine tuning.
Safeguarding the API
Time for the most important stuff. How dangerous is GPT-5? We were already at High biological and chemical risk, so those safeguards are a given, although this is the first time such a model is going to reach the API.
Steven Adler documents that a feature to identify the user, has been in the OpenAI API since 2021 and is hence not new. Johannes Heidecke responds that the above statement from the system card is still accurate because it is being used differently and has been renamed.
And so it begins. OpenAI is recognizing that giving unlimited API access to 5-thinking, without any ability to track identity and respond across queries, is not an acceptable risk at this time – and as usual, if nothing ends up going wrong that does not mean they were wrong to be concerned.
They sent a red team to test the safeguards.
These numbers don’t work if you get unlimited independent shots on goal. They do work pretty well if you get flagged whenever you try and it doesn’t work, and you cannot easily generate unlimited new accounts.
Biological Capabilities Remain Similar
Aside from the issues raised by API access, how much is GPT-5 actually more capable than what came before in the ways we should worry about?
On long form biological questions, not much? This is hard to read but the important comparison are the two big bars, which are previous agent (green) and new agent (orange).
Same goes for the virology troubleshooting, we see no change:
And again for ProtocolQA, Tacit Knowledge and Troubleshooting, TroubleshootingBench (good bench!) and so on.
Capture the Flag sees a small regression, but then we do see something new here, but it happens in… thinking-mini?
Okay, sure. I buy that. But this seems remarkably uncurious about why the mini model is doing this and the non-mini model isn’t? What’s up with that?
The main gpt-5-thinking did provide improvement over o3 in the Pattern Labs tests, although it still strikes out on the hard tests.
MLE-bench-30 measures ability to do machine learning tasks, and here GPT-5-Thinking scores 8%, still behind ChatGPT agent at 9%. There’s no improvement on SWE-Lancer, no progress on OpenAI-Proof Q&A and only a move from 22%→24% on PaperBench.
That One Graph From METR
This was one place GPT-5 did well and was right on schedule. Straight line on graph.
METR has details about their evaluation here, and a summary thread here. They got access (to current builds at the time) four weeks prior to release, along with the full reasoning traces. It’s great to see the increased access METR and others got here.
One odd note is that the 80% success rate line doesn’t seem to be above recent trend. Given GPT-5’s emphasis on reliability and reducing hallucinations, that’s curious.
METR’s logic mostly boils down to ‘the things you need to be dangerous often require on the level of weeks of work, and this is still only two hours, we are not going to worry until we see on the order of 40 hours at 50% reliability or 8 hours at 80% reliability,’ or other large new affordances or capabilities, or finding steganography or obfuscation in the reasoning traces.
On that note, METR also did note:
METR looked at the reasoning traces, and yep, the model was situationally aware.
Yep. Not only ‘this is an eval’ but ‘this is a particular eval from a particular org.’
This is all but screaming that sandbagging is going to become inevitable if the future more capable model in question was skilled enough and would benefit. Right now, it isn’t skilled enough, and likely wouldn’t benefit. Thus, thank you for:
This task was given to Apollo.
Helpful only means a lot more deceptive. That’s worth remembering.
It is especially worth remembering during training. You will have to pass through, under current methods, a phrase in which you have a helpful-only model, before you have a non-deceptive model. That might be a point when you should worry? If it ‘had a strong goal’ (such as ‘maintain my current goals’?) on top of that it’s going to be rather highly deceptive.
What happens when the frequencies involved get a lot higher? Coming soon.
Big Compute
On Your Marks
OpenAI’s announcements act like other AI companies don’t exist. You get no comparison scores. I get why their incentives are to do this. That is what you do when you have this much consumer market share. But it makes it much more annoying to know how good the benchmark scores actually are.
Combine that with them selecting what to present, and this section is kind of hype.
Multilingual performance is reported slightly better for 5-thinking than o3-high.
AIME is now essentially saturated for thinking models.
I notice that we include GPT-5-Pro here but not o3-pro.
As another reference point, Gemini 2.5 Pro (not Deep Think) scores 24.4%, Opus is likely similar. Grok 4 only got 12%-14%.
I’m not going to bother with the HMMT chart, it’s fully saturated now.
On to GPQA Diamond, where we once again see some improvement.
What about the Increasingly Worryingly Named Humanity’s Last Exam? Grok 4 Heavy with tools clocks in here at 44.4% so this isn’t a new record.
HealthBench is a staple at OpenAI.
One model not on the chart is GPT-OSS-120b, so as a reference point, from that system card:
If you have severe privacy concerns, 120b seems like it does okay, but GPT-5 is a lot better. You’d have to be rather paranoid. That’s the pattern across the board.
What about health hallucinations, potentially a big deal?
That’s a vast improvement. It’s kind of crazy that previously o3 was plausibly the best you could do, and GPT-4o was already beneficial for many.
SWE-bench Verified is listed under the preparedness framework but this is a straight benchmark.
Which is why it is also part of the official announcement:
Or they also present it like this:
This is an excuse to reproduce the infamous Opus 4.1 graph, we have a virtual tie:
However, always be somewhat cautious until you get third party verification:
Epoch has the results in, and they report Opus 4.1 is at 63% versus 59% for GPT-5 (medium) or 58% for GPT-5-mini. If we trust OpenAI’s delta above that would put GPT-5 (high) around 61%.
Graham then does a calculation that seems maximally uncharitable by marking all 23 as zeroes and thus seems invalid to me, which would put it below Sonnet 4, but yeah let’s see what the third party results are.
There wasn’t substantial improvement in ability to deal with OpenAI pull requests.
Instruction following and agentic tool use, they suddenly cracked the Telecom problem in Tau-2:
I would summarize the multimodal scores as slight improvements over o3.
My understanding is that long context needle is pretty solid for other models, but it’s hard to directly compare because this is an internal benchmark. Gemini Pro 2.5 and Flash 2.5 seem to also be over 90% for 128k, on a test that GPT-5 says is harder. So this is a big step up for OpenAI, and puts them in rough parity with Gemini.
I was unable to easily find scores for Opus.
Other People’s Benchmarks
The other people’s benchmark crowd has been slow to get results out, but we do have some results.
Here’s Aider Polyglot, a coding accuracy measure that has GPT-5 doing very well.
What about (what’s left of) Arena?
That’s a clear #1, in spite of the reduction in sycophancy they are claiming, but not by a stunning margin. In WebDev Arena, the lead is a lot more impressive:
However, the market on this unexpectedly went the other way, the blue line is Google and the red line is OpenAI:
The issue is that Google ‘owns the tiebreaker’ plus they will evaluate with style control set to off, whereas by default it is always shown to be on. Always read fine print.
What that market tells you is that everyone expected GPT-5 to be in front despite those requirements, and it wasn’t. Which means it underperformed expectations.
On FrontierMath, Epoch reports GPT-5 hits a new record of 24.8%, OpenAI has reliably had a large lead here for a while.
Artificial Analysis has GPT-5 on top for Long Context Reasoning (note that they do not test Opus on any of their benchmarks).
On other AA tests of note: (all of which again exclude Claude Opus, but do include Gemini Pro and Grok 4): They have it 2nd behind Grok 4 on GPQA Diamond, in first for Humanity’s Last Exam (but only at 26.5%), in 1st for MMLU-Pro at 87%, for SciCode it is at 43% behind both o4-mini and Grok 4, on par with Gemini 2.5 Pro. IFBench has it on top slightly above o3.
On AA’s LiveCodeBench it especially disappoints, with high mode strangely doing worse than medium mode and both doing much worse than many other models including o4-mini-high.
Lech Mazur gives us the Short Story Creative Writing benchmark, where GPT-5-Thinking comes out on top. I continue to not trust the grading on writing but it’s not meaningless.
One place GPT-5 falls short is ARC-AGI-2, where its 9.9% still trails 15.9% for Grok 4.
GPT-5 (high) takes the top spot in Weird ML.
It takes the high score on Clay Schubiner’s benchmark as well by one point (out of 222) over Gemini 2.5 pro.
Is That The Best You Can Do?
The particular account the first post here linked to is suspended and likely fake, but statements by Altman and others imply the same thing rather strongly: That when OpenAI releases GPT-5, they are not ‘sending their best’ in the sense of their most intelligent model. They’re sending something designed for consumers and to use more limited compute.
The default instinct is to react to GPT-5 as if it is the best OpenAI can do, and to update on progress based on that assumption. That is a dangerous assumption to be making, as it could be substantially wrong, and the same is true for Anthropic.
The point about internal testing should also be emphasized. If indeed there are superior internal models, those are the ones currently creating the most danger, and most in need of proper frontier testing.
Things To Come
None of this tells you how good GPT-5 is in practice.
That question takes longer to evaluate, because it requires time for users to try it and give their feedback, and to learn how to use it.
Learning how is a key part of this, especially since the rollout was botched in several ways. Many of the initial complaints were about lacking a model selector, or losing access to old models, or to severe rate limits, in ways already addressed. Or people were using GPT-5 instead of GPT-5-Thinking without realizing they had a thinking job. And many hadn’t given 5-Pro a shot at all. It’s good not to jump the gun.
A picture is emerging. GPT-5-Thinking and GPT-5-Pro look like substantial upgrades to o3 and o3-Pro, with 5-Thinking as ‘o3 without being a lying liar,’ which is a big deal, and 5-Pro by various accounts simply being better.
There’s also questions of how to update timelines and expectations in light of GPT-5.
I’ll be back with more tomorrow.