the same way that Claude Code builds Claude Code, Codex now builds Codex.
I am curious, when humans build AIs, they probably do similar things, because they draw from the same science, the same general knowledge, etc. When the AIs start coding themselves, should we expect them to diverge?
Maybe no, because they are still trained on the same data. Maybe yes, if they can somehow create and accumulate knowledge for themselves that they no longer share? (For example, if they design and run their own experiments and learn from them.)
Before I could get this report out, OpenAI also gave us GPT-5.3-Codex-Spark, which is ultra-low latency Codex, more than 1,000 tokens per second. Wowsers. That’s fast.
As in, really super duper fast. Code appears essentially instantaneously. There are times when you feel the need for speed and not the need for robust intelligence. Many tasks are more about getting it done than about being the best like no one ever was.
It does seem like it is a distinct model, akin to GPT-5.3-Codex-Flash, with only a 128k context window and lower benchmark scores, so you’ll need to be confident that is what you want.
I think the majority of this speedup due to OpenAI's partnership with Cerebras, rather than a new model. Cerebras chips can reach this speed because they are much larger and don't have to pay the costs of interconnect.
OpenAI is back with a new Codex model, released the same day as Claude Opus 4.6.
The headline pitch is it combines the coding skills of GPT-5.2-Codex with the general knowledge and skills of other models, along with extra speed and improvements in the Codex harness, so that it can now handle your full stack agentic needs.
We also got the Codex app for Mac, which is getting positive reactions, and quickly picked up a million downloads.
CPT-5.3-Codex is only available inside Codex. It is not in the API.
As usual, Anthropic’s release was understated, basically a ‘here’s Opus 4.6, a 212-page system card and a lot of benchmarks, it’s a good model, sir, so have fun.’ Whereas OpenAI gave us a lot less words and a lot less benchmarks, while claiming their model was definitely the best.
Table of Contents
The Overall Picture
GPT-5.3-Codex (including Codex-Spark) is a specialized model designed for agentic coding and related uses in Codex. It is not intended as a general frontier model, thus the lack of most general benchmarks and it being unavailable on the API or in ChatGPT.
For most purposes other than Codex and agentic coding, that aren’t heavy duty enough to put Gemini 3 Pro Deep Think V2 in play, this makes Claude Opus 4.6 the clearly best model, and the clear choice for daily driver.
For agentic coding and other intended uses of Codex, the overall gestalt is that Codex plus GPT-5.3-Codex is competitive with Claude Code with Claude Opus 4.6.
If you are serious about your agentic coding and other agentic tasks, you should try both halves out and see which one, or what combination, works best for you. But also you can’t go all that wrong specializing in whichever one you like better, especially if you’ve put in a bunch of learning and customization work.
You should probably be serious about your agentic coding and other agentic tasks.
Quickly, There’s No Time
Before I could get this report out, OpenAI also gave us GPT-5.3-Codex-Spark, which is ultra-low latency Codex, more than 1,000 tokens per second. Wowsers. That’s fast.
As in, really super duper fast. Code appears essentially instantaneously. There are times when you feel the need for speed and not the need for robust intelligence. Many tasks are more about getting it done than about being the best like no one ever was.
It does seem like it is a distinct model, akin to GPT-5.3-Codex-Flash, with only a 128k context window and lower benchmark scores, so you’ll need to be confident that is what you want. Going back and fixing lousy code is not usually faster than getting it right the first time.
It is very different from Claude Opus 4.6 Fast Mode, which is regular Opus faster in exchange for much higher costs.
System Card
GPT-5.3-Codex is specifically a coding model. It incorporates general reasoning and professional knowledge because that information is highly useful for coding tasks.
Thus, it is a bit out of place to repeat the usual mundane harm evaluations, which put the model in contexts where this model won’t be used. It’s still worth doing. If the numbers were slipping substantially we would want to know. It does look like things regressed a bit here, but within a range that seems fine.
AI Box Experiment
It is weird to see OpenAI restricting the access of Codex more than Anthropic restricts Claude Code. Given the different abilities and risk profiles, the decision seems wise. Trust is a highly valuable thing, as is knowing when it isn’t earned.
The default intended method for using Codex is in an isolated, secure sandbox in the cloud, on an isolated computer, even when it is used locally. Network access is disabled by default, edits are restricted.
Maybe Cool It With Rm
I really like specifically safeguarding against data destructive actions.
Their solution was to train the model specifically not to revert user edits, and to introduce additional prompting to reinforce this.
It’s great to go from 66% to 76% to 88% ‘destructive action avoidance’ but that’s still 12% destructive action non-avoidance, so you can’t fully rest easy.
In practice, I notice that it is a small handful of commands, which they largely name here (rm -rf, git clean -xfd, git reset —hard, push —force) that cause most of the big trouble.
Why not put in place special protections for them? It does not even need to be requiring user permission. It can be ‘have the model stop and ask itself whether doing this is actually required and whether it would potentially mess anything up, and have it be fully sure it wants to do this.’ Could in practice be a very good tradeoff.
The obvious answer is that the model can then circumvent the restrictions, since there are many ways to mimic those commands, but that requires intent to circumvent. Seems like it should be solvable with the right inoculation programming?
Preparedness Framework
The biological and chemical assessment shows little improvement over GPT-5.2. This makes sense given the nature of 5.3-Codex, and we’re already at High. Easy call.
The cybersecurity assessment makes this the first model ranked at High.
Kudos to OpenAI for handling this correctly. If you don’t know that it isn’t High, then it is High. I’ve been beating that drum a lot and it’s great that they’re listening. Points.
The question is where we are on AI Self-Improvement, which is 5.1.3.
The first thing to say about this is always that the High threshold is very high.
If you can fully duplicate a mid-career research engineer with AI, you are damn close to being fully off to the races. There won’t be much time between High and Critical.
So how did the model do on these tests? Not so well.
Given that it is clear 5.3-Codex is substantially better than 5.2-Codex, this raises the question of sandbagging, which is the subject of 5.1.4. The task was outsourced to Apollo Research.
This did not sound like a reassuring report on sabotage.
The obvious thing to do is to take a very close look at the questions where GPT-5.2-Codex was succeeding, and GPT-5.3-Codex is failing, especially on OpenAI-Proof. I want a damn strong understanding of why GPT-5.3-Codex is regressing in those spots.
Glass Houses
OpenAI’s Noam Brown made a valid shot across the bow at Anthropic for the ad hockery present in their decision to release Claude Opus 4.6. He’s right, and he virtuously acknowledged that Anthropic was being transparent about that.
The thing is, while it seems right that Anthropic and OpenAI are trying (Google is trying in some ways, but they just dropped Gemini 3 Deep Think V2 with zero safety discussions whatsoever, which I find rather unacceptable), OpenAI very much has its own problems here. Most of the problems come from the things OpenAI did not test or mention, but there is also one very clear issue.
OpenAI Appears To Have Violated SB 53 In a Meaningful Way
The Midas Project wrote up the this particular issue.
The core problem is simple: OpenAI classified GPT-5.3-Codex as High risk in cybersecurity. Under their framework, this wisely requires High level safeguards against misalignment.
They then declare that the previous wording did not require this, and was inadvertently ambiguous. I disagree. I read the passage as unambiguous, and also I believe that the previous policy was the right one.
Even if you think I am wrong about that, that still means is that OpenAI must implement the safeguards if the model is High on both cybersecurity and autonomy. OpenAI admits that they cannot rule out High capability in autonomy, despite declaring 10 months ago the need to develop a test for that. The proxy measurements OpenAI used instead seem clearly inadequate. If you can’t rule out High, that means you need to treat the model as High until that changes.
All of their hype around Codex talks about how autonomous this model is, so I find it rather plausible that it is indeed High in autonomy.
Steven Adler investigated further and wrote up his findings. He found their explanations unconvincing. He’s a tough crowd, but I agree with the conclusion.
This highlights both the strengths and weaknesses of SB 53.
It means we get to hold OpenAI accountable for having broken their own framework.
However, it also means we are punishing OpenAI for having a good initial set of commitments, and for being honest about hot having met them.
The other issue is the fines are not meaningful. OpenAI may owe ‘millions’ in fines. I’d rather not pay millions in fines, but if that were the only concern I also wouldn’t delay releasing 5.3-Codex by even a day in order to not pay them.
The main advantage is that this is a much easier thing to communicate, that OpenAI appears to have broken the law.
I have not seen a credible argument for why OpenAI might not be in violation here.
The California AG stated they cannot comment on a potential ongoing investigation.
Safeguards They Did Implement
The plan is to monitor for potential attacks and teach the model to refuse requests, while providing trusted model access to known defenders. Accounts are tracked for risk levels. Users who use ‘dual use’ capabilities often will have to verify their identities. There is two-level always-on monitoring of user queries to detect cybersecurity questions and then evaluate whether they are safe to answer.
They held a ‘universal jailbreak’ competition and 6 complete and 14 partial such jailbreaks were found, which was judged ‘not blocking.’ Those particular tricks were presumably patched, but if you find 6 complete jailbreaks that means there are a lot more of them.
UK AISI also found a (one pass) universal jailbreak that scored 0.778 pass@200 on a policy violating cyber dataset OpenAI provided. If you can’t defend against one fixed prompt, that was found in only 10 hours of work, you are way behind on dealing with customized multi-step prompts.
Later they say ‘undiscovered universal jailbreaks may still exist’ as a risk factor. Let me fix that sentence for you, OpenAI. Undiscovered universal jailbreaks still exist.
Thus the policy here is essentially hoping that there is sufficient inconvenience, and sufficient lack of cooperation by the highly skilled, to prevent serious incidents. So far, this has worked.
Their risk list also included ‘policy gray areas’:
This seems to be a confusion of map and territory. What matters is not whether experts ever disagree, it is whether expert labels reliably lack false negatives, including false negatives that are found by the model. I think we should assume that the expert labels have blind spots, unless we are willing to be highly paranoid with what we cover, in which case we should still assume that but we might be wrong.
Misalignment Risks and Internal Deployment
I was happy to see the concern with internal deployment, and with misalignment risk. They admit that they need to figure out how to measure long-range autonomy (LRA) and other related evaluations. It seems rather late in the game to be doing that, given that those evaluations seem needed right now.
OpenAI seems less concerned, and tries to talk its way out of this requirement.
Yeah, no. This was not ambiguous. I believe OpenAI has violated their framework.
The thing that stands out in the model card is what is missing. Anthropic gave us a 212 page model card and then 50 more pages for a sabotage report that was essentially an appendix. OpenAI gets it done in 33. There’s so much stuff they are silently ignoring. Some of that is that this is a Codex-only model, but most of the concerns should still apply.
The Official Pitch
GPT-5.3-Codex is not in the API, so we don’t get the usual array of benchmarks. We have to mostly accept OpenAI’s choices on what to show us.
They call this state of the art performance:
The catch on SWE-Bench-Pro has different scores depending on who you ask to measure it, so it’s not clear whether or not they’re actually ahead of Opus on this. They’ve improved on token efficiency, but performance at the limit is static.
For OSWorld, they are reporting 64.7% as ‘strong performance,’ but Opus 4.6 leads at 72.7%.
OpenAI has a better case in Terminal Bench 2.0.
For Terminal Bench 2.0, they jump from 5.2-Codex at 64% to 5.3-Codex at 77.3%, versus Opus 4.6 at 65.4%. That’s a clear win.
They make no progress on GDPVal, matching GPT-5.2.
They point out that while GPT-5.2-Codex was narrowly built for code, GPT-5.3-Codex can support the entire software lifestyle, and even handle various spreadsheet work, assembling of PDF presentations and such.
Most of the biggest signs of improvement on tests for GPT-5.3-Codex are actually on the tests within the model card. I don’t doubt that it is actually a solid improvement.
They summarize this evidence with some rather big talk. This is OpenAI, after all.
Here were the headline pitches from the top brass:
The most interesting thing in their announcement is that, the same way that Claude Code builds Claude Code, Codex now builds Codex. That’s a claim we’ve also seen elsewhere in very strong form.
There are obvious issues with a model helping to create itself. I do not believe OpenAI, in the system card or otherwise, has properly reckoned with the risks there.
That’s how I have to put it in 2026, with everyone taking crazy pills. The proper way to talk about it is more like this:
Inception
OpenAI is trying to ‘win’ the battle for agentic coding by claiming to have already run, despite having clear minority market share, and by outright stating that they are the best.
The majority opinion is that they are competitive, but not the best.
Vagueposting is mostly fine. Ignoring the competition entirely is fine, and smart if you are sufficiently ahead on recognition, it’s annoying (I have to look up everything) but at least I get it. Touting what your model and system can do are great, especially given that by all reports they have a pretty sweet offering here. It’s highly competitive. Not mentioning the ways you’re currently behind? Sure.
Inception is different. Inception and such vibes wars are highly disingenuous, it is poisonous of the epistemic environment, is a pet peeve of mine, and it pisses me off.
So you see repeated statements like this one about Codex and the Codex app:
Or this:
If you look at the responses to Weiss, they do not support his story.
Turn The Beat Around
Claude Code should be able to do this too, but my understanding is right now it doesn’t work right, you are effectively interrupting the task. So yes, this is a real edge for tasks that take a long time until Anthropic fixes the problem.
Like Claude Code, it’s time to assemble a team:
Codex Does Cool Things
A playwright script? Cool.
Positive Reactions
Matt Shumer is crazy positive on Codex 5.3, calling it a ‘fucking monster,’ although he was comparing to Opus 4.5 rather than 4.6, there is a lot more good detail at the link.
This was the thing I was most excited to hear:
A common sentiment was that both Codex 5.3 and Opus 4.6, with their respective harnesses, are great coding models, and you could use both or use a combination.
For many the difference is stylistic, and there is no right answer, or you want to use a hybrid process.
Many people just like it, it’s a good model, sir, whee. Those who try it seem to like it.
One particular note was Our Price Cheap:
Negative Reactions
Not everyone is a fan.
Some reactions can be a bit extreme, including for not the best reasons.
Codex of Ultimate Vibing
I am getting strong use out of Claude Code. I believe that Opus 4.6 and Claude Code have a strong edge right now for most other uses.
However, I am not a sufficiently ambitious or skilled coder to form my own judgments about Claude Code and Claude Opus 4.6 versus Codex and ChatGPT-5.3-Codex for hardcore professional agentic coding tasks.
I have to go off the reports of others. Those reports robustly disagree.
My conclusion is that the right answer will be different for different users. If you are going to be putting serious hours into agentic coding, then you need to try both options, and decide for yourself whether to go with Claude Code, Codex or a hybrid. The next time I have a substantial new project I intend to ask both and let them go head to head.
If you go with a hybrid approach, there may also be a role for Gemini that extends beyond image generation. Gemini 3 DeepThink V2 in particular seems likely to have a role to play in especially difficult queries.