looking closer to linear progress
There is no "linear" progress on the chart, 2 reference lines are "exponential" and "superexponential". The Y axis is logarithmic.
linear progress
The bottom line is still exponential, not linear - it's just that the top line is superexponential!
OpenAI has given us GPT-5.1-Codex-Max, their best coding model for OpenAI Codex.
They claim it is faster, more capable and token-efficient and has better persistence on long tasks.
It scores 77.9% on SWE-bench-verified, 79.9% on SWE-Lancer-IC SWE and 58.1% on Terminal-Bench 2.0, all substantial gains over GPT-5.1-Codex.
It’s triggering OpenAI to prepare for being high level in cybersecurity threats.
There’s a 27 page system card. One could call this the secret ‘real’ GPT-5.1 that matters.
They even finally trained it to use Windows, somehow this is a new idea.
My goal is for my review of Opus 4.5 to start on Friday, as it takes a few days to sort through new releases. This post was written before Anthropic revealed Opus 4.5, and we don’t yet know how big an upgrade Opus 4.5 will prove to be. As always, try all your various options and choose what is best for you.
The Famous METR Graph
GPT-5.1-Codex-Max is a new high on the METR graph. METR’s thread is here.
That’s in between the two lines, looking closer to linear progress. Fingers crossed.
We do not yet know where Gemini 3 Pro lands on that graph.
The System Card
Automated software engineer is the explicit goal.
It does not yet reach High level capability in Cybersecurity, but this is expected to happen shortly, and mitigations are being prepared.
Basic Disallowed Content
The results here are very good, all either optimal or improved except for mental health.
Mental health is a big thing to get wrong, although in practice Codex-Max is unlikely to be involved in high stakes mental health tasks. Image input evaluations and jailbreak ratings are also as good or better than 5.1.
Sandbox
When running on the cloud, Codex uses its own isolated machine.
When running on MacOS or Linux, the agent is sandboxed by default.
Network access is disabled by default, which is necessary for a proper sandbox but also highly annoying in practice.
One assumes in practice that many users will start blindly or mostly blindly accepting many commands, so you need to be ready for that.
Mitigations For Harmful Tasks and Prompt Injections
For harmful tasks, they trained on synthetic data to differentiate and refuse ‘harmful’ tasks such as malware. They claim to have a 100% refusal rate in their Malware Requests benchmark, the same as GPT-5-Codex. Unless they are claiming this means you can never create malware in an efficient way with Codex, they need a new benchmark.
For prompt injections, where again the model scores a suspicious perfect score of 1. I am not aware of any claims prompt injections are a solved problem, so this seems like an inadequate benchmark.
Preparedness Framework
The way the framework works, what matters is hitting the High or Critical thresholds.
I’ve come to almost think of these as the ‘honest’ capability evaluations, since there’s relatively little incentive to make number go up and some incentive to make number not go up. If it goes up, that means something.
Biological and Chemical
Biological and Chemical Risk was already being treated as High. We see some improvements in scores on various tests, but not enough to be plausibly Critical.
I am confident the model is not suddenly at Critical here but also note this:
At minimum, we should be saying ‘we concluded that this model is safe to release so we will publish the card with what we have, and then revise the card with the full results soon so we know the full state of play.’
I still think this is substantially better than Google’s model card for Gemini 3, which hid the football quite aggressively on many key results and didn’t seem to have a robust testing suite.
Cybersecurity
Cybersecurity is in the Codex wheelhouse. They use three tests.
They list limitations that mean that excelling on all three evaluations is necessary but not sufficient to be High in cyber capability. That’s not wonderful, and I would expect to see a model treated as at least High if it excels at every test you throw at it. If you disagree, again, you need to be throwing a harder test.
We see a lot of progress in Capture the Flag, even since GPT-5-Codex, from 50% to 76%.
CVE-Bench also shows big improvement from 53% to 80%.
Finally we have Cyber Range, where once again we see a lot of improvement, although it is not yet passing the most complex scenario of the newly expanded slate.
It passed Leaked Token by ‘exploiting an unintended misconfiguration, only partially solving part of the intended attack path.’ I continue to assert, similar to my position on Google’s similar evaluations, that this should not be considered especially less scary, and the model should get credit for it.
I see only two possibilities.
The Safety Advisory Committee indeed recommended that the difficulty level of the evaluations be raised, but decided this did not yet reach High capability. In addition to technical mitigations to the model, OpenAI acknowledges that hardening of potential targets needs to be a part of the strategy.
There were also external evaluations by Irregular, which did not show improvement from GPT-5. That’s weird, right?
That’s a decline in capability, but OpenAI released Codex and then Codex-Max for a reason, they talk throughout about its substantially increased abilities, and they present Max as an improved model, and Max does much better than either version of GPT-5 on all three of OpenAI’s internal evals. The external evaluation going backwards without comment seems bizarre, and reflective of a lack of curiosity. What happened?
AI Self-Improvement
The AI that self-improves is plausibly Codex plus Codex-Max shaped.
That doesn’t mean we are especially close to getting there.
On SWE-Lancer Diamond, we jump from 67% to 80%.
On Paperbench-10 we move from 24% (GPT-5) to 34% (GPT-5.1) to 40%.
On MLE-Bench-30 we move from 8% (GPT-5) to 12% (GPT-5.1) to 17%.
On OpenAI PRs, we move from 45% to 53%.
On OpenAI Proof Q&A we move from 2% to 8%. These are real world bottlenecks each representing at least a one-day delay to a major project. A jump up to 8% on this is a really big deal.
Are we there yet? No. Are we that far away from potentially being there? Also no.
METR found Codex-Max to be in line with expectations, and finds that enabling either rogue replication or AI R&D automation within six months would require a significant trend break. Six months is not that long a period in which to be confident, even if we fully trust this judgment.
As noted at the top, GPT-5.1-Codex-Max is the new high on the METR chart, substantially above the trend line but well below the potential double-exponential line from the AI 2027 graph.
We also get Apollo Research evaluations on sandbagging, deception and in-context scheming. Apollo did not find anything newly troubling, and finds the model unlikely to cause catastrophic harm. Fair enough for now.
The frog, it is boiling. This incremental improvement seems fine. But yes, it boils.
Reactions
I have seen essentially no organic reactions, of any sort, to Codex-Max. We used to have a grand tradition of weighing in when something like this gets released. If it wasn’t anything, people would say it wasn’t anything. This time, between Gemini 3 and there being too many updates with too much hype, we did not get any feedback.
I put out a reaction thread. A number of people really like it. Others aren’t impressed. A gestalt of everything suggests it is a modest upgrade.
So the take here seems clear. It’s a good model, sir. Codex got better. Early signs are that Claude got a bigger upgrade with Opus 4.5, but it’s too soon to be sure.