niplav being impressed by results on a second attempt after Claude crafted a better prompt (this seems like an excellent workflow!)
This one is relevant for me because this algorithm is probably the most cognitively demanding novel thinking I've done in my life, the most shape-rotator I've been. The fact that o3-pro manages to implement this is the final nail in the coffin for me in terms of "my g is worth something". I'd seen this coming for more than three (four?) years, of course, but it still feels a bit strange.
You can now have o3 throw vastly more compute at a given problem. That’s o3-pro.
Should you have o3 throw vastly more compute at a given problem, if you are paying the $200/month subscription price for ChatGPT Pro? Should you pay the $200, or the order of magnitude markup over o3 to use o3-pro in the API?
That’s trickier. Sometimes yes. Sometimes no. My experience so far is that waiting a long time is annoying, sufficiently annoying that you often won’t want to wait. Whenever I ask o3-pro something, I often also have been asking o3 and Opus.
Using the API at scale seems prohibitively expensive for what you get, and you can (and should) instead run parallel queries using the chat interface.
The o3-pro answers have so far definitely been better than o3, but the wait is usually enough to break my workflow and human context window in meaningful ways – fifteen minutes plus variance is past the key breakpoint, such that it would have not been substantially more painful to fully wait for Deep Research.
Indeed, the baseline workflow feels similar to Deep Research, in that you fire off a query and then eventually you context shift back and look at it. But if you are paying the subscription price already it’s often worth queuing up a question and then having it ready later if it is useful.
In many ways o3-pro still feels like o3, only modestly better in exchange for being slower. Otherwise, same niche. If you were already thinking ‘I want to use Opus rather than o3’ chances are you want Opus rather than, or in addition to, o3-pro.
Perhaps the most interesting claim, from some including Tyler Cowen, was that o3-pro is perhaps not a lying liar, and hallucinates far less than o3. If this is true, in many situations it would be worth using for that reason alone, provided the timing allows this. The bad news is that it didn’t improve on a Confabulations benchmark.
My poll (n=19) was roughly evenly split on this question.
My hunch, based on my use so far, is that o3-pro is hallucinating modestly less because:
But for now I think you still have to have a lot of the o3 skepticism.
And as always, the next thing will be here soon, Gemini 2.5 Pro Deep Think is coming.
Pliny of course jailbroke it, for those wondering. Pliny also offers us the tools and channels information.
o3-Pro Versus o3
My poll strongly suggested o3-pro is slightly stronger than o3.
Arena has gotten quite silly if treated as a comprehensive measure (as in Gemini 2.5 Flash is rated above o3), but as a quick heuristic, if we take a 64% win rate seriously, that would by the math put o3-pro ~100 above o3 at 1509 on Arena, crushing Gemini-2.5-Pro for the #1 spot. I would assume that most pairwise comparisons would have a less impressive jump, since o3-pro is essentially offering the same product as o3 only somewhat better, which means the result will be a lot less noisy than if it was up against Gemini.
So this both is a very impressive statistic and also doesn’t mean much of anything.
Give Me a Minute
The problem with o3-pro is that it is slow.
Today I had my first instance of noticing I’d gotten a text (during the night, in this case) and they got a response 20 minutes slower than they would have otherwise because I waited for o3-pro to give its answer to the question I’d been asked.
Thus, even with access to o3-pro at zero marginal compute cost, almost half of people reported they rarely use it for a given query, and only about a quarter said they usually use it.
It is also super frustrating to run into errors when you are waiting 15+ minutes for a response, and reports of such errors were common which matches my experience.
On Your Marks
I certainly would try o3-pro in cases where o3 was failing, if I’d already also tried Opus and Gemini first. I wonder if that agentic coding score drop actually represent an issue here, where because it is for the purpose of reasoning longer and they don’t want it endlessly web searching o3-pro is not properly inclined to exploit tools?
o3-pro gets 8.5/10 on BaldurBench, which is about creating detailed build guides for rapidly changing video games. Somewhat subjective but should still work.
Lech Mazur gives us four of his benchmarks: A small improvement over o3 for Creative Writing Benchmark, a substantial boost from 79.5% (o3) or 82.5% (o1-pro) to 87.3% on Word Connections, no improvement on Thematic Generalization, very little improvement on Confabulations (avoiding hallucinations). The last one seems the most important to note.
Reactions
Tyler Cowen was very positive, he seems like the perfect customer for o3-pro? By which I mean he can context shift easily so he doesn’t mind waiting, and also often uses queries where these models get a lot of value out of going at problems super hard, and relatively less value out of the advantages of other models (doesn’t want the personality, doesn’t want to code, and so on).
Contrast that with the score on Confabulations not changing. I am guessing there is a modest improvement, for reasons described earlier.
There are a number of people pointing out places o3-pro solves something o3 doesn’t, such has here it solved the gimbal uap mystery in 18 minutes.
McKay Wrigley, eternal optimist, agrees on many fronts.
Here’s a different kind of vibe coding, perhaps?
The time cost and dollar costs are very real.
Part of ‘I am very intelligent’ is knowing when to think for longer and when not to. In that sense, o3-pro is not so smart, you have to take care of that question yourself. I do understand why this decision was made, let the user control that.
I agree with Lukes that most people do not ‘need’ o3 pro and they will be fine not paying for it, and for now they are better off with their expensive subscription (if any) being Claude Max. But even if you don’t need it, the queries you benefit from can still be highly useful.
It makes sense to default to using Opus and o3 pro (and for quick stuff Sonnet)
o3-pro is too slow to be a good ‘default’ model, especially for coding. I don’t want to have to reload my state in 15 minute intervals. It may or may not be good for the ‘call in the big guns’ role in coding, where you have a problem that Opus and Gemini (and perhaps regular o3) have failed to solve, but which you think o3-pro might get.
Here’s one that both seems central wrong but also makes an important point:
The obvious evals where it does better are Codeforces, and also ‘user preferences.’ Tyler Cowen’s statement suggests hallucination rate, which is huge if true (and it better be true, I’m not waiting 20 minutes that often to get an o3-level lying liar.) Tyler also reports there are questions where o3 fails and o3-pro succeeds, which is definitive if the gap is only one way. And of course if all else fails you can always have them do things like play board games against each other, as one answer suggests.
Nor do I think either o3 or o3-pro is the AGI you are looking for.
However, it is true that for a large percentage of tasks, o3 is ‘good enough.’ That’s even true in a strict sense for Claude Sonnet or even Gemini Flash. Most of the time one has a query, the amount of actually needed intelligence is small.
You can tell 3300 from 3600 in chess, but only because you can tell who won. If almost any human looked at individual moves, you’d have very little idea.
I always appreciate people thinking at the limit rather than only on the margin. This is a central case of that.
Here’s one report that it’s doing well on the fully informal FictionBench:
This contrasts with:
Other positive reactions include Matt Wigdahl being impressed on a hairy RDP-related problem, a66mike99 getting interesting output and pushback on the request (in general I like this, although if you’re thinking for 20 minutes this could be a lot more frustrating?), niplav being impressed by results on a second attempt after Claude crafted a better prompt (this seems like an excellent workflow!), and Sithis3 saying o3-pro solves many problems o3 struggles on.
The obvious counterpoint is some people didn’t get good responses, and saw it repeating the flaws in o3.
This thread from Xeophon features reactions that are mixed but mostly meh.
Or to some it simply doesn’t feel like much of a change at all.
Here’s a fun reminder that details can matter a lot:
I presume that even at $200/month, the compute costs of letting o3-pro have 150k input tokens would add up fast, if people actually used it a lot.
Brevity Is The Soul Of Wit
This is one of the things I’ve loved the most so far about o3-pro.
o3-pro will make you wait, but its answer will not waste your time. This is a sharp contrast to Deep Research queries, which will take forever to generate and then include a ton of slop.
Where Art Thou O System Card
It is not the main point but I must note the absence of a system card update. When you are releasing what is likely the most powerful model out there, o3-pro, was everything you needed to say truly already addressed by the model card for o3?
The link they provide goes to ‘Model Release Notes,’ which is not quite nothing, but it isn’t much and does not include a Preparedness Framework evaluation.
I agree with Miles that if you don’t want to provide a system card for o3-pro that This Is Fine, but you need to state your case for why you don’t need one. This can be any of:
As far as I can tell we’re basically in scenario #2, and they see no serious issues here. Which again is fine if true, and if they actually tell us that this is the case. But the framework is full of ‘here are the test results’ and presumably those results are different now. I want o3-pro on those charts.
What about alignment otherwise? Hard to say. I did notice this (but did not attempt to make heads or tails of the linked thread), seems like what you would naively expect:
The Price Drop on o3
For many purposes, the 80% price cut in o3 seems more impactful than o3-pro. That’s a huge price cut, whereas o3-pro is still largely a ‘special cases only’ model.
I would say the cost of inference not only might drop an order of magnitude in a year or two, if you hold quality of outputs constant it is all but certain to happen at least one more time. Where you ‘take your profits’ in quality versus quantity is up to you.