GPT-5 is evaluated as if it was scaling up compute in a way that it doesn’t. In various ways people are assuming it ‘cost’ far more than it did.
Even if it's a "small" model (as the balance of evidence suggests), it doesn't follow that it didn't cost a lot. Suppose gpt-5-thinking is a 1-2T total param, 250B active param model, a shape that would've been compute optimal for some 2023 training systems, but it's 10x overtrained using 2024 compute, and then it was RLVRed for the same amount of GPU-time as pretraining. Then it could well cost about $1bn (at $2-3 per H100-hour). It would take an unlikely 300T tokens, but then there's already gpt-oss-120b that apparently needed 100T-200T tokens, and this is still within the forgiving 5x repetition of a plausible amount of natural data.
I'm assuming 120 tokens/param compute optimal, anchoring to Llama 3 405B's dense 40 tokens/param, increased 3x to account for 1:8 sparsity. At 5e25 FLOPs (2023 compute) this asks for 260B active params and 2T total, trained for 31T tokens. Overtrained 10x, this would need 5e26 FLOPs and 310T tokens, without changing model shape. At 40% compute utilization, this is about 175e6 H100-hours (in FP8), or 2.3 months on a 100K H100s training system. If the same amount of time was used for RLVR, this is another 175 million H100-hours (with fewer useful FLOPs), for the total of 350M H100-hours.
At $2-3 per H100-hour, this is $700M to $1bn, in the same sense that DeepSeek-V3/R1 is $5-7M. That is, various surrounding activities probably cost notably more than the final training runs that construct the models, though for the $1bn model it might just be comparable, while for the $6M model it would be much more.
From a different angle, they spent something like 8 billion dollars on training compute while training GPT-5, so if GPT-5 was cheap to train, where did the billions go?
I notice that I am confused about the state of internal-only models at places like OpenAI. I wonder if people are trying to aggregate the informal reports and rumors on that.
In particular, I usually assume that internal models are usually ~6 months ahead of what’s released, but I don’t know if that’s a good estimate.
To make my confusion more concrete: I don’t quite understand how publicly available Claude Code can be useful for internal OpenAI developments if internal models from 6 months in the future are available. (Especially when taking into account that using those models potentially gives information to a competitor.) Internal models might be expensive to use, but with only a few thousand employees this should not matter much.
(I can see how Claude Code might be useful for personal projects by OpenAI employees, precisely because they might want to keep those projects private from their employer.)
Anyway, I wonder if there are some “interest groups” where people talk about rumors related to internal-only models. (The events like IMO gold do give us a bit of a window into what’s available in that sense.)
In a race for clout, they could at any time grab six months from thin air in benchmark graphs by closing the internal-release-external-release gap. No idea if they have made this one time play.
That’s probably too expensive and too risky.
An unsafe model, not well tested, and exposing too many of latest tricks too early to their competitors.
I would not expect them to do that (they don’t have enough compute to serve slow huge models to large number of users anyway; that’s, in part, why GPT-5 is very different from GPT-4/4.5 in terms of price/capability trade-off).
Yes, that’s certainly true. (Although, with the original GPT-4 it is thought that the delay have been mostly dedicated to safety improvements and, perhaps, better instruction following, with shrinkage mostly occurring post initial release.)
In any case, they could have boosted capabilities even without relying on the future models, but just by offering less shrinked versions of GPT-5 in addition to the ones they did offer, and they have chosen not to do that.
Some part of this is that capabilities are not linear, and from what I gather the newer internal models may be less polished (if more capable) than the ones they make public. Especially now what more value add is in post training, I suspect using the work in progress models only feels good closer to release.
Yes, and, perhaps, one would usually want to shrink before post-training, both to make post-training more affordable per iteration, and because I am not sure if post-training-acquired capabilities survive shrinkage as well as pre-training-acquired capabilities (I wonder what is known about that; I want to understand that aspect better; is it insane to postpone shrinkage till after post-training, or is it something to try?).
Internal models aren't 6 months ahead in general.
Sometimes internal models are several months ahead in key benchmarks or capabilities. For example, an internal OpenAI model won gold on IMO but it might be a while before a public OpenAI model does as well at IMO or other math competitions. But you wouldn't want to use this model, and I don't think OpenAI uses the model a lot internally.
Also Anthropic is probably a few months ahead of OpenAI in coding.
Everyone agrees that the release of GPT-5 was botched. Everyone can also agree that the direct jump from GPT-4o and o3 to GPT-5 was not of similar size to the jump from GPT-3 to GPT-4, that it was not the direct quantum leap we were hoping for, and that the release was overhyped quite a bit.
GPT-5 still represented the release of at least three distinct models: GPT-5-Fast, GPT-5-Thinking and GPT-5-Pro, at least two and likely all three of which are SoTA (state of the art) within their class, along with GPT-5-Auto.
The problem is that the release was so botched that OpenAI is now experiencing a Reverse DeepSeek Moment – all the forces that caused us to overreact to DeepSeek’s r1 are now working against OpenAI in reverse.
This threatens to give Washington DC and its key decision makers a very false impression of a lack of AI progress, especially progress towards AGI, that could lead to some very poor decisions, and it could do the same for corporations and individuals.
I spent last week covering the release of GPT-5. This puts GPT-5 in perspective.
GPT-5: The Reverse DeepSeek Moment
In January DeepSeek released r1, and we had a ‘DeepSeek moment’ when everyone panicked about how China had ‘caught up.’ As the link explains in more detail, r1 was a good model, sir, but only an ordinary good model, substantially behind the frontier.
We had the DeepSeek Moment because of a confluence of factors misled people:
GPT-5 is now having a Reverse DeepSeek Moment, including many direct parallels.
Unlike r1 at the time of its release, GPT-5-Thinking and GPT-5-Pro are clearly the current SoTA models in their classes, and GPT-5-Auto is probably SoTA at its level of compute usage, modulo complaints about personality that OpenAI will doubtless ‘fix’ soon.
OpenAI’s model usage was way up after GPT-5’s release, not down.
The release was botched, but this is very obviously a good set of models.
Washington DC, however, is somehow rapidly deciding that GPT-5 is a failure, and that AI capabilities won’t improve much and AGI is no longer a worry. This is presumably in large part due to the ‘race to market share’ faction pushing this narrative rather hardcore, and having this be super convenient for that.
What is even scarier is, what happens if DeepSeek drops r2, and it’s not as good as GPT-5-Thinking, but it is ‘pretty good’?
So let us be clear: (American) AI is making rapid progress, including at OpenAI.
Did You Know AI Is Making Rapid Progress?
How much progress have we been making?
This is only one measure among many, from Artificial Analysis (there is much it doesn’t take into account, which is why Gemini Pro 2.5 looks so good), yes GPT-5 is a relatively small advance despite being called GPT-5 but that is because o1 and o3 already covered a lot of ground, it’s not like the GPT-4 → GPT-5 jump isn’t very big.
AI is making rapid progress. It keeps getting better. We seem headed for AGI.
Yet people continuously try to deny all of that. And because this could impact key policy, investment and life decisions, each time we must respond.
No We Are Not Yet Hitting A Wall
As in, the Financial Times asks the eternal question we somehow have to ask every few months: Is AI ‘hitting a wall’?
(For fun, here is GPT-5-Pro listing many previous times AI supposedly ‘hit a wall.’)
If you would like links, here are some links for all that.
The justification for this supposed hitting of a wall is even stupider than usual.
Yes, users wanted GPT-4o’s sycophancy back, and they even got it. What does that have to do with a wall? They do then present the actual argument.
True. We didn’t get something totally new. But, again, that was OpenAI:
They hit the classic notes.
We have Gary Marcus talking about this being a ‘central icon of the entire scaling approach to get to AGI, and it didn’t work,’ so if this particular scaling effort wasn’t impressive we’re done, no more useful scaling ever.
We have the harkening back to the 1980s ‘AI bubble’ that ‘burst.’
My lord, somehow they are still quoting Yann LeCun.
We have warnings that we have run out of capacity with which to scale. We haven’t.
Their best point is this Altman quote I hadn’t seen:
I believe he meant that in the ‘for ordinary casual chat purposes there isn’t much room for improvement left’ sense, and that this is contrasting mass consumer chatbots with other AI applications, including coding and agents and reasoning models, as evidenced by the other half of the quote:
That is the part that matters for AGI.
That doesn’t mean we will get to AGI and then ASI soon, where soon is something like ‘within 2-10 years.’ It is possible things will stall out before that point, perhaps even indefinitely. But ‘we know we won’t get AGI any time soon’ is crazy. And ‘last month I thought we might well get AGI anytime soon but now we know we won’t’ is even crazier.
Alas, a variety of people are reacting to GPT-5 being underwhelming on the margin, the rapid set of incremental AI improvements, and the general fact that we haven’t gotten AGI yet, and reached the conclusion that Nothing Ever Changes applies and we can assume that AGI will never come. That would be a very serious mistake.
Miles Brundage, partly to try and counter and make up for the FT article and his inadvertent role in it, does a six minute rant explaining one reason for different perceptions of AI progress. The key insight here is that AI at any given speed and cost and level of public availability continues to make steady progress, but rates of that progress look very different depending on what you are comparing. Progress looks a progressively faster if you are looking at Thinking-style models, or Pro-style models, or internal-only even more expensive models.
Models Making Money And Being Useful Does Not Mean Less Progress
Progress in the rapid models like GPT-5-Fast also looks slower than it is because for the particular purposes of many users at current margins, it is true that intelligence is no longer an important limiting factor. Simple questions and interactions often have ‘correct’ answers if you only think about the local myopic goals, so all you can do is asymptotically approach that answer while optimizing on compute and speed. Intelligence still helps but in ways that are less common, more subtle and harder to notice.
One reason people update against AGI soon is that they treat OpenAI’s recent decisions as reflecting AGI not coming soon. It’s easy to see why one would think that.
I am not going to say they did a ‘great job with that.’ They botched the rollout, and I find GPT-5-Auto (the model in question) to not be exciting especially for my purposes, but it does seem to clearly be on the cost-benefit frontier, as are 5-Thinking and 5-Pro? And when people say things like this:
They are talking about GPT-5-Auto, the version targeted at the common user. So of course that is what they created for that.
OpenAI rightfully thinks of itself as essentially multiple companies. They are an AI frontier research lab, and also a consumer product company, and a corporate or professional product company, and also looking to be a hardware company.
Most of those customers want to pay $0, at least until you make yourself indispensable. Most of the rest are willing to pay $20/month and not interested in paying more. You want to keep control over this consumer market at Kleenex or Google levels of dominance, and you want to turn a profit.
So of course, yes, you are largely prioritizing for what you can serve your customers.
What are you supposed to do, not better serve your customers at lower cost?
That doesn’t mean you are not also creating more expensive and smarter models. Thinking and Pro exist, and they are both available and quite good. Other internal models exist and by all reports are better if you disregard cost and don’t mind rough around the edges.
There is an ordinary battle for revenue and market share and so on that looks like every other battle for revenue and market share. And yes, of course when you have a product with high demand you are going to build out a bunch of infrastructure.
That has nothing to do with the more impactful ‘race’ to AGI. The word ‘race’ has simply been repurposed and conflated by such folks in order to push their agenda and rhetoric in which the business of America is to be that of ordinary private business.
Initially FT used only the first sentence from Miles and not the second one, which is very much within Bounded Distrust rules but very clearly misleading, but to their credit FT did then fix it to add the full quote although most clicks will have seen the misleading version.
It is crazy to site ‘companies are Doing Business’ as an argument for why they are no longer building or racing to AGI, or why that means what matters is the ordinary Doing of Business. Yes, of course companies are buying up inference compute to sell at a profit. Yes, of course they are building marketing departments and helping customers with deployment and so on. Why shouldn’t they? Why would one consider this an either-or? Why would you think AI being profitable to sell makes it less likely that AGI is coming soon, rather than more likely?
That is, as stated, exactly correct from Wolf. There is tons of cool stuff to build that is not AGI or ASI. Indeed I would love it if we built all that other cool stuff and mysteriously failed to build AGI or ASI. But that cool stuff doesn’t make it less likely we get AGI, nor does not looking at the top labs racing to AGI, and having this as their stated goal, make that part of the situation go away.
As a reminder, OpenAI several times during their GPT-5 presentation talked about how they were making progress towards AGI or superintelligence, and how this remained the company’s primary goal.
Mark Zuckerberg once said about Facebook, ‘we don’t make better services to make money. We make money to make better services.’ Mark simply has a very strange opinion on what constitutes better services. Consider that the same applies here.
Also note that we are now at the point where if you created a truly exceptional coding and research model, and you are already able to raise capital on great terms, it is not at all obvious you should be in a rush to release your coding and research model. Why would you hand that tool to your competitors?
As in, not only does it help them via distillation and reverse engineering, it also directly can be put to work. Anthropic putting out Claude Code gave them a ton more revenue and market share and valuation, and thus vital capital and mindshare, and helps them recruit, but there was a nontrivial price to pay that their rivals get to use the product.
We Could Massively Screw This All Up
One huge problem with this false perception that GPT-5 failed, or that AI capabilities aren’t going to improve, and that AGI can now be ignored as a possibility, is that this could actually fool the government into ignoring that possibility.
Not only would that mean we wouldn’t prepare for what is coming, the resulting decisions would make things vastly worse. As in, after quoting David Sacks saying the same thing he’s been saying ever since he joined the administration, and noting recent disastrous decisions on the H20 chip, we see this:
Even if we disregard the turn of of phrase here – ‘AI chips and models rule the world’ is exactly the scenario some of us are warning about and trying to prevent, and those chips and models having been created by Americans does not mean Americans or humans have a say in what happens next, instead we would probably all die – pursuing chip market share uber alles with a side of model market share was already this administration’s claimed priority months ago.
We didn’t strike the UAE deal because GPT-5 disappointed. We didn’t have Sacks talking endlessly about an ‘AI race’ purely in terms of market share – mostly that of Nvidia – because GPT-5 disappointed. Causation doesn’t run backwards in time. These are people who were already determined to go down this path. GPT-5 and its botched rollout is the latest talking point, but it changes nothing.
In brief, I once again notice that the best way to run Chinese AI models, or to train Chinese AI models is to use American AI chips. Why haven’t we seen DeepSeek release v4 or r2 yet? Because the CCP made them use Huawei Ascend chips and it didn’t work. What matters is who owns and uses the compute, not who manufactures the compute.
But that is an argument for another day. What matters here is that we not fool ourselves into a Reverse DeepSeek Moment, in three ways: