GPT-5: The Reverse DeepSeek Moment

[-]Vladimir_Nesov3mo162

GPT-5 is evaluated as if it was scaling up compute in a way that it doesn’t. In various ways people are assuming it ‘cost’ far more than it did.

Even if it's a "small" model (as the balance of evidence suggests), it doesn't follow that it didn't cost a lot. Suppose gpt-5-thinking is a 1-2T total param, 250B active param model, a shape that would've been compute optimal for some 2023 training systems, but it's 10x overtrained using 2024 compute, and then it was RLVRed for the same amount of GPU-time as pretraining. Then it could well cost about $1bn (at $2-3 per H100-hour). It would take an unlikely 300T tokens, but then there's already gpt-oss-120b that apparently needed 100T-200T tokens, and this is still within the forgiving 5x repetition of a plausible amount of natural data.

I'm assuming 120 tokens/param compute optimal, anchoring to Llama 3 405B's dense 40 tokens/param, increased 3x to account for 1:8 sparsity. At 5e25 FLOPs (2023 compute) this asks for 260B active params and 2T total, trained for 31T tokens. Overtrained 10x, this would need 5e26 FLOPs and 310T tokens, without changing model shape. At 40% compute utilization, this is about 175e6 H100-hours (in FP8), or 2.3 months on a 100K H100s training system. If the same amount of time was used for RLVR, this is another 175 million H100-hours (with fewer useful FLOPs), for the total of 350M H100-hours.

At $2-3 per H100-hour, this is $700M to $1bn, in the same sense that DeepSeek-V3/R1 is $5-7M. That is, various surrounding activities probably cost notably more than the final training runs that construct the models, though for the $1bn model it might just be comparable, while for the $6M model it would be much more.

[-]Hastings3mo51

From a different angle, they spent something like 8 billion dollars on training compute while training GPT-5, so if GPT-5 was cheap to train, where did the billions go?

[-]mishka3mo80

I notice that I am confused about the state of internal-only models at places like OpenAI. I wonder if people are trying to aggregate the informal reports and rumors on that.

In particular, I usually assume that internal models are usually ~6 months ahead of what’s released, but I don’t know if that’s a good estimate.

To make my confusion more concrete: I don’t quite understand how publicly available Claude Code can be useful for internal OpenAI developments if internal models from 6 months in the future are available. (Especially when taking into account that using those models potentially gives information to a competitor.) Internal models might be expensive to use, but with only a few thousand employees this should not matter much.

(I can see how Claude Code might be useful for personal projects by OpenAI employees, precisely because they might want to keep those projects private from their employer.)

Anyway, I wonder if there are some “interest groups” where people talk about rumors related to internal-only models. (The events like IMO gold do give us a bit of a window into what’s available in that sense.)

[-]Hastings3mo30

In a race for clout, they could at any time grab six months from thin air in benchmark graphs by closing the internal-release-external-release gap. No idea if they have made this one time play.

[-]mishka3mo30

That’s probably too expensive and too risky.

An unsafe model, not well tested, and exposing too many of latest tricks too early to their competitors.

I would not expect them to do that (they don’t have enough compute to serve slow huge models to large number of users anyway; that’s, in part, why GPT-5 is very different from GPT-4/4.5 in terms of price/capability trade-off).

[-]Hastings3mo20

It’s a matter of degree. There’s already shrinkage- gpt 4 took nearly a year to release

[-]mishka3mo40

Yes, that’s certainly true. (Although, with the original GPT-4 it is thought that the delay have been mostly dedicated to safety improvements and, perhaps, better instruction following, with shrinkage mostly occurring post initial release.)

In any case, they could have boosted capabilities even without relying on the future models, but just by offering less shrinked versions of GPT-5 in addition to the ones they did offer, and they have chosen not to do that.

[-]phdead3mo43

Some part of this is that capabilities are not linear, and from what I gather the newer internal models may be less polished (if more capable) than the ones they make public. Especially now what more value add is in post training, I suspect using the work in progress models only feels good closer to release.

[-]mishka3mo20

Yes, and, perhaps, one would usually want to shrink before post-training, both to make post-training more affordable per iteration, and because I am not sure if post-training-acquired capabilities survive shrinkage as well as pre-training-acquired capabilities (I wonder what is known about that; I want to understand that aspect better; is it insane to postpone shrinkage till after post-training, or is it something to try?).

[-]Josh You3mo11

Internal models aren't 6 months ahead in general.

Sometimes internal models are several months ahead in key benchmarks or capabilities. For example, an internal OpenAI model won gold on IMO but it might be a while before a public OpenAI model does as well at IMO or other math competitions. But you wouldn't want to use this model, and I don't think OpenAI uses the model a lot internally.

Also Anthropic is probably a few months ahead of OpenAI in coding.