AI companies are already profitable (in the way that matters)

Yair Halberstadt

I've occasionally heard people suggest that at some point AI companies are going to run out of money, the cost of using AI will shoot up, demand will collapse, and the AI bubble will be over.

At first glance this risk seems real. OpenAI spent $25 billion in the first half of 2025, on revenue of just $4 billion. Whilst data is sorely lacking for other top AI labs, our best guess is that they're burning through cash at similar rates. Scaling laws imply that we need exponentially more compute to achieve linear AI performance improvements, so we should only expect this situation to worsen in the future. A few more doublings, and OpenAI could be spending hundreds of billions on training runs - something likely unsustainable even for the largest tech companies.

However most of these expenses are infrastructure expenses, building out the data centres needed for further training runs and serving future customers. If we look at the actual cost of serving, AI labs are already profitable, and have been for a long time.

In other words the marginal cost to respond to an AI API call is significantly lower than the price of that call. We can see that by comparing open source models, which are often served at prices significantly lower than that of leading labs by infra providers who are unlikely to be running loss leaders.

For example DeepSeek-V4-Pro is a MOE model with 1.6 trillion parameters, and usually costs around 1.74 per 1 million input tokens and $3.48 per 1 million output tokens with neutral vendors. In benchmarks it performs worse than the leading labs models, but not by much, and is a newer model. However leading labs charge significantly more for their flagship models, ranging from $2-5 per million input tokens, to $12-25 per million output tokens. Given the minor differences in quality, either the leading labs are bad at inference optimisation, or they're raking it in.

This means that even if all funding for AI dried up tomorrow, AI would still be a profitable business, and existing models would continue to be served.

Some people have responded that AI companies can't just serve the existing models and make a steady profit - they have to train new models, and models just don't stay at the forefront long enough to recover high up-front training costs. If they don't other AI companies will eat their lunch. It takes all the running they can do just to stay in the same place.

This is definitely a concern for individual AI companies, but it can't possibly be for the industry as a whole. If investment is insufficient for any company to train the next gen models, current gen models will continue to be served indefinitely (and eventually make enough profit to incrementally improve the state of the art, at a sedate enough pace that the newer models can recoup their initial investment).

The question of how long current AI scaling rates can be kept up is an important one for predicting the future of AI (and humanity) but it's irrelevant to whether AI will collapse. The genie is out of the bottle and cannot be put back.

The disconnect between revenue and R&D spend is caused by the rapid scaling of compute. With a 50% gross margin, you can use 50% of compute to serve inference, and that will be enough to pay for the other 50% of compute that can be used for R&D (including training of the next year's model). This is probably what happens once the amount of compute per AI company stops rapidly increasing (absent AGI).

But if you'll need 3x more compute next year, and you want to use as much compute for R&D this year as you'll use to serve models next year, then you are out of luck (you'd need 1.5x as much compute as you actually have in total this year just for R&D, which is more than you actually have). So instead you serve models via clouds at a worse margin, and use more than 50% of your own dedicated compute for R&D (unless there's not enough compute even at the clouds).

DeepSeek-V4-Pro ... leading labs charge significantly more for their flagship models, ranging from $2-5 per million input tokens, to $12-25 per million output tokens ... Given the minor differences in quality, either the leading labs are bad at inference optimisation, or they're raking it in.

DeepSeek-V4-Pro has 50B active params, 1.6T total params, and an unlikely 5 KB of KV-cache per token (which is their most obvious innovation). Frontier models run on GB200 NVL72 with 14 TB of HBM per server, and using up to 2-4 servers with pipeline parallelism is reasonable. With 4-bit weights (in FFNs) using 25% of HBM (with the rest spent on KV-cache), that's already 7-30T total params. If pretraining makes use of 300 MW of compute (1e27 FLOPs in 3 months at 40% utilization), a compute optimal number of active params (at 120 tokens/param) is about 1T. The rumors and Musk's claimed model sizes place total params of current frontier models at 5-10T, which makes sense as the first step in scaling towards what the new rack-sized scale-up systems enable (the 15-30T total param models are probably coming next year).

There are two sides to the cost of more active params: input tokens need more compute, and KV-cache per token wants to be bigger. With 300K token contexts on average, maybe 50% of HBM (on a single GB200 NVL72) spent on KV-cache of the currently live requests, and 2.5K requests (leaving 300 requests per expert with 1:8 sparsity, to be able to feed the compute with enough data), that's only 9 KB per token, which is very little. So perhaps the batch size for the frontier models has to be smaller than it should be for effective serving, smaller than it can be when serving smaller models such as DeepSeek-V4-Pro, which would directly translate to a higher cost of output tokens. Or if they're served on 2-4 servers with pipeline parallelism, this can go to 40 KB per token. KV-cache per token probably scales with model dimension, which maybe scales with square root of active params. If frontier models managed to compress KV-cache as well as DeepSeek-V4-Pro, but use 10-20x more active params, that suggests 15-22 KB per token of KV-cache. But probably it's more, there are too many KV-cache compression schemes that never find traction, there are likely subtle quality degradation costs. So it's the right order of magnitude, but could end up making batch sizes 2-5x smaller than they should be (to be compute-bound), and decoding becomes solidly HBM-bound.

This suggests a 20-100x higher output token cost for the biggest frontier models, compared to the 50B active param open weights models. DeepSeek-V4-Pro is served for $0.87 per 1M output tokens by DeepSeek (and $2.8-3.5 per 1M output tokens by others, possibly not enough users to reliably get big batches for decode), 30x cheaper than Opus 4.7 or GPT-5.5.

For the input tokens, the cost is just proportional to the active params, as it's much more straightforward to make prefill compute-bound. At 60% utilization, and $15bn per year per 1 GW of GB200 NVL72 (400K chips, $4.3 per hour per chip), with 10e15 FP4 FLOP/s per chip, and 2 FP4 FLOPs per active param per token, that's a cost of $0.4 per 1M input tokens for a model with 1T active params. So the cost for a 50B active param model (with FP4 FFNs) could be $0.02 per 1M input tokens when using GB200 NVL72. With H100s, computing in FP8 at 2e15 FLOP/s, but at $2.5 per hour, this gets 3x more expensive, $0.06 per 1M input tokens. DeepSeek offers 1M input tokens for $0.43. And for a 1T active param FP8 model with a 50% gross margin, the price should be about $2 per 1M input tokens. For Opus 4.7, the price is $5 per 1M input tokens, and SemiAnalysis estimates more than 70% gross margin.

Now anchoring the cost of output tokens to the cost of input tokens, with 2-5x smaller batches than what is necessary to make decode compute-bound (this applies only to the biggest frontier models, not to the 50B active param models) and maybe 3x inefficiency of decode compared to prefill, output token cost should be 6-15x higher than input token cost. But since most tokens that are served via API are input tokens, perhaps the gross margin for output tokens matters less, so with the input token price already at least 2x higher than the input token cost, the output token price might end up only 3x-7x higher than input token price. With the cost of $1 per 1M input tokens (for FP8 FFNs), that's a price of $3-7 per 1M output tokens (for 300K token contexts). If we want to avoid serving via API at a loss for 1M token contexts with memory-bound decode, the price might need to go to $10-25 per 1M output tokens (here the gross margin goes to zero at contexts that are 1M tokens long, but is higher for shorter contexts). And $25 is just the price for 1M output tokens of Opus 4.7.

“DeepSeek-V4-Pro is served for $0.87 per 1M output tokens by DeepSeek”

Theyre discounting it by 75% till the end of May when prices go to 3.48.

https://api-docs.deepseek.com/quick_start/pricing

With these levels of economic discounting the range for true price are a bit bigger.

(Subscription revenue is totally different: though Codex usage limits at $100 and $200 are 10x and 25x(for 5hr) vs 5x and 20x the plus tier normaly, means a lot of competition heating up)

AI companies are going to run out of money, the cost of using AI will shoot up, demand will collapse, and the AI bubble will be over.

I've only ever seen claims like the above in anti-ai / ai-skeptical communities.

I think a more practical steelman for this forum would be,

AI companies are going to hit a cash crunch, the pace of investment in AI infrastructure will collapse, financial institutions will panic, and AI financing will become permanently harder for a long time. This is bad because [geopolitics | domestic revolt | time value of AI progress]

Though I don't expect this to happen anymore, for some time I took it seriously

Scaling laws imply that we need exponentially more compute to achieve linear AI performance improvements

In the usual scaling law definitions of performance, it's polynomially more compute, not exponentially more.