In any case, the paper says the curtailments would last about two hours each:
The average duration of load curtailment (i.e., the length of time the new load is curtailed during curtailment events) would be relatively short, at 1.7 hours when average annual load curtailment is limited to 0.25%, 2.1 hours at a 0.5% limit, and 2.5 hours at a 1.0% limit
Demand response could be done by covering the data center with battery energy or not. Demand response and batteries can stack: if the grid is really stressed, a data center can both turn off and discharge its battery into the grid.
Economically, it makes sense to accept some true downtime to avoid months-long delays in data center construction. This is clearly true for training workloads which are very important but don't have live demand. But downtime for even inference clusters is acceptable: you can reduce the compute demand by temporarily slowing down token generation, or use dynamic rate limits. And any curtailment would almost certainly be isolated to one region, so inference data centers in other places would still be operational.
Internal models aren't 6 months ahead in general.
Sometimes internal models are several months ahead in key benchmarks or capabilities. For example, an internal OpenAI model won gold on IMO but it might be a while before a public OpenAI model does as well at IMO or other math competitions. But you wouldn't want to use this model, and I don't think OpenAI uses the model a lot internally.
Also Anthropic is probably a few months ahead of OpenAI in coding.
GPT-5's time horizon curve for reference. Actually looks a bit smoother than the curves for previous models.
The router is only on ChatGPT, not the API, I believe. And it switches between two models of the same size and cost (GPT-5 with thinking and GPT-5 without thinking).
o4-mini performing best isn't surprising, it leads or ties o3 and other larger models in most STEM-focused benchmarks. Similarly, o1-mini was better at math benchmarks than o1-preview, and Grok-3 mini generally gets better scores than Grok-3. The general explanation here is that RL training and experiments are cheaper and faster on mini models.
why not generate them with a more generic art style?
One possibility I've wondered about is whether AI can automate this learning work: start from a transcript of someone trying to do things with AI with mistakes and subsequent feedback, and then curating some data from that works well for RL fine-tuning. Or even distilling it into examples for in-context learning (which probably works somewhat well, sometimes, today).
What do you think about the success of AI 2027? It was very widely read, including by the vice president. That's partly because it's a striking narrative and, I presume, due to a decent amount of comms and press work. But it's also backed by a lot of research which took up the majority of the effort, and I think that was instrumental to its success.
More generally, good research builds credibility, especially among other experts who also have a lot of credibility and can help amplify your message. Someone like Yoshua Bengio has a lot more influence for AI safety than a large number of dedicated AI safety advocates. And high-quality research could persuade the next Bengio.
I've had similar thoughts: I think there's still low-hanging fruit in RL, and in scaffolding and further scaling of inference compute. But my general take is that the recent faster trend of doubling every ~4 months is already the result of picking the low-hanging RL fruit for coding and SWE, and fast inference scaling. So this kind of thing will probably lead to a continuation of the fast trend, not another acceleration.
Another source of shorter timelines, depending on what timeline you mean, is the uncertainty from translating time horizon to real-world AI research productivity. Maybe models with an 80% time horizon of 1 month or less are already enough for a huge acceleration of AI R&D, with the right scaffold/unhobbling/bureaucracy that can take advantage of lots of parallel small experiments or other work, or good complementarities between AI and human labor,