It looks like AI 2027 was posted on April 3rd, 2025?
In that case, August was about 4 months away, which means late September is 20-25% slower than projected and we are still a few percentage points short - seems reasonable to expect the scores they predicted sometime in October or November, but that is still say 40-50% over their prediction.
The authors have emphasized repeatedly that AI 2027 was and is faster than their mode (EDIT: median) scenario, which makes doing this kind of evaluation annoying, but I would have to say that things look significantly behind the specific story in that piece. The reason I am saying this is that it is a bit of an overstatement to praise their predictive accuracy on mid-2025 predictions which they made in early-mid 2025, when their predictive accuracy is off on the scale of a month or two, and their predictions for 2025 were not viewed as particularly radical or unexpected at the time as far as I remember. It seems to me that even a hardcore skeptic of AI 2027 would have been unlikely to predict a much larger error.
(I believe I did myself leave a comment like "I expect this to start not happening right away" but in follow-up conversation specified that I was not talking about 2025).
Still, I appreciate that you are checking in on the accuracy of their story with real numbers.
The authors have emphasized repeatedly that AI 2027 was and is faster than their mode scenario, which makes doing this kind of evaluation annoying,
We've said that it was faster than our median, not our mode. I think it was close to most of our modes at the time of publication, mostly we were at around 2027-2028.
But the evaluation itself seems useful either way, in terms of checking in on how things are going relative to the trajectory that was our best guess conditional on the AGI timelines depicted.
I agree we're behind the AI-2027 scenario and unlikely to see those really really fast timelines. But I'd push back on calling it 'significantly behind.'
Here's my reasoning: We nearly hit the August benchmarks in late September, roughly 5 months after AI-2027's release instead of 4 months. That's about 25% slower. If that rate difference holds constant, the 'really crazy stuff' that AI-2027 places around January 2027 (~21 months out) would instead happen around June 2027 (~26 months out). To me, a 5-month delay on exponential timelines isn't drastically different. Even if you assume that we are going say, 33% slower, we are still looking at August 2027 (~28 months out) for some really weird stuff.
That said, I'm uncertain whether this is the right way to think about it. If progress acceleration depends heavily on hitting specific capability thresholds at specific times (like AI research assistance enabling recursive improvement), then even small delays might compound or cause us to miss windows entirely. I'd be interested to hear if you think threshold effects like that are likely to matter here.
Personally, I am not sure I am convinced these effects will matter very much given that there was not supposed to be large scale speedups to AI research in 2025 in the scenario until early 2026 (where they projected a fairly modest 1.5x speedup). But perhaps you have a different view?
Sonnet 4.5 was nearly the final day of September which seems like 1.5 months out from generically “August”, and a 3% score difference is not necessarily insignificant (perhaps there are diminishing returns at >80%). I agree that we are quibbling over a thing that does not in itself matter much, but it IS important for assessing their predictive accuracy, and if their predictive accuracy is poor, it does not necessarily mean all of their predictions will be slow by the same constant factor. To be clear, all of these signals are very weak. I am only (modestly) disagreeing with the positive claim of the OP.
The signal that I am waiting for to assess very short timelines is primarily METR task lengths.
Sonnet 4.5 was nearly the final day of September which seems like 1.5 months out from genetically “August”
I interpret August as "by the end of August". Probably worth figuring out which interpretation is correct, maybe the authors can clarify.
it IS important for assessing their predictive accuracy, and if their predictive accuracy is poor, it does not necessarily mean all of their predictions will be slow by the same constant factor.
Yeah, I agree with this. I do think there is pretty good evidence of predictive accuracy between the many authors, but obviously people have conflicting views on this topic.
To be clear, all of these signals are very weak. I am only (modestly) disagreeing with the positive claim of the OP.
This is a place where somebody writing a much slower timeline through like, 2028, would be really helpful. It would be easier to assess how good a prediction this is with comparisons to other people's timelines about achieving these metrics (65% OSWorld, 85% SWEBench-Verified). I am not aware of anybody else's predictions about these metrics from a similar time, but that would be useful to resolve this probably.
I appreciate the constructive responses!
It seems to me that even a hardcore skeptic of AI 2027 would have been unlikely to predict a much larger error.
As someone who could perhaps be termed as such, my expectations regarding the technical side of things only start to significantly diverge at the start of 2027. (I'm not certain of Agent-1 1.5x'ing AI research speed, but I can see that.[1] The rest seems more or less priced-in.) And indeed, the end of 2026 is the point where, the forecast itself admits, its uncertainty increases and its predictions get less grounded.
Specifically, the point where I get off the ride is this one:
OpenBrain doubles down on this strategy with Agent-2. It is qualitatively almost as good as the top human experts at research engineering (designing and implementing experiments), and as good as the 25th percentile OpenBrain scientist at “research taste” (deciding what to study next, what experiments to run, or having inklings of potential new paradigms). While the latest Agent-1 could double the pace of OpenBrain’s algorithmic progress, Agent-2 can now triple it, and will improve further with time.
My understanding is that Agent-2 essentially "closes the loop" on automated AI R&D, and while human input is still useful due to worse taste, it's no longer required. That's the part that seems like a "jump" to me, not a common-sensical extrapolation, and which I mostly expect not to happen.
Because I am really confused about how much AI is accelerating research/programming now, so I have no idea what number to extrapolate. Maybe it gets so good at fooling people into thinking they're being incredibly productive by managing 50 agents at once that it slows research down by 50% instead?
TLDR: AI-2027's specific predictions for August 2025 appear to have happened in September of 2025. The predictions were accurate, if a tad late, but they are late by weeks, not months.
Reading AI-2027 was the first thing that viscerally conveyed to me how urgent and dangerous advances in AI technology might be over the next few years. Six months after AI-2027's release, I decided to check in and see how the predictions are holding up so far, what seems like is happening faster than expected, and what seems like is happening slower than expected. I'll just go through the specific claims that seem evaluable in order.
The world sees its first glimpse of AI agents.
Advertisements for computer-using agents emphasize the term “personal assistant”: you can prompt them with tasks like “order me a burrito on DoorDash” or “open my budget spreadsheet and sum this month’s expenses.” They will check in with you as needed: for example, to ask you to confirm purchases. Though more advanced than previous iterations like Operator, they struggle to get widespread usage.
This prediction is panning out. With GPT-5 and Claude Sonnet 4.5, we now have agentic coders (Claude Code, GPT-5 Codex) and personal agents that can make purchases, though not yet on DoorDash, but on platforms like Shopify and Etsy. Widespread adoption definitely doesn't seem to be here yet, but that was expected by AI-2027. Arguably they undersold the degree to which this would already be used in software work, but they didn't make any specific claims about that.
There are a couple of more testable claims made in footnotes to this paragraph.
Specifically, we forecast that they score 65% on the OSWorld benchmark of basic computer tasks (compared to 38% for Operator and 70% for a typical skilled non-expert human).
Claude Sonnet 4.5 scored a 62% on this metric, as of September 29th, 2025. The target was August; the metric was nearly achieved in late September. AI-2027 got agentic capabilities essentially right. One month late and three percentage points short is remarkably accurate.
Another benchmark there was a specific projection about for August 2025 was the SWEBench-Verified.
For example, we think coding agents will move towards functioning like Devin. We forecast that mid-2025 agents will score 85% on SWEBench-Verified.
Claude Sonnet 4.5 scored an 82% on this metric, as of September 29th, 2025. Three percentage points below the 85% target, achieved one month late, again, remarkably close. Particularly given that in August, Opus 4.1 was already scoring 80% on this benchmark.
The August predictions are the only ones we can fully evaluate, but we can make preliminary assessments of the December 2025 predictions.
GPT-4 required 2⋅10^25 FLOP of compute to train. OpenBrain’s latest public model—Agent-0—was trained with 10^27 FLOP. Once the new datacenters are up and running, they’ll be able to train a model with 10^28 FLOP—a thousand times more than GPT-4. Other companies pour money into their own giant datacenters, hoping to keep pace.
The Agent-0 scenario looks increasingly plausible. We now know that GPT-5 was trained with less compute than GPT-4.5. While training compute increased reasonably from GPT-4 to GPT-5, evidence suggests OpenAI has an even more capable model in development. Some version of that will be due to release eventually, especially given the pressure that has been put on them with the very impressive Sonnet 4.5 release.
The evidence: OpenAI entered an 'experimental reasoning model' into the ICPC, which is a prestigious college-level coding contest. This experimental reasoning model performed better than all human contestants, achieving a perfect 12/12 score. GPT-5 solved 11 problems on the first attempt; the experimental reasoning model solved the hardest problem after nine submissions.
The capabilities that this model demonstrated may not be Agent-0 level, and it is possible that it used less than 10^27 FLOP of training compute. But we should watch for the next OpenAI release, which could come as soon as Monday, October 6, at DevDay. This is speculation, but it is grounded in recent announcements. Sam Altman indicated less than 2 weeks ago that several compute-intensive products would release over the coming weeks. We've already seen two such releases in under two weeks. There's Pulse, OpenAI's proactive daily briefing feature, which launched on September 25 but hasn't generated much discussion yet. I'm curious what people think of it. And then there's Sora 2, which represents a significant leap forward for OpenAI in video generation, impressive enough to have generated substantial attention. The Sora app reached #3 on the App Store within 48 hours of its September 30 release. I suspect something bigger is planned for DevDay, though there are no guarantees, especially given Altman's track record of generating hype. It's also worth noting that last year's announcements at DevDay were more practical than transformative, with o1's release coming a couple of weeks before the actual event. Nonetheless, it is difficult to rule out a near-term release of this improved reasoning model.
AI-2027's predictions for mid-2025 have been substantially vindicated. Progress is roughly one month behind the scenario, weeks, not months. Every prediction timed for August 2025 has been essentially realized by end of September 2025. While I remain uncertain about fast timelines, dismissing scenarios like AI-2027 seems unwarranted given how well these early predictions have held up. These were the easiest predictions to verify, but they set a high bar, and reality met it.