Checking in on AI-2027
TLDR: AI-2027's specific predictions for August 2025 appear to have happened in September of 2025. The predictions were accurate, if a tad late, but they are late by weeks, not months. Edit 1: Thanks to Aaron Staley for pointing out that the original osworld benchmark was referred to in AI-2027, while I used the successor benchmark, osworld-verified, to compare to it in this post. I have updated the post to acknowledge this, though I believe for all intents and purposes, these benchmarks can be seen as interchangeable (which I will also lay out in the post). Edit 2: I discovered a new high score on the osworld-verified, achieved by agent s3 w/ GPT-5 bBoN (N=10), an agent framework powered by GPT-5 created by Simular. This score is higher than projected by AI-2027 for August, though it was achieved on October 3rd, 2025. This performance is essentially at parity with skilled human computer use. Reading AI-2027 was the first thing that viscerally conveyed to me how urgent and dangerous advances in AI technology might be over the next few years. Six months after AI-2027's release, I decided to check in and see how the predictions are holding up so far, what seems like is happening faster than expected, and what seems like is happening slower than expected. I'll just go through the specific claims that seem evaluable in order. > The world sees its first glimpse of AI agents. > > Advertisements for computer-using agents emphasize the term “personal assistant”: you can prompt them with tasks like “order me a burrito on DoorDash” or “open my budget spreadsheet and sum this month’s expenses.” They will check in with you as needed: for example, to ask you to confirm purchases. Though more advanced than previous iterations like Operator, they struggle to get widespread usage. This prediction is panning out. With GPT-5 and Claude Sonnet 4.5, we now have agentic coders (Claude Code, GPT-5 Codex) and personal agents that can make purchases, though not yet on DoorDash,
What do you mean by "a jump on the metr graph"? Do you just mean better than GPT-5.1? Do you mean something more than that?