A major plausible class of worlds in which we don't get superhuman coders by end of 2026 is worlds where the METR trend continues at roughly the same or only slightly greater slope than the slope it had in 2025. Right?
Yes, but 2025 saw two trends: Claude 3.5 Sonnet -- o3 and o3 -- GPT-5.1CodexMax with different doubling times. IIRC the earlier trend would cause superhuman coders to appear by 2028 and the later trend (which was arguably invalidated by Claude 4.5 Opus and its ~5h time horizon; see, however, two comments pointing out that the METR benchmark is no longer as trustworthy as it once was and my potential explanation of the abnormally high 50%/80% time horizon ratio) had superhuman coders arrive in 2030 or outright hit a wall[1] before becoming superhuman.
As for the OP's idea that coding agents are used to improving coding agents and reaching the SC, this could be unlikely because they don't improve the underlying LLM. I remember the now-obsolete benchmarks-and-gaps model which required the SCs not just to saturate the RE-bench, but learn to actually do long tasks and handle complex codebases, which in turn requires either a big attention span of the LLM itself or careful summarisation of each method's specification, of formatting, of other methods' names, etc.
P.S. The latter scenario would be particularly difficult to predict as it might involve the time horizon in the METR sense behaving like . In this case the horizon would grow ~exponentially until the very last couple of doublings.
Or become neuralese with consequences as disastrous as the lack of Safer-1 to test alignment.
When looking for trend breaks in time series, it's unwise to rely on eyeballing when Quandt likelihood ratio test aka sup-Wald test exists for 65 years (google it or ask an LLM to explain in layman's terms).
I pulled the METR data and asked Gemini 3 Flash to vibecode the test, and there is a statistically significant (peak F-statistic = 7.79 corresponding to p-value about 0.03) break at Claude 3.5 Sonnet from ~8 to ~5-month doubling but not after it
Using @ryan_greenblatt's updated 5-month doubling time: we reach the 1-month horizon from AI 2027 in ~5 doublings (Jan 2028) at 50% reliability, and ~8 doublings (Apr 2029) at 80% reliability. If I understand correctly, your model uses 80% reliability while also requiring 30x cheaper and faster than humans. It does seem like if the trend holds, by mid-2029 the models wouldn't be much more expensive or slower. But I agree that if a lab tried to demonstrate "superhuman coder" on METR by the end of next year using expensive scaffolding / test-time compute (similar to o1 on ARC-AGI last year), it would probably exceed 30x human-cost, even if already 30x faster.
The thing METR is measuring seems slightly different than "superhuman coder". My understanding is that they're dropping an AI into an unfamiliar codebase and telling it to do something with no context or design help, so this is partially software architecture partially coding. On pure coding tasks, Claude Code is clearly superhuman already.
I spent a few hours over the last few days collaborating with Claude on design docs and some general instructions, then having it go through massive todo lists fully autonomously[1]. This is weeks of coding and it did it in a few hours (mostly slowed down by me getting around to giving it more work).
This is the first time I've had it do tasks of this scale so I'm not doing anything special, just having it propose a design, telling it which parts I want done differently, then having it make a todo list and execute it.
Example prompt:
Can you go through @TODO.md, delegating each task to opus subagents and ensuring that they understand all of the necessary context and implement the task, check it off, and commit it, then move onto the next task until the list is done?
The coding capabilities of LLMs are extremely impressive, but there's still 2 core capabilities missing for "superhuman" coders to arrive that I talked about in this post.
In short, they are:
As of now, humans are required to supply context and glue things together. That won't change if 2026 produces nothing but iterative improvements on general coding ability.
If an OpenAI Co-founder and Claude Code creator are resorting to Claude 4.5 Opus instead of coding themselves, it seems quite clear that we already have super human coders.
Isn't getting working/production-ready code done faster the definition of being better than you at coding? It's possible the creator of Claude Code is incorrect about this and he would be more productive long-term writing this code himself, or the code is actually unacceptable in ways that he hasn't noticed yet, but if he's correct that it's more productive to have Claude write it, then Claude is better at coding than him.
If I use my IDE's LSP functions to do a large automated refactor, is the IDE better than me at coding?
There are many more elements to "coding" than "writing code", namely software design. As a software engineer I use Claude Code daily (I write maybe 1% of my total LOC by hand these days), but I still have to steer it. I have to tell it which architecture to use, which abstractions, correct it when it tries to use a shortcut instead of solving a problem at the root, etc.
When it can produce PRs which would pass code review on a competent software engineering team without that steering, we will have a superhuman coder.
I was thinking of "coder" as specifically the job of writing code, which I assume is what the Claude Code guy meant too. AI is clearly not reliable at system design yet.
System design is one part of designing software, but isn't so much what I'm trying to point at here.
Claude Opus 4.5 still can't produce or follow a simple plan to implement a feature on a mid-sized codebase independently.
As an example: earlier today I was implementing the feature of resuming a session when a client reconnects to a server after losing connection. One small part of this task is re-syncing the state once the current (server-side) task has finished.
Claude Code was not capable of designing a functioning solution to this problem in its planning mode (it kept trying to sync the state immediately upon connecting, leading to the client missing the result of the in-progress task).
The solution I chose for this specific instance of the problem was to add a state sync command to the server's command queue for that session when a client reconnects. Claude Code updated the plan to show the exact code changes required (correctly).
However, when implementing the plan, it forgot to actually make the relevant change to add the command to the queue. End-to-end tests caught this, and Claude's solution was to automatically do a state sync after every task. It did not implement what was written in the plan. I gave it a nudge to re-read the plan, which was enough to make it see the mistake and correct it.
Compared to if I had asked a human co-worker to make the same change, the difference is stark. We are still a way off from superhuman coders.
Not necessarily. You can have a superhuman centaur (AI+Human) system without the AI alone being superhuman.
I do think we are in a coding overhang.
Current harnesses seem far from the ceiling and could be improved a lot. One example: you can significantly boost output quality by using simple tricks, like telling Claude Code to implement something and then explicitly asking it to self-review. You could get slightly more creative -- say, ask to implement something security-sensitive, then ask to break it.
And I feel the same way Andrej does. Even though I consider myself relatively adept at using AI agents, I feel like I am doing a quite horrible job. In principle, these tools could be utilized much more efficiently.
But perhaps even larger bottleneck is on the organizational level. Existing standard workflows weren't designed for a world with AI agents, and it takes a long time to invent and implement new ones.
Also, Claude 3.7 was released just 10 months ago, and it was barely usable, the difference between 3.7 Sonnet and 4.5 Opus feels comparable (even if less dramatic) to the difference between GPT-3 and GPT-4.
Andrej Karpathy posted 12 hours ago (emphasis mine):
This seems to be a big update since his Dwarkesh episode published on Oct 17 (though I know these things can take a while to get edited, so the gap could be even bigger), where he said:
This is just me guessing, but Claude Opus 4.5 released just one month ago, and Opus 4.5 + Claude Code seems like the big shift for a lot of people.
In fact, Boris Cherny, creator of Claude Code, commented on Karpathy's post saying (emphasis mine):
To be clear, a lot of these PRs might be "quite small, a few lines and bug fixes" (cf. this comment by another Anthropic employee). Boris had just asked users for feedback, then closed 19 PRs the next morning. Still, 200 PRs in a month without opening an IDE is something [1].
Update: Boris just made another post giving more details (emphasis mine):
AI Accelerating AI
It seems like we might be entering something like a self-improving feedback loop for the system "humans + AI": employees at the labs are developing AI coding agents using these same AI coding agents, with the horizon length of these models increasing on a faster exponential than we thought (cf. Opus 4.5)[2], and potentially not even an exponential.
This isn't AI autonomously improving itself, but the feedback loop between training better AI models and having these models accelerate the automation of AI R&D seems to be tightening [3] [4].
The "Coding Overhang"
In July 2020, after GPT-3, Andy Jones asked if we were in an AI Overhang, because (at the time) it felt like companies could just be scaling models like GPT-3 to many more orders of magnitude and get much more "intelligence".
With coding agents and reasoning / test-time compute, it seems to me that what Karpathy (& Boris) are describing is some sort of "Coding Overhang" where people at the cutting edge, and especially members of technical staff, are trying to catch up with ~10x improvements that are purely user-dependent skill-issues.
In what worlds do we not get Superhuman Coders by the end of 2026?
Note: As the creator of Claude Code, Boris is obviously incentivized to promote it.
Update: after seeing Daniel Kokotajlo's answer and thinking about it more, I should clarify that if things stay roughly on trend, even with a 5 months doubling time, we'd only get to 1-month tasks with 80%-reliability by Apr 2029. See full comment here.
See this older 2021 post I wrote about self-improving {humans + AI} systems, or this video explaining Tom Davidson's full takeoff model for more intuitions.
Boris just wrote on X: "[...] In the last thirty days, 100% of my contributions to Claude Code were written by Claude Code"