Superhuman Coders in AI 2027 - Not So Fast

FutureSearch

Thank you @elifland for reviewing this post. He and AI Futures are planning to publish updates to the AI 2027 Timeline Forecast soon.

AI 2027 (also launched in this LW post) forecasts an R&D-based AI takeoff starting with the development of Superhuman Coders^[1] within a frontier lab.

FutureSearch co-authored the AI 2027 Timeline Forecast. We thought the other authors’ forecasts were excellently done, and as the piece says:

All model-based forecasts have 2027 as one of the most likely years that SC [Superhuman Coders] being developed, which is when an SC arrives in the AI 2027 scenario.

Indeed. But overall, FutureSearch (two full-time forecasters, two contract forecasters, and this author, Dan Schwarz) think superhuman coding will arrive later — median 2033 — than the other authors (hereon "AI Futures"): median 2028 from Nikola Jurkovic, and median 2030 from Eli Lifland^[2].

Here, we briefly explain how our views diverge on what it would take for OpenAI/DeepMind/Anthropic/xAI to develop superhuman coding.

This post can also serve as a much shorter piece for those who didn’t read the whole (24-page!) piece on the AI 2027 site. (See also Scott Alexander's shorter piece on the AI Futures blog on modeling coding time horizons.)

The Forecast

As @Max Harms wrote in another AI 2027 LW reaction piece,

we can hallucinate that FutureSearch has approximately the market timelines, while Eli Lifland (and perhaps other authors) has a faster, but not radically faster timeline.

I think this is broadly correct, even though the “market” timeline here is hard to interpret because as Max says, Superhuman Coding ≠ AGI.

Our published view in AI 2027 — blue line below — is fairly in-line with similar crowd forecasts, while being much slower than the AI Futures:

We produced, but did not publish on AI 2027, an all-things-considered forecast with a median in Nov 2033, vs the "within-model" forecast of Jan 2032 above:

So while 2027 is conceivable, why do we give a median more than 8 years in the future?

The Path to Superhuman Coding

AI Futures produced, and FutureSearch then adopted, a clever methodology for this forecast that we endorse and used ourselves. It’s written in the piece as “Method 2”, and it has two steps:

Forecast the time for RE-Bench ^[3] to be saturated
Forecast the time for everything else needed from a model that aces RE-Bench for it to act as a superhuman coder in a real frontier-lab environment

We like model-based forecasts a lot more than vibe-based forecasts. This forecast fits, as (a) it's grounded in data, using the now-popular METR “scaling law” of time horizon growth, and (b) it can be broken down into many categories to forecast independently and sum up.

For (1), time to RE-Bench saturation, we roughly agreed with AI Futures, noting that a score of 1.5 is their estimate of when a model will be as good as the best human at these tasks.

Much has been said about time-horizon scaling. One could argue, as Peter Wildeford does and Scott Alexander does, that this timeline should be accelerated, in part due to OpenAI's April launch of o3, in part because labs are trying harder. The trend could also overstate progress towards "superhuman" coding, as HCAST tasks are about "competent" humans, not the best human working at a frontier lab.

Here, though, we take this initial forecast as a rough point of agreement between all AI 2027 forecasters, and move on.

Why an RE-Bench-Saturating AI might be very far from a Production Superhuman Coder

For the second half of the forecast, we assume we have an AI system that can score highly realistic R&D coding tasks, in a controlled environment, that correspond to writing ~500 lines of code and would take a skilled human ~8 hours to complete.

What is left before such a system, at reasonable cost, can do all coding tasks involved in AI research, at 30x the speed of a frontier lab’s best engineer? We agreed with AI Futures that it would involve these steps:

You’ll notice that each of these gaps also involves dealing with longer time horizons^[4]. Here, we explain a few forecasting cruxes where FutureSearch thought development could take significantly longer than AI Futures.

For each section below, we give how much longer FutureSearch thought the stage would take vs. AI Futures.

Handling Engineering Complexity

+8 months (11 months vs. 3 months) ^[5]

AI Futures thought that the time to make 20k line changes to a 500k line codebase could be modeled as another type of time horizon, e.g. that there’s a doubling period of how many lines of code a system can handle.

We thought that such changes are qualitatively different than smaller ones, and are too out-of-distribution to give that much confidence to an extrapolation.

If they do require new training paradigms, they might require synthetic data of what a good 20k line change is to a 500k line codebase. And we factored in a significant slowdown for getting synthetic data like this working.

Working Without Feedback Loops

+14 months (18.3 months vs. 4.5 months) ^[6]

This was our biggest crux. Systems trained today get RL feedback either from human raters, models emulating human raters, or things like unit tests running.

AI Futures admits:

I and others have consistently been surprised by progress on easy-to-evaluate, nicely factorable benchmark tasks, while seeing some corresponding real-world impact but less than I would have expected.

We agree. But we disagree that approaches like best-of-k will help significantly, and the degree to which longer time horizons subsumes this problem.

As a rough intuition, if a software task takes 6 months for a human to do, it’s hard to imagine that the R&D process to build an AI to do it can be done much faster than 6 months, if that’s how long it would take the human researchers to synthesize such data or verify outputs.

Achieving Cost and Speed

+7 months (13.5 months vs. 6.5 months) ^[7]

Recall the defined goal is that 5% of the frontier lab’s compute budget could pay for 30x as many of these coders as the company has human coders, each of which works 30x faster than the company’s best human coder.

AI Futures anchor to Epoch’s ~50x reduction in token costs / year, and estimate a ~1000x reduction in the tokens needed each year.

We thought these trends would not continue. We expect closer to a ~10x reduction in token costs / year. And while that’s still fast, it does add many months to the expected time.

Other Gaps Between RE-Bench and Real World Frontier Lab Coding

+10 months (14.7 months vs. 4.3 months)

Our top contender for unforeseen difficulties are about agent communication. Real world software engineering requires a lot of intelligence in project prioritization and communication amongst teammates.

Yes, the definition of Superhuman Coder leaves the priority of research direction to humans. But even for a given project (”Optimize this kernel”), there is a lot to figure out of which approach to start with, who does what part across a team of agents, and how to put the final answer back together.

“Outside-The-Model” Considerations

A final crucial consideration is that forecasting within a framework or model, as opposed to traditional “vibe”-based forecasts, makes it hard to consider every factor.

AI Futures, like FutureSearch, put a significant amount of probability on some other factor greatly slowing down the arrival of Superhuman Coders, as is reported in the piece:

	Eli’s SC forecast (median, 80% CI)	Nikola’s SC forecast (80% CI)	FutureSearch aggregate (80% CI) (n=3)
Benchmarks-and-gaps model	2028 (2025 to >2050)	2027 (2025 to 2044)	2032 (2026 to >2050)
All-things-considered forecast, adjusting for factors outside these models	2030 (2026 to >2050)	2028 (2026 to 2040)	2033 (2027 to >2050)

("Benchmarks-and-gaps model" is the model in this post, and has the dates from the headline graphic in both this post and in the AI 2027 timeline forecast.)

For us, the all-things-considered view was pushed out ~1 year primarily because:

Governments may intervene more, or there might be geopolitical messes (e.g. China invades Taiwan)
Frontier labs may stumble (e.g. OpenAI loses the lawsuit blocking the sale of the for-profit entity)
Frontier labs do not fully prioritize the development of Superhuman Coders

Updating Over Time

Perhaps the best feature of model-based forecasts is that you can adjust them, both to capture individual views, and also to update them over time.

In AI 2027, Eli and Nikola helpfully link to a repo that lets you input your own numbers and generate distributions, which FutureSearch used to test different assumptions.

We may also write another piece giving our views on the AI 2027 Takeoff Forecast, which reckons with how fast R&D will speed up in the post-SC world, and also has models that can easily be updated as AI advances.

^{^}
Superhuman Coder, as defined in AI 2027: An AI system for which the company could run with 5% of their compute budget 30x as many agents as they have human research engineers, each of which is on average accomplishing coding tasks involved in AI research (e.g. experiment implementation but not ideation/prioritization) at 30x the speed (i.e. the tasks take them 30x less time, not necessarily that they write or “think” at 30x the speed of humans) of the company’s best engineer. This includes being able to accomplish tasks that are in any human researchers’ area of expertise.
^{^}
These forecasts are "all-things-considered". The headline result in AI 2027, giving much more probability of Superhuman Coding coming in 2027, was a "within model" forecast that assumes away certain concerns. See the final section for what those concerns might be.
^{^}
RE-Bench is a near perfect benchmark, as it tracks challenging yet realistic AI R&D coding tasks. Example tasks are "Optimize a Kernel", "Fix Embedding", or "Finetune GPT-2 for QA".
^{^}
There are also considerations: compute scaling may change, algorithmic progress may slow down, and external vs. internal deployment may differ. And these gaps can be achieved sequentially or in parallel. Please refer back to the methodology section in the Timeline Forecast for more.
^{^}
Milestone definition: Ability to develop a wide variety of software projects involved in the AI R&D process which involve modifying >20,000 lines of code across files totaling up to >500,000 lines. Clear instructions, unit tests, and other forms of ground-truth feedback are provided. Do this for tasks that take humans about 1 month (as controlled by the “initial time horizon” parameter) with 80% reliability, add the same cost and speed as humans.
^{^}
Milestone definition: Same as above, but without provided unit tests and only a vague high-level description of what the project should deliver.
^{^}
Milestone definition: A system that does the above two, and also does parallel projects across codebases, and also is specialized in frontier AI development, and additionally is at a cost and speed such that there are substantially more superhuman AI agents than human engineers (specifically, 30x more agents than there are humans, each one accomplishing tasks 30x faster).

LESSWRONG
LW