Cross-posted from my Substack.
AI 2027’s recent publication made waves—if you’re reading this, you probably saw the whole saga unfold: cheers, jeers, NYT writeup, the works. To briefly recap: the authors predict that by March 2027, a superhuman coder is created, speeding up AI R&D fivefold. By the end of the year, full-blown artificial superintelligence arrives. It sparks unprecedented scientific and economic progress—alongside mass unemployment, an arms race with China, and, worst case, human extinction via bioweapon.
There’s much to say about the assumptions built into their timelines, but here, I want to hone in on one key piece: the long-form data bottleneck.
I argue that the AI 2027 authors overlook the availability (or lack thereof) of the suitable long-form data necessary to train a model capable of reasoning reliably over months-long projects—an ability the authors explicitly say their "superhuman coder" needs. This might seem like a minor technical detail, but this data bottleneck could significantly delay AGI timelines, perhaps by years or even decades (Monte Carlo simulation here & corresponding post). Ironically, such delays would be great news for those concerned with AI safety, slowing timelines and providing a rare opportunity for tractable governance.
There are two parts to the AI 2027 forecast. The first part, “Timelines,” estimates the arrival of a superhuman coder. The second part, “Takeoff,” predicts a superhuman AI researcher and, soon after, artificial superintelligence—an explosive jump powered by automated AI R&D.
Perhaps counterintuitively, the Timelines portion actually makes up most of the gap between now and transformative AI. Why? Well, the development of a superhuman coder relies on slow, human-driven R&D. But once it’s created, the research process speeds up 5x, meaning that further improvements quickly follow.
Now, this claim of a 5x speed-up seems dubious, but there’s already another AI 2027 review that covers this concern. For my part, I’ll be digging into the details of the Timelines analysis, uncovering how the data bottleneck might hamper the creation of a superhuman coder.
First, let’s define what a superhuman coder is. It’s not a well-defined technical term—here, it just refers to an AI that can perform expert-level AI R&D coding tasks, but significantly faster and cheaper than a human.[1]
To approximate this level of capability, the authors employ the concept of a model’s time horizon: how long a human needs for a task that AI can do independently. For example, if OpenAI’s GPT-4 reliably succeeds at coding tasks that would take a human software developer one hour, then GPT-4 is considered to have a one-hour time horizon.
They draw from a METR report that estimates models’ time horizons through METR’s HCAST benchmark. HCAST comprises ~100 agentic tasks distributed across machine learning, software engineering, cybersecurity, and general reasoning. Testing frontier models from over the years, the report finds that models’ time horizons have doubled, on average, every seven months, with the doubling time shrinking as progress continues.
For the endpoint, AI 2027’s authors estimate that a superhuman coder will require, on average, a time horizon of ten years on HCAST, which maps onto a 6-month time horizon on the distribution of real-world tasks.
Both forecast methods—one of which also references RE‑Bench—hinge on this same time horizon trend. The authors extrapolate their forecasts from this trend, putting a 40-45% probability on the time horizon growing superexponentially (each doubling happens 10% faster), and a 10% probability that it grows subexponentially (each happens 10% slower).
However, there are a few issues with this method.
First, “Where’s my ten minute AGI?” by Anson Ho offers some convincing objections. For one, time horizons are domain-specific. If we’d done the same analysis with chess-playing AI, we would’ve predicted century-long time horizons by now. Moreover, task reliability strongly influences time horizons. METR picks a 50% success rate to qualify time horizons, which gets us to GPT-4’s current one hour time horizon today. But at 80% reliability, the time horizon shrinks to 15 minutes; push it to 99%, and it plummets below a minute.
Second, I argue that it doesn’t make sense to view these data points as a continuous trend from which one can naively extrapolate, exponential or otherwise. Once we decompose this apparent trendline, we’ll see why data is important, and why it’ll be a bottleneck.
To start, let’s look at another slightly more famous technological trendline: Moore’s Law. Standard visualizations of Moore’s Law show a straightforward exponential increase driven by a single factor. But in reality, experts have observed that the “straight line” of Moore’s Law is composed of several overlapping logistic (“S”) curves, each denoting the rise and saturation of distinct hardware paradigms.
In the same way, the time horizon trend can be attributed to different paradigms.
For instance, the jump from GPT-2 to GPT-4 largely resulted from scaling pre-training resources (data and compute), which are either being exhausted or displaying diminishing returns.
Meanwhile, the gap from GPT-4 to o1 was bridged by post-training techniques applied to the base model, such as reinforcement learning and supervised fine-tuning. These techniques require less time and data compared to pre-training, but it’s doubtful that the post-training paradigm alone will be sufficient to yield a transformative technology. Last week, I outlined a paper claiming that RL with verifiable rewards doesn’t elicit new capabilities—in fact, it diminishes the diversity of model responses, contributing to underperformance over larger samples.
So, yes, doubling time has fallen. But with only two paradigms so far, it seems premature to assign a significant probability of superexponential growth, as the AI 2027 authors do, and the “new Moore’s Law” claims just seem absurd.
Moreover, the y-axis of the time horizon graph is similarly misleading.
For example, the time horizon jump from GPT-2 to GPT-4 reflects improved reasoning capabilities: the model got better at solving harder problems, even when those problems looked simple at first glance. For example, “the sky is ____” can be answered with basic pattern-matching, but “implement binary search” requires actual logical reasoning, even though both prompts are equally concise.
Meanwhile, the improvement from GPT-4 to o1 reflects gains in both raw capability and reasoning processes. o1 builds on GPT-4’s base, but benefits from post-training techniques like reinforcement learning and fine-tuned reasoning strategies, such as Chain of Thought. These techniques made it particularly adept at solving coding and math problems, as well as breaking down larger requests into manageable pieces.
But as we approach models with months-long time horizons, scaling these improvements will be insufficient. At this level, the core challenge shifts: it’s not just about what a model can reason about, but how much input it can reason over.
Graph by me…don’t judge.
Even if a superhuman coder is solely focused on project implementation (rather than “taste” tasks), if the project stretches over several months, it will still need to process a huge volume of upstream information: codebase history, experimental results, error reports, human feedback, organizational goals, ML papers, etc. Input length thus becomes a central issue.
Assuming that a superhuman coder requires a six-month time horizon, a conservative estimate suggests that it must be able to reason over at least one million tokens[2]. Some models, like GPT-4.1, technically already support this context length, but the performance of so-called “long-context” models degrades sharply with longer inputs—GPT-4.1’s accuracy plummets to 60% at a mere 128k tokens on simple recall tasks.
Simply widening the context window fails because of a mathematical constraint in self‑attention.
As you might know, the transformer’s defining breakthrough was the self-attention mechanism, which has proven groundbreaking for sophisticated language comprehension and generation. However, it also comes with a limit: attention dilution.
Mathematically, self-attention is represented by the equations below.
The “raw attention score” that query token i assigns to key token j is s_ij. The softmax normalizes these scores into probability weights w_ij. By construction, the softmax sums must equal 1 for any given query token, which means that the average weight any key token can receive is 1/N. In order to maintain focus on a highly salient token as input length N grows, attention must be spread increasingly thin across the other tokens. Attention dilution over long contexts is thus an unavoidable consequence of the softmax normalization used in self-attention, as visualized by the heat maps below.
Intuitively, you can think of each token as having a fixed amount of attention to give during self-attention, regardless of how many other tokens there are. With long inputs containing many interdependencies, this limited amount of attention is spread thinly across countless tokens. Since many pieces of information are now similarly attended to, the updated token representations lose the signal in the noise.
Although attention dilution precludes simply expanding the context window, alternative algorithms can be used to enable selective attention within long contexts. Some notable examples include retrieval augmented generation (RAG), sparse attention, and compressive memory.
RAG splits the input into smaller entries, storing those in a database. Entries are then retrieved and appended to the context window when relevant to currently processed tokens, as determined by cosine similarity (a simple measure of semantic similarity).
Sparse attention modifies self-attention such that each token only attends to a subset of other tokens, rather than all of them. Which particular subset(s) depends on the specific algorithm being used, but it is common for windows to be based on proximity, with a mix of local, random, and global blocks used.
Compressive memory replaces original information with a summarized version as the context window fills up. The information that is summarized is often the oldest information, but this varies depending on the algorithm.
While these algorithms have yielded marginal improvements for some models, they haven’t solved long context performance degradation. The reason for this is that these algorithms rely on rigid, crude heuristics—cosine similarity, oldest-first compression, etc.—that don’t permit for nearly as nuanced reasoning as self-attention does.
However, given that humans have an extremely small context window (i.e. working memory) and seem to rely on processes similar to retrieval, compression, and sparse attention to reason over long horizons, I’m willing to grant that these algorithms are adequate in theory. But even assuming that we don’t need to go beyond the transformer+, we still need to train these algorithms to operate dynamically, rather than rigidly, in order to replicate self-attention’s efficacy. Critically, training requires the right training data.
First, note that according to Chinchilla scaling laws, we’re already approaching a data bottleneck.
The DeepMind Chinchilla paper shows that for any given increase in compute, the model size (number of parameters) and amount of training data (number of training tokens) should be scaled proportionally to achieve optimal performance.
This trade-off is expressed in the paper’s scaling law for loss L, which is a function of model size N and training data set size D.
LessWrong post “chinchilla’s wild implications” by nostalgebraist lays out the, well, wild implications of this scaling law. By plugging in the parameters and training tokens of the models examined in the paper, the author shows that the “finite model” term is tiny compared to the “finite data” term. Thus, scaling model size, even by orders of magnitude, produces minimal performance gains compared to dataset expansion.
If you plug in figures for GPT-3 vs. GPT-4, a similar dynamic emerges. The majority of the loss decrease between the two models is accounted for by the increase in training tokens, not model size. Moreover, exhausting human-generated public text data (10^15 tokens is the effective stock, according to Epoch AI) only generates an order of magnitude loss reduction—beyond that, models can get arbitrarily large without seeing performance improvements.
GPT-3, GPT-4, and a full data use model, respectively.
Of course, the paper’s parameters, which I plugged in above, were not fit on data from these frontier models. I’m happy to hear out technical challenges on this front, but for now, I’ll assume that this observation is still a decent heuristic to proceed from.
So, what have we learned? Basically, D, the amount of training data, matters a lot, especially when D is small relative to N.
However, while the paper interprets D as simply the number of training tokens, this doesn’t fully describe what’s important.
First, D needs to be reasonably relevant to L. If you trained an LLM solely on literature-related data, it would obviously perform terribly on coding benchmarks, regardless of scale.
Second, D also relates to the number of samples, as opposed to solely the number of tokens. This is pretty intuitive—there’s a big difference between feeding a model a few thousand gargantuan data points vs. feeding it 100 billion singular words, even if these datasets are similarly sized, token-wise.
Therefore, D, the amount of relevant training samples, is important.
The point on relevance has significant implications. If a model requires a time horizon of six months to qualify as a superhuman coder (or superhuman anything, really), then it’s highly plausible that the relevant data is extremely scarce among available data, implying a major bottleneck. In fact, I’d argue that it’s not just long-form data that’s required (which is rare enough), but long-form workflow data, which is all but nonexistent. To clarify, a workflow is more than just a single output; it's the complete sequence of inputs, iterations, data, and feedback that ultimately produces that output.
Sure, relevancy isn’t well-defined, and some might point out the possibility of generalization from short-form data. Here, allow me to offer a few points in favor of my argument.
First, consider: if we were to train a model on disjunct 3-token long phrases (e.g. “the red ball”, “that funny girl”), we wouldn’t expect it to learn the grammar, etc. required to coherently process and respond to 300-word paragraphs. If LLM training samples average a few thousand tokens long, there’s similarly no reason to think that models would be able to bridge the magnitude gap between those data points and reasoning reliably over a million-token context.
Second, it seems intuitive that workflow data (as opposed to long-form data alone) would provide unique signals essential to learning how to reason over long contexts.
As a simple example, imagine two different datasets. Dataset A comprises final research papers only, while Dataset B appends each research paper with the earlier papers that the author drew from. Training on Dataset A lets a model notice, for instance, that a paper’s literature review shapes the experiment that follows. Training on Dataset B teaches this, as well as another novel pattern: earlier experiments themselves inspire new ones. When asked to draft new research, a model trained on Dataset A overemphasizes literature reviews and underutilizes prior experimental designs, while a model trained on Dataset B integrates both—producing a far better output.
Example - Dataset A vs. Dataset B
Third, the dimensionality issues that robotics models face could apply here, albeit at a lower level. Basically, since robots operate in 3D space, sometimes with many degrees of freedom, even massive datasets leave much of the state space uncovered, stunting the utility of ML optimization algorithms. As such, adequate data is the main bottleneck with robotics AI.
LLMs have fared much better in part because language is 1D, and while they don’t face robotics’ disastrous exponential explosion, pairwise interaction between tokens still scales quadratically with context length, creating the potential for a data deficit as context increases. The fact that long-context models lose accuracy particularly when the answer is buried in the middle of a long prompt (rather than the start or end) supports this.
If it’s the case that long-form workflow data is required to train superhuman models, then it’s likely that manual data collection will be required.
As mentioned, among the already minimal public long-form data, there are virtually zero workflows. AI 2027’s predictions rely on synthetic data, but little evidence or reasoning is offered for why this would be an adequate solution. Intuitively, since models cannot independently produce high-quality long-form work (that is, in fact, what we are trying to train them to do), they would require human guidance to even attempt it. But to maintain the efficiency of automated synthesis, that guidance must be uniformly applied across the synthesized data, which will ultimately fail to represent the dynamic permutations of real human memory and attention patterns. Any attempt to use synthetic generation will only produce counterproductive rigidity and uniformity. Empirically, recent work shows that even inserting 1% synthetic data into a long-context fine-tuning data causes measurable performance degradation.
Moreover, attempts to artificially lengthen training data by concatenating similar, non-redundant documents yielded minimal performance improvements, even with small-scale, open-source models. Intuitively, going back to the 3-token example, concatenating these phrases into pseudo-sentences based on semantic similarity wouldn’t permit the model to learn grammar either. Literary structure is particular; ham-fisted solutions don’t work. This fact applies here too—think about the complexity of refactoring thousands of lines of code, or presenting a legal case based on intricate relationships between a multitude of statutes.
So far, I’ve established all the pieces contributing to the data bottleneck. Nonetheless, the actual severity of the bottleneck may vary significantly based on a plethora of factors, as laid out in the questions below.
First, can data be developed from existing private stores? Are records from previous long projects well-organized enough that they can be easily appended together to form workflows? Is it necessary for these workflows to be roughly chronologically ordered or annotated, and if so, how much more difficult would it be to do those retroactively? Basically: is it enough to purchase and manipulate existing data, or is paying for new workflows (perhaps structured in a particular way) necessary and/or more efficient?
Second, how much data is required for adequate time horizon performance? Specifically, how many samples are required? Is it closer to the amount required during pre-training or post-training? And to what extent are algorithmic improvements expected to increase sample efficiency?[3] AI 2027 assumes that labs will pay 20,000 employees to record themselves performing long-horizon tasks. Depending on the answers to the questions posed, this could be more than enough, or not even close. Note that at a certain threshold, the bottleneck becomes willing and qualified employees, as opposed to money.
Third, how much does data quality matter? Would workflow data taken from third-rate companies be adequate? Or is there a certain level of objective success required to avoid degrading model performance? Unlike text, where it’s relatively easier to filter for gibberish, it might be much harder to evaluate long-form workflow quality. A fifth of new businesses fail within the first year, and the majority fail within a decade—unwittingly training on these processes seems undesirable, especially if you’re interested in creating a highly capable agent.
Fourth, how well will this data work for training longer time horizons? If it is necessary for the next level of agent, a superhuman AI researcher, to reason over 1-2 OOMs more tokens, would new, even longer workflows be required to train it? Are the cross-dependencies and signals learned over a career’s worth of information significantly different from those learned over a year-long project? Is synthetic data plausibly more useful and easier to produce at this scale? Or will this data bottleneck function as a continuously expanding barrier to capability development, perhaps even precluding fast take-off?
Fifth, how well will this data work for training over diverse domains? Frontier labs can more easily collect data on coding projects, but will this enable the resulting models to handle white-collar workflows? What about for science R&D, whose data collection seems substantially bottlenecked by slow experimental processes?
…And there are probably more that I’m missing. Basically, it seems like this bottleneck could add either a few years or several decades to the forecast, depending on the above factors and their interaction.
I built a Monte Carlo simulation tool that attempts to quantify the potential delay length through rough estimate answers to these questions. Link here & corresponding post (explaining the simulation variables, defaults, etc.) here.
Nonetheless, if you’re concerned about safety like I am, this is great news.
First, the bottleneck will slow down timelines, potentially significantly. Slower timelines mean more time to act. While manual data collection is burdensome, it is not impossible—and labs have every financial reason to push through. Frontier companies won’t halt their efforts; they’ll just have to proceed through (at least one) slower, more costly data-gathering phase.
If that adds a few years of delay, policy makers get a precious window in which to prepare, perhaps narrowly avoiding the political disaster outlined in AI 2027. If the delay is actually a few decades, then the odds of preemptively establishing adequate catastrophic risk and economic protection increase substantially.
Second, the data bottleneck itself provides a rare opportunity for meaningful governance. Data collection activities are concrete and observable—they serve as a visible friction point. If a frontier lab begins contracting to collect coding workflows, that’s a strong signal it’s aiming to automate AI research. If it starts licensing white-collar enterprise logs, this suggests employee replacement is on the list.
There exist routine regulatory justifications, like privacy or anti-trust, that could be employed to target data collection activities. For example, California’s AB-2013 (effective starting January 2026) will mandate AI developers to publicly disclose the source and structure of their training data. Ideally, laws like this could be expanded to mandate transparency well before model deployment. Such disclosures would give the government a clearer picture of AI companies’ intentions and capabilities—potentially averting the kind of unilateral, destabilizing action described in AI 2027. Given this existing precedent, and the fact that the majority of frontier labs are headquartered in California, this governance approach seems particularly promising.
However, this bottleneck also introduces a new strategic concern. If models flop in the absence of massive investments in expensive, time-consuming data collection, then investors could get cold feet and pull out, potentially leading to a bubble burst and subsequent AI winter. In this case, we might be concerned about China taking the lead.
Unlike U.S. private investors, its state-owned financiers can commit to long-term investments that are costly in the short-term. The CCP surveillance state could collect and feed endless amounts of diverse data to its top labs without any of the contracting costs or privacy fights American companies might face—a major concern given that officials are already calling for broad integration of DeepSeek’s models within companies and government alike.
Race dynamics are broadly harmful irrespective of one’s “side”, but long-term Chinese AI supremacy is still something worth thinking about.
The Official AI 2027 definition – Superhuman coder (SC): An AI system for which the company could run with 5% of their compute budget 30x as many agents as they have human research engineers, each of which is on average accomplishing coding tasks involved in AI research (e.g. experiment implementation but not ideation/prioritization) at 30x the speed (i.e. the tasks take them 30x less time, not necessarily that they write or “think” at 30x the speed of humans) of the company’s best engineer. This includes being able to accomplish tasks that are in any human researchers’ area of expertise. Nikola and Eli estimate that the first SC will have at least 50th percentile frontier AI researcher “research taste” as well, but that isn’t required in the definition.
Fermi estimate
1 word ≈ 1.5 tokens
Inputs: Literature, code, experiments, human feedback
Literature: 100 papers × 30 pp / paper × 500 words / page × 20 % actually read
= 3.0 × 10⁵ words ≈ 4.5 × 10⁵ tokens
Codebase in view: 20 k LOC window × 10 words / line
= 2.0 × 10⁵ tokens
Experimental logs: 5 small runs / day × 180 days × 1 k words + 1 medium run / 3 days × 180 days × 5 k words
= 8.0 × 10⁵ words ≈ 1.2 × 10⁶ tokens
Human feedback: same as experimental volume at 0.5 k and 1 k words
= 3.4 × 10⁵ words ≈ 5.1 × 10⁵ tokens
Total raw context ≈ 2.36 × 10⁶ tokens
Assume half can be compressed/summarized → context window = ~1M tokens
I thought humans were much more sample efficient than ML models, but maybe not? Interesting comment from Jose Miguel Cruz y Celis from the Chinchilla LW post:
I did some calculations with a bunch of assumptions and simplifications but here's a high estimate, back of the envelope calculation for the data and "tokens" a 30 year old human would have "trained" on: