Scaling Laws for Economic Impacts: Experimental Evidence from 500 Professionals and 13 LLMs

Ali Merali

Scaling laws tell us that the cross-entropy loss of a model improves predictably with more compute. However, the way this relates to real-world economic outcomes that people directly care about is non-obvious. Scaling Laws for Economic Impacts aims to bridge this gap by running human-uplift experiments on professionals where model training compute is randomized between participants.

The headline findings: each year of frontier model progress reduces professional task completion time by roughly 8%. Decomposing this, about 56% comes from increased training compute and 44% from algorithmic improvements. Incorporating these results into the famously pessimistic macroeconomic framework of Acemoglu (2024) implies productivity growth over the next decade due to AI even under strict assumptions rises from 0.5% to 20%. But there's a puzzle—while raw model output quality scales with compute, human-AI collaborative output quality stays flat. Users seem to cap the realized gains from better models, regressing outputs toward their own baseline regardless of how capable the tool is.

Experiment Setup:

Concretely, over 500 consultants, data analysts, and managers were given professional tasks to complete with models ranging from Llama-2 to GPT-5 (or no AI assistance at all) in a pre-registered experiment. Participants were recruited through Prolific, but eligibility required at least one year of professional experience, salaries above $40,000, and passing a rigorous screening survey that filtered out roughly 90% of applicants. The final sample averaged over four years of experience.

Each participant completed a subset of nine tasks designed to simulate real workflows: revising financial expansion reports, conducting A/B test analyses, writing crisis management memos, evaluating vendor contracts, creating project Gantt charts, and more (full task descriptions in the appendix of the linked paper). Incentives were high-powered—$15 base pay per task, with an additional $15 bonus for submissions rated 5+ out of 7 by expert human graders, meaning quality directly doubled earnings.

Results:

First, the basic question: does AI help at all? Pooling across all models, workers with AI access earned 81% more per minute than control ($1.24 vs $0.69, p = 0.001) and produced higher quality output (+0.34 standard deviations, p < 0.001). Combining speed and quality—since bonuses doubled earnings for high-quality work—total earnings per minute increased by 146%. Roughly half of this came from working faster, half from hitting the quality bonus threshold more often.

But does this improve with better models? Using model release date as a proxy for overall capability, each year of frontier progress reduces task completion time by approximately 8% (p < 0.05). In dollar terms, this translates to roughly $14/hour in additional base earnings per year of model progress—or $26/hour when including quality bonuses. The relationship is log-linear: plot log time against release date and you get a (slightly noisy) downward slope.

What's driving these gains—raw compute or algorithmic progress? To decompose this, I estimate scaling laws against both calendar time (which captures everything) and log training compute (which isolates pure scale). A 10x increase in compute alone is associated with a 5.9% reduction in task time. With compute growing roughly 6x per year during the sample period, this accounts for about 56% of the total 8% annual improvement. The remaining 44% comes from improved algorithmic progress during this period.

This is actually the second time I've derived economic scaling laws experimentally. In earlier work, I tested 300 professional translators across 1,800 tasks using the same design—randomized model assignment, high-powered incentives, expert grading. The results were strikingly similar: a 10x increase in compute reduced task time by 12.3% and increased earnings per minute by 16.1%. The relationship was again log-linear. That paper also found gains were heavily skewed toward lower-skilled translators, who saw 4x larger time reductions than high-skilled ones.

For non-agentic tasks, I collected both human-AI collaborative outputs and AI-only outputs (the model's zero-shot response to the same prompt, graded by the same human experts). This lets me compare three conditions: human alone, human + AI, and AI alone.

AI-only quality scales cleanly with compute: a 10x increase in training compute corresponds to a 0.51-point increase in grade (p < 0.01). The best models score above 6 out of 7—well beyond the unassisted human average of 3.5.

Human-AI quality does not scale at all. The coefficient is essentially zero (p ≈ 0.85). Whether participants used an early model or a frontier one, final output quality flatlined at around 4.3 out of 7.

What's happening? For weaker models, humans add value—they refine rough drafts and push quality up to ~4.3. For stronger models, humans subtract value—they take outputs that would have scored 5 or 6 and drag them back down to ~4.3. It looks like regression to the mean: humans can only improve outputs that are somewhat better than their own ability, and actively degrade outputs that exceed it. The exact mechanisms of why this occurs are unknown though and this experiment wasn’t designed to be able to give much evidence on this question.

The Simple Macroeconomics of A(G)I

How do these results extrapolate to the broader economy? Acemoglu (2024) famously used a very pessimistic framework (eg. ignoring general equilibrium effects, R&D acceleration, changes in the economy’s task composition) to arrive at a famously conservative estimate of ~0.5% GDP growth from AI over the next decade. Here we show that even adopting the same framework when updating the productivity estimates Acemoglu uses (based on experiments using GPT-3.5 or GPT-4) to incorporate model scaling effects we would expect highly economically significant productivity effects (20% productivity growth over the next decade).

The method applies Hulten's theorem: multiply the share of tasks exposed to AI (19.9%, from Eloundou et al.) by the average productivity boost on those tasks by the labor share of costs (57%). Acemoglu then used productivity estimates from early ChatGPT/GPT-4 experiments and treated capabilities as fixed. Using these experimental estimates instead—and allowing productivity to compound at 8% annually as models improve—yields roughly 20% productivity growth over the same period. A framework that plausibly ignores many of the main channels in which AI can increase economic growth rates now produces a dramatic number, just by incorporating scaling.

There are of course many caveats to such an extrapolation, and indeed to the main experimental results themselves. These include (but are not limited to):

Tasks were short—30 minutes on average—and may not reflect the dynamics of multi-day projects where AI limitations compound differently.
Participants were recruited through Prolific; even with rigorous screening, they may not be representative of top professionals in these fields.
Models were accessed through standard chat interfaces with text-based I/O only—no tool use, no code execution, no agentic capabilities. This likely underestimates current AI performance on agentic tasks specifically, and possibly understates gains across the board.
The scaling laws also only capture training compute; they don't account for test-time compute scaling (chain-of-thought, search, etc.), which is increasingly where capability gains are coming from

Next Steps:

There are two extension projects currently being conducted (with help from funding by Coefficient Giving). First, I'm building ECONbench—a standing panel of consultants, data analysts, lawyers, and financial analysts who will complete standardized tasks each time a major model is released. The goal is near real-time tracking of economic productivity gains, rather than one-off experiments that are outdated by publication. Second, I plan to run these same scaling law experiments on longer-horizon tasks of specific interest—specifically, multi-day ML R&D and scientific discovery workflows. If anyone has any comments or suggestions for these steps please DM or feel free to email at ali.merali@yale.edu.

Thanks for studying this!

I'm confused about figure 13. The red line does not look like a best fit to the blue data-points. I tried to eyeball the x/y location of the datapoints and fit a line and got a slope of ~0.3 rather than -0.03. What am I missing?

Thanks, that's a good point I should've made clearer- to clarify the blue data-points represent the average result per model (so each shown data points represents many participant scores). There's two reasons then that the line of best fit of these may be slightly different to the drawn line which is the linear regression slope: I) all the models didn't have exactly the same number of data points in so the regression weights the points differently, and II) the regression slope also controls for the fact that some models randomly received a larger share of easier/harder task difficulties.

Thanks, that makes sense.

One way to help clarify the effect from (I) would be to add error bars to the individual data points. Presumably models with fewer data points would have wider error bars, and then it would make sense that they pull less hard on the gregression.

Error bars would also be generally great to better understand how much weight to give to the results. In cases where you get a low p-value, I have some sense of this. But in cases like figure 13 where it's a null result, it's hard to tell whether that's strong evidence of an absence of effect, or whether there's just too little data and not much evidence either way.

I skimmed the paper last week but I lost interest when I couldn't find out which models were used.

I think you should have titled or subtitled this "How Fast is my Centaur?" :-)

Interesting, exciting, and very valuable work.

I wish you'd labeled the points to help look for model family effects.

The point spreads are quite noisy — your conclusion that quality caps out looks very sensitive to a balance between two outlier points, one very high and the other very low: remove either of those and the story might change. Obviously there is enormous economic value in ensuring that low capability humans don't drag high capability model outputs down, whether that requires retraining the model or training the human.

Is that 20% prediction a total increase in GDP over the period, or an annualized rate of increase?

the method of estimating hardware vs software share seems biased in the direction of exaggerating hardware share. (because training compute and software efficiency both tend to increase over time, so doing a univariate regression on training compute will include some of the effect from software improvements. so subtracting off that coefficient from the total will underestimate the effect of software improvements / “algorithmic progress”.)

From the paper:

Comparing these two estimates allows us to decompose the total gain into hardware and software components. By subtracting the compute effect (0.048) from the total effect (0.083), we isolate a residual of 0.035. This residual represents algorithmic progress, an economic catch-all for improvements in model architecture, software optimization, and user learning—effectively the Solow residual of AI production. In percentage terms, this decomposition suggests that compute scaling drives approximately 56% of the total reduction in time, while algorithmic advancements account for the remaining 44%.