Scaling laws tell us that the cross-entropy loss of a model improves predictably with more compute. However, the way this relates to real-world economic outcomes that people directly care about is non-obvious. Scaling Laws for Economic Impacts aims to bridge this gap by running human-uplift experiments on professionals where model training compute is randomized between participants.
The headline findings:each year of frontier model progress reduces professional task completion time by roughly 8%. Decomposing this, about 56% comes from increased training compute and 44% from algorithmic improvements. Incorporating these results into the famously pessimistic macroeconomic framework of Acemoglu (2024) implies productivity growth over the next decade due to AI even under strict assumptions rises from 0.5% to 20%. But there's a puzzle—while raw model output quality scales with compute, human-AI collaborative output quality stays flat. Users seem to cap the realized gains from better models, regressing outputs toward their own baseline regardless of how capable the tool is.
Experiment Setup:
Concretely, over 500 consultants, data analysts, and managers were given professional tasks to complete with models ranging from Llama-2 to GPT-5 (or no AI assistance at all) in a pre-registered experiment. Participants were recruited through Prolific, but eligibility required at least one year of professional experience, salaries above $40,000, and passing a rigorous screening survey that filtered out roughly 90% of applicants. The final sample averaged over four years of experience.
Each participant completed a subset of nine tasks designed to simulate real workflows: revising financial expansion reports, conducting A/B test analyses, writing crisis management memos, evaluating vendor contracts, creating project Gantt charts, and more (full task descriptions in the appendix of the linked paper). Incentives were high-powered—$15 base pay per task, with an additional $15 bonus for submissions rated 5+ out of 7 by expert human graders, meaning quality directly doubled earnings.
Results:
First, the basic question: does AI help at all? Pooling across all models, workers with AI access earned 81% more per minute than control ($1.24 vs $0.69, p = 0.001) and produced higher quality output (+0.34 standard deviations, p < 0.001). Combining speed and quality—since bonuses doubled earnings for high-quality work—total earnings per minute increased by 146%. Roughly half of this came from working faster, half from hitting the quality bonus threshold more often.
But does this improve with better models? Using model release date as a proxy for overall capability, each year of frontier progress reduces task completion time by approximately 8% (p < 0.05). In dollar terms, this translates to roughly $14/hour in additional base earnings per year of model progress—or $26/hour when including quality bonuses. The relationship is log-linear: plot log time against release date and you get a (slightly noisy) downward slope.
What's driving these gains—raw compute or algorithmic progress? To decompose this, I estimate scaling laws against both calendar time (which captures everything) and log training compute (which isolates pure scale). A 10x increase in compute alone is associated with a 5.9% reduction in task time. With compute growing roughly 6x per year during the sample period, this accounts for about 56% of the total 8% annual improvement. The remaining 44% comes from improved algorithmic progress during this period.
This is actually the second time I've derived economic scaling laws experimentally. In earlier work, I tested 300 professional translators across 1,800 tasks using the same design—randomized model assignment, high-powered incentives, expert grading. The results were strikingly similar: a 10x increase in compute reduced task time by 12.3% and increased earnings per minute by 16.1%. The relationship was again log-linear. That paper also found gains were heavily skewed toward lower-skilled translators, who saw 4x larger time reductions than high-skilled ones.
For non-agentic tasks, I collected both human-AI collaborative outputs and AI-only outputs (the model's zero-shot response to the same prompt, graded by the same human experts). This lets me compare three conditions: human alone, human + AI, and AI alone.
AI-only quality scales cleanly with compute: a 10x increase in training compute corresponds to a 0.51-point increase in grade (p < 0.01). The best models score above 6 out of 7—well beyond the unassisted human average of 3.5.
Human-AI quality does not scale at all. The coefficient is essentially zero (p ≈ 0.85). Whether participants used an early model or a frontier one, final output quality flatlined at around 4.3 out of 7.
What's happening? For weaker models, humans add value—they refine rough drafts and push quality up to ~4.3. For stronger models, humans subtract value—they take outputs that would have scored 5 or 6 and drag them back down to ~4.3. It looks like regression to the mean: humans can only improve outputs that are somewhat better than their own ability, and actively degrade outputs that exceed it. The exact mechanisms of why this occurs are unknown though and this experiment wasn’t designed to be able to give much evidence on this question.
The Simple Macroeconomics of A(G)I
How do these results extrapolate to the broader economy? Acemoglu (2024) famously used a very pessimistic framework (eg. ignoring general equilibrium effects, R&D acceleration, changes in the economy’s task composition) to arrive at a famously conservative estimate of ~0.5% GDP growth from AI over the next decade. Here we show that even adopting the same framework when updating the productivity estimates Acemoglu uses (based on experiments using GPT-3.5 or GPT-4) to incorporate model scaling effects we would expect highly economically significant productivity effects (20% productivity growth over the next decade).
The method applies Hulten's theorem: multiply the share of tasks exposed to AI (19.9%, from Eloundou et al.) by the average productivity boost on those tasks by the labor share of costs (57%). Acemoglu then used productivity estimates from early ChatGPT/GPT-4 experiments and treated capabilities as fixed. Using these experimental estimates instead—and allowing productivity to compound at 8% annually as models improve—yields roughly 20% productivity growth over the same period. A framework that plausibly ignores many of the main channels in which AI can increase economic growth rates now produces a dramatic number, just by incorporating scaling.
There are of course many caveats to such an extrapolation, and indeed to the main experimental results themselves. These include (but are not limited to):
Tasks were short—30 minutes on average—and may not reflect the dynamics of multi-day projects where AI limitations compound differently.
Participants were recruited through Prolific; even with rigorous screening, they may not be representative of top professionals in these fields.
Models were accessed through standard chat interfaces with text-based I/O only—no tool use, no code execution, no agentic capabilities. This likely underestimates current AI performance on agentic tasks specifically, and possibly understates gains across the board.
The scaling laws also only capture training compute; they don't account for test-time compute scaling (chain-of-thought, search, etc.), which is increasingly where capability gains are coming from
Next Steps:
There are two extension projects currently being conducted (with help from funding by Coefficient Giving). First, I'm building ECONbench—a standing panel of consultants, data analysts, lawyers, and financial analysts who will complete standardized tasks each time a major model is released. The goal is near real-time tracking of economic productivity gains, rather than one-off experiments that are outdated by publication. Second, I plan to run these same scaling law experiments on longer-horizon tasks of specific interest—specifically, multi-day ML R&D and scientific discovery workflows. If anyone has any comments or suggestions for these steps please DM or feel free to email at ali.merali@yale.edu.
Scaling laws tell us that the cross-entropy loss of a model improves predictably with more compute. However, the way this relates to real-world economic outcomes that people directly care about is non-obvious. Scaling Laws for Economic Impacts aims to bridge this gap by running human-uplift experiments on professionals where model training compute is randomized between participants.
The headline findings: each year of frontier model progress reduces professional task completion time by roughly 8%. Decomposing this, about 56% comes from increased training compute and 44% from algorithmic improvements. Incorporating these results into the famously pessimistic macroeconomic framework of Acemoglu (2024) implies productivity growth over the next decade due to AI even under strict assumptions rises from 0.5% to 20%. But there's a puzzle—while raw model output quality scales with compute, human-AI collaborative output quality stays flat. Users seem to cap the realized gains from better models, regressing outputs toward their own baseline regardless of how capable the tool is.
Experiment Setup:
Concretely, over 500 consultants, data analysts, and managers were given professional tasks to complete with models ranging from Llama-2 to GPT-5 (or no AI assistance at all) in a pre-registered experiment. Participants were recruited through Prolific, but eligibility required at least one year of professional experience, salaries above $40,000, and passing a rigorous screening survey that filtered out roughly 90% of applicants. The final sample averaged over four years of experience.
Each participant completed a subset of nine tasks designed to simulate real workflows: revising financial expansion reports, conducting A/B test analyses, writing crisis management memos, evaluating vendor contracts, creating project Gantt charts, and more (full task descriptions in the appendix of the linked paper). Incentives were high-powered—$15 base pay per task, with an additional $15 bonus for submissions rated 5+ out of 7 by expert human graders, meaning quality directly doubled earnings.
Results:
First, the basic question: does AI help at all? Pooling across all models, workers with AI access earned 81% more per minute than control ($1.24 vs $0.69, p = 0.001) and produced higher quality output (+0.34 standard deviations, p < 0.001). Combining speed and quality—since bonuses doubled earnings for high-quality work—total earnings per minute increased by 146%. Roughly half of this came from working faster, half from hitting the quality bonus threshold more often.
But does this improve with better models? Using model release date as a proxy for overall capability, each year of frontier progress reduces task completion time by approximately 8% (p < 0.05). In dollar terms, this translates to roughly $14/hour in additional base earnings per year of model progress—or $26/hour when including quality bonuses. The relationship is log-linear: plot log time against release date and you get a (slightly noisy) downward slope.
What's driving these gains—raw compute or algorithmic progress? To decompose this, I estimate scaling laws against both calendar time (which captures everything) and log training compute (which isolates pure scale). A 10x increase in compute alone is associated with a 5.9% reduction in task time. With compute growing roughly 6x per year during the sample period, this accounts for about 56% of the total 8% annual improvement. The remaining 44% comes from improved algorithmic progress during this period.
This is actually the second time I've derived economic scaling laws experimentally. In earlier work, I tested 300 professional translators across 1,800 tasks using the same design—randomized model assignment, high-powered incentives, expert grading. The results were strikingly similar: a 10x increase in compute reduced task time by 12.3% and increased earnings per minute by 16.1%. The relationship was again log-linear. That paper also found gains were heavily skewed toward lower-skilled translators, who saw 4x larger time reductions than high-skilled ones.
For non-agentic tasks, I collected both human-AI collaborative outputs and AI-only outputs (the model's zero-shot response to the same prompt, graded by the same human experts). This lets me compare three conditions: human alone, human + AI, and AI alone.
AI-only quality scales cleanly with compute: a 10x increase in training compute corresponds to a 0.51-point increase in grade (p < 0.01). The best models score above 6 out of 7—well beyond the unassisted human average of 3.5.
Human-AI quality does not scale at all. The coefficient is essentially zero (p ≈ 0.85). Whether participants used an early model or a frontier one, final output quality flatlined at around 4.3 out of 7.
What's happening? For weaker models, humans add value—they refine rough drafts and push quality up to ~4.3. For stronger models, humans subtract value—they take outputs that would have scored 5 or 6 and drag them back down to ~4.3. It looks like regression to the mean: humans can only improve outputs that are somewhat better than their own ability, and actively degrade outputs that exceed it. The exact mechanisms of why this occurs are unknown though and this experiment wasn’t designed to be able to give much evidence on this question.
The Simple Macroeconomics of A(G)I
How do these results extrapolate to the broader economy? Acemoglu (2024) famously used a very pessimistic framework (eg. ignoring general equilibrium effects, R&D acceleration, changes in the economy’s task composition) to arrive at a famously conservative estimate of ~0.5% GDP growth from AI over the next decade. Here we show that even adopting the same framework when updating the productivity estimates Acemoglu uses (based on experiments using GPT-3.5 or GPT-4) to incorporate model scaling effects we would expect highly economically significant productivity effects (20% productivity growth over the next decade).
The method applies Hulten's theorem: multiply the share of tasks exposed to AI (19.9%, from Eloundou et al.) by the average productivity boost on those tasks by the labor share of costs (57%). Acemoglu then used productivity estimates from early ChatGPT/GPT-4 experiments and treated capabilities as fixed. Using these experimental estimates instead—and allowing productivity to compound at 8% annually as models improve—yields roughly 20% productivity growth over the same period. A framework that plausibly ignores many of the main channels in which AI can increase economic growth rates now produces a dramatic number, just by incorporating scaling.
There are of course many caveats to such an extrapolation, and indeed to the main experimental results themselves. These include (but are not limited to):
Next Steps:
There are two extension projects currently being conducted (with help from funding by Coefficient Giving). First, I'm building ECONbench—a standing panel of consultants, data analysts, lawyers, and financial analysts who will complete standardized tasks each time a major model is released. The goal is near real-time tracking of economic productivity gains, rather than one-off experiments that are outdated by publication. Second, I plan to run these same scaling law experiments on longer-horizon tasks of specific interest—specifically, multi-day ML R&D and scientific discovery workflows. If anyone has any comments or suggestions for these steps please DM or feel free to email at ali.merali@yale.edu.