we run out of data in 2028-2031
Is running out of data still a problem? It sounds like we've already moved to a paradigm of creating new higher-quality data and not relying on random data from the internet. In some sense we ran out of data a while ago and progress hasn't stopped because we're just making more of it now.
I think I agree with this? "Most algo progress is data progress" "Yep. Still counts though."
In a way, this is like a large-scale reprise of the expert systems era, where instead of paying experts to directly program their thinking as code, they provide numerous examples of their reasoning and process formalized and tracked, and then we distill this into models through behavioural cloning. This has updated me slightly towards longer AI timelines since given we need such effort to design extremely high quality human trajectories and environments for frontier systems implies that they still lack the critical core of learning that an actual AGI must possess. Simply grinding to AGI by getting experts to exhaustively cover every possible bit of human knowledge and skill and hand-coding (albeit with AI assistance) every single possible task into an RL-gym seems likely to both be inordinately expensive, take a very long time, and seems unlikely to suddenly bootstrap to superintelligence.
I think this is a reasonable take. Here's the opposite hypothesis:
"What are you talking about? These companies are giant juggernauts that are building a huge pipeline that does the following: (1) Identify economically valuable skills that AIs are missing (2) Collect/construct training environments / data to train those skills (3) Include those environments in the next big training runs, so that future AIs are no longer missing those skills. Already this seems like the sort of economic engine that could just keep churning until basically the whole world has been transformed. Is it AGI? No, it's still massively less efficient than the human brain. But it might nevertheless automate most jobs within a decade or so, and then continue churning along, automating new jobs as they come up. AND that's not taking into account three additional important factors: (4) The AIs are already generalizing to unseen skills/tasks to some extent, e.g. Claude is getting better at Pokemon despite not having been trained on Pokemon. Thus there might be a sort of 'escape velocity' effect where, after the models get big enough and have been trained on enough diverse important tasks, they become able to do additional new tasks with less and less additional training, and eventually can just few-shot-learn them like humans. If this happens then they really are AGI in the relevant sense, while still being less data-efficient than humans in some sense. (5) The AIs are already accelerating coding to some extent. The aforementioned pipeline that does steps 1-3 repeatedly to gradually automate the economy? That pipeline itself is in the process of getting automated as we speak. If you like you can think of the resulting giant automated corporation as itself an AGI that learns pretty fast, perhaps even faster than humans (albeit still less efficiently than humans in some sense). (Faster than humans? Well yeah; consider how fast AIs have improved at math over the last two years as companies turned their efforts towards training math skills; then consider what's happening to agentic coding; compare to individual human mathmeticians and programmers, who take several times as long to cross the same skill range during school.) (6) Even if those previous two claims are wrong and the current paradigm just won't count as AGI, period, if AI R&D gets accelerated significantly then the new paradigms that are necessary should be a few years away rather than decades away. And it seems that the current paradigm might suffice to accelerate R&D significantly, even if it can't automate it completely.
Which of these two competing hypotheses is less wrong? I don't know, but I still have substantial weight on the second.
I wish there was some quantitative analysis attempting to distinguish the two. Questions I'd love to see quantitative answers to: How much would it cost to give every major job in the economy the treatment math and coding are currently getting? How much will that cost go down, as AIs partially automate the pipeline? How much are AIs generalizing already? (this one is hard to answer because the companies are quiet about their training data) Is generalization radius increasing as models get smarter & are trained on more diverse stuff, or does it seem to be plateauing or entirely a function of e.g. pretraining loss?
...
Huh, I wonder if this helps explain some of the failures of the agents in the AI Village. Maybe a bunch of these custom RL environments are buggy, or at least more buggy than the actual environments they are replicating, and so maybe the agents have learned to have a high prior that if you try to click on something and it doesn't work, it's a bug rather than user error. (Probably not though. Just an idea.)
But it might nevertheless automate most jobs within a decade or so, and then continue churning along, automating new jobs as they come up.
I think this is less likely than I did a year ago, and a lot of this is informed by Steve Newman's blog post on a project not being a bundle of tasks.
My median expectation is we get 1-3 year 50% of tasks done by 2030, and 1-3 months 80% of tasks done by 2030, which under this view is not enough to automate away managers, and depending on how much benchmarks diverge from reality, may not even be enough to automate away most regular workers, and my biggest probable divergence is I don't expect super-exponential progress to come soon enough to bend these curves up, due to putting much less weight on superexponential progress in the next 5 years as a result of trend breaks than you.
Here's the link for a project is not a bundle of tasks.
I have nothing to say on the rest of your comment.
The paper you link is pretty interesting, but I don't think it helps support the idea that general capabilities improvement is more data than algorithms. Instead, what they show good evidence for is that a little algorithmic progress has consisted of finding algorithms that give a flat bump to performance, while most of it has been from finding algorithms that scale better with the available compute.
They do ablation on a really tiny transformer (3M parameters), and show that there was a 3.something times improvement from adding modern algorithmic improvements. Meanwhile, the nanoGPT project has added basically the same improvements to gpt2 (124M parameters) and gotten a 10.something times improvement.
EDIT: That said, the "eyeball test" estimation of AI smarts might well be improving more because of better data and data-use than because of slightly lower loss on the Pile, I agree with that.
I don't think it helps support the idea that it's data and not algorithms
Agreed. Gundlach et al. are able to find and categorize specific algorithmic advances (non-data) that they claim explain 6,930× of gains, out of a total amount of gains estimated ("naively extrapolating") by Ho et al. of 22,000×. That is, they explain all but another factor of 3 with algorithms. Quoting from the paper:
Though our experiments do not claim to be exhaustive, we compare our findings with estimates from the literature. Namely, between 2012 to 2023, Ho et al. [2024] found a doubling time of 8 months, or 2.83× per year, for a total efficiency gain of 22, 000×. In contrast, the growth rate of our CEG multiplier is approximately 2.23× annually, for a total of 6, 930×, of which 2, 700× (89%) is due to scale-dependent changes. This leaves a gap of 3.18× from our estimates, which could be from data selection, tokenizer advancements, or a long tail of innovations not captured in our analysis.
First off, making up for all but 3× is very good (frankly, I think too good and should be taken with a grain of salt). Second, naively reading this would imply data has contributed at most a factor of 3 over 11 years.
But I think the experiment in this paper use validation loss on a pretraining dataset, whereas performance on downstream tasks seems especially likely to be affected by better data (i.e., the 22,000× they are trying to account for might not even be influenced much by better data, as it too is based on loss).
(This comment is not meant to take a stand on the overall question of how much data vs. non-data algorithmic innovation has contributed, just the bearing of Gundlach et al., on this question.)
Coauthor of the paper here. It's helpful to hear how people have interpreted our work, though I want to try to clear up a few things quickly.
Firstly, we don't claim that data must make up the remaining gap, and we certainly are not claiming it explains capabilities improvements more broadly. We try to be pretty precise about the definition of algorithmic progress under the CEG framework, which is specifically computed using loss rather than downstream performance metrics (I agree, the relationship between these is not straightforward or well understood). We say explicitly that the CEG framework cannot capture a lot of important innovations for performance and efficiency (instruction fine-tuning, constitutional AI, parallelism, etc.). We also show highly counterintuitive, undesirable properties of the CEG framework which makes interpretation quite difficult.
Speaking for myself and not my coauthors here: It's flattering you think our results are too good, though I'm not sure that was the intent of your comment :) I think the results would be interesting even if we over/undershot existing estimates by 10x. We find two important results, which are true regardless of our estimates' alignment with the literature: scale seems to be necessary for much of the perceived algorithmic gains, and scale-dependence makes the framework for measuring these gains behave strangely. Those two facts are worth considering in their own right, and are true for even modest differences in scaling exponents across architectures. It is reaffirming that we recover so much of the other estimates, but strictly speaking, we could have found way more or way less, and our results would still raise important points.
Lastly, I think our results don't point to "better data has made all the difference." It really points to "a lot of new, specific things lead to a lot of new, specific differences in capabilities, and it's hard to count those up using the same units." I think that's a really important (and difficult) direction for further research, which may shed light on how important data has been!
So this post brought to you by Beren today is about how a lot of claims about within-paradigm algorithmic progress is actually mostly about just getting better data, leading to a Flynn effect, and the reason I'm mentioning this is because once we have to actually build new fabs and we run out of data in 2028-2031, progress will be slower than people expect (assuming we havent reached AGI by then).
Gundleach et al paper
Edit: I incorrectly interpreted what the paper's results actually were for the question of data vs non-data inputs to algorithmic progress.