Summary: We collected a database of notable ML models and their training dataset sizes. We use this database to find historical growth trends in dataset size for different domains, particularly language and vision.
(OOM) (95% CI)
Table 1: Summary of trends for each domain. Scale is the maximum and minimum observed dataset size, and yearly growth is the slope of the best exponential fit (and 95% CI).
Data is one of the main ingredients of Machine Learning (ML) models. To understand the progress of ML over time, it is necessary to understand the evolution of training datasets.
In this document, we characterise the historical trends in training dataset size for different domains, using a custom dataset of ML models. In the Methods section, we explain our method for quantifying training dataset size, as well as the inclusion criteria for our database. In the Dataset size trends section, we present the trends in dataset size for different domains.
There is no canonical way to quantify the amount of data in a given dataset. Below are some ways of quantifying dataset size that we have considered, each with its own problems:
Ultimately, we decided to go with the third approach. Our definition of training dataset size of a model is the number of unique data points seen during training, where a data point is a concrete unit specific for each domain. Additional details about our decision are available in our document Measuring dataset size in Machine Learning.
A very important consequence of our definition is that training dataset sizes are not directly comparable across domains, since each one has a different unit. Table 2 shows each of the domains and their corresponding data points.
Table 2: Definition of a data point for each domain
We have compiled a database of over 200 ML models and their training dataset sizes. Given the overwhelming amount of papers in ML, we had to narrow down our search using a series of inclusion criteria. To be included, a model had to satisfy one of these:
We categorise each model into a domain depending on the task it was designed to solve. Table 3 shows summary information about the different domains. Once a system is categorised, we take the dataset size from the paper and transform it to the appropriate units. This often entails approximations, since there might not be a fixed conversion factor between our units and those of the paper.
Table 3: Dataset domains
The evolution of vision datasets has been greatly influenced by MNIST and ImageNet from 1998 to 2016. Around a third of all the models over this period were trained on those datasets. There was a growth of 0.11 OOMs/year (orders of magnitude per year), and the largest dataset was ImageNet with ~1M images. However, starting from around 2016, a few very large datasets have appeared, the largest of them being JFT-3B with a size of 3B images. It seems too early to tell whether this will change the rate of growth in future years.
The gap between 1e6 and 1e8 images is somewhat puzzling, but we expect it to be an artefact of small sample size.
Figure 2: Evolution of vision datasets. A significant number of models is concentrated near 6e4 and 1e6, which are the sizes of MNIST and ImageNet, respectively.
Training datasets for language models have grown by 0.23 OOMs/year since 1990, for a total growth of 7 OOMs between 1990 and 2022. The data suggests that this growth has been faster since 2014, but given the small sample size we don’t feel confident in this. The biggest language dataset observed so far was the FLAN dataset at 1.87e12 words.
It’s remarkable that up until 2014 there were notable models trained on very little data, such as Deep Belief Networks (trained on less than 2e5 words), but after 2016 virtually all notable models have been trained on more than 1e8 words. This might reflect the adoption of more efficient architectures such as Transformers that allow training on much more data.
Figure 3: Evolution of language datasets
We have very little data on the rest of the domains, so our conclusions are extremely uncertain. The four other domains are shown in Figure 4.
Both Games and Speech models show a pattern similar to Language and Vision, with all models after 2014-2016 using large amounts of data.
Most of our recommendation models are concentrated on a single dataset and time period. This is because they were competitors in the Netflix Prize competition.
Figure 4: Trends for Recommendation (top left), Speech (top right), Drawing (bottom left) and Games (bottom right)
We collected a database of ML models and their training dataset sizes and estimated their growth rates in several domains. While all domains show increasing size over time, our sample size is too small to provide any meaningful conclusion for all domains except vision and language.
In the domains of language, vision and speech, there we see evidence of a transition after 2014-2015. In this period, we see the appearance of datasets that are orders of magnitude larger than what had been commonplace over the previous decade.
Together with the historical trends in compute usage and model size, these trends might be useful for understanding the past progress in Machine Learning, and thus make better predictions about the field going forward.
The supply of vision data is nearly unbounded, but the supply of quality human text data is much more like a finite resource - and one for which we have probably passed the point of peak supply. The nominal supply of text may continue to expand, but it's increasingly auto-generated.
IL's comment has a BOTEC arguing that video data isn't that unbounded either (I think the 1% usefulness assumption is way too low but even bumping it up to 100% doesn't really change the conclusion that much).