Empirical Evidence Against "The Longest Training Run"

NickGabs

Summary

In this post, we empirically test Epoch AI’s theoretical model of an upper bound on AI training run lengths. According to this model, an upper bound for training run time can be estimated by assuming that the length of a training run is optimized for maximizing the FLOP/$ subject to a fixed deadline, and then calculating how long such a run would last. We first analyze the model’s assumptions, noting that two of them in particular are false in a way that suggests that the model may be flawed. Empirically, we find that the observed lengths of past training runs are consistent with all of the upper bounds from the model except the shortest one. However, we find that the data does not support the underlying model of how researchers optimize their training runs, as training runs frequently use out of date hardware and do not seem to be optimized for FLOP/$. Some correspondence with people associated with EleutherAI also indicates that they and other researchers do not generally optimize their runs for FLOP/$ and do not think much about delaying training to take advantage of better hardware. While we thus conclude that the original model is flawed, its upper bounds remain plausible and in agreement with a crude attempt which we make to derive an upper bound from the data alone.

Acknowledgements: This post was written as part of an AI strategy research mentorship program which occurred during the spring of 2023 between researchers at Epoch AI and students from the Harvard AI Safety Team. Thanks to the Epoch team, and in particular to Pablo Villalobos for mentoring me as well as Jaime Sevilla, Matthew Barnett, Carl Guo, Kuhan Jeyapragasan, Robi Rahman, Fiona Pollack, and Eleni Shor for feedback and assistance.

Epistemic Status: >95% confidence that the model does not accurately describe past decisions about the lengths of training runs, 85% confidence that this will remain the case going forward based on ~50 hours of work.

Introduction

In light of recent political attention to the issues posed by advanced AI systems, proposals for how we might monitor and regulate such systems have become more concrete. For example, in “How To Catch A Chinchilla,” Yonadav Shavit proposes a method by which we might detect large training runs so as to enforce further regulations on them. Similarly, technical AI safety researchers such as those working at ARC Evals have been working on developing techniques to actually monitor such models.

An important question for how well methods like these will work is how long future training runs will be. Some alignment and oversight techniques such as using interpretability tools to monitor a model’s internals or exposing it to “honeypots” during training are likely to work better if training takes longer, as all else equal, longer training runs give researchers more clock-time to use these tools on the model. To an even greater extent, training run lengths also have implications for the effectiveness of political regulations on powerful AI systems because while researchers have the ability to pause even short training runs to deploy their alignment tools, short training runs may pose a larger problem for certain kinds of regulatory oversight. For example, if a regulatory regime involved inspectors periodically checking in person on datacenters, then they might struggle to “catch” a sufficiently fast training run in violation of the regulation. Thus, estimating how long future training runs will be could help us to craft better proposals for technical and political oversight of AI.

While estimating how long future training runs will be in general is difficult, deriving an upper bound for their length may be more tractable. In particular, in August 2022, Epoch AI published a theoretical model of machine learning training run lengths to estimate an upper bound for their length. The model works by assuming that researchers training an AI system must deploy the model by a fixed deadline T, and then optimize for beginning training at the time S* which maximizes the FLOP/$ of the training run. Given a fixed budget, this is equivalent to maximizing the compute used in training in the model. While in practice, researchers also optimize for other objectives such as training the model as fast as possible, the assumption that they only optimize for FLOP/$ can plausibly be used to derive an upper bound on training time. This is because if researchers were to optimize more for getting the model out faster, they would choose a strictly shorter training time than under this model. The model also assumes that the quality (i. e. FLOP/$/s) of the hardware which is available for the researchers to buy at time t increases over time according to an exponential function of the form H(t) = H₀e^(gt), where H₀ is the FLOP/$/s of the best hardware available at time t = 0, and g is a parameter describing the rate of hardware improvement. Overall, if the researchers both purchase the most efficient hardware and start the training run at time S, the training run will have a FLOP/$ of F = (T - S) * H(S).

As a result, the researchers face a central tradeoff. Starting the training run later will enable them to buy hardware which is more cost efficient, but will reduce the amount of time they can use that hardware for training their model due to the fixed deadline. Analytically solving this optimization problem by setting the derivative of F with respect to s to 0 leads to the conclusion that the length of the training run, T = T - s, should be equal to 1/g, making S* = T - 1/g.

In some of their other work, researchers from Epoch estimated the rate of progress in hardware efficiency. Plugging in that estimate yields g = .281. In turn plugging this value into the model, its most basic form predicts that no training run should be longer than 42.6 months.

However, researchers may consider other factors in optimizing their training runs. In particular, if researchers’ budget is fixed, then training the model with the highest FLOP/$ is equivalent to training the biggest possible model. But if investment grows over time and researchers want to train the biggest model possible by the deadline, they should (all else equal) delay beginning training in order to increase their budget. Similarly, if the algorithmic efficiency of training techniques improves over time, researchers should likewise delay training to take advantage of the most efficient algorithms. Modeling algorithmic efficiency and budget as smoothly growing exponential functions of time in the same way as hardware quality, Epoch extends the model in various ways. The results are summarized in the table below, with the estimates including algorithmic improvements updated with Epoch’s most recent estimate of algorithmic progress.

Scenario	Upper Bound
Hardware improvements	42.6 months
Hardware + algorithmic improvements	9.96 months
Hardware improvements + rising investments	9.12 months
Hardware improvements + rising investments + algorithmic improvements	5.4 months

Analysis of Key Assumptions

Before discussing our empirical tests, we will first analyze how plausible and important the assumptions that the model makes are. Several of the assumptions made by the model are false. This is of course true of all models, and some of these false assumptions seem likely to be unimportant for accurately predicting the upper bound. However, at least two of them seem like they may pose substantial problems for the model. While these problems are briefly addressed in the original post, we will nevertheless discuss them here.

Firstly, hardware progress is not smooth and continuous, but instead progresses in discontinuous jumps when new hardware is developed. Depending on exactly how long training runs are and how often hardware is released, this might not matter. However, most training runs are not long enough to overlap with the release of new chips, which happens approximately every 1-2 years. In other words, most training runs are short enough that even if researchers were to delay the start of the run while keeping the deadline the same, they would not be able to use better chips. Thus, in most cases delaying the start of training produces no benefits in terms of better hardware quality, so most training runs do not face the tradeoff between length and hardware quality that the model implies. We will discuss this issue more in the empirical section.

Secondly, when companies buy GPUs and use them for a training run, they can continue to use or rent them after the run finishes. Sometimes, they may also rent the GPUs rather than buying them in the first place, though this is less common. Either way, this means that buying the hardware used in a training run very close to the training deadline is much less costly than the model implies, as it remains valuable after you finish training that model. In contrast, the model only considers how the GPUs are used while training the particular model in question, ignoring how they can be repurposed after the fact. Indeed, if we take this into account and assume researchers can rent hardware in arbitrary quantities for arbitrary periods of time, then the model suggests that it will always be optimal for them to wait until the last possible moment to begin training and then spend all of their budget to rent a very large amount of the best possible hardware. This is of course practically impossible, but still showcases the importance of this assumption for the model.

While these two assumptions seem to pose the greatest problems for the model, there are a few other assumptions that the model makes that are also false though probably less important. Firstly, training is not infinitely parallelizable, so there is a limit to how much FLOP/s you can effectively utilize during training. This means that you cannot always make up for decreases in training length by buying more hardware, and is part of the reason why the aforementioned scheme of waiting until the last possible second to rent a large amount of hardware would not work in reality. Secondly, as previously discussed, researchers are not trying to maximize the FLOP/$ of a training run given a fixed deadline. Instead, they have to optimize between a variety of objectives, including producing the model as quickly as possible, producing the model as cheaply as possible, and producing a model that is as capable as possible. While considering only FLOP/$ can plausibly yield a reasonable upper bound for the reasons discussed previously, this may lead to the upper bound being too conservative or otherwise distort the model. Overall, our prior going into examining the data was that the model was likely to be inaccurate due to these considerations.

Empirical Testing

In order to test the model, we synthesized several of Epoch’s pre-existing datasets to compile a unified dataset of training runs. At first, the dataset contained 746 training runs, but after filtering the data for entries with the necessary information and editing and augmenting it, only 95 entries remained for use in most of the following graphs. To briefly summarize, most of the entries had to be removed because they did not record the length of the training run. There were some issues with the way in which the estimates for the amount of FLOP used in training and the cost of those FLOP were calculated for some of the remaining entries, but we were able to salvage most of them by recomputing the relevant information using publicly available data about the price and performance of the GPU used in the training run. More details about this process are in the Appendix.

Direct Upper Bound

Using the dataset, we tested the model in several ways. The first question we asked was which of the upper bounds derived from the theoretical model actually held up. We found that the longest training run in our dataset clocked in at 4320 hours, or 6 months. Thus, while the most aggressive upper bound on training run lengths of 5.4 months from taking into account hardware progress, software progress, and investment growth was falsified, the data was consistent with the other upper bounds.

Testing the Underlying Model

However, we also wanted to test whether the data supported the model's assumptions about how researchers optimize the length of their training runs. Firstly, the model assumes that researchers generally or at least often try to use the most cost-efficient, up-to-date hardware. If they didn’t, then even ignoring the fact that hardware improvements aren’t smooth, delaying training would not necessarily result in better hardware efficiency. But when we plot the relationship between the year that a training run took place and the year that the GPU used for that training run was released, we find that the correlation is not strong, as many training runs use out-of-date GPUs. Indeed, we find that the average training run used a GPU that was 2.54 years out of date, incurring a ~2x FLOP/$/s penalty due to not using the best available hardware. Similarly, the relationship between the FLOP/$/s of the hardware used in a training run and the time at which it was started is positive but relatively noisy, suggesting that researchers use neither the latest nor the best hardware available to them.

In addition to predicting that the training runs should generally use up-to-date and high-performing GPUs, the model implicitly makes several other predictions about the data. Firstly, if training run duration is often picked to maximize the FLOP/$ of the training run, then there should not be a very strong relationship between training run length and FLOP/$, as this would suggest that researchers could in general increase/decrease FLOP/$ by picking different training run lengths. Nevertheless, we observe a strong correlation between training length and FLOP/$. This is not really surprising given the fact that the FLOP/$ of a training run is the FLOP/$/s of the hardware times the length, and the distribution of the length of training runs in the dataset spans several orders of magnitude.

Additionally, according to the model, researchers are centrally faced with a tradeoff between training run length and hardware quality. As a result, if the model is correct, then we should observe a negative relationship between training run length and FLOP/$/s, at least within each year. However, there is no relationship between these two variables in any of the years except 2016, for which there are only 4 data points, nor in the full dataset. Thus, it does not seem like researchers in fact face a substantial tradeoff between the length of a training run and the FLOP/$/s of the associated hardware. This is consistent with the fact that training runs do not in general use the most up to date hardware, as if they are consistently using old hardware, then the rate limiting factor on hardware quality is not the time at which they begin training.

Finally, we hypothesized that while many training runs are not optimized for FLOP/$, it could be the case that the model would better fit more expensive and/or longer runs. Yet we observe essentially the exact same patterns in the subsets of the data containing only the training runs lasting longer than 100 hours and only the training runs using more than 10^20 FLOP. In particular, the GPUs used in these runs are on average 2.53 and 3 years out of date respectively, and the correlations between training run length and FLOP/$ remain very high (R^2 = .85 and .862 respectively). Thus, even when the range of training runs is restricted to the largest and longest runs, these very high correlations suggest that researchers could still consistently achieve higher FLOP/$ by training their models for even longer, implying that they are not optimizing very much for FLOP/$.

Evidence From EleutherAI Discord

In addition to examining the Epoch dataset, we also emailed many AI researchers to ask them about their practices with regards to training run length optimization. Many did not respond, and many who did said that they could not reveal the relevant information. However, we did receive more information from the people on the EleutherAI Discord in particular. According to them, hardware and software progress are sufficiently hard to predict that researchers do not optimize around them in the way that the model implies. They note that “large hardware jumps'' make it hard to forecast hardware improvements, suggesting that the model’s assumption of smooth hardware progress makes it deviate from the decisions of actual researchers. In particular, they claim that GPT-4’s training time was lengthened by OpenAI’s decision to change the hardware used in training midway through the process, suggesting that they were not able to factor in future hardware improvements into their decisions about when to train GPT-4. The EleutherAI folks also wrote that unpredicted algorithmic improvements such as RLHF and Constitutional AI similarly can similarly lengthen training run times, as researchers may decide to switch over to the new methods during training.

Naive Empirical Estimate

While it provides some evidence against the original model, the dataset does enable us to come up with more direct, empirical estimates for an upper bound on training run lengths. One way of doing so is simply to take the largest run in our dataset (6 months), and hypothesize that it is a soft upper bound. While this method is crude, the bound seems plausible, as that training run is substantially longer than any run in the dataset and has not been surpassed since it occurred in 2019, suggesting that training runs longer than it are at least quite rare.

Another way to empirically derive an upper bound on training run length time is to extrapolate the trend in the length of the longest training run in any given year into the future. While the original model implies that the upper bound should be constant over time, this seems plausibly false. On average, there is no relationship between the year a training run was conducted and its length. However, if we instead look at the relationship between the year and the length of the longest training run conducted in that year, we see a clear though noisy positive relationship. This may be because the variance in the length of training runs in our dataset generally increases over time, making very large runs more likely in later years. It may also be an artifact of the dataset, because we have many more datapoints from 2019-2022 than from 2013-2018 which could make the observed maximum length of training runs in later years larger than those in the earlier years even if there has been no change in the underlying distribution over time. Naively extrapolating this trend suggests that the longest training run in 2030 will be 4758 hours or ~6.6 months, though this number should be taken with a generous sprinkling of salt. While these empirical methods are highly imprecise, they both converge on a weak upper bound of ~6 months for training runs conducted before 2030.

Conclusion

In this post, we have described Epoch AI’s theoretical model of an upper bound on training run lengths and have presented 3 main arguments/sources of evidence which point to the original model being flawed: some of its key assumptions seem importantly incorrect, it implicitly fails to predict several empirical observations (e. g. that training runs, including large and expensive ones, tend to use pretty out of date GPUs), and people associated with EleutherAI claim that it does not accurately describe their or other researchers’ processes for optimizing their training runs. While the model’s final predictions are mostly consistent with the data, this is true for a range of plausible models and thus constitutes relatively weak evidence for the model. Weighing all of the evidence together, we conclude with high confidence (>95%) that the model does not accurately describe the way in which researchers have historically chosen the length of training runs, and with somewhat less but still high confidence (85%) that it will remain a significantly inaccurate model of their decisions in the future.

It remains valuable to estimate an upper bound on training run lengths for the reasons discussed in the introduction to the paper, and we have provided an alternative empirically grounded yet crude estimate of ~6 months prior to 2030. However, many of the same facts which posed problems for the original model may make it difficult to develop an alternative model grounded in microeconomic theory. In particular, the discontinuous and somewhat inconsistent nature of both hardware and algorithmic progress make it difficult for researchers to optimize their training run lengths with them in mind, especially in a way that could be described by a relatively simple mathematical model. Additionally, even if researchers could accurately model these phenomena and thus optimize around them, this could only be used to derive a Pareto frontier for the length and cost of training runs; in order to actually arrive at an upper bound, we would need to also factor in researchers’ preferences with respect to that frontier. Thus, this post also provides some evidence that developing accurate theoretical models of training run lengths is somewhat intractable.

Appendix (data)

After synthesizing the pre-existing datasets, we first discarded the datapoints for which information about the length of the associated training run was unavailable. Then, in order to use a consistent methodology for estimating training run costs and to include data about the release year of the GPUs used in training runs, we augmented this pared down dataset with information about what GPU was used for each training run. Datapoints for which this information could not be determined were discarded, but we were able to find the GPU associated with almost all of the training runs. However, many entries still lacked estimates for the cost of the compute used during training. Moreover, the estimates that were in the dataset were calculated by multiplying the cost of buying the GPUs used in the training run by the fraction of two years which elapsed during training under the assumption that the GPUs could be rented at market rate or used elsewhere during the remaining period. As we discussed, this is a reasonable assumption. However, the Epoch model defines the cost of the hardware used in a training run as the cost that it takes to purchase the GPUs used to train the model, with no discount based on the prospect of renting them or using them elsewhere after training. Thus, these estimates would not have worked for the purpose of testing the model.

The cost of the compute used in the training runs was recalculated in the following manner. Firstly, we used Google to find the price of the GPU model used to train the system. We used this source for many of these estimates (particularly for the NVIDIA A100 and V100 GPUs, which made up the majority of the GPUs in the dataset), otherwise generally using the MSRP of the model as an estimate for its price. We then multiplied the price of each GPU by the number of GPUs used in training in order to determine the total compute cost. This method prevented us from estimating the price runs which used TPUs, as TPUs are rented via the cloud and not bought. Thus, while TPU entries are included in statistics/graphs that only reference training run lengths, they are not included in those that contain information about FLOP/$ or FLOP/$/s.

While most entries originally lacked estimates for the cost of compute used in training, most of them did have information about the amount of FLOP that was used. However, once we added in the information about what GPU was used in each run, it was relatively straightforward to recalculate the FLOP used in training. In order to standardize these estimates, we exclusively used Epoch’s GPU time method to calculate the FLOP used in training (i. e. multiplying the training time by the peak FLOP/s of the GPU and multiplying by a constant factor of .3 for the utilization rate), using estimates for the peak FLOP/s of the GPUs identical to those in the calculator on the Epoch website (except for NVIDIA V100 GPUs, which appear to attain 125 teraFLOP when used for deep learning but which are listed as attaining 15 teraFLOP on the calculator). When we estimated the FLOP in this manner, some of our new estimates disagreed with the old estimates; in such cases, we largely ended up using the new estimates, though the disagreements were generally not very large (within an order of magnitude.)

Citations

Yonadav Shavit, “What Does It Take To Catch A Chinchilla?” https://arxiv.org/abs/2303.11341

Epoch AI, “The Longest Training Run,” https://epochai.org/blog/the-longest-training-run

Epoch AI, “Estimating Training Compute,” https://epochai.org/blog/estimating-training-compute

Epoch AI, “Trends in GPU Price Performance,” https://epochai.org/blog/trends-in-gpu-price-performance

LESSWRONG
LW