The longest training run

Tamay; Owen D; anson.ho

Name suggestion: "The Craig Venter Principle". Back in '98, the Human Genome Project was scheduled to finish sequencing the first full human genome in another 5 years (having started in 1990). Venter started a company to do it in two years with more modern tech (specifically shotgun sequencing). That basically forced the HGP to also switch to shotgun sequencing in order to avoid public embarrassment, and the two projects finished a first draft sequence at basically the same time.

[-]NunoSempere3y123

Why publish this publicly? Seems like it would improve optimality of training runs?

[-]Tamay3y*3023

Good question. Some thoughts on why do this:

Our results suggest we won't be caught off-guard by highly capable models that were trained for years in secret, which seems strategically relevant for those concerned with risks
We looked whether there was any 'alpha' in these results by investigating the training durations of ML training runs, and found that models are typically trained for durations that aren't far off from what our analysis suggests might be optimal (see a snapshot of the data here)
It independently seems highly likely that large training runs would already be optimized in this dimension, which further suggests that this has little to no action-relevance for advancing the frontier

[-]Yitz3y113

Thanks for thinking it over, and I agree with your assessment that this is better public knowlege than private :)

[-]NunoSempere3y90

Thanks Tamay!

[-]Jonathan_Graehl3y121

This is good thinking. Breaking out of your framework: trainings are routinely checkpointed periodically to disk (in case of crash) and can be resumed - even across algorithmic improvements in the learning method. So some trainings will effectively be maintained through upgrades. I'd say trainings are short mostly because we haven't converged on the best model architectures and because of publication incentives. IMO benefitting from previous trainings of an evolving architecture will feature in published work over the next decade.

[-]jacob_cannell3y20

Was going to make nearly the same comment, so i'll just add to yours: an existing training run can benefit from hardware/software upgrades nearly as much as new training runs. Big changes to hardware&software are slow relative to these timescales. (Nvidia releases new GPU architectures on a two year cadence, but they are mostly incremental).

New training runs benefit most from major architectural changes and especially training/data/curriculum changes.

[-]Douglas_Knight3y73

Are you assuming that electricity is free? My understanding is that the cost of silicon is small compared to the cost of electricity, if you run the chip all the time, as in this article. For example, this gpu costs $60 and consumes 300 watts = 2700 kwh/year = $270/year, at $.10/kwh. This one costs 10x and consumes 3x, so its price is not negligible, but still less than a year of operation. Plus I think the data center rule of thumb is that you should multiply electricity by 2 to account for cooling costs.

This will have a very large effect on the total compute bought, numbers which only appear in the graph. The headline numbers—the optimal times—depend mainly on the exponential form of the improvement in efficiency. If the time for the cost of silicon to be cut in half is same as the time for the amount of electricity needed to be cut in half (Moore's law vs Koomey's law), then you should get roughly the same answer. Koomey's law used to be faster, but after the breakdown in Dennard scaling, it seems to be slower.

If you want a GPU-specific version of Koomey's law, I don't know. Does that data set of GPUs have watt ratings?

[-]joshc3yΩ351

This is because longer runs will be outcompeted by runs that start later and therefore use better hardware and better algorithms.

Wouldn't companies port their partially-trained models to new hardware? I guess the assumption here is that when more compute is available, actors will want to train larger models. I don't think this is obviously true because:
1. Data may be the bigger bottleneck. There was some discussion of this here. Making models larger doesn't help very much after a certain point compared with training them with more data.
2. If training runs are happening over months, there will be strong incentives to make use of previously trained models -- especially in a world where people are racing to build AGI. This could look like anything from slapping on more layers to developing algorithms that expand the model in all relevant dimensions as it is being trained. Here's a paper about progressive learning for vision transformers. I didn't find anything for NLP, but I also haven't looked very hard.

[-]gwern3y*117

Not necessarily larger, but different. Presumably new hardware will have different performance characteristics than the old hardware (otherwise what's the point?); it seems unlikely that future GPUs will simply be exactly like the old GPU but using half the electricity, say. (Even in that scenario, since electricity is such a major cost, why wouldn't you then add more GPUs to your cluster to use up the new headroom?)

When we look at past changes like V100 to A100, or A100 to H100, they typically change the performance profile quite a bit: VRAM doubles or more, high-precision ops increase much less than low-precision, new numerical formats get native speed support, new specialized hardware like 'tensor cores' get added encouraging sparsity or reduced-precision, interconnects speed up (but never enough)... All of these are going to change your ideal width vs depth scaling ratios, Transformer head sizes or MoE expert sizes (trying to keep on-GPU) or the size of your model components in general, your other hyperparameters like total batch size, and so on.

Changes like precision can require architecture-level changes like more aggressive normalization or regularization (maybe your model will Just Work when you switch to mixed-precision for the performance boost - or maybe it will keep exploding until you throw in more layer normalization to keep all the numbers small), or may just not work at all at present.

You may be able to checkpoint your model and restart if a node crashes or if a minibatch diverges, but that's no panacea, DL is non-convex and different runs will end up in different places, and the seeds of decay & self-sabotage may be planted too deep in a model to be fixed: in the BigGAN paper, mooch tried extensively rolling back BigGANs that diverged, but even resetting back thousands of iterations didn't halt eventual divergence (we verified this the hard way, as hope is a cruel mistress); in the PaLM work, they found some minibatches just spike the loss, and it's not due to the individual datapoints (they had bit-for-bit reproducibility - very impressive! - and could swap out the data to check), it just sorta happens. (Why? Dunno.) There is also the learning that happens during the course of training: most recently, people were very amused/depressed to read through the Facebook OPT training logs about all the bugs, similar stories are told by anyone working on GPT-J or GPT-Neo-20b or HyperCLOVA, Anthropic and OA likewise; one particularly dramatic example I like is OA's "Rerun" DoTA2 OA5 agent - they were editing the arch & hyperparameters (in addition to keeping up with game patches) the entire time, and so at the end, they 'reran' the training process from scratch rather than upgrading progressively the same agent: Rerun required only 20% of the training for a

98% win-rate against the final version of OpenAI Five...The ideal option would be to run Rerun-like training from the very start, but this is impossible---the OpenAI Five curve represents lessons learned that led to the final codebase, environment, etc., without which it would not be possible to train Rerun.

Quite a difference.

So, yes, you can 'just copy over' your big old model onto your shiny new cluster, but you are going to pay a price. The size of this price compared to starting from scratch will depend on how extensive the hardware+software changes are, how much coevolution is going on, how path-dependent large models turn out to be (very unclear because few people train more than one), etc, but the price will be nonzero. Your utilization will be lower because now you underuse each node's VRAM, or you could have packed in larger model-shards into each node, or you need to economize more on inter-node bandwidth, and perhaps this is a small price worth paying; or perhaps your upgrade of the optimizer mid-way permanently hobbled convergence in a way you haven't noticed such that no amount of training will help you match the from-scratch version, and every FLOP is wasted as you will never achieve the target goals.

How bad will the price be be? I guess that will depend on how much software and hardware innovation, and what sort, you expect. If you are looking forward to things like binary-weight nets (which are ultra-fast because they are now just bit operations like xor or popcount), you should expect to have to throw away all your prior models, they really do not like that sort of major change, whatever sort of approach makes them work is not going to play nice with Ye Olde FP32 GPT-3 models, and even when you can convert successfully, you probably can't train them much more. Trying to save compute by transferring old models is then just throwing good compute after bad. Whereas if you are looking forward to innovations focusing on datasets and expect GPUs to remain pretty much as they are now with lots of FP16 multiplication (but nothing crazy like ternary weights or pervasive sparsity or wacky approaches like HyperNEAT-style evolved topologies or Cerebras chips or spiking neural-network hardware), then probably you can plan to just continually train and upgrade a single Chinchilla model indefinitely, and the savings from better hyperparameter tuning etc will be unimportant constant-factors like a third, let's say, which is not enough to justify retraining from scratch until you have some better reason to do so like a new arch.

Since the critical decision is to throw out the old model/arch/run, a big enough change on either hardware or software can trigger a new-run decision, in which case you then pick up the gains from the other one as well. (That is, if some new hardware comes out and your old model is not well-suited to it, then when you start a fresh model, you'll probably also roll in all the software improvements which have happened since eons ago, a year or two or so.) So there's something of a double overhang: regular progress on both streams will lead to smoother capability gains as people regularly start new models and eat up the gains on both, but if one stagnates, then that will tend to lock-in that generation of models and one will want to delay a new model as long as possible, until the marginal return from software+hardware upgrades is so large it can pay for the fully-loaded training cost in one fell swoop. The average trend might be identical, as everyone continues to optimize on the margin, but the latter scenario seems like it would be much more jagged.

(A concrete example of this might be that Stable Diffusion is having such a moment right now in part because it benefits from high-end consumer GPUs, and those GPUs very abruptly became available recently at much closer to MSRP than they have been in years, so people who have been running old image generation models on old GPUs like 1080tis are suddenly running Stable Diffusion on 3090s. I'm sure the FID/IS improvement curves aggregated across research papers are as exactly as smooth as AI Impacts or Paul Christiano would assert they are, but from the perspective of, say, artists suddenly being smacked across the face with SD images everywhere almost literally overnight when the SD model leaked a week or two ago, it sure doesn't feel smooth.)

[-]Maxime Riché2y10

We could also combine this with the rate of growth of investments. In that case we would end up with a total rate of growth of effective compute equal to . This results in an optimal training run length of $L = 1 / (g_{H} + g_{I} + g_{S}) \approx 0.21$ years, ie $2.52$ months.

Why is g_I here 3.84, while above it is 1.03?

[-]Maxime Riché2y10

This is actually corrected on the Epoch website but not here (https://epochai.org/blog/the-longest-training-run)

[+][comment deleted]3y10

^{^}

We assume that hardware price performance increases smoothly over time, rather than with discontinuous jumps corresponding to the release of new GPU designs or lithography techniques. We expect that on a more realistic step-function process, the key conclusions of our framework would still roughly follow (modulo optimal training durations occasionally changing a few months to accommodate discrete generations of hardware).

^{^}

They find a doubling time for hardware efficiency of 2.46 years. This corresponds to a yearly growth rate of $g_{H} \approx \frac{ln 2}{2.46} = 0.281$ .

^{^}

We found a 6.3 month doubling time for compute invested in large training runs. This is a yearly growth rate of $g_{C} \approx \frac{ln 2}{6.3 months \cdot \frac{y e a r}{12 months}} = 1.31$ .

^{^}

In theory, we should also account for the rise in training lengths. In practice, when we looked at a few data-points training lengths appeared to be increasing linearly over time, so we believe the effect is quite small.

^{^}

Meta's OPT logbook illustrates this well: they report being unable to continuously train their models for more than 1-2 days on a cluster of 128 nodes due to the many failures requiring manual detection and remediation.

Scenario	Longest training run
Hardware improvements	3.55 years
Hardware improvements + Software improvements	1.22 years
Hardware improvements + Rising investments	9.12 months
Hardware improvements + Rising investments + Software improvements	2.52 months

Scenario	Longest training run
Hardware improvements	3.55 years
Hardware improvements + Software improvements	1.22 years
Hardware improvements + Rising investments	9.12 months
Hardware improvements + Rising investments + Software improvements	2.52 months

Scenario	Longest training run
Hardware improvements	3.55 years
Hardware improvements + Software improvements	1.22 years
Hardware improvements + Rising investments	9.12 months
Hardware improvements + Rising investments + Software improvements	2.52 months

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

71

The longest training run

71

Ω 36

71

Ω 36

A simple framework for training run lengths

Accounting for increasing dollar-budgets

Accounting for increased algorithmic efficiency

Accounting for hardware swapping

Accounting for stochasticity

Fixed deadlines

Renting hardware

Conclusion

Acknowledgements