I've heard several people say that a 10 Trillion parameter GPT-3-like model, trained with DeepMind's new scaling laws in mind, would be pretty terrifying. I'm curious if anyone could give me a Fermi estimate of the cost of such a thing - if indeed it is feasible at all right now even with an enormous budget. 

New Answer
Ask Related Question
New Comment

4 Answers sorted by

Chinchilla is 70 Billion parameters. 10 trillion / 70 billion ≈ 143

According to the paper released with the information about Chinchilla (Training Compute-Optimal Large Language Models” by Hoffmann et al.), they claim "“For every doubling of model size the number of training tokens should also be doubled.”

So if this would be followed, 143 times more data would also be needed, resulting in a 143*143= 20449 increase of compute needed.

Chinchilla probably costed around 1-5 million usd in compute power to train, the cost for a 10 trillion parameter version would cost around 20.5 billion to 122.5 billion usd.

However, this is not a viable alternative, because there is not enough text data available.

A more realistic scenario is to perhaps double the data Chinchilla was trained on (which might not even be easily done), and then 143x the size, so a cost of about 286 million to 1.43 billion usd.

Keep in mind, that GPT-3 was trained on about a quarter of the data that Chinchilla was trained on. So a 10 trillion parameter GPT-3 model might cost around 71.5 million  to 358 million usd.

I predict that when something with the anticipated capabilities of a 10T Chinchilla is developed, it will cost at the most a small multiple of GPT-3. There’s not just falling hardware costs, but algorithmic improvements and completely new methods.

It won’t cost less, because more compute always improves things, so the cost is set by the budget.

I believe that given a few years, a company wanting to make a 10 trillion parameter GPT-3 could probably do it for less than these estimates, since 71-358 million usd isn't that much in compute, and for that money extra specialized hardware produced in bulk could be used to bring costs down.

[Chinchilla 10T would have a 143x increase in parameters and] 143 times more data would also be needed, resulting in a 143*143= 20449 increase of compute needed. 

Would anybody be able to explain this calculation a bit? It implies that compute requirements scale linearly with the number of parameters. Is that true for transformers?

My understanding would be that making the transformer deeper would increase compute linearly with parameters, but a wider model would require more than linear compute because it increases the number of connections between nodes at each layer. 

The formula is assuming a linear compute cost in number of parameters, not in network width. Fully-connected layers have a number of parameters quadratic in network width, one for each connection between neuron pairs (and this is true for non-transformers as much as transformers).
Ah right. Thank you!

I recently estimated the training cost of PaLM to be around $9M to $17M
Please note all the caveats and this is only estimating the final training run costs using commercial cloud computing (Google's TPUv3).

As already, said a 10T parameter model using Chinchilla scaling laws would be around  FLOPs. That's 5200x more compute than PaLM ( FLOPs).

Therefore, .

So a conservative estimate is around $47 to $88 billion.

It took Google 64 days to train PaLM using more than 6'000 TPU chips. Using the same setup (which is probably one of the most interconnected and capable ML training systems out there), it'd take 912 years.

The Chinchilla's paper states that a 10T parameter model would require 1.30e+28 flops or 150 milion petaflop days. A state-of-the-art Nvdia DGX H100 requires 10 KW and it produces theoretically 8 petaflops FP16. With a training efficiency at 50% and a training time of 100 days, it would require 375,000 DGX H100 systems to train such model, for a total power required of 3.7 Gigawatt. That's a factor of 100x larger any supercomputer in  production today. Also, orchestrating 3 milion GPUs seems well beyond our engineering capabilities. 

It seems unlikely we will see 10 T models trained like using the scaling law of the Chinchilla paper any time in the next 10 to 15 years. 

I moved this into my drafts a couple months ago, but I found myself wanting to read the comments again - so figured I should publish it again as it might be plausibly useful to others. 

2 comments, sorted by Click to highlight new comments since: Today at 8:29 PM

I was super excited to read some detailed answers on how to selectively breed a chinchilla to weigh 10 tonnes and the costs associated with doing so. I have rarely been more disappointed.

I've heard several people say that a 10 Trillion parameter GPT-3-like model, trained with DeepMind's new scaling laws in mind, would be pretty terrifying

Would you mind linking to some examples of people talking along these lines? I’m curious what people’s specific predictions are.

New to LessWrong?