lennart

Personal Website | Twitter | You can find my email on my website or just drop me a PM. |

Sequences

Transformative AI and Compute

Wiki Contributions

Comments

Agree that this discussion is surprisingly often confusing and people use the terms interchangeably. Unfortunately, readers often referred to our training compute measurement as a measure of performance, rather than a quantity of executed operations. However, I don't think that this is necessarily due to the abbreviations but also due to the lack of understanding of what one measures. Next to making the distinction more clear with the terms, one should probably also explain it more and use terms such as quantity and performance.

For my research, I've been trying to be consistent with FLOPs (smaller case s) referring to the quantity. While FLOPS or FLOP/s refer to the performance: operations per second. (FWIW, during my time in computer engineering, it has been the norm to use FLOPs for quantity and FLOPS for performance.)

The term Petaflop/s-days also helps - outlining how long the performance (Petaflop/s) runs for how many days, therefore measuring a quantity of operations.

Note it gets even more complicated once we take the number representation (floating point 32bit, 16bit, or even bfloat16) into consideration. Therefore, I'm also in favor of maybe switching at one point towards OPs and OP/s and also document the used number representation for actual technical documentation (such as reporting the compute of ML models).

They trained it on TPUv3s, however, the robot inference was run on a Geforce RTX 3090 (see section G).

TPUs are mostly designed for data centers and are not really usable for on-device inference.

I'd be curious to hear more thoughts on how much we could already scale it right now. Looks like that data might be a bottleneck?

Some thoughts on compute:

Gato estimate: 256 TPUv3 chips for 4 days a 24hours = 24'574 TPUv3-hours (on-demand costs are $2 per hour for a TPUv3) =$49'152

In comparison, PaLM used 8'404'992 TPUv4 hours and I estimated that it'd cost $11M+. If we'd assume that someone would be willing to spend the same compute budget on it, we could make the model 106x bigger (assuming Chinchilla scaling laws). Also tweeted about this here.

The size of the model was only(?) limited due to latency requirements for the robotics part.

It took Google 64 days to train PaLM using more than 6'000 TPU chips. Using the same setup (which is probably one of the most interconnected and capable ML training systems out there), it'd take 912 years.

I recently estimated the training cost of PaLM to be around $9M to $17M
Please note all the caveats and this is only estimating the final training run costs using commercial cloud computing (Google's TPUv3).

As already, said a 10T parameter model using Chinchilla scaling laws would be around  FLOPs. That's 5200x more compute than PaLM ( FLOPs).

Therefore, .

So a conservative estimate is around $47 to $88 billion.

Thanks for the thoughtful response, Connor.

I'm glad to hear that you will develop a policy and won't be publishing models by default.

Glad to see a new Alignment research lab in Europe. Good luck with the start and the hiring!

I'm wondering, you're saying: 

That being said, our publication model is non-disclosure-by-default, and every shared work will go through an internal review process out of concern for infohazards.

That's different from Eleuther's position[1]. Is this a change of mind or a different practice due to the different research direction? Will you continue open-sourcing your ML models?

  1. ^

    "A grassroots collective of researchers working to open source AI research."

From their paper:

We trained PaLM-540B on 6144 TPU v4 chips for 1200 hours and 3072 TPU v4 chips for 336 hours including some downtime and repeated steps.

That's 64 days.

It's roughly an order of magnitude more compute than GPT-3.

ML ModelCompute [FLOPs]x GPT-3
GPT-3 (2020)3.1e231
Gopher (2021-12)6.3e23≈2x
Chinchilla (2022-04)5.8e23≈2x
PaLM (2022-04)2.5e24≈10x
Load More