by Jaime Sevilla, Lennart Heim, Marius Hobbhahn, Tamay Besiroglu, and Anson Ho
You can find the complete article here. We provide a short summary below.
In short: To estimate the compute used to train a Deep Learning model we can either: 1) directly count the number of operations needed or 2) estimate it from GPU time.
Method 1: Counting operations in the model
Method 2: GPU time
We are uncertain about what utilization rate is best, but our recommendation is 30% for Large Language Models and 40% for other models.
Other parts of interest of this article include:
- We argue that the ratio of operations of backward and forward pass of neural networks is often close to 2:1. More.
- We discuss how the formula of method 1 changes for recurrent models. More.
- We argue that dropout does not affect the number of operations per forward and backward pass. More.
- We have elaborated a table with parameter and operation counts for common neural network layers. More.
- We give a detailed example of method 1. More.
- We discuss commonly used number representation formats in ML. More.
- We share an estimate of the average performance of GPU cards each year. More.
- We share some reported GPU usages in real experiments. More.
- We give a detailed example of method 2. More.
- We compare both methods and conclude they result in similar estimates. More.
- We discuss the use of profilers to measure compute. More.
You can find the article here.