Posts

Sorted by New

Wiki Contributions

Comments

toph2mo10

Late to the party, but thanks for writing this up! I'm confused about two points in this calculation of the Theory section:

  • The FLOP needed to compute the term "δ3@A2R" (and similar)
    • I understand this to be the outer product of two vectors, δ3 with length #output, and A2R with length #hidden2  
    • If that's the case, should this require only #output*#hidden2*#batch FLOP (without the factor two in the table), since it's just the multiplication of each pair of numbers?
  • Do the parameter updates need to be accumulated for each example in the batch?
    • If this is the case, would this mean there's an additional FLOP for each parameter for each example in the batch?

I think these two points end up cancelling out so this still ends up with the 2:1 ratio, as expected. I think these points are also consistent with the explanation here: https://medium.com/@dzmitrybahdanau/the-flops-calculus-of-language-model-training-3b19c1f025e4