Alexander Mathiasen

TLDR: I'm scared Figure 3 is wrong (the one with training loss/parameters).

WHY: From page 2: "... we perform our analysis on the smoothed training loss which is an unbiased estimate of the test loss "

This claim is true. However, it is estimating average loss during training. For a fixed compute budget, larger models take less gradient steps and thus exhibit larger loss for a larger fraction of training time. If they estimate training loss in this way for Figure 3, I would expect them to underestimate the training loss of the larger models.

EXPERIMENT: If anyone has access to training loss .csv files, we can reproduce Figure 3 using loss from the last 100 iterations. All my concerns go away if we get the same plot.

New Scaling Laws for Large Language Models

Alexander Mathiasen4y40

GPT-3 + GAN

Answer by Alexander MathiasenSep 22, 202110

This would require you to sample from GPT during training. If you want a sentence with 500 words you need to evaluate GPT 500 times. As a result, it would slow down training 500 times. The clever thing with GPT (and other autoregressive models) is that they circumvent sampling during training!

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments