TLDR: I'm scared Figure 3 is wrong (the one with training loss/parameters).

WHY: From page 2: "... we perform our analysis on the smoothed training loss which is an unbiased estimate of the test loss "

This claim is true. However, it is estimating average loss during training. For a fixed compute budget, larger models take less gradient steps and thus exhibit larger loss for a larger fraction of training time. If they estimate training loss in this way for Figure 3, I would expect them to underestimate the training loss of the larger models.

EXPERIMENT: If an...

This would require you to sample from GPT during training. If you want a sentence with 500 words you need to evaluate GPT 500 times. As a result, it would slow down training 500 times. The clever thing with GPT (and other autoregressive models) is that they circumvent sampling during training!