Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Updates on scaling laws for foundation models from ' Transcending Scaling Laws with 0.1% Extra Compute'

7LawrenceC

1Sheikh Abdur Raheem Ali

New Answer

New Comment

2 Answers sorted by

*tl;dr: The shape of the curve probably doesn't change, but the compute-optimal LM training will use less data than the Chinchilla scaling law suggests. *

One of the takeaways from the last two years of LM progress is that GPT-3/Chinchilla's next-token-prediction objective is not the most efficient way to use data.* Instead, objectives require the model to infill missing tokens in the middle of a text string, like the T5 objective or the UL2 objective, are much more efficient per unit data.

Figure 2 of the Tay et al UL2R paper shows how UL2 finetuning serves as either a multiple or a constant increase in training flops. Assuming that the improvement holds across the board, this means that UL2 finetuning makes models ~1.5-3x more data efficient. So if before, the optimal trade off for X flops was Y params times Z tokens, with a better objective (or finetuning the objective better), we might see 1.5 Y params and 0.66 Z tokens.

It's worth noting that this still implies a linear relationship between the optimal param count and token count, it's just that if you use a better objective it's better to use more params and fewer tokens than what the next-token log loss--based Chinchilla scaling laws would predict.

* Arguably, we knew this from BERT, where you'd get better finetuned performance on downstream tasks if you pretrained with bidirectional objectives, but I think the result that the next-token prediction objective is worse for *text generation* tasks is new.

Mostly already did my updates when “Efficient Training of Language Models to Fill in the Middle” https://arxiv.org/abs/2207.14255 came out.

I am not sure if this paper is flying under the radar for many people, but has anyone read Transcending Scaling Laws with 0.1% Extra Compute? If so, how do you think it compares to the scaling laws presented by Deepmind's An empirical analysis of compute-optimal large language model training? Does it make you rethink the importance of dataset size (again)?