tl;dr: The shape of the curve probably doesn't change, but the compute-optimal LM training will use less data than the Chinchilla scaling law suggests.

One of the takeaways from the last two years of LM progress is that GPT-3/Chinchilla's next-token-prediction objective is not the most efficient way to use data.* Instead, objectives require the model to infill missing tokens in the middle of a text string, like the T5 objective or the UL2 objective, are much more efficient per unit data.

Figure 2 of the Tay et al UL2R paper shows how UL2 finetuning serves as either a multiple or a constant increase in training flops. Assuming that the improvement holds across the board, this means that UL2 finetuning makes models ~1.5-3x more data efficient. So if before, the optimal trade off for X flops was Y params times Z tokens, with a better objective (or finetuning the objective better), we might see 1.5 Y params and 0.66 Z tokens.

It's worth noting that this still implies a linear relationship between the optimal param count and token count, it's just that if you use a better objective it's better to use more params and fewer tokens than what the next-token log loss--based Chinchilla scaling laws would predict.

* Arguably, we knew this from BERT, where you'd get better finetuned performance on downstream tasks if you pretrained with bidirectional objectives, but I think the result that the next-token prediction objective is worse for text generation tasks is new.

Sheikh Abdur Raheem Ali

Nov 19, 2022

10

Mostly already did my updates when “Efficient Training of Language Models to Fill in the Middle” https://arxiv.org/abs/2207.14255 came out.

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

15

[ Question ]

Updates on scaling laws for foundation models from ' Transcending Scaling Laws with 0.1% Extra Compute'

15

Ω 3

15

Ω 3

2 Answers sorted by
top scoring

Nov 21, 2022

Nov 19, 2022

15

[ Question ]

Updates on scaling laws for foundation models from ' Transcending Scaling Laws with 0.1% Extra Compute'

15

Ω 3

15

Ω 3

2 Answers sorted by top scoring

Nov 21, 2022

Nov 19, 2022

2 Answers sorted by
top scoring