1808

LESSWRONG
LW

1807
Scaling LawsAI
Frontpage

15

[ Question ]

Updates on scaling laws for foundation models from ' Transcending Scaling Laws with 0.1% Extra Compute'

by Nick_Greig
18th Nov 2022
AI Alignment Forum
1 min read
A
2
2

15

Ω 3

15

Ω 3

Updates on scaling laws for foundation models from ' Transcending Scaling Laws with 0.1% Extra Compute'
7LawrenceC
1Sheikh Abdur Raheem Ali
New Answer
New Comment

2 Answers sorted by
top scoring

LawrenceC

Nov 21, 2022

Ω470

tl;dr: The shape of the curve probably doesn't change, but the compute-optimal LM training will use less data than the Chinchilla scaling law suggests. 

One of the takeaways from the last two years of LM progress is that GPT-3/Chinchilla's next-token-prediction objective is not the most efficient way to use data.* Instead, objectives require the model to infill missing tokens in the middle of a text string, like the T5 objective or the UL2 objective, are much more efficient per unit data. 

Figure 2 of the Tay et al UL2R paper shows how UL2 finetuning serves as either a multiple or a constant increase in training flops. Assuming that the improvement holds across the board, this means that UL2 finetuning makes models ~1.5-3x more data efficient. So if before, the optimal trade off for X flops was Y params times Z tokens, with a better objective (or finetuning the objective better), we might see 1.5 Y params and 0.66 Z tokens. 

It's worth noting that this still implies a linear relationship between the optimal param count and token count, it's just that if you use a better objective it's better to use more params and fewer tokens than what the next-token log loss--based Chinchilla scaling laws would predict. 

* Arguably, we knew this from BERT, where you'd get better finetuned performance on downstream tasks if you pretrained with bidirectional objectives, but I think the result that the next-token prediction objective is worse for text generation tasks is new. 

Add Comment

Sheikh Abdur Raheem Ali

Nov 19, 2022

10

Mostly already did my updates when “Efficient Training of Language Models to Fill in the Middle” https://arxiv.org/abs/2207.14255 came out.

Add Comment
Moderation Log
More from Nick_Greig
View more
Curated and popular this week
A
2
0
Scaling LawsAI
Frontpage

I am not sure if this paper is flying under the radar for many people, but has anyone read Transcending Scaling Laws with 0.1% Extra Compute? If so, how do you think it compares to the scaling laws presented by Deepmind's An empirical analysis of compute-optimal large language model training? Does it make you rethink the importance of dataset size (again)?