When and why did 'training' become 'pretraining'?

Mar 08, 2024

1611

From my perspective this term appeared around 2021 and became basically ubiquitous by 2022

I don't think this is correct. To add to Steven's answer, in the "GPT-1" paper from 2018 the abstract discusses

...generative pre-training of a language model
on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each
specific task

and the assumption at the time was that the finetuning step was necessary for the models to be good at a given task. This assumption persisted for a long time with academics finetuning BERT on tasks that GPT-3 would eventually significantly outperformed them on. You can tell this from how cautious the GPT-1 authors are about claiming the base model could do anything, and they sound very quaint:

> We’d like to better understand why language model pre-training of transformers is effective. A hypothesis is that the underlying generative model learns to perform many of the tasks we evaluate on in order to improve its language modeling capability

Steven Byrnes

Mar 08, 2024

166

Other people know more than me, but my impression was that the heritage of LLMs was things like ULMFiT (2018), where the goal was not to generate text but rather to do non-generative NLP tasks like sentiment-classification, spam-detection, and so on. Then you (1) do self-supervised “pretraining”, (2) edit/replace the output layer(s) to convert it from “a model that can output token predictions” to “a model that can output text-classifier labels / scores”, (3) fine-tune this new model (especially the newly-added parts) on human-supervised (text, label) pairs. Or something like that.

The word “pretraining” makes more sense than “training” in that context because “training” would incorrectly imply “training the model to do text classification”, i.e. the eventual goal. …And then I guess the term “pretraining” stuck around after it stopped making so much sense.

[-]mishka2y93

Yes, the ULMFiT paper is one of the first papers using the notion of "pretraining" (it might be the one which actually introduces this terminology).

Then it appears in other famous 2018 papers:

Improving Language Understanding by Generative Pre-Training (Radford et al., June 2018)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

[-]beren2y60

Thanks for these points! I think I understand the history of what has happened here better now -- and the reasons for my misapprehension. Essentially, what I think happened is

a.) LLM/NLP research always (?) used 'pretraining' for a long time back at least to 2017 era for a general training of a model not specialised for a certain NLP task (such as NER, syntax parsing, etc)

b.) rest of ML mostly used 'training' because they by and by large didn't do massive unsupervised training on unrelated tasks -- i.e. CV just had imagenet or whatever

c.) In 2020-2022 peri... (read more)

quetzal_rainbow

Mar 08, 2024

2-2

I think, the reason is that LLMs are not (pre)trained to do any particular practical task, they are "just trained to predict text". "Pre" signifies that LLM is a "raw product", not suitable for consumers.

Victor Ashioya

Mar 09, 2024

While it may seem like a "linguistic quirk", the term "pretraining" emerged to distinguish this initial phase of training the language model on a vast corpus of unlabeled text from the subsequent fine-tuning phase, where the pretrained model is adapted to a specific task using labeled data. This distinction became crucial as the pretraining step often required significant computational resources and time, while fine-tuning could be relatively more efficient and task specific.

One of the earliest mentions of this terminology can be found in the 2018 BERT paper:

"There are two existing strategies for applying pre-trained language representations to downstream tasks: feature-based and fine-tuning." (Devlin et al., 2018)

The rise of large language models like GPT (Generative Pre-trained Transformer) from OpenAI and their impressive performance on various NLP tasks further solidified the importance of this pretraining paradigm. As these models grew larger and more complex, the pretraining phase became even more resource-intensive and critical to the overall performance of the models.

It's worth noting that the term was not exclusively coined by large LLM companies, but rather emerged from the broader research community working on transfer learning and self-supervised pretraining techniques. However, the prominence of these companies and their large-scale language models likely contributed to the widespread adoption of the term "pretraining" in the ML and NLP communities.

Regarding the rationale behind using "pretraining" instead of "training," it seems to stem from the distinction between the initial, resource-intensive phase of capturing general linguistic knowledge and the subsequent task-specific fine-tuning phase. The term "pretraining" emphasizes the preparatory nature of this initial phase, which is followed by fine-tuning or other task-specific training steps.

So yes, I believe, the emergence of the term "pretraining" can be attributed to the paradigm shift in NLP towards transfer learning and self-supervised pretraining techniques, which necessitated a clear distinction between the initial, resource-intensive phase of capturing general linguistic knowledge and the subsequent task-specific fine-tuning phase.

LESSWRONG
LW

LESSWRONG
LW

16

[ Question ]

When and why did 'training' become 'pretraining'?

16

16

4 Answers sorted by
top scoring

Mar 08, 2024

Mar 08, 2024

Mar 08, 2024

Mar 09, 2024

16

[ Question ]

When and why did 'training' become 'pretraining'?

16

16

4 Answers sorted by top scoring

Mar 08, 2024

Mar 08, 2024

Mar 08, 2024

Mar 09, 2024

4 Answers sorted by
top scoring