From my perspective this term appeared around 2021 and became basically ubiquitous by 2022
I don't think this is correct. To add to Steven's answer, in the "GPT-1" paper from 2018 the abstract discusses
...generative pre-training of a language model
on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each
specific task
and the assumption at the time was that the finetuning step was necessary for the models to be good at a given task. This assumption persisted for a long time with academics finetuning BERT on tasks that GPT-3 would eventually significantly outperformed them on. You can tell this from how cautious the GPT-1 authors are about claiming the base model could do anything, and they sound very quaint:
> We’d like to better understand why language model pre-training of transformers is effective. A hypothesis is that the underlying generative model learns to perform many of the tasks we evaluate on in order to improve its language modeling capability
Other people know more than me, but my impression was that the heritage of LLMs was things like ULMFiT (2018), where the goal was not to generate text but rather to do non-generative NLP tasks like sentiment-classification, spam-detection, and so on. Then you (1) do self-supervised “pretraining”, (2) edit/replace the output layer(s) to convert it from “a model that can output token predictions” to “a model that can output text-classifier labels / scores”, (3) fine-tune this new model (especially the newly-added parts) on human-supervised (text, label) pairs. Or something like that.
The word “pretraining” makes more sense than “training” in that context because “training” would incorrectly imply “training the model to do text classification”, i.e. the eventual goal. …And then I guess the term “pretraining” stuck around after it stopped making so much sense.
Yes, the ULMFiT paper is one of the first papers using the notion of "pretraining" (it might be the one which actually introduces this terminology).
Then it appears in other famous 2018 papers:
Improving Language Understanding by Generative Pre-Training (Radford et al., June 2018)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Thanks for these points! I think I understand the history of what has happened here better now -- and the reasons for my misapprehension. Essentially, what I think happened is
a.) LLM/NLP research always (?) used 'pretraining' for a long time back at least to 2017 era for a general training of a model not specialised for a certain NLP task (such as NER, syntax parsing, etc)
b.) rest of ML mostly used 'training' because they by and by large didn't do massive unsupervised training on unrelated tasks -- i.e. CV just had imagenet or whatever
c.) In 2020-2022 peri...
I think, the reason is that LLMs are not (pre)trained to do any particular practical task, they are "just trained to predict text". "Pre" signifies that LLM is a "raw product", not suitable for consumers.
While it may seem like a "linguistic quirk", the term "pretraining" emerged to distinguish this initial phase of training the language model on a vast corpus of unlabeled text from the subsequent fine-tuning phase, where the pretrained model is adapted to a specific task using labeled data. This distinction became crucial as the pretraining step often required significant computational resources and time, while fine-tuning could be relatively more efficient and task specific.
One of the earliest mentions of this terminology can be found in the 2018 BERT paper:
"There are two existing strategies for applying pre-trained language representations to downstream tasks: feature-based and fine-tuning." (Devlin et al., 2018)
The rise of large language models like GPT (Generative Pre-trained Transformer) from OpenAI and their impressive performance on various NLP tasks further solidified the importance of this pretraining paradigm. As these models grew larger and more complex, the pretraining phase became even more resource-intensive and critical to the overall performance of the models.
It's worth noting that the term was not exclusively coined by large LLM companies, but rather emerged from the broader research community working on transfer learning and self-supervised pretraining techniques. However, the prominence of these companies and their large-scale language models likely contributed to the widespread adoption of the term "pretraining" in the ML and NLP communities.
Regarding the rationale behind using "pretraining" instead of "training," it seems to stem from the distinction between the initial, resource-intensive phase of capturing general linguistic knowledge and the subsequent task-specific fine-tuning phase. The term "pretraining" emphasizes the preparatory nature of this initial phase, which is followed by fine-tuning or other task-specific training steps.
So yes, I believe, the emergence of the term "pretraining" can be attributed to the paradigm shift in NLP towards transfer learning and self-supervised pretraining techniques, which necessitated a clear distinction between the initial, resource-intensive phase of capturing general linguistic knowledge and the subsequent task-specific fine-tuning phase.
Just an ML linguistic quirk I have wondered about for a while. When I started learning ML (in 2016-2017 period) everybody referred to the period of training models as just 'training' which could then (optionally) be followed by finetuning. This usage makes sense to me and as far as I know was the standard ML terminology basically as long as people have been training neural networks.
Nowadays, we appear to call what used to be training 'pretraining'. From my perspective this term appeared around 2021 and became basically ubiquitous by 2022. Where did this come from? What is the difference between 'pretraining' now and 'training' from before?
My feeling is that this usage started at big LLM companies. However, what are these companies doing such that 'pretraining' should be a sensible term? As far as I know (especially around 2022 when it really took off) LLM training followed the standard 'pretraining' -> 'finetuning' -> 'alignment' by RLHF pipeline. Why do we need the special term 'pretraining' to handle this when 'training' still seems perfectly fine? Is it because we developed 'post-training' (i.e. finetuning) phases regularly? but then why 'pretraining' and 'post-training' -- but no 'training'?
Does anybody here know a good rationale or history of 'pretraining'? or is this just some inexplicable linguistic quirk?