Pre-training and Fine-tuning

My takeaway series follow a Q&A format to explain AI concepts at three levels:

Conceptual Level

Anyone with general knowledge can understand them.

Implementation Level

For anyone who wants to dive into the code implementation details of the concept.

Mathematical Level

For anyone who wants to understand the mathematics behind the technique.

What is pre-training and fine-tuning?

Pre-training and fine-tuning is a machine learning paradigm. In traditional machine learning, we usually train a model from scratch for each specific task. In contrast, pre-training and fine-tuning involves two stages:

Pre-training: A model is trained on a large dataset to learn general representations.
Fine-tuning: The pre-trained model is further trained on a smaller, task-specific dataset to adapt it to a specific task.

They can be done separately. A model can be pre-trained and saved, then loaded and fine-tuned later.

When did the pre-training and fine-tuning paradigm emerge and become dominant?

The pre-training and fine-tuning paradigm became prominent with the advent of large-scale language models, particularly with the introduction of models like BERT (Bidirectional Encoder Representations from Transformers) by Google in 2018 (devlin2018bert?). BERT demonstrated that pre-training on a large corpus of text followed by fine-tuning on specific tasks could achieve state-of-the-art results across a variety of natural language processing (NLP) benchmarks.

However, the concept of pre-training itself has been around for a while. Hinton et al. introduced the idea of pre-training neural networks using unsupervised learning in 2006 (hinton2006reducing?), which helped to initialize deep networks before fine-tuning them with supervised learning. In 2012, the success of AlexNet (Krizhevsky, Sutskever, and Hinton 2012) in image classification also highlighted the benefits of pre-training on large datasets like ImageNet. In NLP, the use of word embeddings like Word2Vec (mikolov2013efficient?) and GloVe (pennington2014glove?) also laid the groundwork for pre-trained representations.

References

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012. “Imagenet Classification with Deep Convolutional Neural Networks.” Advances in Neural Information Processing Systems 25.