The article discusses the concept of Transformer distillation in large language models (LLMs) and focuses on the development of a compressed version of BERT called TinyBERT. The distillation process involves teaching the student model to imitate the output and inner behavior of the teacher model. Various components, such as the embedding layer, attention layer, and prediction layer, are considered in the distillation process. The article also describes the training process and the use of data augmentation techniques. Despite being significantly smaller in size, TinyBERT achieves comparable performance to BERT base.
Large Language Models: TinyBERT – Distilling BERT for NLP
Unlocking the Power of Transformer Distillation in Large Language Models
In recent years, large language models (LLMs) like BERT have become increasingly complex, making it more difficult to train and use them effectively. To address this issue, researchers have developed a method called transformer distillation for compressing LLMs. In this article, we will focus on a smaller version of BERT called TinyBERT and understand how it works.
Main Idea
TinyBERT uses a modified loss function to make the student model imitate the teacher model. The loss function compares the output distributions, hidden states, attention matrices, and logits of both models. The goal is not only to imitate the output of the teacher model but also its inner behavior, such as the attention weights learned by BERT, which are beneficial for capturing language structure.
Transformer Distillation Losses
The loss function in TinyBERT consists of three components:
- The output of the embedding layer
- The hidden states and attention matrices derived from the Transformer layer
- The logits output by the prediction layer
These components allow the student model to learn the hidden layers and important language concepts from the teacher model, resulting in a more robust and knowledgeable distilled model.
Layer Mapping
TinyBERT has fewer encoder layers compared to BERT. To calculate the distillation loss, a function called g(m) is introduced to map BERT layers to corresponding TinyBERT layers. The function ensures that the embedding layer in BERT is directly mapped to the embedding layer in TinyBERT, and the prediction layer in BERT is mapped to the prediction layer in TinyBERT. Other TinyBERT layers are mapped based on the function values of g(m).
Training
The training process of TinyBERT consists of two stages: general distillation and task-specific distillation. In the general distillation stage, TinyBERT gains general knowledge from pre-trained BERT without fine-tuning. In the task-specific distillation stage, fine-tuned BERT acts as the teacher, and data augmentation techniques are applied to improve performance. Through this two-stage training process, TinyBERT achieves comparable performance with BERT in specific tasks.
Model Settings
TinyBERT has about 7.5x fewer parameters than BERT base, making it significantly smaller. The layer mapping strategy maps each TinyBERT layer to each third BERT layer, allowing the transferred knowledge to be more varied. Despite its reduced size, TinyBERT demonstrates comparable performance, achieving a score of 77.0% on the GLUE benchmark.
Conclusion
Transformer distillation is a powerful technique for compressing large language models like BERT. TinyBERT, a compressed version of BERT, achieves comparable performance while significantly reducing the model size. By leveraging AI solutions like TinyBERT, companies can redefine their work processes and stay competitive in the age of AI.
Discover how AI can redefine your company. Connect with us at hello@itinai.com and explore our AI Sales Bot at itinai.com/aisalesbot to automate customer engagement and enhance your sales processes.