Large Language Models: TinyBERT — Distilling BERT for NLP

The article discusses the concept of Transformer distillation in large language models (LLMs) and focuses on the development of a compressed version of BERT called TinyBERT. The distillation process involves teaching the student model to imitate the output and inner behavior of the teacher model. Various components, such as the embedding layer, attention layer, and prediction layer, are considered in the distillation process. The article also describes the training process and the use of data augmentation techniques. Despite being significantly smaller in size, TinyBERT achieves comparable performance to BERT base.

 Large Language Models: TinyBERT — Distilling BERT for NLP

Large Language Models: TinyBERT – Distilling BERT for NLP

Unlocking the Power of Transformer Distillation in Large Language Models

In recent years, large language models (LLMs) like BERT have become increasingly complex, making it more difficult to train and use them effectively. To address this issue, researchers have developed a method called transformer distillation for compressing LLMs. In this article, we will focus on a smaller version of BERT called TinyBERT and understand how it works.

Main Idea

TinyBERT uses a modified loss function to make the student model imitate the teacher model. The loss function compares the output distributions, hidden states, attention matrices, and logits of both models. The goal is not only to imitate the output of the teacher model but also its inner behavior, such as the attention weights learned by BERT, which are beneficial for capturing language structure.

Transformer Distillation Losses

The loss function in TinyBERT consists of three components:

  1. The output of the embedding layer
  2. The hidden states and attention matrices derived from the Transformer layer
  3. The logits output by the prediction layer

These components allow the student model to learn the hidden layers and important language concepts from the teacher model, resulting in a more robust and knowledgeable distilled model.

Layer Mapping

TinyBERT has fewer encoder layers compared to BERT. To calculate the distillation loss, a function called g(m) is introduced to map BERT layers to corresponding TinyBERT layers. The function ensures that the embedding layer in BERT is directly mapped to the embedding layer in TinyBERT, and the prediction layer in BERT is mapped to the prediction layer in TinyBERT. Other TinyBERT layers are mapped based on the function values of g(m).

Training

The training process of TinyBERT consists of two stages: general distillation and task-specific distillation. In the general distillation stage, TinyBERT gains general knowledge from pre-trained BERT without fine-tuning. In the task-specific distillation stage, fine-tuned BERT acts as the teacher, and data augmentation techniques are applied to improve performance. Through this two-stage training process, TinyBERT achieves comparable performance with BERT in specific tasks.

Model Settings

TinyBERT has about 7.5x fewer parameters than BERT base, making it significantly smaller. The layer mapping strategy maps each TinyBERT layer to each third BERT layer, allowing the transferred knowledge to be more varied. Despite its reduced size, TinyBERT demonstrates comparable performance, achieving a score of 77.0% on the GLUE benchmark.

Conclusion

Transformer distillation is a powerful technique for compressing large language models like BERT. TinyBERT, a compressed version of BERT, achieves comparable performance while significantly reducing the model size. By leveraging AI solutions like TinyBERT, companies can redefine their work processes and stay competitive in the age of AI.

Discover how AI can redefine your company. Connect with us at hello@itinai.com and explore our AI Sales Bot at itinai.com/aisalesbot to automate customer engagement and enhance your sales processes.

List of Useful Links:

AI Products for Business or Try Custom Development

AI Sales Bot

Welcome AI Sales Bot, your 24/7 teammate! Engaging customers in natural language across all channels and learning from your materials, it’s a step towards efficient, enriched customer interactions and sales

AI Document Assistant

Unlock insights and drive decisions with our AI Insights Suite. Indexing your documents and data, it provides smart, AI-driven decision support, enhancing your productivity and decision-making.

AI Customer Support

Upgrade your support with our AI Assistant, reducing response times and personalizing interactions by analyzing documents and past engagements. Boost your team and customer satisfaction

AI Scrum Bot

Enhance agile management with our AI Scrum Bot, it helps to organize retrospectives. It answers queries and boosts collaboration and efficiency in your scrum processes.