1. Multi-teacher distillation BERT model in NLU tasks
- Author
-
Jialai SHI and Weibin GUO
- Subjects
deep pre-training model ,BERT ,multi-teacher distillation ,nature language understanding ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Knowledge distillation is a model compression scheme commonly used to solve the problems of large scale and slow inference of BERT constant depth pre-training model.The method of "multi-teacher distillation" can further improve the performance of the student model, while the traditional "one-to-one" mapping method mandatory assignment strategy for the middle layer of the teacher model will lead to the abandonment of most of the middle features.The "one-tomany" mapping method is proposed to solve the problem that the middle layer cannot be aligned during knowledge distillation, and help students master the grammar, reference and other knowledge in the middle layer of the teacher model.Experiments on several data sets in GLUE show that the student model retains 93.9% of the average inference accuracy of the teacher model, while only accounting for 41.5% of the average parameter size of the teacher model.
- Published
- 2024
- Full Text
- View/download PDF