Back to Search Start Over

Multi-teacher distillation BERT model in NLU tasks

Authors :
Jialai SHI
Weibin GUO
Source :
大数据, Vol 10, Pp 119-132 (2024)
Publication Year :
2024
Publisher :
China InfoCom Media Group, 2024.

Abstract

Knowledge distillation is a model compression scheme commonly used to solve the problems of large scale and slow inference of BERT constant depth pre-training model.The method of "multi-teacher distillation" can further improve the performance of the student model, while the traditional "one-to-one" mapping method mandatory assignment strategy for the middle layer of the teacher model will lead to the abandonment of most of the middle features.The "one-tomany" mapping method is proposed to solve the problem that the middle layer cannot be aligned during knowledge distillation, and help students master the grammar, reference and other knowledge in the middle layer of the teacher model.Experiments on several data sets in GLUE show that the student model retains 93.9% of the average inference accuracy of the teacher model, while only accounting for 41.5% of the average parameter size of the teacher model.

Details

Language :
Chinese
ISSN :
20960271
Volume :
10
Database :
Directory of Open Access Journals
Journal :
大数据
Publication Type :
Academic Journal
Accession number :
edsdoj.43cc24d2d92a4a53823ee584bac02df4
Document Type :
article
Full Text :
https://doi.org/10.11959/j.issn.2096-0271.2023039