Back to Search
Start Over
Multi-teacher distillation BERT model in NLU tasks
- Source :
- 大数据, Vol 10, Pp 119-132 (2024)
- Publication Year :
- 2024
- Publisher :
- China InfoCom Media Group, 2024.
-
Abstract
- Knowledge distillation is a model compression scheme commonly used to solve the problems of large scale and slow inference of BERT constant depth pre-training model.The method of "multi-teacher distillation" can further improve the performance of the student model, while the traditional "one-to-one" mapping method mandatory assignment strategy for the middle layer of the teacher model will lead to the abandonment of most of the middle features.The "one-tomany" mapping method is proposed to solve the problem that the middle layer cannot be aligned during knowledge distillation, and help students master the grammar, reference and other knowledge in the middle layer of the teacher model.Experiments on several data sets in GLUE show that the student model retains 93.9% of the average inference accuracy of the teacher model, while only accounting for 41.5% of the average parameter size of the teacher model.
Details
- Language :
- Chinese
- ISSN :
- 20960271
- Volume :
- 10
- Database :
- Directory of Open Access Journals
- Journal :
- 大数据
- Publication Type :
- Academic Journal
- Accession number :
- edsdoj.43cc24d2d92a4a53823ee584bac02df4
- Document Type :
- article
- Full Text :
- https://doi.org/10.11959/j.issn.2096-0271.2023039