Back to Search
Start Over
Multi-pretraining for Large-scale Text Classification
- Source :
- EMNLP (Findings)
- Publication Year :
- 2020
- Publisher :
- Association for Computational Linguistics, 2020.
-
Abstract
- Deep neural network-based pretraining methods have achieved impressive results in many natural language processing tasks including text classification. However, their applicability to large-scale text classification with numerous categories (e.g., several thousands) is yet to be well-studied, where the training data is insufficient and skewed in terms of categories. In addition, existing pretraining methods usually involve excessive computation and memory overheads. In this paper, we develop a novel multi-pretraining framework for large-scale text classification. This multi-pretraining framework includes both a self-supervised pretraining and a weakly supervised pretraining. We newly introduce an out-of-context words detection task on the unlabeled data as the self-supervised pretraining. It captures the topic-consistency of words used in sentences, which is proven to be useful for text classification. In addition, we propose a weakly supervised pretraining, where labels for text classification are obtained automatically from an existing approach. Experimental results clearly show that both pretraining approaches are effective for large-scale text classification task. The proposed scheme exhibits significant improvements as much as 3.8% in terms of macro-averaging F1-score over strong pretraining methods, while being computationally efficient.
- Subjects :
- Scheme (programming language)
Training set
Artificial neural network
Computer science
business.industry
02 engineering and technology
010501 environmental sciences
Machine learning
computer.software_genre
01 natural sciences
Task (project management)
020204 information systems
0202 electrical engineering, electronic engineering, information engineering
Artificial intelligence
Scale (map)
business
computer
0105 earth and related environmental sciences
computer.programming_language
Subjects
Details
- Database :
- OpenAIRE
- Journal :
- Findings of the Association for Computational Linguistics: EMNLP 2020
- Accession number :
- edsair.doi...........748d6fac6aa69971322e99bcca5db51d