Back to Search Start Over

Robust Learning for Text Classification with Multi-source Noise Simulation and Hard Example Mining

Authors :
Zhongqin Wu
Zitao Liu
Wenbiao Ding
Weiping Fu
Guowei Xu
Source :
Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track ISBN: 9783030865160, ECML/PKDD (5)
Publication Year :
2021
Publisher :
Springer International Publishing, 2021.

Abstract

Many real-world applications involve the use of Optical Character Recognition (OCR) engines to transform handwritten images into transcripts on which downstream Natural Language Processing (NLP) models are applied. In this process, OCR engines may introduce errors and inputs to downstream NLP models become noisy. Despite that pre-trained models achieve state-of-the-art performance in many NLP benchmarks, we prove that they are not robust to noisy texts generated by real OCR engines. This greatly limits the application of NLP models in real-world scenarios. In order to improve model performance on noisy OCR transcripts, it is natural to train the NLP model on labelled noisy texts. However, in most cases there are only labelled clean texts. Since there is no handwritten pictures corresponding to the text, it is impossible to directly use the recognition model to obtain noisy labelled data. Human resources can be employed to copy texts and take pictures, but it is extremely expensive considering the size of data for model training. Consequently, we are interested in making NLP models intrinsically robust to OCR errors in a low resource manner. We propose a novel robust training framework which 1) employs simple but effective methods to directly simulate natural OCR noises from clean texts and 2) iteratively mines the hard examples from a large number of simulated samples for optimal performance. 3) To make our model learn noise-invariant representations, a stability loss is employed. Experiments on three real-world datasets show that the proposed framework boosts the robustness of pre-trained models by a large margin. We believe that this work can greatly promote the application of NLP models in actual scenarios, although the algorithm we use is simple and straightforward. We make our codes and three datasets publicly available (https://github.com/tal-ai/Robust-learning-MSSHEM).

Details

ISBN :
978-3-030-86516-0
ISBNs :
9783030865160
Database :
OpenAIRE
Journal :
Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track ISBN: 9783030865160, ECML/PKDD (5)
Accession number :
edsair.doi...........a2372f1712cb2a7c85e8e8c5395664f6
Full Text :
https://doi.org/10.1007/978-3-030-86517-7_18