Start Over

Robust Learning for Text Classification with Multi-source Noise Simulation and Hard Example Mining

Authors :: Zhongqin Wu
Zitao Liu
Wenbiao Ding
Weiping Fu
Guowei Xu
Source :: Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track ISBN: 9783030865160, ECML/PKDD (5)
Publication Year :: 2021
Publisher :: Springer International Publishing, 2021.
Abstract: Many real-world applications involve the use of Optical Character Recognition (OCR) engines to transform handwritten images into transcripts on which downstream Natural Language Processing (NLP) models are applied. In this process, OCR engines may introduce errors and inputs to downstream NLP models become noisy. Despite that pre-trained models achieve state-of-the-art performance in many NLP benchmarks, we prove that they are not robust to noisy texts generated by real OCR engines. This greatly limits the application of NLP models in real-world scenarios. In order to improve model performance on noisy OCR transcripts, it is natural to train the NLP model on labelled noisy texts. However, in most cases there are only labelled clean texts. Since there is no handwritten pictures corresponding to the text, it is impossible to directly use the recognition model to obtain noisy labelled data. Human resources can be employed to copy texts and take pictures, but it is extremely expensive considering the size of data for model training. Consequently, we are interested in making NLP models intrinsically robust to OCR errors in a low resource manner. We propose a novel robust training framework which 1) employs simple but effective methods to directly simulate natural OCR noises from clean texts and 2) iteratively mines the hard examples from a large number of simulated samples for optimal performance. 3) To make our model learn noise-invariant representations, a stability loss is employed. Experiments on three real-world datasets show that the proposed framework boosts the robustness of pre-trained models by a large margin. We believe that this work can greatly promote the application of NLP models in actual scenarios, although the algorithm we use is simple and straightforward. We make our codes and three datasets publicly available (https://github.com/tal-ai/Robust-learning-MSSHEM).

Subjects :: Process (engineering)
Computer science
business.industry
Optical character recognition
Machine learning
computer.software_genre
ComputingMethodologies_PATTERNRECOGNITION
Robust learning
Margin (machine learning)
Robustness (computer science)
Simple (abstract algebra)
ComputingMethodologies_DOCUMENTANDTEXTPROCESSING
Noise (video)
Artificial intelligence
business
computer
Multi-source

Details

ISBN :: 978-3-030-86516-0
ISBNs :: 9783030865160
Database :: OpenAIRE
Journal :: Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track ISBN: 9783030865160, ECML/PKDD (5)
Accession number :: edsair.doi...........a2372f1712cb2a7c85e8e8c5395664f6
Full Text :: https://doi.org/10.1007/978-3-030-86517-7_18

Full Text Access

View/download PDF

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Robust Learning for Text Classification with Multi-source Noise Simulation and Hard Example Mining

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Robust Learning for Text Classification with Multi-source Noise Simulation and Hard Example Mining

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources