Start Over

Automatical sampling with heterogeneous corpora for grammatical error correction

Authors :: Shichang Zhu
Jianjian Liu
Ying Li
Zhengtao Yu
Source :: Complex & Intelligent Systems, Vol 11, Iss 1, Pp 1-11 (2024)
Publication Year :: 2024
Publisher :: Springer, 2024.
Abstract: Abstract Thanks to the strong representation capability of the pre-trained language models, supervised grammatical error correction has achieved promising performance. However, traditional model training depends significantly on the large scale of similar distributed samples. The model performance decreases sharply once the distributions of training and testing data are inconsistent. To address this issue, we propose an automatic sampling approach to effectively select high-quality samples from different corpora and filter out irrelevant or harmful ones. Concretely, we first provide a detailed analysis of error type and sentence length distributions on all datasets. Second, our corpus weighting approach is exploited to yield different weights for each sample automatically based on analysis results, thus emphasizing beneficial samples and ignoring the noisy ones. Finally, we enhance typical Seq2Seq and Seq2Edit grammatical error correction models with pre-trained language models and design a model ensemble algorithm for integrating the advantages of heterogeneous models and weighted samples. Experiments on the benchmark datasets demonstrate that the proper utilization of different corpora is extremely helpful in enhancing the accuracy of grammatical error correction. The detailed analysis gains more insights into the effect of different corpus weighting strategies.

Subjects :: Grammatical error correction
Automatical sampling
Corpus weighting
Heterogeneous model ensemble
Pre-trained language models
Electronic computers. Computer science
QA75.5-76.95
Information technology
T58.5-58.64

Details

Language :: English
ISSN :: 21994536 and 21986053
Volume :: 11
Issue :: 1
Database :: Directory of Open Access Journals
Journal :: Complex & Intelligent Systems
Publication Type :: Academic Journal
Accession number :: edsdoj.317d7bcdd2a47529935c35ada7e8d56
Document Type :: article
Full Text :: https://doi.org/10.1007/s40747-024-01653-3

Full Text Access

View/download PDF

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Automatical sampling with heterogeneous corpora for grammatical error correction

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Automatical sampling with heterogeneous corpora for grammatical error correction

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources