Back to Search Start Over

Evaluating the Impact of Text Data Augmentation on Text Classification Tasks using DistilBERT.

Authors :
Nair, Aarathi Rajagopalan
Singh, Rimjhim Padam
Gupta, Deepa
Kumar, Priyanka
Source :
Procedia Computer Science; 2024, Vol. 235, p102-111, 10p
Publication Year :
2024

Abstract

Data augmentation entails artificially expanding the dataset's size by applying various transformations to the existing raw data. Enhancing the quality and quantity of the datasets with varying sizes by employing varieddata augmentation techniques has immense importance in the field on Natural Language Processing. Several notable applications for instance text classification, sentiment analysis, text summarization, etc. have proven to be benefitted immensely with the employment of text augmentation techniques. Hence, the paper focuses on efficient text classification using varied datasets of different sizes; small- 500 instances, medium-5564 instances and large-43934 instances.The work considers the standard DistilBERT model, a popular transformer-based language model and presents the impact on the performance of the modelafter employing different text augmentation techniques. The study specifically focuses on three augmentation methods: (a) Synonym augmentation:that involves replacing words with their synonyms to enhance vocabulary diversity and generalization, (b) Contextual word embeddings that enriches semantic understanding by leveraging pre-trained language models, and (c) Black translation that entails translating the text into a another different language and then translating it back, introducing variations in the data and capturing different linguistic patterns.Additionally,the work also discusses the combined effect of employing all three augmentation techniques simultaneously. Moreover, the study also aims compares the relation between the dataset sizes and the performance of the augmentation techniques. The study considers three standard datasets for the analysis and presents a comprehensive analysis using accuracy and F1 score as evaluation metrics. The results highlight the efficacy of each technique across small, medium, and large datasets, enabling a nuanced understanding of their benefits in different data scenarios. The findings indicate the varying degrees of improvement achieved through each augmentation technique.The enhancement achieved by applying text augmentation varied from around 2% on large datasets to 20% on smaller datasets. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
18770509
Volume :
235
Database :
Supplemental Index
Journal :
Procedia Computer Science
Publication Type :
Academic Journal
Accession number :
177603593
Full Text :
https://doi.org/10.1016/j.procs.2024.04.013