Back to Search Start Over

STUDY OF THE REPRESENTATIVENESS OF KAZAKH LANGUAGE CORPORA BY WORD STEMS FOR THE SUMMARIZATION.

Authors :
Жабаев, Т. Р.
Тукеев, У. А.
Source :
Vestnik KazUTB. 2024, Vol. 2 Issue 23, p104-111. 8p.
Publication Year :
2024

Abstract

The aim of the work is to prove the possibility of determining the representativeness of a corpus for training a neural model before conducting resource-intensive experiments. In this work, we investigated the dependence of the summarization model on the number of word stems in it. The work was carried out on a synthetic summarization dataset for the Kazakh language. Taking the number of word stems as the representativeness metric, an analysis of the quality of the work of three summarization models was performed depending on the number of word stems in the training dataset. These training datasets differ in the number of rows. To obtain these datasets, we split the training dataset into three parts of different sizes. On the test files, BLEU scores were obtained for each model during the inference process. The highest BLEU scores are obtained for the model trained on the largest amount of data. When the train dataset was reduced by 50 percent, the score decreased from 4 to 25. On the smallest dataset, the score dropped from 25 to 31. The experimental part of the work showed that the model with the largest number of stems shows the highest BLEU score. The scientific contribution of the work is the experimental proof of the representativeness of the training corpus by the number of stems before training the neural model. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
27084132
Volume :
2
Issue :
23
Database :
Academic Search Index
Journal :
Vestnik KazUTB
Publication Type :
Academic Journal
Accession number :
178611556
Full Text :
https://doi.org/10.58805/kazutb.v.2.23-366