Start Over

STUDY OF THE REPRESENTATIVENESS OF KAZAKH LANGUAGE CORPORA BY WORD STEMS FOR THE SUMMARIZATION.

Authors :: Жабаев, Т. Р.
Тукеев, У. А.
Source :: Vestnik KazUTB. 2024, Vol. 2 Issue 23, p104-111. 8p.
Publication Year :: 2024
Abstract: The aim of the work is to prove the possibility of determining the representativeness of a corpus for training a neural model before conducting resource-intensive experiments. In this work, we investigated the dependence of the summarization model on the number of word stems in it. The work was carried out on a synthetic summarization dataset for the Kazakh language. Taking the number of word stems as the representativeness metric, an analysis of the quality of the work of three summarization models was performed depending on the number of word stems in the training dataset. These training datasets differ in the number of rows. To obtain these datasets, we split the training dataset into three parts of different sizes. On the test files, BLEU scores were obtained for each model during the inference process. The highest BLEU scores are obtained for the model trained on the largest amount of data. When the train dataset was reduced by 50 percent, the score decreased from 4 to 25. On the smallest dataset, the score dropped from 25 to 31. The experimental part of the work showed that the model with the largest number of stems shows the highest BLEU score. The scientific contribution of the work is the experimental proof of the representativeness of the training corpus by the number of stems before training the neural model. [ABSTRACT FROM AUTHOR]

Subjects :: *AUTOMATIC summarization
*CORPORA
*TEXT summarization
*LANGUAGE & languages

Details

Language :: English
ISSN :: 27084132
Volume :: 2
Issue :: 23
Database :: Academic Search Index
Journal :: Vestnik KazUTB
Publication Type :: Academic Journal
Accession number :: 178611556
Full Text :: https://doi.org/10.58805/kazutb.v.2.23-366

Full Text Access

View/download PDF

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

STUDY OF THE REPRESENTATIVENESS OF KAZAKH LANGUAGE CORPORA BY WORD STEMS FOR THE SUMMARIZATION.

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

STUDY OF THE REPRESENTATIVENESS OF KAZAKH LANGUAGE CORPORA BY WORD STEMS FOR THE SUMMARIZATION.

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources