Back to Search Start Over

Abstractive text summarization and new large-scale datasets for agglutinative languages Turkish and Hungarian.

Authors :
Baykara, Batuhan
Güngör, Tunga
Source :
Language Resources & Evaluation. Sep2022, Vol. 56 Issue 3, p973-1007. 35p.
Publication Year :
2022

Abstract

Due to the exponential growth in the number of documents on the Web, accessing the salient information relevant to a user need is gaining importance, which increases the popularity of text summarization. Recent progress in deep learning shifted the research in text summarization from extractive methods towards more abstractive approaches. The research and the available resources remain mostly limited to the English language, which prevents progress in other languages. There is need in low-resourced languages for gathering large-scale resources suitable for such tasks. In this study, we release two large-scale datasets (TR-News and HU-News) that can serve as benchmarks in the abstractive summarization task for Turkish and Hungarian. The datasets are primarily compiled for text summarization, but are also suitable for other tasks such as topic classification, title generation, and key phrase extraction. Morphology is important for these agglutinative languages since meaning is carried mostly within the morphemes of the words. We utilize these morphological properties for tokenization to retain the semantic information and reduce the vocabulary sparsity introduced by the agglutinative nature of these languages. Using the datasets compiled, we propose linguistically-oriented tokenization methods (SeparateSuffix and CombinedSuffix) and evaluate them on the state-of-the-art abstractive summarization models. The SeparateSuffix method achieves the highest ROUGE-1 score on the TR-News dataset and provides promising results on the HU-News dataset. In another experiment, we show that the multilingual cased BERT model outperforms monolingual BERT models for both languages and reaches the highest ROUGE-1 score on the HU-News dataset. Lastly, we provide qualitative analysis of the generated summaries on the TR-News dataset. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
1574020X
Volume :
56
Issue :
3
Database :
Academic Search Index
Journal :
Language Resources & Evaluation
Publication Type :
Academic Journal
Accession number :
158609437
Full Text :
https://doi.org/10.1007/s10579-021-09568-y