ElmNet: a benchmark dataset for generating headlines from Persian papers

Authors :: Mohammad E. Shenassa
Behrouz Minaei-Bidgoli
Source :: Multimedia Tools and Applications. 81:1853-1866
Publication Year :: 2021
Publisher :: Springer Science and Business Media LLC, 2021.
Abstract: Headline generation is a challenging subtask of abstractive text summarization, which its output should be a summary, shorter than one sentence. It would be precious to develop a dataset for the evaluation of abstractive summarization methods on this task in the Persian language. There are several datasets for headline generation in Persian, most of which are not large enough to be used by more sophisticated methods of text summarization, such as deep learning models. Moreover, all of these datasets are focused on daily news and there is no dataset for summarizing scientific Persian papers. In this article, we present “ElmNet,” a headline generation dataset of about 400,000 abstract/headline pairs of scientific papers, gathered from six major publishers for scientific articles in Persian. We, moreover, evaluate the performance of the most important deep learning-based headline generation methods, on the proposed dataset. The results prove the comparability of the performance of the state-of-the-art methods on this task, to their results on the existing English datasets.