1. ElmNet: a benchmark dataset for generating headlines from Persian papers
- Author
-
Mohammad E. Shenassa and Behrouz Minaei-Bidgoli
- Subjects
Computer Networks and Communications ,business.industry ,Computer science ,Deep learning ,Comparability ,Headline ,computer.software_genre ,Automatic summarization ,language.human_language ,Task (project management) ,Hardware and Architecture ,Media Technology ,language ,Benchmark (computing) ,Artificial intelligence ,business ,computer ,Software ,Natural language processing ,Sentence ,Persian - Abstract
Headline generation is a challenging subtask of abstractive text summarization, which its output should be a summary, shorter than one sentence. It would be precious to develop a dataset for the evaluation of abstractive summarization methods on this task in the Persian language. There are several datasets for headline generation in Persian, most of which are not large enough to be used by more sophisticated methods of text summarization, such as deep learning models. Moreover, all of these datasets are focused on daily news and there is no dataset for summarizing scientific Persian papers. In this article, we present “ElmNet,” a headline generation dataset of about 400,000 abstract/headline pairs of scientific papers, gathered from six major publishers for scientific articles in Persian. We, moreover, evaluate the performance of the most important deep learning-based headline generation methods, on the proposed dataset. The results prove the comparability of the performance of the state-of-the-art methods on this task, to their results on the existing English datasets.
- Published
- 2021