Back to Search Start Over

A novel ensemble statistical topic extraction method for scientific publications based on optimization clustering.

Authors :
Abasi, Ammar Kamal
Khader, Ahamad Tajudin
Al-Betar, Mohammed Azmi
Naim, Syibrah
Makhadmeh, Sharif Naser
Alyasseri, Zaid Abdi Alkareem
Source :
Multimedia Tools & Applications; 2021, Vol. 80 Issue 1, p37-82, 46p
Publication Year :
2021

Abstract

The automatic topic extraction (TE) from scientific publications provides a very compact summary of the clusters' contents. This often helps in locating information easily. TE enables us to define the boundaries of the scientific fields. Text Document Clustering (TDC) represents, in general, the first step of topic identification to identify the documents, which address a related subject matter. Metaheuristics are typically used as efficient approaches for TDC. The multi-verse optimizer algorithm (MVO) involves a stochastic population-based algorithm. It has been recently proposed and successfully utilized to tackle many hard optimization problems. In the TE process, the focus of each statistical TE method is placed on various language feature space aspects. The aim of this paper is to design a novel ensemble method for an automatic TE from a collection of scientific publications based on MVO as the clustering algorithm. The automatic TE, which is used in our approach, is term frequency-inverse document frequency (TF-IDF), most frequent based keyword extraction (TF), co-occurrence statistical information-based keyword extraction (CSI), TextRank (TR), and mutual information (MI). A group of candidate topics can be provided by each automatic TE method for the proposed ensemble method. Next, the ensemble approach prunes the candidate topics' set via the application of a specific filtering heuristic. Then, their scores are recalculated based on the prescribed metrics. After that, for selecting a set of topics for certain scientific publications, dynamic threshold functions are applied. The findings emphasized the refined candidate set's efficiency, as well as effectiveness. The results also showed that the system's quality has been improved by new topics. The proposed method achieved better precision, as well as recall on a similar dataset compared to the state-of-the-art TE methods. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
13807501
Volume :
80
Issue :
1
Database :
Complementary Index
Journal :
Multimedia Tools & Applications
Publication Type :
Academic Journal
Accession number :
148024980
Full Text :
https://doi.org/10.1007/s11042-020-09504-2