Back to Search Start Over

Selection of the most relevant terms based on a max-min ratio metric for text classification.

Authors :
Rehman, Abdur
Javed, Kashif
Babri, Haroon A.
Asim, Muhammad Nabeel
Source :
Expert Systems with Applications. Dec2018, Vol. 114, p78-96. 19p.
Publication Year :
2018

Abstract

Highlights • We Illustrated weaknesses of balanced accuracy and normalized difference measures. • We proposed a new feature ranking metric called max-min ratio (MMR). • MMR better estimates the true worth of a term in high class skews. • We tested MMR against 8 well-known metrics on 6 datasets with 2 classifiers. • MMR statistically outperforms metrics in 76% macro F1 cases and 74% micro F1 cases. Abstract Text classification automatically assigns text documents to one or more predefined categories based on their content. In text classification, data are characterized by a large number of highly sparse terms and highly skewed categories. Working with all the terms in the data has an adverse impact on the accuracy and efficiency of text classification tasks. A feature selection algorithm helps in selecting the most relevant terms. In this paper, we propose a new feature ranking metric called max-min ratio (MMR). It is a product of max-min ratio of the true positives and false positives and their difference, which allows MMR to select smaller subsets of more relevant terms even in the presence of highly skewed classes. This results in performing text classification with higher accuracy and more efficiency. To investigate the effectiveness of our newly proposed metric, we compare its performance against eight metrics (balanced accuracy measure, information gain, chi-squared, Poisson ratio, Gini index, odds ratio, distinguishing feature selector, and normalized difference measure) on six data sets namely WebACE (WAP, K1a, K1b), Reuters (RE0, RE1), and 20 Newsgroups using the multinomial naive Bayes (MNB) and support vector machines (SVM) classifiers. The statistical significance of MMR has been estimated on 5 different splits of training and test data sets using the one-way analysis of variance (ANOVA) method and a multiple comparisons test based on Tukey–Kramer method. We found that performance of MMR is statistically significant than that of the other 8 metrics in 76.2% cases in terms of macro F 1 measure and in 74.4% cases in terms of micro F 1 measure. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
09574174
Volume :
114
Database :
Academic Search Index
Journal :
Expert Systems with Applications
Publication Type :
Academic Journal
Accession number :
131885055
Full Text :
https://doi.org/10.1016/j.eswa.2018.07.028