Start Over

Combining Multiple Feature-Ranking Techniques and Clustering of Variables for Feature Selection

Authors :: Anwar Ul Haq
Defu Zhang
He Peng
Sami Ur Rahman
Source :: IEEE Access, Vol 7, Pp 151482-151492 (2019)
Publication Year :: 2019
Publisher :: IEEE, 2019.
Abstract: Feature selection aims to eliminate redundant or irrelevant variables from input data to reduce computational cost, provide a better understanding of data and improve prediction accuracy. Majority of the existing filter methods utilize a single feature-ranking technique, which may overlook some important assumptions about the underlying regression function linking input variables with the output. In this paper, we propose a novel feature selection framework that combines clustering of variables with multiple feature-ranking techniques for selecting an optimal feature subset. Different feature-ranking methods typically result in selecting different subsets, as each method has its own assumption about the regression function linking input variables with the output. Therefore, we employ multiple feature-ranking methods having disjoint assumption about the regression function. The proposed approach has a feature ranking module to identify relevant features and a clustering module to eliminate redundant features. First, input variables are ranked using regression coefficients obtained by training $L1$ regularized Logistic Regression, Support Vector Machine and Random Forests models. Those features which are ranked lower than a certain threshold are filtered-out. The remaining features are grouped into clusters using an exemplar-based clustering algorithm, which identifies data-points that exemplify the data better, and associates each data-point with an exemplar. We use both linear correlation coefficients and information gain for measuring the association between a data-point and its corresponding exemplar. From each cluster the highest ranked feature is selected as a delegate, and all delegates from the three ranked lists are combined into the final feature set using union operation. Empirical results over a number of real-world data sets confirm the hypothesis that combining features selected using multiple heterogeneous methods results in a more robust feature set and improves prediction accuracy. As compared to other feature selection approaches evaluated, features selected using linear correlation-based multi-filter feature selection achieved the best classification accuracy with 98.7%, 100%, 92.3% and 100% for Ionosphere, Wisconsin Breast Cancer, Sonar and Wine data sets respectively.

Subjects :: Classification
clustering of variables
feature selection
filter methods
random forests
Electrical engineering. Electronics. Nuclear engineering
TK1-9971

Details

Language :: English
ISSN :: 21693536
Volume :: 7
Database :: Directory of Open Access Journals
Journal :: IEEE Access
Publication Type :: Academic Journal
Accession number :: edsdoj.6909dac638d64fa6919fbdc646e5dc28
Document Type :: article
Full Text :: https://doi.org/10.1109/ACCESS.2019.2947701

Full Text Access

View/download PDF

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Combining Multiple Feature-Ranking Techniques and Clustering of Variables for Feature Selection

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Combining Multiple Feature-Ranking Techniques and Clustering of Variables for Feature Selection

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources