Back to Search Start Over

Empirical evaluation of the performance of data sampling and feature selection techniques for software fault prediction.

Authors :
Rathi, Sonika Chandrakant
Misra, Sanjay
Colomo-Palacios, Ricardo
Adarsh, R.
Neti, Lalita Bhanu Murthy
Kumar, Lov
Source :
Expert Systems with Applications. Aug2023, Vol. 223, pN.PAG-N.PAG. 1p.
Publication Year :
2023

Abstract

The application of Software Fault Prediction (SFP) in the software development life cycle to predict the faulty class at the early stage has piqued the interest of various scholars. In the SFP domain, during research analysis, it got realized that there has been very little work instigated on addressing both class imbalance and feature redundancy problems jointly to enhance the performance and prediction accuracy of SFP models. It has been perceived in the literature survey the study of droughts with the comprehensive comparative analysis of different sampling and feature selection strategies together. This research builds an extensive assessment of distinct combinations of different feature selection and sampling approaches, to effectively overcome the problems of class overlap, class imbalance, and feature redundancy. The objective is to determine the best combination that will produce results with a higher degree of accuracy and an effective SFP model. Considering the above erudition, the study has applied 8 different sampling techniques along with 10 feature selection algorithms against 56 open-source projects. The comparative analysis is performed against 5346 variants of input datasets by applying 8 different classifiers to predict the faulty class. In addition, the research paper presents an intensive assessment and performance of these techniques individually against all the input projects. We have considered accuracy and Area Under the ROC (receiver operating characteristic curve) Curve (AUC) performance metrics to compare the performance of different models developed using the classification algorithm. For each project in the proposed work, we evaluated a total of 792 combinations that were produced using 10 feature selection methods, 1 all metrics dataset, 8 sampling methods, 1 original, unsampled dataset, and 8 classifiers. The empirical result indicates that, against 21 projects out of 54 projects, Synthetic Minority Over Sampling Technique Edited (SMOTEE) with correlation-based feature selection (FS2) combination outperformed with the highest AUC value which is 38.89 % of projects. Additionally, according to experimental results, the highest AUC values were attained by 24.07 % of projects using the SMOTEE, FS2, and RF combination. The results of the statical analysis test reveal that 93.42 % of the combinational pairs of different sampling and feature selection approaches demonstrated a significant variance in the performance of the distinct combinations of sampling and feature selection techniques. The empirical result indicates the performance of the SFP Model is adversely impacted by class imbalance and irrelevance. The outcome indicates for more than 75% of projects, the performance of trained models improved with an AUC value between a range of 0.805 to 0.99 post-application of sampling and feature selection strategies, in comparison without the use of feature selection and sampling techniques. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
09574174
Volume :
223
Database :
Academic Search Index
Journal :
Expert Systems with Applications
Publication Type :
Academic Journal
Accession number :
163147489
Full Text :
https://doi.org/10.1016/j.eswa.2023.119806