Back to Search Start Over

Improvement of Malicious Software Detection Accuracy through Genetic Programming Symbolic Classifier with Application of Dataset Oversampling Techniques

Authors :
Nikola Anđelić
Sandi Baressi Šegota
Zlatan Car
Source :
Computers, Vol 12, Iss 12, p 242 (2023)
Publication Year :
2023
Publisher :
MDPI AG, 2023.

Abstract

Malware detection using hybrid features, combining binary and hexadecimal analysis with DLL calls, is crucial for leveraging the strengths of both static and dynamic analysis methods. Artificial intelligence (AI) enhances this process by enabling automated pattern recognition, anomaly detection, and continuous learning, allowing security systems to adapt to evolving threats and identify complex, polymorphic malware that may exhibit varied behaviors. This synergy of hybrid features with AI empowers malware detection systems to efficiently and proactively identify and respond to sophisticated cyber threats in real time. In this paper, the genetic programming symbolic classifier (GPSC) algorithm was applied to the publicly available dataset to obtain symbolic expressions (SEs) that could detect the malware software with high classification performance. The initial problem with the dataset was a high imbalance between class samples, so various oversampling techniques were utilized to obtain balanced dataset variations on which GPSC was applied. To find the optimal combination of GPSC hyperparameter values, the random hyperparameter value search method (RHVS) was developed and applied to obtain SEs with high classification accuracy. The GPSC was trained with five-fold cross-validation (5FCV) to obtain a robust set of SEs on each dataset variation. To choose the best SEs, several evaluation metrics were used, i.e., the length and depth of SEs, accuracy score (ACC), area under receiver operating characteristic curve (AUC), precision, recall, f1-score, and confusion matrix. The best-obtained SEs are applied on the original imbalanced dataset to see if the classification performance is the same as it was on balanced dataset variations. The results of the investigation showed that the proposed method generated SEs with high classification accuracy (0.9962) in malware software detection.

Details

Language :
English
ISSN :
2073431X
Volume :
12
Issue :
12
Database :
Directory of Open Access Journals
Journal :
Computers
Publication Type :
Academic Journal
Accession number :
edsdoj.893f7380be646378312f6c25203cb90
Document Type :
article
Full Text :
https://doi.org/10.3390/computers12120242