1. Exploring low-level statistical features of n-grams in phishing URLs: a comparative analysis with high-level features.
- Author
-
Tashtoush, Yahya, Alajlouni, Moayyad, Albalas, Firas, and Darwish, Omar
- Subjects
MACHINE learning ,PATTERN recognition systems ,CONVOLUTIONAL neural networks ,FEATURE selection ,STATISTICAL learning ,UNIFORM Resource Locators ,DEEP learning - Abstract
Phishing attacks are the biggest cybersecurity threats in the digital world. Attackers exploit users by impersonating real, authentic websites to obtain sensitive information such as passwords and bank statements. One common technique in these attacks is using malicious URLs. These malicious URLs mimic legitimate URLs, misleading users into interacting with malicious websites. This practice, URL phishing, presents a big threat to internet security, emphasizing the need for advanced detection methods. So we aim to enhance phishing URL detection by using machine learning and deep learning models, leveraging a set of low-level URL features derived from n-gram analysis. In this paper, we present a method for detecting malicious URLs using statistical features extracted from n-grams. These n-grams are extracted from the hexadecimal representation of URLs. We employed 4 experiments in our paper. The first 3 experiments used machine learning with the statistical features extracted from these n-grams, and the fourth experiment used these grams directly with deep learning models to evaluate their effectiveness. Also, we used Explainable AI (XAI) to explore the extracted features and evaluate their importance and role in phishing detection. A key advantage of our method is its ability to reduce the number of features required and reduce the training time by using fewer features after applying XAI techniques. This stands in contrast to the previous study, which relies on high-level URL features and needs pre-processing and a high number of features (87 high-level URL-based features). So our technique only uses statistical features extracted from n-grams and the n-gram itself, without the need for any high-level features. Our method is evaluated across different n-gram lengths (2, 4, 6, and 8), aiming to optimize detection accuracy. We conducted four experiments in our study. In the first experiment, we focused on extracting and using 12 common statistical features like mean, median, etc. In the first experiment, the XGBoost model achieved the highest accuracy using 8-gram features with 82.41%. In the second experiment, we expanded the feature set and extracted an additional 13 features, so our feature count became 25. XGBoost in the second experiment achieved the highest accuracy with 86.40%. Accuracy improvement continued in the third experiment, we extracted an additional 16 features (character count features), and these features increased XGBoost accuracy to 88.15% in the third experiment. In the fourth experiment, we directly fed n-gram representations into deep learning models. The Convolutional Neural Network (CNN) model achieved the highest accuracy of 94.09% in experiment four. Also, we applied XAI techniques, SHapley Additive exPlanations (SHAP), and Local Interpretable Model-agnostic Explanations (LIME). Through the explanation provided by XAI methods, we were able to determine the most important features in our feature set, enabling a reduction in feature count. Using fewer features (4, 7, 10, 13, 15), we got good accuracy compared to the 41 features used in experiment three and reduced the models' training times and complexity. This research aimed to enhance phishing URL detection by using machine learning and deep learning models, leveraging a set of low-level URL features derived from n-gram analysis. Our findings show the importance of using minimal statistical features to identify malicious URLs. Notably, the use of CNN had a great advancement, achieving an accuracy rate of 94.09% with using n-grams of URLs, surpassing traditional machine learning models. This achievement not only validates the efficacy of deep learning models in complex pattern recognition tasks but also highlights the efficiency of our feature selection approach, which relies on a lower number of features and is less complex compared to existing high-level feature-based studies. The research outcomes demonstrate a promising pathway toward developing more robust, efficient, and scalable phishing detection systems. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF