1,903 results on '"SMOTE"'
Search Results
2. PDEDX: A Comprehensive Expert System for Early Detection of Parkinson’s Disease
- Author
-
Sanyal, Saptarsi, Shanmugarathinam, Watson, Naveen Vijayakumar, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Fortino, Giancarlo, editor, Kumar, Akshi, editor, Swaroop, Abhishek, editor, and Shukla, Pancham, editor
- Published
- 2025
- Full Text
- View/download PDF
3. Time Sensitive and Oversampling Learning for Systemic Crisis Forecasting
- Author
-
De Nicolò, Francesco, La Rocca, Marianna, Marrone, Antonio, Monaco, Alfonso, Tangaro, Sabina, Amoroso, Nicola, Bellotti, Roberto, Faggini, Marisa, Series Editor, Gallegati, Mauro, Series Editor, Kirman, Alan P., Series Editor, Lux, Thomas, Series Editor, Arecchi, Fortunato Tito, Editorial Board Member, Barile, Sergio, Editorial Board Member, Chakrabarti, Bikas K., Editorial Board Member, Chatterjee, Arnab, Editorial Board Member, Colander, David, Editorial Board Member, Day, Richard H., Editorial Board Member, Keen, Steve, Editorial Board Member, Lines, Marji, Editorial Board Member, Medio, Alfredo, Editorial Board Member, Ormerod, Paul, Editorial Board Member, Rosser, J. Barkley, Editorial Board Member, Solomon, Sorin, Editorial Board Member, Velupillai, Kumaraswamy, Editorial Board Member, Vriend, Nicolas, Editorial Board Member, and Pacelli, Vincenzo, editor
- Published
- 2025
- Full Text
- View/download PDF
4. Accurate Hepatitis C Prediction Through Rigorous Experimental Analysis Employing Ensemble Machine Learning Methods
- Author
-
Abdulla Hil Kafi, Md., Basak, Pritom, Sarower, Afjal H., Liza, Subarna Akter, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Mahmud, Mufti, editor, Kaiser, M. Shamim, editor, Bandyopadhyay, Anirban, editor, Ray, Kanad, editor, and Al Mamun, Shamim, editor
- Published
- 2025
- Full Text
- View/download PDF
5. Performance Analysis of Machine Learning Algorithms on Imbalanced Datasets Using SMOTE Technique
- Author
-
Santhosh Kumar, Bala, Praveen Yadav, Pasupula, Penchala Prasad, P., Angrisani, Leopoldo, Series Editor, Arteaga, Marco, Series Editor, Chakraborty, Samarjit, Series Editor, Chen, Shanben, Series Editor, Chen, Tan Kay, Series Editor, Dillmann, Rüdiger, Series Editor, Duan, Haibin, Series Editor, Ferrari, Gianluigi, Series Editor, Ferre, Manuel, Series Editor, Jabbari, Faryar, Series Editor, Jia, Limin, Series Editor, Kacprzyk, Janusz, Series Editor, Khamis, Alaa, Series Editor, Kroeger, Torsten, Series Editor, Li, Yong, Series Editor, Liang, Qilian, Series Editor, Martín, Ferran, Series Editor, Ming, Tan Cher, Series Editor, Minker, Wolfgang, Series Editor, Misra, Pradeep, Series Editor, Mukhopadhyay, Subhas, Series Editor, Ning, Cun-Zheng, Series Editor, Nishida, Toyoaki, Series Editor, Oneto, Luca, Series Editor, Panigrahi, Bijaya Ketan, Series Editor, Pascucci, Federica, Series Editor, Qin, Yong, Series Editor, Seng, Gan Woon, Series Editor, Speidel, Joachim, Series Editor, Veiga, Germano, Series Editor, Wu, Haitao, Series Editor, Zamboni, Walter, Series Editor, Tan, Kay Chen, Series Editor, Kumar, Amit, editor, Gunjan, Vinit Kumar, editor, Senatore, Sabrina, editor, and Hu, Yu-Chen, editor
- Published
- 2025
- Full Text
- View/download PDF
6. An Enhanced Artificial Neural Network Mode for Type 2 Diabetes Classification Using SMOTE and SMOTE-Tomek with Effective Feature Selection Methods
- Author
-
Sabitha, E., Durgadevi, M., Ghosh, Ashish, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Geetha, R., editor, Dao, Nhu-Ngoc, editor, and Khalid, Saeed, editor
- Published
- 2025
- Full Text
- View/download PDF
7. Fraud Detection in Online Payments Using Deep Learning Models for Sustainable Development
- Author
-
Pushkin, Sahil, Dixit, Shubhra, Bhan, Anupama, Ghosh, Ashish, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Whig, Pawan, editor, Silva, Nuno, editor, Elngar, Ahmad A., editor, Aneja, Nagender, editor, and Sharma, Pavika, editor
- Published
- 2025
- Full Text
- View/download PDF
8. An innovative machine learning optimization-based data fusion strategy for distributed wireless sensor networks.
- Author
-
Sollapure, Naganna Shankar and Govindaswamy, Poornima
- Subjects
MULTISENSOR data fusion ,DISTRIBUTED sensors ,SUPPORT vector machines ,TIME complexity ,FEATURE selection ,WIRELESS sensor networks - Abstract
Self-sufficient sensors scattered over different regions of the world comprise distributed wireless sensor networks (DWSNs), which track a range of environmental and physical factors such as pressure, temperature, vibration, sound, motion, and pollution. The use of data fusion becomes essential for combining information from various sensors and system performance. In this study, we suggested the multi-class support vector machine (SDF-MCSVM) with synthetic minority over-sampling techniques (SMOTE) data fusion for wireless sensor network (WSN) performance. The dataset includes 1,334 instances of hourly averaged answers for 12 variables from an AIR quality chemical multisensor device. To create a balanced dataset, the unbalanced data was first pre-processed using the SMOTE. The grey wolf optimization (GWO) approach is then used to reduce features in an effort to improve the efficacy and efficiency of feature selection procedures. This method is applied to classify the fused feature vectors into multiple categories at once to improve classification performance in WSNs and address unbalance datasets. The result shows the proposed method reaches high precision, accuracy, F1-score, recall, and specificity. The computational complexity and processing time were decreased in the study by using the proposed method. This is great potential for accurate and timely data fusion in dispersed WSNs with the successful integration of data fusion technologies. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
9. A combinatorial deep learning method for Alzheimer’s disease classification-based merging pretrained networks.
- Author
-
Slimi, Houmem, Balti, Ala, Abid, Sabeur, and Sayadi, Mounir
- Abstract
Introduction: Alzheimer’s disease (AD) is a progressive neurodegenerative disorder characterized by cognitive decline, memory loss, and impaired daily functioning. Despite significant research, AD remains incurable, highlighting the critical need for early diagnosis and intervention to improve patient outcomes. Timely detection plays a crucial role in managing the disease more effectively. Pretrained convolutional neural networks (CNNs) trained on large-scale datasets, such as ImageNet, have been employed for AD classification, providing a head start for developing more accurate models. Methods: This paper proposes a novel hybrid deep learning approach that combines the strengths of two specific pretrained architectures. The proposed model enhances the representation of AD-related patterns by leveraging the feature extraction capabilities of both networks. We validated this model using a large dataset of MRI images from AD patients. Performance was evaluated in terms of classification accuracy and robustness against noise, and the results were compared to several commonly used models in AD detection. Results: The proposed hybrid model demonstrated significant performance improvements over individual models, achieving an accuracy classification rate of 99.85%. Comparative analysis with other models further revealed the superiority of the new architecture, particularly in terms of classification rate and resistance to noise interference. Discussion; The high accuracy and robustness of the proposed hybrid model suggest its potential utility in early AD detection. By improving feature representation through the combination of two pretrained networks, this model could provide clinicians with a more reliable tool for early diagnosis and monitoring of AD progression. This approach holds promise for aiding in timely diagnoses and treatment decisions, contributing to better management of Alzheimer’s disease. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
10. A proposed technique for predicting heart disease using machine learning algorithms and an explainable AI method.
- Author
-
El-Sofany, Hosam, Bouallegue, Belgacem, and El-Latif, Yasser M. Abd
- Abstract
One of the critical issues in medical data analysis is accurately predicting a patient’s risk of heart disease, which is vital for early intervention and reducing mortality rates. Early detection allows for timely treatment and continuous monitoring by healthcare providers, which is essential but often limited by the inability of medical professionals to provide constant patient supervision. Early detection of cardiac problems and continuous patient monitoring by physicians can help reduce death rates. Doctors cannot constantly have contact with patients, and heart disease detection is not always accurate. By offering a more solid foundation for prediction and decision-making based on data provided by healthcare sectors worldwide, machine learning (ML) could help physicians with the prediction and detection of HD. This study aims to use different feature selection strategies to produce an accurate ML algorithm for early heart disease prediction. We have chosen features using chi-square, ANOVA, and mutual information methods. The three feature groups chosen were SF-1, SF-2, and SF-3. The study employed ten machine learning algorithms to determine the most accurate technique and feature subset fit. The classification algorithms used include support vector machines (SVM), XGBoost, bagging, decision trees (DT), and random forests (RF). We evaluated the proposed heart disease prediction technique using a private dataset, a public dataset, and different cross-validation methods. We used the Synthetic Minority Oversampling Technique (SMOTE) to eliminate inconsistent data and discover the machine learning algorithm that achieves the most accurate heart disease predictions. Healthcare providers might identify early-stage heart disease quickly and cheaply with the proposed method. We have used the most effective ML algorithm to create a mobile app that instantly predicts heart disease based on the input symptoms. The experimental results demonstrated that the XGBoost algorithm performed optimally when applied to the combined datasets and the SF-2 feature subset. It had 97.57% accuracy, 96.61% sensitivity, 90.48% specificity, 95.00% precision, a 92.68% F1 score, and a 98% AUC. We have developed an explainable AI method based on SHAP approaches to understand how the system makes its final predictions. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
11. Cryptocurrency price prediction using GPR and SMOTE.
- Author
-
GÖKÇEN, Tuğçe and ODABAŞ, Alper
- Subjects
- *
KRIGING , *PRICES , *U.S. dollar , *COMPUTER software , *TAX administration & procedure , *CRYPTOCURRENCIES - Abstract
Cryptography is used by cryptocurrencies to shift money without the intervention of centralized financial institutions. They are decentralized digital assets. On rapidly changing exchanges like those for crypto currencies, it is a tremendously taxing procedure for people to keep track of many simultaneous instantaneous price changes. As a solution to this, computer software that can make fast and objective decisions by constantly observing can replace humans. In this study, the closing price of Bitcoin (BTC), which has the highest volume in the crypto money system, is analyzed. In the study, in which the Gaussian Process Regression (GPR) model and the SMOTE method were used, data belonging to BTC for the period between 25/07/2010 and 05/06/2022 were used as the data set. Opening price, highest-lowest price, volume, dollar index and some indicators used in technical analysis were used as input parameters. The kfold method was followed in the separation of training and test data. The data is divided into 5 subsets with kfold. The mean MAPE value was found to be 1887, and the mean R2 value was found to be 0.99977 in the models with SMOTE. In addition, the GPR model and the GPR model functions that were applied to the SMOTE method were compared by excluding the opening price, which was the price that was highest-lowest, from the data. It was carried out to determine which model performed better. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
12. SMOTE-Based Automated PCOS Prediction Using Lightweight Deep Learning Models.
- Author
-
Ahmad, Rumman, Maghrabi, Lamees A., Khaja, Ishfaq Ahmad, Maghrabi, Louai A., and Ahmad, Musheer
- Subjects
- *
POLYCYSTIC ovary syndrome , *DEEP learning , *OVARIES , *FEATURE extraction , *ASIANS - Abstract
Background: The reproductive age of women is particularly vulnerable to the effects of polycystic ovarian syndrome (PCOS). High levels of testosterone and other male hormones are frequent contributors to PCOS. It is believed that miscarriages and ovulation problems are majorly caused by PCOS. A recent study found that 31.3% of Asian women have been afflicted with PCOS. Healing women with life-threatening disorders associated with PCOS requires more research. In prior research, methods have involved autonomously classified PCOS using a number of different machine learning techniques. ML-based approaches involve hand-crafted feature extraction and suffer from low performance issues, which cannot be ignored for the accurate prediction and identification of PCOS. Objective: Hence, predicting PCOS using cutting-edge deep learning methods for automated feature engineering with better performance is the prime focus of this study. Methods: The proposed method suggests three lightweight (LSTM-based, CNN-based, and CNN-LSTM-based) deep learning models, incorporating SMOTE for dataset balancing to obtain a valid performance. Results: The proposed three models tend to offer an accuracy of 92.04%, 96.59%, and 94.31%, an ROC-AUC of 92.0%, 96.6%, and 94.3%, the number of parameters of 6689, 297, and 13285, and a training time of 67.27 s, 10.02 s, and 18.51 s, respectively. In addition, the DeLong test is also performed to compare AUCs to assess the statistical significance of all three models. Among all three models, the SMOTE + CNN models performs better in terms of accuracy, precision, recall, AUC, number of parameters, training time, DeLong's p-value over the other. Conclusions: Moreover, a performance comparison is also carried out with other state-of-the-art PCOS detection studies and methods, which validates the better performance of the proposed model. Thus, the proposed model provides the greatest performance, which can lead to a reduction in the number of failed pregnancies and help in finding PCOS in the early stages. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
13. Early and accurate prediction of diabetics based on FCBF feature selection and SMOTE.
- Author
-
Kishor, Amit and Chakraborty, Chinmay
- Abstract
Diabetes is a chronic hyperglycemic disorder. Every year hundreds of millions of people around the world have diabetes. The presence of irrelevant features and an imbalanced dataset are significant issues to train the model. The availability of patient medical records quantifies symptoms, body characteristics, and clinical laboratory test values that can be used in the study of biostatistics aimed at identifying patterns or characteristics that cannot be detected by current practice. This work proposes a machine learning-based healthcare model for accurate and early detection of diabetics. Five machine learning classifiers such as logistic regression, K-nearest neighbor, Naïve Bayes, random forest, and support vector machine are used. Fast correlation-based filter feature selection is used to remove the irrelevant features. The synthetic minority over-sampling technique is used to balance the imbalanced dataset. The model is evaluated with four performance measuring matrices: accuracy, sensitivity, specificity, and area under the curve (AUC). An experimental outcome shows few relevant features are needed to enhance the accuracy of the developed model. The RF classifier achieves the highest accuracy, sensitivity, specificity, and AUC of 97.81%, 99.32%, 98.86%, and 99.35%. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
14. AR-ADASYN: angle radius-adaptive synthetic data generation approach for imbalanced learning.
- Author
-
Park, Hyejoon and Kim, Hyunjoong
- Abstract
Imbalanced data often leads to biased models favoring the majority class, limiting the representation of the training data and complicating generalization to minority classes. Previous studies have addressed this problem by rebalancing the dataset using resampling methods, such as SMOTE. However, these methods often encounter challenges such as overfitting and information loss. To overcome these limitations, we propose a novel approach called Angle Radius-Adaptive Synthetic Sampling (AR-ADASYN). The proposed method aims to maintain the distribution of minority classes by taking into account both the distance and the angle between existing samples of the minority class. We conducted experiments on 14 datasets composed of real data to compare the classification performance of the proposed method with other oversampling techniques. The results revealed that the proposed method demonstrated superior classification performance compared to the other oversampling techniques. As a result, AR-ADASYN showed potential as a valuable tool for improving the robustness and generalization of machine learning models trained on imbalanced datasets. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
15. Conditional adversarial segmentation and deep learning approach for skin lesion sub-typing from dermoscopic images.
- Author
-
Mirunalini, P., Desingu, Karthik, Aswatha, S., Deepika, R., Deepika, V., and Jaisakthi, S. M.
- Subjects
- *
DEEP learning , *DERMOSCOPY , *SKIN cancer , *IMAGE processing , *SAMPLE size (Statistics) , *HUMAN skin color - Abstract
Automatic skin lesion subtyping is a crucial step for diagnosing and treating skin cancer and acts as a first level diagnostic aid for medical experts. Although, in general, deep learning is very effective in image processing tasks, there are notable areas of the processing pipeline in the dermoscopic image regime that can benefit from refinement. Our work identifies two such areas for improvement. First, most benchmark dermoscopic datasets for skin cancers and lesions are highly imbalanced due to the relative rarity and commonality in the occurrence of specific lesion types. Deep learning methods tend to exhibit biased performance in favor of the majority classes with such datasets, leading to poor generalization. Second, dermoscopic images can be associated with irrelevant information in the form of skin color, hair, veins, etc.; hence, limiting the information available to a neural network by retaining only relevant portions of an input image has been successful in prompting the network towards learning task-relevant features and thereby improving its performance. Hence, this research work augments the skin lesion characterization pipeline in the following ways. First, it balances the dataset to overcome sample size biases. Two balancing methods, synthetic minority oversampling TEchnique (SMOTE) and Reweighting, are applied, compared, and analyzed. Second, a lesion segmentation stage is introduced before classification, in addition to a preprocessing stage, to retain only the region of interest. A baseline segmentation approach based on Bi-Directional ConvLSTM U-Net is improved using conditional adversarial training for enhanced segmentation performance. Finally, the classification stage is implemented using EfficientNets, where the B2 variant is used to benchmark and choose between the balancing and segmentation techniques, and the architecture is then scaled through to B7 to analyze the performance boost in lesion classification. From these experiments, we find that the pipeline that balances using SMOTE and uses the adversarially trained segmentation network achieves the best baseline performance of 91% classification accuracy with EfficientNet B2. Based on the scaling experiments, we find that optimal performance is reached with the B6 architecture that classifies with a 97% accuracy. Furthermore, the proposed pipeline for lesion characterization outperforms the state of the art performance on the ISIC dataset. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
16. SAMME.C2 algorithm for imbalanced multi-class classification.
- Author
-
So, Banghee and Valdez, Emiliano A.
- Subjects
- *
MACHINE learning , *SCIENCE education , *PREDICTION models , *ALGORITHMS , *EMPIRICAL research , *BOOSTING algorithms - Abstract
Classification predictive modeling involves the accurate assignment of observations in a dataset to target classes or categories. Real-world classification problems with severely imbalanced class distributions have increased substantially in recent years. In such cases, significantly fewer observations are available for minority classes to learn from than for majority classes. Despite this sparsity, the minority class is often considered as the more interesting class, yet the development of a scientific learning algorithm that is suitable for these observations presents numerous challenges. In this study, we further explore the merits of an effective multi-class classification algorithm known as SAMME.C2 that is specialized for handling severely imbalanced classes. This innovative method blends the flexible mechanics of the boosting techniques from the SAMME algorithm, which is a multi-class classifier, and the Ada.C2 algorithm, which is a cost-sensitive binary classifier that is designed to address highly imbalanced classes. We establish a scientific and statistical formulation of the SAMME.C2 algorithm, together with providing and explaining the resulting procedure. We demonstrate the consistently superior performance of this algorithm through numerical experiments as well as empirical studies. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
17. A Hybrid Synthetic Minority Oversampling Technique and Deep Neural Network Framework for Improving Rice Yield Estimation in an Open Environment.
- Author
-
Yuan, Jianghao, Zheng, Zuojun, Chu, Changming, Wang, Wensheng, and Guo, Leifeng
- Subjects
- *
ARTIFICIAL neural networks , *PARTIAL least squares regression , *CROP management , *PEARSON correlation (Statistics) , *FIELD crops , *MULTISPECTRAL imaging , *RICE quality - Abstract
Quick and accurate prediction of crop yields is beneficial for guiding crop field management and genetic breeding. This paper utilizes the fast and non-destructive advantages of an unmanned aerial vehicle equipped with a multispectral camera to acquire spatial characteristics of rice and conducts research on yield estimation in an open environment. The study proposes a yield estimation framework that hybrids synthetic minority oversampling technique (SMOTE) and deep neural network (DNN). Firstly, the framework used the Pearson correlation coefficient to select 10 key vegetation indices and determine the optimal feature combination. Secondly, it created a dataset for data augmentation through SMOTE, addressing the challenge of long data collection cycles and small sample sizes caused by long growth cycles. Then, based on this dataset, a yield estimation model was trained using DNN and compared with partial least squares regression (PLSR), support vector regression (SVR), and random forest (RF). The experimental results indicate that the hybrid framework proposed in this study performs the best (R2 = 0.810, RMSE = 0.69 t/ha), significantly improving the accuracy of yield estimation compared to other methods, with an R2 improvement of at least 0.191. It demonstrates that the framework proposed in this study can be used for rice yield estimation. Additionally, it provides a new approach for future yield estimation with small sample sizes for other crops or for predicting numerical crop indicators. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
18. STUDYING THE IMPACT OF DATASET BALANCING ON MACHINE LEARNING-BASED INTRUSION DETECTION SYSTEMS FOR IOT.
- Author
-
Abdelhamid, Salma, Hegazy, Islam, Aref, Mostafa, and Roushdy, Mohamed
- Subjects
MACHINE learning ,SUPPORT vector machines ,K-nearest neighbor classification ,INTERNET of things ,MACHINE performance - Abstract
Internet of Things (IoT) networks are integral to modern life due to their pervasive connectivity and automation capabilities. Intrusion Detection Systems (IDS) are crucial in IoT ecosystems to countermeasure attacks that can compromise devices and disrupt essential services. Their role is vital in maintaining the integrity, confidentiality, and availability of data within these networks. The effectiveness of these security systems is fundamentally dependent on the robustness of learning algorithms and the quality of the datasets utilized. Class imbalance is a common challenge in real-world datasets, where certain classes are represented by significantly fewer instances compared to others. This paper studies the impact of balancing the BoT-IoT dataset on the performance of Machine Learning (ML) based IDSs using three algorithms: K-Nearest Neighbors (KNN), Gradient Boosting (GB), and Support Vector Machine (SVM). We apply two resampling techniques: random upsampling and Synthetic Minority Oversampling Technique (SMOTE). The results show that dataset balancing improves F1-scores across all the algorithms. Minority classes F1-scores increase in KNN, GB, and SVM from 0.77 to 1, 0 to 0.989, and 0 to 0.999; respectively. Our findings prove that balanced datasets lead to more dependable and robust IDSs that are capable of handling real-world data with varied class distributions. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
19. Enhanced Android Ransomware Detection Through Hybrid Simultaneous Swarm-Based Optimization.
- Author
-
Alazab, Moutaz, Khurma, Ruba Abu, Camacho, David, and Martín, Alejandro
- Abstract
Ransomware is a significant security threat that poses a serious risk to the security of smartphones, and its impact on portable devices has been extensively discussed in a number of research papers. In recent times, this threat has witnessed a significant increase, causing substantial losses for both individuals and organizations. The emergence and widespread occurrence of diverse forms of ransomware present a significant impediment to the pursuit of reliable security measures that can effectively combat them. This constitutes a formidable challenge due to the dynamic nature of ransomware, which renders traditional security protocols inadequate, as they might have a high false alarm rate and exert significant processing demands on mobile devices that are restricted by limited battery life, CPU, and memory. This paper proposes a novel intelligent method for detecting ransomware that is based on a hybrid multi-solution binary JAYA algorithm with a single-solution simulated annealing (SA). The primary objective is to leverage the exploitation power of SA in supporting the exploration power of the binary JAYA algorithm. This approach results in a better balance between global and local search milestones. The empirical results of our research demonstrate the superiority of the proposed SMO-BJAYA-SA-SVM method over other algorithms based on the evaluation measures used. The proposed method achieved an accuracy rate of 98.7%, a precision of 98.6%, a recall of 98.7%, and an F1 score of 98.6%. Therefore, we believe that our approach is an effective method for detecting ransomware on portable devices. It has the potential to provide a more reliable and efficient solution to this growing security threat. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
20. Intuitionistic fuzzy rough set model based on k-means and its application to enhance prediction of aptamer–protein interacting pairs.
- Author
-
Jain, Pankhuri, Tiwari, Anoop, and Som, Tanmoy
- Abstract
Aptamers are very interesting peptide molecules or oligonucleic acid. They are used to bind particular target molecules. Aptamers play vital roles in various practical applications and physiological functions. Consequently, several diseases can be treated using therapies based on aptamer proteins and designing the binding of aptamers to specific proteins is essential to advance understanding into processes of interaction between aptamer-protein. Despite the wide applications of aptamers, identification of interaction between aptamer protein is always inadequate and challenging. Therefore, it is necessary to develop a computational approach for achieving good predictions of interaction between aptamer-protein. In the present study, a novel method for enhancing the prediction of interacting aptamer-target pairs based on sequence features obtained from both aptamers and their target proteins by employing a novel k-mean based intuitionistic fuzzy rough feature selection method is proposed. Firstly, an intuitionistic fuzzy rough set model based on k nearest neighbour concept is proposed. Then, a novel feature selection technique is introduced by using this model. Furthermore, non-redundant and relevant features are selected from training as well as testing datasets by using proposed feature selection technique. Secondly, SMOTE (Synthetic Minority Oversampling Technique) is applied to obtain the optimal balanced training and testing datasets. Thirdly, we apply various machine learning algorithms on optimally balanced reduced training and testing datasets to evaluate their performances. Experimental results shows that the best prediction performance is obtained by boosted random forest learning algorithm. Using a 10 fold cross-validation test, the proposed method is a good performer, with sensitivity of 91.3, 86.4, specificity of 91.9, 84.8, overall accuracy of 91.60%, 85.60%, Mathews correlation coefficient of 0.832, 0.713, AUC (area under curve) of 0.969, 0.908, and g-means of 91.5, 85.5 on optimal balanced reduced training and testing datasets consisting of aptamer-protein interacting pairs. Finally, a comparative study of the best obtained results with the existing best results is presented, which clearly indicates that our proposed approach is the best performing approach till date. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
21. DBSCAN SMOTE LSTM: Effective Strategies for Distributed Denial of Service Detection in Imbalanced Network Environments.
- Author
-
Efendi, Rissal, Wahyono, Teguh, and Widiasari, Indrastanti Ratna
- Subjects
DENIAL of service attacks ,DEEP learning ,SUPPLY & demand - Abstract
In detecting Distributed Denial of Service (DDoS), deep learning faces challenges and difficulties such as high computational demands, long training times, and complex model interpretation. This research focuses on overcoming these challenges by proposing an effective strategy for detecting DDoS attacks in imbalanced network environments. This research employed DBSCAN and SMOTE to increase the class distribution of the dataset by allowing models using LSTM to learn time anomalies effectively when DDoS attacks occur. The experiments carried out revealed significant improvement in the performance of the LSTM model when integrated with DBSCAN and SMOTE. These include validation loss results of 0.048 for LSTM DBSCAN and SMOTE and 0.1943 for LSTM without DBSCAN and SMOTE, with accuracy of 99.50 and 97.50. Apart from that, there was an increase in the F1 score from 93.4% to 98.3%. This research proved that DBSCAN and SMOTE can be used as an effective strategy to improve model performance in detecting DDoS attacks on heterogeneous networks, as well as increasing model robustness and reliability. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
22. Comparative evaluation of data imbalance addressing techniques for CNN-based insider threat detection
- Author
-
Taher Al-Shehari, Mohammed Kadrie, Mohammed Nasser Al-Mhiqani, Taha Alfakih, Hussain Alsalman, Mueen Uddin, Syed Sajid Ullah, and Abdulhalim Dandoush
- Subjects
Deep learning ,CNN ,SMOTE ,ADASYN ,Data imbalance addressing ,Insider threat detection ,Medicine ,Science - Abstract
Abstract Insider threats pose a significant challenge in cybersecurity, demanding advanced detection methods for effective risk mitigation. This paper presents a comparative evaluation of data imbalance addressing techniques for CNN-based insider threat detection. Specifically, we integrate Convolutional Neural Networks (CNN) with three popular data imbalance addressing techniques: Synthetic Minority Over-sampling Technique (SMOTE), Borderline-SMOTE, and Adaptive Synthetic Sampling (ADASYN). The objective is to enhance insider threat detection accuracy and robustness in imbalanced datasets common to cybersecurity domains. Our study addresses the lack of consensus in the literature regarding the superiority of data imbalance addressing techniques in this field. We analyze a human behavior-based dataset (i.e., CERT) that reports users’ Information Technology (IT) activities with a substantial number of samples to provide a clear conclusion on the effectiveness of these balancing techniques when coupled with CNN. Experimental results demonstrate that ADASYN, in conjunction with CNN, achieves a ROC curve of 96%, surpassing SMOTE and Borderline-SMOTE in enhancing detection accuracy in imbalanced datasets. We compare the results of these three hybrid models (CNN + imbalance addressing techniques) with state-of-the-art selective studies focusing on ROC, recall, and accuracy measures. Our findings contribute to the advancement of insider threat detection methodologies.
- Published
- 2024
- Full Text
- View/download PDF
23. A proposed technique for predicting heart disease using machine learning algorithms and an explainable AI method
- Author
-
Hosam El-Sofany, Belgacem Bouallegue, and Yasser M. Abd El-Latif
- Subjects
Machine learning ,Heart diseases ,ML algorithms ,SMOTE ,SHAP ,Medicine ,Science - Abstract
Abstract One of the critical issues in medical data analysis is accurately predicting a patient’s risk of heart disease, which is vital for early intervention and reducing mortality rates. Early detection allows for timely treatment and continuous monitoring by healthcare providers, which is essential but often limited by the inability of medical professionals to provide constant patient supervision. Early detection of cardiac problems and continuous patient monitoring by physicians can help reduce death rates. Doctors cannot constantly have contact with patients, and heart disease detection is not always accurate. By offering a more solid foundation for prediction and decision-making based on data provided by healthcare sectors worldwide, machine learning (ML) could help physicians with the prediction and detection of HD. This study aims to use different feature selection strategies to produce an accurate ML algorithm for early heart disease prediction. We have chosen features using chi-square, ANOVA, and mutual information methods. The three feature groups chosen were SF-1, SF-2, and SF-3. The study employed ten machine learning algorithms to determine the most accurate technique and feature subset fit. The classification algorithms used include support vector machines (SVM), XGBoost, bagging, decision trees (DT), and random forests (RF). We evaluated the proposed heart disease prediction technique using a private dataset, a public dataset, and different cross-validation methods. We used the Synthetic Minority Oversampling Technique (SMOTE) to eliminate inconsistent data and discover the machine learning algorithm that achieves the most accurate heart disease predictions. Healthcare providers might identify early-stage heart disease quickly and cheaply with the proposed method. We have used the most effective ML algorithm to create a mobile app that instantly predicts heart disease based on the input symptoms. The experimental results demonstrated that the XGBoost algorithm performed optimally when applied to the combined datasets and the SF-2 feature subset. It had 97.57% accuracy, 96.61% sensitivity, 90.48% specificity, 95.00% precision, a 92.68% F1 score, and a 98% AUC. We have developed an explainable AI method based on SHAP approaches to understand how the system makes its final predictions.
- Published
- 2024
- Full Text
- View/download PDF
24. Classifying Legendary Pokémon with SF-Random Forest Algorithm
- Author
-
Aji Prayoga, Yisti Vita Via, and I Gede Susrama Mas Diyasa
- Subjects
pokémon legendary ,random forest ,smote ,classification ,Mathematics ,QA1-939 ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Here’s an improved version of the abstract with better articulation: Accurate classification of legendary Pokémon is essential due to their distinct characteristics compared to regular Pokémon, impacting various domains such as research, gaming, and strategy development. This study employs the SF-Random Forest algorithm, an advanced variant of Random Forest, designed to effectively handle data heterogeneity and complexity. The dataset comprises 800 Pokémon samples, including attributes like type, base stats (HP, Attack, Defense, etc.), and other relevant features. To address the inherent imbalance between legendary and non-legendary Pokémon, the data preprocessing phase includes outlier removal, handling of missing values, normalization through Min-Max Scaling, and class balancing using the SMOTE (Synthetic Minority Over-sampling Technique) method. The preprocessed data is then used to train the SF-Random Forest model, with performance evaluated using metrics such as accuracy, precision, recall, and F1-score. The results reveal that SF-Random Forest achieves perfect scores across all metrics, demonstrating 100% accuracy, precision, recall, and F1-score. This highlights the algorithm's superior ability to identify key features and manage data imbalance compared to traditional classification methods. The study underscores the efficiency and robustness of SF-Random Forest as a classification tool, paving the way for the development of more advanced classification systems applicable to various fields requiring complex pattern recognition.
- Published
- 2024
- Full Text
- View/download PDF
25. IMPROVING PERFORMANCE FOR IMBALANCED DATA CLASSIFICATION USING OVERSAMPLING AND CHARACTERISTICS OF EACH CLUSTER
- Author
-
Phan Anh Phong, Le Van Thanh
- Subjects
data classification ,imbalanced data ,oversampling ,k-means ,smote ,Technology ,Social sciences (General) ,H1-99 - Abstract
This paper proposes a method to enhance the effectiveness of classifying imbalanced data. The main contribution of the method is integrating the K-means clustering algorithm and the minority oversampling technique VCIR to generate synthetic samples that closely represent the actual data characteristics. Experimental results have shown that the proposed method performs better on several metrics than current popular methods for handling imbalanced data, such as SMOTE, Borderline-SMOTE, Kmeans-SMOTE, and SVM-SMOTE.
- Published
- 2024
- Full Text
- View/download PDF
26. Novel stacking models based on SMOTE for the prediction of rockburst grades at four deep gold mines
- Author
-
Peng Xiao, Zida Liu, Guoyan Zhao, and Pengzhi Pan
- Subjects
Rockburst prediction ,Gold mine ,Stacking model ,SMOTE ,Engineering geology. Rock mechanics. Soil mechanics. Underground construction ,TA703-712 - Abstract
Rockburst is a frequently encountered hazard during the production of deep gold mines. Accurate prediction of rockburst is an important measure to prevent rockburst in gold mines. This study considers seven indicators to evaluate rockburst at four deep gold mines. Field research and rock tests were performed at two gold mines in China to collect these seven indicators and rockburst cases. The collected database was oversampled by the synthetic minority oversampling technique (SMOTE) to balance the categories of rockburst datasets. Stacking models combining tree-based models and logistic regression (LR) were established by the balanced database. Rockburst datasets from another two deep gold mines were implemented to verify the applicability of the predictive models. The stacking model combining extremely randomized trees and LR based on SMOTE (SMOTE-ERT-LR) was the best model, and it obtained a training accuracy of 100% and an evaluation accuracy of 100%. Moreover, model evaluation suggested that SMOTE can enhance the prediction performance for weak rockburst, thereby improving the overall performance. Finally, sensitivity analysis was performed for SMOTE-ERT-LR. The results indicated that the SMOTE-ERT-LR model can achieve satisfactory performance when only depth, maximum tangential stress index, and linear elastic energy index were available.
- Published
- 2024
- Full Text
- View/download PDF
27. Enhancing emotion prediction using deep learning and distributed federated systems with SMOTE oversampling technique
- Author
-
V.V. Narasimha Raju, R. Saravanakumar, Nadia Yusuf, Rahul Pradhan, Hedi Hamdi, K. Aanandha Saravanan, Vuda Sreenivasa Rao, and Majid A. Askar
- Subjects
Audio-visual ,Convolutional neural network ,Deep learning ,Recognising emotions ,Federated system ,SMOTE ,Engineering (General). Civil engineering (General) ,TA1-2040 - Abstract
Facial Expression Recognition (FER) categorizes various human emotions by analyzing the features of the face, so it plays a vital role in recognizing emotions. Prior studies have focused on the issue of recognizing emotions through voices or speech. Addressing the existing method issues, this approach aims to detect voices and three-dimensional images using appropriate datasets and novel deep-learning techniques. In this research, the valid Audio-Visual datasets Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), Acted Facial Expression in the Wild (AFEW), and eNTERFACE’05 datasets are chosen for analysis. RAVDESS dataset contains audio, AFEW, and eNTERFACE and has three-dimensional images of humans, i.e., 3D images. SMOTE technique is presented for solving overfitting problems to balance the dataset by oversampling and under-sampling process. The research employs the Federated 3D-CNN technique to predict the accurate emotions of humans. The 3D Convolutional Neural Network (3DCNN) predicts accurate information of a person at any angle in image processing. Mel Frequency Cepstrum Coefficient (MFCC) is used to extract and fine-tune the voices. A significant contribution of Federated Learning with 3D-Convolutional Neural Network is executed for multiple clients at a time through global and local updates of weights. The proposed framework achieves a prediction accuracy of 95.72 % when compared with existing methods. This approach helps in many applications, such as analyzing emotions, healthcare, etc.
- Published
- 2024
- Full Text
- View/download PDF
28. Improved KD-tree based imbalanced big data classification and oversampling for MapReduce platforms.
- Author
-
Sleeman IV, William C., Roseberry, Martha, Ghosh, Preetam, Cano, Alberto, and Krawczyk, Bartosz
- Subjects
MACHINE learning ,WEBSITES ,CLASSIFICATION algorithms ,SKEWNESS (Probability theory) ,WEB services ,BIG data - Abstract
In the era of big data, it is necessary to provide novel and efficient platforms for training machine learning models over large volumes of data. The MapReduce approach and its Apache Spark implementation are among the most popular methods that provide high-performance computing for classification algorithms. However, they require dedicated implementations that will take advantage of such architectures. Additionally, many real-world big data problems are plagued by class imbalance, posing challenges to the classifier training step. Existing solutions for alleviating skewed distributions do not work well in the MapReduce environment. In this paper, we propose a novel KD-tree based classifier, together with a variation of the SMOTE algorithm dedicated to the Spark platform. Our algorithms offer excellent predictive power and can work simultaneously with binary and multi-class imbalanced data. Exhaustive experiments conducted using the Amazon Web Service platform showcase the high efficiency and flexibility of our proposed algorithms. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
29. Model Optimasi SVM Dengan PSO-GA dan SMOTE Dalam Menangani High Dimensional dan Imbalance Data Banjir
- Author
-
Raenald Syaputra, Taghfirul Azhima Yoga Siswa, and Wawan Joko Pranoto
- Subjects
klasifikasi banjir ,svm ,smote ,ga ,pso ,Information technology ,T58.5-58.64 ,Computer software ,QA76.75-76.765 - Abstract
Banjir merupakan salah satu bencana alam yang sering terjadi di Indonesia, termasuk di Kota Samarinda dengan 18-33 titik desa terdampak dari tahun 2018-2021. Penggunaan machine learning dalam mengklasifikasi bencana banjir sangat penting untuk memprediksi kejadian di masa mendatang. Beberapa penelitian sebelumnya terkait klasifikasi data banjir dalam 3 tahun terakhir telah dilakukan. Namun, dari beberapa penelitian tersebut memunculkan masalah terkait dengan dataset high dimensional yang dapat menurunkan performa model klasifikasi dan menyebabkan overfitting. Selain itu, masalah lain juga muncul dalam hal imbalance data yang menyebabkan bias terhadap kelas mayoritas dan representasi yang tidak akurat. Oleh karena itu, permasalahan dataset high dimensional dan imbalance data merupakan tantangan spesifik yang harus diatas dalam klasifkasi data banjir Kota Samarinda. Penelitian ini bertujuan mengidentifkasi fitur-fitur yang diperoleh dari seleksi fitur Genetic Algorithm (GA) yang memiliki pengaruh terhadap akurasi klasifikasi data banjir Kota Samarinda menggunakan algoritma Support Vector Machine (SVM), serta meningkatkan akurasi klasifikasi data banjir di Kota Samarinda dengan mengimplementasikan algoritma SVM yang dikombinasikan dengan metode Synthetic Minority Oversampling Technique (SMOTE) untuk oversampling, seleksi fitur dengan GA dan optimasi menggunakan Particle Swarm Optimization (PSO). Teknik validasi yang digunakan adalah 10-fold cross validation dan evaluasi performa menggunakan confusion matrix. Data yang digunakan berasal dari BPBD (Badan Penanggulangan Bencana Daerah) dan BMKG (Badan Meteorologi, Klimatologi, dan Geofisika) Kota Samarinda pada tahun 2021-2023 terdiri dari 11 fitur dan 1.095 record. Hasil penelitian menunjukkan bahwa fitur-fitur penting yang terpilih melalui GA adalah temperatur maksimum, kecepatan angin maksimum, arah angin maksimum, arah angin terbanyak, lamanya penyinaran matahari dan kecepatan angin rata-rata. Dengan kombinasi metode SVM, SMOTE, GA dan PSO, akurasi klasifikasi data banjir mencapai 82,28%. Namun, penelitian ini juga menghadapi tantangan seperti kontradiksi hasil dengan penelitian lain terkait penggunaan SMOTE dan variasi hasil akibat karakteristik dataset serta metode pembagian data yang berbeda. Hasil penelitian ini dapat digunakan oleh pemerintah daerah dan badan penanggulangan bencana daerah Kota Samarinda untuk memprediksi kejadian banjir dengan lebih akurat, serta memungkinkan tindakan pencegahan yang lebih efektif. Penerapan hasil penelitian ini dapat meningkatkan efektivitas dalam mitigasi bencana banjir Kota Samarinda.
- Published
- 2024
- Full Text
- View/download PDF
30. Perbaikan Akurasi Random Forest Dengan ANOVA Dan SMOTE Pada Klasifikasi Data Stunting
- Author
-
Ari Ahmad Dhani, Taghfirul Azhima Yoga Siswa, and Wawan Joko Pranoto
- Subjects
klasifikasi ,random forest ,anova ,smote ,high dimensional ,Information technology ,T58.5-58.64 ,Computer software ,QA76.75-76.765 - Abstract
Stunting terus menjadi isu kesehatan masyarakat yang kritis di Indonesia, khususnya di Kota Samarinda yang mencatat prevalensi sebesar 25,3% pada tahun 2022, menjadi yang tertinggi kedua di Provinsi Kalimantan Timur. Di tengah prioritas nasional untuk riset 2020-2024, penggunaan data mining untuk klasifikasi stunting memperlihatkan potensi yang signifikan namun tetap menghadapi tantangan dalam menangani data berdimensi tinggi dan ketidakseimbangan kelas. Penelitian ini bertujuan untuk meningkatkan akurasi klasifikasi stunting menggunakan metode Random Forest (RF) yang diintegrasikan dengan seleksi fitur ANOVA dan teknik SMOTE untuk menyeimbangkan kelas. Data yang digunakan dalam penelitian ini bersumber dari Dinas Kesehatan Kota Samarinda, meliputi 26 Puskesmas dengan 21 atribut dan total 150.466 record. Teknik validasi yang dipakai adalah cross-validation k =10. Hasil menunjukkan peningkatan akurasi dari 98,83% menjadi 99,77% naik sebesar 0,94% setelah penerapan seleksi fitur ANOVA. Fitur ZS TB/U, ZS BB/U, dan BB/U diidentifikasi sebagai yang paling berpengaruh. Peningkatan ini menunjukkan efektivitas integrasi metode dalam mengatasi masalah stunting pada dataset yang kompleks dan tidak seimbang, ini diharapkan dapat mendukung kebijakan dan intervensi kesehatan lebih lanjut di kawasan tersebut.
- Published
- 2024
- Full Text
- View/download PDF
31. Hematoma expansion prediction based on SMOTE and XGBoost algorithm
- Author
-
Yan Li, Chaonan Du, Sikai Ge, Ruonan Zhang, Yiming Shao, Keyu Chen, Zhepeng Li, and Fei Ma
- Subjects
Hematoma expansion ,XGBoost ,SMOTE ,Machine learning prediction ,Unbalanced dataset ,Computer applications to medicine. Medical informatics ,R858-859.7 - Abstract
Abstract Hematoma expansion (HE) is a high risky symptom with high rate of occurrence for patients who have undergone spontaneous intracerebral hemorrhage (ICH) after a major accident or illness. Correct prediction of the occurrence of HE in advance is critical to help the doctors to determine the next step medical treatment. Most existing studies focus only on the occurrence of HE within 6 h after the occurrence of ICH, while in reality a considerable number of patients have HE after the first 6 h but within 24 h. In this study, based on the medical doctors recommendation, we focus on prediction of the occurrence of HE within 24 h, as well as the occurrence of HE every 6 h within 24 h. Based on the demographics and computer tomography (CT) image extraction information, we used the XGBoost method to predict the occurrence of HE within 24 h. In this study, to solve the issue of highly imbalanced data set, which is a frequent case in medical data analysis, we used the SMOTE algorithm for data augmentation. To evaluate our method, we used a data set consisting of 582 patients records, and compared the results of proposed method as well as few machine learning methods. Our experiments show that XGBoost achieved the best prediction performance on the balanced dataset processed by the SMOTE algorithm with an accuracy of 0.82 and F1-score of 0.82. Moreover, our proposed method predicts the occurrence of HE within 6, 12, 18 and 24 h at the accuracy of 0.89, 0.82, 0.87 and 0.94, indicating that the HE occurrence within 24 h can be predicted accurately by the proposed method.
- Published
- 2024
- Full Text
- View/download PDF
32. Data oversampling and imbalanced datasets: an investigation of performance for machine learning and feature engineering
- Author
-
Muhammad Mujahid, EROL Kına, Furqan Rustam, Monica Gracia Villar, Eduardo Silva Alvarado, Isabel De La Torre Diez, and Imran Ashraf
- Subjects
Machine learning ,Bag of words ,Oversampling techniques ,SMOTE ,K-Means SMOTE ,Computer engineering. Computer hardware ,TK7885-7895 ,Information technology ,T58.5-58.64 ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Abstract The classification of imbalanced datasets is a prominent task in text mining and machine learning. The number of samples in each class is not uniformly distributed; one class contains a large number of samples while the other has a small number. Overfitting of the model occurs as a result of imbalanced datasets, resulting in poor performance. In this study, we compare different oversampling techniques like synthetic minority oversampling technique (SMOTE), support vector machine SMOTE (SVM-SMOTE), Border-line SMOTE, K-means SMOTE, and adaptive synthetic (ADASYN) oversampling to address the issue of imbalanced datasets and enhance the performance of machine learning models. Preprocessing significantly enhances the quality of input data by reducing noise, redundant data, and unnecessary data. This enables the machines to identify crucial patterns that facilitate the extraction of significant and pertinent information from the preprocessed data. This study preprocesses the data using various top-level preprocessing steps. Furthermore, two imbalanced Twitter datasets are used to compare the performance of oversampling techniques with six machine learning models including random forest (RF), SVM, K-nearest neighbor (KNN), AdaBoost (ADA), logistic regression (LR), and decision tree (DT). In addition, the bag of words (BoW) and term frequency and inverse document frequency (TF-IDF) features extraction approaches are used to extract features from the tweets. The experiments indicate that SMOTE and ADASYN perform much better than other techniques thus providing higher accuracy. Additionally, overall results show that SVM with ’linear’ kernel tends to attain the highest accuracy and recall score of 99.67% and 1.00% on ADASYN oversampled datasets and 99.57% accuracy on SMOTE oversampled dataset with TF-IDF features. The SVM model using 10-fold cross-validation experiments achieved 97.40 mean accuracy with a 0.008 standard deviation. Our approach achieved 2.62% greater accuracy as compared to other current methods.
- Published
- 2024
- Full Text
- View/download PDF
33. Detection of Parkinson disease using multiclass machine learning approach
- Author
-
Saravanan Srinivasan, Parthasarathy Ramadass, Sandeep Kumar Mathivanan, Karthikeyan Panneer Selvam, Basu Dev Shivahare, and Mohd Asif Shah
- Subjects
Machine learning ,Feed-forward neural network ,RandomizedSearchCV ,SMOTE ,Voice signal feature ,Medicine ,Science - Abstract
Abstract Parkinson’s Disease (PD) is a prevalent neurological condition characterized by motor and cognitive impairments, typically manifesting around the age of 50 and presenting symptoms such as gait difficulties and speech impairments. Although a cure remains elusive, symptom management through medication is possible. Timely detection is pivotal for effective disease management. In this study, we leverage Machine Learning (ML) and Deep Learning (DL) techniques, specifically K-Nearest Neighbor (KNN) and Feed-forward Neural Network (FNN) models, to differentiate between individuals with PD and healthy individuals based on voice signal characteristics. Our dataset, sourced from the University of California at Irvine (UCI), comprises 195 voice recordings collected from 31 patients. To optimize model performance, we employ various strategies including Synthetic Minority Over-sampling Technique (SMOTE) for addressing class imbalance, Feature Selection to identify the most relevant features, and hyperparameter tuning using RandomizedSearchCV. Our experimentation reveals that the FNN and KSVM models, trained on an 80–20 split of the dataset for training and testing respectively, yield the most promising results. The FNN model achieves an impressive overall accuracy of 99.11%, with 98.78% recall, 99.96% precision, and a 99.23% f1-score. Similarly, the KSVM model demonstrates strong performance with an overall accuracy of 95.89%, recall of 96.88%, precision of 98.71%, and an f1-score of 97.62%. Overall, our study showcases the efficacy of ML and DL techniques in accurately identifying PD from voice signals, underscoring the potential for these approaches to contribute significantly to early diagnosis and intervention strategies for Parkinson’s Disease.
- Published
- 2024
- Full Text
- View/download PDF
34. Feature group partitioning: an approach for depression severity prediction with class balancing using machine learning algorithms
- Author
-
Tumpa Rani Shaha, Momotaz Begum, Jia Uddin, Vanessa Yélamos Torres, Josep Alemany Iturriaga, Imran Ashraf, and Md. Abdus Samad
- Subjects
Machine learning ,Depression prediction ,Class balancing ,Oversampling ,SMOTE ,ADASYN ,Medicine (General) ,R5-920 - Abstract
Abstract In contemporary society, depression has emerged as a prominent mental disorder that exhibits exponential growth and exerts a substantial influence on premature mortality. Although numerous research applied machine learning methods to forecast signs of depression. Nevertheless, only a limited number of research have taken into account the severity level as a multiclass variable. Besides, maintaining the equality of data distribution among all the classes rarely happens in practical communities. So, the inevitable class imbalance for multiple variables is considered a substantial challenge in this domain. Furthermore, this research emphasizes the significance of addressing class imbalance issues in the context of multiple classes. We introduced a new approach Feature group partitioning (FGP) in the data preprocessing phase which effectively reduces the dimensionality of features to a minimum. This study utilized synthetic oversampling techniques, specifically Synthetic Minority Over-sampling Technique (SMOTE) and Adaptive Synthetic (ADASYN), for class balancing. The dataset used in this research was collected from university students by administering the Burn Depression Checklist (BDC). For methodological modifications, we implemented heterogeneous ensemble learning stacking, homogeneous ensemble bagging, and five distinct supervised machine learning algorithms. The issue of overfitting was mitigated by evaluating the accuracy of the training, validation, and testing datasets. To justify the effectiveness of the prediction models, balanced accuracy, sensitivity, specificity, precision, and f1-score indices are used. Overall, comprehensive analysis demonstrates the discrimination between the Conventional Depression Screening (CDS) and FGP approach. In summary, the results show that the stacking classifier for FGP with SMOTE approach yields the highest balanced accuracy, with a rate of 92.81%. The empirical evidence has demonstrated that the FGP approach, when combined with the SMOTE, able to produce better performance in predicting the severity of depression. Most importantly the optimization of the training time of the FGP approach for all of the classifiers is a significant achievement of this research.
- Published
- 2024
- Full Text
- View/download PDF
35. Enhancing intrusion detection in IIoT: optimized CNN model with multi-class SMOTE balancing.
- Author
-
Eid, Abdulrahman Mahmoud, Soudan, Bassel, Nassif, Ali Bou, and Injadat, MohammadNoor
- Subjects
- *
CONVOLUTIONAL neural networks , *COMPUTER network security , *INTERNET of things , *INTRUSION detection systems (Computer security) , *GENERALIZATION , *DEFAULT (Finance) - Abstract
This work introduces an intrusion detection system (IDS) tailored for industrial internet of things (IIoT) environments based on an optimized convolutional neural network (CNN) model. The model is trained on a dataset that was balanced using a novel multi-class implementation of synthetic minority over-sampling technique (SMOTE) that ensures equal representation of all classes. Additionally, systematic optimization will be used to fine tune the hyperparameters of the CNN model and mitigate the effects of the increased size of the training dataset. Evaluation results will demonstrate substantial improvement in performance when the optimized CNN model is trained on the balanced dataset. The proposed IDS will be evaluated using the IIoT-specific WUSTL-IIOT-2021 dataset, and then its generalization capability will be verified using the non-domain specific UNSW_NB15 dataset. The model's performance will be evaluated using accuracy, precision, recall, and F1-score metrics. The results will demonstrate that the proposed IDS is highly effective with performance exceeding 99.9% on all performance metrics. The IDS is also highly effective in detecting intrusion for generic IT networks achieving improvements in excess of 30% compared to the default baseline model. The results emphasize the versatility and effectiveness of the proposed IDS model, making it a reliable and adaptable solution for enhancing network security across diverse network environments. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
36. Efficient Sleep Stage Identification Using Piecewise Linear EEG Signal Reduction: A Novel Algorithm for Sleep Disorder Diagnosis.
- Author
-
Paul, Yash, Singh, Rajesh, Sharma, Surbhi, Singh, Saurabh, and Ra, In-Ho
- Subjects
- *
SLEEP , *SLEEP stages , *DATABASES , *EUCLIDEAN distance , *SLEEP disorders , *ELECTROENCEPHALOGRAPHY - Abstract
Sleep is a vital physiological process for human health, and accurately detecting various sleep states is crucial for diagnosing sleep disorders. This study presents a novel algorithm for identifying sleep stages using EEG signals, which is more efficient and accurate than the state-of-the-art methods. The key innovation lies in employing a piecewise linear data reduction technique called the Halfwave method in the time domain. This method simplifies EEG signals into a piecewise linear form with reduced complexity while preserving sleep stage characteristics. Then, a features vector with six statistical features is built using parameters obtained from the reduced piecewise linear function. We used the MIT-BIH Polysomnographic Database to test our proposed method, which includes more than 80 h of long data from different biomedical signals with six main sleep classes. We used different classifiers and found that the K-Nearest Neighbor classifier performs better in our proposed method. According to experimental findings, the average sensitivity, specificity, and accuracy of the proposed algorithm on the Polysomnographic Database considering eight records is estimated as 94.82%, 96.65%, and 95.73%, respectively. Furthermore, the algorithm shows promise in its computational efficiency, making it suitable for real-time applications such as sleep monitoring devices. Its robust performance across various sleep classes suggests its potential for widespread clinical adoption, making significant advances in the knowledge, detection, and management of sleep problems. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
37. Enhancing Firewall Packet Classification through Artificial Neural Networks and Synthetic Minority Over-Sampling Technique: An Innovative Approach with Evaluative Comparison.
- Author
-
Korkmaz, Adem, Bulut, Selma, Talan, Tarık, Kosunalp, Selahattin, and Iliev, Teodor
- Subjects
MACHINE learning ,ARTIFICIAL neural networks ,COMPUTER network security ,COMMUNICATION infrastructure ,INFRASTRUCTURE (Economics) ,FIREWALLS (Computer security) - Abstract
Firewall packet classification is a critical component of network security, demanding precise and reliable methods to ensure optimal functionality. This study introduces an advanced approach that combines Artificial Neural Networks (ANNs) with various data balancing techniques, including the Synthetic Minority Over-sampling Technique (SMOTE), ADASYN, and BorderlineSMOTE, to enhance the classification of firewall packets into four distinct classes: 'allow', 'deny', 'drop', and 'reset-both'. Initial experiments without data balancing revealed that while the ANN model achieved perfect precision, recall, and F1-Scores for the 'allow', 'deny', and 'drop' classes, it struggled to accurately classify the 'reset-both' class. To address this, we applied SMOTE, ADASYN, and BorderlineSMOTE to mitigate class imbalance, which led to significant improvements in overall classification performance. Among the techniques, the ANN combined with BorderlineSMOTE demonstrated superior efficacy, achieving a 97% overall accuracy and consistently high performance across all classes, particularly in the accurate classification of minority classes. In contrast, while SMOTE and ADASYN also improved the model's performance, the results with BorderlineSMOTE were notably more balanced and reliable. This study provides a comparative analysis with existing machine learning models, highlighting the effectiveness of the proposed approach in firewall packet classification. The synthesized results validate the potential of integrating ANNs with advanced data balancing techniques to enhance the robustness and reliability of network security systems. The findings underscore the importance of addressing class imbalance in machine learning models, particularly in security-critical applications, and offer valuable insights for the design and improvement of future network security infrastructures. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
38. DDSC-SMOTE: an imbalanced data oversampling algorithm based on data distribution and spectral clustering.
- Author
-
Li, Xinqi and Liu, Qicheng
- Subjects
- *
DATA distribution , *CLASSIFICATION algorithms , *K-nearest neighbor classification , *ALGORITHMS , *MACHINE learning , *OUTLIER detection , *CLUSTER sampling - Abstract
Imbalanced data poses a significant challenge in machine learning, as conventional classification algorithms often prioritize majority class samples, while accurately classifying minority class samples is more crucial. The synthetic minority oversampling technique (SMOTE) represents one of the most renowned methods for handling imbalanced data. However, both SMOTE and its variants have limitations due to their insufficient consideration of data distribution, leading to the generation of incorrect and unnecessary samples. This paper, therefore, introduces a novel oversampling algorithm called data distribution and spectral clustering-based SMOTE (DDSC-SMOTE). This algorithm addresses the shortcomings of SMOTE by introducing three innovative data distribution-based improvement strategies: adaptive allocation of synthetic sample quantities strategy, seed sample adaptive selection strategy, and synthetic sample improvement strategy. First, we use the k-nearest neighbor sample labels and the local outlier factor algorithm to remove noisy and outlier samples. Next, we leverage spectral clustering to identify clusters within the minority class and propose a dual-weight factor that considers inter-cluster and intra-cluster distances to allocate the number of synthetic samples effectively, addressing interclass and intraclass imbalances. Furthermore, we introduce a relative position weight coefficient to determine the probability of selecting seed samples within the subcluster, ensuring that important minority samples have higher chances of being sampled. Finally, we improve the SMOTE sample synthesis formula for safer generation. Extensive comparisons on real datasets from the UCI repository demonstrate that DDSC-SMOTE outperforms seven state-of-the-art oversampling algorithms significantly in terms of G-mean and F1-score, presenting a data distribution-focused solution for addressing imbalanced data challenges. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
39. Psychosocial Factors Associated With Long-Term Cognitive Impairment Among COVID-19 Survivors: A Cross-Sectional Study.
- Author
-
Wen Dang, Wenjing Li, Haotian Liu, Chunyang Li, Tingxi Zhu, Lin Bai, Runnan Yang, Jingyi Wang, Xiao Liao, Bo Liu, Simai Zhang, Minlan Yuan, and Wei Zhang
- Published
- 2024
- Full Text
- View/download PDF
40. A cyber defense system against phishing attacks with deep learning game theory and LSTM-CNN with African vulture optimization algorithm (AVOA).
- Author
-
Elberri, Mustafa Ahmed, Tokeşer, Ümit, Rahebi, Javad, and Lopez-Guede, Jose Manuel
- Subjects
- *
OPTIMIZATION algorithms , *PHISHING , *GAME theory , *SWARM intelligence , *EDUCATIONAL games , *PHISHING prevention , *DEEP learning - Abstract
Phishing attacks pose a significant threat to online security, utilizing fake websites to steal sensitive user information. Deep learning techniques, particularly convolutional neural networks (CNNs), have emerged as promising tools for detecting phishing attacks. However, traditional CNN-based image classification methods face limitations in effectively identifying fake pages. To address this challenge, we propose an image-based coding approach for detecting phishing attacks using a CNN-LSTM hybrid model. This approach combines SMOTE, an enhanced GAN based on the Autoencoder network, and swarm intelligence algorithms to balance the dataset, select informative features, and generate grayscale images. Experiments on three benchmark datasets demonstrate that the proposed method achieves superior accuracy, precision, and sensitivity compared to other techniques, effectively identifying phishing attacks and enhancing online security. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
41. A framework of Polar CanisFel optimization-based deep ensemble classifier with graph embedding for imbalanced data classification.
- Author
-
Bhowate, Vikas Gajananrao and Reddy, T. Hanumantha
- Subjects
- *
CONVOLUTIONAL neural networks , *RECURRENT neural networks , *DATA mining , *HEART failure , *BIG data - Abstract
Imbalanced data classification (IDC) presents a significant challenge in data mining (DM), as it frequently occurs in various real-world areas with profound implications for highly skewed databases. IDC revolves around the task of learning from data characterized by a substantial imbalance in the number of samples across its different classes. Hence the Polar-CanisFel (PCF) Optimization-deep ensemble model is designed to address imbalanced big data issues, incorporating the SMOTE technique for rebalancing the dataset. This ensemble classifier leverages a deep convolutional neural network (DCNN), Long Short-Term Memory (LSTM), and Gated Recurrent Neural Network (GRNN) architectures for effective data classification. For the Heart Failure Prediction Dataset, the model reaches an accuracy of 96.35%, sensitivity of 94.54%, and specificity of 96.11%. Further, the accuracy of 95.91%, sensitivity of 95.87%, and specificity of 94.79% are obtained concerning the Stroke Prediction dataset. Finally, when applied to the Hepatitis-C prediction dataset, the model attains an accuracy of 92.79%, sensitivity of 92.90%, and specificity of 92.63% during 90% of training. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
42. 基于集成学习的物联网攻击检测方法.
- Author
-
窦佳恩, 张瑛瑛, and 陈 玮
- Abstract
Copyright of Ordnance Industry Automation is the property of Editorial Board for Ordnance Industry Automation and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2024
- Full Text
- View/download PDF
43. For Robust DDoS Attack Detection by IDS: Smart Feature Selection and Data Imbalance Management Strategies.
- Author
-
Berbiche, Naoual and El Alami, Jamila
- Subjects
MACHINE learning ,COMPUTER network security ,DENIAL of service attacks ,FEATURE selection ,COMPUTER network traffic - Abstract
Computer network security represents a major challenge in the digital age, where intrusions threaten data confidentiality, accuracy and accessibility. To safeguard data and online services, Intrusion Detection Systems (IDS) controls the network traffic for any signs of malicious activity. The integration of artificial intelligence into IDSs offers new perspectives, but poses challenges, particularly in terms of feature selection and data imbalance management. Our research focused on identifying DDoS attacks, a major threat to the accessibility of online services. We evaluated the effectiveness of IDS against these attacks by testing the RF, XGB, SGD, LGB and MLP machine learning models on the CICIDS2018 DDOS attacks dataset. To optimize data quality, we adopted a strategic feature selection approach based on correlation matrix, mutual information and feature importance, reducing data dimensionality and improving model performance. Then, by balancing our dataset using oversampling techniques such as SMOTE, BorderlineSMOTE and ADASYN, we achieved better model generalization and reduced false positives. Our results showed that the ADASYN+SMOTE+XGB configuration was the most optimal for DDoS attack detection regarding effectiveness, false positives and execution duration. Our approach, combining judicious feature selection and resampling, has enabled us to create more performing intrusion detection systems, strengthening network security against increasingly sophisticated threats. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
44. Evaluation of Resampling Techniques in CNN-Based Heartbeat Classification.
- Author
-
Subhiyakto, Egia Rosi, Rakasiwi, Sindhu, Zeniarja, Junta, Paramita, Cinantya, Shidik, Guruh Fajar, Hasibuan, Zainal Arifin, and Kesić, Marijana Geets
- Subjects
TRANSFORMER models ,CONVOLUTIONAL neural networks ,DATABASES ,ELECTROCARDIOGRAPHY ,MEDICAL records - Abstract
This study investigates the efficacy of resampling techniques in ECG classification, addressing the challenge of data imbalance in heartbeat classification. Utilizing the PTB Diagnostic ECG database, the research focuses on the application of various Synthetic Minority Over-sampling Technique (SMOTE) variations, including SMOTE Borderline, ADASYN, Tomek, and ENN, alongside three algorithms: CNN, Transformer, and LSTM. The dataset, encompassing 549 patient records from 290 subjects, was bifurcated into training and testing segments, classifying heartbeats into normal and abnormal categories. The novelty of this work lies in its combined deep-structured learning model that integrates CNN, Transformer, and LSTM, further enhanced by an ensemble of these algorithms with original SMOTE and its variants for dataset balancing. The research revealed that the proposed method significantly ameliorates the classification of heartbeats, effectively addressing the class imbalance issue prevalent in ECG data. The results demonstrated that the transformer network, in particular, excelled in recognizing temporal continuities and extracting deep-seated features from ECG signals, thereby enhancing the model's performance beyond the capabilities of basic models. Key results indicate that CNN+SMOTE Borderline achieves the highest testing accuracy at 99.36%, while CNN+SMOTE Tomek leads in precision with 99.89%. Transformers excel in recall with a perfect score of 100%. The research concludes that CNNs effectively distinguish normal from abnormal heartbeats, with the highest accuracy using CNN+SMOTE at 99.06%. However, the study also acknowledges limitations, such as the dataset's restricted scope, and suggests further research with a more diverse dataset. Overall, the study demonstrates the effectiveness of CNN in ECG arrhythmia classification, offering a foundation for more advanced automatic diagnostic systems in cardiology. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
45. Explainable lung cancer classification with ensemble transfer learning of VGG16, Resnet50 and InceptionV3 using grad-cam.
- Author
-
Kumaran S, Yogesh, Jeya, J. Jospin, R, Mahesh T, Khan, Surbhi Bhatia, Alzahrani, Saeed, and Alojail, Mohammed
- Subjects
TUMOR classification ,LUNG cancer ,INTEGRATED learning systems ,DEEP learning ,DIAGNOSTIC imaging ,SIGNAL convolution - Abstract
Medical imaging stands as a critical component in diagnosing various diseases, where traditional methods often rely on manual interpretation and conventional machine learning techniques. These approaches, while effective, come with inherent limitations such as subjectivity in interpretation and constraints in handling complex image features. This research paper proposes an integrated deep learning approach utilizing pre-trained models—VGG16, ResNet50, and InceptionV3—combined within a unified framework to improve diagnostic accuracy in medical imaging. The method focuses on lung cancer detection using images resized and converted to a uniform format to optimize performance and ensure consistency across datasets. Our proposed model leverages the strengths of each pre-trained network, achieving a high degree of feature extraction and robustness by freezing the early convolutional layers and fine-tuning the deeper layers. Additionally, techniques like SMOTE and Gaussian Blur are applied to address class imbalance, enhancing model training on underrepresented classes. The model's performance was validated on the IQ-OTH/NCCD lung cancer dataset, which was collected from the Iraq-Oncology Teaching Hospital/National Center for Cancer Diseases over a period of three months in fall 2019. The proposed model achieved an accuracy of 98.18%, with precision and recall rates notably high across all classes. This improvement highlights the potential of integrated deep learning systems in medical diagnostics, providing a more accurate, reliable, and efficient means of disease detection. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
46. Smart Multistage Privacy-Preserving Framework for Intrusion Detection in Multi-Domain SDN.
- Author
-
Padmanabhan, Jayashree, Prabu, Saranya, Balakrishnan, Saikrishna, and Vijay, Vinayaka Murthy
- Subjects
- *
DENIAL of service attacks , *DECISION trees , *ELECTRONIC paper , *TRAFFIC monitoring , *COMPUTATIONAL complexity , *INTRUSION detection systems (Computer security) - Abstract
SDN architectures are frequently used by organizations for the management of their networks and the detection of anomalous traffic in a single domain. However, in the real world, anomalous traffic might result in attacks like distributed denial of service (DDoS) that affect numerous domains. During intrusion detection, each SDN domain has to send real traffic data of a large volume to the multi-domain controller, exposing its sensitive information. This paper proposes a smart multistage framework for detecting attacks and ensuring privacy at no additional cost. This work utilized the recent unbalanced InSDN dataset for experimentation. It also uses an oversampling technique that reduces the imbalance rate for each attack type and selects the smallest possible training size and feature set size for an increase in detection accuracy and a reduction in computational complexity. Then, a multi-class classifier method for intrusion detection that does not require regularization or hyperparameter tuning, called ensemble-learning-based shallow decision tree (ELSDT) is proposed. Furthermore, the performance of the proposed classifier on the InSDN dataset is assessed on an SDN testbed. Experimental results show the ability of the proposed smart multistage privacy-preserving framework to make a significant reduction in the training sample size and feature set size to 87% and 76%, respectively. It also shows its outperformance in recent literature works by 5.67% improved accuracy. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
47. Enhancing Machine Learning Performance in Estimating CDOM Absorption Coefficient via Data Resampling.
- Author
-
Kim, Jinuk, Kim, Jin Hwi, Jang, Wonjin, Pyo, JongCheol, Lee, Hyuk, Byeon, Seohyun, Lee, Hankyu, Park, Yongeun, and Kim, Seongjoon
- Subjects
- *
ABSORPTION coefficients , *RESAMPLING (Statistics) , *MACHINE performance , *MACHINE learning , *BOOSTING algorithms , *STANDARD deviations , *RANDOM forest algorithms , *DISSOLVED organic matter - Abstract
Chromophoric dissolved organic matter (CDOM) is a mixture of various types of organic matter and a useful parameter for monitoring complex inland surface waters. Remote sensing has been widely utilized to detect CDOM in various studies; however, in many cases, the dataset is relatively imbalanced in a single region. To address these concerns, data were acquired from hyperspectral images, field reflection spectra, and field monitoring data, and the imbalance problem was solved using a synthetic minority oversampling technique (SMOTE). Using the on-site reflectance ratio of the hyperspectral images, the input variables Rrs (452/497), Rrs (497/580), Rrs (497/618), and Rrs (684/618), which had the highest correlation with the CDOM absorption coefficient aCDOM (355), were extracted. Random forest and light gradient boosting machine algorithms were applied to create a CDOM prediction algorithm via machine learning, and to apply SMOTE, low-concentration and high-concentration datasets of CDOM were distinguished by 5 m−1. The training and testing datasets were distinguished at a 75%:25% ratio at low and high concentrations, and SMOTE was applied to generate synthetic data based on the training dataset, which is a sub-dataset of the original dataset. Datasets using SMOTE resulted in an overall improvement in the algorithmic accuracy of the training and test step. The random forest model was selected as the optimal model for CDOM prediction. In the best-case scenario of the random forest model, the SMOTE algorithm showed superior performance, with testing R2, absolute error (MAE), and root mean square error (RMSE) values of 0.838, 0.566, and 0.777 m−1, respectively, compared to the original algorithm's test values of 0.722, 0.493, and 0.802 m−1. This study is anticipated to resolve imbalance problems using SMOTE when predicting remote sensing-based CDOM. It is expected to produce and implement a machine learning model with improved reliable performance. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
48. Sentiment analysis with machine learning for drug reviews.
- Author
-
Bozkurt, Muhammed Oğuzhan, Yaman, Yağız, and Horasan, Fahrettin
- Subjects
MACHINE learning ,SUPPORT vector machines ,CLASSIFICATION algorithms ,HEALTH status indicators ,HEALTH behavior - Abstract
In the treatment of the diseases, the fact that individuals use drugs independently from doctors without appropriate consultation causes their health status to become worse than normal. This article aims to conduct a sentiment analysis over the comments of individuals about the drug in case they use drugs without consultation. Within the scope of this study, patients' comments about drugs were vectorized using Bow and TF-IDF algorithms, sentiment analysis was made, and the predicted sentiments were; it was evaluated with precision, recall, f1score, accuracy and AUC score. As a result of the evaluations, the most successful result was obtained in the TF-IDF method. This result is the result of the linear support vector classifier algorithm with an accuracy value of 93%. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
49. A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning.
- Author
-
Elreedy, Dina, Atiya, Amir F., and Kamalov, Firuz
- Subjects
DISTRIBUTION (Probability theory) ,K-nearest neighbor classification ,MINORITIES - Abstract
Class imbalance occurs when the class distribution is not equal. Namely, one class is under-represented (minority class), and the other class has significantly more samples in the data (majority class). The class imbalance problem is prevalent in many real world applications. Generally, the under-represented minority class is the class of interest. The synthetic minority over-sampling technique (SMOTE) method is considered the most prominent method for handling unbalanced data. The SMOTE method generates new synthetic data patterns by performing linear interpolation between minority class samples and their K nearest neighbors. However, the SMOTE generated patterns do not necessarily conform to the original minority class distribution. This paper develops a novel theoretical analysis of the SMOTE method by deriving the probability distribution of the SMOTE generated samples. To the best of our knowledge, this is the first work deriving a mathematical formulation for the SMOTE patterns' probability distribution. This allows us to compare the density of the generated samples with the true underlying class-conditional density, in order to assess how representative the generated samples are. The derived formula is verified by computing it on a number of densities versus densities computed and estimated empirically. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
50. Strip Steel Defect Prediction Based on Improved Immune Particle Swarm Optimisation–Improved Synthetic Minority Oversampling Technique–Stacking.
- Author
-
Fang, Zhi, Zhang, Fan, Yu, Su, and Wang, Bintao
- Subjects
STEEL strip ,PARTICLE swarm optimization ,PREDICTION models ,ELECTRIC arc ,FORECASTING - Abstract
A model framework for the prediction of defects in strip steel is proposed with the objective of enhancing the accuracy of defect detection. Initially, the data are balanced through the utilisation of the Improved Synthetic Minority Oversampling Technique (ISmote), which is based on clustering techniques. Subsequently, further enhancements are made to the inertia weights and learning factors of the immune particle swarm optimisation (IPSO), with additional optimisations in speed updates and population diversity. These enhancements are designed to address the issue of premature convergence at the early stages of the process and local optima at the later stages. Finally, a prediction model is then constructed based on stacking, with its hyperparameters optimised through the improved immune particle swarm optimisation (IIPSO). The results of the experimental trials demonstrate that the IIPSO-ISmote-Stacking model framework exhibits superior prediction performance when compared to other models. The Macro_Precision, Macro_Recall, and Macro_F1 values for this framework are 93.3%, 93.6%, and 92.2%, respectively. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.