996 results on '"Unbalanced data"'
Search Results
2. Landslide susceptibility assessment using deep learning considering unbalanced samples distribution
- Author
-
Mwakapesa, Deborah Simon, Lan, Xiaoji, and Mao, Yimin
- Published
- 2024
- Full Text
- View/download PDF
3. Models for Insurance Fraud Detection: Dealing with Unbalanced Data
- Author
-
Carracedo, Patricia, Hervás, David, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Juan, Angel A., editor, Faulin, Javier, editor, and Lopez-Lopez, David, editor
- Published
- 2025
- Full Text
- View/download PDF
4. Blending-Based Ensemble Learning Low-Voltage Station Area Theft Detection.
- Author
-
Chen, Dunchu, Li, Wenwu, and Fang, Jie
- Subjects
- *
ENSEMBLE learning , *STANDARD deviations , *POWER resources , *POTENTIAL energy , *CLASSROOM learning centers - Abstract
In order to improve the efficiency of electricity theft detection, the power theft detection area and users should be better integrated, we proposed a Blending ensemble learning electricity theft detection model based on the Base Learner Selection Strategy (BLSS). Firstly, the adaptive synthetic (ADASYN) sampling method is used to process the unbalanced power consumption data, and the sample distribution of training data is balanced. Secondly, the BLSS selection method is used to screen the optimal base learner combination and construct the Blending ensemble learning model. Then, based on the historical data, the model makes a short-term prediction of the power consumption of the station area the next day, and focuses on the verification of the suspected energy-stealing station area where the Root Mean Square Percentage Error (RSPE) exceeds the threshold, so as to lock in the potential energy stealing users. Finally, through the comparison and verification of real examples, the search scope for electricity theft inspections was reduced by 79.17%, greatly improving the detection efficiency of the power supply company. At the same time, the model's electricity theft detection and recognition accuracy rate can be as high as 97.50%. The Blending ensemble learning electricity stealing detection model based on the BLSS base learner selection method has strong electricity stealing detection and recognition ability. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
5. 基于边界信息的自适应过采样算法.
- Author
-
杜睿山, 靳明洋, 孟令东, and 宋健辉
- Abstract
Copyright of Journal of Zhengzhou University (Natural Science Edition) is the property of Journal of Zhengzhou University (Natural Science Edition) Editorial Office and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2025
- Full Text
- View/download PDF
6. Discrimination of the Specific Gravity of Urine Using Spectrophotometry by the Parallel Connection of Two Modified Feature Selection Methods.
- Author
-
Yang, Chengbo, Cai, Zhilong, Li, Qingzhi, Tang, Feng, Wu, Jingjun, Yang, Jia, Zhang, Yurong, Li, Bo, Yang, Ping, Ye, Xin, and Yang, Liming
- Subjects
- *
STANDARD deviations , *SPECIFIC gravity , *FEATURE selection , *NONDESTRUCTIVE testing , *RANDOM forest algorithms - Abstract
Spectroscopy has become prominent in medical surveillance due to its low cost, speed, and nondestructive testing. However, the issue of class unbalance in medical data causes existing algorithms to favor the majority classes, leading to their malfunction. This study attempts to propose a parallel type method based on two modified feature selection methods to achieve visible spectral discrimination of unbalanced urine specific gravity (USG) data. Firstly, the root mean square error (RMSE) of successive projections algorithm (SPA) and competitive adaptive reweighted sampling (CARS) were modified by increasing weight coefficients. Then, SPA, CARS, modified SPA (mSPA), modified CARS (mCARS), tandem connection of SPA and CARS (CARS-SPA), tandem connection of mSPA and CARS (mCARS mSPA), parallel connection of SPA and CARS (CARS + SPA), and parallel connection of mSPA and CARS (mCARS + mSPA) were used to select characteristic wavelengths from the full spectrq. Finally, based on the variable subsets extracted by each method, the random forest (RF) models were established to verify the performance of the parallel strategy and modification method. The results showed that the RF model of mCARS + mSPA achieved effective discrimination of USG with high accuracy (92.81%), high sensitivity (0.9270), and high resolution (0.9280). It means that a parallel hybrid based on two modified feature selection methods can effectively select feature wavelengths beneficial for minority class recognition, achieving the mining of spectral features of unbalanced data. At the same time, this study also provides a novel example of the strategy of parallel feature selection methods. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
7. Handling imbalanced samples in landslide susceptibility evaluation
- Author
-
You TIAN, Bo GAO, Hong YIN, Yuanling LI, Jiajia ZHANG, Long CHEN, and Hongliang LI
- Subjects
landslide susceptibility ,smote ,evaluation model ,changdu ,unbalanced data ,Geology ,QE1-996.5 - Abstract
In landslide susceptibility assessment, different approaches to handling sample imbalance can introduce significant uncertainty in evaluation outcomes. To address this issue, this study focused on the Changdu area of eastern Tibet and constructed the landslide susceptibility evaluation model using a dataset with imbalanced landslide and non-landslide samples. Three disposal schemes were applied: no treatment, downsampling, and SMOTE oversampling. The logistic regression method was used to construct the landslide susceptibility evaluation model. Based on ROC curve, accuracy, precision, recall, missed detection rate, and other evaluation indicators, the comprehensive evaluation index of F1′ score was used to verify the accuracy of model classification. The results show that the modeling effect of landslide susceptibility obtained by data processing into equilibrium data (downsampling/oversampling) is greatly improved compared with that obtained without processing data. Specifically, the value of the F1′score of the comprehensive index was increased by 53.17%. In the two schemes for processing data (downsampling and oversampling), the oversampling method increased the value of the composite index F1′ score by 16.30% compared with the downsampling method, indicating that the oversampling method has effectiveness in handling unbalanced data. This study can provide basic information for processing of data sets before landslide prediction and geological disaster prediction, and provide theoretical and technical support for further improving regional disaster prevention and mitigation.
- Published
- 2024
- Full Text
- View/download PDF
8. Assessing temporal variability in durum wheat performance and stability through multi-trait mean performance selection in Mediterranean climate.
- Author
-
Sellami, Mohamed Houssemeddine, Di Mola, Ida, Ottaiano, Lucia, Cozzolino, Eugenio, De Vita, Pasquale, and Mori, Mauro
- Subjects
GENOTYPE-environment interaction ,WHEAT ,WHEAT breeding ,GRAIN yields ,MEDITERRANEAN climate ,DURUM wheat - Abstract
Durum wheat, a staple crop in Italy, faces substantial challenges due to increasing droughts and rising temperatures. This study examines the grain yield, agronomic traits, and quality of 41 durum wheat varieties over ten growing seasons in Southern Italy, utilizing a randomized complete block design. Notably, most varieties were not repeated between trials and 45% of the data was missing. The results indicate that the interaction between genotype and environment (GEI) significantly impacted all traits. High temperatures, elevated vapor pressure deficit (VPD), and water deficits severely affected yield and quality during warm years, while cooler years with favorable water availability promoted better growth and higher yields. Broad-sense heritability (H²) was generally low, suggesting that environmental factors played a major role in the observed traits. However, some traits, such as grain yield, ears per square meter, plant height, bleached wheat, thousand-grain weight, and hectoliter weight exhibited moderate to high heritability of the mean genotype (h²
mg ), indicating their potential for effective selection in breeding programs. Correlation analyses revealed strong connections between certain traits, such as protein content, and gluten index as well as between grain yield, and spike per square meter. Using the Multi-Trait Mean Performance Selection (MTMPS) index, the study identified six top-performing varieties. Among these, Antalis (G4) and Core (G18) consistently demonstrated strong adaptability and stability across different environments, particularly in hotter, drier conditions. Furio Camillo (G31) also exhibited valuable traits. This study highlights the challenges and complexities of breeding durum wheat for improved yield and quality in the face of climate change. [ABSTRACT FROM AUTHOR]- Published
- 2024
- Full Text
- View/download PDF
9. 滑坡易发性评价中样本不均衡问题处理研究.
- Author
-
田 尤, 高 波, 殷 红, 李元灵, 张佳佳, 陈 龙, and 李洪梁
- Subjects
LANDSLIDE hazard analysis ,EMERGENCY management ,LANDSLIDE prediction ,ELECTRONIC data processing ,INFORMATION processing - Abstract
Copyright of Hydrogeology & Engineering Geology / Shuiwendizhi Gongchengdizhi is the property of Hydrogeology & Engineering Geology Editorial Office and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2024
- Full Text
- View/download PDF
10. Exploration of machine learning methods for maritime risk predictions.
- Author
-
Knapp, Sabine and van de Velden, Michel
- Subjects
- *
RANDOM forest algorithms , *INDUSTRIAL management , *MACHINE learning , *RISK exposure , *STRATEGIC planning - Abstract
Maritime applications such as targeting ships for inspections, improved domain awareness, and dynamic risk exposure assessments for strategic planning all benefit from ship-specific incident probabilities. Using a unique and comprehensive global data set, of 1.2 million observations over the period from 2014 to 2020, this study explores the effectiveness and suitability of 144 model variants from the field of machine learning for eight incident endpoints of interest and evaluating over 580 covariates. Furthermore, the importance of covariates is examined and visualized. The results differ for each endpoint of interest but confirm that random forest methods can improve prediction capabilities. Based on out-of-sample evaluations for the year 2020, targeting the top 10% most risky vessels would improve predictions by a factor of 2.7 to 4.9 compared to random selection and based on the top decile lift. Balanced random forests and random forests with balanced training variants outperform regular random forests, for which the selected variants also depend on aggregation types. The most important covariate groups for predicting incident probabilities relate to beneficial ownership, the safety management company, and the size and age of the vessel, while the relevance of these factors remains similar across the different endpoints of interest. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
11. ConvAtt Network: A Low Parameter Approach For Sign Language Recognition.
- Author
-
Rios, Gastón, Bianco, Pedro Dal, Ronchetti, Franco, Quiroga, Facundo, Ahon, Santiago Ponte, Stanchi, Oscar, and Hasperué, Waldo
- Subjects
LANGUAGE models ,SIGN language ,DATA augmentation ,DEEP learning ,FRENCH language - Abstract
Copyright of Journal of Computer Science & Technology (JCS&T) is the property of Journal of Computer Science & Technology and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2024
- Full Text
- View/download PDF
12. 面向多源数据的个性化联邦学习框架.
- Author
-
裴浪涛, 陈学斌, 任志强, and 翟 冉
- Subjects
DATA privacy ,FEDERATED learning ,BUDGET ,PRIVACY ,NOISE - Abstract
Copyright of Journal of Computer Engineering & Applications is the property of Beijing Journal of Computer Engineering & Applications Journal Co Ltd. and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2024
- Full Text
- View/download PDF
13. Data imbalance in cardiac health diagnostics using CECG-GAN
- Author
-
Yang Yang, Tianyu Lan, Yang Wang, Fengtian Li, Liyan Liu, Xupeng Huang, Fei Gao, Shuhua Jiang, Zhijun Zhang, and Xing Chen
- Subjects
Heart disease ,Generative adversarial networks ,Unbalanced data ,Multi-class classification ,Electrocardiogram ,Medicine ,Science - Abstract
Abstract Heart disease is the world’s leading cause of death. Diagnostic models based on electrocardiograms (ECGs) are often limited by the scarcity of high-quality data and issues of data imbalance. To address these challenges, we propose a conditional generative adversarial network (CECG-GAN). This strategy enables the generation of samples that closely approximate the distribution of ECG data. Additionally, CECG-GAN addresses waveform jitter, slow processing speeds, and dataset imbalance issues through the integration of a transformer architecture. We evaluated this approach using two datasets: MIT-BIH and CSPC2020. The experimental results demonstrate that CECG-GAN achieves outstanding performance metrics. Notably, the percentage root mean square difference (PRD) reached 55.048, indicating a high degree of similarity between generated and actual ECG waveforms. Additionally, the Fréchet distance (FD) was approximately 1.139, the root mean square error (RMSE) registered at 0.232, and the mean absolute error (MAE) was recorded at 0.166.
- Published
- 2024
- Full Text
- View/download PDF
14. Landslide susceptibility mapping model based on a coupled model of SMOTE-Tomek and CNN and its application: A case study in the Zigui-Badong section of the Three Gorges Reservoir area
- Author
-
Xianyu YU and Li TANG
- Subjects
landslide ,landslide susceptibility assessment ,smote-tomek ,convolutional neural network ,unbalanced data ,Geology ,QE1-996.5 - Abstract
China is a nation severely impacted by landslide disasters, which poses a great threat to the lives and properties of people in the disaster-affected areas. Landslide susceptibility assessment, as an important tool for landslide risk prediction, is of great significance for disaster mitigation and prevention. However, traditional landslide susceptibility assessment faces the issue of imbalanced data between landslide and non-landslide samples, leading to the inherent undersampling of non-landslide data in the training set. This results in the loss of important information features related to landslide events, thereby affecting the reliability of landslide susceptibility assessment. In this study, using the Zigui-Badong section of the Three Gorges Reservoir Area as an example, 14 evaluation factors, such as elevation and slope were chosen as landslide susceptibility assessment factors, and the original training set and the validation set were divided. In this study, the synthetic minority oversampling technique - Tomek Links (SMOTE-Tomek) method was employed to process the original training dataset, construct the input training set. A convolutional neural networks (CNN) was then trained using this input data, resulting in the SMOTE-Tomek-CNN coupling model. In addition, by intersecting the SMOTE-Tomek method with undersampling methods (random undersampling, RUS), they were separately coupled with the CNN model and support vector machine model (SVM) to form three coupled models: SMOTE-Tomek-SVM, RUS-CNN, and RUS-SVM. These were compared with the SMOTE-CNN coupled model. The results indicate that, among the four coupling models, the SMOTE-CNN coupled model has higher specific class accuracy and area under the ROC curve, with values of 73.60% and 0.965, respectively. This indicates that this method's predictive ability is superior to that of traditional methods, making it a reliable resource for landslide prediction in the studied area.
- Published
- 2024
- Full Text
- View/download PDF
15. Tool State Recognition Based on POGNN-GRU under Unbalanced Data.
- Author
-
Tong, Weiming, Shen, Jiaqi, Li, Zhongwei, Chu, Xu, Jiang, Wenqi, and Tan, Liguo
- Subjects
- *
GRAPH neural networks , *FEATURE extraction , *KERNEL functions , *PROBLEM solving , *ACQUISITION of data - Abstract
Accurate recognition of tool state is important for maximizing tool life. However, the tool sensor data collected in real-life scenarios has unbalanced characteristics. Additionally, although graph neural networks (GNNs) show excellent performance in feature extraction in the spatial dimension of data, it is difficult to extract features in the temporal dimension efficiently. Therefore, we propose a tool state recognition method based on the Pruned Optimized Graph Neural Network-Gated Recurrent Unit (POGNN-GRU) under unbalanced data. Firstly, design the Improved-Majority Weighted Minority Oversampling Technique (IMWMOTE) by introducing an adaptive noise removal strategy and improving the MWMOTE to alleviate the unbalanced problem of data. Subsequently, propose a POG graph data construction method based on a multi-scale multi-metric basis and a Gaussian kernel weight function to solve the problem of one-sided description of graph data under a single metric basis. Then, construct the POGNN-GRU model to deeply mine the spatial and temporal features of the data to better identify the state of the tool. Finally, validation and ablation experiments on the PHM 2010 and HMoTP datasets show that the proposed method outperforms the other models in terms of identification, and the highest accuracy improves by 1.62% and 1.86% compared with the corresponding optimal baseline model. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
16. Multiple comparisons of treatment against control under unequal variances using parametric bootstrap.
- Author
-
Alver, Sarah and Zhang, Guoyi
- Subjects
- *
MULTIPLE comparisons (Statistics) , *ONE-way analysis of variance , *FALSE positive error , *HETEROSCEDASTICITY , *SAMPLE size (Statistics) , *VARIANCES , *CONTROL groups - Abstract
In one-way analysis of variance models, performing simultaneous multiple comparisons of treatment groups with a control group may be of interest. Dunnett's test is used to test such differences and assumes equal variances of the response variable for each group. This assumption is not always met even after transformation. A parametric bootstrap (PB) method is developed here for comparing multiple treatment group means against the control group with unequal variances and unbalanced data. In simulation studies, the proposed method outperformed Dunnett's test in controlling the type I error under various settings, particularly when data have heteroscedastic variance and unbalanced design. Simulations show that power is often lower for the PB method than for Dunnett's test under equal variance, balanced data, or smaller sample size, but similar to or higher than for Dunnett's test with unequal variance, unbalanced data and larger sample size. The method is applied to a dataset concerning isotope levels found in elephant tusks from various geographical areas. These data have very unbalanced group sizes and unequal variances. This example illustrates that the PB method is easy to implement and avoids the need for transforming data to meet the equal variance assumption, simplifying interpretation of results. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
17. Unbalanced Data-Based Fault Diagnosis Method of Bearing Utilizing Time-Frequency DCGAN Processing.
- Author
-
Sheng-Wei FEI and Ying-Zhe LIU
- Subjects
- *
FAULT diagnosis , *K-nearest neighbor classification , *FEATURE extraction , *DIAGNOSIS methods , *ALGORITHMS - Abstract
Aiming at the unbalanced datasets of fault samples of bearing, a fault diagnosis method of bearing based on time-frequency DCGAN processing is proposed in this paper. Firstly, through STFT, the vibration signals are converted into the time-frequency images, and then the time-frequency images are input into DCGAN to expand the fault samples. Secondly, the expanded fault samples are evaluated for image quality through the comprehensive method of PSNR and SSIM. Thirdly, the Canny edge detection algorithm is used to extract features from the time-frequency image, and the obtained binary image is used as the feature. Finally, k-nearest neighbor algorithm is used for classification to testify the superiority of time-frequency DCGAN processing. The experimental results show that the expanded samples can effectively improve the unbalance of the samples and improve the accuracy of fault diagnosis of bearing. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
18. Population Pharmacokinetics
- Author
-
Weber, Willi, Rüppel, Diether, Pugsley, Michael K., Section editor, Hock, Franz J., Section editor, Hock, Franz J., editor, and Pugsley, Michael K., editor
- Published
- 2024
- Full Text
- View/download PDF
19. VAE-CNN for Coronary Artery Disease Prediction
- Author
-
Louridi, Nabaouia, El Ouahidi, Amine, Benic, Clément, Douzi, Samira, El Ouahidi, Bouabid, Rocha, Álvaro, Series Editor, Hameurlain, Abdelkader, Editorial Board Member, Idri, Ali, Editorial Board Member, Vaseashta, Ashok, Editorial Board Member, Dubey, Ashwani Kumar, Editorial Board Member, Montenegro, Carlos, Editorial Board Member, Laporte, Claude, Editorial Board Member, Moreira, Fernando, Editorial Board Member, Peñalvo, Francisco, Editorial Board Member, Dzemyda, Gintautas, Editorial Board Member, Mejia-Miranda, Jezreel, Editorial Board Member, Hall, Jon, Editorial Board Member, Piattini, Mário, Editorial Board Member, Holanda, Maristela, Editorial Board Member, Tang, Mincong, Editorial Board Member, Ivanovíc, Mirjana, Editorial Board Member, Muñoz, Mirna, Editorial Board Member, Kanth, Rajeev, Editorial Board Member, Anwar, Sajid, Editorial Board Member, Herawan, Tutut, Editorial Board Member, Colla, Valentina, Editorial Board Member, Devedzic, Vladan, Editorial Board Member, and Farhaoui, Yousef, editor
- Published
- 2024
- Full Text
- View/download PDF
20. A Deep Learning Approach to Diabetes Diagnosis
- Author
-
Zhang, Zeyu, Ahmed, Khandaker Asif, Hasan, Md Rakibul, Gedeon, Tom, Hossain, Md Zakir, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Nguyen, Ngoc Thanh, editor, Chbeir, Richard, editor, Manolopoulos, Yannis, editor, Fujita, Hamido, editor, Hong, Tzung-Pei, editor, Nguyen, Le Minh, editor, and Wojtkiewicz, Krystian, editor
- Published
- 2024
- Full Text
- View/download PDF
21. Users’ Scenario-Base for Analysing Insider Threat Detection Based on User’s Downloads Activity Logs
- Author
-
Padiet, Peter, Islam, Rafiqul, Khan, M. Arif, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, and Arai, Kohei, editor
- Published
- 2024
- Full Text
- View/download PDF
22. Detection of Common Risk Factors Leading to the Cardiovascular Illness Using Machine Learning
- Author
-
Louridi, Nabaouia, Douzi, Samira, El Ouahidi, Bouabid, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Gherabi, Noredine, editor, Awad, Ali Ismail, editor, Nayyar, Anand, editor, and Bahaj, Mohamed, editor
- Published
- 2024
- Full Text
- View/download PDF
23. Propheter: Prophetic Teacher Guided Long-Tailed Distribution Learning
- Author
-
Xu, Wenxiang, Jing, Yongcheng, Zhou, Linyun, Huang, Wenqi, Cheng, Lechao, Feng, Zunlei, Song, Mingli, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Luo, Biao, editor, Cheng, Long, editor, Wu, Zheng-Guang, editor, Li, Hongyi, editor, and Li, Chaojie, editor
- Published
- 2024
- Full Text
- View/download PDF
24. Assessing temporal variability in durum wheat performance and stability through multi-trait mean performance selection in Mediterranean climate
- Author
-
Mohamed Houssemeddine Sellami, Ida Di Mola, Lucia Ottaiano, Eugenio Cozzolino, Pasquale De Vita, and Mauro Mori
- Subjects
durum wheat ,genotype by environment interaction ,mixed model ,MTMPS ,unbalanced data ,Agriculture ,Plant culture ,SB1-1110 - Abstract
Durum wheat, a staple crop in Italy, faces substantial challenges due to increasing droughts and rising temperatures. This study examines the grain yield, agronomic traits, and quality of 41 durum wheat varieties over ten growing seasons in Southern Italy, utilizing a randomized complete block design. Notably, most varieties were not repeated between trials and 45% of the data was missing. The results indicate that the interaction between genotype and environment (GEI) significantly impacted all traits. High temperatures, elevated vapor pressure deficit (VPD), and water deficits severely affected yield and quality during warm years, while cooler years with favorable water availability promoted better growth and higher yields. Broad-sense heritability (H²) was generally low, suggesting that environmental factors played a major role in the observed traits. However, some traits, such as grain yield, ears per square meter, plant height, bleached wheat, thousand-grain weight, and hectoliter weight exhibited moderate to high heritability of the mean genotype (h²mg), indicating their potential for effective selection in breeding programs. Correlation analyses revealed strong connections between certain traits, such as protein content, and gluten index as well as between grain yield, and spike per square meter. Using the Multi-Trait Mean Performance Selection (MTMPS) index, the study identified six top-performing varieties. Among these, Antalis (G4) and Core (G18) consistently demonstrated strong adaptability and stability across different environments, particularly in hotter, drier conditions. Furio Camillo (G31) also exhibited valuable traits. This study highlights the challenges and complexities of breeding durum wheat for improved yield and quality in the face of climate change.
- Published
- 2024
- Full Text
- View/download PDF
25. ConvAtt Network: A Low Parameter Approach For Sign Language Recognition
- Author
-
Gaston Gustavo Rios, Pedro Dal Bianco, Franco Ronchetti, Facundo Quiroga, Santiago Ponte Ahón, Oscar Stanchi, and Waldo Hasperué
- Subjects
Deep Learning ,Sequence Classification ,Sign Language Recognition ,Unbalanced Data ,Computer engineering. Computer hardware ,TK7885-7895 ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Despite recent advances in Large Language Models in text processing, Sign Language Recognition (SLR) remains an unresolved task. This is, in part, due to limitations in the available data. In this paper, we investigate combining 1D convolutions with transformer layers to capture local features and global interactions in a low-parameter SLR model. We experimented using multiple data augmentation and regularization techniques to categorize signs of the French Belgian Sign Language. We achieved a top-1 accuracy of 42.7% and a top-10 accuracy of 81.9% in 600 different signs. This model is competitive with the current state of the art while using a significantly lower number of parameters.
- Published
- 2024
- Full Text
- View/download PDF
26. Survey of Research on SMOTE Type Algorithms
- Author
-
WANG Xiaoxia, LI Leixiao, LIN Hao
- Subjects
unbalanced data ,synthetic minority oversampling technique (smote) ,oversampling ,supervised learning ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Synthetic minority oversampling technique (SMOTE) has become one of the mainstream methods for dealing with unbalanced data due to its ability to effectively deal with minority samples, and many SMOTE improvement algorithms have been proposed, but very little research existing considers popular algorithmic-level improvement methods. Therefore a more comprehensive analysis of existing SMOTE class algorithms is provided. Firstly, the basic principles of the SMOTE method are elaborated in detail, and then the SMOTE class algorithms are systematically analyzed mainly from the two levels of data level and algorithmic level, and the new ideas of the hybrid improvement of data level and algorithmic level are introduced. Data-level improvement is to balance the data distribution by deleting or adding data through different operations during preprocessing; algorithmic-level improvement will not change the data distribution, and mainly strengthens the focus on minority samples by modifying or creating algorithms. Comparison between these two kinds of methods shows that, data-level methods are less restricted in their application, and algorithmic-level improvements generally have higher algorithmic robustness. In order to provide more comprehensive basic research material on SMOTE class algorithms, this paper finally lists the commonly used datasets, evaluation metrics, and gives ideas of research in the future to better cope with unbalanced data problem.
- Published
- 2024
- Full Text
- View/download PDF
27. Fault diagnosis of transformer oil-paper bushings in PSO-BPNN algorithm based on ADASYN data balancing
- Author
-
YANG Hao, HU Wenxiu, ZHANG Lu, CHEN Jinpeng, ZHOU Sijia, and ZHAO Sirui
- Subjects
transformer bushing ,fault diagnosis ,dissolved gas in oil ,back propagation neural network (bpnn) ,unbalanced data ,adaptive synthetic sampling (adasyn) ,Applications of electric power ,TK4001-4102 - Abstract
The insulation performance of transformer bushings is a crucial aspect that directly affects the safe operation of equipment. To diagnose the insulation status of transformer bushings and mitigate the impact of small-sample imbalanced data on diagnostic results, a particle swarm optimization combined with back propagation neural network (PSO-BPNN) and adaptive synthetic sampling (ADASYN) method are employed to fault diagnosis of transformer bushing. Initially, historical fault data of transformer bushings are gathered, and a sample set of dissolved gases in transformer oil with distinct fault categories is established. The ADASYN algorithm is used to synthesize the minority class samples in the original data, which allowed for obtaining balanced fault data. The balanced dissolved gases in oil served as the model input, and the fault status is used as the label output to diagnose the transformer bushings using the PSO-BPNN model. To diagnose the bushings under the original sample set, the back propagation neural network (BPNN), genetic combined with back propagation neural network (G-BPNN), cuckoo search combined with back propagation neural network (CS-BPNN), and PSO-BPNN models are used. The results reveal that the PSO-BPNN model based on ADASYN balanced data exhibited the highest accuracy among the various models for fault diagnosing the insulation status of transformer bushings. This approach effectively mitigate the impact of small sample imbalanced data on diagnostic results, and provide an effective method for assessing the insulation performance of transformer bushings.
- Published
- 2024
- Full Text
- View/download PDF
28. Stratifying risk of disease in haematuria patients using machine learning techniques to improve diagnostics.
- Author
-
Drożdż, Anna, Duggan, Brian, Ruddock, Mark W., Reid, Cherith N., Jo Kurth, Mary, Watt, Joanne, Irvine, Allister, Lamont, John, Fitzgerald, Peter, O'Rourke, Declan, Curry, David, Evans, Mark, Boyd, Ruth, and Sousa, Jose
- Subjects
MACHINE learning ,HEMATURIA ,RANDOM forest algorithms ,DECISION trees ,CYSTATIN C - Abstract
Background: Detailed and invasive clinical investigations are required to identify the causes of haematuria. Highly unbalanced patient population (predominantly male) and a wide range of potential causes make the ability to correctly classify patients and identify patient-specific biomarkers a major challenge. Studies have shown that it is possible to improve the diagnosis using multi-marker analysis, even in unbalanced datasets, by applying advanced analytical methods. Here, we applied several machine learning algorithms to classify patients from the haematuria patient cohort (HaBio) by analysing multiple biomarkers and to identify the most relevant ones. Materials and Methods: We applied several classification and feature selection methods (k-means clustering, decision trees, random forest with LIME explainer and CACTUS algorithm) to stratify patients into two groups: healthy (with no clear cause of haematuria) or sick (with an identified cause of haematuria e.g., bladder cancer, or infection). The classification performance of the models was compared. Biomarkers identified as important by the algorithms were also analysed in relation to their involvement in the pathological processes. Results: Results showed that a high unbalance in the datasets significantly affected the classification by random forest and decision trees, leading to the overestimation of the sick class and low model performance. CACTUS algorithm was more robust to the unbalance in the dataset. CACTUS obtained a balanced accuracy of 0.747 for both genders, 0.718 for females and 0.803 for males. The analysis showed that in the classification process for the whole dataset: microalbumin, male gender, and tPSA emerged as the most informative biomarkers. For males: age, microalbumin, tPSA, cystatin C, BTA, HAD and S100A4 were the most significant biomarkers while for females microalbumin, IL-8, pERK, and CXCL16. Conclusions: CACTUS algorithm demonstrated improved performance compared with other methods such as decision trees and random forest. Additionally, we identified the most relevant biomarkers for the specific patient group, which could be considered in the future as novel biomarkers for diagnosis. Our results have the potential to inform future research and provide new personalised diagnostic approaches tailored directly to the needs of the individuals. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
29. 基于 ADASYN 数据平衡化的 PSO-BPNN 变压器套管故障诊断.
- Author
-
杨昊, 胡文秀, 张璐, 陈晋鹏, 周思佳, and 赵思瑞
- Abstract
Copyright of Electric Power Engineering Technology is the property of Editorial Department of Electric Power Engineering Technology and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2024
- Full Text
- View/download PDF
30. 集成数据挖掘知识的可解释最优超球体支持向量机.
- Author
-
陆思洁, 范頔, 渐令, and 郜传厚
- Abstract
Copyright of Control Theory & Applications / Kongzhi Lilun Yu Yinyong is the property of Editorial Department of Control Theory & Applications and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2024
- Full Text
- View/download PDF
31. Unsupervised Feature-Preserving CycleGAN for Fault Diagnosis of Rolling Bearings Using Unbalanced Infrared Thermal Imaging Sample
- Author
-
Lujiale Guo, Joon Huang Chuah, Wong Jee Keen Raymond, Xiaohui Gu, Jie Yao, and Xiangqian Chang
- Subjects
Fault diagnosis ,rolling bearing ,infrared thermal imaging ,unbalanced data ,generative adversarial networks ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
The fault diagnosis of rolling bearing is of great significance in industrial safety. The method of infrared thermal image combined with neural network can diagnose the fault of rolling bearing in a non-contact manner, however its data in different scenes are often unbalanced and difficult to obtain. The generative adversarial networks can solve this problem by generating data with the required features. In this paper, an unsupervised learning framework named Feature-Preserving Cycle-Consistent Generative Adversarial Networks (FP-CycleGAN) is designed for defect detection in unbalanced rolling bearing infrared thermography sample. Since the classical Cycle-Consistent Generative Adversarial Networks (CycleGAN) often must balance the weights between generation, discrimination and consistency loss when doing the feature conversion from source domain to target domain, and the process often results in pattern collapse or feature loss. To avoid this problem, a new discriminator is designed to identify whether the generated image A and B belong to two different classes, and a new class loss are proposed. In order to better extract fault features and perform features migration, the new generator is reconstructed based on the U-Network structure, the convtraspose method of the up-sampling network is replaced by Bicubic Interpolation to effectively avoid the checkerboard effect of the generated images. The defect detection of the expanded dataset was performed using Residual Network and compared with the pre-expansion data to demonstrate the usability of the generated data and the superiority of the proposed FP-CycleGAN method for rolling bearing defect detection in small sample of infrared thermal images.
- Published
- 2024
- Full Text
- View/download PDF
32. Novel Adversarial Unsupervised Subdomain Adaption Multi-Channel Deep Convolutional Network for Cross-Operating Fault Diagnosis of Rolling Bearings
- Author
-
Bo Zhang, Tianlong Huo, Zheng Liu, Baoquan Hu, Heyue Huang, Zehai Ren, and Jianbo Ji
- Subjects
Intelligent cross-domain fault diagnosis ,unbalanced data ,adversarial domain adaptation ,subdomain adaptation ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Rolling bearings in production practice usually serve in a healthy state. Some fault state labels are scarce or even no labels, resulting in unbalanced data categories. Meanwhile, frequent working condition switching results in significant differences in data distribution among working conditions, and labeled data in some working states cannot be fully utilized. To deal with the challenge of low fault identification accuracy caused by these practical factors, this paper proposed a novel adversarial unsupervised subdomain adaption multi-channel deep convolutional network (ASMDCN). Firstly, a parallel three-channel depth feature extraction module is built, and a multi-scale convolution kernel is used to fully extract the rich features of vibration signals under various working conditions. Secondly, a novel loss function is designed to adequately consider the classification difficulty of samples and the degree of class imbalance. Finally, the adversarial training strategy is used to force the feature extractor to extract the domain invariant features, and the Local Maximum Mean discrepancy (LMMD) is used to align the global and related subdomains of the source and target domains. The experimental results show that the designed feature extraction can fully extract the domain-invariant features of the rolling bearings under different working conditions. Under the proposed objective function optimization, the network model can fully align the features of multi-source and single-target domain under unbalanced data and has strong generalization performance.
- Published
- 2024
- Full Text
- View/download PDF
33. FADA-SMOTE-Ms: Fuzzy Adaptative Smote-Based Methods
- Author
-
Roudani Mohammed and El Moutaouakil Karim
- Subjects
Classification ,oversampling ,SMOTE ,unbalanced data ,big data ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
The Synthetic Minority Over-Sampling Technique (SMOTE) is one of the most well-known methods to solve the unequal class distribution problem in imbalanced datasets. However, it has three shortcomings: (1) it may cause the over-generalization problem due to oversampling of noisy samples, (2) over-sampling of uninformative samples, and (3) increasing the overlaps between different classes around the class boundaries. Different approaches SMOTE based have been proposed to handle these problems, but most of them implement hyperparameters and tend to generate noise because the synthetic sample is generated, randomly, in the area delimited by current random minority data. In this research, an improved SMOTE-based method, namely Fuzzy-ADAptative-SMOTE-Based-Methods (FADA-SOMTE-Ms), which targets all three problems at the same time, is introduced. In this regard, the $\alpha $ -SMOTE is chosen in such a way that the synthetic data is as far as possible from the two closest majority data. More precisely, this method processes into six steps: (a) clustering minority class into k groups (b) selecting a safe region (c) selecting random two minority data, (d) finding the M closest majority data to these minority data using original membership functions based on Fuzzy mean and flirting results, (e) finding the $\alpha $ -SMOTE producing a synthetic data as close as possible to the minority class and as far as possible from the M majority data by solving a very simple multi-objective mathematical optimization model, and (f) using SMOTE to generate synthetic samples using optimal $\alpha $ -SMOTE. FADA-SOMTE-Ms is evaluated using 5 classifiers, 21 unbalanced datasets, and it’s compared to 8 oversampling methods using three performance measures. FADA-SOMTE-Ms consistently outperforms other popular oversampling methods.
- Published
- 2024
- Full Text
- View/download PDF
34. Wheat Lodging Types Detection Based on UAV Image Using Improved EfficientNetV2
- Author
-
LONG Jianing, ZHANG Zhao, LIU Xiaohang, LI Yunxia, RUI Zhaoyu, YU Jiangfan, ZHANG Man, FLORES Paulo, HAN Zhexiong, HU Can, and WANG Xufeng
- Subjects
wheat lodging types ,image processing ,deep learning ,unbalanced data ,machine learning ,uav ,Agriculture (General) ,S1-972 ,Technology (General) ,T1-995 - Abstract
ObjectiveWheat, as one of the major global food crops, plays a key role in food production and food supply. Different influencing factors can lead to different types of wheat lodging, e.g., root lodging may be due to improper use of fertilizers. While stem lodging is mostly due to harsh environments, different types of wheat lodging can have different impacts on yield and quality. The aim of this study was to categorize the types of wheat lodging by unmanned aerial vehicle (UAV) image detection and to investigate the effect of UAV flight altitude on the classification performance.MethodsThree UAV flight altitudes (15, 45, and 91 m) were set to acquire images of wheat test fields. The main research methods contained three parts: an automatic segmentation algorithm, wheat classification model selection, and an improved classification model based on EfficientNetV2-C. In the first part, the automatic segmentation algorithm was used to segment the UAV to acquire the wheat test field at three different heights and made it into the training dataset needed for the classification model. The main steps were first to preprocess the original wheat test field images acquired by the UAV through scaling, skew correction, and other methods to save computation time and improve segmentation accuracy. Subsequently, the pre-processed image information was analyzed, and the green part of the image was extracted using the super green algorithm, which was binarized and combined with the edge contour extraction algorithm to remove the redundant part of the image to extract the region of interest, so that the image was segmented for the first time. Finally, the idea of accumulating pixels to find sudden value added was used to find the segmentation coordinates of two different sizes of wheat test field in the image, and the region of interest of the wheat test field was segmented into a long rectangle and a short rectangle test field twice, so as to obtain the structural parameters of different sizes of wheat test field and then to generate the dataset of different heights. In the second part, four machine learning classification models of support vector machine (SVM), K nearest neighbor (KNN), decision tree (DT), and naive bayes (NB), and two deep learning classification models (ResNet101 and EfficientNetV2) were selected. Under the unimproved condition, six classification models were utilized to classify the images collected from three UAVs at different flight altitudes, respectively, and the optimal classification model was selected for improvement. In the third part, an improved model, EfficientNetV2-C, with EfficientNetV2 as the base model, was proposed to classify and recognized the lodging type of wheat in test field images. The main improvement points were attention mechanism improvement and loss function improvement. The attention mechanism was to replace the original model squeeze and excitation (SE) with coordinate attention (CA), which was able to embed the position information into the channel attention, aggregate the features along the width and height directions, respectively, during feature extraction, and capture the long-distance correlation in the width direction while retaining the long-distance correlation in the length direction, accurate location information, enhancing the feature extraction capability of the network in space. The loss function was replaced by class-balanced focal loss (CB-Focal Loss), which could assign different loss weights according to the number of valid samples in each class when targeting unbalanced datasets, effectively solving the impact of data imbalance on the classification accuracy of the model.Results and DiscussionsFour machine learning classification results: SVM average classification accuracy was 81.95%, DT average classification accuracy was 79.56%, KNN average classification accuracy was 59.32%, and NB average classification accuracy was 59.48%. The average classification accuracy of the two deep learning models, ResNet101 and EfficientNetV2, was 78.04%, and the average classification accuracy of ResNet101 was 81.61%. Comparing the above six classification models, the EfficientNetV2 classification model performed optimally at all heights. And the improved EfficientNetV2-C had an average accuracy of 90.59%, which was 8.98% higher compared to the average accuracy of EfficientNetV2. The SVM classification accuracies of UAVs at three flight altitudes of 15, 45, and 91 m were 81.33%, 83.57%, and 81.00%, respectively, in which the accuracy was the highest when the altitude was 45 m, and the classification results of the SVM model values were similar to each other, which indicated that the imbalance of the input data categories would not affect the model's classification effect, and the SVM classification model was able to solve the problem of high dimensionality of the data efficiently and had a good performance for small and medium-sized data sets. The SVM classification model could effectively solve the problem of the high dimensionality of data and had a better classification effect on small and medium-sized datasets. For the deep learning classification model, however, as the flight altitude increases from 15 to 91 m, the classification performance of the deep learning model decreased due to the loss of image feature information. Among them, the classification accuracy of ResNet101 decreased from 81.57% to 78.04%, the classification accuracy of EfficientNetV2 decreased from 84.40% to 81.61%, and the classification accuracy of EfficientNetV2-C decreased from 97.65% to 90.59%. The classification accuracy of EfficientNetV2-C at each of the three altitudes. The difference between the values of precision, recall, and F1-Score results of classification was small, which indicated that the improved model in this study could effectively solve the problems of unbalanced model classification results and poor classification effect caused by data imbalance.ConclusionsThe improved EfficientNetV2-C achieved high accuracy in wheat lodging type detection, which provides a new solution for wheat lodging early warning and crop management and is of great significance for improving wheat production efficiency and sustainable agricultural development.
- Published
- 2023
- Full Text
- View/download PDF
35. 多尺度卷积与双注意力机制融合的入侵检测方法.
- Author
-
陈 虹, 李泓绪, and 金海波
- Abstract
In order to improve the accuracy of internet intrusion detection methods, an intrusion detection method combining convolution neural network and attention mechanism is proposed. Using Borderline-SMOTE oversampling algorithm and MinMax normalization to preprocess data, effectively alleviate the problem of large differences in the amount of intrusion data, and improve the detection performance of unbalanced data; the convolution neural network inception structure is used for multi-scale feature extraction of data, and the attention mechanism is used for dimension update to improve the accuracy of feature expression when the model processes massive data. The experiment shows that the average accuracy of the intrusion detection method is 99.57%. Compared with SVM, CNN, RNN, and BLS-GMM, the accuracy increases by 4.48%, 1.35%, 1.62% and 0.04% respectively, and the recall increases by 4.48%, 1.36%, 1.62% and 0.14% respectively. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
36. Providing an Approach for Early Prediction of Fall in Human Activities Based on Wearable Sensor Data and the Use of Deep Learning Algorithms.
- Author
-
Hatkeposhti, Rahman Keramati, Yadollahzadeh-Tabari, Meisam, and Golsorkhtabariamiri, Mehdi
- Subjects
- *
MACHINE learning , *AUTUMN , *DEEP learning , *WEARABLE technology , *SAMPLING methods , *HUMAN activity recognition - Abstract
Falling is one of the major health concerns, and its early detection is very important. The goal of this study is an early prediction of impending falls using wearable sensors data. The SisFall data set has been used along with two deep learning models (CNN and a combination model named Conv_Lstm). Also, a dynamic sampling method is offered to improve the accuracy of the models by increasing the equilibrium rate between the samples of the majority and minority classes. To fulfill the main idea of this paper, we present a future prediction strategy. Then, by defining a time variable 'T', the system replaces and labels the state of the next T s instead of considering the current state only. This leads to predicting falling states at the beginning moments of balance disturbance. The results of the experiments show that the Conv_Lstm model was able to predict the fall in 78% of cases and an average of 340 ms before the accident. Also, for the Sensitivity criterion, a value of 95.18% has been obtained. A post-processing module based on the median filter was implemented, which could increase the accuracy of predictions to 95%. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
37. A quality detection method of the unbalanced data based on the non‐parameter Log–Log prediction model with the feature extraction.
- Author
-
Wang, Shuying, Zhao, Bo, Wang, Chunjie, and Chen, Jia
- Abstract
In quality detection, it is important to classify and predict the unbalanced data sets with a high proportion of qualified and unqualified products. There exist already some machine learning methods available. However, these existing methods assume that the samples are evenly distributed among the different classes and ignore the unbalanced characteristics of data. In addition, existing methods cannot be directly applied to high‐dimensional data and cannot accurately express the relationship between data features and the quality of industrial engineering products. In this paper, we propose a new quality detection method of the unbalanced data by establishing a non‐parameter Log–Log classification model. The principal component analysis (PCA) is used to extract the features and reduce the dimension of the original data sets. We develop a sieve maximum likelihood algorithm to obtain the non‐parameter function classifier. The proposed method is applied to the product quality detection of industrial semiconductor manufacturing. The results show the proposed method has high detection performance and classification ability. Compared with traditional machine learning methods, the proposed method has a higher classification accuracy can better describe the relationship between product characteristics and product quality and has a strong generalization ability for different data sets. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
38. EVALUATION OF PERFORMANCE MEASURES FOR RELIABLE AND SECURE PHISHING DETECTION SYSTEM.
- Author
-
Barot, Pratikkumar A., Patel, Sunil A., and Jethva, H. B.
- Subjects
- *
PHISHING , *INTERNET security - Abstract
Phishing is an illegal act and security breach which acquires a user's confidential information without consent. Anti-phishing techniques used to detect and prevent such malicious acts to provide data safety to the end user. Researchers proposed an anti-phishing solution with the help of techniques like the blacklist record, heuristic function, visual similarity, and machine learning algorithm. In recent times many researchers proposed machine learning techniques for phishing detection and achieve more than 90% accuracy. However, there is reliability issue in the accuracy measures used by the researchers. In real life, the phishing dataset is unbalanced. Most of the researchers ignore this data quality during their research work design. In the case of unbalanced data, traditional accuracy measure does not give proper performance evaluation. It shows biased performance evaluations. In this paper, we experimented with an unbalanced dataset of phishing detection and did detailed result analysis to highlight the reliability issue of traditional performance evaluation measures for unbalanced data classification. We experiment with four classification algorithms and found that more than 90% of accuracy does not entitle any classifier as secure and safe if the dataset is unbalanced. Our work highlights the data factors and algorithmic limitations that compromise the system security and data safety. [ABSTRACT FROM AUTHOR]
- Published
- 2023
39. Variable Selection in Binary Logistic Regression for Modelling Bankruptcy Risk
- Author
-
Pierri, Francesca, Kitsos, Christos P., editor, Oliveira, Teresa A., editor, Pierri, Francesca, editor, and Restaino, Marialuisa, editor
- Published
- 2023
- Full Text
- View/download PDF
40. IGFClust: Clustering Unbalanced and Complex Single-Cell Expression Data by Iteration and Integrating Gini Index and Fano Factor
- Author
-
Li, Han, Zeng, Feng, Yang, Fan, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Prates, Raquel Oliveira, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Hong, Wenxing, editor, and Weng, Yang, editor
- Published
- 2023
- Full Text
- View/download PDF
41. Privacy-Enhanced ZKP-Inspired Framework for Balanced Federated Learning
- Author
-
Marzo, Stefano, Pinto, Royston, McKenna, Lucy, Brennan, Rob, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Prates, Raquel Oliveira, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Longo, Luca, editor, and O’Reilly, Ruairi, editor
- Published
- 2023
- Full Text
- View/download PDF
42. Leveraging augmentation techniques for tasks with unbalancedness within the financial domain: a two-level ensemble approach
- Author
-
Golshid Ranjbaran, Diego Reforgiato Recupero, Gianfranco Lombardo, and Sergio Consoli
- Subjects
Augmentation techniques ,Ensemble method ,Financial sector ,Machine learning ,Unbalanced data ,Computer applications to medicine. Medical informatics ,R858-859.7 - Abstract
Abstract Modern financial markets produce massive datasets that need to be analysed using new modelling techniques like those from (deep) Machine Learning and Artificial Intelligence. The common goal of these techniques is to forecast the behaviour of the market, which can be translated into various classification tasks, such as, for instance, predicting the likelihood of companies’ bankruptcy or in fraud detection systems. However, it is often the case that real-world financial data are unbalanced, meaning that the classes’ distribution is not equally represented in such datasets. This gives the main issue since any Machine Learning model is trained according to the majority class mainly, leading to inaccurate predictions. In this paper, we explore different data augmentation techniques to deal with very unbalanced financial data. We consider a number of publicly available datasets, then apply state-of-the-art augmentation strategies to them, and finally evaluate the results for several Machine Learning models trained on the sampled data. The performance of the various approaches is evaluated according to their accuracy, micro, and macro F1 score, and finally by analyzing the precision and recall over the minority class. We show that a consistent and accurate improvement is achieved when data augmentation is employed. The obtained classification results look promising and indicate the efficiency of augmentation strategies on financial tasks. On the basis of these results, we present an approach focused on classification tasks within the financial domain that takes a dataset as input, identifies what kind of augmentation technique to use, and then applies an ensemble of all the augmentation techniques of the identified type to the input dataset along with an ensemble of different methods to tackle the underlying classification.
- Published
- 2023
- Full Text
- View/download PDF
43. High Precision Traffic Identification Method based on GAN and XGBoost Fusion
- Author
-
GUAN Qi-feng, ZHAO Su, and ZHU Xiao-rong
- Subjects
application identification ,XGBoost ,GAN ,unbalanced data ,feature reduction ,Applied optics. Photonics ,TA1501-1820 - Abstract
With the continuous development of Internet technology and the continuous expansion of network scale, new network services emerge in an endless stream. In order to ensure the quality of user service, accurate and rapid classification of application traffic is the focus of current research. The traditional service identification method is based on protocol or specific service classification, which is suffered from low applicability. Combining traffic characteristics and machine learning methods, this paper proposes a traffic identification method based on the fusion of Generative Adversative Network (GAN) and Extreme Gradient Lift Boosting (XGBoost). Firstly, the traffic characteristics representing service resource requirements. Then GAN algorithm was improved to expand a few class samples to solve the problem of low model accuracy caused by the unbalanced distribution of data sets in the process of application identification. Finally, the random forest algorithm was used to select the feature, and the XGBoost algorithm was used to complete the model training. The results show that the accuracy of this method is 97.32%.
- Published
- 2023
- Full Text
- View/download PDF
44. Optimising the design of financial data processing models in accounting information systems based on artificial intelligence techniques
- Author
-
Song Yanhua
- Subjects
unbalanced data ,random forest ,support vector machine ,plain bayes ,financial data processing. ,68t05 ,Mathematics ,QA1-939 - Abstract
Financial assessment and early warning analysis can help enterprises find potential financial problems earlier, make timely plans and take necessary measures to avoid risks. This paper uses a Bagging algorithm to integrate Random Forest, Support Vector Machine, and Plain Bayesian method to achieve the processing and classification of enterprise financial imbalance data. The entropy weight method is used to select and empower financial indicators to construct an accounting and financial data assessment model based on artificial intelligence technology. The model is applied to a consumer electronics enterprise, Company W, to analyze its financial situation and operating level. It is found that the composite score from 2019 to 2022 is 60.29, 70.80, 73.11, and 76.52, and the operating condition gradually improves from 2019. Debt service capacity, profitability, operating capacity, and growth capacity also show a positive trend. This is consistent with the actual development of Company W. Accordingly. It is recommended that Company W while maintaining its R&D advantages, focus more on the long-term operating ability of the enterprise, compress the operating cycle, reduce the risk of repayment and inventory pressure, and continue to enhance the competitiveness of the enterprise. This paper presents new ideas and methods for the innovation of enterprise management and the intelligence of accounting information systems.
- Published
- 2024
- Full Text
- View/download PDF
45. Comparative analysis of classification techniques for topic-based biomedical literature categorisation.
- Author
-
Stepanov, Ihor, Ivasiuk, Arsentii, Yavorskyi, Oleksandr, and Frolova, Alina
- Subjects
TRANSFORMER models ,DRUG registration ,COMPARATIVE studies ,CLASSIFICATION ,INFORMATION resources ,HOPFIELD networks ,INFORMATION theory ,IDENTIFICATION - Abstract
Introduction: Scientific articles serve as vital sources of biomedical information, but with the yearly growth in publication volume, processing such vast amounts of information has become increasingly challenging. This difficulty is particularly pronounced when it requires the expertise of highly qualified professionals. Our research focused on the domain-specific articles classification to determine whether they contain information about drug-induced liver injury (DILI). DILI is a clinically significant condition and one of the reasons for drug registration failures. The rapid and accurate identification of drugs that may cause such conditions can prevent side effects in millions of patients. Methods: Developing a text classification method can help regulators, such as the FDA, much faster at a massive scale identify facts of potential DILI of concrete drugs. In our study, we compared several text classification methodologies, including transformers, LSTMs, information theory, and statistics-based methods. We devised a simple and interpretable text classification method that is as fast as Naïve Bayes while delivering superior performance for topic-oriented text categorisation. Moreover, we revisited techniques and methodologies to handle the imbalance of the data. Results: Transformers achieve the best results in cases if the distribution of classes and semantics of test data matches the training set. But in cases of imbalanced data, simple statistical-information theory-based models can surpass complex transformers, bringing more interpretable results that are so important for the biomedical domain. As our results show, neural networks can achieve better results if they are pre-trained on domain-specific data, and the loss function was designed to reflect the class distribution. Discussion: Overall, transformers are powerful architecture, however, in certain cases, such as topic classification, its usage can be redundant and simple statistical approaches can achieve compatible results while being much faster and explainable. However, we see potential in combining results from both worlds. Development of new neural network architectures, loss functions and training procedures that bring stability to unbalanced data is a promising topic of development. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
46. 基于改进 XGBoost 的地震多属性地质构造识别方法.
- Author
-
杨楚龙, 王怀秀, and 刘最亮
- Abstract
Seismic attributes can be used to interpret and predict geological structures, and therefore are widely used in the identification of coal mine geological structures. However, in general, the distribution of regions without structures and regions with structures in the exploration area is unbalanced, with many more regions without structures than with structures. In machine learning, traditional classifiers tend to be biased towards the majority class, making it difficult to effectively identify structures. To solve this problem, an improved extreme gradient boosting (XGBoost) construction recognition method for imbalanced datasets was proposed. Firstly, twelve seismic attributes extracted from a three-dimensional seismic exploration dataset were used as dataset features and actual disclosed geological structures as dataset labels to construct a multi-attribute dataset. Then, redundant features were filtered based on the correlation between features and labels. Next, the boundary sample classification (BSC) algorithm was combined with the synthetic minority oversampling technique (SMOTE) to form the BSC-SMOTE algorithm. The original dataset was balanced using the BSC-SMOTE algorithm, and the balanced dataset was then used to train the XGBoost classifier. The classifier was further optimized using Bayesian optimization (BO) to search for hyperparameters. Finally, the classifier was used to predict structures. Taking the Dongyi mining area of Shanxi Xinyuan Coal Mine Co., Ltd. as the research area, the experimental results show that the prediction accuracy of the improved XGBoost algorithm model is 0. 95, which is 0. 16 higher than the original XGBoost algorithm, and more than 0. 15 higher than the traditional algorithms such as KNN, random forest and SVM. The prediction results of the improved XGBoost model are basically consistent with the actual exposed structure after visualization, which shows that the model can effectively identify geological structures. [ABSTRACT FROM AUTHOR]
- Published
- 2023
47. Research on a Classification Method for Strip Steel Surface Defects Based on Knowledge Distillation and a Self-Adaptive Residual Shrinkage Network.
- Author
-
Huang, Xinbo, Song, Zhiwei, Ji, Chao, Zhang, Ye, and Yang, Luya
- Subjects
- *
STEEL strip , *SURFACE defects , *CLASSIFICATION algorithms , *IMAGE processing , *RESEARCH methodology , *MULTISPECTRAL imaging - Abstract
Different types of surface defects will occur during the production of strip steel. To ensure production quality, it is essential to classify these defects. Our research indicates that two main problems exist in the existing strip steel surface defect classification methods: (1) they cannot solve the problem of unbalanced data using few-shot in reality, (2) they cannot meet the requirement of online real-time classification. To solve the aforementioned problems, a relational knowledge distillation self-adaptive residual shrinkage network (RKD-SARSN) is presented in this work. First, the data enhancement strategy of Cycle GAN defective sample migration is designed. Second, the self-adaptive residual shrinkage network (SARSN) is intended as the backbone network for feature extraction. An adaptive loss function based on accuracy and geometric mean (Gmean) is proposed to solve the problem of unbalanced samples. Finally, a relational knowledge distillation model (RKD) is proposed, and the functions of GUI operation interface encapsulation are designed by combining image processing technology. SARSN is used as a teacher model, its generalization performance is transferred to the lightweight network ResNet34, and it is conveniently deployed as a student model. The results show that the proposed method can improve the deployment efficiency of the model and ensure the real-time performance of the classification algorithms. It is superior to other mainstream algorithms for fine-grained images with unbalanced data classification. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
48. Multiresponse surface methodology for hyperparameter tuning to optimize multiple performance measures of statistical and machine learning algorithms.
- Author
-
Lin, Chang‐Yun
- Subjects
- *
MACHINE learning , *STATISTICAL learning , *RANDOM forest algorithms - Abstract
Hyperparameter tuning is an important task in machine learning for controlling model complexity and improving prediction performance. Most methods in the literature can only be used to tune hyperparameters to optimize a single performance measure. In practice, participants may want to optimize multiple measures of the model; however, optimizing one measure may worsen another. Therefore, a hyperparameter tuning method is proposed using the multiresponse surface methodology to solve the tradeoff problem between multiple measures. A search algorithm that requires fewer tuning runs is developed to find the optimal hyperparameter settings systematically based on the preferences of participants for multiple measures of the model. An example using the random forest algorithm is provided to demonstrate the application of the proposed method to improve the prediction performance of the model on unbalanced data. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
49. Methodologic Issues Specific to Prediction Model Development and Evaluation.
- Author
-
Jin, Yuxuan and Kattan, Michael W.
- Subjects
- *
PREDICTION models , *STATISTICAL models , *RECEIVER operating characteristic curves - Abstract
Developing and evaluating statistical prediction models is challenging, and many pitfalls can arise. This article identifies what the authors believe are some common methodologic concerns that may be encountered. We describe each problem and make suggestions regarding how to address them. The hope is that this article will result in higher-quality publications of statistical prediction models. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
50. A dual-view network for fault diagnosis in rotating machinery using unbalanced data.
- Author
-
Chen, Zixu, Yu, Wennian, Kong, Chengcheng, Zeng, Qiang, Wang, Liming, and Shao, Yimin
- Subjects
FAULT diagnosis ,ROTATING machinery ,FEATURE extraction ,INFORMATION modeling ,GAUSSIAN distribution - Abstract
Data-driven intelligent methods have demonstrated their effectiveness in the area of fault diagnosis. However, most existing studies are based on the assumption that the distributions of normal and faulty samples are balanced during the diagnostic process. This assumption significantly decreases the application range of a diagnostic model as the samples in most real-world scenarios are highly unbalanced. To cope with the limitations caused by unbalanced data, this paper proposed an original dual-view network (DVN). Firstly, an interactive graph modeling strategy is introduced for relationship information modeling of multi-sensor data. Meanwhile, the graph convolution operation is used as the baseline for feature extraction of the constructed interactive graph to mine for fault representations. Secondly, an original dual-view classifier consisting of a binary classifier and a multi-class classifier is proposed, which divides fault diagnosis into two stages. Specifically, in the first stage, the binary classifier performs the binary inference from the view of fault detection. In the second stage, the multi-class classifier performs the full-state inference from the view of fine-grained fault classification. Then, based on the dual-view classifier, a weight activation module is designed to alleviate training bias toward majority classes by sample-level re-weighting. Finally, the diagnosis results can be obtained according to the output of the multi-class classifier. Fault diagnosis experiments using two different datasets with varying data unbalance ratios were conducted to validate the effectiveness of the proposed method. The superiority of the proposed DVN is verified through comparisons with state-of-the-art methods. The effectiveness of the DVN is further validated through ablation studies with some ablative models. The DVN code is available at: https://github.com/CQU-ZixuChen/DualViewNetwork. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.