1,957 results on '"Decision Trees"'
Search Results
2. Construction data mining methods in the prediction of death in hemodialysis patients using support vector machine, neural network, logistic regression and decision tree.
- Author
-
Khazaei S, Najafi-GhOBADI S, and Ramezani-Doroh V
- Subjects
- Humans, Logistic Models, Regression Analysis, Data Mining, Decision Trees, Neural Networks, Computer, Renal Dialysis mortality, Renal Insufficiency, Chronic mortality, Support Vector Machine
- Abstract
Objectives: Chronic kidney disease (CKD) is one of the main causes of morbidity and mortality worldwide. Detecting survival modifiable factors could help in prioritizing the clinical care and offers a treatment decision-making for hemodialysis patients. The aim of this study was to develop the best predictive model to explain the predictors of death in Hemodialysis patients by data mining techniques., Methods: In this study, we used a dataset included records of 857 dialysis patients. Thirty-one potential risk factors, that might be associated with death in dialysis patients, were selected. The performances of four classifiers of support vector machine, neural network, logistic regression and decision tree were compared in terms of sensitivity, specificity, total accuracy, positive likelihood ratio and negative likelihood ratio., Results: The average total accuracy of all methods was over 61%; the greatest total accuracy belonged to logistic regression (0.71). Also, logistic regression produced the greatest specificity (0.72), sensitivity (0.69), positive likelihood ratio (2.48) and the lowest negative likelihood ratio (0.43)., Conclusions: Logistic regression had the best performance in comparison to other methods for predicting death among hemodialysis patients. According to this model female gender, increasing age at diagnosis, addiction, low Iron level, C-reactive protein positive and low urea reduction ratio (URR) were the main predictors of death in these patients., Competing Interests: Conflicts of interest statement The authors declare no conflict of interest., (©2021 Pacini Editore SRL, Pisa, Italy.)
- Published
- 2021
- Full Text
- View/download PDF
3. Correlates of physical activity behavior in adults: a data mining approach.
- Author
-
Farrahi V, Niemelä M, Kärmeniemi M, Puhakka S, Kangas M, Korpelainen R, and Jämsä T
- Subjects
- Accelerometry, Adipose Tissue physiology, Algorithms, Cross-Sectional Studies, Female, Finland epidemiology, Follow-Up Studies, Heart Rate, Humans, Male, Middle Aged, Sitting Position, Surveys and Questionnaires, Data Mining methods, Decision Trees, Exercise, Sedentary Behavior
- Abstract
Purpose: A data mining approach was applied to establish a multilevel hierarchy predicting physical activity (PA) behavior, and to methodologically identify the correlates of PA behavior., Methods: Cross-sectional data from the population-based Northern Finland Birth Cohort 1966 study, collected in the most recent follow-up at age 46, were used to create a hierarchy using the chi-square automatic interaction detection (CHAID) decision tree technique for predicting PA behavior. PA behavior is defined as active or inactive based on machine-learned activity profiles, which were previously created through a multidimensional (clustering) approach on continuous accelerometer-measured activity intensities in one week. The input variables (predictors) used for decision tree fitting consisted of individual, demographical, psychological, behavioral, environmental, and physical factors. Using generalized linear mixed models, we also analyzed how factors emerging from the model were associated with three PA metrics, including daily time (minutes per day) in sedentary (SED), light PA (LPA), and moderate-to-vigorous PA (MVPA), to assure the relative importance of methodologically identified factors., Results: Of the 4582 participants with valid accelerometer data at the latest follow-up, 2701 and 1881 had active and inactive profiles, respectively. We used a total of 168 factors as input variables to classify these two PA behaviors. Out of these 168 factors, the decision tree selected 36 factors of different domains from which 54 subgroups of participants were formed. The emerging factors from the model explained minutes per day in SED, LPA, and/or MVPA, including body fat percentage (SED: B = 26.5, LPA: B = - 16.1, and MVPA: B = - 11.7), normalized heart rate recovery 60 s after exercise (SED: B = -16.1, LPA: B = 9.9, and MVPA: B = 9.6), average weekday total sitting time (SED: B = 34.1, LPA: B = -25.3, and MVPA: B = -5.8), and extravagance score (SED: B = 6.3 and LPA: B = - 3.7)., Conclusions: Using data mining, we established a data-driven model composed of 36 different factors of relative importance from empirical data. This model may be used to identify subgroups for multilevel intervention allocation and design. Additionally, this study methodologically discovered an extensive set of factors that can be a basis for additional hypothesis testing in PA correlates research.
- Published
- 2020
- Full Text
- View/download PDF
4. Developing a Prototype Knowledge-Based System for Diagnosis and Treatment of Diabetes Using Data Mining Techniques.
- Author
-
Eyasu K, Jimma W, and Tadesse T
- Subjects
- Africa, Data Accuracy, Humans, Proof of Concept Study, Data Mining methods, Decision Trees, Diabetes Mellitus diagnosis, Diabetes Mellitus therapy, Knowledge Bases
- Abstract
Background: Diabetes is a disease that affects the body's ability to produce or use insulin. A total of 425 million people are suffering from diabetes in the world. Of this, more than 16 million people live in the Africa Region, which is estimated to be around 41 million by 2045. The main objective of this study was to design and develop a prototype knowledge-based system using data mining techniques for diagnosis and treatment of diabetes., Methods: For this study, experimental research design was employed, and the researchers used domain expert knowledge as a supplement of data mining techniques whereby three classification algorithms in WEKA; namely J48, PART and JRip were used, and finally the researchers decided to use the results of J48 classification algorithm. Ultimate Visual basic studio 2013 (Vb.net) was used to store knowledge and as front side of prototype. Common lisp prolog (Clisp) was used for obtained knowledge back end coding., Results: Using a decision tree algorithm; namely J48, 2512 (95.1515%) of the instances were classified correctly, and 128 (4.8485 %) were classified incorrectly. The second most performing model was generated by JRip Classier. This model scored the 94.7348% accuracy on the general data to classify the status of diabetic patient datasets. It classified the 2501 instances of the records correctly., Conclusion: The J48 model was the best performing model with the best accuracy of results., (© 2020 Kedir Eyasu, et al.)
- Published
- 2020
- Full Text
- View/download PDF
5. Decision tree-based classifier in providing telehealth service.
- Author
-
Chern CC, Chen YJ, and Hsiao B
- Subjects
- Humans, Taiwan, Classification, Data Interpretation, Statistical, Data Mining, Decision Trees, Models, Theoretical, Telemedicine
- Abstract
Background: Although previous research showed that telehealth services can reduce the misuse of resources and urban-rural disparities, most healthcare insurers do not include telehealth services in their health insurance schemes. Therefore, no target variable exists for the classification approaches to learn from or train with. The problem of identifying the potential recipients of telehealth services when introducing telehealth services into health welfare or health insurance schemes becomes an unsupervised classification problem without a target variable., Methods: We propose a HDTTCA approach, which is a systematic approach (the main process of HDTTCA involves (1) data set preprocessing, (2) decision tree model building, and (3) predicting and explaining of the most important attributes in the data set for patients who qualify for telehealth service) to identify those who are eligible for telehealth services., Results: This work uses data from the NHIRD provided by the NHIA in Taiwan in 2012 as our research scope, which consist of 55,389 distinct hospitals and 653,209 distinct patients with 15,882,153 outpatient and 135,775 inpatient records. After HDTTCA produces the final version of the decision tree, the rules can be used to assign the values of the target variables in the entire NHIRD. Our data indicate that 3.56% (23,262 out of 653,209) of the patients are eligible for telehealth services in 2012. This study verifies the efficiency and validity of HDTTCA by using a large data set from the NHI of Taiwan., Conclusion: This study conducts a series of experiments 30 times to compare the HDTTCA results with the logistic regression findings by measuring their average performance and determining which model addresses the telehealth patient classification problem better. Four important metrics are used to compare the results. In terms of sensitivity, the decision trees generated by HDTTCA and the logistic regression model are on equal grounds. In terms of accuracy, specificity, and precision, the decision tree generated by HDTTCA provides a better performance than that of the logistic regression model. When HDTTCA is applied, the decision tree model generates a competitive performance and provides clear, easily understandable rules. Therefore, HDTTCA is a suitable choice in solving telehealth service classification problems.
- Published
- 2019
- Full Text
- View/download PDF
6. Accurate and rapid screening model for potential diabetes mellitus.
- Author
-
Pei D, Gong Y, Kang H, Zhang C, and Guo Q
- Subjects
- Adult, China, Female, Humans, Male, Middle Aged, Clinical Decision-Making, Data Mining, Decision Support Techniques, Decision Trees, Diabetes Mellitus diagnosis, Early Diagnosis
- Abstract
Background: Prediction or early diagnosis of diabetes is crucial for populations with high risk of diabetes., Methods: In this study, we assessed the ability of five popular classifiers (J48, AdaboostM1, SMO, Bayes Net, and Naïve Bayes) to identify individuals with diabetes based on nine non-invasive and easily obtained clinical features, including age, gender, body mass index (BMI), hypertension, history of cardiovascular disease or stroke, family history of diabetes, physical activity, work stress, and salty food preference. A total of 4205 data entries were obtained from annual physical examination reports for adults in the Shengjing Hospital of China Medical University during January-April 2017. Weka data mining software was used to identify the best algorithm for diabetes classification., Results: The results indicate that decision tree classifier J48 has the best performance (accuracy = 0.9503, precision = 0.950, recall = 0.950, F-measure = 0.948, and AUC = 0.964). The decision tree structure shows that age is the most significant feature, followed by family history of diabetes, work stress, BMI, salty food preference, physical activity, hypertension, gender, and history of cardiovascular disease or stroke., Conclusions: Our study shows that decision tree analyses can be applied to screen individuals for early diabetes risk without the need for invasive tests. This procedure will be particularly useful in developing regions with high epidemiological risk and poor socioeconomic status, and enable clinical practitioners to rapidly screen patients for increased risk of diabetes. The key features in the tree structure could further facilitate diabetes prevention through targeted community interventions, which can potentially improve early diabetes diagnosis and reduce burdens on the healthcare system.
- Published
- 2019
- Full Text
- View/download PDF
7. The analysis of the effects of acute rheumatic fever in childhood on cardiac disease with data mining.
- Author
-
Emre İE, Erol N, Ayhan Yİ, Özkan Y, and Erol Ç
- Subjects
- Adolescent, Bayes Theorem, Child, Child, Preschool, Female, Humans, Male, Rheumatic Fever epidemiology, Algorithms, Data Mining methods, Decision Trees, Heart Diseases physiopathology, Rheumatic Fever diagnosis
- Abstract
Background: Acute rheumatic fever (ARF) is an important disease that is frequently seen in Turkey, it is necessary to develop solutions to cure the disease. It is believed that new data analysis methods may be applied to this disease, and this may be useful to discover previously unrecognized patterns. Data mining of existing records and data repositories may improve knowledge on the diagnosis and management of ARF. In this regard, we planned to make a contribution to the development of new solutions by approaching the problem from a different standpoint., Objectives: The aim of this study is to analyse the effects of ARF undergone during childhood on the basis of cardiac diseases by using data mining methods., Materials and Methods: Classification methods of data mining were used, and experiments were conducted on five algorithms. The records of the patients diagnosed with ARF were analysed by setting models with naive Bayes classifier, decision trees (CART, C4.5, C5.0, C5.0 boosted) and random forest algorithms. The performances of the algorithms that were derived were then compared. Among model performance evaluation techniques, the hold-out, cross-validation and bootstrap methods were tested in diverse ways in an applied manner. Within the scope of the research, the dataset comprising records of 297 patients was utilised in cooperation with İstanbul Medeniyet University Göztepe Training and Research Hospital's Pediatric Cardiology Clinic (İstanbul Medeniyet Üniversitesi Göztepe Eğitim ve Araştırma Hastanesi Çocuk Kardiyolojisi Kliniği). Data analysis was carried out with the data of the remaining 201 patients following pre-processing., Results: The results that were obtained from different algorithms were compared based on the model performance evaluation criteria. The best result was shown under the CART model by using the hold-out technique (80% training, 20% testing). According to this model, the importance values of the predictive attributes were listed, and it was found that the "teleNormal" and "cardiomegaly" attributes were not required for ARF diagnosis and treatment. In compliance with this result, it was thought that it should not be necessary for patients have a chest x-ray which is needed for diagnosis of "teleNormal" and "cardiomegaly". This will help reduce costs and thus contribute to the health economy while preventing patients from having unnecessary x-rays., Discussion and Conclusion: The results of this study showed that data mining techniques may be used to analyse diseases such as ARF. The important attributes that affect the disease were obtained in accordance with the results. The results of the best model (CART) may be broadened in numerous ways and provide information for both experienced and inexperienced physicians. This study is considered to be significant as it helps data mining methods become more prevalently used for data analysis in fields of medicine and healthcare., (Copyright © 2019 Elsevier B.V. All rights reserved.)
- Published
- 2019
- Full Text
- View/download PDF
8. Automated data extraction and ensemble methods for predictive modeling of breast cancer outcomes after radiation therapy.
- Author
-
Lindsay WD, Ahern CA, Tobias JS, Berlind CG, Chinniah C, Gabriel PE, Gee JC, and Simone CB 2nd
- Subjects
- Female, Humans, Middle Aged, Predictive Value of Tests, Radiotherapy Dosage, Treatment Outcome, Breast Neoplasms pathology, Breast Neoplasms radiotherapy, Data Mining methods, Decision Trees, Electronic Health Records, Machine Learning
- Abstract
Purpose: The purpose of this study was to compare the effectiveness of ensemble methods (e.g., random forests) and single-model methods (e.g., logistic regression and decision trees) in predictive modeling of post-RT treatment failure and adverse events (AEs) for breast cancer patients using automatically extracted EMR data., Methods: Data from 1967 consecutive breast radiotherapy (RT) courses at one institution between 2008 and 2015 were automatically extracted from EMRs and oncology information systems using extraction software. Over 230 variables were extracted spanning the following variable segments: patient demographics, medical/surgical history, tumor characteristics, RT treatment history, and AEs tracked using CTCAEv4.0. Treatment failure was extracted algorithmically by searching posttreatment encounters for evidence of local, nodal, or distant failure. Individual models were trained using decision trees, logistic regression, random forests, and boosted decision trees to predict treatment failures and AEs. Models were fit on 75% of the data and evaluated for probability calibration and area under the ROC curve (AUC) on the remaining test set. The impact of each variable segment was assessed by retraining without the segment and measuring change in AUC (ΔAUC)., Results: All AUC values were statistically significant (P < 0.05). Ensemble methods outperformed single-model methods across all outcomes. The best ensemble method outperformed decision trees and logistic regression by an average AUC of 0.053 and 0.034, respectively. Model probabilities were well calibrated as evidenced by calibration curves. Excluding the patient medical history variable segment led to the largest AUC reduction in all models (Average ΔAUC = -0.025), followed by RT treatment history (-0.021) and tumor information (-0.015)., Conclusion: In this largest such study in breast cancer performed to date, automatically extracted EMR data provided a basis for reliable outcome predictions across multiple statistical methods. Ensemble methods provided substantial advantages over single-model methods. Patient medical history contributed the most to prediction quality., (© 2018 American Association of Physicists in Medicine.)
- Published
- 2019
- Full Text
- View/download PDF
9. Trip purpose prediction using travel survey data with POI information via gradient boosting decision trees
- Author
-
De Zhao, Wei Zhou, Wei Wang, and Xuedong Hua
- Subjects
behavioural sciences ,data mining ,decision trees ,demand forecasting ,traveller information ,Transportation engineering ,TA1001-1280 ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Abstract At present, data obtained from the Global Positioning System (GPS) is significantly valuable in mobility research. However, GPS‐based data lacks include trip purpose information. Consequently, many researchers have endeavoured to predict or impute these missing attributes. Existing studies have focused on constructing more features to improve prediction accuracy, but paid less attention to the model's applicability and transferability. In this study, five trip purposes are extracted, including education, recreation, personal, shopping, and transportation, from Chengdu Household Travel Survey (HTS) data. The individual and trip characteristics that are common and can be easily derived from GPS data are carefully selected and extracted. Point of Interest (POI) data of the trip destination are also collected to enhance input characteristics. To obtain more accurate results, an ensemble learning model, Gradient Boosting Decision Trees (GBDT), is employed to predict trip purposes. grid search and cross‐validation techniques are used to optimize the hyper‐parameters. Empirical results show that the proposed model achieves 0.788 accuracy, which is 22.17%, 14.53%, 10.36%, and 6.77% higher than Multinominal Logit (MNL), Artificial Neural Network (ANN), Random Forest (RF), and Deep Belief Network (DBN), respectively. It is also found that although increasing trip features improve the model's accuracy, it simultaneously impairs model's transferability and generalizability.
- Published
- 2024
- Full Text
- View/download PDF
10. Mining features for biomedical data using clustering tree ensembles.
- Author
-
Pliakos K and Vens C
- Subjects
- Algorithms, Computational Biology, Databases, Factual statistics & numerical data, Escherichia coli genetics, Escherichia coli metabolism, Gene Regulatory Networks, Humans, Metabolic Networks and Pathways, Protein Interaction Maps, Saccharomyces cerevisiae genetics, Saccharomyces cerevisiae metabolism, Cluster Analysis, Data Mining methods, Decision Trees, Machine Learning
- Abstract
The volume of biomedical data available to the machine learning community grows very rapidly. A rational question is how informative these data really are or how discriminant the features describing the data instances are. Several biomedical datasets suffer from lack of variance in the instance representation, or even worse, contain instances with identical features and different class labels. Indisputably, this directly affects the performance of machine learning algorithms, as well as the ability to interpret their results. In this article, we emphasize on the aforementioned problem and propose a target-informed feature induction method based on tree ensemble learning. The method brings more variance into the data representation, thereby potentially increasing predictive performance of a learner applied to the induced features. The contribution of this article is twofold. Firstly, a problem affecting the quality of biomedical data is highlighted, and secondly, a method to handle that problem is proposed. The efficiency of the presented approach is validated on multi-target prediction tasks. The obtained results indicate that the proposed approach is able to boost the discrimination between the data instances and increase the predictive performance., (Copyright © 2018 Elsevier Inc. All rights reserved.)
- Published
- 2018
- Full Text
- View/download PDF
11. A comprehensive evaluation of ensemble learning methods and decision trees for predicting trauma patient discharge status using real-world data
- Author
-
Zahra Kohzadi, Ali Mohammad Nickfarjam, Leila Shokrizadeh Arani, Zeinab Kohzadi, and Mehrdad Mahdian
- Subjects
data mining ,ensemble learning ,trauma ,decision trees ,Surgery ,RD1-811 - Abstract
Background: Trauma registries collect and document data about the acute injury care in hospitals. The goal of trauma care systems is to reduce injury occurrence and enhance trauma patient survival rates. Objectives: In this article, the Kashan trauma registry was used to predict trauma patient discharge status using machine learning. Methods: This study employed 3930 Kashan Trauma Centre Registry entries after preprocessing. The study experimented with decision trees of varying complexity, using three separate metrics - information gain, Gini index, and gain ratio - to build and evaluate the trees. Finally, bagging, boosting and stacking ensemble learning techniques were implemented to evaluate their predictive performance. Ensemble learning models were developed based on decision trees of varying depths that utilized different learning measures/metrics. The predictive performance of the algorithms was evaluated using metrics such as accuracy, precision, recall, and the area under the receiver operating characteristic curve (AUC). This study aimed to compare ensemble-learning techniques like bagging, boosting and stacking to decision trees configured with various parameter settings, to assess their ability to predict trauma patients' discharge status outcomes. Results: The stacking technique, which used decision tree algorithms (depth=5) that integrated parameters like information gain, gain ratio and Gini index at the base level along with KNN (k=12) using Euclidean distance, and then incorporated logistic regression as the meta-classifier, demonstrated superior predictive performance compared to using individual decision trees, bagging or boosting approaches alone. Conclusion: However, while decision trees are straightforward algorithms and ensemble methods are more time-consuming and computationally complex, this study indicates that stacking learning is superior to single decision tree methods with a variety of parameters, bagging, and boosting.
- Published
- 2023
- Full Text
- View/download PDF
12. hs-CRP is strongly associated with coronary heart disease (CHD): A data mining approach using decision tree algorithm.
- Author
-
Tayefi M, Tajfard M, Saffar S, Hanachi P, Amirabadizadeh AR, Esmaeily H, Taghipour A, Ferns GA, Moohebati M, and Ghayour-Mobarhan M
- Subjects
- Adult, Case-Control Studies, Coronary Disease metabolism, Female, Humans, Male, Middle Aged, Sensitivity and Specificity, Algorithms, C-Reactive Protein metabolism, Coronary Disease epidemiology, Data Mining, Decision Trees
- Abstract
Background and Aims: Coronary heart disease (CHD) is an important public health problem globally. Algorithms incorporating the assessment of clinical biomarkers together with several established traditional risk factors can help clinicians to predict CHD and support clinical decision making with respect to interventions. Decision tree (DT) is a data mining model for extracting hidden knowledge from large databases. We aimed to establish a predictive model for coronary heart disease using a decision tree algorithm., Methods: Here we used a dataset of 2346 individuals including 1159 healthy participants and 1187 participant who had undergone coronary angiography (405 participants with negative angiography and 782 participants with positive angiography). We entered 10 variables of a total 12 variables into the DT algorithm (including age, sex, FBG, TG, hs-CRP, TC, HDL, LDL, SBP and DBP)., Results: Our model could identify the associated risk factors of CHD with sensitivity, specificity, accuracy of 96%, 87%, 94% and respectively. Serum hs-CRP levels was at top of the tree in our model, following by FBG, gender and age., Conclusion: Our model appears to be an accurate, specific and sensitive model for identifying the presence of CHD, but will require validation in prospective studies., (Copyright © 2017 Elsevier B.V. All rights reserved.)
- Published
- 2017
- Full Text
- View/download PDF
13. Classification-based data mining for identification of risk patterns associated with hypertension in Middle Eastern population: A 12-year longitudinal study.
- Author
-
Ramezankhani A, Kabir A, Pournik O, Azizi F, and Hadaegh F
- Subjects
- Adult, Age Factors, Algorithms, Blood Glucose metabolism, Blood Pressure, Diastole, Female, Humans, Hypertension etiology, Incidence, Iran epidemiology, Longitudinal Studies, Male, Middle Aged, Models, Theoretical, Risk Factors, Sex Factors, Systole, Waist Circumference, Young Adult, Data Mining, Decision Trees, Hypertension epidemiology
- Abstract
Hypertension is a critical public health concern worldwide. Identification of risk factors using traditional multivariable models has been a field of active research. The present study was undertaken to identify risk patterns associated with hypertension incidence using data mining methods in a cohort of Iranian adult population.Data on 6205 participants (44% men) age > 20 years, free from hypertension at baseline with no history of cardiovascular disease, were used to develop a series of prediction models by 3 types of decision tree (DT) algorithms. The performances of all classifiers were evaluated on the testing data set.The Quick Unbiased Efficient Statistical Tree algorithm among men and women and Classification and Regression Tree among the total population had the best performance. The C-statistic and sensitivity for the prediction models were (0.70 and 71%) in men, (0.79 and 71%) in women, and (0.78 and 72%) in total population, respectively. In DT models, systolic blood pressure (SBP), diastolic blood pressure, age, and waist circumference significantly contributed to the risk of incident hypertension in both genders and total population, wrist circumference and 2-h postchallenge plasma glucose among women and fasting plasma glucose among men. In men, the highest hypertension risk was seen in those with SBP > 115 mm Hg and age > 30 years. In women those with SBP > 114 mm Hg and age > 33 years had the highest risk for hypertension. For the total population, higher risk was observed in those with SBP > 114 mm Hg and age > 38 years.Our study emphasizes the utility of DTs for prediction of hypertension and exploring interaction between predictors. DT models used the easily available variables to identify homogeneous subgroups with different risk pattern for the hypertension., Competing Interests: The authors have no conflicts of interest to disclose.
- Published
- 2016
- Full Text
- View/download PDF
14. The Reliability of Classification of Terminal Nodes in GUIDE Decision Tree to Predict the Nonalcoholic Fatty Liver Disease.
- Author
-
Birjandi M, Ayatollahi SM, and Pourahmad S
- Subjects
- Algorithms, Calibration, Computational Biology methods, Cross-Sectional Studies, Decision Support Techniques, Humans, Iran, Liver diagnostic imaging, Logistic Models, Models, Statistical, Portal Vein diagnostic imaging, Probability, Reproducibility of Results, Data Mining, Decision Trees, Diagnosis, Computer-Assisted methods, Non-alcoholic Fatty Liver Disease diagnostic imaging, Ultrasonography
- Abstract
Tree structured modeling is a data mining technique used to recursively partition a dataset into relatively homogeneous subgroups in order to make more accurate predictions on generated classes. One of the classification tree induction algorithms, GUIDE, is a nonparametric method with suitable accuracy and low bias selection, which is used for predicting binary classes based on many predictors. In this tree, evaluating the accuracy of predicted classes (terminal nodes) is clinically of special importance. For this purpose, we used GUIDE classification tree in two statuses of equal and unequal misclassification cost in order to predict nonalcoholic fatty liver disease (NAFLD), considering 30 predictors. Then, to evaluate the accuracy of predicted classes by using bootstrap method, first the classification reliability in which individuals are assigned to a unique class and next the prediction probability reliability as support for that are considered., Competing Interests: The authors declare that they have no competing interests.
- Published
- 2016
- Full Text
- View/download PDF
15. A data mining approach to optimize pellets manufacturing process based on a decision tree algorithm.
- Author
-
Ronowicz J, Thommes M, Kleinebudde P, and Krysiński J
- Subjects
- Chemistry, Pharmaceutical, Drug Compounding statistics & numerical data, Excipients, Particle Size, Algorithms, Data Mining methods, Decision Trees, Drug Compounding methods
- Abstract
The present study is focused on the thorough analysis of cause-effect relationships between pellet formulation characteristics (pellet composition as well as process parameters) and the selected quality attribute of the final product. The shape using the aspect ratio value expressed the quality of pellets. A data matrix for chemometric analysis consisted of 224 pellet formulations performed by means of eight different active pharmaceutical ingredients and several various excipients, using different extrusion/spheronization process conditions. The data set contained 14 input variables (both formulation and process variables) and one output variable (pellet aspect ratio). A tree regression algorithm consistent with the Quality by Design concept was applied to obtain deeper understanding and knowledge of formulation and process parameters affecting the final pellet sphericity. The clear interpretable set of decision rules were generated. The spehronization speed, spheronization time, number of holes and water content of extrudate have been recognized as the key factors influencing pellet aspect ratio. The most spherical pellets were achieved by using a large number of holes during extrusion, a high spheronizer speed and longer time of spheronization. The described data mining approach enhances knowledge about pelletization process and simultaneously facilitates searching for the optimal process conditions which are necessary to achieve ideal spherical pellets, resulting in good flow characteristics. This data mining approach can be taken into consideration by industrial formulation scientists to support rational decision making in the field of pellets technology., (Copyright © 2015 Elsevier B.V. All rights reserved.)
- Published
- 2015
- Full Text
- View/download PDF
16. DECISION TREES DO NOT LIE: CURIOSITIES IN PREFERENCES OF CROATIAN ONLINE CONSUMERS
- Author
-
Ana Marija Filipas, Nenad Vretenar, and Ivan Prudky
- Subjects
decision-making ,consumers’ preferences ,data mining ,decision trees ,shopping behaviour indicator ,Economic theory. Demography ,HB1-3840 - Abstract
Understanding consumers’ preferences has always been important for economic theory and for business practitioners in operations management, supply chain management, marketing, etc. While preferences are often considered stable in simplified theoretical modelling, this is not the case in real-world decision-making. Therefore, it is crucial to understand consumers’ preferences when a market disruption occurs. This research aims to recognise consumers’ preferences with respect to online shopping after the COVID-19 outbreak hit markets. To this purpose, we conducted an empirical study among Croatian consumers with prior experience in online shopping using an online questionnaire. The questionnaire was completed by 350 respondents who met the criteria. We selected decision-tree models using the J48 algorithm to determine the influences of the found shopping factors and demographic characteristics on a consumer’s preference indicator. The main components of our indicators that influence consumer behaviour are the stimulators and destimulators of online shopping and the importance of social incidence. Our results show significant differences between men and women, with men tending to use fewer variables to make decisions. In addition, the analysis revealed that four product groups and a range of shopping mode-specific influencing factors are required to evaluate consumers’ purchase points when constructing the consumers’ preference indicator.
- Published
- 2023
- Full Text
- View/download PDF
17. Type 2 Diabetes Mellitus Screening and Risk Factors Using Decision Tree: Results of Data Mining.
- Author
-
Habibi S, Ahmadi M, and Alizadeh S
- Subjects
- Decision Support Techniques, Early Diagnosis, Female, Humans, Iran, Male, Mass Screening, Models, Theoretical, ROC Curve, Risk Factors, Data Mining, Decision Trees, Diabetes Mellitus, Type 2 diagnosis
- Abstract
Objectives: The aim of this study was to examine a predictive model using features related to the diabetes type 2 risk factors., Methods: The data were obtained from a database in a diabetes control system in Tabriz, Iran. The data included all people referred for diabetes screening between 2009 and 2011. The features considered as "Inputs" were: age, sex, systolic and diastolic blood pressure, family history of diabetes, and body mass index (BMI). Moreover, we used diagnosis as "Class". We applied the "Decision Tree" technique and "J48" algorithm in the WEKA (3.6.10 version) software to develop the model., Results: After data preprocessing and preparation, we used 22,398 records for data mining. The model precision to identify patients was 0.717. The age factor was placed in the root node of the tree as a result of higher information gain. The ROC curve indicates the model function in identification of patients and those individuals who are healthy. The curve indicates high capability of the model, especially in identification of the healthy persons., Conclusions: We developed a model using the decision tree for screening T2DM which did not require laboratory tests for T2DM diagnosis.
- Published
- 2015
- Full Text
- View/download PDF
18. Comparison of two data mining techniques in labeling diagnosis to Iranian pharmacy claim dataset: artificial neural network (ANN) versus decision tree model.
- Author
-
Rezaei-Darzi E, Farzadfar F, Hashemi-Meshkini A, Navidi I, Mahmoudi M, Varmaghani M, Mehdipour P, Soudi Alamdari M, Tayefi B, Naderimagham S, Soleymani F, Mesdaghinia A, Delavari A, and Mohammad K
- Subjects
- Epidemiologic Research Design, Gastrointestinal Diseases diagnosis, Gastrointestinal Diseases drug therapy, Humans, Iran epidemiology, Models, Statistical, Data Mining methods, Databases, Factual, Decision Trees, Gastrointestinal Diseases epidemiology, Insurance, Pharmaceutical Services, Neural Networks, Computer
- Abstract
Background: This study aimed to evaluate and compare the prediction accuracy of two data mining techniques, including decision tree and neural network models in labeling diagnosis to gastrointestinal prescriptions in Iran., Methods: This study was conducted in three phases: data preparation, training phase, and testing phase. A sample from a database consisting of 23 million pharmacy insurance claim records, from 2004 to 2011 was used, in which a total of 330 prescriptions were assessed and used to train and test the models simultaneously. In the training phase, the selected prescriptions were assessed by both a physician and a pharmacist separately and assigned a diagnosis. To test the performance of each model, a k-fold stratified cross validation was conducted in addition to measuring their sensitivity and specificity., Result: Generally, two methods had very similar accuracies. Considering the weighted average of true positive rate (sensitivity) and true negative rate (specificity), the decision tree had slightly higher accuracy in its ability for correct classification (83.3% and 96% versus 80.3% and 95.1%, respectively). However, when the weighted average of ROC area (AUC between each class and all other classes) was measured, the ANN displayed higher accuracies in predicting the diagnosis (93.8% compared with 90.6%)., Conclusion: According to the result of this study, artificial neural network and decision tree model represent similar accuracy in labeling diagnosis to GI prescription.
- Published
- 2014
- Full Text
- View/download PDF
19. Fostering Sustainable Aquaculture: Mitigating Fish Mortality Risks Using Decision Trees Classifiers
- Author
-
Dimitris C. Gkikas, Marios C. Gkikas, and John A. Theodorou
- Subjects
data mining ,decision trees ,farmed fish ,fish death rates ,fish mortality prediction ,sustainable aquaculture ,Technology ,Engineering (General). Civil engineering (General) ,TA1-2040 ,Biology (General) ,QH301-705.5 ,Physics ,QC1-999 ,Chemistry ,QD1-999 - Abstract
A proposal has been put forward advocating a data-driven strategy that employs classifiers from data mining to foresee and categorize instances of fish mortality. This addresses the increasing concerns regarding the death rates in caged fish environments because of the unsustainable fish farming techniques employed and environmental variables involved. The aim of this research is to enhance the competitiveness of Greek fish farming through the development of an intelligent system that is able to diagnose fish diseases in farms. This system concurrently addresses medication and dosage issues. To achieve this, a comprehensive dataset derived from various aquaculture sources was used, including various factors such as the geographic locations, farming techniques, and indicative parameters such as the water quality, climatic conditions, and fish biological characteristics. The main objective of the research was to categorize fish mortality cases through predictive models. Advanced data mining classification methods, specifically decision trees (DTs), were used for the comparison, aiming to recognize the most appropriate method with high precision and recall rates in predicting fish death rates. To ensure the reliability of the results, a methodical evaluation process was adopted, including cross-validation and a classification performance assessment. In addition, a statistical analysis was performed to gain insights into the factors that identify the correlations between the various factors affecting fish mortality. This analysis contributes to the development of targeted conservation and restoration action strategies. The research results have important implications for sustainable management actions, enabling stakeholders to proactively address issues and monitor aquaculture practices. This proactive approach ensures the protection of farmed fish quantities while meeting global seafood requirements. The data mining using a classification approach coincides with the general context of the UN sustainability goals, reducing the losses in seafood management and production when dealing with the consequences of climate change.
- Published
- 2024
- Full Text
- View/download PDF
20. Decision Rules Induced From Sets of Decision Trees.
- Author
-
Zielosko, Beata, Moshkov, Mikhail, Glid, Anna, and Tetteh, Evans Teiko
- Subjects
DECISION trees ,DISTRIBUTED databases ,KNOWLEDGE representation (Information theory) ,HAWTHORNS ,DATA mining - Abstract
Decision rules belong to known forms of knowledge representation. Among popular measures of their quality length and support can be distinguished. Shorter rules are easier to understand and interpret. Support allows to present patterns hidden in the data. Nowadays, data mining tasks are oriented toward extracting knowledge from data in both distributed and centralized forms. Learning decision rules from a decision tree is a relatively simple task. However, the challenge arises when decision rules are induced from a set of decision trees. Moreover, in the case of distributed data, the decision trees may be constructed independently on different sources, and merging them into a unified set requires resolving conflicts and inconsistencies. In this paper, decision rules are constructed from distributed data based on decision trees induced using the randomly chosen attributes as the splitting criterion. The aim of the study is to compare the quality of two algorithms for constructing rules which are true for a maximum number of trees. The comparison was made based on three factors: the number of trees for which the rule is true, their length and support. Based on performed experiments it was possible to see that the number of true rules for the maximum number of decision trees from the set is greater for algorithm A than for heuristics H. This algorithm allows the induction of shorter rules with greater support compared to heuristic H. However, it should be also noted that the rules induced by heuristic H are often true for a larger number of trees than the rules constructed by algorithm A. Thus, both algorithms can be applied to distributed data. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
21. Applying decision tree for identification of a low risk population for type 2 diabetes. Tehran Lipid and Glucose Study.
- Author
-
Ramezankhani A, Pournik O, Shahrabi J, Khalili D, Azizi F, and Hadaegh F
- Subjects
- Adult, Aged, Arterial Pressure, Blood Glucose analysis, Body Mass Index, Body Weights and Measures, Computational Biology, Decision Support Techniques, Diabetes Mellitus, Type 2 diagnosis, Educational Status, Employment, Female, Humans, Incidence, Iran epidemiology, Longitudinal Studies, Male, Marital Status, Middle Aged, Risk Factors, Sensitivity and Specificity, Smoking, Triglycerides blood, Data Mining, Decision Trees, Diabetes Mellitus, Type 2 epidemiology
- Abstract
Aims: The aim of this study was to create a prediction model using data mining approach to identify low risk individuals for incidence of type 2 diabetes, using the Tehran Lipid and Glucose Study (TLGS) database., Methods: For a 6647 population without diabetes, aged ≥20 years, followed for 12 years, a prediction model was developed using classification by the decision tree technique. Seven hundred and twenty-nine (11%) diabetes cases occurred during the follow-up. Predictor variables were selected from demographic characteristics, smoking status, medical and drug history and laboratory measures., Results: We developed the predictive models by decision tree using 60 input variables and one output variable. The overall classification accuracy was 90.5%, with 31.1% sensitivity, 97.9% specificity; and for the subjects without diabetes, precision and f-measure were 92% and 0.95, respectively. The identified variables included fasting plasma glucose, body mass index, triglycerides, mean arterial blood pressure, family history of diabetes, educational level and job status., Conclusions: In conclusion, decision tree analysis, using routine demographic, clinical, anthropometric and laboratory measurements, created a simple tool to predict individuals at low risk for type 2 diabetes., (Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.)
- Published
- 2014
- Full Text
- View/download PDF
22. Decision trees for predicting dropout in Engineering Course students in Brazil.
- Author
-
Mariano, Ari Melo, Ferreira, Arthur Bandeira de Magalhães Lelis, Santos, Maíra Rocha, Castilho, Mara Lucia, and Bastos, Anna Carla Freire Luna Campêlo
- Subjects
DECISION trees ,ENGINEERING students ,SCHOOL dropouts ,FIELD research ,RATING of students ,MENTAL health - Abstract
The dropout of Brazilian students from higher education is a subject that has been well explored, where high rates of students who drop out are verified. However, despite the vast literature, the problems arising from student's dropout still have no solution since dropout itself is an unsolved problem. This research aims to present a classification via decision trees to predict the evasion of Engineering course students in Brazil. To reach this objective, exploratory field research was conducted, where data was collected employing surveys directed to the students, enabling the elaboration of a classificatory decision tree with the C4.5 algorithm. The survey sample consisted of 91 valid answers. The results were analyzed with the RapidMiner tool and presented a decision tree with 86.81% accuracy. Among the main factors preventing dropout is interaction with professors, the course curriculum, and issues related to mental well-being. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
23. Optimization of decision trees using modified African buffalo algorithm.
- Author
-
Panhalkar, Archana R. and Doye, Dharmpal D.
- Subjects
DECISION trees ,ALGORITHMS ,TREE size ,MATHEMATICAL optimization ,SWARM intelligence ,PARTICLE swarm optimization - Abstract
Decision tree induction is a simple, however powerful learning and classification tool to discover knowledge from the database. The volume of data in databases is growing to quite large sizes, both in the number of attributes and instances. Some important limitations of decision trees are instability, local decisions, and overfitting for this extensive data. The simple, effective and non-convergence nature of the African Buffalo Optimization (ABO) algorithm makes it suitable to solve complex optimization problems. In this paper, we propose the African Buffalo Optimized Decision Tree (ABODT) algorithm to create globally optimized decision trees using the intelligent and collective behaviour of African Buffalos. The modified African Buffalo optimization algorithm is used to create efficient and optimal decision trees. To evaluate the efficiency of the proposed African Buffalo Optimized Decision Tree algorithm, experiments are performed on 15 standard UCI learning repository datasets that are of various sizes and domains. Results show that the African Buffalo Optimized Decision Tree algorithm globally optimizes decision trees, increases accuracy and reduces the size of a decision tree. These optimized trees are stable and efficient than conventional decision trees. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
24. Assessment of the risk factors for type II diabetes using an improved combination of particle swarm optimization and decision trees by evaluation with Fisher’s linear discriminant analysis
- Author
-
Sheik Abdullah, A. and Selvakumar, S.
- Published
- 2019
- Full Text
- View/download PDF
25. Fostering Sustainable Aquaculture: Mitigating Fish Mortality Risks Using Decision Trees Classifiers.
- Author
-
Gkikas, Dimitris C., Gkikas, Marios C., and Theodorou, John A.
- Subjects
SUSTAINABLE aquaculture ,FISH mortality ,DECISION trees ,AGRICULTURE ,FISH diseases ,AQUACULTURE ,FISH farming - Abstract
Featured Application: The specific application of this work involves the development of an intelligent system for diagnosing and treating fish diseases in Greek fish farming. The project aims to enhance the competitiveness of Greek fish farming by addressing the increasing mortality rates attributed to unsustainable farming methods and environmental factors. The application of data mining classifiers, particularly decision trees (DTs), in predicting and categorizing fish mortality instances contributes to the development of an intelligent system for disease diagnosis and treatment. The proactive approach, supported by rigorous evaluation processes and a feature importance analysis, holds implications for sustainable aquaculture management and aligns with global sustainability initiatives. A proposal has been put forward advocating a data-driven strategy that employs classifiers from data mining to foresee and categorize instances of fish mortality. This addresses the increasing concerns regarding the death rates in caged fish environments because of the unsustainable fish farming techniques employed and environmental variables involved. The aim of this research is to enhance the competitiveness of Greek fish farming through the development of an intelligent system that is able to diagnose fish diseases in farms. This system concurrently addresses medication and dosage issues. To achieve this, a comprehensive dataset derived from various aquaculture sources was used, including various factors such as the geographic locations, farming techniques, and indicative parameters such as the water quality, climatic conditions, and fish biological characteristics. The main objective of the research was to categorize fish mortality cases through predictive models. Advanced data mining classification methods, specifically decision trees (DTs), were used for the comparison, aiming to recognize the most appropriate method with high precision and recall rates in predicting fish death rates. To ensure the reliability of the results, a methodical evaluation process was adopted, including cross-validation and a classification performance assessment. In addition, a statistical analysis was performed to gain insights into the factors that identify the correlations between the various factors affecting fish mortality. This analysis contributes to the development of targeted conservation and restoration action strategies. The research results have important implications for sustainable management actions, enabling stakeholders to proactively address issues and monitor aquaculture practices. This proactive approach ensures the protection of farmed fish quantities while meeting global seafood requirements. The data mining using a classification approach coincides with the general context of the UN sustainability goals, reducing the losses in seafood management and production when dealing with the consequences of climate change. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
26. Missing data imputation using decision trees and fuzzy clustering with iterative learning
- Author
-
Nikfalazar, Sanaz, Yeh, Chung-Hsing, Bedingfield, Susan, and Khorshidi, Hadi A.
- Published
- 2020
- Full Text
- View/download PDF
27. Trip purpose prediction using travel survey data with POI information via gradient boosting decision trees.
- Author
-
Zhao, De, Zhou, Wei, Wang, Wei, and Hua, Xuedong
- Subjects
DECISION trees ,GLOBAL Positioning System ,RANDOM forest algorithms ,MULTIPLE imputation (Statistics) ,RESEARCH personnel ,DEMAND forecasting ,MISSING data (Statistics) - Abstract
At present, data obtained from the Global Positioning System (GPS) is significantly valuable in mobility research. However, GPS‐based data lacks include trip purpose information. Consequently, many researchers have endeavoured to predict or impute these missing attributes. Existing studies have focused on constructing more features to improve prediction accuracy, but paid less attention to the model's applicability and transferability. In this study, five trip purposes are extracted, including education, recreation, personal, shopping, and transportation, from Chengdu Household Travel Survey (HTS) data. The individual and trip characteristics that are common and can be easily derived from GPS data are carefully selected and extracted. Point of Interest (POI) data of the trip destination are also collected to enhance input characteristics. To obtain more accurate results, an ensemble learning model, Gradient Boosting Decision Trees (GBDT), is employed to predict trip purposes. grid search and cross‐validation techniques are used to optimize the hyper‐parameters. Empirical results show that the proposed model achieves 0.788 accuracy, which is 22.17%, 14.53%, 10.36%, and 6.77% higher than Multinominal Logit (MNL), Artificial Neural Network (ANN), Random Forest (RF), and Deep Belief Network (DBN), respectively. It is also found that although increasing trip features improve the model's accuracy, it simultaneously impairs model's transferability and generalizability. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
28. A comprehensive evaluation of ensemble learning methods and decision trees for predicting trauma patient discharge status using real-world data.
- Author
-
Kohzadi, Zahra, Nickfarjam, Ali Mohammad, Arani, Leila Shokrizadeh, Kohzadi, Zeinab, and Mahdian, Mehrdad
- Subjects
PREDICTIVE tests ,STATISTICAL models ,PATIENTS ,RESEARCH funding ,RECEIVER operating characteristic curves ,LOGISTIC regression analysis ,EMERGENCY medical services ,DISCHARGE planning ,RETROSPECTIVE studies ,MEDICAL records ,ACQUISITION of data ,MEMORY ,LEARNING strategies ,DECISION trees ,MACHINE learning ,TRAUMA registries ,ACCURACY ,ALGORITHMS - Abstract
Background: Trauma registries collect and document data about the acute injury care in hospitals. The goal of trauma care systems is to reduce injury occurrence and enhance trauma patient survival rates. Objectives: In this article, the Kashan trauma registry was used to predict trauma patient discharge status using machine learning. Methods: This study employed 3930 Kashan Trauma Centre Registry entries after preprocessing. The study experimented with decision trees of varying complexity, using three separate metrics - information gain, Gini index, and gain ratio - to build and evaluate the trees. Finally, bagging, boosting and stacking ensemble learning techniques were implemented to evaluate their predictive performance. Ensemble learning models were developed based on decision trees of varying depths that utilized different learning measures/metrics. The predictive performance of the algorithms was evaluated using metrics such as accuracy, precision, recall, and the area under the receiver operating characteristic curve (AUC). This study aimed to compare ensemble-learning techniques like bagging, boosting and stacking to decision trees configured with various parameter settings, to assess their ability to predict trauma patients' discharge status outcomes. Results: The stacking technique, which used decision tree algorithms (depth=5) that integrated parameters like information gain, gain ratio and Gini index at the base level along with KNN (k=12) using Euclidean distance, and then incorporated logistic regression as the meta-classifier, demonstrated superior predictive performance compared to using individual decision trees, bagging or boosting approaches alone. Conclusion: However, while decision trees are straightforward algorithms and ensemble methods are more time-consuming and computationally complex, this study indicates that stacking learning is superior to single decision tree methods with a variety of parameters, bagging, and boosting. [ABSTRACT FROM AUTHOR]
- Published
- 2023
29. An ensemble of random decision trees with local differential privacy in edge computing.
- Author
-
Wu, Xiaotong, Qi, Lianyong, Gao, Jiaquan, Ji, Genlin, and Xu, Xiaolong
- Subjects
- *
EDGE computing , *DECISION trees , *RANDOM forest algorithms , *PRIVACY , *DATA mining , *INTERNET of things , *DATABASES - Abstract
Edge computing is an emerging computing paradigm, which offers a great opportunity to implement data mining-based services and applications for a large number of devices and sensors in Internet of Things. However, the new paradigm is faced with security and privacy challenges due to the diversity and the limited capability of edge components. In particular, data privacy is one of the most concerned problems for all the participants. In this paper, we propose a framework of privacy-preserving data mining based on private random decision trees in edge computing, which not only gives the strong privacy guarantee, but also provides a certain amount of data utility. Firstly, we design a preservation framework to implement private random decision trees satisfying local differential privacy. Secondly, we present the concrete implementations of algorithms and the corresponding task that each participant needs to undertake. Thirdly, we analyze the key factors to influence privacy and utility, including the allocation of data and privacy budget. Fourthly, we give the improved algorithms to further increase the utility with strong privacy preservation. Finally, extensive experiments demonstrate the good performance of our designed framework. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
30. Decision trees do not lie: Curiosities in preferences of Croatian online consumers.
- Author
-
Filipas, Ana Marija, Vretenar, Nenad, and Prudky, Ivan
- Published
- 2023
- Full Text
- View/download PDF
31. Sensibilidad de las calificaciones crediticias a elasticidades de las razones financieras respecto a variables macroeconómicas: un modelo de árboles de decisión clasificadores para las empresas mexicanas.
- Author
-
Parada Rojas, Ana Cecilia, Razo De Anda, Jorge Omar, and Cruz Aké, Salvador
- Subjects
CREDIT ratings ,CREDIT risk ,FINANCIAL ratios ,DATA mining ,DECISION trees - Abstract
Copyright of Contaduría y Administración is the property of Facultad de Contaduria y Administracion-Universidad Nacional Autonoma de Mexico and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2021
- Full Text
- View/download PDF
32. Analysing tie-break performance of professional tennis players at Grand Slam matches.
- Author
-
Wang, Qiushi, Zhou, Yunjing, Jiahengnuer, Jialin, Xie, Yixun, Ding, Lan, Bao, Dapeng, and Cui, Yixiong
- Subjects
- *
DECISION making , *DECISION trees , *TENNIS players , *RACKETS (Sporting goods) , *TENNIS - Abstract
A tiebreak in tennis is one of the critical moments where players are expected to excel under mental pressure and maintain high level of performance. Despite the importance of tiebreak points, research exploring the performance of male and female players during such match phrase remains limited. This study aimed to investigate i) the overall tiebreak performance of male and female players in relation to the outcome, ii) to examine their point-level performance by considering different contextual variables. A total of 535 tiebreaks comprising 6380 points from the 2016–2021 US Open men’s and women’s singles matches were collected. The difference in match performance between winning and losing players within the entire tiebreak game was explored. A subsequent decision tree analysis was then used to analyse the effect of the contextual and performance variables on tiebreak point-by-point outcome. The results showed that male and female Winning players outperformed the Losing players in 1st Serve, Serve Width and Net approach performance. The analysis of point-level performance showed that Net point, Score scene, and Point server substantially impacted tennis players’ tiebreak outcome. These findings provide valuable insight for coaches and players, informing tiebreak tactics tailoring and training in relevance to different match status. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
33. Learning decision trees through Monte Carlo tree search: An empirical evaluation.
- Author
-
Nunes, Cecília, De Craene, Mathieu, Langet, Hélène, Camara, Oscar, and Jonsson, Anders
- Subjects
- *
MONTE Carlo method , *DECISION trees , *DATA mining software , *PRUNING , *FORECASTING , *TREE branches , *TREES - Abstract
Decision trees (DTs) are a widely used prediction tool, owing to their interpretability. Standard learning methods follow a locally optimal approach that trades off prediction performance for computational efficiency. Such methods can however be far from optimal, and it may pay off to spend more computational resources to increase performance. Monte Carlo tree search (MCTS) is an approach to approximate optimal choices in exponentially large search spaces. We propose a DT learning approach based on the Upper Confidence Bound applied to tree (UCT) algorithm, including procedures to expand and explore the space of DTs. To mitigate the computational cost of our method, we employ search pruning strategies that discard some branches of the search tree. The experiments show that proposed approach outperformed the C4.5 algorithm in 20 out of 31 datasets, with statistically significant improvements in the trade‐off between prediction performance and DT complexity. The approach improved locally optimal search for datasets with more than 1,000 instances, or for smaller datasets likely arising from complex distributions. This article is categorized under:Algorithmic Development > Hierarchies and TreesApplication Areas > Data Mining Software ToolsFundamental Concepts of Data and Knowledge > Data Concepts [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
34. Fitness evaluation reuse for accelerating GPU-based evolutionary induction of decision trees.
- Author
-
Jurczuk, Krzysztof, Czajkowski, Marcin, and Kretowski, Marek
- Subjects
- *
DECISION trees , *DATA mining , *GRAPHICS processing units , *BIOLOGICAL evolution , *BIG data , *TECHNOLOGICAL progress - Abstract
Decision trees (DTs) are one of the most popular white-box machine-learning techniques. Traditionally, DTs are induced using a top-down greedy search that may lead to sub-optimal solutions. One of the emerging alternatives is an evolutionary induction inspired by the biological evolution. It searches for the tree structure and tests simultaneously, which results in less complex DTs with at least comparable prediction performance. However, the evolutionary search is computationally expensive, and its effective application to big data mining needs algorithmic and technological progress. In this paper, noting that many trees or their parts reappear during the evolution, we propose a reuse strategy. A fixed number of recently processed individuals (DTs) is stored in a so-called repository. A part of the repository entry (related to fitness calculations) is maintained on a CPU side to limit CPU/GPU memory transactions. The rest of the repository entry (tree structures) is located on a GPU side to speed up searching for similar DTs. As the most time-demanding task of the induction is the DTs' evaluation, the GPU first searches similar DTs in the repository for reuse. If it fails, the GPU has to evaluate DT from the ground up. Large artificial and real-life datasets and various repository strategies are tested. Results show that the concept of reusing information from previous generations can accelerate the original GPU-based solution further. It is especially visible for large-scale data. To give an idea of the overall acceleration scale, the proposed solution can process even billions of objects in a few hours on a single GPU workstation. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
35. Regresyon Analizleri mi Karar Ağaçları mı?
- Author
-
Gacar, Burcu Kocarık and Kocakoç, İpek Deveci
- Subjects
- *
DECISION trees , *REGRESSION trees , *LOGISTIC regression analysis , *HOME prices , *REGRESSION analysis , *DATA mining - Abstract
Decision tree algorithm is an important classification method in data mining techniques. A decision tree creates classification and regression models like a tree that has a root node, branches, and leaf nodes. Logistic regression which is an alternative method to regression analysis when the dependent variable is a dichotomy, is another technique used for classification purposes. Within the scope of this research, logistic regression, linear regression, classification tree, and regression tree were applied on the same data set. This study explores the most important variables determining the house price by using these four methods. Models’ performances and predictive powers were compared and the best model is determined. This comparison was performed using 414 real estate data on 5 independent variables and the dependent variable is house price. The findings showed that the classification tree model for real estate valuation data performs better than standard approaches. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
36. Using K-Means Cluster Analysis and Decision Trees to Highlight Significant Factors Leading to Homelessness
- Author
-
Andrea Yoder Clark, Nicole Blumenfeld, Eric Lal, Shikar Darbari, Shiyang Northwood, and Ashkan Wadpey
- Subjects
data science ,machine learning ,data mining ,k-means ,cluster analysis ,decision trees ,Mathematics ,QA1-939 - Abstract
Homelessness has been a persistent social concern in the United States. A combination of political and economic events since the 1960s has driven increases in poverty that, by 1991, had surpassed 1928 depression era levels in some accounts. This paper explores how the emerging field of behavioral economics can use machine learning and data science methods to explore preventative responses to homelessness. In this study, machine learning data mining strategies, specifically K-means cluster analysis and later, decision trees, were used to understand how environmental factors and resultant behaviors can contribute to the experience of homelessness. Prevention of the first homeless event is especially important as studies show that if a person has experienced homelessness once, they are 2.6 times more likely to have another homeless episode. Study findings demonstrate that when someone is at risk for not being able to pay utility bills at the same time as they experience challenges with two or more of the other social determinants of health, the individual is statistically significantly more likely to have their first homeless event. Additionally, for men over 50 who are not in the workforce, have a health hardship, and experience two or more other social determinants of health hardships at the same time, the individual has a high statistically significant probability of experiencing homelessness for the first time.
- Published
- 2021
- Full Text
- View/download PDF
37. Application of decision trees in the identification of patterns of fatal injuries by external cause in the municipality of Pasto, Colombia
- Author
-
Ricardo Timaran-Pereira, Andrés Calderón-Romero, and Arsenio Hidalgo-Troya
- Subjects
Pattern recognition, automated ,data mining ,decision trees ,classification ,Medicine (General) ,R5-920 ,Social history and conditions. Social problems. Social reform ,HN1-995 - Abstract
Introduction: The Pan American Health Organization (PHO) and the World Health Organization (WHO) accepted, since the year 1993 and 1996 respectively, that violence is a public health problem, a situation that is corroborated in the report on violence and health, in which Latin America presented a homicide rate of 18 per 100,000 people, and it is considered one of the most violent regions in the world. Objective: To detect criminal patterns with data mining techniques in the Crime Observatory of the municipality of Pasto (Colombia). Materials and methods: Cross Industry Standard Process for Data Mining (CRISP-DM) was applied, which is one of the methodologies used in the development of data mining projects in academic and industrial environments. The source of information was the Crime Observatory of the municipality of Pasto, where the historical clean and transformed figures on the injuries of external cause (fatal and nonfatal) recorded in 11 years are stored. Results: A decision tree-based classification model was built that allowed the discovery of patterns of deaths from external causes. In the case of homicide, these happened mostly in the commune 5 in Pasto under the following circumstances: during the weekends, in the early morning, in the second semester of the year and in the public thoroughfare; besides, the victims were adult men of various professions; and the cause of the homicides were quarrels and they were produced with a fire gun. Conclusion: The generated knowledge will help government and security agencies make effective decisions regarding the implementation of crime prevention and citizen security plans
- Published
- 2017
- Full Text
- View/download PDF
38. Implementation of data mining to predict student graduation using C4.5 algorithm method.
- Author
-
Anggraeni, Dewi, Rizaldi, Nasution, Akmal, and Kholiq, Abdul
- Subjects
- *
DATA mining , *DECISION trees , *STUDENT interests , *UNIVERSITIES & colleges , *GRADUATION (Education) , *GRADUATE students - Abstract
Students graduating on time are an essential indicator for a higher education institution in supporting campus accreditation. Several factors cause students to graduate on time, namely the origin of the previous student's school and student interest. This study aims to predict student graduation based on the head of the last student's school and student interest so that higher education institutions can get the basis for decisions that will be taken in the future. The method used in analyzing student data and supporting criteria for predicting student graduation is the C4.5 algorithm. Then for the decision tree classifier, this research uses data mining. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
39. Annual workers' income prediction using data mining techniques.
- Author
-
Yahaya, Muhammad Syawalludin, Hasbullah, Mohammad Hafshah, Jamil, Siti Afiqah Muhamad, Ul–Saufie, Ahmad Zia, and Ibrahim, Nurain
- Subjects
ARTIFICIAL neural networks ,INCOME ,DECISION trees ,WEIGHT gain ,DATA mining - Abstract
Predicting annual workers income requires ones to deep dive into several factors. Factors that majorly being discuss were age, gender, education and occupation. On the other hand, there are other factors that may affect the annual workers income where it yet to be discussed. The traditional way of predicting the annual workers income was multiple linear regression. This parametric approach requires assumptions to be fulfilled and this will actions is a time-consuming activity. Data mining approach in predicting the workers income is important to understand on how the economy and compensation work in the United States. Machine learning will cover all aspect without needing to fulfil certain assumptions as compared to traditional method. Hence, the best way to predict the worker's income in the United States is the best using machine learning and concurrently solve the SDG 8: Decent Work & Economic Growth aspect. The dataset used in this study is acquired from Kaggle website. At first, features weight using filter method (Weight by Information Gain, Weight by Information gain Ratio and Weight by Chi – Squared Statistics) were taken to identify the influential factors towards annual workers' income. The three different methods employed in the model to predict worker income are logistic regression, decision trees, and artificial neural networks. The second goal is to contrast the effectiveness of worker income prediction using under sampling and oversampling techniques. The results show that, with the exception of decision tree, oversampling strategy provides the best performance of prediction model when compared to under sampling technique. Since under sampling techniques randomly delete observations when there is a chance that such observations could be significant to the data and have an impact on the prediction model, oversampling techniques perform better than under sampling techniques. The third goal is to identify the most effective classification model for predicting worker's income. The oversampling strategy with backward selection represents the best model when applying the Logistic Regression model. Additionally, the optimal model for Decision Trees is the backward selection with under sampling strategy. The best model criterion for artificial neural networks is the oversampling method via backward selection. Data mining approach in predicting the workers income is important to understand on how the economy and compensation work in the United States. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
40. Predicting Second-Hand Car Sales Price Using Decision Trees and Genetic Algorithms
- Author
-
Mehmet Özçalıcı
- Subjects
data mining ,decision trees ,genetic algorithm ,predicting ,second-hand cars ,Industrial engineering. Management engineering ,T55.4-60.8 ,Business ,HF5001-6182 - Abstract
It is important to predict the sales price of second-hand car for both persons and institutions who are operating in second-hand market. The sales price of cars are affected by many factors which makes predicting difficult. Especially there is no readily available method to determine which factors are affecting the sales price most. The purpose of this study is to predict the sales price of second-hand cars with decision trees. Genetic Algortihm is used to select the most relevant features. For this purpose, 252645 advertisements are scanned fort his study. For each advertisement there are 139 features available. Different models are examined using genetic algorithms with selecting 5, 10, 15 and 20 features. The best predicting performance in the out-of-sample experiment is 65.67%. Proposed model can be used as a decision support system for those operating in second-hand car market
- Published
- 2017
- Full Text
- View/download PDF
41. Finding Good Attribute Subsets for Improved Decision Trees Using a Genetic Algorithm Wrapper; a Supervised Learning Application in the Food Business Sector for Wine Type Classification.
- Author
-
Gkikas, Dimitris C., Theodoridis, Prokopis K., Theodoridis, Theodoros, and Gkikas, Marios C.
- Subjects
SUPERVISED learning ,DECISION trees ,FOOD industry ,GENETIC algorithms ,PRIVATE sector - Abstract
This study aims to provide a method that will assist decision makers in managing large datasets, eliminating the decision risk and highlighting significant subsets of data with certain weight. Thus, binary decision tree (BDT) and genetic algorithm (GA) methods are combined using a wrapping technique. The BDT algorithm is used to classify data in a tree structure, while the GA is used to identify the best attribute combinations from a set of possible combinations, referred to as generations. The study seeks to address the problem of overfitting that may occur when classifying large datasets by reducing the number of attributes used in classification. Using the GA, the number of selected attributes is minimized, reducing the risk of overfitting. The algorithm produces many attribute sets that are classified using the BDT algorithm and are assigned a fitness number based on their accuracy. The fittest set of attributes, or chromosomes, as well as the BDTs, are then selected for further analysis. The training process uses the data of a chemical analysis of wines grown in the same region but derived from three different cultivars. The results demonstrate the effectiveness of this innovative approach in defining certain ingredients and weights of wine's origin. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
42. Stacking-based ensemble learning of decision trees for interpretable prostate cancer detection.
- Author
-
Wang, Yuyan, Wang, Dujuan, Geng, Na, Wang, Yanzhang, Yin, Yunqiang, and Jin, Yaochu
- Subjects
RANDOM forest algorithms ,DECISION trees ,PROSTATE cancer ,EARLY detection of cancer ,DATA mining ,DIAGNOSIS methods - Abstract
Abstract Prostate cancer is a highly incident malignant cancer among men. Early detection of prostate cancer is necessary for deciding whether a patient should receive costly and invasive biopsy with possible serious complications. However, existing cancer diagnosis methods based on data mining only focus on diagnostic accuracy, while neglecting the interpretability of the diagnosis model that is necessary for helping doctors make clinical decisions. To take both accuracy and interpretability into consideration, we propose a stacking-based ensemble learning method that simultaneously constructs the diagnostic model and extracts interpretable diagnostic rules. For this purpose, a multi-objective optimization algorithm is devised to maximize the classification accuracy and minimize the ensemble complexity for model selection. As for model combination, a random forest classifier-based stacking technique is explored for the integration of base learners, i.e., decision trees. Empirical results on real-world data from the General Hospital of PLA demonstrate that the classification performance of the proposed method outperforms that of several state-of-the-art methods in terms of the classification accuracy, sensitivity and specificity. Moreover, the results reveal that several diagnostic rules extracted from the constructed ensemble learning model are accurate and interpretable. Highlights • We propose a stacking-based interpretable selective ensemble learning method. • We select ensemble models with accuracy and complexity under consideration. • We combine selected effective models by random forest-based stacking. • The proposed method is more accurate and interpretable in prostate cancer detection. • We extract a few of effective diagnostic rules for clinical decision support. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
43. Data science in chemistry : artificial intelligence, big data, chemometrics, and quantum computing with Jupyter.
- Author
-
Gressling, Thorsten
- Subjects
Data Mining ,Decision Trees ,Chemistry -- Data processing - Abstract
Summary: Decision trees have become one of the most powerful and popular approaches in knowledge discovery and data mining; it is the science of exploring large and complex bodies of data in order to discover useful patterns. Decision tree learning continues to evolve over time. Existing methods are constantly being improved and new methods introduced. This 2nd Edition is dedicated entirely to the field of decision trees in data mining; to cover all aspects of this important technique, as well as improved or new methods and techniques developed after the publication of our first edition. In this new edition, all chapters have been revised and new topics brought in. New topics include Cost-Sensitive Active Learning, Learning with Uncertain and Imbalanced Data, Using Decision Trees beyond Classification Tasks, Privacy Preserving Decision Tree Learning, Lessons Learned from Comparative Studies, and Learning Decision Trees for Big Data.
- Published
- 2021
44. Customer's class transformation for profit maximization in multi-class setting of Telecom industry using probability estimation decision trees.
- Author
-
Muneiah, Janapati Naga and Subba Rao, Ch D. V.
- Subjects
- *
DECISION trees , *PROFIT maximization , *TELECOMMUNICATION , *DATA mining , *MANUAL labor , *QUEUING theory - Abstract
Telecom sector is hugely losing profits in different degrees due to various undesired classes of its customers. Churners, a certain class of customers shifting to the competitors, are the most undesired class of customers who are the predominant reason for the losses. Still, there are other classes of customers in this business who stay with the enterprise, but they are inactive in using the services and leading to uncertainty and an insignificant amount of profits. When data mining techniques are applied to such applications they produce customer models in the form of decision trees, etc. and provide customer's class label only such as churner/non-churner. Furthermore, they only focus on improving the technical interestingness measures of prediction models. Thus, very limited research has been carried out on turning the prediction results into useful decision making actions. Consequently, some manual work by domain expert has to be done to postprocess the model to obtain the actionable knowledge for changing the customer from undesired class to the desired one. However, some of the existing works are suggesting the actions to convert the class of the customer from one category to another, but they have limitations in that they do not generalize to more than two classes. In this paper, a novel algorithm, which aptly fits the multi-class setting of Telecom sector, is presented that suggest actions to change the customer from an undesired class to a desirable one with maximum net profit. We explain our proposed method with the help of a case study of the Telecom sector. Empirical tests are conducted on the case study problem and also on UCI benchmark data and shown that our method is effective and scalable. With the help of comparison with state-of-the-art methods and substantial experiments, we demonstrate the efficiency of the proposed method. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
45. Secure and Efficient Federated Gradient Boosting Decision Trees.
- Author
-
Zhao, Xue, Li, Xiaohui, Sun, Shuang, and Jia, Xu
- Subjects
DECISION trees ,DATA privacy ,BOOSTING algorithms ,DATA mining ,COMMUNICATION models - Abstract
In recent years, federated GBDTs have gradually replaced traditional GBDTs, and become the focus of academic research. They are used to solve the task of structured data mining. Aiming at the problems of information leakage, insufficient model accuracy and high communication cost in the existing schemes of horizontal federated GBDTs, this paper proposes an algorithm of gradient boosting decision trees based on horizontal federated learning, that is, secure and efficient FL for GBDTs (SeFB). The algorithm uses locality sensitive hashing (LSH) to build a tree by collecting similar information of instances without exposing the original data of participants. In the stage of updating the tree, the algorithm aggregates the local gradients of all data participants and calculates the global leaf weights, so as to improve the accuracy of the model and reduce the communication cost. Finally, the experimental analysis shows that the algorithm can protect the privacy of the original data, and the communication cost is low. At the same time, the performance of the unbalanced binary data set is evaluated. The results show that SeFB algorithm compared with the existing schemes of horizontal federated GBDTs, the accuracy is improved by 2.53% on average. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
46. Decision Trees and Gender Stereotypes in University Academic Desertion
- Author
-
Andrade-Zurita, Sylvia, Armas-Arias, Sonia, Núñez-López, Rocío, Arévalo-Peralta, Josué, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Chen, Joy Iong-Zong, editor, Tavares, João Manuel R. S., editor, and Shi, Fuqian, editor
- Published
- 2022
- Full Text
- View/download PDF
47. Blending Shapley values for feature ranking in machine learning: an analysis on educational data
- Author
-
Guleria, Pratiyush
- Published
- 2024
- Full Text
- View/download PDF
48. Correlation Analysis of Railway Track Alignment and Ballast Stiffness: Comparing Frequency-Based and Machine Learning Algorithms.
- Author
-
Mohammadzadeh, Saeed, Heydari, Hamidreza, Karimi, Mahdi, and Mosleh, Araliya
- Subjects
RANDOM forest algorithms ,MACHINE learning ,STANDARD deviations ,DECISION trees ,DATA mining - Abstract
One of the primary challenges in the railway industry revolves around achieving a comprehensive and insightful understanding of track conditions. The geometric parameters and stiffness of railway tracks play a crucial role in condition monitoring as well as maintenance work. Hence, this study investigated the relationship between vertical ballast stiffness and the track longitudinal level. Initially, the ballast stiffness and track longitudinal level data were acquired through a series of experimental measurements conducted on a reference test track along the Tehran–Mashhad railway line, utilizing recording cars for geometric track and stiffness recordings. Subsequently, the correlation between the track longitudinal level and ballast stiffness was surveyed using both frequency-based techniques and machine learning (ML) algorithms. The power spectrum density (PSD) as a frequency-based technique was employed, alongside ML algorithms, including linear regression, decision trees, and random forests, for correlation mining analyses. The results showed a robust and statistically significant relationship between the vertical ballast stiffness and longitudinal levels of railway tracks. Specifically, the PSD data exhibited a considerable correlation, especially within the 1–4 rad/m wave number range. Furthermore, the data analyses conducted using ML methods indicated that the values of the root mean square error (RMSE) were about 0.05, 0.07, and 0.06 for the linear regression, decision tree, and random forest algorithms, respectively, demonstrating the adequate accuracy of ML-based approaches. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
49. Statistical analysis of various splitting criteria for decision trees.
- Author
-
Aaboub, Fadwa, Chamlal, Hasna, and Ouaderhman, Tayeb
- Subjects
DECISION trees ,STATISTICS ,REGRESSION trees ,INFORMATION theory ,PEARSON correlation (Statistics) ,DATA mining - Abstract
Decision trees are frequently used to overcome classification problems in the fields of data mining and machine learning, owing to their many perks, including their clear and simple architecture, excellent quality, and resilience. Various decision tree algorithms are developed using a variety of attribute selection criteria, following the top-down partitioning strategy. However, their effectiveness is influenced by the choice of the splitting method. Therefore, in this work, six decision tree algorithms that are based on six different attribute evaluation metrics are gathered in order to compare their performances. The choice of the decision trees that will be compared is done based on four different categories of the splitting criteria that are criteria based on information theory, criteria based on distance, statistical-based criteria, and other splitting criteria. These approaches include iterative dichotomizer 3 (first category), C4.5 (first category), classification and regression trees (second category), Pearson's correlation coefficient based decision tree (third category), dispersion ratio (third category), and feature weight based decision tree algorithm (last category). On eleven data sets, the six procedures are assessed in terms of classification accuracy, tree depth, leaf nodes, and tree construction time. Furthermore, the Friedman and post hoc Nemenyi tests are used to examine the results that were obtained. The results of these two tests indicate that the iterative dichotomizer 3 and classification and regression trees decision tree methods perform better than the other decision tree methodologies. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
50. Multi-objective evolution of oblique decision trees for imbalanced data binary classification.
- Author
-
Chabbouh, Marwa, Bechikh, Slim, Hung, Chih-Cheng, and Ben Said, Lamjed
- Subjects
DECISION trees ,CLASSIFICATION algorithms ,DATA mining ,GREEDY algorithms ,CLASSIFICATION ,BIOLOGICAL evolution - Abstract
Imbalanced data classification is one of the most challenging problems in data mining. In this kind of problems, we have two types of classes: the majority class and the minority one. The former has a relatively high number of instances while the latter contains a much less number of instances. As most traditional classifiers usually assume that data is evenly distributed for all classes, they may considerably fail in recognizing instances in the minority class due to the imbalance problem. Several interesting approaches have been proposed to handle the class imbalance issue in the literature and the Oblique Decision Tree (ODT) is one of them. Nevertheless, most standard ODT construction algorithms use a greedy search process; while only very few works have addressed this induction problem using an evolutionary approach and this is done without really considering the class imbalance issue. To cope with this limitation, we propose in this paper a multi-objective evolutionary approach to find optimized ODTs for imbalanced binary classification. Our approach, called ODT-Θ-NSGA-III (ODT-based-Θ-Nondominated Sorting Genetic Algorithm-III), is motivated by its abilities: (a) to escape local optima in the ODT search space and (b) to maximize simultaneously both Precision and Recall. Thanks to these two features, ODT-Θ-NSGA-III provides competitive and better results when compared to many state-of-the-art classification algorithms on commonly used imbalanced benchmark data sets. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.