20 results on '"gradient boosted machine"'
Search Results
2. Development of prognostic models for advanced multiple hepatocellular carcinoma based on Cox regression, deep learning and machine learning algorithms
- Author
-
Jie Shen, Yu Zhou, Junpeng Pei, Dashuai Yang, Kailiang Zhao, and Youming Ding
- Subjects
advanced multiple hepatocellular carcinoma ,prognosis ,machine learning ,deep learning ,gradient boosted machine ,Medicine (General) ,R5-920 - Abstract
BackgroundMost patients with multiple hepatocellular carcinoma (MHCC) are at advanced stage once diagnosed, so that clinical treatment and decision-making are quite tricky. The AJCC-TNM system cannot accurately determine prognosis, our study aimed to identify prognostic factors for MHCC and to develop a prognostic model to quantify the risk and survival probability of patients.MethodsEligible patients with HCC were obtained from the Surveillance, Epidemiology, and End Results (SEER) database, and then prognostic models were built using Cox regression, machine learning (ML), and deep learning (DL) algorithms. The model’s performance was evaluated using C-index, receiver operating characteristic curve, Brier score and decision curve analysis, respectively, and the best model was interpreted using SHapley additive explanations (SHAP) interpretability technique.ResultsA total of eight variables were included in the follow-up study, our analysis identified that the gradient boosted machine (GBM) model was the best prognostic model for advanced MHCC. In particular, the GBM model in the training cohort had a C-index of 0.73, a Brier score of 0.124, with area under the curve (AUC) values above 0.78 at the first, third, and fifth year. Importantly, the model also performed well in test cohort. The Kaplan–Meier (K-M) survival analysis demonstrated that the newly developed risk stratification system could well differentiate the prognosis of patients.ConclusionOf the ML models, GBM model could predict the prognosis of advanced MHCC patients most accurately.
- Published
- 2024
- Full Text
- View/download PDF
3. Modeling and Mapping of Forest Fire Occurrence in the Lower Silesian Voivodeship of Poland Based on Machine Learning Methods.
- Author
-
Milanović, Slobodan, Kaczmarowski, Jan, Ciesielski, Mariusz, Trailović, Zoran, Mielcarek, Miłosz, Szczygieł, Ryszard, Kwiatkowski, Mirosław, Bałazy, Radomir, Zasada, Michał, and Milanović, Sladjan D.
- Subjects
FOREST fires ,FOREST mapping ,FOREST fire prevention & control ,WILDFIRE prevention ,MACHINE learning ,RECEIVER operating characteristic curves ,CONIFEROUS forests - Abstract
In recent years, forest fires have become an important issue in Central Europe. To model the probability of the occurrence of forest fires in the Lower Silesian Voivodeship of Poland, historical fire data and several types of predictors were collected or generated, including topographic, vegetation, climatic, and anthropogenic features. The main objectives of this study were to determine the importance of the predictors of forest fire occurrence and to map the probability of forest fire occurrence. The H2O driverless artificial intelligence (DAI) cloud platform was used to model forest fire probability. The gradient boosted machine (GBM) and random forest (RF) methods were applied to assess the probability of forest fire occurrence. Evaluation the importance of the variables was performed using the H2O platform permutation method. The most important variables were the presence of coniferous forest and the distance to agricultural land according to the GBM and RF methods, respectively. Model validation was conducted using receiver operating characteristic (ROC) analysis. The areas under the curve (AUCs) of the ROC plots from the GBM and RF models were 83.3% and 81.3%, respectively. Based on the results obtained, the GBM model can be recommended for the mapping of forest fire occurrence in the study area. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
4. Development of prognostic models for advanced multiple hepatocellular carcinoma based on Cox regression, deep learning and machine learning algorithms.
- Author
-
Shen J, Zhou Y, Pei J, Yang D, Zhao K, and Ding Y
- Abstract
Background: Most patients with multiple hepatocellular carcinoma (MHCC) are at advanced stage once diagnosed, so that clinical treatment and decision-making are quite tricky. The AJCC-TNM system cannot accurately determine prognosis, our study aimed to identify prognostic factors for MHCC and to develop a prognostic model to quantify the risk and survival probability of patients., Methods: Eligible patients with HCC were obtained from the Surveillance, Epidemiology, and End Results (SEER) database, and then prognostic models were built using Cox regression, machine learning (ML), and deep learning (DL) algorithms. The model's performance was evaluated using C-index, receiver operating characteristic curve, Brier score and decision curve analysis, respectively, and the best model was interpreted using SHapley additive explanations (SHAP) interpretability technique., Results: A total of eight variables were included in the follow-up study, our analysis identified that the gradient boosted machine (GBM) model was the best prognostic model for advanced MHCC. In particular, the GBM model in the training cohort had a C-index of 0.73, a Brier score of 0.124, with area under the curve (AUC) values above 0.78 at the first, third, and fifth year. Importantly, the model also performed well in test cohort. The Kaplan-Meier (K-M) survival analysis demonstrated that the newly developed risk stratification system could well differentiate the prognosis of patients., Conclusion: Of the ML models, GBM model could predict the prognosis of advanced MHCC patients most accurately., Competing Interests: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest., (Copyright © 2024 Shen, Zhou, Pei, Yang, Zhao and Ding.)
- Published
- 2024
- Full Text
- View/download PDF
5. Prediction of probable backorder scenarios in the supply chain using Distributed Random Forest and Gradient Boosting Machine learning techniques
- Author
-
Samiul Islam and Saman Hassanzadeh Amin
- Subjects
Inventory management ,Product backorder ,Machine learning ,Gradient boosted machine ,Supply chain management ,Big data ,Computer engineering. Computer hardware ,TK7885-7895 ,Information technology ,T58.5-58.64 ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Abstract Prediction using machine learning algorithms is not well adapted in many parts of the business decision processes due to the lack of clarity and flexibility. The erroneous data as inputs in the prediction process may produce inaccurate predictions. We aim to use machine learning models in the area of the business decision process by predicting products’ backorder while providing flexibility to the decision authority, better clarity of the process, and maintaining higher accuracy. A ranged method is used for specifying different levels of predicting features to cope with the diverse characteristics of real-time data which may happen by machine or human errors. The range is tunable that gives flexibility to the decision managers. The tree-based machine learning is chosen for better explainability of the model. The backorders of products are predicted in this study using Distributed Random Forest (DRF) and Gradient Boosting Machine (GBM). We have observed that the performances of the machine learning models have been improved by 20% using this ranged approach when the dataset is highly biased with random error. We have utilized a five-level metric to indicate the inventory level, sales level, forecasted sales level, and a four-level metric for the lead time. A decision tree from one of the constructed models is analyzed to understand the effects of the ranged approach. As a part of this analysis, we list major probable backorder scenarios to facilitate business decisions. We show how this model can be used to predict the probable backorder products before actual sales take place. The mentioned methods in this research can be utilized in other supply chain cases to forecast backorders.
- Published
- 2020
- Full Text
- View/download PDF
6. Classifying and Recommending Using Gradient Boosted Machines and Vector Space Models
- Author
-
Sheil, Humphrey, Rana, Omer, Kacprzyk, Janusz, Series editor, Pal, Nikhil R., Advisory editor, Bello Perez, Rafael, Advisory editor, Corchado, Emilio S., Advisory editor, Hagras, Hani, Advisory editor, Kóczy, László T., Advisory editor, Kreinovich, Vladik, Advisory editor, Lin, Chin-Teng, Advisory editor, Lu, Jie, Advisory editor, Melin, Patricia, Advisory editor, Nedjah, Nadia, Advisory editor, Nguyen, Ngoc Thanh, Advisory editor, Wang, Jun, Advisory editor, Chao, Fei, editor, Schockaert, Steven, editor, and Zhang, Qingfu, editor
- Published
- 2018
- Full Text
- View/download PDF
7. Modeling the productivity of mechanized CTL harvesting with statistical machine learning methods.
- Author
-
Liski, Eero, Jounela, Pekka, Korpunen, Heikki, Sosa, Amanda, Lindroos, Ola, and Jylhä, Paula
- Subjects
MACHINE learning ,SUPPORT vector machines ,HARVESTING machinery - Abstract
Modern forest harvesters automatically collect large amounts of standardized work-related data. Statistical machine learning methods enable detailed analyses of large databases from wood harvesting operations. In the present study, gradient boosted machine (GBM), support vector machine (SVM) and ordinary least square (OLS) regression were implemented and compared in predicting the productivity of cut-to-length (CTL) harvesting based on operational monitoring files generated by the harvesters' on-board computers. The data consisted of 1,381 observations from 27 operators and 19 single-grip harvesters. Each tested method detected the mean stem volume as the most significant factor affecting productivity. Depending on the modeling approach, 33–59% of variation was due to the operators. The best GBM model was able to predict the productivity with 90.2% R
2 , whereas OLS and the SVM machine reached R2 -values of 89.3% and 87% R2 , respectively. OLS regression still proved to be an effective method for predicting productivity of CTL harvesting with a limited number of observations and variables, but more powerful GBM and SVM show great potential as the amount of data increases along with the development of various big data applications. [ABSTRACT FROM AUTHOR]- Published
- 2020
- Full Text
- View/download PDF
8. Forecasting heating and cooling loads of buildings: a comparative performance analysis.
- Author
-
Roy, Sanjiban Sekhar, Samui, Pijush, Nagtode, Ishan, Jain, Hemant, Shivaramakrishnan, Vishal, and Mohammadi-ivatloo, Behnam
- Abstract
Heating load and cooling load forecasting are crucial for estimating energy consumption and improvement of energy performance during the design phase of buildings. Since the capacity of cooling ventilation and air-conditioning system of the building contributes to the operation cost, it is ideal to develop accurate models for heating and cooling load forecasting of buildings. This paper proposes a machine-learning technique for prediction of heating load and cooling load of residential buildings. The proposed model is deep neural network (DNN), which presents a category of learning algorithms that adopt nonlinear extraction of information in several steps within a hierarchical framework, primarily applied for learning and pattern classification. The output of DNN has been compared with other proposed methods such as gradient boosted machine (GBM), Gaussian process regression (GPR) and minimax probability machine regression (MPMR). To develop DNN model, the energy data set has been divided into training (70%) and testing (30%) sets. The performance of proposed model was benchmarked by statistical performance metrics such as variance accounted for (VAF), relative average absolute error (RAAE), root means absolute error (RMAE), coefficient of determination (R
2 ), standard deviation ratio (RSR), mean absolute percentage error (MAPE), Nash–Sutcliffe coefficient (NS), root means squared error (RMSE), weighted mean absolute percent error (WMAPE) and mean absolute percentage Error (MAPE). DNN and GPR have produced best-predicted VAF for cooling load and heating load of 99.76% and 99.84% respectively. [ABSTRACT FROM AUTHOR]- Published
- 2020
- Full Text
- View/download PDF
9. Forecasting short-term peak concentrations from a network of air quality instruments measuring PM2.5 using boosted gradient machine models.
- Author
-
Miskell, Georgia, Pattinson, Woodrow, Weissert, Lena, and Williams, David
- Subjects
- *
AIR quality , *MEASURING instruments , *METEOROLOGICAL stations , *AREA measurement , *ATMOSPHERIC pressure , *LOAD forecasting (Electric power systems) - Abstract
Machine learning algorithms are used successfully in this paper to forecast reliably upcoming short-term high concentration episodes, or peaks (<60-min) of fine particulate air pollution (PM 2.5) 1 h in advance. Results are from a network around Christchurch, New Zealand, with an objective to forecast the occurrence of short-term peaks using a gradient boosted machine with a binary classifier as the response (1 = peak, 0 = no peak). Results are successful, with 80–90% accurate forecasting of whether a peak in PM 2.5 would occur within the next 60-min period. Elevated and variable nitrogen monoxide, nitrogen dioxide, and lower temperatures and wind gusts are found to be important precursors to the occurrence of PM 2.5 peaks. The use of meteorological data from a network of personal weather stations across the monitored area and from the measurement instruments was able to identify local-scale peak differences in the network. Boosted models using hourly-averaged and daily-averaged peaks as the response are developed separately to showcase differences in precursors between short-term and long-term peaks, with recent wind gusts and nitrogen oxides linked to hourly-averaged peaks and aloft air temperatures and atmospheric pressure linked to daily-averaged peaks. Results could prove useful in exposure mitigation strategies (e.g. as a short-term warning system). • High short-term (<60 min) PM2.5 concentrations were forecast, 1 h in advance. • Significant precursors were nitrogen monoxide, nitrogen dioxide, temperature and wind. • Personal weather stations helped identify local-scale processes affecting PM2.5 • Short-term results were different to daily-averaged forecasts. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
10. An in-depth experimental study of anomaly detection using gradient boosted machine.
- Author
-
Tama, Bayu Adhi and Rhee, Kyung-Hyune
- Subjects
- *
AIRBORNE lasers , *INTRUSION detection systems (Computer security) , *ANOMALY detection (Computer security) , *RECEIVER operating characteristic curves , *SUPPORT vector machines , *REGRESSION trees - Abstract
This paper proposes an improved detection performance of anomaly-based intrusion detection system (IDS) using gradient boosted machine (GBM). The best parameters of GBM are obtained by performing grid search. The performance of GBM is then compared with the four renowned classifiers, i.e. random forest, deep neural network, support vector machine, and classification and regression tree in terms of four performance measures, i.e. accuracy, specificity, sensitivity, false positive rate and area under receiver operating characteristic curve (AUC). From the experimental result, it can be revealed that GBM significantly outperforms the most recent IDS techniques, i.e. fuzzy classifier, two-tier classifier, GAR-forest, and tree-based classifier ensemble. These results are the highest so far applied on the complete features of three different datasets, i.e. NSL-KDD, UNSW-NB15, and GPRS dataset using either tenfold cross-validation or hold-out method. Moreover, we prove our results by conducting two statistical significant tests which are yet to discover in the existing IDS researches. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
11. An extensive experimental survey of regression methods.
- Author
-
Fernández-Delgado, M., Sirsat, M.S., Cernadas, E., Alawadi, S., Barro, S., and Febrero-Bande, M.
- Subjects
- *
REGRESSION analysis , *MACHINE learning , *LINEAR systems , *LEAST squares , *DEEP learning - Abstract
Abstract Regression is a very relevant problem in machine learning, with many different available approaches. The current work presents a comparison of a large collection composed by 77 popular regression models which belong to 19 families: linear and generalized linear models, generalized additive models, least squares, projection methods, LASSO and ridge regression, Bayesian models, Gaussian processes, quantile regression, nearest neighbors, regression trees and rules, random forests, bagging and boosting, neural networks, deep learning and support vector regression. These methods are evaluated using all the regression datasets of the UCI machine learning repository (83 datasets), with some exceptions due to technical reasons. The experimental work identifies several outstanding regression models: the M5 rule-based model with corrections based on nearest neighbors (cubist), the gradient boosted machine (gbm), the boosting ensemble of regression trees (bstTree) and the M5 regression tree. Cubist achieves the best squared correlation ( R 2) in 15.7% of datasets being very near to it, with difference below 0.2 for 89.1% of datasets, and the median of these differences over the dataset collection is very low (0.0192), compared e.g. to the classical linear regression (0.150). However, cubist is slow and fails in several large datasets, while other similar regression models as M5 never fail and its difference to the best R 2 is below 0.2 for 92.8% of datasets. Other well-performing regression models are the committee of neural networks (avNNet), extremely randomized regression trees (extraTrees, which achieves the best R 2 in 33.7% of datasets), random forest (rf) and ε -support vector regression (svr), but they are slower and fail in several datasets. The fastest regression model is least angle regression lars, which is 70 and 2,115 times faster than M5 and cubist, respectively. The model which requires least memory is non-negative least squares (nnls), about 2 GB, similarly to cubist, while M5 requires about 8 GB. For 97.6% of datasets there is a regression model among the 10 bests which is very near (difference below 0.1) to the best R 2 , which increases to 100% allowing differences of 0.2. Therefore, provided that our dataset and model collection are representative enough, the main conclusion of this study is that, for a new regression problem, some model in our top-10 should achieve R 2 near to the best attainable for that problem. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
12. Modeling and Mapping of Forest Fire Occurrence in the Lower Silesian Voivodeship of Poland Based on Machine Learning Methods
- Author
-
Slobodan Milanović, Jan Kaczmarowski, Mariusz Ciesielski, Zoran Trailović, Miłosz Mielcarek, Ryszard Szczygieł, Mirosław Kwiatkowski, Radomir Bałazy, Michał Zasada, and Sladjan D. Milanović
- Subjects
gradient boosted machine ,ignition probability ,forest fire ,random forest ,Forestry - Abstract
In recent years, forest fires have become an important issue in Central Europe. To model the probability of the occurrence of forest fires in the Lower Silesian Voivodeship of Poland, historical fire data and several types of predictors were collected or generated, including topographic, vegetation, climatic, and anthropogenic features. The main objectives of this study were to determine the importance of the predictors of forest fire occurrence and to map the probability of forest fire occurrence. The H2O driverless artificial intelligence (DAI) cloud platform was used to model forest fire probability. The gradient boosted machine (GBM) and random forest (RF) methods were applied to assess the probability of forest fire occurrence. Evaluation the importance of the variables was performed using the H2O platform permutation method. The most important variables were the presence of coniferous forest and the distance to agricultural land according to the GBM and RF methods, respectively. Model validation was conducted using receiver operating characteristic (ROC) analysis. The areas under the curve (AUCs) of the ROC plots from the GBM and RF models were 83.3% and 81.3%, respectively. Based on the results obtained, the GBM model can be recommended for the mapping of forest fire occurrence in the study area.
- Published
- 2023
13. A Decision Tree Approach for Spatially Interpolating Missing Land Cover Data and Classifying Satellite Images
- Author
-
Jacinta Holloway, Kate J. Helmstedt, Kerrie Mengersen, and Michael Schmidt
- Subjects
random forest ,gradient boosted machine ,decision trees ,inverse distance weighted interpolation, machine learning ,satellite image ,land cover classification ,Sustainable Development Goals ,spatial data ,missing data ,pixel level analysis ,Science - Abstract
Sustainable Development Goals (SDGs) are a set of priorities the United Nations and World Bank have set for countries to reach in order to improve quality of life and environment globally by 2030. Free satellite images have been identified as a key resource that can be used to produce official statistics and analysis to measure progress towards SDGs, especially those that are concerned with the physical environment, such as forest, water, and crops. Satellite images can often be unusable due to missing data from cloud cover, particularly in tropical areas where the deforestation rates are high. There are existing methods for filling in image gaps; however, these are often computationally expensive in image classification or not effective at pixel scale. To address this, we use two machine learning methods—gradient boosted machine and random forest algorithms—to classify the observed and simulated ‘missing’ pixels in satellite images as either grassland or woodland. We also predict a continuous biophysical variable, Foliage Projective Cover (FPC), which was derived from satellite images, and perform accurate binary classification and prediction using only the latitude and longitude of the pixels. We compare the performance of these methods against each other and inverse distance weighted interpolation, which is a well-established spatial interpolation method. We find both of the machine learning methods, particularly random forest, perform fast and accurate classifications of both observed and missing pixels, with up to 0.90 accuracy for the binary classification of pixels as grassland or woodland. The results show that the random forest method is more accurate than inverse distance weighted interpolation and gradient boosted machine for prediction of FPC for observed and missing data. Based on the case study results from a sub-tropical site in Australia, we show that our approach provides an efficient alternative for interpolating images and performing land cover classifications.
- Published
- 2019
- Full Text
- View/download PDF
14. Using machine learning models to predict the effects of seasonal fluxes on Plesiomonas shigelloides population density.
- Author
-
Ekundayo, Temitope C., Ijabadeniyi, Oluwatosin A., Igbinosa, Etinosa O., and Okoh, Anthony I.
- Subjects
TOTAL suspended solids ,POPULATION density ,ARTIFICIAL intelligence ,REGRESSION trees ,WATERSHED management ,BOOSTING algorithms ,MACHINE learning - Abstract
Seasonal variations (SVs) affect the population density (PD), fate, and fitness of pathogens in environmental water resources and the public health impacts. Therefore, this study is aimed at applying machine learning intelligence (MLI) to predict the impacts of SVs on P. shigelloides population density (PDP) in the aquatic milieu. Physicochemical events (PEs) and PDP from three rivers acquired via standard microbiological and instrumental techniques across seasons were fitted to MLI algorithms (linear regression (LR), multiple linear regression (MR), random forest (RF), gradient boosted machine (GBM), neural network (NN), K-nearest neighbour (KNN), boosted regression tree (BRT), extreme gradient boosting (XGB) regression, support vector regression (SVR), decision tree regression (DTR), M5 pruned regression (M5P), artificial neural network (ANN) regression (with one 10-node hidden layer (ANN10), two 6- and 4-node hidden layers (ANN64), and two 5- and 5-node hidden layers (ANN55)), and elastic net regression (ENR)) to assess the implications of the SVs of PEs on aquatic PDP. The results showed that SVs significantly influenced PDP and PEs in the water (p < 0.0001), exhibiting a site-specific pattern. While MLI algorithms predicted PDP with differing absolute flux magnitudes for the contributing variables, DTR predicted the highest PDP value of 1.707 log unit, followed by XGB (1.637 log unit), but XGB (mean-squared-error (MSE) = 0.0025; root - mean-squared-error (RMSE) = 0.0501; R
2 =0.998; medium absolute deviation (MAD) = 0.0275) outperformed other models in terms of regression metrics. Temperature and total suspended solids (TSS) ranked first and second as significant factors in predicting PDP in 53.3% (8/15) and 40% (6/15), respectively, of the models, based on the RMSE loss after permutations. Additionally, season ranked third among the 7 models, and turbidity (TBS) ranked fourth at 26.7% (4/15), as the primary significant factor for predicting PDP in the aquatic milieu. The results of this investigation demonstrated that MLI predictive modelling techniques can promisingly be exploited to complement the repetitive laboratory-based monitoring of PDP and other pathogens, especially in low-resource settings, in response to seasonal fluxes and can provide insights into the potential public health risks of emerging pathogens and TSS pollution (e.g., nanoparticles and micro- and nanoplastics) in the aquatic milieu. The model outputs provide low-cost and effective early warning information to assist watershed managers and fish farmers in making appropriate decisions about water resource protection, aquaculture management, and sustainable public health protection. [Display omitted] • Machine learning (ML) models were built for predicting Plesiomonas density (PDP). • ML regression models predicted PDP with different abilities. • The XGB & RF models displayed good performance/regression metrics in PDP forecasting. • Temperature, season, and total suspended solids (TSS) had a great influence on PDP. • ML models are promising for watershed and aquaculture management. [ABSTRACT FROM AUTHOR]- Published
- 2023
- Full Text
- View/download PDF
15. An extensive experimental survey of regression methods
- Author
-
E. Cernadas, Sadi Alawadi, Manuel Febrero-Bande, Manisha Sanjay Sirsat, Senén Barro, Manuel Fernández-Delgado, Universidade de Santiago de Compostela. Centro de Investigación en Tecnoloxías da Información, Universidade de Santiago de Compostela. Departamento de Electrónica e Computación, and Universidade de Santiago de Compostela. Departamento de Estatística, Análise Matemática e Optimización
- Subjects
Generalized linear model ,0209 industrial biotechnology ,Cognitive Neuroscience ,Gradient boosted machine ,Cubist ,02 engineering and technology ,Machine Learning ,020901 industrial engineering & automation ,M5 ,Artificial Intelligence ,Surveys and Questionnaires ,Statistics ,Linear regression ,0202 electrical engineering, electronic engineering, information engineering ,Humans ,Extremely randomized regression tree ,Mathematics ,Least-angle regression ,Linear model ,Bayes Theorem ,Regression analysis ,UCI machine learning repository ,Regression ,Quantile regression ,Random forest ,Support vector machine ,Linear Models ,020201 artificial intelligence & image processing ,Neural Networks, Computer - Abstract
Regression is a very relevant problem in machine learning, with many different available approaches. The current work presents a comparison of a large collection composed by 77 popular regression models which belong to 19 families: linear and generalized linear models, generalized additive models, least squares, projection methods, LASSO and ridge regression, Bayesian models, Gaussian processes, quantile regression, nearest neighbors, regression trees and rules, random forests, bagging and boosting, neural networks, deep learning and support vector regression. These methods are evaluated using all the regression datasets of the UCI machine learning repository (83 datasets), with some exceptions due to technical reasons. The experimental work identifies several outstanding regression models: the M5 rule-based model with corrections based on nearest neighbors (cubist), the gradient boosted machine (gbm), the boosting ensemble of regression trees (bstTree) and the M5 regression tree. Cubist achieves the best squared correlation (R2) in 15.7% of datasets being very near to it, with difference below 0.2 for 89.1% of datasets, and the median of these differences over the dataset collection is very low (0.0192), compared e.g. to the classical linear regression (0.150). However, cubist is slow and fails in several large datasets, while other similar regression models as M5 never fail and its difference to the best R2 is below 0.2 for 92.8% of datasets. Other well-performing regression models are the committee of neural networks (avNNet), extremely randomized regression trees (extraTrees, which achieves the best R2 in 33.7% of datasets), random forest (rf) and ε-support vector regression (svr), but they are slower and fail in several datasets. The fastest regression model is least angle regression lars, which is 70 and 2,115 times faster than M5 and cubist, respectively. The model which requires least memory is non-negative least squares (nnls), about 2 GB, similarly to cubist, while M5 requires about 8 GB. For 97.6% of datasets there is a regression model among the 10 bests which is very near (difference below 0.1) to the best R2, which increases to 100% allowing differences of 0.2. Therefore, provided that our dataset and model collection are representative enough, the main conclusion of this study is that, for a new regression problem, some model in our top-10 should achieve R2 near to the best attainable for that problem This work has received financial support from the Erasmus Mundus Euphrates programme [project number 2013-2540/001-001-EMA2], from the Xunta de Galicia (Centro singular de investigación de Galicia, accreditation 2016–2019) and the European Union (European Regional Development Fund — ERDF), Project MTM2016–76969–P (Spanish State Research Agency, AEI)co-funded by the European Regional Development Fund (ERDF) and IAP network from Belgian Science Policy SI
- Published
- 2019
16. Prediction of probable backorder scenarios in the supply chain using Distributed Random Forest and Gradient Boosting Machine learning techniques
- Author
-
Islam, Samiul and Amin, Saman Hassanzadeh
- Published
- 2020
- Full Text
- View/download PDF
17. A decision tree approach for spatially interpolating missing land cover data and classifying satellite images
- Author
-
Holloway, Jacinta, Helmstedt, Kate, Mengersen, Kerrie, Schmidt, Michael, Holloway, Jacinta, Helmstedt, Kate, Mengersen, Kerrie, and Schmidt, Michael
- Abstract
Sustainable Development Goals (SDGs) are a set of priorities the United Nations and World Bank have set for countries to reach in order to improve quality of life and environment globally by 2030. Free satellite images have been identified as a key resource that can be used to produce official statistics and analysis to measure progress towards SDGs, especially those that are concerned with the physical environment, such as forest, water, and crops. Satellite images can often be unusable due to missing data from cloud cover, particularly in tropical areas where the deforestation rates are high. There are existing methods for filling in image gaps; however, these are often computationally expensive in image classification or not effective at pixel scale. To address this, we use two machine learning methods—gradient boosted machine and random forest algorithms—to classify the observed and simulated ‘missing’ pixels in satellite images as either grassland or woodland. We also predict a continuous biophysical variable, Foliage Projective Cover (FPC), which was derived from satellite images, and perform accurate binary classification and prediction using only the latitude and longitude of the pixels. We compare the performance of these methods against each other and inverse distance weighted interpolation, which is a well-established spatial interpolation method. We find both of the machine learning methods, particularly random forest, perform fast and accurate classifications of both observed and missing pixels, with up to 0.90 accuracy for the binary classification of pixels as grassland or woodland. The results show that the random forest method is more accurate than inverse distance weighted interpolation and gradient boosted machine for prediction of FPC for observed and missing data. Based on the case study results from a sub-tropical site in Australia, we show that our approach provides an efficient alternative for interpolating images and performing
- Published
- 2019
18. A Decision Tree Approach for Spatially Interpolating Missing Land Cover Data and Classifying Satellite Images
- Author
-
Kate J. Helmstedt, Michael Schmidt, Jacinta Holloway, and Kerrie Mengersen
- Subjects
010504 meteorology & atmospheric sciences ,random forest ,gradient boosted machine ,decision trees ,inverse distance weighted interpolation, machine learning ,satellite image ,land cover classification ,Sustainable Development Goals ,spatial data ,missing data ,pixel level analysis ,Computer science ,Cloud cover ,Science ,0211 other engineering and technologies ,02 engineering and technology ,Land cover ,01 natural sciences ,Grassland ,Multivariate interpolation ,Deforestation ,Spatial analysis ,021101 geological & geomatics engineering ,0105 earth and related environmental sciences ,geography ,geography.geographical_feature_category ,Pixel ,Contextual image classification ,business.industry ,Pattern recognition ,Missing data ,Random forest ,Binary classification ,General Earth and Planetary Sciences ,Artificial intelligence ,business ,Interpolation - Abstract
Sustainable Development Goals (SDGs) are a set of priorities the United Nations and World Bank have set for countries to reach in order to improve quality of life and environment globally by 2030. Free satellite images have been identified as a key resource that can be used to produce official statistics and analysis to measure progress towards SDGs, especially those that are concerned with the physical environment, such as forest, water, and crops. Satellite images can often be unusable due to missing data from cloud cover, particularly in tropical areas where the deforestation rates are high. There are existing methods for filling in image gaps; however, these are often computationally expensive in image classification or not effective at pixel scale. To address this, we use two machine learning methods—gradient boosted machine and random forest algorithms—to classify the observed and simulated ‘missing’ pixels in satellite images as either grassland or woodland. We also predict a continuous biophysical variable, Foliage Projective Cover (FPC), which was derived from satellite images, and perform accurate binary classification and prediction using only the latitude and longitude of the pixels. We compare the performance of these methods against each other and inverse distance weighted interpolation, which is a well-established spatial interpolation method. We find both of the machine learning methods, particularly random forest, perform fast and accurate classifications of both observed and missing pixels, with up to 0.90 accuracy for the binary classification of pixels as grassland or woodland. The results show that the random forest method is more accurate than inverse distance weighted interpolation and gradient boosted machine for prediction of FPC for observed and missing data. Based on the case study results from a sub-tropical site in Australia, we show that our approach provides an efficient alternative for interpolating images and performing land cover classifications.
- Published
- 2019
- Full Text
- View/download PDF
19. Random forest, gradient boosted machines and deep neural network for stock price forecasting: a comparative analysis on South Korean companies
- Author
-
Rohan Chopra, Behnam Mohammadi-ivatlood, Sanjiban Sekhar Roy, Concetto Spampinato, and Kun Chang Lee
- Subjects
Artificial neural network ,Computer Networks and Communications ,Computer science ,business.industry ,Financial markets ,Deep learning ,Gradient boosted machine ,Feature extraction ,Financial market ,Statistical model ,Deep neural network ,KOSPI ,GBM ,Random forest ,Hardware and Architecture ,Korea Composite Stock Price Index ,Econometrics ,Stock market ,Artificial intelligence ,business ,DNN ,Software ,Stock (geology) - Abstract
Predicting the final closing price of a stock is a challenging task and even modest improvements in predictive outcome can be very profitable. Many computer-aided techniques based on either machine learning or statistical models have been adopted to estimate price changes in the stock market. One of the major challenges with traditional machine learning models is the feature extraction process. Indeed, extracting relevant features from data and identifying hidden nonlinear relationships without relying on econometric assumptions and human expertise is extremely complex and makes deep learning particularly attractive. In this paper, we propose a deep neural network-based approach to predict if the stock price will increase by 25% for the following year, same quarter or not. We also compare our deep learning method against 'shallow' approaches, random forest and gradient boosted machines. To test the proposed methods, KIS-VALUE database consisting of the Korea Composite Stock Price Index (KOSPI) of companies for the period 2007 to 2015 was considered. All the methods yielded satisfactory performance, namely, deep neural network achieved an AUC of 0.806. 'Shallow' approaches, random forest and gradient boosted machines have been used for comparisons.
- Published
- 2020
20. Classification of Video Traffic : An Evaluation of Video Traffic Classification using Random Forests and Gradient Boosted Trees
- Author
-
Andersson, Ricky and Andersson, Ricky
- Abstract
Traffic classification is important for Internet providers and other organizations to solve some critical network management problems.The most common methods for traffic classification is Deep Packet Inspection (DPI) and port based classification. These methods are starting to become obsolete as more and more traffic are being encrypted and applications are starting to use dynamic ports and ports of other popular applications. An alternative method for traffic classification uses Machine Learning (ML).This ML method uses statistical features of network traffic flows, which solves the fundamental problems of DPI and port based classification for encrypted flows.The data used in this study is divided into video and non-video traffic flows and the goal of the study is to create a model which can classify video flows accurately in real-time.Previous studies found tree-based algorithms to work well in classifying network traffic. In this study random forest and gradient boosted trees are examined and compared as they are two of the best performing tree-based classification models.Random forest was found to work the best as the classification speed was significantly faster than gradient boosted trees. Over 93% correctly classified flows were achieved while keeping the random forest model small enough to keep fast classification speeds., HITS, 4707
- Published
- 2017
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.