3,838 results on '"High dimensional"'
Search Results
2. Bayesian adaptive design for covariate-adaptive historical control information borrowing.
- Author
-
Scheffler, Aaron, Kim, Mi-ok, Jiang, Fei, and Jin, Huaqing
- Subjects
Bayesian ,covariate-adaptive ,high dimensional ,historical sample ,kernel ,Female ,Humans ,Bayes Theorem ,Computer Simulation ,Prospective Studies ,Research Design ,Sample Size ,Clinical Trials as Topic - Abstract
Interest in incorporating historical data in the clinical trial has increased with the rising cost of conducting clinical trials. The intervention arm for the current trial often requires prospective data to assess a novel treatment, and thus borrowing historical control data commensurate in distribution to current control data is motivated in order to increase the allocation ratio to the current intervention arm. Existing historical control borrowing adaptive designs adjust allocation ratios based on the commensurability assessed through study-level summary statistics of the response agnostic of the distributions of the trial subject characteristics in the current and historical trials. This can lead to distributional imbalance of the current trial subject characteristics across the treatment arms as well as between current control data and borrowed historical control data. Such covariate imbalance may threaten the internal validity of the current trial by introducing confounding factors that affect study endpoints. In this article, we propose a Bayesian design which borrows and updates the treatment allocation ratios both covariate-adaptively and commensurate to covariate dependently assessed similarity between the current and historical control data. We employ covariate-dependent discrepancy parameters which are allowed to grow with the sample size and propose a regularized local regression procedure for the estimation of the parameters. The proposed design also permits the current and the historical controls to be similar to varying degree, depending on the subject level characteristics. We evaluate the proposed design extensively under the settings derived from two placebo-controlled randomized trials on vertebral fracture risk in post-menopausal women.
- Published
- 2023
3. Perbaikan Akurasi Random Forest Dengan ANOVA Dan SMOTE Pada Klasifikasi Data Stunting
- Author
-
Ari Ahmad Dhani, Taghfirul Azhima Yoga Siswa, and Wawan Joko Pranoto
- Subjects
klasifikasi ,random forest ,anova ,smote ,high dimensional ,Information technology ,T58.5-58.64 ,Computer software ,QA76.75-76.765 - Abstract
Stunting terus menjadi isu kesehatan masyarakat yang kritis di Indonesia, khususnya di Kota Samarinda yang mencatat prevalensi sebesar 25,3% pada tahun 2022, menjadi yang tertinggi kedua di Provinsi Kalimantan Timur. Di tengah prioritas nasional untuk riset 2020-2024, penggunaan data mining untuk klasifikasi stunting memperlihatkan potensi yang signifikan namun tetap menghadapi tantangan dalam menangani data berdimensi tinggi dan ketidakseimbangan kelas. Penelitian ini bertujuan untuk meningkatkan akurasi klasifikasi stunting menggunakan metode Random Forest (RF) yang diintegrasikan dengan seleksi fitur ANOVA dan teknik SMOTE untuk menyeimbangkan kelas. Data yang digunakan dalam penelitian ini bersumber dari Dinas Kesehatan Kota Samarinda, meliputi 26 Puskesmas dengan 21 atribut dan total 150.466 record. Teknik validasi yang dipakai adalah cross-validation k =10. Hasil menunjukkan peningkatan akurasi dari 98,83% menjadi 99,77% naik sebesar 0,94% setelah penerapan seleksi fitur ANOVA. Fitur ZS TB/U, ZS BB/U, dan BB/U diidentifikasi sebagai yang paling berpengaruh. Peningkatan ini menunjukkan efektivitas integrasi metode dalam mengatasi masalah stunting pada dataset yang kompleks dan tidak seimbang, ini diharapkan dapat mendukung kebijakan dan intervensi kesehatan lebih lanjut di kawasan tersebut.
- Published
- 2024
- Full Text
- View/download PDF
4. Optimized multi correlation-based feature selection in software defect prediction.
- Author
-
Muyassar Rahman, Muhammad Nabil, Nugroho, Radityo Adi, Faisal, Mohammad Reza, Abadi, Friska, and Herteno, Rudy
- Subjects
- *
FEATURE selection , *PARTICLE swarm optimization , *K-nearest neighbor classification , *COMPUTER software - Abstract
In software defect prediction, noisy attributes and high-dimensional data remain to be a critical challenge. This paper introduces a novel approach known as multi correlation-based feature selection (MCFS), which seeks to address these challenges. MCFS integrates two feature selection techniques, namely correlation-based feature selection (CFS) and correlation matrix-based feature selection (CMFS), intending to reduce data dimensionality and eliminate noisy attributes. To accomplish this, CFS and CMFS are applied independently to filter the datasets, and a weighted average of their outcomes is computed to determine the optimal feature selection. This approach not only reduces data dimensionality but also mitigates the impact of noisy attributes. To further enhance predictive performance, this paper leverages the particle swarm optimization (PSO) algorithm as a feature selection mechanism, specifically targeting improvements in the area under the curve (AUC). The evaluation of the proposed method is conducted on 12 benchmark datasets sourced from the NASA metrics data program (MDP) corpus, renowned for their noisy attributes, high dimensionality, and imbalanced class records. The research findings demonstrate that MCFS outperforms CFS and CMFS, yielding an average AUC value of 0.891, thereby emphasizing it is efficacy in advancing classification performance in the context of software defect prediction using k-nearest neighbors (KNN) classification. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
5. Sparse Clustering Algorithm Based on Multi-Domain Dimensionality Reduction Autoencoder.
- Author
-
Kang, Yu, Liu, Erwei, Zou, Kaichi, Wang, Xiuyun, and Zhang, Huaqing
- Subjects
- *
ALGORITHMS , *HIGH-dimensional model representation , *DATA analysis - Abstract
The key to high-dimensional clustering lies in discovering the intrinsic structures and patterns in data to provide valuable information. However, high-dimensional clustering faces enormous challenges such as dimensionality disaster, increased data sparsity, and reduced reliability of the clustering results. In order to address these issues, we propose a sparse clustering algorithm based on a multi-domain dimensionality reduction model. This method achieves high-dimensional clustering by integrating the sparse reconstruction process and sparse L1 regularization into a deep autoencoder model. A sparse reconstruction module is designed based on the L1 sparse reconstruction of features under different domains to reconstruct the data. The proposed method mainly contributes in two aspects. Firstly, the spatial and frequency domains are combined by taking into account the spatial distribution and frequency characteristics of the data to provide multiple perspectives and choices for data analysis and processing. Then, a neural network-based clustering model with sparsity is conducted by projecting data points onto multi-domains and implementing adaptive regularization penalty terms to the weight matrix. The experimental results demonstrate superior performance of the proposed method in handling clustering problems on high-dimensional datasets. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
6. Low and high dimensional wavelet thresholds for matrix-variate normal distribution.
- Author
-
Karamikabir, H., Sanati, A., and Hamedani, G. G.
- Abstract
AbstractThe matrix-variate normal distribution is a probability distribution that is a generalization of the multivariate normal distribution to matrix-valued random variables. In this paper, we introduce a wavelet shrinkage estimator based on Stein’s unbiased risk estimate (SURE) threshold for matrix-variate normal distribution. We find a new SURE threshold for soft thresholding wavelet shrinkage estimator under the reflected normal balanced loss function in low and high dimensional cases. Also, we obtain the restricted wavelet shrinkage estimator based on non-negative sub matrix of the mean matrix. Finally, we present a simulation study to test the validity of the wavelet shrinkage estimator and two real examples for low and high dimensional data sets. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
7. High-Dimensional Ensemble Learning Classification: An Ensemble Learning Classification Algorithm Based on High-Dimensional Feature Space Reconstruction.
- Author
-
Zhao, Miao and Ye, Ning
- Subjects
MACHINE learning ,CLASSIFICATION algorithms ,FEATURE selection ,NAIVE Bayes classification ,HIGH-dimensional model representation ,CLASSIFICATION ,ALGORITHMS ,PROBLEM solving - Abstract
When performing classification tasks on high-dimensional data, traditional machine learning algorithms often fail to filter out valid information in the features adequately, leading to low levels of classification accuracy. Therefore, this paper explores the high-dimensional data from both the data feature dimension and the model ensemble dimension. We propose a high-dimensional ensemble learning classification algorithm focusing on feature space reconstruction and classifier ensemble, called the HDELC algorithm. First, the algorithm considers feature space reconstruction and then generates a feature space reconstruction matrix. It effectively achieves feature selection and reconstruction for high-dimensional data. An optimal feature space is generated for the subsequent ensemble of the classifier, which enhances the representativeness of the feature space. Second, we recursively determine the number of classifiers and the number of feature subspaces in the ensemble model. Different classifiers in the ensemble system are assigned mutually exclusive non-intersecting feature subspaces for model training. The experimental results show that the HDELC algorithm has advantages compared with most high-dimensional datasets due to its more efficient feature space ensemble capability and relatively reliable ensemble operation performance. The HDELC algorithm makes it possible to solve the classification problem for high-dimensional data effectively and has vital research and application value. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
8. A Performance Analysis of Prediction Techniques in Handling High-Dimensional Uncertain Data for the Application of Skyline Query Over Data Stream
- Author
-
Mudathir Ahmed Mohamud, Hamidah Ibrahim, Fatimah Sidi, Siti Nurulain Mohd Rum, Zarina Binti Dzolkhifli, and Zhang Xiaowei
- Subjects
Prediction techniques ,uncertain data ,high dimensional ,skyline query ,data stream ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
The proliferation of high-dimensional data in many advanced database applications is a result of today’s technological advancements. These data points that correspond to objects are often without a precise description, which make their representation uncertain. While the concept of data streaming is not new, its practical uses are only recently emerging. This research focuses on continuous range data—a type of uncertain data common in database applications—that do not have explicit representations of their exact values. Furthermore, the identification of skyline objects—one of the popular database applications—becomes more challenging when skylines are to be identified from a collection of continuously generated input data streams where objects might have imprecise values. This makes it imperative to determine which approach has the optimal accuracy for estimating or predicting the uncertain values and at the same time able to handle a massive streams of data that are continuously generated and analyze them almost instantly to provide accurate and timely responses. Given this, the following techniques are selected—Linear Regression (LR), k-Nearest Neighbour (k-NN), Random Forest (RF), Decision Trees (DT), and Centre and Range Method (CRM) and their effectiveness is evaluated in terms of execution time, precision, recall, F1-score, and root mean square error (RMSE). Additionally, in order to verify the accuracy of each prediction technique, the predicted data derived from its model is used to derive skyline objects, which are subsequently compared to the actual skyline results. An inaccurate prediction of a continuous range value would result in incorrect set of skyline objects.
- Published
- 2024
- Full Text
- View/download PDF
9. A new binary object-oriented programming optimization algorithm for solving high-dimensional feature selection problem
- Author
-
Asmaa M. Khalid, Wael Said, Mahmoud Elmezain, and Khalid M. Hosny
- Subjects
OOPOA ,Feature selection ,High dimensional ,Exploration ,Convergence ,Classifier ,Engineering (General). Civil engineering (General) ,TA1-2040 - Abstract
Feature selection (FS) is a crucial task in machine learning applications, which aims to select the most appropriate feature subset while maintaining high classification accuracy with the minimum number of selected features. Despite the widespread usage of metaheuristics as wrapper-based FS techniques, they show reduced effectiveness and increased computational cost when applied to high-dimensional datasets. This paper presents a novel Binary Object-Oriented Programming Optimization Algorithm (BOOPOA) for FS of high dimensional datasets, where the Object-Oriented Programming Optimization Algorithm (OOPOA) is a novel optimization technique inspired by the inheritance concept of Object-Oriented programming (OOP) languages. The effectiveness of this method in solving high dimensional FS problems is validated by using 26 datasets, most of which are of high dimension (large number of features). Seven existing FS algorithms are compared with the proposed OOPOA using various metrics, including best fitness, average fitness (AVG), selection size, and computational time. The results prove the superiority of the proposed algorithm over the other FS algorithms, having an average performance of %92.5, 0.078, 0.084, %38.9, and 8.6 min for classification accuracy, best fitness, average fitness, size reduction ratio, and computational time. The outcomes demonstrate the proposed FS approach's superiority over currently used methods.
- Published
- 2023
- Full Text
- View/download PDF
10. Two-sample testing with local community depth
- Author
-
Evans, Ciaran and Berenhaut, Kenneth S.
- Published
- 2024
- Full Text
- View/download PDF
11. Bayesian adaptive design for covariate‐adaptive historical control information borrowing.
- Author
-
Jin, Huaqing, Kim, Mi‐Ok, Scheffler, Aaron, and Jiang, Fei
- Subjects
- *
INFORMATION resources management , *VERTEBRAL fractures , *CURRENT distribution , *POSTMENOPAUSE , *DATA distribution - Abstract
Interest in incorporating historical data in the clinical trial has increased with the rising cost of conducting clinical trials. The intervention arm for the current trial often requires prospective data to assess a novel treatment, and thus borrowing historical control data commensurate in distribution to current control data is motivated in order to increase the allocation ratio to the current intervention arm. Existing historical control borrowing adaptive designs adjust allocation ratios based on the commensurability assessed through study‐level summary statistics of the response agnostic of the distributions of the trial subject characteristics in the current and historical trials. This can lead to distributional imbalance of the current trial subject characteristics across the treatment arms as well as between current control data and borrowed historical control data. Such covariate imbalance may threaten the internal validity of the current trial by introducing confounding factors that affect study endpoints. In this article, we propose a Bayesian design which borrows and updates the treatment allocation ratios both covariate‐adaptively and commensurate to covariate dependently assessed similarity between the current and historical control data. We employ covariate‐dependent discrepancy parameters which are allowed to grow with the sample size and propose a regularized local regression procedure for the estimation of the parameters. The proposed design also permits the current and the historical controls to be similar to varying degree, depending on the subject level characteristics. We evaluate the proposed design extensively under the settings derived from two placebo‐controlled randomized trials on vertebral fracture risk in post‐menopausal women. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
12. A new binary object-oriented programming optimization algorithm for solving high-dimensional feature selection problem.
- Author
-
Khalid, Asmaa M., Said, Wael, Elmezain, Mahmoud, and Hosny, Khalid M.
- Subjects
OPTIMIZATION algorithms ,OBJECT-oriented programming ,FEATURE selection ,MATHEMATICAL optimization ,METAHEURISTIC algorithms ,MACHINE learning - Abstract
Feature selection (FS) is a crucial task in machine learning applications, which aims to select the most appropriate feature subset while maintaining high classification accuracy with the minimum number of selected features. Despite the widespread usage of metaheuristics as wrapper-based FS techniques, they show reduced effectiveness and increased computational cost when applied to high-dimensional datasets. This paper presents a novel Binary Object-Oriented Programming Optimization Algorithm (BOOPOA) for FS of high dimensional datasets, where the Object-Oriented Programming Optimization Algorithm (OOPOA) is a novel optimization technique inspired by the inheritance concept of Object-Oriented programming (OOP) languages. The effectiveness of this method in solving high dimensional FS problems is validated by using 26 datasets, most of which are of high dimension (large number of features). Seven existing FS algorithms are compared with the proposed OOPOA using various metrics, including best fitness, average fitness (AVG), selection size, and computational time. The results prove the superiority of the proposed algorithm over the other FS algorithms, having an average performance of %92.5, 0.078, 0.084, %38.9, and 8.6 min for classification accuracy, best fitness, average fitness, size reduction ratio, and computational time. The outcomes demonstrate the proposed FS approach's superiority over currently used methods. [Display omitted] [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
13. Adaptive Conditional Distribution Estimation with Bayesian Decision Tree Ensembles.
- Author
-
Li, Yinpu, Linero, Antonio R., and Murray, Jared
- Subjects
- *
BODY mass index , *REGRESSION trees , *DATA augmentation , *DECISION trees , *RANDOM forest algorithms - Abstract
We present a Bayesian nonparametric model for conditional distribution estimation using Bayesian additive regression trees (BART). The generative model we use is based on rejection sampling from a base model. Like other BART models, our model is flexible, has a default prior specification, and is computationally convenient. To address the distinguished role of the response in our BART model, we introduce an approach to targeted smoothing of BART models which is of independent interest. We study the proposed model theoretically and provide sufficient conditions for the posterior distribution to concentrate at close to the minimax optimal rate adaptively over smoothness classes in the high-dimensional regime in which many predictors are irrelevant. To fit our model, we propose a data augmentation algorithm which allows for existing BART samplers to be extended with minimal effort. We illustrate the performance of our methodology on simulated data and use it to study the relationship between education and body mass index using data from the medical expenditure panel survey (MEPS). for this article are available online. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
14. Sparse Clustering Algorithm Based on Multi-Domain Dimensionality Reduction Autoencoder
- Author
-
Yu Kang, Erwei Liu, Kaichi Zou, Xiuyun Wang, and Huaqing Zhang
- Subjects
high dimensional ,clustering ,multi-domain ,sparsity ,regularization ,Mathematics ,QA1-939 - Abstract
The key to high-dimensional clustering lies in discovering the intrinsic structures and patterns in data to provide valuable information. However, high-dimensional clustering faces enormous challenges such as dimensionality disaster, increased data sparsity, and reduced reliability of the clustering results. In order to address these issues, we propose a sparse clustering algorithm based on a multi-domain dimensionality reduction model. This method achieves high-dimensional clustering by integrating the sparse reconstruction process and sparse L1 regularization into a deep autoencoder model. A sparse reconstruction module is designed based on the L1 sparse reconstruction of features under different domains to reconstruct the data. The proposed method mainly contributes in two aspects. Firstly, the spatial and frequency domains are combined by taking into account the spatial distribution and frequency characteristics of the data to provide multiple perspectives and choices for data analysis and processing. Then, a neural network-based clustering model with sparsity is conducted by projecting data points onto multi-domains and implementing adaptive regularization penalty terms to the weight matrix. The experimental results demonstrate superior performance of the proposed method in handling clustering problems on high-dimensional datasets.
- Published
- 2024
- Full Text
- View/download PDF
15. 基于多元竞争淘汰的自然计算方法.
- Author
-
胡建暄, 马宁, 付伟, 季伟东, 刁衣非, 刘聪, and 黄鑫宇
- Subjects
- *
PARTICLE swarm optimization , *OPTIMIZATION algorithms , *SWARM intelligence , *TIME complexity , *GENETIC algorithms , *PROBLEM solving - Abstract
In the natural computation method, to solve the optimization problem of high-dimensional data, the population size needs to be increased to obtain higher accuracy, but at the same time, the time complexity is relatively large. If the population size is reduced, the algorithm will fall into local optimization due to the lack of population diversity. In order to solve the problems such as difficult to balance population size, slow convergence rate and easy to fall into local optimum in optimization process, a natural calculation method based on Multiple Competitive Elimination (MCE) strategy is proposed, which is suitable for all kinds of optimization algorithms. It does not depend on the specific steps of algorithm evolution and has universality. First, the original solution space was divided into two types of large Spaces with competitive relations, and each type of large space was decomposed into N-dim small space. Then, two different elimination methods of reverse learning and mixed mutation were carried out respectively in the two types of large Spaces to eliminate the poor individuals. Finally, some better individuals in N-dim small space were selected to carry out competitive exchange across the two types of large Spaces to maintain the diversity of the whole population. Thus, the convergence speed and accuracy of the algorithm are improved. The proposed strategy is applied to particle swarm optimization and genetic algorithm respectively, and compared with standard particle swarm optimization, genetic algorithm and current advanced improved swarm intelligence optimization algorithms, the performance is verified by high-dimensional classical test function. The experimental results show that the improved algorithm of multiple competition elimination has better optimization ability than other comparison algorithms and has universality. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
16. Particle Filtering and Gaussian Mixtures - On a Localized Mixture Coefficients Particle Filter (LMCPF) for Global NWP.
- Author
-
ROJAHN, Anne, SCHENK, Nora, VAN LEEUWEN, Peter Jan, and POTTHAST, Roland
- Subjects
- *
FILTERS & filtration , *NUMERICAL weather forecasting , *ADAPTIVE filters , *MIXTURES - Abstract
In a global numerical weather prediction (NWP) modeling framework we study the implementation of Gaussian uncertainty of individual particles into the assimilation step of a localized adaptive particle filter (LAPF). We obtain a local representation of the prior distribution as a mixture of basis functions. In the assimilation step, the filter calculates the individual weight coefficients and new particle locations. It can be viewed as a combination of the LAPF and a localized version of a Gaussian mixture filter, i.e., a "Localized Mixture Coefficients Particle Filter (LMCPF)". Here, we investigate the feasibility of the LMCPF within a global operational framework and evaluate the relationship between prior and posterior distributions and observations. Our simulations are carried out in a standard pre-operational experimental set-up with the full global observing system, 52 km global resolution and 106 model variables. Statistics of particle movement in the assimilation step are calculated. The mixture approach is able to deal with the discrepancy between prior distributions and observation location in a real-world framework and to pull the particles towards the observations in a much better way than the pure LAPF. This shows that using Gaussian uncertainty can be an important tool to improve the analysis and forecast quality in a particle filter framework. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
17. High-Dimensional Ensemble Learning Classification: An Ensemble Learning Classification Algorithm Based on High-Dimensional Feature Space Reconstruction
- Author
-
Miao Zhao and Ning Ye
- Subjects
classification ensemble ,feature selection ,high dimensional ,space reconstruction ,ensemble learning ,Technology ,Engineering (General). Civil engineering (General) ,TA1-2040 ,Biology (General) ,QH301-705.5 ,Physics ,QC1-999 ,Chemistry ,QD1-999 - Abstract
When performing classification tasks on high-dimensional data, traditional machine learning algorithms often fail to filter out valid information in the features adequately, leading to low levels of classification accuracy. Therefore, this paper explores the high-dimensional data from both the data feature dimension and the model ensemble dimension. We propose a high-dimensional ensemble learning classification algorithm focusing on feature space reconstruction and classifier ensemble, called the HDELC algorithm. First, the algorithm considers feature space reconstruction and then generates a feature space reconstruction matrix. It effectively achieves feature selection and reconstruction for high-dimensional data. An optimal feature space is generated for the subsequent ensemble of the classifier, which enhances the representativeness of the feature space. Second, we recursively determine the number of classifiers and the number of feature subspaces in the ensemble model. Different classifiers in the ensemble system are assigned mutually exclusive non-intersecting feature subspaces for model training. The experimental results show that the HDELC algorithm has advantages compared with most high-dimensional datasets due to its more efficient feature space ensemble capability and relatively reliable ensemble operation performance. The HDELC algorithm makes it possible to solve the classification problem for high-dimensional data effectively and has vital research and application value.
- Published
- 2024
- Full Text
- View/download PDF
18. Large-scale gene expression data clustering through incremental ensemble approach
- Author
-
Imran Khan, Abdul Khalique Shaikh, and Naresh Adhikari
- Subjects
ensemble clustering ,gene expression ,high dimensional ,IECG ,Computer engineering. Computer hardware ,TK7885-7895 ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
DNA microarray technology monitors gene activity in real-time in living organisms. It creates a large amount of data that helps scientists learn about how genes work. Clustering this data helps understand gene interactions and uncover important biological processes. However, the traditional clustering techniques have difficulties due to the enormous dimensionality of gene expression data and the intricacy of biological networks. Although ensemble clustering is a viable strategy, such high-dimensional data may not lend itself well to traditional approaches. This study introduces a novel technique for gene expression data clustering called incremental ensemble clustering for gene expression data (IECG). There are two steps in the IECG. A technique for grouping gene expression data into windows is presented in the first step, producing a tree of clusters. This procedure is carried out again for succeeding windows that have distinct feature sets. The base clusterings of two consecutive windows are ensembled using a new goal function to form a new clustering solution. By repeating this step-by-step method for further windows, reliable patterns that are beneficial for medical applications can be extracted. The results from both biological and non-biological data demonstrate that the proposed algorithm outperformed the state-of-the-art algorithms. Additionally, the running time of the proposed algorithm has been examined.
- Published
- 2024
- Full Text
- View/download PDF
19. High‐dimensional feature selection in competing risks modeling: A stable approach using a split‐and‐merge ensemble algorithm.
- Author
-
Sun, Han and Wang, Xiaofeng
- Abstract
Variable selection is critical in competing risks regression with high‐dimensional data. Although penalized variable selection methods and other machine learning‐based approaches have been developed, many of these methods often suffer from instability in practice. This paper proposes a novel method named Random Approximate Elastic Net (RAEN). Under the proportional subdistribution hazards model, RAEN provides a stable and generalizable solution to the large‐p‐small‐n variable selection problem for competing risks data. Our general framework allows the proposed algorithm to be applicable to other time‐to‐event regression models, including competing risks quantile regression and accelerated failure time models. We show that variable selection and parameter estimation improved markedly using the new computationally intensive algorithm through extensive simulations. A user‐friendly R package RAEN is developed for public use. We also apply our method to a cancer study to identify influential genes associated with the death or progression from bladder cancer. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
20. Escaping The Curse of Dimensionality in Bayesian Model-Based Clustering.
- Author
-
Chandra, Noirrit Kiran, Canale, Antonio, and Dunson, David B.
- Subjects
- *
LATENT variables , *SAMPLE size (Statistics) , *PARTITION functions , *PERFORMANCE theory , *PROBABILITY theory - Abstract
Bayesian mixture models are widely used for clustering of high-dimensional data with appropriate uncertainty quantification. However, as the dimension of the observations increases, posterior inference often tends to favor too many or too few clusters. This article explains this behavior by studying the random partition posterior in a non-standard setting with a fixed sample size and increasing data dimensionality. We provide conditions under which the finite sample posterior tends to either assign every observation to a different cluster or all observations to the same cluster as the dimension grows. Interestingly, the conditions do not depend on the choice of clustering prior, as long as all possible partitions of observations into clusters have positive prior probabilities, and hold irrespective of the true data-generating model. We then propose a class of latent mixtures for Bayesian clustering (Lamb) on a set of low-dimensional latent variables inducing a partition on the observed data. The model is amenable to scalable posterior inference and we show that it can avoid the pitfalls of high-dimensionality under mild assumptions. The proposed approach is shown to have good performance in simulation studies and an application to inferring cell types based on scRNAseq. [ABSTRACT FROM AUTHOR]
- Published
- 2023
21. Feature selection in high‐dimensional microarray cancer datasets using an improved equilibrium optimization approach.
- Author
-
Balakrishnan, Kulanthaivel and Dhanalakshmi, Ramasamy
- Subjects
FEATURE selection ,EARLY diagnosis ,MACHINE learning ,EQUILIBRIUM ,MATHEMATICAL optimization ,SOURCE code - Abstract
Summary: Optimal feature selection of a high‐dimensional micro‐array datasets has gained a significant importance in medical applications for early detection and prevention of disease. Traditional Optimal feature selection percolates through a population‐based meta‐heuristic optimization technique, a Machine Learning classifier and traditional wrapper method for transforming the original feature set into a better feature set. These techniques require a number of iterations for the convergence of random solutions to the global optimum with high‐dimensionality issues such as over‐fitting, memory constraints, computational costs, and low accuracy. In this article, an efficient equilibrium optimization technique is proposed for an optimized feature selection that increases the diversity of the population in the search space through Random Opposition based learning and classify the best features using a 10‐fold cross‐validation‐based wrapper method. The proposed method is tested with six standard micro‐array datasets and compared with the conventional algorithms such as Marine Predators Algorithm, Harris Hawks Optimization, Whale Optimization Algorithm, and conventional Equilibrium Optimization. From the statistical results using the standard metrics, it is interpreted that the proposed method converges to the global minimum in a few iterations through optimized feature selection, fitness value and higher classification accuracy. This proves its efficacy in exploring and finding a better solution as compared to the counterpart algorithms. In addition to complexity analysis, these results indicate a global optimum solution, an effective representation of least amount of data‐high dimensionality reduction and an avoidance of over‐fitting problems. The source code is available at https://github.com/balasv/ROBL‐EOA/blob/main/ROBL_EOA.ipynb [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
22. Bayesian Adaptive Selection Under Prior Ignorance
- Author
-
Basu, Tathagata, Troffaes, Matthias C. M., Einbeck, Jochen, Vasile, Massimiliano, editor, and Quagliarella, Domenico, editor
- Published
- 2021
- Full Text
- View/download PDF
23. Modeling credit risk with a multi‐stage hybrid model: An alternative statistical approach.
- Author
-
Uddin, Mohammad Shamsu, Chi, Guotai, Al Janabi, Mazin A. M., Habib, Tabassum, and Yuan, Kunpeng
- Subjects
CREDIT risk ,STATISTICAL models ,AGRICULTURAL credit ,SMALL business ,BLENDED learning ,MULTISCALE modeling ,NAIVE Bayes classification - Abstract
This paper examines the impact of hybridizations on the classification performances of sophisticated machine learning classifiers such as gradient boosting (GB, TreeNet®) and random forest (RF) using multi‐stage hybrid models. The empirical findings confirm that, overall, hybrid model GB (X*Di; ŶDi, LR), which consists of TreeNet® combined with logistic regression along with a new dependent variable, offers significantly superior accuracy compared to the baselines and other hybrid classifiers. However, the performances of hybrid classifiers are not consistent across all types of datasets. For low‐dimensional data, the constructed models consistently outperform the base classifiers; however, on high dimensional data, the classification outcomes provide little evidence of improvement and in certain cases, they underperform the baseline models. These findings have relevance for the analysis of high‐ and low‐dimensional credit risk, small and medium enterprises, agricultural credits, and so on. Furthermore, the example credit risk scenario and its outcomes provide an alternative path for hybrid and machine learning approaches to be applied to more general applications in accounting and finance fields. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
24. Novel PE and APC tandems: Additional near‐infrared fluorochromes for use in spectral flow cytometry.
- Author
-
Seong, Yekyung, Nguyen, Denny X, Wu, Yian, Thakur, Archana, Harding, Fiona, and Nguyen, Tuan Andrew
- Abstract
Recent advances in flow cytometry instrumentation and fluorochrome chemistries have greatly increased fluorescent conjugated antibody combinations that can be used reliably and easily in routine experiments. The Cytek Aurora flow cytometer was first released with three excitation lasers (405, 488, and 640 nm) and incorporated the latest Avalanche Photodiode (APD) technology, demonstrating significant improvement in sensitivity for fluorescent emission signals longer than 800 nm. However, there are limited commercially available fluorochromes capable of excitation with peak emission signals beyond 800 nm. To address this gap, we engineered six new fluorochromes: PE‐750, PE‐800, PE‐830 for the 488 nm laser and APC‐750, APC‐800, APC‐830 for the 640 nm laser. Utilizing the principal of fluorescence resonance energy transfer (FRET), these novel structures were created by covalently linking a protein donor dye with an organic small molecule acceptor dye. Additionally, each of these fluorochrome conjugates were shown to be compatible with fixation/permeabilization buffer reagents, and demonstrated acceptable brightness and stability when conjugated to antigen‐specific monoclonal antibodies. These six novel fluorochrome reagents can increase the numbers of fluorochromes that can be used on a spectral flow cytometer. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
25. Causal effect estimation in survival analysis with high dimensional confounders.
- Author
-
Jiang F, Zhao G, Rodriguez-Monguio R, and Ma Y
- Subjects
- Humans, Survival Analysis, Confounding Factors, Epidemiologic, Models, Statistical, Biometry methods, Computer Simulation, Data Interpretation, Statistical, Causality, Propensity Score, Lymphoma, Large B-Cell, Diffuse mortality
- Abstract
With the ever advancing of modern technologies, it has become increasingly common that the number of collected confounders exceeds the number of subjects in a data set. However, matching based methods for estimating causal treatment effect in their original forms are not capable of handling high-dimensional confounders, and their various modified versions lack statistical support and valid inference tools. In this article, we propose a new approach for estimating causal treatment effect, defined as the difference of the restricted mean survival time (RMST) under different treatments in high-dimensional setting for survival data. We combine the factor model and the sufficient dimension reduction techniques to construct propensity score and prognostic score. Based on these scores, we develop a kernel based doubly robust estimator of the RMST difference. We demonstrate its link to matching and establish the consistency and asymptotic normality of the estimator. We illustrate our method by analyzing a dataset from a study aimed at comparing the effects of two alternative treatments on the RMST of patients with diffuse large B cell lymphoma., (© The Author(s) 2024. Published by Oxford University Press on behalf of The International Biometric Society.)
- Published
- 2024
- Full Text
- View/download PDF
26. Evaluating the Impact of Sampling-Based Nonlinear Manifold Detection Model on Software Defect Prediction Problem
- Author
-
Ghosh, Soumi, Rana, Ajay, Kansal, Vineet, Howlett, Robert J., Series Editor, Jain, Lakhmi C., Series Editor, Satapathy, Suresh Chandra, editor, Bhateja, Vikrant, editor, Mohanty, J. R., editor, and Udgata, Siba K., editor
- Published
- 2020
- Full Text
- View/download PDF
27. A graph-based gene selection method for medical diagnosis problems using a many-objective PSO algorithm
- Author
-
Saeid Azadifar and Ali Ahmadi
- Subjects
Gene selection ,Dimension reduction ,Many-objective PSO ,Gene clustering ,High dimensional ,Repair operator ,Computer applications to medicine. Medical informatics ,R858-859.7 - Abstract
Abstract Background Gene expression data play an important role in bioinformatics applications. Although there may be a large number of features in such data, they mainly tend to contain only a few samples. This can negatively impact the performance of data mining and machine learning algorithms. One of the most effective approaches to alleviate this problem is to use gene selection methods. The aim of gene selection is to reduce the dimensions (features) of gene expression data leading to eliminating irrelevant and redundant genes. Methods This paper presents a hybrid gene selection method based on graph theory and a many-objective particle swarm optimization (PSO) algorithm. To this end, a filter method is first utilized to reduce the initial space of the genes. Then, the gene space is represented as a graph to apply a graph clustering method to group the genes into several clusters. Moreover, the many-objective PSO algorithm is utilized to search an optimal subset of genes according to several criteria, which include classification error, node centrality, specificity, edge centrality, and the number of selected genes. A repair operator is proposed to cover the whole space of the genes and ensure that at least one gene is selected from each cluster. This leads to an increasement in the diversity of the selected genes. Results To evaluate the performance of the proposed method, extensive experiments are conducted based on seven datasets and two evaluation measures. In addition, three classifiers—Decision Tree (DT), Support Vector Machine (SVM), and K-Nearest Neighbors (KNN)—are utilized to compare the effectiveness of the proposed gene selection method with other state-of-the-art methods. The results of these experiments demonstrate that our proposed method not only achieves more accurate classification, but also selects fewer genes than other methods. Conclusion This study shows that the proposed multi-objective PSO algorithm simultaneously removes irrelevant and redundant features using several different criteria. Also, the use of the clustering algorithm and the repair operator has improved the performance of the proposed method by covering the whole space of the problem.
- Published
- 2021
- Full Text
- View/download PDF
28. A robust high dimensional estimation of a finite mixture of the generalized linear model.
- Author
-
Sabbaghi, Azam and Eskandari, Farzad
- Subjects
- *
FINITE, The , *MIXTURES , *SAMPLE size (Statistics) - Abstract
Robust high dimensional estimation is one of the most important problems in statistics. In a high dimensional structure with a small number of non-zero observations, the dimension of the parameters is larger than the sample size. For modeling the sparsity of outlier response vector, we randomly selected a small number of observations and corrupted them arbitrarily. There are two distinct ways to overcome sparsity in the generalized linear model (GLM): in the parameter space, or in the space output. According to several studies in corrupted observation modeling, there is a relationship between robustness and sparsity. In this paper for obtaining robust high dimensional estimation, we proposed a finite mixture of the generalized linear models (FMGLMs). By using simulation with the expectation-maximization (EM) algorithm, we show improved modeling performance. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
29. AN EFFICIENT HYBRID APPROACH FOR DIAGNOSIS HIGH DIMENSIONAL DATA FOR ALZHEIMER'S DISEASES USING MACHINE LEARNING ALGORITHMS.
- Author
-
Zawawi, Nour, Saber, Heba Gamal, Hashem, Mohamed, and Gharib, Tarek F.
- Subjects
ALZHEIMER'S disease ,MACHINE learning ,DEMENTIA ,MEMORY loss ,RANDOM forest algorithms - Abstract
Alzheimer's disease (AD) is the most familiar type of dementia, a well-known term for memory loss and other cognitive disabilities. The disease is dangerous enough to interfere with ordinary life. Identifying AD in the early stages remains an extremely challenging task, meanwhile, the progression of it develops several years before observing any symptoms. The fundamental issue addressed during diagnosis is the high dimensionality of data. However, not all features are relevant for solving the problem, and sometimes, including some irrelevant ones may deteriorate the learning performance. Therefore, it is essential to do feature reduction by selecting the most relevant features. In this work, a hybrid approach Random Forest Partial Swarm Optimization(RF-PSO) for highdimensional feature selection is proposed. The fundamental reason behind this work is to support geriatricians diagnose AD; by creating a clinically translatable machine learning approach. The dataset created by Alzheimer's Disease Neuroimaging Initiative (ADNI) was used for this purpose. The ADNI dataset contains 900 patients whose diagnostic follow-up is available for at least three years after the baseline assessment. The reason behind choosing is their strength in solving large-scale optimization problems with high data dimensionality. The Experiments show that RF-PSO outperforms most of the others found in the literature. It achieved high performance compared to them. The accuracy rate of this approach reached 95% for all the AD stages. In a comparison with Random Forest which achieve 86%, While Partial Swarm Optimization got 89%. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
30. CONDITIONAL MARGINAL TEST FOR HIGH DIMENSIONAL QUANTILE REGRESSION.
- Author
-
Yanlin Tang, Yinfeng Wang, Huixia Judy Wang, and Qing Pan
- Subjects
QUANTILE regression ,GENOME-wide association studies ,SINGLE nucleotide polymorphisms ,TYPE 1 diabetes ,DISTRIBUTION (Probability theory) ,GLOMERULAR filtration rate - Abstract
Analyzing the tail quantiles of a response distribution is sometimes more important than analyzing the mean in biomarker studies. Inferences in a quantile regression are complicated when there exist a large number of candidate markers, together with some prespecified controlled covariates. In this study, we develop a new and simple testing procedure to detect the effects of biomarkers in a high-dimensional quantile regression in the presence of protected covariates. The test is based on the maximum-score-type statistic obtained from a conditional marginal regression. We establish the asymptotic properties of the proposed test statistic under both null and alternative hypotheses and propose an alternative multiplier bootstrap method, with theoretical justifications. We use numerical studies to show that the proposed method provides adequate controls of the family-wise error rate with competitive power, and that it can also be used as a stopping rule in a forward regression. The proposed method is applied to a motivating genome-wide association study to detect single nucleotide polymorphisms associated with low glomerular filtration rates in type 1 diabetes patients. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
31. Two Parallelized Filter Methods for Feature Selection Based on Spark
- Author
-
Marone, Reine Marie Ndéla, Camara, Fodé, Ndiaye, Samba, Kande, Demba, Akan, Ozgur, Series Editor, Bellavista, Paolo, Series Editor, Cao, Jiannong, Series Editor, Coulson, Geoffrey, Series Editor, Dressler, Falko, Series Editor, Ferrari, Domenico, Series Editor, Gerla, Mario, Series Editor, Kobayashi, Hisashi, Series Editor, Palazzo, Sergio, Series Editor, Sahni, Sartaj, Series Editor, Shen, Xuemin (Sherman), Series Editor, Stan, Mircea, Series Editor, Xiaohua, Jia, Series Editor, Zomaya, Albert Y., Series Editor, Zitouni, Rafik, editor, and Agueh, Max, editor
- Published
- 2019
- Full Text
- View/download PDF
32. A Surrogate-Assisted Cooperative Co-evolutionary Algorithm for Solving High Dimensional, Expensive and Black Box Optimization Problems
- Author
-
Blanchard, Julien, Beauthier, Charlotte, Carletti, Timoteo, Rodrigues, H.C., editor, Herskovits, J., editor, Mota Soares, C.M., editor, Araújo, A.L., editor, Guedes, J.M., editor, Folgado, J.O., editor, Moleiro, F., editor, and Madeira, J. F. A., editor
- Published
- 2019
- Full Text
- View/download PDF
33. Variable selection in the Box–Cox power transformation model.
- Author
-
Chen, Baojiang, Qin, Jing, and Yuan, Ao
- Subjects
- *
GENE expression , *REGRESSION analysis , *GENOMICS - Abstract
High dimensional data are frequently collected across research fields such as genomics, health sciences, economics, and social sciences. Recently, variable selection in the high dimensional setting has drawn great attention, with many effective methods developed to reduce the dimensionality of the data. However, most of these methods apply only to normally or near normally distributed outcomes in a linear regression model, while few studies focus on variable selection for skewed data. Simulation studies show that ignoring an appropriate transformation for the outcome can lead to biased inferences (e.g., missing important covariates). In this paper, we develop a variable selection procedure for the Box–Cox power transformation model by developing a penalized maximum likelihood estimate and deriving the consistency, oracle property, and asymptotic distribution of this estimate. Simulation studies demonstrate that the proposed method can yield higher sensitivity, while the naive method that without doing transformation can lead to lower sensitivity. We apply the proposed method to a gene expression study. • Variable selection has drawn great attention across research fields. • Few studies focus on variable selection for skewed data. • We develop a variable selection procedure for the Box–Cox power transformation model for handling skewed data. • The proposed method can yield higher sensitivity than naïve methods. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
34. Estimating player value in American football using plus–minus models.
- Author
-
Sabin, R. Paul
- Subjects
FOOTBALL ,BASKETBALL players ,FOOTBALL players ,SOCCER players ,SPORTS forecasting ,RECREATIONAL mathematics - Abstract
Calculating the value of football player's on-field performance has been limited to scouting methods while data-driven methods are mostly limited to quarterbacks. A popular method to calculate player value in other sports are Adjusted Plus–Minus (APM) and Regularized Adjusted Plus–Minus (RAPM) models. These models have been used in other sports, most notably basketball (Rosenbaum, D. T. 2004. Measuring How NBA Players Help Their Teams Win. http://www.82games.com/comm30.htm#%5fftn1; Kubatko, J., D. Oliver, K. Pelton, and D. T. Rosenbaum. 2007. "A Starting Point for Analyzing Basketball Statistics." Journal of Quantitative Analysis in Sports 3 (3); Winston, W. 2009. Player and Lineup Analysis in the NBA. Cambridge, Massachusetts; Sill, J. 2010. "Improved NBA Adjusted +/− Using Regularization and Out-Of-Sample Testing." In Proceedings of the 2010 MIT Sloan Sports Analytics Conference) to estimate each player's value by accounting for those in the game at the same time. Football is less amenable to APM models due to its few scoring events, few lineup changes, restrictive positioning, and small quantity of games relative to the number of teams. More recent methods have found ways to incorporate plus–minus models in other sports such as Hockey (Macdonald, B. 2011. "A Regression-Based Adjusted Plus-Minus Statistic for NHL players." Journal of Quantitative Analysis in Sports 7 (3)) and Soccer (Schultze, S. R., and C.-M. Wellbrock. 2018. "A Weighted Plus/Minus Metric for Individual Soccer Player Performance." Journal of Sports Analytics 4 (2): 121–31 and Matano, F., L. F. Richardson, T. Pospisil, C. Eubanks, and J. Qin (2018). Augmenting Adjusted Plus-Minus in Soccer with Fifa Ratings. arXiv preprint arXiv:1810.08032). These models are useful in coming up with results-oriented estimation of each player's value. In American football, many positions such as offensive lineman have no recorded statistics which hinders the ability to estimate a player's value. I provide a fully hierarchical Bayesian plus–minus (HBPM) model framework that extends RAPM to include position-specific penalization that solves many of the shortcomings of APM and RAPM models in American football. Cross-validated results show the HBPM to be more predictive out of sample than RAPM or APM models. Results for the HBPM models are provided for both Collegiate and NFL football players as well as deeper insights into positional value and position-specific age curves. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
35. A graph-based gene selection method for medical diagnosis problems using a many-objective PSO algorithm.
- Author
-
Azadifar, Saeid and Ahmadi, Ali
- Subjects
- *
DIAGNOSIS , *K-nearest neighbor classification , *PARTICLE swarm optimization , *SUPPORT vector machines , *DIAGNOSIS methods , *ALGORITHMS - Abstract
Background: Gene expression data play an important role in bioinformatics applications. Although there may be a large number of features in such data, they mainly tend to contain only a few samples. This can negatively impact the performance of data mining and machine learning algorithms. One of the most effective approaches to alleviate this problem is to use gene selection methods. The aim of gene selection is to reduce the dimensions (features) of gene expression data leading to eliminating irrelevant and redundant genes.Methods: This paper presents a hybrid gene selection method based on graph theory and a many-objective particle swarm optimization (PSO) algorithm. To this end, a filter method is first utilized to reduce the initial space of the genes. Then, the gene space is represented as a graph to apply a graph clustering method to group the genes into several clusters. Moreover, the many-objective PSO algorithm is utilized to search an optimal subset of genes according to several criteria, which include classification error, node centrality, specificity, edge centrality, and the number of selected genes. A repair operator is proposed to cover the whole space of the genes and ensure that at least one gene is selected from each cluster. This leads to an increasement in the diversity of the selected genes.Results: To evaluate the performance of the proposed method, extensive experiments are conducted based on seven datasets and two evaluation measures. In addition, three classifiers-Decision Tree (DT), Support Vector Machine (SVM), and K-Nearest Neighbors (KNN)-are utilized to compare the effectiveness of the proposed gene selection method with other state-of-the-art methods. The results of these experiments demonstrate that our proposed method not only achieves more accurate classification, but also selects fewer genes than other methods.Conclusion: This study shows that the proposed multi-objective PSO algorithm simultaneously removes irrelevant and redundant features using several different criteria. Also, the use of the clustering algorithm and the repair operator has improved the performance of the proposed method by covering the whole space of the problem. [ABSTRACT FROM AUTHOR]- Published
- 2021
- Full Text
- View/download PDF
36. Extracting information from textual descriptions for actuarial applications.
- Author
-
Manski, Scott, Yang, Kaixu, Lee, Gee Y., and Maiti, Tapabrata
- Subjects
VECTORS (Calculus) ,MATRICES (Mathematics) ,INSURANCE companies - Abstract
Initial insurance losses are often reported with a textual description of the claim. The claims manager must determine the adequate case reserve for each known claim. In this paper, we present a framework for predicting the amount of loss given a textual description of the claim using a large number of words found in the descriptions. Prior work has focused on classifying insurance claims based on keywords selected by a human expert, whereas in this paper the focus is on loss amount prediction with automatic word selection. In order to transform words into numeric vectors, we use word cosine similarities and word embedding matrices. When we consider all unique words found in the training dataset and impose a generalised additive model to the resulting explanatory variables, the resulting design matrix is high dimensional. For this reason, we use a group lasso penalty to reduce the number of coefficients in the model. The scalable, analytical framework proposed provides for a parsimonious and interpretable model. Finally, we discuss the implications of the analysis, including how the framework may be used by an insurance company and how the interpretation of the covariates can lead to significant policy change. The code can be found in the TAGAM R package (github.com/scottmanski/TAGAM). [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
37. Globaltest confidence regions and their application to ridge regression.
- Author
-
Xu, Ningning, Solari, Aldo, and Goeman, Jelle
- Abstract
We construct confidence regions in high dimensions by inverting the globaltest statistics, and use them to choose the tuning parameter for penalized regression. The selected model corresponds to the point in the confidence region of the parameters that minimizes the penalty, making it the least complex model that still has acceptable fit according to the test that defines the confidence region. As the globaltest is particularly powerful in the presence of many weak predictors, it connects well to ridge regression, and we thus focus on ridge penalties in this paper. The confidence region method is quick to calculate, intuitive, and gives decent predictive potential. As a tuning parameter selection method it may even outperform classical methods such as cross‐validation in terms of mean squared error of prediction, especially when the signal is weak. We illustrate the method for linear models in simulation study and for Cox models in real gene expression data of breast cancer samples. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
38. Forecasting Baden‐Württemberg's GDP growth: MIDAS regressions versus dynamic mixed‐frequency factor models.
- Author
-
Kuck, Konstantin and Schweikert, Karsten
- Subjects
FORECASTING ,GROSS domestic product ,INTERNATIONAL trade ,TIME series analysis ,ECONOMIC forecasting ,ECONOMIC expansion - Abstract
Germany's economic composition is heterogenous across regions, which makes regional economic projections based on German gross domestic product (GDP) growth unreliable. In this paper, we develop forecasting models for Baden‐Württemberg's economic growth, a regional economy that is dominated by small‐ and medium‐sized enterprises with a strong focus on foreign trade. For this purpose, we evaluate the backcasting and nowcasting performance of mixed data sampling (MIDAS) regressions with forecast combinations against an approximate dynamic mixed‐frequency factor model. Considering a wide range of regional, national, and global predictors, we find that our high‐dimensional models outperform benchmark time series models. Surprisingly, we also find that combined forecasts based on simple single‐predictor MIDAS regressions are able to outperform forecasts from more sophisticated dynamic factor models. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
39. S3LR: Novel feature selection approach for Microarray-Based breast cancer recurrence prediction.
- Author
-
Erekat, Asala N. and Khasawneh, Mohammad T.
- Subjects
- *
CANCER relapse , *FEATURE selection , *BREAST cancer , *PARTICLE swarm optimization , *METAHEURISTIC algorithms , *SECURE Sockets Layer (Computer network protocol) , *FORECASTING - Abstract
Enhancing the accuracy of breast cancer recurrence prediction is crucial, mainly when dealing with genomics data, which presents challenges such as high dimensionality, noise, non-linearity, and limited sample sizes. This paper introduces Semi-Supervised Survival Laplacian Regression (S3LR), a novel feature selection algorithm designed to improve breast cancer recurrence prediction by effectively handling censored, event, and unlabeled data. S3LR modifies the Laplacian Score (LS) by incorporating a distance matrix to calculate the weight matrix, and it integrates heuristic and metaheuristic optimization algorithms to optimize the weighted matrix. These enhancements refine feature selection and overall performance. In our evaluations using three datasets and comparisons with state-of-the-art techniques, S3LR combined with Particle Swarm Optimization (PSO) demonstrates significant improvements in C-index and mean absolute error (MAE). Average C-index values reach 68.80 %, 59.49 %, and 67.66 %, with average MAE values of 15.98, 7.87, and 8.65 months, respectively. These results showcase S3LR's effectiveness in predicting recurrence, even with censored data, for more precise and reliable outcomes. Furthermore, the framework's versatility extends beyond breast cancer and can readily be applied to address other survival and recurrence problems. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
40. A Hashing-Based Framework for Enhancing Cluster Delineation of High-Dimensional Single-Cell Profiles
- Author
-
Liu, Xiao, Zhang, Ting, Tan, Ziyang, Warden, Antony R., Li, Shanhe, Cheung, Edwin, and Ding, Xianting
- Published
- 2022
- Full Text
- View/download PDF
41. Variable Selection via SCAD-Penalized Quantile Regression for High-Dimensional Count Data
- Author
-
Dost Muhammad Khan, Anum Yaqoob, Nadeem Iqbal, Abdul Wahid, Umair Khalil, Mukhtaj Khan, Mohd Amiruddin Abd Rahman, Mohd Shafie Mustafa, and Zardad Khan
- Subjects
Count data ,high dimensional ,jittering ,quantile regression ,variable selection ,zeroinflated ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
This article introduces a quantile penalized regression technique for variable selection and estimation of conditional quantiles of counts in sparse high-dimensional models. The direct estimation and variable selection of the quantile regression is not feasible due to the discreteness of the count data and non-differentiability of the objective function, therefore, some smoothness must be artificially imposed on the problem. To achieve the necessary smoothness, we use the Jittering process by adding a uniformly distributed noise to the response count variable. The proposed method is compared with the existing penalized regression methods in terms of prediction accuracy and variable selection. We compare the proposed approach in zero-inflated count data regression models and in the presence of outliers. The performance and implementation of the proposed method are illustrated by detailed simulation studies and real data applications.
- Published
- 2019
- Full Text
- View/download PDF
42. Attribute Portfolio Distance: A Dynamic Time Warping-Based Approach to Comparing and Detecting Common Spatiotemporal Patterns Among Multiattribute Data Portfolios
- Author
-
Piburn, Jesse, Stewart, Robert, Morton, April, Balram, Shivanand, Series editor, Dragicevic, Suzana, Series editor, Griffith, Daniel A., editor, Chun, Yongwan, editor, and Dean, Denis J., editor
- Published
- 2017
- Full Text
- View/download PDF
43. Finite volume approximation with ADI scheme and low-rank solver for high dimensional spatial distributed-order fractional diffusion equations.
- Author
-
Chou, Lot-Kei and Lei, Siu-Long
- Subjects
- *
HEAT equation , *CRANK-nicolson method , *FINITE volume method , *FINITE, The , *TRANSPORT equation - Abstract
High dimensional conservative spatial distributed-order fractional diffusion equation is discretized by midpoint quadrature rule, Crank–Nicolson method, and a finite volume approximation, with alternating direction implicit scheme. The resulting scheme is shown to be consistent and unconditionally stable, hence convergent with order 3 − α , where α is the maximum of the involving fractional orders. Moreover, if the initial condition and source term possess Tensor-Train format (TT-format) with low TT-ranks, the scheme can be solved in TT-format, such that higher dimensional cases can be considered. Perturbation analysis ensures that the accumulated errors due to data recompression do not affect the overall convergence order. Numerical examples with low TT-ranks initial conditions and source terms, and with dimensions up to 20 are tested. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
44. Random projections: Data perturbation for classification problems.
- Author
-
Cannings, Timothy I.
- Subjects
- *
CLASSIFICATION , *DIMENSIONAL analysis , *STATISTICAL accuracy , *STATISTICAL learning , *DATA science , *GRAPHICAL modeling (Statistics) - Abstract
Random projections offer an appealing and flexible approach to a wide range of large‐scale statistical problems. They are particularly useful in high‐dimensional settings, where we have many covariates recorded for each observation. In classification problems, there are two general techniques using random projections. The first involves many projections in an ensemble—the idea here is to aggregate the results after applying different random projections, with the aim of achieving superior statistical accuracy. The second class of methods include hashing and sketching techniques, which are straightforward ways to reduce the complexity of a problem, perhaps therefore with a huge computational saving, while approximately preserving the statistical efficiency. This article is categorized under: Statistical Learning and Exploratory Methods of the Data Sciences > Clustering and ClassificationStatistical and Graphical Methods of Data Analysis > Analysis of High Dimensional DataStatistical Models > Classification Models [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
45. Multiple Classifiers-Assisted Evolutionary Algorithm Based on Decomposition for High-Dimensional Multiobjective Problems
- Author
-
Takumi Sonoda and Masaya Nakata
- Subjects
Computer science ,business.industry ,Evolutionary algorithm ,Function (mathematics) ,High dimensional ,Machine learning ,computer.software_genre ,Multi-objective optimization ,Field (computer science) ,Theoretical Computer Science ,Support vector machine ,Computational Theory and Mathematics ,Classifier (linguistics) ,Decomposition (computer science) ,Artificial intelligence ,business ,computer ,Software - Abstract
Surrogate-assisted multi-objective evolutionary algorithms have advanced the field of computationally expensive optimization, but their progress is often restricted to low-dimensional problems. This manuscript presents a multiple classifiers-assisted evolutionary algorithm based on decomposition, which is adapted for high-dimensional expensive problems in terms of the following two insights. Compared to approximation-based surrogates, the accuracy of classification-based surrogates is robust for few high-dimensional training samples. Further, multiple local classifiers can hedge the risk of over-fitting issues. Accordingly, the proposed algorithm builds multiple classifiers with support vector machines on a decomposition-based multi-objective algorithm, wherein each local classifier is trained for a corresponding scalarization function. Experimental results statistically confirm that the proposed algorithm is competitive to the state-of-the-art algorithms and computationally efficient as well.
- Published
- 2022
46. Differentially Private Release of Datasets using Gaussian Copula
- Author
-
Hassan Jameel Asghar, Ming Ding, Thierry Rakotoarivelo, Sirine Mrabet, and Dali Kaafar
- Subjects
differential privacy ,synthetic data ,high dimensional ,copula ,Technology ,Social Sciences - Abstract
We propose a generic mechanism to efficiently release differentially private synthetic versions of high-dimensional datasets with high utility. The core technique in our mechanism is the use of copulas, which are functions representing dependencies among random variables with a multivariate distribution. Specifically, we use the Gaussian copula to define dependencies of attributes in the input dataset, whose rows are modelled as samples from an unknown multivariate distribution, and then sample synthetic records through this copula. Despite the inherently numerical nature of Gaussian correlations we construct a method that is applicable to both numerical and categorical attributes alike. Our mechanism is efficient in that it only takes time proportional to the square of the number of attributes in the dataset. We propose a differentially private way of constructing the Gaussian copula without compromising computational efficiency. Through experiments on three real-world datasets, we show that we can obtain highly accurate answers to the set of all one-way marginal, and two-and three-way positive conjunction queries, with 99% of the query answers having absolute (fractional) error rates between 0.01 to 3%. Furthermore, for a majority of two-way and three-way queries, we outperform independent noise addition through the well-known Laplace mechanism. In terms of computational time we demonstrate that our mechanism can output synthetic datasets in around 6 minutes 47 seconds on average with an input dataset of about 200 binary attributes and more than 32,000 rows, and about 2 hours 30 mins to execute a much larger dataset of about 700 binary attributes and more than 5 million rows. To further demonstrate scalability, we ran the mechanism on larger (artificial) datasets with 1,000 and 2,000 binary attributes (and 5 million rows) obtaining synthetic outputs in approximately 6 and 19 hours, respectively. These are highly feasible times for synthetic datasets, which are one-off releases.
- Published
- 2020
- Full Text
- View/download PDF
47. A Tree-Based Semi-Varying Coefficient Model for the COM-Poisson Distribution.
- Author
-
Chatla, Suneel Babu and Shmueli, Galit
- Subjects
- *
POISSON distribution , *RECURSIVE partitioning , *FIX-point estimation , *CHANGE-point problems , *ALGORITHMS , *SMOOTHNESS of functions , *BICYCLES - Abstract
We propose a tree-based semi-varying coefficient model for the Conway–Maxwell–Poisson (CMP or COM-Poisson) distribution which is a two-parameter generalization of the Poisson distribution and is flexible enough to capture both under-dispersion and over-dispersion in count data. The advantage of tree-based methods is their scalability to high-dimensional data. We develop CMPMOB, an estimation procedure for a semi-varying coefficient model, using model-based recursive partitioning (MOB). The proposed framework is broader than the existing MOB framework as it allows node-invariant effects to be included in the model. To simplify the computational burden of the exhaustive search employed in the original MOB algorithm, a new split point estimation procedure is proposed by borrowing tools from change point estimation methodology. The proposed method uses only the estimated score functions without fitting models for each split point and, therefore, is computationally simpler. Since the tree-based methods only provide a piece-wise constant approximation to the underlying smooth function, we further propose the CMPBoost semi-varying coefficient model which uses the gradient boosting procedure for estimation. The usefulness of the proposed methods are illustrated using simulation studies and a real example from a bike sharing system in Washington, DC. Supplementary files for this article are available online. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
48. Adaptive Constrained Independent Vector Analysis: An Effective Solution for Analysis of Large-Scale Medical Imaging Data.
- Author
-
Bhinge, Suchita, Long, Qunfang, Calhoun, Vince D., and Adali, Tulay
- Abstract
There is a growing need for flexible methods for the analysis of large-scale functional magnetic resonance imaging (fMRI) data for the estimation of global signatures that summarize the population while preserving individual-specific traits. Independent vector analysis (IVA) is a data-driven method that jointly estimates global spatio-temporal patterns from multi-subject fMRI data, and effectively preserves subject variability. However, as we show, IVA performance is negatively affected when the number of datasets and components increases especially when there is low component correlation across the datasets. In this article, we study the problem and its relationship with respect to correlation across the datasets, and propose an effective method for addressing the issue by incorporating reference information of the estimation patterns into the formulation, as a guidance in high dimensional scenarios. Constrained IVA (cIVA) provides an efficient framework for incorporating references, however its performance depends on a user-defined constraint parameter, which enforces the association between the reference signals and estimation patterns to a fixed level. We propose adaptive cIVA (acIVA) that tunes the constraint parameter to allow flexible associations between the references and estimation patterns, and enables incorporating multiple reference signals, without enforcing inaccurate conditions. Our results indicate that acIVA can reliably estimate high-dimensional multivariate sources from large-scale simulated datasets, when compared with standard IVA. It also successfully extracts meaningful functional networks from a large-scale fMRI dataset for which standard IVA did not converge. The method also efficiently captures subject-specific information, which is demonstrated through observed gender differences in spectral power, higher spectral power in males at low frequencies and in females at high frequencies, within the motor, attention, visual and default mode networks. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
49. Using Bayesian Latent Gaussian Graphical Models to Infer Symptom Associations in Verbal Autopsies.
- Author
-
Zehang Richard Li, McComick, Tyler H., and Clark, Samuel J.
- Subjects
AUTOPSY ,BAYESIAN analysis ,SYMPTOMS ,QUESTIONNAIRES ,INFORMATION retrieval - Abstract
Learning dependence relationships among variables of mixed types provides insights in a variety of scientific settings and is a well-studied problem in statistics. Existing methods, however, typically rely on copious, high quality data to accurately learn associations. In this paper, we develop a method for scientific settings where learning dependence structure is essential, but data are sparse and have a high fraction of missing values. Specifically, our work is motivated by survey-based cause of death assessments known as verbal autopsies (VAs). We propose a Bayesian approach to characterize dependence relationships using a latent Gaussian graphical model that incorporates informative priors on the marginal distributions of the variables. We demonstrate such information can improve estimation of the dependence structure, especially in settings with little training data. We show that our method can be integrated into existing probabilistic cause-of-death assignment algorithms and improves model performance while recovering dependence patterns between symptoms that can inform efficient questionnaire design in future data collection. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
50. HT-AWGM: a hierarchical Tucker–adaptive wavelet Galerkin method for high-dimensional elliptic problems.
- Author
-
Ali, Mazen and Urban, Karsten
- Abstract
This paper is concerned with the construction, analysis, and realization of a numerical method to approximate the solution of high-dimensional elliptic partial differential equations. We propose a new combination of an adaptive wavelet Galerkin method (AWGM) and the well-known hierarchical tensor (HT) format. The arising HT-AWGM is adaptive both in the wavelet representation of the low-dimensional factors and in the tensor rank of the HT representation. The point of departure is an adaptive wavelet method for the HT format using approximate Richardson iterations and an AWGM for elliptic problems. HT-AWGM performs a sequence of Galerkin solves based upon a truncated preconditioned conjugate gradient (PCG) algorithm in combination with a tensor-based preconditioner. Our analysis starts by showing convergence of the truncated conjugate gradient method. The next step is to add routines realizing the adaptive refinement. The resulting HT-AWGM is analyzed concerning convergence and complexity. We show that the performance of the scheme asymptotically depends only on the desired tolerance with convergence rates depending on the Besov regularity of low-dimensional quantities and the low-rank tensor structure of the solution. The complexity in the ranks is algebraic with powers of four stemming from the complexity of the tensor truncation. Numerical experiments show the quantitative performance. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.