1. Integrated logistic ridge regression and random forest for phenotype-genotype association analysis in categorical genomic data containing non-ignorable missing values.
- Author
-
Wang, Siru, Qian, Guoqi, and Hopper, John
- Subjects
- *
RANDOM forest algorithms , *STATISTICAL hypothesis testing , *MISSING data (Statistics) , *LOGISTIC regression analysis , *GENOMICS , *GENOME-wide association studies , *GENOTYPES - Abstract
• Innovative phenotype-genotype association analysis for categorical genomic data having non-ignorable missing values. • Statistical learning method of random forest is used for variable selection in the analysis. • Weighted logistic ridge regression with EM algorithm is used for missing data imputation in the analysis. • Linear statistical hypothesis testing is used for determining the missingness mechanism in the analysis. • An application to analyzing real data from Australia breast cancer genome-wide association study (GWAS). Genomic data arising from a genome-wide association study (GWAS) are often not only of large-scale, but also incomplete. A specific form of their incompleteness is missing values with non-ignorable missingness mechanism. The intrinsic complications of genomic data present significant challenges in developing an effective and efficient procedure of phenotype-genotype association analysis by a statistical variable selection approach. In this paper we develop a coherent procedure of categorical phenotype-genotype association analysis, in the presence of missing values with non-ignorable missingness mechanism in genomic data. It is developed by integrating the statistical learning methods of random forest for variable selection, joint weighted logistic ridge regression with EM algorithm for missing data imputation, and linear statistical hypothesis testing for determining the missingness mechanism. Two simulated genomic datasets are used to undertake the phenotype-genotype association analysis by the proposed procedure, with the performance validated. The proposed procedure is then applied to analyze a real data set from breast cancer GWAS. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF