Back to Search
Start Over
Machine learning workflows to estimate class probabilities for precision cancer diagnostics on DNA methylation microarray data
- Publication Year :
- 2020
-
Abstract
- DNA methylation data-based precision cancer diagnostics is emerging as the state of the art for molecular tumor classification. Standards for choosing statistical methods with regard to well-calibrated probability estimates for these typically highly multiclass classification tasks are still lacking. To support this choice, we evaluated well-established machine learning (ML) classifiers including random forests (RFs), elastic net (ELNET), support vector machines (SVMs) and boosted trees in combination with post-processing algorithms and developed ML workflows that allow for unbiased class probability (CP) estimation. Calibrators included ridge-penalized multinomial logistic regression (MR) and Platt scaling by fitting logistic regression (LR) and Firth’s penalized LR. We compared these workflows on a recently published brain tumor 450k DNA methylation cohort of 2,801 samples with 91 diagnostic categories using a 5 × 5-fold nested cross-validation scheme and demonstrated their generalizability on external data from The Cancer Genome Atlas. ELNET was the top stand-alone classifier with the best calibration profiles. The best overall two-stage workflow was MR-calibrated SVM with linear kernels closely followed by ridge-calibrated tuned RF. For calibration, MR was the most effective regardless of the primary classifier. The protocols developed as a result of these comparisons provide valuable guidance on choosing ML workflows and their tuning to generate well-calibrated CP estimates for precision diagnostics using DNA methylation data. Computation times vary depending on the ML algorithm from
- Subjects :
- Platt scaling
Elastic net regularization
0303 health sciences
business.industry
Computer science
Machine learning
computer.software_genre
General Biochemistry, Genetics and Molecular Biology
Random forest
Support vector machine
Multiclass classification
Bioconductor
03 medical and health sciences
0302 clinical medicine
Artificial intelligence
business
Classifier (UML)
computer
030217 neurology & neurosurgery
030304 developmental biology
Multinomial logistic regression
Subjects
Details
- ISSN :
- 17542189
- Database :
- OpenAIRE
- Accession number :
- edsair.doi.dedup.....25eb62111fba2d4bda1dfca6f30183fb