1. A machine learning-based method for feature reduction of methylation data for the classification of cancer tissue origin.
- Author
-
De Velasco MA, Sakai K, Mitani S, Kura Y, Minamoto S, Haeno T, Hayashi H, and Nishio K
- Abstract
Background: Genome DNA methylation profiling is a promising yet costly method for cancer classification, involving substantial data. We developed an ensemble learning model to identify cancer types using methylation profiles from a limited number of CpG sites., Methods: Analyzing methylation data from 890 samples across 10 cancer types from the TCGA database, we utilized ANOVA and Gain Ratio to select the most significant CpG sites, then employed Gradient Boosting to reduce these to just 100 sites., Results: This approach maintained high accuracy across multiple machine learning models, with classification accuracy rates between 87.7% and 93.5% for methods including Extreme Gradient Boosting, CatBoost, and Random Forest. This method effectively minimizes the number of features needed without losing performance, helping to classify primary organs and uncover subgroups within specific cancers like breast and lung., Conclusions: Using a gradient boosting feature selector shows potential for streamlining methylation-based cancer classification., (© 2024. The Author(s).)
- Published
- 2024
- Full Text
- View/download PDF