1. LLM-PBC: Logic Learning Machine-Based Explainable Rules Accurately Stratify the Genetic Risk of Primary Biliary Cholangitis
- Author
-
Gerussi, A, Verda, D, Cappadona, C, Cristoferi, L, Bernasconi, D, Bottaro, S, Carbone, M, Muselli, M, Invernizzi, P, Asselta, R, Gerussi, Alessio, Verda, Damiano, Cappadona, Claudio, Cristoferi, Laura, Bernasconi, Davide Paolo, Bottaro, Sandro, Carbone, Marco, Muselli, Marco, Invernizzi, Pietro, Asselta, Rosanna, Gerussi, A, Verda, D, Cappadona, C, Cristoferi, L, Bernasconi, D, Bottaro, S, Carbone, M, Muselli, M, Invernizzi, P, Asselta, R, Gerussi, Alessio, Verda, Damiano, Cappadona, Claudio, Cristoferi, Laura, Bernasconi, Davide Paolo, Bottaro, Sandro, Carbone, Marco, Muselli, Marco, Invernizzi, Pietro, and Asselta, Rosanna
- Abstract
Background: The application of Machine Learning (ML) to genetic individual-level data represents a foreseeable advancement for the field, which is still in its infancy. Here, we aimed to evaluate the feasibility and accuracy of an ML-based model for disease risk prediction applied to Primary Biliary Cholangitis (PBC). Methods: Genome-wide significant variants identified in subjects of European ancestry in the recently released second international meta-analysis of GWAS in PBC were used as input data. Quality-checked, individual genomic data from two Italian cohorts were used. The ML included the following steps: import of genotype and phenotype data, genetic variant selection, supervised classification of PBC by genotype, generation of “if-then” rules for disease prediction by logic learning machine (LLM), and model validation in a different cohort. Results: The training cohort included 1345 individuals: 444 were PBC cases and 901 were healthy controls. After pre-processing, 41,899 variants entered the analysis. Several configurations of parameters related to feature selection were simulated. The best LLM model reached an Accuracy of 71.7%, a Matthews correlation coefficient of 0.29, a Youden’s value of 0.21, a Sensitivity of 0.28, a Specificity of 0.93, a Positive Predictive Value of 0.66, and a Negative Predictive Value of 0.72. Thirty-eight rules were generated. The rule with the highest covering (19.14) included the following genes: RIN3, KANSL1, TIMMDC1, TNPO3. The validation cohort included 834 individuals: 255 cases and 579 controls. By applying the ruleset derived in the training cohort, the Area under the Curve of the model was 0.73. Conclusions: This study represents the first illustration of an ML model applied to common variants associated with PBC. Our approach is computationally feasible, leverages individual-level data to generate intelligible rules, and can be used for disease prediction in at-risk individuals.
- Published
- 2022