Back to Search
Start Over
LYRUS: a machine learning model for predicting the pathogenicity of missense variants
- Source :
- Bioinformatics Advances
- Publication Year :
- 2021
- Publisher :
- Oxford University Press (OUP), 2021.
-
Abstract
- Single amino acid variations (SAVs) are a primary contributor to variations in the human genome. Identifying pathogenic SAVs can aid in the diagnosis and understanding of the genetic architecture of complex diseases, such as cancer. Most approaches for predicting the functional effects or pathogenicity of SAVs rely on either sequence or structural information. Nevertheless, previous analyses have shown that methods that depend on only sequence or structural information may have limited accuracy. Recently, researchers have attempted to increase the accuracy of their predictions by incorporating protein dynamics into pathogenicity predictions. This study presents < Lai Yang Rubenstein Uzun Sarkar > (LYRUS), a machine learning method that uses an XGBoost classifier selected by TPOT to predict the pathogenicity of SAVs. LYRUS incorporates five sequence-based features, six structure-based features, and four dynamics-based features. Uniquely, LYRUS includes a newly-proposed sequence co-evolution feature called variation number. LYRUS’s performance was evaluated using a dataset that contains 4,363 protein structures corresponding to 20,307 SAVs based on human genetic variant data from the ClinVar database. Based on our dataset, the LYRUS classifier has a higher accuracy, specificity, F-measure, and Matthews correlation coefficient (MCC) than alternative methods including PolyPhen2, PROVEAN, SIFT, Rhapsody, EVMutation, MutationAssessor, SuSPect, FATHMM, and MVP. Variation numbers used within LYRUS differ greatly between pathogenic and neutral SAVs, and have a high feature weight in the XGBoost classifier employed by this method. Applications of the method to PTEN and TP53 further corroborate LYRUS’s strong performance. LYRUS is freely available and the source code can be found at https://github.com/jiaying2508/LYRUS.
- Subjects :
- Original Paper
Source code
AcademicSubjects/SCI01060
Computer science
business.industry
media_common.quotation_subject
Scale-invariant feature transform
General Medicine
Matthews correlation coefficient
Machine learning
computer.software_genre
Genetic architecture
Feature (machine learning)
Human genome
Artificial intelligence
business
computer
Classifier (UML)
media_common
Sequence (medicine)
Subjects
Details
- ISSN :
- 26350041
- Volume :
- 2
- Database :
- OpenAIRE
- Journal :
- Bioinformatics Advances
- Accession number :
- edsair.doi.dedup.....6bd40388ea7a297e9908b57220a19957