1. XGBoost-enhanced ensemble model using discriminative hybrid features for the prediction of sumoylation sites
- Author
-
Salman Khan, Sumaiya Noor, Tahir Javed, Afshan Naseem, Fahad Aslam, Salman A. AlQahtani, and Nijad Ahmad
- Subjects
Pseudo position-specific score matrix ,Sumoylation ,Post-translation modification ,XGBoost ,SHAP ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Analysis ,QA299.6-433 - Abstract
Abstract Posttranslational modifications (PTMs) are essential for regulating protein localization and stability, significantly affecting gene expression, biological functions, and genome replication. Among these, sumoylation a PTM that attaches a chemical group to protein sequences—plays a critical role in protein function. Identifying sumoylation sites is particularly important due to their links to Parkinson’s and Alzheimer’s. This study introduces XGBoost-Sumo, a robust model to predict sumoylation sites by integrating protein structure and sequence data. The model utilizes a transformer-based attention mechanism to encode peptides and extract evolutionary features through the PsePSSM-DWT approach. By fusing word embeddings with evolutionary descriptors, it applies the SHapley Additive exPlanations (SHAP) algorithm for optimal feature selection and uses eXtreme Gradient Boosting (XGBoost) for classification. XGBoost-Sumo achieved an impressive accuracy of 99.68% on benchmark datasets using 10-fold cross-validation and 96.08% on independent samples. This marks a significant improvement, outperforming existing models by 10.31% on training data and 2.74% on independent tests. The model’s reliability and high performance make it a valuable resource for researchers, with strong potential for applications in pharmaceutical development.
- Published
- 2025
- Full Text
- View/download PDF