Back to Search Start Over

iR5hmcSC: Identifying RNA 5-hydroxymethylcytosine with multiple features based on stacking learning.

Authors :
Zhang, Shengli
Shi, Hongyan
Source :
Computational Biology & Chemistry. Dec2021, Vol. 95, pN.PAG-N.PAG. 1p.
Publication Year :
2021

Abstract

RNA 5-hydroxymethylcytosine (5hmC) modification is the basis of the translation of genetic information and the biological evolution. The study of its distribution in transcriptome is fundamentally crucial to reveal the biological significance of 5hmC. Biochemical experiments can use a variety of sequencing-based technologies to achieve high-throughput identification of 5hmC; however, they are labor-intensive, time-consuming, as well as expensive. Therefore, it is urgent to develop more effective and feasible computational methods. In this paper, a novel and powerful model called iR5hmcSC is designed for identifying 5hmC. Firstly, we extract the different features by K-mer, Pseudo Structure Status Composition and One-Hot encoding. Subsequently, the combination of chi-square test and logistic regression is utilized as the feature selection method to select the optimal feature sets. And then stacking learning, an ensemble learning method including random forest (RF), extra trees (EX), AdaBoost (Ada), gradient boosting decision tree (GBDT), and support vector machine (SVM), is used to recognize 5hmC and non-5hmC. Finally, 10-fold cross-validation test is performed to evaluate the model. The accuracy reaches 85.27% and 79.92% on benchmark dataset and independent dataset, respectively. The result is better than the state-of-the-art methods, which indicates that our model is a feasible tool to identify 5hmC. The datasets and source code are freely available at https://github.com/HongyanShi026/iR5hmcSC. [Display omitted] • A new model named iR5hmcSC was proposed to predict RNA 5hmC sites. • K-mer, Pseudo Structure Status Composition and One-Hot encoding are applied to extract features from the dataset. • A new method combining chi-square test and logistic regression is used to reduce the dimensions of data. • The stacking learning is adopted to classify the model. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
14769271
Volume :
95
Database :
Academic Search Index
Journal :
Computational Biology & Chemistry
Publication Type :
Academic Journal
Accession number :
153903346
Full Text :
https://doi.org/10.1016/j.compbiolchem.2021.107583