Back to Search Start Over

Using a low correlation high orthogonality feature set and machine learning methods to identify plant pentatricopeptide repeat coding gene/protein.

Authors :
Feng, Changli
Zou, Quan
Wang, Donghua
Source :
Neurocomputing. Feb2021, Vol. 424, p246-254. 9p.
Publication Year :
2021

Abstract

Identifying whether a pentatricopeptide repeat (PPR) exists in an amino acid is a significant task in the field of bioinformatics. To address this problem, an identification method that combines an optimal feature set selection framework and machine learning algorithms is proposed to recognize the PPR coding genes and proteins in the sequence of amino acid. The original 188-dimensional (D) features are obtained using a feature extraction method, which is successively optimised through a covariance analysis, max-relevant-max-distance processing, and principal component analysis to reduce it to an optimal feature set that has fewer but more expressive features. Four machine learning methods are then used to serve as the classifiers for the identification task. The final number of feature data dimensions is reduced from 188 to only 10, and according to the experimental results from support vector machine methods, the loss of the AUC and the F 1 values are only 3.26% and 10.1%, respectively. Moreover, after applying the J48, random forest, and naïve Bayes methods as classifiers, it was also found that the optimal feature set with 10 dimensions has an almost equivalent performance for a 10-fold validation test. Identifying whether a pentatricopeptide repeat (PPR) exists in an amino acid is a significant task in the field of bioinformatics. To address this problem, an identification method that combines an optimal feature set selection framework and machine learning algorithms is proposed to recognise the PPR coding genes and proteins in the sequence of an amino acid. The original 188-dimensional (D) features are obtained using a feature extraction method, which is successively optimised through a covariance analysis, max-relevant-max-distance processing, and a principal component analysis to reduce it to an optimal feature set that has fewer but more expressive features. Four machine learning methods are then used to serve as the classifiers for the identification task. The final number of feature data dimensions is reduced from 188 to only 10, and according to the experimental results from support vector machine methods, the loss of the AUC and F 1 the value are only 3.26% and 10.1%, respectively. Moreover, after applying the J48, random forest, and naïve Bayes methods as classifiers, it was also found that the optimal feature set with 10 dimensions has an almost equivalent performance for a 10-fold validation test Image, graphical abstract [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
09252312
Volume :
424
Database :
Academic Search Index
Journal :
Neurocomputing
Publication Type :
Academic Journal
Accession number :
148202669
Full Text :
https://doi.org/10.1016/j.neucom.2020.02.079