Back to Search Start Over

Prediction of diabetic protein markers based on an ensemble method

Authors :
Kaiyang Qu
Hua Shi
Quan Zou
Source :
Frontiers in bioscience (Landmark edition). 26(7)
Publication Year :
2021

Abstract

Introduction: A diabetic protein marker is a type of protein that is closely related to diabetes. This kind of protein plays an important role in the prevention and diagnosis of diabetes. Therefore, it is necessary to identify an effective method for predicting diabetic protein markers. In this study, we propose using ensemble methods to predict diabetic protein markers. Methodological issues: The ensemble method consists of two aspects. First, we combine a feature extraction method to obtain mixed features. Next, we classify the protein using ensemble classifiers. We use three feature extraction methods in the ensemble method, including composition and physicochemical features (abbreviated as 188D), adaptive skip gram features (abbreviated as 400D) and g-gap (abbreviated as 670D). There are six traditional classifiers in this study: decision tree, Naive Bayes, logistic regression, part, k-nearest neighbor, and kernel logistic regression. The ensemble classifiers are random forest and vote. First, we used feature extraction methods and traditional classifiers to classify protein sequences. Then, we compared the combined feature extraction methods with single methods. Next, we compared ensemble classifiers to traditional classifiers. Finally, we used ensemble classifiers and combined feature extraction methods to predict samples. Results: The results indicated that ensemble methods outperform single methods with respect to either ensemble classifiers or combined feature extraction methods. When the classifier is a random forest and the feature extraction method is 588D (combined 188D and 400D), the performance is best among all methods. The second best ensemble feature extraction method is 1285D (combining the three methods) with random forest. The best single feature extraction method is 188D, and the worst one is g-gap. Conclusion: According to the results, the ensemble method, either the combined feature extraction method or the ensemble classifier, was better than the single method. We anticipate that ensemble methods will be a useful tool for identifying diabetic protein markers in a cost-effective manner.

Details

ISSN :
27686698
Volume :
26
Issue :
7
Database :
OpenAIRE
Journal :
Frontiers in bioscience (Landmark edition)
Accession number :
edsair.doi.dedup.....1998901feefcd14275c95604f3c55906