Back to Search Start Over

Identification of protein-coding genes in the genome of Vibrio cholerae with more than 98% accuracy using occurrence frequencies of single nucleotides.

Authors :
Wang, Ju
Zhang, Chun-Ting
Source :
European Journal of Biochemistry; Aug2001, Vol. 268 Issue 15, p4261-4268, 8p, 5 Diagrams, 5 Charts
Publication Year :
2001

Abstract

The published sequence of the Vibrio cholerae genome indicates that, in addition to the genes that encode proteins of known and unknown function, there are 1577 ORFs identified as conserved hypothetical or hypothetical gene candidates. Because the annotation is not 100% accurate, it is not known which of the 1577 ORFs are true protein-coding genes. In this paper, an algorithm based on the Z curve method, with sensitivity, specificity and accuracy greater than 98%, is used to solve this problem. Twenty-fold cross-validation tests show that the accuracy of the algorithm is 98.8%. A detailed discussion of the mechanism of the algorithm is also presented. It was found that 172 of the 1577 ORFs are unlikely to be protein-coding genes. The number of protein-coding genes in the V. cholerae genome was re-estimated and found to be ≈ 3716. This result should be of use in microarray analysis of gene expression in the genome, because the cost of preparing chips may be somewhat decreased. A computer program was written to calculate a coding score called VCZ for gene identification in the genome. Coding/noncoding is simply determined by VCZ > 0/VCZ < 0. The program is freely available on request for academic use. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
00142956
Volume :
268
Issue :
15
Database :
Complementary Index
Journal :
European Journal of Biochemistry
Publication Type :
Academic Journal
Accession number :
4937708
Full Text :
https://doi.org/10.1046/j.1432-1327.2001.02341.x