51. GENCODE: Creating a Validated Manually Annotated Geneset for the Whole Human Genome
- Author
-
Jen Harrow, Felix Kokocinski, Jane E. Loveland, James G. R. Gilbert, Claire Davidson, E. Hart, Adam Frankish, Michael L. Tress, Bronwen Aken, Rachel A. Harte, M. Kay, Michael F. Lin, Alexandra Bignell, Denise Carvalho-Silva, Mark Diekhans, J. Van Baren, Manolis Kellis, Toby Hunt, If H. A. Barnes, Jessica Vamathevan, Catherine E. Snow, Mark Gerstein, R. Kinsella, J. E. Mudge, S. Donaldson, Tim Hubbard, Laurens G. Wilming, David Lloyd, S. Searle, Roderic Guigó, and Michael R. Brent
- Subjects
Annotation ,GENCODE ,Computer science ,Pseudogene ,Chromosome ,Coding region ,General Materials Science ,Human genome ,Gene Annotation ,Computational biology ,ENCODE - Abstract
The Human and Vertebrate Analysis and Annotation (HAVANA) group at the Wellcome Trust Sanger Institute produced the manually annotated geneset for the Encyclopedia of DNA Elements (ENCODE) pilot project and, as part of the Gencode subgroup, are reprising this role in the scale up to cover the whole human genome. Our manual annotation is checked computationally and validated experimentally. Loci and transcripts predicted to be absent from the initial annotation are identified by comparison with a number of state-of-the-art algorithms for identifying exons, splice sites, transcripts and pseudogenes. Where novel features are confirmed the annotation is updated. Annotated coding transcripts are analysed to assess their coding potential by investigating patterns of conservation within the coding sequence (CDS) and comparing predicted secondary structures of annotated CDSs to similar proteins with solved structures. Annotated coding transcripts are also checked against the current set of human Consensus CDSs (CCDS) to check agreement with other participating centres (EBI, NCBI, & UCSC).An initial round of annotation and analysis of chromosomes 21 and 22 has shown that while HAVANA annotation is both comprehensive and robust, it has benefitted from computational review. 13 novel non-coding loci, 27 novel splice variants and 6 extensions to existing variants were identified, many of which were found using supporting EST/mRNA sequences that were not present at the time of initial annotation. Fewer than 10 annotated CDSs required reclassification, no CCDS sequences required updating and 26 novel pseudogene were added. The annotation of human chromosome 2 is complete and we are currently annotating chromosomes 3 and 7. Data from all members of Gencode is distributed via DAS and is now visible in our Zmap annotation interface, allowing assessment of computational predictions contemporaneous with first-pass gene annotation.
- Published
- 2009
- Full Text
- View/download PDF