Jennifer Rosenberg, Philippe Lamesch, Ning Li, Stephanie Bosak, Tong Hao, David E. Hill, Reynaldo Sequerra, Marc Vidal, Jean Vandenhaute, Lynn Doucette-Stamm, and Stuart Milstein
The Caenorhabditis elegans genome sequence, released in December 1998, was nearly complete and highly accurate, with an error rate estimated at 1/30,000 (The C. elegans Sequencing Consortium 1998). The finished sequence was eventually released in November 2002, comprising 100,258,171 bp in six contiguous segments corresponding to the six C. elegans chromosomes (J. Sulston, pers com; http://elegans.swmed.edu/Announcements/genome_complete.html). Although the technology required for rapid and accurate whole-genome sequencing is mature, the gene prediction tools currently available to identify protein-encoding open reading frames (ORFs) and to define their exon/intron structures still need improvements. For exon prediction in mammalian genomes, these tools have an overall sensitivity and specificity of only 60% (Burset and Guigo 1996), and ∼40% for the 5′ and 3′ gene boundaries specifically (Korf et al. 2001). Predicted genes can be truncated, extended, split, or merged (see Reboul et al. 2001), relative to their actual “observed” exon/intron structure. Using GeneFinder, a gene prediction tool developed for C. elegans (http://ftp.genome.washington.edu/cgi-bin/genefinder_req.pl), a total of 19,477 ORFs were annotated in Wormbase release WS9 (August 1999; http://www.Wormbase.org; Stein et al. 2001). Approximately 50% of these ORFs were predicted ab initio, without experimental support. The C. elegans ORFeome project was launched to test the accuracy of these gene predictions, while simultaneously creating a resource of cloned full-length predicted ORFs to be used in various functional genomics and reverse proteomics studies (Reboul et al. 2001, 2003). ORFs were PCR-amplified between their 5′- and 3′-ends, and cloned using the Gateway recombinational cloning system (Hartley et al. 2000; Walhout et al. 2000a,b). PCR amplification was performed on a highly representative cDNA library using gene-specific primer pairs for each of the 19,477 ORFs based on WS9 predictions. Gateway tails attached to all primers allowed the cloning of the ORFs into the pDONR201 vector, resulting in a total of 11,984 (61.5% of the ORFs) Entry clones in the first version of the ORFeome (version v1.1; Supplemental Table 1). The C. elegans ORFeome version 1.1a (v1.1a) represents a consolidated set of 10,623 ORFs cloned in-frame, 11.4% (1361 out of 11,984) of all cloned ORFs in version 1 were cloned outof-frame because of mispredicted gene boundaries (v1.1b). This first version of the worm ORFeome contributed significantly to the reannotation of C. elegans gene structure. The alignment of OSTs (ORF Sequence Tags) to the corresponding predicted gene sequences allowed the improvement of C. elegans annotations by correcting the internal gene structure of 20% of v1.1a cloned ORFs. In addition, OSTs provided experimental verification for 45% of the set of “untouched” ORFs, that is, not detected yet by any mRNA or EST. For each gene, ORFeome v1.1a contains cloned pools that result from mixing ∼50 to ∼1000 Escherichia coli transformants for each Entry clone. Thus, such Entry pools might contain multiple splice variants and alleles corresponding to PCR misincorporations. We are in the process of generating a new resource, ORFeome v2 (Reboul et al. 2003), in which we isolate individual wild-type clones for all detected splice variants of ORFs cloned in v1.1a. We will shortly initiate similar attempts for the ORFs cloned in the ORFeome version 3 described below. The difficulties inherent in identifying ORFs within metazoan genomes and predicting their correct structure are not specific to C. elegans. Genome annotation initiatives in the model organisms Arabidopsis thaliana (Yamada et al. 2003) and Drosophila melanogaster (Hild et al. 2003) have also shown limited accuracy. The accuracy of current gene prediction algorithms is also a major issue for the human genome. High numbers of splice variants and lower signal-to-noise ratios caused by longer introns and intergenic regions render human genome annotations even more difficult than for the model systems experimentally validated so far. Hence, both in model organisms and in human, functional genomic and reverse proteomics studies, which require the use of large sets of full-length ORFs, are hampered by inaccuracies in gene prediction, limiting the usefulness of sequenced genomes. Since the release of Wormbase WS9 in 1999, continuous efforts to reannotate the C. elegans genome have occurred. Reannotations are mainly based on new experimental data, such as mRNAs and ESTs (the EMBL nucleotide sequence database [http://www.ebi.ac.uk/embl/] and the Y. Kohara DNA databank [DDBJ, http://www.ddbj.nig.ac.jp/]), as well as splice-leader sequences (Blumenthal et al. 2002). Furthermore, more refined ab initio approaches have allowed the reprediction of genes for which no confirmatory experimental data are yet available. To experimentally validate these new predictions, improve gene annotation, and generate a more complete C. elegans ORFeome resource, we attempted to clone the 4232 ORFs originally missed in v1.1a and that have been either repredicted or newly predicted between the release of WS9 and that of WS100 (May 2003).