1. An Integrative Proteogenomics Strategy to Identify the Entire Protein Coding Potential in Prokaryotic Genomes
- Author
-
Ravikumar Varadarajan, Adithi, Wollscheid, Bernd, Ahrens, Christian, Aebersold, Rudolf, and Claassen, Manfred
- Subjects
genome annotation ,BARTONELLA (MIKROBIOLOGIE) ,Genomics/Proteomics ,Escherichia coli (E. coli) ,PROTEOMICS (PROTEINS AND PEPTIDES) ,BIOLOGISCHE INFORMATIK UND COMPUTERANWENDUNG IN DER BIOLOGIE ,small ORFs ,Data processing, computer science ,ddc:570 ,De novo assembly ,Transcriptomics ,Proteogenomics ,BIOLOGICAL INFORMATICS AND COMPUTER APPLICATIONS IN BIOLOGY ,Comparative genomics ,Genomics ,Bradyrhizobium diazoefficiens ,Listeria monocytogenes ,Life sciences ,BARTONELLA (MICROBIOLOGY) ,SYSTEMS BIOLOGY ,Pseudomonas aeruginosa ,ddc:004 ,PROTEOMIK (PROTEINE UND PEPTIDE) ,TnSeq - Abstract
Systems biology is based on the view that the cell as a biological system is greater than the sum of its molecular parts. Through the integration of quantitative information of the cell’s molecular elements, such as genes, transcripts, proteins, metabolites and their molecular interactions, in combination with molecular modelling, systems biologists aim to model and predict the behaviour of a complex system such as a cell. A prerequisite for such systems biology-based predictions in order to understand life at the molecular level is detailed knowledge/information about the involved parts and their relationships. Bacteria are great model organisms for systems biology research, because they are not as complex as human cells and therefore a good starting point for the development of new tools and strategies. They are also among the most abundant organisms present on earth. While they have both beneficial and harmful roles in human life, our knowledge about them remains surprisingly little. A systems-wide understanding and predictability of their phenotype (appearance) based on their genotype (genetic blueprint) would enable us to better engineer them for a wide range of applications in agriculture, food, energy production and medicine or to develop new therapeutics to combat bacterial diseases. In the post-genomic era, revolutionary advances in sequencing technologies, decreasing sequencing costs and improvements in computational algorithms have led to an exponential increase in the number of completely sequenced genomes, in particular those of prokaryotes (Bacteria and Archaea). Predicting and linking the genotype to the phenotype in prokaryotes could eventually be achieved through the integration of additional molecular data, such as information about proteins as the translated gene products. Therefore, measurement and quantitative information about the proteotype, defined as the acute state of proteins in a given condition and time point, could fill the gap in understanding the phenotype from the genotype alone. Proteogenomics is the emerging research field at the interface of the genomic (genotype data) and proteomics (proteotype data) fields. Through the integration of proteotype and genotype data, proteogenomics researchers aim to define all protein-coding genes in a genome, a prerequisite to obtain a systems-wide understanding of a (prokaryotic) organism. This thesis focuses on the development, application and validation of a proteogenomics strategy enabling the identification of the entire protein-coding potential of selected prokaryotic genomes. An accurate and comprehensive prediction of gene boundaries and their protein coding potential is essential to match the proteotype data and eventually link the genotype to proteotype. However, large discrepancies in the number of coding sequences annotated by different reference genome annotation sources and unannotated small ORFs (smORFs) that can carry out important biological functions currently represent serious limitations in genome annotations. In the first part of my research described in chapter 2 of this thesis, a proteogenomics strategy was devised that can be used to integrate and consolidate annotations from different reference databases, ab initio gene predictions and in silico annotations into a minimally redundant and maximally informative integrated proteogenomics (iPtgxDB) search database (Omasits and Varadarajan et al. 2017). By generating easy to use output file formats containing informative protein identifiers, we ensured a swift integration of iPtgxDBs with proteotype data and simplify the downstream identification, visualization of peptide evidence. A public web server (https://iptgxdb.expasy.org/) was developed as a new public resource which provides the possibility to create such an iPtgxDB database for any prokaryotic organism, both for existing reference genomes and newly sequenced strains. Moreover, pre-computed downloadable iPtgxDBs for a set of prokaryotic organisms are also available on the iPtgxDB web server. The iPtgxDB web server allows researchers to use the proteogenomics database to search proteotype data, thereby aiding them in the identification of currently unannotated and therefore presumably novel proteins. The proteogenomics iPtgxDB based strategy was validated on selected prokaryotic strains with existing reference genome annotations, including Bartonella henselae Houston-1, and Escherichia coli BW25113. In each case, searching proteotype data sets against the iPtgxDB followed by a stringent false discovery rate filtering led to identification of novel short proteins, alternative start sites and expressed pseudogenes thereby highlighting the value of the integrated proteogenomics approach. Furthermore, for B. henselae CHDE101, comparison of the de novo assembled complete genome with that of strain Houston-1 allowed us to detect peptide evidence for single amino acid variants, insertions or deletions and proteins in unique genomic regions. These observations highlight the importance of having a complete genome sequence of the strain under study and the ability of our proteogenomics approach to provide strain and sample specific analysis which will have implications in clinical proteotype analysis and beyond. Taken together, the integrated proteogenomics approach provides an advanced basis for the identification of the entire protein-coding potential in prokaryotic genomes in order to fully capitalize on the genome information and decode its functional relevance. In the subsequent chapters of my thesis, I further developed these proteogenomics concepts in the context of two prokaryotic model organisms along with additional orthogonal data sets in order to obtain a systems level understanding of their respective genomes. Chapter 3 of my thesis represents a next-generation proteogenomics strategy which combines de novo genome assembly, comparative genomics analysis, in combination with next generation proteotype analysis for the food borne pathogen Listeria monocytogenes strains EGD-e and ScottA. I performed an integrative bioinformatic analysis using the complete genome sequences, core, strain-specific genes including extensive proteotype data which enabled us to investigate the Listeria proteotype in states mimicking the upper gastrointestinal tract. This comprehensive resources and toolbox (Varadarajan et al. 2019) will facilitate the Listeria community to reuse this information and analyze the genotype-proteotype-phenotype relationships of the two strains. In chapter 4, a systems biology approach was applied on Bradyrhizobium diazoefficiens 110spc4 to uncover cellular functions required by the bacteria to adapt to micro-oxia conditions within plant root nodules (Fernández and Cabrera and Varadarajan et al. 2019). The de novo assembled complete genome sequences together with proteotype data produced from the analysis enabled us to identify proteins specifically expressed in micro-oxia conditions. Further, a comparative genotype analysis of the strain against the Bradyrhizobium diazoefficiens USDA110 reference genome provided links between the genotypes and different phenotypes of the two genomes. In conclusion, the iPtgxDB-based proteogenomics strategies developed here in the context of this thesis enables life science researchers now to obtain an accurate and comprehensive genome annotation of prokaryotic organisms. The established public web server not only provides an open source platform to the community to create proteogenomics database for any prokaryotic organism, but also provides access to pre-computed downloadable databases. Together, the innovative proteogenomics strategies, bioinformatics analysis, including de novo assembly, annotation and comparative genomics, tools and resources presented in this thesis will aid the research community in obtaining a systems-wide quantitative understanding of prokaryotic genomes and their functions. The proteogenomics applications are far reaching, ranging from the identification of novel drug targets for human diseases to engineering bacteria for improved food or bioenergy production, and environmental remediation.
- Published
- 2019