Author: "Aaron L. Halpern" / Search Limiters: Available in Library Collection - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Aaron L. Halpern"' showing total 50 results

Start Over Author "Aaron L. Halpern" Search Limiters Available in Library Collection

50 results on '"Aaron L. Halpern"'

1. CYP2C8, CYP2C9, and CYP2C19 Characterization Using Next-Generation Sequencing and Haplotype Analysis

Author: Andrea Gaedigk, Erin C. Boone, Steven E. Scherer, Seung-been Lee, Ibrahim Numanagić, Cenk Sahinalp, Joshua D. Smith, Sean McGee, Aparna Radhakrishnan, Xiang Qin, Wendy Y. Wang, Emily G. Farrow, Nina Gonzaludo, Aaron L. Halpern, Deborah A. Nickerson, Neil A. Miller, Victoria M. Pratt, and Lisa V. Kalman
Subjects: Molecular Medicine, Pathology and Forensic Medicine
Published: 2022
Full Text: View/download PDF

2. Spinal muscular atrophy diagnosis and carrier screening from genome sequencing data

Author: Zoya Kingsbury, Andrew J. Connell, Ryan J. Taft, Alba Sanchis-Juan, Courtney E. French, Matthew E.R. Butchbach, David R. Bentley, Aditi Chawla, Xiao Chen, Isabelle Delon, Michael A. Eberle, Aaron L. Halpern, F Lucy Raymond, and Nihr BioResource
Subjects: 0301 basic medicine, spinal muscular atrophy (SMA), medicine.medical_specialty, Copy number analysis, Genomics, carrier screening, SMN1, Computational biology, 030105 genetics & heredity, Biology, Genome, DNA sequencing, Article, Muscular Atrophy, Spinal, 03 medical and health sciences, 0302 clinical medicine, medicine, Humans, Child, Gene, Genetics (clinical), 030304 developmental biology, 0303 health sciences, Base Sequence, genome sequencing (GS), Spinal muscular atrophy, bioinformatics, SMA, medicine.disease, Survival of Motor Neuron 1 Protein, nervous system diseases, 030104 developmental biology, copy-number analysis, Child, Preschool, Medical genetics, 030217 neurology & neurosurgery, Reference genome
Abstract: PurposeSpinal muscular atrophy (SMA), caused by loss of the SMN1 gene, is a leading cause of early childhood death. Due to the near identical sequences of SMN1 and SMN2, analysis of this region is challenging. Population-wide SMA screening to quantify the SMN1 copy number (CN) is recommended by the American College of Medical Genetics.MethodsWe developed a method that accurately identifies the CN of SMN1 and SMN2 using genome sequencing (GS) data by analyzing read depth and eight informative reference genome differences between SMN1/2.ResultsWe characterized SMN1/2 in 12,747 genomes, identified 1568 samples with SMN1 gains or losses and 6615 samples with SMN2 gains or losses and calculated a pan-ethnic carrier frequency of 2%, consistent with previous studies. Additionally, 99.8% of our SMN1 and 99.7% of SMN2 CN calls agreed with orthogonal methods, with a recall of 100% for SMA and 97.8% for carriers, and a precision of 100% for both SMA and carriers.ConclusionThis SMN copy number caller can be used to identify both carrier and affected status of SMA, enabling SMA testing to be offered as a comprehensive test in neonatal care and an accurate carrier screening tool in GS sequencing projects.
Published: 2020
Full Text: View/download PDF

3. Nanoliter reactors improve multiple displacement amplification of genomes from single cells.

Author: Yann Marcy, Thomas Ishoey, Roger S Lasken, Timothy B Stockwell, Brian P Walenz, Aaron L Halpern, Karen Y Beeson, Susanne M D Goldberg, and Stephen R Quake
Subjects: Genetics, QH426-470
Abstract: Since only a small fraction of environmental bacteria are amenable to laboratory culture, there is great interest in genomic sequencing directly from single cells. Sufficient DNA for sequencing can be obtained from one cell by the Multiple Displacement Amplification (MDA) method, thereby eliminating the need to develop culture methods. Here we used a microfluidic device to isolate individual Escherichia coli and amplify genomic DNA by MDA in 60-nl reactions. Our results confirm a report that reduced MDA reaction volume lowers nonspecific synthesis that can result from contaminant DNA templates and unfavourable interaction between primers. The quality of the genome amplification was assessed by qPCR and compared favourably to single-cell amplifications performed in standard 50-microl volumes. Amplification bias was greatly reduced in nanoliter volumes, thereby providing a more even representation of all sequences. Single-cell amplicons from both microliter and nanoliter volumes provided high-quality sequence data by high-throughput pyrosequencing, thereby demonstrating a straightforward route to sequencing genomes from single cells.
Published: 2007
Full Text: View/download PDF

4. The Sorcerer II Global Ocean Sampling Expedition: metagenomic characterization of viruses within aquatic microbial samples.

Author: Shannon J Williamson, Douglas B Rusch, Shibu Yooseph, Aaron L Halpern, Karla B Heidelberg, John I Glass, Cynthia Andrews-Pfannkoch, Douglas Fadrosh, Christopher S Miller, Granger Sutton, Marvin Frazier, and J Craig Venter
Subjects: Medicine, Science
Abstract: Viruses are the most abundant biological entities on our planet. Interactions between viruses and their hosts impact several important biological processes in the world's oceans such as horizontal gene transfer, microbial diversity and biogeochemical cycling. Interrogation of microbial metagenomic sequence data collected as part of the Sorcerer II Global Ocean Expedition (GOS) revealed a high abundance of viral sequences, representing approximately 3% of the total predicted proteins. Cluster analyses of the viral sequences revealed hundreds to thousands of viral genes encoding various metabolic and cellular functions. Quantitative analyses of viral genes of host origin performed on the viral fraction of aquatic samples confirmed the viral nature of these sequences and suggested that significant portions of aquatic viral communities behave as reservoirs of such genetic material. Distributional and phylogenetic analyses of these host-derived viral sequences also suggested that viral acquisition of environmentally relevant genes of host origin is a more abundant and widespread phenomenon than previously appreciated. The predominant viral sequences identified within microbial fractions originated from tailed bacteriophages and exhibited varying global distributions according to viral family. Recruitment of GOS viral sequence fragments against 27 complete aquatic viral genomes revealed that only one reference bacteriophage genome was highly abundant and was closely related, but not identical, to the cyanomyovirus P-SSM4. The co-distribution across all sampling sites of P-SSM4-like sequences with the dominant ecotype of its host, Prochlorococcus supports the classification of the viral sequences as P-SSM4-like and suggests that this virus may influence the abundance, distribution and diversity of one of the most dominant components of picophytoplankton in oligotrophic oceans. In summary, the abundance and broad geographical distribution of viral sequences within microbial fractions, the prevalence of genes among viral sequences that encode microbial physiological function and their distinct phylogenetic distribution lend strong support to the notion that viral-mediated gene acquisition is a common and ongoing mechanism for generating microbial diversity in the marine environment.
Published: 2008
Full Text: View/download PDF

5. The diploid genome sequence of an individual human.

Author: Samuel Levy, Granger Sutton, Pauline C Ng, Lars Feuk, Aaron L Halpern, Brian P Walenz, Nelson Axelrod, Jiaqi Huang, Ewen F Kirkness, Gennady Denisov, Yuan Lin, Jeffrey R MacDonald, Andy Wing Chun Pang, Mary Shago, Timothy B Stockwell, Alexia Tsiamouri, Vineet Bafna, Vikas Bansal, Saul A Kravitz, Dana A Busam, Karen Y Beeson, Tina C McIntosh, Karin A Remington, Josep F Abril, John Gill, Jon Borman, Yu-Hui Rogers, Marvin E Frazier, Stephen W Scherer, Robert L Strausberg, and J Craig Venter
Subjects: Biology (General), QH301-705.5
Abstract: Presented here is a genome sequence of an individual human. It was produced from approximately 32 million random DNA fragments, sequenced by Sanger dideoxy technology and assembled into 4,528 scaffolds, comprising 2,810 million bases (Mb) of contiguous sequence with approximately 7.5-fold coverage for any given region. We developed a modified version of the Celera assembler to facilitate the identification and comparison of alternate alleles within this individual diploid genome. Comparison of this genome and the National Center for Biotechnology Information human reference assembly revealed more than 4.1 million DNA variants, encompassing 12.3 Mb. These variants (of which 1,288,319 were novel) included 3,213,401 single nucleotide polymorphisms (SNPs), 53,823 block substitutions (2-206 bp), 292,102 heterozygous insertion/deletion events (indels)(1-571 bp), 559,473 homozygous indels (1-82,711 bp), 90 inversions, as well as numerous segmental duplications and copy number variation regions. Non-SNP DNA variation accounts for 22% of all events identified in the donor, however they involve 74% of all variant bases. This suggests an important role for non-SNP genetic alterations in defining the diploid genome structure. Moreover, 44% of genes were heterozygous for one or more variants. Using a novel haplotype assembly strategy, we were able to span 1.5 Gb of genome sequence in segments >200 kb, providing further precision to the diploid nature of the genome. These data depict a definitive molecular portrait of a diploid human genome that provides a starting point for future genome comparisons and enables an era of individualized genomic information.
Published: 2007
Full Text: View/download PDF

6. Survey sequencing and comparative analysis of the elephant shark (Callorhinchus milii) genome.

Author: Byrappa Venkatesh, Ewen F Kirkness, Yong-Hwee Loh, Aaron L Halpern, Alison P Lee, Justin Johnson, Nidhi Dandona, Lakshmi D Viswanathan, Alice Tay, J Craig Venter, Robert L Strausberg, and Sydney Brenner
Subjects: Biology (General), QH301-705.5
Abstract: Owing to their phylogenetic position, cartilaginous fishes (sharks, rays, skates, and chimaeras) provide a critical reference for our understanding of vertebrate genome evolution. The relatively small genome of the elephant shark, Callorhinchus milii, a chimaera, makes it an attractive model cartilaginous fish genome for whole-genome sequencing and comparative analysis. Here, the authors describe survey sequencing (1.4x coverage) and comparative analysis of the elephant shark genome, one of the first cartilaginous fish genomes to be sequenced to this depth. Repetitive sequences, represented mainly by a novel family of short interspersed element-like and long interspersed element-like sequences, account for about 28% of the elephant shark genome. Fragments of approximately 15,000 elephant shark genes reveal specific examples of genes that have been lost differentially during the evolution of tetrapod and teleost fish lineages. Interestingly, the degree of conserved synteny and conserved sequences between the human and elephant shark genomes are higher than that between human and teleost fish genomes. Elephant shark contains putative four Hox clusters indicating that, unlike teleost fish genomes, the elephant shark genome has not experienced an additional whole-genome duplication. These findings underscore the importance of the elephant shark as a critical reference vertebrate genome for comparative analysis of the human and other vertebrate genomes. This study also demonstrates that a survey-sequencing approach can be applied productively for comparative analysis of distantly related vertebrate genomes.
Published: 2007
Full Text: View/download PDF

7. The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families.

Author: Shibu Yooseph, Granger Sutton, Douglas B Rusch, Aaron L Halpern, Shannon J Williamson, Karin Remington, Jonathan A Eisen, Karla B Heidelberg, Gerard Manning, Weizhong Li, Lukasz Jaroszewski, Piotr Cieplak, Christopher S Miller, Huiying Li, Susan T Mashiyama, Marcin P Joachimiak, Christopher van Belle, John-Marc Chandonia, David A Soergel, Yufeng Zhai, Kannan Natarajan, Shaun Lee, Benjamin J Raphael, Vineet Bafna, Robert Friedman, Steven E Brenner, Adam Godzik, David Eisenberg, Jack E Dixon, Susan S Taylor, Robert L Strausberg, Marvin Frazier, and J Craig Venter
Subjects: Biology (General), QH301-705.5
Abstract: Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.
Published: 2007
Full Text: View/download PDF

8. The Sorcerer II Global Ocean Sampling expedition: northwest Atlantic through eastern tropical Pacific.

Author: Douglas B Rusch, Aaron L Halpern, Granger Sutton, Karla B Heidelberg, Shannon Williamson, Shibu Yooseph, Dongying Wu, Jonathan A Eisen, Jeff M Hoffman, Karin Remington, Karen Beeson, Bao Tran, Hamilton Smith, Holly Baden-Tillson, Clare Stewart, Joyce Thorpe, Jason Freeman, Cynthia Andrews-Pfannkoch, Joseph E Venter, Kelvin Li, Saul Kravitz, John F Heidelberg, Terry Utterback, Yu-Hui Rogers, Luisa I Falcón, Valeria Souza, Germán Bonilla-Rosso, Luis E Eguiarte, David M Karl, Shubha Sathyendranath, Trevor Platt, Eldredge Bermingham, Victor Gallardo, Giselle Tamayo-Castillo, Michael R Ferrari, Robert L Strausberg, Kenneth Nealson, Robert Friedman, Marvin Frazier, and J Craig Venter
Subjects: Biology (General), QH301-705.5
Abstract: The world's oceans contain a complex mixture of micro-organisms that are for the most part, uncharacterized both genetically and biochemically. We report here a metagenomic study of the marine planktonic microbiota in which surface (mostly marine) water samples were analyzed as part of the Sorcerer II Global Ocean Sampling expedition. These samples, collected across a several-thousand km transect from the North Atlantic through the Panama Canal and ending in the South Pacific yielded an extensive dataset consisting of 7.7 million sequencing reads (6.3 billion bp). Though a few major microbial clades dominate the planktonic marine niche, the dataset contains great diversity with 85% of the assembled sequence and 57% of the unassembled data being unique at a 98% sequence identity cutoff. Using the metadata associated with each sample and sequencing library, we developed new comparative genomic and assembly methods. One comparative genomic method, termed "fragment recruitment," addressed questions of genome structure, evolution, and taxonomic or phylogenetic diversity, as well as the biochemical diversity of genes and gene families. A second method, termed "extreme assembly," made possible the assembly and reconstruction of large segments of abundant but clearly nonclonal organisms. Within all abundant populations analyzed, we found extensive intra-ribotype diversity in several forms: (1) extensive sequence variation within orthologous regions throughout a given genome; despite coverage of individual ribotypes approaching 500-fold, most individual sequencing reads are unique; (2) numerous changes in gene content some with direct adaptive implications; and (3) hypervariable genomic islands that are too variable to assemble. The intra-ribotype diversity is organized into genetically isolated populations that have overlapping but independent distributions, implying distinct environmental preference. We present novel methods for measuring the genomic similarity between metagenomic samples and show how they may be grouped into several community types. Specific functional adaptations can be identified both within individual ribotypes and across the entire community, including proteorhodopsin spectral tuning and the presence or absence of the phosphate-binding gene PstS.
Published: 2007
Full Text: View/download PDF

9. Strelka2: fast and accurate calling of germline and somatic variants

Author: Peter Krusche, Doruk Beyter, Christopher T. Saunders, Konrad Scheffler, Aaron L. Halpern, Xiaoyu Chen, Morten Källberg, Sangtae Kim, Mitchell A. Bekritsky, Eunho Noh, and Yeonbin Kim
Subjects: 0301 basic medicine, Computer science, Somatic cell, Computational biology, Biochemistry, Germline, 03 medical and health sciences, Germline mutation, INDEL Mutation, Neoplasms, Databases, Genetic, Humans, Molecular Biology, Germ-Line Mutation, Liquid Tumor, Models, Genetic, Whole Genome Sequencing, Genetic Variation, High-Throughput Nucleotide Sequencing, Cell Biology, 030104 developmental biology, Haplotypes, Genome informatics, Sample contamination, Software, Biotechnology
Abstract: We describe Strelka2 ( https://github.com/Illumina/strelka ), an open-source small-variant-calling method for research and clinical germline and somatic sequencing applications. Strelka2 introduces a novel mixture-model-based estimation of insertion/deletion error parameters from each sample, an efficient tiered haplotype-modeling strategy, and a normal sample contamination model to improve liquid tumor analysis. For both germline and somatic calling, Strelka2 substantially outperformed the current leading tools in terms of both variant-calling accuracy and computing cost.
Published: 2017

10. Strelka2: Fast and accurate variant calling for clinical sequencing applications

Author: Eunho Noh, Peter Krusche, Konrad Scheffler, Christopher T. Saunders, Doruk Beyter, Sangtae Kim, Aaron L. Halpern, Morten Källberg, Mitchell A. Bekritsky, and Xiaoyu Chen
Subjects: Genetics, Liquid Tumor, InformationSystems_GENERAL, ComputingMethodologies_PATTERNRECOGNITION, GeneralLiterature_INTRODUCTORYANDSURVEY, ComputingMilieux_COMPUTERSANDEDUCATION, Computational biology, Biology, Indel, Sample contamination, Germline
Abstract: We describe Strelka2 (https://github.com/Illumina/strelka), an open-source small variant calling method for clinical germline and somatic sequencing applications. Strelka2 introduces a novel mixture-model based estimation of indel error parameters from each sample, an efficient tiered haplotype modeling strategy and a normal sample contamination model to improve liquid tumor analysis. For both germline and somatic calling, Strelka2 substantially outperforms current leading tools on both variant calling accuracy and compute cost.
Published: 2017
Full Text: View/download PDF

11. Haplotype phasing of whole human genomes using bead-based barcode partitioning in a single tube

Author: Niall Anthony Gormley, Maria C Rogert, Jerushah Thomas, Lena Christiansen, Ana Granat, Ros Jackson, Frank J. Steemers, Jay Shendure, Yannan Zhao, Mostafa Ronaghi, Aaron L. Halpern, Melissa M. Wiley, Dmitry K. Pokholok, Steven Norberg, Fan Zhang, Emily Welch, Erich Jaeger, Kevin L. Gunderson, and Natalie Morrell
Subjects: 0301 basic medicine, Population, Biomedical Engineering, Bioengineering, Hybrid genome assembly, Computational biology, Biology, Barcode, Applied Microbiology and Biotechnology, Genome, Deep sequencing, DNA sequencing, law.invention, 03 medical and health sciences, law, DNA Barcoding, Taxonomic, Humans, education, Genetics, education.field_of_study, Massive parallel sequencing, Genome, Human, High-Throughput Nucleotide Sequencing, Genomics, 030104 developmental biology, Haplotypes, Molecular Medicine, Human genome, Biotechnology
Abstract: Haplotype-resolved genome sequencing promises to unlock a wealth of information in population and medical genetics. However, for the vast majority of genomes sequenced to date, haplotypes have not been determined because of cumbersome haplotyping workflows that require fractions of the genome to be sequenced in a large number of compartments. Here we demonstrate barcode partitioning of long DNA molecules in a single compartment using "on-bead" barcoded tagmentation. The key to the method that we call "contiguity preserving transposition" sequencing on beads (CPTv2-seq) is transposon-mediated transfer of homogenous populations of barcodes from beads to individual long DNA molecules that get fragmented at the same time (tagmentation). These are then processed to sequencing libraries wherein all sequencing reads originating from each long DNA molecule share a common barcode. Single-tube, bulk processing of long DNA molecules with ∼150,000 different barcoded bead types provides a barcode-linked read structure that reveals long-range molecular contiguity. This technology provides a simple, rapid, plate-scalable and automatable route to accurate, haplotype-resolved sequencing, and phasing of structural variants of the genome.
Published: 2016

12. A reference dataset of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree

Author: Benjamin L. Moore, Gil McVean, Elliott H. Margulies, Zamin Iqbal, Epameinondas Fritzilas, Michael A. Eberle, Han-Yu Chuang, Aaron L. Halpern, Sean Humphray, David R. Bentley, Peter Krusche, Morten Källberg, Semyon Kruglyak, and Mitchell A. Bekritsky
Subjects: 0301 basic medicine, Resource, Genetic inheritance, Genotype, Sequence analysis, Genomics, Computational biology, Biology, Polymorphism, Single Nucleotide, Genome, 03 medical and health sciences, 0302 clinical medicine, Data sequences, INDEL Mutation, Databases, Genetic, Genetics, Humans, Exome, Indel, Genetics (clinical), Genome, Human, Haplotype, Inheritance (genetic algorithm), High-Throughput Nucleotide Sequencing, Sequence Analysis, DNA, Human genetics, Pedigree, 030104 developmental biology, 030217 neurology & neurosurgery, Algorithms, Software, Reference dataset
Abstract: Improvement of variant calling in next-generation sequence data requires a comprehensive, genome-wide catalogue of high-confidence variants called in a set of genomes for use as a benchmark. We generated deep, whole-genome sequence data of seventeen individuals in a three-generation pedigree and called variants in each genome using a range of currently available algorithms. We used haplotype transmission information to create a phased “platinum” variant catalogue of 4.7 million single nucleotide variants (SNVs) plus 0.7 million small (1-50bp) insertions and deletions (indels) that are consistent with the pattern of inheritance in the parents and eleven children of this pedigree. Platinum genotypes are highly concordant with the current catalogue of the National Institute of Standards and Technology for both SNVs (>99.99%) and indels (99.92%), and add a validated truth catalogue that has 26% more SNVs and 45% more indels. Analysis of 334,652 SNVs that were consistent between informatics pipelines yet inconsistent with haplotype transmission (“non-platinum”) revealed that the majority of these variants are de novo and cell-line mutations or reside within previously unidentified duplications and deletions. The reference materials from this study are a resource for objective assessment of the accuracy of variant calls throughout genomes.
Published: 2016
Full Text: View/download PDF

13. Genomic insights to SAR86, an abundant and uncultivated marine bacterial lineage

Author: Robert Friedman, Jeremy D. Selengut, R. Alexander Richter, Daniel H. Haft, Aaron L. Halpern, Roger S. Lasken, J. Craig Venter, Mary-Jane Lombardo, Mark Novotny, Douglas B. Rusch, Christopher L. Dupont, Shibu Yooseph, Ruben E. Valas, Joyclyn Yee-Greenbaum, and Kenneth H. Nealson
Subjects: Rhodopsin, proteorhodopsin, tonB receptors, Oceans and Seas, Lineage (evolution), SAR86, Biology, Microbiology, Genome, Phylogenetics, RNA, Ribosomal, 16S, Rhodopsins, Microbial, Seawater, Genomic library, single cell genomics, Phylogeny, Ecology, Evolution, Behavior and Systematics, Genomic Library, Proteorhodopsin, Ecology, SAR11, Computational Biology, Ribosomal RNA, Plankton, Evolutionary biology, Metagenomics, biology.protein, Original Article, metagenomic assembly, Bacterial outer membrane, Gammaproteobacteria, Genome, Bacterial
Abstract: Bacteria in the 16S rRNA clade SAR86 are among the most abundant uncultivated constituents of microbial assemblages in the surface ocean for which little genomic information is currently available. Bioinformatic techniques were used to assemble two nearly complete genomes from marine metagenomes and single-cell sequencing provided two more partial genomes. Recruitment of metagenomic data shows that these SAR86 genomes substantially increase our knowledge of non-photosynthetic bacteria in the surface ocean. Phylogenomic analyses establish SAR86 as a basal and divergent lineage of γ-proteobacteria, and the individual genomes display a temperature-dependent distribution. Modestly sized at 1.25-1.7 Mbp, the SAR86 genomes lack several pathways for amino-acid and vitamin synthesis as well as sulfate reduction, trends commonly observed in other abundant marine microbes. SAR86 appears to be an aerobic chemoheterotroph with the potential for proteorhodopsin-based ATP generation, though the apparent lack of a retinal biosynthesis pathway may require it to scavenge exogenously-derived pigments to utilize proteorhodopsin. The genomes contain an expanded capacity for the degradation of lipids and carbohydrates acquired using a wealth of tonB-dependent outer membrane receptors. Like the abundant planktonic marine bacterial clade SAR11, SAR86 exhibits metabolic streamlining, but also a distinct carbon compound specialization, possibly avoiding competition.
Published: 2011
Full Text: View/download PDF

14. Characterization of Prochlorococcus clades from iron-depleted oceanic regions

Author: J. Craig Venter, Adam C. Martiny, Douglas B. Rusch, Christopher L. Dupont, and Aaron L. Halpern
Subjects: Multidisciplinary, Ecotype, Ecology, Iron, Oceans and Seas, fungi, Iron fertilization, Biodiversity, Biological Sciences, Biology, biology.organism_classification, Phylogenetics, Phytoplankton, Upwelling, Marine ecosystem, Prochlorococcus, Genome, Bacterial, Phylogeny
Abstract: Prochlorococcus describes a diverse and abundant genus of marine photosynthetic microbes. It is primarily found in oligotrophic waters across the globe and plays a crucial role in energy and nutrient cycling in the ocean ecosystem. The abundance, global distribution, and availability of isolates make Prochlorococcus a model system for understanding marine microbial diversity and biogeochemical cycling. Analysis of 73 metagenomic samples from the Global Ocean Sampling expedition acquired in the Atlantic, Pacific, and Indian Oceans revealed the presence of two uncharacterized Prochlorococcus clades. A phylogenetic analysis using six different genetic markers places the clades close to known lineages adapted to high-light environments. The two uncharacterized clades consistently cooccur and dominate the surface waters of high-temperature, macronutrient-replete, and low-iron regions of the Eastern Equatorial Pacific upwelling and the tropical Indian Ocean. They are genetically distinct from each other and other high-light Prochlorococcus isolates and likely define a previously unrecognized ecotype. Our detailed genomic analysis indicates that these clades comprise organisms that are adapted to iron-depleted environments by reducing their iron quota through the loss of several iron-containing proteins that likely function as electron sinks in the photosynthetic pathway in other Prochlorococcus clades from high-light environments. The presence and inferred physiology of these clades may explain why Prochlorococcus populations from iron-depleted regions do not respond to iron fertilization experiments and further expand our understanding of how phytoplankton adapt to variations in nutrient availability in the ocean.
Published: 2010
Full Text: View/download PDF

15. Human Genome Sequencing Using Unchained Base Reads on Self-Assembling DNA Nanoarrays

Author: Robert Hartlage, Brock A. Peters, Igor Nazarenko, Jonathan Baccash, Calvin Kong, Vitali Karpinchyk, Andres Fernandez, Abraham M. Rosenbaum, Ryan J. Cedeno, Paolo Carnevali, Celeste E. McBride, Norman L. Burns, Shaunak Roy, Karen W. Shannon, George M. Church, Snezana Drmanac, Daniel F. Chernikoff, Radoje Drmanac, Geoffrey B. Nilsen, Claudia Richter, Coleen R. Hacker, Jay Shafto, William C. Banyai, Kaliprasad Pothuraju, Helena Perazich, Bruce L. Martin, Dennis G. Ballinger, Benjamin Curson, Linsu Chen, Brian Hauser, Steve Huang, Alexander Wait Zaranek, Anushka Brownley, Dylan Vu, Matt Morenzoni, Andrew B. Sparks, Matthew J. Callow, Alex Cheung, Clifford Reid, Adam P. Borcherding, George Yeung, Xiaodi Wu, Catherine Le, Tom Landers, Aaron L. Halpern, Bahram G. Kermani, Kimberly Perry, Arnold R. Oliphant, Mark Koenig, Charit L. Pethiyagoda, Michel Sun, Joseph V. Thakuria, Conrad G. Sheppy, Anne Tran, Robert E. Morey, Fredrik A. Dahl, Krishna Pant, Karl Mutch, Bryan Staker, Joe Peterson, Jessica Ebert, Yuan Jiang, Jia Liu, Razvan Chirita, and Uladzislau Sharanhovich
Subjects: Male, Genotype, Sequence analysis, Biology, Polymorphism, Single Nucleotide, Genome, DNA sequencing, chemistry.chemical_compound, Sequencing by hybridization, Human Genome Project, Humans, Nanotechnology, Genomic library, Genetics, Genomic Library, Multidisciplinary, Base Sequence, Genome, Human, Computational Biology, DNA, Sequence Analysis, DNA, Nucleic acid amplification technique, Microarray Analysis, Nanostructures, Haplotypes, chemistry, Costs and Cost Analysis, Human genome, Databases, Nucleic Acid, Nucleic Acid Amplification Techniques, Software
Abstract: Toward $1000 Genomes The ability to generate human genome sequence data that is complete, accurate, and inexpensive is a necessary prerequisite to perform genome-wide disease association studies. Drmanac et al. (p. 78 , published online 5 November) present a technique advancing toward this goal. The method uses Type IIS endonucleases to incorporate short oligonucleotides within a set of randomly sheared circularized DNA. DNA polymerase then generates concatenated copies of the circular oligonucleotides leading to formation of compact but very long oligonucleotides which are then sequenced by ligation. The relatively low cost of this technology, which shows a low error rate, advances sequencing closer to the goal of the $1000 genome.
Published: 2010
Full Text: View/download PDF

16. An MCMC algorithm for haplotype assembly from whole-genome sequence data

Author: Aaron L. Halpern, Nelson Axelrod, Vikas Bansal, and Vineet Bafna
Subjects: Hash function, Population, Genomics, Computational biology, Biology, symbols.namesake, Methods, Genetics, Humans, Computer Simulation, International HapMap Project, education, Genetics (clinical), education.field_of_study, Markov chain, Genome, Human, Haplotype, Markov chain Monte Carlo, Markov Chains, Haplotypes, symbols, Monte Carlo Method, Algorithms, Reference genome
Abstract: In comparison to genotypes, knowledge about haplotypes (the combination of alleles present on a single chromosome) is much more useful for whole-genome association studies and for making inferences about human evolutionary history. Haplotypes are typically inferred from population genotype data using computational methods. Whole-genome sequence data represent a promising resource for constructing haplotypes spanning hundreds of kilobases for an individual. In this article, we propose a Markov chain Monte Carlo (MCMC) algorithm, HASH (haplotype assembly for single human), for assembling haplotypes from sequenced DNA fragments that have been mapped to a reference genome assembly. The transitions of the Markov chain are generated using min-cut computations on graphs derived from the sequenced fragments. We have applied our method to infer haplotypes using whole-genome shotgun sequence data from a recently sequenced human individual. The high sequence coverage and presence of mate pairs result in fairly long haplotypes (N50 length ∼ 350 kb). Based on comparison of the sequenced fragments against the individual haplotypes, we demonstrate that the haplotypes for this individual inferred using HASH are significantly more accurate than the haplotypes estimated using a previously proposed greedy heuristic and a simple MCMC method. Using haplotypes from the HapMap project, we estimate the switch error rate of the haplotypes inferred using HASH to be quite low, ∼1.1%. Our Markov chain Monte Carlo algorithm represents a general framework for haplotype assembly that can be applied to sequence data generated by other sequencing technologies. The code implementing the methods and the phased individual haplotypes can be downloaded from http://www.cse.ucsd.edu/users/vibansal/HASH/.
Published: 2008
Full Text: View/download PDF

17. Evolutionary and Biomedical Insights from the Rhesus Macaque Genome

Author: Carolin Kosiol, Belinda Giardine, Janet A. Hopkins, Andrew G. Clark, Ryan D. Hernandez, Peng Wang, Peter D. Stenson, Yu-Hui Rogers, Aaron L. Halpern, Andrew D. Kern, Webb Miller, Kymberlie H. Pepin, Melissa J. Hubisz, Kimberly D. Delehaunty, Robert E. Palermo, Matthew W. Hahn, Erica Sodergren, Brian P. Walenz, Scott M. Smith, Sandra L. Lee, Xiang Qin, Yucheng Feng, Ewen F. Kirkness, Vandita Joshi, Xiaoqiu Huang, Amanda F. Svatek, Fan Yang, Young Ho Kim, Laura Clarke, John E. Karro, Courtney Sherell White, Jessica Kolb, David Glenn Smith, Clay Davis, Jian Ma, Shobha Patil, Todd Wylie, Arian F.A. Smit, Shalini N. Jhangiani, Michael G. Katze, Edward V. Ball, Jennifer Godfrey, Heather A. Lawson, Brian J. Raney, Michael Holder, Ross C. Hardison, Christian J. Buhay, Zhangwan Li, Alicia Hawes, Eric J. Vallender, David A. Wheeler, James C. Wallace, Galt P. Barber, Jinchuan Xing, Yufeng Shen, Kayla E. Smith, Marvin Diep Dao, Jeffrey Rogers, Evan E. Eichler, Cynthia Pfannkoch, Jireh Santibanez, Kateryna D. Makova, Kashif Hirani, Robert M. Kuhn, Yanru Ren, David Neil Cooper, David Haussler, Carlos Bustamante, Adam Siepel, Mimi N. Chandrabose, Xiaoming Liu, George M. Weinstock, Teresa Utterback, Jarret Glasscock, Tomas Vinar, R. Alan Harris, Anis Karimpour-Fard, San Juana Ruiz, Lucinda Fulton, Asif T. Chinwalla, Aniko Sabo, Xinwei She, Charles Addo-Quaye, David L. Nelson, Lora Lewis, Hui Ke, Eli Venter, Donna M. Muzny, Alison Marklein, Bruce T. Lahn, Grace Pai, Brian W. Schneider, Shannon Dugan-Rocha, Henry Xing-Zhi Song, Jeremiah D. Degenhardt, Kyudong Han, Huaiyang Jiang, Stephanie M. Moore, Ian Schenck, Dinh Ngoc Ngo, Michael J. Cox, Heidie A. Paul, Ann S. Zwieg, Kim C. Worley, Craig Pohl, Rui Chen, Robert L. Strausberg, Ling-Ling Pu, Donna Karolchik, Jonathan R. Pollack, Geoffrey Okwuonu, Jennifer Hume, Elaine R. Mardis, David N. Messina, W. James Kent, William E. O'Brien, Fan Hsu, Andrew R. Jackson, Huyen Dinh, Hui Wang, LaDeana W. Hillier, Richard A. Gibbs, Alexandra Denby, Wesley C. Warren, Brygg Ullmer, Laura J. Dumas, Yih-shin Liu, Tony Attaway, Richard K. Wilson, Patrick Minx, James M. Sikela, Lan Zhang, Sandra Hines, Steven J. M. Jones, Amit Indap, Ze Cheng, Karin A. Remington, Stephanie Bell, Jungnam Lee, Kelly E. Bernard, Sang-Gook Han, Mariano Rocchi, Judith Hernandez, Betsy Ferguson, Hildegard Kehrer-Sawatzki, Ziad Khan, Aleksandar Milosavljevic, Joanne O. Nelson, Jeffery P. Demuth, Richard Burhans, David A. Parker, Lynne V. Nazareth, Roger E. Bumgarner, Marco A. Marra, Robert Baertsch, Andrew Cree, Paul Havlak, J. Craig Venter, Kay Prüfer, Rasmus Nielsen, Ewan Birney, Miriam K. Konkel, Mark A. Batzer, Arthur M. Lesk, Jacqueline E. Schein, Granger G. Sutton, Yan Ding, Yue Liu, Andy Peng Xiang, Miklós Csürös, Selina Vattathil, John W. Wallis, R. Gerald Fowler, Shiaw-Pyng Yang, Ramatu Ayiesha Gabisi, and Toni T. Garner
Subjects: Male, Biomedical Research, Pan troglodytes, Macaque, Human accelerated regions, Genome, Evolution, Molecular, Species Specificity, Gene Duplication, biology.animal, Animals, Humans, Primate, Gene Rearrangement, Genetics, Whole genome sequencing, Multidisciplinary, biology, Genetic Diseases, Inborn, Genetic Variation, Sequence Analysis, DNA, Gene rearrangement, biology.organism_classification, Macaca mulatta, Rhesus macaque, Homo sapiens, Evolutionary biology, Multigene Family, Mutation, Female
Abstract: The rhesus macaque ( Macaca mulatta ) is an abundant primate species that diverged from the ancestors of Homo sapiens about 25 million years ago. Because they are genetically and physiologically similar to humans, rhesus monkeys are the most widely used nonhuman primate in basic and applied biomedical research. We determined the genome sequence of an Indian-origin Macaca mulatta female and compared the data with chimpanzees and humans to reveal the structure of ancestral primate genomes and to identify evidence for positive selection and lineage-specific expansions and contractions of gene families. A comparison of sequences from individual animals was used to investigate their underlying genetic diversity. The complete description of the macaque genome blueprint enhances the utility of this animal model for biomedical research and improves our understanding of the basic biology of the species.
Published: 2007
Full Text: View/download PDF

18. Shotgun sequence assembly and recent segmental duplications within the human genome

Author: Ze Cheng, Ge Liu, Eray Tüzün, Evan E. Eichler, Deanna M. Church, Zhaoshi Jiang, Xinwei She, Granger G. Sutton, Royden A. Clark, and Aaron L. Halpern
Subjects: Genetics, Multidisciplinary, Genome, Human, Sequence analysis, Computational Biology, Sequence assembly, Sequence alignment, Genomics, Sequence Analysis, DNA, Computational biology, Biology, Physical Chromosome Mapping, Sensitivity and Specificity, Genome, Mice, Genes, Duplicate, Gene Duplication, Animals, Chromosomes, Human, Humans, Human genome, Shotgun Sequence Assembly, Repeated sequence, Sequence Alignment, Segmental duplication
Abstract: Complex eukaryotic genomes are now being sequenced at an accelerated pace primarily using whole-genome shotgun (WGS) sequence assembly approaches. WGS assembly was initially criticized because of its perceived inability to resolve repeat structures within genomes. Here, we quantify the effect of WGS sequence assembly on large, highly similar repeats by comparison of the segmental duplication content of two different human genome assemblies. Our analysis shows that large (> 15 kilobases) and highly identical (> 97%) duplications are not adequately resolved by WGS assembly. This leads to significant reduction in genome length and the loss of genes embedded within duplications. Comparable analyses of mouse genome assemblies confirm that strict WGS sequence assembly will oversimplify our understanding of mammalian genome structure and evolution; a hybrid strategy using a targeted clone-by-clone approach to resolve duplications is proposed.
Published: 2004
Full Text: View/download PDF

19. Environmental Genome Shotgun Sequencing of the Sargasso Sea

Author: Owen White, Doug Rusch, Derrick E. Fouts, Kenneth H. Nealson, Jonathan A. Eisen, Anthony H. Knap, John F. Heidelberg, Hamilton O. Smith, J. Craig Venter, Karin A. Remington, Holly Baden-Tillson, Samuel Levy, Jeremy Peterson, Michael W. Lomas, Cynthia Pfannkoch, William C. Nelson, Karen E. Nelson, Dongying Wu, Jeff Hoffman, Ian T. Paulsen, Rachel Parsons, Aaron L. Halpern, and Yu-Hui Rogers
Subjects: Rhodopsin, Molecular Sequence Data, Nitrosopumilus, Biodiversity, Genomics, Biology, Cyanobacteria, Genome, Genes, Archaeal, Genome, Archaeal, Phylogenetics, Rhodopsins, Microbial, Bacteriophages, Seawater, Photosynthesis, Atlantic Ocean, Relative species abundance, Ecosystem, Phylogeny, Multidisciplinary, Bacteria, Shotgun sequencing, Ecology, Computational Biology, Genes, rRNA, Sequence Analysis, DNA, biology.organism_classification, Archaea, Eukaryotic Cells, Genes, Bacterial, Water Microbiology, Genome, Bacterial, Plasmids
Abstract: We have applied “whole-genome shotgun sequencing” to microbial populations collected en masse on tangential flow and impact filters from seawater samples collected from the Sargasso Sea near Bermuda. A total of 1.045 billion base pairs of nonredundant sequence was generated, annotated, and analyzed to elucidate the gene content, diversity, and relative abundance of the organisms within these environmental samples. These data are estimated to derive from at least 1800 genomic species based on sequence relatedness, including 148 previously unknown bacterial phylotypes. We have identified over 1.2 million previously unknown genes represented in these samples, including more than 782 new rhodopsin-like photoreceptors. Variation in species present and stoichiometry suggests substantial oceanic microbial diversity. Microorganisms are responsible for most of the biogeochemical cycles that shape the environment of Earth and its oceans. Yet, these organisms are the least well understood on Earth, as the ability to study and understand the metabolic potential of microorganisms has been hampered by the inability to generate pure cultures. Recent studies have begun to explore environ
Published: 2004
Full Text: View/download PDF

20. The Dog Genome: Survey Sequencing and Comparative Analysis

Author: Karin A. Remington, Arthur L. Delcher, Wei Wang, Samuel Levy, Mihai Pop, Aaron L. Halpern, Vineet Bafna, Ewen F. Kirkness, J. Craig Venter, Douglas B. Rusch, and Claire M. Fraser
Subjects: Male, Mutation rate, Molecular Sequence Data, Computational biology, Biology, Polymorphism, Single Nucleotide, Synteny, Genome, Contig Mapping, Mice, Dogs, Phylogenetics, Animals, Humans, RNA, Messenger, Conserved Sequence, Phylogeny, Repetitive Sequences, Nucleic Acid, Short Interspersed Nucleotide Elements, Genomic organization, Sequence (medicine), Genetics, Whole genome sequencing, Multidisciplinary, Genome, Human, Nucleic acid sequence, Computational Biology, Genetic Variation, Genomics, Sequence Analysis, DNA, Physical Chromosome Mapping, Chromosomes, Mammalian, Long Interspersed Nucleotide Elements, Mutation, DNA, Intergenic, Human genome, Sequence Alignment
Abstract: A survey of the dog genome sequence (6.22 million sequence reads; 1.5× coverage) demonstrates the power of sample sequencing for comparative analysis of mammalian genomes and the generation of species-specific resources. More than 650 million base pairs (>25%) of dog sequence align uniquely to the human genome, including fragments of putative orthologs for 18,473 of 24,567 annotated human genes. Mutation rates, conserved synteny, repeat content, and phylogeny can be compared among human, mouse, and dog. A variety of polymorphic elements are identified that will be valuable for mapping the genetic basis of diseases and traits in the dog.
Published: 2003
Full Text: View/download PDF

21. The Genome Sequence of the Malaria Mosquito Anopheles gambiae

Author: Mei Wang, Frank H. Collins, Yong Liang, José M. C. Ribeiro, Zhijian Tu, Jason R. Miller, Mark Yandell, Pantelis Topalis, Hongguang Shao, Qi Zhao, Hamilton O. Smith, Ali N Dana, Zhaoxi Ke, J. Craig Venter, Deborah R. Nusskern, Christos Louis, Ivica Letunic, Brian P. Walenz, Granger G. Sutton, Patrick Wincker, Anastasios C. Koutsos, Paul T. Brey, Ewan Birney, Jean Weissenbach, Fotis C. Kafatos, Cheryl A. Evans, Kerry J. Woodford, Dana Thomasova, Eugene W. Myers, Stephen L. Hoffman, Kokoza Eb, Josep F. Abril, Randall Bolanos, Megan A. Regier, Holly Baden, George K. Christophides, Véronique de Berardinis, Jingtao Sun, James R. Hogan, Kabir Chatuverdi, Ron Wides, Emmanuel Mongin, Igor F. Zhimulev, Steven L. Salzberg, Danita Baldwin, Richard J. Mural, Shiaoping C. Zhu, Anibal Cravchik, Jhy-Jhu Lin, G. Mani Subramanian, Young S. Hong, Shuang Cai, Francis Kalush, Rosane Charlab, Martin Wu, Claudia Blass, Mark Raymond Adams, Robert A. Holt, Clark M. Mobarry, Douglas B. Rusch, Michael Flanigan, Jim Biedler, Susanne L. Hladun, Ping Guan, Cynthia Sitter, Joel A. Malek, Mario Coluzzi, Cynthia Pfannkoch, Arthur L. Delcher, Alessandra della Torre, Maria F. Unger, Evgeny M. Zdobnov, Stephan Meister, Karin A. Remington, Peter W. Atkinson, Malcolm J. Gardner, Vladimir Benes, Ian M. Dew, Maria V. Sharakhova, X. Wang, Hongyu Zhang, Jian Wang, Jeffrey Hoover, Cheryl L. Kraft, Charles Roth, Andrew G. Clark, Shaying Zhao, Jyoti Shetty, Tina C. McIntosh, Aihui Wang, Zhiping Gu, Aaron L. Halpern, Anne Grundschober-Freimoser, David A. O'Brochta, Peter Arensburger, Brendan J. Loftus, Lucas Q. Ton, Véronique Anthouard, Mary Barnstead, John Lopez, Peer Bork, Didier Boscus, Michele Clamp, Jennifer R. Wortman, Claire M. Fraser, Lisa Friedli, William H. Majoros, Thomas J. Smith, Olivier Jaillon, Val Curwen, Samuel Broder, Sean D. Murphy, Roderic Guigó, Neil F. Lobo, Mathew A. Chrystal, Alison Yao, Alex Levitsky, Renee Strong, Maureen E. Hillenmeyer, Zhongwu Lai, Chinnappa D. Kodira, Rong Qi, and Zdobnov, Evgeny
Subjects: Chromosomes, Artificial, Bacterial, Drosophila melanogaster/genetics, Mosquito Control, Proteome, Enzymes/chemistry/genetics/metabolism, Anopheles gambiae, Genes, Insect, Genome, Plasmodium falciparum/growth & development, Malaria, Falciparum, Expressed Sequence Tags, Genetics, Expressed sequence tag, Multidisciplinary, Physical Chromosome Mapping, Biological Evolution, Enzymes, Blood, Drosophila melanogaster, Insect Proteins, Digestion, Sequence analysis, Molecular Sequence Data, Plasmodium falciparum, Biology, Polymorphism, Single Nucleotide, Species Specificity, Anopheles, Genetic variation, Transcription Factors/chemistry/genetics/physiology, Animals, Humans, Insect Proteins/chemistry/genetics/physiology, Malaria, Falciparum/transmission, Gene, Anopheles/classification/genetics/parasitology/physiology, Whole genome sequencing, Haplotype, Computational Biology, Genetic Variation, Feeding Behavior, Sequence Analysis, DNA, biology.organism_classification, Insect Vectors, Gene Expression Regulation, Haplotypes, Chromosome Inversion, DNA Transposable Elements, Insect Vectors/genetics/parasitology/physiology, Transcription Factors
Abstract: Anopheles gambiae is the principal vector of malaria, a disease that afflicts more than 500 million people and causes more than 1 million deaths each year. Tenfold shotgun sequence coverage was obtained from the PEST strain of A. gambiae and assembled into scaffolds that span 278 million base pairs. A total of 91% of the genome was organized in 303 scaffolds; the largest scaffold was 23.1 million base pairs. There was substantial genetic variation within this strain, and the apparent existence of two haplotypes of approximately equal frequency (“dual haplotypes”) in a substantial fraction of the genome likely reflects the outbred nature of the PEST strain. The sequence produced a conservative inference of more than 400,000 single-nucleotide polymorphisms that showed a markedly bimodal density distribution. Analysis of the genome sequence revealed strong evidence for about 14,000 protein-encoding transcripts. Prominent expansions in specific families of proteins likely involved in cell adhesion and immunity were noted. An expressed sequence tag analysis of genes regulated by blood feeding provided insights into the physiological adaptations of a hematophagous insect.
Published: 2002
Full Text: View/download PDF

22. Efficiently detecting polymorphisms during the fragment assembly process

Author: Daniel Fasulo, Clark M. Mobarry, Ian M. Dew, and Aaron L. Halpern
Subjects: Statistics and Probability, Molecular Sequence Data, Population, Sequence assembly, DNA Fragmentation, Computational biology, Biology, Biochemistry, Genome, Set (abstract data type), Consensus Sequence, Indel, education, Molecular Biology, Sequence (medicine), Genetics, education.field_of_study, Polymorphism, Genetic, Base Sequence, Shotgun sequencing, Gene Expression Profiling, Genetic Variation, Sequence Analysis, DNA, Computer Science Applications, Computational Mathematics, Computational Theory and Mathematics, Graph (abstract data type), Sequence Alignment, Algorithms, Polymorphism, Restriction Fragment Length
Abstract: Motivation: Current genomic sequence assemblers assume that the input data is derived from a single, homogeneous source. However, recent whole-genome shotgun sequencing projects have violated this assumption, resulting in input fragments covering the same region of the genome whose sequences differ due to polymorphic variation in the population. While single-nucleotide polymorphisms (SNPs) do not pose a significant problem to state-of-the-art assembly methods, these methods do not handle insertion/deletion (indel) polymorphisms of more than a few bases. Results: This paper describes an efficient method for detecting sequence discrepencies due to polymorphism that avoids resorting to global use of more costly, less stringent affine sequence alignments. Instead, the algorithm uses graph-based methods to determine the small set of fragments involved in each polymorphism and performs more sophisticated alignments only among fragments in that set. Results from the incorporation of this method into the Celera Assembler are reported for the D. melanogaster, H. sapiens, and M. musculus genomes. Availability: The method described herein does not constitute a stand-alone software application, but is laid out in sufficient detail to be implemented as a component of any genomic sequence assembler. Contact: daniel.fasulo@celera.com Keywords: whole-genome assembly; shotgun sequencing; polymorphism.
Published: 2002
Full Text: View/download PDF

23. Computational techniques for human genome resequencing using mated gapped reads

Author: Jessica Ebert, Vitali Karpinchyk, Igor Nazarenko, Dennis G. Ballinger, Jonathan M. Baccash, Anushka Brownley, Matt Morenzoni, Krishna Pant, Geoffrey B. Nilsen, Radoje Drmanac, Aaron L. Halpern, Bruce K. Martin, and Paolo Carnevali
Subjects: Sequence assembly, Genomics, Biology, Contig Mapping, Genetics, Humans, Computer Simulation, Molecular Biology, Alleles, Whole genome sequencing, Sequence, Base Sequence, Models, Genetic, Genome, Human, Chromosome Mapping, Statistical model, Bayes Theorem, Sequence Analysis, DNA, Base (topology), Bayesian statistics, Computational Mathematics, Computational Theory and Mathematics, Modeling and Simulation, Data Interpretation, Statistical, Human genome, Algorithm, Algorithms
Abstract: Unchained base reads on self-assembling DNA nanoarrays have recently emerged as a promising approach to low-cost, high-quality resequencing of human genomes. Because of unique characteristics of these mated pair reads, existing computational methods for resequencing assembly, such as those based on map-consensus calling, are not adequate for accurate variant calling. We describe novel computational methods developed for accurate calling of SNPs and short substitutions and indels (100 bp); the same methods apply to evaluation of hypothesized larger, structural variations. We use an optimization process that iteratively adjusts the genome sequence to maximize its a posteriori probability given the observed reads. For each candidate sequence, this probability is computed using Bayesian statistics with a simple read generation model and simplifying assumptions that make the problem computationally tractable. The optimization process iteratively applies one-base substitutions, insertions, and deletions until convergence is achieved to an optimum diploid sequence. A local de novo assembly procedure that generalizes approaches based on De Bruijn graphs is used to seed the optimization process in order to reduce the chance of converging to local optima. Finally, a correlation-based filter is applied to reduce the false positive rate caused by the presence of repetitive regions in the reference genome.
Published: 2011

24. Stalking the fourth domain in metagenomic data: searching for, discovering, and interpreting novel, deep branches in marker gene phylogenetic trees

Author: J. Craig Venter, Martin Wu, Marvin Frazier, Douglas B. Rusch, Shibu Yooseph, Jonathan A. Eisen, Aaron L. Halpern, Dongying Wu, and Fleischer, Robert
Subjects: Genome, Computer Applications, Databases, Genetic, Genome Evolution, Phylogeny, Genetics, Plant Growth and Development, 0303 health sciences, Multidisciplinary, Phylogenetic tree, Ecology, Archaeal Evolution, Genomics, Phylogenetics, Multigene Family, Medicine, Algorithms, Biotechnology, Research Article, Archaeans, Sequence analysis, Evolution, General Science & Technology, Oceans and Seas, Science, Sequence alignment, Biology, Microbiology, DNA sequencing, Viral Evolution, Evolution, Molecular, 03 medical and health sciences, Databases, Genetic, Bacterial Proteins, Virology, Evolutionary Systematics, 14. Life underwater, 030304 developmental biology, Ribosomal, Evolutionary Biology, Bacterial Evolution, Base Sequence, 030306 microbiology, Molecular, Computational Biology, Genomic Evolution, Bacteriology, Comparative Genomics, rpoB, Organismal Evolution, Rec A Recombinases, Evolutionary biology, Metagenomics, RNA, Ribosomal, Evolutionary Ecology, Microbial Evolution, Computer Science, RNA, Environmental Protection, Developmental Biology
Abstract: BackgroundMost of our knowledge about the ancient evolutionary history of organisms has been derived from data associated with specific known organisms (i.e., organisms that we can study directly such as plants, metazoans, and culturable microbes). Recently, however, a new source of data for such studies has arrived: DNA sequence data generated directly from environmental samples. Such metagenomic data has enormous potential in a variety of areas including, as we argue here, in studies of very early events in the evolution of gene families and of species.Methodology/principal findingsWe designed and implemented new methods for analyzing metagenomic data and used them to search the Global Ocean Sampling (GOS) expedition data set for novel lineages in three gene families commonly used in phylogenetic studies of known and unknown organisms: small subunit rRNA and the recA and rpoB superfamilies. Though the methods available could not accurately identify very deeply branched ss-rRNAs (largely due to difficulties in making robust sequence alignments for novel rRNA fragments), our analysis revealed the existence of multiple novel branches in the recA and rpoB gene families. Analysis of available sequence data likely from the same genomes as these novel recA and rpoB homologs was then used to further characterize the possible organismal source of the novel sequences.Conclusions/significanceOf the novel recA and rpoB homologs identified in the metagenomic data, some likely come from uncharacterized viruses while others may represent ancient paralogs not yet seen in any cultured organism. A third possibility is that some come from novel cellular lineages that are only distantly related to any organisms for which sequence data is currently available. If there exist any major, but so-far-undiscovered, deeply branching lineages in the tree of life, we suggest that methods such as those described herein currently offer the best way to search for them.
Published: 2011

25. Functional genomic signatures of sponge bacteria reveal unique and shared features of symbiosis

Author: Suhelen Egan, Pui Yi Yung, Doug Rusch, Karla B. Heidelberg, Staffan Kjelleberg, Torsten Thomas, Matthew Z. DeMaere, Matthew R. Lewis, Peter D. Steinberg, and Aaron L. Halpern
Subjects: DNA, Bacterial, Microbial metabolism, Biology, Microbiology, Symbiosis, RNA, Ribosomal, 16S, Animals, Seawater, Ecology, Evolution, Behavior and Systematics, Ecosystem, Phylogeny, Comparative Genomic Hybridization, Bacteria, Sequence Analysis, DNA, biology.organism_classification, Biological Evolution, Porifera, Tetratricopeptide, Sponge, Phylogenetic diversity, Evolutionary biology, Metagenomics, DNA Transposable Elements, Metagenome, Genome, Bacterial, Symbiotic bacteria
Abstract: Sponges form close relationships with bacteria, and a remarkable phylogenetic diversity of yet-uncultured bacteria has been identified from sponges using molecular methods. In this study, we use a comparative metagenomic analysis of the bacterial community in the model sponge Cymbastela concentrica and in the surrounding seawater to identify previously unrecognized genomic signatures and functions for sponge bacteria. We observed a surprisingly large number of transposable insertion elements, a feature also observed in other symbiotic bacteria, as well as a set of predicted mechanisms that may defend the sponge community against the introduction of foreign DNA and hence contribute to its genetic resilience. Moreover, several shared metabolic interactions between bacteria and host include vitamin production, nutrient transport and utilization, and redox sensing and response. Finally, an abundance of protein–protein interactions mediated through ankyrin and tetratricopeptide repeat proteins could represent a mechanism for the sponge to discriminate between food and resident bacteria. These data provide new insight into the evolution of symbiotic diversity, microbial metabolism and host–microbe interactions in sponges.
Published: 2010

26. Consensus generation and variant detection by Celera Assembler

Author: Nelson Axelrod, Jason R. Miller, Gennady Denisov, Samuel Levy, Granger G. Sutton, Brian P. Walenz, and Aaron L. Halpern
Subjects: Statistics and Probability, Sequence analysis, DNA Mutational Analysis, Molecular Sequence Data, Locus (genetics), Single-nucleotide polymorphism, Computational biology, Biology, Haploidy, Biochemistry, Genome, Gene Frequency, Consensus Sequence, Consensus sequence, Humans, Molecular Biology, Genetics, Base Sequence, Shotgun sequencing, Genome, Human, Chromosome Mapping, Genetic Variation, Sequence Analysis, DNA, Computer Science Applications, Computational Mathematics, Computational Theory and Mathematics, Human genome, Ploidy, Algorithms, Software
Abstract: Motivation: We present an algorithm to identify allelic variation given a Whole Genome Shotgun (WGS) assembly of haploid sequences, and to produce a set of haploid consensus sequences rather than a single consensus sequence. Existing WGS assemblers take a column-by-column approach to consensus generation, and produce a single consensus sequence which can be inconsistent with the underlying haploid alleles, and inconsistent with any of the aligned sequence reads. Our new algorithm uses a dynamic windowing approach. It detects alleles by simultaneously processing the portions of aligned reads spanning a region of sequence variation, assigns reads to their respective alleles, phases adjacent variant alleles and generates a consensus sequence corresponding to each confirmed allele. This algorithm was used to produce the first diploid genome sequence of an individual human. It can also be applied to assemblies of multiple diploid individuals and hybrid assemblies of multiple haploid organisms.Results: Being applied to the individual human genome assembly, the new algorithm detects exactly two confirmed alleles and reports two consensus sequences in 98.98% of the total number 2 033 311 detected regions of sequence variation. In 33 269 out of 460 373 detected regions of size >1 bp, it fixes the constructed errors of a mosaic haploid representation of a diploid locus as produced by the original Celera Assembler consensus algorithm. Using an optimized procedure calibrated against 1 506 344 known SNPs, it detects 438 814 new heterozygous SNPs with false positive rate 12%.Availability: The open source code is available at: http://wgs-assembler.cvs.sourceforge.net/wgs-assembler/Contact: gdenisov@jcvi.org
Published: 2008

27. The Sorcerer II Global Ocean Sampling Expedition: metagenomic characterization of viruses within aquatic microbial samples

Author: J. Craig Venter, Shibu Yooseph, John I. Glass, Shannon J. Williamson, Aaron L. Halpern, Christopher S. Miller, Marvin Frazier, Douglas Fadrosh, Cynthia Andrews-Pfannkoch, Douglas B. Rusch, Granger G. Sutton, and Karla B. Heidelberg
Subjects: Genome evolution, Sequence analysis, Genetic Linkage, Oceans and Seas, viruses, Science, Genome, Viral, Genome, Ecology/Marine and Freshwater Ecology, Phylogenetics, Virology, Microbiology/Environmental Microbiology, Molecular Biology, Phylogeny, Genetics, Multidisciplinary, Phylogenetic tree, biology, biology.organism_classification, Computational Biology/Metagenomics, Metagenomics, Horizontal gene transfer, Medicine, Prochlorococcus, Water Microbiology, Research Article
Abstract: Viruses are the most abundant biological entities on our planet. Interactions between viruses and their hosts impact several important biological processes in the world's oceans such as horizontal gene transfer, microbial diversity and biogeochemical cycling. Interrogation of microbial metagenomic sequence data collected as part of the Sorcerer II Global Ocean Expedition (GOS) revealed a high abundance of viral sequences, representing approximately 3% of the total predicted proteins. Cluster analyses of the viral sequences revealed hundreds to thousands of viral genes encoding various metabolic and cellular functions. Quantitative analyses of viral genes of host origin performed on the viral fraction of aquatic samples confirmed the viral nature of these sequences and suggested that significant portions of aquatic viral communities behave as reservoirs of such genetic material. Distributional and phylogenetic analyses of these host-derived viral sequences also suggested that viral acquisition of environmentally relevant genes of host origin is a more abundant and widespread phenomenon than previously appreciated. The predominant viral sequences identified within microbial fractions originated from tailed bacteriophages and exhibited varying global distributions according to viral family. Recruitment of GOS viral sequence fragments against 27 complete aquatic viral genomes revealed that only one reference bacteriophage genome was highly abundant and was closely related, but not identical, to the cyanomyovirus P-SSM4. The co-distribution across all sampling sites of P-SSM4-like sequences with the dominant ecotype of its host, Prochlorococcus supports the classification of the viral sequences as P-SSM4-like and suggests that this virus may influence the abundance, distribution and diversity of one of the most dominant components of picophytoplankton in oligotrophic oceans. In summary, the abundance and broad geographical distribution of viral sequences within microbial fractions, the prevalence of genes among viral sequences that encode microbial physiological function and their distinct phylogenetic distribution lend strong support to the notion that viral-mediated gene acquisition is a common and ongoing mechanism for generating microbial diversity in the marine environment.
Published: 2008

28. Evolution of genes and genomes on the Drosophila phylogeny

Author: Adam M. Phillippy, Edward Grandbois, Pen MacDonald, Iain MacCallum, Laura K. Reed, Wojciech Makalowski, Tracey Honan, Tania Tassinari Rieger, Melissa J. Hubisz, Josep M. Comeron, Douglas Smith, Jennifer Godfrey, Sebastian Strempel, Amr Abdouelleil, Brenton Gravely, Harindra Arachi, Albert J. Vilella, Marc Azer, Sarah A. Teichmann, Roger A. Hoskins, Corbin D. Jones, Keenan Ross, Derek Wilson, Stuart J. Newfeld, John Stalker, Thomas D. Watts, Dennis C. Friedrich, Therese A. Markow, Michael U. Mollenhauer, Tina Goode, Geneva Young, Terry Shea, Krista Lance, Karin A. Remington, Kevin A. Edwards, Lynne Aftuck, Cecil Rise, Sheridon Channer, Matthew D. Rasmussen, Nicole Stange-Thomann, Annie Lui, Robert A. Reenan, Todd Sparrow, Dave Begun, Tamrat Negash, Laura K. Sirot, Adrianne Brand, Adam Brown, Daisuke Yamamoto, Pema Phunkhang, Justin Abreu, Russell Schwartz, Ana Llopart, Abderrahim Farina, Kebede Maru, Chung-I Wu, Allen Alexander, Scott Anderson, So Jeong Lee, Jason Blye, Gary H. Karpen, Wilfried Haerty, Daniel A. Barbash, Peter Rogov, Barry O'Neill, Rachel Mittelman, Jakob Skou Pedersen, Leanne Hughes, Robert K. Bradley, Graziano Pesole, Wyatt W. Anderson, Anthony J. Greenberg, Alejandro Sánchez-Gracia, Julio Rozas, Stephen W. Schaeffer, Yama Thoulutsang, Roger K. Butlin, David H. Ardell, Stuart DeGray, Chris P. Ponting, Deborah E. Stage, Corrado Caggese, Montserrat Aguadé, Casey M. Bergman, Diallo Ferguson, Peili Zhang, Jeffrey R. Powell, Hajime Sato, Xiaohong Liu, Marta Sabariego Puig, Michael Parisi, Passang Dorje, Yoshihiko Tomimura, Adal Abebe, Carlo G. Artieri, Brian Hurhula, Filip Rege, Peter D. Keightley, Andrew Barry, Pablo Alvarez, Tsamla Tsamla, Marvin Wasserman, Santosh Jagadeeshan, Daniel L. Halligan, Chelsea D. Foley, Kim D. Delehaunty, Manfred Grabherr, Sourav Chatterji, Angela N. Brooks, James C. Costello, Mieke Citroen, James A. Yorke, Hsiao Pei Yang, Charles Chapple, Jian Lu, Carlos A. Machado, Norbu Dhargay, Tsering Wangchuk, Anat Caspi, Patrick Cahill, Tashi Bayul, Lisa Levesque, Otero L. Oyono, Atanas Mihalev, Dawa Thoulutsang, Dawn N. Abt, Sujaa Raghuraman, Manyuan Long, Maria Mendez-Lago, Charles Matthews, Kimberly Dooley, Alex Wong, Melanie A. Huntley, William R. Jeck, Ira Topping, Ben Kanga, José P. Abad, Ana Cristina Lauer Garcia, Brikti Abera, Kunsang Gyaltsen, Jonathan Butler, Alicia Franke, Michael C. Schatz, Cheewhye Chin, Charles F. Aquadro, Justin Johnson, Bryant F. McAllister, Georgia Giannoukos, M. Erii Husby, Rod A. Wing, Shangtao Liu, Jean L. Chang, Jennifer Daub, Eiko Kataoka, Leopold Parts, Rakela Lubonja, Margaret Priest, Yoshiko N. Tobari, Teena Mehta, Evgeny M. Zdobnov, Yeshi Lokyitsang, Richard Elong, Matthew J. Parisi, Louis Meneus, Eric S. Lander, Alan Filipski, Gary Gearin, Nabil Hafez, Nicholas Sisneros, David B. Jaffe, Ian Holmes, Marina Sirota, Leonid Boguslavskiy, Lisa Chuda, LaDeana W. Hillier, Meizhong Luo, Phil Batterham, Michael Kleber, Richard K. Wilson, Yama Cheshatsang, Qing Yu, Rebecca Reyes, Matthew W. Hahn, Andreas Heger, Mar Marzo, Patrick Minx, Kerstin Lindblad-Toh, Vera L. S. Valente, Adam Wilson, William C. Jordan, Mohamed A. F. Noor, Chiao-Feng Lin, Asha Kamat, Heather Ebling, Mihai Pop, Frances Letendre, Mariana F. Wolfner, Don Gilbert, Ngawang Sherpa, Riza M. Daza, Oana Mihai, Gabriel C. Wu, Aaron M. Berlin, Ewen F. Kirkness, Monika D. Huard, Robert S. Fulton, Randall H. Brown, Danni Zhong, Sharon Stavropoulos, Venky N. Iyer, Xu Mu, Christina R. Gearin, David M. Rand, Jerry A. Coyne, Dan Hultmark, Jill Falk, Christopher Patti, Montserrat Papaceit, James Meldrim, Valentine Mlenga, Muneo Matsuda, Sven Findeiß, Todd A. Schlenke, Kevin McKernan, Brian P. Walenz, Timothy B. Sackton, Leonardo Koerich, Peter An, Robert Nicol, Chuong B. Do, Dmitry Khazanovich, Carmen Segarra, Maura Costello, St Christophe Acer, Claudia Rohde, Serafim Batzoglou, Hadi Quesneville, Evan Mauceli, Andy Vo, Luciano M. Matzkin, Susan E. Celniker, Patrick M. O’Grady, William M. Gelbart, Lloyd Low, Jamal Abdulkadir, Jessica Spaulding, Brian R. Calvi, Charlotte Henson, Robert David, Jennifer L. Hall, Andrew G. Clark, Anastasia Gardiner, Susan M. Russo, Birhane Hagos, Kerri Topham, Amy Denise Reily, Eli Venter, Jerome Naylor, Sandra W. Clifton, Valer Gotea, Samuel R. Gross, Manolis Kellis, Claude Bonnet, Christopher Strader, Tashi Lokyitsang, Nyima Norbu, Jennifer Baldwin, Stephen M. Mount, Robert L. Strausberg, Shailendra Yadav, Kristipati Ravi Ram, Steven L. Salzberg, Erik Gustafson, David A. Garfield, Eva Freyhult, Arthur L. Delcher, Enrico Blanco, Granger G. Sutton, Jason M. Tsolas, Charles Robin, Angie S. Hinrichs, Christopher D. Smith, Jane Wilkinson, Brendan McKernan, Fritz Pierre, William McCusker, Brian Oliver, Barry E. Garvin, Sudhir Kumar, Peter Kisner, Kunsang Dorjee, A. Bernardo Carvalho, Anna Montmayeur, Andrew Zimmer, Diana Shih, Wei Tao, Shiaw Pyng Yang, Sante Gnerre, Sampath Settipalli, Thu Nguyen, Paolo Barsanti, Brian P. Lazzaro, Sonja J. Prohaska, J. Craig Venter, Senait Tesfaye, Susan McDonough, Kim D. Pruitt, Alexander Stark, Sergio Castrezana, Lucinda Fulton, Richard T. Lapoint, Greg Gibson, John Spieth, Boris Adryan, Georgius De Haan, Sheila Fisher, Daniel A. Pollard, Seva Kashin, Rob J. Kulathinal, Michael B. Eisen, Nathaniel Novod, Christina Demaso, Alan Dupes, Amanda M. Larracuente, Toby Bloom, Alfredo Villasante, Charles H. Langley, Rama S. Singh, Niall J. Lennon, Kristi L. Montooth, Daniel Barker, Wolfgang Stephan, David Sturgill, Ruiqiang Li, Andrew Hollinger, Boris Boukhgalter, Talene Thomson, Patrick Cooke, Zac Zwirko, Nadia D. Singh, Michael Weiand, Lior Pachter, Roderic Guigó, Yu Zhang, Jay D. Evans, Stephanie Bosak, Rosie Levine, Lu Shi, Kiyohito Yoshida, Carolyn S. McBride, Pouya Kheradpour, William Brockman, Alberto Civetta, Hiroshi Akashi, Marcia Lara, Susan Faro, Sam Griffiths-Jones, Michael R. Brent, Thomas H. Eickbush, Gane Ka-Shu Wong, Elizabeth P. Ryan, Erica Anderson, Roberta Kwok, Asif T. Chinwalla, Sahal Osman, Nga Nguyen, Damiano Porcelli, Missole Doricent, Saverio Vicario, Marc Rubenfield, Bárbara Negre, Gillian M. Halter, Erin E. Dooley, Elena R. Lozovsky, William Lee, Alville Collymore, Catherine Stone, Tanya Mihova, Jun Wang, Karsten Kristiansen, Imane Bourzgui, Michael F. Lin, Katie D'Aco, Filipe G. Vieira, Choe Norbu, Yu-Hui Rogers, Aaron L. Halpern, Eugene W. Myers, Sharleen Grewal, Robert T. Good, Alfredo Ruiz, Dave Kudrna, Joseph Graham, Alex Lipovsky, Leonidas Mulrain, Tsering Wangdi, Roman Arguello, Mira V. Han, Arjun Bhutkar, Rasmus Nielsen, David J. Saranga, Aleksey V. Zimin, Vasilia Magnisalis, Helen Vassiliev, Thomas C. Kaufman, Eva Markiewicz, Temple F. Smith, Jinlei Liu, Loryn Gadbois, Michael G. Ritchie, Lisa Zembek, Daniel Bessette, Pasang Bachantsang, Adam Navidi, Department of Molecular Biology and Genetics, Cornell University [New York], Lawrence Berkeley National Laboratory [Berkeley] (LBNL), University of California [Berkeley], University of California, Agencourt Bioscience Corporation, Partenaires INRAE, Faculty of Life Science, University of Manchester [Manchester], Laboratory of Cellular and Developmental Biology (LCDB), NIDDK, NIH, Department of Ecology and Evolutionary Biology, University of Arizona, Department of Biology, Indiana University [Bloomington], Indiana University System-Indiana University System, Massachusetts Institute of Technology (MIT), Harvard University [Cambridge], Centro de Biología Molecular Severo Ochoa [Madrid] (CBMSO), Universidad Autonoma de Madrid (UAM)-Consejo Superior de Investigaciones Científicas [Madrid] (CSIC), Brown University, Laboratory of Molecular Biology, Medical Research Council, Departament de Genetica, Universitat de Barcelona (UB), Pennsylvania State University (Penn State), Penn State System-Penn State System, Department of Genetics, University of Georgia [USA], Uppsala University, Department of Ecology and Evolution [Lausanne], Université de Lausanne (UNIL), McMaster University, School of Biology, IE University, Università degli Studi di Bari Aldo Moro, University of Melbourne, Stanford University, University of California [Davis] (UC Davis), Boston University [Boston] (BU), Centro de Regulación Genómica (CRG), Universitat Pompeu Fabra [Barcelona] (UPF), Washington University in Saint Louis (WUSTL), University of Sheffield, Syracuse University, Universidade Federal Rural do Rio de Janeiro (UFRRJ), Department of Bioengineering, Beihang University (BUAA), Tucson Stock Center, Genome Center, University of California-University of California, Genome Sequencing Center, University of Washington School of Medicine, University of Winnipeg, Iowa State University (ISU), Indiana University System, The Wellcome Trust Sanger Institute [Cambridge], Center for Bioinformatics and Computational Biology, University of Delaware [Newark], Illinois State University, University of Rochester [USA], United States Department of Agriculture (USDA), Arizona State University [Tempe] (ASU), Leipzig University, Universidade Federal do Rio Grande do Sul (UFRGS), Duke University, North Carolina State University [Raleigh] (NC State), University of North Carolina System (UNC)-University of North Carolina System (UNC), University of Connecticut (UCONN), Computer Science Département, Université Saint-Esprit de Kaslik (USEK), Mc Master University, Indiana University, Institute of Evolutionary Biology, University of Edinburgh, J. Craig Venter Institute [La Jolla, USA] (JCVI), University of Oxford [Oxford], Center for Biomolecular Science and Engineering, Unité de Recherche Génomique Info (URGI), Institut National de la Recherche Agronomique (INRA), and Zdobnov, Evgeny
Subjects: melanogaster genome, 0106 biological sciences, RNA, Untranslated, [SDV]Life Sciences [q-bio], Genome, Insect, RNA, Untranslated/genetics, Genes, Insect, 01 natural sciences, Genome, Genome, Insect/ genetics, Gene Order, Genome, Mitochondrial/genetics, Drosophila Proteins, Phylogeny, ddc:616, Genetics, 0303 health sciences, Multidisciplinary, biology, Reproduction, Genomics, Multigene Family/genetics, Reproduction/genetics, DNA Transposable Elements/genetics, Genes, Insect/ genetics, Multigene Family, dosage compensation, Drosophila, amino-acid substitution, Drosophila Protein, Drosophila Proteins/genetics, Synteny/genetics, fruit-fly, 010603 evolutionary biology, Synteny, Drosophila sechellia, Evolution, Molecular, 03 medical and health sciences, Phylogenetics, Molecular evolution, Codon/genetics, [SDV.BV]Life Sciences [q-bio]/Vegetal Biology, Animals, adaptive protein evolution, Codon, 030304 developmental biology, Gene Order/genetics, molecular evolution, fungi, Immunity, synonymous codon usage, Sequence Analysis, DNA, Immunity/genetics, biology.organism_classification, Drosophila mojavensis, Evolutionary biology, Genome, Mitochondrial, DNA Transposable Elements, maximum-likelihood, noncoding dna, Drosophila/ classification/ genetics/immunology/metabolism, Sequence Alignment, natural-selection, Drosophila yakuba
Abstract: Affiliations des auteurs : cf page 216 de l'article; International audience; Comparative analysis of multiple genomes in a phylogenetic framework dramatically improves the precision and sensitivity of evolutionary inference, producing more robust results than single-genome analyses can provide. The genomes of 12 Drosophila species, ten of which are presented here for the first time (sechellia, simulans, yakuba, erecta, ananassae, persimilis, willistoni, mojavensis, virilis and grimshawi), illustrate how rates and patterns of sequence divergence across taxa can illuminate evolutionary processes on a genomic scale. These genome sequences augment the formidable genetic tools that have made Drosophila melanogaster a pre-eminent model for animal genetics, and will further catalyse fundamental research on mechanisms of development, cell biology, genetics, disease, neurobiology, behaviour, physiology and evolution. Despite remarkable similarities among these Drosophila species, we identified many putatively non-neutral changes in protein-coding genes, non-coding RNA genes, and cis-regulatory regions. These may prove to underlie differences in the ecology and behaviour of these diverse species.
Published: 2007
Full Text: View/download PDF

29. A New Human Genome Sequence Paves the Way for Individualized Genomics

Author: Jiaqi Huang, Marvin Frazier, Vineet Bafna, Brian P. Walenz, Jon Borman, Samuel Levy, Josep F. Abril, Yu-Hui Rogers, Aaron L. Halpern, Vikas Bansal, J. Craig Venter, Ewen F. Kirkness, Timothy B. Stockwell, Jeffrey R. MacDonald, Granger G. Sutton, Pauline C. Ng, John Gill, Karen Beeson, Karin A. Remington, Alexia Tsiamouri, Robert L. Strausberg, Nelson Axelrod, Lars Feuk, Yuan Lin, Mary Shago, Andy Wing Chun Pang, Dana A. Busam, Gennady Denisov, Saul A. Kravitz, Tina C McIntosh, Stephen W. Scherer, and Universitat de Barcelona
Subjects: Male, ADN, Gene Dosage, Genoma humà, Genome, 0302 clinical medicine, INDEL Mutation, Homo (Human), Human Genome Project, Chromosomes, Human, Biology (General), In Situ Hybridization, Fluorescence, Genetics, Mammals, 0303 health sciences, General Neuroscience, Chromosome Mapping, Genome project, Genomics, Middle Aged, Pedigree, Phenotype, Synopsis, General Agricultural and Biological Sciences, Research Article, Human, Primates, Genome evolution, Genotype, Bioinformatics, QH301-705.5, Molecular Sequence Data, Biology, Polymorphism, Single Nucleotide, General Biochemistry, Genetics and Molecular Biology, 03 medical and health sciences, Bioinformàtica, Gene density, Humans, Genome size, 030304 developmental biology, Chromosomes, Human, Y, General Immunology and Microbiology, Human genome, Base Sequence, Genome, Human, Reproducibility of Results, Genetics and Genomics, DNA, Sequence Analysis, DNA, Microarray Analysis, Diploidy, Genòmica, Haplotypes, 030217 neurology & neurosurgery, Reference genome
Abstract: Presented here is a genome sequence of an individual human. It was produced from ∼32 million random DNA fragments, sequenced by Sanger dideoxy technology and assembled into 4,528 scaffolds, comprising 2,810 million bases (Mb) of contiguous sequence with approximately 7.5-fold coverage for any given region. We developed a modified version of the Celera assembler to facilitate the identification and comparison of alternate alleles within this individual diploid genome. Comparison of this genome and the National Center for Biotechnology Information human reference assembly revealed more than 4.1 million DNA variants, encompassing 12.3 Mb. These variants (of which 1,288,319 were novel) included 3,213,401 single nucleotide polymorphisms (SNPs), 53,823 block substitutions (2–206 bp), 292,102 heterozygous insertion/deletion events (indels)(1–571 bp), 559,473 homozygous indels (1–82,711 bp), 90 inversions, as well as numerous segmental duplications and copy number variation regions. Non-SNP DNA variation accounts for 22% of all events identified in the donor, however they involve 74% of all variant bases. This suggests an important role for non-SNP genetic alterations in defining the diploid genome structure. Moreover, 44% of genes were heterozygous for one or more variants. Using a novel haplotype assembly strategy, we were able to span 1.5 Gb of genome sequence in segments >200 kb, providing further precision to the diploid nature of the genome. These data depict a definitive molecular portrait of a diploid human genome that provides a starting point for future genome comparisons and enables an era of individualized genomic information., Author Summary We have generated an independently assembled diploid human genomic DNA sequence from both chromosomes of a single individual (J. Craig Venter). Our approach, based on whole-genome shotgun sequencing and using enhanced genome assembly strategies and software, generated an assembled genome over half of which is represented in large diploid segments (>200 kilobases), enabling study of the diploid genome. Comparison with previous reference human genome sequences, which were composites comprising multiple humans, revealed that the majority of genomic alterations are the well-studied class of variants based on single nucleotides (SNPs). However, the results also reveal that lesser-studied genomic variants, insertions and deletions, while comprising a minority (22%) of genomic variation events, actually account for almost 74% of variant nucleotides. Inclusion of insertion and deletion genetic variation into our estimates of interchromosomal difference reveals that only 99.5% similarity exists between the two chromosomal copies of an individual and that genetic variation between two individuals is as much as five times higher than previously estimated. The existence of a well-characterized diploid human genome sequence provides a starting point for future individual genome comparisons and enables the emerging era of individualized genomic information., Comparison of the DNA sequence of an individual human from the reference sequence reveals a surprising amount of difference.
Published: 2007

30. The Sorcerer II Global Ocean Sampling expedition: northwest Atlantic through eastern tropical Pacific

Author: Jonathan A. Eisen, Luisa I. Falcón, Jeff Hoffman, Dongying Wu, Joseph E. Venter, Valeria Souza, Shannon J. Williamson, Karen Beeson, Kenneth H. Nealson, Yu-Hui Rogers, Aaron L. Halpern, Kelvin Li, Germán Bonilla-Rosso, Robert Friedman, Jason Freeman, Karin A. Remington, Clare Stewart, Michael Ferrari, Douglas B. Rusch, Holly Baden-Tillson, Shubha Sathyendranath, Marvin Frazier, Granger G. Sutton, T. Utterback, Eldredge Bermingham, Robert L. Strausberg, Karla B. Heidelberg, Luis E. Eguiarte, J. Craig Venter, Cynthia Andrews-Pfannkoch, Victor A. Gallardo, David M. Karl, Trevor Platt, Saul A. Kravitz, John F. Heidelberg, Giselle Tamayo-Castillo, Bao Duc Tran, Hamilton O. Smith, Joyce Thorpe, Shibu Yooseph, and Moran, Nancy A
Subjects: Pelagibacter ubique, Genome evolution, Food Chain, QH301-705.5, Oceans and Seas, Genomics, Medical and Health Sciences, General Biochemistry, Genetics and Molecular Biology, Species Specificity, Phylogenetics, Genetics, Biology (General), Clade, Comparative genomics, General Immunology and Microbiology, biology, Agricultural and Veterinary Sciences, Ecology, General Neuroscience, Human Genome, Computational Biology, Biological Sciences, biology.organism_classification, Plankton, Phylogenetic diversity, Metagenomics, General Agricultural and Biological Sciences, Water Microbiology, Biotechnology, Developmental Biology
Abstract: The world's oceans contain a complex mixture of micro-organisms that are for the most part, uncharacterized both genetically and biochemically. We report here a metagenomic study of the marine planktonic microbiota in which surface (mostly marine) water samples were analyzed as part of the Sorcerer II Global Ocean Sampling expedition. These samples, collected across a several-thousand km transect from the North Atlantic through the Panama Canal and ending in the South Pacific yielded an extensive dataset consisting of 7.7 million sequencing reads (6.3 billion bp). Though a few major microbial clades dominate the planktonic marine niche, the dataset contains great diversity with 85% of the assembled sequence and 57% of the unassembled data being unique at a 98% sequence identity cutoff. Using the metadata associated with each sample and sequencing library, we developed new comparative genomic and assembly methods. One comparative genomic method, termed "fragment recruitment," addressed questions of genome structure, evolution, and taxonomic or phylogenetic diversity, as well as the biochemical diversity of genes and gene families. A second method, termed "extreme assembly," made possible the assembly and reconstruction of large segments of abundant but clearly nonclonal organisms. Within all abundant populations analyzed, we found extensive intra-ribotype diversity in several forms: (1) extensive sequence variation within orthologous regions throughout a given genome; despite coverage of individual ribotypes approaching 500-fold, most individual sequencing reads are unique; (2) numerous changes in gene content some with direct adaptive implications; and (3) hypervariable genomic islands that are too variable to assemble. The intra-ribotype diversity is organized into genetically isolated populations that have overlapping but independent distributions, implying distinct environmental preference. We present novel methods for measuring the genomic similarity between metagenomic samples and show how they may be grouped into several community types. Specific functional adaptations can be identified both within individual ribotypes and across the entire community, including proteorhodopsin spectral tuning and the presence or absence of the phosphate-binding gene PstS.
Published: 2007

31. The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families

Author: Aaron L. Halpern, Shannon J. Williamson, Kannan Natarajan, Susan S. Taylor, Granger G. Sutton, Marcin P. Joachimiak, David Eisenberg, Robert Friedman, Christopher S. Miller, Gerard Manning, Shibu Yooseph, David A W Soergel, Christopher van Belle, Douglas B. Rusch, Karin A. Remington, Steven E. Brenner, Jack E. Dixon, Yufeng Zhai, Robert L. Strausberg, Karla B. Heidelberg, Vineet Bafna, Jonathan A. Eisen, Shaun W. Lee, Weizhong Li, Piotr Cieplak, John-Marc Chandonia, Huiying Li, Benjamin J. Raphael, J. Craig Venter, Susan T. Mashiyama, Lukasz Jaroszewski, Marvin Frazier, and Adam Godzik
Subjects: Protein family, QH301-705.5, Gene prediction, Protein domain, Sequence alignment, Genomics, Computational biology, Biology, Medical and Health Sciences, General Biochemistry, Genetics and Molecular Biology, Structural genomics, 03 medical and health sciences, 14. Life underwater, Biology (General), 030304 developmental biology, Genetics, 0303 health sciences, General Immunology and Microbiology, Agricultural and Veterinary Sciences, 030306 microbiology, Shotgun sequencing, General Neuroscience, Biological Sciences, Metagenomics, General Agricultural and Biological Sciences, Developmental Biology
Abstract: Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.
Published: 2007

32. Ancient noncoding elements conserved in the human genome

Author: J. Craig Venter, Robert L. Strausberg, Nidhi Dandona, Byrappa Venkatesh, Alison P. Lee, Alice Tay, Aaron L. Halpern, Yong-Hwee E. Loh, Lakshmi D. Viswanathan, Justin Johnson, Ewen F. Kirkness, and Sydney Brenner
Subjects: Molecular Sequence Data, Biology, Regulatory Sequences, Nucleic Acid, Genome, Conserved sequence, Evolution, Molecular, Intergenic region, Molecular evolution, biology.animal, Animals, Humans, Conserved Sequence, Zebrafish, Whole genome sequencing, Genetics, Multidisciplinary, Base Sequence, Genome, Human, Vertebrate, Takifugu, Enhancer Elements, Genetic, Regulatory sequence, Sharks, Human genome, DNA, Intergenic
Abstract: Cartilaginous fishes represent the living group of jawed vertebrates that diverged from the common ancestor of human and teleost fish lineages about 530 million years ago. We generated ~1.4× genome sequence coverage for a cartilaginous fish, the elephant shark ( Callorhinchus milii ), and compared this genome with the human genome to identify conserved noncoding elements (CNEs). The elephant shark sequence revealed twice as many CNEs as were identified by whole-genome comparisons between teleost fishes and human. The ancient vertebrate-specific CNEs in the elephant shark and human genomes are likely to play key regulatory roles in vertebrate gene expression.
Published: 2006

33. Correction for Goldberg et al., A Sanger/pyrosequencing hybrid approach for the generation of high-quality draft assemblies of marine microbial genomes

Author: Granger G. Sutton, Yu-Hui Rogers, Aaron L. Halpern, Susanne M. D. Goldberg, Justin Johnson, Federico M. Lauro, Robert L. Strausberg, Saul A. Kravitz, Eli Venter, Luke J. Tallon, Dana A. Busam, Steve Ferriera, Torsten Thomas, Hoda Khouri, Marvin Frazier, Kelvin Li, Robert Friedman, Tamara Feldblyum, and J. Craig Venter
Subjects: Multidisciplinary, Microbial Genomes, media_common.quotation_subject, Pyrosequencing, Correction, Quality (business), Computational biology, Biology, Hybrid approach, media_common
Published: 2006

34. A Sanger/pyrosequencing hybrid approach for the generation of high-quality draft assemblies of marine microbial genomes

Author: J. Craig Venter, Justin Johnson, Luke J. Tallon, Dana A. Busam, Marvin Frazier, Kelvin Li, Torsten Thomas, Steve Ferriera, Aaron L. Halpern, Yu-Hui Rogers, Robert L. Strausberg, Tamara Feldblyum, Hoda Khouri, Robert M. Friedman, Granger G. Sutton, Susanne M. D. Goldberg, Saul A. Kravitz, Eli Venter, and Federico M. Lauro
Subjects: Cancer genome sequencing, Sanger sequencing, Genetics, Multidisciplinary, Massive parallel sequencing, Shotgun sequencing, Sequence assembly, Computational Biology, Genomics, Computational biology, Sequence Analysis, DNA, Biology, Biological Sciences, symbols.namesake, Contig Mapping, Metagenomics, Genes, Bacterial, symbols, ABI Solid Sequencing, Genome, Bacterial, Biotechnology
Abstract: Since its introduction a decade ago, whole-genome shotgun sequencing (WGS) has been the main approach for producing cost-effective and high-quality genome sequence data. Until now, the Sanger sequencing technology that has served as a platform for WGS has not been truly challenged by emerging technologies. The recent introduction of the pyrosequencing-based 454 sequencing platform (454 Life Sciences, Branford, CT) offers a very promising sequencing technology alternative for incorporation in WGS. In this study, we evaluated the utility and cost-effectiveness of a hybrid sequencing approach using 3730 xl Sanger data and 454 data to generate higher-quality lower-cost assemblies of microbial genomes compared to current Sanger sequencing strategies alone.
Published: 2006

35. Whole-genome shotgun assembly and comparison of human genome assemblies

Author: Clark M. Mobarry, Shibu Yooseph, Jason R. Miller, Brian P. Walenz, Xiangqun Holly Zheng, Randall Bolanos, Karin A. Remington, Liliana Florea, Sorin Istrail, Ian Dew, Michael J. Flanigan, Deborah R. Nusskern, Daniel H. Huson, Aaron L. Halpern, Sridhar Hannenhalli, Laurent Mouchard, Fu Lu, Granger G. Sutton, Eugene W. Myers, Hagit Shatkay, Evan E. Eichler, Ross A. Lippert, Russell Turner, Knut Reinert, Andrew G. Clark, Arthur L. Delcher, Bjarni V. Halldorsson, Nathan Edwards, Saul A. Kravitz, J. Craig Venter, Michael S. Waterman, Bixiong Chris Shue, Daniel Fasulo, Michael W. Hunkapiller, Mark Raymond Adams, and Fei Zhong
Subjects: Genetics, Cancer genome sequencing, Multidisciplinary, Shotgun sequencing, Genome, Human, Sequence assembly, Computational Biology, Hybrid genome assembly, Computational biology, Genome project, Biology, Biological Sciences, Genome, Contig Mapping, Human Genome Project, Humans, Human genome, RNA, Messenger, Software, Reference genome
Abstract: We report a whole-genome shotgun assembly (called WGSA) of the human genome generated at Celera in 2001. The Celera-generated shotgun data set consisted of 27 million sequencing reads organized in pairs by virtue of end-sequencing 2-kbp, 10-kbp, and 50-kbp inserts from shotgun clone libraries. The quality-trimmed reads covered the genome 5.3 times, and the inserts from which pairs of reads were obtained covered the genome 39 times. With the nearly complete human DNA sequence [National Center for Biotechnology Information (NCBI) Build 34] now available, it is possible to directly assess the quality, accuracy, and completeness of WGSA and of the first reconstructions of the human genome reported in two landmark papers in February 2001 [Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G., Smith, H. O., Yandell, M., Evans, C. A., Holt, R. A., et al. (2001) Science 291, 1304–1351; International Human Genome Sequencing Consortium (2001) Nature 409, 860–921]. The analysis of WGSA shows 97% order and orientation agreement with NCBI Build 34, where most of the 3% of sequence out of order is due to scaffold placement problems as opposed to assembly errors within the scaffolds themselves. In addition, WGSA fills some of the remaining gaps in NCBI Build 34. The early genome sequences all covered about the same amount of the genome, but they did so in different ways. The Celera results provide more order and orientation, and the consortium sequence provides better coverage of exact and nearly exact repeats.
Published: 2004

36. Comparative genome and proteome analysis of Anopheles gambiae and Drosophila melanogaster

Author: Dana Thomasova, Suzanna E. Lewis, Hans Michael Mueller, George K. Christophides, José M. C. Ribeiro, Michael A. Wells, Ron Wides, Patrick Wincker, Rosane Charlab, Peer Bork, Steven L. Salzberg, Gerald M. Rubin, G. Mani Subramanian, Kokoza Eb, Cheryl L. Kraft, Robert A. Holt, Aaron L. Halpern, Ewan Birney, David Torrents, Frank H. Collins, Ivica Letunic, Richard R. Copley, Pantelis Topalis, Granger G. Sutton, Zhongwu Lai, Fotis C. Kafatos, Carolina Barillas-Mury, Evgeny M. Zdobnov, Mikita Suyama, William M. Gelbart, George Dimopoulos, Christos Louis, Mark Yandell, Deborah Nusskern, John H. Law, and Christian von Mering
Subjects: Anopheles/chemistry/genetics/physiology, Proteome, Anopheles gambiae, Genes, Insect, Genome, Synteny, Chromosomes, Homology (biology), Species Specificity, Drosophilidae, Dosage Compensation, Genetic, Drosophila Proteins/chemistry/genetics/physiology, Sequence Homology, Nucleic Acid, Anopheles, Gene Order, Drosophila Proteins, Animals, Cluster Analysis, Insect Proteins/chemistry/genetics/physiology, Gene, Genetics, Multidisciplinary, biology, Chromosomes/genetics, Intron, myr, Exons, biology.organism_classification, Physical Chromosome Mapping, Biological Evolution, Introns, Protein Structure, Tertiary, Drosophila melanogaster, Chromosome Inversion, Insect Proteins, Drosophila melanogaster/chemistry/genetics/physiology, Pseudogenes
Abstract: Comparison of the genomes and proteomes of the two diptera Anopheles gambiae and Drosophila melanogaster , which diverged about 250 million years ago, reveals considerable similarities. However, numerous differences are also observed; some of these must reflect the selection and subsequent adaptation associated with different ecologies and life strategies. Almost half of the genes in both genomes are interpreted as orthologs and show an average sequence identity of about 56%, which is slightly lower than that observed between the orthologs of the pufferfish and human (diverged about 450 million years ago). This indicates that these two insects diverged considerably faster than vertebrates. Aligned sequences reveal that orthologous genes have retained only half of their intron/exon structure, indicating that intron gains or losses have occurred at a rate of about one per gene per 125 million years. Chromosomal arms exhibit significant remnants of homology between the two species, although only 34% of the genes colocalize in small “microsyntenic” clusters, and major interarm transfers as well as intra-arm shuffling of gene order are detected.
Published: 2002

37. Design of a compartmentalized shotgun assembler for the human genome

Author: Clark M. Mobarry, Granger G. Sutton, Eugene W. Myers, Ian M. Dew, Arthur L. Delcher, Michael Flanigan, Aaron L. Halpern, Zhongwu Lai, Knut Reinert, Saul A. Kravitz, Daniel H. Huson, and Karin A. Remington
Subjects: Statistics and Probability, Genetics, Chromosomes, Artificial, Bacterial, Shotgun sequencing, Genome, Human, Computational Biology, Genomics, Shotgun, Computational biology, Sequence Analysis, DNA, Biology, Biochemistry, Computer Science Applications, Computational Mathematics, Computational Theory and Mathematics, Humans, Human genome, Cloning, Molecular, Databases, Nucleic Acid, Molecular Biology, Software
Abstract: Two different strategies for determining the human genome are currently being pursued: one is the “clone-by-clone” approach, employed by the publicly funded project, and the other is the “whole genome shotgun assembler” approach, favored by researchers at Celera Genomics. An interim strategy employed at Celera, called compartmentalized shotgun assembly, makes use of preliminary data produced by both approaches. In this paper we describe the design, implementation and operation of the “compartmentalized shotgun assembler”. Contact: Knut.Reinert@celera.com
Published: 2001

38. Minimally selected p and other tests for a single abrupt changepoint in a binary sequence

Author: Aaron L. Halpern
Subjects: Statistics and Probability, Anderson–Darling test, Biometry, Kolmogorov–Smirnov test, Pseudorandom binary sequence, General Biochemistry, Genetics and Molecular Biology, symbols.namesake, Statistics, Humans, Statistic, Fisher's exact test, Mathematics, Recombination, Genetic, Likelihood Functions, Chi-Square Distribution, Models, Statistical, Statistics::Applications, General Immunology and Microbiology, Base Sequence, Applied Mathematics, HIV, General Medicine, Sequence Analysis, DNA, Exact test, Counting problem, Likelihood-ratio test, symbols, General Agricultural and Biological Sciences, Sequence Alignment
Abstract: Summary. A novel changepoint statistic based on the minimum value, over possible changepoint locations, of Fisher's Exact Test, is introduced. Specific points in the exact distribution of the minimally selected Fisher's value may be rapidly calculated as a lattice-path counting problem via known recurrence methods. The test is compared to the Kolmogorov-Smirnov two-sample test, the maximally selected chi-square, and a likelihood ratio test. The tests are applied to assessing recombination in genetic sequences of HIV.
Published: 2001

39. Multiple-changepoint testing for an alternating segments model of a binary sequence

Author: Aaron L. Halpern
Subjects: Statistics and Probability, Monte Carlo method, Molecular Sequence Data, Pseudorandom binary sequence, Genes, env, General Biochemistry, Genetics and Molecular Biology, Combinatorics, Computational statistics, Alternation (formal language theory), Humans, Segmentation, Mathematics, Recombination, Genetic, Sequence, Models, Statistical, General Immunology and Microbiology, Base Sequence, Models, Genetic, Applied Mathematics, General Medicine, Genes, gag, Constraint (information theory), Dynamic programming, HIV-1, General Agricultural and Biological Sciences, Algorithm
Abstract: Summary. A binary sequence may give the appearance of being composed of alternating segments with relatively high and relatively low probability of success. Determining whether such an alternating pattern is significant is a multiple-changepoint problem where the number of segments and their success probabilities are unknown, with the added constraint of segment alternation. A dynamic programming method for determining the optimal segmentation into a given number of segments is provided. Given this, a variation on the simulation method of Venter and Steel (1996, Computational Statistics and Data Analysis22, 481–504) may be employed t o test the null hypothesis of a homogeneous sequence as well as to estimate the number and location of changepoints. A sample application, the assessment of the possibility of genetic recombination in HIV sequences, is presented.
Published: 2000

40. Large-scale comparison of fungal sequence information: mechanisms of innovation in Neurospora crassa and gene loss in Saccharomyces cerevisiae

Author: Aaron L. Halpern, Mary Anne Nelson, Edward L. Braun, and Donald O. Natvig
Subjects: Databases, Factual, Saccharomyces cerevisiae, Genes, Fungal, Genome, Aspergillus nidulans, Neurospora crassa, Evolution, Molecular, Fungal Proteins, Molecular evolution, Sequence Homology, Nucleic Acid, Genetics, Gene family, DNA, Fungal, Gene, Genetics (clinical), Expressed Sequence Tags, biology, Crassa, Genetic Variation, biology.organism_classification, Horizontal gene transfer, Genome, Fungal, Gene Deletion
Abstract: We report a large-scale comparison of sequence data from the filamentous fungus Neurospora crassa with the complete genome sequence of Saccharomyces cerevisiae. N. crassa is considerably more morphologically and developmentally complex thanS. cerevisiae. We found that N. crassa has a much higher proportion of “orphan” genes than S. cerevisiae, suggesting that its morphological complexity reflects the acquisition or maintenance of novel genes, consistent with its larger genome. Our results also indicate the loss of specific genes from S. cerevisiae. Surprisingly, some of the genes lost from S. cerevisiae are involved in basic cellular processes, including translation and ion (especially calcium) homeostasis. Horizontal gene transfer from prokaryotes appears to have played a relatively modest role in the evolution of the N. crassa genome. Differences in the overall rate of molecular evolution between N. crassa andS. cerevisiae were not detected. Our results indicate that the current public sequence databases have fairly complete samples of gene families with ancient conserved regions, suggesting that further sequencing will not substantially change the proportion of genes with homologs among distantly related groups. Models of the evolution of fungal genomes compatible with these results, and their functional implications, are discussed.
Published: 2000

41. A whole-genome assembly of Drosophila

Author: Ming Zhan, Deborah R. Nusskern, Eugene W. Myers, Xiangqun Zheng, Michael Flanigan, Randall Bolanos, Lin Chen, Ellen M. Beasley, Qing Zhang, Gerald M. Rubin, Mark Raymond Adams, Arthur L. Delcher, Clark M. Mobarry, Granger G. Sutton, Hui-Hsien Chou, Yong Liang, Dan P. Fasulo, J. Craig Venter, Rhonda C. Brandon, Eric L. Anson, Stefano Lonardi, Patrick J. Dunn, Saul A. Kravitz, Zhongwu Lai, Karin A. Remington, Catherine Jordan, Knut Reinert, Ian M. Dew, and Aaron L. Halpern
Subjects: Euchromatin, Molecular Sequence Data, Sequence assembly, Hybrid genome assembly, Genes, Insect, Computational biology, Biology, Sequence-tagged site, Contig Mapping, Heterochromatin, Animals, Phrap, Repetitive Sequences, Nucleic Acid, Sequence Tagged Sites, Genetics, Multidisciplinary, Genome, Contig, Assembly software, DNA sequencing theory, Computational Biology, Sequence Analysis, DNA, Physical Chromosome Mapping, Chromatin, Drosophila melanogaster, Algorithms
Abstract: We report on the quality of a whole-genome assembly ofDrosophila melanogasterand the nature of the computer algorithms that accomplished it. Three independent external data sources essentially agree with and support the assembly's sequence and ordering of contigs across the euchromatic portion of the genome. In addition, there are isolated contigs that we believe represent nonrepetitive pockets within the heterochromatin of the centromeres. Comparison with a previously sequenced 2.9- megabase region indicates that sequencing accuracy within nonrepetitive segments is greater than 99.99% without manual curation. As such, this initial reconstruction of theDrosophilasequence should be of substantial value to the scientific community.
Published: 2000

42. Weighted neighbor joining: a likelihood-based approach to distance-based phylogeny reconstruction

Author: Aaron L. Halpern, Nicholas D. Socci, and William J. Bruno
Subjects: Sequence, Gaussian, Statistical model, Biology, Models, Theoretical, Term (time), Evolution, Molecular, symbols.namesake, Additive function, Genetics, symbols, Animals, Humans, Computer Simulation, Likelihood function, Molecular Biology, Algorithm, Neighbor joining, Random variable, Ecology, Evolution, Behavior and Systematics, Phylogeny
Abstract: We introduce a distance-based phylogeny reconstruction method called ‘‘weighted neighbor joining,’’ or ‘‘Weighbor’’ for short. As in neighbor joining, two taxa are joined in each iteration; however, the Weighbor criterion for choosing a pair of taxa to join takes into account that errors in distance estimates are exponentially larger for longer distances. The criterion embodies a likelihood function on the distances, which are modeled as correlated Gaussian random variables with different means and variances, computed under a probabilistic model for sequence evolution. The Weighbor criterion consists of two terms, an additivity term and a positivity term, that quantify the implications of joining the pair. The first term evaluates deviations from additivity of the implied external branches, while the second term evaluates confidence that the implied internal branch has a positive branch length. Compared with maximum-likelihood phylogeny reconstruction, Weighbor is much faster, while building trees that are qualitatively and quantitatively similar. Weighbor appears to be relatively immune to the ‘‘long branches attract’’ and ‘‘long branch distracts’’ drawbacks observed with neighbor joining, BIONJ, and parsimony.
Published: 2000

43. A computer program designed to screen rapidly for HIV type 1 intersubtype recombinant sequences

Author: Aaron L. Halpern, Bette T. Korber, Adam Siepel, and Catherine A. Macken
Subjects: Recombination, Genetic, Base Sequence, Immunology, Molecular Sequence Data, Human immunodeficiency virus (HIV), Biology, medicine.disease_cause, Virology, Genes, env, Genes, gag, law.invention, Infectious Diseases, law, Computer software, DNA, Viral, Recombinant DNA, medicine, HIV-1, Humans, Phylogeny, Software
Abstract: n/a
Published: 1995

44. Abstract 4821: Somatic variation scoring and validation for large-scale cancer genome sequencing

Author: Dennis G. Ballinger, Steve Lincoln, Igor Nazarenko, Jonathan M. Baccash, Aaron L. Halpern, Krishna P. Pant, and Paolo Carnevali
Subjects: Genetics, Cancer genome sequencing, Cancer Research, Somatic cell, Cancer, Genomics, Biology, medicine.disease, Genome, DNA sequencing, Loss of heterozygosity, Oncology, medicine, Copy-number variation
Abstract: Complete Genomics has sequenced the genomes of over 100 tumor-normal sample pairs from a variety of cancers (e.g. Lee, et al., Nature 465: 473-477, 2010) using a unique high throughput sequencing platform (Drmanac, et al., Science 327:78-81, 2010). We have assembled a tool set to investigate somatic mutations in these samples, including small variations (SNVs, deletions, insertions and substitutions), as well as larger structural variations, copy number variations and regions of LOH (Loss of Heterozygosity). Included in this sample set are 12 cancer cell lines and matched normal cell lines obtained from the American Type Tissue Culture collection (ATCC), comprising 10 breast cancers and 2 lung cancers. We have developed and characterized somatic variation scoring methods for each variant class that can be tuned for particular applications and sample types. Comparative analyses with previously published data show high specificity and sensitivity of somatic variation detection. We will discuss the application of these methods for somatic variation analysis in larger cancer genome studies. Citation Format: {Authors}. {Abstract title} [abstract]. In: Proceedings of the 102nd Annual Meeting of the American Association for Cancer Research; 2011 Apr 2-6; Orlando, FL. Philadelphia (PA): AACR; Cancer Res 2011;71(8 Suppl):Abstract nr 4821. doi:10.1158/1538-7445.AM2011-4821
Published: 2011
Full Text: View/download PDF

45. Nanoliter Reactors Improve Multiple Displacement Amplification of Genomes from Single Cells

Author: Karen Beeson, Aaron L. Halpern, Timothy B. Stockwell, Roger S. Lasken, Brian P. Walenz, Yann Marcy, Stephen R. Quake, Susanne M. D. Goldberg, and Thomas Ishoey
Subjects: Cancer Research, lcsh:QH426-470, Microfluidics, Biophysics, Computational biology, Biology, Biochemistry, Genome, DNA sequencing, law.invention, chemistry.chemical_compound, law, Genetics, Nanotechnology, Molecular Biology, In Situ Hybridization, Fluorescence, Genetics (clinical), Ecology, Evolution, Behavior and Systematics, Polymerase chain reaction, Gene Amplification, Multiple displacement amplification, Genetics and Genomics, RNA Probes, Amplicon, Molecular biology, In Vitro, lcsh:Genetics, Eubacteria, genomic DNA, chemistry, Pyrosequencing, DNA, Research Article, Biotechnology
Abstract: Since only a small fraction of environmental bacteria are amenable to laboratory culture, there is great interest in genomic sequencing directly from single cells. Sufficient DNA for sequencing can be obtained from one cell by the Multiple Displacement Amplification (MDA) method, thereby eliminating the need to develop culture methods. Here we used a microfluidic device to isolate individual Escherichia coli and amplify genomic DNA by MDA in 60-nl reactions. Our results confirm a report that reduced MDA reaction volume lowers nonspecific synthesis that can result from contaminant DNA templates and unfavourable interaction between primers. The quality of the genome amplification was assessed by qPCR and compared favourably to single-cell amplifications performed in standard 50-μl volumes. Amplification bias was greatly reduced in nanoliter volumes, thereby providing a more even representation of all sequences. Single-cell amplicons from both microliter and nanoliter volumes provided high-quality sequence data by high-throughput pyrosequencing, thereby demonstrating a straightforward route to sequencing genomes from single cells., Author Summary It is often challenging to manipulate or analyze the genetic material or genome of an individual cell. Biochemical DNA amplification technologies can be used to make many copies of the genome from a single cell, and in this paper we investigated how well such amplification works as a function of the reaction volume. We found that single-cell genome amplification in nanoliter volumes is much more effective than in microliter volumes, providing better representation of the starting genome with less bias in the product. It should therefore be possible to obtain high-quality genome sequences from single cells. This is useful because very few microbes can be obtained in pure culture, and are therefore only amenable to single-cell analysis.
Published: 2007
Full Text: View/download PDF

46. Survey Sequencing and Comparative Analysis of the Elephant Shark (Callorhinchus milii) Genome

Author: Alice Tay, Aaron L. Halpern, Justin Johnson, Alison P. Lee, Lakshmi D. Viswanathan, Ewen F. Kirkness, Nidhi Dandona, Byrappa Venkatesh, J. Craig Venter, Sydney Brenner, Yong-Hwee E. Loh, and Robert L. Strausberg
Subjects: QH301-705.5, Eukaryotes, Teleost Fishes, Molecular Sequence Data, chemical and pharmacologic phenomena, Genome, General Biochemistry, Genetics and Molecular Biology, Conserved sequence, biology.animal, Animals, Humans, Amino Acid Sequence, Biology (General), Gene, Zebrafish, Phylogeny, Repetitive Sequences, Nucleic Acid, Comparative genomics, Genetics, Evolutionary Biology, Base Sequence, General Immunology and Microbiology, biology, Phylogenetic tree, General Neuroscience, Vertebrate, Genetics and Genomics, DNA, biology.organism_classification, Chondrichthyes, Evolutionary biology, Vertebrates, Sharks, General Agricultural and Biological Sciences, human activities, Research Article
Abstract: Owing to their phylogenetic position, cartilaginous fishes (sharks, rays, skates, and chimaeras) provide a critical reference for our understanding of vertebrate genome evolution. The relatively small genome of the elephant shark, Callorhinchus milii, a chimaera, makes it an attractive model cartilaginous fish genome for whole-genome sequencing and comparative analysis. Here, the authors describe survey sequencing (1.4× coverage) and comparative analysis of the elephant shark genome, one of the first cartilaginous fish genomes to be sequenced to this depth. Repetitive sequences, represented mainly by a novel family of short interspersed element–like and long interspersed element–like sequences, account for about 28% of the elephant shark genome. Fragments of approximately 15,000 elephant shark genes reveal specific examples of genes that have been lost differentially during the evolution of tetrapod and teleost fish lineages. Interestingly, the degree of conserved synteny and conserved sequences between the human and elephant shark genomes are higher than that between human and teleost fish genomes. Elephant shark contains putative four Hox clusters indicating that, unlike teleost fish genomes, the elephant shark genome has not experienced an additional whole-genome duplication. These findings underscore the importance of the elephant shark as a critical reference vertebrate genome for comparative analysis of the human and other vertebrate genomes. This study also demonstrates that a survey-sequencing approach can be applied productively for comparative analysis of distantly related vertebrate genomes., Author Summary Cartilaginous fishes (sharks, rays, skates, and chimaeras) are the phylogenetically oldest group of living jawed vertebrates. They are also an important outgroup for understanding the evolution of bony vertebrates such as human and teleost fishes. We performed survey sequencing (1.4× coverage) of a chimaera, the elephant shark (Callorhinchus milii). The elephant shark genome, estimated to be about 910 Mb long, comprises about 28% repetitive elements. Comparative analysis of approximately 15,000 elephant shark gene fragments revealed examples of several ancient genes that have been lost differentially during the evolution of human and teleost fish lineages. Interestingly, the human and elephant shark genomes exhibit a higher degree of synteny and sequence conservation than human and teleost fish (zebrafish and fugu) genomes, even though humans are more closely related to teleost fishes than to the elephant shark. Unlike teleost fish genomes, the elephant shark genome does not seem to have experienced an additional round of whole-genome duplication. These findings underscore the importance of the elephant shark as a useful “model” cartilaginous fish genome for understanding vertebrate genome evolution., The cartilaginous elephant shark has a basal phylogenetic position useful for understanding jawed vertebrate evolution. Survey sequencing of its genome identified four Hox clusters, suggesting that, unlike for teleost fishes, no additional whole-genome duplication has occurred.
Published: 2007
Full Text: View/download PDF

47. [Untitled]

Author: Cameron Kennedy, Gary H. Karpen, Christopher D. Smith, Roger A. Hoskins, Barbara T. Wakimoto, Beth A. Sullivan, A. Bernardo Carvalho, Susan E. Celniker, Granger G. Sutton, Jiro C. Yasuhara, Joshua S. Kaminker, Aaron L. Halpern, Joseph W. Carlson, Gerald M. Rubin, Christopher J. Mungall, and Eugene W. Myers
Subjects: Genetics, 0303 health sciences, Euchromatin, Shotgun sequencing, Sequence analysis, Heterochromatin, government.form_of_government, Computational biology, Biology, Genome, 03 medical and health sciences, 0302 clinical medicine, government, Shotgun Sequence Assembly, 030217 neurology & neurosurgery, 030304 developmental biology, Centric heterochromatin, Sequence (medicine)
Abstract: Background: Most eukaryotic genomes include a substantial repeat-rich fraction termed heterochromatin, which is concentrated in centric and telomeric regions. The repetitive nature of heterochromatic sequence makes it difficult to assemble and analyze. To better understand the heterochromatic component of the Drosophila melanogaster genome, we characterized and annotated portions of a whole-genome shotgun sequence assembly. Results: WGS3, an improved whole-genome shotgun assembly, includes 20.7 Mb of draft-quality sequence not represented in the Release 3 sequence spanning the euchromatin. We annotated this sequence using the methods employed in the re-annotation of the Release 3 euchromatic sequence. This analysis predicted 297 protein-coding genes and six non-protein-coding genes, including known heterochromatic genes, and regions of similarity to known transposable elements. Bacterial artificial chromosome (BAC)-based fluorescence in situ hybridization analysis was used to correlate the genomic sequence with the cytogenetic map in order to refine the genomic definition of the centric heterochromatin; on the basis of our cytological definition, the annotated Release 3 euchromatic sequence extends into the centric heterochromatin on each chromosome arm. Conclusions: Whole-genome shotgun assembly produced a reliable draft-quality sequence of a significant part of the Drosophila heterochromatin. Annotation of this sequence defined the intron-exon structures of 30 known protein-coding genes and 267 protein-coding gene models. The cytogenetic mapping suggests that an additional 150 predicted genes are located in heterochromatin at the base of the Release 3 euchromatic sequence. Our analysis suggests strategies for improving the sequence and annotation of the heterochromatic portions of the Drosophila and other complex genomes.
Published: 2002
Full Text: View/download PDF

48. [Untitled]

Author: Soo Park, Kenneth H. Wan, Roger A. Hoskins, Erica Sodergren, Catherine R Nelson, Joseph W. Carlson, Reed A. George, Eugene W. Myers, Stephen Richards, Barret D. Pfeiffer, Steven E. Scherer, Gerald M. Rubin, Aaron L. Halpern, Mark Stapleton, Sandeep Patel, Susan E. Celniker, Joanne Pacleb, Erwin Frise, Brent Kronmiller, Donna M. Muzny, Shannon Dugan, Mark Raymond Adams, Paul E. Tabor, Richard A. Gibbs, David A. Wheeler, Granger G. Sutton, Ann Hodgson, Todd R. Laverty, Craig Venter, Robert Svirskas, George M. Weinstock, and Mark Champe
Subjects: Genetics, Whole genome sequencing, 0303 health sciences, biology, Shotgun sequencing, Sequence analysis, Sequence assembly, Genomics, Computational biology, biology.organism_classification, Genome, 03 medical and health sciences, 0302 clinical medicine, Drosophila melanogaster, 030217 neurology & neurosurgery, 030304 developmental biology, Sequence (medicine)
Abstract: Background The Drosophila melanogaster genome was the first metazoan genome to have been sequenced by the whole-genome shotgun (WGS) method. Two issues relating to this achievement were widely debated in the genomics community: how correct is the sequence with respect to base-pair (bp) accuracy and frequency of assembly errors? And, how difficult is it to bring a WGS sequence to the accepted standard for finished sequence? We are now in a position to answer these questions.
Published: 2002
Full Text: View/download PDF

49. Consensus generation and variant detection by Celera Assembler.

Author: Gennady Denisov, Brian Walenz, Aaron L. Halpern, Jason Miller, Nelson Axelrod, Samuel Levy, and Granger Sutton
Subjects: GENETICS, GENOMES, HUMAN chromosomes, HUMAN gene mapping
Abstract: Motivation: We present an algorithm to identify allelic variation given a Whole Genome Shotgun (WGS) assembly of haploid sequences, and to produce a set of haploid consensus sequences rather than a single consensus sequence. Existing WGS assemblers take a column-by-column approach to consensus generation, and produce a single consensus sequence which can be inconsistent with the underlying haploid alleles, and inconsistent with any of the aligned sequence reads. Our new algorithm uses a dynamic windowing approach. It detects alleles by simultaneously processing the portions of aligned reads spanning a region of sequence variation, assigns reads to their respective alleles, phases adjacent variant alleles and generates a consensus sequence corresponding to each confirmed allele. This algorithm was used to produce the first diploid genome sequence of an individual human. It can also be applied to assemblies of multiple diploid individuals and hybrid assemblies of multiple haploid organisms. Results: Being applied to the individual human genome assembly, the new algorithm detects exactly two confirmed alleles and reports two consensus sequences in 98.98% of the total number 2 033 311 detected regions of sequence variation. In 33 269 out of 460 373 detected regions of size >1 bp, it fixes the constructed errors of a mosaic haploid representation of a diploid locus as produced by the original Celera Assembler consensus algorithm. Using an optimized procedure calibrated against 1 506 344 known SNPs, it detects 438 814 new heterozygous SNPs with false positive rate 12%. Availability: The open source code is available at: http://wgs-assembler.cvs.sourceforge.net/wgs-assembler/ Contact: gdenisov@jcvi.org [ABSTRACT FROM AUTHOR]
Published: 2008
Full Text: View/download PDF

50. External Arguments in Basque

Author: Demirdache, Hamida, Lisa, Cheng, Laboratoire de Linguistique de Nantes (LLING), Centre National de la Recherche Scientifique (CNRS)-Université de Nantes - UFR Lettres et Langages (UFRLL), Université de Nantes (UN)-Université de Nantes (UN), Leiden University Center for Linguistics, Universiteit Leiden [Leiden], and Aaron L. Halpern
Subjects: 060201 languages & linguistics, 030507 speech-language pathology & audiology, 03 medical and health sciences, 0602 languages and literature, 06 humanities and the arts, [SHS.LANGUE]Humanities and Social Sciences/Linguistics, 0305 other medical science, ComputingMilieux_MISCELLANEOUS
Abstract: International audience
Published: 1993
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

50 results on '"Aaron L. Halpern"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources