Back to Search Start Over

NovoGraph: Genome graph construction from multiple long-read de novo assemblies [version 1; referees: 1 approved, 1 approved with reservations]

Authors :
Evan Biederstedt
Jeffrey C. Oliver
Nancy F. Hansen
Aarti Jajoo
Nathan Dunn
Andrew Olson
Ben Busby
Alexander T. Dilthey
Author Affiliations :
<relatesTo>1</relatesTo>Weill Cornell Medicine, New York, NY, 10065, USA<br /><relatesTo>2</relatesTo>New York Genome Center, New York, NY, 10013, USA<br /><relatesTo>3</relatesTo>Office of Digital Innovation and Stewardship, University Libraries, University of Arizona, Tucson, AZ, 85721, USA<br /><relatesTo>4</relatesTo>National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, 20817, USA<br /><relatesTo>5</relatesTo>Baylor College of Medicine, Houston, TX, 77030, USA<br /><relatesTo>6</relatesTo>Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA<br /><relatesTo>7</relatesTo>Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724, USA<br /><relatesTo>8</relatesTo>National Center for Biotechnology Information, National Institutes of Health, Bethesda, MD, 20817, USA<br /><relatesTo>9</relatesTo>Institute of Medical Microbiology and Hospital Hygiene, Heinrich Heine University Düsseldorf, Düsseldorf, 40225, Germany
Source :
F1000Research. 7:1391
Publication Year :
2018
Publisher :
London, UK: F1000 Research Limited, 2018.

Abstract

Genome graphs are emerging as an important novel approach to the analysis of high-throughput sequencing data. By explicitly representing genetic variants and alternative haplotypes in a mappable data structure, they can enable the improved analysis of structurally variable and hyperpolymorphic regions of the genome. In most existing approaches, graphs are constructed from variant call sets derived from short-read sequencing. As long-read sequencing becomes more cost-effective and enables de novo assembly for increasing numbers of whole genomes, a method for the direct construction of a genome graph from sets of assembled human genomes would be desirable. Such assembly-based genome graphs would encompass the wide spectrum of genetic variation accessible to long-read-based de novo assembly, including large structural variants and divergent haplotypes. Here we present NovoGraph, a method for the construction of a genome graph directly from a set of de novo assemblies. NovoGraph constructs a genome-wide multiple sequence alignment of all input contigs and uses a simple criterion of homologous-identical recombination to convert the multiple sequence alignment into a graph. NovoGraph outputs resulting graphs in VCF format that can be loaded into third-party genome graph toolkits. To demonstrate NovoGraph, we construct a genome graph with 23,478,835 variant sites and 30,582,795 variant alleles from de novo assemblies of seven ethnically diverse human genomes (AK1, CHM1, CHM13, HG003, HG004, HX1, NA19240). Initial evaluations show that mapping against the constructed graph reduces the average mismatch rate of reads from sample NA12878 by approximately 0.2%, albeit at a slightly increased rate of reads that remain unmapped.

Details

ISSN :
20461402
Volume :
7
Database :
F1000Research
Journal :
F1000Research
Notes :
[version 1; referees: 1 approved, 1 approved with reservations]
Publication Type :
Academic Journal
Accession number :
edsfor.10.12688.f1000research.15895.1
Document Type :
software-tool
Full Text :
https://doi.org/10.12688/f1000research.15895.1