Back to Search Start Over

Dynamic Alignment-Free and Reference-Free Read Compression.

Authors :
Stoye, Jens
Holley, Guillaume
Wittler, Roland
Hach, Faraz
Source :
Journal of Computational Biology. Jul2018, Vol. 25 Issue 7, p825-836. 12p.
Publication Year :
2018

Abstract

The advent of high throughput sequencing (HTS) technologies raises a major concern about storage and transmission of data produced by these technologies. In particular, large-scale sequencing projects generate an unprecedented volume of genomic sequences ranging from tens to several thousands of genomes per species. These collections contain highly similar and redundant sequences, also known as pangenomes. The ideal way to represent and transfer pangenomes is through compression. A number of HTS-specific compression tools have been developed to reduce the storage and communication costs of HTS data, yet none of them is designed to process a pangenome. In this article, we present dynamic alignment-free and reference-free read compression (DARRC), a new alignment-free and reference-free compression method. It addresses the problem of pangenome compression by encoding the sequences of a pangenome as a guided de Bruijn graph. The novelty of this method is its ability to incrementally update DARRC archives with new genome sequences without full decompression of the archive. DARRC can compress both single-end and paired-end read sequences of any length using all symbols of the IUPAC nucleotide code. On a large <bold> Pseudomonas aeruginosa </bold> data set, our method outperforms all other tested tools. It provides a 30% compression ratio improvement in single-end mode compared with the best performing state-of-the-art HTS-specific compression method in our experiments. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
10665277
Volume :
25
Issue :
7
Database :
Academic Search Index
Journal :
Journal of Computational Biology
Publication Type :
Academic Journal
Accession number :
130740871
Full Text :
https://doi.org/10.1089/cmb.2018.0068