Back to Search
Start Over
Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis
- Source :
- Bioinformatics (Oxford, England). 34(4)
- Publication Year :
- 2017
-
Abstract
- Motivation New Generation Sequencing (NGS) technologies for genome sequencing produce large amounts of short genomic reads per experiment, which are highly redundant and compressible. However, general-purpose compressors are unable to exploit this redundancy due to the special structure present in the data. Results We present a new algorithm for compressing reads both with and without preserving the read order. In both cases, it achieves 1.4×–2× compression gain over state-of-the-art read compression tools for datasets containing as many as 3 billion Illumina reads. Our tool is based on the idea of approximately reordering the reads according to their position in the genome using hashed substring indices. We also present a systematic analysis of the read compression problem and compute bounds on fundamental limits of read compression. This analysis sheds light on the dynamics of the proposed algorithm (and read compression algorithms in general) and helps understand its performance in practice. The algorithm compresses only the read sequence, works with unaligned FASTQ files, and does not require a reference. Supplementary information Supplementary material are available at Bioinformatics online. The proposed algorithm is available for download at https://github.com/shubhamchandak94/HARC.
- Subjects :
- 0301 basic medicine
Statistics and Probability
FASTQ format
Computer science
Hash function
Biochemistry
DNA sequencing
03 medical and health sciences
Redundancy (information theory)
Humans
Molecular Biology
Genome
Bacteria
Eukaryota
High-Throughput Nucleotide Sequencing
Genomics
Sequence Analysis, DNA
Data Compression
Original Papers
Substring
Computer Science Applications
Computational Mathematics
030104 developmental biology
Computational Theory and Mathematics
Algorithm
Algorithms
Software
Data compression
Subjects
Details
- ISSN :
- 13674811
- Volume :
- 34
- Issue :
- 4
- Database :
- OpenAIRE
- Journal :
- Bioinformatics (Oxford, England)
- Accession number :
- edsair.doi.dedup.....c852e708449cb1538c68a9790b5e558b