Back to Search
Start Over
Managing Variant Calling Files the Big Data Way: Using HDFS and Apache Parquet
- Source :
- BDCAT '17 Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies. ACM, BDCAT '17 Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies, BDCAT
- Publication Year :
- 2017
-
Abstract
- Big Data has been seen as a remedy for the efficient management of the ever-increasing genomic data. In this paper, we investigate the use of Apache Spark to store and process Variant Calling Files (VCF) on a Hadoop cluster. We demonstrate Tomatula, a software tool for converting VCF files to Apache Parquet storage format, and an application to query variant calling datasets. We evaluate how the wall time (i.e. time until the query answer is returned to the user) scales out on a Hadoop cluster storing VCF files, either in the original flat-file format, or using the Apache Parquet columnar storage format. Apache Parquet can compress the VCF data by around a factor of 10, and supports easier querying of VCF files as it exposes the field structure. We discuss advantages and disadvantages in terms of storage capacity and querying performance with both flat VCF files and Apache Parquet using an open plant breeding dataset. We conclude that Apache Parquet offers benefits for reducing storage size and wall time, and scales out with larger datasets.
- Subjects :
- 0301 basic medicine
PBR Non host and insect resistance
Big Data
Computer science
0206 medical engineering
Big data
WASS
02 engineering and technology
computer.software_genre
Field (computer science)
Apache Parquet
PBR Quantitative aspects of Plant Breeding
03 medical and health sciences
Factor (programming language)
Spark (mathematics)
Data_FILES
computer.programming_language
PBR Kwantitatieve aspecten
HDFS
Database
Apache Spark
business.industry
variant calling
Toegepaste Informatiekunde
bioinformatics
030104 developmental biology
Hadoop
business
Information Technology
computer
020602 bioinformatics
PBR Non host en Insectenresistentie
Subjects
Details
- Language :
- English
- ISBN :
- 978-1-4503-5549-0
- ISBNs :
- 9781450355490
- Database :
- OpenAIRE
- Journal :
- BDCAT '17 Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies. ACM, BDCAT '17 Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies, BDCAT
- Accession number :
- edsair.doi.dedup.....17dae3c7396bbc3559d72282a6c613c8