Back to Search
Start Over
PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets
- Source :
- BMC Bioinformatics, BMC Bioinformatics, Vol 20, Iss 1, Pp 1-11 (2019)
- Publication Year :
- 2019
- Publisher :
- Springer Science and Business Media LLC, 2019.
-
Abstract
- Background With the growth of available sequenced datasets, analysis of heterogeneous processed data can answer increasingly relevant biological and clinical questions. Scientists are challenged in performing efficient and reproducible data extraction and analysis pipelines over heterogeneously processed datasets. Available software packages are suitable for analyzing experimental files from such datasets one by one, but do not scale to thousands of experiments. Moreover, they lack proper support for metadata manipulation. Results We present PyGMQL, a novel software for the manipulation of region-based genomic files and their relative metadata, built on top of the GMQL genomic big data management system. PyGMQL provides a set of expressive functions for the manipulation of region data and their metadata that can scale to arbitrary clusters and implicitly apply to thousands of files, producing millions of regions. PyGMQL provides data interoperability, distribution transparency and query outsourcing. The PyGMQL package integrates scalable data extraction over the Apache Spark engine underlying the GMQL implementation with native Python support for interactive data analysis and visualization. It supports data interoperability, solving the impedance mismatch between executing set-oriented queries and programming in Python. PyGMQL provides distribution transparency (the ability to address a remote dataset) and query outsourcing (the ability to assign processing to a remote service) in an orthogonal way. Outsourced processing can address cloud-based installations of the GMQL engine. Conclusions PyGMQL is an effective and innovative tool for supporting tertiary data extraction and analysis pipelines. We demonstrate the expressiveness and performance of PyGMQL through a sequence of biological data analysis scenarios of increasing complexity, which highlight reproducibility, expressive power and scalability.
- Subjects :
- Data Analysis
Computer science
Cloud computing
02 engineering and technology
computer.software_genre
Biochemistry
Data scalability
Software
Structural Biology
Databases, Genetic
lcsh:QH301-705.5
computer.programming_language
Distribution transparency
0303 health sciences
Biological data
Genome
Applied Mathematics
Genomics
Computer Science Applications
Tertiary data analysis
Enhancer Elements, Genetic
Data extraction
Scalability
lcsh:R858-859.7
Data mining
Enhancer Elements
0206 medical engineering
lcsh:Computer applications to medicine. Medical informatics
Databases
03 medical and health sciences
Genetic
Genomic data
Object-relational impedance mismatch
Humans
Python
Genome-Wide Association Study
Reproducibility of Results
Transcription Factors
Molecular Biology
030304 developmental biology
business.industry
Python (programming language)
Visualization
Metadata
lcsh:Biology (General)
business
computer
020602 bioinformatics
Subjects
Details
- ISSN :
- 14712105
- Volume :
- 20
- Database :
- OpenAIRE
- Journal :
- BMC Bioinformatics
- Accession number :
- edsair.doi.dedup.....94f40ed5c1f032a2561f53cbd22db31c