1. OPERA-gSAM: Big Data Processing Framework for UMI Sequencing at High Scalability and Efficiency
- Author
-
Simmhan, Y, Altintas, I, Varbanescu, AL, Balaji, P, Prasad, AS, Carnevale, L, Caderno, PV, Awaysheh, FM, Colino-Sanguino, Y, Fuente, LRDL, Valdes-Mora, F, Cabaleiro, JC, Pena, TF, Gallego-Ortega, D, Simmhan, Y, Altintas, I, Varbanescu, AL, Balaji, P, Prasad, AS, Carnevale, L, Caderno, PV, Awaysheh, FM, Colino-Sanguino, Y, Fuente, LRDL, Valdes-Mora, F, Cabaleiro, JC, Pena, TF, and Gallego-Ortega, D
- Abstract
genome Sequence Alignment Map The rapidly increasing demand for next generation biotechnologies has enabled the development of DNA and RNA big data oriented BD pipelines The preprocessing stage requires sequencing and alignment tools that provide barcoding for error correction and increase accuracy during sequencing Unique Molecular Identifiers UMIs promise a highly accurate bioinformatic identification of PCR duplication before the amplification stage However using alignment coordinates alone is Data intensiveand challenging due to the increased demand for computational throughput affecting the performance of the underlying resources This paper proposes a highly scalable data scheduling and resource allocation framework called OPERA gSAM for the genome Sequence Alignment Map SAM OPERA gSAM an OPportunistic and Elastic Resource Allocation is an enabling big data platform i e Apache Spark for the next generation massively parallel sequencing applications We validate OPERA gSAM scalability and efficiency using Genomics single cell RNA sequencing Our experiments demonstrate the usability and high efficiency of the proposed framework Results show that OPERA gSAM is up to 2 4 faster while consuming 50 fewer resources than the conventional pipeline using SAM and UMI tools
- Published
- 2023