1. beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types
- Author
-
Lun, Aaron T. L., Pagès, Hervé, Smith, Mike L., Lun, Aaron TL [0000-0002-3564-4813], Smith, Mike L [0000-0002-7800-3848], and Apollo - University of Cambridge Repository
- Subjects
Computer and Information Sciences ,Molecular biology ,QH301-705.5 ,Research and Analysis Methods ,Mathematical and Statistical Techniques ,Sequencing techniques ,Databases, Genetic ,Genetics ,Humans ,Statistical Methods ,Biology (General) ,Statistical Data ,Principal Component Analysis ,Data Processing ,Software Tools ,Sequence Analysis, RNA ,Applied Mathematics ,Simulation and Modeling ,Biology and Life Sciences ,Software Engineering ,Computational Biology ,High-Throughput Nucleotide Sequencing ,RNA sequencing ,Genomics ,Genome Analysis ,Molecular biology techniques ,Physical Sciences ,Multivariate Analysis ,Engineering and Technology ,Programming Languages ,Information Technology ,Transcriptome Analysis ,Mathematics ,Statistics (Mathematics) ,Algorithms ,Software ,Research Article - Abstract
Biological experiments involving genomics or other high-throughput assays typically yield a data matrix that can be explored and analyzed using the R programming language with packages from the Bioconductor project. Improvements in the throughput of these assays have resulted in an explosion of data even from routine experiments, which poses a challenge to the existing computational infrastructure for statistical data analysis. For example, single-cell RNA sequencing (scRNA-seq) experiments frequently generate large matrices containing expression values for each gene in each cell, requiring sparse or file-backed representations for memory-efficient manipulation in R. These alternative representations are not easily compatible with high-performance C++ code used for computationally intensive tasks in existing R/Bioconductor packages. Here, we describe a C++ interface named beachmat, which enables agnostic data access from various matrix representations. This allows package developers to write efficient C++ code that is interoperable with dense, sparse and file-backed matrices, amongst others. We evaluated the performance of beachmat for accessing data from each matrix representation using both simulated and real scRNA-seq data, and defined a clear memory/speed trade-off to motivate the choice of an appropriate representation. We also demonstrate how beachmat can be incorporated into the code of other packages to drive analyses of a very large scRNA-seq data set.
- Published
- 2018