Back to Search
Start Over
Learning sparse log-ratios for high-throughput sequencing data
- Source :
- Bioinformatics
- Publication Year :
- 2021
- Publisher :
- Oxford University Press, 2021.
-
Abstract
- Motivation The automatic discovery of sparse biomarkers that are associated with an outcome of interest is a central goal of bioinformatics. In the context of high-throughput sequencing (HTS) data, and compositional data (CoDa) more generally, an important class of biomarkers are the log-ratios between the input variables. However, identifying predictive log-ratio biomarkers from HTS data is a combinatorial optimization problem, which is computationally challenging. Existing methods are slow to run and scale poorly with the dimension of the input, which has limited their application to low- and moderate-dimensional metagenomic datasets. Results Building on recent advances from the field of deep learning, we present CoDaCoRe, a novel learning algorithm that identifies sparse, interpretable and predictive log-ratio biomarkers. Our algorithm exploits a continuous relaxation to approximate the underlying combinatorial optimization problem. This relaxation can then be optimized efficiently using the modern ML toolbox, in particular, gradient descent. As a result, CoDaCoRe runs several orders of magnitude faster than competing methods, all while achieving state-of-the-art performance in terms of predictive accuracy and sparsity. We verify the outperformance of CoDaCoRe across a wide range of microbiome, metabolite and microRNA benchmark datasets, as well as a particularly high-dimensional dataset that is outright computationally intractable for existing sparse log-ratio selection methods. Availability and implementation The CoDaCoRe package is available at https://github.com/egr95/R-codacore. Code and instructions for reproducing our results are available at https://github.com/cunningham-lab/codacore. Supplementary information Supplementary data are available at Bioinformatics online.
- Subjects :
- Statistics and Probability
AcademicSubjects/SCI01060
business.industry
Computer science
Deep learning
Relaxation (iterative method)
Gene Expression
Context (language use)
Machine learning
computer.software_genre
Biochemistry
Original Papers
Field (computer science)
Computer Science Applications
Computational Mathematics
Computational Theory and Mathematics
Dimension (vector space)
Benchmark (computing)
Code (cryptography)
Artificial intelligence
business
Gradient descent
Molecular Biology
computer
Selection (genetic algorithm)
Subjects
Details
- Language :
- English
- ISSN :
- 13674811 and 13674803
- Volume :
- 38
- Issue :
- 1
- Database :
- OpenAIRE
- Journal :
- Bioinformatics
- Accession number :
- edsair.doi.dedup.....4e644a2045e82b2b775e93ccbc35ccb4