1. Metabolic pathway inference using multi-label classification with rich pathway features
- Author
-
Basher, Abdur Rahman M. A., McLaughlin, Ryan J., and Hallam, Steven J.
- Subjects
0301 basic medicine ,Computer science ,ved/biology.organism_classification_rank.species ,Inference ,Genome ,Biochemistry ,Machine Learning ,0302 clinical medicine ,Software ,Mathematical and Statistical Techniques ,Metabolic potential ,Databases, Genetic ,Biology (General) ,0303 health sciences ,education.field_of_study ,Ecology ,Rule sets ,Applied Mathematics ,Simulation and Modeling ,Statistics ,Genomics ,Enzymes ,Computational Theory and Mathematics ,Modeling and Simulation ,Physical Sciences ,Benchmark (computing) ,Metabolic Pathways ,Algorithms ,Metabolic Networks and Pathways ,Research Article ,Computer and Information Sciences ,Cell Physiology ,QH301-705.5 ,Population ,Computational biology ,Research and Analysis Methods ,Genome Complexity ,Biosynthesis ,Low complexity ,03 medical and health sciences ,Cellular and Molecular Neuroscience ,Machine Learning Algorithms ,Artificial Intelligence ,Proteobacteria ,Genetics ,Statistical Methods ,Model organism ,education ,Molecular Biology ,Ecology, Evolution, Behavior and Systematics ,030304 developmental biology ,Multi-label classification ,ved/biology ,business.industry ,Biology and Life Sciences ,Proteins ,Computational Biology ,Cell Biology ,Cell Metabolism ,Metabolic pathway ,030104 developmental biology ,ComputingMethodologies_PATTERNRECOGNITION ,Metabolism ,Logistic Models ,Enzymology ,business ,030217 neurology & neurosurgery ,Mathematics ,Forecasting - Abstract
Metabolic inference from genomic sequence information is a necessary step in determining the capacity of cells to make a living in the world at different levels of biological organization. A common method for determining the metabolic potential encoded in genomes is to map conceptually translated open reading frames onto a database containing known product descriptions. Such gene-centric methods are limited in their capacity to predict pathway presence or absence and do not support standardized rule sets for automated and reproducible research. Pathway-centric methods based on defined rule sets or machine learning algorithms provide an adjunct or alternative inference method that supports hypothesis generation and testing of metabolic relationships within and between cells. Here, we present mlLGPR, multi-label based on logistic regression for pathway prediction, a software package that uses supervised multi-label classification and rich pathway features to infer metabolic networks in organismal and multi-organismal datasets. We evaluated mlLGPR performance using a corpora of 12 experimental datasets manifesting diverse multi-label properties, including manually curated organismal genomes, synthetic microbial communities and low complexity microbial communities. Resulting performance metrics equaled or exceeded previous reports for organismal genomes and identify specific challenges associated with features engineering and training data for community-level metabolic inference., Author summary Predicting the complex series of metabolic interactions e.g. pathways, within and between cells from genomic sequence information is an integral problem in biology linking genotype to phenotype. This is a prerequisite to both understanding fundamental life processes and ultimately engineering these processes for specific biotechnological applications. A pathway prediction problem exists because we have limited knowledge of the reactions and pathways operating in cells even in model organisms like Esherichia coli where the majority of protein functions are determined. To improve pathway prediction outcomes for genomes at different levels of complexity and completion we have developed mlLGPR, multi-label based on logistic regression for pathway prediction, a scalable open source software package that uses supervised multi-label classification and rich pathway features to infer metabolic networks. We benchmark mlLGPR performance against other inference methods providing a code base and metrics for continued application of machine learning methods to the pathway prediction problem.
- Published
- 2020