Journal: plos computational biology / Publication Year Range: Last 50 years / Publisher: public library of science / Search Limiters: Full Text / Topic: 5 selected - Searchworks@Jio Institute Digital Library Search Results

Showing total 79 results

Start Over Search Limiters Full Text Topic biology and life sciences Topic computational biology Topic computer and information sciences Topic genetics Topic mathematics Publication Year Range Last 50 years Journal plos computational biology Publisher public library of science

79 results

1. LOTUS: A single- and multitask machine learning algorithm for the prediction of cancer driver genes.

Author: Collier, Olivier, Stoven, Véronique, and Vert, Jean-Philippe
Subjects: CANCER genes, MACHINE learning, LEARNING strategies, P53 antioncogene, PROTEIN-protein interactions, COMPUTATIONAL biology, TUMOR suppressor genes
Abstract: Cancer driver genes, i.e., oncogenes and tumor suppressor genes, are involved in the acquisition of important functions in tumors, providing a selective growth advantage, allowing uncontrolled proliferation and avoiding apoptosis. It is therefore important to identify these driver genes, both for the fundamental understanding of cancer and to help finding new therapeutic targets or biomarkers. Although the most frequently mutated driver genes have been identified, it is believed that many more remain to be discovered, particularly for driver genes specific to some cancer types. In this paper, we propose a new computational method called LOTUS to predict new driver genes. LOTUS is a machine-learning based approach which allows to integrate various types of data in a versatile manner, including information about gene mutations and protein-protein interactions. In addition, LOTUS can predict cancer driver genes in a pan-cancer setting as well as for specific cancer types, using a multitask learning strategy to share information across cancer types. We empirically show that LOTUS outperforms five other state-of-the-art driver gene prediction methods, both in terms of intrinsic consistency and prediction accuracy, and provide predictions of new cancer genes across many cancer types. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

2. Transient crosslinking kinetics optimize gene cluster interactions.

Author: Walker, Benjamin, Taylor, Dane, Lawrimore, Josh, Hult, Caitlin, Adalsteinsson, David, Bloom, Kerry, and Forest, M. Gregory
Subjects: GENE clusters, CHROMOSOME structure, COMPUTATIONAL biology, RIBOSOMAL DNA
Abstract: Our understanding of how chromosomes structurally organize and dynamically interact has been revolutionized through the lens of long-chain polymer physics. Major protein contributors to chromosome structure and dynamics are condensin and cohesin that stochastically generate loops within and between chains, and entrap proximal strands of sister chromatids. In this paper, we explore the ability of transient, protein-mediated, gene-gene crosslinks to induce clusters of genes, thereby dynamic architecture, within the highly repeated ribosomal DNA that comprises the nucleolus of budding yeast. We implement three approaches: live cell microscopy; computational modeling of the full genome during G1 in budding yeast, exploring four decades of timescales for transient crosslinks between 5kbp domains (genes) in the nucleolus on Chromosome XII; and, temporal network models with automated community (cluster) detection algorithms applied to the full range of 4D modeling datasets. The data analysis tools detect and track gene clusters, their size, number, persistence time, and their plasticity (deformation). Of biological significance, our analysis reveals an optimal mean crosslink lifetime that promotes pairwise and cluster gene interactions through “flexible” clustering. In this state, large gene clusters self-assemble yet frequently interact (merge and separate), marked by gene exchanges between clusters, which in turn maximizes global gene interactions in the nucleolus. This regime stands between two limiting cases each with far less global gene interactions: with shorter crosslink lifetimes, “rigid” clustering emerges with clusters that interact infrequently; with longer crosslink lifetimes, there is a dissolution of clusters. These observations are compared with imaging experiments on a normal yeast strain and two condensin-modified mutant cell strains. We apply the same image analysis pipeline to the experimental and simulated datasets, providing support for the modeling predictions. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

3. A data-driven interactome of synergistic genes improves network-based cancer outcome prediction.

Author: Allahyar, Amin, Ubels, Joske, and de Ridder, Jeroen
Subjects: CANCER patients, GENE expression, CANCER treatment, HEALTH outcome assessment, MOLECULAR genetics
Abstract: Robustly predicting outcome for cancer patients from gene expression is an important challenge on the road to better personalized treatment. Network-based outcome predictors (NOPs), which considers the cellular wiring diagram in the classification, hold much promise to improve performance, stability and interpretability of identified marker genes. Problematically, reports on the efficacy of NOPs are conflicting and for instance suggest that utilizing random networks performs on par to networks that describe biologically relevant interactions. In this paper we turn the prediction problem around: instead of using a given biological network in the NOP, we aim to identify the network of genes that truly improves outcome prediction. To this end, we propose SyNet, a gene network constructed ab initio from synergistic gene pairs derived from survival-labelled gene expression data. To obtain SyNet, we evaluate synergy for all 69 million pairwise combinations of genes resulting in a network that is specific to the dataset and phenotype under study and can be used to in a NOP model. We evaluated SyNet and 11 other networks on a compendium dataset of >4000 survival-labelled breast cancer samples. For this purpose, we used cross-study validation which more closely emulates real world application of these outcome predictors. We find that SyNet is the only network that truly improves performance, stability and interpretability in several existing NOPs. We show that SyNet overlaps significantly with existing gene networks, and can be confidently predicted (~85% AUC) from graph-topological descriptions of these networks, in particular the breast tissue-specific network. Due to its data-driven nature, SyNet is not biased to well-studied genes and thus facilitates post-hoc interpretation. We find that SyNet is highly enriched for known breast cancer genes and genes related to e.g. histological grade and tamoxifen resistance, suggestive of a role in determining breast cancer outcome. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

4. SFPEL-LPI: Sequence-based feature projection ensemble learning for predicting LncRNA-protein interactions.

Author: Zhang, Wen, Tang, Guifeng, Huang, Feng, Zhang, Xining, Yue, Xiang, and Wu, Wenjian
Subjects: RNA-protein interactions, GENETIC regulation, RNA interference, RNA splicing, ADENYLATION (Biochemistry)
Abstract: LncRNA-protein interactions play important roles in post-transcriptional gene regulation, poly-adenylation, splicing and translation. Identification of lncRNA-protein interactions helps to understand lncRNA-related activities. Existing computational methods utilize multiple lncRNA features or multiple protein features to predict lncRNA-protein interactions, but features are not available for all lncRNAs or proteins; most of existing methods are not capable of predicting interacting proteins (or lncRNAs) for new lncRNAs (or proteins), which don’t have known interactions. In this paper, we propose the sequence-based feature projection ensemble learning method, “SFPEL-LPI”, to predict lncRNA-protein interactions. First, SFPEL-LPI extracts lncRNA sequence-based features and protein sequence-based features. Second, SFPEL-LPI calculates multiple lncRNA-lncRNA similarities and protein-protein similarities by using lncRNA sequences, protein sequences and known lncRNA-protein interactions. Then, SFPEL-LPI combines multiple similarities and multiple features with a feature projection ensemble learning frame. In computational experiments, SFPEL-LPI accurately predicts lncRNA-protein associations and outperforms other state-of-the-art methods. More importantly, SFPEL-LPI can be applied to new lncRNAs (or proteins). The case studies demonstrate that our method can find out novel lncRNA-protein interactions, which are confirmed by literature. Finally, we construct a user-friendly web server, available at . [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

5. Efficient pedigree recording for fast population genetics simulation.

Author: Kelleher, Jerome, Thornton, Kevin R., Ashander, Jaime, and Ralph, Peter L.
Subjects: POPULATION genetics, EUKARYOTES, PHYLOGENY, GENOTYPES, ALGORITHMS
Abstract: In this paper we describe how to efficiently record the entire genetic history of a population in forwards-time, individual-based population genetics simulations with arbitrary breeding models, population structure and demography. This approach dramatically reduces the computational burden of tracking individual genomes by allowing us to simulate only those loci that may affect reproduction (those having non-neutral variants). The genetic history of the population is recorded as a succinct tree sequence as introduced in the software package msprime, on which neutral mutations can be quickly placed afterwards. Recording the results of each breeding event requires storage that grows linearly with time, but there is a great deal of redundancy in this information. We solve this storage problem by providing an algorithm to quickly ‘simplify’ a tree sequence by removing this irrelevant history for a given set of genomes. By periodically simplifying the history with respect to the extant population, we show that the total storage space required is modest and overall large efficiency gains can be made over classical forward-time simulations. We implement a general-purpose framework for recording and simplifying genealogical data, which can be used to make simulations of any population model more efficient. We modify two popular forwards-time simulation frameworks to use this new approach and observe efficiency gains in large, whole-genome simulations of one to two orders of magnitude. In addition to speed, our method for recording pedigrees has several advantages: (1) All marginal genealogies of the simulated individuals are recorded, rather than just genotypes. (2) A population of N individuals with M polymorphic sites can be stored in O(N log N + M) space, making it feasible to store a simulation’s entire final generation as well as its history. (3) A simulation can easily be initialized with a more efficient coalescent simulation of deep history. The software for recording and processing tree sequences is named tskit. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

6. A Scalable Computational Framework for Establishing Long-Term Behavior of Stochastic Reaction Networks.

Author: Gupta, Ankit, Briat, Corentin, and Khammash, Mustafa
Subjects: COMPUTATIONAL biology, STOCHASTIC processes, RANDOM variables, MATHEMATICAL optimization, INFORMATION networks
Abstract: Reaction networks are systems in which the populations of a finite number of species evolve through predefined interactions. Such networks are found as modeling tools in many biological disciplines such as biochemistry, ecology, epidemiology, immunology, systems biology and synthetic biology. It is now well-established that, for small population sizes, stochastic models for biochemical reaction networks are necessary to capture randomness in the interactions. The tools for analyzing such models, however, still lag far behind their deterministic counterparts. In this paper, we bridge this gap by developing a constructive framework for examining the long-term behavior and stability properties of the reaction dynamics in a stochastic setting. In particular, we address the problems of determining ergodicity of the reaction dynamics, which is analogous to having a globally attracting fixed point for deterministic dynamics. We also examine when the statistical moments of the underlying process remain bounded with time and when they converge to their steady state values. The framework we develop relies on a blend of ideas from probability theory, linear algebra and optimization theory. We demonstrate that the stability properties of a wide class of biological networks can be assessed from our sufficient theoretical conditions that can be recast as efficient and scalable linear programs, well-known for their tractability. It is notably shown that the computational complexity is often linear in the number of species. We illustrate the validity, the efficiency and the wide applicability of our results on several reaction networks arising in biochemistry, systems biology, epidemiology and ecology. The biological implications of the results as well as an example of a non-ergodic biological network are also discussed. [ABSTRACT FROM AUTHOR]
Published: 2014
Full Text: View/download PDF

7. A phylogenetic method to perform genome-wide association studies in microbes that accounts for population structure and recombination.

Author: Collins, Caitlin and Didelot, Xavier
Subjects: PHYLOGENY, MICROORGANISMS, NEISSERIA meningitidis, PENICILLIN, DRUG resistance in bacteria
Abstract: Genome-Wide Association Studies (GWAS) in microbial organisms have the potential to vastly improve the way we understand, manage, and treat infectious diseases. Yet, microbial GWAS methods established thus far remain insufficiently able to capitalise on the growing wealth of bacterial and viral genetic sequence data. Facing clonal population structure and homologous recombination, existing GWAS methods struggle to achieve both the precision necessary to reject spurious findings and the power required to detect associations in microbes. In this paper, we introduce a novel phylogenetic approach that has been tailor-made for microbial GWAS, which is applicable to organisms ranging from purely clonal to frequently recombining, and to both binary and continuous phenotypes. Our approach is robust to the confounding effects of both population structure and recombination, while maintaining high statistical power to detect associations. Thorough testing via application to simulated data provides strong support for the power and specificity of our approach and demonstrates the advantages offered over alternative cluster-based and dimension-reduction methods. Two applications to Neisseria meningitidis illustrate the versatility and potential of our method, confirming previously-identified penicillin resistance loci and resulting in the identification of both well-characterised and novel drivers of invasive disease. Our method is implemented as an open-source R package called treeWAS which is freely available at . [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

8. Bayesian inference of phylogenetic networks from bi-allelic genetic markers.

Author: Zhu, Jiafan, Wen, Dingqiao, Yu, Yun, Meudt, Heidi M., and Nakhleh, Luay
Subjects: BAYESIAN analysis, PHYLOGENY, INFERENTIAL statistics, GENETIC markers in plants, PLANTAGINACEAE
Abstract: Phylogenetic networks are rooted, directed, acyclic graphs that model reticulate evolutionary histories. Recently, statistical methods were devised for inferring such networks from either gene tree estimates or the sequence alignments of multiple unlinked loci. Bi-allelic markers, most notably single nucleotide polymorphisms (SNPs) and amplified fragment length polymorphisms (AFLPs), provide a powerful source of genome-wide data. In a recent paper, a method called SNAPP was introduced for statistical inference of species trees from unlinked bi-allelic markers. The generative process assumed by the method combined both a model of evolution for the bi-allelic markers, as well as the multispecies coalescent. A novel component of the method was a polynomial-time algorithm for exact computation of the likelihood of a fixed species tree via integration over all possible gene trees for a given marker. Here we report on a method for Bayesian inference of phylogenetic networks from bi-allelic markers. Our method significantly extends the algorithm for exact computation of phylogenetic network likelihood via integration over all possible gene trees. Unlike the case of species trees, the algorithm is no longer polynomial-time on all instances of phylogenetic networks. Furthermore, the method utilizes a reversible-jump MCMC technique to sample the posterior of phylogenetic networks given bi-allelic marker data. Our method has a very good performance in terms of accuracy and robustness as we demonstrate on simulated data, as well as a data set of multiple New Zealand species of the plant genus Ourisia (Plantaginaceae). We implemented the method in the publicly available, open-source PhyloNet software package. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

9. Scaling up data curation using deep learning: An application to literature triage in genomic variation resources.

Author: Lee, Kyubum, Famiglietti, Maria Livia, McMahon, Aoife, Wei, Chih-Hsuan, MacArthur, Jacqueline Ann Langdon, Poux, Sylvain, Breuza, Lionel, Bridge, Alan, Cunningham, Fiona, Xenarios, Ioannis, and Lu, Zhiyong
Subjects: ARTIFICIAL neural networks, GENOMES, ALGORITHMS, GENOMICS
Abstract: Manually curating biomedical knowledge from publications is necessary to build a knowledge based service that provides highly precise and organized information to users. The process of retrieving relevant publications for curation, which is also known as document triage, is usually carried out by querying and reading articles in PubMed. However, this query-based method often obtains unsatisfactory precision and recall on the retrieved results, and it is difficult to manually generate optimal queries. To address this, we propose a machine-learning assisted triage method. We collect previously curated publications from two databases UniProtKB/Swiss-Prot and the NHGRI-EBI GWAS Catalog, and used them as a gold-standard dataset for training deep learning models based on convolutional neural networks. We then use the trained models to classify and rank new publications for curation. For evaluation, we apply our method to the real-world manual curation process of UniProtKB/Swiss-Prot and the GWAS Catalog. We demonstrate that our machine-assisted triage method outperforms the current query-based triage methods, improves efficiency, and enriches curated content. Our method achieves a precision 1.81 and 2.99 times higher than that obtained by the current query-based triage methods of UniProtKB/Swiss-Prot and the GWAS Catalog, respectively, without compromising recall. In fact, our method retrieves many additional relevant publications that the query-based method of UniProtKB/Swiss-Prot could not find. As these results show, our machine learning-based method can make the triage process more efficient and is being implemented in production so that human curators can focus on more challenging tasks to improve the quality of knowledge bases. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

10. Generation of Binary Tree-Child phylogenetic networks

Author: Gabriel Cardona, Joan Carles Pons, Celine Scornavacca, University of the Balearic Islands (UIB), Institut des Sciences de l'Evolution de Montpellier (UMR ISEM), École pratique des hautes études (EPHE), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Université de Montpellier (UM)-Centre de Coopération Internationale en Recherche Agronomique pour le Développement (Cirad)-Centre National de la Recherche Scientifique (CNRS)-Institut de recherche pour le développement [IRD] : UR226, Research of GC and JCP has been partially supported by the Spanish Ministry of Science, Innovation and Universities (http://www.ciencia.gob.es/), Spanish State Research Agency (http://www.ciencia.gob.es/portal/site/MICINN/aei) and European Regional Development Fund (https://ec.europa.eu/regional_policy/es/funding/erdf/) projects DPI2015-67082-P and PGC2018-096956-B-C43. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.', Centre de Coopération Internationale en Recherche Agronomique pour le Développement (Cirad)-École Pratique des Hautes Études (EPHE), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Université de Montpellier (UM)-Institut de recherche pour le développement [IRD] : UR226-Centre National de la Recherche Scientifique (CNRS), and Centre de Coopération Internationale en Recherche Agronomique pour le Développement (Cirad)-École pratique des hautes études (EPHE)
Subjects: 0301 basic medicine, filogenia, Leaves, Theoretical computer science, Computer science, Speciation, Binary number, Plant Science, Upper and lower bounds, Database and Informatics Methods, 0302 clinical medicine, Biology (General), Phylogeny, Data Management, biología computacional, Sequence, Binary tree, Ecology, Phylogenetic tree, Directed Graphs, Plant Anatomy, Applied Mathematics, Simulation and Modeling, Phylogenetic Analysis, Directed graph, [SDV.BIBS]Life Sciences [q-bio]/Quantitative Methods [q-bio.QM], Reticulate evolution, Phylogenetics, Computational Theory and Mathematics, Modeling and Simulation, Physical Sciences, Sequence Analysis, Network Analysis, Algorithms, Research Article, simulación por ordenador, Computer and Information Sciences, Evolutionary Processes, QH301-705.5, Bioinformatics, Research and Analysis Methods, Set (abstract data type), 03 medical and health sciences, Cellular and Molecular Neuroscience, algoritmos, Genetics, Computer Simulation, Evolutionary Systematics, Molecular Biology, Ecology, Evolution, Behavior and Systematics, Taxonomy, Evolutionary Biology, Computational Biology, Biology and Life Sciences, 030104 developmental biology, Graph Theory, 030217 neurology & neurosurgery, Mathematics
Abstract: Phylogenetic networks generalize phylogenetic trees by allowing the modelization of events of reticulate evolution. Among the different kinds of phylogenetic networks that have been proposed in the literature, the subclass of binary tree-child networks is one of the most studied ones. However, very little is known about the combinatorial structure of these networks. In this paper we address the problem of generating all possible binary tree-child (BTC) networks with a given number of leaves in an efficient way via reduction/augmentation operations that extend and generalize analogous operations for phylogenetic trees, and are biologically relevant. Since our solution is recursive, this also provides us with a recurrence relation giving an upper bound on the number of such networks. We also show how the operations introduced in this paper can be employed to extend the evolutive history of a set of sequences, represented by a BTC network, to include a new sequence. An implementation in python of the algorithms described in this paper, along with some computational experiments, can be downloaded from https://github.com/bielcardona/TCGenerators., Research of GC and JCP has been partially supported by the Spanish Ministry of Science, Innovation and Universities (http://www.ciencia.gob.es/) and European Regional Development Fund (https://ec.europa.eu/regional_policy/es/funding/erdf/) projects DPI2015-67082-P and PGC2018-096956-B-C43. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Published: 2019

11. Bayesian inference of phylogenetic networks from bi-allelic genetic markers

Author: Heidi M. Meudt, Yun Yu, Jiafan Zhu, Luay Nakhleh, and Dingqiao Wen
Subjects: 0301 basic medicine, 0106 biological sciences, Computer science, Gene Identification and Analysis, Genetic Networks, 01 natural sciences, Coalescent theory, Bayes' theorem, Computational phylogenetics, Statistical inference, lcsh:QH301-705.5, Genome Evolution, Phylogeny, Data Management, Genetics, Recombination, Genetic, 0303 health sciences, Likelihood Functions, Ecology, Phylogenetic tree, Applied Mathematics, Simulation and Modeling, Nucleic Acid Hybridization, Phylogenetic Analysis, Plantaginaceae, Phylogenetic network, Genomics, Phylogenetics, Computational Theory and Mathematics, Modeling and Simulation, Physical Sciences, symbols, Network Analysis, Algorithms, Research Article, Genetic Markers, Computer and Information Sciences, Computational biology, Biology, Bayesian inference, Research and Analysis Methods, Genes, Plant, 010603 evolutionary biology, Polymorphism, Single Nucleotide, Molecular Evolution, Cellular and Molecular Neuroscience, 03 medical and health sciences, symbols.namesake, Evolutionary Systematics, Computer Simulation, Molecular Biology, Ecology, Evolution, Behavior and Systematics, Alleles, 030304 developmental biology, Taxonomy, Probability, Evolutionary Biology, Models, Genetic, Biology and Life Sciences, Computational Biology, Markov chain Monte Carlo, Bayes Theorem, Tree (graph theory), 030104 developmental biology, lcsh:Biology (General), Genetic Loci, Mathematics, Software, New Zealand
Abstract: Phylogenetic networks are rooted, directed, acyclic graphs that model reticulate evolutionary histories. Recently, statistical methods were devised for inferring such networks from either gene tree estimates or the sequence alignments of multiple unlinked loci. Bi-allelic markers, most notably single nucleotide polymorphisms (SNPs) and amplified fragment length polymorphisms (AFLPs), provide a powerful source of genome-wide data. In a recent paper, a method called SNAPP was introduced for statistical inference of species trees from unlinked bi-allelic markers. The generative process assumed by the method combined both a model of evolution for the bi-allelic markers, as well as the multispecies coalescent. A novel component of the method was a polynomial-time algorithm for exact computation of the likelihood of a fixed species tree via integration over all possible gene trees for a given marker. Here we report on a method for Bayesian inference of phylogenetic networks from bi-allelic markers. Our method significantly extends the algorithm for exact computation of phylogenetic network likelihood via integration over all possible gene trees. Unlike the case of species trees, the algorithm is no longer polynomial-time on all instances of phylogenetic networks. Furthermore, the method utilizes a reversible-jump MCMC technique to sample the posterior of phylogenetic networks given bi-allelic marker data. Our method has a very good performance in terms of accuracy and robustness as we demonstrate on simulated data, as well as a data set of multiple New Zealand species of the plant genus Ourisia (Plantaginaceae). We implemented the method in the publicly available, open-source PhyloNet software package., Author summary The availability of genomic data has revolutionized the study of evolutionary histories and phylogeny inference. Inferring evolutionary histories from genomic data requires, in most cases, accounting for the fact that different genomic regions could have evolutionary histories that differ from each other as well as from that of the species from which the genomes were sampled. In this paper, we introduce a method for inferring evolutionary histories while accounting for two processes that could give rise to such differences across the genomes, namely incomplete lineage sorting and hybridization. We introduce a novel algorithm for computing the likelihood of phylogenetic networks from bi-allelic genetic markers and use it in a Bayesian inference method. Analyses of synthetic and empirical data sets show a very good performance of the method in terms of the estimates it obtains.
Published: 2018

12. Rearrangement moves on rooted phylogenetic networks

Author: Gambette, Philippe, van Iersel, Leo, Jones, Mark, Lafond, Manuel, Pardi, Fabio, Scornavacca, Celine, Laboratoire d'Informatique Gaspard-Monge (LIGM), Centre National de la Recherche Scientifique (CNRS)-Fédération de Recherche Bézout-ESIEE Paris-École des Ponts ParisTech (ENPC)-Université Paris-Est Marne-la-Vallée (UPEM), Delft Institute of Applied Mathematics (TWA), Faculty of Electrical Engineering, Mathematics and Computer Science [Delft] (EEMCS)-Delft University of Technology (TU Delft), Department of Mathematics and Statistics [Ottawa], University of Ottawa [Ottawa], Institut de Biologie Computationnelle (IBC), Université de Montpellier (UM)-Institut National de la Recherche Agronomique (INRA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS), Méthodes et Algorithmes pour la Bioinformatique (MAB), Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier (LIRMM), Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM), Institut des Sciences de l'Evolution de Montpellier (UMR ISEM), École pratique des hautes études (EPHE), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Université de Montpellier (UM)-Centre de Coopération Internationale en Recherche Agronomique pour le Développement (Cirad)-Centre National de la Recherche Scientifique (CNRS)-Institut de recherche pour le développement [IRD] : UR226, CNRS PICS 230310 (CoCoAlSeq), NWO Vidi 639.072.602, NSERC PDF Grant, ANR-10-BINF-0001,ANCESTROME,Approche de phylogénie intégrative pour la reconstruction de génomes ancestraux(2010), European Project: 634650,H2020,H2020-PHC-2014-two-stage,VIROGENESIS(2015), Université Paris-Est Marne-la-Vallée (UPEM)-École des Ponts ParisTech (ENPC)-ESIEE Paris-Fédération de Recherche Bézout-Centre National de la Recherche Scientifique (CNRS), Institut National de la Recherche Agronomique (INRA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS), Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS), Centre de Coopération Internationale en Recherche Agronomique pour le Développement (Cirad)-École pratique des hautes études (EPHE), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Université de Montpellier (UM)-Institut de recherche pour le développement [IRD] : UR226-Centre National de la Recherche Scientifique (CNRS), Centre de Coopération Internationale en Recherche Agronomique pour le Développement (Cirad)-École Pratique des Hautes Études (EPHE), Laboratoire d'Informatique Gaspard-Monge (ligm), University of Ottawa [Ottawa] (uOttawa), Centre de Coopération Internationale en Recherche Agronomique pour le Développement (Cirad)-École pratique des hautes études (EPHE)-Université de Montpellier (UM)-Institut de recherche pour le développement [IRD] : UR226-Centre National de la Recherche Scientifique (CNRS), and ANR-10-BINF-01-01/10-BINF-0001,ANCESTROME,ANCESTROME(2010)
Subjects: Optimization, Evolutionary Genetics, Computer and Information Sciences, Evolutionary Processes, nearest-neighbor interchange, Gene Transfer, [INFO.INFO-DM]Computer Science [cs]/Discrete Mathematics [cs.DM], phylogeny, Microbiology, phylogenetic networks, Genetics, Animals, Humans, Evolutionary Systematics, lcsh:QH301-705.5, Phylogeny, Taxonomy, Data Management, Horizontal Gene Transfer, Gene Rearrangement, Evolutionary Biology, Models, Genetic, subtree pruning and regrafting, Biology and Life Sciences, Computational Biology, Phylogenetic Analysis, Hominidae, [SDV.BIBS]Life Sciences [q-bio]/Quantitative Methods [q-bio.QM], Organismal Evolution, Phylogenetics, lcsh:Biology (General), Physical Sciences, Microbial Evolution, rearrangement moves, [INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM], Mathematics, Network Analysis, Research Article
Abstract: Phylogenetic tree reconstruction is usually done by local search heuristics that explore the space of the possible tree topologies via simple rearrangements of their structure. Tree rearrangement heuristics have been used in combination with practically all optimization criteria in use, from maximum likelihood and parsimony to distance-based principles, and in a Bayesian context. Their basic components are rearrangement moves that specify all possible ways of generating alternative phylogenies from a given one, and whose fundamental property is to be able to transform, by repeated application, any phylogeny into any other phylogeny. Despite their long tradition in tree-based phylogenetics, very little research has gone into studying similar rearrangement operations for phylogenetic network—that is, phylogenies explicitly representing scenarios that include reticulate events such as hybridization, horizontal gene transfer, population admixture, and recombination. To fill this gap, we propose “horizontal” moves that ensure that every network of a certain complexity can be reached from any other network of the same complexity, and “vertical” moves that ensure reachability between networks of different complexities. When applied to phylogenetic trees, our horizontal moves—named rNNI and rSPR—reduce to the best-known moves on rooted phylogenetic trees, nearest-neighbor interchange and rooted subtree pruning and regrafting. Besides a number of reachability results—separating the contributions of horizontal and vertical moves—we prove that rNNI moves are local versions of rSPR moves, and provide bounds on the sizes of the rNNI neighborhoods. The paper focuses on the most biologically meaningful versions of phylogenetic networks, where edges are oriented and reticulation events clearly identified. Moreover, our rearrangement moves are robust to the fact that networks with higher complexity usually allow a better fit with the data. Our goal is to provide a solid basis for practical phylogenetic network reconstruction., Author summary Phylogenetic networks are used to represent reticulate evolution, that is, cases in which the tree-of-life metaphor for evolution breaks down, because some of its branches have merged at one or several points in the past. This may occur, for example, when some organisms in the phylogeny are hybrids. In this paper, we deal with an elementary question for the reconstruction of phylogenetic networks: how to explore the space of all possible networks. The fundamental component for this is the set of operations that should be employed to generate alternative hypotheses for what happened in the past—which serve as basic blocks for optimization techniques such as hill-climbing. Although these approaches have a long tradition in classic tree-based phylogenetics, their application to networks that explicitly represent reticulate evolution is relatively unexplored. This paper provides the fundamental definitions and theoretical results for subsequent work in practical methods for phylogenetic network reconstruction: we subdivide networks into layers, according to a generally-accepted measure of their complexity, and provide operations that allow both to fully explore each layer, and to move across different layers. These operations constitute natural generalizations of well-known operations for the exploration of the space of phylogenetic trees, the lowest layer in the hierarchy described above.
Published: 2017

13. Executable pathway analysis using ensemble discrete-state modeling for large-scale data.

Author: Palli, Rohith, Palshikar, Mukta G., and Thakar, Juilee
Subjects: REGULATOR genes, DATA modeling, GENETIC algorithms, COMPUTATIONAL biology, PHYSICAL sciences, BIOLOGICAL networks
Abstract: Pathway analysis is widely used to gain mechanistic insights from high-throughput omics data. However, most existing methods do not consider signal integration represented by pathway topology, resulting in enrichment of convergent pathways when downstream genes are modulated. Incorporation of signal flow and integration in pathway analysis could rank the pathways based on modulation in key regulatory genes. This implementation can be facilitated for large-scale data by discrete state network modeling due to simplicity in parameterization. Here, we model cellular heterogeneity using discrete state dynamics and measure pathway activities in cross-sectional data. We introduce a new algorithm, Boolean Omics Network Invariant-Time Analysis (BONITA), for signal propagation, signal integration, and pathway analysis. Our signal propagation approach models heterogeneity in transcriptomic data as arising from intercellular heterogeneity rather than intracellular stochasticity, and propagates binary signals repeatedly across networks. Logic rules defining signal integration are inferred by genetic algorithm and are refined by local search. The rules determine the impact of each node in a pathway, which is used to score the probability of the pathway’s modulation by chance. We have comprehensively tested BONITA for application to transcriptomics data from translational studies. Comparison with state-of-the-art pathway analysis methods shows that BONITA has higher sensitivity at lower levels source node modulation and similar sensitivity at higher levels of source node modulation. Application of BONITA pathway analysis to previously validated RNA-sequencing studies identifies additional relevant pathways in in-vitro human cell line experiments and in-vivo infant studies. Additionally, BONITA successfully detected modulation of disease specific pathways when comparing relevant RNA-sequencing data with healthy controls. Most interestingly, the two highest impact score nodes identified by BONITA included known drug targets. Thus, BONITA is a powerful approach to prioritize not only pathways but also specific mechanistic role of genes compared to existing methods. BONITA is available at: . [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

14. Benchmarking network propagation methods for disease gene identification.

Author: Picart-Armada, Sergio, Barrett, Steven J., Willé, David R., Perera-Lluna, Alexandre, Gutteridge, Alex, and Dessailly, Benoit H.
Subjects: BIOLOGICAL networks, GENE regulatory networks, SEED treatment, PROTEIN-protein interactions, GENES, MACHINE learning
Abstract: In-silico identification of potential target genes for disease is an essential aspect of drug target discovery. Recent studies suggest that successful targets can be found through by leveraging genetic, genomic and protein interaction information. Here, we systematically tested the ability of 12 varied algorithms, based on network propagation, to identify genes that have been targeted by any drug, on gene-disease data from 22 common non-cancerous diseases in OpenTargets. We considered two biological networks, six performance metrics and compared two types of input gene-disease association scores. The impact of the design factors in performance was quantified through additive explanatory models. Standard cross-validation led to over-optimistic performance estimates due to the presence of protein complexes. In order to obtain realistic estimates, we introduced two novel protein complex-aware cross-validation schemes. When seeding biological networks with known drug targets, machine learning and diffusion-based methods found around 2-4 true targets within the top 20 suggestions. Seeding the networks with genes associated to disease by genetics decreased performance below 1 true hit on average. The use of a larger network, although noisier, improved overall performance. We conclude that diffusion-based prioritisers and machine learning applied to diffusion-based features are suited for drug discovery in practice and improve over simpler neighbour-voting methods. We also demonstrate the large impact of choosing an adequate validation strategy and the definition of seed disease genes. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

15. Machine learning-based microarray analyses indicate low-expression genes might collectively influence PAH disease.

Author: Cui, Song, Wu, Qiang, West, James, and Bai, Jiangping
Subjects: FEATURE selection, MACHINE learning, GENES, ARTIFICIAL neural networks, PHYSICAL sciences
Abstract: Accurately predicting and testing the types of Pulmonary arterial hypertension (PAH) of each patient using cost-effective microarray-based expression data and machine learning algorithms could greatly help either identifying the most targeting medicine or adopting other therapeutic measures that could correct/restore defective genetic signaling at the early stage. Furthermore, the prediction model construction processes can also help identifying highly informative genes controlling PAH, leading to enhanced understanding of the disease etiology and molecular pathways. In this study, we used several different gene filtering methods based on microarray expression data obtained from a high-quality patient PAH dataset. Following that, we proposed a novel feature selection and refinement algorithm in conjunction with well-known machine learning methods to identify a small set of highly informative genes. Results indicated that clusters of small-expression genes could be extremely informative at predicting and differentiating different forms of PAH. Additionally, our proposed novel feature refinement algorithm could lead to significant enhancement in model performance. To summarize, integrated with state-of-the-art machine learning and novel feature refining algorithms, the most accurate models could provide near-perfect classification accuracies using very few (close to ten) low-expression genes. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

16. Disease gene prediction for molecularly uncharacterized diseases.

Author: Cáceres, Juan J. and Paccanaro, Alberto
Subjects: DISEASES, RESEARCH methodology, COMPUTATIONAL biology, MOLECULAR biology
Abstract: Network medicine approaches have been largely successful at increasing our knowledge of molecularly characterized diseases. Given a set of disease genes associated with a disease, neighbourhood-based methods and random walkers exploit the interactome allowing the prediction of further genes for that disease. In general, however, diseases with no known molecular basis constitute a challenge. Here we present a novel network approach to prioritize gene-disease associations that is able to also predict genes for diseases with no known molecular basis. Our method, which we have called Cardigan (ChARting DIsease Gene AssociatioNs), uses semi-supervised learning and exploits a measure of similarity between disease phenotypes. We evaluated its performance at predicting genes for both molecularly characterized and uncharacterized diseases in OMIM, using both weighted and binary interactomes, and compared it with state-of-the-art methods. Our tests, which use datasets collected at different points in time to replicate the dynamics of the disease gene discovery process, prove that Cardigan is able to accurately predict disease genes for molecularly uncharacterized diseases. Additionally, standard leave-one-out cross validation tests show how our approach outperforms state-of-the-art methods at predicting genes for molecularly characterized diseases by 14%-65%. Cardigan can also be used for disease module prediction, where it outperforms state-of-the-art methods by 87%-299%. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

17. Efficient algorithms to discover alterations with complementary functional association in cancer.

Author: Sarto Basso, Rebecca, Hochbaum, Dorit S., and Vandin, Fabio
Subjects: CANCER genetics, PERTURBATION theory, PHENOTYPES, ALGORITHMS, COMPUTATIONAL biology
Abstract: Recent large cancer studies have measured somatic alterations in an unprecedented number of tumours. These large datasets allow the identification of cancer-related sets of genetic alterations by identifying relevant combinatorial patterns. Among such patterns, mutual exclusivity has been employed by several recent methods that have shown its effectiveness in characterizing gene sets associated to cancer. Mutual exclusivity arises because of the complementarity, at the functional level, of alterations in genes which are part of a group (e.g., a pathway) performing a given function. The availability of quantitative target profiles, from genetic perturbations or from clinical phenotypes, provides additional information that can be leveraged to improve the identification of cancer related gene sets by discovering groups with complementary functional associations with such targets. In this work we study the problem of finding groups of mutually exclusive alterations associated with a quantitative (functional) target. We propose a combinatorial formulation for the problem, and prove that the associated computational problem is computationally hard. We design two algorithms to solve the problem and implement them in our tool UNCOVER. We provide analytic evidence of the effectiveness of UNCOVER in finding high-quality solutions and show experimentally that UNCOVER finds sets of alterations significantly associated with functional targets in a variety of scenarios. In particular, we show that our algorithms find sets which are better than the ones obtained by the state-of-the-art method, even when sets are evaluated using the statistical score employed by the latter. In addition, our algorithms are much faster than the state-of-the-art, allowing the analysis of large datasets of thousands of target profiles from cancer cell lines. We show that on two such datasets, one from project Achilles and one from the Genomics of Drug Sensitivity in Cancer project, UNCOVER identifies several significant gene sets with complementary functional associations with targets. Software available at: . [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

18. Uncovering functional signature in neural systems via random matrix theory.

Author: Almog, Assaf, Buijink, M. Renate, Roethler, Ori, Michel, Stephan, Meijer, Johanna H., Rohling, Jos H. T., and Garlaschelli, Diego
Subjects: NEURONS, RANDOM matrices, BRAIN, GENE expression, PHOTOPERIODISM
Abstract: Neural systems are organized in a modular way, serving multiple functionalities. This multiplicity requires that both positive (e.g. excitatory, phase-coherent) and negative (e.g. inhibitory, phase-opposing) interactions take place across brain modules. Unfortunately, most methods to detect modules from time series either neglect or convert to positive, any measured negative correlation. This may leave a significant part of the sign-dependent functional structure undetected. Here we present a novel method, based on random matrix theory, for the identification of sign-dependent modules in the brain. Our method filters out both local (unit-specific) noise and global (system-wide) dependencies that typically obfuscate the presence of such structure. The method is guaranteed to identify an optimally contrasted functional ‘signature’, i.e. a partition into modules that are positively correlated internally and negatively correlated across. The method is purely data-driven, does not use any arbitrary threshold or network projection, and outputs only statistically significant structure. In measurements of neuronal gene expression in the biological clock of mice, the method systematically uncovers two otherwise undetectable, negatively correlated modules whose relative size and mutual interaction strength are found to depend on photoperiod. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

19. A computational framework To assess genome-wide distribution Of polymorphic human endogenous retrovirus-K In human populations.

Author: Li, Weiling, Lin, Lin, Malhotra, Raunaq, Yang, Lei, Acharya, Raj, and Poss, Mary
Subjects: HUMAN endogenous retroviruses, COMPUTATIONAL statistics, GENETIC polymorphisms, HUMAN genome, RETROVIRUS genetics, DISEASE risk factors, HUMAN population genetics
Abstract: Human Endogenous Retrovirus type K (HERV-K) is the only HERV known to be insertionally polymorphic; not all individuals have a retrovirus at a specific genomic location. It is possible that HERV-Ks contribute to human disease because people differ in both number and genomic location of these retroviruses. Indeed viral transcripts, proteins, and antibody against HERV-K are detected in cancers, auto-immune, and neurodegenerative diseases. However, attempts to link a polymorphic HERV-K with any disease have been frustrated in part because population prevalence of HERV-K provirus at each polymorphic site is lacking and it is challenging to identify closely related elements such as HERV-K from short read sequence data. We present an integrated and computationally robust approach that uses whole genome short read data to determine the occupation status at all sites reported to contain a HERV-K provirus. Our method estimates the proportion of fixed length genomic sequence (k-mers) from whole genome sequence data matching a reference set of k-mers unique to each HERV-K locus and applies mixture model-based clustering of these values to account for low depth sequence data. Our analysis of 1000 Genomes Project Data (KGP) reveals numerous differences among the five KGP super-populations in the prevalence of individual and co-occurring HERV-K proviruses; we provide a visualization tool to easily depict the proportion of the KGP populations with any combination of polymorphic HERV-K provirus. Further, because HERV-K is insertionally polymorphic, the genome burden of known polymorphic HERV-K is variable in humans; this burden is lowest in East Asian (EAS) individuals. Our study identifies population-specific sequence variation for HERV-K proviruses at several loci. We expect these resources will advance research on HERV-K contributions to human diseases. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

20. OptRAM: In-silico strain design via integrative regulatory-metabolic network modeling.

Author: Shen, Fangzhou, Sun, Renliang, Yao, Jie, Li, Jian, Liu, Qian, Price, Nathan D., Liu, Chenguang, and Wang, Zhuo
Subjects: METABOLIC models, GENE regulatory networks, GENETIC overexpression, COMBINATORIAL optimization, CELL growth
Abstract: The ultimate goal of metabolic engineering is to produce desired compounds on an industrial scale in a cost effective manner. To address challenges in metabolic engineering, computational strain optimization algorithms based on genome-scale metabolic models have increasingly been used to aid in overproducing products of interest. However, most of these strain optimization algorithms utilize a metabolic network alone, with few approaches providing strategies that also include transcriptional regulation. Moreover previous integrated approaches generally require a pre-existing regulatory network. In this study, we developed a novel strain design algorithm, named OptRAM (Optimization of Regulatory And Metabolic Networks), which can identify combinatorial optimization strategies including overexpression, knockdown or knockout of both metabolic genes and transcription factors. OptRAM is based on our previous IDREAM integrated network framework, which makes it able to deduce a regulatory network from data. OptRAM uses simulated annealing with a novel objective function, which can ensure a favorable coupling between desired chemical and cell growth. The other advance we propose is a systematic evaluation metric of multiple solutions, by considering the essential genes, flux variation, and engineering manipulation cost. We applied OptRAM to generate strain designs for succinate, 2,3-butanediol, and ethanol overproduction in yeast, which predicted high minimum predicted target production rate compared with other methods and previous literature values. Moreover, most of the genes and TFs proposed to be altered by OptRAM in these scenarios have been validated by modification of the exact genes or the target genes regulated by the TFs, for overproduction of these desired compounds by in vivo experiments cataloged in the LASER database. Particularly, we successfully validated the predicted strain optimization strategy for ethanol production by fermentation experiment. In conclusion, OptRAM can provide a useful approach that leverages an integrated transcriptional regulatory network and metabolic network to guide metabolic engineering applications. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

21. PAIRUP-MS: Pathway analysis and imputation to relate unknowns in profiles from mass spectrometry-based metabolite data.

Author: Hsu, Yu-Han H., Churchhouse, Claire, Pers, Tune H., Mercader, Josep M., Metspalu, Andres, Fischer, Krista, Fortney, Kristen, Morgen, Eric K., Gonzalez, Clicerio, Gonzalez, Maria E., Esko, Tonu, and Hirschhorn, Joel N.
Subjects: METABOLITES, MASS spectrometry, GENES, BIOMOLECULES, NUCLEAR spectroscopy
Abstract: Metabolomics is a powerful approach for discovering biomarkers and for characterizing the biochemical consequences of genetic variation. While untargeted metabolite profiling can measure thousands of signals in a single experiment, many biologically meaningful signals cannot be readily identified as known metabolites nor compared across datasets, making it difficult to infer biology and to conduct well-powered meta-analyses across studies. To overcome these challenges, we developed a suite of computational methods, PAIRUP-MS, to match metabolite signals across mass spectrometry-based profiling datasets and to generate metabolic pathway annotations for these signals. To pair up signals measured in different datasets, where retention times (RT) are often not comparable or even available, we implemented an imputation-based approach that only requires mass-to-charge ratios (m/z). As validation, we treated each shared known metabolite as an unmatched signal and showed that PAIRUP-MS correctly matched 70–88% of these metabolites from among thousands of signals, equaling or outperforming a standard m/z- and RT-based approach. We performed further validation using genetic data: the most stringent set of matched signals and shared knowns showed comparable consistency of genetic associations across datasets. Next, we developed a pathway reconstitution method to annotate unknown signals using curated metabolic pathways containing known metabolites. We performed genetic validation for the generated annotations, showing that annotated signals associated with gene variants were more likely to be enriched for pathways functionally related to the genes compared to random expectation. Finally, we applied PAIRUP-MS to study associations between metabolites and genetic variants or body mass index (BMI) across multiple datasets, identifying up to ~6 times more significant signals and many more BMI-associated pathways compared to the standard practice of only analyzing known metabolites. These results demonstrate that PAIRUP-MS enables analysis of unknown signals in a robust, biologically meaningful manner and provides a path to more comprehensive, well-powered studies of untargeted metabolomics data. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

22. Local epigenomic state cannot discriminate interacting and non-interacting enhancer–promoter pairs with high accuracy.

Author: Xi, Wang and Beer, Michael A.
Subjects: MACHINE learning, ACCURACY, TRAINING, TECHNOLOGICAL innovations, GENERALIZATION
Abstract: We report an experimental design issue in recent machine learning formulations of the enhancer-promoter interaction problem arising from the fact that many enhancer-promoter pairs share features. Cross-fold validation schemes which do not correctly separate these feature sharing enhancer-promoter pairs into one test set report high accuracy, which is actually arising from high training set accuracy and a failure to properly evaluate generalization performance. Cross-fold validation schemes which properly segregate pairs with shared features show markedly reduced ability to predict enhancer-promoter interactions from epigenomic state. Parameter scans with multiple models indicate that local epigenomic features of individual pairs of enhancers and promoters cannot distinguish those pairs that interact from those which do with high accuracy, suggesting that additional information is required to predict enhancer-promoter interactions. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

23. Identifying (un)controllable dynamical behavior in complex networks.

Author: Rozum, Jordan C. and Albert, Réka
Subjects: ORDINARY differential equations, ROBUST control, MOLECULAR recognition, DRUG delivery systems, DROSOPHILA melanogaster
Abstract: We present a technique applicable in any dynamical framework to identify control-robust subsets of an interacting system. These robust subsystems, which we call stable modules, are characterized by constraints on the variables that make up the subsystem. They are robust in the sense that if the defining constraints are satisfied at a given time, they remain satisfied for all later times, regardless of what happens in the rest of the system, and can only be broken if the constrained variables are externally manipulated. We identify stable modules as graph structures in an expanded network, which represents causal links between variable constraints. A stable module represents a system “decision point”, or trap subspace. Using the expanded network, small stable modules can be composed sequentially to form larger stable modules that describe dynamics on the system level. Collections of large, mutually exclusive stable modules describe the system’s repertoire of long-term behaviors. We implement this technique in a broad class of dynamical systems and illustrate its practical utility via examples and algorithmic analysis of two published biological network models. In the segment polarity gene network of Drosophila melanogaster, we obtain a state-space visualization that reproduces by novel means the four possible cell fates and predicts the outcome of cell transplant experiments. In the T-cell signaling network, we identify six signaling elements that determine the high-signal response and show that control of an element connected to them cannot disrupt this response. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

24. Deepbinner: Demultiplexing barcoded Oxford Nanopore reads with deep convolutional neural networks.

Author: Wick, Ryan R., Judd, Louise M., and Holt, Kathryn E.
Subjects: DEMULTIPLEXING, GENOMES, GENETIC barcoding, NANOPORES, ARTIFICIAL neural networks, DNA
Abstract: Multiplexing, the simultaneous sequencing of multiple barcoded DNA samples on a single flow cell, has made Oxford Nanopore sequencing cost-effective for small genomes. However, it depends on the ability to sort the resulting sequencing reads by barcode, and current demultiplexing tools fail to classify many reads. Here we present Deepbinner, a tool for Oxford Nanopore demultiplexing that uses a deep neural network to classify reads based on the raw electrical read signal. This ‘signal-space’ approach allows for greater accuracy than existing ‘base-space’ tools (Albacore and Porechop) for which signals must first be converted to DNA base calls, itself a complex problem that can introduce noise into the barcode sequence. To assess Deepbinner and existing tools, we performed multiplex sequencing on 12 amplicons chosen for their distinguishability. This allowed us to establish a ground truth classification for each read based on internal sequence alone. Deepbinner had the lowest rate of unclassified reads (7.8%) and the highest demultiplexing precision (98.5% of classified reads were correctly assigned). It can be used alone (to maximise the number of classified reads) or in conjunction with other demultiplexers (to maximise precision and minimise false positive classifications). We also found cross-sample chimeric reads (0.3%) and evidence of barcode switching (0.3%) in our dataset, which likely arise during library preparation and may be detrimental for quantitative studies that use multiplexing. Deepbinner is open source (GPLv3) and available at . [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

25. Systematically benchmarking peptide-MHC binding predictors: From synthetic to naturally processed epitopes.

Author: Zhao, Weilong and Sher, Xinwei
Subjects: MACHINE learning, EPITOPES, T cells, CLINICAL immunology, HLA histocompatibility antigens
Abstract: A number of machine learning-based predictors have been developed for identifying immunogenic T-cell epitopes based on major histocompatibility complex (MHC) class I and II binding affinities. Rationally selecting the most appropriate tool has been complicated by the evolving training data and machine learning methods. Despite the recent advances made in generating high-quality MHC-eluted, naturally processed ligandome, the reliability of new predictors on these epitopes has yet to be evaluated. This study reports the latest benchmarking on an extensive set of MHC-binding predictors by using newly available, untested data of both synthetic and naturally processed epitopes. 32 human leukocyte antigen (HLA) class I and 24 HLA class II alleles are included in the blind test set. Artificial neural network (ANN)-based approaches demonstrated better performance than regression-based machine learning and structural modeling. Among the 18 predictors benchmarked, ANN-based mhcflurry and nn_align perform the best for MHC class I 9-mer and class II 15-mer predictions, respectively, on binding/non-binding classification (Area Under Curves = 0.911). NetMHCpan4 also demonstrated comparable predictive power. Our customization of mhcflurry to a pan-HLA predictor has achieved similar accuracy to NetMHCpan. The overall accuracy of these methods are comparable between 9-mer and 10-mer testing data. However, the top methods deliver low correlations between the predicted versus the experimental affinities for strong MHC binders. When used on naturally processed MHC-ligands, tools that have been trained on elution data (NetMHCpan4 and MixMHCpred) shows better accuracy than pure binding affinity predictor. The variability of false prediction rate is considerable among HLA types and datasets. Finally, structure-based predictor of Rosetta FlexPepDock is less optimal compared to the machine learning approaches. With our benchmarking of MHC-binding and MHC-elution predictors using a comprehensive metrics, a unbiased view for establishing best practice of T-cell epitope predictions is presented, facilitating future development of methods in immunogenomics. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

26. Maintaining maximal metabolic flux by gene expression control.

Author: Planqué, Robert, Hulshof, Josephus, Teusink, Bas, Hendriks, Johannes C., and Bruggeman, Frank J.
Subjects: ENZYMOLOGY, CARBOHYDRATES, BIOCHEMISTRY, GALACTOSE, MONOSACCHARIDES
Abstract: One of the marvels of biology is the phenotypic plasticity of microorganisms. It allows them to maintain high growth rates across conditions. Studies suggest that cells can express metabolic enzymes at tuned concentrations through adjustment of gene expression. The associated transcription factors are often regulated by intracellular metabolites. Here we study metabolite-mediated regulation of metabolic-gene expression that maximises metabolic fluxes across conditions. We developed an adaptive control theory, qORAC (for ‘Specific Flux (q) Optimization by Robust Adaptive Control’), and illustrate it with several examples of metabolic pathways. The key feature of the theory is that it does not require knowledge of the regulatory network, only of the metabolic part. We derive that maximal metabolic flux can be maintained in the face of varying N environmental parameters only if the number of transcription-factor binding metabolites is at least equal to N. The controlling circuits appear to require simple biochemical kinetics. We conclude that microorganisms likely can achieve maximal rates in metabolic pathways, in the face of environmental changes. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

27. SIG-DB: Leveraging homomorphic encryption to securely interrogate privately held genomic databases.

Author: Titus, Alexander J., Flower, Audrey, Hagerty, Patrick, Gamble, Paul, Lewis, Charlie, Stavish, Todd, O’Connell, Kevin P., Shipley, Greg, and Rogers, Stephanie M.
Subjects: GENOMES, BIOLOGICAL databases, CRYPTOGRAPHY, BIOINFORMATICS, BIOMATHEMATICS, GENETICS
Abstract: Genomic data are becoming increasingly valuable as we develop methods to utilize the information at scale and gain a greater understanding of how genetic information relates to biological function. Advances in synthetic biology and the decreased cost of sequencing are increasing the amount of privately held genomic data. As the quantity and value of private genomic data grows, so does the incentive to acquire and protect such data, which creates a need to store and process these data securely. We present an algorithm for the Secure Interrogation of Genomic DataBases (SIG-DB). The SIG-DB algorithm enables databases of genomic sequences to be searched with an encrypted query sequence without revealing the query sequence to the Database Owner or any of the database sequences to the Querier. SIG-DB is the first application of its kind to take advantage of locality-sensitive hashing and homomorphic encryption to allow generalized sequence-to-sequence comparisons of genomic data. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

28. SILGGM: An extensive R package for efficient statistical inference in large-scale gene networks.

Author: Zhang, Rong, Ren, Zhao, and Chen, Wei
Subjects: GENE expression, IMMUNOLOGY, T cells, RNA sequencing, GENETICS
Abstract: Gene co-expression network analysis is extremely useful in interpreting a complex biological process. The recent droplet-based single-cell technology is able to generate much larger gene expression data routinely with thousands of samples and tens of thousands of genes. To analyze such a large-scale gene-gene network, remarkable progress has been made in rigorous statistical inference of high-dimensional Gaussian graphical model (GGM). These approaches provide a formal confidence interval or a p-value rather than only a single point estimator for conditional dependence of a gene pair and are more desirable for identifying reliable gene networks. To promote their widespread use, we herein introduce an extensive and efficient R package named SILGGM (tatistical nference of arge-scale aussian raphical odel) that includes four main approaches in statistical inference of high-dimensional GGM. Unlike the existing tools, SILGGM provides statistically efficient inference on both individual gene pair and whole-scale gene pairs. It has a novel and consistent false discovery rate (FDR) procedure in all four methodologies. Based on the user-friendly design, it provides outputs compatible with multiple platforms for interactive network visualization. Furthermore, comparisons in simulation illustrate that SILGGM can accelerate the existing MATLAB implementation to several orders of magnitudes and further improve the speed of the already very efficient R package FastGGM. Testing results from the simulated data confirm the validity of all the approaches in SILGGM even in a very large-scale setting with the number of variables or genes to a ten thousand level. We have also applied our package to a novel single-cell RNA-seq data set with pan T cells. The results show that the approaches in SILGGM significantly outperform the conventional ones in a biological sense. The package is freely available via CRAN at . [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

29. Rare-event sampling of epigenetic landscapes and phenotype transitions.

Author: Tse, Margaret J., Chu, Brian K., Gallivan, Cameron P., and Read, Elizabeth L.
Subjects: EPIGENETICS, PHENOTYPES, GENETIC regulation, GENE expression, EMBRYONIC stem cells
Abstract: Stochastic simulation has been a powerful tool for studying the dynamics of gene regulatory networks, particularly in terms of understanding how cell-phenotype stability and fate-transitions are impacted by noisy gene expression. However, gene networks often have dynamics characterized by multiple attractors. Stochastic simulation is often inefficient for such systems, because most of the simulation time is spent waiting for rare, barrier-crossing events to occur. We present a rare-event simulation-based method for computing epigenetic landscapes and phenotype-transitions in metastable gene networks. Our computational pipeline was inspired by studies of metastability and barrier-crossing in protein folding, and provides an automated means of computing and visualizing essential stationary and dynamic information that is generally inaccessible to conventional simulation. Applied to a network model of pluripotency in Embryonic Stem Cells, our simulations revealed rare phenotypes and approximately Markovian transitions among phenotype-states, occurring with a broad range of timescales. The relative probabilities of phenotypes and the transition paths linking pluripotency and differentiation are sensitive to global kinetic parameters governing transcription factor-DNA binding kinetics. Our approach significantly expands the capability of stochastic simulation to investigate gene regulatory network dynamics, which may help guide rational cell reprogramming strategies. Our approach is also generalizable to other types of molecular networks and stochastic dynamics frameworks. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

30. miRAW: A deep learning-based approach to predict microRNA targets by analyzing whole microRNA transcripts.

Author: Pla, Albert, Zhong, Xiangfu, and Rayner, Simon
Subjects: ARTIFICIAL intelligence in medicine, MICRORNA, COMPUTERS in medicine, DEEP learning, GENE expression, ARTIFICIAL neural networks
Abstract: MicroRNAs (miRNAs) are small non-coding RNAs that regulate gene expression by binding to partially complementary regions within the 3’UTR of their target genes. Computational methods play an important role in target prediction and assume that the miRNA “seed region” (nt 2 to 8) is required for functional targeting, but typically only identify ∼80% of known bindings. Recent studies have highlighted a role for the entire miRNA, suggesting that a more flexible methodology is needed. We present a novel approach for miRNA target prediction based on Deep Learning (DL) which, rather than incorporating any knowledge (such as seed regions), investigates the entire miRNA and 3’TR mRNA nucleotides to learn a uninhibited set of feature descriptors related to the targeting process. We collected more than 150,000 experimentally validated homo sapiens miRNA:gene targets and cross referenced them with different CLIP-Seq, CLASH and iPAR-CLIP datasets to obtain ∼20,000 validated miRNA:gene exact target sites. Using this data, we implemented and trained a deep neural network—composed of autoencoders and a feed-forward network—able to automatically learn features describing miRNA-mRNA interactions and assess functionality. Predictions were then refined using information such as site location or site accessibility energy. In a comparison using independent datasets, our DL approach consistently outperformed existing prediction methods, recognizing the seed region as a common feature in the targeting process, but also identifying the role of pairings outside this region. Thermodynamic analysis also suggests that site accessibility plays a role in targeting but that it cannot be used as a sole indicator for functionality. Data and source code available at: . [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

31. A loop-counting method for covariate-corrected low-rank biclustering of gene-expression and genome-wide association study data.

Author: Rangan, Aaditya V., McGrouther, Caroline C., Kelsoe, John, Schork, Nicholas, Stahl, Eli, Zhu, Qian, Krishnan, Arjun, Yao, Vicky, Troyanskaya, Olga, Bilaloglu, Seda, Raghavan, Preeti, Bergen, Sarah, Jureus, Anders, Landen, Mikael, and null, null
Subjects: HUMAN genome, GENE expression, MOLECULAR biology, SINGLE nucleotide polymorphisms, BIOLOGICAL evolution
Abstract: A common goal in data-analysis is to sift through a large data-matrix and detect any significant submatrices (i.e., biclusters) that have a low numerical rank. We present a simple algorithm for tackling this biclustering problem. Our algorithm accumulates information about 2-by-2 submatrices (i.e., ‘loops’) within the data-matrix, and focuses on rows and columns of the data-matrix that participate in an abundance of low-rank loops. We demonstrate, through analysis and numerical-experiments, that this loop-counting method performs well in a variety of scenarios, outperforming simple spectral methods in many situations of interest. Another important feature of our method is that it can easily be modified to account for aspects of experimental design which commonly arise in practice. For example, our algorithm can be modified to correct for controls, categorical- and continuous-covariates, as well as sparsity within the data. We demonstrate these practical features with two examples; the first drawn from gene-expression analysis and the second drawn from a much larger genome-wide-association-study (GWAS). [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

32. beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types.

Author: Lun, Aaron T. L., Pagès, Hervé, and Smith, Mike L.
Subjects: RNA sequencing, C++, GENOMICS, COMPUTATIONAL biology
Abstract: Biological experiments involving genomics or other high-throughput assays typically yield a data matrix that can be explored and analyzed using the R programming language with packages from the Bioconductor project. Improvements in the throughput of these assays have resulted in an explosion of data even from routine experiments, which poses a challenge to the existing computational infrastructure for statistical data analysis. For example, single-cell RNA sequencing (scRNA-seq) experiments frequently generate large matrices containing expression values for each gene in each cell, requiring sparse or file-backed representations for memory-efficient manipulation in R. These alternative representations are not easily compatible with high-performance C++ code used for computationally intensive tasks in existing R/Bioconductor packages. Here, we describe a C++ interface named beachmat, which enables agnostic data access from various matrix representations. This allows package developers to write efficient C++ code that is interoperable with dense, sparse and file-backed matrices, amongst others. We evaluated the performance of beachmat for accessing data from each matrix representation using both simulated and real scRNA-seq data, and defined a clear memory/speed trade-off to motivate the choice of an appropriate representation. We also demonstrate how beachmat can be incorporated into the code of other packages to drive analyses of a very large scRNA-seq data set. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

33. On the role of extrinsic noise in microRNA-mediated bimodal gene expression.

Author: Del Giudice, Marco, Bo, Stefano, Grigolon, Silvia, and Bosia, Carla
Subjects: CELL differentiation, CELL physiology, GENE expression, MICRORNA, GENETIC transcription
Abstract: Several studies highlighted the relevance of extrinsic noise in shaping cell decision making and differentiation in molecular networks. Bimodal distributions of gene expression levels provide experimental evidence of phenotypic differentiation, where the modes of the distribution often correspond to different physiological states of the system. We theoretically address the presence of bimodal phenotypes in the context of microRNA (miRNA)-mediated regulation. MiRNAs are small noncoding RNA molecules that downregulate the expression of their target mRNAs. The nature of this interaction is titrative and induces a threshold effect: below a given target transcription rate almost no mRNAs are free and available for translation. We investigate the effect of extrinsic noise on the system by introducing a fluctuating miRNA-transcription rate. We find that the presence of extrinsic noise favours the presence of bimodal target distributions which can be observed for a wider range of parameters compared to the case with intrinsic noise only and for lower miRNA-target interaction strength. Our results suggest that combining threshold-inducing interactions with extrinsic noise provides a simple and robust mechanism for obtaining bimodal populations without requiring fine tuning. Furthermore, we characterise the protein distribution’s dependence on protein half-life. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

34. Using pseudoalignment and base quality to accurately quantify microbial community composition.

Author: Reppell, Mark and Novembre, John
Subjects: DNA analysis, MICROBIAL diversity, COMPUTATIONAL biology, MOLECULAR biology, COMPUTER simulation
Abstract: Pooled DNA from multiple unknown organisms arises in a variety of contexts, for example microbial samples from ecological or human health research. Determining the composition of pooled samples can be difficult, especially at the scale of modern sequencing data and reference databases. Here we propose a novel method for taxonomic profiling in pooled DNA that combines the speed and low-memory requirements of k-mer based pseudoalignment with a likelihood framework that uses base quality information to better resolve multiply mapped reads. We apply the method to the problem of classifying 16S rRNA reads using a reference database of known organisms, a common challenge in microbiome research. Using simulations, we show the method is accurate across a variety of read lengths, with different length reference sequences, at different sample depths, and when samples contain reads originating from organisms absent from the reference. We also assess performance in real 16S data, where we reanalyze previous genetic association data to show our method discovers a larger number of quantitative trait associations than other widely used methods. We implement our method in the software Karp, for k-mer based analysis of read pools, to provide a novel combination of speed and accuracy that is uniquely suited for enhancing discoveries in microbial studies. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

35. Cox-nnet: An artificial neural network method for prognosis prediction of high-throughput omics data.

Author: Ching, Travers, Zhu, Xun, and Garmire, Lana X.
Subjects: ARTIFICIAL neural networks, COMPUTING platforms, INTERNET in medicine, PROPORTIONAL hazards models, COMPUTER architecture
Abstract: Artificial neural networks (ANN) are computing architectures with many interconnections of simple neural-inspired computing elements, and have been applied to biomedical fields such as imaging analysis and diagnosis. We have developed a new ANN framework called Cox-nnet to predict patient prognosis from high throughput transcriptomics data. In 10 TCGA RNA-Seq data sets, Cox-nnet achieves the same or better predictive accuracy compared to other methods, including Cox-proportional hazards regression (with LASSO, ridge, and mimimax concave penalty), Random Forests Survival and CoxBoost. Cox-nnet also reveals richer biological information, at both the pathway and gene levels. The outputs from the hidden layer node provide an alternative approach for survival-sensitive dimension reduction. In summary, we have developed a new method for accurate and efficient prognosis prediction on high throughput data, with functional biological insights. The source code is freely available at . [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

36. Memory functions reveal structural properties of gene regulatory networks.

Author: Herrera-Delgado, Edgar, Perez-Carrasco, Ruben, Briscoe, James, and Sollich, Peter
Subjects: GENE regulatory networks, HOMEOSTASIS, THERMODYNAMICS, TRANSCRIPTION factors, NEURAL tube
Abstract: Gene regulatory networks (GRNs) control cellular function and decision making during tissue development and homeostasis. Mathematical tools based on dynamical systems theory are often used to model these networks, but the size and complexity of these models mean that their behaviour is not always intuitive and the underlying mechanisms can be difficult to decipher. For this reason, methods that simplify and aid exploration of complex networks are necessary. To this end we develop a broadly applicable form of the Zwanzig-Mori projection. By first converting a thermodynamic state ensemble model of gene regulation into mass action reactions we derive a general method that produces a set of time evolution equations for a subset of components of a network. The influence of the rest of the network, the bulk, is captured by memory functions that describe how the subnetwork reacts to its own past state via components in the bulk. These memory functions provide probes of near-steady state dynamics, revealing information not easily accessible otherwise. We illustrate the method on a simple cross-repressive transcriptional motif to show that memory functions not only simplify the analysis of the subnetwork but also have a natural interpretation. We then apply the approach to a GRN from the vertebrate neural tube, a well characterised developmental transcriptional network composed of four interacting transcription factors. The memory functions reveal the function of specific links within the neural tube network and identify features of the regulatory structure that specifically increase the robustness of the network to initial conditions. Taken together, the study provides evidence that Zwanzig-Mori projections offer powerful and effective tools for simplifying and exploring the behaviour of GRNs. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

37. The development and application of bioinformatics core competencies to improve bioinformatics training and education.

Author: Mulder, Nicola, Schwartz, Russell, Brazas, Michelle D., Brooksbank, Cath, Gaeta, Bruno, Morgan, Sarah L., Pauley, Mark A., Rosenwald, Anne, Rustici, Gabriella, Sierk, Michael, Warnow, Tandy, and Welch, Lonnie
Subjects: BIOINFORMATICS, SYSTEMS biology, COMPUTATIONAL biology, CORE competencies, OCCUPATIONAL training
Abstract: Bioinformatics is recognized as part of the essential knowledge base of numerous career paths in biomedical research and healthcare. However, there is little agreement in the field over what that knowledge entails or how best to provide it. These disagreements are compounded by the wide range of populations in need of bioinformatics training, with divergent prior backgrounds and intended application areas. The Curriculum Task Force of the International Society of Computational Biology (ISCB) Education Committee has sought to provide a framework for training needs and curricula in terms of a set of bioinformatics core competencies that cut across many user personas and training programs. The initial competencies developed based on surveys of employers and training programs have since been refined through a multiyear process of community engagement. This report describes the current status of the competencies and presents a series of use cases illustrating how they are being applied in diverse training contexts. These use cases are intended to demonstrate how others can make use of the competencies and engage in the process of their continuing refinement and application. The report concludes with a consideration of remaining challenges and future plans. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

38. mixOmics: An R package for ‘omics feature selection and multiple data integration.

Author: Rohart, Florian, Gautier, Benoît, Singh, Amrit, and Lê Cao, Kim-Anh
Subjects: COMPUTATIONAL biology, DATA integration, BIOLOGICAL research, DATA mining, SYSTEMS biology
Abstract: The advent of high throughput technologies has led to a wealth of publicly available ‘omics data coming from different sources, such as transcriptomics, proteomics, metabolomics. Combining such large-scale biological data sets can lead to the discovery of important biological insights, provided that relevant information can be extracted in a holistic manner. Current statistical approaches have been focusing on identifying small subsets of molecules (a ‘molecular signature’) to explain or predict biological conditions, but mainly for a single type of ‘omics. In addition, commonly used methods are univariate and consider each biological feature independently. We introduce , an R package dedicated to the multivariate analysis of biological data sets with a specific focus on data exploration, dimension reduction and visualisation. By adopting a systems biology approach, the toolkit provides a wide range of methods that statistically integrate several data sets at once to probe relationships between heterogeneous ‘omics data sets. Our recent methods extend Projection to Latent Structure (PLS) models for discriminant analysis, for data integration across multiple ‘omics data or across independent studies, and for the identification of molecular signatures. We illustrate our latest integrative frameworks for the multivariate analyses of ‘omics data available from the package. [ABSTRACT FROM AUTHOR]
Published: 2017
Full Text: View/download PDF

39. A quadratically regularized functional canonical correlation analysis for identifying the global structure of pleiotropy with NGS data.

Author: Lin, Nan, Zhu, Yun, Fan, Ruzong, and Xiong, Momiao
Subjects: GENETIC pleiotropy, GENETIC disorders, PHENOTYPES, COMPUTER algorithms, STATISTICAL correlation
Abstract: Investigating the pleiotropic effects of genetic variants can increase statistical power, provide important information to achieve deep understanding of the complex genetic structures of disease, and offer powerful tools for designing effective treatments with fewer side effects. However, the current multiple phenotype association analysis paradigm lacks breadth (number of phenotypes and genetic variants jointly analyzed at the same time) and depth (hierarchical structure of phenotype and genotypes). A key issue for high dimensional pleiotropic analysis is to effectively extract informative internal representation and features from high dimensional genotype and phenotype data. To explore correlation information of genetic variants, effectively reduce data dimensions, and overcome critical barriers in advancing the development of novel statistical methods and computational algorithms for genetic pleiotropic analysis, we proposed a new statistic method referred to as a quadratically regularized functional CCA (QRFCCA) for association analysis which combines three approaches: (1) quadratically regularized matrix factorization, (2) functional data analysis and (3) canonical correlation analysis (CCA). Large-scale simulations show that the QRFCCA has a much higher power than that of the ten competing statistics while retaining the appropriate type 1 errors. To further evaluate performance, the QRFCCA and ten other statistics are applied to the whole genome sequencing dataset from the TwinsUK study. We identify a total of 79 genes with rare variants and 67 genes with common variants significantly associated with the 46 traits using QRFCCA. The results show that the QRFCCA substantially outperforms the ten other statistics. [ABSTRACT FROM AUTHOR]
Published: 2017
Full Text: View/download PDF

40. A machine learning approach for predicting CRISPR-Cas9 cleavage efficiencies and patterns underlying its mechanism of action.

Author: Abadi, Shiran, Yan, Winston X., Amar, David, and Mayrose, Itay
Subjects: MACHINE learning, CRISPRS, GENOME editing, RNA, OLIGONUCLEOTIDES
Abstract: The adaptation of the CRISPR-Cas9 system as a genome editing technique has generated much excitement in recent years owing to its ability to manipulate targeted genes and genomic regions that are complementary to a programmed single guide RNA (sgRNA). However, the efficacy of a specific sgRNA is not uniquely defined by exact sequence homology to the target site, thus unintended off-targets might additionally be cleaved. Current methods for sgRNA design are mainly concerned with predicting off-targets for a given sgRNA using basic sequence features and employ elementary rules for ranking possible sgRNAs. Here, we introduce CRISTA (CRISPR Target Assessment), a novel algorithm within the machine learning framework that determines the propensity of a genomic site to be cleaved by a given sgRNA. We show that the predictions made with CRISTA are more accurate than other available methodologies. We further demonstrate that the occurrence of bulges is not a rare phenomenon and should be accounted for in the prediction process. Beyond predicting cleavage efficiencies, the learning process provides inferences regarding patterns that underlie the mechanism of action of the CRISPR-Cas9 system. We discover that attributes that describe the spatial structure and rigidity of the entire genomic site as well as those surrounding the PAM region are a major component of the prediction capabilities. [ABSTRACT FROM AUTHOR]
Published: 2017
Full Text: View/download PDF

41. Identifying parameter regions for multistationarity.

Author: Conradi, Carsten, Feliu, Elisenda, Mincheva, Maya, and Wiuf, Carsten
Subjects: BIOLOGICAL mathematical modeling, ORDINARY differential equations, POLYNOMIALS, GENE expression, CELLULAR signal transduction, COMPUTATIONAL biology
Abstract: Mathematical modelling has become an established tool for studying the dynamics of biological systems. Current applications range from building models that reproduce quantitative data to identifying systems with predefined qualitative features, such as switching behaviour, bistability or oscillations. Mathematically, the latter question amounts to identifying parameter values associated with a given qualitative feature. We introduce a procedure to partition the parameter space of a parameterized system of ordinary differential equations into regions for which the system has a unique or multiple equilibria. The procedure is based on the computation of the Brouwer degree, and it creates a multivariate polynomial with parameter depending coefficients. The signs of the coefficients determine parameter regions with and without multistationarity. A particular strength of the procedure is the avoidance of numerical analysis and parameter sampling. The procedure consists of a number of steps. Each of these steps might be addressed algorithmically using various computer programs and available software, or manually. We demonstrate our procedure on several models of gene transcription and cell signalling, and show that in many cases we obtain a complete partitioning of the parameter space with respect to multistationarity. [ABSTRACT FROM AUTHOR]
Published: 2017
Full Text: View/download PDF

42. Reduction of multiscale stochastic biochemical reaction networks using exact moment derivation.

Author: Kim, Jae Kyoung and Sontag, Eduardo D.
Subjects: BIOCHEMICAL genetics, DNA-binding proteins, GENE expression, BIOCHEMISTRY, COMPUTATIONAL biology
Abstract: Biochemical reaction networks (BRNs) in a cell frequently consist of reactions with disparate timescales. The stochastic simulations of such multiscale BRNs are prohibitively slow due to high computational cost for the simulations of fast reactions. One way to resolve this problem uses the fact that fast species regulated by fast reactions quickly equilibrate to their stationary distribution while slow species are unlikely to be changed. Thus, on a slow timescale, fast species can be replaced by their quasi-steady state (QSS): their stationary conditional expectation values for given slow species. As the QSS are determined solely by the state of slow species, such replacement leads to a reduced model, where fast species are eliminated. However, it is challenging to derive the QSS in the presence of nonlinear reactions. While various approximations schemes for the QSS have been developed, they often lead to considerable errors. Here, we propose two classes of multiscale BRNs which can be reduced by deriving an exact QSS rather than approximations. Specifically, if fast species constitute either a feedforward network or a complex balanced network, the reduced model based on the exact QSS can be derived. Such BRNs are frequently observed in a cell as the feedforward network is one of fundamental motifs of gene or protein regulatory networks. Furthermore, complex balanced networks also include various types of fast reversible bindings such as bindings between transcriptional factors and gene regulatory sites. The reduced models based on exact QSS, which can be calculated by the computational packages provided in this work, accurately approximate the slow scale dynamics of the original full model with much lower computational cost. [ABSTRACT FROM AUTHOR]
Published: 2017
Full Text: View/download PDF

43. Computation and measurement of cell decision making errors using single cell data.

Author: Habibi, Iman, Cheong, Raymond, Lipniacki, Tomasz, Levchenko, Andre, Emamian, Effat S., and Abdi, Ali
Subjects: COMPUTATIONAL biology, TUMOR necrosis factors, TRANSCRIPTION factors, CELLULAR control mechanisms, CYTOKINES, CELLULAR signal transduction
Abstract: In this study a new computational method is developed to quantify decision making errors in cells, caused by noise and signaling failures. Analysis of tumor necrosis factor (TNF) signaling pathway which regulates the transcription factor Nuclear Factor κB (NF-κB) using this method identifies two types of incorrect cell decisions called false alarm and miss. These two events represent, respectively, declaring a signal which is not present and missing a signal that does exist. Using single cell experimental data and the developed method, we compute false alarm and miss error probabilities in wild-type cells and provide a formulation which shows how they depend on the signal transduction noise level. We also show that in the presence of abnormalities in a cell, decision making processes can be significantly affected, compared to a wild-type cell, and the method is able to model and measure such effects. In the TNF—NF-κB pathway, the method computes and reveals changes in false alarm and miss probabilities in A20-deficient cells, caused by cell’s inability to inhibit TNF-induced NF-κB response. In biological terms, a higher false alarm metric in this abnormal TNF signaling system indicates perceiving more cytokine signals which in fact do not exist at the system input, whereas a higher miss metric indicates that it is highly likely to miss signals that actually exist. Overall, this study demonstrates the ability of the developed method for modeling cell decision making errors under normal and abnormal conditions, and in the presence of transduction noise uncertainty. Compared to the previously reported pathway capacity metric, our results suggest that the introduced decision error metrics characterize signaling failures more accurately. This is mainly because while capacity is a useful metric to study information transmission in signaling pathways, it does not capture the overlap between TNF-induced noisy response curves. [ABSTRACT FROM AUTHOR]
Published: 2017
Full Text: View/download PDF

44. Stochastic Simulation Service: Bridging the Gap between the Computational Expert and the Biologist.

Author: Drawert, Brian, Hellander, Andreas, Bales, Ben, Banerjee, Debjani, Bellesia, Giovanni, Jr.Daigle, Bernie J., Douglas, Geoffrey, Gu, Mengyuan, Gupta, Anand, Hellander, Stefan, Horuk, Chris, Nath, Dibyendu, Takkar, Aviral, Wu, Sheng, Lötstedt, Per, Krintz, Chandra, and Petzold, Linda R.
Subjects: STOCHASTIC systems, BIOCHEMICAL models, SIMULATION methods & models, DISCRETE systems, BIOLOGISTS
Abstract: We present StochSS: Stochastic Simulation as a Service, an integrated development environment for modeling and simulation of both deterministic and discrete stochastic biochemical systems in up to three dimensions. An easy to use graphical user interface enables researchers to quickly develop and simulate a biological model on a desktop or laptop, which can then be expanded to incorporate increasing levels of complexity. StochSS features state-of-the-art simulation engines. As the demand for computational power increases, StochSS can seamlessly scale computing resources in the cloud. In addition, StochSS can be deployed as a multi-user software environment where collaborators share computational resources and exchange models via a public model repository. We demonstrate the capabilities and ease of use of StochSS with an example of model development and simulation at increasing levels of complexity. [ABSTRACT FROM AUTHOR]
Published: 2016
Full Text: View/download PDF

45. WORMHOLE: Novel Least Diverged Ortholog Prediction through Machine Learning.

Author: Sutphin, George L., Mahoney, J. Matthew, Sheppard, Keith, Walton, David O., and Korstanje, Ron
Subjects: MACHINE learning, DATA mining, COMPARATIVE biology, BIODIVERSITY, AMINO acid sequence, ZEBRA danio, ALGORITHMS
Abstract: The rapid advancement of technology in genomics and targeted genetic manipulation has made comparative biology an increasingly prominent strategy to model human disease processes. Predicting orthology relationships between species is a vital component of comparative biology. Dozens of strategies for predicting orthologs have been developed using combinations of gene and protein sequence, phylogenetic history, and functional interaction with progressively increasing accuracy. A relatively new class of orthology prediction strategies combines aspects of multiple methods into meta-tools, resulting in improved prediction performance. Here we present WORMHOLE, a novel ortholog prediction meta-tool that applies machine learning to integrate 17 distinct ortholog prediction algorithms to identify novel least diverged orthologs (LDOs) between 6 eukaryotic species—humans, mice, zebrafish, fruit flies, nematodes, and budding yeast. Machine learning allows WORMHOLE to intelligently incorporate predictions from a wide-spectrum of strategies in order to form aggregate predictions of LDOs with high confidence. In this study we demonstrate the performance of WORMHOLE across each combination of query and target species. We show that WORMHOLE is particularly adept at improving LDO prediction performance between distantly related species, expanding the pool of LDOs while maintaining low evolutionary distance and a high level of functional relatedness between genes in LDO pairs. We present extensive validation, including cross-validated prediction of PANTHER LDOs and evaluation of evolutionary divergence and functional similarity, and discuss future applications of machine learning in ortholog prediction. A WORMHOLE web tool has been developed and is available at . [ABSTRACT FROM AUTHOR]
Published: 2016
Full Text: View/download PDF

46. Slow manifolds within network dynamics encode working memory efficiently and robustly

Author: ShiNung Ching and Elham Ghazizadeh
Subjects: Theoretical computer science, Computer science, Action Potentials, Systems Science, Nervous System, Cognition, Learning and Memory, Animal Cells, Attractor, Neural Pathways, Medicine and Health Sciences, State space, Biology (General), Neurons, Ecology, Artificial neural network, Brain, Dynamical Systems, Memory, Short-Term, Computational Theory and Mathematics, Modeling and Simulation, Physical Sciences, Cellular Types, Anatomy, Network Analysis, Research Article, Computer and Information Sciences, Dynamical systems theory, Neural Networks, QH301-705.5, Cognitive Neuroscience, Models, Neurological, Cellular and Molecular Neuroscience, Memory, Genetics, Biological neural network, Humans, Learning, Computer Simulation, Working Memory, Molecular Biology, Ecology, Evolution, Behavior and Systematics, Working memory, Biology and Life Sciences, Computational Biology, Eigenvalues, Cell Biology, Network dynamics, Neuroanatomy, Algebra, Linear Algebra, Nonlinear Dynamics, Cellular Neuroscience, Cognitive Science, Neural Networks, Computer, Nerve Net, Mathematics, Neuroscience
Abstract: Working memory is a cognitive function involving the storage and manipulation of latent information over brief intervals of time, thus making it crucial for context-dependent computation. Here, we use a top-down modeling approach to examine network-level mechanisms of working memory, an enigmatic issue and central topic of study in neuroscience. We optimize thousands of recurrent rate-based neural networks on a working memory task and then perform dynamical systems analysis on the ensuing optimized networks, wherein we find that four distinct dynamical mechanisms can emerge. In particular, we show the prevalence of a mechanism in which memories are encoded along slow stable manifolds in the network state space, leading to a phasic neuronal activation profile during memory periods. In contrast to mechanisms in which memories are directly encoded at stable attractors, these networks naturally forget stimuli over time. Despite this seeming functional disadvantage, they are more efficient in terms of how they leverage their attractor landscape and paradoxically, are considerably more robust to noise. Our results provide new hypotheses regarding how working memory function may be encoded within the dynamics of neural circuits., Author summary The ability to remember information for brief periods of time before using it is a key human ability. For example, retaining a phone number for a few moments prior to entering it into a keypad. Such ability, known as working memory, enables many more complex functions such as planning and reasoning. In this paper, we use theory and computational modeling approaches to try and better understand how circuits and networks in the brain might be achieving working memory. Specifically, we construct hypothetical network models to perform a task that embodies essential aspects of working memory. We then dissect our model to reveal how its components—simulated neural units—interact with each other to represent and maintain information. It turns out that some models can achieve memory by maintaining a fixed representation over time, i.e., the units remain ‘still’ during memory. However, other models fluctuate their activity during memory in a particular way that is seemingly quite efficient and less susceptible to distraction. In total, our computational study provides new theory for how this form of memory might be implemented in brain networks.
Published: 2021

47. Bipartite Community Structure of eQTLs.

Author: Platig, John, Castaldi, Peter J., DeMeo, Dawn, and Quackenbush, John
Subjects: PHENOTYPES, DISEASES, BIPARTITE graphs, PERSONALITY, COMMUNITIES, NODES of Ranvier, POWER law (Mathematics)
Abstract: Genome Wide Association Studies (GWAS) and expression quantitative trait locus (eQTL) analyses have identified genetic associations with a wide range of human phenotypes. However, many of these variants have weak effects and understanding their combined effect remains a challenge. One hypothesis is that multiple SNPs interact in complex networks to influence functional processes that ultimately lead to complex phenotypes, including disease states. Here we present CONDOR, a method that represents both cis- and trans-acting SNPs and the genes with which they are associated as a bipartite graph and then uses the modular structure of that graph to place SNPs into a functional context. In applying CONDOR to eQTLs in chronic obstructive pulmonary disease (COPD), we found the global network “hub” SNPs were devoid of disease associations through GWAS. However, the network was organized into 52 communities of SNPs and genes, many of which were enriched for genes in specific functional classes. We identified local hubs within each community (“core SNPs”) and these were enriched for GWAS SNPs for COPD and many other diseases. These results speak to our intuition: rather than single SNPs influencing single genes, we see groups of SNPs associated with the expression of families of functionally related genes and that disease SNPs are associated with the perturbation of those functions. These methods are not limited in their application to COPD and can be used in the analysis of a wide variety of disease processes and other phenotypic traits. [ABSTRACT FROM AUTHOR]
Published: 2016
Full Text: View/download PDF

48. Inference of Ancestral Recombination Graphs through Topological Data Analysis.

Author: Cámara, Pablo G., Levine, Arnold J., and Rabadán, Raúl
Subjects: GENETIC recombination, DATA analysis, TOPOLOGICAL graph theory, PHYLOGENY, INTEGRATED software
Abstract: The recent explosion of genomic data has underscored the need for interpretable and comprehensive analyses that can capture complex phylogenetic relationships within and across species. Recombination, reassortment and horizontal gene transfer constitute examples of pervasive biological phenomena that cannot be captured by tree-like representations. Starting from hundreds of genomes, we are interested in the reconstruction of potential evolutionary histories leading to the observed data. Ancestral recombination graphs represent potential histories that explicitly accommodate recombination and mutation events across orthologous genomes. However, they are computationally costly to reconstruct, usually being infeasible for more than few tens of genomes. Recently, Topological Data Analysis (TDA) methods have been proposed as robust and scalable methods that can capture the genetic scale and frequency of recombination. We build upon previous TDA developments for detecting and quantifying recombination, and present a novel framework that can be applied to hundreds of genomes and can be interpreted in terms of minimal histories of mutation and recombination events, quantifying the scales and identifying the genomic locations of recombinations. We implement this framework in a software package, called TARGet, and apply it to several examples, including small migration between different populations, human recombination, and horizontal evolution in finches inhabiting the Galápagos Islands. [ABSTRACT FROM AUTHOR]
Published: 2016
Full Text: View/download PDF

49. Improved Metabolic Models for E. coli and Mycoplasma genitalium from GlobalFit, an Algorithm That Simultaneously Matches Growth and Non-Growth Data Sets.

Author: Hartleb, Daniel, Jarre, Florian, and Lercher, Martin J.
Subjects: ESCHERICHIA coli growth, CELL metabolism, MYCOPLASMA diseases, METABOLIC models, CELLULAR evolution, BIOTECHNOLOGICAL microorganisms, NONSTOICHIOMETRIC compounds, PATHOGENIC bacteria, THERAPEUTICS
Abstract: Constraint-based metabolic modeling methods such as Flux Balance Analysis (FBA) are routinely used to predict the effects of genetic changes and to design strains with desired metabolic properties. The major bottleneck in modeling genome-scale metabolic systems is the establishment and manual curation of reliable stoichiometric models. Initial reconstructions are typically refined through comparisons to experimental growth data from gene knockouts or nutrient environments. Existing methods iteratively correct one erroneous model prediction at a time, resulting in accumulating network changes that are often not globally optimal. We present GF, a bi-level optimization method that finds a globally optimal network, by identifying the minimal set of network changes needed to correctly predict all experimentally observed growth and non-growth cases simultaneously. When applied to the genome-scale metabolic model of Mycoplasma genitalium, GF decreases unexplained gene knockout phenotypes by 79%, increasing accuracy from 87.3% (according to the current state-of-the-art) to 97.3%. While currently available computers do not allow a global optimization of the much larger metabolic network of E. coli, the main strengths of GF are already played out when considering only one growth and one non-growth case simultaneously. Application of a corresponding strategy halves the number of unexplained cases for the already highly curated E. coli model, increasing accuracy from 90.8% to 95.4%. [ABSTRACT FROM AUTHOR]
Published: 2016
Full Text: View/download PDF

50. Inference of Gene Regulatory Network Based on Local Bayesian Networks.

Author: Liu, Fei, Zhang, Shao-Wu, Guo, Wei-Feng, Wei, Ze-Gang, and Chen, Luonan
Subjects: GENE regulatory networks, BAYESIAN analysis, GENE expression, INFORMATION theory, ALGORITHMS
Abstract: The inference of gene regulatory networks (GRNs) from expression data can mine the direct regulations among genes and gain deep insights into biological processes at a network level. During past decades, numerous computational approaches have been introduced for inferring the GRNs. However, many of them still suffer from various problems, e.g., Bayesian network (BN) methods cannot handle large-scale networks due to their high computational complexity, while information theory-based methods cannot identify the directions of regulatory interactions and also suffer from false positive/negative problems. To overcome the limitations, in this work we present a novel algorithm, namely local Bayesian network (LBN), to infer GRNs from gene expression data by using the network decomposition strategy and false-positive edge elimination scheme. Specifically, LBN algorithm first uses conditional mutual information (CMI) to construct an initial network or GRN, which is decomposed into a number of local networks or GRNs. Then, BN method is employed to generate a series of local BNs by selecting the k-nearest neighbors of each gene as its candidate regulatory genes, which significantly reduces the exponential search space from all possible GRN structures. Integrating these local BNs forms a tentative network or GRN by performing CMI, which reduces redundant regulations in the GRN and thus alleviates the false positive problem. The final network or GRN can be obtained by iteratively performing CMI and local BN on the tentative network. In the iterative process, the false or redundant regulations are gradually removed. When tested on the benchmark GRN datasets from DREAM challenge as well as the SOS DNA repair network in E.coli, our results suggest that LBN outperforms other state-of-the-art methods (ARACNE, GENIE3 and NARROMI) significantly, with more accurate and robust performance. In particular, the decomposition strategy with local Bayesian networks not only effectively reduce the computational cost of BN due to much smaller sizes of local GRNs, but also identify the directions of the regulations. [ABSTRACT FROM AUTHOR]
Published: 2016
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Database

79 results

Search Results

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources