101 results
Search Results
2. bigPint: A Bioconductor visualization package that makes big data pint-sized.
- Author
-
Rutter, Lindsay and Cook, Dianne
- Subjects
LIFE sciences ,VISUALIZATION ,DATA modeling ,SOURCE code ,COMPUTATIONAL biology ,BIOLOGICAL databases ,NEXT generation networks ,BIG data - Abstract
Interactive data visualization is imperative in the biological sciences. The development of independent layers of interactivity has been in pursuit in the visualization community. We developed bigPint, a data visualization package available on Bioconductor under the GPL-3 license (https://bioconductor.org/packages/release/bioc/html/bigPint.html). Our software introduces new visualization technology that enables independent layers of interactivity using Plotly in R, which aids in the exploration of large biological datasets. The bigPint package presents modernized versions of scatterplot matrices, volcano plots, and litre plots through the implementation of layered interactivity. These graphics have detected normalization issues, differential expression designation problems, and common analysis errors in public RNA-sequencing datasets. Researchers can apply bigPint graphics to their data by following recommended pipelines written in reproducible code in the user manual. In this paper, we explain how we achieved the independent layers of interactivity that are behind bigPint graphics. Pseudocode and source code are provided. Computational scientists can leverage our open-source code to expand upon our layered interactive technology and/or apply it in new ways toward other computational biology tasks. Author summary: Biological disciplines face the challenge of increasingly large and complex data. One necessary approach toward eliciting information is data visualization. Newer visualization tools incorporate interactive capabilities that allow scientists to extract information more efficiently than static counterparts. In this paper, we introduce technology that allows multiple independent layers of interactive visualization written in open-source code. This technology can be repurposed across various biological problems. Here, we apply this technology to RNA-sequencing data, a popular next-generation sequencing approach that provides snapshots of RNA quantity in biological samples at given moments in time. It can be used to investigate cellular differences between health and disease, cellular changes in response to external stimuli, and additional biological inquiries. RNA-sequencing data is large, noisy, and biased. It requires sophisticated normalization. The most popular open-source RNA-sequencing data analysis software focuses on models, with little emphasis on integrating effective visualization tools. This is despite sound evidence that RNA-sequencing data is most effectively explored using graphical and numerical approaches in a complementary fashion. The software we introduce can make it easier for researchers to use models and visuals in an integrated fashion during RNA-sequencing data analysis. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
3. LOTUS: A single- and multitask machine learning algorithm for the prediction of cancer driver genes.
- Author
-
Collier, Olivier, Stoven, Véronique, and Vert, Jean-Philippe
- Subjects
CANCER genes ,MACHINE learning ,LEARNING strategies ,P53 antioncogene ,PROTEIN-protein interactions ,COMPUTATIONAL biology ,TUMOR suppressor genes - Abstract
Cancer driver genes, i.e., oncogenes and tumor suppressor genes, are involved in the acquisition of important functions in tumors, providing a selective growth advantage, allowing uncontrolled proliferation and avoiding apoptosis. It is therefore important to identify these driver genes, both for the fundamental understanding of cancer and to help finding new therapeutic targets or biomarkers. Although the most frequently mutated driver genes have been identified, it is believed that many more remain to be discovered, particularly for driver genes specific to some cancer types. In this paper, we propose a new computational method called LOTUS to predict new driver genes. LOTUS is a machine-learning based approach which allows to integrate various types of data in a versatile manner, including information about gene mutations and protein-protein interactions. In addition, LOTUS can predict cancer driver genes in a pan-cancer setting as well as for specific cancer types, using a multitask learning strategy to share information across cancer types. We empirically show that LOTUS outperforms five other state-of-the-art driver gene prediction methods, both in terms of intrinsic consistency and prediction accuracy, and provide predictions of new cancer genes across many cancer types. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
4. Per-sample immunoglobulin germline inference from B cell receptor deep sequencing data.
- Author
-
Ralph, Duncan K. and IVMatsen, Frederick A.
- Subjects
B cell receptors ,IMMUNOGLOBULIN genes ,B cells ,ALLELES - Abstract
The collection of immunoglobulin genes in an individual’s germline, which gives rise to B cell receptors via recombination, is known to vary significantly across individuals. In humans, for example, each individual has only a fraction of the several hundred known V alleles. Furthermore, the currently-accepted set of known V alleles is both incomplete (particularly for non-European samples), and contains a significant number of spurious alleles. The resulting uncertainty as to which immunoglobulin alleles are present in any given sample results in inaccurate B cell receptor sequence annotations, and in particular inaccurate inferred naive ancestors. In this paper we first show that the currently widespread practice of aligning each sequence to its closest match in the full set of IMGT alleles results in a very large number of spurious alleles that are not in the sample’s true set of germline V alleles. We then describe a new method for inferring each individual’s germline gene set from deep sequencing data, and show that it improves upon existing methods by making a detailed comparison on a variety of simulated and real data samples. This new method has been integrated into the partis annotation and clonal family inference package, available at , and is run by default without affecting overall run time. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
5. Ten quick tips for sharing open genomic data.
- Author
-
Brown, Anne V., Campbell, Jacqueline D., Assefa, Teshale, Grant, David, Nelson, Rex T., Weeks, Nathan T., and Cannon, Steven B.
- Subjects
GENOMICS ,BIOLOGICAL databases ,NUCLEOTIDE sequencing ,DATA curation ,DNA data banks - Abstract
As sequencing prices drop, genomic data accumulates—seemingly at a steadily increasing pace. Most genomic data potentially have value beyond the initial purpose—but only if shared with the scientific community. This, of course, is often easier said than done. Some of the challenges in sharing genomic data include data volume (raw file sizes and number of files), complexities, formats, nomenclatures, metadata descriptions, and the choice of a repository. In this paper, we describe 10 quick tips for sharing open genomic data. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
6. SFPEL-LPI: Sequence-based feature projection ensemble learning for predicting LncRNA-protein interactions.
- Author
-
Zhang, Wen, Tang, Guifeng, Huang, Feng, Zhang, Xining, Yue, Xiang, and Wu, Wenjian
- Subjects
RNA-protein interactions ,GENETIC regulation ,RNA interference ,RNA splicing ,ADENYLATION (Biochemistry) - Abstract
LncRNA-protein interactions play important roles in post-transcriptional gene regulation, poly-adenylation, splicing and translation. Identification of lncRNA-protein interactions helps to understand lncRNA-related activities. Existing computational methods utilize multiple lncRNA features or multiple protein features to predict lncRNA-protein interactions, but features are not available for all lncRNAs or proteins; most of existing methods are not capable of predicting interacting proteins (or lncRNAs) for new lncRNAs (or proteins), which don’t have known interactions. In this paper, we propose the sequence-based feature projection ensemble learning method, “SFPEL-LPI”, to predict lncRNA-protein interactions. First, SFPEL-LPI extracts lncRNA sequence-based features and protein sequence-based features. Second, SFPEL-LPI calculates multiple lncRNA-lncRNA similarities and protein-protein similarities by using lncRNA sequences, protein sequences and known lncRNA-protein interactions. Then, SFPEL-LPI combines multiple similarities and multiple features with a feature projection ensemble learning frame. In computational experiments, SFPEL-LPI accurately predicts lncRNA-protein associations and outperforms other state-of-the-art methods. More importantly, SFPEL-LPI can be applied to new lncRNAs (or proteins). The case studies demonstrate that our method can find out novel lncRNA-protein interactions, which are confirmed by literature. Finally, we construct a user-friendly web server, available at . [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
7. Discrete modeling for integration and analysis of large-scale signaling networks.
- Author
-
Vignet, Pierre, Coquet, Jean, Auber, Sébastien, Boudet, Matéo, Siegel, Anne, and Théret, Nathalie
- Subjects
LINKED data (Semantic Web) ,BIOLOGICAL systems ,BIOLOGICAL databases ,EPITHELIAL-mesenchymal transition ,BIOTIC communities ,BIOMOLECULES ,POLYMER networks - Abstract
Most biological processes are orchestrated by large-scale molecular networks which are described in large-scale model repositories and whose dynamics are extremely complex. An observed phenotype is a state of this system that results from control mechanisms whose identification is key to its understanding. The Biological Pathway Exchange (BioPAX) format is widely used to standardize the biological information relative to regulatory processes. However, few modeling approaches developed so far enable for computing the events that control a phenotype in large-scale networks. Here we developed an integrated approach to build large-scale dynamic networks from BioPAX knowledge databases in order to analyse trajectories and to identify sets of biological entities that control a phenotype. The Cadbiom approach relies on the guarded transitions formalism, a discrete modeling approach which models a system dynamics by taking into account competition and cooperation events in chains of reactions. The method can be applied to every BioPAX (large-scale) model thanks to a specific package which automatically generates Cadbiom models from BioPAX files. The Cadbiom framework was applied to the BioPAX version of two resources (PID, KEGG) of the Pathway Commons database and to the Atlas of Cancer Signalling Network (ACSN). As a case-study, it was used to characterize sets of biological entities implicated in the epithelial-mesenchymal transition. Our results highlight the similarities between the PID and ACSN resources in terms of biological content, and underline the heterogeneity of usage of the BioPAX semantics limiting the fusion of models that require curation. Causality analyses demonstrate the smart complementarity of the databases in terms of combinatorics of controllers that explain a phenotype. From a biological perspective, our results show the specificity of controllers for epithelial and mesenchymal phenotypes that are consistent with the literature and identify a novel signature for intermediate states. Author summary: The computation of sets of biological entities implicated in phenotypes is hampered by the complex nature of controllers acting in competitive or cooperative combinations. These biological mechanisms are underlied by chains of reactions involving interactions between biomolecules (DNA, RNA, proteins, lipids, complexes, etc.), all of which form complex networks. Hence, the identification of controllers relies on computational methods for dynamical systems, which require the biological information about the interactions to be translated into a formal language. The BioPAX standard is a reference ontology associated with a description language to describe biological mechanisms, which satisfies the Linked Open Data initiative recommendations for data interoperability. Although it has been widely adopted by the community to describe biological pathways, no computational method is able of studying the dynamics of the networks described in the BioPAX large-scale resources. To solve this issue, our Cadbiom framework was designed to automatically transcribe the biological systems knowledge of large-scale BioPAX networks into discrete models. The framework then identifies the trajectories that explain a biological phenotype (e.g., all the biomolecules that are activated to induce the expression of a gene). Here, we created Cadbiom models from three biological pathway databases (KEGG, PID and ACSN). The comparative analysis of these models highlighted the diversity of molecules in sets of biological entities that can explain a same phenotype. The application of our framework to the search of biomolecules regulating the epithelial-mesenchymal transition not only confirmed known pathways in the control of epithelial or mesenchymal cell markers but also highlighted new pathways for transient states. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
8. GPRuler: Metabolic gene-protein-reaction rules automatic reconstruction.
- Author
-
Di Filippo, Marzia, Damiani, Chiara, and Pescini, Dario
- Subjects
METABOLIC models ,GENE regulatory networks ,GENE expression profiling ,BIOLOGICAL databases ,DELETION mutation ,DATA mining - Abstract
Metabolic network models are increasingly being used in health care and industry. As a consequence, many tools have been released to automate their reconstruction process de novo. In order to enable gene deletion simulations and integration of gene expression data, these networks must include gene-protein-reaction (GPR) rules, which describe with a Boolean logic relationships between the gene products (e.g., enzyme isoforms or subunits) associated with the catalysis of a given reaction. Nevertheless, the reconstruction of GPRs still remains a largely manual and time consuming process. Aiming at fully automating the reconstruction process of GPRs for any organism, we propose the open-source python-based framework GPRuler. By mining text and data from 9 different biological databases, GPRuler can reconstruct GPRs starting either from just the name of the target organism or from an existing metabolic model. The performance of the developed tool is evaluated at small-scale level for a manually curated metabolic model, and at genome-scale level for three metabolic models related to Homo sapiens and Saccharomyces cerevisiae organisms. By exploiting these models as benchmarks, the proposed tool shown its ability to reproduce the original GPR rules with a high level of accuracy. In all the tested scenarios, after a manual investigation of the mismatches between the rules proposed by GPRuler and the original ones, the proposed approach revealed to be in many cases more accurate than the original models. By complementing existing tools for metabolic network reconstruction with the possibility to reconstruct GPRs quickly and with a few resources, GPRuler paves the way to the study of context-specific metabolic networks, representing the active portion of the complete network in given conditions, for organisms of industrial or biomedical interest that have not been characterized metabolically yet. Author summary: Over years, several methodologies have been proposed to integrate omics data into metabolic models in order to derive context-specific networks that represent the active portion of the network under specific conditions. In this way, biologically meaningful phenotypic predictions can be derived as a function of genes expression profiles encoding for subunits or isoforms of the involved enzymes. Regardless of the used approach to integrate omics data, the reliability of the formulated hypotheses strongly depends on the quality of gene-protein-reaction (GPR) rules included into the models, which describe how gene products concur to catalyze the associated reactions. To date, the reconstruction of GPR rules for their integration within metabolic networks still remains a largely manual and time consuming process. Therefore, we propose the open-source framework GPRuler to automate the reconstruction process of GPR rules for any living organism. Applying the developed tool to four case studies, we verified the ability of GPRuler to reproduce the original GPR rules with a very high level of accuracy. Moreover, in all the tested scenarios, the proposed approach revealed to be in many cases more accurate than the original models. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
9. PCSF: An R-package for network-based interpretation of high-throughput data.
- Author
-
Akhmedov, Murodzhon, Kedaigle, Amanda, Chong, Renan Escalante, Montemanni, Roberto, Bertoni, Francesco, Fraenkel, Ernest, and Kwee, Ivo
- Subjects
BIOINFORMATICS software ,DATA analysis software ,MATHEMATICAL optimization ,COMPUTATIONAL biology ,PROTEIN-protein interactions - Abstract
With the recent technological developments a vast amount of high-throughput data has been profiled to understand the mechanism of complex diseases. The current bioinformatics challenge is to interpret the data and underlying biology, where efficient algorithms for analyzing heterogeneous high-throughput data using biological networks are becoming increasingly valuable. In this paper, we propose a software package based on the Prize-collecting Steiner Forest graph optimization approach. The PCSF package performs fast and user-friendly network analysis of high-throughput data by mapping the data onto a biological networks such as protein-protein interaction, gene-gene interaction or any other correlation or coexpression based networks. Using the interaction networks as a template, it determines high-confidence subnetworks relevant to the data, which potentially leads to predictions of functional units. It also interactively visualizes the resulting subnetwork with functional enrichment analysis. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
10. ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time.
- Author
-
Cai, Yunpeng, Zheng, Wei, Yao, Jin, Yang, Yujie, Mai, Volker, Mao, Qi, and Sun, Yijun
- Subjects
GENOMICS ,QUADRATIC programming ,HUMAN microbiota ,RIBOSOMAL RNA ,BIOACCUMULATION - Abstract
The rapid development of sequencing technology has led to an explosive accumulation of genomic sequence data. Clustering is often the first step to perform in sequence analy- sis, and hierarchical clustering is one of the most commonly used approaches for this purpose. However, it is currently computationally expensive to perform hierarchical clustering of extremely large sequence datasets due to its quadratic time and space complexities. In this paper we developed a new algorithm called ESPRIT-Forest for parallel hierarchical clustering of sequences. The algorithm achieves subquadratic time and space complexity and maintains a high clustering accuracy comparable to the standard method. The basic idea is to organize sequences into a pseudo-metric based partitioning tree for sub-linear time searching of nearest neighbors, and then use a new multiple-pair merging criterion to construct clusters in parallel using multiple threads. The new algorithm was tested on the human microbiome project (HMP) dataset, currently one of the largest published microbial 16S rRNA sequence dataset. Our experiment demonstrated that with the power of parallel computing it is now compu- tationally feasible to perform hierarchical clustering analysis of tens of millions of sequences. The software is available at . [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
11. Genome composition and phylogeny of microbes predict their co-occurrence in the environment.
- Author
-
Kamneva, Olga K.
- Subjects
GENOMES ,MICROBIAL genetics ,MICROORGANISM phylogeny ,ECOLOGICAL research ,COMPUTATIONAL biology - Abstract
The genomic information of microbes is a major determinant of their phenotypic properties, yet it is largely unknown to what extent ecological associations between different species can be explained by their genome composition. To bridge this gap, this study introduces two new genome-wide pairwise measures of microbe-microbe interaction. The first (genome content similarity index) quantifies similarity in genome composition between two microbes, while the second (microbe-microbe functional association index) summarizes the topology of a protein functional association network built for a given pair of microbes and quantifies the fraction of network edges crossing organismal boundaries. These new indices are then used to predict co-occurrence between reference genomes from two 16S-based ecological datasets, accounting for phylogenetic relatedness of the taxa. Phylogenetic relatedness was found to be a strong predictor of ecological associations between microbes which explains about 10% of variance in co-occurrence data, but genome composition was found to be a strong predictor as well, it explains up to 4% the variance in co-occurrence when all genomic-based indices are used in combination, even after accounting for evolutionary relationships between the species. On their own, the metrics proposed here explain a larger proportion of variance than previously reported more complex methods that rely on metabolic network comparisons. In summary, results of this study indicate that microbial genomes do indeed contain detectable signal of organismal ecology, and the methods described in the paper can be used to improve mechanistic understanding of microbe-microbe interactions. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
12. Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model.
- Author
-
Wang, Sheng, Sun, Siqi, Li, Zhen, Zhang, Renyu, and Xu, Jinbo
- Subjects
PROTEIN structure ,ARTIFICIAL neural networks ,PROTEIN folding ,PAIRED comparisons (Mathematics) ,AMINO acid sequence - Abstract
Motivation: Protein contacts contain key information for the understanding of protein structure and function and thus, contact prediction from sequence is an important problem. Recently exciting progress has been made on this problem, but the predicted contacts for proteins without many sequence homologs is still of low quality and not very useful for de novo structure prediction. Method: This paper presents a new deep learning method that predicts contacts by integrating both evolutionary coupling (EC) and sequence conservation information through an ultra-deep neural network formed by two deep residual neural networks. The first residual network conducts a series of 1-dimensional convolutional transformation of sequential features; the second residual network conducts a series of 2-dimensional convolutional transformation of pairwise information including output of the first residual network, EC information and pairwise potential. By using very deep residual networks, we can accurately model contact occurrence patterns and complex sequence-structure relationship and thus, obtain high-quality contact prediction regardless of how many sequence homologs are available for proteins in question. Results: Our method greatly outperforms existing methods and leads to much more accurate contact-assisted folding. Tested on 105 CASP11 targets, 76 past CAMEO hard targets, and 398 membrane proteins, the average top L long-range prediction accuracy obtained by our method, one representative EC method CCMpred and the CASP11 winner MetaPSICOV is 0.47, 0.21 and 0.30, respectively; the average top L/10 long-range accuracy of our method, CCMpred and MetaPSICOV is 0.77, 0.47 and 0.59, respectively. Ab initio folding using our predicted contacts as restraints but without any force fields can yield correct folds (i.e., TMscore>0.6) for 203 of the 579 test proteins, while that using MetaPSICOV- and CCMpred-predicted contacts can do so for only 79 and 62 of them, respectively. Our contact-assisted models also have much better quality than template-based models especially for membrane proteins. The 3D models built from our contact prediction have TMscore>0.5 for 208 of the 398 membrane proteins, while those from homology modeling have TMscore>0.5 for only 10 of them. Further, even if trained mostly by soluble proteins, our deep learning method works very well on membrane proteins. In the recent blind CAMEO benchmark, our fully-automated web server implementing this method successfully folded 6 targets with a new fold and only 0.3L-2.3L effective sequence homologs, including one β protein of 182 residues, one α+β protein of 125 residues, one α protein of 140 residues, one α protein of 217 residues, one α/β of 260 residues and one α protein of 462 residues. Our method also achieved the highest F1 score on free-modeling targets in the latest CASP (Critical Assessment of Structure Prediction), although it was not fully implemented back then. Availability: [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
13. A Graph-Centric Approach for Metagenome-Guided Peptide and Protein Identification in Metaproteomics.
- Author
-
Tang, Haixu, Li, Sujun, and Ye, Yuzhen
- Subjects
PROTEIN expression ,PEPTIDES ,GENES ,MICROBIOLOGICAL chemistry ,METAGENOMICS - Abstract
Metaproteomic studies adopt the common bottom-up proteomics approach to investigate the protein composition and the dynamics of protein expression in microbial communities. When matched metagenomic and/or metatranscriptomic data of the microbial communities are available, metaproteomic data analyses often employ a metagenome-guided approach, in which complete or fragmental protein-coding genes are first directly predicted from metagenomic (and/or metatranscriptomic) sequences or from their assemblies, and the resulting protein sequences are then used as the reference database for peptide/protein identification from MS/MS spectra. This approach is often limited because protein coding genes predicted from metagenomes are incomplete and fragmental. In this paper, we present a graph-centric approach to improving metagenome-guided peptide and protein identification in metaproteomics. Our method exploits the de Bruijn graph structure reported by metagenome assembly algorithms to generate a comprehensive database of protein sequences encoded in the community. We tested our method using several public metaproteomic datasets with matched metagenomic and metatranscriptomic sequencing data acquired from complex microbial communities in a biological wastewater treatment plant. The results showed that many more peptides and proteins can be identified when assembly graphs were utilized, improving the characterization of the proteins expressed in the microbial communities. The additional proteins we identified contribute to the characterization of important pathways such as those involved in degradation of chemical hazards. Our tools are released as open-source software on github at . [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
14. Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine.
- Author
-
Singhal, Ayush, Simmons, Michael, and Lu, Zhiyong
- Subjects
INDIVIDUALIZED medicine ,MACHINE learning ,DATA extraction ,MEDICAL literature ,TEXT mining - Abstract
The practice of precision medicine will ultimately require databases of genes and mutations for healthcare providers to reference in order to understand the clinical implications of each patient’s genetic makeup. Although the highest quality databases require manual curation, text mining tools can facilitate the curation process, increasing accuracy, coverage, and productivity. However, to date there are no available text mining tools that offer high-accuracy performance for extracting such triplets from biomedical literature. In this paper we propose a high-performance machine learning approach to automate the extraction of disease-gene-variant triplets from biomedical literature. Our approach is unique because we identify the genes and protein products associated with each mutation from not just the local text content, but from a global context as well (from the Internet and from all literature in PubMed). Our approach also incorporates protein sequence validation and disease association using a novel text-mining-based machine learning approach. We extract disease-gene-variant triplets from all abstracts in PubMed related to a set of ten important diseases (breast cancer, prostate cancer, pancreatic cancer, lung cancer, acute myeloid leukemia, Alzheimer’s disease, hemochromatosis, age-related macular degeneration (AMD), diabetes mellitus, and cystic fibrosis). We then evaluate our approach in two ways: (1) a direct comparison with the state of the art using benchmark datasets; (2) a validation study comparing the results of our approach with entries in a popular human-curated database (UniProt) for each of the previously mentioned diseases. In the benchmark comparison, our full approach achieves a 28% improvement in F
1 -measure (from 0.62 to 0.79) over the state-of-the-art results. For the validation study with UniProt Knowledgebase (KB), we present a thorough analysis of the results and errors. Across all diseases, our approach returned 272 triplets (disease-gene-variant) that overlapped with entries in UniProt and 5,384 triplets without overlap in UniProt. Analysis of the overlapping triplets and of a stratified sample of the non-overlapping triplets revealed accuracies of 93% and 80% for the respective categories (cumulative accuracy, 77%). We conclude that our process represents an important and broadly applicable improvement to the state of the art for curation of disease-gene-variant relationships. [ABSTRACT FROM AUTHOR]- Published
- 2016
- Full Text
- View/download PDF
15. A model of dopamine and serotonin-kynurenine metabolism in cortisolemia: Implications for depression.
- Author
-
Dalvi-Garcia, Felipe, Fonseca, Luis L., Vasoncelos, Ana Tereza R., Hedin-Pereira, Cecilia, and Voit, Eberhard O.
- Subjects
DOPAMINE ,SEROTONIN antagonists ,NONLINEAR differential equations ,ORDINARY differential equations ,BIOLOGICAL databases ,ALZHEIMER'S disease ,NEURAL transmission - Abstract
A major factor contributing to the etiology of depression is a neurochemical imbalance of the dopaminergic and serotonergic systems, which is caused by persistently high levels of circulating stress hormones. Here, a computational model is proposed to investigate the interplay between dopaminergic and serotonergic-kynurenine metabolism under cortisolemia and its consequences for the onset of depression. The model was formulated as a set of nonlinear ordinary differential equations represented with power-law functions. Parameter values were obtained from experimental data reported in the literature, biological databases, and other general information, and subsequently fine-tuned through optimization. Model simulations predict that changes in the kynurenine pathway, caused by elevated levels of cortisol, can increase the risk of neurotoxicity and lead to increased levels of 3,4-dihydroxyphenylaceltahyde (DOPAL) and 5-hydroxyindoleacetaldehyde (5-HIAL). These aldehydes contribute to alpha-synuclein aggregation and may cause mitochondrial fragmentation. Further model analysis demonstrated that the inhibition of both serotonin transport and kynurenine-3-monooxygenase decreased the levels of DOPAL and 5-HIAL and the neurotoxic risk often associated with depression. The mathematical model was also able to predict a novel role of the dopamine and serotonin metabolites DOPAL and 5-HIAL in the ethiology of depression, which is facilitated through increased cortisol levels. Finally, the model analysis suggests treatment with a combination of inhibitors of serotonin transport and kynurenine-3-monooxygenase as a potentially effective pharmacological strategy to revert the slow-down in monoamine neurotransmission that is often triggered by inflammation. Author summary: According to the World Health Organization, major depressive disorder (MDD) was in 2014 the fourth leading cause of disability in people between the ages of 15 and 44 years. MDD is responsible for about 1 million suicides per year and associated with a number of other medical conditions such as coronary disease, diabetes, and Alzheimer's disease. While MDD has been studied for a long time, molecular details of its pathophysiology are still scarce. Computational models offer a powerful opportunity to assist neuropsychiatric disorder research, as they permit the representation of large sets of physiological, cellular and biochemical phenomena through mathematical equations that can be simulated in a very efficient fashion and have the capacity to provide testable hypotheses. Here, we introduce a computational model of relevant biochemical pathways associated with high levels of circulating stress hormone, as they are observed in MDD. The model captures known observations well and demonstrates how increased levels of internally produced toxic agents, such as various kynurenines, DOPAL and 5-HIAL, can lead to dysregulation of key enzymes. These insights suggest new hypotheses for model-driven experiments, as well as novel potential targets for pharmacological intervention. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
16. Scaling up data curation using deep learning: An application to literature triage in genomic variation resources.
- Author
-
Lee, Kyubum, Famiglietti, Maria Livia, McMahon, Aoife, Wei, Chih-Hsuan, MacArthur, Jacqueline Ann Langdon, Poux, Sylvain, Breuza, Lionel, Bridge, Alan, Cunningham, Fiona, Xenarios, Ioannis, and Lu, Zhiyong
- Subjects
ARTIFICIAL neural networks ,GENOMES ,ALGORITHMS ,GENOMICS - Abstract
Manually curating biomedical knowledge from publications is necessary to build a knowledge based service that provides highly precise and organized information to users. The process of retrieving relevant publications for curation, which is also known as document triage, is usually carried out by querying and reading articles in PubMed. However, this query-based method often obtains unsatisfactory precision and recall on the retrieved results, and it is difficult to manually generate optimal queries. To address this, we propose a machine-learning assisted triage method. We collect previously curated publications from two databases UniProtKB/Swiss-Prot and the NHGRI-EBI GWAS Catalog, and used them as a gold-standard dataset for training deep learning models based on convolutional neural networks. We then use the trained models to classify and rank new publications for curation. For evaluation, we apply our method to the real-world manual curation process of UniProtKB/Swiss-Prot and the GWAS Catalog. We demonstrate that our machine-assisted triage method outperforms the current query-based triage methods, improves efficiency, and enriches curated content. Our method achieves a precision 1.81 and 2.99 times higher than that obtained by the current query-based triage methods of UniProtKB/Swiss-Prot and the GWAS Catalog, respectively, without compromising recall. In fact, our method retrieves many additional relevant publications that the query-based method of UniProtKB/Swiss-Prot could not find. As these results show, our machine learning-based method can make the triage process more efficient and is being implemented in production so that human curators can focus on more challenging tasks to improve the quality of knowledge bases. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
17. CODON—Software to manual curation of prokaryotic genomes.
- Author
-
Merlin, Bruno, Castro Alves, Jorianne Thyeska, de Sá, Pablo Henrique Caracciolo Gomes, de Oliveira, Mônica Silva, Dias, Larissa Maranhão, da Silva Moia, Gislenne, Cardoso dos Santos, Victória, and Veras, Adonney Allan de Oliveira
- Subjects
PROKARYOTIC genomes ,FINITE state machines ,BIOLOGICAL databases ,GENES ,COMPUTER software - Abstract
Genome annotation conceptually consists of inferring and assigning biological information to gene products. Over the years, numerous pipelines and computational tools have been developed aiming to automate this task and assist researchers in gaining knowledge about target genes of study. However, even with these technological advances, manual annotation or manual curation is necessary, where the information attributed to the gene products is verified and enriched. Despite being called the gold standard process for depositing data in a biological database, the task of manual curation requires significant time and effort from researchers who sometimes have to parse through numerous products in various public databases. To assist with this problem, we present CODON, a tool for manual curation of genomic data, capable of performing the prediction and annotation process. This software makes use of a finite state machine in the prediction process and automatically annotates products based on information obtained from the Uniprot database. CODON is equipped with a simple and intuitive graphic interface that assists on manual curation, enabling the user to decide about the analysis based on information as to identity, length of the alignment, and name of the organism in which the product obtained a match. Further, visual analysis of all matches found in the database is possible, impacting significantly in the curation task considering that the user has at his disposal all the information available for a given product. An analysis performed on eleven organisms was used to test the efficiency of this tool by comparing the results of prediction and annotation through CODON to ones from the NCBI and RAST platforms. Author summary: The accuracy of genome annotation is directly impacted by the manual curation step since complementary information is added to gene products. However, this process takes time and requires specialized labor, since there is a need to consult external databases to check the information of the products inferred in the automated process. We present the CODON software, which allows this process to be dynamic and significantly reduce the total work, the user responsible for conducting manual curation will be able to check the annotation of each product directly on the screen of his computer, without the need to search for the similarity of each ORF in the external databases, it is possible to change the ORF annotation based on the information displayed on the screen, for example, percentage of identity and alignment length match. The results show that CODON is an efficient tool for manual curation, beyond prediction and annotation. In addition to providing the user with access to highly accurate database information, producing a result with more gene acronyms, metabolic pathway information, and Gene Ontology terms. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
18. Ten simple rules to increase computational skills among biologists with Code Clubs.
- Author
-
Hagan, Ada K., Lesniak, Nicholas A., Balunas, Marcy J., Bishop, Lucas, Close, William L., Doherty, Matthew D., Elmore, Amanda G., Flynn, Kaitlin J., Hannigan, Geoffrey D., Koumpouras, Charlie C., Jenior, Matthew L., Kozik, Ariangela J., McBride, Kathryn, Rifkin, Samara B., Stough, Joshua M. A., Sovacool, Kelly L., Sze, Marc A., Tomkovich, Sarah, Topcuoglu, Begum D., and Schloss, Patrick D.
- Subjects
BIOLOGICAL databases ,BIOLOGISTS ,CONCEPT mapping - Published
- 2020
- Full Text
- View/download PDF
19. CAncer bioMarker Prediction Pipeline (CAMPP)—A standardized framework for the analysis of quantitative biological data.
- Author
-
Terkelsen, Thilde, Krogh, Anders, and Papaleo, Elena
- Subjects
FORECASTING ,PIPELINES ,QUANTITATIVE research ,K-means clustering ,BIOLOGICAL databases ,NUCLEOTIDE sequencing ,MISSING data (Statistics) ,PIPELINE inspection - Abstract
With the improvement of -omics and next-generation sequencing (NGS) methodologies, along with the lowered cost of generating these types of data, the analysis of high-throughput biological data has become standard both for forming and testing biomedical hypotheses. Our knowledge of how to normalize datasets to remove latent undesirable variances has grown extensively, making for standardized data that are easily compared between studies. Here we present the CAncer bioMarker Prediction Pipeline (CAMPP), an open-source R-based wrapper (https://github.com/ELELAB/CAncer-bioMarker-Prediction-Pipeline -CAMPP) intended to aid bioinformatic software-users with data analyses. CAMPP is called from a terminal command line and is supported by a user-friendly manual. The pipeline may be run on a local computer and requires little or no knowledge of programming. To avoid issues relating to R-package updates, a renv.lock file is provided to ensure R-package stability. Data-management includes missing value imputation, data normalization, and distributional checks. CAMPP performs (I) k-means clustering, (II) differential expression/abundance analysis, (III) elastic-net regression, (IV) correlation and co-expression network analyses, (V) survival analysis, and (VI) protein-protein/miRNA-gene interaction networks. The pipeline returns tabular files and graphical representations of the results. We hope that CAMPP will assist in streamlining bioinformatic analysis of quantitative biological data, whilst ensuring an appropriate bio-statistical framework. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
20. Bioinformatics Meets User-Centred Design: A Perspective.
- Author
-
Pavelin, Katrina, Cham, Jennifer A., de Matos, Paula, Brooksbank, Cath, Cameron, Graham, and Steinbeck, Christoph
- Subjects
BIOINFORMATICS ,BIOLOGICAL databases ,COMPUTER interfaces ,RESEARCH grants ,COMPUTERS in biology ,SOFTWARE architecture - Abstract
Designers have a saying that "the joy of an early release lasts but a short time. The bitterness of an unusable system lasts for years." It is indeed disappointing to discover that your data resources are not being used to their full potential. Not only have you invested your time, effort, and research grant on the project, but you may face costly redesigns if you want to improve the system later. This scenario would be less likely if the product was designed to provide users with exactly what they need, so that it is fit for purpose before its launch. We work at EMBL-European Bioinformatics Institute (EMBL-EBI), and we consult extensively with life science researchers to find out what they need from biological data resources. We have found that although users believe that the bioinformatics community is providing accurate and valuable data, they often find the interfaces to these resources tricky to use and navigate. We believe that if you can find out what your users want even before you create the first mock-up of a system, the final product will provide a better user experience. This would encourage more people to use the resource and they would have greater access to the data, which could ultimately lead to more scientific discoveries. In this paper, we explore the need for a user-centred design (UCD) strategy when designing bioinformatics resources and illustrate this with examples from our work at EMBLEBI. Our aim is to introduce the reader to how selected UCD techniques may be successfully applied to software design for bioinformatics. INSET: Box 1. User Experience Design as a Profession. [ABSTRACT FROM AUTHOR]
- Published
- 2012
- Full Text
- View/download PDF
21. The impact of DNA methylation on the cancer proteome.
- Author
-
Magzoub, Majed Mohamed, Prunello, Marcos, Brennan, Kevin, and Gevaert, Olivier
- Subjects
DNA methylation ,GENE expression ,CANCER genes ,TUMOR markers ,CANCER ,CYTOLOGY - Abstract
Aberrant DNA methylation disrupts normal gene expression in cancer and broadly contributes to oncogenesis. We previously developed MethylMix, a model-based algorithmic approach to identify epigenetically regulated driver genes. MethylMix identifies genes where methylation likely executes a functional role by using transcriptomic data to select only methylation events that can be linked to changes in gene expression. However, given that proteins more closely link genotype to phenotype recent high-throughput proteomic data provides an opportunity to more accurately identify functionally relevant abnormal methylation events. Here we present a MethylMix analysis that refines nominations for epigenetic driver genes by leveraging quantitative high-throughput proteomic data to select only genes where DNA methylation is predictive of protein abundance. Applying our algorithm across three cancer cohorts we find that using protein abundance data narrows candidate nominations, where the effect of DNA methylation is often buffered at the protein level. Next, we find that MethylMix genes predictive of protein abundance are enriched for biological processes involved in cancer including functions involved in epithelial and mesenchymal transition. Moreover, our results are also enriched for tumor markers which are predictive of clinical features like tumor stage and we find clustering using MethylMix genes predictive of protein abundance captures cancer subtypes. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
22. Where did you come from, where did you go: Refining metagenomic analysis tools for horizontal gene transfer characterisation.
- Author
-
Seiler, Enrico, Trappe, Kathrin, and Renard, Bernhard Y.
- Subjects
HORIZONTAL gene transfer ,METHICILLIN-resistant staphylococcus aureus ,TEXTURE mapping ,COMPUTATIONAL biology ,SHOTGUN sequencing - Abstract
Horizontal gene transfer (HGT) has changed the way we regard evolution. Instead of waiting for the next generation to establish new traits, especially bacteria are able to take a shortcut via HGT that enables them to pass on genes from one individual to another, even across species boundaries. The tool Daisy offers the first HGT detection approach based on read mapping that provides complementary evidence compared to existing methods. However, Daisy relies on the acceptor and donor organism involved in the HGT being known. We introduce DaisyGPS, a mapping-based pipeline that is able to identify acceptor and donor reference candidates of an HGT event based on sequencing reads. Acceptor and donor identification is akin to species identification in metagenomic samples based on sequencing reads, a problem addressed by metagenomic profiling tools. However, acceptor and donor references have certain properties such that these methods cannot be directly applied. DaisyGPS uses MicrobeGPS, a metagenomic profiling tool tailored towards estimating the genomic distance between organisms in the sample and the reference database. We enhance the underlying scoring system of MicrobeGPS to account for the sequence patterns in terms of mapping coverage of an acceptor and donor involved in an HGT event, and report a ranked list of reference candidates. These candidates can then be further evaluated by tools like Daisy to establish HGT regions. We successfully validated our approach on both simulated and real data, and show its benefits in an investigation of an outbreak involving Methicillin-resistant Staphylococcus aureus data. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
23. Pathogenicity and functional impact of non-frameshifting insertion/deletion variation in the human genome.
- Author
-
Pagel, Kymberleigh A., Antaki, Danny, Lian, AoJie, Mort, Matthew, Cooper, David N., Sebat, Jonathan, Iakoucheva, Lilia M., Mooney, Sean D., and Radivojac, Predrag
- Subjects
HUMAN genome ,AUTISM spectrum disorders ,MICROBIAL virulence ,RECURRENT neural networks ,POST-translational modification ,PHYSICAL sciences - Abstract
Differentiation between phenotypically neutral and disease-causing genetic variation remains an open and relevant problem. Among different types of variation, non-frameshifting insertions and deletions (indels) represent an understudied group with widespread phenotypic consequences. To address this challenge, we present a machine learning method, MutPred-Indel, that predicts pathogenicity and identifies types of functional residues impacted by non-frameshifting insertion/deletion variation. The model shows good predictive performance as well as the ability to identify impacted structural and functional residues including secondary structure, intrinsic disorder, metal and macromolecular binding, post-translational modifications, allosteric sites, and catalytic residues. We identify structural and functional mechanisms impacted preferentially by germline variation from the Human Gene Mutation Database, recurrent somatic variation from COSMIC in the context of different cancers, as well as de novo variants from families with autism spectrum disorder. Further, the distributions of pathogenicity prediction scores generated by MutPred-Indel are shown to differentiate highly recurrent from non-recurrent somatic variation. Collectively, we present a framework to facilitate the interrogation of both pathogenicity and the functional effects of non-frameshifting insertion/deletion variants. The MutPred-Indel webserver is available at . [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
24. Noise-precision tradeoff in predicting combinations of mutations and drugs.
- Author
-
Tendler, Avichai, Zimmer, Anat, Mayo, Avi, and Alon, Uri
- Subjects
PARETO analysis ,NOISE ,DRUGS ,PERTURBATION theory ,GENETIC mutation - Abstract
Many biological problems involve the response to multiple perturbations. Examples include response to combinations of many drugs, and the effects of combinations of many mutations. Such problems have an exponentially large space of combinations, which makes it infeasible to cover the entire space experimentally. To overcome this problem, several formulae that predict the effect of drug combinations or fitness landscape values have been proposed. These formulae use the effects of single perturbations and pairs of perturbations to predict triplets and higher order combinations. Interestingly, different formulae perform best on different datasets. Here we use Pareto optimality theory to quantitatively explain why no formula is optimal for all datasets, due to an inherent bias-variance (noise-precision) tradeoff. We calculate the Pareto front of log-linear formulae and find that the optimal formula depends on properties of the dataset: the typical interaction strength and the experimental noise. This study provides an approach to choose a suitable prediction formula for a given dataset, in order to best overcome the combinatorial explosion problem. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
25. Ten quick tips for biocuration.
- Author
-
Tang, Y. Amy, Pichler, Klemens, Füllgrabe, Anja, Lomax, Jane, Malone, James, Munoz-Torres, Monica C., Vasant, Drashtti V., Williams, Eleanor, and Haendel, Melissa
- Subjects
BIOLOGICAL databases ,MEDICAL research ,BIG data ,METADATA ,BIOINFORMATICS - Abstract
The article provides information on the interdisciplinary biomedical research, focusing on data sets which is the process of identifying, organizing and enriching biological data, experimental metadata and bioinformatics. Topics include curating RNA sequencing, biological ontology and learning of data workflow.
- Published
- 2019
- Full Text
- View/download PDF
26. Finding driver mutations in cancer: Elucidating the role of background mutational processes.
- Author
-
Brown, Anna-Leigh, Li, Minghui, Goncearenco, Alexander, and Panchenko, Anna R.
- Subjects
DNA replication ,GENETIC mutation ,MUTAGENESIS ,GENETICS ,MUTANT proteins - Abstract
Identifying driver mutations in cancer is notoriously difficult. To date, recurrence of a mutation in patients remains one of the most reliable markers of mutation driver status. However, some mutations are more likely to occur than others due to differences in background mutation rates arising from various forms of infidelity of DNA replication and repair machinery, endogenous, and exogenous mutagens. We calculated nucleotide and codon mutability to study the contribution of background processes in shaping the observed mutational spectrum in cancer. We developed and tested probabilistic pan-cancer and cancer-specific models that adjust the number of mutation recurrences in patients by background mutability in order to find mutations which may be under selection in cancer. We showed that mutations with higher mutability values had higher observed recurrence frequency, especially in tumor suppressor genes. This trend was prominent for nonsense and silent mutations or mutations with neutral functional impact. In oncogenes, however, highly recurring mutations were characterized by relatively low mutability, resulting in an inversed U-shaped trend. Mutations not yet observed in any tumor had relatively low mutability values, indicating that background mutability might limit mutation occurrence. We compiled a dataset of missense mutations from 58 genes with experimentally validated functional and transforming impacts from various studies. We found that mutability of driver mutations was lower than that of passengers and consequently adjusting mutation recurrence frequency by mutability significantly improved ranking of mutations and driver mutation prediction. Even though no training on existing data was involved, our approach performed similarly or better to the state-of-the-art methods. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
27. Genotype-phenotype relations of the von Hippel-Lindau tumor suppressor inferred from a large-scale analysis of disease mutations and interactors.
- Author
-
Minervini, Giovanni, Quaglia, Federica, Tabaro, Francesco, and Tosatto, Silvio C. E.
- Subjects
VON Hippel-Lindau disease ,GENOTYPES ,PHENOTYPES ,TUMOR suppressor genes ,GENETIC mutation - Abstract
Familiar cancers represent a privileged point of view for studying the complex cellular events inducing tumor transformation. Von Hippel-Lindau syndrome, a familiar predisposition to develop cancer is a clear example. Here, we present our efforts to decipher the role of von Hippel-Lindau tumor suppressor protein (pVHL) in cancer insurgence. We collected high quality information about both pVHL mutations and interactors to investigate the association between patient phenotypes, mutated protein surface and impaired interactions. Our data suggest that different phenotypes correlate with localized perturbations of the pVHL structure, with specific cell functions associated to different protein surfaces. We propose five different pVHL interfaces to be selectively involved in modulating proteins regulating gene expression, protein homeostasis as well as to address extracellular matrix (ECM) and ciliogenesis associated functions. These data were used to drive molecular docking of pVHL with its interactors and guide Petri net simulations of the most promising alterations. We predict that disruption of pVHL association with certain interactors can trigger tumor transformation, inducing metabolism imbalance and ECM remodeling. Collectively taken, our findings provide novel insights into VHL-associated tumorigenesis. This highly integrated in silico approach may help elucidate novel treatment paradigms for VHL disease. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
28. Prediction of VRC01 neutralization sensitivity by HIV-1 gp160 sequence features.
- Author
-
Magaret, Craig A., Benkeser, David C., Williamson, Brian D., Borate, Bhavesh R., Carpp, Lindsay N., Georgiev, Ivelin S., Setliff, Ian, Dingens, Adam S., Simon, Noah, Carone, Marco, Simpkins, Christopher, Montefiori, David, Alter, Galit, Yu, Wen-Han, Juraska, Michal, Edlefsen, Paul T., Karuna, Shelly, Mgodi, Nyaradzo M., Edugupanti, Srilatha, and Gilbert, Peter B.
- Subjects
BIOLOGICAL databases ,HIV-positive persons ,AMINO acid analysis ,GLYCOSYLATION ,TITERS - Abstract
The broadly neutralizing antibody (bnAb) VRC01 is being evaluated for its efficacy to prevent HIV-1 infection in the Antibody Mediated Prevention (AMP) trials. A secondary objective of AMP utilizes sieve analysis to investigate how VRC01 prevention efficacy (PE) varies with HIV-1 envelope (Env) amino acid (AA) sequence features. An exhaustive analysis that tests how PE depends on every AA feature with sufficient variation would have low statistical power. To design an adequately powered primary sieve analysis for AMP, we modeled VRC01 neutralization as a function of Env AA sequence features of 611 HIV-1 gp160 pseudoviruses from the CATNAP database, with objectives: (1) to develop models that best predict the neutralization readouts; and (2) to rank AA features by their predictive importance with classification and regression methods. The dataset was split in half, and machine learning algorithms were applied to each half, each analyzed separately using cross-validation and hold-out validation. We selected Super Learner, a nonparametric ensemble-based cross-validated learning method, for advancement to the primary sieve analysis. This method predicted the dichotomous resistance outcome of whether the IC
50 neutralization titer of VRC01 for a given Env pseudovirus is right-censored (indicating resistance) with an average validated AUC of 0.868 across the two hold-out datasets. Quantitative log IC50 was predicted with an average validated R2 of 0.355. Features predicting neutralization sensitivity or resistance included 26 surface-accessible residues in the VRC01 and CD4 binding footprints, the length of gp120, the length of Env, the number of cysteines in gp120, the number of cysteines in Env, and 4 potential N-linked glycosylation sites; the top features will be advanced to the primary sieve analysis. This modeling framework may also inform the study of VRC01 in the treatment of HIV-infected persons. [ABSTRACT FROM AUTHOR]- Published
- 2019
- Full Text
- View/download PDF
29. A computational framework To assess genome-wide distribution Of polymorphic human endogenous retrovirus-K In human populations.
- Author
-
Li, Weiling, Lin, Lin, Malhotra, Raunaq, Yang, Lei, Acharya, Raj, and Poss, Mary
- Subjects
HUMAN endogenous retroviruses ,COMPUTATIONAL statistics ,GENETIC polymorphisms ,HUMAN genome ,RETROVIRUS genetics ,DISEASE risk factors ,HUMAN population genetics - Abstract
Human Endogenous Retrovirus type K (HERV-K) is the only HERV known to be insertionally polymorphic; not all individuals have a retrovirus at a specific genomic location. It is possible that HERV-Ks contribute to human disease because people differ in both number and genomic location of these retroviruses. Indeed viral transcripts, proteins, and antibody against HERV-K are detected in cancers, auto-immune, and neurodegenerative diseases. However, attempts to link a polymorphic HERV-K with any disease have been frustrated in part because population prevalence of HERV-K provirus at each polymorphic site is lacking and it is challenging to identify closely related elements such as HERV-K from short read sequence data. We present an integrated and computationally robust approach that uses whole genome short read data to determine the occupation status at all sites reported to contain a HERV-K provirus. Our method estimates the proportion of fixed length genomic sequence (k-mers) from whole genome sequence data matching a reference set of k-mers unique to each HERV-K locus and applies mixture model-based clustering of these values to account for low depth sequence data. Our analysis of 1000 Genomes Project Data (KGP) reveals numerous differences among the five KGP super-populations in the prevalence of individual and co-occurring HERV-K proviruses; we provide a visualization tool to easily depict the proportion of the KGP populations with any combination of polymorphic HERV-K provirus. Further, because HERV-K is insertionally polymorphic, the genome burden of known polymorphic HERV-K is variable in humans; this burden is lowest in East Asian (EAS) individuals. Our study identifies population-specific sequence variation for HERV-K proviruses at several loci. We expect these resources will advance research on HERV-K contributions to human diseases. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
30. Searching algorithm for Type IV effector proteins (S4TE) 2.0: Improved tools for Type IV effector prediction, analysis and comparison in proteobacteria.
- Author
-
Noroy, Christophe, Lefrançois, Thierry, and Meyer, Damien F.
- Subjects
PROTEOBACTERIA ,SEARCH algorithms ,PATHOGENIC microorganisms ,DRUG development ,PREDICTION models ,EUKARYOTIC cells - Abstract
Bacterial pathogens have evolved numerous strategies to corrupt, hijack or mimic cellular processes in order to survive and proliferate. Among those strategies, Type IV effectors (T4Es) are proteins secreted by pathogenic bacteria to manipulate host cell processes during infection. They are delivered into eukaryotic cells in an ATP-dependent manner via the type IV secretion system, a specialized multiprotein complex. T4Es contain a wide spectrum of features including eukaryotic-like domains, localization signals or a C-terminal translocation signal. A combination of these features enables prediction of T4Es in a given bacterial genome. In this study, we developed a web-based comprehensive suite of tools with a user-friendly graphical interface. This version 2.0 of S4TE (Searching Algorithm for Type IV Effector Proteins; ) enables accurate prediction and comparison of T4Es. Search parameters and threshold can be customized by the user to work with any genome sequence, whether publicly available or not. Applications range from characterizing effector features and identifying potential T4Es to analyzing the effectors based on the genome G+C composition and local gene density. S4TE 2.0 allows the comparison of putative T4E repertoires of up to four bacterial strains at the same time. The software identifies T4E orthologs among strains and provides a Venn diagram and lists of genes for each intersection. New interactive features offer the best visualization of the location of candidate T4Es and hyperlinks to NCBI and Pfam databases. S4TE 2.0 is designed to evolve rapidly with the publication of new experimentally validated T4Es, which will reinforce the predictive power of the algorithm. The computational methodology can be used to identify a wide spectrum of candidate bacterial effectors that lack sequence conservation but have similar amino acid characteristics. This approach will provide very valuable information about bacterial host-specificity and virulence factors and help identify host targets for the development of new anti-bacterial molecules. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
31. ChIPulate: A comprehensive ChIP-seq simulation pipeline.
- Author
-
Datta, Vishaka, Hannenhalli, Sridhar, and Siddharthan, Rahul
- Subjects
CHROMATIN ,IMMUNOPRECIPITATION ,DNA-binding proteins ,GENOMICS ,TRANSCRIPTION factors - Abstract
ChIP-seq (Chromatin Immunoprecipitation followed by sequencing) is a high-throughput technique to identify genomic regions that are bound in vivo by a particular protein, e.g., a transcription factor (TF). Biological factors, such as chromatin state, indirect and cooperative binding, as well as experimental factors, such as antibody quality, cross-linking, and PCR biases, are known to affect the outcome of ChIP-seq experiments. However, the relative impact of these factors on inferences made from ChIP-seq data is not entirely clear. Here, via a detailed ChIP-seq simulation pipeline, ChIPulate, we assess the impact of various biological and experimental sources of variation on several outcomes of a ChIP-seq experiment, viz., the recoverability of the TF binding motif, accuracy of TF-DNA binding detection, the sensitivity of inferred TF-DNA binding strength, and number of replicates needed to confidently infer binding strength. We find that the TF motif can be recovered despite poor and non-uniform extraction and PCR amplification efficiencies. The recovery of the motif is, however, affected to a larger extent by the fraction of sites that are either cooperatively or indirectly bound. Importantly, our simulations reveal that the number of ChIP-seq replicates needed to accurately measure in vivo occupancy at high-affinity sites is larger than the recommended community standards. Our results establish statistical limits on the accuracy of inferences of protein-DNA binding from ChIP-seq and suggest that increasing the mean extraction efficiency, rather than amplification efficiency, would better improve sensitivity. The source code and instructions for running ChIPulate can be found at . [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
32. 16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses.
- Author
-
Woloszynek, Stephen, Zhao, Zhengqiao, Chen, Jian, and Rosen, Gail L.
- Subjects
RIBOSOMAL RNA ,NUCLEOTIDE sequence ,HUMAN microbiota ,MACHINE learning ,FEATURE extraction ,TAXONOMY - Abstract
Advances in high-throughput sequencing have increased the availability of microbiome sequencing data that can be exploited to characterize microbiome community structure in situ. We explore using word and sentence embedding approaches for nucleotide sequences since they may be a suitable numerical representation for downstream machine learning applications (especially deep learning). This work involves first encoding (“embedding”) each sequence into a dense, low-dimensional, numeric vector space. Here, we use Skip-Gram word2vec to embed k-mers, obtained from 16S rRNA amplicon surveys, and then leverage an existing sentence embedding technique to embed all sequences belonging to specific body sites or samples. We demonstrate that these representations are meaningful, and hence the embedding space can be exploited as a form of feature extraction for exploratory analysis. We show that sequence embeddings preserve relevant information about the sequencing data such as k-mer context, sequence taxonomy, and sample class. Specifically, the sequence embedding space resolved differences among phyla, as well as differences among genera within the same family. Distances between sequence embeddings had similar qualities to distances between alignment identities, and embedding multiple sequences can be thought of as generating a consensus sequence. In addition, embeddings are versatile features that can be used for many downstream tasks, such as taxonomic and sample classification. Using sample embeddings for body site classification resulted in negligible performance loss compared to using OTU abundance data, and clustering embeddings yielded high fidelity species clusters. Lastly, the k-mer embedding space captured distinct k-mer profiles that mapped to specific regions of the 16S rRNA gene and corresponded with particular body sites. Together, our results show that embedding sequences results in meaningful representations that can be used for exploratory analyses or for downstream machine learning applications that require numeric data. Moreover, because the embeddings are trained in an unsupervised manner, unlabeled data can be embedded and used to bolster supervised machine learning tasks. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
33. Phylogenies from dynamic networks.
- Author
-
Metzig, Cornelia, Ratmann, Oliver, Bezemer, Daniela, and Colijn, Caroline
- Subjects
PHYLOGENY ,PATHOGENIC microorganisms ,MULTIPLE correspondence analysis (Statistics) ,SUPERVISED learning ,EPIDEMIOLOGY - Abstract
The relationship between the underlying contact network over which a pathogen spreads and the pathogen phylogenetic trees that are obtained presents an opportunity to use sequence data to learn about contact networks that are difficult to study empirically. However, this relationship is not explicitly known and is usually studied in simulations, often with the simplifying assumption that the contact network is static in time, though human contact networks are dynamic. We simulate pathogen phylogenetic trees on dynamic Erdős-Renyi random networks and on two dynamic networks with skewed degree distribution, of which one is additionally clustered. We use tree shape features to explore how adding dynamics changes the relationships between the overall network structure and phylogenies. Our tree features include the number of small substructures (cherries, pitchforks) in the trees, measures of tree imbalance (Sackin index, Colless index), features derived from network science (diameter, closeness), as well as features using the internal branch lengths from the tip to the root. Using principal component analysis we find that the network dynamics influence the shapes of phylogenies, as does the network type. We also compare dynamic and time-integrated static networks. We find, in particular, that static network models like the widely used Barabasi-Albert model can be poor approximations for dynamic networks. We explore the effects of mis-specifying the network on the performance of classifiers trained identify the transmission rate (using supervised learning methods). We find that both mis-specification of the underlying network and its parameters (mean degree, turnover rate) have a strong adverse effect on the ability to estimate the transmission parameter. We illustrate these results by classifying HIV trees with a classifier that we trained on simulated trees from different networks, infection rates and turnover rates. Our results point to the importance of correctly estimating and modelling contact networks with dynamics when using phylodynamic tools to estimate epidemiological parameters. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
34. Identifying individual risk rare variants using protein structure guided local tests (POINT).
- Author
-
Marceau West, Rachel, Lu, Wenbin, Rotroff, Daniel M., Kuenemann, Melaine A., Chang, Sheng-Mao, Wu, Michael C., Wagner, Michael J., Buse, John B., Motsinger-Reif, Alison A., Fourches, Denis, and Tzeng, Jung-Ying
- Subjects
PROTEIN structure ,PROTEOMICS ,OPERATOR theory ,QUANTITATIVE research ,KERNEL functions ,BIOLOGICAL databases - Abstract
Rare variants are of increasing interest to genetic association studies because of their etiological contributions to human complex diseases. Due to the rarity of the mutant events, rare variants are routinely analyzed on an aggregate level. While aggregation analyses improve the detection of global-level signal, they are not able to pinpoint causal variants within a variant set. To perform inference on a localized level, additional information, e.g., biological annotation, is often needed to boost the information content of a rare variant. Following the observation that important variants are likely to cluster together on functional domains, we propose a rte structure guided local est (POINT) to provide variant-specific association information using structure-guided aggregation of signal. Constructed under a kernel machine framework, POINT performs local association testing by borrowing information from neighboring variants in the 3-dimensional protein space in a data-adaptive fashion. Besides merely providing a list of promising variants, POINT assigns each variant a p-value to permit variant ranking and prioritization. We assess the selection performance of POINT using simulations and illustrate how it can be used to prioritize individual rare variants in PCSK9, ANGPTL4 and CETP in the Action to Control Cardiovascular Risk in Diabetes (ACCORD) clinical trial data. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
35. Apollo: Democratizing genome annotation.
- Author
-
Dunn, Nathan A., Unni, Deepak R., Diesh, Colin, Munoz-Torres, Monica, Harris, Nomi L., Yao, Eric, Rasche, Helena, Holmes, Ian H., Elsik, Christine G., and Lewis, Suzanna E.
- Subjects
TRANSCRIPTOMES ,GENOMICS ,OPEN source software ,GENOMES ,MOLECULAR genetics - Abstract
Genome annotation is the process of identifying the location and function of a genome's encoded features. Improving the biological accuracy of annotation is a complex and iterative process requiring researchers to review and incorporate multiple sources of information such as transcriptome alignments, predictive models based on sequence profiles, and comparisons to features found in related organisms. Because rapidly decreasing costs are enabling an ever-growing number of scientists to incorporate sequencing as a routine laboratory technique, there is widespread demand for tools that can assist in the deliberative analytical review of genomic information. To this end, we present Apollo, an open source software package that enables researchers to efficiently inspect and refine the precise structure and role of genomic features in a graphical browser-based platform. Some of Apollo’s newer user interface features include support for real-time collaboration, allowing distributed users to simultaneously edit the same encoded features while also instantly seeing the updates made by other researchers on the same region in a manner similar to Google Docs. Its technical architecture enables Apollo to be integrated into multiple existing genomic analysis pipelines and heterogeneous laboratory workflow platforms. Finally, we consider the implications that Apollo and related applications may have on how the results of genome research are published and made accessible. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
36. Full-Length Envelope Analyzer (): A tool for longitudinal analysis of viral amplicons.
- Author
-
Eren, Kemal, Weaver, Steven, Kosakovsky Pond, Sergei L., Ketteringham, Robert, Valentyn, Morné, Laird Smith, Melissa, Kumar, Venkatesh, Mohan, Sanjay, and Murrell, Ben
- Subjects
NUCLEOTIDE sequencing ,DRUG resistance ,AIDS vaccines ,PHYLOGENY ,TIME series analysis - Abstract
Next generation sequencing of viral populations has advanced our understanding of viral population dynamics, the development of drug resistance, and escape from host immune responses. Many applications require complete gene sequences, which can be impossible to reconstruct from short reads. HIV env, the protein of interest for HIV vaccine studies, is exceptionally challenging for long-read sequencing and analysis due to its length, high substitution rate, and extensive indel variation. While long-read sequencing is attractive in this setting, the analysis of such data is not well handled by existing methods. To address this, we introduce (Full-Length Envelope Analyzer), which performs end-to-end analysis and visualization of long-read sequencing data. consists of both a pipeline (optionally run on a high-performance cluster), and a client-side web application that provides interactive results. The pipeline transforms FASTQ reads into high-quality consensus sequences (HQCSs) and uses them to build a codon-aware multiple sequence alignment. The resulting alignment is then used to infer phylogenies, selection pressure, and evolutionary dynamics. The web application provides publication-quality plots and interactive visualizations, including an annotated viral alignment browser, time series plots of evolutionary dynamics, visualizations of gene-wide selective pressures (such as dN/dS) across time and across protein structure, and a phylogenetic tree browser. We demonstrate how may be used to process Pacific Biosciences HIV env data and describe recent examples of its use. Simulations show how FLEA dramatically reduces the error rate of this sequencing platform, providing an accurate portrait of complex and variable HIV env populations. A public instance of is hosted at . The Python source code for the pipeline can be found at . The client-side application is available at . A live demo of the P018 results can be found at . [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
37. Co-evolution networks of HIV/HCV are modular with direct association to structure and function.
- Author
-
Quadeer, Ahmed Abdul, Morales-Jimenez, David, and McKay, Matthew R.
- Subjects
HEPATITIS C virus ,DIAGNOSIS of HIV infections ,VIRAL evolution ,MULTIPLE correspondence analysis (Statistics) ,FLAVIVIRUSES - Abstract
Mutational correlation patterns found in population-level sequence data for the Human Immunodeficiency Virus (HIV) and the Hepatitis C Virus (HCV) have been demonstrated to be informative of viral fitness. Such patterns can be seen as footprints of the intrinsic functional constraints placed on viral evolution under diverse selective pressures. Here, considering multiple HIV and HCV proteins, we demonstrate that these mutational correlations encode a modular co-evolutionary structure that is tightly linked to the structural and functional properties of the respective proteins. Specifically, by introducing a robust statistical method based on sparse principal component analysis, we identify near-disjoint sets of collectively-correlated residues (sectors) having mostly a one-to-one association to largely distinct structural or functional domains. This suggests that the distinct phenotypic properties of HIV/HCV proteins often give rise to quasi-independent modes of evolution, with each mode involving a sparse and localized network of mutational interactions. Moreover, individual inferred sectors of HIV are shown to carry immunological significance, providing insight for guiding targeted vaccine strategies. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
38. SIG-DB: Leveraging homomorphic encryption to securely interrogate privately held genomic databases.
- Author
-
Titus, Alexander J., Flower, Audrey, Hagerty, Patrick, Gamble, Paul, Lewis, Charlie, Stavish, Todd, O’Connell, Kevin P., Shipley, Greg, and Rogers, Stephanie M.
- Subjects
GENOMES ,BIOLOGICAL databases ,CRYPTOGRAPHY ,BIOINFORMATICS ,BIOMATHEMATICS ,GENETICS - Abstract
Genomic data are becoming increasingly valuable as we develop methods to utilize the information at scale and gain a greater understanding of how genetic information relates to biological function. Advances in synthetic biology and the decreased cost of sequencing are increasing the amount of privately held genomic data. As the quantity and value of private genomic data grows, so does the incentive to acquire and protect such data, which creates a need to store and process these data securely. We present an algorithm for the Secure Interrogation of Genomic DataBases (SIG-DB). The SIG-DB algorithm enables databases of genomic sequences to be searched with an encrypted query sequence without revealing the query sequence to the Database Owner or any of the database sequences to the Querier. SIG-DB is the first application of its kind to take advantage of locality-sensitive hashing and homomorphic encryption to allow generalized sequence-to-sequence comparisons of genomic data. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
39. Cancerin: A computational pipeline to infer cancer-associated ceRNA interaction networks.
- Author
-
Do, Duc and Bozdag, Serdar
- Subjects
CANCER genetics ,MICRORNA ,GENE expression ,CARCINOGENESIS ,GENETIC regulation ,DNA methylation - Abstract
MicroRNAs (miRNAs) inhibit expression of target genes by binding to their RNA transcripts. It has been recently shown that RNA transcripts targeted by the same miRNA could “compete” for the miRNA molecules and thereby indirectly regulate each other. Experimental evidence has suggested that the aberration of such miRNA-mediated interaction between RNAs—called competing endogenous RNA (ceRNA) interaction—can play important roles in tumorigenesis. Given the difficulty of deciphering context-specific miRNA binding, and the existence of various gene regulatory factors such as DNA methylation and copy number alteration, inferring context-specific ceRNA interactions accurately is a computationally challenging task. Here we propose a computational method called Cancerin to identify cancer-associated ceRNA interactions. Cancerin incorporates DNA methylation, copy number alteration, gene and miRNA expression datasets to construct cancer-specific ceRNA networks. We applied Cancerin to three cancer datasets from the Cancer Genome Atlas (TCGA) project. Our results indicated that ceRNAs were enriched with cancer-related genes, and ceRNA modules in the inferred ceRNA networks were involved in cancer-associated biological processes. Using LINCS-L1000 shRNA-mediated gene knockdown experiment in breast cancer cell line to assess accuracy, Cancerin was able to predict expression outcome of ceRNA genes with high accuracy. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
40. Latent environment allocation of microbial community data.
- Author
-
Higashi, Koichi, Suzuki, Shinya, Kurosawa, Shin, Mori, Hiroshi, and Kurokawa, Ken
- Subjects
BACTERIAL communities ,MICROBIAL communities ,HIERARCHICAL Bayes model ,HUMAN microbiota ,COMPUTATIONAL biology - Abstract
As data for microbial community structures found in various environments has increased, studies have examined the relationship between environmental labels given to retrieved microbial samples and their community structures. However, because environments continuously change over time and space, mixed states of some environments and its effects on community formation should be considered, instead of evaluating effects of discrete environmental categories. Here we applied a hierarchical Bayesian model to paired datasets containing more than 30,000 samples of microbial community structures and sample description documents. From the training results, we extracted latent environmental topics that associate co-occurring microbes with co-occurring word sets among samples. Topics are the core elements of environmental mixtures and the visualization of topic-based samples clarifies the connections of various environments. Based on the model training results, we developed a web application, LEA (Latent Environment Allocation), which provides the way to evaluate typicality and heterogeneity of microbial communities in newly obtained samples without confining environmental categories to be compared. Because topics link words and microbes, LEA also enables to search samples semantically related to the query out of 30,000 microbiome samples. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
41. Submit a Topic Page to PLOS Computational Biology and Wikipedia.
- Author
-
Mietchen, Daniel, Wodak, Shoshana, Wasik, Szymon, Szostak, Natalia, and Dessimoz, Christophe
- Subjects
COMPUTATIONAL biology ,AUTHORS ,READERSHIP ,ORIGIN of life - Abstract
The article offers information on the periodical's `Topic Pages' project as a way to help fill important gaps in Wikipedia's coverage of computational biology content and to credit authors for their contributions. It mentions that hypercycle theory is now more accessible not only for advanced readers, but also for ordinary people who seek knowledge on the computational aspects of the origins of life.
- Published
- 2018
- Full Text
- View/download PDF
42. Traceability, reproducibility and wiki-exploration for “à-la-carte” reconstructions of genome-scale metabolic models.
- Author
-
Aite, Méziane, Chevallier, Marie, Frioux, Clémence, Trottier, Camille, Got, Jeanne, Cortés, María Paz, Mendoza, Sebastián N., Carrier, Grégory, Dameron, Olivier, Guillaudeux, Nicolas, Latorre, Mauricio, Loira, Nicolás, Markov, Gabriel V., Maass, Alejandro, and Siegel, Anne
- Subjects
METABOLISM ,REPRODUCIBLE research ,BIOINFORMATICS ,EUKARYOTES ,AUTOMATION - Abstract
Genome-scale metabolic models have become the tool of choice for the global analysis of microorganism metabolism, and their reconstruction has attained high standards of quality and reliability. Improvements in this area have been accompanied by the development of some major platforms and databases, and an explosion of individual bioinformatics methods. Consequently, many recent models result from “à la carte” pipelines, combining the use of platforms, individual tools and biological expertise to enhance the quality of the reconstruction. Although very useful, introducing heterogeneous tools, that hardly interact with each other, causes loss of traceability and reproducibility in the reconstruction process. This represents a real obstacle, especially when considering less studied species whose metabolic reconstruction can greatly benefit from the comparison to good quality models of related organisms. This work proposes an adaptable workspace, AuReMe, for sustainable reconstructions or improvements of genome-scale metabolic models involving personalized pipelines. At each step, relevant information related to the modifications brought to the model by a method is stored. This ensures that the process is reproducible and documented regardless of the combination of tools used. Additionally, the workspace establishes a way to browse metabolic models and their metadata through the automatic generation of ad-hoc local wikis dedicated to monitoring and facilitating the process of reconstruction. AuReMe supports exploration and semantic query based on RDF databases. We illustrate how this workspace allowed handling, in an integrated way, the metabolic reconstructions of non-model organisms such as an extremophile bacterium or eukaryote algae. Among relevant applications, the latter reconstruction led to putative evolutionary insights of a metabolic pathway. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
43. RosettaAntibodyDesign (RAbD): A general framework for computational antibody design.
- Author
-
Adolf-Bryfogle, Jared, Kalyuzhniy, Oleks, Kubitz, Michael, Weitzner, Brian D., Hu, Xiaozhen, Adachi, Yumiko, Schief, William R., and Jr.Dunbrack, Roland L.
- Subjects
IMMUNOGLOBULINS ,EPITOPES ,ANTIGENS ,AMINO acids ,MONOCLONAL antibodies - Abstract
A structural-bioinformatics-based computational methodology and framework have been developed for the design of antibodies to targets of interest. RosettaAntibodyDesign (RAbD) samples the diverse sequence, structure, and binding space of an antibody to an antigen in highly customizable protocols for the design of antibodies in a broad range of applications. The program samples antibody sequences and structures by grafting structures from a widely accepted set of the canonical clusters of CDRs (North et al., J. Mol. Biol., 406:228–256, 2011). It then performs sequence design according to amino acid sequence profiles of each cluster, and samples CDR backbones using a flexible-backbone design protocol incorporating cluster-based CDR constraints. Starting from an existing experimental or computationally modeled antigen-antibody structure, RAbD can be used to redesign a single CDR or multiple CDRs with loops of different length, conformation, and sequence. We rigorously benchmarked RAbD on a set of 60 diverse antibody–antigen complexes, using two design strategies—optimizing total Rosetta energy and optimizing interface energy alone. We utilized two novel metrics for measuring success in computational protein design. The design risk ratio (DRR) is equal to the frequency of recovery of native CDR lengths and clusters divided by the frequency of sampling of those features during the Monte Carlo design procedure. Ratios greater than 1.0 indicate that the design process is picking out the native more frequently than expected from their sampled rate. We achieved DRRs for the non-H3 CDRs of between 2.4 and 4.0. The antigen risk ratio (ARR) is the ratio of frequencies of the native amino acid types, CDR lengths, and clusters in the output decoys for simulations performed in the presence and absence of the antigen. For CDRs, we achieved cluster ARRs as high as 2.5 for L1 and 1.5 for H2. For sequence design simulations without CDR grafting, the overall recovery for the native amino acid types for residues that contact the antigen in the native structures was 72% in simulations performed in the presence of the antigen and 48% in simulations performed without the antigen, for an ARR of 1.5. For the non-contacting residues, the ARR was 1.08. This shows that the sequence profiles are able to maintain the amino acid types of these conserved, buried sites, while recovery of the exposed, contacting residues requires the presence of the antigen-antibody interface. We tested RAbD experimentally on both a lambda and kappa antibody–antigen complex, successfully improving their affinities 10 to 50 fold by replacing individual CDRs of the native antibody with new CDR lengths and clusters. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
44. A machine learning based framework to identify and classify long terminal repeat retrotransposons.
- Author
-
Schietgat, Leander, Vens, Celine, Cerri, Ricardo, Fischer, Carlos N., Costa, Eduardo, Ramon, Jan, Carareto, Claudia M. A., and Blockeel, Hendrik
- Subjects
RETROTRANSPOSONS ,MACHINE learning ,GENOMES ,DROSOPHILA melanogaster ,ARABIDOPSIS thaliana - Abstract
Transposable elements (TEs) are repetitive nucleotide sequences that make up a large portion of eukaryotic genomes. They can move and duplicate within a genome, increasing genome size and contributing to genetic diversity within and across species. Accurate identification and classification of TEs present in a genome is an important step towards understanding their effects on genes and their role in genome evolution. We introduce TE-L, a framework based on machine learning that automatically identifies TEs in a given genome and assigns a classification to them. We present an implementation of our framework towards LTR retrotransposons, a particular type of TEs characterized by having long terminal repeats (LTRs) at their boundaries. We evaluate the predictive performance of our framework on the well-annotated genomes of Drosophila melanogaster and Arabidopsis thaliana and we compare our results for three LTR retrotransposon superfamilies with the results of three widely used methods for TE identification or classification: RM, C and LD. In contrast to these methods, TE-L is the first to incorporate machine learning techniques, outperforming these methods in terms of predictive performance, while able to learn models and make predictions efficiently. Moreover, we show that our method was able to identify TEs that none of the above method could find, and we investigated TE-L’s predictions which did not correspond to an official annotation. It turns out that many of these predictions are in fact strongly homologous to a known TE. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
45. DIVERSITY in binding, regulation, and evolution revealed from high-throughput ChIP.
- Author
-
Mitra, Sneha, Biswas, Anushua, and Narlikar, Leelavati
- Subjects
DNA-protein interactions ,CHROMATIN ,IMMUNOPRECIPITATION ,DNA-binding proteins ,EPIGENETICS - Abstract
Genome-wide in vivo protein-DNA interactions are routinely mapped using high-throughput chromatin immunoprecipitation (ChIP). ChIP-reported regions are typically investigated for enriched sequence-motifs, which are likely to model the DNA-binding specificity of the profiled protein and/or of co-occurring proteins. However, simple enrichment analyses can miss insights into the binding-activity of the protein. Note that ChIP reports regions making direct contact with the protein as well as those binding through intermediaries. For example, consider a ChIP experiment targeting protein X, which binds DNA at its cognate sites, but simultaneously interacts with four other proteins. Each of these proteins also binds to its own specific cognate sites along distant parts of the genome, a scenario consistent with the current view of transcriptional hubs and chromatin loops. Since ChIP will pull down all X-associated regions, the final reported data will be a union of five distinct sets of regions, each containing binding sites of one of the five proteins, respectively. Characterizing all five different motifs and the corresponding sets is important to interpret the ChIP experiment and ultimately, the role of X in regulation. We present which attempts exactly this: it partitions the data so that each partition can be characterized with its own de novo motif. D uses a Bayesian approach to identify the optimal number of motifs and the associated partitions, which together explain the entire dataset. This is in contrast to standard motif finders, which report motifs individually enriched in the data, but do not necessarily explain all reported regions. We show that the different motifs and associated regions identified by give insights into the various complexes that may be forming along the chromatin, something that has so far not been attempted from ChIP data. Webserver at ; standalone (Mac OS X/Linux) from . [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
46. Using pseudoalignment and base quality to accurately quantify microbial community composition.
- Author
-
Reppell, Mark and Novembre, John
- Subjects
DNA analysis ,MICROBIAL diversity ,COMPUTATIONAL biology ,MOLECULAR biology ,COMPUTER simulation - Abstract
Pooled DNA from multiple unknown organisms arises in a variety of contexts, for example microbial samples from ecological or human health research. Determining the composition of pooled samples can be difficult, especially at the scale of modern sequencing data and reference databases. Here we propose a novel method for taxonomic profiling in pooled DNA that combines the speed and low-memory requirements of k-mer based pseudoalignment with a likelihood framework that uses base quality information to better resolve multiply mapped reads. We apply the method to the problem of classifying 16S rRNA reads using a reference database of known organisms, a common challenge in microbiome research. Using simulations, we show the method is accurate across a variety of read lengths, with different length reference sequences, at different sample depths, and when samples contain reads originating from organisms absent from the reference. We also assess performance in real 16S data, where we reanalyze previous genetic association data to show our method discovers a larger number of quantitative trait associations than other widely used methods. We implement our method in the software Karp, for k-mer based analysis of read pools, to provide a novel combination of speed and accuracy that is uniquely suited for enhancing discoveries in microbial studies. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
47. New computational approaches to understanding molecular protein function.
- Author
-
Fetrow, Jacquelyn S. and Babbitt, Patricia C.
- Subjects
COMPUTATIONAL biology ,PHYSIOLOGICAL effects of proteins ,PROTEIN genetics ,MOLECULAR biology ,ONTOLOGIES (Information retrieval) - Abstract
The author discusses new computational approaches to understanding molecular protein function. The author states that Gene Ontology (GO) system of classifying function recognizes ways of defining function, using distinct cellular components, molecular function, and mentions the need of understanding molecular function involving motifs, like sequence motifs exemplified by PRINTS and PROSITE. The author notes that contemporary protein superfamilies are the result of numerous genetic events.
- Published
- 2018
- Full Text
- View/download PDF
48. A FAIR guide for data providers to maximise sharing of human genomic data.
- Author
-
Corpas, Manuel, Kovalevskaya, Nadezda V., McMurray, Amanda, and Nielsen, Fiona G. G.
- Subjects
HUMAN genome ,INFORMATION resources ,DATABASE management ,DATABASE administration ,SOCIAL contract - Abstract
It is generally acknowledged that, for reproducibility and progress of human genomic research, data sharing is critical. For every sharing transaction, a successful data exchange is produced between a data consumer and a data provider. Providers of human genomic data (e.g., publicly or privately funded repositories and data archives) fulfil their social contract with data donors when their shareable data conforms to FAIR (findable, accessible, interoperable, reusable) principles. Based on our experiences via Repositive (), a leading discovery platform cataloguing all shared human genomic datasets, we propose guidelines for data providers wishing to maximise their shared data’s FAIRness. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
49. iDREM: Interactive visualization of dynamic regulatory networks.
- Author
-
Ding, Jun, Hagood, James S., Ambalavanan, Namasivayam, Kaminski, Naftali, and Bar-Joseph, Ziv
- Subjects
GENETIC software ,PROTEINS ,GENE expression ,MICRORNA ,PROTEOMICS - Abstract
The Dynamic Regulatory Events Miner (DREM) software reconstructs dynamic regulatory networks by integrating static protein-DNA interaction data with time series gene expression data. In recent years, several additional types of high-throughput time series data have been profiled when studying biological processes including time series miRNA expression, proteomics, epigenomics and single cell RNA-Seq. Combining all available time series and static datasets in a unified model remains an important challenge and goal. To address this challenge we have developed a new version of DREM termed interactive DREM (iDREM). iDREM provides support for all data types mentioned above and combines them with existing interaction data to reconstruct networks that can lead to novel hypotheses on the function and timing of regulators. Users can interactively visualize and query the resulting model. We showcase the functionality of the new tool by applying it to microglia developmental data from multiple labs. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
50. LAILAPS-QSM: A RESTful API and JAVA library for semantic query suggestions.
- Author
-
Chen, Jinbo, Scholz, Uwe, Zhou, Ruonan, and Lange, Matthias
- Subjects
JAVA programming language ,APPLICATION program interfaces ,DATABASES ,LIFE sciences ,ONTOLOGIES (Information retrieval) - Abstract
In order to access and filter content of life-science databases, full text search is a widely applied query interface. But its high flexibility and intuitiveness is paid for with potentially imprecise and incomplete query results. To reduce this drawback, query assistance systems suggest those combinations of keywords with the highest potential to match most of the relevant data records. Widespread approaches are syntactic query corrections that avoid misspelling and support expansion of words by suffixes and prefixes. Synonym expansion approaches apply thesauri, ontologies, and query logs. All need laborious curation and maintenance. Furthermore, access to query logs is in general restricted. Approaches that infer related queries by their query profile like research field, geographic location, co-authorship, affiliation etc. require user’s registration and its public accessibility that contradict privacy concerns. To overcome these drawbacks, we implemented LAILAPS-QSM, a machine learning approach that reconstruct possible linguistic contexts of a given keyword query. The context is referred from the text records that are stored in the databases that are going to be queried or extracted for a general purpose query suggestion from PubMed abstracts and UniProt data. The supplied tool suite enables the pre-processing of these text records and the further computation of customized distributed word vectors. The latter are used to suggest alternative keyword queries. An evaluated of the query suggestion quality was done for plant science use cases. Locally present experts enable a cost-efficient quality assessment in the categories trait, biological entity, taxonomy, affiliation, and metabolic function which has been performed using ontology term similarities. LAILAPS-QSM mean information content similarity for 15 representative queries is 0.70, whereas 34% have a score above 0.80. In comparison, the information content similarity for human expert made query suggestions is 0.90. The software is either available as tool set to build and train dedicated query suggestion services or as already trained general purpose RESTful web service. The service uses open interfaces to be seamless embeddable into database frontends. The JAVA implementation uses highly optimized data structures and streamlined code to provide fast and scalable response for web service calls. The source code of LAILAPS-QSM is available under GNU General Public License version 2 in Bitbucket GIT repository: [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.