117 results
Search Results
2. Per-sample immunoglobulin germline inference from B cell receptor deep sequencing data.
- Author
-
Ralph, Duncan K. and IVMatsen, Frederick A.
- Subjects
B cell receptors ,IMMUNOGLOBULIN genes ,B cells ,ALLELES - Abstract
The collection of immunoglobulin genes in an individual’s germline, which gives rise to B cell receptors via recombination, is known to vary significantly across individuals. In humans, for example, each individual has only a fraction of the several hundred known V alleles. Furthermore, the currently-accepted set of known V alleles is both incomplete (particularly for non-European samples), and contains a significant number of spurious alleles. The resulting uncertainty as to which immunoglobulin alleles are present in any given sample results in inaccurate B cell receptor sequence annotations, and in particular inaccurate inferred naive ancestors. In this paper we first show that the currently widespread practice of aligning each sequence to its closest match in the full set of IMGT alleles results in a very large number of spurious alleles that are not in the sample’s true set of germline V alleles. We then describe a new method for inferring each individual’s germline gene set from deep sequencing data, and show that it improves upon existing methods by making a detailed comparison on a variety of simulated and real data samples. This new method has been integrated into the partis annotation and clonal family inference package, available at , and is run by default without affecting overall run time. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
3. LMTRDA: Using logistic model tree to predict MiRNA-disease associations by fusing multi-source information of sequences and similarities.
- Author
-
Wang, Lei, You, Zhu-Hong, Chen, Xing, Li, Yang-Ming, Dong, Ya-Nan, Li, Li-Ping, and Zheng, Kai
- Subjects
LOGISTIC model (Demography) ,MICRORNA ,MEDICAL genetics ,RNA sequencing ,PREDICTION models ,BREAST tumors ,NATURAL language processing ,LYMPHOMA diagnosis - Abstract
Emerging evidence has shown microRNAs (miRNAs) play an important role in human disease research. Identifying potential association among them is significant for the development of pathology, diagnose and therapy. However, only a tiny portion of all miRNA-disease pairs in the current datasets are experimentally validated. This prompts the development of high-precision computational methods to predict real interaction pairs. In this paper, we propose a new model of Logistic Model Tree for predicting miRNA-Disease Association (LMTRDA) by fusing multi-source information including miRNA sequences, miRNA functional similarity, disease semantic similarity, and known miRNA-disease associations. In particular, we introduce miRNA sequence information and extract its features using natural language processing technique for the first time in the miRNA-disease prediction model. In the cross-validation experiment, LMTRDA obtained 90.51% prediction accuracy with 92.55% sensitivity at the AUC of 90.54% on the HMDD V3.0 dataset. To further evaluate the performance of LMTRDA, we compared it with different classifier and feature descriptor models. In addition, we also validate the predictive ability of LMTRDA in human diseases including Breast Neoplasms, Breast Neoplasms and Lymphoma. As a result, 28, 27 and 26 out of the top 30 miRNAs associated with these diseases were verified by experiments in different kinds of case studies. These experimental results demonstrate that LMTRDA is a reliable model for predicting the association among miRNAs and diseases. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
4. Predicting the mechanism and rate of H-NS binding to AT-rich DNA.
- Author
-
Riccardi, Enrico, van Mastbergen, Eva C., Navarre, William Wiley, and Vreede, Jocelyne
- Subjects
BACTERIA ,ARGININE ,DNA ,BIOCHEMISTRY ,PROTEINS - Abstract
Bacteria contain several nucleoid-associated proteins that organize their genomic DNA into the nucleoid by bending, wrapping or bridging DNA. The Histone-like Nucleoid Structuring protein H-NS found in many Gram-negative bacteria is a DNA bridging protein and can structure DNA by binding to two separate DNA duplexes or to adjacent sites on the same duplex, depending on external conditions. Several nucleotide sequences have been identified to which H-NS binds with high affinity, indicating H-NS prefers AT-rich DNA. To date, highly detailed structural information of the H-NS DNA complex remains elusive. Molecular simulation can complement experiments by modelling structures and their time evolution in atomistic detail. In this paper we report an exploration of the different binding modes of H-NS to a high affinity nucleotide sequence and an estimate of the associated rate constant. By means of molecular dynamics simulations, we identified three types of binding for H-NS to AT-rich DNA. To further sample the transitions between these binding modes, we performed Replica Exchange Transition Interface Sampling, providing predictions of the mechanism and rate constant of H-NS binding to DNA. H-NS interacts with the DNA through a conserved QGR motif, aided by a conserved arginine at position 93. The QGR motif interacts first with phosphate groups, followed by the formation of hydrogen bonds between acceptors in the DNA minor groove and the sidechains of either Q112 or R114. After R114 inserts into the minor groove, the rest of the QGR motif follows. Full insertion of the QGR motif in the minor groove is stable over several tens of nanoseconds, and involves hydrogen bonds between the bases and both backbone and sidechains of the QGR motif. The rate constant for the process of H-NS binding to AT-rich DNA resulting in full insertion of the QGR motif is in the order of 10
6 M−1 s−1 , which is rate limiting compared to the non-specific association of H-NS to the DNA backbone at a rate of 108 M−1 s−1 . [ABSTRACT FROM AUTHOR]- Published
- 2019
- Full Text
- View/download PDF
5. Global analysis of N6-methyladenosine functions and its disease association using deep learning and network-based methods.
- Author
-
Zhang, Song-yao, Zhang, Shao-wu, Fan, Xiao-nan, Meng, Jia, Chen, Yidong, Gao, Shou-Jiang, and Huang, Yufei
- Subjects
PHYSIOLOGICAL effects of adenosine ,DEEP learning ,MESSENGER RNA ,PROTEIN-protein interactions ,CELL proliferation - Abstract
N6-methyladenosine (m
6 A) is the most abundant methylation, existing in >25% of human mRNAs. Exciting recent discoveries indicate the close involvement of m6 A in regulating many different aspects of mRNA metabolism and diseases like cancer. However, our current knowledge about how m6 A levels are controlled and whether and how regulation of m6 A levels of a specific gene can play a role in cancer and other diseases is mostly elusive. We propose in this paper a computational scheme for predicting m6 A-regulated genes and m6 A-associated disease, which includes Deep-m6 A, the first model for detecting condition-specific m6 A sites from MeRIP-Seq data with a single base resolution using deep learning and Hot-m6 A, a new network-based pipeline that prioritizes functional significant m6 A genes and its associated diseases using the Protein-Protein Interaction (PPI) and gene-disease heterogeneous networks. We applied Deep-m6 A and this pipeline to 75 MeRIP-seq human samples, which produced a compact set of 709 functionally significant m6 A-regulated genes and nine functionally enriched subnetworks. The functional enrichment analysis of these genes and networks reveal that m6 A targets key genes of many critical biological processes including transcription, cell organization and transport, and cell proliferation and cancer-related pathways such as Wnt pathway. The m6 A-associated disease analysis prioritized five significantly associated diseases including leukemia and renal cell carcinoma. These results demonstrate the power of our proposed computational scheme and provide new leads for understanding m6 A regulatory functions and its roles in diseases. [ABSTRACT FROM AUTHOR]- Published
- 2019
- Full Text
- View/download PDF
6. SFPEL-LPI: Sequence-based feature projection ensemble learning for predicting LncRNA-protein interactions.
- Author
-
Zhang, Wen, Tang, Guifeng, Huang, Feng, Zhang, Xining, Yue, Xiang, and Wu, Wenjian
- Subjects
RNA-protein interactions ,GENETIC regulation ,RNA interference ,RNA splicing ,ADENYLATION (Biochemistry) - Abstract
LncRNA-protein interactions play important roles in post-transcriptional gene regulation, poly-adenylation, splicing and translation. Identification of lncRNA-protein interactions helps to understand lncRNA-related activities. Existing computational methods utilize multiple lncRNA features or multiple protein features to predict lncRNA-protein interactions, but features are not available for all lncRNAs or proteins; most of existing methods are not capable of predicting interacting proteins (or lncRNAs) for new lncRNAs (or proteins), which don’t have known interactions. In this paper, we propose the sequence-based feature projection ensemble learning method, “SFPEL-LPI”, to predict lncRNA-protein interactions. First, SFPEL-LPI extracts lncRNA sequence-based features and protein sequence-based features. Second, SFPEL-LPI calculates multiple lncRNA-lncRNA similarities and protein-protein similarities by using lncRNA sequences, protein sequences and known lncRNA-protein interactions. Then, SFPEL-LPI combines multiple similarities and multiple features with a feature projection ensemble learning frame. In computational experiments, SFPEL-LPI accurately predicts lncRNA-protein associations and outperforms other state-of-the-art methods. More importantly, SFPEL-LPI can be applied to new lncRNAs (or proteins). The case studies demonstrate that our method can find out novel lncRNA-protein interactions, which are confirmed by literature. Finally, we construct a user-friendly web server, available at . [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
7. Predicting B cell receptor substitution profiles using public repertoire data.
- Author
-
Dhar, Amrit, Davidsen, Kristian, IVMatsen, Frederick A., and Minin, Vladimir N.
- Subjects
B cell receptors ,AMINO acids ,GENETIC mutation ,CLONING ,GERMINAL centers ,IMMUNOTECHNOLOGY - Abstract
B cells develop high affinity receptors during the course of affinity maturation, a cyclic process of mutation and selection. At the end of affinity maturation, a number of cells sharing the same ancestor (i.e. in the same “clonal family”) are released from the germinal center; their amino acid frequency profile reflects the allowed and disallowed substitutions at each position. These clonal-family-specific frequency profiles, called “substitution profiles”, are useful for studying the course of affinity maturation as well as for antibody engineering purposes. However, most often only a single sequence is recovered from each clonal family in a sequencing experiment, making it impossible to construct a clonal-family-specific substitution profile. Given the public release of many high-quality large B cell receptor datasets, one may ask whether it is possible to use such data in a prediction model for clonal-family-specific substitution profiles. In this paper, we present the method “Substitution Profiles Using Related Families” (SPURF), a penalized tensor regression framework that integrates information from a rich assemblage of datasets to predict the clonal-family-specific substitution profile for any single input sequence. Using this framework, we show that substitution profiles from similar clonal families can be leveraged together with simulated substitution profiles and germline gene sequence information to improve prediction. We fit this model on a large public dataset and validate the robustness of our approach on two external datasets. Furthermore, we provide a command-line tool in an open-source software package () implementing these ideas and providing easy prediction using our pre-fit models. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
8. SARNAclust: Semi-automatic detection of RNA protein binding motifs from immunoprecipitation data.
- Author
-
Dotu, Ivan, Adamson, Scott I., Coleman, Benjamin, Fournier, Cyril, Ricart-Altimiras, Emma, Eyras, Eduardo, and Chuang, Jeffrey H.
- Subjects
IMMUNOPRECIPITATION ,RNA-binding proteins ,PROTEIN-protein interactions ,NUCLEOTIDE sequence ,RNA splicing - Abstract
RNA-protein binding is critical to gene regulation, controlling fundamental processes including splicing, translation, localization and stability, and aberrant RNA-protein interactions are known to play a role in a wide variety of diseases. However, molecular understanding of RNA-protein interactions remains limited; in particular, identification of RNA motifs that bind proteins has long been challenging, especially when such motifs depend on both sequence and structure. Moreover, although RNA binding proteins (RBPs) often contain more than one binding domain, algorithms capable of identifying more than one binding motif simultaneously have not been developed. In this paper we present a novel pipeline to determine binding peaks in crosslinking immunoprecipitation (CLIP) data, to discover multiple possible RNA sequence/structure motifs among them, and to experimentally validate such motifs. At the core is a new semi-automatic algorithm SARNAclust, the first unsupervised method to identify and deconvolve multiple sequence/structure motifs simultaneously. SARNAclust computes similarity between sequence/structure objects using a graph kernel, providing the ability to isolate the impact of specific features through the bulge graph formalism. Application of SARNAclust to synthetic data shows its capability of clustering 5 motifs at once with a V-measure value of over 0.95, while GraphClust achieves only a V-measure of 0.083 and RNAcontext cannot detect any of the motifs. When applied to existing eCLIP sets, SARNAclust finds known motifs for SLBP and HNRNPC and novel motifs for several other RBPs such as AGGF1, AKAP8L and ILF3. We demonstrate an experimental validation protocol, a targeted Bind-n-Seq-like high-throughput sequencing approach that relies on RNA inverse folding for oligo pool design, that can validate the components within the SLBP motif. Finally, we use this protocol to experimentally interrogate the SARNAclust motif predictions for protein ILF3. Our results support a newly identified partially double-stranded UUUUUGAGA motif similar to that known for the splicing factor HNRNPC. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
9. PCSF: An R-package for network-based interpretation of high-throughput data.
- Author
-
Akhmedov, Murodzhon, Kedaigle, Amanda, Chong, Renan Escalante, Montemanni, Roberto, Bertoni, Francesco, Fraenkel, Ernest, and Kwee, Ivo
- Subjects
BIOINFORMATICS software ,DATA analysis software ,MATHEMATICAL optimization ,COMPUTATIONAL biology ,PROTEIN-protein interactions - Abstract
With the recent technological developments a vast amount of high-throughput data has been profiled to understand the mechanism of complex diseases. The current bioinformatics challenge is to interpret the data and underlying biology, where efficient algorithms for analyzing heterogeneous high-throughput data using biological networks are becoming increasingly valuable. In this paper, we propose a software package based on the Prize-collecting Steiner Forest graph optimization approach. The PCSF package performs fast and user-friendly network analysis of high-throughput data by mapping the data onto a biological networks such as protein-protein interaction, gene-gene interaction or any other correlation or coexpression based networks. Using the interaction networks as a template, it determines high-confidence subnetworks relevant to the data, which potentially leads to predictions of functional units. It also interactively visualizes the resulting subnetwork with functional enrichment analysis. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
10. ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time.
- Author
-
Cai, Yunpeng, Zheng, Wei, Yao, Jin, Yang, Yujie, Mai, Volker, Mao, Qi, and Sun, Yijun
- Subjects
GENOMICS ,QUADRATIC programming ,HUMAN microbiota ,RIBOSOMAL RNA ,BIOACCUMULATION - Abstract
The rapid development of sequencing technology has led to an explosive accumulation of genomic sequence data. Clustering is often the first step to perform in sequence analy- sis, and hierarchical clustering is one of the most commonly used approaches for this purpose. However, it is currently computationally expensive to perform hierarchical clustering of extremely large sequence datasets due to its quadratic time and space complexities. In this paper we developed a new algorithm called ESPRIT-Forest for parallel hierarchical clustering of sequences. The algorithm achieves subquadratic time and space complexity and maintains a high clustering accuracy comparable to the standard method. The basic idea is to organize sequences into a pseudo-metric based partitioning tree for sub-linear time searching of nearest neighbors, and then use a new multiple-pair merging criterion to construct clusters in parallel using multiple threads. The new algorithm was tested on the human microbiome project (HMP) dataset, currently one of the largest published microbial 16S rRNA sequence dataset. Our experiment demonstrated that with the power of parallel computing it is now compu- tationally feasible to perform hierarchical clustering analysis of tens of millions of sequences. The software is available at . [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
11. Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model.
- Author
-
Wang, Sheng, Sun, Siqi, Li, Zhen, Zhang, Renyu, and Xu, Jinbo
- Subjects
PROTEIN structure ,ARTIFICIAL neural networks ,PROTEIN folding ,PAIRED comparisons (Mathematics) ,AMINO acid sequence - Abstract
Motivation: Protein contacts contain key information for the understanding of protein structure and function and thus, contact prediction from sequence is an important problem. Recently exciting progress has been made on this problem, but the predicted contacts for proteins without many sequence homologs is still of low quality and not very useful for de novo structure prediction. Method: This paper presents a new deep learning method that predicts contacts by integrating both evolutionary coupling (EC) and sequence conservation information through an ultra-deep neural network formed by two deep residual neural networks. The first residual network conducts a series of 1-dimensional convolutional transformation of sequential features; the second residual network conducts a series of 2-dimensional convolutional transformation of pairwise information including output of the first residual network, EC information and pairwise potential. By using very deep residual networks, we can accurately model contact occurrence patterns and complex sequence-structure relationship and thus, obtain high-quality contact prediction regardless of how many sequence homologs are available for proteins in question. Results: Our method greatly outperforms existing methods and leads to much more accurate contact-assisted folding. Tested on 105 CASP11 targets, 76 past CAMEO hard targets, and 398 membrane proteins, the average top L long-range prediction accuracy obtained by our method, one representative EC method CCMpred and the CASP11 winner MetaPSICOV is 0.47, 0.21 and 0.30, respectively; the average top L/10 long-range accuracy of our method, CCMpred and MetaPSICOV is 0.77, 0.47 and 0.59, respectively. Ab initio folding using our predicted contacts as restraints but without any force fields can yield correct folds (i.e., TMscore>0.6) for 203 of the 579 test proteins, while that using MetaPSICOV- and CCMpred-predicted contacts can do so for only 79 and 62 of them, respectively. Our contact-assisted models also have much better quality than template-based models especially for membrane proteins. The 3D models built from our contact prediction have TMscore>0.5 for 208 of the 398 membrane proteins, while those from homology modeling have TMscore>0.5 for only 10 of them. Further, even if trained mostly by soluble proteins, our deep learning method works very well on membrane proteins. In the recent blind CAMEO benchmark, our fully-automated web server implementing this method successfully folded 6 targets with a new fold and only 0.3L-2.3L effective sequence homologs, including one β protein of 182 residues, one α+β protein of 125 residues, one α protein of 140 residues, one α protein of 217 residues, one α/β of 260 residues and one α protein of 462 residues. Our method also achieved the highest F1 score on free-modeling targets in the latest CASP (Critical Assessment of Structure Prediction), although it was not fully implemented back then. Availability: [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
12. A Graph-Centric Approach for Metagenome-Guided Peptide and Protein Identification in Metaproteomics.
- Author
-
Tang, Haixu, Li, Sujun, and Ye, Yuzhen
- Subjects
PROTEIN expression ,PEPTIDES ,GENES ,MICROBIOLOGICAL chemistry ,METAGENOMICS - Abstract
Metaproteomic studies adopt the common bottom-up proteomics approach to investigate the protein composition and the dynamics of protein expression in microbial communities. When matched metagenomic and/or metatranscriptomic data of the microbial communities are available, metaproteomic data analyses often employ a metagenome-guided approach, in which complete or fragmental protein-coding genes are first directly predicted from metagenomic (and/or metatranscriptomic) sequences or from their assemblies, and the resulting protein sequences are then used as the reference database for peptide/protein identification from MS/MS spectra. This approach is often limited because protein coding genes predicted from metagenomes are incomplete and fragmental. In this paper, we present a graph-centric approach to improving metagenome-guided peptide and protein identification in metaproteomics. Our method exploits the de Bruijn graph structure reported by metagenome assembly algorithms to generate a comprehensive database of protein sequences encoded in the community. We tested our method using several public metaproteomic datasets with matched metagenomic and metatranscriptomic sequencing data acquired from complex microbial communities in a biological wastewater treatment plant. The results showed that many more peptides and proteins can be identified when assembly graphs were utilized, improving the characterization of the proteins expressed in the microbial communities. The additional proteins we identified contribute to the characterization of important pathways such as those involved in degradation of chemical hazards. Our tools are released as open-source software on github at . [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
13. Per-sample immunoglobulin germline inference from B cell receptor deep sequencing data
- Author
-
Duncan Ralph and Frederick A. Matsen
- Subjects
0301 basic medicine ,Physiology ,Inference ,Biochemistry ,Germline ,Database and Informatics Methods ,0302 clinical medicine ,Immune Physiology ,Databases, Genetic ,Medicine and Health Sciences ,Biology (General) ,Data Management ,Genetics ,0303 health sciences ,Immune System Proteins ,Ecology ,Genes, Immunoglobulin ,High-Throughput Nucleotide Sequencing ,Phylogenetic Analysis ,3. Good health ,Phylogenetics ,Computational Theory and Mathematics ,Modeling and Simulation ,Mutation (genetic algorithm) ,Sequence Analysis ,Research Article ,Computer and Information Sciences ,QH301-705.5 ,Bioinformatics ,B-cell receptor ,Immunology ,Sequence Databases ,Receptors, Antigen, B-Cell ,Sequence alignment ,Computational biology ,Biology ,Research and Analysis Methods ,Deep sequencing ,Antibodies ,Set (abstract data type) ,03 medical and health sciences ,Cellular and Molecular Neuroscience ,Sequence Motif Analysis ,Point Mutation ,Humans ,Evolutionary Systematics ,Computer Simulation ,Allele ,Quantitative Biology - Populations and Evolution ,Molecular Biology ,Gene ,Ecology, Evolution, Behavior and Systematics ,Alleles ,030304 developmental biology ,Sequence (medicine) ,Taxonomy ,Evolutionary Biology ,Models, Genetic ,Populations and Evolution (q-bio.PE) ,Models, Immunological ,Biology and Life Sciences ,Proteins ,Computational Biology ,030104 developmental biology ,Biological Databases ,Germ Cells ,Genetic Loci ,FOS: Biological sciences ,Mutation ,Sequence Alignment ,030217 neurology & neurosurgery ,Software ,030215 immunology - Abstract
The collection of immunoglobulin genes in an individual’s germline, which gives rise to B cell receptors via recombination, is known to vary significantly across individuals. In humans, for example, each individual has only a fraction of the several hundred known V alleles. Furthermore, the currently-accepted set of known V alleles is both incomplete (particularly for non-European samples), and contains a significant number of spurious alleles. The resulting uncertainty as to which immunoglobulin alleles are present in any given sample results in inaccurate B cell receptor sequence annotations, and in particular inaccurate inferred naive ancestors. In this paper we first show that the currently widespread practice of aligning each sequence to its closest match in the full set of IMGT alleles results in a very large number of spurious alleles that are not in the sample’s true set of germline V alleles. We then describe a new method for inferring each individual’s germline gene set from deep sequencing data, and show that it improves upon existing methods by making a detailed comparison on a variety of simulated and real data samples. This new method has been integrated into the partis annotation and clonal family inference package, available at https://github.com/psathyrella/partis, and is run by default without affecting overall run time., Author summary Antibodies are an important component of the adaptive immune system, which itself determines our response to both pathogens and vaccines. They are produced by B cells through somatic recombination of germline DNA, which results in a vast diversity of antigen binding affinities across the B cell repertoire. We typically learn about the development of this repertoire, and its history of interaction with antigens, by sequencing large numbers of the DNA sequences from which antibodies are derived. In order to understand such data, it is necessary to determine the combination of germline V, D, and J genes that was rearranged to form each such B cell receptor sequence. This is difficult, however, because the immunoglobulin locus exhibits an extraordinary level of diversity across individuals—encompassing both allelic variation and gene duplication, deletion, and conversion—and because the locus’s large size and repetitive structure make germline sequencing very difficult. In this paper we describe a new computational method that avoids this difficulty by inferring each individual’s set of immunoglobulin germline genes directly from expressed B cell receptor sequence data.
- Published
- 2019
14. A Graph-Centric Approach for Metagenome-Guided Peptide and Protein Identification in Metaproteomics
- Author
-
Haixu Tang, Sujun Li, and Yuzhen Ye
- Subjects
0301 basic medicine ,Proteomics ,Peptide ,Plant Science ,Biochemistry ,De Bruijn graph ,Database and Informatics Methods ,Tandem Mass Spectrometry ,Database Searching ,Photosynthesis ,lcsh:QH301-705.5 ,chemistry.chemical_classification ,Ecology ,Plant Biochemistry ,Microbiota ,Genomics ,6. Clean water ,Computational Theory and Mathematics ,Modeling and Simulation ,symbols ,Sequence Analysis ,Algorithms ,Research Article ,Gene prediction ,Sequence Databases ,Computational biology ,Biology ,Research and Analysis Methods ,03 medical and health sciences ,Cellular and Molecular Neuroscience ,symbols.namesake ,Genetics ,Ribulose-1,5-Bisphosphate Carboxylase Oxygenase ,Humans ,Molecular Biology Techniques ,Sequencing Techniques ,Sequence Similarity Searching ,Gene Prediction ,Gene ,Molecular Biology ,Ecology, Evolution, Behavior and Systematics ,Sequence Assembly Tools ,Biology and Life Sciences ,Computational Biology ,Proteins ,Genome Analysis ,030104 developmental biology ,Biological Databases ,lcsh:Biology (General) ,chemistry ,Metagenomics ,Metaproteomics ,Protein identification ,Peptides - Abstract
Metaproteomic studies adopt the common bottom-up proteomics approach to investigate the protein composition and the dynamics of protein expression in microbial communities. When matched metagenomic and/or metatranscriptomic data of the microbial communities are available, metaproteomic data analyses often employ a metagenome-guided approach, in which complete or fragmental protein-coding genes are first directly predicted from metagenomic (and/or metatranscriptomic) sequences or from their assemblies, and the resulting protein sequences are then used as the reference database for peptide/protein identification from MS/MS spectra. This approach is often limited because protein coding genes predicted from metagenomes are incomplete and fragmental. In this paper, we present a graph-centric approach to improving metagenome-guided peptide and protein identification in metaproteomics. Our method exploits the de Bruijn graph structure reported by metagenome assembly algorithms to generate a comprehensive database of protein sequences encoded in the community. We tested our method using several public metaproteomic datasets with matched metagenomic and metatranscriptomic sequencing data acquired from complex microbial communities in a biological wastewater treatment plant. The results showed that many more peptides and proteins can be identified when assembly graphs were utilized, improving the characterization of the proteins expressed in the microbial communities. The additional proteins we identified contribute to the characterization of important pathways such as those involved in degradation of chemical hazards. Our tools are released as open-source software on github at https://github.com/COL-IU/Graph2Pro., Author Summary In recent years, meta-omic (including metatranscriptomic and metaproteomic) techniques have been adopted as complementary approaches to metagenomic sequencing to study functional characteristics and dynamics of microbial communities, aiming at a holistic understanding of a community to respond to the changes in the environment. Currently, metaproteomic data are largely analyzed using the bioinformatics tools originally designed in bottom-up proteomics. In particular, recent metaproteomic studies employed a metagenome-guided approach, in which complete or fragmental protein-coding genes were first predicted from metagenomic sequences (i.e., contigs or scaffolds), acquired from the matched community samples, and predicted protein sequences were then used in peptide identification. A key challenge of this approach is that the protein coding genes predicted from assembled metagenomic contigs can be incomplete and fragmented due to the complexity of metagenomic samples and the short reads length in metagenomic sequencing. To address this issue, in this paper, we present a graph-centric approach that exploits the de bruijn graph structure reported by metagenome assembly algorithms to improve metagenome-guided peptide and protein identification in metaproteomics. We show that our method can identify much more peptides and proteins, improving the characterization of the proteins expressed in the microbial communities.
- Published
- 2016
15. A scale-free analysis of the HIV-1 genome demonstrates multiple conserved regions of structural and functional importance.
- Author
-
Skittrall, Jordan P., Ingemarsdotter, Carin K., Gog, Julia R., and Lever, Andrew M. L.
- Subjects
GENOMES ,NUCLEIC acids ,DNA synthesis ,SEQUENCE alignment ,COMPUTATIONAL biology ,NUCLEOTIDE sequence ,GENETIC code - Abstract
HIV-1 replicates via a low-fidelity polymerase with a high mutation rate; strong conservation of individual nucleotides is highly indicative of the presence of critical structural or functional properties. Identifying such conservation can reveal novel insights into viral behaviour. We analysed 3651 publicly available sequences for the presence of nucleic acid conservation beyond that required by amino acid constraints, using a novel scale-free method that identifies regions of outlying score together with a codon scoring algorithm. Sequences with outlying score were further analysed using an algorithm for producing local RNA folds whilst accounting for alignment properties. 11 different conserved regions were identified, some corresponding to well-known cis-acting functions of the HIV-1 genome but also others whose conservation has not previously been noted. We identify rational causes for many of these, including cis functions, possible additional reading frame usage, a plausible mechanism by which the central polypurine tract primes second-strand DNA synthesis and a conformational stabilising function of a region at the 5′ end of env. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
16. The impact of DNA methylation on the cancer proteome.
- Author
-
Magzoub, Majed Mohamed, Prunello, Marcos, Brennan, Kevin, and Gevaert, Olivier
- Subjects
DNA methylation ,GENE expression ,CANCER genes ,TUMOR markers ,CANCER ,CYTOLOGY - Abstract
Aberrant DNA methylation disrupts normal gene expression in cancer and broadly contributes to oncogenesis. We previously developed MethylMix, a model-based algorithmic approach to identify epigenetically regulated driver genes. MethylMix identifies genes where methylation likely executes a functional role by using transcriptomic data to select only methylation events that can be linked to changes in gene expression. However, given that proteins more closely link genotype to phenotype recent high-throughput proteomic data provides an opportunity to more accurately identify functionally relevant abnormal methylation events. Here we present a MethylMix analysis that refines nominations for epigenetic driver genes by leveraging quantitative high-throughput proteomic data to select only genes where DNA methylation is predictive of protein abundance. Applying our algorithm across three cancer cohorts we find that using protein abundance data narrows candidate nominations, where the effect of DNA methylation is often buffered at the protein level. Next, we find that MethylMix genes predictive of protein abundance are enriched for biological processes involved in cancer including functions involved in epithelial and mesenchymal transition. Moreover, our results are also enriched for tumor markers which are predictive of clinical features like tumor stage and we find clustering using MethylMix genes predictive of protein abundance captures cancer subtypes. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
17. DART-ID increases single-cell proteome coverage.
- Author
-
Chen, Albert Tian, Franks, Alexander, and Slavov, Nikolai
- Subjects
TANDEM mass spectrometry ,MONOCYTES ,RF values (Chromatography) ,LEUCOCYTES ,STATISTICAL power analysis ,LIQUID chromatography - Abstract
Analysis by liquid chromatography and tandem mass spectrometry (LC-MS/MS) can identify and quantify thousands of proteins in microgram-level samples, such as those comprised of thousands of cells. This process, however, remains challenging for smaller samples, such as the proteomes of single mammalian cells, because reduced protein levels reduce the number of confidently sequenced peptides. To alleviate this reduction, we developed Data-driven Alignment of Retention Times for IDentification (DART-ID). DART-ID implements principled Bayesian frameworks for global retention time (RT) alignment and for incorporating RT estimates towards improved confidence estimates of peptide-spectrum-matches. When applied to bulk or to single-cell samples, DART-ID increased the number of data points by 30–50% at 1% FDR, and thus decreased missing data. Benchmarks indicate excellent quantification of peptides upgraded by DART-ID and support their utility for quantitative analysis, such as identifying cell types and cell-type specific proteins. The additional datapoints provided by DART-ID boost the statistical power and double the number of proteins identified as differentially abundant in monocytes and T-cells. DART-ID can be applied to diverse experimental designs and is freely available at . [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
18. Pathogenicity and functional impact of non-frameshifting insertion/deletion variation in the human genome.
- Author
-
Pagel, Kymberleigh A., Antaki, Danny, Lian, AoJie, Mort, Matthew, Cooper, David N., Sebat, Jonathan, Iakoucheva, Lilia M., Mooney, Sean D., and Radivojac, Predrag
- Subjects
HUMAN genome ,AUTISM spectrum disorders ,MICROBIAL virulence ,RECURRENT neural networks ,POST-translational modification ,PHYSICAL sciences - Abstract
Differentiation between phenotypically neutral and disease-causing genetic variation remains an open and relevant problem. Among different types of variation, non-frameshifting insertions and deletions (indels) represent an understudied group with widespread phenotypic consequences. To address this challenge, we present a machine learning method, MutPred-Indel, that predicts pathogenicity and identifies types of functional residues impacted by non-frameshifting insertion/deletion variation. The model shows good predictive performance as well as the ability to identify impacted structural and functional residues including secondary structure, intrinsic disorder, metal and macromolecular binding, post-translational modifications, allosteric sites, and catalytic residues. We identify structural and functional mechanisms impacted preferentially by germline variation from the Human Gene Mutation Database, recurrent somatic variation from COSMIC in the context of different cancers, as well as de novo variants from families with autism spectrum disorder. Further, the distributions of pathogenicity prediction scores generated by MutPred-Indel are shown to differentiate highly recurrent from non-recurrent somatic variation. Collectively, we present a framework to facilitate the interrogation of both pathogenicity and the functional effects of non-frameshifting insertion/deletion variants. The MutPred-Indel webserver is available at . [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
19. DeepConv-DTI: Prediction of drug-target interactions via deep learning with convolution on protein sequences.
- Author
-
Lee, Ingoo, Keum, Jongsoo, and Nam, Hojung
- Subjects
AMINO acid sequence ,DEEP learning ,MATHEMATICAL convolutions ,CARRIER proteins ,BINDING sites ,PROTEIN models - Abstract
Identification of drug-target interactions (DTIs) plays a key role in drug discovery. The high cost and labor-intensive nature of in vitro and in vivo experiments have highlighted the importance of in silico-based DTI prediction approaches. In several computational models, conventional protein descriptors have been shown to not be sufficiently informative to predict accurate DTIs. Thus, in this study, we propose a deep learning based DTI prediction model capturing local residue patterns of proteins participating in DTIs. When we employ a convolutional neural network (CNN) on raw protein sequences, we perform convolution on various lengths of amino acids subsequences to capture local residue patterns of generalized protein classes. We train our model with large-scale DTI information and demonstrate the performance of the proposed model using an independent dataset that is not seen during the training phase. As a result, our model performs better than previous protein descriptor-based models. Also, our model performs better than the recently developed deep learning models for massive prediction of DTIs. By examining pooled convolution results, we confirmed that our model can detect binding sites of proteins for DTIs. In conclusion, our prediction model for detecting local residue patterns of target proteins successfully enriches the protein features of a raw protein sequence, yielding better prediction results than previous approaches. Our code is available at . [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
20. Conformational coupling by trans-phosphorylation in calcium calmodulin dependent kinase II.
- Author
-
Pandini, Alessandro, Schulman, Howard, and Khan, Shahid
- Subjects
PHOSPHORYLATION ,PROTEIN kinases ,CALCIUM ,CALMODULIN ,MOLECULAR dynamics ,GENETIC mutation - Abstract
The calcium calmodulin-dependent protein kinase II (CaMKII) is a dodecameric holoenzyme important for encoding memory. Its activation, triggered by binding of calcium-calmodulin, persists autonomously after calmodulin dissociation. One (receiver) kinase captures and subsequently phosphorylates the regulatory domain peptide of a donor kinase forming a chained dimer as the first stage of autonomous activation. Protein dynamics simulations examined the conformational changes triggered by dimer formation and phosphorylation, aimed to provide a molecular rationale for human mutations that result in learning disabilities. Ensembles generated from X-ray crystal structures were characterized by network centrality and community analysis. Mutual information related collective motions to local fragment dynamics encoded with a structural alphabet. Implicit solvent tCONCOORD conformational ensembles revealed the dynamic architecture of Inactive kinase domains was co-opted in the activated dimer but the network hub shifted from the nucleotide binding cleft to the captured peptide. Explicit solvent molecular dynamics (MD) showed nucleotide and substrate binding determinants formed coupled nodes in long-range signal relays between regulatory peptides in the dimer. Strain in the extended captured peptide was balanced by reduced flexibility of the receiver kinase C-lobe core. The relays were organized around a hydrophobic patch between the captured peptide and a key binding helix. The human mutations aligned along the relays. Thus, these mutations could disrupt the allosteric network alternatively, or in addition, to altered binding affinities. Non-binding protein sectors distant from the binding sites mediated the allosteric signalling; providing possible targets for inhibitor design. Phosphorylation of the peptide modulated the dielectric of its binding pocket to strengthen the patch, non-binding sectors, domain interface and temporal correlations between parallel relays. These results provide the molecular details underlying the reported positive kinase cooperativity to enrich the discussion on how autonomous activation by phosphorylation leads to long-term behavioural effects. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
21. Comment on "A comprehensive overview and evaluation of circular RNA detection tools".
- Author
-
Chen, Chia-Ying and Chuang, Trees-Juen
- Subjects
CIRCULAR RNA ,NON-coding RNA ,COMPUTATIONAL biology - Abstract
A review of the article "A comprehensive overview and evaluation of circular RNA detection tools" which appeared in a previous issue of the periodical "PLOS Computational Biology" is presented.
- Published
- 2019
- Full Text
- View/download PDF
22. The intrinsic dimension of protein sequence evolution.
- Author
-
Facco, Elena, Pagnani, Andrea, Russo, Elena Tea, and Laio, Alessandro
- Subjects
AMINO acid sequence ,PROTEIN structure ,GENETIC mutation ,STATISTICAL correlation ,MAXIMUM entropy method ,MOLECULAR phylogeny - Abstract
It is well known that, in order to preserve its structure and function, a protein cannot change its sequence at random, but only by mutations occurring preferentially at specific locations. We here investigate quantitatively the amount of variability that is allowed in protein sequence evolution, by computing the intrinsic dimension (ID) of the sequences belonging to a selection of protein families. The ID is a measure of the number of independent directions that evolution can take starting from a given sequence. We find that the ID is practically constant for sequences belonging to the same family, and moreover it is very similar in different families, with values ranging between 6 and 12. These values are significantly smaller than the raw number of amino acids, confirming the importance of correlations between mutations in different sites. However, we demonstrate that correlations are not sufficient to explain the small value of the ID we observe in protein families. Indeed, we show that the ID of a set of protein sequences generated by maximum entropy models, an approach in which correlations are accounted for, is typically significantly larger than the value observed in natural protein families. We further prove that a critical factor to reproduce the natural ID is to take into consideration the phylogeny of sequences. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
23. Genotype-phenotype relations of the von Hippel-Lindau tumor suppressor inferred from a large-scale analysis of disease mutations and interactors.
- Author
-
Minervini, Giovanni, Quaglia, Federica, Tabaro, Francesco, and Tosatto, Silvio C. E.
- Subjects
VON Hippel-Lindau disease ,GENOTYPES ,PHENOTYPES ,TUMOR suppressor genes ,GENETIC mutation - Abstract
Familiar cancers represent a privileged point of view for studying the complex cellular events inducing tumor transformation. Von Hippel-Lindau syndrome, a familiar predisposition to develop cancer is a clear example. Here, we present our efforts to decipher the role of von Hippel-Lindau tumor suppressor protein (pVHL) in cancer insurgence. We collected high quality information about both pVHL mutations and interactors to investigate the association between patient phenotypes, mutated protein surface and impaired interactions. Our data suggest that different phenotypes correlate with localized perturbations of the pVHL structure, with specific cell functions associated to different protein surfaces. We propose five different pVHL interfaces to be selectively involved in modulating proteins regulating gene expression, protein homeostasis as well as to address extracellular matrix (ECM) and ciliogenesis associated functions. These data were used to drive molecular docking of pVHL with its interactors and guide Petri net simulations of the most promising alterations. We predict that disruption of pVHL association with certain interactors can trigger tumor transformation, inducing metabolism imbalance and ECM remodeling. Collectively taken, our findings provide novel insights into VHL-associated tumorigenesis. This highly integrated in silico approach may help elucidate novel treatment paradigms for VHL disease. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
24. Prediction of VRC01 neutralization sensitivity by HIV-1 gp160 sequence features.
- Author
-
Magaret, Craig A., Benkeser, David C., Williamson, Brian D., Borate, Bhavesh R., Carpp, Lindsay N., Georgiev, Ivelin S., Setliff, Ian, Dingens, Adam S., Simon, Noah, Carone, Marco, Simpkins, Christopher, Montefiori, David, Alter, Galit, Yu, Wen-Han, Juraska, Michal, Edlefsen, Paul T., Karuna, Shelly, Mgodi, Nyaradzo M., Edugupanti, Srilatha, and Gilbert, Peter B.
- Subjects
BIOLOGICAL databases ,HIV-positive persons ,AMINO acid analysis ,GLYCOSYLATION ,TITERS - Abstract
The broadly neutralizing antibody (bnAb) VRC01 is being evaluated for its efficacy to prevent HIV-1 infection in the Antibody Mediated Prevention (AMP) trials. A secondary objective of AMP utilizes sieve analysis to investigate how VRC01 prevention efficacy (PE) varies with HIV-1 envelope (Env) amino acid (AA) sequence features. An exhaustive analysis that tests how PE depends on every AA feature with sufficient variation would have low statistical power. To design an adequately powered primary sieve analysis for AMP, we modeled VRC01 neutralization as a function of Env AA sequence features of 611 HIV-1 gp160 pseudoviruses from the CATNAP database, with objectives: (1) to develop models that best predict the neutralization readouts; and (2) to rank AA features by their predictive importance with classification and regression methods. The dataset was split in half, and machine learning algorithms were applied to each half, each analyzed separately using cross-validation and hold-out validation. We selected Super Learner, a nonparametric ensemble-based cross-validated learning method, for advancement to the primary sieve analysis. This method predicted the dichotomous resistance outcome of whether the IC
50 neutralization titer of VRC01 for a given Env pseudovirus is right-censored (indicating resistance) with an average validated AUC of 0.868 across the two hold-out datasets. Quantitative log IC50 was predicted with an average validated R2 of 0.355. Features predicting neutralization sensitivity or resistance included 26 surface-accessible residues in the VRC01 and CD4 binding footprints, the length of gp120, the length of Env, the number of cysteines in gp120, the number of cysteines in Env, and 4 potential N-linked glycosylation sites; the top features will be advanced to the primary sieve analysis. This modeling framework may also inform the study of VRC01 in the treatment of HIV-infected persons. [ABSTRACT FROM AUTHOR]- Published
- 2019
- Full Text
- View/download PDF
25. ChIPulate: A comprehensive ChIP-seq simulation pipeline.
- Author
-
Datta, Vishaka, Hannenhalli, Sridhar, and Siddharthan, Rahul
- Subjects
CHROMATIN ,IMMUNOPRECIPITATION ,DNA-binding proteins ,GENOMICS ,TRANSCRIPTION factors - Abstract
ChIP-seq (Chromatin Immunoprecipitation followed by sequencing) is a high-throughput technique to identify genomic regions that are bound in vivo by a particular protein, e.g., a transcription factor (TF). Biological factors, such as chromatin state, indirect and cooperative binding, as well as experimental factors, such as antibody quality, cross-linking, and PCR biases, are known to affect the outcome of ChIP-seq experiments. However, the relative impact of these factors on inferences made from ChIP-seq data is not entirely clear. Here, via a detailed ChIP-seq simulation pipeline, ChIPulate, we assess the impact of various biological and experimental sources of variation on several outcomes of a ChIP-seq experiment, viz., the recoverability of the TF binding motif, accuracy of TF-DNA binding detection, the sensitivity of inferred TF-DNA binding strength, and number of replicates needed to confidently infer binding strength. We find that the TF motif can be recovered despite poor and non-uniform extraction and PCR amplification efficiencies. The recovery of the motif is, however, affected to a larger extent by the fraction of sites that are either cooperatively or indirectly bound. Importantly, our simulations reveal that the number of ChIP-seq replicates needed to accurately measure in vivo occupancy at high-affinity sites is larger than the recommended community standards. Our results establish statistical limits on the accuracy of inferences of protein-DNA binding from ChIP-seq and suggest that increasing the mean extraction efficiency, rather than amplification efficiency, would better improve sensitivity. The source code and instructions for running ChIPulate can be found at . [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
26. Genesis of the αβ T-cell receptor.
- Author
-
Dupic, Thomas, Marcou, Quentin, Walczak, Aleksandra M., and Mora, Thierry
- Subjects
T cell receptors ,CELLULAR signal transduction ,IMMUNE response ,SEQUENCE analysis ,QUANTITATIVE research ,MAJOR histocompatibility complex - Abstract
The T-cell (TCR) repertoire relies on the diversity of receptors composed of two chains, called α and β, to recognize pathogens. Using results of high throughput sequencing and computational chain-pairing experiments of human TCR repertoires, we quantitively characterize the αβ generation process. We estimate the probabilities of a rescue recombination of the β chain on the second chromosome upon failure or success on the first chromosome. Unlike β chains, α chains recombine simultaneously on both chromosomes, resulting in correlated statistics of the two genes which we predict using a mechanistic model. We find that ∼35% of cells express both α chains. Altogether, our statistical analysis gives a complete quantitative mechanistic picture that results in the observed correlations in the generative process. We learn that the probability to generate any TCRαβ is lower than 10
−12 and estimate the generation diversity and sharing properties of the αβ TCR repertoire. [ABSTRACT FROM AUTHOR]- Published
- 2019
- Full Text
- View/download PDF
27. CoPhosK: A method for comprehensive kinase substrate annotation using co-phosphorylation analysis.
- Author
-
Ayati, Marzieh, Wiredja, Danica, Schlatzer, Daniela, Maxwell, Sean, Li, Ming, Koyutürk, Mehmet, and Chance, Mark R.
- Subjects
KINASES ,MASS spectrometry ,PHOSPHORYLATION ,CANCER ,PHOSPHOPEPTIDES - Abstract
We present CoPhosK to predict kinase-substrate associations for phosphopeptide substrates detected by mass spectrometry (MS). The tool utilizes a Naïve Bayes framework with priors of known kinase-substrate associations (KSAs) to generate its predictions. Through the mining of MS data for the collective dynamic signatures of the kinases’ substrates revealed by correlation analysis of phosphopeptide intensity data, the tool infers KSAs in the data for the considerable body of substrates lacking such annotations. We benchmarked the tool against existing approaches for predicting KSAs that rely on static information (e.g. sequences, structures and interactions) using publically available MS data, including breast, colon, and ovarian cancer models. The benchmarking reveals that co-phosphorylation analysis can significantly improve prediction performance when static information is available (about 35% of sites) while providing reliable predictions for the remainder, thus tripling the KSAs available from the experimental MS data providing to a comprehensive and reliable characterization of the landscape of kinase-substrate interactions well beyond current limitations. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
28. 16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses.
- Author
-
Woloszynek, Stephen, Zhao, Zhengqiao, Chen, Jian, and Rosen, Gail L.
- Subjects
RIBOSOMAL RNA ,NUCLEOTIDE sequence ,HUMAN microbiota ,MACHINE learning ,FEATURE extraction ,TAXONOMY - Abstract
Advances in high-throughput sequencing have increased the availability of microbiome sequencing data that can be exploited to characterize microbiome community structure in situ. We explore using word and sentence embedding approaches for nucleotide sequences since they may be a suitable numerical representation for downstream machine learning applications (especially deep learning). This work involves first encoding (“embedding”) each sequence into a dense, low-dimensional, numeric vector space. Here, we use Skip-Gram word2vec to embed k-mers, obtained from 16S rRNA amplicon surveys, and then leverage an existing sentence embedding technique to embed all sequences belonging to specific body sites or samples. We demonstrate that these representations are meaningful, and hence the embedding space can be exploited as a form of feature extraction for exploratory analysis. We show that sequence embeddings preserve relevant information about the sequencing data such as k-mer context, sequence taxonomy, and sample class. Specifically, the sequence embedding space resolved differences among phyla, as well as differences among genera within the same family. Distances between sequence embeddings had similar qualities to distances between alignment identities, and embedding multiple sequences can be thought of as generating a consensus sequence. In addition, embeddings are versatile features that can be used for many downstream tasks, such as taxonomic and sample classification. Using sample embeddings for body site classification resulted in negligible performance loss compared to using OTU abundance data, and clustering embeddings yielded high fidelity species clusters. Lastly, the k-mer embedding space captured distinct k-mer profiles that mapped to specific regions of the 16S rRNA gene and corresponded with particular body sites. Together, our results show that embedding sequences results in meaningful representations that can be used for exploratory analyses or for downstream machine learning applications that require numeric data. Moreover, because the embeddings are trained in an unsupervised manner, unlabeled data can be embedded and used to bolster supervised machine learning tasks. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
29. Identifying individual risk rare variants using protein structure guided local tests (POINT).
- Author
-
Marceau West, Rachel, Lu, Wenbin, Rotroff, Daniel M., Kuenemann, Melaine A., Chang, Sheng-Mao, Wu, Michael C., Wagner, Michael J., Buse, John B., Motsinger-Reif, Alison A., Fourches, Denis, and Tzeng, Jung-Ying
- Subjects
PROTEIN structure ,PROTEOMICS ,OPERATOR theory ,QUANTITATIVE research ,KERNEL functions ,BIOLOGICAL databases - Abstract
Rare variants are of increasing interest to genetic association studies because of their etiological contributions to human complex diseases. Due to the rarity of the mutant events, rare variants are routinely analyzed on an aggregate level. While aggregation analyses improve the detection of global-level signal, they are not able to pinpoint causal variants within a variant set. To perform inference on a localized level, additional information, e.g., biological annotation, is often needed to boost the information content of a rare variant. Following the observation that important variants are likely to cluster together on functional domains, we propose a rte structure guided local est (POINT) to provide variant-specific association information using structure-guided aggregation of signal. Constructed under a kernel machine framework, POINT performs local association testing by borrowing information from neighboring variants in the 3-dimensional protein space in a data-adaptive fashion. Besides merely providing a list of promising variants, POINT assigns each variant a p-value to permit variant ranking and prioritization. We assess the selection performance of POINT using simulations and illustrate how it can be used to prioritize individual rare variants in PCSK9, ANGPTL4 and CETP in the Action to Control Cardiovascular Risk in Diabetes (ACCORD) clinical trial data. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
30. Graph Peak Caller: Calling ChIP-seq peaks on graph-based reference genomes.
- Author
-
Grytten, Ivar, Rand, Knut D., Nederbragt, Alexander J., Storvik, Geir O., Glad, Ingrid K., and Sandve, Geir K.
- Subjects
REPRESENTATIONS of graphs ,GENOMES ,PLANT genomes ,PLANT biotechnology ,COMPUTATIONAL biology ,SEQUENCE analysis - Abstract
Graph-based representations are considered to be the future for reference genomes, as they allow integrated representation of the steadily increasing data on individual variation. Currently available tools allow de novo assembly of graph-based reference genomes, alignment of new read sets to the graph representation as well as certain analyses like variant calling and haplotyping. We here present a first method for calling ChIP-Seq peaks on read data aligned to a graph-based reference genome. The method is a graph generalization of the peak caller MACS2, and is implemented in an open source tool, Graph Peak Caller. By using the existing tool vg to build a pan-genome of Arabidopsis thaliana, we validate our approach by showing that Graph Peak Caller with a pan-genome reference graph can trace variants within peaks that are not part of the linear reference genome, and find peaks that in general are more motif-enriched than those found by MACS2. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
31. Statistical investigations of protein residue direct couplings.
- Author
-
Neuwald, Andrew F. and Altschul, Stephen F.
- Subjects
PROTEINS ,SEQUENCE alignment ,STATISTICAL significance ,GUANOSINE triphosphatase ,BIOCHEMISTRY - Abstract
Protein Direct Coupling Analysis (DCA), which predicts residue-residue contacts based on covarying positions within a multiple sequence alignment, has been remarkably effective. This suggests that there is more to learn from sequence correlations than is generally assumed, and calls for deeper investigations into DCA and perhaps into other types of correlations. Here we describe an approach that enables such investigations by measuring, as an estimated p-value, the statistical significance of the association between residue-residue covariance and structural interactions, either internal or homodimeric. Its application to thirty protein superfamilies confirms that direct coupling (DC) scores correlate with 3D pairwise contacts with very high significance. This method also permits quantitative assessment of the relative performance of alternative DCA methods, and of the degree to which they detect direct versus indirect couplings. We illustrate its use to assess, for a given protein, the biological relevance of alternative conformational states, to investigate the possible mechanistic implications of differences between these states, and to characterize subtle aspects of direct couplings. Our analysis indicates that direct pairwise correlations may be largely distinct from correlated patterns associated with functional specialization, and that the joint analysis of both types of correlations can yield greater power. Data, programs, and source code are freely available at . [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
32. Coevolving residues inform protein dynamics profiles and disease susceptibility of nSNVs.
- Author
-
Butler, Brandon M., Kazan, I. Can, Kumar, Avishek, and Ozkan, S. Banu
- Subjects
GENETICS of disease susceptibility ,PROTEIN conformation ,GENETIC mutation ,PROTEIN structure ,HUMAN genetic variation - Abstract
The conformational dynamics of proteins is rarely used in methodologies used to predict the impact of genetic mutations due to the paucity of three-dimensional protein structures as compared to the vast number of available sequences. Until now a three-dimensional (3D) structure has been required to predict the conformational dynamics of a protein. We introduce an approach that estimates the conformational dynamics of a protein, without relying on structural information. This de novo approach utilizes coevolving residues identified from a multiple sequence alignment (MSA) using Potts models. These coevolving residues are used as contacts in a Gaussian network model (GNM) to obtain protein dynamics. B-factors calculated using sequence-based GNM (Seq-GNM) are in agreement with crystallographic B-factors as well as theoretical B-factors from the original GNM that utilizes the 3D structure. Moreover, we demonstrate the ability of the calculated B-factors from the Seq-GNM approach to discriminate genomic variants according to their phenotypes for a wide range of proteins. These results suggest that protein dynamics can be approximated based on sequence information alone, making it possible to assess the phenotypes of nSNVs in cases where a 3D structure is unknown. We hope this work will promote the use of dynamics information in genetic disease prediction at scale by circumventing the need for 3D structures. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
33. Deepbinner: Demultiplexing barcoded Oxford Nanopore reads with deep convolutional neural networks.
- Author
-
Wick, Ryan R., Judd, Louise M., and Holt, Kathryn E.
- Subjects
DEMULTIPLEXING ,GENOMES ,GENETIC barcoding ,NANOPORES ,ARTIFICIAL neural networks ,DNA - Abstract
Multiplexing, the simultaneous sequencing of multiple barcoded DNA samples on a single flow cell, has made Oxford Nanopore sequencing cost-effective for small genomes. However, it depends on the ability to sort the resulting sequencing reads by barcode, and current demultiplexing tools fail to classify many reads. Here we present Deepbinner, a tool for Oxford Nanopore demultiplexing that uses a deep neural network to classify reads based on the raw electrical read signal. This ‘signal-space’ approach allows for greater accuracy than existing ‘base-space’ tools (Albacore and Porechop) for which signals must first be converted to DNA base calls, itself a complex problem that can introduce noise into the barcode sequence. To assess Deepbinner and existing tools, we performed multiplex sequencing on 12 amplicons chosen for their distinguishability. This allowed us to establish a ground truth classification for each read based on internal sequence alone. Deepbinner had the lowest rate of unclassified reads (7.8%) and the highest demultiplexing precision (98.5% of classified reads were correctly assigned). It can be used alone (to maximise the number of classified reads) or in conjunction with other demultiplexers (to maximise precision and minimise false positive classifications). We also found cross-sample chimeric reads (0.3%) and evidence of barcode switching (0.3%) in our dataset, which likely arise during library preparation and may be detrimental for quantitative studies that use multiplexing. Deepbinner is open source (GPLv3) and available at . [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
34. Inferring interaction partners from protein sequences using mutual information.
- Author
-
Bitbol, Anne-Florence
- Subjects
PROTEIN-protein interactions ,CELLULAR signal transduction ,COEVOLUTION ,FAMILIES ,AMINO acid sequence - Abstract
Functional protein-protein interactions are crucial in most cellular processes. They enable multi-protein complexes to assemble and to remain stable, and they allow signal transduction in various pathways. Functional interactions between proteins result in coevolution between the interacting partners, and thus in correlations between their sequences. Pairwise maximum-entropy based models have enabled successful inference of pairs of amino-acid residues that are in contact in the three-dimensional structure of multi-protein complexes, starting from the correlations in the sequence data of known interaction partners. Recently, algorithms inspired by these methods have been developed to identify which proteins are functional interaction partners among the paralogous proteins of two families, starting from sequence data alone. Here, we demonstrate that a slightly higher performance for partner identification can be reached by an approximate maximization of the mutual information between the sequence alignments of the two protein families. Our mutual information-based method also provides signatures of the existence of interactions between protein families. These results stand in contrast with structure prediction of proteins and of multi-protein complexes from sequence data, where pairwise maximum-entropy based global statistical models substantially improve performance compared to mutual information. Our findings entail that the statistical dependences allowing interaction partner prediction from sequence data are not restricted to the residue pairs that are in direct contact at the interface between the partner proteins. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
35. Systematically benchmarking peptide-MHC binding predictors: From synthetic to naturally processed epitopes.
- Author
-
Zhao, Weilong and Sher, Xinwei
- Subjects
MACHINE learning ,EPITOPES ,T cells ,CLINICAL immunology ,HLA histocompatibility antigens - Abstract
A number of machine learning-based predictors have been developed for identifying immunogenic T-cell epitopes based on major histocompatibility complex (MHC) class I and II binding affinities. Rationally selecting the most appropriate tool has been complicated by the evolving training data and machine learning methods. Despite the recent advances made in generating high-quality MHC-eluted, naturally processed ligandome, the reliability of new predictors on these epitopes has yet to be evaluated. This study reports the latest benchmarking on an extensive set of MHC-binding predictors by using newly available, untested data of both synthetic and naturally processed epitopes. 32 human leukocyte antigen (HLA) class I and 24 HLA class II alleles are included in the blind test set. Artificial neural network (ANN)-based approaches demonstrated better performance than regression-based machine learning and structural modeling. Among the 18 predictors benchmarked, ANN-based mhcflurry and nn_align perform the best for MHC class I 9-mer and class II 15-mer predictions, respectively, on binding/non-binding classification (Area Under Curves = 0.911). NetMHCpan4 also demonstrated comparable predictive power. Our customization of mhcflurry to a pan-HLA predictor has achieved similar accuracy to NetMHCpan. The overall accuracy of these methods are comparable between 9-mer and 10-mer testing data. However, the top methods deliver low correlations between the predicted versus the experimental affinities for strong MHC binders. When used on naturally processed MHC-ligands, tools that have been trained on elution data (NetMHCpan4 and MixMHCpred) shows better accuracy than pure binding affinity predictor. The variability of false prediction rate is considerable among HLA types and datasets. Finally, structure-based predictor of Rosetta FlexPepDock is less optimal compared to the machine learning approaches. With our benchmarking of MHC-binding and MHC-elution predictors using a comprehensive metrics, a unbiased view for establishing best practice of T-cell epitope predictions is presented, facilitating future development of methods in immunogenomics. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
36. Motif-Aware PRALINE: Improving the alignment of motif regions.
- Author
-
Dijkstra, Maurits, Bawono, Punto, Abeln, Sanne, Feenstra, K. Anton, Fokkink, Wan, and Heringa, Jaap
- Subjects
SEQUENCE alignment ,DYNAMIC programming ,HEURISTIC ,IMMUNODEFICIENCY ,GLYCOMICS ,AMINO acids ,BIOCHEMISTRY - Abstract
Protein or DNA motifs are sequence regions which possess biological importance. These regions are often highly conserved among homologous sequences. The generation of multiple sequence alignments (MSAs) with a correct alignment of the conserved sequence motifs is still difficult to achieve, due to the fact that the contribution of these typically short fragments is overshadowed by the rest of the sequence. Here we extended the PRALINE multiple sequence alignment program with a novel motif-aware MSA algorithm in order to address this shortcoming. This method can incorporate explicit information about the presence of externally provided sequence motifs, which is then used in the dynamic programming step by boosting the amino acid substitution matrix towards the motif. The strength of the boost is controlled by a parameter, α. Using a benchmark set of alignments we confirm that a good compromise can be found that improves the matching of motif regions while not significantly reducing the overall alignment quality. By estimating α on an unrelated set of reference alignments we find there is indeed a strong conservation signal for motifs. A number of typical but difficult MSA use cases are explored to exemplify the problems in correctly aligning functional sequence motifs and how the motif-aware alignment method can be employed to alleviate these problems. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
37. Co-evolution networks of HIV/HCV are modular with direct association to structure and function.
- Author
-
Quadeer, Ahmed Abdul, Morales-Jimenez, David, and McKay, Matthew R.
- Subjects
HEPATITIS C virus ,DIAGNOSIS of HIV infections ,VIRAL evolution ,MULTIPLE correspondence analysis (Statistics) ,FLAVIVIRUSES - Abstract
Mutational correlation patterns found in population-level sequence data for the Human Immunodeficiency Virus (HIV) and the Hepatitis C Virus (HCV) have been demonstrated to be informative of viral fitness. Such patterns can be seen as footprints of the intrinsic functional constraints placed on viral evolution under diverse selective pressures. Here, considering multiple HIV and HCV proteins, we demonstrate that these mutational correlations encode a modular co-evolutionary structure that is tightly linked to the structural and functional properties of the respective proteins. Specifically, by introducing a robust statistical method based on sparse principal component analysis, we identify near-disjoint sets of collectively-correlated residues (sectors) having mostly a one-to-one association to largely distinct structural or functional domains. This suggests that the distinct phenotypic properties of HIV/HCV proteins often give rise to quasi-independent modes of evolution, with each mode involving a sparse and localized network of mutational interactions. Moreover, individual inferred sectors of HIV are shown to carry immunological significance, providing insight for guiding targeted vaccine strategies. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
38. riboWaltz: Optimization of ribosome P-site positioning in ribosome profiling data.
- Author
-
Lauria, Fabio, Tebaldi, Toma, Bernabò, Paola, Groen, Ewout J. N., Gillingwater, Thomas H., and Viero, Gabriella
- Subjects
RIBOSOMES ,RNA ,BIOCHEMISTRY ,EUKARYOTES ,YEAST - Abstract
Ribosome profiling is a powerful technique used to study translation at the genome-wide level, generating unique information concerning ribosome positions along RNAs. Optimal localization of ribosomes requires the proper identification of the ribosome P-site in each ribosome protected fragment, a crucial step to determine the trinucleotide periodicity of translating ribosomes, and draw correct conclusions concerning where ribosomes are located. To determine the P-site within ribosome footprints at nucleotide resolution, the precise estimation of its offset with respect to the protected fragment is necessary. Here we present riboWaltz, an R package for calculation of optimal P-site offsets, diagnostic analysis and visual inspection of ribosome profiling data. Compared to existing tools, riboWaltz shows improved accuracies for P-site estimation and neat ribosome positioning in multiple case studies. riboWaltz was implemented in R and is available as an R package at . [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
39. Cancerin: A computational pipeline to infer cancer-associated ceRNA interaction networks.
- Author
-
Do, Duc and Bozdag, Serdar
- Subjects
CANCER genetics ,MICRORNA ,GENE expression ,CARCINOGENESIS ,GENETIC regulation ,DNA methylation - Abstract
MicroRNAs (miRNAs) inhibit expression of target genes by binding to their RNA transcripts. It has been recently shown that RNA transcripts targeted by the same miRNA could “compete” for the miRNA molecules and thereby indirectly regulate each other. Experimental evidence has suggested that the aberration of such miRNA-mediated interaction between RNAs—called competing endogenous RNA (ceRNA) interaction—can play important roles in tumorigenesis. Given the difficulty of deciphering context-specific miRNA binding, and the existence of various gene regulatory factors such as DNA methylation and copy number alteration, inferring context-specific ceRNA interactions accurately is a computationally challenging task. Here we propose a computational method called Cancerin to identify cancer-associated ceRNA interactions. Cancerin incorporates DNA methylation, copy number alteration, gene and miRNA expression datasets to construct cancer-specific ceRNA networks. We applied Cancerin to three cancer datasets from the Cancer Genome Atlas (TCGA) project. Our results indicated that ceRNAs were enriched with cancer-related genes, and ceRNA modules in the inferred ceRNA networks were involved in cancer-associated biological processes. Using LINCS-L1000 shRNA-mediated gene knockdown experiment in breast cancer cell line to assess accuracy, Cancerin was able to predict expression outcome of ceRNA genes with high accuracy. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
40. Solving the RNA design problem with reinforcement learning.
- Author
-
Eastman, Peter, Shi, Jade, Ramsundar, Bharath, and Pande, Vijay S.
- Subjects
REINFORCEMENT learning ,NUCLEOTIDE sequence ,BIOENGINEERING ,COMPUTER science ,COMPUTATIONAL biology - Abstract
We use reinforcement learning to train an agent for computational RNA design: given a target secondary structure, design a sequence that folds to that structure in silico. Our agent uses a novel graph convolutional architecture allowing a single model to be applied to arbitrary target structures of any length. After training it on randomly generated targets, we test it on the Eterna100 benchmark and find it outperforms all previous algorithms. Analysis of its solutions shows it has successfully learned some advanced strategies identified by players of the game Eterna, allowing it to solve some very difficult structures. On the other hand, it has failed to learn other strategies, possibly because they were not required for the targets in the training set. This suggests the possibility that future improvements to the training protocol may yield further gains in performance. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
41. Identifying functional groups among the diverse, recombining antigenic var genes of the malaria parasite Plasmodium falciparum from a local community in Ghana.
- Author
-
Rorick, Mary M., Baskerville, Edward B., Rask, Thomas S., Day, Karen P., and Pascual, Mercedes
- Subjects
PLASMODIUM falciparum ,FUNCTIONAL groups ,GENOMICS ,PHYLOGENY ,BIODIVERSITY ,NUCLEOTIDE sequencing - Abstract
A challenge in studying diverse multi-copy gene families is deciphering distinct functional types within immense sequence variation. Functional changes can in some cases be tracked through the evolutionary history of a gene family; however phylogenetic approaches are not possible in cases where gene families diversify primarily by recombination. We take a network theoretical approach to functionally classify the highly recombining var antigenic gene family of the malaria parasite Plasmodium falciparum. We sample var DBLα sequence types from a local population in Ghana, and classify 9,276 of these variants into just 48 functional types. Our approach is to first decompose each sequence type into its constituent, recombining parts; we then use a stochastic block model to identify functional groups among the parts; finally, we classify the sequence types based on which functional groups they contain. This method for functional classification does not rely on an inferred phylogenetic history, nor does it rely on inferring function based on conserved sequence features. Instead, it infers functional similarity among recombining parts based on the sharing of similar co-occurrence interactions with other parts. This method can therefore group sequences that have undetectable sequence homology or even distinct origination. Describing these 48 var functional types allows us to simplify the antigenic diversity within our dataset by over two orders of magnitude. We consider how the var functional types are distributed in isolates, and find a nonrandom pattern reflecting that common var functional types are non-randomly distinct from one another in terms of their functional composition. The coarse-graining of var gene diversity into biologically meaningful functional groups has important implications for understanding the disease ecology and evolution of this system, as well as for designing effective epidemiological monitoring and intervention. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
42. Traceability, reproducibility and wiki-exploration for “à-la-carte” reconstructions of genome-scale metabolic models.
- Author
-
Aite, Méziane, Chevallier, Marie, Frioux, Clémence, Trottier, Camille, Got, Jeanne, Cortés, María Paz, Mendoza, Sebastián N., Carrier, Grégory, Dameron, Olivier, Guillaudeux, Nicolas, Latorre, Mauricio, Loira, Nicolás, Markov, Gabriel V., Maass, Alejandro, and Siegel, Anne
- Subjects
METABOLISM ,REPRODUCIBLE research ,BIOINFORMATICS ,EUKARYOTES ,AUTOMATION - Abstract
Genome-scale metabolic models have become the tool of choice for the global analysis of microorganism metabolism, and their reconstruction has attained high standards of quality and reliability. Improvements in this area have been accompanied by the development of some major platforms and databases, and an explosion of individual bioinformatics methods. Consequently, many recent models result from “à la carte” pipelines, combining the use of platforms, individual tools and biological expertise to enhance the quality of the reconstruction. Although very useful, introducing heterogeneous tools, that hardly interact with each other, causes loss of traceability and reproducibility in the reconstruction process. This represents a real obstacle, especially when considering less studied species whose metabolic reconstruction can greatly benefit from the comparison to good quality models of related organisms. This work proposes an adaptable workspace, AuReMe, for sustainable reconstructions or improvements of genome-scale metabolic models involving personalized pipelines. At each step, relevant information related to the modifications brought to the model by a method is stored. This ensures that the process is reproducible and documented regardless of the combination of tools used. Additionally, the workspace establishes a way to browse metabolic models and their metadata through the automatic generation of ad-hoc local wikis dedicated to monitoring and facilitating the process of reconstruction. AuReMe supports exploration and semantic query based on RDF databases. We illustrate how this workspace allowed handling, in an integrated way, the metabolic reconstructions of non-model organisms such as an extremophile bacterium or eukaryote algae. Among relevant applications, the latter reconstruction led to putative evolutionary insights of a metabolic pathway. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
43. RosettaAntibodyDesign (RAbD): A general framework for computational antibody design.
- Author
-
Adolf-Bryfogle, Jared, Kalyuzhniy, Oleks, Kubitz, Michael, Weitzner, Brian D., Hu, Xiaozhen, Adachi, Yumiko, Schief, William R., and Jr.Dunbrack, Roland L.
- Subjects
IMMUNOGLOBULINS ,EPITOPES ,ANTIGENS ,AMINO acids ,MONOCLONAL antibodies - Abstract
A structural-bioinformatics-based computational methodology and framework have been developed for the design of antibodies to targets of interest. RosettaAntibodyDesign (RAbD) samples the diverse sequence, structure, and binding space of an antibody to an antigen in highly customizable protocols for the design of antibodies in a broad range of applications. The program samples antibody sequences and structures by grafting structures from a widely accepted set of the canonical clusters of CDRs (North et al., J. Mol. Biol., 406:228–256, 2011). It then performs sequence design according to amino acid sequence profiles of each cluster, and samples CDR backbones using a flexible-backbone design protocol incorporating cluster-based CDR constraints. Starting from an existing experimental or computationally modeled antigen-antibody structure, RAbD can be used to redesign a single CDR or multiple CDRs with loops of different length, conformation, and sequence. We rigorously benchmarked RAbD on a set of 60 diverse antibody–antigen complexes, using two design strategies—optimizing total Rosetta energy and optimizing interface energy alone. We utilized two novel metrics for measuring success in computational protein design. The design risk ratio (DRR) is equal to the frequency of recovery of native CDR lengths and clusters divided by the frequency of sampling of those features during the Monte Carlo design procedure. Ratios greater than 1.0 indicate that the design process is picking out the native more frequently than expected from their sampled rate. We achieved DRRs for the non-H3 CDRs of between 2.4 and 4.0. The antigen risk ratio (ARR) is the ratio of frequencies of the native amino acid types, CDR lengths, and clusters in the output decoys for simulations performed in the presence and absence of the antigen. For CDRs, we achieved cluster ARRs as high as 2.5 for L1 and 1.5 for H2. For sequence design simulations without CDR grafting, the overall recovery for the native amino acid types for residues that contact the antigen in the native structures was 72% in simulations performed in the presence of the antigen and 48% in simulations performed without the antigen, for an ARR of 1.5. For the non-contacting residues, the ARR was 1.08. This shows that the sequence profiles are able to maintain the amino acid types of these conserved, buried sites, while recovery of the exposed, contacting residues requires the presence of the antigen-antibody interface. We tested RAbD experimentally on both a lambda and kappa antibody–antigen complex, successfully improving their affinities 10 to 50 fold by replacing individual CDRs of the native antibody with new CDR lengths and clusters. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
44. DIVERSITY in binding, regulation, and evolution revealed from high-throughput ChIP.
- Author
-
Mitra, Sneha, Biswas, Anushua, and Narlikar, Leelavati
- Subjects
DNA-protein interactions ,CHROMATIN ,IMMUNOPRECIPITATION ,DNA-binding proteins ,EPIGENETICS - Abstract
Genome-wide in vivo protein-DNA interactions are routinely mapped using high-throughput chromatin immunoprecipitation (ChIP). ChIP-reported regions are typically investigated for enriched sequence-motifs, which are likely to model the DNA-binding specificity of the profiled protein and/or of co-occurring proteins. However, simple enrichment analyses can miss insights into the binding-activity of the protein. Note that ChIP reports regions making direct contact with the protein as well as those binding through intermediaries. For example, consider a ChIP experiment targeting protein X, which binds DNA at its cognate sites, but simultaneously interacts with four other proteins. Each of these proteins also binds to its own specific cognate sites along distant parts of the genome, a scenario consistent with the current view of transcriptional hubs and chromatin loops. Since ChIP will pull down all X-associated regions, the final reported data will be a union of five distinct sets of regions, each containing binding sites of one of the five proteins, respectively. Characterizing all five different motifs and the corresponding sets is important to interpret the ChIP experiment and ultimately, the role of X in regulation. We present which attempts exactly this: it partitions the data so that each partition can be characterized with its own de novo motif. D uses a Bayesian approach to identify the optimal number of motifs and the associated partitions, which together explain the entire dataset. This is in contrast to standard motif finders, which report motifs individually enriched in the data, but do not necessarily explain all reported regions. We show that the different motifs and associated regions identified by give insights into the various complexes that may be forming along the chromatin, something that has so far not been attempted from ChIP data. Webserver at ; standalone (Mac OS X/Linux) from . [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
45. A machine learning based framework to identify and classify long terminal repeat retrotransposons.
- Author
-
Schietgat, Leander, Vens, Celine, Cerri, Ricardo, Fischer, Carlos N., Costa, Eduardo, Ramon, Jan, Carareto, Claudia M. A., and Blockeel, Hendrik
- Subjects
RETROTRANSPOSONS ,MACHINE learning ,GENOMES ,DROSOPHILA melanogaster ,ARABIDOPSIS thaliana - Abstract
Transposable elements (TEs) are repetitive nucleotide sequences that make up a large portion of eukaryotic genomes. They can move and duplicate within a genome, increasing genome size and contributing to genetic diversity within and across species. Accurate identification and classification of TEs present in a genome is an important step towards understanding their effects on genes and their role in genome evolution. We introduce TE-L, a framework based on machine learning that automatically identifies TEs in a given genome and assigns a classification to them. We present an implementation of our framework towards LTR retrotransposons, a particular type of TEs characterized by having long terminal repeats (LTRs) at their boundaries. We evaluate the predictive performance of our framework on the well-annotated genomes of Drosophila melanogaster and Arabidopsis thaliana and we compare our results for three LTR retrotransposon superfamilies with the results of three widely used methods for TE identification or classification: RM, C and LD. In contrast to these methods, TE-L is the first to incorporate machine learning techniques, outperforming these methods in terms of predictive performance, while able to learn models and make predictions efficiently. Moreover, we show that our method was able to identify TEs that none of the above method could find, and we investigated TE-L’s predictions which did not correspond to an official annotation. It turns out that many of these predictions are in fact strongly homologous to a known TE. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
46. Backbone Brackets and Arginine Tweezers delineate Class I and Class II aminoacyl tRNA synthetases.
- Author
-
Kaiser, Florian, Bittrich, Sebastian, Salentin, Sebastian, Leberecht, Christoph, Haupt, V. Joachim, Krautwurst, Sarah, Schroeder, Michael, and Labudde, Dirk
- Subjects
PROTEIN synthesis ,AMINOACYL-tRNA synthetases ,AMINO acids ,ARGININE ,BIOMARKERS - Abstract
The origin of the machinery that realizes protein biosynthesis in all organisms is still unclear. One key component of this machinery are aminoacyl tRNA synthetases (aaRS), which ligate tRNAs to amino acids while consuming ATP. Sequence analyses revealed that these enzymes can be divided into two complementary classes. Both classes differ significantly on a sequence and structural level, feature different reaction mechanisms, and occur in diverse oligomerization states. The one unifying aspect of both classes is their function of binding ATP. We identified Backbone Brackets and Arginine Tweezers as most compact ATP binding motifs characteristic for each Class. Geometric analysis shows a structural rearrangement of the Backbone Brackets upon ATP binding, indicating a general mechanism of all Class I structures. Regarding the origin of aaRS, the Rodin-Ohno hypothesis states that the peculiar nature of the two aaRS classes is the result of their primordial forms, called Protozymes, being encoded on opposite strands of the same gene. Backbone Brackets and Arginine Tweezers were traced back to the proposed Protozymes and their more efficient successors, the Urzymes. Both structural motifs can be observed as pairs of residues in contemporary structures and it seems that the time of their addition, indicated by their placement in the ancient aaRS, coincides with the evolutionary trace of Proto- and Urzymes. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
47. New computational approaches to understanding molecular protein function.
- Author
-
Fetrow, Jacquelyn S. and Babbitt, Patricia C.
- Subjects
COMPUTATIONAL biology ,PHYSIOLOGICAL effects of proteins ,PROTEIN genetics ,MOLECULAR biology ,ONTOLOGIES (Information retrieval) - Abstract
The author discusses new computational approaches to understanding molecular protein function. The author states that Gene Ontology (GO) system of classifying function recognizes ways of defining function, using distinct cellular components, molecular function, and mentions the need of understanding molecular function involving motifs, like sequence motifs exemplified by PRINTS and PROSITE. The author notes that contemporary protein superfamilies are the result of numerous genetic events.
- Published
- 2018
- Full Text
- View/download PDF
48. Meet-U: Educating through research immersion.
- Author
-
Abdollahi, Nika, Albani, Alexandre, Anthony, Eric, Baud, Agnes, Cardon, Mélissa, Clerc, Robert, Czernecki, Dariusz, Conte, Romain, David, Laurent, Delaune, Agathe, Djerroud, Samia, Fourgoux, Pauline, Guiglielmoni, Nadège, Laurentie, Jeanne, Lehmann, Nathalie, Lochard, Camille, Montagne, Rémi, Myrodia, Vasiliki, Opuu, Vaitea, and Parey, Elise
- Subjects
COMPUTATIONAL biology ,CLOUD computing ,RESEARCH methodology ,BIOLOGY students ,SCIENTISTS - Abstract
We present a new educational initiative called Meet-U that aims to train students for collaborative work in computational biology and to bridge the gap between education and research. Meet-U mimics the setup of collaborative research projects and takes advantage of the most popular tools for collaborative work and of cloud computing. Students are grouped in teams of 4–5 people and have to realize a project from A to Z that answers a challenging question in biology. Meet-U promotes "coopetition," as the students collaborate within and across the teams and are also in competition with each other to develop the best final product. Meet-U fosters interactions between different actors of education and research through the organization of a meeting day, open to everyone, where the students present their work to a jury of researchers and jury members give research seminars. This very unique combination of education and research is strongly motivating for the students and provides a formidable opportunity for a scientific community to unite and increase its visibility. We report on our experience with Meet-U in two French universities with master’s students in bioinformatics and modeling, with protein–protein docking as the subject of the course. Meet-U is easy to implement and can be straightforwardly transferred to other fields and/or universities. All the information and data are available at . [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
49. iDREM: Interactive visualization of dynamic regulatory networks.
- Author
-
Ding, Jun, Hagood, James S., Ambalavanan, Namasivayam, Kaminski, Naftali, and Bar-Joseph, Ziv
- Subjects
GENETIC software ,PROTEINS ,GENE expression ,MICRORNA ,PROTEOMICS - Abstract
The Dynamic Regulatory Events Miner (DREM) software reconstructs dynamic regulatory networks by integrating static protein-DNA interaction data with time series gene expression data. In recent years, several additional types of high-throughput time series data have been profiled when studying biological processes including time series miRNA expression, proteomics, epigenomics and single cell RNA-Seq. Combining all available time series and static datasets in a unified model remains an important challenge and goal. To address this challenge we have developed a new version of DREM termed interactive DREM (iDREM). iDREM provides support for all data types mentioned above and combines them with existing interaction data to reconstruct networks that can lead to novel hypotheses on the function and timing of regulators. Users can interactively visualize and query the resulting model. We showcase the functionality of the new tool by applying it to microglia developmental data from multiple labs. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
50. Integrating linear optimization with structural modeling to increase HIV neutralization breadth.
- Author
-
Sevy, Alexander M., Panda, Swetasudha, JrCrowe, James E., Meiler, Jens, and Vorobeychik, Yevgeniy
- Subjects
PROTEINS ,AMINO acid sequence ,MACHINE learning ,IMMUNOGLOBULINS ,HIV - Abstract
Computational protein design has been successful in modeling fixed backbone proteins in a single conformation. However, when modeling large ensembles of flexible proteins, current methods in protein design have been insufficient. Large barriers in the energy landscape are difficult to traverse while redesigning a protein sequence, and as a result current design methods only sample a fraction of available sequence space. We propose a new computational approach that combines traditional structure-based modeling using the software suite with machine learning and integer linear programming to overcome limitations in the sampling methods. We demonstrate the effectiveness of this method, which we call BROAD, by benchmarking the performance on increasing predicted breadth of anti-HIV antibodies. We use this novel method to increase predicted breadth of naturally-occurring antibody VRC23 against a panel of 180 divergent HIV viral strains and achieve 100% predicted binding against the panel. In addition, we compare the performance of this method to state-of-the-art multistate design in and show that we can outperform the existing method significantly. We further demonstrate that sequences recovered by this method recover known binding motifs of broadly neutralizing anti-HIV antibodies. Finally, our approach is general and can be extended easily to other protein systems. Although our modeled antibodies were not tested in vitro, we predict that these variants would have greatly increased breadth compared to the wild-type antibody. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.