39 results
Search Results
2. Per-sample immunoglobulin germline inference from B cell receptor deep sequencing data.
- Author
-
Ralph, Duncan K. and IVMatsen, Frederick A.
- Subjects
B cell receptors ,IMMUNOGLOBULIN genes ,B cells ,ALLELES - Abstract
The collection of immunoglobulin genes in an individual’s germline, which gives rise to B cell receptors via recombination, is known to vary significantly across individuals. In humans, for example, each individual has only a fraction of the several hundred known V alleles. Furthermore, the currently-accepted set of known V alleles is both incomplete (particularly for non-European samples), and contains a significant number of spurious alleles. The resulting uncertainty as to which immunoglobulin alleles are present in any given sample results in inaccurate B cell receptor sequence annotations, and in particular inaccurate inferred naive ancestors. In this paper we first show that the currently widespread practice of aligning each sequence to its closest match in the full set of IMGT alleles results in a very large number of spurious alleles that are not in the sample’s true set of germline V alleles. We then describe a new method for inferring each individual’s germline gene set from deep sequencing data, and show that it improves upon existing methods by making a detailed comparison on a variety of simulated and real data samples. This new method has been integrated into the partis annotation and clonal family inference package, available at , and is run by default without affecting overall run time. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
3. SFPEL-LPI: Sequence-based feature projection ensemble learning for predicting LncRNA-protein interactions.
- Author
-
Zhang, Wen, Tang, Guifeng, Huang, Feng, Zhang, Xining, Yue, Xiang, and Wu, Wenjian
- Subjects
RNA-protein interactions ,GENETIC regulation ,RNA interference ,RNA splicing ,ADENYLATION (Biochemistry) - Abstract
LncRNA-protein interactions play important roles in post-transcriptional gene regulation, poly-adenylation, splicing and translation. Identification of lncRNA-protein interactions helps to understand lncRNA-related activities. Existing computational methods utilize multiple lncRNA features or multiple protein features to predict lncRNA-protein interactions, but features are not available for all lncRNAs or proteins; most of existing methods are not capable of predicting interacting proteins (or lncRNAs) for new lncRNAs (or proteins), which don’t have known interactions. In this paper, we propose the sequence-based feature projection ensemble learning method, “SFPEL-LPI”, to predict lncRNA-protein interactions. First, SFPEL-LPI extracts lncRNA sequence-based features and protein sequence-based features. Second, SFPEL-LPI calculates multiple lncRNA-lncRNA similarities and protein-protein similarities by using lncRNA sequences, protein sequences and known lncRNA-protein interactions. Then, SFPEL-LPI combines multiple similarities and multiple features with a feature projection ensemble learning frame. In computational experiments, SFPEL-LPI accurately predicts lncRNA-protein associations and outperforms other state-of-the-art methods. More importantly, SFPEL-LPI can be applied to new lncRNAs (or proteins). The case studies demonstrate that our method can find out novel lncRNA-protein interactions, which are confirmed by literature. Finally, we construct a user-friendly web server, available at . [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
4. PCSF: An R-package for network-based interpretation of high-throughput data.
- Author
-
Akhmedov, Murodzhon, Kedaigle, Amanda, Chong, Renan Escalante, Montemanni, Roberto, Bertoni, Francesco, Fraenkel, Ernest, and Kwee, Ivo
- Subjects
BIOINFORMATICS software ,DATA analysis software ,MATHEMATICAL optimization ,COMPUTATIONAL biology ,PROTEIN-protein interactions - Abstract
With the recent technological developments a vast amount of high-throughput data has been profiled to understand the mechanism of complex diseases. The current bioinformatics challenge is to interpret the data and underlying biology, where efficient algorithms for analyzing heterogeneous high-throughput data using biological networks are becoming increasingly valuable. In this paper, we propose a software package based on the Prize-collecting Steiner Forest graph optimization approach. The PCSF package performs fast and user-friendly network analysis of high-throughput data by mapping the data onto a biological networks such as protein-protein interaction, gene-gene interaction or any other correlation or coexpression based networks. Using the interaction networks as a template, it determines high-confidence subnetworks relevant to the data, which potentially leads to predictions of functional units. It also interactively visualizes the resulting subnetwork with functional enrichment analysis. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
5. Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model.
- Author
-
Wang, Sheng, Sun, Siqi, Li, Zhen, Zhang, Renyu, and Xu, Jinbo
- Subjects
PROTEIN structure ,ARTIFICIAL neural networks ,PROTEIN folding ,PAIRED comparisons (Mathematics) ,AMINO acid sequence - Abstract
Motivation: Protein contacts contain key information for the understanding of protein structure and function and thus, contact prediction from sequence is an important problem. Recently exciting progress has been made on this problem, but the predicted contacts for proteins without many sequence homologs is still of low quality and not very useful for de novo structure prediction. Method: This paper presents a new deep learning method that predicts contacts by integrating both evolutionary coupling (EC) and sequence conservation information through an ultra-deep neural network formed by two deep residual neural networks. The first residual network conducts a series of 1-dimensional convolutional transformation of sequential features; the second residual network conducts a series of 2-dimensional convolutional transformation of pairwise information including output of the first residual network, EC information and pairwise potential. By using very deep residual networks, we can accurately model contact occurrence patterns and complex sequence-structure relationship and thus, obtain high-quality contact prediction regardless of how many sequence homologs are available for proteins in question. Results: Our method greatly outperforms existing methods and leads to much more accurate contact-assisted folding. Tested on 105 CASP11 targets, 76 past CAMEO hard targets, and 398 membrane proteins, the average top L long-range prediction accuracy obtained by our method, one representative EC method CCMpred and the CASP11 winner MetaPSICOV is 0.47, 0.21 and 0.30, respectively; the average top L/10 long-range accuracy of our method, CCMpred and MetaPSICOV is 0.77, 0.47 and 0.59, respectively. Ab initio folding using our predicted contacts as restraints but without any force fields can yield correct folds (i.e., TMscore>0.6) for 203 of the 579 test proteins, while that using MetaPSICOV- and CCMpred-predicted contacts can do so for only 79 and 62 of them, respectively. Our contact-assisted models also have much better quality than template-based models especially for membrane proteins. The 3D models built from our contact prediction have TMscore>0.5 for 208 of the 398 membrane proteins, while those from homology modeling have TMscore>0.5 for only 10 of them. Further, even if trained mostly by soluble proteins, our deep learning method works very well on membrane proteins. In the recent blind CAMEO benchmark, our fully-automated web server implementing this method successfully folded 6 targets with a new fold and only 0.3L-2.3L effective sequence homologs, including one β protein of 182 residues, one α+β protein of 125 residues, one α protein of 140 residues, one α protein of 217 residues, one α/β of 260 residues and one α protein of 462 residues. Our method also achieved the highest F1 score on free-modeling targets in the latest CASP (Critical Assessment of Structure Prediction), although it was not fully implemented back then. Availability: [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
6. Per-sample immunoglobulin germline inference from B cell receptor deep sequencing data
- Author
-
Duncan Ralph and Frederick A. Matsen
- Subjects
0301 basic medicine ,Physiology ,Inference ,Biochemistry ,Germline ,Database and Informatics Methods ,0302 clinical medicine ,Immune Physiology ,Databases, Genetic ,Medicine and Health Sciences ,Biology (General) ,Data Management ,Genetics ,0303 health sciences ,Immune System Proteins ,Ecology ,Genes, Immunoglobulin ,High-Throughput Nucleotide Sequencing ,Phylogenetic Analysis ,3. Good health ,Phylogenetics ,Computational Theory and Mathematics ,Modeling and Simulation ,Mutation (genetic algorithm) ,Sequence Analysis ,Research Article ,Computer and Information Sciences ,QH301-705.5 ,Bioinformatics ,B-cell receptor ,Immunology ,Sequence Databases ,Receptors, Antigen, B-Cell ,Sequence alignment ,Computational biology ,Biology ,Research and Analysis Methods ,Deep sequencing ,Antibodies ,Set (abstract data type) ,03 medical and health sciences ,Cellular and Molecular Neuroscience ,Sequence Motif Analysis ,Point Mutation ,Humans ,Evolutionary Systematics ,Computer Simulation ,Allele ,Quantitative Biology - Populations and Evolution ,Molecular Biology ,Gene ,Ecology, Evolution, Behavior and Systematics ,Alleles ,030304 developmental biology ,Sequence (medicine) ,Taxonomy ,Evolutionary Biology ,Models, Genetic ,Populations and Evolution (q-bio.PE) ,Models, Immunological ,Biology and Life Sciences ,Proteins ,Computational Biology ,030104 developmental biology ,Biological Databases ,Germ Cells ,Genetic Loci ,FOS: Biological sciences ,Mutation ,Sequence Alignment ,030217 neurology & neurosurgery ,Software ,030215 immunology - Abstract
The collection of immunoglobulin genes in an individual’s germline, which gives rise to B cell receptors via recombination, is known to vary significantly across individuals. In humans, for example, each individual has only a fraction of the several hundred known V alleles. Furthermore, the currently-accepted set of known V alleles is both incomplete (particularly for non-European samples), and contains a significant number of spurious alleles. The resulting uncertainty as to which immunoglobulin alleles are present in any given sample results in inaccurate B cell receptor sequence annotations, and in particular inaccurate inferred naive ancestors. In this paper we first show that the currently widespread practice of aligning each sequence to its closest match in the full set of IMGT alleles results in a very large number of spurious alleles that are not in the sample’s true set of germline V alleles. We then describe a new method for inferring each individual’s germline gene set from deep sequencing data, and show that it improves upon existing methods by making a detailed comparison on a variety of simulated and real data samples. This new method has been integrated into the partis annotation and clonal family inference package, available at https://github.com/psathyrella/partis, and is run by default without affecting overall run time., Author summary Antibodies are an important component of the adaptive immune system, which itself determines our response to both pathogens and vaccines. They are produced by B cells through somatic recombination of germline DNA, which results in a vast diversity of antigen binding affinities across the B cell repertoire. We typically learn about the development of this repertoire, and its history of interaction with antigens, by sequencing large numbers of the DNA sequences from which antibodies are derived. In order to understand such data, it is necessary to determine the combination of germline V, D, and J genes that was rearranged to form each such B cell receptor sequence. This is difficult, however, because the immunoglobulin locus exhibits an extraordinary level of diversity across individuals—encompassing both allelic variation and gene duplication, deletion, and conversion—and because the locus’s large size and repetitive structure make germline sequencing very difficult. In this paper we describe a new computational method that avoids this difficulty by inferring each individual’s set of immunoglobulin germline genes directly from expressed B cell receptor sequence data.
- Published
- 2019
7. A Graph-Centric Approach for Metagenome-Guided Peptide and Protein Identification in Metaproteomics
- Author
-
Haixu Tang, Sujun Li, and Yuzhen Ye
- Subjects
0301 basic medicine ,Proteomics ,Peptide ,Plant Science ,Biochemistry ,De Bruijn graph ,Database and Informatics Methods ,Tandem Mass Spectrometry ,Database Searching ,Photosynthesis ,lcsh:QH301-705.5 ,chemistry.chemical_classification ,Ecology ,Plant Biochemistry ,Microbiota ,Genomics ,6. Clean water ,Computational Theory and Mathematics ,Modeling and Simulation ,symbols ,Sequence Analysis ,Algorithms ,Research Article ,Gene prediction ,Sequence Databases ,Computational biology ,Biology ,Research and Analysis Methods ,03 medical and health sciences ,Cellular and Molecular Neuroscience ,symbols.namesake ,Genetics ,Ribulose-1,5-Bisphosphate Carboxylase Oxygenase ,Humans ,Molecular Biology Techniques ,Sequencing Techniques ,Sequence Similarity Searching ,Gene Prediction ,Gene ,Molecular Biology ,Ecology, Evolution, Behavior and Systematics ,Sequence Assembly Tools ,Biology and Life Sciences ,Computational Biology ,Proteins ,Genome Analysis ,030104 developmental biology ,Biological Databases ,lcsh:Biology (General) ,chemistry ,Metagenomics ,Metaproteomics ,Protein identification ,Peptides - Abstract
Metaproteomic studies adopt the common bottom-up proteomics approach to investigate the protein composition and the dynamics of protein expression in microbial communities. When matched metagenomic and/or metatranscriptomic data of the microbial communities are available, metaproteomic data analyses often employ a metagenome-guided approach, in which complete or fragmental protein-coding genes are first directly predicted from metagenomic (and/or metatranscriptomic) sequences or from their assemblies, and the resulting protein sequences are then used as the reference database for peptide/protein identification from MS/MS spectra. This approach is often limited because protein coding genes predicted from metagenomes are incomplete and fragmental. In this paper, we present a graph-centric approach to improving metagenome-guided peptide and protein identification in metaproteomics. Our method exploits the de Bruijn graph structure reported by metagenome assembly algorithms to generate a comprehensive database of protein sequences encoded in the community. We tested our method using several public metaproteomic datasets with matched metagenomic and metatranscriptomic sequencing data acquired from complex microbial communities in a biological wastewater treatment plant. The results showed that many more peptides and proteins can be identified when assembly graphs were utilized, improving the characterization of the proteins expressed in the microbial communities. The additional proteins we identified contribute to the characterization of important pathways such as those involved in degradation of chemical hazards. Our tools are released as open-source software on github at https://github.com/COL-IU/Graph2Pro., Author Summary In recent years, meta-omic (including metatranscriptomic and metaproteomic) techniques have been adopted as complementary approaches to metagenomic sequencing to study functional characteristics and dynamics of microbial communities, aiming at a holistic understanding of a community to respond to the changes in the environment. Currently, metaproteomic data are largely analyzed using the bioinformatics tools originally designed in bottom-up proteomics. In particular, recent metaproteomic studies employed a metagenome-guided approach, in which complete or fragmental protein-coding genes were first predicted from metagenomic sequences (i.e., contigs or scaffolds), acquired from the matched community samples, and predicted protein sequences were then used in peptide identification. A key challenge of this approach is that the protein coding genes predicted from assembled metagenomic contigs can be incomplete and fragmented due to the complexity of metagenomic samples and the short reads length in metagenomic sequencing. To address this issue, in this paper, we present a graph-centric approach that exploits the de bruijn graph structure reported by metagenome assembly algorithms to improve metagenome-guided peptide and protein identification in metaproteomics. We show that our method can identify much more peptides and proteins, improving the characterization of the proteins expressed in the microbial communities.
- Published
- 2016
8. Pathogenicity and functional impact of non-frameshifting insertion/deletion variation in the human genome.
- Author
-
Pagel, Kymberleigh A., Antaki, Danny, Lian, AoJie, Mort, Matthew, Cooper, David N., Sebat, Jonathan, Iakoucheva, Lilia M., Mooney, Sean D., and Radivojac, Predrag
- Subjects
HUMAN genome ,AUTISM spectrum disorders ,MICROBIAL virulence ,RECURRENT neural networks ,POST-translational modification ,PHYSICAL sciences - Abstract
Differentiation between phenotypically neutral and disease-causing genetic variation remains an open and relevant problem. Among different types of variation, non-frameshifting insertions and deletions (indels) represent an understudied group with widespread phenotypic consequences. To address this challenge, we present a machine learning method, MutPred-Indel, that predicts pathogenicity and identifies types of functional residues impacted by non-frameshifting insertion/deletion variation. The model shows good predictive performance as well as the ability to identify impacted structural and functional residues including secondary structure, intrinsic disorder, metal and macromolecular binding, post-translational modifications, allosteric sites, and catalytic residues. We identify structural and functional mechanisms impacted preferentially by germline variation from the Human Gene Mutation Database, recurrent somatic variation from COSMIC in the context of different cancers, as well as de novo variants from families with autism spectrum disorder. Further, the distributions of pathogenicity prediction scores generated by MutPred-Indel are shown to differentiate highly recurrent from non-recurrent somatic variation. Collectively, we present a framework to facilitate the interrogation of both pathogenicity and the functional effects of non-frameshifting insertion/deletion variants. The MutPred-Indel webserver is available at . [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
9. Genotype-phenotype relations of the von Hippel-Lindau tumor suppressor inferred from a large-scale analysis of disease mutations and interactors.
- Author
-
Minervini, Giovanni, Quaglia, Federica, Tabaro, Francesco, and Tosatto, Silvio C. E.
- Subjects
VON Hippel-Lindau disease ,GENOTYPES ,PHENOTYPES ,TUMOR suppressor genes ,GENETIC mutation - Abstract
Familiar cancers represent a privileged point of view for studying the complex cellular events inducing tumor transformation. Von Hippel-Lindau syndrome, a familiar predisposition to develop cancer is a clear example. Here, we present our efforts to decipher the role of von Hippel-Lindau tumor suppressor protein (pVHL) in cancer insurgence. We collected high quality information about both pVHL mutations and interactors to investigate the association between patient phenotypes, mutated protein surface and impaired interactions. Our data suggest that different phenotypes correlate with localized perturbations of the pVHL structure, with specific cell functions associated to different protein surfaces. We propose five different pVHL interfaces to be selectively involved in modulating proteins regulating gene expression, protein homeostasis as well as to address extracellular matrix (ECM) and ciliogenesis associated functions. These data were used to drive molecular docking of pVHL with its interactors and guide Petri net simulations of the most promising alterations. We predict that disruption of pVHL association with certain interactors can trigger tumor transformation, inducing metabolism imbalance and ECM remodeling. Collectively taken, our findings provide novel insights into VHL-associated tumorigenesis. This highly integrated in silico approach may help elucidate novel treatment paradigms for VHL disease. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
10. Prediction of VRC01 neutralization sensitivity by HIV-1 gp160 sequence features.
- Author
-
Magaret, Craig A., Benkeser, David C., Williamson, Brian D., Borate, Bhavesh R., Carpp, Lindsay N., Georgiev, Ivelin S., Setliff, Ian, Dingens, Adam S., Simon, Noah, Carone, Marco, Simpkins, Christopher, Montefiori, David, Alter, Galit, Yu, Wen-Han, Juraska, Michal, Edlefsen, Paul T., Karuna, Shelly, Mgodi, Nyaradzo M., Edugupanti, Srilatha, and Gilbert, Peter B.
- Subjects
BIOLOGICAL databases ,HIV-positive persons ,AMINO acid analysis ,GLYCOSYLATION ,TITERS - Abstract
The broadly neutralizing antibody (bnAb) VRC01 is being evaluated for its efficacy to prevent HIV-1 infection in the Antibody Mediated Prevention (AMP) trials. A secondary objective of AMP utilizes sieve analysis to investigate how VRC01 prevention efficacy (PE) varies with HIV-1 envelope (Env) amino acid (AA) sequence features. An exhaustive analysis that tests how PE depends on every AA feature with sufficient variation would have low statistical power. To design an adequately powered primary sieve analysis for AMP, we modeled VRC01 neutralization as a function of Env AA sequence features of 611 HIV-1 gp160 pseudoviruses from the CATNAP database, with objectives: (1) to develop models that best predict the neutralization readouts; and (2) to rank AA features by their predictive importance with classification and regression methods. The dataset was split in half, and machine learning algorithms were applied to each half, each analyzed separately using cross-validation and hold-out validation. We selected Super Learner, a nonparametric ensemble-based cross-validated learning method, for advancement to the primary sieve analysis. This method predicted the dichotomous resistance outcome of whether the IC
50 neutralization titer of VRC01 for a given Env pseudovirus is right-censored (indicating resistance) with an average validated AUC of 0.868 across the two hold-out datasets. Quantitative log IC50 was predicted with an average validated R2 of 0.355. Features predicting neutralization sensitivity or resistance included 26 surface-accessible residues in the VRC01 and CD4 binding footprints, the length of gp120, the length of Env, the number of cysteines in gp120, the number of cysteines in Env, and 4 potential N-linked glycosylation sites; the top features will be advanced to the primary sieve analysis. This modeling framework may also inform the study of VRC01 in the treatment of HIV-infected persons. [ABSTRACT FROM AUTHOR]- Published
- 2019
- Full Text
- View/download PDF
11. Identifying individual risk rare variants using protein structure guided local tests (POINT).
- Author
-
Marceau West, Rachel, Lu, Wenbin, Rotroff, Daniel M., Kuenemann, Melaine A., Chang, Sheng-Mao, Wu, Michael C., Wagner, Michael J., Buse, John B., Motsinger-Reif, Alison A., Fourches, Denis, and Tzeng, Jung-Ying
- Subjects
PROTEIN structure ,PROTEOMICS ,OPERATOR theory ,QUANTITATIVE research ,KERNEL functions ,BIOLOGICAL databases - Abstract
Rare variants are of increasing interest to genetic association studies because of their etiological contributions to human complex diseases. Due to the rarity of the mutant events, rare variants are routinely analyzed on an aggregate level. While aggregation analyses improve the detection of global-level signal, they are not able to pinpoint causal variants within a variant set. To perform inference on a localized level, additional information, e.g., biological annotation, is often needed to boost the information content of a rare variant. Following the observation that important variants are likely to cluster together on functional domains, we propose a rte structure guided local est (POINT) to provide variant-specific association information using structure-guided aggregation of signal. Constructed under a kernel machine framework, POINT performs local association testing by borrowing information from neighboring variants in the 3-dimensional protein space in a data-adaptive fashion. Besides merely providing a list of promising variants, POINT assigns each variant a p-value to permit variant ranking and prioritization. We assess the selection performance of POINT using simulations and illustrate how it can be used to prioritize individual rare variants in PCSK9, ANGPTL4 and CETP in the Action to Control Cardiovascular Risk in Diabetes (ACCORD) clinical trial data. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
12. Co-evolution networks of HIV/HCV are modular with direct association to structure and function.
- Author
-
Quadeer, Ahmed Abdul, Morales-Jimenez, David, and McKay, Matthew R.
- Subjects
HEPATITIS C virus ,DIAGNOSIS of HIV infections ,VIRAL evolution ,MULTIPLE correspondence analysis (Statistics) ,FLAVIVIRUSES - Abstract
Mutational correlation patterns found in population-level sequence data for the Human Immunodeficiency Virus (HIV) and the Hepatitis C Virus (HCV) have been demonstrated to be informative of viral fitness. Such patterns can be seen as footprints of the intrinsic functional constraints placed on viral evolution under diverse selective pressures. Here, considering multiple HIV and HCV proteins, we demonstrate that these mutational correlations encode a modular co-evolutionary structure that is tightly linked to the structural and functional properties of the respective proteins. Specifically, by introducing a robust statistical method based on sparse principal component analysis, we identify near-disjoint sets of collectively-correlated residues (sectors) having mostly a one-to-one association to largely distinct structural or functional domains. This suggests that the distinct phenotypic properties of HIV/HCV proteins often give rise to quasi-independent modes of evolution, with each mode involving a sparse and localized network of mutational interactions. Moreover, individual inferred sectors of HIV are shown to carry immunological significance, providing insight for guiding targeted vaccine strategies. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
13. RosettaAntibodyDesign (RAbD): A general framework for computational antibody design.
- Author
-
Adolf-Bryfogle, Jared, Kalyuzhniy, Oleks, Kubitz, Michael, Weitzner, Brian D., Hu, Xiaozhen, Adachi, Yumiko, Schief, William R., and Jr.Dunbrack, Roland L.
- Subjects
IMMUNOGLOBULINS ,EPITOPES ,ANTIGENS ,AMINO acids ,MONOCLONAL antibodies - Abstract
A structural-bioinformatics-based computational methodology and framework have been developed for the design of antibodies to targets of interest. RosettaAntibodyDesign (RAbD) samples the diverse sequence, structure, and binding space of an antibody to an antigen in highly customizable protocols for the design of antibodies in a broad range of applications. The program samples antibody sequences and structures by grafting structures from a widely accepted set of the canonical clusters of CDRs (North et al., J. Mol. Biol., 406:228–256, 2011). It then performs sequence design according to amino acid sequence profiles of each cluster, and samples CDR backbones using a flexible-backbone design protocol incorporating cluster-based CDR constraints. Starting from an existing experimental or computationally modeled antigen-antibody structure, RAbD can be used to redesign a single CDR or multiple CDRs with loops of different length, conformation, and sequence. We rigorously benchmarked RAbD on a set of 60 diverse antibody–antigen complexes, using two design strategies—optimizing total Rosetta energy and optimizing interface energy alone. We utilized two novel metrics for measuring success in computational protein design. The design risk ratio (DRR) is equal to the frequency of recovery of native CDR lengths and clusters divided by the frequency of sampling of those features during the Monte Carlo design procedure. Ratios greater than 1.0 indicate that the design process is picking out the native more frequently than expected from their sampled rate. We achieved DRRs for the non-H3 CDRs of between 2.4 and 4.0. The antigen risk ratio (ARR) is the ratio of frequencies of the native amino acid types, CDR lengths, and clusters in the output decoys for simulations performed in the presence and absence of the antigen. For CDRs, we achieved cluster ARRs as high as 2.5 for L1 and 1.5 for H2. For sequence design simulations without CDR grafting, the overall recovery for the native amino acid types for residues that contact the antigen in the native structures was 72% in simulations performed in the presence of the antigen and 48% in simulations performed without the antigen, for an ARR of 1.5. For the non-contacting residues, the ARR was 1.08. This shows that the sequence profiles are able to maintain the amino acid types of these conserved, buried sites, while recovery of the exposed, contacting residues requires the presence of the antigen-antibody interface. We tested RAbD experimentally on both a lambda and kappa antibody–antigen complex, successfully improving their affinities 10 to 50 fold by replacing individual CDRs of the native antibody with new CDR lengths and clusters. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
14. DIVERSITY in binding, regulation, and evolution revealed from high-throughput ChIP.
- Author
-
Mitra, Sneha, Biswas, Anushua, and Narlikar, Leelavati
- Subjects
DNA-protein interactions ,CHROMATIN ,IMMUNOPRECIPITATION ,DNA-binding proteins ,EPIGENETICS - Abstract
Genome-wide in vivo protein-DNA interactions are routinely mapped using high-throughput chromatin immunoprecipitation (ChIP). ChIP-reported regions are typically investigated for enriched sequence-motifs, which are likely to model the DNA-binding specificity of the profiled protein and/or of co-occurring proteins. However, simple enrichment analyses can miss insights into the binding-activity of the protein. Note that ChIP reports regions making direct contact with the protein as well as those binding through intermediaries. For example, consider a ChIP experiment targeting protein X, which binds DNA at its cognate sites, but simultaneously interacts with four other proteins. Each of these proteins also binds to its own specific cognate sites along distant parts of the genome, a scenario consistent with the current view of transcriptional hubs and chromatin loops. Since ChIP will pull down all X-associated regions, the final reported data will be a union of five distinct sets of regions, each containing binding sites of one of the five proteins, respectively. Characterizing all five different motifs and the corresponding sets is important to interpret the ChIP experiment and ultimately, the role of X in regulation. We present which attempts exactly this: it partitions the data so that each partition can be characterized with its own de novo motif. D uses a Bayesian approach to identify the optimal number of motifs and the associated partitions, which together explain the entire dataset. This is in contrast to standard motif finders, which report motifs individually enriched in the data, but do not necessarily explain all reported regions. We show that the different motifs and associated regions identified by give insights into the various complexes that may be forming along the chromatin, something that has so far not been attempted from ChIP data. Webserver at ; standalone (Mac OS X/Linux) from . [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
15. A machine learning based framework to identify and classify long terminal repeat retrotransposons.
- Author
-
Schietgat, Leander, Vens, Celine, Cerri, Ricardo, Fischer, Carlos N., Costa, Eduardo, Ramon, Jan, Carareto, Claudia M. A., and Blockeel, Hendrik
- Subjects
RETROTRANSPOSONS ,MACHINE learning ,GENOMES ,DROSOPHILA melanogaster ,ARABIDOPSIS thaliana - Abstract
Transposable elements (TEs) are repetitive nucleotide sequences that make up a large portion of eukaryotic genomes. They can move and duplicate within a genome, increasing genome size and contributing to genetic diversity within and across species. Accurate identification and classification of TEs present in a genome is an important step towards understanding their effects on genes and their role in genome evolution. We introduce TE-L, a framework based on machine learning that automatically identifies TEs in a given genome and assigns a classification to them. We present an implementation of our framework towards LTR retrotransposons, a particular type of TEs characterized by having long terminal repeats (LTRs) at their boundaries. We evaluate the predictive performance of our framework on the well-annotated genomes of Drosophila melanogaster and Arabidopsis thaliana and we compare our results for three LTR retrotransposon superfamilies with the results of three widely used methods for TE identification or classification: RM, C and LD. In contrast to these methods, TE-L is the first to incorporate machine learning techniques, outperforming these methods in terms of predictive performance, while able to learn models and make predictions efficiently. Moreover, we show that our method was able to identify TEs that none of the above method could find, and we investigated TE-L’s predictions which did not correspond to an official annotation. It turns out that many of these predictions are in fact strongly homologous to a known TE. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
16. New computational approaches to understanding molecular protein function.
- Author
-
Fetrow, Jacquelyn S. and Babbitt, Patricia C.
- Subjects
COMPUTATIONAL biology ,PHYSIOLOGICAL effects of proteins ,PROTEIN genetics ,MOLECULAR biology ,ONTOLOGIES (Information retrieval) - Abstract
The author discusses new computational approaches to understanding molecular protein function. The author states that Gene Ontology (GO) system of classifying function recognizes ways of defining function, using distinct cellular components, molecular function, and mentions the need of understanding molecular function involving motifs, like sequence motifs exemplified by PRINTS and PROSITE. The author notes that contemporary protein superfamilies are the result of numerous genetic events.
- Published
- 2018
- Full Text
- View/download PDF
17. iDREM: Interactive visualization of dynamic regulatory networks.
- Author
-
Ding, Jun, Hagood, James S., Ambalavanan, Namasivayam, Kaminski, Naftali, and Bar-Joseph, Ziv
- Subjects
GENETIC software ,PROTEINS ,GENE expression ,MICRORNA ,PROTEOMICS - Abstract
The Dynamic Regulatory Events Miner (DREM) software reconstructs dynamic regulatory networks by integrating static protein-DNA interaction data with time series gene expression data. In recent years, several additional types of high-throughput time series data have been profiled when studying biological processes including time series miRNA expression, proteomics, epigenomics and single cell RNA-Seq. Combining all available time series and static datasets in a unified model remains an important challenge and goal. To address this challenge we have developed a new version of DREM termed interactive DREM (iDREM). iDREM provides support for all data types mentioned above and combines them with existing interaction data to reconstruct networks that can lead to novel hypotheses on the function and timing of regulators. Users can interactively visualize and query the resulting model. We showcase the functionality of the new tool by applying it to microglia developmental data from multiple labs. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
18. Improving pairwise comparison of protein sequences with domain co-occurrence.
- Author
-
Menichelli, Christophe, Gascuel, Olivier, and Bréhélin, Laurent
- Subjects
AMINO acid sequence ,BIOINFORMATICS ,PHYLOGENY ,PLASMODIUM falciparum ,PROTEIN analysis - Abstract
Comparing and aligning protein sequences is an essential task in bioinformatics. More specifically, local alignment tools like BLAST are widely used for identifying conserved protein sub-sequences, which likely correspond to protein domains or functional motifs. However, to limit the number of false positives, these tools are used with stringent sequence-similarity thresholds and hence can miss several hits, especially for species that are phylogenetically distant from reference organisms. A solution to this problem is then to integrate additional contextual information to the procedure. Here, we propose to use domain co-occurrence to increase the sensitivity of pairwise sequence comparisons. Domain co-occurrence is a strong feature of proteins, since most protein domains tend to appear with a limited number of other domains on the same protein. We propose a method to take this information into account in a typical BLAST analysis and to construct new domain families on the basis of these results. We used Plasmodium falciparum as a case study to evaluate our method. The experimental findings showed an increase of 14% of the number of significant BLAST hits and an increase of 25% of the proteome area that can be covered with a domain. Our method identified 2240 new domains for which, in most cases, no model of the Pfam database could be linked. Moreover, our study of the quality of the new domains in terms of alignment and physicochemical properties show that they are close to that of standard Pfam domains. Source code of the proposed approach and supplementary data are available at: [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
19. Single-molecule protein identification by sub-nanopore sensors.
- Author
-
Kolmogorov, Mikhail, Kennedy, Eamonn, Dong, Zhuxin, Timp, Gregory, and Pevzner, Pavel A.
- Subjects
PROTEOMICS ,MASS spectrometry ,P-value (Statistics) ,MOLECULAR biology ,NUCLEOTIDE sequencing - Abstract
Recent advances in top-down mass spectrometry enabled identification of intact proteins, but this technology still faces challenges. For example, top-down mass spectrometry suffers from a lack of sensitivity since the ion counts for a single fragmentation event are often low. In contrast, nanopore technology is exquisitely sensitive to single intact molecules, but it has only been successfully applied to DNA sequencing, so far. Here, we explore the potential of sub-nanopores for single-molecule protein identification (SMPI) and describe an algorithm for identification of the electrical current blockade signal (nanospectrum) resulting from the translocation of a denaturated, linearly charged protein through a sub-nanopore. The analysis of identification p-values suggests that the current technology is already sufficient for matching nanospectra against small protein databases, e.g., protein identification in bacterial proteomes. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
20. Oncodomains: A protein domain-centric framework for analyzing rare variants in tumor samples.
- Author
-
Peterson, Thomas A., Gauran, Iris Ivy M., Park, Junyong, Park, DoHwan, and Kann, Maricel G.
- Subjects
CANCER prevention ,SOMATIC mutation ,TUMOR prevention ,GENETIC mutation ,GENE families ,PROTEIN structure - Abstract
The fight against cancer is hindered by its highly heterogeneous nature. Genome-wide sequencing studies have shown that individual malignancies contain many mutations that range from those commonly found in tumor genomes to rare somatic variants present only in a small fraction of lesions. Such rare somatic variants dominate the landscape of genomic mutations in cancer, yet efforts to correlate somatic mutations found in one or few individuals with functional roles have been largely unsuccessful. Traditional methods for identifying somatic variants that drive cancer are ‘gene-centric’ in that they consider only somatic variants within a particular gene and make no comparison to other similar genes in the same family that may play a similar role in cancer. In this work, we present oncodomain hotspots, a new ‘domain-centric’ method for identifying clusters of somatic mutations across entire gene families using protein domain models. Our analysis confirms that our approach creates a framework for leveraging structural and functional information encapsulated by protein domains into the analysis of somatic variants in cancer, enabling the assessment of even rare somatic variants by comparison to similar genes. Our results reveal a vast landscape of somatic variants that act at the level of domain families altering pathways known to be involved with cancer such as protein phosphorylation, signaling, gene regulation, and cell metabolism. Due to oncodomain hotspots’ unique ability to assess rare variants, we expect our method to become an important tool for the analysis of sequenced tumor genomes, complementing existing methods. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
21. Exhaustive search of linear information encoding protein-peptide recognition.
- Author
-
Kelil, Abdellali, Dubreuil, Benjamin, Levy, Emmanuel D., and Michnick, Stephen W.
- Subjects
MOLECULAR recognition ,PEPTIDES ,PROTEINS ,CARRIER proteins ,SACCHAROMYCES cerevisiae - Abstract
High-throughput in vitro methods have been extensively applied to identify linear information that encodes peptide recognition. However, these methods are limited in number of peptides, sequence variation, and length of peptides that can be explored, and often produce solutions that are not found in the cell. Despite the large number of methods developed to attempt addressing these issues, the exhaustive search of linear information encoding protein-peptide recognition has been so far physically unfeasible. Here, we describe a strategy, called DALEL, for the exhaustive search of linear sequence information encoded in proteins that bind to a common partner. We applied DALEL to explore binding specificity of SH3 domains in Saccharomyces cerevisiae. Using only the polypeptide sequences of SH3 domain binding proteins, we succeeded in identifying the majority of known SH3 binding sites previously discovered either in vitro or in vivo. Moreover, we discovered a number of sites with both non-canonical sequences and distinct properties that may serve ancillary roles in peptide recognition. We compared DALEL to a variety of state-of-the-art algorithms in the blind identification of known binding sites of the human Grb2 SH3 domain. We also benchmarked DALEL on curated biological motifs derived from the ELM database to evaluate the effect of increasing/decreasing the enrichment of the motifs. Our strategy can be applied in conjunction with experimental data of proteins interacting with a common partner to identify binding sites among them. Yet, our strategy can also be applied to any group of proteins of interest to identify enriched linear motifs or to exhaustively explore the space of linear information. Finally, we have developed a webserver located at , offering user-friendly interface and providing different scenarios utilizing DALEL. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
22. An Atlas of Peroxiredoxins Created Using an Active Site Profile-Based Approach to Functionally Relevant Clustering of Proteins.
- Author
-
Harper, Angela F., Leuthaeuser, Janelle B., Babbitt, Patricia C., Morris, John H., Ferrin, Thomas E., Poole, Leslie B., and Fetrow, Jacquelyn S.
- Subjects
PEROXIREDOXINS ,PROTEINS ,ANTIOXIDANTS ,ENZYMES ,CELLULAR signal transduction - Abstract
Peroxiredoxins (Prxs or Prdxs) are a large protein superfamily of antioxidant enzymes that rapidly detoxify damaging peroxides and/or affect signal transduction and, thus, have roles in proliferation, differentiation, and apoptosis. Prx superfamily members are widespread across phylogeny and multiple methods have been developed to classify them. Here we present an updated atlas of the Prx superfamily identified using a novel method called MISST (Multi-level Iterative Sequence Searching Technique). MISST is an iterative search process developed to be both agglomerative, to add sequences containing similar functional site features, and divisive, to split groups when functional site features suggest distinct functionally-relevant clusters. Superfamily members need not be identified initially—MISST begins with a minimal representative set of known structures and searches GenBank iteratively. Further, the method’s novelty lies in the manner in which isofunctional groups are selected; rather than use a single or shifting threshold to identify clusters, the groups are deemed isofunctional when they pass a self-identification criterion, such that the group identifies itself and nothing else in a search of GenBank. The method was preliminarily validated on the Prxs, as the Prxs presented challenges of both agglomeration and division. For example, previous sequence analysis clustered the Prx functional families Prx1 and Prx6 into one group. Subsequent expert analysis clearly identified Prx6 as a distinct functionally relevant group. The MISST process distinguishes these two closely related, though functionally distinct, families. Through MISST search iterations, over 38,000 Prx sequences were identified, which the method divided into six isofunctional clusters, consistent with previous expert analysis. The results represent the most complete computational functional analysis of proteins comprising the Prx superfamily. The feasibility of this novel method is demonstrated by the Prx superfamily results, laying the foundation for potential functionally relevant clustering of the universe of protein sequences. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
23. Improvement in Protein Domain Identification Is Reached by Breaking Consensus, with the Agreement of Many Profiles and Domain Co-occurrence.
- Author
-
Bernardes, Juliana, Zaverucha, Gerson, Vaquero, Catherine, and Carbone, Alessandra
- Subjects
PROTEIN domains ,PROTEINS ,AMINO acid sequence ,MEDICAL decision making ,ALGORITHMS - Abstract
Traditional protein annotation methods describe known domains with probabilistic models representing consensus among homologous domain sequences. However, when relevant signals become too weak to be identified by a global consensus, attempts for annotation fail. Here we address the fundamental question of domain identification for highly divergent proteins. By using high performance computing, we demonstrate that the limits of state-of-the-art annotation methods can be bypassed. We design a new strategy based on the observation that many structural and functional protein constraints are not globally conserved through all species but might be locally conserved in separate clades. We propose a novel exploitation of the large amount of data available: 1. for each known protein domain, several probabilistic clade-centered models are constructed from a large and differentiated panel of homologous sequences, 2. a decision-making protocol combines outcomes obtained from multiple models, 3. a multi-criteria optimization algorithm finds the most likely protein architecture. The method is evaluated for domain and architecture prediction over several datasets and statistical testing hypotheses. Its performance is compared against HMMScan and HHblits, two widely used search methods based on sequence-profile and profile-profile comparison. Due to their closeness to actual protein sequences, clade-centered models are shown to be more specific and functionally predictive than the broadly used consensus models. Based on them, we improved annotation of Plasmodium falciparum protein sequences on a scale not previously possible. We successfully predict at least one domain for 72% of P. falciparum proteins against 63% achieved previously, corresponding to 30% of improvement over the total number of Pfam domain predictions on the whole genome. The method is applicable to any genome and opens new avenues to tackle evolutionary questions such as the reconstruction of ancient domain duplications, the reconstruction of the history of protein architectures, and the estimation of protein domain age. Website and software: . [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
24. A Multi-scale Computational Platform to Mechanistically Assess the Effect of Genetic Variation on Drug Responses in Human Erythrocyte Metabolism.
- Author
-
Mih, Nathan, Brunk, Elizabeth, Bordbar, Aarash, and Palsson, Bernhard O.
- Subjects
METABOLOMICS ,DRUG therapy ,METHYLTRANSFERASES ,DEHYDROGENASES ,PHARMACOLOGY ,COMPUTATIONAL biology - Abstract
Progress in systems medicine brings promise to addressing patient heterogeneity and individualized therapies. Recently, genome-scale models of metabolism have been shown to provide insight into the mechanistic link between drug therapies and systems-level off-target effects while being expanded to explicitly include the three-dimensional structure of proteins. The integration of these molecular-level details, such as the physical, structural, and dynamical properties of proteins, notably expands the computational description of biochemical network-level properties and the possibility of understanding and predicting whole cell phenotypes. In this study, we present a multi-scale modeling framework that describes biological processes which range in scale from atomistic details to an entire metabolic network. Using this approach, we can understand how genetic variation, which impacts the structure and reactivity of a protein, influences both native and drug-induced metabolic states. As a proof-of-concept, we study three enzymes (catechol-O-methyltransferase, glucose-6-phosphate dehydrogenase, and glyceraldehyde-3-phosphate dehydrogenase) and their respective genetic variants which have clinically relevant associations. Using all-atom molecular dynamic simulations enables the sampling of long timescale conformational dynamics of the proteins (and their mutant variants) in complex with their respective native metabolites or drug molecules. We find that changes in a protein’s structure due to a mutation influences protein binding affinity to metabolites and/or drug molecules, and inflicts large-scale changes in metabolism. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
25. Metagenome and Metatranscriptome Analyses Using Protein Family Profiles.
- Author
-
Zhong, Cuncong, Edlund, Anna, Yang, Youngik, McLean, Jeffrey S., and Yooseph, Shibu
- Subjects
METABOLIC profile tests ,GENE expression ,MARKOV processes ,ANTI-infective agents ,NUCLEOTIDE sequence ,ESTIMATION theory - Abstract
Analyses of metagenome data (MG) and metatranscriptome data (MT) are often challenged by a paucity of complete reference genome sequences and the uneven/low sequencing depth of the constituent organisms in the microbial community, which respectively limit the power of reference-based alignment and de novo sequence assembly. These limitations make accurate protein family classification and abundance estimation challenging, which in turn hamper downstream analyses such as abundance profiling of metabolic pathways, identification of differentially encoded/expressed genes, and de novo reconstruction of complete gene and protein sequences from the protein family of interest. The profile hidden Markov model (HMM) framework enables the construction of very useful probabilistic models for protein families that allow for accurate modeling of position specific matches, insertions, and deletions. We present a novel homology detection algorithm that integrates banded Viterbi algorithm for profile HMM parsing with an iterative simultaneous alignment and assembly computational framework. The algorithm searches a given profile HMM of a protein family against a database of fragmentary MG/MT sequencing data and simultaneously assembles complete or near-complete gene and protein sequences of the protein family. The resulting program, HMM-GRASPx, demonstrates superior performance in aligning and assembling homologs when benchmarked on both simulated marine MG and real human saliva MG datasets. On real supragingival plaque and stool MG datasets that were generated from healthy individuals, HMM-GRASPx accurately estimates the abundances of the antimicrobial resistance (AMR) gene families and enables accurate characterization of the resistome profiles of these microbial communities. For real human oral microbiome MT datasets, using the HMM-GRASPx estimated transcript abundances significantly improves detection of differentially expressed (DE) genes. Finally, HMM-GRASPx was used to reconstruct comprehensive sets of complete or near-complete protein and nucleotide sequences for the query protein families. HMM-GRASPx is freely available online from . [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
26. Quantification and Classification of E. coli Proteome Utilization and Unused Protein Costs across Environments.
- Author
-
O’Brien, Edward J., Utrilla, Jose, and Palsson, Bernhard O.
- Subjects
PROTEOMICS ,ESCHERICHIA coli ,GROWTH rate ,MICRODIALYSIS ,CELL growth - Abstract
The costs and benefits of protein expression are balanced through evolution. Expression of un-utilized protein (that have no benefits in the current environment) incurs a quantifiable fitness costs on cellular growth rates; however, the magnitude and variability of un-utilized protein expression in natural settings is unknown, largely due to the challenge in determining environment-specific proteome utilization. We address this challenge using absolute and global proteomics data combined with a recently developed genome-scale model of Escherichia coli that computes the environment-specific cost and utility of the proteome on a per gene basis. We show that nearly half of the proteome mass is unused in certain environments and accounting for the cost of this unused protein expression explains >95% of the variance in growth rates of Escherichia coli across 16 distinct environments. Furthermore, reduction in unused protein expression is shown to be a common mechanism to increase cellular growth rates in adaptive evolution experiments. Classification of the unused protein reveals that the unused protein encodes several nutrient- and stress- preparedness functions, which may convey fitness benefits in varying environments. Thus, unused protein expression is the source of large and pervasive fitness costs that may provide the benefit of hedging against environmental change. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
27. Evolution-Based Functional Decomposition of Proteins.
- Author
-
Rivoire, Olivier, Reynolds, Kimberly A., and Ranganathan, Rama
- Subjects
BIODEGRADATION ,AMINO acids ,PROTEINS ,COEVOLUTION ,INTEGRATED software ,PYTHON programming language - Abstract
The essential biological properties of proteins—folding, biochemical activities, and the capacity to adapt—arise from the global pattern of interactions between amino acid residues. The statistical coupling analysis (SCA) is an approach to defining this pattern that involves the study of amino acid coevolution in an ensemble of sequences comprising a protein family. This approach indicates a functional architecture within proteins in which the basic units are coupled networks of amino acids termed sectors. This evolution-based decomposition has potential for new understandings of the structural basis for protein function. To facilitate its usage, we present here the principles and practice of the SCA and introduce new methods for sector analysis in a python-based software package (pySCA). We show that the pattern of amino acid interactions within sectors is linked to the divergence of functional lineages in a multiple sequence alignment—a model for how sector properties might be differentially tuned in members of a protein family. This work provides new tools for studying proteins and for generally testing the concept of sectors as the principal units of function and adaptive variation. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
28. From Binding-Induced Dynamic Effects in SH3 Structures to Evolutionary Conserved Sectors.
- Author
-
Zafra Ruano, Ana, Cilia, Elisa, Couceiro, José R., Ruiz Sanz, Javier, Schymkowitz, Joost, Rousseau, Frederic, Luque, Irene, and Lenaerts, Tom
- Subjects
HOMOLOGY (Biology) ,LIGAND binding (Biochemistry) ,CELLULAR signal transduction ,AMINO acids ,HYDROGEN bonding - Abstract
Src Homology 3 domains are ubiquitous small interaction modules known to act as docking sites and regulatory elements in a wide range of proteins. Prior experimental NMR work on the SH3 domain of Src showed that ligand binding induces long-range dynamic changes consistent with an induced fit mechanism. The identification of the residues that participate in this mechanism produces a chart that allows for the exploration of the regulatory role of such domains in the activity of the encompassing protein. Here we show that a computational approach focusing on the changes in side chain dynamics through ligand binding identifies equivalent long-range effects in the Src SH3 domain. Mutation of a subset of the predicted residues elicits long-range effects on the binding energetics, emphasizing the relevance of these positions in the definition of intramolecular cooperative networks of signal transduction in this domain. We find further support for this mechanism through the analysis of seven other publically available SH3 domain structures of which the sequences represent diverse SH3 classes. By comparing the eight predictions, we find that, in addition to a dynamic pathway that is relatively conserved throughout all SH3 domains, there are dynamic aspects specific to each domain and homologous subgroups. Our work shows for the first time from a structural perspective, which transduction mechanisms are common between a subset of closely related and distal SH3 domains, while at the same time highlighting the differences in signal transduction that make each family member unique. These results resolve the missing link between structural predictions of dynamic changes and the domain sectors recently identified for SH3 domains through sequence analysis. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
29. Computational Identification of Genomic Features That Influence 3D Chromatin Domain Formation.
- Author
-
Mourad, Raphaël and Cuvier, Olivier
- Subjects
GENE expression ,PROTEINS ,DROSOPHILA ,COHESINS ,POLYCOMB group proteins - Abstract
Recent advances in long-range Hi-C contact mapping have revealed the importance of the 3D structure of chromosomes in gene expression. A current challenge is to identify the key molecular drivers of this 3D structure. Several genomic features, such as architectural proteins and functional elements, were shown to be enriched at topological domain borders using classical enrichment tests. Here we propose multiple logistic regression to identify those genomic features that positively or negatively influence domain border establishment or maintenance. The model is flexible, and can account for statistical interactions among multiple genomic features. Using both simulated and real data, we show that our model outperforms enrichment test and non-parametric models, such as random forests, for the identification of genomic features that influence domain borders. Using Drosophila Hi-C data at a very high resolution of 1 kb, our model suggests that, among architectural proteins, BEAF-32 and CP190 are the main positive drivers of 3D domain borders. In humans, our model identifies well-known architectural proteins CTCF and cohesin, as well as ZNF143 and Polycomb group proteins as positive drivers of domain borders. The model also reveals the existence of several negative drivers that counteract the presence of domain borders including P300, RXRA, BCL11A and ELK1. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
30. Effective Design of Multifunctional Peptides by Combining Compatible Functions.
- Author
-
Diener, Christian, Garza Ramos Martínez, Georgina, Moreno Blas, Daniel, Castillo González, David A., Corzo, Gerardo, Castro-Obregon, Susana, and Del Rio, Gabriel
- Subjects
PEPTIDE antibiotics ,DNA-binding proteins ,CELL-penetrating peptides ,MACHINE learning ,PHEROMONES - Abstract
Multifunctionality is a common trait of many natural proteins and peptides, yet the rules to generate such multifunctionality remain unclear. We propose that the rules defining some protein/peptide functions are compatible. To explore this hypothesis, we trained a computational method to predict cell-penetrating peptides at the sequence level and learned that antimicrobial peptides and DNA-binding proteins are compatible with the rules of our predictor. Based on this finding, we expected that designing peptides for CPP activity may render AMP and DNA-binding activities. To test this prediction, we designed peptides that embedded two independent functional domains (nuclear localization and yeast pheromone activity), linked by optimizing their composition to fit the rules characterizing cell-penetrating peptides. These peptides presented effective cell penetration, DNA-binding, pheromone and antimicrobial activities, thus confirming the effectiveness of our computational approach to design multifunctional peptides with potential therapeutic uses. Our computational implementation is available at . [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
31. APP Is a Context-Sensitive Regulator of the Hippocampal Presynaptic Active Zone.
- Author
-
Laßek, Melanie, Weingarten, Jens, Wegner, Martin, Mueller, Benjamin F., Rohmer, Marion, Baeumlisberger, Dominic, Arrey, Tabiwang N., Hick, Meike, Ackermann, Jörg, Acker-Palmer, Amparo, Koch, Ina, Müller, Ulrike, Karas, Michael, and Volknandt, Walter
- Subjects
AMYLOID beta-protein precursor ,HIPPOCAMPUS physiology ,ALZHEIMER'S disease risk factors ,LABORATORY mice ,PROTEIN-protein interactions - Abstract
The hallmarks of Alzheimer’s disease (AD) are characterized by cognitive decline and behavioral changes. The most prominent brain region affected by the progression of AD is the hippocampal formation. The pathogenesis involves a successive loss of hippocampal neurons accompanied by a decline in learning and memory consolidation mainly attributed to an accumulation of senile plaques. The amyloid precursor protein (APP) has been identified as precursor of Aβ-peptides, the main constituents of senile plaques. Until now, little is known about the physiological function of APP within the central nervous system. The allocation of APP to the proteome of the highly dynamic presynaptic active zone (PAZ) highlights APP as a yet unknown player in neuronal communication and signaling. In this study, we analyze the impact of APP deletion on the hippocampal PAZ proteome. The native hippocampal PAZ derived from APP mouse mutants (APP-KOs and NexCreAPP/APLP2-cDKOs) was isolated by subcellular fractionation and immunopurification. Subsequently, an isobaric labeling was performed using TMT
6 for protein identification and quantification by high-resolution mass spectrometry. We combine bioinformatics tools and biochemical approaches to address the proteomics dataset and to understand the role of individual proteins. The impact of APP deletion on the hippocampal PAZ proteome was visualized by creating protein-protein interaction (PPI) networks that incorporated APP into the synaptic vesicle cycle, cytoskeletal organization, and calcium-homeostasis. The combination of subcellular fractionation, immunopurification, proteomic analysis, and bioinformatics allowed us to identify APP as structural and functional regulator in a context-sensitive manner within the hippocampal active zone network. [ABSTRACT FROM AUTHOR]- Published
- 2016
- Full Text
- View/download PDF
32. Cache Domains That are Homologous to, but Different from PAS Domains Comprise the Largest Superfamily of Extracellular Sensors in Prokaryotes.
- Author
-
Upadhyay, Amit A., Fleetwood, Aaron D., Adebali, Ogun, Finn, Robert D., and Zhulin, Igor B.
- Subjects
HOMOLOGOUS chromosomes ,CELLULAR signal transduction ,PROKARYOTES ,PHOSPHODIESTERASES ,ADENYLATE cyclase - Abstract
Cellular receptors usually contain a designated sensory domain that recognizes the signal. Per/Arnt/Sim (PAS) domains are ubiquitous sensors in thousands of species ranging from bacteria to humans. Although PAS domains were described as intracellular sensors, recent structural studies revealed PAS-like domains in extracytoplasmic regions in several transmembrane receptors. However, these structurally defined extracellular PAS-like domains do not match sequence-derived PAS domain models, and thus their distribution across the genomic landscape remains largely unknown. Here we show that structurally defined extracellular PAS-like domains belong to the Cache superfamily, which is homologous to, but distinct from the PAS superfamily. Our newly built computational models enabled identification of Cache domains in tens of thousands of signal transduction proteins including those from important pathogens and model organisms. Furthermore, we show that Cache domains comprise the dominant mode of extracellular sensing in prokaryotes. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
33. Evolutionary Conserved Positions Define Protein Conformational Diversity.
- Author
-
Saldaño, Tadeo E., Monzon, Alexander M., Parisi, Gustavo, and Fernandez-Alberti, Sebastian
- Subjects
CONFORMERS (Chemistry) ,LIGAND binding (Biochemistry) ,DNA-ligand interactions ,MOLECULAR recognition ,CONFORMATIONAL analysis ,STATISTICAL correlation - Abstract
Conformational diversity of the native state plays a central role in modulating protein function. The selection paradigm sustains that different ligands shift the conformational equilibrium through their binding to highest-affinity conformers. Intramolecular vibrational dynamics associated to each conformation should guarantee conformational transitions, which due to its importance, could possibly be associated with evolutionary conserved traits. Normal mode analysis, based on a coarse-grained model of the protein, can provide the required information to explore these features. Herein, we present a novel procedure to identify key positions sustaining the conformational diversity associated to ligand binding. The method is applied to an adequate refined dataset of 188 paired protein structures in their bound and unbound forms. Firstly, normal modes most involved in the conformational change are selected according to their corresponding overlap with structural distortions introduced by ligand binding. The subspace defined by these modes is used to analyze the effect of simulated point mutations on preserving the conformational diversity of the protein. We find a negative correlation between the effects of mutations on these normal mode subspaces associated to ligand-binding and position-specific evolutionary conservations obtained from multiple sequence-structure alignments. Positions whose mutations are found to alter the most these subspaces are defined as key positions, that is, dynamically important residues that mediate the ligand-binding conformational change. These positions are shown to be evolutionary conserved, mostly buried aliphatic residues localized in regular structural regions of the protein like β-sheets and α-helix. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
34. A Multi-Method Approach for Proteomic Network Inference in 11 Human Cancers.
- Author
-
Şenbabaoğlu, Yasin, Sümer, Selçuk Onur, Sánchez-Vega, Francisco, Bemis, Debra, Ciriello, Giovanni, Schultz, Nikolaus, and Sander, Chris
- Subjects
PROTEOMICS ,PROTEIN expression ,PROTEIN microarrays ,PROTEIN-protein interactions ,CANCER research - Abstract
Protein expression and post-translational modification levels are tightly regulated in neoplastic cells to maintain cellular processes known as ‘cancer hallmarks’. The first Pan-Cancer initiative of The Cancer Genome Atlas (TCGA) Research Network has aggregated protein expression profiles for 3,467 patient samples from 11 tumor types using the antibody based reverse phase protein array (RPPA) technology. The resultant proteomic data can be utilized to computationally infer protein-protein interaction (PPI) networks and to study the commonalities and differences across tumor types. In this study, we compare the performance of 13 established network inference methods in their capacity to retrieve the curated Pathway Commons interactions from RPPA data. We observe that no single method has the best performance in all tumor types, but a group of six methods, including diverse techniques such as correlation, mutual information, and regression, consistently rank highly among the tested methods. We utilize the high performing methods to obtain a consensus network; and identify four robust and densely connected modules that reveal biological processes as well as suggest antibody–related technical biases. Mapping the consensus network interactions to Reactome gene lists confirms the pan-cancer importance of signal transduction pathways, innate and adaptive immune signaling, cell cycle, metabolism, and DNA repair; and also suggests several biological processes that may be specific to a subset of tumor types. Our results illustrate the utility of the RPPA platform as a tool to study proteomic networks in cancer. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
35. Evolutionary Genomics Suggests That CheV Is an Additional Adaptor for Accommodating Specific Chemoreceptors within the Chemotaxis Signaling Complex.
- Author
-
Ortega, Davi R. and Zhulin, Igor B.
- Subjects
GENOMICS ,MOLECULAR genetics ,GENOMES ,CHEMORECEPTORS ,CHEMOTAXIS - Abstract
Escherichia coli and Salmonella enterica are models for many experiments in molecular biology including chemotaxis, and most of the results obtained with one organism have been generalized to another. While most components of the chemotaxis pathway are strongly conserved between the two species, Salmonella genomes contain some chemoreceptors and an additional protein, CheV, that are not found in E. coli. The role of CheV was examined in distantly related species Bacillus subtilis and Helicobacter pylori, but its role in bacterial chemotaxis is still not well understood. We tested a hypothesis that in enterobacteria CheV functions as an additional adaptor linking the CheA kinase to certain types of chemoreceptors that cannot be effectively accommodated by the universal adaptor CheW. Phylogenetic profiling, genomic context and comparative protein sequence analyses suggested that CheV interacts with specific domains of CheA and chemoreceptors from an orthologous group exemplified by the Salmonella McpC protein. Structural consideration of the conservation patterns suggests that CheV and CheW share the same binding spot on the chemoreceptor structure, but have some affinity bias towards chemoreceptors from different orthologous groups. Finally, published experimental results and data newly obtained via comparative genomics support the idea that CheV functions as a “phosphate sink” possibly to off-set the over-stimulation of the kinase by certain types of chemoreceptors. Overall, our results strongly suggest that CheV is an additional adaptor for accommodating specific chemoreceptors within the chemotaxis signaling complex. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
36. Systems-Wide Prediction of Enzyme Promiscuity Reveals a New Underground Alternative Route for Pyridoxal 5’-Phosphate Production in E. coli.
- Author
-
Oberhardt, Matthew A., Zarecki, Raphy, Reshef, Leah, Xia, Fangfang, Duran-Frigola, Miquel, Schreiber, Rachel, Henry, Christopher S., Ben-Tal, Nir, Dwyer, Daniel J., Gophna, Uri, and Ruppin, Eytan
- Subjects
ENZYME promiscuity ,ENZYME kinetics ,VITAMIN B6 ,ESCHERICHIA coli ,COMPUTATIONAL biology - Abstract
Recent insights suggest that non-specific and/or promiscuous enzymes are common and active across life. Understanding the role of such enzymes is an important open question in biology. Here we develop a genome-wide method, PROPER, that uses a permissive PSI-BLAST approach to predict promiscuous activities of metabolic genes. Enzyme promiscuity is typically studied experimentally using multicopy suppression, in which over-expression of a promiscuous ‘replacer’ gene rescues lethality caused by inactivation of a ‘target’ gene. We use PROPER to predict multicopy suppression in Escherichia coli, achieving highly significant overlap with published cases (hypergeometric p = 4.4e-13). We then validate three novel predicted target-replacer gene pairs in new multicopy suppression experiments. We next go beyond PROPER and develop a network-based approach, GEM-PROPER, that integrates PROPER with genome-scale metabolic modeling to predict promiscuous replacements via alternative metabolic pathways. GEM-PROPER predicts a new indirect replacer (thiG) for an essential enzyme (pdxB) in production of pyridoxal 5’-phosphate (the active form of Vitamin B
6 ), which we validate experimentally via multicopy suppression. We perform a structural analysis of thiG to determine its potential promiscuous active site, which we validate experimentally by inactivating the pertaining residues and showing a loss of replacer activity. Thus, this study is a successful example where a computational investigation leads to a network-based identification of an indirect promiscuous replacement of a key metabolic enzyme, which would have been extremely difficult to identify directly. [ABSTRACT FROM AUTHOR]- Published
- 2016
- Full Text
- View/download PDF
37. Binding Site Identification and Flexible Docking of Single Stranded RNA to Proteins Using a Fragment-Based Approach.
- Author
-
Chauvot de Beauchene, Isaure, de Vries, Sjoerd J., and Zacharias, Martin
- Subjects
RNA-protein interactions ,PROTEIN binding ,BINDING sites ,NUCLEOTIDES ,COMPUTATIONAL biology - Abstract
Protein-RNA docking is hampered by the high flexibility of RNA, and particularly single-stranded RNA (ssRNA). Yet, ssRNA regions typically carry the specificity of protein recognition. The lack of methodology for modeling such regions limits the accuracy of current protein-RNA docking methods. We developed a fragment-based approach to model protein-bound ssRNA, based on the structure of the protein and the sequence of the RNA, without any prior knowledge of the RNA binding site or the RNA structure. The conformational diversity of each fragment is sampled by an exhaustive RNA fragment library that was created from all the existing experimental structures of protein-ssRNA complexes. A systematic and detailed analysis of fragment-based ssRNA docking was performed which constitutes a proof-of-principle for the fragment-based approach. The method was tested on two 8-homo-nucleotide ssRNA-protein complexes and was able to identify the binding site on the protein within 10 Å. Moreover, a structure of each bound ssRNA could be generated in close agreement with the crystal structure with a mean deviation of ~1.5 Å except for a terminal nucleotide. This is the first time a bound ssRNA could be modeled from sequence with high precision. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
38. Evolutionary and Functional Relationships in the Truncated Hemoglobin Family.
- Author
-
Bustamante, Juan P., Radusky, Leandro, Boechi, Leonardo, Estrin, Darío A., ten Have, Arjen, and Martí, Marcelo A.
- Subjects
HEMOGLOBINS ,AMINO acid sequence ,BIOLOGICAL research ,BIOLOGICAL evolution ,PROTEIN structure - Abstract
Predicting function from sequence is an important goal in current biological research, and although, broad functional assignment is possible when a protein is assigned to a family, predicting functional specificity with accuracy is not straightforward. If function is provided by key structural properties and the relevant properties can be computed using the sequence as the starting point, it should in principle be possible to predict function in detail. The truncated hemoglobin family presents an interesting benchmark study due to their ubiquity, sequence diversity in the context of a conserved fold and the number of characterized members. Their functions are tightly related to O
2 affinity and reactivity, as determined by the association and dissociation rate constants, both of which can be predicted and analyzed using in-silico based tools. In the present work we have applied a strategy, which combines homology modeling with molecular based energy calculations, to predict and analyze function of all known truncated hemoglobins in an evolutionary context. Our results show that truncated hemoglobins present conserved family features, but that its structure is flexible enough to allow the switch from high to low affinity in a few evolutionary steps. Most proteins display moderate to high oxygen affinities and multiple ligand migration paths, which, besides some minor trends, show heterogeneous distributions throughout the phylogenetic tree, again suggesting fast functional adaptation. Our data not only deepens our comprehension of the structural basis governing ligand affinity, but they also highlight some interesting functional evolutionary trends. [ABSTRACT FROM AUTHOR]- Published
- 2016
- Full Text
- View/download PDF
39. Identifying individual risk rare variants using protein structure guided local tests (POINT)
- Author
-
Sheng Mao Chang, Jung-Ying Tzeng, Denis Fourches, Daniel M. Rotroff, Mélaine A. Kuenemann, John B. Buse, Wenbin Lu, Michael J. Wagner, Michael C. Wu, Alison A. Motsinger-Reif, and Rachel Marceau West
- Subjects
0301 basic medicine ,Proteomics ,Heredity ,Computer science ,Kernel Functions ,Test Statistics ,Inference ,Biochemistry ,Database and Informatics Methods ,Protein Structure Databases ,0302 clinical medicine ,Mathematical and Statistical Techniques ,Risk Factors ,Macromolecular Structure Analysis ,Biochemical Simulations ,Biology (General) ,Operator Theory ,Ecology ,Proteomic Databases ,Statistics ,Genetic Mapping ,Kernel method ,Computational Theory and Mathematics ,Modeling and Simulation ,Physical Sciences ,Structural Proteins ,Proprotein Convertase 9 ,Research Article ,Protein Structure ,QH301-705.5 ,Association (object-oriented programming) ,Variant Genotypes ,Computational biology ,Research and Analysis Methods ,Ranking (information retrieval) ,03 medical and health sciences ,Cellular and Molecular Neuroscience ,Genetics ,Angiopoietin-Like Protein 4 ,Humans ,Computer Simulation ,Genetic Predisposition to Disease ,Statistical Methods ,Set (psychology) ,Molecular Biology ,Ecology, Evolution, Behavior and Systematics ,Selection (genetic algorithm) ,Genetic Association Studies ,Genetic association ,Statistical hypothesis testing ,Models, Genetic ,Biology and Life Sciences ,Proteins ,Computational Biology ,Genetic Variation ,Sequence Analysis, DNA ,Cholesterol Ester Transfer Proteins ,Protein Structure, Tertiary ,030104 developmental biology ,Biological Databases ,030217 neurology & neurosurgery ,Mathematics - Abstract
Rare variants are of increasing interest to genetic association studies because of their etiological contributions to human complex diseases. Due to the rarity of the mutant events, rare variants are routinely analyzed on an aggregate level. While aggregation analyses improve the detection of global-level signal, they are not able to pinpoint causal variants within a variant set. To perform inference on a localized level, additional information, e.g., biological annotation, is often needed to boost the information content of a rare variant. Following the observation that important variants are likely to cluster together on functional domains, we propose a protein structure guided local test (POINT) to provide variant-specific association information using structure-guided aggregation of signal. Constructed under a kernel machine framework, POINT performs local association testing by borrowing information from neighboring variants in the 3-dimensional protein space in a data-adaptive fashion. Besides merely providing a list of promising variants, POINT assigns each variant a p-value to permit variant ranking and prioritization. We assess the selection performance of POINT using simulations and illustrate how it can be used to prioritize individual rare variants in PCSK9, ANGPTL4 and CETP in the Action to Control Cardiovascular Risk in Diabetes (ACCORD) clinical trial data., Author summary While it is known that rare variants play an important role in understanding associations between genotype and complex diseases, pinpointing individual rare variants likely to be responsible for association is still a daunting task. Due to their low frequency in the population and reduced signal, localizing causal rare variants often requires additional information, such as type of DNA change or location of variant along the sequence, to be incorporated in a biologically meaningful fashion that does not overpower the genotype data. In this paper, we use the observation that important variants tend to cluster together on functional domains to propose a new approach for prioritizing rare variants: the protein structure guided local test (POINT). POINT uses a gene’s 3-dimensional protein folding structure to guide aggregation of information from neighboring variants in the protein in a robust manner. We show how POINT improves selection performance over existing methods. We further illustrate how it can be used to prioritize individual rare variants using the Action to Control Cardiovascular Risk in Diabetes (ACCORD) clinical trial data, finding promising variants within genes in association with lipoprotein-related outcomes.
- Published
- 2019
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.