Descriptor: "Protein similarity" / Language: undetermined - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Protein similarity"' showing total 95 results

Start Over Descriptor "Protein similarity" Language undetermined

95 results on '"Protein similarity"'

1. Parallelization of large-scale drug–protein binding experiments

Author: Antonios Makris, Dimitrios Michail, Mark Sawyer, and Iraklis Varlamis
Subjects: Computer Networks and Communications, business.industry, Computer science, Drug discovery, Pipeline (computing), Process (computing), 020206 networking & telecommunications, 02 engineering and technology, Parallel computing, Task (project management), Software, Memory management, Protein similarity, Hardware and Architecture, 0202 electrical engineering, electronic engineering, information engineering, Leverage (statistics), 020201 artificial intelligence & image processing, business, Pharmaceutical industry
Abstract: The pharmaceutical industry invests billions of dollars on a yearly basis for new drug research. Part of this research is focused on the repositioning of established drugs to new disease indications and is based on “drug promiscuity”, or in plain words, on the ability of certain drugs to bind multiple proteins. The increased cost of wet-lab experiments makes the in-silico alternatives a promising solution. In order to find similar protein targets for an existing drug, it is necessary to analyse the protein and drug structures and find potential similarities. The latter is a highly demanding in computational resources task. However, algorithmic advances in conjunction with increased computational resources can leverage this task and increase the success rate of drug discovery with significantly smaller cost. The current work proposes several algorithms that implement the protein similarity task in a parallel high-performance computing environment, solve several load imbalance and memory management issues and take maximum advantage of the available resources. The proposed optimizations achieve better memory and CPU balancing and faster execution times. Several parts of the previously linear processing pipeline, which used different software packages, have been re-engineered in order to improve process parallelization. Experimental results, on a high-performance computing environment with up to 1024 cores and 2048GB of memory, demonstrate the effectiveness of our approach, which scales well to large amounts of protein pairs.
Published: 2019
Full Text: View/download PDF

2. Scalable remote homology detection and fold recognition in massive protein networks

Author: Wei Zhang, Molly A. Srour, Yousef Saad, Rui Kuang, Zhuliu Li, and Raphael Petegrosso
Subjects: Computer science, Computation, Cloud computing, Computational biology, Biochemistry, Homology (biology), 03 medical and health sciences, Protein similarity, Sequence Analysis, Protein, Structural Biology, Humans, CASP, Molecular Biology, 030304 developmental biology, 0303 health sciences, business.industry, 030302 biochemistry & molecular biology, Computational Biology, Proteins, ComputingMethodologies_PATTERNRECOGNITION, Scalability, Pairwise comparison, business, Protein network, Algorithms, Software
Abstract: The global connectivities in very large protein similarity networks contain traces of evolution among the proteins for detecting protein remote evolutionary relations or structural similarities. To investigate how well a protein network captures the evolutionary information, a key limitation is the intensive computation of pairwise sequence similarities needed to construct very large protein networks. In this article, we introduce label propagation on low-rank kernel approximation (LP-LOKA) for searching massively large protein networks. LP-LOKA propagates initial protein similarities in a low-rank graph by Nyström approximation without computing all pairwise similarities. With scalable parallel implementations based on distributed-memory using message-passing interface and Apache-Hadoop/Spark on cloud, LP-LOKA can search protein networks with one million proteins or more. In the experiments on Swiss-Prot/ADDA/CASP data, LP-LOKA significantly improved protein ranking over the widely used HMM-HMM or profile-sequence alignment methods utilizing large protein networks. It was observed that the larger the protein similarity network, the better the performance, especially on relatively small protein superfamilies and folds. The results suggest that computing massively large protein network is necessary to meet the growing need of annotating proteins from newly sequenced species and LP-LOKA is both scalable and accurate for searching massively large protein networks.
Published: 2019
Full Text: View/download PDF

3. ADEPT: a domain independent sequence alignment strategy for gpu architectures

Author: Muaaz Gul Awan, Steven Hofmeyr, Jack Deslippe, Katherine Yelick, Aydin Buluc, Oguz Selvitopi, and Leonid Oliker
Subjects: Computer science, GPU, Sequence assembly, Parallel computing, Biochemistry, Genome, Mathematical Sciences, chemistry.chemical_compound, 0302 clinical medicine, Protein similarity, Structural Biology, Nucleotide, SIMD, lcsh:QH301-705.5, chemistry.chemical_classification, 0303 health sciences, Applied Mathematics, Adept, Biological Sciences, Computer Science Applications, Amino acid, Dynamic programming, Networking and Information Technology R&D, Graph (abstract data type), lcsh:R858-859.7, DNA microarray, Algorithms, Biotechnology, Bioinformatics, Sequence alignment, lcsh:Computer applications to medicine. Medical informatics, 03 medical and health sciences, Information and Computing Sciences, Genetics, Humans, Cluster analysis, Molecular Biology, 030304 developmental biology, Alignment, Smith–Waterman algorithm, Protein, Human Genome, Computational Biology, DNA, chemistry, lcsh:Biology (General), Metagenomics, Generic health relevance, Sequence Alignment, 030217 neurology & neurosurgery, Software
Abstract: Background Bioinformatic workflows frequently make use of automated genome assembly and protein clustering tools. At the core of most of these tools, a significant portion of execution time is spent in determining optimal local alignment between two sequences. This task is performed with the Smith-Waterman algorithm, which is a dynamic programming based method. With the advent of modern sequencing technologies and increasing size of both genome and protein databases, a need for faster Smith-Waterman implementations has emerged. Multiple SIMD strategies for the Smith-Waterman algorithm are available for CPUs. However, with the move of HPC facilities towards accelerator based architectures, a need for an efficient GPU accelerated strategy has emerged. Existing GPU based strategies have either been optimized for a specific type of characters (Nucleotides or Amino Acids) or for only a handful of application use-cases. Results In this paper, we present ADEPT, a new sequence alignment strategy for GPU architectures that is domain independent, supporting alignment of sequences from both genomes and proteins. Our proposed strategy uses GPU specific optimizations that do not rely on the nature of sequence. We demonstrate the feasibility of this strategy by implementing the Smith-Waterman algorithm and comparing it to similar CPU strategies as well as the fastest known GPU methods for each domain. ADEPT’s driver enables it to scale across multiple GPUs and allows easy integration into software pipelines which utilize large scale computational systems. We have shown that the ADEPT based Smith-Waterman algorithm demonstrates a peak performance of 360 GCUPS and 497 GCUPs for protein based and DNA based datasets respectively on a single GPU node (8 GPUs) of the Cori Supercomputer. Overall ADEPT shows 10x faster performance in a node-to-node comparison against a corresponding SIMD CPU implementation. Conclusions ADEPT demonstrates a performance that is either comparable or better than existing GPU strategies. We demonstrated the efficacy of ADEPT in supporting existing bionformatics software pipelines by integrating ADEPT in MetaHipMer a high-performance denovo metagenome assembler and PASTIS a high-performance protein similarity graph construction pipeline. Our results show 10% and 30% boost of performance in MetaHipMer and PASTIS respectively.
Published: 2020

4. FoldRec-C2C: protein fold recognition by combining cluster-to-cluster model and protein similarity network

Author: Ke Yan, Bin Liu, and Jiangyi Shao
Subjects: Models, Molecular, Protein Folding, Source code, AcademicSubjects/SCI01060, Coronavirus disease 2019 (COVID-19), Computer science, media_common.quotation_subject, Datasets as Topic, 03 medical and health sciences, Protein structure, Protein similarity, Cluster Analysis, Molecular Biology, 030304 developmental biology, media_common, 0303 health sciences, business.industry, 030302 biochemistry & molecular biology, Computational Biology, Proteins, Pattern recognition, cluster-to-cluster model, protein fold recognition, Problem Solving Protocol, seq-to-cluster model, Learning to rank, Protein folding, Artificial intelligence, Performance improvement, seq-to-seq model, business, Information Systems
Abstract: As a key for studying the protein structures, protein fold recognition is playing an important role in predicting the protein structures associated with COVID-19 and other important structures. However, the existing computational predictors only focus on the protein pairwise similarity or the similarity between two groups of proteins from 2-folds. However, the homology relationship among proteins is in a hierarchical structure. The global protein similarity network will contribute to the performance improvement. In this study, we proposed a predictor called FoldRec-C2C to globally incorporate the interactions among proteins into the prediction. For the FoldRec-C2C predictor, protein fold recognition problem is treated as an information retrieval task in nature language processing. The initial ranking results were generated by a surprised ranking algorithm Learning to Rank, and then three re-ranking algorithms were performed on the ranking lists to adjust the results globally based on the protein similarity network, including seq-to-seq model, seq-to-cluster model and cluster-to-cluster model (C2C). When tested on a widely used and rigorous benchmark dataset LINDAHL dataset, FoldRec-C2C outperforms other 34 state-of-the-art methods in this field. The source code and data of FoldRec-C2C can be downloaded from http://bliulab.net/FoldRec-C2C/download.
Published: 2020
Full Text: View/download PDF

5. A Refined 3-in-1 Fused Protein Similarity Measure: Application in Threshold-Free Hub Detection

Author: Yi Pan, Sudipta Acharya, and Laizhong Cui
Subjects: Proteomics, Proximity measure, Gene ontology, Computer science, Applied Mathematics, 0206 medical engineering, Computational Biology, Proteins, 02 engineering and technology, computer.software_genre, Measure (mathematics), Protein sequencing, Gene Ontology, Genetic similarity, Protein similarity, Genetics, Cluster Analysis, Data mining, Literature survey, Cluster analysis, computer, 020602 bioinformatics, Algorithms, Biotechnology
Abstract: An exhaustive literature survey shows that finding protein/gene similarity is an important step towards solving widespread bioinformatics problems, such as predicting protein-protein interactions, analyzing Protein-Protein Interaction Networks (PPINs), gene prioritization, and disease gene/protein detection. In this article, we have proposed an improved 3-in-1 fused protein similarity measure called FuSim-II. It is built upon combining the weighted average of biological knowledge extracted from three potential genomic/ proteomic resources such as Gene Ontology (GO), PPIN, and protein sequence. Furthermore, we have shown the application of the proposed measure in detecting potential hub-proteins from a given PPIN. Aiming that, we have proposed a multi-objective clustering-based protein hub detection framework with FuSim-II working as the underlying proximity measure. The PPINs of H. Sapiens and M. Musculus organisms are chosen for experimental purposes. Unlike most of the existing hub-detection methods, the proposed technique does not require to follow any protein degree cut-off or threshold to define hubs. A thorough assessment of efficiency between proposed and existing eight protein similarity measures along with eight single/multi-objective clustering methods has been carried out. Internal cluster validity indices like Silhouette and Davies Bouldin (DB) are deployed to accomplish analytical study. Also, a comparative performance analysis between proposed and five existing hub-proteins detection algorithms is conducted through the enrichment of essentiality study. The reported results show the improved performance of FuSim-II over existing protein similarity measures in terms of identifying functionally related proteins as well as relevant hub-proteins. Supplementary material is available at http://csse.szu.edu.cn/staff/cuilz/eng/index.html.
Published: 2020

6. CHARACTERIZATION OF SOME INDIAN SESAME (Sesamum indicum L.) CULTIVARS THROUGH SOLUBLE SEED STORAGE PROTEIN MARKERS

Author: A Das, S K Pandey, T Dasgupta, and P Bhattacharya
Subjects: chemistry.chemical_classification, Coat, General Veterinary, biology, Phenology, food and beverages, 04 agricultural and veterinary sciences, biology.organism_classification, 040401 food science, General Biochemistry, Genetics and Molecular Biology, Crop, Horticulture, 0404 agricultural biotechnology, chemistry, Protein similarity, Storage protein, Sesamum, Cultivar, General Agricultural and Biological Sciences, Polyacrylamide gel electrophoresis
Abstract: Seed storage protein markers being less sensitive to environmental fluctuation than phenological traits, has been successfully employed in assessing divergence in many crop plants. The present study was aimed to find out correlation of seed storage protein markers in twenty eight Indian sesame cultivars with their agro-ecological zone of adoption and their seed coat colour. Sodium Dodecyl Sulphate Polyacrylamide Gel Electrophoresis (SDS-PAGE) revealed altogether twenty two protein bands of which thirteen were polymorphic with varied molecular weights. Specific bands, relating to specific agro-ecologies were found. Moreover, bands of 93.40 KDa and 68.05 KDa were found associated with production of darker shades of seed coat colour. Clustering pattern based on protein similarity value offered no definite grouping, either to specific agro-ecological zones of adoption or to specific seed coat colour. It is concluded that individual protein banding pattern can be linked to agro-ecological adoption zone and seed coat colour which is helpful in divergence and phylogenetic study in sesame.
Published: 2018
Full Text: View/download PDF

7. Real time structural search of the Protein Data Bank

Author: Jose M. Duarte, Stephen K. Burley, and Dmytro Guzenko
Subjects: 0301 basic medicine, Models, Molecular, Protein Structure Comparison, Computer science, Protein Conformation, Cell, Protein Data Bank (RCSB PDB), Normal Distribution, Polypeptide chain, Oligomer, Biochemistry, Polynomials, chemistry.chemical_compound, Structural bioinformatics, Database and Informatics Methods, 0302 clinical medicine, Protein structure, Protein similarity, Macromolecular Structure Analysis, Search problem, Biology (General), Databases, Protein, Ecology, Physics, A protein, computer.file_format, Condensed Matter Physics, Stoichiometry, Chemistry, medicine.anatomical_structure, Computational Theory and Mathematics, Modeling and Simulation, Physical Sciences, Algorithm, Sequence Analysis, Algorithms, Research Article, Normalization (statistics), Protein Structure, Multiple Alignment Calculation, QH301-705.5, Bioinformatics, Protein subunit, Sequence alignment, Research and Analysis Methods, 03 medical and health sciences, Cellular and Molecular Neuroscience, Similarity (network science), Computational Techniques, medicine, Genetics, Electron Density, Molecular Biology, Ecology, Evolution, Behavior and Systematics, Internet, Models, Statistical, Computational Biology, Proteins, Biology and Life Sciences, Polypeptides, Atomic coordinates, Protein Data Bank, Split-Decomposition Method, 030104 developmental biology, Algebra, chemistry, Peptides, computer, Sequence Alignment, 030217 neurology & neurosurgery, Software, Mathematics
Abstract: Detection of protein structure similarity is a central challenge in structural bioinformatics. Comparisons are usually performed at the polypeptide chain level, however the functional form of a protein within the cell is often an oligomer. This fact, together with recent growth of oligomeric structures in the Protein Data Bank (PDB), demands more efficient approaches to oligomeric assembly alignment/retrieval. Traditional methods use atom level information, which can be complicated by the presence of topological permutations within a polypeptide chain and/or subunit rearrangements. These challenges can be overcome by comparing electron density volumes directly. But, brute force alignment of 3D data is a compute intensive search problem. We developed a 3D Zernike moment normalization procedure to orient electron density volumes and assess similarity with unprecedented speed. Similarity searching with this approach enables real-time retrieval of proteins/protein assemblies resembling a target, from PDB or user input, together with resulting alignments (http://shape.rcsb.org)., Author summary Protein structures possess wildly varied shapes, but patterns at different levels are frequently reused by nature. Finding and classifying these similarities is fundamental to understand evolution. Given the continued growth in the number of known protein structures in the Protein Data Bank, the task of comparing them to find the common patterns is becoming increasingly complicated. This is especially true when considering complete protein assemblies with several polypeptide chains, where the large sizes further complicate the issue. Here we present a novel method that can detect similarity between protein shapes and that works equally fast for any size of proteins or assemblies. The method looks at proteins as volumes of density distribution, departing from what is more usual in the field: similarity assessment based on atomic coordinates and chain connectivity. A volumetric function is amenable to be decomposed with a mathematical tool known as 3D Zernike polynomials, resulting in a compact description as vectors of Zernike moments. The tool was introduced in the 1990s, when it was suggested that the moments could be normalized to be invariant to rotations without losing information. Here we demonstrate that in fact this normalization is possible and that it offers a much more accurate method for assessing similarity between shapes, when compared to previous attempts.
Published: 2019

8. MADOKA: an ultra-fast approach for large-scale protein structure similarity searching

Author: Lei Deng, Hui Liu, Guolun Zhong, Chenzhe Liu, and Judong Luo
Subjects: Computer science, Protein structure alignment, Structural alignment, Parallel programming, Protein Data Bank (RCSB PDB), Structural neighbor searching, lcsh:Computer applications to medicine. Medical informatics, computer.software_genre, Biochemistry, 03 medical and health sciences, Structural bioinformatics, 0302 clinical medicine, Protein structure, Fragment (logic), Protein similarity, Similarity (network science), Structural Biology, lcsh:QH301-705.5, Molecular Biology, 030304 developmental biology, 0303 health sciences, Applied Mathematics, Methodology, Computational Biology, Proteins, Computer Science Applications, Visualization, lcsh:Biology (General), lcsh:R858-859.7, Data mining, DNA microarray, computer, 030217 neurology & neurosurgery, Algorithms, Software
Abstract: Background Protein comparative analysis and similarity searches play essential roles in structural bioinformatics. A couple of algorithms for protein structure alignments have been developed in recent years. However, facing the rapid growth of protein structure data, improving overall comparison performance and running efficiency with massive sequences is still challenging. Results Here, we propose MADOKA, an ultra-fast approach for massive structural neighbor searching using a novel two-phase algorithm. Initially, we apply a fast alignment between pairwise structures. Then, we employ a score to select pairs with more similarity to carry out a more accurate fragment-based residue-level alignment. MADOKA performs about 6–100 times faster than existing methods, including TM-align and SAL, in massive alignments. Moreover, the quality of structural alignment of MADOKA is better than the existing algorithms in terms of TM-score and number of aligned residues. We also develop a web server to search structural neighbors in PDB database (About 360,000 protein chains in total), as well as additional features such as 3D structure alignment visualization. The MADOKA web server is freely available at: http://madoka.denglab.org/ Conclusions MADOKA is an efficient approach to search for protein structure similarity. In addition, we provide a parallel implementation of MADOKA which exploits massive power of multi-core CPUs.
Published: 2019

9. Complete Genome Sequence of Shelby, a Siphophage Infecting Carbapenemase-Producing Klebsiella pneumoniae

Author: Jason J. Gill, Heather Newkirk, Mei Liu, Robert Saldana, and Jolene Ramsey
Subjects: Genetics, Whole genome sequencing, 0303 health sciences, Klebsiella pneumoniae, Genome Sequences, Human pathogen, Carbapenemase producing, Biology, biology.organism_classification, Genome, 3. Good health, 03 medical and health sciences, 0302 clinical medicine, Immunology and Microbiology (miscellaneous), Protein similarity, 030212 general & internal medicine, Molecular Biology, Genome size, Gene, 030304 developmental biology
Abstract: Carbapenem-resistant Klebsiella pneumoniae, a bacterium of the family Enterobacteriaceae, is a high-priority antibiotic-resistant pathogen that causes nosocomial infections. Here, we describe the isolation and annotation of the K. pneumoniae siphophage Shelby, a T1-like siphophage encoding 78 proteins, of which 34 have a predicted function.
Published: 2019
Full Text: View/download PDF

10. HDInsight4PSi: Boosting performance of 3D protein structure similarity searching with HDInsight clusters in Microsoft Azure cloud

Author: Bożena Małysiak-Mrozek, Paweł Daniłowicz, and Dariusz Mrozek
Subjects: 0301 basic medicine, Information Systems and Management, Boosting (machine learning), Computer science, Big data, Cloud computing, 02 engineering and technology, computer.software_genre, Theoretical Computer Science, 03 medical and health sciences, Structural bioinformatics, Protein structure, Protein similarity, Artificial Intelligence, 0202 electrical engineering, electronic engineering, information engineering, Biological data, business.industry, computer.file_format, Protein Data Bank, Computer Science Applications, 030104 developmental biology, Control and Systems Engineering, 020201 artificial intelligence & image processing, Data mining, business, computer, Software, Macromolecule
Abstract: 3D protein structure similarity searching is one of the important processes performed in structural bioinformatics, since it allows for protein function identification and reconstruction of phylogeny for weakly related organisms. Due to the complexity of 3D protein structures and exponential growth of protein structures in public repositories, like the Protein Data Bank, the process is time-consuming and requires increased computational resources. This causes the necessity to prepare computer systems to be able to deal with such huge volumes of macromolecular data. In this paper, we show how 3D protein structure similarity searching can be performed in parallel by distributing MapReduce jobs on the HDInsight cluster in Microsoft Azure commercial cloud. Our solution combines the use of two important computing paradigms that gain popularity in recent years—Hadoop/MapReduce and Cloud computing. Our experiments performed with the use of the whole repository of protein structures from Protein Data Bank confirm that such a technological fusion is very beneficial and can be successfully applied when performing time-consuming computations over biological data. Moreover, appropriate preparation of data allows to reduce the time needed for computations and significantly accelerates the similarity searching.
Published: 2016
Full Text: View/download PDF

11. Visualizing and Clustering Protein Similarity Networks: Sequences, Structures, and Functions

Author: Te-Lun Mai, Chi-Ming Chen, and Geng Ming Hu
Subjects: 0301 basic medicine, Protein structure database, 030102 biochemistry & molecular biology, Protein Conformation, Protein database, General Chemistry, Biology, computer.software_genre, Biochemistry, Enzymes, Visualization, Evolution, Molecular, Structure-Activity Relationship, 03 medical and health sciences, 030104 developmental biology, Protein similarity, Molecular evolution, Cluster Analysis, Amino Acid Sequence, Protein Interaction Maps, Data mining, Cluster analysis, computer, Protein network
Abstract: Research in the recent decade has demonstrated the usefulness of protein network knowledge in furthering the study of molecular evolution of proteins, understanding the robustness of cells to perturbation, and annotating new protein functions. In this study, we aimed to provide a general clustering approach to visualize the sequence-structure-function relationship of protein networks, and investigate possible causes for inconsistency in the protein classifications based on sequences, structures, and functions. Such visualization of protein networks could facilitate our understanding of the overall relationship among proteins and help researchers comprehend various protein databases. As a demonstration, we clustered 1437 enzymes by their sequences and structures using the minimum span clustering (MSC) method. The general structure of this protein network was delineated at two clustering resolutions, and the second level MSC clustering was found to be highly similar to existing enzyme classifications. The clustering of these enzymes based on sequence, structure, and function information is consistent with each other. For proteases, the Jaccard's similarity coefficient is 0.86 between sequence and function classifications, 0.82 between sequence and structure classifications, and 0.78 between structure and function classifications. From our clustering results, we discussed possible examples of divergent evolution and convergent evolution of enzymes. Our clustering approach provides a panoramic view of the sequence-structure-function network of proteins, helps visualize the relation between related proteins intuitively, and is useful in predicting the structure and function of newly determined protein sequences.
Published: 2016
Full Text: View/download PDF

12. SOV_refine: A further refined definition of segment overlap score and its significance for protein structure similarity

Author: Tong Liu and Zheng Wang
Subjects: 0301 basic medicine, Segment overlap score, Information Systems and Management, Source code, Computer science, media_common.quotation_subject, Assessment of protein secondary structure predictions, Health Informatics, lcsh:Computer applications to medicine. Medical informatics, Similarity of segmented biological sequences, Correlation, 03 medical and health sciences, 0302 clinical medicine, Protein similarity, SOV score, Protein secondary structure, media_common, Sequence, Quality assessment, Comparing different definitions of topologically associating domains, Protein structure similarity, Methodology, Protein tertiary structure, Protein secondary structure prediction, Computer Science Applications, 030104 developmental biology, Protein model, lcsh:R858-859.7, Algorithm, 030217 neurology & neurosurgery, Information Systems
Abstract: The segment overlap score (SOV) has been used to evaluate the predicted protein secondary structures, a sequence composed of helix (H), strand (E), and coil (C), by comparing it with the native or reference secondary structures, another sequence of H, E, and C. SOV’s advantage is that it can consider the size of continuous overlapping segments and assign extra allowance to longer continuous overlapping segments instead of only judging from the percentage of overlapping individual positions as Q3 score does. However, we have found a drawback from its previous definition, that is, it cannot ensure increasing allowance assignment when more residues in a segment are further predicted accurately. A new way of assigning allowance has been designed, which keeps all the advantages of the previous SOV score definitions and ensures that the amount of allowance assigned is incremental when more elements in a segment are predicted accurately. Furthermore, our improved SOV has achieved a higher correlation with the quality of protein models measured by GDT-TS score and TM-score, indicating its better abilities to evaluate tertiary structure quality at the secondary structure level. We analyzed the statistical significance of SOV scores and found the threshold values for distinguishing two protein structures (SOV_refine > 0.19) and indicating whether two proteins are under the same CATH fold (SOV_refine > 0.94 and > 0.90 for three- and eight-state secondary structures respectively). We provided another two example applications, which are when used as a machine learning feature for protein model quality assessment and comparing different definitions of topologically associating domains. We proved that our newly defined SOV score resulted in better performance. The SOV score can be widely used in bioinformatics research and other fields that need to compare two sequences of letters in which continuous segments have important meanings. We also generalized the previous SOV definitions so that it can work for sequences composed of more than three states (e.g., it can work for the eight-state definition of protein secondary structures). A standalone software package has been implemented in Perl with source code released. The software can be downloaded from http://dna.cs.miami.edu/SOV/ .
Published: 2018
Full Text: View/download PDF

13. Predicting drug-target interaction based on sequence and structure information

Author: Wei Lan, Jianxin Wang, Fang-Xiang Wu, Min Li, and Yi Pan
Subjects: Drug discovery, business.industry, Drug target, Biology, computer.software_genre, Machine learning, Support vector machine, Protein sequencing, Protein similarity, Control and Systems Engineering, Target protein, Artificial intelligence, Data mining, Drug structure, business, Classifier (UML), computer
Abstract: It is well known that discovering a new drug is a cumbersome, time-consuming and expensive process. Computational approaches for identifying interactions between drug compounds and target proteins have become important in drug discovery which is helpful to reduce these obstacles. The difficulties of drug-target interaction identification include the lack of known drug-target associations and no experimentally verified negative examples. In this study, we present a method, called PUDT, to predict drug-target interactions. Instead of treating unknown interactions as negative examples, we consider unknown interactions as unlabeled examples. The unlabeled examples are divided into two parts: reliable negative examples and likely negative examples based on protein structure similarity. Then, a weighted support vector machine is used to build a classifier to predict drug-target interactions based on protein sequence and drug structure information. Four data sets (enzymes, ion channels, GPCRs and nuclear receptors) are used to evaluate the performance of the proposed method PUDT. The experimental results demonstrate that our method PUDT outperforms recent state-of-the-art approaches.
Published: 2015
Full Text: View/download PDF

14. Efficient inference of homologs in large eukaryotic pan-proteomes

Author: Siavash Sheikhizadeh Anari, Dick de Ridder, M. Eric Schranz, and Sandra Smit
Subjects: 0301 basic medicine, Proteome, Computer science, Biochemistry, Homology (biology), Protein similarity, Structural Biology, Phylogenomics, Cluster Analysis, Databases, Protein, lcsh:QH301-705.5, Genome, Applied Mathematics, Methodology Article, Homologous genes, Pan-genome, Eukaryota, Genomics, Biosystematiek, Computer Science Applications, lcsh:R858-859.7, DNA microarray, Functional genomics, Algorithms, Singular homology, Bioinformatics, Computational biology, lcsh:Computer applications to medicine. Medical informatics, Genes, Plant, 03 medical and health sciences, Orthology, Bioinformatica, Homologous chromosome, Humans, Cluster analysis, Molecular Biology, Comparative genomics, Sequence Homology, Amino Acid, k-mer, 030104 developmental biology, lcsh:Biology (General), Brassicaceae, Biosystematics, EPS, Software
Abstract: Background Identification of homologous genes is fundamental to comparative genomics, functional genomics and phylogenomics. Extensive public homology databases are of great value for investigating homology but need to be continually updated to incorporate new sequences. As new sequences are rapidly being generated, there is a need for efficient standalone tools to detect homologs in novel data. Results To address this, we present a fast method for detecting homology groups across a large number of individuals and/or species. We adopted a k-mer based approach which considerably reduces the number of pairwise protein alignments without sacrificing sensitivity. We demonstrate accuracy, scalability, efficiency and applicability of the presented method for detecting homology in large proteomes of bacteria, fungi, plants and Metazoa. Conclusions We clearly observed the trade-off between recall and precision in our homology inference. Favoring recall or precision strongly depends on the application. The clustering behavior of our program can be optimized for particular applications by altering a few key parameters. The program is available for public use at https://github.com/sheikhizadeh/pantools as an extension to our pan-genomic analysis tool, PanTools. Electronic supplementary material The online version of this article (10.1186/s12859-018-2362-4) contains supplementary material, which is available to authorized users.
Published: 2017

15. Parallelization of Large-Scale Drug-Protein Binding Experiments

Author: Antonios Makris, George Tsatsaronis, Joachim Haupt, Mark Sawyer, Iraklis Varlamis, Konstantinos Tserpes, Dimitrios Michail, and Chronis Dimitropoulos
Subjects: 0301 basic medicine, Computer science, business.industry, Scale (chemistry), 0206 medical engineering, Process (computing), 02 engineering and technology, Parallel computing, Plasma protein binding, 03 medical and health sciences, Task (computing), 030104 developmental biology, Software, Memory management, Protein structure, Protein similarity, business, 020602 bioinformatics
Abstract: Drug polypharmacology or “drug promiscuity” refers to the ability of a drug to bind multiple proteins. Such studies have huge impact to the pharmaceutical industry, but in the same time require large investments on wet-lab experiments. The respective in-silico experiments have a significantly smaller cost and minimize the expenses for the subsequent lab experiments. However, the process of finding similar protein targets for an existing drug, passes through protein structural similarity and is a highly demanding in computational resources task. In this work, we propose several algorithms that port the protein similarity task to a parallel high-performance computing environment. The differences in size and complexity of the examined protein structures raise several issues in a naive parallelization process that significantly affect the overall time and required memory. We describe several optimizations for better memory and CPU balancing which achieve faster execution times. Experimental results, on a high-performance computing environment with 512 cores and 2048GB of memory, demonstrate the effectiveness of our approach which scales well to large amounts of protein pairs.
Published: 2017
Full Text: View/download PDF

16. Determining protein similarity by comparing hydrophobic core structure

Author: Barbara Kalinowska, Mateusz Banach, Małgorzata Gadzała, Irena Roterman, and Leszek Konieczny
Subjects: 0301 basic medicine, Structural similarity, Bioinformatics, computer.software_genre, Article, Correlation, 03 medical and health sciences, Protein similarity, lcsh:Social sciences (General), CASP, lcsh:Science (General), Biological sciences, Multidisciplinary, Chemistry, biological sciences, bioinformatics, Protein structure prediction, 030104 developmental biology, Similarity criterion, Protein body, lcsh:H1-99, Data mining, Biological system, computer, lcsh:Q1-390
Abstract: Formal assessment of structural similarity is − next to protein structure prediction − arguably the most important unsolved problem in proteomics. In this paper we propose a similarity criterion based on commonalities between the proteins’ hydrophobic cores. The hydrophobic core emerges as a result of conformational changes through which each residue reaches its intended position in the protein body. A quantitative criterion based on this phenomenon has been proposed in the framework of the CASP challenge. The structure of the hydrophobic core − including the placement and scope of any deviations from the idealized model − may indirectly point to areas of importance from the point of view of the protein’s biological function. Our analysis focuses on an arbitrarily selected target from the CASP11 challenge. The proposed measure, while compliant with CASP criteria (70–80% correlation), involves certain adjustments which acknowledge the presence of factors other than simple spatial arrangement of solids.
Published: 2017
Full Text: View/download PDF

17. Functional Aggregation for Protein Function Prediction

Author: Jingyu Hou
Subjects: business.industry, Function (mathematics), computer.software_genre, Machine learning, Fuzzy logic, Semantic similarity, Similarity (network science), Choquet integral, Protein similarity, Protein function prediction, Data mining, Artificial intelligence, business, computer, Mathematics
Abstract: This chapter introduces a novel method that incorporates functional aggregation of proteins into protein function prediction via the Choquet–Integral technique in fuzzy theory. A new semantic protein similarity that is based on a new function similarity is also presented accordingly, which makes this new prediction approach work properly. Some possible research topics based on this new approach are presented as well.
Published: 2017
Full Text: View/download PDF

18. Searching for Domains for Protein Function Prediction

Author: Jingyu Hou
Subjects: Protein similarity, Computer science, business.industry, Protein function prediction, Function (mathematics), Artificial intelligence, Data mining, Machine learning, computer.software_genre, business, computer, Probability model
Abstract: This chapter introduces a new method of protein function prediction with an innovative algorithm of dynamically searching for suitable prediction domains during the prediction processes. The corresponding probability model, as well as the new function and protein similarity definitions, is presented in detail. Some possible research topics based on this new prediction method are also discussed.
Published: 2017
Full Text: View/download PDF

19. Simple Ligand–Receptor Interaction Descriptor (SILIRID) for alignment-free binding site comparison

Author: Gilles Marcou, Helena Gaspar, Alexandre Varnek, Vladimir Chupakhin, Chimie de la matière complexe (CMC), and Université de Strasbourg (UNISTRA)-Institut de Chimie du CNRS (INC)-Centre National de la Recherche Scientifique (CNRS)
Subjects: Stereochemistry, lcsh:Biotechnology, Nearest neighbor search, Biophysics, Druggability, Protein similarity, Plasma protein binding, Biology, computer.software_genre, Biochemistry, Article, Generative Topographic Mapping, Structural Biology, lcsh:TP248.13-248.65, Genetics, Binding site, chemistry.chemical_classification, computer.file_format, Protein Data Bank, Ligand (biochemistry), Computer Science Applications, Amino acid, Protein classification, chemistry, Chemogenomics, Interaction fingerprints, Data mining, computer, [CHIM.CHEM]Chemical Sciences/Cheminformatics, Protein–ligand interactions, Biotechnology, Integer (computer science)
Abstract: We describe SILIRID (Simple Ligand–Receptor Interaction Descriptor), a novel fixed size descriptor characterizing protein–ligand interactions. SILIRID can be obtained from the binary interaction fingerprints (IFPs) by summing up the bits corresponding to identical amino acids. This results in a vector of 168 integer numbers corresponding to the product of the number of entries (20 amino acids and one cofactor) and 8 interaction types per amino acid (hydrophobic, aromatic face to face, aromatic edge to face, H-bond donated by the protein, H-bond donated by the ligand, ionic bond with protein cation and protein anion, and interaction with metal ion). Efficiency of SILIRID to distinguish different protein binding sites has been examined in similarity search in sc-PDB database, a druggable portion of the Protein Data Bank, using various protein–ligand complexes as queries. The performance of retrieval of structurally and evolutionary related classes of proteins was comparable to that of state-of-the-art approaches (ROC AUC≈0.91). SILIRID can efficiently be used to visualize chemogenomic space covered by sc-PDB using Generative Topographic Mapping (GTM): sc-PDB SILIRID data form clusters corresponding to different protein types.
Published: 2014
Full Text: View/download PDF

20. A topological similarity measure for proteins

Author: Dieter W. Heermann, Nicolas Wenzel, Gabriell Máté, and Andreas Hofmann
Subjects: Geometric similarity, Jaccard index, Computer science, Biophysics, Proteins, Zinc Fingers, Protein similarity, Cell Biology, Similarity measure, Topology, Measure (mathematics), Persistent intervals, Biochemistry, Similarity (network science), Simple (abstract algebra), Protein structure, Viral Membrane Proteins, Protein flexibility, Algorithms
Abstract: We introduce a new measure for assessing similarity among chemical structures, based on well-established computational-topology algorithms. We argue that although the method considers geometry, it is more than a mere geometric similarity measure, as it takes into account, on different geometric scales, the important topological features of the compared structures. We prove that our measure is rigorous and complies with the proper mathematical requirements. We validate the method through comparing different configurations of simple zinc finger proteins and present an application on ligands binding to membrane-proteINS extracted from the Directory of Useful Decoys: Enhanced database and corresponding decoys. This article is part of a Special Issue entitled: Viral membrane proteins — Channels for cellular networking.
Published: 2014
Full Text: View/download PDF

21. TheCandidaGenome Database: The new homology information page highlights protein similarity and phylogeny

Author: Marek S. Skrzypek, Gail Binkley, Farrell Wymore, Martha B. Arnaud, Stuart R. Miyasato, Jonathan Binkley, Matt Simison, Diane O. Inglis, Prachi Shah, and Gavin Sherlock
Subjects: Genetics, Internet, 0303 health sciences, Fungal protein, Sequence Homology, Amino Acid, 030306 microbiology, Genome database, Locus (genetics), Computational biology, Biology, Genome, Homology (biology), Fungal Proteins, 03 medical and health sciences, ComputingMethodologies_PATTERNRECOGNITION, Protein similarity, Phylogenetics, Databases, Genetic, Genome, Fungal, Gene, Phylogeny, IV. Viruses, bacteria, protozoa and fungi, Candida, 030304 developmental biology
Abstract: The Candida Genome Database (CGD, http://www.candidagenome.org/) is a freely available online resource that provides gene, protein and sequence information for multiple Candida species, along with web-based tools for accessing, analyzing and exploring these data. The goal of CGD is to facilitate and accelerate research into Candida pathogenesis and biology. The CGD Web site is organized around Locus pages, which display information collected about individual genes. Locus pages have multiple tabs for accessing different types of information; the default Summary tab provides an overview of the gene name, aliases, phenotype and Gene Ontology curation, whereas other tabs display more in-depth information, including protein product details for coding genes, notes on changes to the sequence or structure of the gene and a comprehensive reference list. Here, in this update to previous NAR Database articles featuring CGD, we describe a new tab that we have added to the Locus page, entitled the Homology Information tab, which displays phylogeny and gene similarity information for each locus.
Published: 2013
Full Text: View/download PDF

22. Going over the three dimensional protein structure similarity problem

Author: Nantia D. Iakovidou, Konstantinos Tsichlas, Eleftherios Tiakas, and Yannis Manolopoulos
Subjects: Scheme (programming language), Linguistics and Language, Similarity (network science), Protein similarity, Artificial Intelligence, Computer science, Process (engineering), Search engine indexing, Data mining, computer.software_genre, computer, Language and Linguistics, computer.programming_language
Abstract: This article presents in detail our novel proposed methodology for detecting similarity between or among three dimensional protein structures. The innovation of our algorithm relies on the fact that during the similarity process, it has the ability to combine many attributes together and fulfill lots of preconditions, which are extensively discussed throughout the paper. Our concept is also supported by an efficient and effective indexing scheme, that provides convincing results comparing to other known methods.
Published: 2013
Full Text: View/download PDF

23. LDS vs FPT Method for Cluster Deletion

Author: Wady Naanaa and Amel Mhamdi
Subjects: 0301 basic medicine, Random graph, Simple graph, Optimization problem, 0102 computer and information sciences, 01 natural sciences, Combinatorics, 03 medical and health sciences, 030104 developmental biology, Protein similarity, 010201 computation theory & mathematics, Partition (number theory), Cluster analysis, Constraint satisfaction problem, MathematicsofComputing_DISCRETEMATHEMATICS, Mathematics
Abstract: We consider the following vertex-partition problem on graphs: given a simple graph G = (V, E), we want to partition G into a disjoint union of cliques by only removing a minimum number of edges. This NP-hard optimization problem is referred to as the Cluster Deletion (CD). In this paper, we propose an encoding of CD in terms of a Weighted Constraint Satisfaction Problem (WCSP), a framework which has been widely used in solving hard combinatorial problems. We compare our approach with a fixed-parameter tractability algorithm, one of the most used algorithms for solving cluster deletion. Then, we experimentally show that significant results are obtained using the WCSP encoding. We compare both solution quality and running times of these algorithms on random graphs and protein similarity graphs derived from the COG dataset.
Published: 2016
Full Text: View/download PDF

24. Methods for a Rapid and Automated Description of Proteins

Author: Dieter Cremer and Zhanyong Guo
Subjects: 0301 basic medicine, 03 medical and health sciences, 030104 developmental biology, 0302 clinical medicine, Protein similarity, Computational chemistry, 030220 oncology & carcinogenesis, Protein folding, Computational biology, Granularity, Mathematics
Published: 2016
Full Text: View/download PDF

25. Three-dimensional protein model similarity analysis based on salient shape index

Author: Zhong Li, Minhong Chen, Meng Ding, and Bo Yao
Subjects: Models, Molecular, 0301 basic medicine, Computer science, Shape index, 02 engineering and technology, Machine learning, computer.software_genre, Biochemistry, 03 medical and health sciences, Protein similarity, Structural Biology, Similarity analysis, Protein model, Salient geometric feature, 0202 electrical engineering, electronic engineering, information engineering, Molecular Biology, business.industry, Applied Mathematics, Computational Biology, Proteins, Reproducibility of Results, A protein, 020207 software engineering, Pattern recognition, Shape analysis, Computer Science Applications, 030104 developmental biology, Structural Homology, Protein, Salient, Artificial intelligence, DNA microarray, business, computer, Reeb graph, Research Article
Abstract: Background Proteins play a special role in bioinformatics. The surface shape of a protein, which is an important characteristic of the protein, defines a geometric and biochemical domain where the protein interacts with other proteins. The similarity analysis among protein models has become an important topic of protein analysis, by which it can reveal the structure and the function of proteins. Results In this paper, a new protein similarity analysis method based on three-dimensional protein models is proposed. It constructs a feature matrix descriptor for each protein model combined by calculating the shape index (SI) and the related salient geometric feature (SGF), and then analyzes the protein model similarity by using this feature matrix and the extended grey relation analysis. Conclusions We compare our method to the Multi-resolution Reeb Graph (MRG) skeleton method, the L1-medial skeleton method and the local-diameter descriptor method. Experimental results show that our protein similarity analysis method is accurate and reliable while keeping the high computational efficiency.
Published: 2016
Full Text: View/download PDF

26. Accelerating 3D Protein Structure Similarity Searching on Microsoft Azure Cloud with Local Replicas of Macromolecular Data

Author: Bożena Małysiak-Mrozek, Tomasz Kutyła, and Dariusz Mrozek
Subjects: 0301 basic medicine, Database, business.industry, Process (engineering), Computer science, Cloud computing, computer.file_format, 010402 general chemistry, computer.software_genre, Protein Data Bank, 01 natural sciences, 0104 chemical sciences, Computational science, 03 medical and health sciences, Structural bioinformatics, 030104 developmental biology, Protein structure, Protein similarity, Exponential growth, Scalability, business, computer
Abstract: Searching similarities among 3D protein structures deposited in macromolecular data repositories, like Protein Data Bank, is one of the time-consuming processes performed in structural bioinformatics. When performed in one-to-many or many-to-many model, the process requires increased computational resources. Moreover, exponential growth of protein structures in the Protein Data Bank causes the necessity to prepare computer systems to be able to deal with such huge volumes of data. Cloud computing provides both, theoretically infinite computational resources and a great possibility of scaling systems out and up. In this paper, we show how 3D protein structure similarity searching can be scaled out on Microsoft Azure cloud and performed by a loosely coupled, many-task computing system with local replicas of macromolecular data.
Published: 2016
Full Text: View/download PDF

27. ProBiS-2012: web server and web services for detection of structurally similar binding sites in proteins

Author: Dušanka Janežič and Janez Konc
Subjects: Models, Molecular, Internet, Web server, Binding Sites, Protein Conformation, business.industry, Protein Data Bank (RCSB PDB), Proteins, A protein, Articles, Biology, computer.software_genre, Bioinformatics, World Wide Web, Protein similarity, Genetics, Pairwise alignment, The Internet, Enzyme Inhibitors, Web service, User interface, business, computer, Algorithms, Software
Abstract: The ProBiS web server is a web server for detection of structurally similar binding sites in the PDB and for local pairwise alignment of protein structures. In this article, we present a new version of the ProBiS web server that is 10 times faster than earlier versions, due to the efficient parallelization of the ProBiS algorithm, which now allows significantly faster comparison of a protein query against the PDB and reduces the calculation time for scanning the entire PDB from hours to minutes. It also features new web services, and an improved user interface. In addition, the new web server is united with the ProBiS-Database and thus provides instant access to pre-calculated protein similarity profiles for over 29 000 non-redundant protein structures. The ProBiS web server is particularly adept at detection of secondary binding sites in proteins. It is freely available at http://probis.cmm.ki.si/old-version, and the new ProBiS web server is at http://probis.cmm.ki.si.
Published: 2012
Full Text: View/download PDF

28. Surface-based protein binding pocket similarity

Author: Ann E. Cleves, Russell Spitzer, and Ajay N. Jain
Subjects: Surface (mathematics), Plasma protein binding, Computational biology, Biology, computer.software_genre, Multiple species, Biochemistry, Small molecule, Similarity (network science), Protein similarity, Structural Biology, Data mining, Binding site, Molecular Biology, computer, Sequence (medicine)
Abstract: Protein similarity comparisons may be made on a local or global basis and may consider sequence information or differing levels of structural information. We present a local three-dimensional method that compares protein binding site surfaces in full atomic detail. The approach is based on the morphological similarity method which has been widely applied for global comparison of small molecules. We apply the method to all-by-all comparisons two sets of human protein kinases, a very diverse set of ATP-bound proteins from multiple species, and three heterogeneous benchmark protein binding site data sets. Cases of disagreement between sequence-based similarity and binding site similarity yield informative examples. Where sequence similarity is very low, high pocket similarity can reliably identify important binding motifs. Where sequence similarity is very high, significant differences in pocket similarity are related to ligand binding specificity and similarity. Local protein binding pocket similarity provides qualitatively complementary information to other approaches, and it can yield quantitative information in support of functional annotation. Proteins 2011; © 2011 Wiley-Liss, Inc.
Published: 2011
Full Text: View/download PDF

29. Searching Protein 3-D Structures for Optimal Structure Alignment Using Intelligent Algorithms and Data Structures

Author: Tomáš Novosád, Ajith Abraham, Vaclav Snasel, and Jack Y. Yang
Subjects: Protein structure database, Computer science, Suffix tree, Structural alignment, Protein Data Bank (RCSB PDB), Similarity measure, computer.software_genre, Proteomics, law.invention, Structural bioinformatics, Protein similarity, Similarity (network science), Artificial Intelligence, law, Data Mining, Humans, Electrical and Electronic Engineering, Databases, Protein, Cluster analysis, chemistry.chemical_classification, business.industry, Cosine similarity, Computational Biology, Proteins, Reproducibility of Results, Pattern recognition, General Medicine, computer.file_format, Structural Classification of Proteins database, Protein Data Bank, Protein tertiary structure, Protein Structure, Tertiary, Computer Science Applications, Amino acid, chemistry, Structural Homology, Protein, Data mining, Artificial intelligence, business, computer, Algorithms, Biotechnology
Abstract: In this paper, we present a novel algorithm for measuring protein similarity based on their 3-D structure (protein tertiary structure). The algorithm used a suffix tree for discovering common parts of main chains of all proteins appearing in the current research collaboratory for structural bioinformatics protein data bank (PDB). By identifying these common parts, we build a vector model and use some classical information retrieval (IR) algorithms based on the vector model to measure the similarity between proteins--all to all protein similarity. For the calculation of protein similarity, we use term frequency × inverse document frequency ( tf × idf ) term weighing schema and cosine similarity measure. The goal of this paper is to introduce new protein similarity metric based on suffix trees and IR methods. Whole current PDB database was used to demonstrate very good time complexity of the algorithm as well as high precision. We have chosen the structural classification of proteins (SCOP) database for verification of the precision of our algorithm because it is maintained primarily by humans. The next success of this paper would be the ability to determine SCOP categories of proteins not included in the latest version of the SCOP database (v. 1.75) with nearly 100% precision.
Published: 2010
Full Text: View/download PDF

30. Predicting protein pKaby environment similarity

Author: Loriano Storchi, Gabriele Cruciani, and Francesca Milletti
Subjects: Models, Molecular, Binding Sites, Training set, Chromatography, Protein Conformation, Chemistry, Computational Biology, Proteins, A protein, Hydrogen-Ion Concentration, Biochemistry, Kinetics, Protein structure, Protein similarity, Structural Biology, Protein pKa calculations, Binding site, Biological system, Hydrophobic and Hydrophilic Interactions, Molecular Biology, Algorithms
Abstract: A statistical method to predict protein pK(a) has been developed by using the 3D structure of a protein and a database of 434 experimental protein pK(a) values. Each pK(a) in the database is associated with a fingerprint that describes the chemical environment around an ionizable residue. A computational tool, MoKaBio, has been developed to identify automatically ionizable residues in a protein, generate fingerprints that describe the chemical environment around such residues, and predict pK(a) from the experimental pK(a) values in the database by using a similarity metric. The method, which retrieved the pK(a) of 429 of the 434 ionizable sites in the database correctly, was crossvalidated by leave-one-out and yielded root mean square error (RMSE) = 0.95, a result that is superior to that obtained by using the Null Model (RMSE 1.07) and other well-established protein pK(a) prediction tools. This novel approach is suitable to rationalize protein pK(a) by comparing the region around the ionizable site with similar regions whose ionizable site pK(a) is known. The pK(a) of residues that have a unique environment not represented in the training set cannot be predicted accurately, however, the method offers the advantage of being trainable to increase its predictive power.
Published: 2009
Full Text: View/download PDF

31. Applying Fuzzy Technologies to Equivalence Learning in Protein Classification

Author: József Dombi and Attila Kertész-Farkas
Subjects: Normalization (statistics), business.industry, Computational Biology, Proteins, Pattern recognition, Fuzzy logic, Computational Mathematics, Protein sequencing, Fuzzy Logic, Computational Theory and Mathematics, Protein similarity, Binary classification, Artificial Intelligence, Modeling and Simulation, Genetics, Artificial intelligence, Databases, Protein, business, Protein Kinases, Molecular Biology, Algorithms, Similarity learning, Mathematics
Abstract: When sequencing a new genome, its function and structure are important concerns, and inferring methods are based on protein sequence similarity methods. However, sequence groups differ in their parameters such as the number of group members and intra- and inter-class variability. A method that performs well on one group may not perform well on another group. Thus, learning similarity in a supervised manner could provide a general framework to set a similarity function to a specific sequence class. Here we describe a novel method that learns a similarity function between proteins by using a binary classifier and pairs of equivalent sequences (belonging to the same class) as positive samples, and non- equivalent sequences (belonging to different classes) as negative training samples. For sequence pair representation, we propose to use advanced techniques from fuzzy theory, including a sigmoid-type function for normalization and the class of Dombi operators that provide a more robust method. Using some additional constraints, the learned function turns out to be a valid kernel or metric function, and we present a new way of learning it, along with a new parameter-weighting technique. Using a dataset of archeal, bacterial, and eukaryotic 3-phosphoglycerate-kinase sequences (3PGK) and clusters from COG, we evaluate this equivalence learning method from a protein classification point of view. A receiver operator characteristic (ROC) analysis shows that we get a much more robust and accurate methodology for protein classification when these techniques are applied together. (See online Supplementary Material at www.liebertonline.com).
Published: 2009
Full Text: View/download PDF

32. Fuse: Multiple Network Alignment via Data Fusion

Author: Noël Malod-Dognin, Vladimir Gligorijević, Nataša Pržulj, and Commission of the European Communities
Subjects: 0301 basic medicine, Statistics and Probability, Matching (graph theory), Computer science, Bioinformatics, Molecular Networks (q-bio.MN), Systems biology, 0206 medical engineering, Sequence alignment, 02 engineering and technology, computer.software_genre, Biochemistry, Homology (biology), 03 medical and health sciences, Similarity (network science), Protein similarity, Protein Interaction Mapping, Quantitative Biology - Molecular Networks, Molecular Biology, Gene, 01 Mathematical Sciences, 08 Information And Computing Sciences, Computational Biology, Proteins, Approximation algorithm, q-bio.MN, 06 Biological Sciences, Sensor fusion, Computer Science Applications, Computational Mathematics, ComputingMethodologies_PATTERNRECOGNITION, 030104 developmental biology, Computational Theory and Mathematics, FOS: Biological sciences, Data mining, Sequence Alignment, computer, Algorithms, Software, 020602 bioinformatics, Biological network
Abstract: Discovering patterns in networks of protein-protein interactions (PPIs) is a central problem in systems biology. Alignments between these networks aid functional understanding as they uncover important information, such as evolutionary conserved pathways, protein complexes and functional orthologs. The objective of a multiple network alignment is to create clusters of nodes that are evolutionarily conserved and functionally consistent across all networks. Unfortunately, the alignment methods proposed thus far do not fully meet this objective, as they are guided by pairwise scores that do not utilize the entire functional and topological information across all networks. To overcome this weakness, we propose FUSE, a multiple network aligner that utilizes all functional and topological information in all PPI networks. It works in two steps. First, it computes novel similarity scores of proteins across the PPI networks by fusing from all aligned networks both the protein wiring patterns and their sequence similarities. It does this by using Non-negative Matrix Tri-Factorization (NMTF). When we apply NMTF on the five largest and most complete PPI networks from BioGRID, we show that NMTF finds a larger number of protein pairs across the PPI networks that are functionally conserved than can be found by using protein sequence similarities alone. This demonstrates complementarity of protein sequence and their wiring patterns in the PPI networks. In the second step, FUSE uses a novel maximum weight k-partite matching approximation algorithm to find an alignment between multiple networks. We compare FUSE with the state of the art multiple network aligners and show that it produces the largest number of functionally consistent clusters that cover all aligned PPI networks. Also, FUSE is more computationally efficient than other multiple network aligners., 15 pages, 3 figures
Published: 2015

33. Do specialized distributed frameworks for bioinformatics applications obtain better performance over generic ones?

Author: Kanak Mahadik, Folker Meyer, Wei Tang, and Saurabh Bagchi
Subjects: Protein similarity, Computer science, Genomic data, Scalability, Bioinformatics
Abstract: The most popular approach to tackle the data deluge due to high throughput sequencing instruments is parallelizing applications and distributing the large datasets across cluster of computers to achieve scalability and performance. Hadoop is a generic and Shock-AWE is a customized platform for genomic data for development of such applications. In this work we compare and contrast performance of protein similarity search application based on these platforms.
Published: 2015
Full Text: View/download PDF

34. SW#db: GPU-Accelerated Exact Sequence Similarity Database Search

Author: Martin Sosic, Mile Šikić, Matija Korpar, and Dino Blazeka
Subjects: Sequence alignment, Database searching, sequence similarity, Computer science, Nearest neighbor search, lcsh:Medicine, Parallel computing, Web Browser, Bioinformatics, CUDA, 03 medical and health sciences, 0302 clinical medicine, Protein similarity, Similarity (network science), Search algorithm, Database search engine, Sensitivity (control systems), lcsh:Science, Time complexity, 030304 developmental biology, Smith–Waterman algorithm, Physics, Sequence, 0303 health sciences, Exact sequence, Multidisciplinary, lcsh:R, Computational Biology, Dynamic programming, lcsh:Q, Databases, Nucleic Acid, Algorithm, Sequence Alignment, 030217 neurology & neurosurgery, Algorithms, Software, Research Article
Abstract: In recent years we have witnessed a growth in sequencing yield, the number of samples sequenced, and as a result-the growth of publicly maintained sequence databases. The increase of data present all around has put high requirements on protein similarity search algorithms with two ever-opposite goals: how to keep the running times acceptable while maintaining a high-enough level of sensitivity. The most time consuming step of similarity search are the local alignments between query and database sequences. This step is usually performed using exact local alignment algorithms such as Smith-Waterman. Due to its quadratic time complexity, alignments of a query to the whole database are usually too slow. Therefore, the majority of the protein similarity search methods prior to doing the exact local alignment apply heuristics to reduce the number of possible candidate sequences in the database. However, there is still a need for the alignment of a query sequence to a reduced database. In this paper we present the SW#db tool and a library for fast exact similarity search. Although its running times, as a standalone tool, are comparable to the running times of BLAST, it is primarily intended to be used for exact local alignment phase in which the database of sequences has already been reduced. It uses both GPU and CPU parallelization and was 4-5 times faster than SSEARCH, 6-25 times faster than CUDASW++ and more than 20 times faster than SSW at the time of writing, using multiple queries on Swiss-prot and Uniref90 databases.
Published: 2015

35. Protein Structure Similarity Clustering: Dynamic Treatment of PDB Structures Facilitates Clustering

Author: Stefan Wetzel, David B. Berkowitz, Herbert Waldmann, Bradley D. Charette, and Richard G. MacDonald
Subjects: Models, Molecular, Protein Conformation, Chemistry, Protein Data Bank (RCSB PDB), Proteins, General Chemistry, Computational biology, General Medicine, Ligands, Catalysis, Molecular dynamics, Protein structure, Protein similarity, Cluster Analysis, Cluster analysis, Algorithms
Published: 2006
Full Text: View/download PDF

36. Charting biologically relevant chemical space: A structural classification of natural products (SCONP)

Author: Stefan Wetzel, Michael Scheck, Marcus A. Koch, Ansgar Schuffenhauer, Marco Casaulta, Herbert Waldmann, Peter Ertl, and Alex Odermatt
Subjects: Biological Products, Informatics, Multidisciplinary, Databases, Factual, Protein Conformation, Chemical biology, Computational biology, Structural classification, Biology, Small molecule, Chemical space, Natural (archaeology), Chemistry, CHEMISTRY METHODS, Biochemistry, Protein similarity, Drug Design, Physical Sciences, 11-beta-Hydroxysteroid Dehydrogenase Type 1, Identification (biology), Software
Abstract: The identification of small molecules that fall within the biologically relevant subfraction of vast chemical space is of utmost importance to chemical biology and medicinal chemistry research. The prerequirement of biological relevance to be met by such molecules is fulfilled by natural product-derived compound collections. We report a structural classification of natural products (SCONP) as organizing principle for charting the known chemical space explored by nature. SCONP arranges the scaffolds of the natural products in a tree-like fashion and provides a viable analysis- and hypothesis-generating tool for the design of natural product-derived compound collections. The validity of the approach is demonstrated in the development of a previously undescribed class of selective and potent inhibitors of 11β-hydroxysteroid dehydrogenase type 1 with activity in cells guided by SCONP and protein structure similarity clustering. 11β-hydroxysteroid dehydrogenase type 1 is a target in the development of new therapies for the treatment of diabetes, the metabolic syndrome, and obesity.
Published: 2005
Full Text: View/download PDF

37. FLAP: 4‐Point Pharmacophore Fingerprints from GRID

Author: Simone Sciabola, Francesca Perruccio, Jonathan S. Mason, and Massimo Baroni
Subjects: Protein similarity, Computer science, Docking (molecular), Molecular interaction fields, Computational biology, Pharmacophore, Grid, Combinatorial chemistry
Published: 2005
Full Text: View/download PDF

38. Identifying remote protein homologs by network propagation

Author: Jason Weston, Christina S. Leslie, William Stafford Noble, and Rui Kuang
Subjects: Protein structure database, Sequence, Protein database, Cell Biology, Biology, computer.software_genre, Bioinformatics, ENCODE, Biochemistry, ComputingMethodologies_PATTERNRECOGNITION, Protein similarity, Search algorithm, Pairwise comparison, Data mining, Global structure, Molecular Biology, computer
Abstract: Perhaps the most widely used applications of bioinformatics are tools such as psi-blast for searching sequence databases. We describe a recently developed protein database search algorithm called rankprop. rankprop relies upon a precomputed network of pairwise protein similarities. The algorithm performs a diffusion operation from a specified query protein across the protein similarity network. The resulting activation scores, assigned to each database protein, encode information about the global structure of the protein similarity network. This type of algorithm has a rich history in associationist psychology, artificial intelligence and web search. We describe the rankprop algorithm and its relatives, and we provide evidence that the algorithm successfully improves upon the rankings produced by psi-blast.
Published: 2005
Full Text: View/download PDF

39. Natural Product‐Derived Compound Libraries and Protein Structure Similarity As Guiding Principles for the Discovery of Drug Candidates

Author: Herbert Waldmann and Marcus A. Koch
Subjects: chemistry.chemical_compound, Protein function, Natural product, Guiding Principles, Protein similarity, chemistry, Chemogenomics, Computational biology, Combinatorial chemistry
Published: 2004
Full Text: View/download PDF

40. Evaluation of structural similarity based on reduced dimensionality representations of protein structure

Author: Guy H. Grant, W. Graham Richards, and Birgit Albrecht
Subjects: Models, Molecular, Magnetic Resonance Spectroscopy, Structural similarity, Gaussian, Monte Carlo method, Bioengineering, Bioinformatics, Biochemistry, Quantitative Biology::Subcellular Processes, symbols.namesake, Protein structure, Protein similarity, Molecular Biology, Mathematics, Quantitative Biology::Biomolecules, business.industry, Dimensionality reduction, Computational Biology, Proteins, Pattern recognition, Protein Structure, Tertiary, Structural Homology, Protein, Data Interpretation, Statistical, symbols, Protein model, Artificial intelligence, business, Monte Carlo Method, Biotechnology, Curse of dimensionality
Abstract: Protein similarity estimations can be achieved using reduced dimensional representations and we describe a new application for the generation of two-dimensional maps from the three-dimensional structure. The code for the dimensionality reduction is based on the concept of pseudo-random generation of two-dimensional coordinates and Monte Carlo-like acceptance criteria for the generated coordinates. A new method for calculating protein similarity is developed by introducing a distance-dependent similarity field. Similarity of two proteins is derived from similarity field indices between amino acids based on various criteria such as hydrophobicity, residue replacement factors and conformational similarity, each showing a one factor Gaussian dependence. Results on comparisons of misfolded protein models with data sets of correctly folded structures show that discrimination between correctly folded and misfolded structures is possible. Tests were carried out on five different proteins, comparing a misfolded protein structure with members of the same topology, architecture, family and domain according to the CATH classification.
Published: 2004
Full Text: View/download PDF

41. YHR150w and YDR479c encode peroxisomal integral membrane proteins involved in the regulation of peroxisome number, size, and distribution in Saccharomyces cerevisiae

Author: Juan C. Torres-Guzman, Richard A. Rachubinski, John D. Aitchison, Franco J. Vizeacoumar, and Yuen Yi C. Tam
Subjects: Saccharomyces cerevisiae Proteins, Molecular Sequence Data, Saccharomyces cerevisiae, biogenesis, peroxin, protein similarity, open reading frame, membrane fission, Vesicular Transport Proteins, Peroxin, Article, 03 medical and health sciences, Membrane fission, GTP-Binding Proteins, Gene Expression Regulation, Fungal, Sequence Homology, Nucleic Acid, Peroxisomes, Gene, Integral membrane protein, Cells, Cultured, 030304 developmental biology, 0303 health sciences, Sequence Homology, Amino Acid, biology, 030302 biochemistry & molecular biology, Membrane Proteins, Yarrowia, Cell Biology, Peroxisome, biology.organism_classification, Microscopy, Electron, Membrane protein, Biochemistry, Mutation, Gene Deletion, Oleic Acid
Abstract: The peroxin Pex24p of the yeast Yarrowia lipolytica exhibits high sequence similarity to two hypothetical proteins, Yhr150p and Ydr479p, encoded by the Saccharomyces cerevisiae genome. Like YlPex24p, both Yhr150p and Ydr479p have been shown to be integral to the peroxisomal membrane, but unlike YlPex24p, their levels of synthesis are not increased upon a shift of cells from glucose- to oleic acid–containing medium. Peroxisomes of cells deleted for either or both of the YHR150w and YDR479c genes are increased in number, exhibit extensive clustering, are smaller in area than peroxisomes of wild-type cells, and often exhibit membrane thickening between adjacent peroxisomes in a cluster. Peroxisomes isolated from cells deleted for both genes have a decreased buoyant density compared with peroxisomes isolated from wild-type cells and still exhibit clustering and peroxisomal membrane thickening. Overexpression of the genes PEX25 or VPS1, but not the gene PEX11, restored the wild-type phenotype to cells deleted for one or both of the YHR150w and YDR479c genes. Together, our data suggest a role for Yhr150p and Ydr479p, together with Pex25p and Vps1p, in regulating peroxisome number, size, and distribution in S. cerevisiae. Because of their role in peroxisome dynamics, YHR150w and YDR479c have been designated as PEX28 and PEX29, respectively, and their encoded peroxins as Pex28p and Pex29p.
Published: 2003
Full Text: View/download PDF

42. MINRMS: an efficient algorithm for determining protein structure similarity using root-mean-squared-distance

Author: Conrad C. Huang, Andrew I. Jewett, and Thomas E. Ferrin
Subjects: Quality Control, Statistics and Probability, Theoretical computer science, Protein Conformation, Computation, Molecular Sequence Data, Structural alignment, Sequence alignment, Biochemistry, Root mean square, Protein structure, Protein similarity, Sequence Analysis, Protein, Amino Acid Sequence, Molecular Biology, Mathematics, Multiple sequence alignment, Efficient algorithm, Proteins, Computer Science Applications, Computational Mathematics, Computational Theory and Mathematics, Linear Models, Muramidase, Sequence Alignment, Algorithm, Algorithms
Abstract: Motivation: Existing algorithms for automated protein structure alignment generate contradictory results and are difficult to interpret. An algorithm which can provide a context for interpreting the alignment and uses a simple method to characterize protein structure similarity is needed. Results: We describe a heuristic for limiting the search space for structure alignment comparisons between two proteins, and an algorithm for finding minimal root-mean-squared-distance (RMSD) alignments as a function of the number of matching residue pairs within this limited search space. Our alignment algorithm uses coordinates of alpha-carbon atoms to represent each amino acid residue and requires a total computation time of O(m3n2), where m and n denote the lengths of the protein sequences. This makes our method fast enough for comparisons of moderate-size proteins (fewer than ∼800 residues) on current workstation-class computers and therefore addresses the need for a systematic analysis of multiple plausible shape similarities between two proteins using a widely accepted comparison metric. Availability: See http://www.cgl.ucsf.edu/Research/minrms Contact: tef@cgl.ucsf.edu * To whom correspondence should be addressed.
Published: 2003
Full Text: View/download PDF

43. The Sequential Introduction of HIV-1 Subtype B and CRF01AE in Singapore by Sexual Transmission: Accelerated V3 Region Evolution in a Subpopulation of Asian CRF01 Viruses

Author: Chee-Leok Goh, Davis Lupo, Iris Verghese, Yee Sin Leo, Boon Huan Tan, Phillip Gerrish, Bette T. Korber, Satish K. Pillai, Roy Chan, Marcia L. Kalish, Ae M. Saekhou, Kenneth E. Robbins, and Teresa M. Brown
Subjects: Adult, Male, CRF01AE, Asia, Glycosylation, Sexual transmission, Sexual Behavior, Molecular Sequence Data, Population, Human immunodeficiency virus (HIV), HIV Envelope Protein gp120, Biology, V3 loop, medicine.disease_cause, diversity, law.invention, chemistry.chemical_compound, glycosylation sites, Protein similarity, law, Virology, evolution, medicine, Humans, Amino Acid Sequence, education, Phylogeny, Aged, Genetics, Acquired Immunodeficiency Syndrome, Singapore, education.field_of_study, risk group, Genetic Variation, Middle Aged, Genes, gag, Immune surveillance, chemistry, HIV-1, Recombinant DNA, Female, envelope sequences
Abstract: The rapid spread of the human immunodeficiency virus type 1 (HIV-1) circulating recombinant form (CRF) 01 AE throughout Asia demonstrates the dynamic nature of emerging epidemics. To further characterize the dissemination of these strains regionally, we sequenced 58 strains from Singapore and found that subtype B and CRF01 were introduced separately, by homosexual and heterosexual transmission, respectively. Protein similarity scores of the Singapore CRF01, as well as all Asian strains, demonstrated a complex distribution of scores in the V3 loop—some strains had very similar V3 loop sequences, while others were highly divergent. Furthermore, we found a strong correlation between the loss of a V3 glycosylation site and the divergent strains. This suggests that loss of this glycosylation site may make the V3 loop more susceptible to immune surveillance. The identification of a rapidly evolving population of CRF01 AE variants should be considered when designing new candidate vaccines and when evaluating breakthrough strains from current vaccine trials.
Published: 2002
Full Text: View/download PDF

44. Assessment of the CASP4 fold recognition category

Author: Antonina Andreeva, Rainer Malik, Andreas Prlić, Manfred J. Sippl, Francisco S. Domingues, Markus Wiederstein, and Peter Lackner
Subjects: Models, Molecular, Protein Folding, Computer science, business.industry, Protein structure prediction, CAFASP, Machine learning, computer.software_genre, Biochemistry, Protein Structure, Tertiary, Critical discussion, Protein similarity, Fully automated, Sequence Analysis, Protein, Structural Biology, Server, Statistics, Quality Score, Computer Simulation, Artificial intelligence, Threading (protein sequence), business, Molecular Biology, computer
Abstract: We present the assessment of the CASP4 fold recognition category. The tasks we had to execute include the splitting of multidomain targets into single domains, the classification of target domains in terms of prediction categories, the numerical evaluation of predictions, the mapping of numerical scores to quality indices, the ranking of predictors, the selection of top-performing groups, and the analysis and critical discussion of the state of the art in this field. The 125 fold recognition groups were assessed by a total score that summarizes their performance over all targets and a quality score reflecting the average quality of the submitted models. Most of the top-performing groups achieved respectable results on both scores simultaneously. Several groups submitted models that were much closer to the respective target structures than any of the known folds in the Protein Data Bank. The CASP4 assessment included the automated servers of the parallel CAFASP experiment. For the total score, the highest rank achieved by a fully automated server is 12. Two thirds of the predictors have rather low scores.
Published: 2001
Full Text: View/download PDF

45. Gaussian-based Alignment of Protein Structures: Deriving a Consensus Superposition when Alternative Solutions Exist

Author: Jordi Mestres
Subjects: Quantitative Biology::Biomolecules, Sequence, Computer science, Gaussian, Organic Chemistry, Catalysis, Computer Science Applications, Inorganic Chemistry, symbols.namesake, Superposition principle, Protein structure, Computational Theory and Mathematics, Protein similarity, symbols, Statistical physics, Physical and Theoretical Chemistry, Representation (mathematics), Protein structure comparison
Abstract: The use of a Gaussian-based representation of protein structures for evaluating protein-structure similarities and deriving three-dimensional superpositions is presented. The approach, as implemented in the program GAPS, is applied to three pairs of proteins with different topological characteristics (rich α-helix, mixed α-helix/β-strand, and rich β-strand), low sequence identities (10–30%), and recognized difficulties to define a unique optimum alignment.Validation of the GAPS superpositions is done by comparison with superpositions obtained by the TOP, GA_FIT, and ALIGN programs and those directly extracted from the FSSP database. Results suggest that a Gaussian-based methodology offers an objective means to, depending on the Gaussian-based representation, derive a consensus three-dimensional superposition when alternative superposition solutions exist.
Published: 2000
Full Text: View/download PDF

46. PFDB: A Protein Families DataBase for Macintosh Computers. The Effectiveness of Its Organization in Searching for Protein Similarity

Author: P. Petrilli and Nyerhovwo J. Tonukari
Subjects: Plants, Medicinal, Databases, Factual, Database, Sequence database, Computer science, Molecular Sequence Data, Proteins, Sequence Homology, A protein, Fabaceae, computer.software_genre, Biochemistry, Peptide Fragments, Dipeptide composition, Sequence homology, Microcomputers, Protein similarity, Albumins, Amino Acid Sequence, Linear correlation, Peptide sequence, computer, Software, Plant Proteins
Abstract: A protein sequence database (PFDB) containing about 11,000 entries is available for Macintosh computers. The PFDB can be easily updated by importing sequences from the PIR collection through the internet. The most important feature of the database is its organization in families of closely related sequences, each family being characterized by its average dipeptide composition [Petrilli (1993), Comput. Appl. Biosci. 2, 89-93]. This allows one to perform a rapid and sensitive protein similarity search by comparing the precalculated family dipeptide composition with that of the query sequence by a linear correlation coefficient. An example of an application in which a new protein was classified by using a sequence of a fragment just 19 residues long is reported.
Published: 1997
Full Text: View/download PDF

47. Predicting substrate specificity of adenylation domains of nonribosomal peptide synthetases and other protein properties by latent semantic indexing

Author: Daslav Hranueli, John Cullum, Janko Diminic, Antonio Starcevic, Damir Baranasic, Paul F. Long, Jurica Zucko, and Ranko Gacesa
Subjects: chemistry.chemical_classification, Protein family, Sequence analysis, Document-term matrix, Bioengineering, Sequence alignment, Computational biology, Biology, ENCODE, Applied Microbiology and Biotechnology, Substrate Specificity, chemistry, Biochemistry, Nonribosomal peptide, Sequence Analysis, Protein, Catalytic Domain, Amino Acids, Peptide Synthases, Peptide sequence, Adenylylation, Sequence Alignment, Adenylation domains, nonribosomal peptide synthetases, latent semantic indexing, term-document matrix, protein similarity, support vector machine, Biotechnology
Abstract: Successful genome mining is dependent on accurate prediction of protein function from sequence. This often involves dividing protein families into functional subtypes (e.g., with different substrates). In many cases, there are only a small number of known functional subtypes, but in the case of the adenylation domains of nonribosomal peptide synthetases (NRPS), there are >500 known substrates. Latent semantic indexing (LSI) was originally developed for text processing but has also been used to assign proteins to families. Proteins are treated as ‘‘documents’’ and it is necessary to encode properties of the amino acid sequence as ‘‘terms’’ in order to construct a term-document matrix, which counts the terms in each document. This matrix is then processed to produce a document-concept matrix, where each protein is represented as a row vector. A standard measure of the closeness of vectors to each other (cosines of the angle between them) provides a measure of protein similarity. Previous work encoded proteins as oligopeptide terms, i.e. counted oligopeptides, but used no information regarding location of oligopeptides in the proteins. A novel tokenization method was developed to analyze information from multiple alignments. LSI successfully distinguished between two functional subtypes in five well-characterized families. Visualization of different ‘‘concept’’ dimensions allows exploration of the structure of protein families. LSI was also used to predict the amino acid substrate of adenylation domains of NRPS. Better results were obtained when selected residues from multiple alignments were used rather than the total sequence of the adenylation domains. Using ten residues from the substrate binding pocket performed better than using 34 residues within 8 Å of the active site. Prediction efficiency was somewhat better than that of the best published method using a support vector machine.
Published: 2013

48. A Promising method for fast evaluating protein structure similarity

Author: Ying Liu
Subjects: Protein similarity, Drug discovery, Computer science, Molecular biophysics, Drug target, A protein, Protein folding, Computational biology, Binding site, Bioinformatics, Ligand (biochemistry)
Abstract: According to a ligand binding site, screening protein structural database may provide the lead information to accelerate drug discovery research. In this study we propose a novel approach MIIF based on the method protein folding shape code (PFSC), is applied to discover the similar binding site area of a protein related to the known drug target area of a ligand.
Published: 2012
Full Text: View/download PDF

49. GSQCT: A solution to screening gene sequences for phylogenetics analysis

Author: Jianhui Li, Jing Zhao, Zhen Meng, Shouzhou Zhang, Xiao Xiao, Hui Dong, Yunchun Zhou, and Cao Wei
Subjects: Multiple sequence alignment, Phylogenetic tree, Computer science, Pseudogene, Computational biology, Gene Annotation, computer.software_genre, Stop codon, Homology (biology), DNA sequencing, Protein similarity, Phylogenetics, Data mining, computer, Gene, Alignment-free sequence analysis
Abstract: Screening data for phylogenetic analysis is a known Gordian knot. In this paper, GSQCT (Gene Sequence Quality Control Tool), a solution of screening gene sequence data is promoted. It is firstly to extract initial datasets using of gene annotation information; and then, to calculate the content of the uncertain character from gene sequencing for sequencing quality detection, to detect stop codons to avoid pseudogenes, to detect custom serial strings to remove contaminative sequence fragment, and to do protein similarity calculation with template protein of the object gene for homology detection and finally to decide whether to select by pre-determined threshold range, one by one. The report of the screening result is given and the multiple sequence alignment can be done to verify the homology with those verified sequences. This solution overcomes the existing gene data filtering with problems of error or ambiguous annotations and sequencing accuracy in uneven, which will lead to construct incorrect phylogenetics trees. The evaluation of the solution is introduced and shown well accuracy and effectiveness. Parallel implementation with Hadoop (Map / Reduce) for download: http://www.darwintree.cn/tools.htm
Published: 2012
Full Text: View/download PDF

50. Predicting protein functions from PPI networks using functional aggregation

Author: Xiaoxiao Chi and Jingyu Hou
Subjects: Statistics and Probability, Protein function, General Immunology and Microbiology, Applied Mathematics, Proteins, General Medicine, computer.software_genre, Fuzzy logic, General Biochemistry, Genetics and Molecular Biology, ComputingMethodologies_PATTERNRECOGNITION, Protein similarity, Choquet integral, Fuzzy Logic, Modeling and Simulation, Prediction methods, Protein Interaction Mapping, Data mining, General Agricultural and Biological Sciences, computer, Algorithms, Mathematics
Abstract: Predicting protein functions computationally from massive protein–protein interaction (PPI) data generated by high-throughput technology is one of the challenges and fundamental problems in the post-genomic era. Although there have been many approaches developed for computationally predicting protein functions, the mutual correlations among proteins in terms of protein functions have not been thoroughly investigated and incorporated into existing prediction methods, especially in voting based prediction methods. In this paper, we propose an innovative method to predict protein functions from PPI data by aggregating the functional correlations among relevant proteins using the Choquet-Integral in fuzzy theory. This functional aggregation measures the real impact of each relevant protein function on the final prediction results, and reduces the impact of repeated functional information on the prediction. Accordingly, a new protein similarity and a new iterative prediction algorithm are proposed in this paper. The experimental evaluations on real PPI datasets demonstrate the effectiveness of our method.
Published: 2012

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Journal

Database

Publisher

95 results on '"Protein similarity"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources