Descriptor: "Protein similarity" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Protein similarity"' showing total 162 results

Start Over Descriptor "Protein similarity"

162 results on '"Protein similarity"'

1. BASE: a web service for providing compound-protein binding affinity prediction datasets with reduced similarity bias.

Author: Son, Hyojin, Lee, Sechan, Kim, Jaeuk, Park, Haangik, Hwang, Myeong-Ha, and Yi, Gwan-Su
Subjects: *DRUG discovery, *PROTEIN structure, *PREDICTION models, *BINDING sites, *RESEARCH personnel, *DEEP learning
Abstract: Background: Deep learning-based drug-target affinity (DTA) prediction methods have shown impressive performance, despite a high number of training parameters relative to the available data. Previous studies have highlighted the presence of dataset bias by suggesting that models trained solely on protein or ligand structures may perform similarly to those trained on complex structures. However, these studies did not propose solutions and focused solely on analyzing complex structure-based models. Even when ligands are excluded, protein-only models trained on complex structures still incorporate some ligand information at the binding sites. Therefore, it is unclear whether binding affinity can be accurately predicted using only compound or protein features due to potential dataset bias. In this study, we expanded our analysis to comprehensive databases and investigated dataset bias through compound and protein feature-based methods using multilayer perceptron models. We assessed the impact of this bias on current prediction models and proposed the binding affinity similarity explorer (BASE) web service, which provides bias-reduced datasets. Results: By analyzing eight binding affinity databases using multilayer perceptron models, we confirmed a bias where the compound-protein binding affinity can be accurately predicted using compound features alone. This bias arises because most compounds show consistent binding affinities due to high sequence or functional similarity among their target proteins. Our Uniform Manifold Approximation and Projection analysis based on compound fingerprints further revealed that low and high variation compounds do not exhibit significant structural differences. This suggests that the primary factor driving the consistent binding affinities is protein similarity rather than compound structure. We addressed this bias by creating datasets with progressively reduced protein similarity between the training and test sets, observing significant changes in model performance. We developed the BASE web service to allow researchers to download and utilize these datasets. Feature importance analysis revealed that previous models heavily relied on protein features. However, using bias-reduced datasets increased the importance of compound and interaction features, enabling a more balanced extraction of key features. Conclusions: We propose the BASE web service, providing both the affinity prediction results of existing models and bias-reduced datasets. These resources contribute to the development of generalized and robust predictive models, enhancing the accuracy and reliability of DTA predictions in the drug discovery process. BASE is freely available online at https://synbi2024.kaist.ac.kr/base. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

2. BASE: a web service for providing compound-protein binding affinity prediction datasets with reduced similarity bias

Author: Hyojin Son, Sechan Lee, Jaeuk Kim, Haangik Park, Myeong-Ha Hwang, and Gwan-Su Yi
Subjects: Drug-target affinity prediction, Drug discovery, Deep learning, Dataset bias, Protein similarity, Computer applications to medicine. Medical informatics, R858-859.7, Biology (General), QH301-705.5
Abstract: Abstract Background Deep learning-based drug-target affinity (DTA) prediction methods have shown impressive performance, despite a high number of training parameters relative to the available data. Previous studies have highlighted the presence of dataset bias by suggesting that models trained solely on protein or ligand structures may perform similarly to those trained on complex structures. However, these studies did not propose solutions and focused solely on analyzing complex structure-based models. Even when ligands are excluded, protein-only models trained on complex structures still incorporate some ligand information at the binding sites. Therefore, it is unclear whether binding affinity can be accurately predicted using only compound or protein features due to potential dataset bias. In this study, we expanded our analysis to comprehensive databases and investigated dataset bias through compound and protein feature-based methods using multilayer perceptron models. We assessed the impact of this bias on current prediction models and proposed the binding affinity similarity explorer (BASE) web service, which provides bias-reduced datasets. Results By analyzing eight binding affinity databases using multilayer perceptron models, we confirmed a bias where the compound-protein binding affinity can be accurately predicted using compound features alone. This bias arises because most compounds show consistent binding affinities due to high sequence or functional similarity among their target proteins. Our Uniform Manifold Approximation and Projection analysis based on compound fingerprints further revealed that low and high variation compounds do not exhibit significant structural differences. This suggests that the primary factor driving the consistent binding affinities is protein similarity rather than compound structure. We addressed this bias by creating datasets with progressively reduced protein similarity between the training and test sets, observing significant changes in model performance. We developed the BASE web service to allow researchers to download and utilize these datasets. Feature importance analysis revealed that previous models heavily relied on protein features. However, using bias-reduced datasets increased the importance of compound and interaction features, enabling a more balanced extraction of key features. Conclusions We propose the BASE web service, providing both the affinity prediction results of existing models and bias-reduced datasets. These resources contribute to the development of generalized and robust predictive models, enhancing the accuracy and reliability of DTA predictions in the drug discovery process. BASE is freely available online at https://synbi2024.kaist.ac.kr/base .
Published: 2024
Full Text: View/download PDF

3. Holistic similarity-based prediction of phosphorylation sites for understudied kinases.

Author: Ma, Renfei, Li, Shangfu, Parisi, Luca, Li, Wenshuo, Huang, Hsien-Da, and Lee, Tzong-Yi
Subjects: *PHOSPHORYLATION, *KINASES, *PROTEIN-protein interactions, *PREDICTION models, *FORECASTING
Abstract: Phosphorylation is an essential mechanism for regulating protein activities. Determining kinase-specific phosphorylation sites by experiments involves time-consuming and expensive analyzes. Although several studies proposed computational methods to model kinase-specific phosphorylation sites, they typically required abundant experimentally verified phosphorylation sites to yield reliable predictions. Nevertheless, the number of experimentally verified phosphorylation sites for most kinases is relatively small, and the targeting phosphorylation sites are still unidentified for some kinases. In fact, there is little research related to these understudied kinases in the literature. Thus, this study aims to create predictive models for these understudied kinases. A kinase–kinase similarity network was generated by merging the sequence-, functional-, protein-domain- and 'STRING'-related similarities. Thus, besides sequence data, protein–protein interactions and functional pathways were also considered to aid predictive modelling. This similarity network was then integrated with a classification of kinase groups to yield highly similar kinases to a specific understudied type of kinase. Their experimentally verified phosphorylation sites were leveraged as positive sites to train predictive models. The experimentally verified phosphorylation sites of the understudied kinase were used for validation. Results demonstrate that 82 out of 116 understudied kinases were predicted with adequate performance via the proposed modelling strategy, achieving a balanced accuracy of 0.81, 0.78, 0.84, 0.84, 0.85, 0.82, 0.90, 0.82 and 0.85, for the 'TK', 'Other', 'STE', 'CAMK', 'TKL', 'CMGC', 'AGC', 'CK1' and 'Atypical' groups, respectively. Therefore, this study demonstrates that web-like predictive networks can reliably capture the underlying patterns in such understudied kinases by harnessing relevant sources of similarities to predict their specific phosphorylation sites. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

4. The specific applications of the TSR-based method in identifying Zn2+ binding sites of proteases and ACE/ACE2

Author: Titli Sarkar, Camille R. Reaux, Jianxiong Li, Vijay V. Raghavan, and Wu Xu
Subjects: TSR, Metal ion binding site, Protein similarity, 3D structure, Alignment-free, Structural motif, Computer applications to medicine. Medical informatics, R858-859.7, Science (General), Q1-390
Abstract: We have developed an alignment-free TSR (Triangular Spatial Relationship)-based computational method for protein structural comparison and motif identification and discovery. To demonstrate the potential applications of the method, we have generated two datasets. One dataset contains five classes: Actin/Hsp70, serine protease (chymotrypsin/trypsin/elastase), ArsC/Prdx2, PKA/PKB/PKC, and AChE/BChE at the hierarchical level 1 and twelve groups at the level 2. The other dataset includes representative proteases and ACE/ACE2. The x,y, z coordinates of the structures were obtained from PDB. We calculated the keys (or features) that represent each structure using the TSR-based method. The dataset and data presented here include additional information that help the readers become aware of specific applications of the TSR-based method in protein clustering, identification and discovery of metal ion binding sites as well as to understand the effect of amino acid grouping on protein 3D structural relationships at both global and local levels.
Published: 2022
Full Text: View/download PDF

5. Link prediction in protein–protein interaction network: A similarity multiplied similarity algorithm with paths of length three.

Author: Cai, Wangmin, Liu, Peiqiang, Wang, Zunfang, Jiang, Hong, Liu, Chang, Fei, Zhaojie, and Yang, Zhuang
Subjects: *ALGORITHMS, *FORECASTING, *PROTEIN-protein interactions
Abstract: Protein–protein interactions (PPIs) are crucial for various biological processes, and predicting PPIs is a major challenge. To solve this issue, the most common method is link prediction. Currently, the link prediction methods based on network Paths of Length Three (L3) have been proven to be highly effective. In this paper, we propose a novel link prediction algorithm, named SMS, which is based on L3 and protein similarities. We first design a mixed similarity that combines the topological structure and attribute features of nodes. Then, we compute the predicted value by summing the product of all similarities along the L3. Furthermore, we propose the Max Similarity Multiplied Similarity (maxSMS) algorithm from the perspective of maximum impact. Our computational prediction results show that on six datasets, including S. cerevisiae, H. sapiens, and others, the maxSMS algorithm improves the precision of the top 500, area under the precision–recall curve, and normalized discounted cumulative gain by an average of 26.99%, 53.67%, and 6.7%, respectively, compared to other optimal methods. [Display omitted] • A mixed similarity combining sequence and topological structure is introduced. • A link prediction algorithm based on complementarity and similarity is proposed. • The proposed method performs better than existing L3, Sim and other methods. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

6. A study of a hierarchical structure of proteins and ligand binding sites of receptors using the triangular spatial relationship‐based structure comparison method and development of a size‐filtering feature designed for comparing different sizes of protein structures

Author: Kondra, Sarika, Chen, Feng, Chen, Yixin, Chen, Yuwu, Collette, Caleb J., and Xu, Wu
Abstract: The presence of receptors and the specific binding of the ligands determine nearly all cellular responses. Binding of a ligand to its receptor causes conformational changes of the receptor that triggers the subsequent signaling cascade. Therefore, systematically studying structures of receptors will provide insight into their functions. We have developed the triangular spatial relationship (TSR)‐based method where all possible triangles are constructed with Cα atoms of a protein as vertices. Every triangle is represented by an integer denoted as a "key" computed through the TSR algorithm. A structure is thereby represented by a vector of integers. In this study, we have first defined substructures using different types of keys. Second, using different types of keys represents a new way to interpret structure hierarchical relations and differences between structures and sequences. Third, we demonstrate the effects of sequence similarity as well as sample size on the structure‐based classifications. Fourth, we show identification of structure motifs, and the motifs containing multiple triangles connected by either an edge or a vertex are mapped to the ligand binding sites of the receptors. The structure motifs are valuable resources for the researchers in the field of signal transduction. Next, we propose amino‐acid scoring matrices that capture "evolutionary closeness" information based on BLOSUM62 matrix, and present the development of a new visualization method where keys are organized according to evolutionary closeness and shown in a 2D image. This new visualization opens a window for developing tools with the aim of identification of specific and common substructures by scanning pixels and neighboring pixels. Finally, we report a new algorithm called as size filtering that is designed to improve structure comparison of large proteins with small proteins. Collectively, we provide an in‐depth interpretation of structure relations through the detailed analyses of different types of keys and their associated key occurrence frequencies, geometries, and labels. In summary, we consider this study as a new computational platform where keys are served as a bridge to connect sequence and structure as well as structure and function for a deep understanding of sequence, structure, and function relationships of the protein family. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

7. Comparative Analysis of Unsupervised Protein Similarity Prediction Based on Graph Embedding

Author: Yuanyuan Zhang, Ziqi Wang, Shudong Wang, and Junliang Shang
Subjects: protein similarity, graph embedding, gene ontology, link prediction, DTW algorithm, Genetics, QH426-470
Abstract: The study of protein–protein interaction and the determination of protein functions are important parts of proteomics. Computational methods are used to study the similarity between proteins based on Gene Ontology (GO) to explore their functions and possible interactions. GO is a series of standardized terms that describe gene products from molecular functions, biological processes, and cell components. Previous studies on assessing the similarity of GO terms were primarily based on Information Content (IC) between GO terms to measure the similarity of proteins. However, these methods tend to ignore the structural information between GO terms. Therefore, considering the structural information of GO terms, we systematically analyze the performance of the GO graph and GO Annotation (GOA) graph in calculating the similarity of proteins using different graph embedding methods. When applied to the actual Human and Yeast datasets, the feature vectors of GO terms and proteins are learned based on different graph embedding methods. To measure the similarity of the proteins annotated by different GO numbers, we used Dynamic Time Warping (DTW) and cosine to calculate protein similarity in GO graph and GOA graph, respectively. Link prediction experiments were then performed to evaluate the reliability of protein similarity networks constructed by different methods. It is shown that graph embedding methods have obvious advantages over the traditional IC-based methods. We found that random walk graph embedding methods, in particular, showed excellent performance in calculating the similarity of proteins. By comparing link prediction experiment results from GO(DTW) and GOA(cosine) methods, it is shown that GO(DTW) features provide highly effective information for analyzing the similarity among proteins.
Published: 2021
Full Text: View/download PDF

8. Predicting lncRNA–Protein Interaction With Weighted Graph-Regularized Matrix Factorization

Author: Xibo Sun, Leiming Cheng, Jinyang Liu, Cuinan Xie, Jiasheng Yang, and Fu Li
Subjects: lncRNA–protein interaction, weighted graph-regularized matrix factorization, lncRNA similarity, protein similarity, SFPQ, SNHG3, Genetics, QH426-470
Abstract: Long non-coding RNAs (lncRNAs) are widely concerned because of their close associations with many key biological activities. Though precise functions of most lncRNAs are unknown, research works show that lncRNAs usually exert biological function by interacting with the corresponding proteins. The experimental validation of interactions between lncRNAs and proteins is costly and time-consuming. In this study, we developed a weighted graph-regularized matrix factorization (LPI-WGRMF) method to find unobserved lncRNA–protein interactions (LPIs) based on lncRNA similarity matrix, protein similarity matrix, and known LPIs. We compared our proposed LPI-WGRMF method with five classical LPI prediction methods, that is, LPBNI, LPI-IBNRA, LPIHN, RWR, and collaborative filtering (CF). The results demonstrate that the LPI-WGRMF method can produce high-accuracy performance, obtaining an AUC score of 0.9012 and AUPR of 0.7324. The case study showed that SFPQ, SNHG3, and PRPF31 may associate with Q9NUL5, Q9NUL5, and Q9UKV8 with the highest linking probabilities and need to further experimental validation.
Published: 2021
Full Text: View/download PDF

9. Comparative Analysis of Unsupervised Protein Similarity Prediction Based on Graph Embedding.

Author: Zhang, Yuanyuan, Wang, Ziqi, Wang, Shudong, and Shang, Junliang
Subjects: PROTEIN analysis, COMPARATIVE studies, RANDOM walks, RANDOM graphs, PROTEIN-protein interactions
Abstract: The study of protein–protein interaction and the determination of protein functions are important parts of proteomics. Computational methods are used to study the similarity between proteins based on Gene Ontology (GO) to explore their functions and possible interactions. GO is a series of standardized terms that describe gene products from molecular functions, biological processes, and cell components. Previous studies on assessing the similarity of GO terms were primarily based on Information Content (IC) between GO terms to measure the similarity of proteins. However, these methods tend to ignore the structural information between GO terms. Therefore, considering the structural information of GO terms, we systematically analyze the performance of the GO graph and GO Annotation (GOA) graph in calculating the similarity of proteins using different graph embedding methods. When applied to the actual Human and Yeast datasets, the feature vectors of GO terms and proteins are learned based on different graph embedding methods. To measure the similarity of the proteins annotated by different GO numbers, we used Dynamic Time Warping (DTW) and cosine to calculate protein similarity in GO graph and GOA graph, respectively. Link prediction experiments were then performed to evaluate the reliability of protein similarity networks constructed by different methods. It is shown that graph embedding methods have obvious advantages over the traditional IC-based methods. We found that random walk graph embedding methods, in particular, showed excellent performance in calculating the similarity of proteins. By comparing link prediction experiment results from GO(DTW) and GOA(cosine) methods, it is shown that GO(DTW) features provide highly effective information for analyzing the similarity among proteins. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

10. Predicting lncRNA–Protein Interaction With Weighted Graph-Regularized Matrix Factorization.

Author: Sun, Xibo, Cheng, Leiming, Liu, Jinyang, Xie, Cuinan, Yang, Jiasheng, and Li, Fu
Subjects: MATRIX decomposition, LINCRNA, EXTRACELLULAR matrix proteins
Abstract: Long non-coding RNAs (lncRNAs) are widely concerned because of their close associations with many key biological activities. Though precise functions of most lncRNAs are unknown, research works show that lncRNAs usually exert biological function by interacting with the corresponding proteins. The experimental validation of interactions between lncRNAs and proteins is costly and time-consuming. In this study, we developed a weighted graph-regularized matrix factorization (LPI-WGRMF) method to find unobserved lncRNA–protein interactions (LPIs) based on lncRNA similarity matrix, protein similarity matrix, and known LPIs. We compared our proposed LPI-WGRMF method with five classical LPI prediction methods, that is, LPBNI, LPI-IBNRA, LPIHN, RWR, and collaborative filtering (CF). The results demonstrate that the LPI-WGRMF method can produce high-accuracy performance, obtaining an AUC score of 0.9012 and AUPR of 0.7324. The case study showed that SFPQ, SNHG3, and PRPF31 may associate with Q9NUL5, Q9NUL5, and Q9UKV8 with the highest linking probabilities and need to further experimental validation. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

11. Efficient inference of homologs in large eukaryotic pan-proteomes

Author: Siavash Sheikhizadeh Anari, Dick de Ridder, M. Eric Schranz, and Sandra Smit
Subjects: Pan-genome, Protein similarity, Homologous genes, Orthology, k-mer, Computer applications to medicine. Medical informatics, R858-859.7, Biology (General), QH301-705.5
Abstract: Abstract Background Identification of homologous genes is fundamental to comparative genomics, functional genomics and phylogenomics. Extensive public homology databases are of great value for investigating homology but need to be continually updated to incorporate new sequences. As new sequences are rapidly being generated, there is a need for efficient standalone tools to detect homologs in novel data. Results To address this, we present a fast method for detecting homology groups across a large number of individuals and/or species. We adopted a k-mer based approach which considerably reduces the number of pairwise protein alignments without sacrificing sensitivity. We demonstrate accuracy, scalability, efficiency and applicability of the presented method for detecting homology in large proteomes of bacteria, fungi, plants and Metazoa. Conclusions We clearly observed the trade-off between recall and precision in our homology inference. Favoring recall or precision strongly depends on the application. The clustering behavior of our program can be optimized for particular applications by altering a few key parameters. The program is available for public use at https://github.com/sheikhizadeh/pantools as an extension to our pan-genomic analysis tool, PanTools.
Published: 2018
Full Text: View/download PDF

12. Efficient inference of homologs in large eukaryotic pan-proteomes.

Author: Sheikhizadeh Anari, Siavash, de Ridder, Dick, Schranz, M. Eric, and Smit, Sandra
Subjects: *EUKARYOTIC genomes, *GENOMICS, *GENES, *PROTEOMICS, *HOMOLOGY (Biochemistry), *MATHEMATICAL models
Abstract: Background: Identification of homologous genes is fundamental to comparative genomics, functional genomics and phylogenomics. Extensive public homology databases are of great value for investigating homology but need to be continually updated to incorporate new sequences. As new sequences are rapidly being generated, there is a need for efficient standalone tools to detect homologs in novel data. Results: To address this, we present a fast method for detecting homology groups across a large number of individuals and/or species. We adopted a k-mer based approach which considerably reduces the number of pairwise protein alignments without sacrificing sensitivity. We demonstrate accuracy, scalability, efficiency and applicability of the presented method for detecting homology in large proteomes of bacteria, fungi, plants and Metazoa. Conclusions: We clearly observed the trade-off between recall and precision in our homology inference. Favoring recall or precision strongly depends on the application. The clustering behavior of our program can be optimized for particular applications by altering a few key parameters. The program is available for public use at https://github.com/sheikhizadeh/pantools as an extension to our pan-genomic analysis tool, PanTools. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

13. Insights from Ion Binding Site Network Analysis into Evolution and Functions of Proteins.

Author: Škrlj, Blaž, Kunej, Tanja, and Konc, Janez
Subjects: BINDING sites, PROTEIN binding
Abstract: Abstract: Many biological phenomena can be represented as complex networks. Using a protein binding site comparison approach, we generated a network of ion binding sites on the scale of all known protein structures from the Protein Data Bank. We found that this ion binding site similarity network is scale‐free, indicating a network in which a few ion binding site scaffolds are the network hubs, and these are connected to hundreds of nodes, whereas the vast majority of nodes have only a few neighbors. Enrichment and statistical analysis of the network components and communities yielded insights into underlying processes from the functional and the structural perspective. Largest components and communities were observed to be closely related to basic metabolic processes and some of the most common structural folds, which, from the evolutionary point of view, indicates that they may be the oldest ones. Further, we derived the first comprehensive map of ion interchangeability, based on binding site similarity. Several highly interchangeable protein‐ion binding site pairs emerged (e.g., Ca2+ and Mg2+), as well as structurally distinct ones. The constructed network of ion binding site similarities will aid in understanding the general principles of protein‐ion binding sites structure, function and evolution. We demonstrate potential uses of the network on proteins involved in cancer development and immune response, where individual ions play prominent roles in disease development. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

14. Simple Ligand–Receptor Interaction Descriptor (SILIRID) for alignment-free binding site comparison

Author: Vladimir Chupakhin, Gilles Marcou, Helena Gaspar, and Alexandre Varnek
Subjects: Protein–ligand interactions, Interaction fingerprints, Protein similarity, Protein classification, Chemogenomics, Generative Topographic Mapping, Biotechnology, TP248.13-248.65
Abstract: We describe SILIRID (Simple Ligand–Receptor Interaction Descriptor), a novel fixed size descriptor characterizing protein–ligand interactions. SILIRID can be obtained from the binary interaction fingerprints (IFPs) by summing up the bits corresponding to identical amino acids. This results in a vector of 168 integer numbers corresponding to the product of the number of entries (20 amino acids and one cofactor) and 8 interaction types per amino acid (hydrophobic, aromatic face to face, aromatic edge to face, H-bond donated by the protein, H-bond donated by the ligand, ionic bond with protein cation and protein anion, and interaction with metal ion). Efficiency of SILIRID to distinguish different protein binding sites has been examined in similarity search in sc-PDB database, a druggable portion of the Protein Data Bank, using various protein–ligand complexes as queries. The performance of retrieval of structurally and evolutionary related classes of proteins was comparable to that of state-of-the-art approaches (ROC AUC ≈ 0.91). SILIRID can efficiently be used to visualize chemogenomic space covered by sc-PDB using Generative Topographic Mapping (GTM): sc-PDB SILIRID data form clusters corresponding to different protein types.
Published: 2014
Full Text: View/download PDF

15. Some Notes on the Complexity of Protein Similarity Search under mRNA Structure Constraints

Author: Bongartz, Dirk, Goos, Gerhard, editor, Hartmanis, Juris, editor, van Leeuwen, Jan, editor, Van Emde Boas, Peter, editor, Pokorný, Jaroslav, editor, Bieliková, Mária, editor, and Štuller, Július, editor
Published: 2004
Full Text: View/download PDF

16. Applications of Protein Secondary Structure Algorithms in SARS-CoV-2 Research

Author: Alibek Kruglikov, Yulong Wei, Mohan Rakesh, and Xuhua Xia
Subjects: 0301 basic medicine, Models, Molecular, Proteomics, Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), Reviews, Sequence alignment, Genome, Viral, Biology, spike protein, Biochemistry, Genome, Protein Structure, Secondary, 03 medical and health sciences, Protein structure, protein similarity, Animals, Humans, Protein Interaction Domains and Motifs, Peptide sequence, Protein secondary structure, Pandemics, Sequence (medicine), chemistry.chemical_classification, 030102 biochemistry & molecular biology, Host Microbial Interactions, SARS-CoV-2, COVID-19, secondary structure, General Chemistry, Amino acid, 030104 developmental biology, chemistry, Spike Glycoprotein, Coronavirus, Receptors, Virus, Angiotensin-Converting Enzyme 2, Algorithm, Sequence Alignment, Algorithms
Abstract: Since the outset of COVID-19, the pandemic has prompted immediate global efforts to sequence SARS-CoV-2, and over 450 000 complete genomes have been publicly deposited over the course of 12 months. Despite this, comparative nucleotide and amino acid sequence analyses often fall short in answering key questions in vaccine design. For example, the binding affinity between different ACE2 receptors and SARS-COV-2 spike protein cannot be fully explained by amino acid similarity at ACE2 contact sites because protein structure similarities are not fully reflected by amino acid sequence similarities. To comprehensively compare protein homology, secondary structure (SS) analysis is required. While protein structure is slow and difficult to obtain, SS predictions can be made rapidly, and a well-predicted SS structure may serve as a viable proxy to gain biological insight. Here we review algorithms and information used in predicting protein SS to highlight its potential application in pandemics research. We also showed examples of how SS predictions can be used to compare ACE2 proteins and to evaluate the zoonotic origins of viruses. As computational tools are much faster than wet-lab experiments, these applications can be important for research especially in times when quickly obtained biological insights can help in speeding up response to pandemics.
Published: 2021

17. Identificação de possíveis ESTs de Eucalyptus relacionados a genes responsáveis pelo alongamento celular (Nota Científica). Identification of possible ESTs of Eucalyptus related to genes responsible for elongation (Scientific Note)

Author: Léo ZIMBACK, Edson Seizo MORI, Mario Luiz Teixeira de MORAES, Daniel Dias ROSA, Edson Luiz FURTADO, Celso Luiz MARINO, Ivan de Godoy MAIA, Carlos Frederico WILCKEN, Edivaldo Domingues VELINI, Iraê Amaral GUERRINI, and Hideyo AOKI
Subjects: cDNA, mRNA, alinhamento de sequência, similaridade protéica, protein similarity, sequence alignments, Forestry, SD1-669.5
Abstract: O objetivo deste estudo foi buscar in silico os ESTs do Banco de dados do Genoma de Eucalyptus (FORESTs, 2001) correlacionados com genes de alongamento celular envolvidos com crescimento de forma geral. Utilizando bibliotecas de cDNA, nós identificamos agrupamentos de ESTs semelhantes às proteínas controlando alongamento celular registradas no National Center of Biotechnologies Information – NCBI (2002). A busca mostrou similaridade para os seguintes agrupamentos de ESTs: o agrupamento EGEZLV2207D07.g com a enzima esterol delta redutase, EGEQRT4200A01.g com a proteína LKB biossintética de brassinosteróide, EGJMLV2220C06.g com EMBL-transportador mitocondrial half ABC, EGACST6260F07.g e EGMCFB1086B12.g com esterol 22 alfa hidroxilase (CYP90) (P450) e EGEQSL1051D09.g com EGEZSL5201E08.g para aproteína copine. The objective of study was in silico mining of ESTs from Eucalyptus ESTs database (FORESTs, 2001) correlated with cell elongation genes of growth character. Using cDNA libraries, we identified similar EST clusters to the proteins involved on control of cell elongation have been registered on National Center of Biotechnologies Information – NCBI (2002). The data mining has shown similarities for the following ESTs clusters: EGEZLV2207D07.g with delta sterol reductase, EGEQRT4200A01.g with LKB brassinosteroid biosynthetic protein, EGJMLV2220C06.g with EMBL-transporter of mitochondrial half ABC, EGACST6260F07.g and EGMCFB1086B12.g with steroid 22-α hydroxylase, and EGEQSL1051D09.g with EGEZSL5201E08.g for copine protein.
Published: 2012

18. Comparative Analysis of Unsupervised Protein Similarity Prediction Based on Graph Embedding

Author: Shudong Wang, Junliang Shang, Ziqi Wang, and Yuanyuan Zhang
Subjects: graph embedding, Dynamic time warping, DTW algorithm, Computer science, business.industry, Graph embedding, Feature vector, Pattern recognition, Link (geometry), QH426-470, Measure (mathematics), ComputingMethodologies_PATTERNRECOGNITION, Similarity (network science), protein similarity, Methods, Genetics, Molecular Medicine, Trigonometric functions, Graph (abstract data type), gene ontology, Artificial intelligence, business, Genetics (clinical), link prediction
Abstract: The study of protein–protein interaction and the determination of protein functions are important parts of proteomics. Computational methods are used to study the similarity between proteins based on Gene Ontology (GO) to explore their functions and possible interactions. GO is a series of standardized terms that describe gene products from molecular functions, biological processes, and cell components. Previous studies on assessing the similarity of GO terms were primarily based on Information Content (IC) between GO terms to measure the similarity of proteins. However, these methods tend to ignore the structural information between GO terms. Therefore, considering the structural information of GO terms, we systematically analyze the performance of the GO graph and GO Annotation (GOA) graph in calculating the similarity of proteins using different graph embedding methods. When applied to the actual Human and Yeast datasets, the feature vectors of GO terms and proteins are learned based on different graph embedding methods. To measure the similarity of the proteins annotated by different GO numbers, we used Dynamic Time Warping (DTW) and cosine to calculate protein similarity in GO graph and GOA graph, respectively. Link prediction experiments were then performed to evaluate the reliability of protein similarity networks constructed by different methods. It is shown that graph embedding methods have obvious advantages over the traditional IC-based methods. We found that random walk graph embedding methods, in particular, showed excellent performance in calculating the similarity of proteins. By comparing link prediction experiment results from GO(DTW) and GOA(cosine) methods, it is shown that GO(DTW) features provide highly effective information for analyzing the similarity among proteins.
Published: 2021
Full Text: View/download PDF

19. Parallelization of large-scale drug–protein binding experiments

Author: Antonios Makris, Dimitrios Michail, Mark Sawyer, and Iraklis Varlamis
Subjects: Computer Networks and Communications, business.industry, Computer science, Drug discovery, Pipeline (computing), Process (computing), 020206 networking & telecommunications, 02 engineering and technology, Parallel computing, Task (project management), Software, Memory management, Protein similarity, Hardware and Architecture, 0202 electrical engineering, electronic engineering, information engineering, Leverage (statistics), 020201 artificial intelligence & image processing, business, Pharmaceutical industry
Abstract: The pharmaceutical industry invests billions of dollars on a yearly basis for new drug research. Part of this research is focused on the repositioning of established drugs to new disease indications and is based on “drug promiscuity”, or in plain words, on the ability of certain drugs to bind multiple proteins. The increased cost of wet-lab experiments makes the in-silico alternatives a promising solution. In order to find similar protein targets for an existing drug, it is necessary to analyse the protein and drug structures and find potential similarities. The latter is a highly demanding in computational resources task. However, algorithmic advances in conjunction with increased computational resources can leverage this task and increase the success rate of drug discovery with significantly smaller cost. The current work proposes several algorithms that implement the protein similarity task in a parallel high-performance computing environment, solve several load imbalance and memory management issues and take maximum advantage of the available resources. The proposed optimizations achieve better memory and CPU balancing and faster execution times. Several parts of the previously linear processing pipeline, which used different software packages, have been re-engineered in order to improve process parallelization. Experimental results, on a high-performance computing environment with up to 1024 cores and 2048GB of memory, demonstrate the effectiveness of our approach, which scales well to large amounts of protein pairs.
Published: 2019
Full Text: View/download PDF

20. Scalable remote homology detection and fold recognition in massive protein networks

Author: Wei Zhang, Molly A. Srour, Yousef Saad, Rui Kuang, Zhuliu Li, and Raphael Petegrosso
Subjects: Computer science, Computation, Cloud computing, Computational biology, Biochemistry, Homology (biology), 03 medical and health sciences, Protein similarity, Sequence Analysis, Protein, Structural Biology, Humans, CASP, Molecular Biology, 030304 developmental biology, 0303 health sciences, business.industry, 030302 biochemistry & molecular biology, Computational Biology, Proteins, ComputingMethodologies_PATTERNRECOGNITION, Scalability, Pairwise comparison, business, Protein network, Algorithms, Software
Abstract: The global connectivities in very large protein similarity networks contain traces of evolution among the proteins for detecting protein remote evolutionary relations or structural similarities. To investigate how well a protein network captures the evolutionary information, a key limitation is the intensive computation of pairwise sequence similarities needed to construct very large protein networks. In this article, we introduce label propagation on low-rank kernel approximation (LP-LOKA) for searching massively large protein networks. LP-LOKA propagates initial protein similarities in a low-rank graph by Nyström approximation without computing all pairwise similarities. With scalable parallel implementations based on distributed-memory using message-passing interface and Apache-Hadoop/Spark on cloud, LP-LOKA can search protein networks with one million proteins or more. In the experiments on Swiss-Prot/ADDA/CASP data, LP-LOKA significantly improved protein ranking over the widely used HMM-HMM or profile-sequence alignment methods utilizing large protein networks. It was observed that the larger the protein similarity network, the better the performance, especially on relatively small protein superfamilies and folds. The results suggest that computing massively large protein network is necessary to meet the growing need of annotating proteins from newly sequenced species and LP-LOKA is both scalable and accurate for searching massively large protein networks.
Published: 2019
Full Text: View/download PDF

21. ProtDec-LTR3.0: Protein Remote Homology Detection by Incorporating Profile-Based Features Into Learning to Rank

Author: Yulin Zhu and Bin Liu
Subjects: General Computer Science, Computer science, 0206 medical engineering, Protein sequence analysis, Theoretical research, 02 engineering and technology, Machine learning, computer.software_genre, feature mapping strategy, Homology (biology), law.invention, 03 medical and health sciences, Protein similarity, PageRank, law, Protein remote homology detection, profile-based features, General Materials Science, hyperlink-induced topic search, 030304 developmental biology, learning to rank, 0303 health sciences, business.industry, General Engineering, Protein homology, Learning to rank, Artificial intelligence, lcsh:Electrical engineering. Electronics. Nuclear engineering, business, pagerank, computer, lcsh:TK1-9971, 020602 bioinformatics
Abstract: Protein remote homology detection is one of the most challenging problems in the field of protein sequence analysis, which is an important step for both theoretical research (such as the understanding of structures and functions of proteins) and drug design. Previous studies have shown that combining different ranking methods via learning to the rank algorithm is an effective strategy for remote protein homology detection, and the performance can be further improved by the protein similarity networks. In this paper, we improved the ProtDec-LTR1.0 and ProtDec-LTR2.0 predictors by incorporating three profile-based features (Top-1-gram, Top-2-gram, and ACC) into the framework of learning to rank via feature mapping strategies. The predictive performance was further refined by the pagerank (PR) algorithm and hyperlink-induced topic search (HITS) algorithm. Finally, a predictor called ProtDec-LTR3.0 was proposed. Rigorous tests on two widely used benchmark datasets showed that the ProtDec-LTR3.0 predictor outperformed both ProtDec-LTR1.0 and ProtDec-LTR2.0, and other nine existing state-of-the-art predictors, indicating that the ProtDec-LTR3.0 is an efficient method for protein remote homology detection, and will become a useful tool for protein sequence analysis. A user-friendly web server of the ProtDec-LTR3.0 predictor was established for the convenience of users, which can be accessed at http://bliulab.net/ProtDec-LTR3.0/.
Published: 2019

22. FEGS: a novel feature extraction model for protein sequences and its applications

Author: Xiaoping Liu, Juntao Liu, Ting Yu, Zengchao Mu, Hongyu Zheng, and Leyi Wei
Subjects: Research areas, Computer science, QH301-705.5, Physicochemical properties of amino acids, Feature extraction, Computer applications to medicine. Medical informatics, R858-859.7, 010402 general chemistry, 01 natural sciences, Biochemistry, Statistical features, 03 medical and health sciences, Protein sequencing, Protein similarity, Structural Biology, Sequence Analysis, Protein, Amino Acid Sequence, Biology (General), Amino Acids, Representation (mathematics), Molecular Biology, Phylogeny, 030304 developmental biology, 0303 health sciences, Sequence, Phylogenetic tree, Graphical representation, business.industry, Applied Mathematics, Research, Proteins, Pattern recognition, 0104 chemical sciences, Computer Science Applications, Artificial intelligence, DNA microarray, business, Protein similarity analysis, Algorithms
Abstract: Background Feature extraction of protein sequences is widely used in various research areas related to protein analysis, such as protein similarity analysis and prediction of protein functions or interactions. Results In this study, we introduce FEGS (Feature Extraction based on Graphical and Statistical features), a novel feature extraction model of protein sequences, by developing a new technique for graphical representation of protein sequences based on the physicochemical properties of amino acids and effectively employing the statistical features of protein sequences. By fusing the graphical and statistical features, FEGS transforms a protein sequence into a 578-dimensional numerical vector. When FEGS is applied to phylogenetic analysis on five protein sequence data sets, its performance is notably better than all of the other compared methods. Conclusion The FEGS method is carefully designed, which is practically powerful for extracting features of protein sequences. The current version of FEGS is developed to be user-friendly and is expected to play a crucial role in the related studies of protein sequence analyses.
Published: 2021

23. FUNCTIONAL CENTRALITY: DETECTING LETHALITY OF PROTEINS IN PROTEIN INTERACTION NETWORKS.

Author: Kar Leong Tew, Xiao-Li Li, and Soon-Heng Tan
Subjects: PROTEINS, PROTEIN-protein interactions, SACCHAROMYCES cerevisiae, GENE knockout, SUBGRAPHS
Published: 2007

24. SOME EXPERIENCES WITH SOLVING SEMIDEFINITE PROGRAMMING RELAXATIONS OF BINARY QUADRATIC OPTIMIZATION MODELS IN COMPUTATIONAL BIOLOGY.

Author: ENGAU, ALEXANDER
Subjects: SEMIDEFINITE programming, QUADRATIC programming, COMPUTATIONAL biology, INTEGER programming, COMBINATORIAL optimization, PROTEIN folding
Abstract: We present two recent integer programming models in molecular biology and study practical reformulations to compute solutions to some of these problems. In extension of previously tested linearization techniques, we formulate corresponding semidefinite relaxations and discuss practical rounding strategies to find good feasible approximate solutions. Our computational results highlight the possible advantages and remaining challenges of this approach especially on large-scale problems. [ABSTRACT FROM AUTHOR]
Published: 2014
Full Text: View/download PDF

25. A topological similarity measure for proteins.

Author: Máté, Gabriell, Hofmann, Andreas, Wenzel, Nicolas, and Heermann, Dieter W.
Subjects: *PROTEIN structure, *TOPOLOGY, *COMPUTATIONAL biology, *GEOMETRIC analysis, *LIGAND binding (Biochemistry), *CHEMICAL structure
Abstract: Abstract: We introduce a new measure for assessing similarity among chemical structures, based on well-established computational-topology algorithms. We argue that although the method considers geometry, it is more than a mere geometric similarity measure, as it takes into account, on different geometric scales, the important topological features of the compared structures. We prove that our measure is rigorous and complies with the proper mathematical requirements. We validate the method through comparing different configurations of simple zinc finger proteins and present an application on ligands binding to membrane-proteINS extracted from the Directory of Useful Decoys: Enhanced database and corresponding decoys. This article is part of a Special Issue entitled: Viral membrane proteins — Channels for cellular networking. [Copyright &y& Elsevier]
Published: 2014
Full Text: View/download PDF

26. ADEPT: a domain independent sequence alignment strategy for gpu architectures

Author: Muaaz Gul Awan, Steven Hofmeyr, Jack Deslippe, Katherine Yelick, Aydin Buluc, Oguz Selvitopi, and Leonid Oliker
Subjects: Computer science, GPU, Sequence assembly, Parallel computing, Biochemistry, Genome, Mathematical Sciences, chemistry.chemical_compound, 0302 clinical medicine, Protein similarity, Structural Biology, Nucleotide, SIMD, lcsh:QH301-705.5, chemistry.chemical_classification, 0303 health sciences, Applied Mathematics, Adept, Biological Sciences, Computer Science Applications, Amino acid, Dynamic programming, Networking and Information Technology R&D, Graph (abstract data type), lcsh:R858-859.7, DNA microarray, Algorithms, Biotechnology, Bioinformatics, Sequence alignment, lcsh:Computer applications to medicine. Medical informatics, 03 medical and health sciences, Information and Computing Sciences, Genetics, Humans, Cluster analysis, Molecular Biology, 030304 developmental biology, Alignment, Smith–Waterman algorithm, Protein, Human Genome, Computational Biology, DNA, chemistry, lcsh:Biology (General), Metagenomics, Generic health relevance, Sequence Alignment, 030217 neurology & neurosurgery, Software
Abstract: Background Bioinformatic workflows frequently make use of automated genome assembly and protein clustering tools. At the core of most of these tools, a significant portion of execution time is spent in determining optimal local alignment between two sequences. This task is performed with the Smith-Waterman algorithm, which is a dynamic programming based method. With the advent of modern sequencing technologies and increasing size of both genome and protein databases, a need for faster Smith-Waterman implementations has emerged. Multiple SIMD strategies for the Smith-Waterman algorithm are available for CPUs. However, with the move of HPC facilities towards accelerator based architectures, a need for an efficient GPU accelerated strategy has emerged. Existing GPU based strategies have either been optimized for a specific type of characters (Nucleotides or Amino Acids) or for only a handful of application use-cases. Results In this paper, we present ADEPT, a new sequence alignment strategy for GPU architectures that is domain independent, supporting alignment of sequences from both genomes and proteins. Our proposed strategy uses GPU specific optimizations that do not rely on the nature of sequence. We demonstrate the feasibility of this strategy by implementing the Smith-Waterman algorithm and comparing it to similar CPU strategies as well as the fastest known GPU methods for each domain. ADEPT’s driver enables it to scale across multiple GPUs and allows easy integration into software pipelines which utilize large scale computational systems. We have shown that the ADEPT based Smith-Waterman algorithm demonstrates a peak performance of 360 GCUPS and 497 GCUPs for protein based and DNA based datasets respectively on a single GPU node (8 GPUs) of the Cori Supercomputer. Overall ADEPT shows 10x faster performance in a node-to-node comparison against a corresponding SIMD CPU implementation. Conclusions ADEPT demonstrates a performance that is either comparable or better than existing GPU strategies. We demonstrated the efficacy of ADEPT in supporting existing bionformatics software pipelines by integrating ADEPT in MetaHipMer a high-performance denovo metagenome assembler and PASTIS a high-performance protein similarity graph construction pipeline. Our results show 10% and 30% boost of performance in MetaHipMer and PASTIS respectively.
Published: 2020

27. FoldRec-C2C: protein fold recognition by combining cluster-to-cluster model and protein similarity network

Author: Ke Yan, Bin Liu, and Jiangyi Shao
Subjects: Models, Molecular, Protein Folding, Source code, AcademicSubjects/SCI01060, Coronavirus disease 2019 (COVID-19), Computer science, media_common.quotation_subject, Datasets as Topic, 03 medical and health sciences, Protein structure, Protein similarity, Cluster Analysis, Molecular Biology, 030304 developmental biology, media_common, 0303 health sciences, business.industry, 030302 biochemistry & molecular biology, Computational Biology, Proteins, Pattern recognition, cluster-to-cluster model, protein fold recognition, Problem Solving Protocol, seq-to-cluster model, Learning to rank, Protein folding, Artificial intelligence, Performance improvement, seq-to-seq model, business, Information Systems
Abstract: As a key for studying the protein structures, protein fold recognition is playing an important role in predicting the protein structures associated with COVID-19 and other important structures. However, the existing computational predictors only focus on the protein pairwise similarity or the similarity between two groups of proteins from 2-folds. However, the homology relationship among proteins is in a hierarchical structure. The global protein similarity network will contribute to the performance improvement. In this study, we proposed a predictor called FoldRec-C2C to globally incorporate the interactions among proteins into the prediction. For the FoldRec-C2C predictor, protein fold recognition problem is treated as an information retrieval task in nature language processing. The initial ranking results were generated by a surprised ranking algorithm Learning to Rank, and then three re-ranking algorithms were performed on the ranking lists to adjust the results globally based on the protein similarity network, including seq-to-seq model, seq-to-cluster model and cluster-to-cluster model (C2C). When tested on a widely used and rigorous benchmark dataset LINDAHL dataset, FoldRec-C2C outperforms other 34 state-of-the-art methods in this field. The source code and data of FoldRec-C2C can be downloaded from http://bliulab.net/FoldRec-C2C/download.
Published: 2020
Full Text: View/download PDF

28. A Refined 3-in-1 Fused Protein Similarity Measure: Application in Threshold-Free Hub Detection

Author: Yi Pan, Sudipta Acharya, and Laizhong Cui
Subjects: Proteomics, Proximity measure, Gene ontology, Computer science, Applied Mathematics, 0206 medical engineering, Computational Biology, Proteins, 02 engineering and technology, computer.software_genre, Measure (mathematics), Protein sequencing, Gene Ontology, Genetic similarity, Protein similarity, Genetics, Cluster Analysis, Data mining, Literature survey, Cluster analysis, computer, 020602 bioinformatics, Algorithms, Biotechnology
Abstract: An exhaustive literature survey shows that finding protein/gene similarity is an important step towards solving widespread bioinformatics problems, such as predicting protein-protein interactions, analyzing Protein-Protein Interaction Networks (PPINs), gene prioritization, and disease gene/protein detection. In this article, we have proposed an improved 3-in-1 fused protein similarity measure called FuSim-II. It is built upon combining the weighted average of biological knowledge extracted from three potential genomic/ proteomic resources such as Gene Ontology (GO), PPIN, and protein sequence. Furthermore, we have shown the application of the proposed measure in detecting potential hub-proteins from a given PPIN. Aiming that, we have proposed a multi-objective clustering-based protein hub detection framework with FuSim-II working as the underlying proximity measure. The PPINs of H. Sapiens and M. Musculus organisms are chosen for experimental purposes. Unlike most of the existing hub-detection methods, the proposed technique does not require to follow any protein degree cut-off or threshold to define hubs. A thorough assessment of efficiency between proposed and existing eight protein similarity measures along with eight single/multi-objective clustering methods has been carried out. Internal cluster validity indices like Silhouette and Davies Bouldin (DB) are deployed to accomplish analytical study. Also, a comparative performance analysis between proposed and five existing hub-proteins detection algorithms is conducted through the enrichment of essentiality study. The reported results show the improved performance of FuSim-II over existing protein similarity measures in terms of identifying functionally related proteins as well as relevant hub-proteins. Supplementary material is available at http://csse.szu.edu.cn/staff/cuilz/eng/index.html.
Published: 2020

29. CHARACTERIZATION OF SOME INDIAN SESAME (Sesamum indicum L.) CULTIVARS THROUGH SOLUBLE SEED STORAGE PROTEIN MARKERS

Author: A Das, S K Pandey, T Dasgupta, and P Bhattacharya
Subjects: chemistry.chemical_classification, Coat, General Veterinary, biology, Phenology, food and beverages, 04 agricultural and veterinary sciences, biology.organism_classification, 040401 food science, General Biochemistry, Genetics and Molecular Biology, Crop, Horticulture, 0404 agricultural biotechnology, chemistry, Protein similarity, Storage protein, Sesamum, Cultivar, General Agricultural and Biological Sciences, Polyacrylamide gel electrophoresis
Abstract: Seed storage protein markers being less sensitive to environmental fluctuation than phenological traits, has been successfully employed in assessing divergence in many crop plants. The present study was aimed to find out correlation of seed storage protein markers in twenty eight Indian sesame cultivars with their agro-ecological zone of adoption and their seed coat colour. Sodium Dodecyl Sulphate Polyacrylamide Gel Electrophoresis (SDS-PAGE) revealed altogether twenty two protein bands of which thirteen were polymorphic with varied molecular weights. Specific bands, relating to specific agro-ecologies were found. Moreover, bands of 93.40 KDa and 68.05 KDa were found associated with production of darker shades of seed coat colour. Clustering pattern based on protein similarity value offered no definite grouping, either to specific agro-ecological zones of adoption or to specific seed coat colour. It is concluded that individual protein banding pattern can be linked to agro-ecological adoption zone and seed coat colour which is helpful in divergence and phylogenetic study in sesame.
Published: 2018
Full Text: View/download PDF

30. The specific applications of the TSR-based method in identifying Zn 2+ binding sites of proteases and ACE/ACE2.

Author: Sarkar T, Reaux CR, Li J, Raghavan VV, and Xu W
Abstract: We have developed an alignment-free TSR (Triangular Spatial Relationship)-based computational method for protein structural comparison and motif identification and discovery. To demonstrate the potential applications of the method, we have generated two datasets. One dataset contains five classes: Actin/Hsp70, serine protease (chymotrypsin/trypsin/elastase), ArsC/Prdx2, PKA/PKB/PKC, and AChE/BChE at the hierarchical level 1 and twelve groups at the level 2. The other dataset includes representative proteases and ACE/ACE2. The x,y, z coordinates of the structures were obtained from PDB. We calculated the keys (or features) that represent each structure using the TSR-based method. The dataset and data presented here include additional information that help the readers become aware of specific applications of the TSR-based method in protein clustering, identification and discovery of metal ion binding sites as well as to understand the effect of amino acid grouping on protein 3D structural relationships at both global and local levels., Competing Interests: The authors declare that they have no known competing financial interests or personal relationships which have or could be perceived to have influenced the work reported in this article., (Published by Elsevier Inc.)
Published: 2022
Full Text: View/download PDF

31. Mapping SNP-anchored genes using high-resolution melting analysis in almond.

Author: Shu-Biao Wu, Tavassolian, Iraj, Rabiei, Gholamreza, Hunt, Peter, Wirthensohn, Michelle, Gibson, John P., Ford, Christopher M., and Sedgley, Margaret
Subjects: *NUCLEOTIDES, *GENETIC polymorphisms, *ALMOND, *PRUNUS, *ROSACEAE
Abstract: Peach and almond have been considered as model species for the family Rosaceae and other woody plants. Consequently, mapping and characterisation of genes in these species has important implications. High-resolution melting (HRM) analysis is a recent development in the detection of SNPs and other markers, and proved to be an efficient and cost-effective approach. In this study, we aimed to map genes corresponding to known proteins in other species using the HRM approach. Prunus unigenes were searched and compared with known proteins in the public databases. We developed single-nucleotide polymorphism (SNP) markers, polymorphic in a mapping population produced from a cross between the cloned cultivars Nonpareil and Lauranne. A total of 12 SNP-anchored putative genes were genotyped in the population using HRM, and mapped to an existing linkage map. These genes were mapped on six linkage groups, and the predicted proteins were compared to putative orthologs in other species. Amongst those genes, four were abiotic stress-responsive genes, which can provide a starting point for construction of an abiotic resistance map. Two allergy and detoxification related genes, respectively, were also mapped and analysed. Most of the investigated genes had high similarities to sequences from closely related species such as apricot, apple and other eudicots, and these are putatively orthologous. In addition, it was shown that HRM can be an effective means of genotyping populations for the purpose of constructing a linkage map. Our work provides basic genomic information for the 12 genes, which can be used for further genetic and functional studies. [ABSTRACT FROM AUTHOR]
Published: 2009
Full Text: View/download PDF

32. Fixed-parameter algorithms for protein similarity search under mRNA structure constraints.

Author: Blin, Guillaume, Fertin, Guillaume, Hermelin, Danny, and Vialette, Stéphane
Subjects: PROTEIN engineering, MESSENGER RNA, MATHEMATICAL optimization, COMPUTER science, AMINO acids, BIOCHEMICAL engineering
Abstract: Abstract: In the context of protein engineering, we consider the problem of computing an mRNA sequence of maximal codon-wise similarity to a given mRNA (and consequently, to a given protein) that additionally satisfies some secondary structure constraints, the so-called mRNA Structure Optimization (MRSO) problem. Since MRSO is known to be APX -hard, Bongartz [D. Bongartz, Some notes on the complexity of protein similarity search under mRNA structure constraints, in: Proc. of the 30th Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM), 2004, pp. 174–183] suggested to attack the problem using the approach of parameterized complexity. In this paper we propose three fixed-parameter algorithms that apply for several interesting parameters of MRSO. We believe these algorithms to be relevant for practical applications today, as well as for possible future applications. Furthermore, our results extend the known tractability borderline of MRSO, and provide new research horizons for further improvements of this sort. [Copyright &y& Elsevier]
Published: 2008
Full Text: View/download PDF

33. Classification of proteins based on similarity of two-dimensional protein maps

Author: Albrecht, Birgit, Grant, Guy H., Sisu, Cristina, and Richards, W. Graham
Subjects: *PROTEIN kinases, *PHOSPHOTRANSFERASES, *PROTEINS, *MOLECULAR biology
Abstract: Abstract: Data reduction techniques are now a vital part of numerical analysis and principal component analysis is often used to identify important molecular features from a set of descriptors. We now take a different approach and apply data reduction techniques directly to protein structure. With this we can reduce the three-dimensional structural data into two-dimensions while preserving the correct relationships. With two-dimensional representations, structural comparisons between proteins are accelerated significantly. This means that protein–protein similarity comparisons are now feasible on a large scale. We show how the approach can help to predict the function of kinase structures according to the Hanks'' classification based on their structural similarity to different kinase classes. [Copyright &y& Elsevier]
Published: 2008
Full Text: View/download PDF

34. Serpins: structure, function and molecular evolution

Author: van Gent, Diana, Sharp, Paul, Morgan, Kevin, and Kalsheker, Noor
Subjects: *SERINE proteinase inhibitors, *BLOOD coagulation
Abstract: The superfamily of serine proteinase inhibitors (serpins) are involved in a number of fundamental biological processes such as blood coagulation, complement activation, fibrinolysis, angiogenesis, inflammation and tumor suppression and are expressed in a cell-specific manner. The average protein size of a serpin family member is 350–400 amino acids, but gene structure varies in terms of number and size of exons and introns. Previous studies of all known serpins identified 16 clades and 10 orphan sequences. Vertebrate serpins can be conveniently classified into six sub-groups.We provide additional data that updates the phylogenetic analysis in the context of structural and functional properties of the proteins. From these, we can conclude that the functional classification of serpins relies on their protein structure and not on sequence similarity. [Copyright &y& Elsevier]
Published: 2003
Full Text: View/download PDF

35. A new topological method to measure protein structure similarity

Author: Bostick, David and Vaisman, Iosif I.
Subjects: *PROTEINS, *TOPOLOGY
Abstract: A method for the quantitative evaluation of structural similarity between protein pairs is developed that makes use of a Delaunay-based topological mapping. The result of the mapping is a three-dimensional array which is representative of the global structural topology and whose elements can be used to construe an integral scoring scheme. This scoring scheme was tested for its dependence on the protein length difference in a pairwise comparison, its ability to provide a reasonable means for structural similarity comparison within a family of structural neighbors of similar length, and its sensitivity to the differences in protein conformation. It is shown that such a topological evaluation of similarity is capable of providing insight into these points of interest. Protein structure comparison using the method is computationally efficient and the topological scores, although providing different information about protein similarity, correlate well with the distance root-mean-square deviation values calculated by rigid-body structural alignment. [Copyright &y& Elsevier]
Published: 2003
Full Text: View/download PDF

36. Gaussian-based Alignment of Protein Structures: Deriving a Consensus Superposition when Alternative Solutions Exist.

Author: Mestres, Jordi
Abstract: The use of a Gaussian-based representation of protein structures for evaluating protein-structure similarities and deriving three-dimensional superpositions is presented. The approach, as implemented in the program GAPS, is applied to three pairs of proteins with different topological characteristics (rich α-helix, mixed α-helix/β-strand, and rich β-strand), low sequence identities (10–30%), and recognized difficulties to define a unique optimum alignment.Validation of the GAPS superpositions is done by comparison with superpositions obtained by the TOP, GA_FIT, and ALIGN programs and those directly extracted from the FSSP database. Results suggest that a Gaussian-based methodology offers an objective means to, depending on the Gaussian-based representation, derive a consensus three-dimensional superposition when alternative superposition solutions exist. [ABSTRACT FROM AUTHOR]
Published: 2000
Full Text: View/download PDF

37. Real time structural search of the Protein Data Bank

Author: Jose M. Duarte, Stephen K. Burley, and Dmytro Guzenko
Subjects: 0301 basic medicine, Models, Molecular, Protein Structure Comparison, Computer science, Protein Conformation, Cell, Protein Data Bank (RCSB PDB), Normal Distribution, Polypeptide chain, Oligomer, Biochemistry, Polynomials, chemistry.chemical_compound, Structural bioinformatics, Database and Informatics Methods, 0302 clinical medicine, Protein structure, Protein similarity, Macromolecular Structure Analysis, Search problem, Biology (General), Databases, Protein, Ecology, Physics, A protein, computer.file_format, Condensed Matter Physics, Stoichiometry, Chemistry, medicine.anatomical_structure, Computational Theory and Mathematics, Modeling and Simulation, Physical Sciences, Algorithm, Sequence Analysis, Algorithms, Research Article, Normalization (statistics), Protein Structure, Multiple Alignment Calculation, QH301-705.5, Bioinformatics, Protein subunit, Sequence alignment, Research and Analysis Methods, 03 medical and health sciences, Cellular and Molecular Neuroscience, Similarity (network science), Computational Techniques, medicine, Genetics, Electron Density, Molecular Biology, Ecology, Evolution, Behavior and Systematics, Internet, Models, Statistical, Computational Biology, Proteins, Biology and Life Sciences, Polypeptides, Atomic coordinates, Protein Data Bank, Split-Decomposition Method, 030104 developmental biology, Algebra, chemistry, Peptides, computer, Sequence Alignment, 030217 neurology & neurosurgery, Software, Mathematics
Abstract: Detection of protein structure similarity is a central challenge in structural bioinformatics. Comparisons are usually performed at the polypeptide chain level, however the functional form of a protein within the cell is often an oligomer. This fact, together with recent growth of oligomeric structures in the Protein Data Bank (PDB), demands more efficient approaches to oligomeric assembly alignment/retrieval. Traditional methods use atom level information, which can be complicated by the presence of topological permutations within a polypeptide chain and/or subunit rearrangements. These challenges can be overcome by comparing electron density volumes directly. But, brute force alignment of 3D data is a compute intensive search problem. We developed a 3D Zernike moment normalization procedure to orient electron density volumes and assess similarity with unprecedented speed. Similarity searching with this approach enables real-time retrieval of proteins/protein assemblies resembling a target, from PDB or user input, together with resulting alignments (http://shape.rcsb.org)., Author summary Protein structures possess wildly varied shapes, but patterns at different levels are frequently reused by nature. Finding and classifying these similarities is fundamental to understand evolution. Given the continued growth in the number of known protein structures in the Protein Data Bank, the task of comparing them to find the common patterns is becoming increasingly complicated. This is especially true when considering complete protein assemblies with several polypeptide chains, where the large sizes further complicate the issue. Here we present a novel method that can detect similarity between protein shapes and that works equally fast for any size of proteins or assemblies. The method looks at proteins as volumes of density distribution, departing from what is more usual in the field: similarity assessment based on atomic coordinates and chain connectivity. A volumetric function is amenable to be decomposed with a mathematical tool known as 3D Zernike polynomials, resulting in a compact description as vectors of Zernike moments. The tool was introduced in the 1990s, when it was suggested that the moments could be normalized to be invariant to rotations without losing information. Here we demonstrate that in fact this normalization is possible and that it offers a much more accurate method for assessing similarity between shapes, when compared to previous attempts.
Published: 2019

38. MADOKA: an ultra-fast approach for large-scale protein structure similarity searching

Author: Lei Deng, Hui Liu, Guolun Zhong, Chenzhe Liu, and Judong Luo
Subjects: Computer science, Protein structure alignment, Structural alignment, Parallel programming, Protein Data Bank (RCSB PDB), Structural neighbor searching, lcsh:Computer applications to medicine. Medical informatics, computer.software_genre, Biochemistry, 03 medical and health sciences, Structural bioinformatics, 0302 clinical medicine, Protein structure, Fragment (logic), Protein similarity, Similarity (network science), Structural Biology, lcsh:QH301-705.5, Molecular Biology, 030304 developmental biology, 0303 health sciences, Applied Mathematics, Methodology, Computational Biology, Proteins, Computer Science Applications, Visualization, lcsh:Biology (General), lcsh:R858-859.7, Data mining, DNA microarray, computer, 030217 neurology & neurosurgery, Algorithms, Software
Abstract: Background Protein comparative analysis and similarity searches play essential roles in structural bioinformatics. A couple of algorithms for protein structure alignments have been developed in recent years. However, facing the rapid growth of protein structure data, improving overall comparison performance and running efficiency with massive sequences is still challenging. Results Here, we propose MADOKA, an ultra-fast approach for massive structural neighbor searching using a novel two-phase algorithm. Initially, we apply a fast alignment between pairwise structures. Then, we employ a score to select pairs with more similarity to carry out a more accurate fragment-based residue-level alignment. MADOKA performs about 6–100 times faster than existing methods, including TM-align and SAL, in massive alignments. Moreover, the quality of structural alignment of MADOKA is better than the existing algorithms in terms of TM-score and number of aligned residues. We also develop a web server to search structural neighbors in PDB database (About 360,000 protein chains in total), as well as additional features such as 3D structure alignment visualization. The MADOKA web server is freely available at: http://madoka.denglab.org/ Conclusions MADOKA is an efficient approach to search for protein structure similarity. In addition, we provide a parallel implementation of MADOKA which exploits massive power of multi-core CPUs.
Published: 2019

39. Complete Genome Sequence of Shelby, a Siphophage Infecting Carbapenemase-Producing Klebsiella pneumoniae

Author: Jason J. Gill, Heather Newkirk, Mei Liu, Robert Saldana, and Jolene Ramsey
Subjects: Genetics, Whole genome sequencing, 0303 health sciences, Klebsiella pneumoniae, Genome Sequences, Human pathogen, Carbapenemase producing, Biology, biology.organism_classification, Genome, 3. Good health, 03 medical and health sciences, 0302 clinical medicine, Immunology and Microbiology (miscellaneous), Protein similarity, 030212 general & internal medicine, Molecular Biology, Genome size, Gene, 030304 developmental biology
Abstract: Carbapenem-resistant Klebsiella pneumoniae, a bacterium of the family Enterobacteriaceae, is a high-priority antibiotic-resistant pathogen that causes nosocomial infections. Here, we describe the isolation and annotation of the K. pneumoniae siphophage Shelby, a T1-like siphophage encoding 78 proteins, of which 34 have a predicted function.
Published: 2019
Full Text: View/download PDF

40. HDInsight4PSi: Boosting performance of 3D protein structure similarity searching with HDInsight clusters in Microsoft Azure cloud

Author: Bożena Małysiak-Mrozek, Paweł Daniłowicz, and Dariusz Mrozek
Subjects: 0301 basic medicine, Information Systems and Management, Boosting (machine learning), Computer science, Big data, Cloud computing, 02 engineering and technology, computer.software_genre, Theoretical Computer Science, 03 medical and health sciences, Structural bioinformatics, Protein structure, Protein similarity, Artificial Intelligence, 0202 electrical engineering, electronic engineering, information engineering, Biological data, business.industry, computer.file_format, Protein Data Bank, Computer Science Applications, 030104 developmental biology, Control and Systems Engineering, 020201 artificial intelligence & image processing, Data mining, business, computer, Software, Macromolecule
Abstract: 3D protein structure similarity searching is one of the important processes performed in structural bioinformatics, since it allows for protein function identification and reconstruction of phylogeny for weakly related organisms. Due to the complexity of 3D protein structures and exponential growth of protein structures in public repositories, like the Protein Data Bank, the process is time-consuming and requires increased computational resources. This causes the necessity to prepare computer systems to be able to deal with such huge volumes of macromolecular data. In this paper, we show how 3D protein structure similarity searching can be performed in parallel by distributing MapReduce jobs on the HDInsight cluster in Microsoft Azure commercial cloud. Our solution combines the use of two important computing paradigms that gain popularity in recent years—Hadoop/MapReduce and Cloud computing. Our experiments performed with the use of the whole repository of protein structures from Protein Data Bank confirm that such a technological fusion is very beneficial and can be successfully applied when performing time-consuming computations over biological data. Moreover, appropriate preparation of data allows to reduce the time needed for computations and significantly accelerates the similarity searching.
Published: 2016
Full Text: View/download PDF

41. Visualizing and Clustering Protein Similarity Networks: Sequences, Structures, and Functions

Author: Te-Lun Mai, Chi-Ming Chen, and Geng Ming Hu
Subjects: 0301 basic medicine, Protein structure database, 030102 biochemistry & molecular biology, Protein Conformation, Protein database, General Chemistry, Biology, computer.software_genre, Biochemistry, Enzymes, Visualization, Evolution, Molecular, Structure-Activity Relationship, 03 medical and health sciences, 030104 developmental biology, Protein similarity, Molecular evolution, Cluster Analysis, Amino Acid Sequence, Protein Interaction Maps, Data mining, Cluster analysis, computer, Protein network
Abstract: Research in the recent decade has demonstrated the usefulness of protein network knowledge in furthering the study of molecular evolution of proteins, understanding the robustness of cells to perturbation, and annotating new protein functions. In this study, we aimed to provide a general clustering approach to visualize the sequence-structure-function relationship of protein networks, and investigate possible causes for inconsistency in the protein classifications based on sequences, structures, and functions. Such visualization of protein networks could facilitate our understanding of the overall relationship among proteins and help researchers comprehend various protein databases. As a demonstration, we clustered 1437 enzymes by their sequences and structures using the minimum span clustering (MSC) method. The general structure of this protein network was delineated at two clustering resolutions, and the second level MSC clustering was found to be highly similar to existing enzyme classifications. The clustering of these enzymes based on sequence, structure, and function information is consistent with each other. For proteases, the Jaccard's similarity coefficient is 0.86 between sequence and function classifications, 0.82 between sequence and structure classifications, and 0.78 between structure and function classifications. From our clustering results, we discussed possible examples of divergent evolution and convergent evolution of enzymes. Our clustering approach provides a panoramic view of the sequence-structure-function network of proteins, helps visualize the relation between related proteins intuitively, and is useful in predicting the structure and function of newly determined protein sequences.
Published: 2016
Full Text: View/download PDF

42. Final Report

Author: Ferrin, Thomas
Published: 2001

43. SOV_refine: A further refined definition of segment overlap score and its significance for protein structure similarity

Author: Tong Liu and Zheng Wang
Subjects: 0301 basic medicine, Segment overlap score, Information Systems and Management, Source code, Computer science, media_common.quotation_subject, Assessment of protein secondary structure predictions, Health Informatics, lcsh:Computer applications to medicine. Medical informatics, Similarity of segmented biological sequences, Correlation, 03 medical and health sciences, 0302 clinical medicine, Protein similarity, SOV score, Protein secondary structure, media_common, Sequence, Quality assessment, Comparing different definitions of topologically associating domains, Protein structure similarity, Methodology, Protein tertiary structure, Protein secondary structure prediction, Computer Science Applications, 030104 developmental biology, Protein model, lcsh:R858-859.7, Algorithm, 030217 neurology & neurosurgery, Information Systems
Abstract: The segment overlap score (SOV) has been used to evaluate the predicted protein secondary structures, a sequence composed of helix (H), strand (E), and coil (C), by comparing it with the native or reference secondary structures, another sequence of H, E, and C. SOV’s advantage is that it can consider the size of continuous overlapping segments and assign extra allowance to longer continuous overlapping segments instead of only judging from the percentage of overlapping individual positions as Q3 score does. However, we have found a drawback from its previous definition, that is, it cannot ensure increasing allowance assignment when more residues in a segment are further predicted accurately. A new way of assigning allowance has been designed, which keeps all the advantages of the previous SOV score definitions and ensures that the amount of allowance assigned is incremental when more elements in a segment are predicted accurately. Furthermore, our improved SOV has achieved a higher correlation with the quality of protein models measured by GDT-TS score and TM-score, indicating its better abilities to evaluate tertiary structure quality at the secondary structure level. We analyzed the statistical significance of SOV scores and found the threshold values for distinguishing two protein structures (SOV_refine > 0.19) and indicating whether two proteins are under the same CATH fold (SOV_refine > 0.94 and > 0.90 for three- and eight-state secondary structures respectively). We provided another two example applications, which are when used as a machine learning feature for protein model quality assessment and comparing different definitions of topologically associating domains. We proved that our newly defined SOV score resulted in better performance. The SOV score can be widely used in bioinformatics research and other fields that need to compare two sequences of letters in which continuous segments have important meanings. We also generalized the previous SOV definitions so that it can work for sequences composed of more than three states (e.g., it can work for the eight-state definition of protein secondary structures). A standalone software package has been implemented in Perl with source code released. The software can be downloaded from http://dna.cs.miami.edu/SOV/ .
Published: 2018
Full Text: View/download PDF

44. Exploring the effectiveness of the TSR-based protein 3-D structural comparison method for protein clustering, and structural motif identification and discovery of protein kinases, hydrolases, and SARS-CoV-2's protein via the application of amino acid grouping

Author: Sarkar, Titli, Raghavan, Vijay V., Chen, Feng, Riley, Andrew, Zhou, Sophia, and Xu, Wu
Subjects: *PROTEIN kinases, *CYTOSKELETAL proteins, *AMINO acids, *PROTEOMICS, *SARS-CoV-2
Abstract: [Display omitted] • Exploring the effectiveness of the TSR-based method for protein clustering and structural motif identification via amino acid grouping. • We have classified the keys into the different categories for better understanding of protein structure relations. • Applying amino acid grouping to the TSR-based method modestly improves the accuracy of protein clustering in certain cases. • Applying amino acid grouping facilitates the process of identification or discovery of conserved structural motifs. • TSR-based protein 3-D structural comparison method has its uniqueness in identification and discovery of structural motifs/binding sites. • The substructures we defined for coronaviruses' nsp16 will help future antiviral drug design for improving therapeutic outcome. • The substructures we have defined will also help to understand how nsp10 interacts with nsp16 to regulate function of nsp16. Development of protein 3-D structural comparison methods is essential for understanding protein functions. Some amino acids share structural similarities while others vary considerably. These structures determine the chemical and physical properties of amino acids. Grouping amino acids with similar structures potentially improves the ability to identify structurally conserved regions and increases the global structural similarity between proteins. We systematically studied the effects of amino acid grouping on the numbers of Specific/specific, Common/common, and statistically different keys to achieve a better understanding of protein structure relations. Common keys represent substructures found in all types of proteins and Specific keys represent substructures exclusively belonging to a certain type of proteins in a data set. Our results show that applying amino acid grouping to the Triangular Spatial Relationship (TSR)-based method, while computing structural similarity among proteins, improves the accuracy of protein clustering in certain cases. In addition, applying amino acid grouping facilitates the process of identification or discovery of conserved structural motifs. The results from the principal component analysis (PCA) demonstrate that applying amino acid grouping captures slightly more structural variation than when amino acid grouping is not used, indicating that amino acid grouping reduces structure diversity as predicted. The TSR-based method uniquely identifies and discovers binding sites for drugs or interacting proteins. The binding sites of nsp16 of SARS-CoV-2, SARS-CoV and MERS-CoV that we have defined will aid future antiviral drug design for improving therapeutic outcome. This approach for incorporating the amino acid grouping feature into our structural comparison method is promising and provides a deeper insight into understanding of structural relations of proteins. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

45. Predicting drug-target interaction based on sequence and structure information

Author: Wei Lan, Jianxin Wang, Fang-Xiang Wu, Min Li, and Yi Pan
Subjects: Drug discovery, business.industry, Drug target, Biology, computer.software_genre, Machine learning, Support vector machine, Protein sequencing, Protein similarity, Control and Systems Engineering, Target protein, Artificial intelligence, Data mining, Drug structure, business, Classifier (UML), computer
Abstract: It is well known that discovering a new drug is a cumbersome, time-consuming and expensive process. Computational approaches for identifying interactions between drug compounds and target proteins have become important in drug discovery which is helpful to reduce these obstacles. The difficulties of drug-target interaction identification include the lack of known drug-target associations and no experimentally verified negative examples. In this study, we present a method, called PUDT, to predict drug-target interactions. Instead of treating unknown interactions as negative examples, we consider unknown interactions as unlabeled examples. The unlabeled examples are divided into two parts: reliable negative examples and likely negative examples based on protein structure similarity. Then, a weighted support vector machine is used to build a classifier to predict drug-target interactions based on protein sequence and drug structure information. Four data sets (enzymes, ion channels, GPCRs and nuclear receptors) are used to evaluate the performance of the proposed method PUDT. The experimental results demonstrate that our method PUDT outperforms recent state-of-the-art approaches.
Published: 2015
Full Text: View/download PDF

46. Efficient inference of homologs in large eukaryotic pan-proteomes

Author: Siavash Sheikhizadeh Anari, Dick de Ridder, M. Eric Schranz, and Sandra Smit
Subjects: 0301 basic medicine, Proteome, Computer science, Biochemistry, Homology (biology), Protein similarity, Structural Biology, Phylogenomics, Cluster Analysis, Databases, Protein, lcsh:QH301-705.5, Genome, Applied Mathematics, Methodology Article, Homologous genes, Pan-genome, Eukaryota, Genomics, Biosystematiek, Computer Science Applications, lcsh:R858-859.7, DNA microarray, Functional genomics, Algorithms, Singular homology, Bioinformatics, Computational biology, lcsh:Computer applications to medicine. Medical informatics, Genes, Plant, 03 medical and health sciences, Orthology, Bioinformatica, Homologous chromosome, Humans, Cluster analysis, Molecular Biology, Comparative genomics, Sequence Homology, Amino Acid, k-mer, 030104 developmental biology, lcsh:Biology (General), Brassicaceae, Biosystematics, EPS, Software
Abstract: Background Identification of homologous genes is fundamental to comparative genomics, functional genomics and phylogenomics. Extensive public homology databases are of great value for investigating homology but need to be continually updated to incorporate new sequences. As new sequences are rapidly being generated, there is a need for efficient standalone tools to detect homologs in novel data. Results To address this, we present a fast method for detecting homology groups across a large number of individuals and/or species. We adopted a k-mer based approach which considerably reduces the number of pairwise protein alignments without sacrificing sensitivity. We demonstrate accuracy, scalability, efficiency and applicability of the presented method for detecting homology in large proteomes of bacteria, fungi, plants and Metazoa. Conclusions We clearly observed the trade-off between recall and precision in our homology inference. Favoring recall or precision strongly depends on the application. The clustering behavior of our program can be optimized for particular applications by altering a few key parameters. The program is available for public use at https://github.com/sheikhizadeh/pantools as an extension to our pan-genomic analysis tool, PanTools. Electronic supplementary material The online version of this article (10.1186/s12859-018-2362-4) contains supplementary material, which is available to authorized users.
Published: 2017

47. Parallelization of Large-Scale Drug-Protein Binding Experiments

Author: Antonios Makris, George Tsatsaronis, Joachim Haupt, Mark Sawyer, Iraklis Varlamis, Konstantinos Tserpes, Dimitrios Michail, and Chronis Dimitropoulos
Subjects: 0301 basic medicine, Computer science, business.industry, Scale (chemistry), 0206 medical engineering, Process (computing), 02 engineering and technology, Parallel computing, Plasma protein binding, 03 medical and health sciences, Task (computing), 030104 developmental biology, Software, Memory management, Protein structure, Protein similarity, business, 020602 bioinformatics
Abstract: Drug polypharmacology or “drug promiscuity” refers to the ability of a drug to bind multiple proteins. Such studies have huge impact to the pharmaceutical industry, but in the same time require large investments on wet-lab experiments. The respective in-silico experiments have a significantly smaller cost and minimize the expenses for the subsequent lab experiments. However, the process of finding similar protein targets for an existing drug, passes through protein structural similarity and is a highly demanding in computational resources task. In this work, we propose several algorithms that port the protein similarity task to a parallel high-performance computing environment. The differences in size and complexity of the examined protein structures raise several issues in a naive parallelization process that significantly affect the overall time and required memory. We describe several optimizations for better memory and CPU balancing which achieve faster execution times. Experimental results, on a high-performance computing environment with 512 cores and 2048GB of memory, demonstrate the effectiveness of our approach which scales well to large amounts of protein pairs.
Published: 2017
Full Text: View/download PDF

48. Determining protein similarity by comparing hydrophobic core structure

Author: Barbara Kalinowska, Mateusz Banach, Małgorzata Gadzała, Irena Roterman, and Leszek Konieczny
Subjects: 0301 basic medicine, Structural similarity, Bioinformatics, computer.software_genre, Article, Correlation, 03 medical and health sciences, Protein similarity, lcsh:Social sciences (General), CASP, lcsh:Science (General), Biological sciences, Multidisciplinary, Chemistry, biological sciences, bioinformatics, Protein structure prediction, 030104 developmental biology, Similarity criterion, Protein body, lcsh:H1-99, Data mining, Biological system, computer, lcsh:Q1-390
Abstract: Formal assessment of structural similarity is − next to protein structure prediction − arguably the most important unsolved problem in proteomics. In this paper we propose a similarity criterion based on commonalities between the proteins’ hydrophobic cores. The hydrophobic core emerges as a result of conformational changes through which each residue reaches its intended position in the protein body. A quantitative criterion based on this phenomenon has been proposed in the framework of the CASP challenge. The structure of the hydrophobic core − including the placement and scope of any deviations from the idealized model − may indirectly point to areas of importance from the point of view of the protein’s biological function. Our analysis focuses on an arbitrarily selected target from the CASP11 challenge. The proposed measure, while compliant with CASP criteria (70–80% correlation), involves certain adjustments which acknowledge the presence of factors other than simple spatial arrangement of solids.
Published: 2017
Full Text: View/download PDF

49. Functional Aggregation for Protein Function Prediction

Author: Jingyu Hou
Subjects: business.industry, Function (mathematics), computer.software_genre, Machine learning, Fuzzy logic, Semantic similarity, Similarity (network science), Choquet integral, Protein similarity, Protein function prediction, Data mining, Artificial intelligence, business, computer, Mathematics
Abstract: This chapter introduces a novel method that incorporates functional aggregation of proteins into protein function prediction via the Choquet–Integral technique in fuzzy theory. A new semantic protein similarity that is based on a new function similarity is also presented accordingly, which makes this new prediction approach work properly. Some possible research topics based on this new approach are presented as well.
Published: 2017
Full Text: View/download PDF

50. Searching for Domains for Protein Function Prediction

Author: Jingyu Hou
Subjects: Protein similarity, Computer science, business.industry, Protein function prediction, Function (mathematics), Artificial intelligence, Data mining, Machine learning, computer.software_genre, business, computer, Probability model
Abstract: This chapter introduces a new method of protein function prediction with an innovative algorithm of dynamically searching for suitable prediction domains during the prediction processes. The corresponding probability model, as well as the new function and protein similarity definitions, is presented in detail. Some possible research topics based on this new prediction method are also discussed.
Published: 2017
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

162 results on '"Protein similarity"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources