369 results on '"Protein classification"'
Search Results
2. Comparison of complex-valued and real-valued neural networks for protein sequence classification.
- Author
-
Yakupoğlu, Abdullah and Bilgin, Ömer Cevdet
- Subjects
- *
AMINO acid sequence , *NERVE tissue proteins , *DEEP learning , *BIOLOGICAL networks , *COMPLEX numbers - Abstract
In recent years, tremendous progress has been made in the field of real-valued deep learning. Despite successful applications using amplitude and phase features, complex-valued deep learning methods remain an actively researched area with significant potential. This study investigates the potential of complex-valued networks in biological sequence analysis. In this context, the sequences encoded by a novel approach proposed for encoding protein sequences into complex numbers are classified by complex networks and compared with a real method available in the literature. This comparative study is carried out separately for three different sequence forms of protein sequences: DNA, codon and amino acid. Both real and complex networks achieved very high test accuracies of 90% and above. In statistical analyses using tenfold cross-validation, the complex-valued method yielded average accuracies of 88% (± 6), 84% (± 8) and 87% (± 8) for DNA, codon and amino acid sequences, respectively. The real-valued method gave mean accuracies of 91% (± 8), 88% (± 6) and 88% (± 7), respectively. According to the comparative t-test, there was no statistically significant difference between the two methods at the p = 0.05 level, but the findings highlight the potential for achieving high success in biological sequence analysis of complex networks despite their current limitations. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
3. HaloClass: Salt-Tolerant Protein Classification with Protein Language Models.
- Author
-
Narang, Kush, Nath, Abhigyan, Hemstrom, William, and Chu, Simon K. S.
- Subjects
- *
LANGUAGE models , *CYTOSKELETAL proteins , *PROTEIN models , *PROTEIN stability , *MANUFACTURING processes - Abstract
Salt-tolerant proteins, also known as halophilic proteins, have unique adaptations to function in high-salinity environments. These proteins have naturally evolved in extremophilic organisms, and more recently, are being increasingly applied as enzymes in industrial processes. Due to an abundance of salt-tolerant sequences and a simultaneous lack of experimental structures, most computational methods to predict stability are sequence-based only. These approaches, however, are hindered by a lack of structural understanding of these proteins. Here, we present HaloClass, an SVM classifier that leverages ESM-2 protein language model embeddings to accurately identify salt-tolerant proteins. On a newer and larger test dataset, HaloClass outperforms existing approaches when predicting the stability of never-before-seen proteins that are distal to its training set. Finally, on a mutation study that evaluated changes in salt tolerance based on single- and multiple-point mutants, HaloClass outperforms existing approaches, suggesting applications in the guided design of salt-tolerant enzymes. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
4. SBSM-Pro: support bio-sequence machine for proteins.
- Author
-
Wang, Yizheng, Zhai, Yixiao, Ding, Yijie, and Zou, Quan
- Abstract
Proteins play a pivotal role in biological systems. The use of machine learning algorithms for protein classification can assist and even guide biological experiments, offering crucial insights for biotechnological applications. We introduce the support bio-sequence machine for proteins (SBSM-Pro), a model purpose-built for the classification of biological sequences. This model starts with raw sequences and groups amino acids based on their physicochemical properties. It incorporates sequence alignment to measure the similarities between proteins and uses a novel multiple kernel learning (MKL) approach to integrate various types of information, utilizing support vector machines for classification prediction. The results indicate that our model demonstrates commendable performance across ten datasets in terms of the identification of protein function and post translational modification. This research not only exemplifies state-of-the-art work in protein classification but also paves avenues for new directions in this domain, representing a beneficial endeavor in the development of platforms tailored for the classification of biological sequences. SBSM-Pro is available for access at . [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
5. Testing the Capability of Embedding-Based Alignments on the GST Superfamily Classification: The Role of Protein Length.
- Author
-
Vazzana, Gabriele, Savojardo, Castrense, Martelli, Pier Luigi, and Casadio, Rita
- Subjects
- *
LANGUAGE models , *PROTEIN models , *XENOBIOTICS , *GLUTATHIONE , *DATA analysis - Abstract
In order to shed light on the usage of protein language model-based alignment procedures, we attempted the classification of Glutathione S-transferases (GST; EC 2.5.1.18) and compared our results with the ARBA/UNI rule-based annotation in UniProt. GST is a protein superfamily involved in cellular detoxification from harmful xenobiotics and endobiotics, widely distributed in prokaryotes and eukaryotes. What is particularly interesting is that the superfamily is characterized by different classes, comprising proteins from different taxa that can act in different cell locations (cytosolic, mitochondrial and microsomal compartments) with different folds and different levels of sequence identity with remote homologs. For this reason, GST functional annotation in a specific class is problematic: unless a structure is released, the protein can be classified only on the basis of sequence similarity, which excludes the annotation of remote homologs. Here, we adopt an embedding-based alignment to classify 15,061 GST proteins automatically annotated by the UniProt-ARBA/UNI rules. Embedding is based on the Meta ESM2-15b protein language. The embedding-based alignment reaches more than a 99% rate of perfect matching with the UniProt automatic procedure. Data analysis indicates that 46% of the UniProt automatically classified proteins do not conserve the typical length of canonical GSTs, whose structure is known. Therefore, 46% of the classified proteins do not conserve the template/s structure required for their family classification. Our approach finds that 41% of 64,207 GST UniProt proteins not yet assigned to any class can be classified consistently with the structural template length. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
6. An efficient and accurate approach to identify similarities between biological sequences using pair amino acid composition and physicochemical properties.
- Author
-
Hooshyar, L., Hernández-Jiménez, M. B., Khastan, A., and Vasighi, M.
- Subjects
- *
AMINO acid sequence , *SEQUENCE analysis , *RESEARCH personnel , *EXPERTISE , *AMBIGUITY - Abstract
Our study presents a novel method for analyzing biological sequences, utilizing Pairwise Amino Acid Composition and Amino Acid physicochemical properties to construct a feature vector. This step is pivotal, as by utilizing pairwise analysis, we consider the order of amino acids, thereby capturing subtle nuances in sequence structure. Simultaneously, by incorporating physicochemical properties, we ensure that the hidden information encoded within amino acids is not overlooked. Furthermore, by considering both the frequency and order of amino acid pairs, our method mitigates the risk of erroneously clustering different sequences as similar, a common pitfall in older methods. Our approach generates a concise 48-member vector, accommodating sequences of arbitrary lengths efficiently. This compact representation retains essential amino acid-specific features, enhancing the accuracy of sequence analysis. Unlike traditional approaches, our algorithm avoids the introduction of sparse vectors, ensuring the retention of important information. Additionally, we introduce fuzzy equivalence relationships to address uncertainty in the clustering process, enabling a more nuanced and flexible clustering approach that captures the inherent ambiguity in biological data. Despite these advancements, our algorithm is presented in a straightforward manner, ensuring accessibility to researchers with varying levels of computational expertise. This enhancement improves the robustness and interpretability of our method, providing researchers with a comprehensive and user-friendly tool for biological sequence analysis. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
7. Geometric Features and GAT Neural Network for Protein Surface Classification
- Author
-
Ferroudj, Wissam, Faci, Noura, Kheddouci, Hamamache, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Strauss, Christine, editor, Amagasa, Toshiyuki, editor, Manco, Giuseppe, editor, Kotsis, Gabriele, editor, Tjoa, A Min, editor, and Khalil, Ismail, editor
- Published
- 2024
- Full Text
- View/download PDF
8. Plant Protein Classification Using K-mer Encoding
- Author
-
Veningston, K., Venkateswara Rao, P. V., Pravallika Devi, M., Pranitha Reddy, S., Ronalda, M., Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Prates, Raquel Oliveira, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Muthalagu, Raja, editor, P S, Tamizharasan, editor, Pawar, Pranav M., editor, R, Elakkiya, editor, Prasad, Neeli Rashmi, editor, and Fiorentino, Michele, editor
- Published
- 2024
- Full Text
- View/download PDF
9. An Efficient Deep Learning Approach for DNA-Binding Proteins Classification from Primary Sequences
- Author
-
Nosiba Yousif Ahmed, Wafa Alameen Alsanousi, Eman Mohammed Hamid, Murtada K. Elbashir, Khadija Mohammed Al-Aidarous, Mogtaba Mohammed, and Mohamed Elhafiz M. Musa
- Subjects
Deep learning ,Convolutional neural network ,Gated recurrent unit ,Long-short term memory ,DNA-binding protein ,Protein classification ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Abstract As the number of identified proteins has expanded, the accurate identification of proteins has become a significant challenge in the field of biology. Various computational methods, such as Support Vector Machine (SVM), K-nearest neighbors (KNN), and convolutional neural network (CNN), have been proposed to recognize deoxyribonucleic acid (DNA)-binding proteins solely based on amino acid sequences. However, these methods do not consider the contextual information within amino acid sequences, limiting their ability to adequately capture sequence features. In this study, we propose a novel approach to identify DNA-binding proteins by integrating a CNN with bidirectional long-short-term memory (LSTM) and gated recurrent unit (GRU) as (CNN-BiLG). The CNN-BiLG model can explore the potential contextual relationships of amino acid sequences and obtain more features than traditional models. Our experimental results demonstrate a validation set prediction accuracy of 94% for the proposed CNN-BiLG, surpassing the accuracy of machine learning models and deep learning models. Furthermore, our model is both effective and efficient, exhibiting commendable classification accuracy based on comparative analysis.
- Published
- 2024
- Full Text
- View/download PDF
10. A rule-based protein classification approach using normalized distance-based encoding method
- Author
-
Saha, Suprativ, Bhattacharyya, Rupak, and Bhattacharya, Tanmay
- Published
- 2024
- Full Text
- View/download PDF
11. An Efficient Deep Learning Approach for DNA-Binding Proteins Classification from Primary Sequences
- Author
-
Ahmed, Nosiba Yousif, Alsanousi, Wafa Alameen, Hamid, Eman Mohammed, Elbashir, Murtada K., Al-Aidarous, Khadija Mohammed, Mohammed, Mogtaba, and Musa, Mohamed Elhafiz M.
- Published
- 2024
- Full Text
- View/download PDF
12. Testing the Capability of Embedding-Based Alignments on the GST Superfamily Classification: The Role of Protein Length
- Author
-
Gabriele Vazzana, Castrense Savojardo, Pier Luigi Martelli, and Rita Casadio
- Subjects
Glutathione S-transferases ,protein language models ,protein classification ,functional annotation ,embedding-based alignment ,Organic chemistry ,QD241-441 - Abstract
In order to shed light on the usage of protein language model-based alignment procedures, we attempted the classification of Glutathione S-transferases (GST; EC 2.5.1.18) and compared our results with the ARBA/UNI rule-based annotation in UniProt. GST is a protein superfamily involved in cellular detoxification from harmful xenobiotics and endobiotics, widely distributed in prokaryotes and eukaryotes. What is particularly interesting is that the superfamily is characterized by different classes, comprising proteins from different taxa that can act in different cell locations (cytosolic, mitochondrial and microsomal compartments) with different folds and different levels of sequence identity with remote homologs. For this reason, GST functional annotation in a specific class is problematic: unless a structure is released, the protein can be classified only on the basis of sequence similarity, which excludes the annotation of remote homologs. Here, we adopt an embedding-based alignment to classify 15,061 GST proteins automatically annotated by the UniProt-ARBA/UNI rules. Embedding is based on the Meta ESM2-15b protein language. The embedding-based alignment reaches more than a 99% rate of perfect matching with the UniProt automatic procedure. Data analysis indicates that 46% of the UniProt automatically classified proteins do not conserve the typical length of canonical GSTs, whose structure is known. Therefore, 46% of the classified proteins do not conserve the template/s structure required for their family classification. Our approach finds that 41% of 64,207 GST UniProt proteins not yet assigned to any class can be classified consistently with the structural template length.
- Published
- 2024
- Full Text
- View/download PDF
13. Computational discovery and modelling of tandem domain repeats in proteins
- Author
-
Lafita Masip, Aleix and Bateman, Alex
- Subjects
572 ,protein modelling ,protein classification ,protein domains ,protein misfolding ,multidomain proteins ,tandem domain repeats - Abstract
Domains are functional and evolutionary units of proteins that typically fold into stable globular structures. A small subset of natural multidomain proteins contain large arrays of nearly identical domains repeated in tandem, challenging some of our assumptions about protein folding and evolution. In this study, I aim to discover new tandem domain repeats, characterise their sequence and structural properties and understand their roles in the function of proteins. I start by using computational sequence analysis tools across large datasets of proteins and bacterial genomes to survey the prevalence and distribution of tan- dem domain repeats across organisms and domain families. Next, I computation- ally analyse and compare structures of domains found as tandem repeats, several of which have been experimentally determined by our collaborators in the course of this study. I finally develop two computational methods to systematically model the structure and misfolding energetics of tandem domain repeats. Nearly identical tandem domain repeats are rare in natural proteins (below 0.1%) and their sequences are highly biased in amino acid composition. Many of them have structural roles in bacterial surface proteins implicated in biofilm for- mation and host colonisation; new examples of such proteins, named "Periscope proteins", show rapid domain repeat number variation, a molecular mechanism used to modulate bacterial phenotype. Tandem domain repeat structures reveal unusual structural malleability, with numerous cases of domain atrophy (loss of core secondary structures) and elaboration. They are also predicted to be more resistant to misfolding via tandem domain swapping, with potential misfolding- resistant mechanisms such as the domain topology and length. This study improves our understanding of the prevalence, type and function of tandem domain repeats in proteins, in particular their role as structural ele- ments in bacterial surface proteins, and suggests new protein and domain targets for further experimental characterisation. It also has important implications for protein misfolding and for the design and engineering of multidomain proteins.
- Published
- 2021
- Full Text
- View/download PDF
14. A hybrid deep learning model for classification of plant transcription factor proteins.
- Author
-
Öncül, Ali Burak and Çelik, Yüksel
- Abstract
Studies on the amino acid sequences, protein structure, and the relationships of amino acids are still a large and challenging problem in biology. Although bioinformatics studies have progressed in solving these problems, the relationship between amino acids and determining the type of protein formed by amino acids are still a problem that has not been fully solved. This problem is why the use of some of the available protein sequences is also limited. This study proposes a hybrid deep learning model to classify amino acid sequences of unknown species using the amino acid sequences in the plant transcription factor database. The model achieved 98.23% success rate in the tests performed. With the hybrid model created, transcription factor proteins in the plant kingdom can be easily classified. The fact that the model is hybrid has made its layers lighter. The training period has decreased, and the success has increased. When tested with a bidirectional LSTM produced with a similar dataset to our dataset and a ResNet-based ProtCNN model, a CNN model, the proposed model was more successful. In addition, we found that the hybrid model we designed by creating vectors with Word2Vec is more successful than other LSTM or CNN-based models. With the model we have prepared, other proteins, especially transcription factor proteins, will be classified, thus enabling species identification to be carried out efficiently and successfully. The use of such a triplet hybrid structure in classifying plant transcription factors stands out as an innovation brought to the literature. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
15. ENTAIL: yEt aNoTher amyloid fIbrils cLassifier
- Author
-
Alessia Auriemma Citarella, Luigi Di Biasi, Fabiola De Marco, and Genoveffa Tortora
- Subjects
Protein classification ,Amyloidoses ,Fibrils machine learning ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Biology (General) ,QH301-705.5 - Abstract
Abstract Background This research aims to increase our knowledge of amyloidoses. These disorders cause incorrect protein folding, affecting protein functionality (on structure). Fibrillar deposits are the basis of some wellknown diseases, such as Alzheimer, Creutzfeldt–Jakob diseases and type II diabetes. For many of these amyloid proteins, the relative precursors are known. Discovering new protein precursors involved in forming amyloid fibril deposits would improve understanding the pathological processes of amyloidoses. Results A new classifier, called ENTAIL, was developed using over than 4000 molecular descriptors. ENTAIL was based on the Naive Bayes Classifier with Unbounded Support and Gaussian Kernel Type, with an accuracy on the test set of 81.80%, SN of 100%, SP of 63.63% and an MCC of 0.683 on a balanced dataset. Conclusions The analysis carried out has demonstrated how, despite the various configurations of the tests, performances are superior in terms of performance on a balanced dataset.
- Published
- 2022
- Full Text
- View/download PDF
16. A lightweight classification of adaptor proteins using transformer networks
- Author
-
Sylwan Rahardja, Mou Wang, Binh P. Nguyen, Pasi Fränti, and Susanto Rahardja
- Subjects
Adaptor protein ,Protein classification ,Deep learning ,Transformer ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Biology (General) ,QH301-705.5 - Abstract
Abstract Background Adaptor proteins play a key role in intercellular signal transduction, and dysfunctional adaptor proteins result in diseases. Understanding its structure is the first step to tackling the associated conditions, spurring ongoing interest in research into adaptor proteins with bioinformatics and computational biology. Our study aims to introduce a small, new, and superior model for protein classification, pushing the boundaries with new machine learning algorithms. Results We propose a novel transformer based model which includes convolutional block and fully connected layer. We input protein sequences from a database, extract PSSM features, then process it via our deep learning model. The proposed model is efficient and highly compact, achieving state-of-the-art performance in terms of area under the receiver operating characteristic curve, Matthew’s Correlation Coefficient and Receiver Operating Characteristics curve. Despite merely 20 hidden nodes translating to approximately 1% of the complexity of previous best known methods, the proposed model is still superior in results and computational efficiency. Conclusions The proposed model is the first transformer model used for recognizing adaptor protein, and outperforms all existing methods, having PSSM profiles as inputs that comprises convolutional blocks, transformer and fully connected layers for the use of classifying adaptor proteins.
- Published
- 2022
- Full Text
- View/download PDF
17. Calibrating the classifier for protein family prediction with protein sequence using machine learning techniques: An empirical investigation.
- Author
-
Idhaya, T., Suruliandi, A., Calitoiu, Dragos, and Raja, S. P.
- Subjects
- *
AMINO acid sequence , *MACHINE learning , *AMINO acid residues , *DNA , *FEATURE selection , *COMPUTATIONAL neuroscience - Abstract
A gene is a basic unit of congenital traits and a sequence of nucleotides in deoxyribonucleic acid that encrypts protein synthesis. Proteins are made up of amino acid residue and are classified for use in protein-related research, which includes identifying changes in genes, finding associations with diseases and phenotypes, and identifying potential drug targets. To this end, proteins are studied and classified, based on the family. For family prediction, however, a computational rather than an experimental approach is introduced, owing to the time involved in the latter process. Computational approaches to protein family prediction involve two important processes, feature selection and classification. Existing approaches to protein family prediction are alignment-based and alignment-free. The drawback of the former is that it searches for protein signatures by aligning every available sequence. Consequently, the latter alignment-free approach is taken for study, given that it only needs sequence-based features to predict the protein family and is far more efficient than the former. Nevertheless, the sequence-based characteristics taken for study have additional features to offer. There is, thus, a need to select the best features of all. When comes to classification still there is no perfection in classifying the protein. So, a comparison of different approaches is done to find the best feature selection technique and classification technique for protein family prediction. From the study, the feature subset selected provides the best classification accuracy of 96% for filter-based feature selection technique and the random forest classifier. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
18. SNARER: new molecular descriptors for SNARE proteins classification
- Author
-
Alessia Auriemma Citarella, Luigi Di Biasi, Michele Risi, and Genoveffa Tortora
- Subjects
SNARE ,Protein classification ,Machine learning ,Random forest ,AdaBoost ,KNN ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Biology (General) ,QH301-705.5 - Abstract
Abstract Background SNARE proteins play an important role in different biological functions. This study aims to investigate the contribution of a new class of molecular descriptors (called SNARER) related to the chemical-physical properties of proteins in order to evaluate the performance of binary classifiers for SNARE proteins. Results We constructed a SNARE proteins balanced dataset, D128, and an unbalanced one, DUNI, on which we tested and compared the performance of the new descriptors presented here in combination with the feature sets (GAAC, CTDT, CKSAAP and 188D) already present in the literature. The machine learning algorithms used were Random Forest, k-Nearest Neighbors and AdaBoost and oversampling and subsampling techniques were applied to the unbalanced dataset. The addition of the SNARER descriptors increases the precision for all considered ML algorithms. In particular, on the unbalanced DUNI dataset the accuracy increases in parallel with the increase in sensitivity while on the balanced dataset D128 the accuracy increases compared to the counterpart without the addition of SNARER descriptors, with a strong improvement in specificity. Our best result is the combination of our descriptors SNARER with CKSAAP feature on the dataset D128 with 92.3% of accuracy, 90.1% for sensitivity and 95% for specificity with the RF algorithm. Conclusions The performed analysis has shown how the introduction of molecular descriptors linked to the chemical-physical and structural characteristics of the proteins can improve the classification performance. Additionally, it was pointed out that performance can change based on using a balanced or unbalanced dataset. The balanced nature of training can significantly improve forecast accuracy.
- Published
- 2022
- Full Text
- View/download PDF
19. Bridging the Gap between Sequence and Structure Classifications of Proteins with AlphaFold Models.
- Author
-
Pei, Jimin, Andreeva, Antonina, Chuguransky, Sara, Lázaro Pinto, Beatriz, Paysan-Lafosse, Typhaine, Dustin Schaeffer, R., Bateman, Alex, Cong, Qian, and Grishin, Nick V.
- Subjects
- *
PROTEIN structure prediction , *BINDING sites , *PROTEIN domains , *PROTEIN structure , *PROTEIN models - Abstract
[Display omitted] • Protein domain classification is important for studying protein evolution and function. • Highly accurate AlphaFold models offer structural insights into Pfam domains. • DPAM was able to parse and assign many Pfam domains into ECOD classification. • Manual inspection of unassigned domains uncovers new folds and remote evolutionary relationships. • A combined approach to domain classification leads to a better understanding of the protein universe. Classification of protein domains based on homology and structural similarity serves as a fundamental tool to gain biological insights into protein function. Recent advancements in protein structure prediction, exemplified by AlphaFold, have revolutionized the availability of protein structural data. We focus on classifying about 9000 Pfam families into ECOD (Evolutionary Classification of Domains) by using predicted AlphaFold models and the DPAM (Domain Parser for AlphaFold Models) tool. Our results offer insights into their homologous relationships and domain boundaries. More than half of these Pfam families contain DPAM domains that can be confidently assigned to the ECOD hierarchy. Most assigned domains belong to highly populated folds such as Immunoglobulin-like (IgL), Armadillo (ARM), helix-turn-helix (HTH), and Src homology 3 (SH3). A large fraction of DPAM domains, however, cannot be confidently assigned to ECOD homologous groups. These unassigned domains exhibit statistically different characteristics, including shorter average length, fewer secondary structure elements, and more abundant transmembrane segments. They could potentially define novel families remotely related to domains with known structures or novel superfamilies and folds. Manual scrutiny of a subset of these domains revealed an abundance of internal duplications and recurring structural motifs. Exploring sequence and structural features such as disulfide bond patterns, metal-binding sites, and enzyme active sites helped uncover novel structural folds as well as remote evolutionary relationships. By bridging the gap between sequence-based Pfam and structure-based ECOD domain classifications, our study contributes to a more comprehensive understanding of the protein universe by providing structural and functional insights into previously uncharacterized proteins. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
20. Understanding structural variability in proteins using protein structural networks
- Author
-
Vasam Manjveekar Prabantu, Vasundhara Gadiyaram, Saraswathi Vishveshwara, and Narayanaswamy Srinivasan
- Subjects
Structural variability ,Protein structural networks ,Protein structure comparison ,Protein classification ,Edge-weight variance ,Biology (General) ,QH301-705.5 - Abstract
Proteins perform their function by accessing a suitable conformer from the ensemble of available conformations. The conformational diversity of a chosen protein structure can be obtained by experimental methods under different conditions. A key issue is the accurate comparison of different conformations. A gold standard used for such a comparison is the root mean square deviation (RMSD) between the two structures. While extensive refinements of RMSD evaluation at the backbone level are available, a comprehensive framework including the side chain interaction is not well understood. Here we employ protein structure network (PSN) formalism, with the non-covalent interactions of side chain, explicitly treated. The PSNs thus constructed are compared through graph spectral method, which provides a comparison at the local and at the global structural level. In this work, PSNs of multiple crystal conformers of single-chain, single-domain proteins, are subject to pair-wise analysis to examine the dissimilarity in their network topologies and in order to determine the conformational diversity of their native structures. This information is utilized to classify the structural domains of proteins into different categories. It is observed that proteins typically tend to retain structure and interactions at the backbone level. However, some of them also depict variability in either their overall structure or only in their inter-residue connectivity at the sidechain level, or both. Variability of sub-networks based on solvent accessibility and secondary structure is studied. The types of specific interactions are found to contribute differently to structure variability. An ensemble analysis by computing the mathematical variance of edge-weights across multiple conformers provided information on the contribution to overall variability from each edge of the PSN. Interactions that are highly variable are identified and their impact on structure variability has been discussed with the help of a case study. The classification based on the present side-chain network-based studies provides a framework to correlate the structure-function relationships in protein structures.
- Published
- 2022
- Full Text
- View/download PDF
21. Research progress of reduced amino acid alphabets in protein analysis and prediction
- Author
-
Yuchao Liang, Siqi Yang, Lei Zheng, Hao Wang, Jian Zhou, Shenghui Huang, Lei Yang, and Yongchun Zuo
- Subjects
Reduced amino acid alphabets ,Machine learning ,Sequence alignment ,Protein classification ,Structure analysis ,Biotechnology ,TP248.13-248.65 - Abstract
Proteins are the executors of cellular physiological activities, and accurate structural and function elucidation are crucial for the refined mapping of proteins. As a feature engineering method, the reduction of amino acid composition is not only an important method for protein structure and function analysis, but also opens a broad horizon for the complex field of machine learning. Representing sequences with fewer amino acid types greatly reduces the complexity and noise of traditional feature engineering in dimension, and provides more interpretable predictive models for machine learning to capture key features. In this paper, we systematically reviewed the strategy and method studies of the reduced amino acid (RAA) alphabets, and summarized its main research in protein sequence alignment, functional classification, and prediction of structural properties, respectively. In the end, we gave a comprehensive analysis of 672 RAA alphabets from 74 reduction methods.
- Published
- 2022
- Full Text
- View/download PDF
22. ENTAIL: yEt aNoTher amyloid fIbrils cLassifier.
- Author
-
Auriemma Citarella, Alessia, Di Biasi, Luigi, De Marco, Fabiola, and Tortora, Genoveffa
- Subjects
- *
NAIVE Bayes classification , *PROTEIN precursors , *AMYLOID beta-protein , *AMYLOID , *TYPE 2 diabetes , *CREUTZFELDT-Jakob disease - Abstract
Background: This research aims to increase our knowledge of amyloidoses. These disorders cause incorrect protein folding, affecting protein functionality (on structure). Fibrillar deposits are the basis of some wellknown diseases, such as Alzheimer, Creutzfeldt–Jakob diseases and type II diabetes. For many of these amyloid proteins, the relative precursors are known. Discovering new protein precursors involved in forming amyloid fibril deposits would improve understanding the pathological processes of amyloidoses. Results: A new classifier, called ENTAIL, was developed using over than 4000 molecular descriptors. ENTAIL was based on the Naive Bayes Classifier with Unbounded Support and Gaussian Kernel Type, with an accuracy on the test set of 81.80%, SN of 100%, SP of 63.63% and an MCC of 0.683 on a balanced dataset. Conclusions: The analysis carried out has demonstrated how, despite the various configurations of the tests, performances are superior in terms of performance on a balanced dataset. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
23. A lightweight classification of adaptor proteins using transformer networks.
- Author
-
Rahardja, Sylwan, Wang, Mou, Nguyen, Binh P., Fränti, Pasi, and Rahardja, Susanto
- Subjects
- *
RECEIVER operating characteristic curves , *DEEP learning , *ADAPTOR proteins , *MACHINE learning , *CELLULAR signal transduction , *COMPUTATIONAL biology - Abstract
Background: Adaptor proteins play a key role in intercellular signal transduction, and dysfunctional adaptor proteins result in diseases. Understanding its structure is the first step to tackling the associated conditions, spurring ongoing interest in research into adaptor proteins with bioinformatics and computational biology. Our study aims to introduce a small, new, and superior model for protein classification, pushing the boundaries with new machine learning algorithms. Results: We propose a novel transformer based model which includes convolutional block and fully connected layer. We input protein sequences from a database, extract PSSM features, then process it via our deep learning model. The proposed model is efficient and highly compact, achieving state-of-the-art performance in terms of area under the receiver operating characteristic curve, Matthew's Correlation Coefficient and Receiver Operating Characteristics curve. Despite merely 20 hidden nodes translating to approximately 1% of the complexity of previous best known methods, the proposed model is still superior in results and computational efficiency. Conclusions: The proposed model is the first transformer model used for recognizing adaptor protein, and outperforms all existing methods, having PSSM profiles as inputs that comprises convolutional blocks, transformer and fully connected layers for the use of classifying adaptor proteins. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
24. GRACE: Generative Redesign in Artificial Computational Enzymology.
- Author
-
Hu RE, Yu CH, and Ng IS
- Abstract
Designing de novo enzymes is complex and challenging, especially to maintain the activity. This research focused on motif design to identify the crucial domain in the enzyme and uncovered the protein structure by molecular docking. Therefore, we developed a Generative Redesign in Artificial Computational Enzymology (GRACE), which is an automated workflow for reformation and creation of the de novo enzymes for the first time. GRACE integrated RFdiffusion for structure generation, ProteinMPNN for sequence interpretation, CLEAN for enzyme classification, and followed by solubility analysis and molecular dynamic simulation. As a result, we selected two gene sequences associated with carbonic anhydrase from among 10,000 protein candidates. Experimental validation confirmed that these two novel enzymes, i.e. , dCA12_2 and dCA23_1, exhibited favorable solubility, promising substrate-active site interactions, and achieved activity of 400 WAU/mL. This workflow has the potential to greatly streamline experimental efforts in enzyme engineering and unlock new avenues for rational protein design.
- Published
- 2024
- Full Text
- View/download PDF
25. Supervised Techniques in Proteomics
- Author
-
Kiranmai, Vasireddy Prabha, Siddesh, G. M., Manisekhar, S. R., Bansal, Jagdish Chand, Series Editor, Deep, Kusum, Series Editor, Nagar, Atulya K., Series Editor, Srinivasa, K. G., editor, Siddesh, G. M., editor, and Manisekhar, S. R., editor
- Published
- 2020
- Full Text
- View/download PDF
26. A Novel Improved Algorithm for Protein Classification Through a Graph Similarity Approach
- Author
-
Chou, Hsin-Hung, Hsu, Ching-Tien, Wang, Hao-Ching, Hsieh, Sun-Yuan, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Huang, De-Shuang, editor, and Jo, Kang-Hyun, editor
- Published
- 2020
- Full Text
- View/download PDF
27. Ensemble Learning-Based Feature Selection for Phage Protein Prediction.
- Author
-
Songbo Liu, Chengmin Cui, Huipeng Chen, and Tong Liu
- Subjects
FEATURE selection ,PROTEOMICS ,PROTEINS ,ANTI-infective agents - Abstract
Phage has high specificity for its host recognition. As a natural enemy of bacteria, it has been used to treat super bacteria many times. Identifying phage proteins from the original sequence is very important for understanding the relationship between phage and host bacteria and developing new antimicrobial agents. However, traditional experimental methods are both expensive and time-consuming. In this study, an ensemble learning-based feature selectionmethod is proposed to find important features for phage protein identification. The method uses four types of protein sequence-derived features, quantifies the importance of each feature by adding perturbations to the features to influence the results, and finally splices the important features among the four types of features. In addition, we analyzed the selected features and their biological significance. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
28. CRCF: A Method of Identifying Secretory Proteins of Malaria Parasites.
- Author
-
Feng, Changli, Wu, Jin, Wei, Haiyan, Xu, Lei, and Zou, Quan
- Abstract
Malaria is a mosquito-borne disease that results in millions of cases and deaths annually. The development of a fast computational method that identifies secretory proteins of the malaria parasite is important for research on antimalarial drugs and vaccines. Thus, a method was developed to identify the secretory proteins of malaria parasites. In this method, a reduced alphabet was selected to recode the original protein sequence. A feature synthesis method was used to synthesise three different types of feature information. Finally, the random forest method was used as a classifier to identify the secretory proteins. In addition, a web server was developed to share the proposed algorithm. Experiments using the benchmark dataset demonstrated that the overall accuracy achieved by the proposed method was greater than 97.8 percent using the 10-fold cross-validation method. Furthermore, the reduced schemes and characteristic performance analyses are discussed. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
29. Prediction of presynaptic and postsynaptic neurotoxins based on feature extraction
- Author
-
Wen Zhu, Yuxin Guo, and Quan Zou
- Subjects
neurotoxin ,monomonokgap ,logistic regression ,protein classification ,neu_lr ,Biotechnology ,TP248.13-248.65 ,Mathematics ,QA1-939 - Abstract
A neurotoxin is essentially a protein that mainly acts on the nervous system; it has a selective toxic effect on the central nervous system and neuromuscular nodes, can cause muscle paralysis and respiratory paralysis, and has strong lethality. According to their principle of action, neurotoxins are divided into presynaptic neurotoxins and postsynaptic neurotoxins. Correctly identifying presynaptic and postsynaptic nerve toxins provides important clues for future drug development and the discovery of drug targets. Therefore, a predictive model, Neu_LR, was constructed in this paper. The monoMonokGap method was used to extract the frequency characteristics of presynaptic and postsynaptic neurotoxin sequences and carry out feature selection, then, based on the important features obtained after dimensionality reduction, the prediction model Neu_LR was constructed using a logistic regression algorithm, and ten-fold cross-validation and independent test set validation were used. The final accuracy rates were 99.6078 and 94.1176%, respectively, which proved that the Neu_LR model had good predictive performance and robustness, and could meet the prediction requirements of presynaptic and postsynaptic neurotoxins. The data and source code of the model can be freely download from https://github.com/gyx123681/.
- Published
- 2021
- Full Text
- View/download PDF
30. SNARER: new molecular descriptors for SNARE proteins classification.
- Author
-
Auriemma Citarella, Alessia, Di Biasi, Luigi, Risi, Michele, and Tortora, Genoveffa
- Subjects
- *
SNARE proteins , *CYTOSKELETAL proteins , *RANDOM forest algorithms , *K-nearest neighbor classification , *MACHINE learning - Abstract
Background: SNARE proteins play an important role in different biological functions. This study aims to investigate the contribution of a new class of molecular descriptors (called SNARER) related to the chemical-physical properties of proteins in order to evaluate the performance of binary classifiers for SNARE proteins. Results: We constructed a SNARE proteins balanced dataset, D128, and an unbalanced one, DUNI, on which we tested and compared the performance of the new descriptors presented here in combination with the feature sets (GAAC, CTDT, CKSAAP and 188D) already present in the literature. The machine learning algorithms used were Random Forest, k-Nearest Neighbors and AdaBoost and oversampling and subsampling techniques were applied to the unbalanced dataset. The addition of the SNARER descriptors increases the precision for all considered ML algorithms. In particular, on the unbalanced DUNI dataset the accuracy increases in parallel with the increase in sensitivity while on the balanced dataset D128 the accuracy increases compared to the counterpart without the addition of SNARER descriptors, with a strong improvement in specificity. Our best result is the combination of our descriptors SNARER with CKSAAP feature on the dataset D128 with 92.3% of accuracy, 90.1% for sensitivity and 95% for specificity with the RF algorithm. Conclusions: The performed analysis has shown how the introduction of molecular descriptors linked to the chemical-physical and structural characteristics of the proteins can improve the classification performance. Additionally, it was pointed out that performance can change based on using a balanced or unbalanced dataset. The balanced nature of training can significantly improve forecast accuracy. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
31. PredMHC: An Effective Predictor of Major Histocompatibility Complex Using Mixed Features.
- Author
-
Chen, Dong and Li, Yanjuan
- Subjects
MAJOR histocompatibility complex ,AMINO acid sequence ,IMMUNE system ,RANDOM forest algorithms - Abstract
The major histocompatibility complex (MHC) is a large locus on vertebrate DNA that contains a tightly linked set of polymorphic genes encoding cell surface proteins essential for the adaptive immune system. The groups of proteins encoded in the MHC play an important role in the adaptive immune system. Therefore, the accurate identification of the MHC is necessary to understand its role in the adaptive immune system. An effective predictor called PredMHC is established in this study to identify the MHC from protein sequences. Firstly, PredMHC encoded a protein sequence with mixed features including 188D, APAAC, KSCTriad, CKSAAGP, and PAAC. Secondly, three classifiers including SGD, SMO, and random forest were trained on the mixed features of the protein sequence. Finally, the prediction result was obtained by the voting of the three classifiers. The experimental results of the 10-fold cross-validation test in the training dataset showed that PredMHC can obtain 91.69% accuracy. Experimental results on comparison with other features, classifiers, and existing methods showed the effectiveness of PredMHC in predicting the MHC. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
32. Multiple Profile Models Extract Features from Protein Sequence Data and Resolve Functional Diversity of Very Different Protein Families.
- Author
-
Vicedomini, R., Bouly, J.P., Laine, E., Falciatore, A., and Carbone, A.
- Subjects
AMINO acid sequence ,BOLTZMANN machine ,SMALL molecules ,PROTEINS ,NUCLEIC acids ,FUNCTIONAL groups - Abstract
Functional classification of proteins from sequences alone has become a critical bottleneck in understanding the myriad of protein sequences that accumulate in our databases. The great diversity of homologous sequences hides, in many cases, a variety of functional activities that cannot be anticipated. Their identification appears critical for a fundamental understanding of the evolution of living organisms and for biotechnological applications. ProfileView is a sequence-based computational method, designed to functionally classify sets of homologous sequences. It relies on two main ideas: the use of multiple profile models whose construction explores evolutionary information in available databases, and a novel definition of a representation space in which to analyze sequences with multiple profile models combined together. ProfileView classifies protein families by enriching known functional groups with new sequences and discovering new groups and subgroups. We validate ProfileView on seven classes of widespread proteins involved in the interaction with nucleic acids, amino acids and small molecules, and in a large variety of functions and enzymatic reactions. ProfileView agrees with the large set of functional data collected for these proteins from the literature regarding the organization into functional subgroups and residues that characterize the functions. In addition, ProfileView resolves undefined functional classifications and extracts the molecular determinants underlying protein functional diversity, showing its potential to select sequences towards accurate experimental design and discovery of novel biological functions. On protein families with complex domain architecture, ProfileView functional classification reconciles domain combinations, unlike phylogenetic reconstruction. ProfileView proves to outperform the functional classification approach PANTHER, the two k-mer-based methods CUPP and eCAMI and a neural network approach based on Restricted Boltzmann Machines. It overcomes time complexity limitations of the latter. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
33. Decision Tree Classifier for Classification of Proteins Using the Protein Data Bank
- Author
-
Satpute, Babasaheb S., Yadav, Raghav, Kacprzyk, Janusz, Series Editor, Krishna, A.N., editor, Srikantaiah, K.C., editor, and Naveena, C, editor
- Published
- 2019
- Full Text
- View/download PDF
34. Classification of Proteins Using Naïve Bayes Classifier and Surface-Invariant Coordinates
- Author
-
Satpute, Babasaheb S., Yadav, Raghav, Kacprzyk, Janusz, Series Editor, Krishna, A.N., editor, Srikantaiah, K.C., editor, and Naveena, C, editor
- Published
- 2019
- Full Text
- View/download PDF
35. Protein Sequence in Classifying Dengue Serotypes
- Author
-
Pandiyarajan, Pandiselvam, Thangairulappan, Kathirvalavakumar, Kacprzyk, Janusz, Series Editor, Pal, Nikhil R., Advisory Editor, Bello Perez, Rafael, Advisory Editor, Corchado, Emilio S., Advisory Editor, Hagras, Hani, Advisory Editor, Kóczy, László T., Advisory Editor, Kreinovich, Vladik, Advisory Editor, Lin, Chin-Teng, Advisory Editor, Lu, Jie, Advisory Editor, Melin, Patricia, Advisory Editor, Nedjah, Nadia, Advisory Editor, Nguyen, Ngoc Thanh, Advisory Editor, Wang, Jun, Advisory Editor, Pati, Bibudhendu, editor, Panigrahi, Chhabi Rani, editor, Misra, Sudip, editor, Pujari, Arun K., editor, and Bakshi, Sambit, editor
- Published
- 2019
- Full Text
- View/download PDF
36. Classifying Mixed Patterns of Proteins in High-Throughput Microscopy Images Using Deep Neural Networks
- Author
-
Zhang, Enze, Zhang, Boheng, Hu, Shaohan, Zhang, Fa, Wan, Xiaohua, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Huang, De-Shuang, editor, Bevilacqua, Vitoantonio, editor, and Premaratne, Prashan, editor
- Published
- 2019
- Full Text
- View/download PDF
37. PredMHC: An Effective Predictor of Major Histocompatibility Complex Using Mixed Features
- Author
-
Dong Chen and Yanjuan Li
- Subjects
protein classification ,major histocompatibility complex ,machine learning ,feature extraction ,identification ,Genetics ,QH426-470 - Abstract
The major histocompatibility complex (MHC) is a large locus on vertebrate DNA that contains a tightly linked set of polymorphic genes encoding cell surface proteins essential for the adaptive immune system. The groups of proteins encoded in the MHC play an important role in the adaptive immune system. Therefore, the accurate identification of the MHC is necessary to understand its role in the adaptive immune system. An effective predictor called PredMHC is established in this study to identify the MHC from protein sequences. Firstly, PredMHC encoded a protein sequence with mixed features including 188D, APAAC, KSCTriad, CKSAAGP, and PAAC. Secondly, three classifiers including SGD, SMO, and random forest were trained on the mixed features of the protein sequence. Finally, the prediction result was obtained by the voting of the three classifiers. The experimental results of the 10-fold cross-validation test in the training dataset showed that PredMHC can obtain 91.69% accuracy. Experimental results on comparison with other features, classifiers, and existing methods showed the effectiveness of PredMHC in predicting the MHC.
- Published
- 2022
- Full Text
- View/download PDF
38. On Applications of Cellular Automata Memristor Networks for Reservoir Computing: Classifying Protein Toxicity.
- Author
-
DEL AMO, IGNACIO and KONKOLI, Z.
- Subjects
CELLULAR automata ,AMINO acid sequence ,COMPUTER systems ,PROTEINS ,INFORMATION networks ,TOXINS - Abstract
We explore the computing capacity of large memristor networks in a reservoir computing setup. Memristor networks are modelled as random cellular automata networks. It is inevitable that the cellular automata model does not describe properly certain aspects of memristor dynamics. However, owing to the simplicity of the model one can explore extremely large memristor networks and specifically focus on issues related to the network topology. To investigate the computing capacity of such systems we studied a challenging classification problem. The goal was to distinguish whether a given protein sequence resembles a toxin or not. Different network topologies have been investigated with the overarching goal of understanding how the topology of the memristor network relates to its information processing capacity, and ultimately affects the accuracy of the prediction. We demonstrate the existence of “sweet spots” in the space of network topologies. There are network structures that can generalize very well and are robust with regard to the change in the data set. [ABSTRACT FROM AUTHOR]
- Published
- 2022
39. A novel numerical representation for proteins: Three-dimensional Chaos Game Representation and its Extended Natural Vector
- Author
-
Zeju Sun, Shaojun Pei, Rong Lucy He, and Stephen S.-T. Yau
- Subjects
Chaos Game Representation ,Three-dimensional CGR ,Extended Natural Vector ,Protein classification ,Biotechnology ,TP248.13-248.65 - Abstract
Chaos Game Representation (CGR) was first proposed to be an image representation method of DNA and have been extended to the case of other biological macromolecules. Compared with the CGR images of DNA, where DNA sequences are converted into a series of points in the unit square, the existing CGR images of protein are not so elegant in geometry and the implications of the distribution of points in the CGR image are not so obvious. In this study, by naturally distributing the twenty amino acids on the vertices of a regular dodecahedron, we introduce a novel three-dimensional image representation of protein sequences with CGR method. We also associate each CGR image with a vector in high dimensional Euclidean space, called the extended natural vector (ENV), in order to analyze the information contained in the CGR images. Based on the results of protein classification and phylogenetic analysis, our method could serve as a precise method to discover biological relationships between proteins.
- Published
- 2020
- Full Text
- View/download PDF
40. Predicting Sub-Golgi Apparatus Resident Protein With Primary Sequence Hybrid Features
- Author
-
Chunyu Wang, Jialin Li, Xiaoyan Liu, and Maozu Guo
- Subjects
Golgi apparatus ,feature extraction ,hybrid sequence features ,protein classification ,SVM ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
The Golgi apparatus is a significant membrane-bound organelle of eukaryotic cells that is made up of a series of flattened, stacked pouches (called cisternae). The Golgi apparatus packages proteins into membrane-bound vesicles, and so it is responsible for transporting, modifying, and packaging proteins and lipids into vesicles for delivery to targeted destinations. It belongs to the central organelle mediating system of eukaryotic cells. Functional defects of the Golgi apparatus are associated with many kinds of neurodegenerative diseases, such as Parkinson's and Alzheimer's diseases. Golgi-resident proteins play an important role in the Golgi apparatus' processing, which includes storing, packaging, and dispatching proteins. Identifying sub-Golgi protein types can help researchers to develop more effective therapies and drugs for diseases that result from disorders of Golgi-resident proteins. In this paper, we propose a computational model to discriminate cis-Golgi proteins from trans-Golgi proteins using a machine learning method. First, we use PseKNC, K-separated Bigrams, and PsePSSM as feature extraction techniques, and then we select the optimal features among those identified by PseKNC with the AdaBoost classifier. To create a balanced dataset out of the imbalanced set of Golgi proteins, we used the Random-SMOTE oversampling approach. Finally, we employed the SVM algorithm to distinguish cis-Golgi proteins from trans-Golgi proteins. The proposed method achieves promising performance, with accuracy of 96.5%, 96.5%, and 96.9% in the experiments with jackknife cross-validation, independent testing, and 10-fold cross-validation, respectively, which exceeds the performance of previous related work.
- Published
- 2020
- Full Text
- View/download PDF
41. SMOPredT4SE: An Effective Prediction of Bacterial Type IV Secreted Effectors Using SVM Training With SMO
- Author
-
Zihao Yan, Dong Chen, Zhixia Teng, Donghua Wang, and Yanjuan Li
- Subjects
Machine learning ,protein classification ,sequential minimal optimization ,type IV secreted effector ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Various bacterial pathogens can deliver their secreted effectors to host cells via type IV secretion system (T4SS) and cause host diseases. Since T4SS secreted effectors (T4SEs) play important roles in the interaction between pathogens and host, identifying T4SEs is crucial to understanding of the pathogenic mechanism of T4SS. We established an effective predictor called SMOPredT4SE to identify T4SEs from protein sequences. SMOPredT4SE employed combination features of series correlation pseudo amino acid composition and position-specific scoring matrix to present protein sequences, and employed support vector machines (SVM) training with sequential minimal optimization (SMO) arithmetic to train the prediction model (To distinguish it from the traditional SVM, we will abbreviate it as SMO later). In the 5-fold cross-validation test, SMOPredT4SE's overall accuracy was 95.6%. Experiments on comparison with other feature, classifiers, and existing methods are conducted. Experimental results show the effectiveness of SMOPredT4SE in predicting T4SEs.
- Published
- 2020
- Full Text
- View/download PDF
42. A Pearson Based Feature Compressing Model for SNARE Protein Classification
- Author
-
Guilin Li
- Subjects
Feature representation ,feature selection ,protein classification ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
SNARE proteins are a group of proteins that drive the biological fusion of two membranes. It is important to identify them accurately, because malfunction of the SNARE proteins can lead to a lot of diseases. In this paper, a Pearson based feature compressing model is proposed to identify the SNARE proteins accurately and efficiently. First, 188D, CKSAAP, CTDD and CTRIAD feature extraction methods are used to extract features from the SNARE and non-SNARE proteins. As the number of features extracted by the four methods is very large, which means many redundant features are included. It is necessary to filter the original feature set. The Chi-Square, Information Gain and Pearson Correlation Coefficient feature selection methods are used to evaluate the value of each feature in the feature set. The selected features are used to train a random forest classifier and the performance of the selected features is evaluated by cross validation. The experimental results showed that the CTDD based model with the first 70% of features selected by the Pearson feature selection method can achieve the best performance among all kinds of models.
- Published
- 2020
- Full Text
- View/download PDF
43. Prediction of Hormone-Binding Proteins Based on K-mer Feature Representation and Naive Bayes.
- Author
-
Guo, Yuxin, Hou, Liping, Zhu, Wen, and Wang, Peng
- Subjects
CARRIER proteins ,FEATURE selection ,FEATURE extraction ,SENSITIVITY & specificity (Statistics) ,ALGORITHMS ,PROTEINS - Abstract
Hormone binding protein (HBP) is a soluble carrier protein that interacts selectively with different types of hormones and has various effects on the body's life activities. HBPs play an important role in the growth process of organisms, but their specific role is still unclear. Therefore, correctly identifying HBPs is the first step towards understanding and studying their biological function. However, due to their high cost and long experimental period, it is difficult for traditional biochemical experiments to correctly identify HBPs from an increasing number of proteins, so the real characterization of HBPs has become a challenging task for researchers. To measure the effectiveness of HBPs, an accurate and reliable prediction model for their identification is desirable. In this paper, we construct the prediction model HBP_NB. First, HBPs data were collected from the UniProt database, and a dataset was established. Then, based on the established high-quality dataset, the k-mer (K = 3) feature representation method was used to extract features. Second, the feature selection algorithm was used to reduce the dimensionality of the extracted features and select the appropriate optimal feature set. Finally, the selected features are input into Naive Bayes to construct the prediction model, and the model is evaluated by using 10-fold cross-validation. The final results were 95.45% accuracy, 94.17% sensitivity and 96.73% specificity. These results indicate that our model is feasible and effective. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
44. SHREC 2021: Retrieval and classification of protein surfaces equipped with physical and chemical properties.
- Author
-
Raffo, Andrea, Fugacci, Ulderico, Biasotti, Silvia, Rocchia, Walter, Liu, Yonghuai, Otu, Ekpo, Zwiggelaar, Reyer, Hunter, David, Zacharaki, Evangelia I., Psatha, Eleftheria, Laskos, Dimitrios, Arvanitis, Gerasimos, Moustakas, Konstantinos, Aderinwale, Tunde, Christoffer, Charles, Shin, Woong-Hee, Kihara, Daisuke, Giachetti, Andrea, Nguyen, Huu-Nghia, and Nguyen, Tuan-Duy
- Subjects
- *
CHEMICAL properties , *GEOMETRIC surfaces , *SURFACE geometry , *ELECTRIC potential , *PROTEINS - Abstract
• A new benchmark of protein surfaces equipped with three physicochemical properties. • Protein surfaces are retrieved and classified with respect to different ground truths. • Analysis of the retrieval and classification performances of the methods that participated in SHREC 2021. • A thorough Investigation of how different physicochemical properties impact the performance of retrieval methods. [Display omitted] This paper presents the methods that have participated in the SHREC 2021 contest on retrieval and classification of protein surfaces on the basis of their geometry and physicochemical properties. The goal of the contest is to assess the capability of different computational approaches to identify different conformations of the same protein, or the presence of common sub-parts, starting from a set of molecular surfaces. We addressed two problems: defining the similarity solely based on the surface geometry or with the inclusion of physicochemical information, such as electrostatic potential, amino acid hydrophobicity, and the presence of hydrogen bond donors and acceptors. Retrieval and classification performances, with respect to the single protein or the existence of common sub-sequences, are analysed according to a number of information retrieval indicators. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
45. A topological approach for protein classification
- Author
-
Wei, Guo [The Ohio State Univ., Columbus, OH (United States)]
- Published
- 2015
- Full Text
- View/download PDF
46. Prediction of Hormone-Binding Proteins Based on K-mer Feature Representation and Naive Bayes
- Author
-
Yuxin Guo, Liping Hou, Wen Zhu, and Peng Wang
- Subjects
hormone binding protein ,feature selection ,protein classification ,k-mer ,naive Bayes model ,Genetics ,QH426-470 - Abstract
Hormone binding protein (HBP) is a soluble carrier protein that interacts selectively with different types of hormones and has various effects on the body’s life activities. HBPs play an important role in the growth process of organisms, but their specific role is still unclear. Therefore, correctly identifying HBPs is the first step towards understanding and studying their biological function. However, due to their high cost and long experimental period, it is difficult for traditional biochemical experiments to correctly identify HBPs from an increasing number of proteins, so the real characterization of HBPs has become a challenging task for researchers. To measure the effectiveness of HBPs, an accurate and reliable prediction model for their identification is desirable. In this paper, we construct the prediction model HBP_NB. First, HBPs data were collected from the UniProt database, and a dataset was established. Then, based on the established high-quality dataset, the k-mer (K = 3) feature representation method was used to extract features. Second, the feature selection algorithm was used to reduce the dimensionality of the extracted features and select the appropriate optimal feature set. Finally, the selected features are input into Naive Bayes to construct the prediction model, and the model is evaluated by using 10-fold cross-validation. The final results were 95.45% accuracy, 94.17% sensitivity and 96.73% specificity. These results indicate that our model is feasible and effective.
- Published
- 2021
- Full Text
- View/download PDF
47. A Novel Feature for Recognition of Protein Family Using ANN and Machine Learning
- Author
-
Satpute, Babasaheb S., Yadav, Raghav, Singh, Satendra, Barbosa, Simone Diniz Junqueira, Series Editor, Filipe, Joaquim, Series Editor, Kotenko, Igor, Series Editor, Sivalingam, Krishna M., Series Editor, Washio, Takashi, Series Editor, Yuan, Junsong, Series Editor, Zhou, Lizhu, Series Editor, Deshpande, A.V., editor, Unal, Aynur, editor, Passi, Kalpdrum, editor, Singh, Dharm, editor, Nayak, Malaya, editor, Patel, Bharat, editor, and Pathan, Shafi, editor
- Published
- 2018
- Full Text
- View/download PDF
48. Recognition of Protein Family Using a Novel Classification System
- Author
-
Satpute, Babasaheb S., Yadav, Raghav, Barbosa, Simone Diniz Junqueira, Series Editor, Filipe, Joaquim, Series Editor, Kotenko, Igor, Series Editor, Sivalingam, Krishna M., Series Editor, Washio, Takashi, Series Editor, Yuan, Junsong, Series Editor, Zhou, Lizhu, Series Editor, Deshpande, A.V., editor, Unal, Aynur, editor, Passi, Kalpdrum, editor, Singh, Dharm, editor, Nayak, Malaya, editor, Patel, Bharat, editor, and Pathan, Shafi, editor
- Published
- 2018
- Full Text
- View/download PDF
49. Development of a TSR-Based Method for Protein 3-D Structural Comparison With Its Applications to Protein Classification and Motif Discovery
- Author
-
Sarika Kondra, Titli Sarkar, Vijay Raghavan, and Wu Xu
- Subjects
protein structure comparison ,triangular spatial relationship ,structure motifs ,protein classification ,protein structure and function relation ,protein secondary structure ,Chemistry ,QD1-999 - Abstract
Development of protein 3-D structural comparison methods is important in understanding protein functions. At the same time, developing such a method is very challenging. In the last 40 years, ever since the development of the first automated structural method, ~200 papers were published using different representations of structures. The existing methods can be divided into five categories: sequence-, distance-, secondary structure-, geometry-based, and network-based structural comparisons. Each has its uniqueness, but also limitations. We have developed a novel method where the 3-D structure of a protein is modeled using the concept of Triangular Spatial Relationship (TSR), where triangles are constructed with the Cα atoms of a protein as vertices. Every triangle is represented using an integer, which we denote as “key,” A key is computed using the length, angle, and vertex labels based on a rule-based formula, which ensures assignment of the same key to identical TSRs across proteins. A structure is thereby represented by a vector of integers. Our method is able to accurately quantify similarity of structure or substructure by matching numbers of identical keys between two proteins. The uniqueness of our method includes: (i) a unique way to represent structures to avoid performing structural superimposition; (ii) use of triangles to represent substructures as it is the simplest primitive to capture shape; (iii) complex structure comparison is achieved by matching integers corresponding to multiple TSRs. Every substructure of one protein is compared to every other substructure in a different protein. The method is used in the studies of proteases and kinases because they play essential roles in cell signaling, and a majority of these constitute drug targets. The new motifs or substructures we identified specifically for proteases and kinases provide a deeper insight into their structural relations. Furthermore, the method provides a unique way to study protein conformational changes. In addition, the results from CATH and SCOP data sets clearly demonstrate that our method can distinguish alpha helices from beta pleated sheets and vice versa. Our method has the potential to be developed into a powerful tool for efficient structure-BLAST search and comparison, just as BLAST is for sequence search and alignment.
- Published
- 2021
- Full Text
- View/download PDF
50. A new method for protein characterization and classification using geometrical features for 3D face analysis: An example of tubulin structures.
- Author
-
Di Grazia, Luca, Aminpour, Maral, Vezzetti, Enrico, Rezania, Vahid, Marcolin, Federica, and Tuszynski, Jack Adam
- Abstract
This article reports on the results of research aimed to translate biometric 3D face recognition concepts and algorithms into the field of protein biophysics in order to precisely and rapidly classify morphological features of protein surfaces. Both human faces and protein surfaces are free-forms and some descriptors used in differential geometry can be used to describe them applying the principles of feature extraction developed for computer vision and pattern recognition. The first part of this study focused on building the protein dataset using a simulation tool and performing feature extraction using novel geometrical descriptors. The second part tested the method on two examples, first involved a classification of tubulin isotypes and the second compared tubulin with the FtsZ protein, which is its bacterial analog. An additional test involved several unrelated proteins. Different classification methodologies have been used: a classic approach with a support vector machine (SVM) classifier and an unsupervised learning with a k-means approach. The best result was obtained with SVM and the radial basis function kernel. The results are significant and competitive with the state-of-the-art protein classification methods. This leads to a new methodological direction in protein structure analysis. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.