147 results on '"Language Modeling"'
Search Results
2. GATSol, an enhanced predictor of protein solubility through the synergy of 3D structure graph and large language modeling
- Author
-
Bin Li and Dengming Ming
- Subjects
Protein solubility ,Alphafold ,ESM-1b ,Graph neural network ,Attention ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Biology (General) ,QH301-705.5 - Abstract
Abstract Background Protein solubility is a critically important physicochemical property closely related to protein expression. For example, it is one of the main factors to be considered in the design and production of antibody drugs and a prerequisite for realizing various protein functions. Although several solubility prediction models have emerged in recent years, many of these models are limited to capturing information embedded in one-dimensional amino acid sequences, resulting in unsatisfactory predictive performance. Results In this study, we introduce a novel Graph Attention network-based protein Solubility model, GATSol, which represents the 3D structure of proteins as a protein graph. In addition to the node features of amino acids extracted by the state-of-the-art protein large language model, GATSol utilizes amino acid distance maps generated using the latest AlphaFold technology. Rigorous testing on independent eSOL and the Saccharomyces cerevisiae test datasets has shown that GATSol outperforms most recently introduced models, especially with respect to the coefficient of determination R2, which reaches 0.517 and 0.424, respectively. It outperforms the current state-of-the-art GraphSol by 18.4% on the S. cerevisiae_test set. Conclusions GATSol captures 3D dimensional features of proteins by building protein graphs, which significantly improves the accuracy of protein solubility prediction. Recent advances in protein structure modeling allow our method to incorporate spatial structure features extracted from predicted structures into the model by relying only on the input of protein sequences, which simplifies the entire graph neural network prediction process, making it more user-friendly and efficient. As a result, GATSol may help prioritize highly soluble proteins, ultimately reducing the cost and effort of experimental work. The source code and data of the GATSol model are freely available at https://github.com/binbinbinv/GATSol .
- Published
- 2024
- Full Text
- View/download PDF
3. Empirical evaluation of language modeling to ascertain cancer outcomes from clinical text reports
- Author
-
Elmarakeby, Haitham A., Trukhanov, Pavel S., Arroyo, Vidal M., Riaz, Irbaz Bin, Schrag, Deborah, Van Allen, Eliezer M., and Kehl, Kenneth L.
- Published
- 2023
- Full Text
- View/download PDF
4. Empirical evaluation of language modeling to ascertain cancer outcomes from clinical text reports
- Author
-
Haitham A. Elmarakeby, Pavel S. Trukhanov, Vidal M. Arroyo, Irbaz Bin Riaz, Deborah Schrag, Eliezer M. Van Allen, and Kenneth L. Kehl
- Subjects
Cancer ,Natural language processing ,Clinical outcomes ,Information extraction ,Transformer-based language models ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Biology (General) ,QH301-705.5 - Abstract
Abstract Background Longitudinal data on key cancer outcomes for clinical research, such as response to treatment and disease progression, are not captured in standard cancer registry reporting. Manual extraction of such outcomes from unstructured electronic health records is a slow, resource-intensive process. Natural language processing (NLP) methods can accelerate outcome annotation, but they require substantial labeled data. Transfer learning based on language modeling, particularly using the Transformer architecture, has achieved improvements in NLP performance. However, there has been no systematic evaluation of NLP model training strategies on the extraction of cancer outcomes from unstructured text. Results We evaluated the performance of nine NLP models at the two tasks of identifying cancer response and cancer progression within imaging reports at a single academic center among patients with non-small cell lung cancer. We trained the classification models under different conditions, including training sample size, classification architecture, and language model pre-training. The training involved a labeled dataset of 14,218 imaging reports for 1112 patients with lung cancer. A subset of models was based on a pre-trained language model, DFCI-ImagingBERT, created by further pre-training a BERT-based model using an unlabeled dataset of 662,579 reports from 27,483 patients with cancer from our center. A classifier based on our DFCI-ImagingBERT, trained on more than 200 patients, achieved the best results in most experiments; however, these results were marginally better than simpler “bag of words” or convolutional neural network models. Conclusion When developing AI models to extract outcomes from imaging reports for clinical cancer research, if computational resources are plentiful but labeled training data are limited, large language models can be used for zero- or few-shot learning to achieve reasonable performance. When computational resources are more limited but labeled training data are readily available, even simple machine learning architectures can achieve good performance for such tasks.
- Published
- 2023
- Full Text
- View/download PDF
5. Collectively encoding protein properties enriches protein language models
- Author
-
Jingmin An and Xiaogang Weng
- Subjects
Protein language modeling ,Multi-task learning ,Transfer learning ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Biology (General) ,QH301-705.5 - Abstract
Abstract Pre-trained natural language processing models on a large natural language corpus can naturally transfer learned knowledge to protein domains by fine-tuning specific in-domain tasks. However, few studies focused on enriching such protein language models by jointly learning protein properties from strongly-correlated protein tasks. Here we elaborately designed a multi-task learning (MTL) architecture, aiming to decipher implicit structural and evolutionary information from three sequence-level classification tasks for protein family, superfamily and fold. Considering the co-existing contextual relevance between human words and protein language, we employed BERT, pre-trained on a large natural language corpus, as our backbone to handle protein sequences. More importantly, the encoded knowledge obtained in the MTL stage can be well transferred to more fine-grained downstream tasks of TAPE. Experiments on structure- or evolution-related applications demonstrate that our approach outperforms many state-of-the-art Transformer-based protein models, especially in remote homology detection.
- Published
- 2022
- Full Text
- View/download PDF
6. Collectively encoding protein properties enriches protein language models
- Author
-
An, Jingmin and Weng, Xiaogang
- Published
- 2022
- Full Text
- View/download PDF
7. USMPep: universal sequence models for major histocompatibility complex binding affinity prediction
- Author
-
Johanna Vielhaben, Markus Wenzel, Wojciech Samek, and Nils Strodthoff
- Subjects
Major histocompatibility complex ,Binding affinity prediction ,Peptide data ,Recurrent neural networks ,Language modeling ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Biology (General) ,QH301-705.5 - Abstract
Abstract Background Immunotherapy is a promising route towards personalized cancer treatment. A key algorithmic challenge in this process is to decide if a given peptide (neoepitope) binds with the major histocompatibility complex (MHC). This is an active area of research and there are many MHC binding prediction algorithms that can predict the MHC binding affinity for a given peptide to a high degree of accuracy. However, most of the state-of-the-art approaches make use of complicated training and model selection procedures, are restricted to peptides of a certain length and/or rely on heuristics. Results We put forward USMPep, a simple recurrent neural network that reaches state-of-the-art approaches on MHC class I binding prediction with a single, generic architecture and even a single set of hyperparameters both on IEDB benchmark datasets and on the very recent HPV dataset. Moreover, the algorithm is competitive for a single model trained from scratch, while ensembling multiple regressors and language model pretraining can still slightly improve the performance. The direct application of the approach to MHC class II binding prediction shows a solid performance despite of limited training data. Conclusions We demonstrate that competitive performance in MHC binding affinity prediction can be reached with a standard architecture and training procedure without relying on any heuristics.
- Published
- 2020
- Full Text
- View/download PDF
8. Modeling aspects of the language of life through transfer-learning protein sequences
- Author
-
Michael Heinzinger, Ahmed Elnaggar, Yu Wang, Christian Dallago, Dmitrii Nechaev, Florian Matthes, and Burkhard Rost
- Subjects
Machine Learning ,Language Modeling ,Sequence Embedding ,Secondary structure prediction ,Localization prediction ,Transfer Learning ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Biology (General) ,QH301-705.5 - Abstract
Abstract Background Predicting protein function and structure from sequence is one important challenge for computational biology. For 26 years, most state-of-the-art approaches combined machine learning and evolutionary information. However, for some applications retrieving related proteins is becoming too time-consuming. Additionally, evolutionary information is less powerful for small families, e.g. for proteins from the Dark Proteome. Both these problems are addressed by the new methodology introduced here. Results We introduced a novel way to represent protein sequences as continuous vectors (embeddings) by using the language model ELMo taken from natural language processing. By modeling protein sequences, ELMo effectively captured the biophysical properties of the language of life from unlabeled big data (UniRef50). We refer to these new embeddings as SeqVec (Sequence-to-Vector) and demonstrate their effectiveness by training simple neural networks for two different tasks. At the per-residue level, secondary structure (Q3 = 79% ± 1, Q8 = 68% ± 1) and regions with intrinsic disorder (MCC = 0.59 ± 0.03) were predicted significantly better than through one-hot encoding or through Word2vec-like approaches. At the per-protein level, subcellular localization was predicted in ten classes (Q10 = 68% ± 1) and membrane-bound were distinguished from water-soluble proteins (Q2 = 87% ± 1). Although SeqVec embeddings generated the best predictions from single sequences, no solution improved over the best existing method using evolutionary information. Nevertheless, our approach improved over some popular methods using evolutionary information and for some proteins even did beat the best. Thus, they prove to condense the underlying principles of protein sequences. Overall, the important novelty is speed: where the lightning-fast HHblits needed on average about two minutes to generate the evolutionary information for a target protein, SeqVec created embeddings on average in 0.03 s. As this speed-up is independent of the size of growing sequence databases, SeqVec provides a highly scalable approach for the analysis of big data in proteomics, i.e. microbiome or metaproteome analysis. Conclusion Transfer-learning succeeded to extract information from unlabeled sequence databases relevant for various protein prediction tasks. SeqVec modeled the language of life, namely the principles underlying protein sequences better than any features suggested by textbooks and prediction methods. The exception is evolutionary information, however, that information is not available on the level of a single sequence.
- Published
- 2019
- Full Text
- View/download PDF
9. USMPep: universal sequence models for major histocompatibility complex binding affinity prediction
- Author
-
Vielhaben, Johanna, Wenzel, Markus, Samek, Wojciech, and Strodthoff, Nils
- Published
- 2020
- Full Text
- View/download PDF
10. Modeling aspects of the language of life through transfer-learning protein sequences
- Author
-
Heinzinger, Michael, Elnaggar, Ahmed, Wang, Yu, Dallago, Christian, Nechaev, Dmitrii, Matthes, Florian, and Rost, Burkhard
- Published
- 2019
- Full Text
- View/download PDF
11. NanoBERTa-ASP: predicting nanobody paratope based on a pretrained RoBERTa model.
- Author
-
Li, Shangru, Meng, Xiangpeng, Li, Rui, Huang, Bingding, and Wang, Xin
- Abstract
Background: Nanobodies, also known as VHH or single-domain antibodies, are unique antibody fragments derived solely from heavy chains. They offer advantages of small molecules and conventional antibodies, making them promising therapeutics. The paratope is the specific region on an antibody that binds to an antigen. Paratope prediction involves the identification and characterization of the antigen-binding site on an antibody. This process is crucial for understanding the specificity and affinity of antibody-antigen interactions. Various computational methods and experimental approaches have been developed to predict and analyze paratopes, contributing to advancements in antibody engineering, drug development, and immunotherapy. However, existing predictive models trained on traditional antibodies may not be suitable for nanobodies. Additionally, the limited availability of nanobody datasets poses challenges in constructing accurate models. Methods: To address these challenges, we have developed a novel nanobody prediction model, named NanoBERTa-ASP (Antibody Specificity Prediction), which is specifically designed for predicting nanobody-antigen binding sites. The model adopts a training strategy more suitable for nanobodies, based on an advanced natural language processing (NLP) model called BERT (Bidirectional Encoder Representations from Transformers). To be more specific, the model utilizes a masked language modeling approach named RoBERTa (Robustly Optimized BERT Pretraining Approach) to learn the contextual information of the nanobody sequence and predict its binding site. Results: NanoBERTa-ASP achieved exceptional performance in predicting nanobody binding sites, outperforming existing methods, indicating its proficiency in capturing sequence information specific to nanobodies and accurately identifying their binding sites. Furthermore, NanoBERTa-ASP provides insights into the interaction mechanisms between nanobodies and antigens, contributing to a better understanding of nanobodies and facilitating the design and development of nanobodies with therapeutic potential. Conclusion: NanoBERTa-ASP represents a significant advancement in nanobody paratope prediction. Its superior performance highlights the potential of deep learning approaches in nanobody research. By leveraging the increasing volume of nanobody data, NanoBERTa-ASP can further refine its predictions, enhance its performance, and contribute to the development of novel nanobody-based therapeutics. Github repository: https://github.com/WangLabforComputationalBiology/NanoBERTa-ASP [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
12. A multitask transfer learning framework for the prediction of virus-human protein–protein interactions.
- Author
-
Dong, Thi Ngan, Brogden, Graham, Gerold, Gisa, and Khosla, Megha
- Subjects
PROTEIN-protein interactions ,SARS-CoV-2 ,AMINO acid sequence ,VIRUS diseases ,VIRAL proteins ,DNA-binding proteins - Abstract
Background: Viral infections are causing significant morbidity and mortality worldwide. Understanding the interaction patterns between a particular virus and human proteins plays a crucial role in unveiling the underlying mechanism of viral infection and pathogenesis. This could further help in prevention and treatment of virus-related diseases. However, the task of predicting protein–protein interactions between a new virus and human cells is extremely challenging due to scarce data on virus-human interactions and fast mutation rates of most viruses. Results: We developed a multitask transfer learning approach that exploits the information of around 24 million protein sequences and the interaction patterns from the human interactome to counter the problem of small training datasets. Instead of using hand-crafted protein features, we utilize statistically rich protein representations learned by a deep language modeling approach from a massive source of protein sequences. Additionally, we employ an additional objective which aims to maximize the probability of observing human protein–protein interactions. This additional task objective acts as a regularizer and also allows to incorporate domain knowledge to inform the virus-human protein–protein interaction prediction model. Conclusions: Our approach achieved competitive results on 13 benchmark datasets and the case study for the SARS-CoV-2 virus receptor. Experimental results show that our proposed model works effectively for both virus-human and bacteria-human protein–protein interaction prediction tasks. We share our code for reproducibility and future research at https://git.l3s.uni-hannover.de/dong/multitask-transfer. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
13. DeepBP: Ensemble deep learning strategy for bioactive peptide prediction.
- Author
-
Zhang, Ming, Zhou, Jianren, Wang, Xiaohua, Wang, Xun, and Ge, Fang
- Subjects
GENERATIVE adversarial networks ,LANGUAGE models ,DRUG discovery ,CONVOLUTIONAL neural networks ,FOOD science ,DEEP learning ,ANGIOTENSIN converting enzyme - Abstract
Background: Bioactive peptides are important bioactive molecules composed of short-chain amino acids that play various crucial roles in the body, such as regulating physiological processes and promoting immune responses and antibacterial effects. Due to their significance, bioactive peptides have broad application potential in drug development, food science, and biotechnology. Among them, understanding their biological mechanisms will contribute to new ideas for drug discovery and disease treatment. Results: This study employs generative adversarial capsule networks (CapsuleGAN), gated recurrent units (GRU), and convolutional neural networks (CNN) as base classifiers to achieve ensemble learning through voting methods, which not only obtains high-precision prediction results on the angiotensin-converting enzyme (ACE) inhibitory peptides dataset and the anticancer peptides (ACP) dataset but also demonstrates effective model performance. For this method, we first utilized the protein language model—evolutionary scale modeling (ESM-2)—to extract relevant features for the ACE inhibitory peptides and ACP datasets. Following feature extraction, we trained three deep learning models—CapsuleGAN, GRU, and CNN—while continuously adjusting the model parameters throughout the training process. Finally, during the voting stage, different weights were assigned to the models based on their prediction accuracy, allowing full utilization of the model's performance. Experimental results show that on the ACE inhibitory peptide dataset, the balanced accuracy is 0.926, the Matthews correlation coefficient (MCC) is 0.831, and the area under the curve is 0.966; on the ACP dataset, the accuracy (ACC) is 0.779, and the MCC is 0.558. The experimental results on both datasets are superior to existing methods, demonstrating the effectiveness of the experimental approach. Conclusion: In this study, CapsuleGAN, GRU, and CNN were successfully employed as base classifiers to implement ensemble learning, which not only achieved good results in the prediction of two datasets but also surpassed existing methods. The ability to predict peptides with strong ACE inhibitory activity and ACPs more accurately and quickly is significant, and this work provides valuable insights for predicting other functional peptides. The source code and dataset for this experiment are publicly available at https://github.com/Zhou-Jianren/bioactive-peptides. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
14. N-gram analysis of 970 microbial organisms reveals presence of biological language models.
- Author
-
Osmanbeyoglu, Hatice Ulku and Ganapathiraju, Madhavi K.
- Subjects
BIOLOGICAL models ,GENOMES ,AMINO acids ,ESCHERICHIA coli ,PHYLOGENY - Abstract
Background: It has been suggested previously that genome and proteome sequences show characteristics typical of natural-language texts such as "signature-style" word usage indicative of authors or topics, and that the algorithms originally developed for natural language processing may therefore be applied to genome sequences to draw biologically relevant conclusions. Following this approach of 'biological language modeling', statistical n-gram analysis has been applied for comparative analysis of whole proteome sequences of 44 organisms. It has been shown that a few particular amino acid n-grams are found in abundance in one organism but occurring very rarely in other organisms, thereby serving as genome signatures. At that time proteomes of only 44 organisms were available, thereby limiting the generalization of this hypothesis. Today nearly 1,000 genome sequences and corresponding translated sequences are available, making it feasible to test the existence of biological language models over the evolutionary tree. Results: We studied whole proteome sequences of 970 microbial organisms using n-gram frequencies and crossperplexity employing the Biological Language Modeling Toolkit and Patternix Revelio toolkit. Genus-specific signatures were observed even in a simple unigram distribution. By taking statistical n-gram model of one organism as reference and computing cross-perplexity of all other microbial proteomes with it, cross-perplexity was found to be predictive of branch distance of the phylogenetic tree. For example, a 4-gram model from proteome of Shigellae flexneri 2a, which belongs to the Gammaproteobacteria class showed a self-perplexity of 15.34 while the cross-perplexity of other organisms was in the range of 15.59 to 29.5 and was proportional to their branching distance in the evolutionary tree from S. flexneri. The organisms of this genus, which happen to be pathotypes of E.coli, also have the closest perplexity values with E. coli. Conclusion: Whole proteome sequences of microbial organisms have been shown to contain particular n-gram sequences in abundance in one organism but occurring very rarely in other organisms, thereby serving as proteome signatures. Further it has also been shown that perplexity, a statistical measure of similarity of n-gram composition, can be used to predict evolutionary distance within a genus in the phylogenetic tree. [ABSTRACT FROM AUTHOR]
- Published
- 2011
- Full Text
- View/download PDF
15. Can large language models understand molecules?
- Author
-
Sadeghi, Shaghayegh, Bui, Alan, Forooghi, Ali, Lu, Jianguo, and Ngom, Alioune
- Subjects
LANGUAGE models ,GENERATIVE pre-trained transformers ,DRUG interactions - Abstract
Purpose: Large Language Models (LLMs) like Generative Pre-trained Transformer (GPT) from OpenAI and LLaMA (Large Language Model Meta AI) from Meta AI are increasingly recognized for their potential in the field of cheminformatics, particularly in understanding Simplified Molecular Input Line Entry System (SMILES), a standard method for representing chemical structures. These LLMs also have the ability to decode SMILES strings into vector representations. Method: We investigate the performance of GPT and LLaMA compared to pre-trained models on SMILES in embedding SMILES strings on downstream tasks, focusing on two key applications: molecular property prediction and drug-drug interaction prediction. Results: We find that SMILES embeddings generated using LLaMA outperform those from GPT in both molecular property and DDI prediction tasks. Notably, LLaMA-based SMILES embeddings show results comparable to pre-trained models on SMILES in molecular prediction tasks and outperform the pre-trained models for the DDI prediction tasks. Conclusion: The performance of LLMs in generating SMILES embeddings shows great potential for further investigation of these models for molecular embedding. We hope our study bridges the gap between LLMs and molecular embedding, motivating additional research into the potential of LLMs in the molecular representation field. GitHub: https://github.com/sshaghayeghs/LLaMA-VS-GPT. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
16. NanoBERTa-ASP: predicting nanobody paratope based on a pretrained RoBERTa model
- Author
-
Shangru Li, Xiangpeng Meng, Rui Li, Bingding Huang, and Xin Wang
- Subjects
Nanobodies ,RoBERTa ,Pretrained model ,prediction of binding sites ,Antibody engineering ,Transfer learning ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Biology (General) ,QH301-705.5 - Abstract
Abstract Background Nanobodies, also known as VHH or single-domain antibodies, are unique antibody fragments derived solely from heavy chains. They offer advantages of small molecules and conventional antibodies, making them promising therapeutics. The paratope is the specific region on an antibody that binds to an antigen. Paratope prediction involves the identification and characterization of the antigen-binding site on an antibody. This process is crucial for understanding the specificity and affinity of antibody-antigen interactions. Various computational methods and experimental approaches have been developed to predict and analyze paratopes, contributing to advancements in antibody engineering, drug development, and immunotherapy. However, existing predictive models trained on traditional antibodies may not be suitable for nanobodies. Additionally, the limited availability of nanobody datasets poses challenges in constructing accurate models. Methods To address these challenges, we have developed a novel nanobody prediction model, named NanoBERTa-ASP (Antibody Specificity Prediction), which is specifically designed for predicting nanobody-antigen binding sites. The model adopts a training strategy more suitable for nanobodies, based on an advanced natural language processing (NLP) model called BERT (Bidirectional Encoder Representations from Transformers). To be more specific, the model utilizes a masked language modeling approach named RoBERTa (Robustly Optimized BERT Pretraining Approach) to learn the contextual information of the nanobody sequence and predict its binding site. Results NanoBERTa-ASP achieved exceptional performance in predicting nanobody binding sites, outperforming existing methods, indicating its proficiency in capturing sequence information specific to nanobodies and accurately identifying their binding sites. Furthermore, NanoBERTa-ASP provides insights into the interaction mechanisms between nanobodies and antigens, contributing to a better understanding of nanobodies and facilitating the design and development of nanobodies with therapeutic potential. Conclusion NanoBERTa-ASP represents a significant advancement in nanobody paratope prediction. Its superior performance highlights the potential of deep learning approaches in nanobody research. By leveraging the increasing volume of nanobody data, NanoBERTa-ASP can further refine its predictions, enhance its performance, and contribute to the development of novel nanobody-based therapeutics. Github repository: https://github.com/WangLabforComputationalBiology/NanoBERTa-ASP
- Published
- 2024
- Full Text
- View/download PDF
17. Using protein language models for protein interaction hot spot prediction with limited data.
- Author
-
Sargsyan, Karen and Lim, Carmay
- Subjects
LANGUAGE models ,PROTEIN models ,PROTEIN-protein interactions ,AMINO acid sequence ,PROTEIN structure ,ANTIBIOTIC residues ,MACHINE learning - Abstract
Background: Protein language models, inspired by the success of large language models in deciphering human language, have emerged as powerful tools for unraveling the intricate code of life inscribed within protein sequences. They have gained significant attention for their promising applications across various areas, including the sequence-based prediction of secondary and tertiary protein structure, the discovery of new functional protein sequences/folds, and the assessment of mutational impact on protein fitness. However, their utility in learning to predict protein residue properties based on scant datasets, such as protein–protein interaction (PPI)-hotspots whose mutations significantly impair PPIs, remained unclear. Here, we explore the feasibility of using protein language-learned representations as features for machine learning to predict PPI-hotspots using a dataset containing 414 experimentally confirmed PPI-hotspots and 504 PPI-nonhot spots. Results: Our findings showcase the capacity of unsupervised learning with protein language models in capturing critical functional attributes of protein residues derived from the evolutionary information encoded within amino acid sequences. We show that methods relying on protein language models can compete with methods employing sequence and structure-based features to predict PPI-hotspots from the free protein structure. We observed an optimal number of features for model precision, suggesting a balance between information and overfitting. Conclusions: This study underscores the potential of transformer-based protein language models to extract critical knowledge from sparse datasets, exemplified here by the challenging realm of predicting PPI-hotspots. These models offer a cost-effective and time-efficient alternative to traditional experimental methods for predicting certain residue properties. However, the challenge of explaining why specific features are important for determining certain residue properties remains. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
18. GPDminer: a tool for extracting named entities and analyzing relations in biological literature.
- Author
-
Park, Yeon-Ji, Yang, Geun-Je, Sohn, Chae-Bong, and Park, Soo Jun
- Subjects
KNOWLEDGE acquisition (Expert systems) ,DATA mining ,TEXT mining ,DATABASES ,INFORMATION retrieval ,RESEARCH personnel - Abstract
Purpose: The expansion of research across various disciplines has led to a substantial increase in published papers and journals, highlighting the necessity for reliable text mining platforms for database construction and knowledge acquisition. This abstract introduces GPDMiner(Gene, Protein, and Disease Miner), a platform designed for the biomedical domain, addressing the challenges posed by the growing volume of academic papers. Methods: GPDMiner is a text mining platform that utilizes advanced information retrieval techniques. It operates by searching PubMed for specific queries, extracting and analyzing information relevant to the biomedical field. This system is designed to discern and illustrate relationships between biomedical entities obtained from automated information extraction. Results: The implementation of GPDMiner demonstrates its efficacy in navigating the extensive corpus of biomedical literature. It efficiently retrieves, extracts, and analyzes information, highlighting significant connections between genes, proteins, and diseases. The platform also allows users to save their analytical outcomes in various formats, including Excel and images. Conclusion: GPDMiner offers a notable additional functionality among the array of text mining tools available for the biomedical field. This tool presents an effective solution for researchers to navigate and extract relevant information from the vast unstructured texts found in biomedical literature, thereby providing distinctive capabilities that set it apart from existing methodologies. Its application is expected to greatly benefit researchers in this domain, enhancing their capacity for knowledge discovery and data management. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
19. Protein embedding based alignment.
- Author
-
Iovino, Benjamin Giovanni and Yuzhen Ye
- Abstract
Purpose: Despite the many progresses with alignment algorithms, aligning divergent protein sequences with less than 20–35% pairwise identity (so called "twilight zone") remains a difficult problem. Many alignment algorithms have been using substitution matrices since their creation in the 1970’s to generate alignments, however, these matrices do not work well to score alignments within the twilight zone. We developed Protein Embedding based Alignments, or PEbA, to better align sequences with low pairwise identity. Similar to the traditional Smith-Waterman algorithm, PEbA uses a dynamic programming algorithm but the matching score of amino acids is based on the similarity of their embeddings from a protein language model. Methods: We tested PEbA on over twelve thousand benchmark pairwise alignments from BAliBASE, each one extracted from one of their multiple sequence alignments. Five different BAliBASE references were used, each with different sequence identities, motifs, and lengths, allowing PEbA to showcase how well it aligns under different circumstances. Results: PEbA greatly outperformed BLOSUM substitution matrix-based pairwise alignments, achieving different levels of improvements of the alignment quality for pairs of sequences with different levels of similarity (over four times as well for pairs of sequences with <10% identity). We also compared PEbA with embeddings generated by different protein language models (ProtT5 and ESM-2) and found that ProtT5- XL-U50 produced the most useful embeddings for aligning protein sequences. PEbA also outperformed DEDAL and vcMSA, two recently developed protein language model embedding-based alignment methods. Conclusion: Our results suggested that general purpose protein language models provide useful contextual information for generating more accurate protein alignments than typically used methods. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
20. Proformer: a hybrid macaron transformer model predicts expression values from promoter sequences.
- Author
-
Kwak, Il-Youp, Kim, Byeong-Chan, Lee, Juhyun, Kang, Taein, Garry, Daniel J., Zhang, Jianyi, and Gong, Wuming
- Subjects
TRANSFORMER models ,DNA sequencing - Abstract
The breakthrough high-throughput measurement of the cis-regulatory activity of millions of randomly generated promoters provides an unprecedented opportunity to systematically decode the cis-regulatory logic that determines the expression values. We developed an end-to-end transformer encoder architecture named Proformer to predict the expression values from DNA sequences. Proformer used a Macaron-like Transformer encoder architecture, where two half-step feed forward (FFN) layers were placed at the beginning and the end of each encoder block, and a separable 1D convolution layer was inserted after the first FFN layer and in front of the multi-head attention layer. The sliding k-mers from one-hot encoded sequences were mapped onto a continuous embedding, combined with the learned positional embedding and strand embedding (forward strand vs. reverse complemented strand) as the sequence input. Moreover, Proformer introduced multiple expression heads with mask filling to prevent the transformer models from collapsing when training on relatively small amount of data. We empirically determined that this design had significantly better performance than the conventional design such as using the global pooling layer as the output layer for the regression task. These analyses support the notion that Proformer provides a novel method of learning and enhances our understanding of how cis-regulatory sequences determine the expression values. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
21. Learning self-supervised molecular representations for drug–drug interaction prediction.
- Author
-
Kpanou, Rogia, Dallaire, Patrick, Rousseau, Elsa, and Corbeil, Jacques
- Subjects
DRUG interactions ,MACHINE learning ,ARTIFICIAL neural networks ,SUPERVISED learning ,COMPUTER vision - Abstract
Drug–drug interactions (DDI) are a critical concern in healthcare due to their potential to cause adverse effects and compromise patient safety. Supervised machine learning models for DDI prediction need to be optimized to learn abstract, transferable features, and generalize to larger chemical spaces, primarily due to the scarcity of high-quality labeled DDI data. Inspired by recent advances in computer vision, we present SMR–DDI, a self-supervised framework that leverages contrastive learning to embed drugs into a scaffold-based feature space. Molecular scaffolds represent the core structural motifs that drive pharmacological activities, making them valuable for learning informative representations. Specifically, we pre-trained SMR–DDI on a large-scale unlabeled molecular dataset. We generated augmented views for each molecule via SMILES enumeration and optimized the embedding process through contrastive loss minimization between views. This enables the model to capture relevant and robust molecular features while reducing noise. We then transfer the learned representations for the downstream prediction of DDI. Experiments show that the new feature space has comparable expressivity to state-of-the-art molecular representations and achieved competitive DDI prediction results while training on less data. Additional investigations also revealed that pre-training on more extensive and diverse unlabeled molecular datasets improved the model's capability to embed molecules more effectively. Our results highlight contrastive learning as a promising approach for DDI prediction that can identify potentially hazardous drug combinations using only structural information. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
22. Tpgen: a language model for stable protein design with a specific topology structure.
- Author
-
Min, Xiaoping, Yang, Chongzhou, Xie, Jun, Huang, Yang, Liu, Nan, Jin, Xiaocheng, Wang, Tianshu, Kong, Zhibo, Lu, Xiaoli, Ge, Shengxiang, Zhang, Jun, and Xia, Ningshao
- Subjects
LANGUAGE models ,PROTEIN engineering ,SYNTHETIC proteins ,PROTEIN models ,REINFORCEMENT learning ,AMINO acid sequence - Abstract
Background: Natural proteins occupy a small portion of the protein sequence space, whereas artificial proteins can explore a wider range of possibilities within the sequence space. However, specific requirements may not be met when generating sequences blindly. Research indicates that small proteins have notable advantages, including high stability, accurate resolution prediction, and facile specificity modification. Results: This study involves the construction of a neural network model named TopoProGenerator(TPGen) using a transformer decoder. The model is trained with sequences consisting of a maximum of 65 amino acids. The training process of TopoProGenerator incorporates reinforcement learning and adversarial learning, for fine-tuning. Additionally, it encompasses a stability predictive model trained with a dataset comprising over 200,000 sequences. The results demonstrate that TopoProGenerator is capable of designing stable small protein sequences with specified topology structures. Conclusion: TPGen has the ability to generate protein sequences that fold into the specified topology, and the pretraining and fine-tuning methods proposed in this study can serve as a framework for designing various types of proteins. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
23. Advancing drug–target interaction prediction: a comprehensive graph-based approach integrating knowledge graph embedding and ProtBert pretraining.
- Author
-
Djeddi, Warith Eddine, Hermi, Khalil, Ben Yahia, Sadok, and Diallo, Gayo
- Subjects
KNOWLEDGE graphs ,LANGUAGE models ,COSINE function ,EUCLIDEAN distance ,DRUG discovery ,AMINO acid sequence - Abstract
Background: The pharmaceutical field faces a significant challenge in validating drug target interactions (DTIs) due to the time and cost involved, leading to only a fraction being experimentally verified. To expedite drug discovery, accurate computational methods are essential for predicting potential interactions. Recently, machine learning techniques, particularly graph-based methods, have gained prominence. These methods utilize networks of drugs and targets, employing knowledge graph embedding (KGE) to represent structured information from knowledge graphs in a continuous vector space. This phenomenon highlights the growing inclination to utilize graph topologies as a means to improve the precision of predicting DTIs, hence addressing the pressing requirement for effective computational methodologies in the field of drug discovery. Results: The present study presents a novel approach called DTIOG for the prediction of DTIs. The methodology employed in this study involves the utilization of a KGE strategy, together with the incorporation of contextual information obtained from protein sequences. More specifically, the study makes use of Protein Bidirectional Encoder Representations from Transformers (ProtBERT) for this purpose. DTIOG utilizes a two-step process to compute embedding vectors using KGE techniques. Additionally, it employs ProtBERT to determine target–target similarity. Different similarity measures, such as Cosine similarity or Euclidean distance, are utilized in the prediction procedure. In addition to the contextual embedding, the proposed unique approach incorporates local representations obtained from the Simplified Molecular Input Line Entry Specification (SMILES) of drugs and the amino acid sequences of protein targets. Conclusions: The effectiveness of the proposed approach was assessed through extensive experimentation on datasets pertaining to Enzymes, Ion Channels, and G-protein-coupled Receptors. The remarkable efficacy of DTIOG was showcased through the utilization of diverse similarity measures in order to calculate the similarities between drugs and targets. The combination of these factors, along with the incorporation of various classifiers, enabled the model to outperform existing algorithms in its ability to predict DTIs. The consistent observation of this advantage across all datasets underlines the robustness and accuracy of DTIOG in the domain of DTIs. Additionally, our case study suggests that the DTIOG can serve as a valuable tool for discovering new DTIs. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
24. Machine learning-based approaches for ubiquitination site prediction in human proteins.
- Author
-
Pourmirzaei, Mahdi, Ramazi, Shahin, Esmaili, Farzaneh, Shojaeilangari, Seyedehsamaneh, and Allahvardi, Abdollah
- Subjects
UBIQUITINATION ,POST-translational modification ,AMINO acid sequence ,PROTEINS ,DEEP learning - Abstract
Protein ubiquitination is a critical post-translational modification (PTMs) involved in numerous cellular processes. Identifying ubiquitination sites (Ubi-sites) on proteins offers valuable insights into their function and regulatory mechanisms. Due to the cost- and time-consuming nature of traditional approaches for Ubi-site detection, there has been a growing interest in leveraging artificial intelligence for computer-aided Ubi-site prediction. In this study, we collected experimentally verified Ubi-sites of human proteins from the dbPTM database, then conducted comprehensive state-of-the art computational methods along with standard evaluation metrics and a proper validation strategy for Ubi-site prediction. We presented the effectiveness of our framework by comparing ten machine learning (ML) based approaches in three different categories: feature-based conventional ML methods, end-to-end sequence-based deep learning (DL) techniques, and hybrid feature-based DL models. Our results revealed that DL approaches outperformed the classical ML methods, achieving a 0.902 F1-score, 0.8198 accuracy, 0.8786 precision, and 0.9147 recall as the best performance for a DL model using both raw amino acid sequences and hand-crafted features. Interestingly, our experimental results disclosed that the performance of DL methods had a positive correlation with the length of amino acid fragments, suggesting that utilizing the entire sequence can lead to more accurate predictions in future research endeavors. Additionally, we developed a meticulously curated benchmark for Ubi-site prediction in human proteins. This benchmark serves as a valuable resource for future studies, enabling fair and accurate comparisons between different methods. Overall, our work highlights the potential of ML, particularly DL techniques, in predicting Ubi-sites and furthering our knowledge of protein regulation through ubiquitination in cells. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
25. AptaTrans: a deep neural network for predicting aptamer-protein interaction using pretrained encoders.
- Author
-
Shin, Incheol, Kang, Keumseok, Kim, Juseong, Sel, Sanghun, Choi, Jeonghoon, Lee, Jae-Wook, Kang, Ho Young, and Song, Giltae
- Subjects
APTAMERS ,DRUG discovery ,TERTIARY structure ,SINGLE-stranded DNA ,DEEP learning ,TRANSFORMER models - Abstract
Background: Aptamers, which are biomaterials comprised of single-stranded DNA/RNA that form tertiary structures, have significant potential as next-generation materials, particularly for drug discovery. The systematic evolution of ligands by exponential enrichment (SELEX) method is a critical in vitro technique employed to identify aptamers that bind specifically to target proteins. While advanced SELEX-based methods such as Cell- and HT-SELEX are available, they often encounter issues such as extended time consumption and suboptimal accuracy. Several In silico aptamer discovery methods have been proposed to address these challenges. These methods are specifically designed to predict aptamer-protein interaction (API) using benchmark datasets. However, these methods often fail to consider the physicochemical interactions between aptamers and proteins within tertiary structures. Results: In this study, we propose AptaTrans, a pipeline for predicting API using deep learning techniques. AptaTrans uses transformer-based encoders to handle aptamer and protein sequences at the monomer level. Furthermore, pretrained encoders are utilized for the structural representation. After validation with a benchmark dataset, AptaTrans has been integrated into a comprehensive toolset. This pipeline synergistically combines with Apta-MCTS, a generative algorithm for recommending aptamer candidates. Conclusion: The results show that AptaTrans outperforms existing models for predicting API, and the efficacy of the AptaTrans pipeline has been confirmed through various experimental tools. We expect AptaTrans will enhance the cost-effectiveness and efficiency of SELEX in drug discovery. The source code and benchmark dataset for AptaTrans are available at https://github.com/pnumlb/AptaTrans. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
26. Searching for protein variants with desired properties using deep generative models.
- Author
-
Li, Yan, Yao, Yinying, Xia, Yu, and Tang, Mingjing
- Subjects
DEEP learning ,AMINO acid sequence ,PROTEIN engineering ,DECODERS & decoding ,LEARNING ability ,PROTEINS - Abstract
Background: Protein engineering aims to improve the functional properties of existing proteins to meet people's needs. Current deep learning-based models have captured evolutionary, functional, and biochemical features contained in amino acid sequences. However, the existing generative models need to be improved when capturing the relationship between amino acid sites on longer sequences. At the same time, the distribution of protein sequences in the homologous family has a specific positional relationship in the latent space. We want to use this relationship to search for new variants directly from the vicinity of better-performing varieties. Results: To improve the representation learning ability of the model for longer sequences and the similarity between the generated sequences and the original sequences, we propose a temporal variational autoencoder (T-VAE) model. T-VAE consists of an encoder and a decoder. The encoder expands the receptive field of neurons in the network structure by dilated causal convolution, thereby improving the encoding representation ability of longer sequences. The decoder decodes the sampled data into variants closely resembling the original sequence. Conclusion: Compared to other models, the person correlation coefficient between the predicted values of protein fitness obtained by T-VAE and the truth values was higher, and the mean absolute deviation was lower. In addition, the T-VAE model has a better representation learning ability for longer sequences when comparing the encoding of protein sequences of different lengths. These results show that our model has more advantages in representation learning for longer sequences. To verify the model's generative effect, we also calculate the sequence identity between the generated data and the input data. The sequence identity obtained by T-VAE improved by 12.9% compared to the baseline model. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
27. A prefix and attention map discrimination fusion guided attention for biomedical named entity recognition.
- Author
-
Guan, Zhengyi and Zhou, Xiaobing
- Subjects
SUFFIXES & prefixes (Grammar) ,TEXT mining ,MACHINE learning ,KNOWLEDGE base - Abstract
Background: The biomedical literature is growing rapidly, and it is increasingly important to extract meaningful information from the vast amount of literature. Biomedical named entity recognition (BioNER) is one of the key and fundamental tasks in biomedical text mining. It also acts as a primitive step for many downstream applications such as relation extraction and knowledge base completion. Therefore, the accurate identification of entities in biomedical literature has certain research value. However, this task is challenging due to the insufficiency of sequence labeling and the lack of large-scale labeled training data and domain knowledge. Results: In this paper, we use a novel word-pair classification method, design a simple attention mechanism and propose a novel architecture to solve the research difficulties of BioNER more efficiently without leveraging any external knowledge. Specifically, we break down the limitations of sequence labeling-based approaches by predicting the relationship between word pairs. Based on this, we enhance the pre-trained model BioBERT, through the proposed prefix and attention map dscrimination fusion guided attention and propose the E-BioBERT. Our proposed attention differentiates the distribution of different heads in different layers in the BioBERT, which enriches the diversity of self-attention. Our model is superior to state-of-the-art compared models on five available datasets: BC4CHEMD, BC2GM, BC5CDR-Disease, BC5CDR-Chem, and NCBI-Disease, achieving F1-score of 92.55%, 85.45%, 87.53%, 94.16% and 90.55%, respectively. Conclusion: Compared with many previous various models, our method does not require additional training datasets, external knowledge, and complex training process. The experimental results on five BioNER benchmark datasets demonstrate that our model is better at mining semantic information, alleviating the problem of label inconsistency, and has higher entity recognition ability. More importantly, we analyze and demonstrate the effectiveness of our proposed attention. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
28. Negation detection in Dutch clinical texts: an evaluation of rule-based and machine learning methods.
- Author
-
van Es, Bram, Reteig, Leon C., Tan, Sander C., Schraagen, Marijn, Hemker, Myrthe M., Arends, Sebastiaan R. S., Rios, Miguel A. R., and Haitjema, Saskia
- Subjects
DECISION support systems ,ELECTRONIC health records ,INFORMATION retrieval ,NATURAL language processing ,INFORMATION modeling - Abstract
When developing models for clinical information retrieval and decision support systems, the discrete outcomes required for training are often missing. These labels need to be extracted from free text in electronic health records. For this extraction process one of the most important contextual properties in clinical text is negation, which indicates the absence of findings. We aimed to improve large scale extraction of labels by comparing three methods for negation detection in Dutch clinical notes. We used the Erasmus Medical Center Dutch Clinical Corpus to compare a rule-based method based on ContextD, a biLSTM model using MedCAT and (finetuned) RoBERTa-based models. We found that both the biLSTM and RoBERTa models consistently outperform the rule-based model in terms of F1 score, precision and recall. In addition, we systematically categorized the classification errors for each model, which can be used to further improve model performance in particular applications. Combining the three models naively was not beneficial in terms of performance. We conclude that the biLSTM and RoBERTa-based models in particular are highly accurate accurate in detecting clinical negations, but that ultimately all three approaches can be viable depending on the use case at hand. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
29. MR-KPA: medication recommendation by combining knowledge-enhanced pre-training with a deep adversarial network.
- Author
-
Lin, Shaofu, Wang, Mengzhen, Shi, Chengyu, Xu, Zhe, Chen, Lihong, Gao, Qingcai, and Chen, Jianhui
- Subjects
ELECTRONIC health records ,DRUGS ,HEALTH facilities - Abstract
Background: Medication recommendation based on electronic medical record (EMR) is a research hot spot in smart healthcare. For developing computational medication recommendation methods based on EMR, an important challenge is the lack of a large number of longitudinal EMR data with time correlation. Faced with this challenge, this paper proposes a new EMR-based medication recommendation model called MR-KPA, which combines knowledge-enhanced pre-training with the deep adversarial network to improve medication recommendation from both feature representation and the fine-tuning process. Firstly, a knowledge-enhanced pre-training visit model is proposed to realize domain knowledge-based external feature fusion and pre-training-based internal feature mining for improving the feature representation. Secondly, a medication recommendation model based on the deep adversarial network is developed to optimize the fine-tuning process of pre-training visit model and alleviate over-fitting of model caused by the task gap between pre-training and recommendation. Result: The experimental results on EMRs from medical and health institutions in Hainan Province, China show that the proposed MR-KPA model can effectively improve the accuracy of medication recommendation on small-scale longitudinal EMR data compared with existing representative methods. Conclusion: The advantages of the proposed MR-KPA are mainly attributed to knowledge enhancement based on ontology embedding, the pre-training visit model and adversarial training. Each of these three optimizations is very effective for improving the capability of medication recommendation on small-scale longitudinal EMR data, and the pre-training visit model has the most significant improvement effect. These three optimizations are also complementary, and their integration makes the proposed MR-KPA model achieve the best recommendation effect. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
30. BioByGANS: biomedical named entity recognition by fusing contextual and syntactic features through graph attention network in node classification framework.
- Author
-
Zheng, Xiangwen, Du, Haijian, Luo, Xiaowei, Tong, Fan, Song, Wei, and Zhao, Dongsheng
- Subjects
ARTIFICIAL neural networks ,GRAPH algorithms ,TEXT mining ,MACHINE learning ,PROBLEM solving - Abstract
Background: Automatic and accurate recognition of various biomedical named entities from literature is an important task of biomedical text mining, which is the foundation of extracting biomedical knowledge from unstructured texts into structured formats. Using the sequence labeling framework and deep neural networks to implement biomedical named entity recognition (BioNER) is a common method at present. However, the above method often underutilizes syntactic features such as dependencies and topology of sentences. Therefore, it is an urgent problem to be solved to integrate semantic and syntactic features into the BioNER model. Results: In this paper, we propose a novel biomedical named entity recognition model, named BioByGANS (BioBERT/SpaCy-Graph Attention Network-Softmax), which uses a graph to model the dependencies and topology of a sentence and formulate the BioNER task as a node classification problem. This formulation can introduce more topological features of language and no longer be only concerned about the distance between words in the sequence. First, we use periods to segment sentences and spaces and symbols to segment words. Second, contextual features are encoded by BioBERT, and syntactic features such as part of speeches, dependencies and topology are preprocessed by SpaCy respectively. A graph attention network is then used to generate a fusing representation considering both the contextual features and syntactic features. Last, a softmax function is used to calculate the probabilities and get the results. We conduct experiments on 8 benchmark datasets, and our proposed model outperforms existing BioNER state-of-the-art methods on the BC2GM, JNLPBA, BC4CHEMD, BC5CDR-chem, BC5CDR-disease, NCBI-disease, Species-800, and LINNAEUS datasets, and achieves F1-scores of 85.15%, 78.16%, 92.97%, 94.74%, 87.74%, 91.57%, 75.01%, 90.99%, respectively. Conclusion: The experimental results on 8 biomedical benchmark datasets demonstrate the effectiveness of our model, and indicate that formulating the BioNER task into a node classification problem and combining syntactic features into the graph attention networks can significantly improve model performance. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
31. MLM-based typographical error correction of unstructured medical texts for named entity recognition.
- Author
-
Lee, Eun Byul, Heo, Go Eun, Choi, Chang Min, and Song, Min
- Subjects
SURGICAL pathology ,ELECTRONIC health records ,DATA mining ,MEDICAL errors ,PROBLEM solving ,NATURAL language processing ,DATA extraction - Abstract
Background: Unstructured text in medical records, such as Electronic Health Records, contain an enormous amount of valuable information for research; however, it is difficult to extract and structure important information because of frequent typographical errors. Therefore, improving the quality of data with errors for text analysis is an essential task. To date, few prior studies have been conducted addressing this. Here, we propose a new methodology for extracting important information from unstructured medical texts by overcoming the typographical problem in surgical pathology records related to lung cancer. Methods: We propose a typo correction model that considers context, based on the Masked Language Model, to solve the problem of typographical errors in real-world medical data. In addition, a word dictionary was used for the typo correction model based on PubMed abstracts. After refining the data through typo correction, fine tuning was performed on pre-trained BERT model. Next, deep learning-based Named Entity Recognition (NER) was performed. By solving the quality problem of medical data, we sought to improve the accuracy of information extraction in unstructured text data. Results: We compared the performance of the proposed typo correction model based on contextual information with an existing SymSpell model. We confirmed that our proposed model outperformed the existing model in a typographical correction task. The F1-score of the model improved by approximately 5% and 9% when compared with the model without contextual information in the NCBI-disease and surgical pathology record datasets, respectively. In addition, the F1-score of NER after typo correction increased by 2% in the NCBI-disease dataset. There was a significant performance difference of approximately 25% between the before and after typo correction in the Surgical pathology record dataset. This confirmed that typos influenced the information extraction of the unstructured text. Conclusion: We verified that typographical errors in unstructured text negatively affect the performance of natural language processing tasks. The proposed method of a typo correction model outperformed the existing SymSpell model. This study shows that the proposed model is robust and can be applied in real-world environments by focusing on the typos that cause difficulties in analyzing unstructured medical text. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
32. A multitask transfer learning framework for the prediction of virus-human protein–protein interactions
- Author
-
Thi Ngan Dong, Graham Brogden, Gisa Gerold, and Megha Khosla
- Subjects
Protein–protein interaction ,Human PPI ,Virus-human PPI ,Multitask ,Transfer learning ,Protein embedding ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Biology (General) ,QH301-705.5 - Abstract
Abstract Background Viral infections are causing significant morbidity and mortality worldwide. Understanding the interaction patterns between a particular virus and human proteins plays a crucial role in unveiling the underlying mechanism of viral infection and pathogenesis. This could further help in prevention and treatment of virus-related diseases. However, the task of predicting protein–protein interactions between a new virus and human cells is extremely challenging due to scarce data on virus-human interactions and fast mutation rates of most viruses. Results We developed a multitask transfer learning approach that exploits the information of around 24 million protein sequences and the interaction patterns from the human interactome to counter the problem of small training datasets. Instead of using hand-crafted protein features, we utilize statistically rich protein representations learned by a deep language modeling approach from a massive source of protein sequences. Additionally, we employ an additional objective which aims to maximize the probability of observing human protein–protein interactions. This additional task objective acts as a regularizer and also allows to incorporate domain knowledge to inform the virus-human protein–protein interaction prediction model. Conclusions Our approach achieved competitive results on 13 benchmark datasets and the case study for the SARS-CoV-2 virus receptor. Experimental results show that our proposed model works effectively for both virus-human and bacteria-human protein–protein interaction prediction tasks. We share our code for reproducibility and future research at https://git.l3s.uni-hannover.de/dong/multitask-transfer .
- Published
- 2021
- Full Text
- View/download PDF
33. AMPDeep: hemolytic activity prediction of antimicrobial peptides using transfer learning.
- Author
-
Salem, Milad, Keshavarzi Arshadi, Arash, and Yuan, Jiann Shiun
- Subjects
ANTIMICROBIAL peptides ,DEEP learning ,PEPTIDES ,ANTI-infective agents ,AMINO acids ,AMINO acid sequence - Abstract
Background: Deep learning's automatic feature extraction has proven to give superior performance in many sequence classification tasks. However, deep learning models generally require a massive amount of data to train, which in the case of Hemolytic Activity Prediction of Antimicrobial Peptides creates a challenge due to the small amount of available data. Results: Three different datasets for hemolysis activity prediction of therapeutic and antimicrobial peptides are gathered and the AMPDeep pipeline is implemented for each. The result demonstrate that AMPDeep outperforms the previous works on all three datasets, including works that use physicochemical features to represent the peptides or those who solely rely on the sequence and use deep learning to learn representation for the peptides. Moreover, a combined dataset is introduced for hemolytic activity prediction to address the problem of sequence similarity in this domain. AMPDeep fine-tunes a large transformer based model on a small amount of peptides and successfully leverages the patterns learned from other protein and peptide databases to assist hemolysis activity prediction modeling. Conclusions: In this work transfer learning is leveraged to overcome the challenge of small data and a deep learning based model is successfully adopted for hemolysis activity classification of antimicrobial peptides. This model is first initialized as a protein language model which is pre-trained on masked amino acid prediction on many unlabeled protein sequences in a self-supervised manner. Having done so, the model is fine-tuned on an aggregated dataset of labeled peptides in a supervised manner to predict secretion. Through transfer learning, hyper-parameter optimization and selective fine-tuning, AMPDeep is able to achieve state-of-the-art performance on three hemolysis datasets using only the sequence of the peptides. This work assists the adoption of large sequence-based models for peptide classification and modeling tasks in a practical manner. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
34. GeneralizedDTA: combining pre-training and multi-task learning to predict drug-target binding affinity for unknown drug discovery.
- Author
-
Lin, Shaofu, Shi, Chengyu, and Chen, Jianhui
- Subjects
DRUG discovery ,ARTIFICIAL neural networks ,AMINO acid sequence ,MOLECULAR graphs ,CYTOSKELETAL proteins - Abstract
Background: Accurately predicting drug-target binding affinity (DTA) in silico plays an important role in drug discovery. Most of the computational methods developed for predicting DTA use machine learning models, especially deep neural networks, and depend on large-scale labelled data. However, it is difficult to learn enough feature representation from tens of millions of compounds and hundreds of thousands of proteins only based on relatively limited labelled drug-target data. There are a large number of unknown drugs, which never appear in the labelled drug-target data. This is a kind of out-of-distribution problems in bio-medicine. Some recent studies adopted self-supervised pre-training tasks to learn structural information of amino acid sequences for enhancing the feature representation of proteins. However, the task gap between pre-training and DTA prediction brings the catastrophic forgetting problem, which hinders the full application of feature representation in DTA prediction and seriously affects the generalization capability of models for unknown drug discovery. Results: To address these problems, we propose the GeneralizedDTA, which is a new DTA prediction model oriented to unknown drug discovery, by combining pre-training and multi-task learning. We introduce self-supervised protein and drug pre-training tasks to learn richer structural information from amino acid sequences of proteins and molecular graphs of drug compounds, in order to alleviate the problem of high variance caused by encoding based on deep neural networks and accelerate the convergence of prediction model on small-scale labelled data. We also develop a multi-task learning framework with a dual adaptation mechanism to narrow the task gap between pre-training and prediction for preventing overfitting and improving the generalization capability of DTA prediction model on unknown drug discovery. To validate the effectiveness of our model, we construct an unknown drug data set to simulate the scenario of unknown drug discovery. Compared with existing DTA prediction models, the experimental results show that our model has the higher generalization capability in the DTA prediction of unknown drugs. Conclusions: The advantages of our model are mainly attributed to two kinds of pre-training tasks and the multi-task learning framework, which can learn richer structural information of proteins and drugs from large-scale unlabeled data, and then effectively integrate it into the downstream prediction task for obtaining a high-quality DTA prediction in unknown drug discovery. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
35. IMSE: interaction information attention and molecular structure based drug drug interaction extraction.
- Author
-
Duan, Biao, Peng, Jing, and Zhang, Yi
- Subjects
DRUG interactions ,DRUG side effects ,CHI-square distribution ,FUNCTIONAL groups - Abstract
Background: Extraction of drug drug interactions from biomedical literature and other textual data is an important component to monitor drug-safety and this has attracted attention of many researchers in healthcare. Existing works are more pivoted around relation extraction using bidirectional long short-term memory networks (BiLSTM) and BERT model which does not attain the best feature representations. Results: Our proposed DDI (drug drug interaction) prediction model provides multiple advantages: (1) The newly proposed attention vector is added to better deal with the problem of overlapping relations, (2) The molecular structure information of drugs is integrated into the model to better express the functional group structure of drugs, (3) We also added text features that combined the T-distribution and chi-square distribution to make the model more focused on drug entities and (4) it achieves similar or better prediction performance (F-scores up to 85.16%) compared to state-of-the-art DDI models when tested on benchmark datasets. Conclusions: Our model that leverages state of the art transformer architecture in conjunction with multiple features can bolster the performances of drug drug interation tasks in the biomedical domain. In particular, we believe our research would be helpful in identification of potential adverse drug reactions. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
36. Exploring deep learning methods for recognizing rare diseases and their clinical manifestations from texts.
- Author
-
Segura-Bedmar, Isabel, Camino-Perdones, David, and Guerrero-Aspizua, Sara
- Subjects
NATURAL language processing ,DEEP learning ,RARE diseases ,SYMPTOMS ,LONG-term memory ,DELAYED diagnosis ,SCIENTIFIC knowledge - Abstract
Background and objective: Although rare diseases are characterized by low prevalence, approximately 400 million people are affected by a rare disease. The early and accurate diagnosis of these conditions is a major challenge for general practitioners, who do not have enough knowledge to identify them. In addition to this, rare diseases usually show a wide variety of manifestations, which might make the diagnosis even more difficult. A delayed diagnosis can negatively affect the patient's life. Therefore, there is an urgent need to increase the scientific and medical knowledge about rare diseases. Natural Language Processing (NLP) and Deep Learning can help to extract relevant information about rare diseases to facilitate their diagnosis and treatments. Methods: The paper explores several deep learning techniques such as Bidirectional Long Short Term Memory (BiLSTM) networks or deep contextualized word representations based on Bidirectional Encoder Representations from Transformers (BERT) to recognize rare diseases and their clinical manifestations (signs and symptoms). Results: BioBERT, a domain-specific language representation based on BERT and trained on biomedical corpora, obtains the best results with an F1 of 85.2% for rare diseases. Since many signs are usually described by complex noun phrases that involve the use of use of overlapped, nested and discontinuous entities, the model provides lower results with an F1 of 57.2%. Conclusions: While our results are promising, there is still much room for improvement, especially with respect to the identification of clinical manifestations (signs and symptoms). [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
37. CoQUAD: a COVID-19 question answering dataset system, facilitating research, benchmarking, and practice.
- Author
-
Raza, Shaina, Schwartz, Brian, and Rosella, Laura C.
- Subjects
QUESTION answering systems ,COVID-19 ,NATURAL language processing ,COVID-19 pandemic - Abstract
Background: Due to the growing amount of COVID-19 research literature, medical experts, clinical scientists, and researchers frequently struggle to stay up to date on the most recent findings. There is a pressing need to assist researchers and practitioners in mining and responding to COVID-19-related questions on time. Methods: This paper introduces CoQUAD, a question-answering system that can extract answers related to COVID-19 questions in an efficient manner. There are two datasets provided in this work: a reference-standard dataset built using the CORD-19 and LitCOVID initiatives, and a gold-standard dataset prepared by the experts from a public health domain. The CoQUAD has a Retriever component trained on the BM25 algorithm that searches the reference-standard dataset for relevant documents based on a question related to COVID-19. CoQUAD also has a Reader component that consists of a Transformer-based model, namely MPNet, which is used to read the paragraphs and find the answers related to a question from the retrieved documents. In comparison to previous works, the proposed CoQUAD system can answer questions related to early, mid, and post-COVID-19 topics. Results: Extensive experiments on CoQUAD Retriever and Reader modules show that CoQUAD can provide effective and relevant answers to any COVID-19-related questions posed in natural language, with a higher level of accuracy. When compared to state-of-the-art baselines, CoQUAD outperforms the previous models, achieving an exact match ratio score of 77.50% and an F1 score of 77.10%. Conclusion: CoQUAD is a question-answering system that mines COVID-19 literature using natural language processing techniques to help the research community find the most recent findings and answer any related questions. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
38. Interpretable prediction of necrotizing enterocolitis from machine learning analysis of premature infant stool microbiota.
- Author
-
Lin, Yun Chao, Salleb-Aouissi, Ansaf, and Hooven, Thomas A.
- Subjects
PREMATURE infants ,MACHINE learning ,ENTEROCOLITIS ,HUMAN microbiota ,DISEASE risk factors ,DEFECATION - Abstract
Background: Necrotizing enterocolitis (NEC) is a common, potentially catastrophic intestinal disease among very low birthweight premature infants. Affecting up to 15% of neonates born weighing less than 1500 g, NEC causes sudden-onset, progressive intestinal inflammation and necrosis, which can lead to significant bowel loss, multi-organ injury, or death. No unifying cause of NEC has been identified, nor is there any reliable biomarker that indicates an individual patient's risk of the disease. Without a way to predict NEC in advance, the current medical strategy involves close clinical monitoring in an effort to treat babies with NEC as quickly as possible before irrecoverable intestinal damage occurs. In this report, we describe a novel machine learning application for generating dynamic, individualized NEC risk scores based on intestinal microbiota data, which can be determined from sequencing bacterial DNA from otherwise discarded infant stool. A central insight that differentiates our work from past efforts was the recognition that disease prediction from stool microbiota represents a specific subtype of machine learning problem known as multiple instance learning (MIL). Results: We used a neural network-based MIL architecture, which we tested on independent datasets from two cohorts encompassing 3595 stool samples from 261 at-risk infants. Our report also introduces a new concept called the "growing bag" analysis, which applies MIL over time, allowing incorporation of past data into each new risk calculation. This approach allowed early, accurate NEC prediction, with a mean sensitivity of 86% and specificity of 90%. True-positive NEC predictions occurred an average of 8 days before disease onset. We also demonstrate that an attention-gated mechanism incorporated into our MIL algorithm permits interpretation of NEC risk, identifying several bacterial taxa that past work has associated with NEC, and potentially pointing the way toward new hypotheses about NEC pathogenesis. Our system is flexible, accepting microbiota data generated from targeted 16S or "shotgun" whole-genome DNA sequencing. It performs well in the setting of common, potentially confounding preterm neonatal clinical events such as perinatal cardiopulmonary depression, antibiotic administration, feeding disruptions, or transitions between breast feeding and formula. Conclusions: We have developed and validated a robust MIL-based system for NEC prediction from harmlessly collected premature infant stool. While this system was developed for NEC prediction, our MIL approach may also be applicable to other diseases characterized by changes in the human microbiota. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
39. A-Prot: protein structure modeling using MSA transformer.
- Author
-
Hong, Yiyu, Lee, Juyong, and Ko, Junsu
- Subjects
PROTEIN structure ,PROTEIN models ,PROTEIN structure prediction ,SEQUENCE alignment ,CYTOSKELETAL proteins ,DIHEDRAL angles - Abstract
Background: The accuracy of protein 3D structure prediction has been dramatically improved with the help of advances in deep learning. In the recent CASP14, Deepmind demonstrated that their new version of AlphaFold (AF) produces highly accurate 3D models almost close to experimental structures. The success of AF shows that the multiple sequence alignment of a sequence contains rich evolutionary information, leading to accurate 3D models. Despite the success of AF, only the prediction code is open, and training a similar model requires a vast amount of computational resources. Thus, developing a lighter prediction model is still necessary. Results: In this study, we propose a new protein 3D structure modeling method, A-Prot, using MSA Transformer, one of the state-of-the-art protein language models. An MSA feature tensor and row attention maps are extracted and converted into 2D residue-residue distance and dihedral angle predictions for a given MSA. We demonstrated that A-Prot predicts long-range contacts better than the existing methods. Additionally, we modeled the 3D structures of the free modeling and hard template-based modeling targets of CASP14. The assessment shows that the A-Prot models are more accurate than most top server groups of CASP14. Conclusion: These results imply that A-Prot accurately captures the evolutionary and structural information of proteins with relatively low computational cost. Thus, A-Prot can provide a clue for the development of other protein property prediction methods. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
40. EnsembleFam: towards more accurate protein family prediction in the twilight zone.
- Author
-
Kabir, Mohammad Neamul and Wong, Limsoon
- Subjects
DEEP learning ,G protein coupled receptors ,SUPPORT vector machines ,PROTEINS ,MARKOV processes ,PROTEIN models - Abstract
Background: Current protein family modeling methods like profile Hidden Markov Model (pHMM), k-mer based methods, and deep learning-based methods do not provide very accurate protein function prediction for proteins in the twilight zone, due to low sequence similarity to reference proteins with known functions. Results: We present a novel method EnsembleFam, aiming at better function prediction for proteins in the twilight zone. EnsembleFam extracts the core characteristics of a protein family using similarity and dissimilarity features calculated from sequence homology relations. EnsembleFam trains three separate Support Vector Machine (SVM) classifiers for each family using these features, and an ensemble prediction is made to classify novel proteins into these families. Extensive experiments are conducted using the Clusters of Orthologous Groups (COG) dataset and G Protein-Coupled Receptor (GPCR) dataset. EnsembleFam not only outperforms state-of-the-art methods on the overall dataset but also provides a much more accurate prediction for twilight zone proteins. Conclusions: EnsembleFam, a machine learning method to model protein families, can be used to better identify members with very low sequence homology. Using EnsembleFam protein functions can be predicted using just sequence information with better accuracy than state-of-the-art methods. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
41. ProtPlat: an efficient pre-training platform for protein classification based on FastText.
- Author
-
Jin, Yuan and Yang, Yang
- Subjects
SUPERVISED learning ,AMINO acid sequence ,COMPUTER vision ,SIGNAL peptides ,NATURAL language processing ,PROTEINS - Abstract
Background: For the past decades, benefitting from the rapid growth of protein sequence data in public databases, a lot of machine learning methods have been developed to predict physicochemical properties or functions of proteins using amino acid sequence features. However, the prediction performance often suffers from the lack of labeled data. In recent years, pre-training methods have been widely studied to address the small-sample issue in computer vision and natural language processing fields, while specific pre-training techniques for protein sequences are few. Results: In this paper, we propose a pre-training platform for representing protein sequences, called ProtPlat, which uses the Pfam database to train a three-layer neural network, and then uses specific training data from downstream tasks to fine-tune the model. ProtPlat can learn good representations for amino acids, and at the same time achieve efficient classification. We conduct experiments on three protein classification tasks, including the identification of type III secreted effectors, the prediction of subcellular localization, and the recognition of signal peptides. The experimental results show that the pre-training can enhance model performance effectively and ProtPlat is competitive to the state-of-the-art predictors, especially for small datasets. We implement the ProtPlat platform as a web service (https://compbio.sjtu.edu.cn/protplat) that is accessible to the public. Conclusions: To enhance the feature representation of protein amino acid sequences and improve the performance of sequence-based classification tasks, we develop ProtPlat, a general platform for the pre-training of protein sequences, which is featured by a large-scale supervised training based on Pfam database and an efficient learning model, FastText. The experimental results of three downstream classification tasks demonstrate the efficacy of ProtPlat. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
42. Lerna: transformer architectures for configuring error correction tools for short- and long-read genome sequencing.
- Author
-
Sharma, Atul, Jain, Pranjal, Mahgoub, Ashraf, Zhou, Zihan, Mahadik, Kanak, and Chaterji, Somali
- Subjects
GRAPHICS processing units ,NUCLEOTIDE sequencing ,NATURAL language processing - Abstract
Background: Sequencing technologies are prone to errors, making error correction (EC) necessary for downstream applications. EC tools need to be manually configured for optimal performance. We find that the optimal parameters (e.g., k-mer size) are both tool- and dataset-dependent. Moreover, evaluating the performance (i.e., Alignment-rate or Gain) of a given tool usually relies on a reference genome, but quality reference genomes are not always available. We introduce Lerna for the automated configuration of k-mer-based EC tools. Lerna first creates a language model (LM) of the uncorrected genomic reads, and then, based on this LM, calculates a metric called the perplexity metric to evaluate the corrected reads for different parameter choices. Next, it finds the one that produces the highest alignment rate without using a reference genome. The fundamental intuition of our approach is that the perplexity metric is inversely correlated with the quality of the assembly after error correction. Therefore, Lerna leverages the perplexity metric for automated tuning of k-mer sizes without needing a reference genome. Results: First, we show that the best k-mer value can vary for different datasets, even for the same EC tool. This motivates our design that automates k-mer size selection without using a reference genome. Second, we show the gains of our LM using its component attention-based transformers. We show the model's estimation of the perplexity metric before and after error correction. The lower the perplexity after correction, the better the k-mer size. We also show that the alignment rate and assembly quality computed for the corrected reads are strongly negatively correlated with the perplexity, enabling the automated selection of k-mer values for better error correction, and hence, improved assembly quality. We validate our approach on both short and long reads. Additionally, we show that our attention-based models have significant runtime improvement for the entire pipeline—18 × faster than previous works, due to parallelizing the attention mechanism and the use of JIT compilation for GPU inferencing. Conclusion: Lerna improves de novo genome assembly by optimizing EC tools. Our code is made available in a public repository at: https://github.com/icanforce/lerna-genomics. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
43. Improving deep learning method for biomedical named entity recognition by using entity definition information.
- Author
-
Xiong, Ying, Chen, Shuai, Tang, Buzhou, Chen, Qingcai, Wang, Xiaolong, Yan, Jun, and Zhou, Yi
- Subjects
DEEP learning ,KNOWLEDGE graphs ,TEXT mining ,READING comprehension ,DEFINITIONS ,INFORMATION modeling - Abstract
Background: Biomedical named entity recognition (NER) is a fundamental task of biomedical text mining that finds the boundaries of entity mentions in biomedical text and determines their entity type. To accelerate the development of biomedical NER techniques in Spanish, the PharmaCoNER organizers launched a competition to recognize pharmacological substances, compounds, and proteins. Biomedical NER is usually recognized as a sequence labeling task, and almost all state-of-the-art sequence labeling methods ignore the meaning of different entity types. In this paper, we investigate some methods to introduce the meaning of entity types in deep learning methods for biomedical NER and apply them to the PharmaCoNER 2019 challenge. The meaning of each entity type is represented by its definition information. Material and method: We investigate how to use entity definition information in the following two methods: (1) SQuad-style machine reading comprehension (MRC) methods that treat entity definition information as query and biomedical text as context and predict answer spans as entities. (2) Span-level one-pass (SOne) methods that predict entity spans of one type by one type and introduce entity type meaning, which is represented by entity definition information. All models are trained and tested on the PharmaCoNER 2019 corpus, and their performance is evaluated by strict micro-average precision, recall, and F1-score. Results: Entity definition information brings improvements to both SQuad-style MRC and SOne methods by about 0.003 in micro-averaged F1-score. The SQuad-style MRC model using entity definition information as query achieves the best performance with a micro-averaged precision of 0.9225, a recall of 0.9050, and an F1-score of 0.9137, respectively. It outperforms the best model of the PharmaCoNER 2019 challenge by 0.0032 in F1-score. Compared with the state-of-the-art model without using manually-crafted features, our model obtains a 1% improvement in F1-score, which is significant. These results indicate that entity definition information is useful for deep learning methods on biomedical NER. Conclusion: Our entity definition information enhanced models achieve the state-of-the-art micro-average F1 score of 0.9137, which implies that entity definition information has a positive impact on biomedical NER detection. In the future, we will explore more entity definition information from knowledge graph. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
44. Analyzing transfer learning impact in biomedical cross-lingual named entity recognition and normalization.
- Author
-
Rivera-Zavala, Renzo M. and Martínez, Paloma
- Subjects
DEEP learning ,NATURAL language processing ,MEDICAL research ,ACADEMIC discourse - Abstract
Background: The volume of biomedical literature and clinical data is growing at an exponential rate. Therefore, efficient access to data described in unstructured biomedical texts is a crucial task for the biomedical industry and research. Named Entity Recognition (NER) is the first step for information and knowledge acquisition when we deal with unstructured texts. Recent NER approaches use contextualized word representations as input for a downstream classification task. However, distributed word vectors (embeddings) are very limited in Spanish and even more for the biomedical domain. Methods: In this work, we develop several biomedical Spanish word representations, and we introduce two Deep Learning approaches for pharmaceutical, chemical, and other biomedical entities recognition in Spanish clinical case texts and biomedical texts, one based on a Bi-STM-CRF model and the other on a BERT-based architecture. Results: Several Spanish biomedical embeddigns together with the two deep learning models were evaluated on the PharmaCoNER and CORD-19 datasets. The PharmaCoNER dataset is composed of a set of Spanish clinical cases annotated with drugs, chemical compounds and pharmacological substances; our extended Bi-LSTM-CRF model obtains an F-score of 85.24% on entity identification and classification and the BERT model obtains an F-score of 88.80%. For the entity normalization task, the extended Bi-LSTM-CRF model achieves an F-score of 72.85% and the BERT model achieves 79.97%. The CORD-19 dataset consists of scholarly articles written in English annotated with biomedical concepts such as disorder, species, chemical or drugs, gene and protein, enzyme and anatomy. Bi-LSTM-CRF model and BERT model obtain an F-measure of 78.23% and 78.86% on entity identification and classification, respectively on the CORD-19 dataset. Conclusion: These results prove that deep learning models with in-domain knowledge learned from large-scale datasets highly improve named entity recognition performance. Moreover, contextualized representations help to understand complexities and ambiguity inherent to biomedical texts. Embeddings based on word, concepts, senses, etc. other than those for English are required to improve NER tasks in other languages. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
45. Concept recognition as a machine translation problem.
- Author
-
Boguslav, Mayla R., Hailu, Negacy D., Bada, Michael, Baumgartner Jr., William A., and Hunter, Lawrence E.
- Subjects
MACHINE translating ,ARTIFICIAL neural networks ,CONCEPT mapping ,MACHINE learning ,TEXT mining ,NATURAL language processing - Abstract
Background: Automated assignment of specific ontology concepts to mentions in text is a critical task in biomedical natural language processing, and the subject of many open shared tasks. Although the current state of the art involves the use of neural network language models as a post-processing step, the very large number of ontology classes to be recognized and the limited amount of gold-standard training data has impeded the creation of end-to-end systems based entirely on machine learning. Recently, Hailu et al. recast the concept recognition problem as a type of machine translation and demonstrated that sequence-to-sequence machine learning models have the potential to outperform multi-class classification approaches. Methods: We systematically characterize the factors that contribute to the accuracy and efficiency of several approaches to sequence-to-sequence machine learning through extensive studies of alternative methods and hyperparameter selections. We not only identify the best-performing systems and parameters across a wide variety of ontologies but also provide insights into the widely varying resource requirements and hyperparameter robustness of alternative approaches. Analysis of the strengths and weaknesses of such systems suggest promising avenues for future improvements as well as design choices that can increase computational efficiency with small costs in performance. Results: Bidirectional encoder representations from transformers for biomedical text mining (BioBERT) for span detection along with the open-source toolkit for neural machine translation (OpenNMT) for concept normalization achieve state-of-the-art performance for most ontologies annotated in the CRAFT Corpus. This approach uses substantially fewer computational resources, including hardware, memory, and time than several alternative approaches. Conclusions: Machine translation is a promising avenue for fully machine-learning-based concept recognition that achieves state-of-the-art results on the CRAFT Corpus, evaluated via a direct comparison to previous results from the 2019 CRAFT shared task. Experiments illuminating the reasons for the surprisingly good performance of sequence-to-sequence methods targeting ontology identifiers suggest that further progress may be possible by mapping to alternative target concept representations. All code and models can be found at: https://github.com/UCDenver-ccp/Concept-Recognition-as-Translation. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
46. A sui generis QA approach using RoBERTa for adverse drug event identification.
- Author
-
Jain, Harshit, Raj, Nishant, and Mishra, Suyash
- Subjects
DRUG side effects ,LONG-term memory ,SHORT-term memory - Abstract
Background: Extraction of adverse drug events from biomedical literature and other textual data is an important component to monitor drug-safety and this has attracted attention of many researchers in healthcare. Existing works are more pivoted around entity-relation extraction using bidirectional long short term memory networks (Bi-LSTM) which does not attain the best feature representations. Results: In this paper, we introduce a question answering framework that exploits the robustness, masking and dynamic attention capabilities of RoBERTa by a technique of domain adaptation and attempt to overcome the aforementioned limitations. With formulation of an end-to-end pipeline, our model outperforms the prior work by 9.53% F1-Score. Conclusion: An end-to-end pipeline that leverages state of the art transformer architecture in conjunction with QA approach can bolster the performances of entity-relation extraction tasks in the biomedical domain. In particular, we believe our research would be helpful in identification of potential adverse drug reactions in mono as well as combination therapy related textual data. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
47. N-gram analysis of 970 microbial organisms reveals presence of biological language models
- Author
-
Ganapathiraju Madhavi K and Osmanbeyoglu Hatice
- Subjects
Computer applications to medicine. Medical informatics ,R858-859.7 ,Biology (General) ,QH301-705.5 - Abstract
Abstract Background It has been suggested previously that genome and proteome sequences show characteristics typical of natural-language texts such as "signature-style" word usage indicative of authors or topics, and that the algorithms originally developed for natural language processing may therefore be applied to genome sequences to draw biologically relevant conclusions. Following this approach of 'biological language modeling', statistical n-gram analysis has been applied for comparative analysis of whole proteome sequences of 44 organisms. It has been shown that a few particular amino acid n-grams are found in abundance in one organism but occurring very rarely in other organisms, thereby serving as genome signatures. At that time proteomes of only 44 organisms were available, thereby limiting the generalization of this hypothesis. Today nearly 1,000 genome sequences and corresponding translated sequences are available, making it feasible to test the existence of biological language models over the evolutionary tree. Results We studied whole proteome sequences of 970 microbial organisms using n-gram frequencies and cross-perplexity employing the Biological Language Modeling Toolkit and Patternix Revelio toolkit. Genus-specific signatures were observed even in a simple unigram distribution. By taking statistical n-gram model of one organism as reference and computing cross-perplexity of all other microbial proteomes with it, cross-perplexity was found to be predictive of branch distance of the phylogenetic tree. For example, a 4-gram model from proteome of Shigellae flexneri 2a, which belongs to the Gammaproteobacteria class showed a self-perplexity of 15.34 while the cross-perplexity of other organisms was in the range of 15.59 to 29.5 and was proportional to their branching distance in the evolutionary tree from S. flexneri. The organisms of this genus, which happen to be pathotypes of E.coli, also have the closest perplexity values with E. coli. Conclusion Whole proteome sequences of microbial organisms have been shown to contain particular n-gram sequences in abundance in one organism but occurring very rarely in other organisms, thereby serving as proteome signatures. Further it has also been shown that perplexity, a statistical measure of similarity of n-gram composition, can be used to predict evolutionary distance within a genus in the phylogenetic tree.
- Published
- 2011
- Full Text
- View/download PDF
48. PlncRNA-HDeep: plant long noncoding RNA prediction using hybrid deep learning based on two encoding styles.
- Author
-
Meng, Jun, Kang, Qiang, Chang, Zheng, and Luan, Yushi
- Subjects
DEEP learning ,LINCRNA ,NUCLEOTIDE sequence ,CONVOLUTIONAL neural networks ,BIOLOGICAL laboratories ,CORN ,SUPPORT vector machines - Abstract
Background: Long noncoding RNAs (lncRNAs) play an important role in regulating biological activities and their prediction is significant for exploring biological processes. Long short-term memory (LSTM) and convolutional neural network (CNN) can automatically extract and learn the abstract information from the encoded RNA sequences to avoid complex feature engineering. An ensemble model learns the information from multiple perspectives and shows better performance than a single model. It is feasible and interesting that the RNA sequence is considered as sentence and image to train LSTM and CNN respectively, and then the trained models are hybridized to predict lncRNAs. Up to present, there are various predictors for lncRNAs, but few of them are proposed for plant. A reliable and powerful predictor for plant lncRNAs is necessary. Results: To boost the performance of predicting lncRNAs, this paper proposes a hybrid deep learning model based on two encoding styles (PlncRNA-HDeep), which does not require prior knowledge and only uses RNA sequences to train the models for predicting plant lncRNAs. It not only learns the diversified information from RNA sequences encoded by p-nucleotide and one-hot encodings, but also takes advantages of lncRNA-LSTM proposed in our previous study and CNN. The parameters are adjusted and three hybrid strategies are tested to maximize its performance. Experiment results show that PlncRNA-HDeep is more effective than lncRNA-LSTM and CNN and obtains 97.9% sensitivity, 95.1% precision, 96.5% accuracy and 96.5% F1 score on Zea mays dataset which are better than those of several shallow machine learning methods (support vector machine, random forest, k-nearest neighbor, decision tree, naive Bayes and logistic regression) and some existing tools (CNCI, PLEK, CPC2, LncADeep and lncRNAnet). Conclusions: PlncRNA-HDeep is feasible and obtains the credible predictive results. It may also provide valuable references for other related research. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
49. Assistant diagnosis with Chinese electronic medical records based on CNN and BiLSTM with phrase-level and word-level attentions.
- Author
-
Wang, Tong, Xuan, Ping, Liu, Zonglin, and Zhang, Tiangang
- Subjects
ELECTRONIC health records ,CONVOLUTIONAL neural networks ,DEEP learning ,FORECASTING ,DATA entry ,DATA fusion (Statistics) - Abstract
Background: Inferring diseases related to the patient's electronic medical records (EMRs) is of great significance for assisting doctor diagnosis. Several recent prediction methods have shown that deep learning-based methods can learn the deep and complex information contained in EMRs. However, they do not consider the discriminative contributions of different phrases and words. Moreover, local information and context information of EMRs should be deeply integrated. Results: A new method based on the fusion of a convolutional neural network (CNN) and bidirectional long short-term memory (BiLSTM) with attention mechanisms is proposed for predicting a disease related to a given EMR, and it is referred to as FCNBLA. FCNBLA deeply integrates local information, context information of the word sequence and more informative phrases and words. A novel framework based on deep learning is developed to learn the local representation, the context representation and the combination representation. The left side of the framework is constructed based on CNN to learn the local representation of adjacent words. The right side of the framework based on BiLSTM focuses on learning the context representation of the word sequence. Not all phrases and words contribute equally to the representation of an EMR meaning. Therefore, we establish the attention mechanisms at the phrase level and word level, and the middle module of the framework learns the combination representation of the enhanced phrases and words. The macro average f-score and accuracy of FCNBLA achieved 91.29 and 92.78%, respectively. Conclusion: The experimental results indicate that FCNBLA yields superior performance compared with several state-of-the-art methods. The attention mechanisms and combination representations are also confirmed to be helpful for improving FCNBLA's prediction performance. Our method is helpful for assisting doctors in diagnosing diseases in patients. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
50. Evolving knowledge graph similarity for supervised learning in complex biomedical domains.
- Author
-
Sousa, Rita T., Silva, Sara, and Pesquita, Catia
- Subjects
SUPERVISED learning ,ONTOLOGIES (Information retrieval) ,GENETIC programming ,PROTEIN-protein interactions ,RESEMBLANCE (Philosophy) ,MORPHOLOGY - Abstract
Background: In recent years, biomedical ontologies have become important for describing existing biological knowledge in the form of knowledge graphs. Data mining approaches that work with knowledge graphs have been proposed, but they are based on vector representations that do not capture the full underlying semantics. An alternative is to use machine learning approaches that explore semantic similarity. However, since ontologies can model multiple perspectives, semantic similarity computations for a given learning task need to be fine-tuned to account for this. Obtaining the best combination of semantic similarity aspects for each learning task is not trivial and typically depends on expert knowledge. Results: We have developed a novel approach, evoKGsim, that applies Genetic Programming over a set of semantic similarity features, each based on a semantic aspect of the data, to obtain the best combination for a given supervised learning task. The approach was evaluated on several benchmark datasets for protein-protein interaction prediction using the Gene Ontology as the knowledge graph to support semantic similarity, and it outperformed competing strategies, including manually selected combinations of semantic aspects emulating expert knowledge. evoKGsim was also able to learn species-agnostic models with different combinations of species for training and testing, effectively addressing the limitations of predicting protein-protein interactions for species with fewer known interactions. Conclusions: evoKGsim can overcome one of the limitations in knowledge graph-based semantic similarity applications: the need to expertly select which aspects should be taken into account for a given application. Applying this methodology to protein-protein interaction prediction proved successful, paving the way to broader applications. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.