16,327 results
Search Results
2. A pipeline for the retrieval and extraction of domain-specific information with application to COVID-19 immune signatures.
- Author
-
Newton, Adam J. H., Chartash, David, Kleinstein, Steven H., and McDougal, Robert A.
- Subjects
DATA mining ,COVID-19 ,GENE expression ,SARS-CoV-2 - Abstract
Background: The accelerating pace of biomedical publication has made it impractical to manually, systematically identify papers containing specific information and extract this information. This is especially challenging when the information itself resides beyond titles or abstracts. For emerging science, with a limited set of known papers of interest and an incomplete information model, this is of pressing concern. A timely example in retrospect is the identification of immune signatures (coherent sets of biomarkers) driving differential SARS-CoV-2 infection outcomes. Implementation: We built a classifier to identify papers containing domain-specific information from the document embeddings of the title and abstract. To train this classifier with limited data, we developed an iterative process leveraging pre-trained SPECTER document embeddings, SVM classifiers and web-enabled expert review to iteratively augment the training set. This training set was then used to create a classifier to identify papers containing domain-specific information. Finally, information was extracted from these papers through a semi-automated system that directly solicited the paper authors to respond via a web-based form. Results: We demonstrate a classifier that retrieves papers with human COVID-19 immune signatures with a positive predictive value of 86%. The type of immune signature (e.g., gene expression vs. other types of profiling) was also identified with a positive predictive value of 74%. Semi-automated queries to the corresponding authors of these publications requesting signature information achieved a 31% response rate. Conclusions: Our results demonstrate the efficacy of using a SVM classifier with document embeddings of the title and abstract, to retrieve papers with domain-specific information, even when that information is rarely present in the abstract. Targeted author engagement based on classifier predictions offers a promising pathway to build a semi-structured representation of such information. Through this approach, partially automated literature mining can help rapidly create semi-structured knowledge repositories for automatic analysis of emerging health threats. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
3. Selected papers from the 15th and 16th international conference on Computational Intelligence Methods for Bioinformatics and Biostatistics.
- Author
-
Cazzaniga P, Raposo M, Besozzi D, Merelli I, Staiano A, Ciaramella A, Rizzo R, and Manzoni L
- Published
- 2021
- Full Text
- View/download PDF
4. Biomedical event extraction from abstracts and full papers using search-based structured prediction.
- Author
-
Vlachos, Andreas and Craven, Mark
- Subjects
- *
DATA mining , *BIOMEDICAL engineering , *MOLECULAR biology , *CONFERENCE papers , *LEARNING curve - Abstract
Background: Biomedical event extraction has attracted substantial attention as it can assist researchers in understanding the plethora of interactions among genes that are described in publications in molecular biology. While most recent work has focused on abstracts, the BioNLP 2011 shared task evaluated the submitted systems on both abstracts and full papers. In this article, we describe our submission to the shared task which decomposes event extraction into a set of classification tasks that can be learned either independently or jointly using the search-based structured prediction framework. Our intention is to explore how these two learning paradigms compare in the context of the shared task. Results: We report that models learned using search-based structured prediction exceed the accuracy of independently learned classifiers by 8.3 points in F-score, with the gains being more pronounced on the more complex Regulation events (13.23 points). Furthermore, we show how the trade-off between recall and precision can be adjusted in both learning paradigms and that search-based structured prediction achieves better recall at all precision points. Finally, we report on experiments with a simple domain-adaptation method, resulting in the second-best performance achieved by a single system. Conclusions: We demonstrate that joint inference using the search-based structured prediction framework can achieve better performance than independently learned classifiers, thus demonstrating the potential of this learning paradigm for event extraction and other similarly complex information-extraction tasks. [ABSTRACT FROM AUTHOR]
- Published
- 2012
- Full Text
- View/download PDF
5. The data paper: a mechanism to incentivize data publishing in biodiversity science.
- Author
-
Chavan, Vishwas and Penev, Lyubomir
- Subjects
- *
BIODIVERSITY , *DECISION making , *PUBLISHING , *CONSERVATION biology , *ENVIRONMENTAL engineering - Abstract
Background: Free and open access to primary biodiversity data is essential for informed decision-making to achieve conservation of biodiversity and sustainable development. However, primary biodiversity data are neither easily accessible nor discoverable. Among several impediments, one is a lack of incentives to data publishers for publishing of their data resources. One such mechanism currently lacking is recognition through conventional scholarly publication of enriched metadata, which should ensure rapid discovery of 'fit-for-use' biodiversity data resources. Discussion: We review the state of the art of data discovery options and the mechanisms in place for incentivizing data publishers efforts towards easy, efficient and enhanced publishing, dissemination, sharing and re-use of biodiversity data. We propose the establishment of the 'biodiversity data paper' as one possible mechanism to offer scholarly recognition for efforts and investment by data publishers in authoring rich metadata and publishing them as citable academic papers. While detailing the benefits to data publishers, we describe the objectives, work flow and outcomes of the pilot project commissioned by the Global Biodiversity Information Facility in collaboration with scholarly publishers and pioneered by Pensoft Publishers through its journals Zookeys, PhytoKeys, MycoKeys, BioRisk, NeoBiota, Nature Conservation and the forthcoming Biodiversity Data Journal. We then debate further enhancements of the data paper beyond the pilot project and attempt to forecast the future uptake of data papers as an incentivization mechanism by the stakeholder communities. Conclusions: We believe that in addition to recognition for those involved in the data publishing enterprise, data papers will also expedite publishing of fit-for-use biodiversity data resources. However, uptake and establishment of the data paper as a potential mechanism of scholarly recognition requires a high degree of commitment and investment by the cross-sectional stakeholder communities. [ABSTRACT FROM AUTHOR]
- Published
- 2011
- Full Text
- View/download PDF
6. MimoSA: a system for minimotif annotation.
- Author
-
Vyas, Jay, Nowling, Ronald J., Meusburger, Thomas, Sargeant, David, Kadaveru, Krishna, Gryk, Michael R., Kundeti, Vamsi, Rajasekaran, Sanguthevar, and Schiller, Martin R.
- Subjects
PEPTIDES ,AMINO acid sequence ,DATABASE searching ,MACHINE learning ,MACHINE theory - Abstract
Background: Minimotifs are short peptide sequences within one protein, which are recognized by other proteins or molecules. While there are now several minimotif databases, they are incomplete. There are reports of many minimotifs in the primary literature, which have yet to be annotated, while entirely novel minimotifs continue to be published on a weekly basis. Our recently proposed function and sequence syntax for minimotifs enables us to build a general tool that will facilitate structured annotation and management of minimotif data from the biomedical literature. Results: We have built the MimoSA application for minimotif annotation. The application supports management of the Minimotif Miner database, literature tracking, and annotation of new minimotifs. MimoSA enables the visualization, organization, selection and editing functions of minimotifs and their attributes in the MnM database. For the literature components, Mimosa provides paper status tracking and scoring of papers for annotation through a freely available machine learning approach, which is based on word correlation. The paper scoring algorithm is also available as a separate program, TextMine. Form-driven annotation of minimotif attributes enables entry of new minimotifs into the MnM database. Several supporting features increase the efficiency of annotation. The layered architecture of MimoSA allows for extensibility by separating the functions of paper scoring, minimotif visualization, and database management. MimoSA is readily adaptable to other annotation efforts that manually curate literature into a MySQL database. Conclusions: MimoSA is an extensible application that facilitates minimotif annotation and integrates with the Minimotif Miner database. We have built MimoSA as an application that integrates dynamic abstract scoring with a high performance relational model of minimotif syntax. MimoSA's TextMine, an efficient paper-scoring algorithm, can be used to dynamically rank papers with respect to context. [ABSTRACT FROM AUTHOR]
- Published
- 2010
- Full Text
- View/download PDF
7. A multi-task graph deep learning model to predict drugs combination of synergy and sensitivity scores.
- Author
-
Monem, Samar, Hassanien, Aboul Ella, and Abdel-Hamid, Alaa H.
- Subjects
DRUG synergism ,DEEP learning ,CROSS-stitch ,CELL lines ,CANCER cells - Abstract
Background: Drug combination treatments have proven to be a realistic technique for treating challenging diseases such as cancer by enhancing efficacy and mitigating side effects. To achieve the therapeutic goals of these combinations, it is essential to employ multi-targeted drug combinations, which maximize effectiveness and synergistic effects. Results: This paper proposes 'MultiComb', a multi-task deep learning (MTDL) model designed to simultaneously predict the synergy and sensitivity of drug combinations. The model utilizes a graph convolution network to represent the Simplified Molecular-Input Line-Entry (SMILES) of two drugs, generating their respective features. Also, three fully connected subnetworks extract features of the cancer cell line. These drug and cell line features are then concatenated and processed through an attention mechanism, which outputs two optimized feature representations for the target tasks. The cross-stitch model learns the relationship between these tasks. At last, each learned task feature is fed into fully connected subnetworks to predict the synergy and sensitivity scores. The proposed model is validated using the O'Neil benchmark dataset, which includes 38 unique drugs combined to form 17,901 drug combination pairs and tested across 37 unique cancer cells. The model's performance is tested using some metrics like mean square error ( MSE ), mean absolute error ( MAE ), coefficient of determination ( R 2 ), Spearman, and Pearson scores. The mean synergy scores of the proposed model are 232.37, 9.59, 0.57, 0.76, and 0.73 for the previous metrics, respectively. Also, the values for mean sensitivity scores are 15.59, 2.74, 0.90, 0.95, and 0.95, respectively. Conclusion: This paper proposes an MTDL model to predict synergy and sensitivity scores for drug combinations targeting specific cancer cell lines. The MTDL model demonstrates superior performance compared to existing approaches, providing better results. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
8. SurvConvMixer: robust and interpretable cancer survival prediction based on ConvMixer using pathway-level gene expression images.
- Author
-
Wang, Shuo, Liu, Yuanning, Zhang, Hao, and Liu, Zhen
- Subjects
GENE expression ,SQUAMOUS cell carcinoma ,SURVIVAL analysis (Biometry) ,FORECASTING ,OVERALL survival - Abstract
Cancer is one of the leading causes of deaths worldwide. Survival analysis and prediction of cancer patients is of great significance for their precision medicine. The robustness and interpretability of the survival prediction models are important, where robustness tells whether a model has learned the knowledge, and interpretability means if a model can show human what it has learned. In this paper, we propose a robust and interpretable model SurvConvMixer, which uses pathways customized gene expression images and ConvMixer for cancer short-term, mid-term and long-term overall survival prediction. With ConvMixer, the representation of each pathway can be learned respectively. We show the robustness of our model by testing the trained model on absolutely untrained external datasets. The interpretability of SurvConvMixer depends on gradient-weighted class activation mapping (Grad-Cam), by which we can obtain the pathway-level activation heat map. Then wilcoxon rank-sum tests are conducted to obtain the statistically significant pathways, thereby revealing which pathways the model focuses on more. SurvConvMixer achieves remarkable performance on the short-term, mid-term and long-term overall survival of lung adenocarcinoma, lung squamous cell carcinoma and skin cutaneous melanoma, and the external validation tests show that SurvConvMixer can generalize to external datasets so that it is robust. Finally, we investigate the activation maps generated by Grad-Cam, after wilcoxon rank-sum test and Kaplan–Meier estimation, we find that some survival-related pathways play important role in SurvConvMixer. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
9. GPDminer: a tool for extracting named entities and analyzing relations in biological literature.
- Author
-
Park, Yeon-Ji, Yang, Geun-Je, Sohn, Chae-Bong, and Park, Soo Jun
- Subjects
KNOWLEDGE acquisition (Expert systems) ,DATA mining ,TEXT mining ,DATABASES ,INFORMATION retrieval ,RESEARCH personnel - Abstract
Purpose: The expansion of research across various disciplines has led to a substantial increase in published papers and journals, highlighting the necessity for reliable text mining platforms for database construction and knowledge acquisition. This abstract introduces GPDMiner(Gene, Protein, and Disease Miner), a platform designed for the biomedical domain, addressing the challenges posed by the growing volume of academic papers. Methods: GPDMiner is a text mining platform that utilizes advanced information retrieval techniques. It operates by searching PubMed for specific queries, extracting and analyzing information relevant to the biomedical field. This system is designed to discern and illustrate relationships between biomedical entities obtained from automated information extraction. Results: The implementation of GPDMiner demonstrates its efficacy in navigating the extensive corpus of biomedical literature. It efficiently retrieves, extracts, and analyzes information, highlighting significant connections between genes, proteins, and diseases. The platform also allows users to save their analytical outcomes in various formats, including Excel and images. Conclusion: GPDMiner offers a notable additional functionality among the array of text mining tools available for the biomedical field. This tool presents an effective solution for researchers to navigate and extract relevant information from the vast unstructured texts found in biomedical literature, thereby providing distinctive capabilities that set it apart from existing methodologies. Its application is expected to greatly benefit researchers in this domain, enhancing their capacity for knowledge discovery and data management. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
10. Proceedings of the 2010 AMIA Summit on Translational Bioinformatics, March 10-12, 2010, San Francisco, CA, USA.
- Subjects
- *
CONFERENCE papers , *MEDICAL informatics , *BIOINFORMATICS , *GENOMICS , *BIOMARKERS , *DRUG side effects - Abstract
Papers from the 2010 American Medical Informatics Association (AMIA) Summit on Translational Bioinformatics held from March 10-12, 2010 in San Francisco, California are presented including "Mapping Transcription Mechanisms from Multimodal Genomic Data," "Using Gene Co-Expression Network Analysis to Predict Biomarkers for Chronic Lymphocytic Leukemia," and "Mining Multi-Item Drug Adverse Effect Associations in Spontaneous Reporting Systems."
- Published
- 2010
11. Proceedings of the 9th International Conference on Bioinformatics, September 26-28, 2010, Tokyo, Japan.
- Subjects
- *
CONFERENCE papers , *BIOINFORMATICS , *RNA , *PROTEIN-protein interactions , *TAXONOMY - Abstract
Papers from the 9th International Conference on Bioinformatics, held September 26-28, 2010 in Tokyo, Japan are presented including "Robust and Accurate Prediction of Noncoding RNAs from Aligned Sequences," "Integrating Diverse Biological and Computational sources for Reliable Protein-Protein Interactions," and "DiScRIBinATE: A Rapid Method for Accurate Taxonomic Classification of Metagenomic Sequences."
- Published
- 2010
12. Proceedings of the 2010 MidSouth Computational Biology and Bioinformatics Society (MCBIOS) Conference, February 19-20, 2010, Jonesboro, AR, USA.
- Subjects
- *
CONFERENCE papers , *COMPUTATIONAL biology , *BIOINFORMATICS , *BOVINE viral diarrhea virus , *XYLANASES , *GENOMICS - Abstract
Papers from the 2010 MidSouth Computational Biology and Bioinformatics Society (MCBIOS) Conference, held February 19-20, 2010 in Jonesboro, Arkansas are presented including "Analysis of Bovine Viral Diarrhea Viruses-Infected Monocytes: Identification of Cytopathic and Noncytopathic Biotype Differences," "Enzyme Structure Dynamics of Xylanase I from Trichoderma Longibrachiatum," and "Comparative Genome Analysis of PHB Gene Family Reveals Deep Evolutionary Origins and Diverse Gene Function."
- Published
- 2010
- Full Text
- View/download PDF
13. Prediction of lung cancer using gene expression and deep learning with KL divergence gene selection.
- Author
-
Liu, Suli and Yao, Wu
- Subjects
DEEP learning ,CANCER genes ,GENE expression ,LUNG cancer ,ARTIFICIAL neural networks ,GENES - Abstract
Background: Lung cancer is one of the cancers with the highest mortality rate in China. With the rapid development of high-throughput sequencing technology and the research and application of deep learning methods in recent years, deep neural networks based on gene expression have become a hot research direction in lung cancer diagnosis in recent years, which provide an effective way of early diagnosis for lung cancer. Thus, building a deep neural network model is of great significance for the early diagnosis of lung cancer. However, the main challenges in mining gene expression datasets are the curse of dimensionality and imbalanced data. The existing methods proposed by some researchers can't address the problems of high-dimensionality and imbalanced data, because of the overwhelming number of variables measured (genes) versus the small number of samples, which result in poor performance in early diagnosis for lung cancer. Method: Given the disadvantages of gene expression data sets with small datasets, high-dimensionality and imbalanced data, this paper proposes a gene selection method based on KL divergence, which selects some genes with higher KL divergence as model features. Then build a deep neural network model using Focal Loss as loss function, at the same time, we use k-fold cross validation method to verify and select the best model, we set the value of k is five in this paper. Result: The deep learning model method based on KL divergence gene selection proposed in this paper has an AUC of 0.99 on the validation set. The generalization performance of model is high. Conclusion: The deep neural network model based on KL divergence gene selection proposed in this paper is proved to be an accurate and effective method for lung cancer prediction. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
14. DL-PPI: a method on prediction of sequenced protein–protein interaction based on deep learning.
- Author
-
Wu, Jiahui, Liu, Bo, Zhang, Jidong, Wang, Zhihan, and Li, Jianqiang
- Subjects
DEEP learning ,PROTEIN-protein interactions ,FEATURE extraction ,AMINO acid sequence ,SCIENTIFIC community ,FORECASTING - Abstract
Purpose: Sequenced Protein–Protein Interaction (PPI) prediction represents a pivotal area of study in biology, playing a crucial role in elucidating the mechanistic underpinnings of diseases and facilitating the design of novel therapeutic interventions. Conventional methods for extracting features through experimental processes have proven to be both costly and exceedingly complex. In light of these challenges, the scientific community has turned to computational approaches, particularly those grounded in deep learning methodologies. Despite the progress achieved by current deep learning technologies, their effectiveness diminishes when applied to larger, unfamiliar datasets. Results: In this study, the paper introduces a novel deep learning framework, termed DL-PPI, for predicting PPIs based on sequence data. The proposed framework comprises two key components aimed at improving the accuracy of feature extraction from individual protein sequences and capturing relationships between proteins in unfamiliar datasets. 1. Protein Node Feature Extraction Module: To enhance the accuracy of feature extraction from individual protein sequences and facilitate the understanding of relationships between proteins in unknown datasets, the paper devised a novel protein node feature extraction module utilizing the Inception method. This module efficiently captures relevant patterns and representations within protein sequences, enabling more informative feature extraction. 2. Feature-Relational Reasoning Network (FRN): In the Global Feature Extraction module of our model, the paper developed a novel FRN that leveraged Graph Neural Networks to determine interactions between pairs of input proteins. The FRN effectively captures the underlying relational information between proteins, contributing to improved PPI predictions. DL-PPI framework demonstrates state-of-the-art performance in the realm of sequence-based PPI prediction. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
15. ForestSubtype: a cancer subtype identifying approach based on high-dimensional genomic data and a parallel random forest
- Author
-
Luo, Junwei, Feng, Yading, Wu, Xuyang, Li, Ruimin, Shi, Jiawei, Chang, Wenjing, and Wang, Junfeng
- Published
- 2023
- Full Text
- View/download PDF
16. QNetDiff: a quantitative measurement of network rewiring.
- Author
-
Nose, Shota, Shiroma, Hirotsugu, Yamada, Takuji, and Uno, Yushi
- Subjects
LARGE intestine ,COLORECTAL cancer ,HUMAN body ,CANCER patients - Abstract
Bacteria in the human body, particularly in the large intestine, are known to be associated with various diseases. To identify disease-associated bacteria (markers), a typical method is to statistically compare the relative abundance of bacteria between healthy subjects and diseased patients. However, since bacteria do not necessarily cause diseases in isolation, it is also important to focus on the interactions and relationships among bacteria when examining their association with diseases. In fact, although there are common approaches to represent and analyze bacterial interaction relationships as networks, there are limited methods to find bacteria associated with diseases through network-driven analysis. In this paper, we focus on rewiring of the bacterial network and propose a new method for quantifying the rewiring. We then apply the proposed method to a group of colorectal cancer patients. We show that it can identify and detect bacteria that cannot be detected by conventional methods such as abundance comparison. Furthermore, the proposed method is implemented as a general-purpose tool and made available to the general public. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
17. Robust double machine learning model with application to omics data.
- Author
-
Wang, Xuqing, Liu, Yahang, Qin, Guoyou, and Yu, Yongfu
- Abstract
Background: Recently, there has been a growing interest in combining causal inference with machine learning algorithms. Double machine learning model (DML), as an implementation of this combination, has received widespread attention for their expertise in estimating causal effects within high-dimensional complex data. However, the DML model is sensitive to the presence of outliers and heavy-tailed noise in the outcome variable. In this paper, we propose the robust double machine learning (RDML) model to achieve a robust estimation of causal effects when the distribution of the outcome is contaminated by outliers or exhibits symmetrically heavy-tailed characteristics. Results: In the modelling of RDML model, we employed median machine learning algorithms to achieve robust predictions for the treatment and outcome variables. Subsequently, we established a median regression model for the prediction residuals. These two steps ensure robust causal effect estimation. Simulation study show that the RDML model is comparable to the existing DML model when the data follow normal distribution, while the RDML model has obvious superiority when the data follow mixed normal distribution and t-distribution, which is manifested by having a smaller RMSE. Meanwhile, we also apply the RDML model to the deoxyribonucleic acid methylation dataset from the Alzheimer's disease (AD) neuroimaging initiative database with the aim of investigating the impact of Cerebrospinal Fluid Amyloid β 42 (CSF A β 42) on AD severity. Conclusion: These findings illustrate that the RDML model is capable of robustly estimating causal effect, even when the outcome distribution is affected by outliers or displays symmetrically heavy-tailed properties. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
18. REDalign: accurate RNA structural alignment using residual encoder-decoder network.
- Author
-
Chen, Chun-Chi, Chan, Yi-Ming, and Jeong, Hyundoo
- Subjects
RNA analysis ,GENOMICS ,COMPUTATIONAL complexity ,RNA ,SUPPLY & demand ,DEEP learning - Abstract
Background: RNA secondary structural alignment serves as a foundational procedure in identifying conserved structural motifs among RNA sequences, crucially advancing our understanding of novel RNAs via comparative genomic analysis. While various computational strategies for RNA structural alignment exist, they often come with high computational complexity. Specifically, when addressing a set of RNAs with unknown structures, the task of simultaneously predicting their consensus secondary structure and determining the optimal sequence alignment requires an overwhelming computational effort of O (L 6) for each RNA pair. Such an extremely high computational complexity makes these methods impractical for large-scale analysis despite their accurate alignment capabilities. Results: In this paper, we introduce REDalign, an innovative approach based on deep learning for RNA secondary structural alignment. By utilizing a residual encoder-decoder network, REDalign can efficiently capture consensus structures and optimize structural alignments. In this learning model, the encoder network leverages a hierarchical pyramid to assimilate high-level structural features. Concurrently, the decoder network, enhanced with residual skip connections, integrates multi-level encoded features to learn detailed feature hierarchies with fewer parameter sets. REDalign significantly reduces computational complexity compared to Sankoff-style algorithms and effectively handles non-nested structures, including pseudoknots, which are challenging for traditional alignment methods. Extensive evaluations demonstrate that REDalign provides superior accuracy and substantial computational efficiency. Conclusion: REDalign presents a significant advancement in RNA secondary structural alignment, balancing high alignment accuracy with lower computational demands. Its ability to handle complex RNA structures, including pseudoknots, makes it an effective tool for large-scale RNA analysis, with potential implications for accelerating discoveries in RNA research and comparative genomics. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
19. PangeBlocks: customized construction of pangenome graphs via maximal blocks.
- Author
-
Avila Cartes, Jorge, Bonizzoni, Paola, Ciccolella, Simone, Della Vedova, Gianluca, and Denti, Luca
- Subjects
PAN-genome ,REPRESENTATIONS of graphs ,LINEAR programming ,INTEGER programming ,SEQUENCE alignment - Abstract
Background: The construction of a pangenome graph is a fundamental task in pangenomics. A natural theoretical question is how to formalize the computational problem of building an optimal pangenome graph, making explicit the underlying optimization criterion and the set of feasible solutions. Current approaches build a pangenome graph with some heuristics, without assuming some explicit optimization criteria. Thus it is unclear how a specific optimization criterion affects the graph topology and downstream analysis, like read mapping and variant calling. Results: In this paper, by leveraging the notion of maximal block in a Multiple Sequence Alignment (MSA), we reframe the pangenome graph construction problem as an exact cover problem on blocks called Minimum Weighted Block Cover (MWBC). Then we propose an Integer Linear Programming (ILP) formulation for the MWBC problem that allows us to study the most natural objective functions for building a graph. We provide an implementation of the ILP approach for solving the MWBC and we evaluate it on SARS-CoV-2 complete genomes, showing how different objective functions lead to pangenome graphs that have different properties, hinting that the specific downstream task can drive the graph construction phase. Conclusion: We show that a customized construction of a pangenome graph based on selecting objective functions has a direct impact on the resulting graphs. In particular, our formalization of the MWBC problem, based on finding an optimal subset of blocks covering an MSA, paves the way to novel practical approaches to graph representations of an MSA where the user can guide the construction. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
20. Biomedical relation extraction method based on ensemble learning and attention mechanism.
- Author
-
Jia, Yaxun, Wang, Haoyang, Yuan, Zhu, Zhu, Lian, and Xiang, Zuo-lin
- Subjects
DEEP learning ,MEDICAL research - Abstract
Background: Relation extraction (RE) plays a crucial role in biomedical research as it is essential for uncovering complex semantic relationships between entities in textual data. Given the significance of RE in biomedical informatics and the increasing volume of literature, there is an urgent need for advanced computational models capable of accurately and efficiently extracting these relationships on a large scale. Results: This paper proposes a novel approach, SARE, combining ensemble learning Stacking and attention mechanisms to enhance the performance of biomedical relation extraction. By leveraging multiple pre-trained models, SARE demonstrates improved adaptability and robustness across diverse domains. The attention mechanisms enable the model to capture and utilize key information in the text more accurately. SARE achieved performance improvements of 4.8, 8.7, and 0.8 percentage points on the PPI, DDI, and ChemProt datasets, respectively, compared to the original BERT variant and the domain-specific PubMedBERT model. Conclusions: SARE offers a promising solution for improving the accuracy and efficiency of relation extraction tasks in biomedical research, facilitating advancements in biomedical informatics. The results suggest that combining ensemble learning with attention mechanisms is effective for extracting complex relationships from biomedical texts. Our code and data are publicly available at: https://github.com/GS233/Biomedical. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
21. LDAGM: prediction lncRNA-disease asociations by graph convolutional auto-encoder and multilayer perceptron based on multi-view heterogeneous networks.
- Author
-
Zhang, Bing, Wang, Haoyu, Ma, Chao, Huang, Hai, Fang, Zhou, and Qu, Jiaxing
- Subjects
FEATURE extraction ,LINCRNA ,ASSOCIATION rule mining ,MICRORNA ,INFORMATION resources management - Abstract
Background: Long non-coding RNAs (lncRNAs) can prevent, diagnose, and treat a variety of complex human diseases, and it is crucial to establish a method to efficiently predict lncRNA-disease associations. Results: In this paper, we propose a prediction method for the lncRNA-disease association relationship, named LDAGM, which is based on the Graph Convolutional Autoencoder and Multilayer Perceptron model. The method first extracts the functional similarity and Gaussian interaction profile kernel similarity of lncRNAs and miRNAs, as well as the semantic similarity and Gaussian interaction profile kernel similarity of diseases. It then constructs six homogeneous networks and deeply fuses them using a deep topology feature extraction method. The fused networks facilitate feature complementation and deep mining of the original association relationships, capturing the deep connections between nodes. Next, by combining the obtained deep topological features with the similarity network of lncRNA, disease, and miRNA interactions, we construct a multi-view heterogeneous network model. The Graph Convolutional Autoencoder is employed for nonlinear feature extraction. Finally, the extracted nonlinear features are combined with the deep topological features of the multi-view heterogeneous network to obtain the final feature representation of the lncRNA-disease pair. Prediction of the lncRNA-disease association relationship is performed using the Multilayer Perceptron model. To enhance the performance and stability of the Multilayer Perceptron model, we introduce a hidden layer called the aggregation layer in the Multilayer Perceptron model. Through a gate mechanism, it controls the flow of information between each hidden layer in the Multilayer Perceptron model, aiming to achieve optimal feature extraction from each hidden layer. Conclusions: Parameter analysis, ablation studies, and comparison experiments verified the effectiveness of this method, and case studies verified the accuracy of this method in predicting lncRNA-disease association relationships. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
22. C-ziptf: stable tensor factorization for zero-inflated multi-dimensional genomics data.
- Author
-
Chafamo, Daniel, Shanmugam, Vignesh, and Tokcan, Neriman
- Subjects
MAXIMUM likelihood statistics ,MATRIX decomposition ,GENE expression ,FACTOR analysis ,BAYESIAN field theory - Abstract
In the past two decades, genomics has advanced significantly, with single-cell RNA-sequencing (scRNA-seq) marking a pivotal milestone. ScRNA-seq provides unparalleled insights into cellular diversity and has spurred diverse studies across multiple conditions and samples, resulting in an influx of complex multidimensional genomics data. This highlights the need for robust methodologies capable of handling the complexity and multidimensionality of such genomics data. Furthermore, single-cell data grapples with sparsity due to issues like low capture efficiency and dropout effects. Tensor factorizations (TF) have emerged as powerful tools to unravel the complex patterns from multi-dimensional genomics data. Classic TF methods, based on maximum likelihood estimation, struggle with zero-inflated count data, while the inherent stochasticity in TFs further complicates result interpretation and reproducibility. Our paper introduces Zero Inflated Poisson Tensor Factorization (ZIPTF), a novel method for high-dimensional zero-inflated count data factorization. We also present Consensus-ZIPTF (C-ZIPTF), merging ZIPTF with a consensus-based approach to address stochasticity. We evaluate our proposed methods on synthetic zero-inflated count data, simulated scRNA-seq data, and real multi-sample multi-condition scRNA-seq datasets. ZIPTF consistently outperforms baseline matrix and tensor factorization methods, displaying enhanced reconstruction accuracy for zero-inflated data. When dealing with high probabilities of excess zeros, ZIPTF achieves up to 2.4 × better accuracy. Moreover, C-ZIPTF notably enhances the factorization's consistency. When tested on synthetic and real scRNA-seq data, ZIPTF and C-ZIPTF consistently uncover known and biologically meaningful gene expression programs. Access our data and code at: https://github.com/klarman-cell-observatory/scBTF and https://github.com/klarman-cell-observatory/scbtf_experiments. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
23. Mild cognitive impairment prediction based on multi-stream convolutional neural networks.
- Author
-
Lee, Chien-Cheng, Chau, Hong-Han, Wang, Hsiao-Lun, Chuang, Yi-Fang, and Chau, Yawgeng
- Subjects
CONVOLUTIONAL neural networks ,MILD cognitive impairment ,DEEP learning ,COGNITIVE testing ,COGNITION disorders - Abstract
Background: Mild cognitive impairment (MCI) is the transition stage between the cognitive decline expected in normal aging and more severe cognitive decline such as dementia. The early diagnosis of MCI plays an important role in human healthcare. Current methods of MCI detection include cognitive tests to screen for executive function impairments, possibly followed by neuroimaging tests. However, these methods are expensive and time-consuming. Several studies have demonstrated that MCI and dementia can be detected by machine learning technologies from different modality data. This study proposes a multi-stream convolutional neural network (MCNN) model to predict MCI from face videos. Results: The total effective data are 48 facial videos from 45 participants, including 35 videos from normal cognitive participants and 13 videos from MCI participants. The videos are divided into several segments. Then, the MCNN captures the latent facial spatial features and facial dynamic features of each segment and classifies the segment as MCI or normal. Finally, the aggregation stage produces the final detection results of the input video. We evaluate 27 MCNN model combinations including three ResNet architectures, three optimizers, and three activation functions. The experimental results showed that the ResNet-50 backbone with Swish activation function and Ranger optimizer produces the best results with an F1-score of 89% at the segment level. However, the ResNet-18 backbone with Swish and Ranger achieves the F1-score of 100% at the participant level. Conclusions: This study presents an efficient new method for predicting MCI from facial videos. Studies have shown that MCI can be detected from facial videos, and facial data can be used as a biomarker for MCI. This approach is very promising for developing accurate models for screening MCI through facial data. It demonstrates that automated, non-invasive, and inexpensive MCI screening methods are feasible and do not require highly subjective paper-and-pencil questionnaires. Evaluation of 27 model combinations also found that ResNet-50 with Swish is more stable for different optimizers. Such results provide directions for hyperparameter tuning to further improve MCI predictions. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
24. VCF observer: a user-friendly software tool for preliminary VCF file analysis and comparison.
- Author
-
Emül, Abdullah Asım, Ergün, Mehmet Arif, Ertürk, Rumeysa Aslıhan, Çinal, Ömer, and Baysan, Mehmet
- Subjects
SCIENTIFIC method ,WEB-based user interfaces ,SOFTWARE development tools ,RESEARCH personnel ,GENETIC variation - Abstract
Background: Advancements over the past decade in DNA sequencing technology and computing power have created the potential to revolutionize medicine. There has been a marked increase in genetic data available, allowing for the advancement of areas such as personalized medicine. A crucial type of data in this context is genetic variant data which is stored in variant call format (VCF) files. However, the rapid growth in genomics has presented challenges in analyzing and comparing VCF files. Results: In response to the limitations of existing tools, this paper introduces a novel web application that provides a user-friendly solution for VCF file analyses and comparisons. The software tool enables researchers and clinicians to perform high-level analysis with ease and enhances productivity. The application's interface allows users to conveniently upload, analyze, and visualize their VCF files using simple drag-and-drop and point-and-click operations. Essential visualizations such as Venn diagrams, clustergrams, and precision–recall plots are provided to users. A key feature of the application is its support for metadata-based file grouping, accomplished through flexible data matrix uploads, streamlining organization and analysis of user-defined categories. Additionally, the application facilitates standardized benchmarking of VCF files by integrating user-provided ground truth regions and variant lists. Conclusions: By providing a user-friendly interface and supporting essential visualizations, this software enhances the accessibility of VCF file analysis and assists researchers and clinicians in their scientific inquiries. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
25. Tensor product algorithms for inference of contact network from epidemiological data.
- Author
-
Dolgov, Sergey and Savostyanov, Dmitry
- Subjects
MARKOV chain Monte Carlo ,BAYESIAN analysis ,MARKOV processes ,BAYESIAN field theory ,TENSOR products - Abstract
We consider a problem of inferring contact network from nodal states observed during an epidemiological process. In a black-box Bayesian optimisation framework this problem reduces to a discrete likelihood optimisation over the set of possible networks. The cardinality of this set grows combinatorially with the number of network nodes, which makes this optimisation computationally challenging. For each network, its likelihood is the probability for the observed data to appear during the evolution of the epidemiological process on this network. This probability can be very small, particularly if the network is significantly different from the ground truth network, from which the observed data actually appear. A commonly used stochastic simulation algorithm struggles to recover rare events and hence to estimate small probabilities and likelihoods. In this paper we replace the stochastic simulation with solving the chemical master equation for the probabilities of all network states. Since this equation also suffers from the curse of dimensionality, we apply tensor train approximations to overcome it and enable fast and accurate computations. Numerical simulations demonstrate efficient black-box Bayesian inference of the network. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
26. Automatic categorization of diverse experimental information in the bioscience literature.
- Subjects
CURATORSHIP ,BIOLOGICAL literature ,DATABASES ,CLASSIFICATION ,GENE expression - Abstract
The article offers information on curation of information from bioscience literature into biological knowledge databases which is a is a crucial way of capturing experimental information in a computable form. It is mentioned that the first step is to identify from all published literature the papers that contain results for a specific data type the curator is interested in annotating.
- Published
- 2012
- Full Text
- View/download PDF
27. ANINet: a deep neural network for skull ancestry estimation.
- Author
-
Pengyue, Lin, Siyuan, Xia, Yi, Jiang, Wen, Yang, Xiaoning, Liu, Guohua, Geng, and Shixiong, Wang
- Subjects
GENEALOGY ,FORENSIC sciences ,SKULL - Abstract
Background: Ancestry estimation of skulls is under a wide range of applications in forensic science, anthropology, and facial reconstruction. This study aims to avoid defects in traditional skull ancestry estimation methods, such as time-consuming and labor-intensive manual calibration of feature points, and subjective results. Results: This paper uses the skull depth image as input, based on AlexNet, introduces the Wide module and SE-block to improve the network, designs and proposes ANINet, and realizes the ancestry classification. Such a unified model architecture of ANINet overcomes the subjectivity of manually calibrating feature points, of which the accuracy and efficiency are improved. We use depth projection to obtain the local depth image and the global depth image of the skull, take the skull depth image as the object, use global, local, and local + global methods respectively to experiment on the 95 cases of Han skull and 110 cases of Uyghur skull data sets, and perform cross-validation. The experimental results show that the accuracies of the three methods for skull ancestry estimation reached 98.21%, 98.04% and 99.03%, respectively. Compared with the classic networks AlexNet, Vgg-16, GoogLenet, ResNet-50, DenseNet-121, and SqueezeNet, the network proposed in this paper has the advantages of high accuracy and small parameters; compared with state-of-the-art methods, the method in this paper has a higher learning rate and better ability to estimate. Conclusions: In summary, skull depth images have an excellent performance in estimation, and ANINet is an effective approach for skull ancestry estimation. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
28. Deep evolutionary fusion neural network: a new prediction standard for infectious disease incidence rates.
- Author
-
Yao, Tianhua, Chen, Xicheng, Wang, Haojia, Gao, Chengcheng, Chen, Jia, Yi, Dali, Wei, Zeliang, Yao, Ning, Li, Yang, Yi, Dong, and Wu, Yazhou
- Abstract
Background: Previously, many methods have been used to predict the incidence trends of infectious diseases. There are numerous methods for predicting the incidence trends of infectious diseases, and they have exhibited varying degrees of success. However, there are a lack of prediction benchmarks that integrate linear and nonlinear methods and effectively use internet data. The aim of this paper is to develop a prediction model of the incidence rate of infectious diseases that integrates multiple methods and multisource data, realizing ground-breaking research. Results: The infectious disease dataset is from an official release and includes four national and three regional datasets. The Baidu index platform provides internet data. We choose a single model (seasonal autoregressive integrated moving average (SARIMA), nonlinear autoregressive neural network (NAR), and long short-term memory (LSTM)) and a deep evolutionary fusion neural network (DEFNN). The DEFNN is built using the idea of neural evolution and fusion, and the DEFNN + is built using multisource data. We compare the model accuracy on reference group data and validate the model generalizability on external data. (1) The loss of SA-LSTM in the reference group dataset is 0.4919, which is significantly better than that of other single models. (2) The loss values of SA-LSTM on the national and regional external datasets are 0.9666, 1.2437, 0.2472, 0.7239, 1.4026, and 0.6868. (3) When multisource indices are added to the national dataset, the loss of the DEFNN + increases to 0.4212, 0.8218, 1.0331, and 0.8575. Conclusions: We propose an SA-LSTM optimization model with good accuracy and generalizability based on the concept of multiple methods and multiple data fusion. DEFNN enriches and supplements infectious disease prediction methodologies, can serve as a new benchmark for future infectious disease predictions and provides a reference for the prediction of the incidence rates of various infectious diseases. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
29. Popularity and performance of bioinformatics software: the case of gene set analysis.
- Author
-
Xie, Chengshu, Jauhari, Shaurya, and Mora, Antonio
- Subjects
BIOINFORMATICS software ,BIBLIOGRAPHIC databases ,GENES ,POPULARITY ,ONLINE databases ,DISCUSSION in education - Abstract
Background: Gene Set Analysis (GSA) is arguably the method of choice for the functional interpretation of omics results. The following paper explores the popularity and the performance of all the GSA methodologies and software published during the 20 years since its inception. "Popularity" is estimated according to each paper's citation counts, while "performance" is based on a comprehensive evaluation of the validation strategies used by papers in the field, as well as the consolidated results from the existing benchmark studies. Results: Regarding popularity, data is collected into an online open database ("GSARefDB") which allows browsing bibliographic and method-descriptive information from 503 GSA paper references; regarding performance, we introduce a repository of jupyter workflows and shiny apps for automated benchmarking of GSA methods ("GSA-BenchmarKING"). After comparing popularity versus performance, results show discrepancies between the most popular and the best performing GSA methods. Conclusions: The above-mentioned results call our attention towards the nature of the tool selection procedures followed by researchers and raise doubts regarding the quality of the functional interpretation of biological datasets in current biomedical studies. Suggestions for the future of the functional interpretation field are made, including strategies for education and discussion of GSA tools, better validation and benchmarking practices, reproducibility, and functional re-analysis of previously reported data. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
30. Optimizing biomedical information retrieval with a keyword frequency-driven prompt enhancement strategy.
- Author
-
Aftab, Wasim, Apostolou, Zivkos, Bouazoune, Karim, and Straub, Tobias
- Subjects
LANGUAGE models ,NATURAL language processing ,GENERATIVE pre-trained transformers ,NATURAL languages ,INFORMATION retrieval ,CHATBOTS - Abstract
Background: Mining the vast pool of biomedical literature to extract accurate responses and relevant references is challenging due to the domain's interdisciplinary nature, specialized jargon, and continuous evolution. Early natural language processing (NLP) approaches often led to incorrect answers as they failed to comprehend the nuances of natural language. However, transformer models have significantly advanced the field by enabling the creation of large language models (LLMs), enhancing question-answering (QA) tasks. Despite these advances, current LLM-based solutions for specialized domains like biology and biomedicine still struggle to generate up-to-date responses while avoiding "hallucination" or generating plausible but factually incorrect responses. Results: Our work focuses on enhancing prompts using a retrieval-augmented architecture to guide LLMs in generating meaningful responses for biomedical QA tasks. We evaluated two approaches: one relying on text embedding and vector similarity in a high-dimensional space, and our proposed method, which uses explicit signals in user queries to extract meaningful contexts. For robust evaluation, we tested these methods on 50 specific and challenging questions from diverse biomedical topics, comparing their performance against a baseline model, BM25. Retrieval performance of our method was significantly better than others, achieving a median Precision@10 of 0.95, which indicates the fraction of the top 10 retrieved chunks that are relevant. We used GPT-4, OpenAI's most advanced LLM to maximize the answer quality and manually accessed LLM-generated responses. Our method achieved a median answer quality score of 2.5, surpassing both the baseline model and the text embedding-based approach. We developed a QA bot, WeiseEule (https://github.com/wasimaftab/WeiseEule-LocalHost), which utilizes these methods for comparative analysis and also offers advanced features for review writing and identifying relevant articles for citation. Conclusions: Our findings highlight the importance of prompt enhancement methods that utilize explicit signals in user queries over traditional text embedding-based approaches to improve LLM-generated responses for specialized queries in specialized domains such as biology and biomedicine. By providing users complete control over the information fed into the LLM, our approach addresses some of the major drawbacks of existing web-based chatbots and LLM-based QA systems, including hallucinations and the generation of irrelevant or outdated responses. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
31. MSH-DTI: multi-graph convolution with self-supervised embedding and heterogeneous aggregation for drug-target interaction prediction.
- Author
-
Zhang, Beiyi, Niu, Dongjiang, Zhang, Lianwei, Zhang, Qiang, and Li, Zhen
- Subjects
DRUG target ,PREDICTION models ,DRUG interactions ,MULTIGRAPH ,PHARMACOLOGY - Abstract
Background: The rise of network pharmacology has led to the widespread use of network-based computational methods in predicting drug target interaction (DTI). However, existing DTI prediction models typically rely on a limited amount of data to extract drug and target features, potentially affecting the comprehensiveness and robustness of features. In addition, although multiple networks are used for DTI prediction, the integration of heterogeneous information often involves simplistic aggregation and attention mechanisms, which may impose certain limitations. Results: MSH-DTI, a deep learning model for predicting drug-target interactions, is proposed in this paper. The model uses self-supervised learning methods to obtain drug and target structure features. A Heterogeneous Interaction-enhanced Feature Fusion Module is designed for multi-graph construction, and the graph convolutional networks are used to extract node features. With the help of an attention mechanism, the model focuses on the important parts of different features for prediction. Experimental results show that the AUROC and AUPR of MSH-DTI are 0.9620 and 0.9605 respectively, outperforming other models on the DTINet dataset. Conclusion: The proposed MSH-DTI is a helpful tool to discover drug-target interactions, which is also validated through case studies in predicting new DTIs. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
32. A comparative analysis of mutual information methods for pairwise relationship detection in metagenomic data.
- Author
-
Francis, Dallace and Sun, Fengzhu
- Subjects
BIOLOGICAL systems ,ECOSYSTEMS ,STATISTICAL correlation ,INFORMATION networks ,METAGENOMICS ,RANK correlation (Statistics) ,PEARSON correlation (Statistics) - Abstract
Background: Construction of co-occurrence networks in metagenomic data often employs correlation to infer pairwise relationships between microbes. However, biological systems are complex and often display qualities non-linear in nature. Therefore, the reliance on correlation alone may overlook important relationships and fail to capture the full breadth of intricacies presented in underlying interaction networks. It is of interest to incorporate metrics that are not only robust in detecting linear relationships, but non-linear ones as well. Results: In this paper, we explore the use of various mutual information (MI) estimation approaches for quantifying pairwise relationships in biological data and compare their performances against two traditional measures–Pearson's correlation coefficient, r, and Spearman's rank correlation coefficient, ρ. Metrics are tested on both simulated data designed to mimic pairwise relationships that may be found in ecological systems and real data from a previous study on C. diff infection. The results demonstrate that, in the case of asymmetric relationships, mutual information estimators can provide better detection ability than Pearson's or Spearman's correlation coefficients. Specifically, we find that these estimators have elevated performances in the detection of exploitative relationships, demonstrating the potential benefit of including them in future metagenomic studies. Conclusions: Mutual information (MI) can uncover complex pairwise relationships in biological data that may be missed by traditional measures of association. The inclusion of such relationships when constructing co-occurrence networks can result in a more comprehensive analysis than the use of correlation alone. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
33. BEROLECMI: a novel prediction method to infer circRNA-miRNA interaction from the role definition of molecular attributes and biological networks.
- Author
-
Wang, Xin-Fei, Yu, Chang-Qing, You, Zhu-Hong, Wang, Yan, Huang, Lan, Qiao, Yan, Wang, Lei, and Li, Zheng-Wei
- Subjects
COMPETITIVE endogenous RNA ,BIOLOGICAL networks ,MORPHOLOGY ,NON-coding RNA ,PREDICTION models ,CIRCULAR RNA - Abstract
Circular RNA (CircRNA)–microRNA (miRNA) interaction (CMI) is an important model for the regulation of biological processes by non-coding RNA (ncRNA), which provides a new perspective for the study of human complex diseases. However, the existing CMI prediction models mainly rely on the nearest neighbor structure in the biological network, ignoring the molecular network topology, so it is difficult to improve the prediction performance. In this paper, we proposed a new CMI prediction method, BEROLECMI, which uses molecular sequence attributes, molecular self-similarity, and biological network topology to define the specific role feature representation for molecules to infer the new CMI. BEROLECMI effectively makes up for the lack of network topology in the CMI prediction model and achieves the highest prediction performance in three commonly used data sets. In the case study, 14 of the 15 pairs of unknown CMIs were correctly predicted. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
34. Drug repositioning based on residual attention network and free multiscale adversarial training.
- Author
-
Li, Guanghui, Li, Shuwen, Liang, Cheng, Xiao, Qiu, and Luo, Jiawei
- Subjects
DRUG repositioning ,BIPARTITE graphs ,DRUG development ,THERAPEUTICS ,FORECASTING - Abstract
Background: Conducting traditional wet experiments to guide drug development is an expensive, time-consuming and risky process. Analyzing drug function and repositioning plays a key role in identifying new therapeutic potential of approved drugs and discovering therapeutic approaches for untreated diseases. Exploring drug-disease associations has far-reaching implications for identifying disease pathogenesis and treatment. However, reliable detection of drug-disease relationships via traditional methods is costly and slow. Therefore, investigations into computational methods for predicting drug-disease associations are currently needed. Results: This paper presents a novel drug-disease association prediction method, RAFGAE. First, RAFGAE integrates known associations between diseases and drugs into a bipartite network. Second, RAFGAE designs the Re_GAT framework, which includes multilayer graph attention networks (GATs) and two residual networks. The multilayer GATs are utilized for learning the node embeddings, which is achieved by aggregating information from multihop neighbors. The two residual networks are used to alleviate the deep network oversmoothing problem, and an attention mechanism is introduced to combine the node embeddings from different attention layers. Third, two graph autoencoders (GAEs) with collaborative training are constructed to simulate label propagation to predict potential associations. On this basis, free multiscale adversarial training (FMAT) is introduced. FMAT enhances node feature quality through small gradient adversarial perturbation iterations, improving the prediction performance. Finally, tenfold cross-validations on two benchmark datasets show that RAFGAE outperforms current methods. In addition, case studies have confirmed that RAFGAE can detect novel drug-disease associations. Conclusions: The comprehensive experimental results validate the utility and accuracy of RAFGAE. We believe that this method may serve as an excellent predictor for identifying unobserved disease-drug associations. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
35. Maptcha: an efficient parallel workflow for hybrid genome scaffolding.
- Author
-
Bhowmik, Oieswarya, Rahman, Tazin, and Kalyanaraman, Ananth
- Subjects
WORKFLOW ,HEURISTIC - Abstract
Background: Genome assembly, which involves reconstructing a target genome, relies on scaffolding methods to organize and link partially assembled fragments. The rapid evolution of long read sequencing technologies toward more accurate long reads, coupled with the continued use of short read technologies, has created a unique need for hybrid assembly workflows. The construction of accurate genomic scaffolds in hybrid workflows is complicated due to scale, sequencing technology diversity (e.g., short vs. long reads, contigs or partial assemblies), and repetitive regions within a target genome. Results: In this paper, we present a new parallel workflow for hybrid genome scaffolding that would allow combining pre-constructed partial assemblies with newly sequenced long reads toward an improved assembly. More specifically, the workflow, called Maptcha, is aimed at generating long scaffolds of a target genome, from two sets of input sequences—an already constructed partial assembly of contigs, and a set of newly sequenced long reads. Our scaffolding approach internally uses an alignment-free mapping step to build a ⟨ contig,contig ⟩ graph using long reads as linking information. Subsequently, this graph is used to generate scaffolds. We present and evaluate a graph-theoretic "wiring" heuristic to perform this scaffolding step. To enable efficient workload management in a parallel setting, we use a batching technique that partitions the scaffolding tasks so that the more expensive alignment-based assembly step at the end can be efficiently parallelized. This step also allows the use of any standalone assembler for generating the final scaffolds. Conclusions: Our experiments with Maptcha on a variety of input genomes, and comparison against two state-of-the-art hybrid scaffolders demonstrate that Maptcha is able to generate longer and more accurate scaffolds substantially faster. In almost all cases, the scaffolds produced by Maptcha are at least an order of magnitude longer (in some cases two orders) than the scaffolds produced by state-of-the-art tools. Maptcha runs significantly faster too, reducing time-to-solution from hours to minutes for most input cases. We also performed a coverage experiment by varying the sequencing coverage depth for long reads, which demonstrated the potential of Maptcha to generate significantly longer scaffolds in low coverage settings ( 1 × – 10 × ). [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
36. Occlusion enhanced pan-cancer classification via deep learning.
- Author
-
Zhao, Xing, Chen, Zigui, Wang, Huating, and Sun, Hao
- Subjects
ARTIFICIAL neural networks ,GENE expression ,SHORT-term memory ,LONG-term memory ,DEEP learning ,CANCER diagnosis - Abstract
Quantitative measurement of RNA expression levels through RNA-Seq is an ideal replacement for conventional cancer diagnosis via microscope examination. Currently, cancer-related RNA-Seq studies focus on two aspects: classifying the status and tissue of origin of a sample and discovering marker genes. Existing studies typically identify marker genes by statistically comparing healthy and cancer samples. However, this approach overlooks marker genes with low expression level differences and may be influenced by experimental results. This paper introduces "GENESO," a novel framework for pan-cancer classification and marker gene discovery using the occlusion method in conjunction with deep learning. we first trained a baseline deep LSTM neural network capable of distinguishing the origins and statuses of samples utilizing RNA-Seq data. Then, we propose a novel marker gene discovery method called "Symmetrical Occlusion (SO)". It collaborates with the baseline LSTM network, mimicking the "gain of function" and "loss of function" of genes to evaluate their importance in pan-cancer classification quantitatively. By identifying the genes of utmost importance, we then isolate them to train new neural networks, resulting in higher-performance LSTM models that utilize only a reduced set of highly relevant genes. The baseline neural network achieves an impressive validation accuracy of 96.59% in pan-cancer classification. With the help of SO, the accuracy of the second network reaches 98.30%, while using 67% fewer genes. Notably, our method excels in identifying marker genes that are not differentially expressed. Moreover, we assessed the feasibility of our method using single-cell RNA-Seq data, employing known marker genes as a validation test. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
37. Effective type label-based synergistic representation learning for biomedical event trigger detection.
- Author
-
Hao, Anran, Yuan, Haohan, Hui, Siu Cheung, and Su, Jian
- Abstract
Background: Detecting event triggers in biomedical texts, which contain domain knowledge and context-dependent terms, is more challenging than in general-domain texts. Most state-of-the-art models rely mainly on external resources such as linguistic tools and knowledge bases to improve system performance. However, they lack effective mechanisms to obtain semantic clues from label specification and sentence context. Given its success in image classification, label representation learning is a promising approach to enhancing biomedical event trigger detection models by leveraging the rich semantics of pre-defined event type labels. Results: In this paper, we propose the Biomedical Label-based Synergistic representation Learning (BioLSL) model, which effectively utilizes event type labels by learning their correlation with trigger words and enriches the representation contextually. The BioLSL model consists of three modules. Firstly, the Domain-specific Joint Encoding module employs a transformer-based, domain-specific pre-trained architecture to jointly encode input sentences and pre-defined event type labels. Secondly, the Label-based Synergistic Representation Learning module learns the semantic relationships between input texts and event type labels, and generates a Label-Trigger Aware Representation (LTAR) and a Label-Context Aware Representation (LCAR) for enhanced semantic representations. Finally, the Trigger Classification module makes structured predictions, where each label is predicted with respect to its neighbours. We conduct experiments on three benchmark BioNLP datasets, namely MLEE, GE09, and GE11, to evaluate our proposed BioLSL model. Results show that BioLSL has achieved state-of-the-art performance, outperforming the baseline models. Conclusions: The proposed BioLSL model demonstrates good performance for biomedical event trigger detection without using any external resources. This suggests that label representation learning and context-aware enhancement are promising directions for improving the task. The key enhancement is that BioLSL effectively learns to construct semantic linkages between the event mentions and type labels, which provide the latent information of label-trigger and label-context relationships in biomedical texts. Moreover, additional experiments on BioLSL show that it performs exceptionally well with limited training data under the data-scarce scenarios. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
38. DGCPPISP: a PPI site prediction model based on dynamic graph convolutional network and two-stage transfer learning.
- Author
-
Feng, Zijian, Huang, Weihong, Li, Haohao, Zhu, Hancan, Kang, Yanlei, and Li, Zhong
- Abstract
Background: Proteins play a pivotal role in the diverse array of biological processes, making the precise prediction of protein–protein interaction (PPI) sites critical to numerous disciplines including biology, medicine and pharmacy. While deep learning methods have progressively been implemented for the prediction of PPI sites within proteins, the task of enhancing their predictive performance remains an arduous challenge. Results: In this paper, we propose a novel PPI site prediction model (DGCPPISP) based on a dynamic graph convolutional neural network and a two-stage transfer learning strategy. Initially, we implement the transfer learning from dual perspectives, namely feature input and model training that serve to supply efficacious prior knowledge for our model. Subsequently, we construct a network designed for the second stage of training, which is built on the foundation of dynamic graph convolution. Conclusions: To evaluate its effectiveness, the performance of the DGCPPISP model is scrutinized using two benchmark datasets. The ensuing results demonstrate that DGCPPISP outshines competing methods in terms of performance. Specifically, DGCPPISP surpasses the second-best method, EGRET, by margins of 5.9%, 10.1%, and 13.3% for F1-measure, AUPRC, and MCC metrics respectively on Dset_186_72_PDB164. Similarly, on Dset_331, it eclipses the performance of the runner-up method, HN-PPISP, by 14.5%, 19.8%, and 29.9% respectively. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
39. Multioviz: an interactive platform for in silico perturbation and interrogation of gene regulatory networks.
- Author
-
Xie, Helen, Crawford, Lorin, and Conard, Ashley Mae
- Abstract
In this paper, we aim to build a platform that will help bridge the gap between high-dimensional computation and wet-lab experimentation by allowing users to interrogate genomic signatures at multiple molecular levels and identify best next actionable steps for downstream decision making. We introduce Multioviz: a publicly accessible R package and web application platform to easily perform in silico hypothesis testing of generated gene regulatory networks. We demonstrate the utility of Multioviz by conducting an end-to-end analysis in a statistical genetics application focused on measuring the effect of in silico perturbations of complex trait architecture. By using a real dataset from the Wellcome Trust Centre for Human Genetics, we both recapitulate previous findings and propose hypotheses about the genes involved in the percentage of immune CD8+ cells found in heterogeneous stocks of mice. Source code for the MultiovizR package is available at https://github.com/lcrawlab/multio-viz and an interactive version of the platform is available at https://multioviz.ccv.brown.edu/. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
40. AFITbin: a metagenomic contig binning method using aggregate l-mer frequency based on initial and terminal nucleotides.
- Author
-
Darabi, Amin, Sobhani, Sayeh, Aghdam, Rosa, and Eslahchi, Changiz
- Subjects
BIOTIC communities ,METAGENOMICS ,NUCLEOTIDES ,MATRIX decomposition ,MICROBIAL communities ,MICROBIAL ecology - Abstract
Background: Using next-generation sequencing technologies, scientists can sequence complex microbial communities directly from the environment. Significant insights into the structure, diversity, and ecology of microbial communities have resulted from the study of metagenomics. The assembly of reads into longer contigs, which are then binned into groups of contigs that correspond to different species in the metagenomic sample, is a crucial step in the analysis of metagenomics. It is necessary to organize these contigs into operational taxonomic units (OTUs) for further taxonomic profiling and functional analysis. For binning, which is synonymous with the clustering of OTUs, the tetra-nucleotide frequency (TNF) is typically utilized as a compositional feature for each OTU. Results: In this paper, we present AFIT, a new l-mer statistic vector for each contig, and AFITBin, a novel method for metagenomic binning based on AFIT and a matrix factorization method. To evaluate the performance of the AFIT vector, the t-SNE algorithm is used to compare species clustering based on AFIT and TNF information. In addition, the efficacy of AFITBin is demonstrated on both simulated and real datasets in comparison to state-of-the-art binning methods such as MetaBAT 2, MaxBin 2.0, CONCOT, MetaCon, SolidBin, BusyBee Web, and MetaBinner. To further analyze the performance of the purposed AFIT vector, we compare the barcodes of the AFIT vector and the TNF vector. Conclusion: The results demonstrate that AFITBin shows superior performance in taxonomic identification compared to existing methods, leveraging the AFIT vector for improved results in metagenomic binning. This approach holds promise for advancing the analysis of metagenomic data, providing more reliable insights into microbial community composition and function. Availability: A python package is available at: https://github.com/SayehSobhani/AFITBin. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
41. Enhancing SNV identification in whole-genome sequencing data through the incorporation of known genetic variants into the minimap2 index.
- Author
-
Egor, Guguchkin, Artem, Kasianov, Maksim, Belenikin, Gaukhar, Zobkova, Ekaterina, Kosova, Vsevolod, Makeev, and Evgeny, Karpulevich
- Subjects
WHOLE genome sequencing ,GENETIC variation ,NUCLEOTIDE sequencing ,GENETIC databases ,GENOME-wide association studies ,SINGLE nucleotide polymorphisms - Abstract
Motivation: Alignment of reads to a reference genome sequence is one of the key steps in the analysis of human whole-genome sequencing data obtained through Next-generation sequencing (NGS) technologies. The quality of the subsequent steps of the analysis, such as the results of clinical interpretation of genetic variants or the results of a genome-wide association study, depends on the correct identification of the position of the read as a result of its alignment. The amount of human NGS whole-genome sequencing data is constantly growing. There are a number of human genome sequencing projects worldwide that have resulted in the creation of large-scale databases of genetic variants of sequenced human genomes. Such information about known genetic variants can be used to improve the quality of alignment at the read alignment stage when analysing sequencing data obtained for a new individual, for example, by creating a genomic graph. While existing methods for aligning reads to a linear reference genome have high alignment speed, methods for aligning reads to a genomic graph have greater accuracy in variable regions of the genome. The development of a read alignment method that takes into account known genetic variants in the linear reference sequence index allows combining the advantages of both sets of methods. Results: In this paper, we present the minimap2_index_modifier tool, which enables the construction of a modified index of a reference genome using known single nucleotide variants and insertions/deletions (indels) specific to a given human population. The use of the modified minimap2 index improves variant calling quality without modifying the bioinformatics pipeline and without significant additional computational overhead. Using the PrecisionFDA Truth Challenge V2 benchmark data (for HG002 short-read data aligned to the GRCh38 linear reference (GCA_000001405.15) with parameters k = 27 and w = 14) it was demonstrated that the number of false negative genetic variants decreased by more than 9500, and the number of false positives decreased by more than 7000 when modifying the index with genetic variants from the Human Pangenome Reference Consortium. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
42. METASEED: a novel approach to full-length 16S rRNA gene reconstruction from short read data.
- Author
-
Philip, Melcy, Rudi, Knut, Ormaasen, Ida, Angell, Inga Leena, Pettersen, Ragnhild, Keeley, Nigel B., and Snipen, Lars-Gustav
- Subjects
RIBOSOMAL RNA ,SHOTGUN sequencing ,DATABASES ,GENES ,DATA analysis - Abstract
Background: With the emergence of Oxford Nanopore technology, now the on-site sequencing of 16S rRNA from environments is available. Due to the error level and structure, the analysis of such data demands some database of reference sequences. However, many taxa from complex and diverse environments, have poor representation in publicly available databases. In this paper, we propose the METASEED pipeline for the reconstruction of full-length 16S sequences from such environments, in order to improve the reference for the subsequent use of on-site sequencing. Results: We show that combining high-precision short-read sequencing of both 16S and full metagenome from the same samples allow us to reconstruct high-quality 16S sequences from the more abundant taxa. A significant novelty is the carefully designed collection of metagenome reads that matches the 16S amplicons, based on a combination of uniqueness and abundance. Compared to alternative approaches this produces superior results. Conclusion: Our pipeline will facilitate numerous studies associated with various unknown microorganisms, thus allowing the comprehension of the diverse environments. The pipeline is a potential tool in generating a full length 16S rRNA gene database for any environment. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
43. Dyport: dynamic importance-based biomedical hypothesis generation benchmarking technique.
- Author
-
Tyagin, Ilya and Safro, Ilya
- Subjects
KNOWLEDGE graphs ,BENCHMARKING (Management) ,NATURAL language processing ,HYPOTHESIS ,SCIENTIFIC discoveries ,SEMANTICS - Abstract
Background: Automated hypothesis generation (HG) focuses on uncovering hidden connections within the extensive information that is publicly available. This domain has become increasingly popular, thanks to modern machine learning algorithms. However, the automated evaluation of HG systems is still an open problem, especially on a larger scale. Results: This paper presents a novel benchmarking framework Dyport for evaluating biomedical hypothesis generation systems. Utilizing curated datasets, our approach tests these systems under realistic conditions, enhancing the relevance of our evaluations. We integrate knowledge from the curated databases into a dynamic graph, accompanied by a method to quantify discovery importance. This not only assesses hypotheses accuracy but also their potential impact in biomedical research which significantly extends traditional link prediction benchmarks. Applicability of our benchmarking process is demonstrated on several link prediction systems applied on biomedical semantic knowledge graphs. Being flexible, our benchmarking system is designed for broad application in hypothesis generation quality verification, aiming to expand the scope of scientific discovery within the biomedical research community. Conclusions: Dyport is an open-source benchmarking framework designed for biomedical hypothesis generation systems evaluation, which takes into account knowledge dynamics, semantics and impact. All code and datasets are available at: https://github.com/IlyaTyagin/Dyport. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
44. PyMulSim: a method for computing node similarities between multilayer networks via graph isomorphism networks.
- Author
-
Cinaglia, Pietro
- Subjects
BIOLOGICAL networks ,BIOLOGICAL systems ,TEST validity ,STATISTICAL significance ,STATISTICS - Abstract
Background: In bioinformatics, interactions are modelled as networks, based on graph models. Generally, these support a single-layer structure which incorporates a specific entity (i.e., node) and only one type of link (i.e., edge). However, real-world biological systems consisting of biological objects belonging to heterogeneous entities, and these operate and influence each other in multiple contexts, simultaneously. Usually, node similarities are investigated to assess the relatedness between biological objects in a network of interest, and node embeddings are widely used for studying novel interaction from a topological point of view. About that, the state-of-the-art presents several methods for evaluating the node similarity inside a given network, but methodologies able to evaluate similarities between pairs of nodes belonging to different networks are missing. The latter are crucial for studies that relate different biological networks, e.g., for Network Alignment or to evaluate the possible evolution of the interactions of a little-known network on the basis of a well-known one. Existing methods are ineffective in evaluating nodes outside their structure, even more so in the context of multilayer networks, in which the topic still exploits approaches adapted from static networks. In this paper, we presented pyMulSim, a novel method for computing the pairwise similarities between nodes belonging to different multilayer networks. It uses a Graph Isomorphism Network (GIN) for the representative learning of node features, that uses for processing the embeddings and computing the similarities between the pairs of nodes of different multilayer networks. Results: Our experimentation investigated the performance of our method. Results show that our method effectively evaluates the similarities between the biological objects of a source multilayer network to a target one, based on the analysis of the node embeddings. Results have been also assessed for different noise levels, also through statistical significance analyses properly performed for this purpose. Conclusions: PyMulSim is a novel method for computing the pairwise similarities between nodes belonging to different multilayer networks, by using a GIN for learning node embeddings. It has been evaluated both in terms of performance and validity, reporting a high degree of reliability. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
45. Integrating transformers and many-objective optimization for drug design.
- Author
-
Aksamit, Nicholas, Hou, Jinqiang, Li, Yifeng, and Ombuki-Berman, Beatrice
- Subjects
DRUG design ,PARTICLE swarm optimization ,EVOLUTIONARY algorithms ,ARTIFICIAL intelligence ,LYSOPHOSPHOLIPIDS ,COMPUTATIONAL intelligence - Abstract
Background: Drug design is a challenging and important task that requires the generation of novel and effective molecules that can bind to specific protein targets. Artificial intelligence algorithms have recently showed promising potential to expedite the drug design process. However, existing methods adopt multi-objective approaches which limits the number of objectives. Results: In this paper, we expand this thread of research from the many-objective perspective, by proposing a novel framework that integrates a latent Transformer-based model for molecular generation, with a drug design system that incorporates absorption, distribution, metabolism, excretion, and toxicity prediction, molecular docking, and many-objective metaheuristics. We compared the performance of two latent Transformer models (ReLSO and FragNet) on a molecular generation task and show that ReLSO outperforms FragNet in terms of reconstruction and latent space organization. We then explored six different many-objective metaheuristics based on evolutionary algorithms and particle swarm optimization on a drug design task involving potential drug candidates to human lysophosphatidic acid receptor 1, a cancer-related protein target. Conclusion: We show that multi-objective evolutionary algorithm based on dominance and decomposition performs the best in terms of finding molecules that satisfy many objectives, such as high binding affinity and low toxicity, and high drug-likeness. Our framework demonstrates the potential of combining Transformers and many-objective computational intelligence for drug design. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
46. Predicting anticancer synergistic drug combinations based on multi-task learning.
- Author
-
Chen, Danyi, Wang, Xiaowen, Zhu, Hongming, Jiang, Yizhi, Li, Yulong, Liu, Qi, and Liu, Qin
- Subjects
ANTINEOPLASTIC combined chemotherapy protocols ,ARTIFICIAL neural networks ,DRUG synergism ,DRUG delivery systems ,PEARSON correlation (Statistics) ,DEEP learning ,DRUG carriers - Abstract
Background: The discovery of anticancer drug combinations is a crucial work of anticancer treatment. In recent years, pre-screening drug combinations with synergistic effects in a large-scale search space adopting computational methods, especially deep learning methods, is increasingly popular with researchers. Although achievements have been made to predict anticancer synergistic drug combinations based on deep learning, the application of multi-task learning in this field is relatively rare. The successful practice of multi-task learning in various fields shows that it can effectively learn multiple tasks jointly and improve the performance of all the tasks. Methods: In this paper, we propose MTLSynergy which is based on multi-task learning and deep neural networks to predict synergistic anticancer drug combinations. It simultaneously learns two crucial prediction tasks in anticancer treatment, which are synergy prediction of drug combinations and sensitivity prediction of monotherapy. And MTLSynergy integrates the classification and regression of prediction tasks into the same model. Moreover, autoencoders are employed to reduce the dimensions of input features. Results: Compared with the previous methods listed in this paper, MTLSynergy achieves the lowest mean square error of 216.47 and the highest Pearson correlation coefficient of 0.76 on the drug synergy prediction task. On the corresponding classification task, the area under the receiver operator characteristics curve and the area under the precision–recall curve are 0.90 and 0.62, respectively, which are equivalent to the comparison methods. Through the ablation study, we verify that multi-task learning and autoencoder both have a positive effect on prediction performance. In addition, the prediction results of MTLSynergy in many cases are also consistent with previous studies. Conclusion: Our study suggests that multi-task learning is significantly beneficial for both drug synergy prediction and monotherapy sensitivity prediction when combining these two tasks into one model. The ability of MTLSynergy to discover new anticancer synergistic drug combinations noteworthily outperforms other state-of-the-art methods. MTLSynergy promises to be a powerful tool to pre-screen anticancer synergistic drug combinations. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
47. Optimizing diabetes classification with a machine learning-based framework.
- Author
-
Feng, Xin, Cai, Yihuai, and Xin, Ruihao
- Subjects
ARTIFICIAL pancreases ,GENERATIVE adversarial networks ,DIABETES ,MACHINE learning ,DATABASES ,METABOLIC disorders ,BLOOD sugar ,HUNGER - Abstract
Background: Diabetes is a metabolic disorder usually caused by insufficient secretion of insulin from the pancreas or insensitivity of cells to insulin, resulting in long-term elevated blood sugar levels in patients. Patients usually present with frequent urination, thirst, and hunger. If left untreated, it can lead to various complications that can affect essential organs and even endanger life. Therefore, developing an intelligent diagnosis framework for diabetes is necessary. Result: This paper proposes a machine learning-based diabetes classification framework machine learning optimized GAN. The framework encompasses several methodological approaches to address the diverse challenges encountered during the analysis. These approaches encompass the implementation of the mean and median joint filling method for handling missing values, the application of the cap method for outlier processing, and the utilization of SMOTEENN to mitigate sample imbalance. Additionally, the framework incorporates the employment of the proposed Diabetes Classification Model based on Generative Adversarial Network and employs logistic regression for detailed feature analysis. The effectiveness of the framework is evaluated using both the PIMA dataset and the diabetes dataset obtained from the GEO database. The experimental findings showcase our model achieved exceptional results, including a binary classification accuracy of 96.27%, tertiary classification accuracy of 99.31%, precision and f1 score of 0.9698, recall of 0.9698, and an AUC of 0.9702. Conclusion: The experimental results show that the framework proposed in this paper can accurately classify diabetes and provide new ideas for intelligent diagnosis of diabetes. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
48. Raman spectroscopy-based prediction of ofloxacin concentration in solution using a novel loss function and an improved GA-CNN model.
- Author
-
Ma, Chenyu, Shi, Yuanbo, Huang, Yueyang, and Dai, Gongwei
- Subjects
DEEP learning ,RECURRENT neural networks ,STANDARD deviations ,GAUSSIAN function ,KERNEL functions - Abstract
Background: A Raman spectroscopy method can quickly and accurately measure the concentration of ofloxacin in solution. This method has the advantages of accuracy and rapidity over traditional detection methods. However, the manual analysis methods for the collected Raman spectral data often ignore the nonlinear characteristics of the data and cannot accurately predict the concentration of the target sample. Methods: To address this drawback, this paper proposes a novel kernel-Huber loss function that combines the Huber loss function with the Gaussian kernel function. This function is used with an improved genetic algorithm-convolutional neural network (GA-CNN) to model and predict the Raman spectral data of different concentrations of ofloxacin in solution. In addition, the paper introduces recurrent neural networks (RNN), long short-term memory (LSTM), bidirectional long short-term memory (BiLSTM) and gated recurrent units (GRU) models to conduct multiple experiments and use root mean square error (RMSE) and residual predictive deviation (RPD) as evaluation metrics. Results: The proposed method achieved an R 2 of 0.9989 on the test set data and improved by 3% over the traditional CNN. Multiple experiments were also conducted using RNN, LSTM, BiLSTM, and GRU models and evaluated their performance using RMSE, RPD, and other metrics. The results showed that the proposed method consistently outperformed these models. Conclusions: This paper demonstrates the effectiveness of the proposed method for predicting the concentration of ofloxacin in solution based on Raman spectral data, in addition to discussing the advantages and limitations of the proposed method, and the study proposes a solution to the problem of deep learning methods for Raman spectral concentration prediction. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
49. LncRNA–protein interaction prediction with reweighted feature selection.
- Author
-
Lv, Guohao, Xia, Yingchun, Qi, Zhao, Zhao, Zihao, Tang, Lianggui, Chen, Cheng, Yang, Shuai, Wang, Qingyong, and Gu, Lichuan
- Subjects
FEATURE selection ,OBSERVATIONAL learning ,FORECASTING ,PROTEIN-protein interactions ,AMINO acid sequence - Abstract
LncRNA–protein interactions are ubiquitous in organisms and play a crucial role in a variety of biological processes and complex diseases. Many computational methods have been reported for lncRNA–protein interaction prediction. However, the experimental techniques to detect lncRNA–protein interactions are laborious and time-consuming. Therefore, to address this challenge, this paper proposes a reweighting boosting feature selection (RBFS) method model to select key features. Specially, a reweighted apporach can adjust the contribution of each observational samples to learning model fitting; let higher weights are given more influence samples than those with lower weights. Feature selection with boosting can efficiently rank to iterate over important features to obtain the optimal feature subset. Besides, in the experiments, the RBFS method is applied to the prediction of lncRNA–protein interactions. The experimental results demonstrate that our method achieves higher accuracy and less redundancy with fewer features. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
50. Extracting cancer concepts from clinical notes using natural language processing: a systematic review.
- Author
-
Gholipour, Maryam, Khajouei, Reza, Amiri, Parastoo, Hajesmaeel Gohari, Sadrieh, and Ahmadian, Leila
- Subjects
NATURAL language processing ,MEDICAL language ,RESEARCH personnel ,LUNG cancer ,ENGLISH language - Abstract
Background: Extracting information from free texts using natural language processing (NLP) can save time and reduce the hassle of manually extracting large quantities of data from incredibly complex clinical notes of cancer patients. This study aimed to systematically review studies that used NLP methods to identify cancer concepts from clinical notes automatically. Methods: PubMed, Scopus, Web of Science, and Embase were searched for English language papers using a combination of the terms concerning "Cancer", "NLP", "Coding", and "Registries" until June 29, 2021. Two reviewers independently assessed the eligibility of papers for inclusion in the review. Results: Most of the software programs used for concept extraction reported were developed by the researchers (n = 7). Rule-based algorithms were the most frequently used algorithms for developing these programs. In most articles, the criteria of accuracy (n = 14) and sensitivity (n = 12) were used to evaluate the algorithms. In addition, Systematized Nomenclature of Medicine-Clinical Terms (SNOMED-CT) and Unified Medical Language System (UMLS) were the most commonly used terminologies to identify concepts. Most studies focused on breast cancer (n = 4, 19%) and lung cancer (n = 4, 19%). Conclusion: The use of NLP for extracting the concepts and symptoms of cancer has increased in recent years. The rule-based algorithms are well-liked algorithms by developers. Due to these algorithms' high accuracy and sensitivity in identifying and extracting cancer concepts, we suggested that future studies use these algorithms to extract the concepts of other diseases as well. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.