1,632 results
Search Results
2. The data paper: a mechanism to incentivize data publishing in biodiversity science
- Author
-
Lyubomir Penev and Vishwas Chavan
- Subjects
0106 biological sciences ,Computer science ,Biodiversity ,Pilot Projects ,Data publishing ,lcsh:Computer applications to medicine. Medical informatics ,010603 evolutionary biology ,01 natural sciences ,Biochemistry ,Workflow ,12. Responsible consumption ,Access to Information ,03 medical and health sciences ,Structural Biology ,Molecular Biology ,lcsh:QH301-705.5 ,030304 developmental biology ,Publishing ,Sustainable development ,0303 health sciences ,business.industry ,Research ,Applied Mathematics ,Stakeholder ,Data discovery ,15. Life on land ,Data science ,Intellectual Property ,Computer Science Applications ,Metadata ,lcsh:Biology (General) ,lcsh:R858-859.7 ,Periodicals as Topic ,business ,Global biodiversity - Abstract
Background Free and open access to primary biodiversity data is essential for informed decision-making to achieve conservation of biodiversity and sustainable development. However, primary biodiversity data are neither easily accessible nor discoverable. Among several impediments, one is a lack of incentives to data publishers for publishing of their data resources. One such mechanism currently lacking is recognition through conventional scholarly publication of enriched metadata, which should ensure rapid discovery of 'fit-for-use' biodiversity data resources. Discussion We review the state of the art of data discovery options and the mechanisms in place for incentivizing data publishers efforts towards easy, efficient and enhanced publishing, dissemination, sharing and re-use of biodiversity data. We propose the establishment of the 'biodiversity data paper' as one possible mechanism to offer scholarly recognition for efforts and investment by data publishers in authoring rich metadata and publishing them as citable academic papers. While detailing the benefits to data publishers, we describe the objectives, work flow and outcomes of the pilot project commissioned by the Global Biodiversity Information Facility in collaboration with scholarly publishers and pioneered by Pensoft Publishers through its journals Zookeys, PhytoKeys, MycoKeys, BioRisk, NeoBiota, Nature Conservation and the forthcoming Biodiversity Data Journal. We then debate further enhancements of the data paper beyond the pilot project and attempt to forecast the future uptake of data papers as an incentivization mechanism by the stakeholder communities. Conclusions We believe that in addition to recognition for those involved in the data publishing enterprise, data papers will also expedite publishing of fit-for-use biodiversity data resources. However, uptake and establishment of the data paper as a potential mechanism of scholarly recognition requires a high degree of commitment and investment by the cross-sectional stakeholder communities.
- Published
- 2011
3. EZH2 as a prognostic-related biomarker in lung adenocarcinoma correlating with cell cycle and immune infiltrates
- Author
-
Kui Fan, Bo-hui Zhang, Deng Han, and Yun-chuan Sun
- Subjects
Structural Biology ,Applied Mathematics ,Molecular Biology ,Biochemistry ,Computer Science Applications - Abstract
Backgrounds It has been observed that high levels of enhancer of zeste homolog 2 (EZH2) expression are associated with unsatisfactory prognoses and can be found in a wide range of malignancies. However, the effects of EZH2 on Lung Adenocarcinoma (LUAD) remain elusive. Through the integration of bioinformatic analyses, the present paper sought to ascertain the effects of EZH2 in LUAD. Methods The TIMER and UALCAN databases were applied to analyze mRNA and protein expression data for EZH2 in LUAD. The result of immunohistochemistry was obtained from the HPA database, and the survival curve was drawn according to the library provided by the HPA database. The LinkedOmics database was utilized to investigate the co-expressed genes and signal transduction pathways with EZH2. Up- and down-regulated genes from The Linked Omics database were introduced to the CMap database to predict potential drug targets for LUAD using the CMap database. The association between EZH2 and cancer-infiltrating immunocytes was studied through TIMER and TISIDB. In addition, this paper explores the relationship between EZH2 mRNA expression and NSCLC OS using the Kaplan–Meier plotter database to further validate and complement the research. Furthermore, the correlation between EZH2 expression and EGFR genes, KRAS genes, BRAF genes, and smoking from the Cancer Genome Atlas (TCGA) database is analyzed. Results In contrast to paracancer specimens, the mRNA and protein levels of EZH2 were higher in LUAD tissues. Significantly, high levels of EZH2 were associated with unsatisfactory prognoses in LUAD patients. Additionally, the coexpressed genes of EZH2 were predominantly associated with numerous cell growth-associated pathways, including the cell cycle, DNA replication, RNA transport, and the p53 signaling pathway, according to Gene Ontology and Kyoto Encyclopedia of Genes and Genomes pathways. The results of TCGA database revealed that the expression of EZH2 was lower in normal tissues than in lung cancer tissues (p p r = 0.3129, p Conclusions Highly expressed EZH2 is a predictor of a suboptimal prognosis in LUAD and may serve as a prognostic marker and target gene for LUAD. The underlying cause may be associated with the synergistic effect of KRAS, immune cell infiltration, and metabolic processes.
- Published
- 2023
4. Avoiding background knowledge: literature based discovery from important information
- Author
-
Judita Preiss
- Subjects
Structural Biology ,Applied Mathematics ,Molecular Biology ,Biochemistry ,Computer Science Applications - Abstract
Background Automatic literature based discovery attempts to uncover new knowledge by connecting existing facts: information extracted from existing publications in the form of $$A \rightarrow B$$ A → B and $$B \rightarrow C$$ B → C relations can be simply connected to deduce $$A \rightarrow C$$ A → C . However, using this approach, the quantity of proposed connections is often too vast to be useful. It can be reduced by using subject$$\rightarrow$$ → (predicate)$$\rightarrow$$ → object triples as the $$A \rightarrow B$$ A → B relations, but too many proposed connections remain for manual verification. Results Based on the hypothesis that only a small number of subject–predicate–object triples extracted from a publication represent the paper’s novel contribution(s), we explore using BERT embeddings to identify these before literature based discovery is performed utilizing only these, important, triples. While the method exploits the availability of full texts of publications in the CORD-19 dataset—making use of the fact that a novel contribution is likely to be mentioned in both an abstract and the body of a paper—to build a training set, the resulting tool can be applied to papers with only abstracts available. Candidate hidden knowledge pairs generated from unfiltered triples and those built from important triples only are compared using a variety of timeslicing gold standards. Conclusions The quantity of proposed knowledge pairs is reduced by a factor of $$10^3$$ 10 3 , and we show that when the gold standard is designed to avoid rewarding background knowledge, the precision obtained increases up to a factor of 10. We argue that the gold standard needs to be carefully considered, and release as yet undiscovered candidate knowledge pairs based on important triples alongside this work.
- Published
- 2023
5. GENTLE: a novel bioinformatics tool for generating features and building classifiers from T cell repertoire cancer data
- Author
-
Dhiego Souto Andrade, Patrick Terrematte, César Rennó-Costa, Alona Zilberberg, and Sol Efroni
- Subjects
Structural Biology ,Applied Mathematics ,Molecular Biology ,Biochemistry ,Computer Science Applications - Abstract
Background In the global effort to discover biomarkers for cancer prognosis, prediction tools have become essential resources. TCR (T cell receptor) repertoires contain important features that differentiate healthy controls from cancer patients or differentiate outcomes for patients being treated with different drugs. Considering, tools that can easily and quickly generate and identify important features out of TCR repertoire data and build accurate classifiers to predict future outcomes are essential. Results This paper introduces GENTLE (GENerator of T cell receptor repertoire features for machine LEarning): an open-source, user-friendly web-application tool that allows TCR repertoire researchers to discover important features; to create classifier models and evaluate them with metrics; and to quickly generate visualizations for data interpretations. We performed a case study with repertoires of TRegs (regulatory T cells) and TConvs (conventional T cells) from healthy controls versus patients with breast cancer. We showed that diversity features were able to distinguish between the groups. Moreover, the classifiers built with these features could correctly classify samples (‘Healthy’ or ‘Breast Cancer’)from the TRegs repertoire when trained with the TConvs repertoire, and from the TConvs repertoire when trained with the TRegs repertoire. Conclusion The paper walks through installing and using GENTLE and presents a case study and results to demonstrate the application’s utility. GENTLE is geared towards any researcher working with TCR repertoire data and aims to discover predictive features from these data and build accurate classifiers. GENTLE is available on https://github.com/dhiego22/gentle and https://share.streamlit.io/dhiego22/gentle/main/gentle.py.
- Published
- 2023
6. Structure-based kernels for the prediction of catalytic residues and their involvement in human inherited disease
- Author
-
David Neil Cooper, Steven Myers, Yong Fuga Li, Fuxiao Xin, Sean D. Mooney, and Predrag Radivojac
- Subjects
Statistics and Probability ,Computer science ,Computational biology ,Biology ,Biochemistry ,Catalysis ,Conserved sequence ,Enzyme catalysis ,Protein structure ,Protein sequencing ,Artificial Intelligence ,Structural Biology ,Catalytic Domain ,Humans ,Amino Acid Sequence ,Peptide sequence ,Molecular Biology ,Genetics ,Applied Mathematics ,Genetic Diseases, Inborn ,Computational Biology ,Proteins ,Original Papers ,Enzymes ,Computer Science Applications ,Computational Mathematics ,Kernel method ,Computational Theory and Mathematics ,Mutation ,Structure based ,Inherited disease ,DNA microarray ,Algorithms ,Software - Abstract
Motivation: Enzyme catalysis is involved in numerous biological processes and the disruption of enzymatic activity has been implicated in human disease. Despite this, various aspects of catalytic reactions are not completely understood, such as the mechanics of reaction chemistry and the geometry of catalytic residues within active sites. As a result, the computational prediction of catalytic residues has the potential to identify novel catalytic pockets, aid in the design of more efficient enzymes and also predict the molecular basis of disease. Results: We propose a new kernel-based algorithm for the prediction of catalytic residues based on protein sequence, structure and evolutionary information. The method relies upon explicit modeling of similarity between residue-centered neighborhoods in protein structures. We present evidence that this algorithm evaluates favorably against established approaches, and also provides insights into the relative importance of the geometry, physicochemical properties and evolutionary conservation of catalytic residue activity. The new algorithm was used to identify known mutations associated with inherited disease whose molecular mechanism might be predicted to operate specifically though the loss or gain of catalytic residues. It should, therefore, provide a viable approach to identifying the molecular basis of disease in which the loss or gain of function is not caused solely by the disruption of protein stability. Our analysis suggests that both mechanisms are actively involved in human inherited disease. Availability and Implementation: Source code for the structural kernel is available at www.informatics.indiana.edu/predrag/ Contact: predrag@indiana.edu Supplementary information: Supplementary data are available at Bioinformatics online.
- Published
- 2010
7. RSAP-Net: joint optic disc and cup segmentation with a residual spatial attention path module and MSRCR-PT pre-processing algorithm
- Author
-
Yun, Jiang, Zeqi, Ma, Chao, Wu, Zequn, Zhang, and Wei, Yan
- Subjects
Structural Biology ,Applied Mathematics ,Molecular Biology ,Biochemistry ,Computer Science Applications - Abstract
Background Glaucoma can cause irreversible blindness to people’s eyesight. Since there are no symptoms in its early stage, it is particularly important to accurately segment the optic disc (OD) and optic cup (OC) from fundus medical images for the screening and prevention of glaucoma. In recent years, the mainstream method of OD and OC segmentation is convolution neural network (CNN). However, most existing CNN methods segment OD and OC separately and ignore the a priori information that OC is always contained inside the OD region, which makes the segmentation accuracy of most methods not high enough. Methods This paper proposes a new encoder–decoder segmentation structure, called RSAP-Net, for joint segmentation of OD and OC. We first designed an efficient U-shaped segmentation network as the backbone. Considering the spatial overlap relationship between OD and OC, a new Residual spatial attention path is proposed to connect the encoder–decoder to retain more characteristic information. In order to further improve the segmentation performance, a pre-processing method called MSRCR-PT (Multi-Scale Retinex Colour Recovery and Polar Transformation) has been devised. It incorporates a multi-scale Retinex colour recovery algorithm and a polar coordinate transformation, which can help RSAP-Net to produce more refined boundaries of the optic disc and the optic cup. Results The experimental results show that our method achieves excellent segmentation performance on the Drishti-GS1 standard dataset. In the OD and OC segmentation effects, the F1 scores are 0.9752 and 0.9012, respectively. The BLE are 6.33 pixels and 11.97 pixels, respectively. Conclusions This paper presents a new framework for the joint segmentation of optic discs and optic cups, called RSAP-Net. The framework mainly consists of a U-shaped segmentation skeleton and a residual space attention path module. The design of a pre-processing method called MSRCR-PT for the OD/OC segmentation task can improve segmentation performance. The method was evaluated on the publicly available Drishti-GS1 standard dataset and proved to be effective.
- Published
- 2022
8. Reproducibility of mass spectrometry based metabolomics data
- Author
-
Daisy Philtron, Debashis Ghosh, Tusharkanti Ghosh, Weiming Zhang, and Katerina Kechris
- Subjects
False discovery rate ,QH301-705.5 ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Biochemistry ,Bioconductor ,Metabolomics ,Structural Biology ,Consistency (statistics) ,Biology (General) ,Molecular Biology ,Parametric statistics ,Mathematics ,Reproducibility ,Mass spectrometry ,business.industry ,Research ,Applied Mathematics ,Nonparametric statistics ,Reproducibility of Results ,Pattern recognition ,Replicate ,Computer Science Applications ,Artificial intelligence ,business - Abstract
Background Assessing the reproducibility of measurements is an important first step for improving the reliability of downstream analyses of high-throughput metabolomics experiments. We define a metabolite to be reproducible when it demonstrates consistency across replicate experiments. Similarly, metabolites which are not consistent across replicates can be labeled as irreproducible. In this work, we introduce and evaluate the use (Ma)ximum (R)ank (R)eproducibility (MaRR) to examine reproducibility in mass spectrometry-based metabolomics experiments. We examine reproducibility across technical or biological samples in three different mass spectrometry metabolomics (MS-Metabolomics) data sets. Results We apply MaRR, a nonparametric approach that detects the change from reproducible to irreproducible signals using a maximal rank statistic. The advantage of using MaRR over model-based methods that it does not make parametric assumptions on the underlying distributions or dependence structures of reproducible metabolites. Using three MS Metabolomics data sets generated in the multi-center Genetic Epidemiology of Chronic Obstructive Pulmonary Disease (COPD) study, we applied the MaRR procedure after data processing to explore reproducibility across technical or biological samples. Under realistic settings of MS-Metabolomics data, the MaRR procedure effectively controls the False Discovery Rate (FDR) when there was a gradual reduction in correlation between replicate pairs for less highly ranked signals. Simulation studies also show that the MaRR procedure tends to have high power for detecting reproducible metabolites in most situations except for smaller values of proportion of reproducible metabolites. Bias (i.e., the difference between the estimated and the true value of reproducible signal proportions) values for simulations are also close to zero. The results reported from the real data show a higher level of reproducibility for technical replicates compared to biological replicates across all the three different datasets. In summary, we demonstrate that the MaRR procedure application can be adapted to various experimental designs, and that the nonparametric approach performs consistently well. Conclusions This research was motivated by reproducibility, which has proven to be a major obstacle in the use of genomic findings to advance clinical practice. In this paper, we developed a data-driven approach to assess the reproducibility of MS-Metabolomics data sets. The methods described in this paper are implemented in the open-source R package marr, which is freely available from Bioconductor at http://bioconductor.org/packages/marr.
- Published
- 2021
9. Protein structure prediction based on particle swarm optimization and tabu search strategy
- Author
-
Yu, Shuchun, Li, Xianxiang, Tian, Xue, and Pang, Ming
- Subjects
Protein Stability ,Structural Biology ,Applied Mathematics ,Proteins ,Amino Acid Sequence ,Molecular Biology ,Biochemistry ,Algorithms ,Computer Science Applications - Abstract
Background The stability of protein sequence structure plays an important role in the prevention and treatment of diseases. Results In this paper, particle swarm optimization and tabu search are combined to propose a new method for protein structure prediction. The experimental results show that: for four groups of artificial protein sequences with different lengths, this method obtains the lowest potential energy value and stable structure prediction results, and the effect is obviously better than the other two comparison methods. Taking the first group of protein sequences as an example, our method improves the prediction of minimum potential energy by 127% and 7% respectively. Conclusions Therefore, the method proposed in this paper is more suitable for the prediction of protein structural stability.
- Published
- 2022
10. CMIC: an efficient quality score compressor with random access functionality
- Author
-
Hansen Chen, Jianhua Chen, Zhiwen Lu, and Rongshu Wang
- Subjects
Structural Biology ,Applied Mathematics ,High-Throughput Nucleotide Sequencing ,Data Compression ,Molecular Biology ,Biochemistry ,Algorithms ,Software ,Computer Science Applications - Abstract
BackgroundOver the past few decades, the emergence and maturation of new technologies have substantially reduced the cost of genome sequencing. As a result, the amount of genomic data that needs to be stored and transmitted has grown exponentially. For the standard sequencing data format, FASTQ, compression of the quality score is a key and difficult aspect of FASTQ file compression. Throughout the literature, we found that the majority of the current quality score compression methods do not support random access. Based on the above consideration, it is reasonable to investigate a lossless quality score compressor with a high compression rate, a fast compression and decompression speed, and support for random access.ResultsIn this paper, we propose CMIC, an adaptive and random access supported compressor for lossless compression of quality score sequences. CMIC is an acronym of the four steps (classification, mapping, indexing and compression) in the paper. Its framework consists of the following four parts: classification, mapping, indexing, and compression. The experimental results show that our compressor has good performance in terms of compression rates on all the tested datasets. The file sizes are reduced by up to 21.91% when compared with LCQS. In terms of compression speed, CMIC is better than all other compressors on most of the tested cases. In terms of random access speed, the CMIC is faster than the LCQS, which provides a random access function for compressed quality scores.ConclusionsCMIC is a compressor that is especially designed for quality score sequences, which has good performance in terms of compression rate, compression speed, decompression speed, and random access speed. The CMIC can be obtained in the following way:https://github.com/Humonex/Cmic.
- Published
- 2022
11. Topology-enhanced molecular graph representation for anti-breast cancer drug selection
- Author
-
Yue Gao, Songling Chen, Junyi Tong, and Xiangling Fu
- Subjects
Structural Biology ,Applied Mathematics ,Estrogen Receptor alpha ,Humans ,Antineoplastic Agents ,Breast Neoplasms ,Female ,Breast ,Neural Networks, Computer ,Molecular Biology ,Biochemistry ,Computer Science Applications - Abstract
Background Breast cancer is currently one of the cancers with a higher mortality rate in the world. The biological research on anti-breast cancer drugs focuses on the activity of estrogen receptors alpha (ER$$\alpha$$ α ), the pharmacokinetic properties and the safety of the compounds, which, however, is an expensive and time-consuming process. Developments of deep learning bring potential to efficiently facilitate the candidate drug selection against breast cancer. Methods In this paper, we propose an Anti-Breast Cancer Drug selection method utilizing Gated Graph Neural Networks (ABCD-GGNN) to topologically enhance the molecular representation of candidate drugs. By constructing atom-level graphs through atomic descriptors for each distinct compound, ABCD-GGNN can topologically learn both the implicit structure and substructure characteristics of a candidate drug and then integrate the representation with explicit discrete molecular descriptors to generate a molecule-level representation. As a result, the representation of ABCD-GGNN can inductively predict the ER$$\alpha$$ α , the pharmacokinetic properties and the safety of each candidate drug. Finally, we design a ranking operator whose inputs are the predicted properties so as to statistically select the appropriate drugs against breast cancer. Results Extensive experiments conducted on our collected anti-breast cancer candidate drug dataset demonstrate that our proposed method outperform all the other representative methods in the tasks of predicting ER$$\alpha$$ α , and the pharmacokinetic properties and safety of the compounds. Extended result analysis demonstrates the efficiency and biological rationality of the operator we design to calculate the candidate drug ranking from the predicted properties. Conclusion In this paper, we propose the ABCD-GGNN representation method to efficiently integrate the topological structure and substructure features of the molecules with the discrete molecular descriptors. With a ranking operator applied, the predicted properties efficiently facilitate the candidate drug selection against breast cancer.
- Published
- 2022
12. Automatic classification of nerve discharge rhythms based on sparse auto-encoder and time series feature
- Author
-
Zhongting Jiang, Dong Wang, and Yuehui Chen
- Subjects
Time Factors ,Structural Biology ,Applied Mathematics ,Neural Networks, Computer ,Molecular Biology ,Biochemistry ,Computer Science Applications - Abstract
Background Nerve discharge is the carrier of information transmission, which can reveal the basic rules of various nerve activities. Recognition of the nerve discharge rhythm is the key to correctly understand the dynamic behavior of the nervous system. The previous methods for the nerve discharge recognition almost depended on the traditional statistical features, and the nonlinear dynamical features of the discharge activity. The artificial extraction and the empirical judgment of the features were required for the recognition. Thus, these methods suffered from subjective factors and were not conducive to the identification of a large number of discharge rhythms. Results The ability of automatic feature extraction along with the development of the neural network has been greatly improved. In this paper, an effective discharge rhythm classification model based on sparse auto-encoder was proposed. The sparse auto-encoder was used to construct the feature learning network. The simulated discharge data from the Chay model and its variants were taken as the input of the network, and the fused features, including the network learning features, covariance and approximate entropy of nerve discharge, were classified by Softmax. The results showed that the accuracy of the classification on the testing data was 87.5%, which could provide more accurate classification results. Compared with other methods for the identification of nerve discharge types, this method could extract the characteristics of nerve discharge rhythm automatically without artificial design, and show a higher accuracy. Conclusions The sparse auto-encoder, even neural network has not been used to classify the basic nerve discharge from neither biological experiment data nor model simulation data. The automatic classification method of nerve discharge rhythm based on the sparse auto-encoder in this paper reduced the subjectivity and misjudgment of the artificial feature extraction, saved the time for the comparison with the traditional method, and improved the intelligence of the classification of discharge types. It could further help us to recognize and identify the nerve discharge activities in a new way.
- Published
- 2022
13. Spark-based parallel calculation of 3D fourier shell correlation for macromolecule structure local resolution estimation
- Author
-
Xinhui Tian, Xiangrui Zeng, Xiaohui Zheng, Xin Gao, Wang Hui, Liu Xiaodong, Shi Xiao, Zhao Xiaofang, Min Xu, and Yongchun Lü
- Subjects
Macromolecular Substances ,Fourier shell correlation ,Computer science ,Single particle analysis ,3D array partition ,lcsh:Computer applications to medicine. Medical informatics ,Biochemistry ,03 medical and health sciences ,Imaging, Three-Dimensional ,0302 clinical medicine ,Structural Biology ,Computer cluster ,Microscopy ,Spark (mathematics) ,Image Processing, Computer-Assisted ,Molecular Biology ,lcsh:QH301-705.5 ,3D local resolution map ,030304 developmental biology ,3D local Fourier shell correlation ,Spark ,0303 health sciences ,Key-value data ,Applied Mathematics ,Cryoelectron Microscopy ,Resolution (electron density) ,Methodology ,Partition (database) ,Computer Science Applications ,lcsh:Biology (General) ,lcsh:R858-859.7 ,Algorithm ,Algorithms ,030217 neurology & neurosurgery - Abstract
Background Resolution estimation is the main evaluation criteria for the reconstruction of macromolecular 3D structure in the field of cryoelectron microscopy (cryo-EM). At present, there are many methods to evaluate the 3D resolution for reconstructed macromolecular structures from Single Particle Analysis (SPA) in cryo-EM and subtomogram averaging (SA) in electron cryotomography (cryo-ET). As global methods, they measure the resolution of the structure as a whole, but they are inaccurate in detecting subtle local changes of reconstruction. In order to detect the subtle changes of reconstruction of SPA and SA, a few local resolution methods are proposed. The mainstream local resolution evaluation methods are based on local Fourier shell correlation (FSC), which is computationally intensive. However, the existing resolution evaluation methods are based on multi-threading implementation on a single computer with very poor scalability. Results This paper proposes a new fine-grained 3D array partition method by key-value format in Spark. Our method first converts 3D images to key-value data (K-V). Then the K-V data is used for 3D array partitioning and data exchange in parallel. So Spark-based distributed parallel computing framework can solve the above scalability problem. In this distributed computing framework, all 3D local FSC tasks are simultaneously calculated across multiple nodes in a computer cluster. Through the calculation of experimental data, 3D local resolution evaluation algorithm based on Spark fine-grained 3D array partition has a magnitude change in computing speed compared with the mainstream FSC algorithm under the condition that the accuracy remains unchanged, and has better fault tolerance and scalability. Conclusions In this paper, we proposed a K-V format based fine-grained 3D array partition method in Spark to parallel calculating 3D FSC for getting a 3D local resolution density map. 3D local resolution density map evaluates the three-dimensional density maps reconstructed from single particle analysis and subtomogram averaging. Our proposed method can significantly increase the speed of the 3D local resolution evaluation, which is important for the efficient detection of subtle variations among reconstructed macromolecular structures.
- Published
- 2020
14. Predicting synthetic lethal interactions in human cancers using graph regularized self-representative matrix factorization
- Author
-
Jiang Huang, Fan Lu, Zexuan Zhu, Min Wu, and Le Ou-Yang
- Subjects
Synthetic lethality ,Computer science ,Antineoplastic Agents ,Computational biology ,lcsh:Computer applications to medicine. Medical informatics ,Biochemistry ,Matrix decomposition ,03 medical and health sciences ,0302 clinical medicine ,Structural Biology ,Neoplasms ,Humans ,Molecular Biology ,Gene ,lcsh:QH301-705.5 ,030304 developmental biology ,0303 health sciences ,Research ,Applied Mathematics ,Matrix factorization ,Anticancer drug ,Computer Science Applications ,lcsh:Biology (General) ,030220 oncology & carcinogenesis ,Graph regularization ,Graph (abstract data type) ,lcsh:R858-859.7 ,DNA microarray ,Algorithms - Abstract
Background Synthetic lethality has attracted a lot of attentions in cancer therapeutics due to its utility in identifying new anticancer drug targets. Identifying synthetic lethal (SL) interactions is the key step towards the exploration of synthetic lethality in cancer treatment. However, biological experiments are faced with many challenges when identifying synthetic lethal interactions. Thus, it is necessary to develop computational methods which could serve as useful complements to biological experiments. Results In this paper, we propose a novel graph regularized self-representative matrix factorization (GRSMF) algorithm for synthetic lethal interaction prediction. GRSMF first learns the self-representations from the known SL interactions and further integrates the functional similarities among genes derived from Gene Ontology (GO). It can then effectively predict potential SL interactions by leveraging the information provided by known SL interactions and functional annotations of genes. Extensive experiments on the synthetic lethal interaction data downloaded from SynLethDB database demonstrate the superiority of our GRSMF in predicting potential synthetic lethal interactions, compared with other competing methods. Moreover, case studies of novel interactions are conducted in this paper for further evaluating the effectiveness of GRSMF in synthetic lethal interaction prediction. Conclusions In this paper, we demonstrate that by adaptively exploiting the self-representation of original SL interaction data, and utilizing functional similarities among genes to enhance the learning of self-representation matrix, our GRSMF could predict potential SL interactions more accurately than other state-of-the-art SL interaction prediction methods.
- Published
- 2019
15. MiRNA therapeutics based on logic circuits of biological pathways
- Author
-
Massimo La Rosa, Antonino Fiannaca, Alfonso Urso, Valeria Boscaino, Laura La Paglia, and Riccardo Rizzo
- Subjects
Lung Neoplasms ,Logic ,Computer science ,In silico ,Cancer pathway ,Boolean network ,Logic circuit ,Computational biology ,lcsh:Computer applications to medicine. Medical informatics ,Biochemistry ,Biological pathway ,03 medical and health sciences ,chemistry.chemical_compound ,0302 clinical medicine ,Structural Biology ,In vivo ,Carcinoma, Non-Small-Cell Lung ,microRNA ,medicine ,Humans ,Molecule ,Computer Simulation ,Antagomir ,KEGG ,Molecular Biology ,Gene ,lcsh:QH301-705.5 ,030304 developmental biology ,0303 health sciences ,Drug discovery ,Research ,Applied Mathematics ,Cancer ,RNA ,Cancer Pathway ,miRNA therapeutics ,medicine.disease ,In vitro ,Computer Science Applications ,Gene Expression Regulation, Neoplastic ,MicroRNAs ,chemistry ,lcsh:Biology (General) ,030220 oncology & carcinogenesis ,Mutation ,lcsh:R858-859.7 ,DNA microarray ,Signal Transduction - Abstract
Background In silico experiments, with the aid of computer simulation, speed up the process of in vitro or in vivo experiments. Cancer therapy design is often based on signalling pathway. MicroRNAs (miRNA) are small non-coding RNA molecules. In several kinds of diseases, including cancer, hepatitis and cardiovascular diseases, they are often deregulated, acting as oncogenes or tumor suppressors. miRNA therapeutics is based on two main kinds of molecules injection: miRNA mimics, which consists of injection of molecules that mimic the targeted miRNA, and antagomiR, which consists of injection of molecules inhibiting the targeted miRNA. Nowadays, the research is focused on miRNA therapeutics. This paper addresses cancer related signalling pathways to investigate miRNA therapeutics. Results In order to prove our approach, we present two different case studies: non-small cell lung cancer and melanoma. KEGG signalling pathways are modelled by a digital circuit. A logic value of 1 is linked to the expression of the corresponding gene. A logic value of 0 is linked to the absence (not expressed) gene. All possible relationships provided by a signalling pathway are modelled by logic gates. Mutations, derived according to the literature, are introduced and modelled as well. The modelling approach and analysis are widely discussed within the paper. MiRNA therapeutics is investigated by the digital circuit analysis. The most effective miRNA and combination of miRNAs, in terms of reduction of pathogenic conditions, are obtained. A discussion of obtained results in comparison with literature data is provided. Results are confirmed by existing data. Conclusions The proposed study is based on drug discovery and miRNA therapeutics and uses a digital circuit simulation of a cancer pathway. Using this simulation, the most effective combination of drugs and miRNAs for mutated cancer therapy design are obtained and these results were validated by the literature. The proposed modelling and analysis approach can be applied to each human disease, starting from the corresponding signalling pathway.
- Published
- 2019
16. ANINet: a deep neural network for skull ancestry estimation
- Author
-
Jiang Yi, Geng Guo-hua, Liu Xiaoning, Lin Pengyue, Xia Siyuan, Wang Shixiong, and Yang Wen
- Subjects
Ancestry classification ,QH301-705.5 ,Computer science ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Biochemistry ,Cross-validation ,Image (mathematics) ,Structural Biology ,Depth projection ,Image Processing, Computer-Assisted ,Calibration ,medicine ,Biology (General) ,Projection (set theory) ,Molecular Biology ,Artificial neural network ,business.industry ,Research ,3D skull models ,ANINet ,Applied Mathematics ,Skull ,Pattern recognition ,Computer Science Applications ,Range (mathematics) ,medicine.anatomical_structure ,Feature (computer vision) ,Neural Networks, Computer ,Artificial intelligence ,business - Abstract
Background Ancestry estimation of skulls is under a wide range of applications in forensic science, anthropology, and facial reconstruction. This study aims to avoid defects in traditional skull ancestry estimation methods, such as time-consuming and labor-intensive manual calibration of feature points, and subjective results. Results This paper uses the skull depth image as input, based on AlexNet, introduces the Wide module and SE-block to improve the network, designs and proposes ANINet, and realizes the ancestry classification. Such a unified model architecture of ANINet overcomes the subjectivity of manually calibrating feature points, of which the accuracy and efficiency are improved. We use depth projection to obtain the local depth image and the global depth image of the skull, take the skull depth image as the object, use global, local, and local + global methods respectively to experiment on the 95 cases of Han skull and 110 cases of Uyghur skull data sets, and perform cross-validation. The experimental results show that the accuracies of the three methods for skull ancestry estimation reached 98.21%, 98.04% and 99.03%, respectively. Compared with the classic networks AlexNet, Vgg-16, GoogLenet, ResNet-50, DenseNet-121, and SqueezeNet, the network proposed in this paper has the advantages of high accuracy and small parameters; compared with state-of-the-art methods, the method in this paper has a higher learning rate and better ability to estimate. Conclusions In summary, skull depth images have an excellent performance in estimation, and ANINet is an effective approach for skull ancestry estimation.
- Published
- 2021
17. Research on RNA secondary structure predicting via bidirectional recurrent neural network
- Author
-
Yu Zhang, Qiming Fu, Haiou Li, Weizhong Lu, Hongjie Wu, Yan Cao, Zhengwei Song, and Yijie Ding
- Subjects
Truncation ,Computer science ,QH301-705.5 ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Recurrent neural network ,Biochemistry ,Protein Structure, Secondary ,Nucleic acid secondary structure ,Local optimum ,Structural Biology ,RNA secondary structure prediction ,Weight ,Biology (General) ,Molecular Biology ,Sequence ,Pseudoknots ,Applied Mathematics ,Research ,Matthews correlation coefficient ,Base (topology) ,Computer Science Applications ,Nucleic Acid Conformation ,RNA ,Neural Networks, Computer ,Algorithm ,Algorithms - Abstract
Background RNA secondary structure prediction is an important research content in the field of biological information. Predicting RNA secondary structure with pseudoknots has been proved to be an NP-hard problem. Traditional machine learning methods can not effectively apply protein sequence information with different sequence lengths to the prediction process due to the constraint of the self model when predicting the RNA secondary structure. In addition, there is a large difference between the number of paired bases and the number of unpaired bases in the RNA sequences, which means the problem of positive and negative sample imbalance is easy to make the model fall into a local optimum. To solve the above problems, this paper proposes a variable-length dynamic bidirectional Gated Recurrent Unit(VLDB GRU) model. The model can accept sequences with different lengths through the introduction of flag vector. The model can also make full use of the base information before and after the predicted base and can avoid losing part of the information due to truncation. Introducing a weight vector to predict the RNA training set by dynamically adjusting each base loss function solves the problem of balanced sample imbalance. Results The algorithm proposed in this paper is compared with the existing algorithms on five representative subsets of the data set RNA STRAND. The experimental results show that the accuracy and Matthews correlation coefficient of the method are improved by 4.7% and 11.4%, respectively. Conclusions The flag vector introduced allows the model to effectively use the information before and after the protein sequence; the introduced weight vector solves the problem of unbalanced sample balance. Compared with other algorithms, the LVDB GRU algorithm proposed in this paper has the best detection results.
- Published
- 2021
18. Prediction of lung cancer using gene expression and deep learning with KL divergence gene selection
- Author
-
Suli Liu and Wu Yao
- Subjects
China ,Deep Learning ,Lung Neoplasms ,Structural Biology ,Applied Mathematics ,Gene Expression ,Humans ,Neural Networks, Computer ,Molecular Biology ,Biochemistry ,Computer Science Applications - Abstract
Background Lung cancer is one of the cancers with the highest mortality rate in China. With the rapid development of high-throughput sequencing technology and the research and application of deep learning methods in recent years, deep neural networks based on gene expression have become a hot research direction in lung cancer diagnosis in recent years, which provide an effective way of early diagnosis for lung cancer. Thus, building a deep neural network model is of great significance for the early diagnosis of lung cancer. However, the main challenges in mining gene expression datasets are the curse of dimensionality and imbalanced data. The existing methods proposed by some researchers can’t address the problems of high-dimensionality and imbalanced data, because of the overwhelming number of variables measured (genes) versus the small number of samples, which result in poor performance in early diagnosis for lung cancer. Method Given the disadvantages of gene expression data sets with small datasets, high-dimensionality and imbalanced data, this paper proposes a gene selection method based on KL divergence, which selects some genes with higher KL divergence as model features. Then build a deep neural network model using Focal Loss as loss function, at the same time, we use k-fold cross validation method to verify and select the best model, we set the value of k is five in this paper. Result The deep learning model method based on KL divergence gene selection proposed in this paper has an AUC of 0.99 on the validation set. The generalization performance of model is high. Conclusion The deep neural network model based on KL divergence gene selection proposed in this paper is proved to be an accurate and effective method for lung cancer prediction.
- Published
- 2021
19. Quantitative prediction model for affinity of drug-target interactions based on molecular vibrations and overall system of ligand-receptor
- Author
-
Yun Wang, ting ting cao, xian rui wang, xue mei tian, and cong min jia
- Subjects
Quantitative structure–activity relationship ,QH301-705.5 ,Computer science ,Computer applications to medicine. Medical informatics ,Drug target ,R858-859.7 ,Chemical composition ,Quantitative Structure-Activity Relationship ,Drug–target interactions ,Ligands ,Biochemistry ,Vibration ,Whole systems ,Structural Biology ,Molecular descriptor ,Drug–target affinity ,Biology (General) ,Molecular Biology ,Applied Mathematics ,Molecular vibrations ,Ligand (biochemistry) ,Computer Science Applications ,Random forest ,Molecular Docking Simulation ,Pharmaceutical Preparations ,Molecular vibration ,Biological system ,Research Article - Abstract
Background: the study of drug-target interactions (DTIs) affinity plays an important role in safety assessment and pharmacology. Currently, quantitative structure-activity relationship (QSAR) and molecular docking (MD) are most common methods in research of DTIs affinity. However, they often built for a specific target or several targets and most QSAR and MD were based either only on structure of drug molecules or on structure of targets with low accuracy and small scope of application. How to construct quantitative prediction models with high accuracy with wide applicability remains a challenge. To this end, this paper screened molecular descriptors based on molecular vibrations and took molecule-target as a whole system to construct prediction model with high accuracy-wide applicability based on Kd and EC50, and to provide reference for quantifying affinity of DTIs.Methods: Through parametric characterization based on molecular vibrations and protein sequences, taking molecule-target as whole system and feature selection of drug molecule-target, we constructed feature datasets of DTIs quantified by Kd and EC50, respectively. Then, prediction models were constructed using above datasets and SVM, RF and ANN. In addition, optimal models were selected for application evaluation and comprehensive comparison.Results: Under ten-fold cross-validation, evaluation parameters based on RF for EC50 dataset are as follows: R2 (RF) of training and test sets are 0.9611, 0.9641; MSE (RF) of training and test sets are 0.0891, 0.0817. Evaluation parameters based on RF for Kd dataset are as follows: R2 (RF) of training and test sets are 0.9425, 0.9485; MSE (RF) of training and test sets are 0.1208, 0.1191. After comprehensive comparison, the results showed that RF model in this paper is optimal model. In application evaluation of RF model, the errors of most prediction results were in range of 1.5-2.0.Conclusion: Through screening molecular descriptors based on molecular vibrations and taking molecule-target as whole system, we obtained optimal model based on RF with more accurate-widely applicable, which indicated that selection of molecular descriptors associated with molecular vibrations and the use of molecular-target as whole system are reliable methods for improving performance of model. It can provide reference for quantifying affinity of DTIs.
- Published
- 2021
20. Adverse drug reaction detection via a multihop self-attention mechanism
- Author
-
Yijia Zhang, Hongfei Lin, Liang Yang, Yuqi Ren, Jian Wang, Zhihao Yang, Bo Xu, and Tongxuan Zhang
- Subjects
Complex semantic information ,Drug-Related Side Effects and Adverse Reactions ,Generalization ,Computer science ,Adverse drug reactions ,02 engineering and technology ,computer.software_genre ,lcsh:Computer applications to medicine. Medical informatics ,Biochemistry ,Task (project management) ,03 medical and health sciences ,Structural Biology ,0202 electrical engineering, electronic engineering, information engineering ,medicine ,Humans ,Attention ,Molecular Biology ,lcsh:QH301-705.5 ,030304 developmental biology ,0303 health sciences ,Artificial neural network ,business.industry ,Mechanism (biology) ,Applied Mathematics ,Unstructured data ,medicine.disease ,Neural network ,Semantics ,Computer Science Applications ,Focus (linguistics) ,lcsh:Biology (General) ,Multihop self-attention mechanism ,lcsh:R858-859.7 ,020201 artificial intelligence & image processing ,Neural Networks, Computer ,Artificial intelligence ,business ,computer ,Natural language processing ,Sentence ,Adverse drug reaction ,Research Article - Abstract
BackgroundThe adverse reactions that are caused by drugs are potentially life-threatening problems. Comprehensive knowledge of adverse drug reactions (ADRs) can reduce their detrimental impacts on patients. Detecting ADRs through clinical trials takes a large number of experiments and a long period of time. With the growing amount of unstructured textual data, such as biomedical literature and electronic records, detecting ADRs in the available unstructured data has important implications for ADR research. Most of the neural network-based methods typically focus on the simple semantic information of sentence sequences; however, the relationship of the two entities depends on more complex semantic information.MethodsIn this paper, we propose multihop self-attention mechanism (MSAM) model that aims to learn the multi-aspect semantic information for the ADR detection task. first, the contextual information of the sentence is captured by using the bidirectional long short-term memory (Bi-LSTM) model. Then, via applying the multiple steps of an attention mechanism, multiple semantic representations of a sentence are generated. Each attention step obtains a different attention distribution focusing on the different segments of the sentence. Meanwhile, our model locates and enhances various keywords from the multiple representations of a sentence.ResultsOur model was evaluated by using two ADR corpora. It is shown that the method has a stable generalization ability. Via extensive experiments, our model achieved F-measure of 0.853, 0.799 and 0.851 for ADR detection for TwiMed-PubMed, TwiMed-Twitter, and ADE, respectively. The experimental results showed that our model significantly outperforms other compared models for ADR detection.ConclusionsIn this paper, we propose a modification of multihop self-attention mechanism (MSAM) model for an ADR detection task. The proposed method significantly improved the learning of the complex semantic information of sentences.
- Published
- 2019
21. Enhancing ontology-driven diagnostic reasoning with a symptom-dependency-aware Naïve Bayes classifier
- Author
-
Yaliang Li, Min Yang, Buzhou Tang, Ying Shen, and Hai-Tao Zheng
- Subjects
Dependency (UML) ,Computer science ,Knowledge Bases ,Diagnostic reasoning ,Disease ,Ontology (information science) ,Machine learning ,computer.software_genre ,Uncertainty reasoning ,lcsh:Computer applications to medicine. Medical informatics ,Biochemistry ,03 medical and health sciences ,Naive Bayes classifier ,0302 clinical medicine ,Structural Biology ,Electronic Health Records ,Humans ,naïve Bayes classifier ,Molecular Biology ,lcsh:QH301-705.5 ,Diagnostic Techniques and Procedures ,030304 developmental biology ,Probability ,0303 health sciences ,business.industry ,Ontology ,Applied Mathematics ,Medical record ,Conditional probability ,Bayes Theorem ,Computer Science Applications ,ROC Curve ,lcsh:Biology (General) ,030220 oncology & carcinogenesis ,Area Under Curve ,lcsh:R858-859.7 ,Artificial intelligence ,business ,computer ,Algorithms ,Research Article - Abstract
Background Ontology has attracted substantial attention from both academia and industry. Handling uncertainty reasoning is important in researching ontology. For example, when a patient is suffering from cirrhosis, the appearance of abdominal vein varices is four times more likely than the presence of bitter taste. Such medical knowledge is crucial for decision-making in various medical applications but is missing from existing medical ontologies. In this paper, we aim to discover medical knowledge probabilities from electronic medical record (EMR) texts to enrich ontologies. First, we build an ontology by identifying meaningful entity mentions from EMRs. Then, we propose a symptom-dependency-aware naïve Bayes classifier (SDNB) that is based on the assumption that there is a level of dependency among symptoms. To ensure the accuracy of the diagnostic classification, we incorporate the probability of a disease into the ontology via innovative approaches. Results We conduct a series of experiments to evaluate whether the proposed method can discover meaningful and accurate probabilities for medical knowledge. Based on over 30,000 deidentified medical records, we explore 336 abdominal diseases and 81 related symptoms. Among these 336 gastrointestinal diseases, the probabilities of 31 diseases are obtained via our method. These 31 probabilities of diseases and 189 conditional probabilities between diseases and the symptoms are added into the generated ontology. Conclusion In this paper, we propose a medical knowledge probability discovery method that is based on the analysis and extraction of EMR text data for enriching a medical ontology with probability information. The experimental results demonstrate that the proposed method can effectively identify accurate medical knowledge probability information from EMR data. In addition, the proposed method can efficiently and accurately calculate the probability of a patient suffering from a specified disease, thereby demonstrating the advantage of combining an ontology and a symptom-dependency-aware naïve Bayes classifier.
- Published
- 2019
22. Knowledge-guided convolutional networks for chemical-disease relation extraction
- Author
-
Lei Du, Zhuang Liu, Yingyu Lin, Chengkun Lang, Huiwei Zhou, and Shixian Ning
- Subjects
FOS: Computer and information sciences ,Relation (database) ,Drug-Related Side Effects and Adverse Reactions ,Computer science ,Knowledge Bases ,Attention mechanism ,Context (language use) ,Machine learning ,computer.software_genre ,lcsh:Computer applications to medicine. Medical informatics ,Biochemistry ,03 medical and health sciences ,0302 clinical medicine ,Structural Biology ,Leverage (statistics) ,Data Mining ,Humans ,Disease ,Molecular Biology ,lcsh:QH301-705.5 ,030304 developmental biology ,0303 health sciences ,Computer Science - Computation and Language ,business.industry ,Applied Mathematics ,Context features ,CDR extraction ,Relationship extraction ,Computer Science Applications ,Knowledge representations ,lcsh:Biology (General) ,030220 oncology & carcinogenesis ,lcsh:R858-859.7 ,Artificial intelligence ,Gating units ,business ,Computation and Language (cs.CL) ,computer ,Research Article - Abstract
Background: Automatic extraction of chemical-disease relations (CDR) from unstructured text is of essential importance for disease treatment and drug development. Meanwhile, biomedical experts have built many highly-structured knowledge bases (KBs), which contain prior knowledge about chemicals and diseases. Prior knowledge provides strong support for CDR extraction. How to make full use of it is worth studying. Results: This paper proposes a novel model called "Knowledge-guided Convolutional Networks (KCN)" to leverage prior knowledge for CDR extraction. The proposed model first learns knowledge representations including entity embeddings and relation embeddings from KBs. Then, entity embeddings are used to control the propagation of context features towards a chemical-disease pair with gated convolutions. After that, relation embeddings are employed to further capture the weighted context features by a shared attention pooling. Finally, the weighted context features containing additional knowledge information are used for CDR extraction. Experiments on the BioCreative V CDR dataset show that the proposed KCN achieves 71.28% F1-score, which outperforms most of the state-of-the-art systems. Conclusions: This paper proposes a novel CDR extraction model KCN to make full use of prior knowledge. Experimental results demonstrate that KCN could effectively integrate prior knowledge and contexts for the performance improvement., Published on BMC Bioinformatics, 16 pages, 5 figures
- Published
- 2019
23. Linking entities through an ontology using word embeddings and syntactic re-ranking
- Author
-
İlknur Karadeniz and Arzucan Özgür
- Subjects
Normalization (statistics) ,Drug-Related Side Effects and Adverse Reactions ,Text mining ,Computer science ,Adverse drug reactions ,Bacteria biotopes ,computer.software_genre ,lcsh:Computer applications to medicine. Medical informatics ,Biochemistry ,03 medical and health sciences ,Entity linking ,0302 clinical medicine ,Named-entity recognition ,Named entity normalization ,Structural Biology ,Data Mining ,Molecular Biology ,lcsh:QH301-705.5 ,030304 developmental biology ,0303 health sciences ,Parsing ,Bacteria ,business.industry ,Entity categorization ,Applied Mathematics ,Natural language processing ,Reference Standards ,Syntax ,Biomedical text mining ,Computer Science Applications ,Semantics ,Named entity ,lcsh:Biology (General) ,030220 oncology & carcinogenesis ,Word embeddings ,Ontology ,lcsh:R858-859.7 ,Artificial intelligence ,business ,computer ,Algorithms ,Software ,Research Article - Abstract
Background Although there is an enormous number of textual resources in the biomedical domain, currently, manually curated resources cover only a small part of the existing knowledge. The vast majority of these information is in unstructured form which contain nonstandard naming conventions. The task of named entity recognition, which is the identification of entity names from text, is not adequate without a standardization step. Linking each identified entity mention in text to an ontology/dictionary concept is an essential task to make sense of the identified entities. This paper presents an unsupervised approach for the linking of named entities to concepts in an ontology/dictionary. We propose an approach for the normalization of biomedical entities through an ontology/dictionary by using word embeddings to represent semantic spaces, and a syntactic parser to give higher weight to the most informative word in the named entity mentions. Results We applied the proposed method to two different normalization tasks: the normalization of bacteria biotope entities through the Onto-Biotope ontology and the normalization of adverse drug reaction entities through the Medical Dictionary for Regulatory Activities (MedDRA). The proposed method achieved a precision score of 65.9%, which is 2.9 percentage points above the state-of-the-art result on the BioNLP Shared Task 2016 Bacteria Biotope test data and a macro-averaged precision score of 68.7% on the Text Analysis Conference 2017 Adverse Drug Reaction test data. Conclusions The core contribution of this paper is a syntax-based way of combining the individual word vectors to form vectors for the named entity mentions and ontology concepts, which can then be used to measure the similarity between them. The proposed approach is unsupervised and does not require labeled data, making it easily applicable to different domains.
- Published
- 2019
24. Proceedings of the 2018 MidSouth Computational Biology and Bioinformatics Society (MCBIOS) conference
- Author
-
Jonathan D. Wren, Robert J. Doerkson, Prashanti Manda, Inimary T. Toby, Shraddha Thakkar, Bindu Nanduri, and Ramin Homayouni
- Subjects
Biomedical knowledge ,MCBIOS ,Bioinformatics ,media_common.quotation_subject ,Computational biology ,lcsh:Computer applications to medicine. Medical informatics ,History, 21st Century ,Biochemistry ,03 medical and health sciences ,0302 clinical medicine ,Cyberinfrastructure ,Structural Biology ,Excellence ,Humans ,Molecular Biology ,lcsh:QH301-705.5 ,030304 developmental biology ,media_common ,Panel discussion ,Introduction ,0303 health sciences ,Systems Biology ,Applied Mathematics ,ISCB ,Conferences ,Computational Biology ,Plenary session ,Assistant professor ,3. Good health ,Computer Science Applications ,R package ,lcsh:Biology (General) ,030220 oncology & carcinogenesis ,lcsh:R858-859.7 ,Career development - Abstract
The XVth Annual MidSouth Computational Biology and Bioinformatics Society (MCBIOS XV) conference was held in Starkville, MS from March 29–31, 2018 at Mississippi State University (MSU) within the Mill Conference Center. MSU had previously hosted the conference (MCBIOS VI) in 2009. The theme of MCBIOS XV was “Genomics and Big Data”. The co-chairs and conference hosts were Drs. Bindu Nanduri, Andy Perkins, and Daniel G. Peterson from MSU. The program was co-chaired by Dr. Shraddha Thakkar from the National Center for Toxicological Research (NCTR) within the US Food and Drug Administration (FDA), Dr. Mary Yang, from University of Arkansas at Little Rock (UALR) and Dr. Prashanti Manda from University of North Carolina at Greensboro. The conference was attended by 183 registered participants, of these, 73 registered participants were in the professional category, while 13 were postdoctoral fellows and 97 were student participants. A total of 157 abstracts were submitted for MCBIOS XV, including 65 oral presentations and 92 poster presentations at the meeting. There were nine breakout sessions conducted during the meeting. Each breakout session included a presentation by a featured speaker, a renowned scientist in the topic of that session, followed by four additional presentations in that area. Dr. Cesar M. Compadre, from University of Arkansas for Medical Sciences served as the finance coordinator for the conference. Dr. Ping Gong, at the US Army Engineer Research and Development Center, Vicksburg served as the coordinator of Young Scientist Research Excellence Award. Dr. George Popescu from the Institute of Genomics, Biocomputing and Biotechnology (IGBB) of MSU served as the poster session coordinator, and Dr. William S. Sanders from The Jackson Laboratory served as the workshop coordinator. For 2019–20, Dr. Weida Tong, Director of Division of Bioinformatics and Biostatics from NCTR/FDA was chosen as the President-Elect and Dr. Ramin Homayouni from University of Memphis as the President. Keynote speakers for MCBIOS 2018 Keynote Session I: “Next-Gen Data Science”, Russ Wolfinger, Ph.D., Director of Scientific Discovery and Genomics, JMP Life Sciences, SAS Institute, Cary, NC Keynote Session II: “Real World Data and Precision Medicine: Treatment Selection and Dose Optimization Strategies”, Lawrence J. Lesko Ph.D., F.C.P., University of Florida, Orlando, FL Keynote Session III: “No-Boundary Thinking: Defining Problems So Their Solutions Matter”, Steve Jennings, Ph.D., UALR Keynote Session IV: “Informatics Tools for Big Biologicals and Small Drug Molecules”, William J Welsh, Ph.D., Norman H. Edelman Professor in Bioinformatics, Rutgers University, New Brunswick, NJ Keynote Session V: “A decade of MAQC effort and its contribution to our understanding of high-throughput genomics technologies”, Weida Tong, Ph.D., Director, Division of Bioinformatics and Biostatistics, NCTR/FDA The conference program included three workshops: Workshop I: “Advanced Data Analytics using JMP Genomics”, Wenjun Bao, Ph.D., JMP Life Sciences, SAS Institute, Cary, NC. Workshop II: “Career Development Workshop for Young Scientist”, Inimary Toby, Ph.D., University of Dallas, Dallas, TX. Workshop III: “MCBIOS and No-Boundary Thinking Joint Bioinformatics Research workshop”. Session chair - Steve Jennings, Ph.D., UALR. Presentations: “Encoding biomedical knowledge using hetnets”, Daniel Himmelstein, Ph.D., University of Pennsylvania, Philadelphia, PN. “Microbial interactions and microbe-host interactions”, Hongmei Jiang Ph.D., Northwestern University, Evanston, IL. “Evolution as a metaphor for No Boundary Thinking”, Scott M. Williams, Ph.D., Case Western Reserve University, Cleveland, Ohio. Panel Discussion: Joan Peckham, Ph.D., University of Rhode Island Xiuzhen Huang, Ph.D., Arkansas State University Scott M. Williams, Ph.D., Case Western Reserve University Hongmei Jiang, Ph.D., Northwestern University Daniel Himmelstein, Ph.D., University of Pennsylvania In addition to the workshops, MCBIOS provided assistance to students in preparing their resume. A one-on-one resume clinic was conducted by Gladys Awosemo, HRP, Baylor Scott and White Health, TX. Breakout Session I: Plant Omics I. Session Chair – Sorina C. Popescu, Ph.D., Assistant Professor, MSU. Featured speaker: Marilyn Warburton, Ph.D., United States Department of Agriculture, Agriculture Research Services, MS, “A pathway-based method to interpret GWAS results”. Breakout Session II: Next generation tools for environment and health research. Session Chair and featured speaker: Natalia Reyero, Ph.D., US Army Engineer Research and Development Center, Vicksburg MS, “Next Generation Tools for Environmental Research”. Breakout Session III: Drug Discovery and Precision Medicine. Session Chair and featured speaker: Robert J. Doerksen, Ph.D., University of Mississippi (UM), Oxford, MS, “Protein structure-based virtual screening: deep learning for precision medicine”. Breakout Session IV: Breakout Session IV: Plant Omics II. Session Chair – Sorina C. Popescu, Ph.D., MSU. Featured Speaker: Tessa Burch-Smith, Ph.D., University of Tennessee, Knoxville, TN, “Focused ion beam-scanning electron microscopy for three-dimensional modelling of cellular ultrastructure”. Breakout Session V: Transcriptomics and Genome Sequencing. Session Chair and featured speaker: Brian Counterman, Ph.D., MSU, “Patternize: an R package for color pattern variation”. Breakout Session VI: Big Data and Risk Assessment. Session Chair – Minjun Chen, Ph.D., NCTR/FDA. Featured Speaker: William Mattes, Ph.D., NCTR/FDA, “Systems Biology and Big Data: Little Mitochondria as a Big Example”. Breakout Session VII: Genomics and Proteomics application. Session Chair - Zhichao Liu, Ph.D., NCTR/FDA. Featured Speaker: Rakesh Kaundal, Ph.D., Utah State University, Logan, UT, “Complete genome sequence of Pythium brassicum P1, an oomycete root pathogen: insights into its host specificity to Brassicaceae”. Breakout Session VIII: Genomics and Infectious Disease. Session Chair and Featured speaker: Stephen Pruett, Ph.D., MSU, “Machine Learning Analysis of the Relationship between Changes in Immunological Parameters and Changes in Resistance to Listeria monocytogenes: A New Approach for Risk Assessment and Systems Immunology”. Breakout Session IX: MCBIOS Group Projects. Shraddha Thakkar, Ph.D., NCTR/FDA. William Sanders, Ph.D., IT Research Cyberinfrastructure, The Jackson Laboratory, Bar Harbor, ME. Best Paper Award, MCBIOS 2018: Phillip Berg et al., “Evaluation of Linear Models and Missing Value Imputation for the Analysis of Peptide-Centric Proteomics” [1]. Best Paper Runner-up, MCBIOS 2018: Bohu Pan et al., “Similarities and differences between variants called with human reference genome HG19 or HG38” [2]. This was the 2nd year for “MCBIOS Young Scientist Excellence Award” awards to recognize students and postdoctoral fellows that exhibit scientific excellence in the field of Bioinformatics. Student and postdoctoral fellows went through a rigorous award application process with both internal and external judges. The top five candidates were selected to present during the opening session on March 29th. To compete, applicants submitted an extended abstract with a description of the innovation of their research and their specific contribution to the work presented, from which the quality and impact of the research was judged. Initiative in expanding their skills and bringing multidisciplinary talent to their project was an important consideration for selection for an oral presentation, and the quality of the presentation during the plenary session was the primary consideration for award. Additional evaluation criteria included creativity, dedication and multidisciplinary contribution. This award was supported by the FDA grant to MCBIOS (5R13FD005931–03) and JMP Life Sciences. MCBIOS young scientist excellence award 2018 Post-doctoral winners First Place: Sundar Thangapandian, Ph.D., University of Illinois Urbana-Champaign, Urbana, IL. “Quantitative Target-specific Toxicity Prediction Model (QTTPM): A Novel Computational Toxicology Approach Integrating Molecular Dynamics Simulation and Machine Learning”. Second Place: Brian Walker, Ph.D., UALR. “Synthesis of xanthine derivatives for the inhibition of PARG”. Third Place: Darshan Mehta, Ph.D., NCTR/FDA. “Mining pharmacogenomic information from drug labeling using FDALabel database for advancing precision medicine”.
- Published
- 2019
25. Coding Prony’s method in MATLAB and applying it to biomedical signal filtering
- Author
-
A. Fernández Rodríguez, J. M. Rodríguez Ascariz, L. de Santiago Rodrigo, J. M. Miguel Jiménez, E. López Guillén, Luciano Boquete, and Universidad de Alcalá. Departamento de Electrónica
- Subjects
Male ,Polynomial ,Total least squares ,Computer science ,Prony"s method ,Prony’s method ,02 engineering and technology ,Biochemistry ,Least squares ,0302 clinical medicine ,Structural Biology ,0202 electrical engineering, electronic engineering, information engineering ,Linear combination ,MATLAB ,lcsh:QH301-705.5 ,computer.programming_language ,Fourier Analysis ,Applied Mathematics ,Computer Science Applications ,Exponential function ,Function approximation ,Matrix pencil ,lcsh:R858-859.7 ,Female ,Electrónica ,Algorithm ,Algorithms ,Adult ,lcsh:Computer applications to medicine. Medical informatics ,Discrete Fourier transform ,Multiple sclerosis ,Young Adult ,03 medical and health sciences ,Humans ,Least-Squares Analysis ,Molecular Biology ,020206 networking & telecommunications ,Filter (signal processing) ,lcsh:Biology (General) ,Prony's method ,Evoked Potentials, Visual ,Programming Languages ,Electronics ,Multifocal evoked visual potentials ,computer ,Software ,030217 neurology & neurosurgery - Abstract
Background:The response of many biomedical systems can be modelled using a linear combination of damped exponential functions. The approximation parameters, based on equally spaced samples, can be obtained using Prony's method and its variants (e.g. the matrix pencil method). This paper provides a tutorial on the main polynomial Prony and matrix pencil methods and their implementation in MATLAB and analyses how they perform with synthetic and multifocal visual-evoked potential (mfVEP) signals. This paper briefly describes the theoretical basis of four polynomial Prony approximation methods: classic, least squares (LS), total least squares (TLS) and matrix pencil method (MPM). In each of these cases, implementation uses general MATLAB functions. The features of the various options are tested by approximating a set of synthetic mathematical functions and evaluating filtering performance in the Prony domain when applied to mfVEP signals to improve diagnosis of patients with multiple sclerosis (MS). Results:The code implemented does not achieve 100%-correct signal approximation and, of the methods tested, LS and MPM perform best. When filtering mfVEP records in the Prony domain, the value of the area under the receiver-operating-characteristic (ROC) curve is 0.7055 compared with 0.6538 obtained with the usual filtering method used for this type of signal (discrete Fourier transform low-pass filter with a cut-off frequency of 35 Hz). Conclusions:This paper reviews Prony's method in relation to signal filtering and approximation, provides the MATLAB code needed to implement the classic, LS, TLS and MPM methods, and tests their performance in biomedical signal filtering and function approximation. It emphasizes the importance of improving the computational methods used to implement the various methods described above., Universidad de Alcalá, Agencia Estatal de Investigación
- Published
- 2018
26. Juxtapose: a gene-embedding approach for comparing co-expression networks
- Author
-
B. Frank Eames, Farhad Maleki, Katie Ovens, and Ian McQuillan
- Subjects
Word embedding ,Evolution ,Computer science ,lcsh:Computer applications to medicine. Medical informatics ,computer.software_genre ,Gene co-expression networks ,Biochemistry ,Set (abstract data type) ,03 medical and health sciences ,0302 clinical medicine ,Similarity (network science) ,Structural Biology ,Machine learning ,Gene Regulatory Networks ,Word2vec ,Transcriptomics ,lcsh:QH301-705.5 ,Molecular Biology ,Blossom algorithm ,030304 developmental biology ,0303 health sciences ,Methodology Article ,Applied Mathematics ,Computational Biology ,Expression (mathematics) ,Computer Science Applications ,lcsh:Biology (General) ,lcsh:R858-859.7 ,Embedding ,Data mining ,DNA microarray ,computer ,Algorithms ,Software ,030217 neurology & neurosurgery - Abstract
Background Gene co-expression networks (GCNs) are not easily comparable due to their complex structure. In this paper, we propose a tool, Juxtapose, together with similarity measures that can be utilized for comparative transcriptomics between a set of organisms. While we focus on its application to comparing co-expression networks across species in evolutionary studies, Juxtapose is also generalizable to co-expression network comparisons across tissues or conditions within the same species. Methods A word embedding strategy commonly used in natural language processing was utilized in order to generate gene embeddings based on walks made throughout the GCNs. Juxtapose was evaluated based on its ability to embed the nodes of synthetic structures in the networks consistently while also generating biologically informative results. Evaluation of the techniques proposed in this research utilized RNA-seq datasets from GTEx, a multi-species experiment of prefrontal cortex samples from the Gene Expression Omnibus, as well as synthesized datasets. Biological evaluation was performed using gene set enrichment analysis and known gene relationships in literature. Results We show that Juxtapose is capable of globally aligning synthesized networks as well as identifying areas that are conserved in real gene co-expression networks without reliance on external biological information. Furthermore, output from a matching algorithm that uses cosine distance between GCN embeddings is shown to be an informative measure of similarity that reflects the amount of topological similarity between networks. Conclusions Juxtapose can be used to align GCNs without relying on known biological similarities and enables post-hoc analyses using biological parameters, such as orthology of genes, or conserved or variable pathways. Availability A development version of the software used in this paper is available at https://github.com/klovens/juxtapose
- Published
- 2021
27. PartSeg: a tool for quantitative feature extraction from 3D microscopy images for dummies
- Author
-
Dariusz Plewczynski, Grzegorz Bokota, Nirmal Das, Pawel Trzaskoma, Agnieszka Grabowska, Adriana Magalska, Yana Yushkevich, Jacek Sroka, and Subhadip Basu
- Subjects
Interface (Java) ,Computer science ,Feature extraction ,Batch processing ,lcsh:Computer applications to medicine. Medical informatics ,computer.software_genre ,Biochemistry ,Nucleus ,03 medical and health sciences ,Imaging, Three-Dimensional ,Segmentation ,0302 clinical medicine ,Structural Biology ,Image Processing, Computer-Assisted ,Electron microscopy ,3D reconstruction ,Super-resolution microscopy ,lcsh:QH301-705.5 ,Molecular Biology ,030304 developmental biology ,Cell Nucleus ,Microscopy ,0303 health sciences ,3D FISH ,business.industry ,Applied Mathematics ,Pattern recognition ,Bioimaging ,Chromatin ,Computer Science Applications ,Visualization ,lcsh:Biology (General) ,lcsh:R858-859.7 ,Artificial intelligence ,Data mining ,Focus (optics) ,business ,computer ,Algorithms ,Software ,030217 neurology & neurosurgery - Abstract
BackgroundBioimaging techniques offer a robust tool for studying molecular pathways and morphological phenotypes of cell populations subjected to various conditions. As modern high resolution 3D microscopy provides access to an ever-increasing amount of high quality images, there arises a need for their analysis in an automated, unbiased and simple way.Segmentation of structures within cell nucleus, which is the focus of this paper, presents a new layer of complexity in the form of dense packing and significant signal overlap.At the same time the available segmentation tools provide a steep learning curve for new users with limited technical background. This is especially apparent in bulk processing of image sets, which requires the use of some form of programming notation.ResultsIn this paper, we present PartSeg, a tool for segmentation and reconstruction of 3D microscopy images, optimised for the study of cell nucleus. PartSeg integrates refined versions of several state-of-the-art algorithms, including a new multi-scale approach for segmentation and quantitative analysis of 3D microscopy images.The features and user-friendly interface of PartSeg were carefully planned with biologists in mind, based on analysis of multiple use cases and difficulties encountered with other tools, to offer ergonomic interface with a minimal entry barrier. Bulk processing in an ad-hoc manner is possible without the need for programmer support. As the size of datasets of interest grows, such bulk processing solutions become essential for proper statistical analysis of results.Advanced users can use PartSeg components as a library within Python data processing and visualisation pipelines, for example within Jupyter notebooks. The tool is extensible so that new functionality and algorithms can be added by the use of plugins.For biologists the utility of PartSeg is presented in several scenarios, showing the quantitative analysis of nuclear structures.ConclusionsIn this paper, we have presented PartSeg which is a tool for precise and verifiable segmentation and reconstruction of 3D microscopy images. PartSeg is optimised for cell nucleus analysis and offers multiscale segmentation algorithms best-suited for this task. PartSeg can also be used for bulk processing of multiple images and its components can be reused in other systems or computational experiments.Contactg.bokota@cent.uw.edu.pl, a.magalska@nencki.edu.pl, d.plewczynski@cent.uw.edu.pl
- Published
- 2021
28. Application of artificial intelligence ensemble learning model in early prediction of atrial fibrillation
- Author
-
Jian Huang, Yen-Ming J. Chen, Yiu-Jen Chang, Maxwell Hwang, Wen-Hsien Ho, Tian-Hsiang Huang, Kao-Shing Hwang, Cai Wu, and Tsung-Han Ho
- Subjects
Artificial intelligence ,QH301-705.5 ,Computer science ,Computer applications to medicine. Medical informatics ,Feature extraction ,R858-859.7 ,Biochemistry ,Machine Learning ,symbols.namesake ,Electrocardiography ,Structural Biology ,Ensemble learning ,Atrial Fibrillation ,Feature (machine learning) ,Gaussian function ,Humans ,Sensitivity (control systems) ,AdaBoost ,Biology (General) ,Molecular Biology ,Receiver operating characteristic ,business.industry ,Applied Mathematics ,Research ,Computer Science Applications ,Electrocardiogram ,ROC Curve ,symbols ,F1 score ,business ,Algorithms - Abstract
Background Atrial fibrillation is a paroxysmal heart disease without any obvious symptoms for most people during the onset. The electrocardiogram (ECG) at the time other than the onset of this disease is not significantly different from that of normal people, which makes it difficult to detect and diagnose. However, if atrial fibrillation is not detected and treated early, it tends to worsen the condition and increase the possibility of stroke. In this paper, P-wave morphology parameters and heart rate variability feature parameters were simultaneously extracted from the ECG. A total of 31 parameters were used as input variables to perform the modeling of artificial intelligence ensemble learning model. Results This paper applied three artificial intelligence ensemble learning methods, namely Bagging ensemble learning method, AdaBoost ensemble learning method, and Stacking ensemble learning method. The prediction results of these three artificial intelligence ensemble learning methods were compared. As a result of the comparison, the Stacking ensemble learning method combined with various models finally obtained the best prediction effect with the accuracy of 92%, sensitivity of 88%, specificity of 96%, positive predictive value of 95.7%, negative predictive value of 88.9%, F1 score of 0.9231 and area under receiver operating characteristic curve value of 0.911. Conclusion In feature extraction, this paper combined P-wave morphology parameters and heart rate variability parameters as input parameters for model training, and validated the value of the proposed parameters combination for the improvement of the model’s predicting effect. In the calculation of the P-wave morphology parameters, the hybrid Taguchi-genetic algorithm was used to obtain more accurate Gaussian function fitting parameters. The prediction model was trained using the Stacking ensemble learning method, so that the model accuracy had better results, which can further improve the early prediction of atrial fibrillation.
- Published
- 2021
29. Identification of essential proteins based on edge features and the fusion of multiple-source biological information
- Author
-
Peiqiang Liu, Chang Liu, Yanyan Mao, Junhong Guo, Fanshu Liu, Wangmin Cai, and Feng Zhao
- Subjects
Structural Biology ,Applied Mathematics ,Molecular Biology ,Biochemistry ,Computer Science Applications - Abstract
Background A major current focus in the analysis of protein–protein interaction (PPI) data is how to identify essential proteins. As massive PPI data are available, this warrants the design of efficient computing methods for identifying essential proteins. Previous studies have achieved considerable performance. However, as a consequence of the features of high noise and structural complexity in PPIs, it is still a challenge to further upgrade the performance of the identification methods. Methods This paper proposes an identification method, named CTF, which identifies essential proteins based on edge features including h-quasi-cliques and uv-triangle graphs and the fusion of multiple-source information. We first design an edge-weight function, named EWCT, for computing the topological scores of proteins based on quasi-cliques and triangle graphs. Then, we generate an edge-weighted PPI network using EWCT and dynamic PPI data. Finally, we compute the essentiality of proteins by the fusion of topological scores and three scores of biological information. Results We evaluated the performance of the CTF method by comparison with 16 other methods, such as MON, PeC, TEGS, and LBCC, the experiment results on three datasets of Saccharomyces cerevisiae show that CTF outperforms the state-of-the-art methods. Moreover, our method indicates that the fusion of other biological information is beneficial to improve the accuracy of identification.
- Published
- 2023
30. A systematic review of biologically-informed deep learning models for cancer: fundamental trends for encoding and interpreting oncology data
- Author
-
Magdalena Wysocka, Oskar Wysocki, Marie Zufferey, Donal Landers, and Andre Freitas
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Artificial Intelligence (cs.AI) ,Computer Science - Artificial Intelligence ,Structural Biology ,FOS: Biological sciences ,Applied Mathematics ,Quantitative Biology - Quantitative Methods ,Molecular Biology ,Biochemistry ,Quantitative Methods (q-bio.QM) ,Machine Learning (cs.LG) ,Computer Science Applications - Abstract
There is an increasing interest in the use of Deep Learning (DL) based methods as a supporting analytical framework in oncology. However, most direct applications of DL will deliver models with limited transparency and explainability, which constrain their deployment in biomedical settings. This systematic review discusses DL models used to support inference in cancer biology with a particular emphasis on multi-omics analysis. It focuses on how existing models address the need for better dialogue with prior knowledge, biological plausibility and interpretability, fundamental properties in the biomedical domain. For this, we retrieved and analyzed 42 studies focusing on emerging architectural and methodological advances, the encoding of biological domain knowledge and the integration of explainability methods. We discuss the recent evolutionary arch of DL models in the direction of integrating prior biological relational and network knowledge to support better generalisation (e.g. pathways or Protein-Protein-Interaction networks) and interpretability. This represents a fundamental functional shift towards models which can integrate mechanistic and statistical inference aspects. We introduce a concept of bio-centric interpretability and according to its taxonomy, we discuss representational methodologies for the integration of domain prior knowledge in such models. The paper provides a critical outlook into contemporary methods for explainability and interpretabiltiy used in DL for cancer. The analysis points in the direction of a convergence between encoding prior knowledge and improved interpretability. We introduce bio-centric interpretability which is an important step towards formalisation of biological interpretability of DL models and developing methods that are less problem- or application-specific., Comment: 25 pages, 5 figures
- Published
- 2023
31. HLA-Clus: HLA class I clustering based on 3D structure
- Author
-
Yue Shen, Jerry M. Parks, and Jeremy C. Smith
- Subjects
Structural Biology ,Applied Mathematics ,Molecular Biology ,Biochemistry ,Computer Science Applications - Abstract
Background In a previous paper, we classified populated HLA class I alleles into supertypes and subtypes based on the similarity of 3D landscape of peptide binding grooves, using newly defined structure distance metric and hierarchical clustering approach. Compared to other approaches, our method achieves higher correlation with peptide binding specificity, intra-cluster similarity (cohesion), and robustness. Here we introduce HLA-Clus, a Python package for clustering HLA Class I alleles using the method we developed recently and describe additional features including a new nearest neighbor clustering method that facilitates clustering based on user-defined criteria. Results The HLA-Clus pipeline includes three stages: First, HLA Class I structural models are coarse grained and transformed into clouds of labeled points. Second, similarities between alleles are determined using a newly defined structure distance metric that accounts for spatial and physicochemical similarities. Finally, alleles are clustered via hierarchical or nearest-neighbor approaches. We also interfaced HLA-Clus with the peptide:HLA affinity predictor MHCnuggets. By using the nearest neighbor clustering method to select optimal allele-specific deep learning models in MHCnuggets, the average accuracy of peptide binding prediction of rare alleles was improved. Conclusions The HLA-Clus package offers a solution for characterizing the peptide binding specificities of a large number of HLA alleles. This method can be applied in HLA functional studies, such as the development of peptide affinity predictors, disease association studies, and HLA matching for grafting. HLA-Clus is freely available at our GitHub repository (https://github.com/yshen25/HLA-Clus).
- Published
- 2023
32. Automated recognition and analysis of body bending behavior in C. elegans
- Author
-
Hui Zhang and Weiyang Chen
- Subjects
Structural Biology ,Applied Mathematics ,Molecular Biology ,Biochemistry ,Computer Science Applications - Abstract
Background Locomotion behaviors of Caenorhabditis elegans play an important role in drug activity screening, anti-aging research, and toxicological assessment. Previous studies have provided important insights into drug activity screening, anti-aging, and toxicological research by manually counting the number of body bends. However, manual counting is often low-throughput and takes a lot of time and manpower. And it is easy to cause artificial bias and error in counting results. Results In this paper, an algorithm is proposed for automatic counting and analysis of the body bending behavior of nematodes. First of all, the numerical coordinate regression method with convolutional neural network is used to obtain the head and tail coordinates. Next, curvature-based feature point extraction algorithm is used to calculate the feature points of the nematode centerline. Then the maximum distance between the peak point and the straight line between the pharynx and the tail is calculated. The number of body bends is counted according to the change in the maximum distance per frame. Conclusion Experiments are performed to prove the effectiveness of the proposed algorithm. The accuracy of head coordinate prediction is 0.993, and the accuracy of tail coordinate prediction is 0.990. The Pearson correlation coefficient between the results of the automatic count and manual count of the number of body bends is 0.998 and the mean absolute error is 1.931. Different strains of nematodes are selected to analyze differences in body bending behavior, demonstrating a relationship between nematode vitality and lifespan. The code is freely available at https://github.com/hthana/Body-Bend-Count.
- Published
- 2023
33. meth-SemiCancer: a cancer subtype classification framework via semi-supervised learning utilizing DNA methylation profiles
- Author
-
Choi, Joung M., Park, Chaelin, and Chae, Heejoon
- Subjects
Structural Biology ,Applied Mathematics ,Molecular Biology ,Biochemistry ,Computer Science Applications - Abstract
Background Identification of the cancer subtype plays a crucial role to provide an accurate diagnosis and proper treatment to improve the clinical outcomes of patients. Recent studies have shown that DNA methylation is one of the key factors for tumorigenesis and tumor growth, where the DNA methylation signatures have the potential to be utilized as cancer subtype-specific markers. However, due to the high dimensionality and the low number of DNA methylome cancer samples with the subtype information, still, to date, a cancer subtype classification method utilizing DNA methylome datasets has not been proposed. Results In this paper, we present meth-SemiCancer, a semi-supervised cancer subtype classification framework based on DNA methylation profiles. The proposed model was first pre-trained based on the methylation datasets with the cancer subtype labels. After that, meth-SemiCancer generated the pseudo-subtypes for the cancer datasets without subtype information based on the model’s prediction. Finally, fine-tuning was performed utilizing both the labeled and unlabeled datasets. Conclusions From the performance comparison with the standard machine learning-based classifiers, meth-SemiCancer achieved the highest average F1-score and Matthews correlation coefficient, outperforming other methods. Fine-tuning the model with the unlabeled patient samples by providing the proper pseudo-subtypes, encouraged meth-SemiCancer to generalize better than the supervised neural network-based subtype classification method. meth-SemiCancer is publicly available at https://github.com/cbi-bioinfo/meth-SemiCancer.
- Published
- 2023
34. Predicting disease genes based on multi-head attention fusion
- Author
-
Linlin Zhang, Dianrong Lu, Xuehua Bi, Kai Zhao, Guanglei Yu, and Na Quan
- Subjects
Structural Biology ,Applied Mathematics ,Molecular Biology ,Biochemistry ,Computer Science Applications - Abstract
Background The identification of disease-related genes is of great significance for the diagnosis and treatment of human disease. Most studies have focused on developing efficient and accurate computational methods to predict disease-causing genes. Due to the sparsity and complexity of biomedical data, it is still a challenge to develop an effective multi-feature fusion model to identify disease genes. Results This paper proposes an approach to predict the pathogenic gene based on multi-head attention fusion (MHAGP). Firstly, the heterogeneous biological information networks of disease genes are constructed by integrating multiple biomedical knowledge databases. Secondly, two graph representation learning algorithms are used to capture the feature vectors of gene-disease pairs from the network, and the features are fused by introducing multi-head attention. Finally, multi-layer perceptron model is used to predict the gene-disease association. Conclusions The MHAGP model outperforms all of other methods in comparative experiments. Case studies also show that MHAGP is able to predict genes potentially associated with diseases. In the future, more biological entity association data, such as gene-drug, disease phenotype-gene ontology and so on, can be added to expand the information in heterogeneous biological networks and achieve more accurate predictions. In addition, MHAGP with strong expansibility can be used for potential tasks such as gene-drug association and drug-disease association prediction.
- Published
- 2023
35. Automatic block-wise genotype-phenotype association detection based on hidden Markov model
- Author
-
Jin Du, Chaojie Wang, Lijun Wang, Shanjun Mao, Bencong Zhu, Zheng Li, and Xiaodan Fan
- Subjects
Structural Biology ,Applied Mathematics ,Molecular Biology ,Biochemistry ,Computer Science Applications - Abstract
Background For detecting genotype-phenotype association from case–control single nucleotide polymorphism (SNP) data, one class of methods relies on testing each genomic variant site individually. However, this approach ignores the tendency for associated variant sites to be spatially clustered instead of uniformly distributed along the genome. Therefore, a more recent class of methods looks for blocks of influential variant sites. Unfortunately, existing such methods either assume prior knowledge of the blocks, or rely on ad hoc moving windows. A principled method is needed to automatically detect genomic variant blocks which are associated with the phenotype. Results In this paper, we introduce an automatic block-wise Genome-Wide Association Study (GWAS) method based on Hidden Markov model. Using case–control SNP data as input, our method detects the number of blocks associated with the phenotype and the locations of the blocks. Correspondingly, the minor allele of each variate site will be classified as having negative influence, no influence or positive influence on the phenotype. We evaluated our method using both datasets simulated from our model and datasets from a block model different from ours, and compared the performance with other methods. These included both simple methods based on the Fisher’s exact test, applied site-by-site, as well as more complex methods built into the recent Zoom-Focus Algorithm. Across all simulations, our method consistently outperformed the comparisons. Conclusions With its demonstrated better performance, we expect our algorithm for detecting influential variant sites may help find more accurate signals across a wide range of case–control GWAS.
- Published
- 2023
36. clusterMaker2: a major update to clusterMaker, a multi-algorithm clustering app for Cytoscape
- Author
-
Maija Utriainen and John H. Morris
- Subjects
Structural Biology ,Applied Mathematics ,Molecular Biology ,Biochemistry ,Computer Science Applications - Abstract
Background Since the initial publication of clusterMaker, the need for tools to analyze large biological datasets has only increased. New datasets are significantly larger than a decade ago, and new experimental techniques such as single-cell transcriptomics continue to drive the need for clustering or classification techniques to focus on portions of datasets of interest. While many libraries and packages exist that implement various algorithms, there remains the need for clustering packages that are easy to use, integrated with visualization of the results, and integrated with other commonly used tools for biological data analysis. clusterMaker2 has added several new algorithms, including two entirely new categories of analyses: node ranking and dimensionality reduction. Furthermore, many of the new algorithms have been implemented using the Cytoscape jobs API, which provides a mechanism for executing remote jobs from within Cytoscape. Together, these advances facilitate meaningful analyses of modern biological datasets despite their ever-increasing size and complexity. Results The use of clusterMaker2 is exemplified by reanalyzing the yeast heat shock expression experiment that was included in our original paper; however, here we explored this dataset in significantly more detail. Combining this dataset with the yeast protein–protein interaction network from STRING, we were able to perform a variety of analyses and visualizations from within clusterMaker2, including Leiden clustering to break the entire network into smaller clusters, hierarchical clustering to look at the overall expression dataset, dimensionality reduction using UMAP to find correlations between our hierarchical visualization and the UMAP plot, fuzzy clustering, and cluster ranking. Using these techniques, we were able to explore the highest-ranking cluster and determine that it represents a strong contender for proteins working together in response to heat shock. We found a series of clusters that, when re-explored as fuzzy clusters, provide a better presentation of mitochondrial processes. Conclusions clusterMaker2 represents a significant advance over the previously published version, and most importantly, provides an easy-to-use tool to perform clustering and to visualize clusters within the Cytoscape network context. The new algorithms should be welcome to the large population of Cytoscape users, particularly the new dimensionality reduction and fuzzy clustering techniques.
- Published
- 2023
37. A two-stage hybrid biomarker selection method based on ensemble filter and binary differential evolution incorporating binary African vultures optimization
- Author
-
Wei Li, Yuhuan Chi, Kun Yu, and Weidong Xie
- Subjects
Structural Biology ,Applied Mathematics ,Molecular Biology ,Biochemistry ,Computer Science Applications - Abstract
Background In the field of genomics and personalized medicine, it is a key issue to find biomarkers directly related to the diagnosis of specific diseases from high-throughput gene microarray data. Feature selection technology can discover biomarkers with disease classification information. Results We use support vector machines as classifiers and use the five-fold cross-validation average classification accuracy, recall, precision and F1 score as evaluation metrics to evaluate the identified biomarkers. Experimental results show classification accuracy above 0.93, recall above 0.92, precision above 0.91, and F1 score above 0.94 on eight microarray datasets. Method This paper proposes a two-stage hybrid biomarker selection method based on ensemble filter and binary differential evolution incorporating binary African vultures optimization (EF-BDBA), which can effectively reduce the dimension of microarray data and obtain optimal biomarkers. In the first stage, we propose an ensemble filter feature selection method. The method combines an improved fast correlation-based filter algorithm with Fisher score. obviously redundant and irrelevant features can be filtered out to initially reduce the dimensionality of the microarray data. In the second stage, the optimal feature subset is selected using an improved binary differential evolution incorporating an improved binary African vultures optimization algorithm. The African vultures optimization algorithm has excellent global optimization ability. It has not been systematically applied to feature selection problems, especially for gene microarray data. We combine it with a differential evolution algorithm to improve population diversity. Conclusion Compared with traditional feature selection methods and advanced hybrid methods, the proposed method achieves higher classification accuracy and identifies excellent biomarkers while retaining fewer features. The experimental results demonstrate the effectiveness and advancement of our proposed algorithmic model.
- Published
- 2023
38. Analysis of associations between emotions and activities of drug users and their addiction recovery tendencies from social media posts using structural equation modeling
- Author
-
Rahul Singh and Deeptanshu Jha
- Subjects
Drug ,Text mining ,Substance-Related Disorders ,Reddit ,media_common.quotation_subject ,Emotions ,Word count ,030508 substance abuse ,Context (language use) ,lcsh:Computer applications to medicine. Medical informatics ,Substance misuse disorder ,Biochemistry ,Structural equation modeling ,Developmental psychology ,Social media ,Addiction recovery ,Opioid epidemic ,03 medical and health sciences ,0302 clinical medicine ,Online communities ,Recurrence ,Structural Biology ,Humans ,030212 general & internal medicine ,lcsh:QH301-705.5 ,Molecular Biology ,Addiction relapse ,media_common ,Research ,Applied Mathematics ,Addiction ,Personalized interventions ,Computer Science Applications ,lcsh:Biology (General) ,Latent Class Analysis ,Research Design ,Life expectancy ,lcsh:R858-859.7 ,0305 other medical science - Abstract
Background Addiction to drugs and alcohol constitutes one of the significant factors underlying the decline in life expectancy in the US. Several context-specific reasons influence drug use and recovery. In particular emotional distress, physical pain, relationships, and self-development efforts are known to be some of the factors associated with addiction recovery. Unfortunately, many of these factors are not directly observable and quantifying, and assessing their impact can be difficult. Based on social media posts of users engaged in substance use and recovery on the forum Reddit, we employed two psycholinguistic tools, Linguistic Inquiry and Word Count and Empath and activities of substance users on various Reddit sub-forums to analyze behavior underlining addiction recovery and relapse. We then employed a statistical analysis technique called structural equation modeling to assess the effects of these latent factors on recovery and relapse. Results We found that both emotional distress and physical pain significantly influence addiction recovery behavior. Self-development activities and social relationships of the substance users were also found to enable recovery. Furthermore, within the context of self-development activities, those that were related to influencing the mental and physical well-being of substance users were found to be positively associated with addiction recovery. We also determined that lack of social activities and physical exercise can enable a relapse. Moreover, geography, especially life in rural areas, appears to have a greater correlation with addiction relapse. Conclusions The paper describes how observable variables can be extracted from social media and then be used to model important latent constructs that impact addiction recovery and relapse. We also report factors that impact self-induced addiction recovery and relapse. To the best of our knowledge, this paper represents the first use of structural equation modeling of social media data with the goal of analyzing factors influencing addiction recovery.
- Published
- 2020
39. DSCMF: prediction of LncRNA-disease associations based on dual sparse collaborative matrix factorization
- Author
-
Jin-Xing Liu, Feng Li, Zhen Cui, Ying-Lian Gao, and Ming-Ming Gao
- Subjects
Male ,Computer science ,QH301-705.5 ,Gaussian ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Value (computer science) ,Breast Neoplasms ,Disease ,Machine learning ,computer.software_genre ,Biochemistry ,Matrix decomposition ,Gaussian interaction profile kernel ,03 medical and health sciences ,symbols.namesake ,0302 clinical medicine ,Similarity (network science) ,Structural Biology ,Code (cryptography) ,Humans ,Computer Simulation ,Biology (General) ,Molecular Biology ,030304 developmental biology ,0303 health sciences ,business.industry ,Applied Mathematics ,Research ,Prostatic Neoplasms ,Computer Science Applications ,Dual (category theory) ,Kernel (image processing) ,LncRNA-disease associations ,030220 oncology & carcinogenesis ,symbols ,RNA, Long Noncoding ,Artificial intelligence ,Collaborative matrix factorization ,business ,computer ,Algorithms - Abstract
Background In the development of science and technology, there are increasing evidences that there are some associations between lncRNAs and human diseases. Therefore, finding these associations between them will have a huge impact on our treatment and prevention of some diseases. However, the process of finding the associations between them is very difficult and requires a lot of time and effort. Therefore, it is particularly important to find some good methods for predicting lncRNA-disease associations (LDAs). Results In this paper, we propose a method based on dual sparse collaborative matrix factorization (DSCMF) to predict LDAs. The DSCMF method is improved on the traditional collaborative matrix factorization method. To increase the sparsity, the L2,1-norm is added in our method. At the same time, Gaussian interaction profile kernel is added to our method, which increase the network similarity between lncRNA and disease. Finally, the AUC value obtained by the experiment is used to evaluate the quality of our method, and the AUC value is obtained by the ten-fold cross-validation method. Conclusions The AUC value obtained by the DSCMF method is 0.8523. At the end of the paper, simulation experiment is carried out, and the experimental results of prostate cancer, breast cancer, ovarian cancer and colorectal cancer are analyzed in detail. The DSCMF method is expected to bring some help to lncRNA-disease associations research. The code can access the https://github.com/Ming-0113/DSCMF website.
- Published
- 2020
40. Optimized permutation testing for information theoretic measures of multi-gene interactions
- Author
-
David J. Galas, James Kunert-Graf, and Nikita A. Sakhanenko
- Subjects
Information theory ,Genotype ,Computer science ,Pipeline (computing) ,Computation ,0206 medical engineering ,02 engineering and technology ,lcsh:Computer applications to medicine. Medical informatics ,Biochemistry ,Polymorphism, Single Nucleotide ,Bottleneck ,03 medical and health sciences ,Structural Biology ,Resampling ,Code (cryptography) ,Multivariable interactions ,Humans ,Molecular Biology ,lcsh:QH301-705.5 ,030304 developmental biology ,0303 health sciences ,Applied Mathematics ,Multivariable calculus ,Methodology Article ,Epistasis, Genetic ,Computer Science Applications ,Permutation testing ,Exact test ,ComputingMethodologies_PATTERNRECOGNITION ,Phenotype ,lcsh:Biology (General) ,Multi-locus GWAS ,lcsh:R858-859.7 ,Algorithm ,020602 bioinformatics ,Genome-Wide Association Study - Abstract
Background Permutation testing is often considered the “gold standard” for multi-test significance analysis, as it is an exact test requiring few assumptions about the distribution being computed. However, it can be computationally very expensive, particularly in its naive form in which the full analysis pipeline is re-run after permuting the phenotype labels. This can become intractable in multi-locus genome-wide association studies (GWAS), in which the number of potential interactions to be tested is combinatorially large. Results In this paper, we develop an approach for permutation testing in multi-locus GWAS, specifically focusing on SNP–SNP-phenotype interactions using multivariable measures that can be computed from frequency count tables, such as those based in Information Theory. We find that the computational bottleneck in this process is the construction of the count tables themselves, and that this step can be eliminated at each iteration of the permutation testing by transforming the count tables directly. This leads to a speed-up by a factor of over 103 for a typical permutation test compared to the naive approach. Additionally, this approach is insensitive to the number of samples making it suitable for datasets with large number of samples. Conclusions The proliferation of large-scale datasets with genotype data for hundreds of thousands of individuals enables new and more powerful approaches for the detection of multi-locus genotype-phenotype interactions. Our approach significantly improves the computational tractability of permutation testing for these studies. Moreover, our approach is insensitive to the large number of samples in these modern datasets. The code for performing these computations and replicating the figures in this paper is freely available at https://github.com/kunert/permute-counts.
- Published
- 2020
41. Sparse clusterability: testing for cluster structure in high dimensions
- Author
-
Jose Laborde, Paul A. Stewart, Zhihua Chen, Yian A. Chen, and Naomi C. Brownstein
- Subjects
Structural Biology ,Applied Mathematics ,Molecular Biology ,Biochemistry ,Computer Science Applications - Abstract
Background Cluster analysis is utilized frequently in scientific theory and applications to separate data into groups. A key assumption in many clustering algorithms is that the data was generated from a population consisting of multiple distinct clusters. Clusterability testing allows users to question the inherent assumption of latent cluster structure, a theoretical requirement for meaningful results in cluster analysis. Results This paper proposes methods for clusterability testing designed for high-dimensional data by utilizing sparse principal component analysis. Type I error and power of the clusterability tests are evaluated using simulated data with different types of cluster structure in high dimensions. Empirical performance of the new methods is evaluated and compared with prior methods on gene expression, microarray, and shotgun proteomics data. Our methods had reasonably low Type I error and maintained power for many datasets with a variety of structures and dimensions. Cluster structure was not detectable in other datasets with spatially close clusters. Conclusion This is the first analysis of clusterability testing on both simulated and real-world high-dimensional data.
- Published
- 2023
42. CellTrackVis: interactive browser-based visualization for analyzing cell trajectories and lineages
- Author
-
Changbeom Shim, Wooil Kim, Tran Thien Dat Nguyen, Du Yong Kim, Yu Suk Choi, and Yon Dohn Chung
- Subjects
Structural Biology ,Applied Mathematics ,Molecular Biology ,Biochemistry ,Computer Science Applications - Abstract
Background Automatic cell tracking methods enable practitioners to analyze cell behaviors efficiently. Notwithstanding the continuous development of relevant software, user-friendly visualization tools have room for further improvements. Typical visualization mostly comes with main cell tracking tools as a simple plug-in, or relies on specific software/platforms. Although some tools are standalone, limited visual interactivity is provided, or otherwise cell tracking outputs are partially visualized. Results This paper proposes a self-reliant visualization system, CellTrackVis, to support quick and easy analysis of cell behaviors. Interconnected views help users discover meaningful patterns of cell motions and divisions in common web browsers. Specifically, cell trajectory, lineage, and quantified information are respectively visualized in a coordinated interface. In particular, immediate interactions among modules enable the study of cell tracking outputs to be more effective, and also each component is highly customizable for various biological tasks. Conclusions CellTrackVis is a standalone browser-based visualization tool. Source codes and data sets are freely available at http://github.com/scbeom/celltrackvis with the tutorial at http://scbeom.github.io/ctv_tutorial.
- Published
- 2023
43. cnnLSV: detecting structural variants by encoding long-read alignment information and convolutional neural network
- Author
-
Huidong Ma, Cheng Zhong, Danyang Chen, Haofa He, and Feng Yang
- Subjects
Structural Biology ,Applied Mathematics ,Molecular Biology ,Biochemistry ,Computer Science Applications - Abstract
Background Genomic structural variant detection is a significant and challenging issue in genome analysis. The existing long-read based structural variant detection methods still have space for improvement in detecting multi-type structural variants. Results In this paper, we propose a method called cnnLSV to obtain detection results with higher quality by eliminating false positives in the detection results merged from the callsets of existing methods. We design an encoding strategy for four types of structural variants to represent long-read alignment information around structural variants into images, input the images into a constructed convolutional neural network to train a filter model, and load the trained model to remove the false positives to improve the detection performance. We also eliminate mislabeled training samples in the training model phase by using principal component analysis algorithm and unsupervised clustering algorithm k-means. Experimental results on both simulated and real datasets show that our proposed method outperforms existing methods overall in detecting insertions, deletions, inversions, and duplications. The program of cnnLSV is available at https://github.com/mhuidong/cnnLSV. Conclusions The proposed cnnLSV can detect structural variants by using long-read alignment information and convolutional neural network to achieve overall higher performance, and effectively eliminate incorrectly labeled samples by using the principal component analysis and k-means algorithms in training model stage.
- Published
- 2023
44. REDfold: accurate RNA secondary structure prediction using residual encoder-decoder network
- Author
-
Chun-Chi Chen and Yi-Ming Chan
- Subjects
Structural Biology ,Applied Mathematics ,Molecular Biology ,Biochemistry ,Computer Science Applications - Abstract
Background As the RNA secondary structure is highly related to its stability and functions, the structure prediction is of great value to biological research. The traditional computational prediction for RNA secondary prediction is mainly based on the thermodynamic model with dynamic programming to find the optimal structure. However, the prediction performance based on the traditional approach is unsatisfactory for further research. Besides, the computational complexity of the structure prediction using dynamic programming is $$O(N^3)$$ O ( N 3 ) ; it becomes $$O(N^6)$$ O ( N 6 ) for RNA structure with pseudoknots, which is computationally impractical for large-scale analysis. Results In this paper, we propose REDfold, a novel deep learning-based method for RNA secondary prediction. REDfold utilizes an encoder-decoder network based on CNN to learn the short and long range dependencies among the RNA sequence, and the network is further integrated with symmetric skip connections to efficiently propagate activation information across layers. Moreover, the network output is post-processed with constrained optimization to yield favorable predictions even for RNAs with pseudoknots. Experimental results based on the ncRNA database demonstrate that REDfold achieves better performance in terms of efficiency and accuracy, outperforming the contemporary state-of-the-art methods.
- Published
- 2023
45. EnsInfer: a simple ensemble approach to network inference outperforms any single method
- Author
-
Bingran Shen, Gloria Coruzzi, and Dennis Shasha
- Subjects
Structural Biology ,Applied Mathematics ,Molecular Biology ,Biochemistry ,Computer Science Applications - Abstract
This study evaluates both a variety of existing base causal inference methods and a variety of ensemble methods. We show that: (i) base network inference methods vary in their performance across different datasets, so a method that works poorly on one dataset may work well on another; (ii) a non-homogeneous ensemble method in the form of a Naive Bayes classifier leads overall to as good or better results than using the best single base method or any other ensemble method; (iii) for the best results, the ensemble method should integrate all methods that satisfy a statistical test of normality on training data. The resulting ensemble model EnsInfer easily integrates all kinds of RNA-seq data as well as new and existing inference methods. The paper categorizes and reviews state-of-the-art underlying methods, describes the EnsInfer ensemble approach in detail, and presents experimental results. The source code and data used will be made available to the community upon publication.
- Published
- 2023
46. CNN-Siam: multimodal siamese CNN-based deep learning approach for drug‒drug interaction prediction
- Author
-
Zihao Yang, Kuiyuan Tong, Shiyu Jin, Shiyan Wang, Chao Yang, and Feng Jiang
- Subjects
Structural Biology ,Applied Mathematics ,Molecular Biology ,Biochemistry ,Computer Science Applications - Abstract
Background Drug‒drug interactions (DDIs) are reactions between two or more drugs, i.e., possible situations that occur when two or more drugs are used simultaneously. DDIs act as an important link in both drug development and clinical treatment. Since it is not possible to study the interactions of such a large number of drugs using experimental means, a computer-based deep learning solution is always worth investigating. We propose a deep learning-based model that uses twin convolutional neural networks to learn representations from multimodal drug data and to make predictions about the possible types of drug effects. Results In this paper, we propose a novel convolutional neural network algorithm using a Siamese network architecture called CNN-Siam. CNN-Siam uses a convolutional neural network (CNN) as a backbone network in the form of a twin network architecture to learn the feature representation of drug pairs from multimodal data of drugs (including chemical substructures, targets and enzymes). Moreover, this network is used to predict the types of drug interactions with the best optimization algorithms available (RAdam and LookAhead). The experimental data show that the CNN-Siam achieves an area under the precision-recall (AUPR) curve score of 0.96 on the benchmark dataset and a correct rate of 92%. These results are significant improvements compared to the state-of-the-art method (from 86 to 92%) and demonstrate the robustness of the CNN-Siam and the superiority of the new optimization algorithm through ablation experiments. Conclusion The experimental results show that our multimodal siamese convolutional neural network can accurately predict DDIs, and the Siamese network architecture is able to learn the feature representation of drug pairs better than individual networks. CNN-Siam outperforms other state-of-the-art algorithms with the combination of data enhancement and better optimizers. But at the same time, CNN-Siam has some drawbacks, longer training time, generalization needs to be improved, and poorer classification results on some classes.
- Published
- 2023
47. Predicting miRNA-disease associations based on PPMI and attention network
- Author
-
Xuping Xie, Yan Wang, Kai He, and Nan Sheng
- Subjects
Structural Biology ,Applied Mathematics ,Molecular Biology ,Biochemistry ,Computer Science Applications - Abstract
Background With the development of biotechnology and the accumulation of theories, many studies have found that microRNAs (miRNAs) play an important role in various diseases. Uncovering the potential associations between miRNAs and diseases is helpful to better understand the pathogenesis of complex diseases. However, traditional biological experiments are expensive and time-consuming. Therefore, it is necessary to develop more efficient computational methods for exploring underlying disease-related miRNAs. Results In this paper, we present a new computational method based on positive point-wise mutual information (PPMI) and attention network to predict miRNA-disease associations (MDAs), called PATMDA. Firstly, we construct the heterogeneous MDA network and multiple similarity networks of miRNAs and diseases. Secondly, we respectively perform random walk with restart and PPMI on different similarity network views to get multi-order proximity features and then obtain high-order proximity representations of miRNAs and diseases by applying the convolutional neural network to fuse the learned proximity features. Then, we design an attention network with neural aggregation to integrate the representations of a node and its heterogeneous neighbor nodes according to the MDA network. Finally, an inner product decoder is adopted to calculate the relationship scores between miRNAs and diseases. Conclusions PATMDA achieves superior performance over the six state-of-the-art methods with the area under the receiver operating characteristic curve of 0.933 and 0.946 on the HMDD v2.0 and HMDD v3.2 datasets, respectively. The case studies further demonstrate the validity of PATMDA for discovering novel disease-associated miRNAs.
- Published
- 2023
48. Study of the error correction capability of multiple sequence alignment algorithm (MAFFT) in DNA storage
- Author
-
Ranze Xie, Xiangzhen Zan, Ling Chu, Yanqing Su, Peng Xu, and Wenbin Liu
- Subjects
Structural Biology ,Applied Mathematics ,Molecular Biology ,Biochemistry ,Computer Science Applications - Abstract
Synchronization (insertions–deletions) errors are still a major challenge for reliable information retrieval in DNA storage. Unlike traditional error correction codes (ECC) that add redundancy in the stored information, multiple sequence alignment (MSA) solves this problem by searching the conserved subsequences. In this paper, we conduct a comprehensive simulation study on the error correction capability of a typical MSA algorithm, MAFFT. Our results reveal that its capability exhibits a phase transition when there are around 20% errors. Below this critical value, increasing sequencing depth can eventually allow it to approach complete recovery. Otherwise, its performance plateaus at some poor levels. Given a reasonable sequencing depth (≤ 70), MSA could achieve complete recovery in the low error regime, and effectively correct 90% of the errors in the medium error regime. In addition, MSA is robust to imperfect clustering. It could also be combined with other means such as ECC, repeated markers, or any other code constraints. Furthermore, by selecting an appropriate sequencing depth, this strategy could achieve an optimal trade-off between cost and reading speed. MSA could be a competitive alternative for future DNA storage.
- Published
- 2023
49. EG-TransUNet: a transformer-based U-Net with enhanced and guided models for biomedical image segmentation
- Author
-
Shaoming Pan, Xin Liu, Ningdi Xie, and Yanwen Chong
- Subjects
Structural Biology ,Applied Mathematics ,Molecular Biology ,Biochemistry ,Computer Science Applications - Abstract
Although various methods based on convolutional neural networks have improved the performance of biomedical image segmentation to meet the precision requirements of medical imaging segmentation task, medical image segmentation methods based on deep learning still need to solve the following problems: (1) Difficulty in extracting the discriminative feature of the lesion region in medical images during the encoding process due to variable sizes and shapes; (2) difficulty in fusing spatial and semantic information of the lesion region effectively during the decoding process due to redundant information and the semantic gap. In this paper, we used the attention-based Transformer during the encoder and decoder stages to improve feature discrimination at the level of spatial detail and semantic location by its multihead-based self-attention. In conclusion, we propose an architecture called EG-TransUNet, including three modules improved by a transformer: progressive enhancement module, channel spatial attention, and semantic guidance attention. The proposed EG-TransUNet architecture allowed us to capture object variabilities with improved results on different biomedical datasets. EG-TransUNet outperformed other methods on two popular colonoscopy datasets (Kvasir-SEG and CVC-ClinicDB) by achieving 93.44% and 95.26% on mDice. Extensive experiments and visualization results demonstrate that our method advances the performance on five medical segmentation datasets with better generalization ability.
- Published
- 2023
50. INSnet: a method for detecting insertions based on deep learning network
- Author
-
Runtian Gao, Junwei Luo, Hongyu Ding, and Haixia Zhai
- Subjects
Structural Biology ,Applied Mathematics ,Molecular Biology ,Biochemistry ,Computer Science Applications - Abstract
Background Many studies have shown that structural variations (SVs) strongly impact human disease. As a common type of SV, insertions are usually associated with genetic diseases. Therefore, accurately detecting insertions is of great significance. Although many methods for detecting insertions have been proposed, these methods often generate some errors and miss some variants. Hence, accurately detecting insertions remains a challenging task. Results In this paper, we propose a method named INSnet to detect insertions using a deep learning network. First, INSnet divides the reference genome into continuous sub-regions and takes five features for each locus through alignments between long reads and the reference genome. Next, INSnet uses a depthwise separable convolutional network. The convolution operation extracts informative features through spatial information and channel information. INSnet uses two attention mechanisms, the convolutional block attention module (CBAM) and efficient channel attention (ECA) to extract key alignment features in each sub-region. In order to capture the relationship between adjacent subregions, INSnet uses a gated recurrent unit (GRU) network to further extract more important SV signatures. After predicting whether a sub-region contains an insertion through the previous steps, INSnet determines the precise site and length of the insertion. The source code is available from GitHub at https://github.com/eioyuou/INSnet. Conclusion Experimental results show that INSnet can achieve better performance than other methods in terms of F1 score on real datasets.
- Published
- 2023
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.