1,622 results
Search Results
2. The data paper: a mechanism to incentivize data publishing in biodiversity science
- Author
-
Lyubomir Penev and Vishwas Chavan
- Subjects
0106 biological sciences ,Computer science ,Biodiversity ,Pilot Projects ,Data publishing ,lcsh:Computer applications to medicine. Medical informatics ,010603 evolutionary biology ,01 natural sciences ,Biochemistry ,Workflow ,12. Responsible consumption ,Access to Information ,03 medical and health sciences ,Structural Biology ,Molecular Biology ,lcsh:QH301-705.5 ,030304 developmental biology ,Publishing ,Sustainable development ,0303 health sciences ,business.industry ,Research ,Applied Mathematics ,Stakeholder ,Data discovery ,15. Life on land ,Data science ,Intellectual Property ,Computer Science Applications ,Metadata ,lcsh:Biology (General) ,lcsh:R858-859.7 ,Periodicals as Topic ,business ,Global biodiversity - Abstract
Background Free and open access to primary biodiversity data is essential for informed decision-making to achieve conservation of biodiversity and sustainable development. However, primary biodiversity data are neither easily accessible nor discoverable. Among several impediments, one is a lack of incentives to data publishers for publishing of their data resources. One such mechanism currently lacking is recognition through conventional scholarly publication of enriched metadata, which should ensure rapid discovery of 'fit-for-use' biodiversity data resources. Discussion We review the state of the art of data discovery options and the mechanisms in place for incentivizing data publishers efforts towards easy, efficient and enhanced publishing, dissemination, sharing and re-use of biodiversity data. We propose the establishment of the 'biodiversity data paper' as one possible mechanism to offer scholarly recognition for efforts and investment by data publishers in authoring rich metadata and publishing them as citable academic papers. While detailing the benefits to data publishers, we describe the objectives, work flow and outcomes of the pilot project commissioned by the Global Biodiversity Information Facility in collaboration with scholarly publishers and pioneered by Pensoft Publishers through its journals Zookeys, PhytoKeys, MycoKeys, BioRisk, NeoBiota, Nature Conservation and the forthcoming Biodiversity Data Journal. We then debate further enhancements of the data paper beyond the pilot project and attempt to forecast the future uptake of data papers as an incentivization mechanism by the stakeholder communities. Conclusions We believe that in addition to recognition for those involved in the data publishing enterprise, data papers will also expedite publishing of fit-for-use biodiversity data resources. However, uptake and establishment of the data paper as a potential mechanism of scholarly recognition requires a high degree of commitment and investment by the cross-sectional stakeholder communities.
- Published
- 2011
3. Structure-based kernels for the prediction of catalytic residues and their involvement in human inherited disease
- Author
-
David Neil Cooper, Steven Myers, Yong Fuga Li, Fuxiao Xin, Sean D. Mooney, and Predrag Radivojac
- Subjects
Statistics and Probability ,Computer science ,Computational biology ,Biology ,Biochemistry ,Catalysis ,Conserved sequence ,Enzyme catalysis ,Protein structure ,Protein sequencing ,Artificial Intelligence ,Structural Biology ,Catalytic Domain ,Humans ,Amino Acid Sequence ,Peptide sequence ,Molecular Biology ,Genetics ,Applied Mathematics ,Genetic Diseases, Inborn ,Computational Biology ,Proteins ,Original Papers ,Enzymes ,Computer Science Applications ,Computational Mathematics ,Kernel method ,Computational Theory and Mathematics ,Mutation ,Structure based ,Inherited disease ,DNA microarray ,Algorithms ,Software - Abstract
Motivation: Enzyme catalysis is involved in numerous biological processes and the disruption of enzymatic activity has been implicated in human disease. Despite this, various aspects of catalytic reactions are not completely understood, such as the mechanics of reaction chemistry and the geometry of catalytic residues within active sites. As a result, the computational prediction of catalytic residues has the potential to identify novel catalytic pockets, aid in the design of more efficient enzymes and also predict the molecular basis of disease. Results: We propose a new kernel-based algorithm for the prediction of catalytic residues based on protein sequence, structure and evolutionary information. The method relies upon explicit modeling of similarity between residue-centered neighborhoods in protein structures. We present evidence that this algorithm evaluates favorably against established approaches, and also provides insights into the relative importance of the geometry, physicochemical properties and evolutionary conservation of catalytic residue activity. The new algorithm was used to identify known mutations associated with inherited disease whose molecular mechanism might be predicted to operate specifically though the loss or gain of catalytic residues. It should, therefore, provide a viable approach to identifying the molecular basis of disease in which the loss or gain of function is not caused solely by the disruption of protein stability. Our analysis suggests that both mechanisms are actively involved in human inherited disease. Availability and Implementation: Source code for the structural kernel is available at www.informatics.indiana.edu/predrag/ Contact: predrag@indiana.edu Supplementary information: Supplementary data are available at Bioinformatics online.
- Published
- 2010
4. Eliciting candidate anatomical routes for protein interactions: a scenario from endocrine physiology.
- Author
-
Grenon, Pierre and De Bono, Bernard
- Subjects
PROTEIN-protein interactions ,ENDOCRINE glands ,ENDOCRINE system ,BIOCHEMISTRY ,BIOINFORMATICS - Abstract
Background: In this paper, we use: i) formalised anatomical knowledge of connectivity between body structures and ii) a formal theory of physiological transport between fluid compartments in order to define and make explicit the routes followed by proteins to a site of interaction. The underlying processes are the objects of mathematical models of physiology and, therefore, the motivation for the approach can be understood as using knowledge representation and reasoning methods to propose concrete candidate routes corresponding to correlations between variables in mathematical models of physiology. In so doing, the approach projects physiology models onto a representation of the anatomical and physiological reality which underpins them. Results: The paper presents a method based on knowledge representation and reasoning for eliciting physiological communication routes. In doing so, the paper presents the core knowledge representation and algorithms using it in the application of the method. These are illustrated through the description of a prototype implementation and the treatment of a simple endocrine scenario whereby a candidate route of communication between ANP and its receptors on the external membrane of smooth muscle cells in renal arterioles is elicited. The potential of further development of the approach is illustrated through the informal discussion of a more complex scenario. Conclusions: The work presented in this paper supports research in intercellular communication by enabling knowledge-based inference on physiologically-related biomedical data and models. [ABSTRACT FROM AUTHOR]
- Published
- 2013
- Full Text
- View/download PDF
5. Thermodynamically consistent Bayesian analysis of closed biochemical reaction systems.
- Author
-
Jenkinson, Garrett, Xiaogang Zhong, and Goutsias, John
- Subjects
BIOMOLECULES ,GENETICS ,BAYESIAN analysis ,STOICHIOMETRY ,GENETIC disorders ,BIOCHEMISTRY - Abstract
Background: Estimating the rate constants of a biochemical reaction system with known stoichiometry from noisy time series measurements of molecular concentrations is an important step for building predictive models of cellular function. Inference techniques currently available in the literature may produce rate constant values that defy necessary constraints imposed by the fundamental laws of thermodynamics. As a result, these techniques may lead to biochemical reaction systems whose concentration dynamics could not possibly occur in nature. Therefore, development of a thermodynamically consistent approach for estimating the rate constants of a biochemical reaction system is highly desirable. Results: We introduce a Bayesian analysis approach for computing thermodynamically consistent estimates of the rate constants of a closed biochemical reaction system with known stoichiometry given experimental data. Our method employs an appropriately designed prior probability density function that effectively integrates fundamental biophysical and thermodynamic knowledge into the inference problem. Moreover, it takes into account experimental strategies for collecting informative observations of molecular concentrations through perturbations. The proposed method employs a maximization-expectation-maximization algorithm that provides thermodynamically feasible estimates of the rate constant values and computes appropriate measures of estimation accuracy. We demonstrate various aspects of the proposed method on synthetic data obtained by simulating a subset of a well-known model of the EGF/ERK signaling pathway, and examine its robustness under conditions that violate key assumptions. Software, coded in MATLAB®, which implements all Bayesian analysis techniques discussed in this paper, is available free of charge at http://www.cis.jhu.edu/∼goutsias/CSS%20lab/ software.html. Conclusions: Our approach provides an attractive statistical methodology for estimating thermodynamically feasible values for the rate constants of a biochemical reaction system from noisy time series observations of molecular concentrations obtained through perturbations. The proposed technique is theoretically sound and computationally feasible, but restricted to quantitative data obtained from closed biochemical reaction systems. This necessitates development of similar techniques for estimating the rate constants of open biochemical reaction systems, which are more realistic models of cellular function. [ABSTRACT FROM AUTHOR]
- Published
- 2010
- Full Text
- View/download PDF
6. Sequence-based prediction of physicochemical interactions at protein functional sites using a function-and-interaction-annotated domain profile database.
- Author
-
Han, Min, Song, Yifan, Qian, Jiaqiang, and Ming, Dengming
- Subjects
PROTEINS ,BIOCHEMISTRY ,DRUG design ,HIDDEN Markov models ,PROTEIN domains - Abstract
Background: Identifying protein functional sites (PFSs) and, particularly, the physicochemical interactions at these sites is critical to understanding protein functions and the biochemical reactions involved. Several knowledge-based methods have been developed for the prediction of PFSs; however, accurate methods for predicting the physicochemical interactions associated with PFSs are still lacking. Results: In this paper, we present a sequence-based method for the prediction of physicochemical interactions at PFSs. The method is based on a functional site and physicochemical interaction-annotated domain profile database, called
fi DPD, which was built using protein domains found in the Protein Data Bank. This method was applied to 13 target proteins from the very recent Critical Assessment of Structure Prediction (CASP10/11), and our calculations gave a Matthews correlation coefficient (MCC) value of 0.66 for PFS prediction and an 80% recall in the prediction of the associated physicochemical interactions. Conclusions: Our results show that, in addition to the PFSs, the physical interactions at these sites are also conserved in the evolution of proteins. This work provides a valuable sequence-based tool for rational drug design and side-effect assessment. The method is freely available and can be accessed athttp://202.119.249.49 . [ABSTRACT FROM AUTHOR]- Published
- 2018
- Full Text
- View/download PDF
7. Utilizing knowledge base of amino acids structural neighborhoods to predict protein-protein interaction sites.
- Author
-
Jelínek, Jan, Škoda, Petr, and Hoksza, David
- Subjects
PROTEIN-protein interactions ,AMINO acids ,SILICON ,GENETICS ,BIOCHEMISTRY - Abstract
Background: Protein-protein interactions (PPI) play a key role in an investigation of various biochemical processes, and their identification is thus of great importance. Although computational prediction of which amino acids take part in a PPI has been an active field of research for some time, the quality of in-silico methods is still far from perfect. Results: We have developed a novel prediction method called INSPiRE which benefits from a knowledge base built from data available in Protein Data Bank. All proteins involved in PPIs were converted into labeled graphs with nodes corresponding to amino acids and edges to pairs of neighboring amino acids. A structural neighborhood of each node was then encoded into a bit string and stored in the knowledge base. When predicting PPIs, INSPiRE labels amino acids of unknown proteins as interface or non-interface based on how often their structural neighborhood appears as interface or non-interface in the knowledge base. We evaluated INSPiRE's behavior with respect to different types and sizes of the structural neighborhood. Furthermore, we examined the suitability of several different features for labeling the nodes. Our evaluations showed that INSPiRE clearly outperforms existing methods with respect to Matthews correlation coefficient. Conclusion: In this paper we introduce a new knowledge-based method for identification of protein-protein interaction sites called INSPiRE. Its knowledge base utilizes structural patterns of known interaction sites in the Protein Data Bank which are then used for PPI prediction. Extensive experiments on several well-established datasets show that INSPiRE significantly surpasses existing PPI approaches. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
8. A scalable assembly-free variable selection algorithm for biomarker discovery from metagenomes.
- Author
-
Gkanogiannis, Anestis, Gazut, Stéphane, Salanoubat, Marcel, Kanj, Sawsan, and Brüls, Thomas
- Subjects
BIOMARKERS ,BIOCHEMISTRY ,METAGENOMICS ,MICROBIAL genomics ,BIOINFORMATICS - Abstract
Background: Metagenomics holds great promises for deepening our knowledge of key bacterial driven processes, but metagenome assembly remains problematic, typically resulting in representation biases and discarding significant amounts of non-redundant sequence information. In order to alleviate constraints assembly can impose on downstream analyses, and/or to increase the fraction of raw reads assembled via targeted assemblies relying on pre-assembly binning steps, we developed a set of binning modules and evaluated their combination in a new "assembly-free" binning protocol. Results: We describe a scalable multi-tiered binning algorithm that combines frequency and compositional features to cluster unassembled reads, and demonstrate i) significant runtime performance gains of the developed modules against state of the art software, obtained through parallelization and the efficient use of large lock-free concurrent hash maps, ii) its relevance for clustering unassembled reads from high complexity (e.g., harboring 700 distinct genomes) samples, iii) its relevance to experimental setups involving multiple samples, through a use case consisting in the "de novo" identification of sequences from a target genome (e.g., a pathogenic strain) segregating at low levels in a cohort of 50 complex microbiomes (harboring 100 distinct genomes each), in the background of closely related strains and the absence of reference genomes, iv) its ability to correctly identify clusters of sequences from the E. coli O104:H4 genome as the most strongly correlated to the infection status in 53 microbiomes sampled from the 2011 STEC outbreak in Germany, and to accurately cluster contigs of this pathogenic strain from a cross-assembly of these 53 microbiomes. Conclusions: We present a set of sequence clustering ("binning") modules and their application to biomarker (e.g., genomes of pathogenic organisms) discovery from large synthetic and real metagenomics datasets. Initially designed for the "assembly-free" analysis of individual metagenomic samples, we demonstrate their extension to setups involving multiple samples via the usage of the "alignment-free" d
2 S statistic to relate clusters across samples, and illustrate how the clustering modules can otherwise be leveraged for de novo "pre-assembly" tasks by segregating sequences into biologically meaningful partitions. [ABSTRACT FROM AUTHOR]- Published
- 2016
- Full Text
- View/download PDF
9. Biochemical systems identification by a random drift particle swarm optimization approach.
- Author
-
Sun J, Palade V, Cai Y, Fang W, and Wu X
- Subjects
- Computer Simulation, Models, Theoretical, Nonlinear Dynamics, Algorithms, Biochemistry, Computational Biology methods
- Abstract
Background: Finding an efficient method to solve the parameter estimation problem (inverse problem) for nonlinear biochemical dynamical systems could help promote the functional understanding at the system level for signalling pathways. The problem is stated as a data-driven nonlinear regression problem, which is converted into a nonlinear programming problem with many nonlinear differential and algebraic constraints. Due to the typical ill conditioning and multimodality nature of the problem, it is in general difficult for gradient-based local optimization methods to obtain satisfactory solutions. To surmount this limitation, many stochastic optimization methods have been employed to find the global solution of the problem., Results: This paper presents an effective search strategy for a particle swarm optimization (PSO) algorithm that enhances the ability of the algorithm for estimating the parameters of complex dynamic biochemical pathways. The proposed algorithm is a new variant of random drift particle swarm optimization (RDPSO), which is used to solve the above mentioned inverse problem and compared with other well known stochastic optimization methods. Two case studies on estimating the parameters of two nonlinear biochemical dynamic models have been taken as benchmarks, under both the noise-free and noisy simulation data scenarios., Conclusions: The experimental results show that the novel variant of RDPSO algorithm is able to successfully solve the problem and obtain solutions of better quality than other global optimization methods used for finding the solution to the inverse problems in this study.
- Published
- 2014
- Full Text
- View/download PDF
10. EZH2 as a prognostic-related biomarker in lung adenocarcinoma correlating with cell cycle and immune infiltrates
- Author
-
Kui Fan, Bo-hui Zhang, Deng Han, and Yun-chuan Sun
- Subjects
Structural Biology ,Applied Mathematics ,Molecular Biology ,Biochemistry ,Computer Science Applications - Abstract
Backgrounds It has been observed that high levels of enhancer of zeste homolog 2 (EZH2) expression are associated with unsatisfactory prognoses and can be found in a wide range of malignancies. However, the effects of EZH2 on Lung Adenocarcinoma (LUAD) remain elusive. Through the integration of bioinformatic analyses, the present paper sought to ascertain the effects of EZH2 in LUAD. Methods The TIMER and UALCAN databases were applied to analyze mRNA and protein expression data for EZH2 in LUAD. The result of immunohistochemistry was obtained from the HPA database, and the survival curve was drawn according to the library provided by the HPA database. The LinkedOmics database was utilized to investigate the co-expressed genes and signal transduction pathways with EZH2. Up- and down-regulated genes from The Linked Omics database were introduced to the CMap database to predict potential drug targets for LUAD using the CMap database. The association between EZH2 and cancer-infiltrating immunocytes was studied through TIMER and TISIDB. In addition, this paper explores the relationship between EZH2 mRNA expression and NSCLC OS using the Kaplan–Meier plotter database to further validate and complement the research. Furthermore, the correlation between EZH2 expression and EGFR genes, KRAS genes, BRAF genes, and smoking from the Cancer Genome Atlas (TCGA) database is analyzed. Results In contrast to paracancer specimens, the mRNA and protein levels of EZH2 were higher in LUAD tissues. Significantly, high levels of EZH2 were associated with unsatisfactory prognoses in LUAD patients. Additionally, the coexpressed genes of EZH2 were predominantly associated with numerous cell growth-associated pathways, including the cell cycle, DNA replication, RNA transport, and the p53 signaling pathway, according to Gene Ontology and Kyoto Encyclopedia of Genes and Genomes pathways. The results of TCGA database revealed that the expression of EZH2 was lower in normal tissues than in lung cancer tissues (p p r = 0.3129, p Conclusions Highly expressed EZH2 is a predictor of a suboptimal prognosis in LUAD and may serve as a prognostic marker and target gene for LUAD. The underlying cause may be associated with the synergistic effect of KRAS, immune cell infiltration, and metabolic processes.
- Published
- 2023
11. Avoiding background knowledge: literature based discovery from important information
- Author
-
Judita Preiss
- Subjects
Structural Biology ,Applied Mathematics ,Molecular Biology ,Biochemistry ,Computer Science Applications - Abstract
Background Automatic literature based discovery attempts to uncover new knowledge by connecting existing facts: information extracted from existing publications in the form of $$A \rightarrow B$$ A → B and $$B \rightarrow C$$ B → C relations can be simply connected to deduce $$A \rightarrow C$$ A → C . However, using this approach, the quantity of proposed connections is often too vast to be useful. It can be reduced by using subject$$\rightarrow$$ → (predicate)$$\rightarrow$$ → object triples as the $$A \rightarrow B$$ A → B relations, but too many proposed connections remain for manual verification. Results Based on the hypothesis that only a small number of subject–predicate–object triples extracted from a publication represent the paper’s novel contribution(s), we explore using BERT embeddings to identify these before literature based discovery is performed utilizing only these, important, triples. While the method exploits the availability of full texts of publications in the CORD-19 dataset—making use of the fact that a novel contribution is likely to be mentioned in both an abstract and the body of a paper—to build a training set, the resulting tool can be applied to papers with only abstracts available. Candidate hidden knowledge pairs generated from unfiltered triples and those built from important triples only are compared using a variety of timeslicing gold standards. Conclusions The quantity of proposed knowledge pairs is reduced by a factor of $$10^3$$ 10 3 , and we show that when the gold standard is designed to avoid rewarding background knowledge, the precision obtained increases up to a factor of 10. We argue that the gold standard needs to be carefully considered, and release as yet undiscovered candidate knowledge pairs based on important triples alongside this work.
- Published
- 2023
12. GENTLE: a novel bioinformatics tool for generating features and building classifiers from T cell repertoire cancer data
- Author
-
Dhiego Souto Andrade, Patrick Terrematte, César Rennó-Costa, Alona Zilberberg, and Sol Efroni
- Subjects
Structural Biology ,Applied Mathematics ,Molecular Biology ,Biochemistry ,Computer Science Applications - Abstract
Background In the global effort to discover biomarkers for cancer prognosis, prediction tools have become essential resources. TCR (T cell receptor) repertoires contain important features that differentiate healthy controls from cancer patients or differentiate outcomes for patients being treated with different drugs. Considering, tools that can easily and quickly generate and identify important features out of TCR repertoire data and build accurate classifiers to predict future outcomes are essential. Results This paper introduces GENTLE (GENerator of T cell receptor repertoire features for machine LEarning): an open-source, user-friendly web-application tool that allows TCR repertoire researchers to discover important features; to create classifier models and evaluate them with metrics; and to quickly generate visualizations for data interpretations. We performed a case study with repertoires of TRegs (regulatory T cells) and TConvs (conventional T cells) from healthy controls versus patients with breast cancer. We showed that diversity features were able to distinguish between the groups. Moreover, the classifiers built with these features could correctly classify samples (‘Healthy’ or ‘Breast Cancer’)from the TRegs repertoire when trained with the TConvs repertoire, and from the TConvs repertoire when trained with the TRegs repertoire. Conclusion The paper walks through installing and using GENTLE and presents a case study and results to demonstrate the application’s utility. GENTLE is geared towards any researcher working with TCR repertoire data and aims to discover predictive features from these data and build accurate classifiers. GENTLE is available on https://github.com/dhiego22/gentle and https://share.streamlit.io/dhiego22/gentle/main/gentle.py.
- Published
- 2023
13. Efficient algorithms for biological stems search.
- Author
-
Mi, Tian and Rajasekaran, Sanguthevar
- Subjects
NUCLEOTIDE sequence ,PROTEIN binding ,TRANSCRIPTION factors ,BIOCHEMISTRY ,BIOINFORMATICS - Abstract
Background: Motifs are significant patterns in DNA, RNA, and protein sequences, which play an important role in biological processes and functions, like identification of open reading frames, RNA transcription, protein binding, etc. Several versions of the motif search problem have been studied in the literature. One such version is called the Planted Motif Search (PMS) or (l, d)-motif Search. PMS is known to be NP complete. The time complexities of most of the planted motif search algorithms depend exponentially on the alphabet size. Recently a new version of the motif search problem has been introduced by Kuksa and Pavlovic. We call this version as the Motif Stems Search (MSS) problem. A motif stem is an l-mer (for some relevant value of l) with some wildcard characters and hence corresponds to a set of l-mers (without wildcards), some of which are (l, d)-motifs. Kuksa and Pavlovic have presented an efficient algorithm to find motif stems for inputs from large alphabets. Ideally, the number of stems output should be as small as possible since the stems form a superset of the motifs. Results: In this paper we propose an efficient algorithm for MSS and evaluate it on both synthetic and real data. This evaluation reveals that our algorithm is much faster than Kuksa and Pavlovic's algorithm. Conclusions: Our MSS algorithm outperforms the algorithm of Kuksa and Pavlovic in terms of the run time as well as the number of stems output. Specifically, the stems output by our algorithm form a proper (and much smaller) subset of the stems output by Kuksa and Pavlovic's algorithm. [ABSTRACT FROM AUTHOR]
- Published
- 2013
- Full Text
- View/download PDF
14. Biomimicry of quorum sensing using bacterial lifecycle model.
- Author
-
Ben Niu, Hong Wang, Qiqi Duan, and Li Li
- Subjects
BIOMIMICRY ,QUORUM sensing ,BIOMIMETIC chemicals ,CELL communication ,MICROBIAL genetics ,BIOCHEMISTRY - Abstract
Background: Recent microbiologic studies have shown that quorum sensing mechanisms, which serve as one of the fundamental requirements for bacterial survival, exist widely in bacterial intra- and inter-species cell-cell communication. Many simulation models, inspired by the social behavior of natural organisms, are presented to provide new approaches for solving realistic optimization problems. Most of these simulation models follow population-based modelling approaches, where all the individuals are updated according to the same rules. Therefore, it is difficult to maintain the diversity of the population. Results: In this paper, we present a computational model termed LCM-QS, which simulates the bacterial quorumsensing (QS) mechanism using an individual-based modelling approach under the framework of Agent- Environment-Rule (AER) scheme, i.e. bacterial lifecycle model (LCM). LCM-QS model can be classified into three main sub-models: chemotaxis with QS sub-model, reproduction and elimination sub-model and migration submodel. The proposed model is used to not only imitate the bacterial evolution process at the single-cell level, but also concentrate on the study of bacterial macroscopic behaviour. Comparative experiments under four different scenarios have been conducted in an artificial 3-D environment with nutrients and noxious distribution. Detailed study on bacterial chemotatic processes with quorum sensing and without quorum sensing are compared. By using quorum sensing mechanisms, artificial bacteria working together can find the nutrient concentration (or global optimum) quickly in the artificial environment. Conclusions: Biomimicry of quorum sensing mechanisms using the lifecycle model allows the artificial bacteria endowed with the communication abilities, which are essential to obtain more valuable information to guide their search cooperatively towards the preferred nutrient concentrations. It can also provide an inspiration for designing new swarm intelligence optimization algorithms, which can be used for solving the real-world problems. [ABSTRACT FROM AUTHOR]
- Published
- 2013
- Full Text
- View/download PDF
15. MALDI imaging mass spectrometry: statistical data analysis and current computational challenges.
- Author
-
Alexandrov, Theodore
- Subjects
MASS spectrometry ,DATA analysis ,LASERS ,IONIZATION (Atomic physics) ,DESORPTION ,ANALYTICAL chemistry ,BIOCHEMISTRY - Abstract
Matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF) imaging mass spectrometry, also called MALDI-imaging, is a label-free bioanalytical technique used for spatially-resolved chemical analysis of a sample. Usually, MALDI-imaging is exploited for analysis of a specially prepared tissue section thaw mounted onto glass slide. A tremendous development of the MALDI-imaging technique has been observed during the last decade. Currently, it is one of the most promising innovative measurement techniques in biochemistry and a powerful and versatile tool for spatially-resolved chemical analysis of diverse sample types ranging from biological and plant tissues to bio and polymer thin films. In this paper, we outline computational methods for analyzing MALDIimaging data with the emphasis on multivariate statistical methods, discuss their pros and cons, and give recommendations on their application. The methods of unsupervised data mining as well as supervised classification methods for biomarker discovery are elucidated. We also present a high-throughput computational pipeline for interpretation of MALDI-imaging data using spatial segmentation. Finally, we discuss current challenges associated with the statistical analysis of MALDI-imaging data. [ABSTRACT FROM AUTHOR]
- Published
- 2012
- Full Text
- View/download PDF
16. Metabolic network alignment in large scale by network compression.
- Author
-
Ay, Ferhat, Dang, Michael, and Kahveci, Tamer
- Subjects
BIOTRANSFORMATION (Metabolism) ,COMPARATIVE studies ,METABOLISM ,COMPUTATIONAL biology ,ORGANISMS ,ALGORITHMS ,BIOCHEMISTRY - Abstract
Metabolic network alignment is a system scale comparative analysis that discovers important similarities and differences across different metabolisms and organisms. Although the problem of aligning metabolic networks has been considered in the past, the computational complexity of the existing solutions has so far limited their use to moderately sized networks. In this paper, we address the problem of aligning two metabolic networks, particularly when both of them are too large to be dealt with using existing methods. We develop a generic framework that can significantly improve the scale of the networks that can be aligned in practical time. Our framework has three major phases, namely the compression phase, the alignment phase and the refinement phase. For the first phase, we develop an algorithm which transforms the given networks to a compressed domain where they are summarized using fewer nodes, termed supernodes, and interactions. In the second phase, we carry out the alignment in the compressed domain using an existing network alignment method as our base algorithm. This alignment results in supernode mappings in the compressed domain, each of which are smaller instances of network alignment problem. In the third phase, we solve each of the instances using the base alignment algorithm to refine the alignment results. We provide a user defined parameter to control the number of compression levels which generally determines the tradeoff between the quality of the alignment versus how fast the algorithm runs. Our experiments on the networks from KEGG pathway database demonstrate that the compression method we propose reduces the sizes of metabolic networks by almost half at each compression level which provides an expected speedup of more than an order of magnitude. We also observe that the alignments obtained by only one level of compression capture the original alignment results with high accuracy. Together, these suggest that our framework results in alignments that are comparable to existing algorithms and can do this with practical resource utilization for large scale networks that existing algorithms could not handle. As an example of our method's performance in practice, the alignment of organism-wide metabolic networks of human (1615 reactions) and mouse (1600 reactions) was performed under three minutes by only using a single level of compression. [ABSTRACT FROM AUTHOR]
- Published
- 2012
- Full Text
- View/download PDF
17. A text-mining system for extracting metabolic reactions from full-text articles.
- Author
-
Czarnecki, Jan, Nobeli, Irene, Smith, Adrian M., and Shepherd, Adrian J.
- Subjects
METABOLISM ,TEXT mining ,DATA mining ,INFORMATION retrieval ,BIOCHEMISTRY - Abstract
Background: Increasingly biological text mining research is focusing on the extraction of complex relationships relevant to the construction and curation of biological networks and pathways. However, one important category of pathway-metabolic pathways-has been largely neglected. Here we present a relatively simple method for extracting metabolic reaction information from free text that scores different permutations of assigned entities (enzymes and metabolites) within a given sentence based on the presence and location of stemmed keywords. This method extends an approach that has proved effective in the context of the extraction of protein-protein interactions. Results: When evaluated on a set of manually-curated metabolic pathways using standard performance criteria, our method performs surprisingly well. Precision and recall rates are comparable to those previously achieved for the well-known protein-protein interaction extraction task. Conclusions: We conclude that automated metabolic pathway construction is more tractable than has often been assumed, and that (as in the case of protein-protein interaction extraction) relatively simple text-mining approaches can prove surprisingly effective. It is hoped that these results will provide an impetus to further research and act as a useful benchmark for judging the performance of more sophisticated methods that are yet to be developed. [ABSTRACT FROM AUTHOR]
- Published
- 2012
- Full Text
- View/download PDF
18. New insights into protein-protein interaction data lead to increased estimates of the S. cerevisiae interactome size.
- Author
-
Sambourg, Laure and Thierry-Mieg, Nicolas
- Subjects
PROTEIN-protein interactions ,MOLECULAR association ,SACCHAROMYCES cerevisiae ,CHEMICAL bonds ,BIOCHEMISTRY - Abstract
Background: As protein interactions mediate most cellular mechanisms, protein-protein interaction networks are essential in the study of cellular processes. Consequently, several large-scale interactome mapping projects have been undertaken, and protein-protein interactions are being distilled into databases through literature curation; yet protein-protein interaction data are still far from comprehensive, even in the model organism Saccharomyces cerevisiae. Estimating the interactome size is important for evaluating the completeness of current datasets, in order to measure the remaining efforts that are required. Results: We examined the yeast interactome from a new perspective, by taking into account how thoroughly proteins have been studied. We discovered that the set of literature-curated protein-protein interactions is qualitatively different when restricted to proteins that have received extensive attention from the scientific community. In particular, these interactions are less often supported by yeast two-hybrid, and more often by more complex experiments such as biochemical activity assays. Our analysis showed that high-throughput and literature-curated interactome datasets are more correlated than commonly assumed, but that this bias can be corrected for by focusing on well-studied proteins. We thus propose a simple and reliable method to estimate the size of an interactome, combining literature-curated data involving well-studied proteins with high-throughput data. It yields an estimate of at least 37, 600 direct physical protein-protein interactions in S. cerevisiae. Conclusions: Our method leads to higher and more accurate estimates of the interactome size, as it accounts for interactions that are genuine yet difficult to detect with commonly-used experimental assays. This shows that we are even further from completing the yeast interactome map than previously expected. [ABSTRACT FROM AUTHOR]
- Published
- 2010
- Full Text
- View/download PDF
19. Statistical assessment of discriminative features for protein-coding and non coding cross-species conserved sequence elements.
- Author
-
Creanza, Teresa M., Horner, David S., D'Addabbo, Annarita, Maglietta, Rosalia, Mignone, Flavio, Ancona, Nicola, and Pesole, Graziano
- Subjects
CODE generators ,MOLECULAR biology ,CROSS-species amplification ,BIOCHEMISTRY ,BIOINFORMATICS - Abstract
Background: The identification of protein coding elements in sets of mammalian conserved elements is one of the major challenges in the current molecular biology research. Many features have been proposed for automatically distinguishing coding and non coding conserved sequences, making so necessary a systematic statistical assessment of their differences. A comprehensive study should be composed of an association study, i.e. a comparison of the distributions of the features in the two classes, and a prediction study in which the prediction accuracies of classifiers trained on single and groups of features are analyzed, conditionally to the compared species and to the sequence lengths. Results: In this paper we compared distributions of a set of comparative and non comparative features and evaluated the prediction accuracy of classifiers trained for discriminating sequence elements conserved among human, mouse and rat species. The association study showed that the analyzed features are statistically different in the two classes. In order to study the influence of the sequence lengths on the feature performances, a predictive study was performed on different data sets composed of coding and non coding alignments in equal number and equally long with an ascending average length. We found that the most discriminant feature was a comparative measure indicating the proportion of synonymous nucleotide substitutions per synonymous sites. Moreover, linear discriminant classifiers trained by using comparative features in general outperformed classifiers based on intrinsic ones. Finally, the prediction accuracy of classifiers trained on comparative features increased significantly by adding intrinsic features to the set of input variables, independently on sequence length (Kolmogorov-Smirnov P-value ≤ 0.05). Conclusion: We observed distinct and consistent patterns for individual and combined use of comparative and intrinsic classifiers, both with respect to different lengths of sequences/alignments and with respect to error rates in the classification of coding and non-coding elements. In particular, we noted that comparative features tend to be more accurate in the classification of coding sequences -- this is likely related to the fact that such features capture deviations from strictly neutral evolution expected as a consequence of the characteristics of the genetic code. [ABSTRACT FROM AUTHOR]
- Published
- 2009
- Full Text
- View/download PDF
20. An effective docking strategy for virtual screening based on multi-objective optimization algorithm.
- Author
-
Honglin Li, Hailei Zhang, Mingyue Zheng, Jie Luo, Ling Kang, Xiaofeng Liu, Xicheng Wang, and Hualiang Jiang
- Subjects
DIAGNOSIS ,BIOCHEMISTRY ,DATABASES ,DRUG development ,PHARMACOLOGY ,BIOINFORMATICS ,COMPUTERS in medicine - Abstract
Background: Development of a fast and accurate scoring function in virtual screening remains a hot issue in current computer-aided drug research. Different scoring functions focus on diverse aspects of ligand binding, and no single scoring can satisfy the peculiarities of each target system. Therefore, the idea of a consensus score strategy was put forward. Integrating several scoring functions, consensus score re-assesses the docked conformations using a primary scoring function. However, it is not really robust and efficient from the perspective of optimization. Furthermore, to date, the majority of available methods are still based on single objective optimization design. Results: In this paper, two multi-objective optimization methods, called MOSFOM, were developed for virtual screening, which simultaneously consider both the energy score and the contact score. Results suggest that MOSFOM can effectively enhance enrichment and performance compared with a single score. For three different kinds of binding sites, MOSFOM displays an excellent ability to differentiate active compounds through energy and shape complementarity. EFMOGA performed particularly well in the top 2% of database for all three cases, whereas MOEA_Nrg and MOEA_Cnt performed better than the corresponding individual scoring functions if the appropriate type of binding site was selected. Conclusion: The multi-objective optimization method was successfully applied in virtual screening with two different scoring functions that can yield reasonable binding poses and can furthermore, be ranked with the potentially compromised conformations of each compound, abandoning those conformations that can not satisfy overall objective functions. [ABSTRACT FROM AUTHOR]
- Published
- 2009
- Full Text
- View/download PDF
21. RSAP-Net: joint optic disc and cup segmentation with a residual spatial attention path module and MSRCR-PT pre-processing algorithm
- Author
-
Yun, Jiang, Zeqi, Ma, Chao, Wu, Zequn, Zhang, and Wei, Yan
- Subjects
Structural Biology ,Applied Mathematics ,Molecular Biology ,Biochemistry ,Computer Science Applications - Abstract
Background Glaucoma can cause irreversible blindness to people’s eyesight. Since there are no symptoms in its early stage, it is particularly important to accurately segment the optic disc (OD) and optic cup (OC) from fundus medical images for the screening and prevention of glaucoma. In recent years, the mainstream method of OD and OC segmentation is convolution neural network (CNN). However, most existing CNN methods segment OD and OC separately and ignore the a priori information that OC is always contained inside the OD region, which makes the segmentation accuracy of most methods not high enough. Methods This paper proposes a new encoder–decoder segmentation structure, called RSAP-Net, for joint segmentation of OD and OC. We first designed an efficient U-shaped segmentation network as the backbone. Considering the spatial overlap relationship between OD and OC, a new Residual spatial attention path is proposed to connect the encoder–decoder to retain more characteristic information. In order to further improve the segmentation performance, a pre-processing method called MSRCR-PT (Multi-Scale Retinex Colour Recovery and Polar Transformation) has been devised. It incorporates a multi-scale Retinex colour recovery algorithm and a polar coordinate transformation, which can help RSAP-Net to produce more refined boundaries of the optic disc and the optic cup. Results The experimental results show that our method achieves excellent segmentation performance on the Drishti-GS1 standard dataset. In the OD and OC segmentation effects, the F1 scores are 0.9752 and 0.9012, respectively. The BLE are 6.33 pixels and 11.97 pixels, respectively. Conclusions This paper presents a new framework for the joint segmentation of optic discs and optic cups, called RSAP-Net. The framework mainly consists of a U-shaped segmentation skeleton and a residual space attention path module. The design of a pre-processing method called MSRCR-PT for the OD/OC segmentation task can improve segmentation performance. The method was evaluated on the publicly available Drishti-GS1 standard dataset and proved to be effective.
- Published
- 2022
22. Reproducibility of mass spectrometry based metabolomics data
- Author
-
Daisy Philtron, Debashis Ghosh, Tusharkanti Ghosh, Weiming Zhang, and Katerina Kechris
- Subjects
False discovery rate ,QH301-705.5 ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Biochemistry ,Bioconductor ,Metabolomics ,Structural Biology ,Consistency (statistics) ,Biology (General) ,Molecular Biology ,Parametric statistics ,Mathematics ,Reproducibility ,Mass spectrometry ,business.industry ,Research ,Applied Mathematics ,Nonparametric statistics ,Reproducibility of Results ,Pattern recognition ,Replicate ,Computer Science Applications ,Artificial intelligence ,business - Abstract
Background Assessing the reproducibility of measurements is an important first step for improving the reliability of downstream analyses of high-throughput metabolomics experiments. We define a metabolite to be reproducible when it demonstrates consistency across replicate experiments. Similarly, metabolites which are not consistent across replicates can be labeled as irreproducible. In this work, we introduce and evaluate the use (Ma)ximum (R)ank (R)eproducibility (MaRR) to examine reproducibility in mass spectrometry-based metabolomics experiments. We examine reproducibility across technical or biological samples in three different mass spectrometry metabolomics (MS-Metabolomics) data sets. Results We apply MaRR, a nonparametric approach that detects the change from reproducible to irreproducible signals using a maximal rank statistic. The advantage of using MaRR over model-based methods that it does not make parametric assumptions on the underlying distributions or dependence structures of reproducible metabolites. Using three MS Metabolomics data sets generated in the multi-center Genetic Epidemiology of Chronic Obstructive Pulmonary Disease (COPD) study, we applied the MaRR procedure after data processing to explore reproducibility across technical or biological samples. Under realistic settings of MS-Metabolomics data, the MaRR procedure effectively controls the False Discovery Rate (FDR) when there was a gradual reduction in correlation between replicate pairs for less highly ranked signals. Simulation studies also show that the MaRR procedure tends to have high power for detecting reproducible metabolites in most situations except for smaller values of proportion of reproducible metabolites. Bias (i.e., the difference between the estimated and the true value of reproducible signal proportions) values for simulations are also close to zero. The results reported from the real data show a higher level of reproducibility for technical replicates compared to biological replicates across all the three different datasets. In summary, we demonstrate that the MaRR procedure application can be adapted to various experimental designs, and that the nonparametric approach performs consistently well. Conclusions This research was motivated by reproducibility, which has proven to be a major obstacle in the use of genomic findings to advance clinical practice. In this paper, we developed a data-driven approach to assess the reproducibility of MS-Metabolomics data sets. The methods described in this paper are implemented in the open-source R package marr, which is freely available from Bioconductor at http://bioconductor.org/packages/marr.
- Published
- 2021
23. Protein structure prediction based on particle swarm optimization and tabu search strategy
- Author
-
Yu, Shuchun, Li, Xianxiang, Tian, Xue, and Pang, Ming
- Subjects
Protein Stability ,Structural Biology ,Applied Mathematics ,Proteins ,Amino Acid Sequence ,Molecular Biology ,Biochemistry ,Algorithms ,Computer Science Applications - Abstract
Background The stability of protein sequence structure plays an important role in the prevention and treatment of diseases. Results In this paper, particle swarm optimization and tabu search are combined to propose a new method for protein structure prediction. The experimental results show that: for four groups of artificial protein sequences with different lengths, this method obtains the lowest potential energy value and stable structure prediction results, and the effect is obviously better than the other two comparison methods. Taking the first group of protein sequences as an example, our method improves the prediction of minimum potential energy by 127% and 7% respectively. Conclusions Therefore, the method proposed in this paper is more suitable for the prediction of protein structural stability.
- Published
- 2022
24. CMIC: an efficient quality score compressor with random access functionality
- Author
-
Hansen Chen, Jianhua Chen, Zhiwen Lu, and Rongshu Wang
- Subjects
Structural Biology ,Applied Mathematics ,High-Throughput Nucleotide Sequencing ,Data Compression ,Molecular Biology ,Biochemistry ,Algorithms ,Software ,Computer Science Applications - Abstract
BackgroundOver the past few decades, the emergence and maturation of new technologies have substantially reduced the cost of genome sequencing. As a result, the amount of genomic data that needs to be stored and transmitted has grown exponentially. For the standard sequencing data format, FASTQ, compression of the quality score is a key and difficult aspect of FASTQ file compression. Throughout the literature, we found that the majority of the current quality score compression methods do not support random access. Based on the above consideration, it is reasonable to investigate a lossless quality score compressor with a high compression rate, a fast compression and decompression speed, and support for random access.ResultsIn this paper, we propose CMIC, an adaptive and random access supported compressor for lossless compression of quality score sequences. CMIC is an acronym of the four steps (classification, mapping, indexing and compression) in the paper. Its framework consists of the following four parts: classification, mapping, indexing, and compression. The experimental results show that our compressor has good performance in terms of compression rates on all the tested datasets. The file sizes are reduced by up to 21.91% when compared with LCQS. In terms of compression speed, CMIC is better than all other compressors on most of the tested cases. In terms of random access speed, the CMIC is faster than the LCQS, which provides a random access function for compressed quality scores.ConclusionsCMIC is a compressor that is especially designed for quality score sequences, which has good performance in terms of compression rate, compression speed, decompression speed, and random access speed. The CMIC can be obtained in the following way:https://github.com/Humonex/Cmic.
- Published
- 2022
25. Topology-enhanced molecular graph representation for anti-breast cancer drug selection
- Author
-
Yue Gao, Songling Chen, Junyi Tong, and Xiangling Fu
- Subjects
Structural Biology ,Applied Mathematics ,Estrogen Receptor alpha ,Humans ,Antineoplastic Agents ,Breast Neoplasms ,Female ,Breast ,Neural Networks, Computer ,Molecular Biology ,Biochemistry ,Computer Science Applications - Abstract
Background Breast cancer is currently one of the cancers with a higher mortality rate in the world. The biological research on anti-breast cancer drugs focuses on the activity of estrogen receptors alpha (ER$$\alpha$$ α ), the pharmacokinetic properties and the safety of the compounds, which, however, is an expensive and time-consuming process. Developments of deep learning bring potential to efficiently facilitate the candidate drug selection against breast cancer. Methods In this paper, we propose an Anti-Breast Cancer Drug selection method utilizing Gated Graph Neural Networks (ABCD-GGNN) to topologically enhance the molecular representation of candidate drugs. By constructing atom-level graphs through atomic descriptors for each distinct compound, ABCD-GGNN can topologically learn both the implicit structure and substructure characteristics of a candidate drug and then integrate the representation with explicit discrete molecular descriptors to generate a molecule-level representation. As a result, the representation of ABCD-GGNN can inductively predict the ER$$\alpha$$ α , the pharmacokinetic properties and the safety of each candidate drug. Finally, we design a ranking operator whose inputs are the predicted properties so as to statistically select the appropriate drugs against breast cancer. Results Extensive experiments conducted on our collected anti-breast cancer candidate drug dataset demonstrate that our proposed method outperform all the other representative methods in the tasks of predicting ER$$\alpha$$ α , and the pharmacokinetic properties and safety of the compounds. Extended result analysis demonstrates the efficiency and biological rationality of the operator we design to calculate the candidate drug ranking from the predicted properties. Conclusion In this paper, we propose the ABCD-GGNN representation method to efficiently integrate the topological structure and substructure features of the molecules with the discrete molecular descriptors. With a ranking operator applied, the predicted properties efficiently facilitate the candidate drug selection against breast cancer.
- Published
- 2022
26. Automatic classification of nerve discharge rhythms based on sparse auto-encoder and time series feature
- Author
-
Zhongting Jiang, Dong Wang, and Yuehui Chen
- Subjects
Time Factors ,Structural Biology ,Applied Mathematics ,Neural Networks, Computer ,Molecular Biology ,Biochemistry ,Computer Science Applications - Abstract
Background Nerve discharge is the carrier of information transmission, which can reveal the basic rules of various nerve activities. Recognition of the nerve discharge rhythm is the key to correctly understand the dynamic behavior of the nervous system. The previous methods for the nerve discharge recognition almost depended on the traditional statistical features, and the nonlinear dynamical features of the discharge activity. The artificial extraction and the empirical judgment of the features were required for the recognition. Thus, these methods suffered from subjective factors and were not conducive to the identification of a large number of discharge rhythms. Results The ability of automatic feature extraction along with the development of the neural network has been greatly improved. In this paper, an effective discharge rhythm classification model based on sparse auto-encoder was proposed. The sparse auto-encoder was used to construct the feature learning network. The simulated discharge data from the Chay model and its variants were taken as the input of the network, and the fused features, including the network learning features, covariance and approximate entropy of nerve discharge, were classified by Softmax. The results showed that the accuracy of the classification on the testing data was 87.5%, which could provide more accurate classification results. Compared with other methods for the identification of nerve discharge types, this method could extract the characteristics of nerve discharge rhythm automatically without artificial design, and show a higher accuracy. Conclusions The sparse auto-encoder, even neural network has not been used to classify the basic nerve discharge from neither biological experiment data nor model simulation data. The automatic classification method of nerve discharge rhythm based on the sparse auto-encoder in this paper reduced the subjectivity and misjudgment of the artificial feature extraction, saved the time for the comparison with the traditional method, and improved the intelligence of the classification of discharge types. It could further help us to recognize and identify the nerve discharge activities in a new way.
- Published
- 2022
27. Spark-based parallel calculation of 3D fourier shell correlation for macromolecule structure local resolution estimation
- Author
-
Xinhui Tian, Xiangrui Zeng, Xiaohui Zheng, Xin Gao, Wang Hui, Liu Xiaodong, Shi Xiao, Zhao Xiaofang, Min Xu, and Yongchun Lü
- Subjects
Macromolecular Substances ,Fourier shell correlation ,Computer science ,Single particle analysis ,3D array partition ,lcsh:Computer applications to medicine. Medical informatics ,Biochemistry ,03 medical and health sciences ,Imaging, Three-Dimensional ,0302 clinical medicine ,Structural Biology ,Computer cluster ,Microscopy ,Spark (mathematics) ,Image Processing, Computer-Assisted ,Molecular Biology ,lcsh:QH301-705.5 ,3D local resolution map ,030304 developmental biology ,3D local Fourier shell correlation ,Spark ,0303 health sciences ,Key-value data ,Applied Mathematics ,Cryoelectron Microscopy ,Resolution (electron density) ,Methodology ,Partition (database) ,Computer Science Applications ,lcsh:Biology (General) ,lcsh:R858-859.7 ,Algorithm ,Algorithms ,030217 neurology & neurosurgery - Abstract
Background Resolution estimation is the main evaluation criteria for the reconstruction of macromolecular 3D structure in the field of cryoelectron microscopy (cryo-EM). At present, there are many methods to evaluate the 3D resolution for reconstructed macromolecular structures from Single Particle Analysis (SPA) in cryo-EM and subtomogram averaging (SA) in electron cryotomography (cryo-ET). As global methods, they measure the resolution of the structure as a whole, but they are inaccurate in detecting subtle local changes of reconstruction. In order to detect the subtle changes of reconstruction of SPA and SA, a few local resolution methods are proposed. The mainstream local resolution evaluation methods are based on local Fourier shell correlation (FSC), which is computationally intensive. However, the existing resolution evaluation methods are based on multi-threading implementation on a single computer with very poor scalability. Results This paper proposes a new fine-grained 3D array partition method by key-value format in Spark. Our method first converts 3D images to key-value data (K-V). Then the K-V data is used for 3D array partitioning and data exchange in parallel. So Spark-based distributed parallel computing framework can solve the above scalability problem. In this distributed computing framework, all 3D local FSC tasks are simultaneously calculated across multiple nodes in a computer cluster. Through the calculation of experimental data, 3D local resolution evaluation algorithm based on Spark fine-grained 3D array partition has a magnitude change in computing speed compared with the mainstream FSC algorithm under the condition that the accuracy remains unchanged, and has better fault tolerance and scalability. Conclusions In this paper, we proposed a K-V format based fine-grained 3D array partition method in Spark to parallel calculating 3D FSC for getting a 3D local resolution density map. 3D local resolution density map evaluates the three-dimensional density maps reconstructed from single particle analysis and subtomogram averaging. Our proposed method can significantly increase the speed of the 3D local resolution evaluation, which is important for the efficient detection of subtle variations among reconstructed macromolecular structures.
- Published
- 2020
28. Predicting synthetic lethal interactions in human cancers using graph regularized self-representative matrix factorization
- Author
-
Jiang Huang, Fan Lu, Zexuan Zhu, Min Wu, and Le Ou-Yang
- Subjects
Synthetic lethality ,Computer science ,Antineoplastic Agents ,Computational biology ,lcsh:Computer applications to medicine. Medical informatics ,Biochemistry ,Matrix decomposition ,03 medical and health sciences ,0302 clinical medicine ,Structural Biology ,Neoplasms ,Humans ,Molecular Biology ,Gene ,lcsh:QH301-705.5 ,030304 developmental biology ,0303 health sciences ,Research ,Applied Mathematics ,Matrix factorization ,Anticancer drug ,Computer Science Applications ,lcsh:Biology (General) ,030220 oncology & carcinogenesis ,Graph regularization ,Graph (abstract data type) ,lcsh:R858-859.7 ,DNA microarray ,Algorithms - Abstract
Background Synthetic lethality has attracted a lot of attentions in cancer therapeutics due to its utility in identifying new anticancer drug targets. Identifying synthetic lethal (SL) interactions is the key step towards the exploration of synthetic lethality in cancer treatment. However, biological experiments are faced with many challenges when identifying synthetic lethal interactions. Thus, it is necessary to develop computational methods which could serve as useful complements to biological experiments. Results In this paper, we propose a novel graph regularized self-representative matrix factorization (GRSMF) algorithm for synthetic lethal interaction prediction. GRSMF first learns the self-representations from the known SL interactions and further integrates the functional similarities among genes derived from Gene Ontology (GO). It can then effectively predict potential SL interactions by leveraging the information provided by known SL interactions and functional annotations of genes. Extensive experiments on the synthetic lethal interaction data downloaded from SynLethDB database demonstrate the superiority of our GRSMF in predicting potential synthetic lethal interactions, compared with other competing methods. Moreover, case studies of novel interactions are conducted in this paper for further evaluating the effectiveness of GRSMF in synthetic lethal interaction prediction. Conclusions In this paper, we demonstrate that by adaptively exploiting the self-representation of original SL interaction data, and utilizing functional similarities among genes to enhance the learning of self-representation matrix, our GRSMF could predict potential SL interactions more accurately than other state-of-the-art SL interaction prediction methods.
- Published
- 2019
29. MiRNA therapeutics based on logic circuits of biological pathways
- Author
-
Massimo La Rosa, Antonino Fiannaca, Alfonso Urso, Valeria Boscaino, Laura La Paglia, and Riccardo Rizzo
- Subjects
Lung Neoplasms ,Logic ,Computer science ,In silico ,Cancer pathway ,Boolean network ,Logic circuit ,Computational biology ,lcsh:Computer applications to medicine. Medical informatics ,Biochemistry ,Biological pathway ,03 medical and health sciences ,chemistry.chemical_compound ,0302 clinical medicine ,Structural Biology ,In vivo ,Carcinoma, Non-Small-Cell Lung ,microRNA ,medicine ,Humans ,Molecule ,Computer Simulation ,Antagomir ,KEGG ,Molecular Biology ,Gene ,lcsh:QH301-705.5 ,030304 developmental biology ,0303 health sciences ,Drug discovery ,Research ,Applied Mathematics ,Cancer ,RNA ,Cancer Pathway ,miRNA therapeutics ,medicine.disease ,In vitro ,Computer Science Applications ,Gene Expression Regulation, Neoplastic ,MicroRNAs ,chemistry ,lcsh:Biology (General) ,030220 oncology & carcinogenesis ,Mutation ,lcsh:R858-859.7 ,DNA microarray ,Signal Transduction - Abstract
Background In silico experiments, with the aid of computer simulation, speed up the process of in vitro or in vivo experiments. Cancer therapy design is often based on signalling pathway. MicroRNAs (miRNA) are small non-coding RNA molecules. In several kinds of diseases, including cancer, hepatitis and cardiovascular diseases, they are often deregulated, acting as oncogenes or tumor suppressors. miRNA therapeutics is based on two main kinds of molecules injection: miRNA mimics, which consists of injection of molecules that mimic the targeted miRNA, and antagomiR, which consists of injection of molecules inhibiting the targeted miRNA. Nowadays, the research is focused on miRNA therapeutics. This paper addresses cancer related signalling pathways to investigate miRNA therapeutics. Results In order to prove our approach, we present two different case studies: non-small cell lung cancer and melanoma. KEGG signalling pathways are modelled by a digital circuit. A logic value of 1 is linked to the expression of the corresponding gene. A logic value of 0 is linked to the absence (not expressed) gene. All possible relationships provided by a signalling pathway are modelled by logic gates. Mutations, derived according to the literature, are introduced and modelled as well. The modelling approach and analysis are widely discussed within the paper. MiRNA therapeutics is investigated by the digital circuit analysis. The most effective miRNA and combination of miRNAs, in terms of reduction of pathogenic conditions, are obtained. A discussion of obtained results in comparison with literature data is provided. Results are confirmed by existing data. Conclusions The proposed study is based on drug discovery and miRNA therapeutics and uses a digital circuit simulation of a cancer pathway. Using this simulation, the most effective combination of drugs and miRNAs for mutated cancer therapy design are obtained and these results were validated by the literature. The proposed modelling and analysis approach can be applied to each human disease, starting from the corresponding signalling pathway.
- Published
- 2019
30. ANINet: a deep neural network for skull ancestry estimation
- Author
-
Jiang Yi, Geng Guo-hua, Liu Xiaoning, Lin Pengyue, Xia Siyuan, Wang Shixiong, and Yang Wen
- Subjects
Ancestry classification ,QH301-705.5 ,Computer science ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Biochemistry ,Cross-validation ,Image (mathematics) ,Structural Biology ,Depth projection ,Image Processing, Computer-Assisted ,Calibration ,medicine ,Biology (General) ,Projection (set theory) ,Molecular Biology ,Artificial neural network ,business.industry ,Research ,3D skull models ,ANINet ,Applied Mathematics ,Skull ,Pattern recognition ,Computer Science Applications ,Range (mathematics) ,medicine.anatomical_structure ,Feature (computer vision) ,Neural Networks, Computer ,Artificial intelligence ,business - Abstract
Background Ancestry estimation of skulls is under a wide range of applications in forensic science, anthropology, and facial reconstruction. This study aims to avoid defects in traditional skull ancestry estimation methods, such as time-consuming and labor-intensive manual calibration of feature points, and subjective results. Results This paper uses the skull depth image as input, based on AlexNet, introduces the Wide module and SE-block to improve the network, designs and proposes ANINet, and realizes the ancestry classification. Such a unified model architecture of ANINet overcomes the subjectivity of manually calibrating feature points, of which the accuracy and efficiency are improved. We use depth projection to obtain the local depth image and the global depth image of the skull, take the skull depth image as the object, use global, local, and local + global methods respectively to experiment on the 95 cases of Han skull and 110 cases of Uyghur skull data sets, and perform cross-validation. The experimental results show that the accuracies of the three methods for skull ancestry estimation reached 98.21%, 98.04% and 99.03%, respectively. Compared with the classic networks AlexNet, Vgg-16, GoogLenet, ResNet-50, DenseNet-121, and SqueezeNet, the network proposed in this paper has the advantages of high accuracy and small parameters; compared with state-of-the-art methods, the method in this paper has a higher learning rate and better ability to estimate. Conclusions In summary, skull depth images have an excellent performance in estimation, and ANINet is an effective approach for skull ancestry estimation.
- Published
- 2021
31. Research on RNA secondary structure predicting via bidirectional recurrent neural network
- Author
-
Yu Zhang, Qiming Fu, Haiou Li, Weizhong Lu, Hongjie Wu, Yan Cao, Zhengwei Song, and Yijie Ding
- Subjects
Truncation ,Computer science ,QH301-705.5 ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Recurrent neural network ,Biochemistry ,Protein Structure, Secondary ,Nucleic acid secondary structure ,Local optimum ,Structural Biology ,RNA secondary structure prediction ,Weight ,Biology (General) ,Molecular Biology ,Sequence ,Pseudoknots ,Applied Mathematics ,Research ,Matthews correlation coefficient ,Base (topology) ,Computer Science Applications ,Nucleic Acid Conformation ,RNA ,Neural Networks, Computer ,Algorithm ,Algorithms - Abstract
Background RNA secondary structure prediction is an important research content in the field of biological information. Predicting RNA secondary structure with pseudoknots has been proved to be an NP-hard problem. Traditional machine learning methods can not effectively apply protein sequence information with different sequence lengths to the prediction process due to the constraint of the self model when predicting the RNA secondary structure. In addition, there is a large difference between the number of paired bases and the number of unpaired bases in the RNA sequences, which means the problem of positive and negative sample imbalance is easy to make the model fall into a local optimum. To solve the above problems, this paper proposes a variable-length dynamic bidirectional Gated Recurrent Unit(VLDB GRU) model. The model can accept sequences with different lengths through the introduction of flag vector. The model can also make full use of the base information before and after the predicted base and can avoid losing part of the information due to truncation. Introducing a weight vector to predict the RNA training set by dynamically adjusting each base loss function solves the problem of balanced sample imbalance. Results The algorithm proposed in this paper is compared with the existing algorithms on five representative subsets of the data set RNA STRAND. The experimental results show that the accuracy and Matthews correlation coefficient of the method are improved by 4.7% and 11.4%, respectively. Conclusions The flag vector introduced allows the model to effectively use the information before and after the protein sequence; the introduced weight vector solves the problem of unbalanced sample balance. Compared with other algorithms, the LVDB GRU algorithm proposed in this paper has the best detection results.
- Published
- 2021
32. Prediction of lung cancer using gene expression and deep learning with KL divergence gene selection
- Author
-
Suli Liu and Wu Yao
- Subjects
China ,Deep Learning ,Lung Neoplasms ,Structural Biology ,Applied Mathematics ,Gene Expression ,Humans ,Neural Networks, Computer ,Molecular Biology ,Biochemistry ,Computer Science Applications - Abstract
Background Lung cancer is one of the cancers with the highest mortality rate in China. With the rapid development of high-throughput sequencing technology and the research and application of deep learning methods in recent years, deep neural networks based on gene expression have become a hot research direction in lung cancer diagnosis in recent years, which provide an effective way of early diagnosis for lung cancer. Thus, building a deep neural network model is of great significance for the early diagnosis of lung cancer. However, the main challenges in mining gene expression datasets are the curse of dimensionality and imbalanced data. The existing methods proposed by some researchers can’t address the problems of high-dimensionality and imbalanced data, because of the overwhelming number of variables measured (genes) versus the small number of samples, which result in poor performance in early diagnosis for lung cancer. Method Given the disadvantages of gene expression data sets with small datasets, high-dimensionality and imbalanced data, this paper proposes a gene selection method based on KL divergence, which selects some genes with higher KL divergence as model features. Then build a deep neural network model using Focal Loss as loss function, at the same time, we use k-fold cross validation method to verify and select the best model, we set the value of k is five in this paper. Result The deep learning model method based on KL divergence gene selection proposed in this paper has an AUC of 0.99 on the validation set. The generalization performance of model is high. Conclusion The deep neural network model based on KL divergence gene selection proposed in this paper is proved to be an accurate and effective method for lung cancer prediction.
- Published
- 2021
33. Quantitative prediction model for affinity of drug-target interactions based on molecular vibrations and overall system of ligand-receptor
- Author
-
Yun Wang, ting ting cao, xian rui wang, xue mei tian, and cong min jia
- Subjects
Quantitative structure–activity relationship ,QH301-705.5 ,Computer science ,Computer applications to medicine. Medical informatics ,Drug target ,R858-859.7 ,Chemical composition ,Quantitative Structure-Activity Relationship ,Drug–target interactions ,Ligands ,Biochemistry ,Vibration ,Whole systems ,Structural Biology ,Molecular descriptor ,Drug–target affinity ,Biology (General) ,Molecular Biology ,Applied Mathematics ,Molecular vibrations ,Ligand (biochemistry) ,Computer Science Applications ,Random forest ,Molecular Docking Simulation ,Pharmaceutical Preparations ,Molecular vibration ,Biological system ,Research Article - Abstract
Background: the study of drug-target interactions (DTIs) affinity plays an important role in safety assessment and pharmacology. Currently, quantitative structure-activity relationship (QSAR) and molecular docking (MD) are most common methods in research of DTIs affinity. However, they often built for a specific target or several targets and most QSAR and MD were based either only on structure of drug molecules or on structure of targets with low accuracy and small scope of application. How to construct quantitative prediction models with high accuracy with wide applicability remains a challenge. To this end, this paper screened molecular descriptors based on molecular vibrations and took molecule-target as a whole system to construct prediction model with high accuracy-wide applicability based on Kd and EC50, and to provide reference for quantifying affinity of DTIs.Methods: Through parametric characterization based on molecular vibrations and protein sequences, taking molecule-target as whole system and feature selection of drug molecule-target, we constructed feature datasets of DTIs quantified by Kd and EC50, respectively. Then, prediction models were constructed using above datasets and SVM, RF and ANN. In addition, optimal models were selected for application evaluation and comprehensive comparison.Results: Under ten-fold cross-validation, evaluation parameters based on RF for EC50 dataset are as follows: R2 (RF) of training and test sets are 0.9611, 0.9641; MSE (RF) of training and test sets are 0.0891, 0.0817. Evaluation parameters based on RF for Kd dataset are as follows: R2 (RF) of training and test sets are 0.9425, 0.9485; MSE (RF) of training and test sets are 0.1208, 0.1191. After comprehensive comparison, the results showed that RF model in this paper is optimal model. In application evaluation of RF model, the errors of most prediction results were in range of 1.5-2.0.Conclusion: Through screening molecular descriptors based on molecular vibrations and taking molecule-target as whole system, we obtained optimal model based on RF with more accurate-widely applicable, which indicated that selection of molecular descriptors associated with molecular vibrations and the use of molecular-target as whole system are reliable methods for improving performance of model. It can provide reference for quantifying affinity of DTIs.
- Published
- 2021
34. Adverse drug reaction detection via a multihop self-attention mechanism
- Author
-
Yijia Zhang, Hongfei Lin, Liang Yang, Yuqi Ren, Jian Wang, Zhihao Yang, Bo Xu, and Tongxuan Zhang
- Subjects
Complex semantic information ,Drug-Related Side Effects and Adverse Reactions ,Generalization ,Computer science ,Adverse drug reactions ,02 engineering and technology ,computer.software_genre ,lcsh:Computer applications to medicine. Medical informatics ,Biochemistry ,Task (project management) ,03 medical and health sciences ,Structural Biology ,0202 electrical engineering, electronic engineering, information engineering ,medicine ,Humans ,Attention ,Molecular Biology ,lcsh:QH301-705.5 ,030304 developmental biology ,0303 health sciences ,Artificial neural network ,business.industry ,Mechanism (biology) ,Applied Mathematics ,Unstructured data ,medicine.disease ,Neural network ,Semantics ,Computer Science Applications ,Focus (linguistics) ,lcsh:Biology (General) ,Multihop self-attention mechanism ,lcsh:R858-859.7 ,020201 artificial intelligence & image processing ,Neural Networks, Computer ,Artificial intelligence ,business ,computer ,Natural language processing ,Sentence ,Adverse drug reaction ,Research Article - Abstract
BackgroundThe adverse reactions that are caused by drugs are potentially life-threatening problems. Comprehensive knowledge of adverse drug reactions (ADRs) can reduce their detrimental impacts on patients. Detecting ADRs through clinical trials takes a large number of experiments and a long period of time. With the growing amount of unstructured textual data, such as biomedical literature and electronic records, detecting ADRs in the available unstructured data has important implications for ADR research. Most of the neural network-based methods typically focus on the simple semantic information of sentence sequences; however, the relationship of the two entities depends on more complex semantic information.MethodsIn this paper, we propose multihop self-attention mechanism (MSAM) model that aims to learn the multi-aspect semantic information for the ADR detection task. first, the contextual information of the sentence is captured by using the bidirectional long short-term memory (Bi-LSTM) model. Then, via applying the multiple steps of an attention mechanism, multiple semantic representations of a sentence are generated. Each attention step obtains a different attention distribution focusing on the different segments of the sentence. Meanwhile, our model locates and enhances various keywords from the multiple representations of a sentence.ResultsOur model was evaluated by using two ADR corpora. It is shown that the method has a stable generalization ability. Via extensive experiments, our model achieved F-measure of 0.853, 0.799 and 0.851 for ADR detection for TwiMed-PubMed, TwiMed-Twitter, and ADE, respectively. The experimental results showed that our model significantly outperforms other compared models for ADR detection.ConclusionsIn this paper, we propose a modification of multihop self-attention mechanism (MSAM) model for an ADR detection task. The proposed method significantly improved the learning of the complex semantic information of sentences.
- Published
- 2019
35. Enhancing ontology-driven diagnostic reasoning with a symptom-dependency-aware Naïve Bayes classifier
- Author
-
Yaliang Li, Min Yang, Buzhou Tang, Ying Shen, and Hai-Tao Zheng
- Subjects
Dependency (UML) ,Computer science ,Knowledge Bases ,Diagnostic reasoning ,Disease ,Ontology (information science) ,Machine learning ,computer.software_genre ,Uncertainty reasoning ,lcsh:Computer applications to medicine. Medical informatics ,Biochemistry ,03 medical and health sciences ,Naive Bayes classifier ,0302 clinical medicine ,Structural Biology ,Electronic Health Records ,Humans ,naïve Bayes classifier ,Molecular Biology ,lcsh:QH301-705.5 ,Diagnostic Techniques and Procedures ,030304 developmental biology ,Probability ,0303 health sciences ,business.industry ,Ontology ,Applied Mathematics ,Medical record ,Conditional probability ,Bayes Theorem ,Computer Science Applications ,ROC Curve ,lcsh:Biology (General) ,030220 oncology & carcinogenesis ,Area Under Curve ,lcsh:R858-859.7 ,Artificial intelligence ,business ,computer ,Algorithms ,Research Article - Abstract
Background Ontology has attracted substantial attention from both academia and industry. Handling uncertainty reasoning is important in researching ontology. For example, when a patient is suffering from cirrhosis, the appearance of abdominal vein varices is four times more likely than the presence of bitter taste. Such medical knowledge is crucial for decision-making in various medical applications but is missing from existing medical ontologies. In this paper, we aim to discover medical knowledge probabilities from electronic medical record (EMR) texts to enrich ontologies. First, we build an ontology by identifying meaningful entity mentions from EMRs. Then, we propose a symptom-dependency-aware naïve Bayes classifier (SDNB) that is based on the assumption that there is a level of dependency among symptoms. To ensure the accuracy of the diagnostic classification, we incorporate the probability of a disease into the ontology via innovative approaches. Results We conduct a series of experiments to evaluate whether the proposed method can discover meaningful and accurate probabilities for medical knowledge. Based on over 30,000 deidentified medical records, we explore 336 abdominal diseases and 81 related symptoms. Among these 336 gastrointestinal diseases, the probabilities of 31 diseases are obtained via our method. These 31 probabilities of diseases and 189 conditional probabilities between diseases and the symptoms are added into the generated ontology. Conclusion In this paper, we propose a medical knowledge probability discovery method that is based on the analysis and extraction of EMR text data for enriching a medical ontology with probability information. The experimental results demonstrate that the proposed method can effectively identify accurate medical knowledge probability information from EMR data. In addition, the proposed method can efficiently and accurately calculate the probability of a patient suffering from a specified disease, thereby demonstrating the advantage of combining an ontology and a symptom-dependency-aware naïve Bayes classifier.
- Published
- 2019
36. Knowledge-guided convolutional networks for chemical-disease relation extraction
- Author
-
Lei Du, Zhuang Liu, Yingyu Lin, Chengkun Lang, Huiwei Zhou, and Shixian Ning
- Subjects
FOS: Computer and information sciences ,Relation (database) ,Drug-Related Side Effects and Adverse Reactions ,Computer science ,Knowledge Bases ,Attention mechanism ,Context (language use) ,Machine learning ,computer.software_genre ,lcsh:Computer applications to medicine. Medical informatics ,Biochemistry ,03 medical and health sciences ,0302 clinical medicine ,Structural Biology ,Leverage (statistics) ,Data Mining ,Humans ,Disease ,Molecular Biology ,lcsh:QH301-705.5 ,030304 developmental biology ,0303 health sciences ,Computer Science - Computation and Language ,business.industry ,Applied Mathematics ,Context features ,CDR extraction ,Relationship extraction ,Computer Science Applications ,Knowledge representations ,lcsh:Biology (General) ,030220 oncology & carcinogenesis ,lcsh:R858-859.7 ,Artificial intelligence ,Gating units ,business ,Computation and Language (cs.CL) ,computer ,Research Article - Abstract
Background: Automatic extraction of chemical-disease relations (CDR) from unstructured text is of essential importance for disease treatment and drug development. Meanwhile, biomedical experts have built many highly-structured knowledge bases (KBs), which contain prior knowledge about chemicals and diseases. Prior knowledge provides strong support for CDR extraction. How to make full use of it is worth studying. Results: This paper proposes a novel model called "Knowledge-guided Convolutional Networks (KCN)" to leverage prior knowledge for CDR extraction. The proposed model first learns knowledge representations including entity embeddings and relation embeddings from KBs. Then, entity embeddings are used to control the propagation of context features towards a chemical-disease pair with gated convolutions. After that, relation embeddings are employed to further capture the weighted context features by a shared attention pooling. Finally, the weighted context features containing additional knowledge information are used for CDR extraction. Experiments on the BioCreative V CDR dataset show that the proposed KCN achieves 71.28% F1-score, which outperforms most of the state-of-the-art systems. Conclusions: This paper proposes a novel CDR extraction model KCN to make full use of prior knowledge. Experimental results demonstrate that KCN could effectively integrate prior knowledge and contexts for the performance improvement., Published on BMC Bioinformatics, 16 pages, 5 figures
- Published
- 2019
37. Linking entities through an ontology using word embeddings and syntactic re-ranking
- Author
-
İlknur Karadeniz and Arzucan Özgür
- Subjects
Normalization (statistics) ,Drug-Related Side Effects and Adverse Reactions ,Text mining ,Computer science ,Adverse drug reactions ,Bacteria biotopes ,computer.software_genre ,lcsh:Computer applications to medicine. Medical informatics ,Biochemistry ,03 medical and health sciences ,Entity linking ,0302 clinical medicine ,Named-entity recognition ,Named entity normalization ,Structural Biology ,Data Mining ,Molecular Biology ,lcsh:QH301-705.5 ,030304 developmental biology ,0303 health sciences ,Parsing ,Bacteria ,business.industry ,Entity categorization ,Applied Mathematics ,Natural language processing ,Reference Standards ,Syntax ,Biomedical text mining ,Computer Science Applications ,Semantics ,Named entity ,lcsh:Biology (General) ,030220 oncology & carcinogenesis ,Word embeddings ,Ontology ,lcsh:R858-859.7 ,Artificial intelligence ,business ,computer ,Algorithms ,Software ,Research Article - Abstract
Background Although there is an enormous number of textual resources in the biomedical domain, currently, manually curated resources cover only a small part of the existing knowledge. The vast majority of these information is in unstructured form which contain nonstandard naming conventions. The task of named entity recognition, which is the identification of entity names from text, is not adequate without a standardization step. Linking each identified entity mention in text to an ontology/dictionary concept is an essential task to make sense of the identified entities. This paper presents an unsupervised approach for the linking of named entities to concepts in an ontology/dictionary. We propose an approach for the normalization of biomedical entities through an ontology/dictionary by using word embeddings to represent semantic spaces, and a syntactic parser to give higher weight to the most informative word in the named entity mentions. Results We applied the proposed method to two different normalization tasks: the normalization of bacteria biotope entities through the Onto-Biotope ontology and the normalization of adverse drug reaction entities through the Medical Dictionary for Regulatory Activities (MedDRA). The proposed method achieved a precision score of 65.9%, which is 2.9 percentage points above the state-of-the-art result on the BioNLP Shared Task 2016 Bacteria Biotope test data and a macro-averaged precision score of 68.7% on the Text Analysis Conference 2017 Adverse Drug Reaction test data. Conclusions The core contribution of this paper is a syntax-based way of combining the individual word vectors to form vectors for the named entity mentions and ontology concepts, which can then be used to measure the similarity between them. The proposed approach is unsupervised and does not require labeled data, making it easily applicable to different domains.
- Published
- 2019
38. Proceedings of the 2018 MidSouth Computational Biology and Bioinformatics Society (MCBIOS) conference
- Author
-
Jonathan D. Wren, Robert J. Doerkson, Prashanti Manda, Inimary T. Toby, Shraddha Thakkar, Bindu Nanduri, and Ramin Homayouni
- Subjects
Biomedical knowledge ,MCBIOS ,Bioinformatics ,media_common.quotation_subject ,Computational biology ,lcsh:Computer applications to medicine. Medical informatics ,History, 21st Century ,Biochemistry ,03 medical and health sciences ,0302 clinical medicine ,Cyberinfrastructure ,Structural Biology ,Excellence ,Humans ,Molecular Biology ,lcsh:QH301-705.5 ,030304 developmental biology ,media_common ,Panel discussion ,Introduction ,0303 health sciences ,Systems Biology ,Applied Mathematics ,ISCB ,Conferences ,Computational Biology ,Plenary session ,Assistant professor ,3. Good health ,Computer Science Applications ,R package ,lcsh:Biology (General) ,030220 oncology & carcinogenesis ,lcsh:R858-859.7 ,Career development - Abstract
The XVth Annual MidSouth Computational Biology and Bioinformatics Society (MCBIOS XV) conference was held in Starkville, MS from March 29–31, 2018 at Mississippi State University (MSU) within the Mill Conference Center. MSU had previously hosted the conference (MCBIOS VI) in 2009. The theme of MCBIOS XV was “Genomics and Big Data”. The co-chairs and conference hosts were Drs. Bindu Nanduri, Andy Perkins, and Daniel G. Peterson from MSU. The program was co-chaired by Dr. Shraddha Thakkar from the National Center for Toxicological Research (NCTR) within the US Food and Drug Administration (FDA), Dr. Mary Yang, from University of Arkansas at Little Rock (UALR) and Dr. Prashanti Manda from University of North Carolina at Greensboro. The conference was attended by 183 registered participants, of these, 73 registered participants were in the professional category, while 13 were postdoctoral fellows and 97 were student participants. A total of 157 abstracts were submitted for MCBIOS XV, including 65 oral presentations and 92 poster presentations at the meeting. There were nine breakout sessions conducted during the meeting. Each breakout session included a presentation by a featured speaker, a renowned scientist in the topic of that session, followed by four additional presentations in that area. Dr. Cesar M. Compadre, from University of Arkansas for Medical Sciences served as the finance coordinator for the conference. Dr. Ping Gong, at the US Army Engineer Research and Development Center, Vicksburg served as the coordinator of Young Scientist Research Excellence Award. Dr. George Popescu from the Institute of Genomics, Biocomputing and Biotechnology (IGBB) of MSU served as the poster session coordinator, and Dr. William S. Sanders from The Jackson Laboratory served as the workshop coordinator. For 2019–20, Dr. Weida Tong, Director of Division of Bioinformatics and Biostatics from NCTR/FDA was chosen as the President-Elect and Dr. Ramin Homayouni from University of Memphis as the President. Keynote speakers for MCBIOS 2018 Keynote Session I: “Next-Gen Data Science”, Russ Wolfinger, Ph.D., Director of Scientific Discovery and Genomics, JMP Life Sciences, SAS Institute, Cary, NC Keynote Session II: “Real World Data and Precision Medicine: Treatment Selection and Dose Optimization Strategies”, Lawrence J. Lesko Ph.D., F.C.P., University of Florida, Orlando, FL Keynote Session III: “No-Boundary Thinking: Defining Problems So Their Solutions Matter”, Steve Jennings, Ph.D., UALR Keynote Session IV: “Informatics Tools for Big Biologicals and Small Drug Molecules”, William J Welsh, Ph.D., Norman H. Edelman Professor in Bioinformatics, Rutgers University, New Brunswick, NJ Keynote Session V: “A decade of MAQC effort and its contribution to our understanding of high-throughput genomics technologies”, Weida Tong, Ph.D., Director, Division of Bioinformatics and Biostatistics, NCTR/FDA The conference program included three workshops: Workshop I: “Advanced Data Analytics using JMP Genomics”, Wenjun Bao, Ph.D., JMP Life Sciences, SAS Institute, Cary, NC. Workshop II: “Career Development Workshop for Young Scientist”, Inimary Toby, Ph.D., University of Dallas, Dallas, TX. Workshop III: “MCBIOS and No-Boundary Thinking Joint Bioinformatics Research workshop”. Session chair - Steve Jennings, Ph.D., UALR. Presentations: “Encoding biomedical knowledge using hetnets”, Daniel Himmelstein, Ph.D., University of Pennsylvania, Philadelphia, PN. “Microbial interactions and microbe-host interactions”, Hongmei Jiang Ph.D., Northwestern University, Evanston, IL. “Evolution as a metaphor for No Boundary Thinking”, Scott M. Williams, Ph.D., Case Western Reserve University, Cleveland, Ohio. Panel Discussion: Joan Peckham, Ph.D., University of Rhode Island Xiuzhen Huang, Ph.D., Arkansas State University Scott M. Williams, Ph.D., Case Western Reserve University Hongmei Jiang, Ph.D., Northwestern University Daniel Himmelstein, Ph.D., University of Pennsylvania In addition to the workshops, MCBIOS provided assistance to students in preparing their resume. A one-on-one resume clinic was conducted by Gladys Awosemo, HRP, Baylor Scott and White Health, TX. Breakout Session I: Plant Omics I. Session Chair – Sorina C. Popescu, Ph.D., Assistant Professor, MSU. Featured speaker: Marilyn Warburton, Ph.D., United States Department of Agriculture, Agriculture Research Services, MS, “A pathway-based method to interpret GWAS results”. Breakout Session II: Next generation tools for environment and health research. Session Chair and featured speaker: Natalia Reyero, Ph.D., US Army Engineer Research and Development Center, Vicksburg MS, “Next Generation Tools for Environmental Research”. Breakout Session III: Drug Discovery and Precision Medicine. Session Chair and featured speaker: Robert J. Doerksen, Ph.D., University of Mississippi (UM), Oxford, MS, “Protein structure-based virtual screening: deep learning for precision medicine”. Breakout Session IV: Breakout Session IV: Plant Omics II. Session Chair – Sorina C. Popescu, Ph.D., MSU. Featured Speaker: Tessa Burch-Smith, Ph.D., University of Tennessee, Knoxville, TN, “Focused ion beam-scanning electron microscopy for three-dimensional modelling of cellular ultrastructure”. Breakout Session V: Transcriptomics and Genome Sequencing. Session Chair and featured speaker: Brian Counterman, Ph.D., MSU, “Patternize: an R package for color pattern variation”. Breakout Session VI: Big Data and Risk Assessment. Session Chair – Minjun Chen, Ph.D., NCTR/FDA. Featured Speaker: William Mattes, Ph.D., NCTR/FDA, “Systems Biology and Big Data: Little Mitochondria as a Big Example”. Breakout Session VII: Genomics and Proteomics application. Session Chair - Zhichao Liu, Ph.D., NCTR/FDA. Featured Speaker: Rakesh Kaundal, Ph.D., Utah State University, Logan, UT, “Complete genome sequence of Pythium brassicum P1, an oomycete root pathogen: insights into its host specificity to Brassicaceae”. Breakout Session VIII: Genomics and Infectious Disease. Session Chair and Featured speaker: Stephen Pruett, Ph.D., MSU, “Machine Learning Analysis of the Relationship between Changes in Immunological Parameters and Changes in Resistance to Listeria monocytogenes: A New Approach for Risk Assessment and Systems Immunology”. Breakout Session IX: MCBIOS Group Projects. Shraddha Thakkar, Ph.D., NCTR/FDA. William Sanders, Ph.D., IT Research Cyberinfrastructure, The Jackson Laboratory, Bar Harbor, ME. Best Paper Award, MCBIOS 2018: Phillip Berg et al., “Evaluation of Linear Models and Missing Value Imputation for the Analysis of Peptide-Centric Proteomics” [1]. Best Paper Runner-up, MCBIOS 2018: Bohu Pan et al., “Similarities and differences between variants called with human reference genome HG19 or HG38” [2]. This was the 2nd year for “MCBIOS Young Scientist Excellence Award” awards to recognize students and postdoctoral fellows that exhibit scientific excellence in the field of Bioinformatics. Student and postdoctoral fellows went through a rigorous award application process with both internal and external judges. The top five candidates were selected to present during the opening session on March 29th. To compete, applicants submitted an extended abstract with a description of the innovation of their research and their specific contribution to the work presented, from which the quality and impact of the research was judged. Initiative in expanding their skills and bringing multidisciplinary talent to their project was an important consideration for selection for an oral presentation, and the quality of the presentation during the plenary session was the primary consideration for award. Additional evaluation criteria included creativity, dedication and multidisciplinary contribution. This award was supported by the FDA grant to MCBIOS (5R13FD005931–03) and JMP Life Sciences. MCBIOS young scientist excellence award 2018 Post-doctoral winners First Place: Sundar Thangapandian, Ph.D., University of Illinois Urbana-Champaign, Urbana, IL. “Quantitative Target-specific Toxicity Prediction Model (QTTPM): A Novel Computational Toxicology Approach Integrating Molecular Dynamics Simulation and Machine Learning”. Second Place: Brian Walker, Ph.D., UALR. “Synthesis of xanthine derivatives for the inhibition of PARG”. Third Place: Darshan Mehta, Ph.D., NCTR/FDA. “Mining pharmacogenomic information from drug labeling using FDALabel database for advancing precision medicine”.
- Published
- 2019
39. Coding Prony’s method in MATLAB and applying it to biomedical signal filtering
- Author
-
A. Fernández Rodríguez, J. M. Rodríguez Ascariz, L. de Santiago Rodrigo, J. M. Miguel Jiménez, E. López Guillén, Luciano Boquete, and Universidad de Alcalá. Departamento de Electrónica
- Subjects
Male ,Polynomial ,Total least squares ,Computer science ,Prony"s method ,Prony’s method ,02 engineering and technology ,Biochemistry ,Least squares ,0302 clinical medicine ,Structural Biology ,0202 electrical engineering, electronic engineering, information engineering ,Linear combination ,MATLAB ,lcsh:QH301-705.5 ,computer.programming_language ,Fourier Analysis ,Applied Mathematics ,Computer Science Applications ,Exponential function ,Function approximation ,Matrix pencil ,lcsh:R858-859.7 ,Female ,Electrónica ,Algorithm ,Algorithms ,Adult ,lcsh:Computer applications to medicine. Medical informatics ,Discrete Fourier transform ,Multiple sclerosis ,Young Adult ,03 medical and health sciences ,Humans ,Least-Squares Analysis ,Molecular Biology ,020206 networking & telecommunications ,Filter (signal processing) ,lcsh:Biology (General) ,Prony's method ,Evoked Potentials, Visual ,Programming Languages ,Electronics ,Multifocal evoked visual potentials ,computer ,Software ,030217 neurology & neurosurgery - Abstract
Background:The response of many biomedical systems can be modelled using a linear combination of damped exponential functions. The approximation parameters, based on equally spaced samples, can be obtained using Prony's method and its variants (e.g. the matrix pencil method). This paper provides a tutorial on the main polynomial Prony and matrix pencil methods and their implementation in MATLAB and analyses how they perform with synthetic and multifocal visual-evoked potential (mfVEP) signals. This paper briefly describes the theoretical basis of four polynomial Prony approximation methods: classic, least squares (LS), total least squares (TLS) and matrix pencil method (MPM). In each of these cases, implementation uses general MATLAB functions. The features of the various options are tested by approximating a set of synthetic mathematical functions and evaluating filtering performance in the Prony domain when applied to mfVEP signals to improve diagnosis of patients with multiple sclerosis (MS). Results:The code implemented does not achieve 100%-correct signal approximation and, of the methods tested, LS and MPM perform best. When filtering mfVEP records in the Prony domain, the value of the area under the receiver-operating-characteristic (ROC) curve is 0.7055 compared with 0.6538 obtained with the usual filtering method used for this type of signal (discrete Fourier transform low-pass filter with a cut-off frequency of 35 Hz). Conclusions:This paper reviews Prony's method in relation to signal filtering and approximation, provides the MATLAB code needed to implement the classic, LS, TLS and MPM methods, and tests their performance in biomedical signal filtering and function approximation. It emphasizes the importance of improving the computational methods used to implement the various methods described above., Universidad de Alcalá, Agencia Estatal de Investigación
- Published
- 2018
40. Juxtapose: a gene-embedding approach for comparing co-expression networks
- Author
-
B. Frank Eames, Farhad Maleki, Katie Ovens, and Ian McQuillan
- Subjects
Word embedding ,Evolution ,Computer science ,lcsh:Computer applications to medicine. Medical informatics ,computer.software_genre ,Gene co-expression networks ,Biochemistry ,Set (abstract data type) ,03 medical and health sciences ,0302 clinical medicine ,Similarity (network science) ,Structural Biology ,Machine learning ,Gene Regulatory Networks ,Word2vec ,Transcriptomics ,lcsh:QH301-705.5 ,Molecular Biology ,Blossom algorithm ,030304 developmental biology ,0303 health sciences ,Methodology Article ,Applied Mathematics ,Computational Biology ,Expression (mathematics) ,Computer Science Applications ,lcsh:Biology (General) ,lcsh:R858-859.7 ,Embedding ,Data mining ,DNA microarray ,computer ,Algorithms ,Software ,030217 neurology & neurosurgery - Abstract
Background Gene co-expression networks (GCNs) are not easily comparable due to their complex structure. In this paper, we propose a tool, Juxtapose, together with similarity measures that can be utilized for comparative transcriptomics between a set of organisms. While we focus on its application to comparing co-expression networks across species in evolutionary studies, Juxtapose is also generalizable to co-expression network comparisons across tissues or conditions within the same species. Methods A word embedding strategy commonly used in natural language processing was utilized in order to generate gene embeddings based on walks made throughout the GCNs. Juxtapose was evaluated based on its ability to embed the nodes of synthetic structures in the networks consistently while also generating biologically informative results. Evaluation of the techniques proposed in this research utilized RNA-seq datasets from GTEx, a multi-species experiment of prefrontal cortex samples from the Gene Expression Omnibus, as well as synthesized datasets. Biological evaluation was performed using gene set enrichment analysis and known gene relationships in literature. Results We show that Juxtapose is capable of globally aligning synthesized networks as well as identifying areas that are conserved in real gene co-expression networks without reliance on external biological information. Furthermore, output from a matching algorithm that uses cosine distance between GCN embeddings is shown to be an informative measure of similarity that reflects the amount of topological similarity between networks. Conclusions Juxtapose can be used to align GCNs without relying on known biological similarities and enables post-hoc analyses using biological parameters, such as orthology of genes, or conserved or variable pathways. Availability A development version of the software used in this paper is available at https://github.com/klovens/juxtapose
- Published
- 2021
41. PartSeg: a tool for quantitative feature extraction from 3D microscopy images for dummies
- Author
-
Dariusz Plewczynski, Grzegorz Bokota, Nirmal Das, Pawel Trzaskoma, Agnieszka Grabowska, Adriana Magalska, Yana Yushkevich, Jacek Sroka, and Subhadip Basu
- Subjects
Interface (Java) ,Computer science ,Feature extraction ,Batch processing ,lcsh:Computer applications to medicine. Medical informatics ,computer.software_genre ,Biochemistry ,Nucleus ,03 medical and health sciences ,Imaging, Three-Dimensional ,Segmentation ,0302 clinical medicine ,Structural Biology ,Image Processing, Computer-Assisted ,Electron microscopy ,3D reconstruction ,Super-resolution microscopy ,lcsh:QH301-705.5 ,Molecular Biology ,030304 developmental biology ,Cell Nucleus ,Microscopy ,0303 health sciences ,3D FISH ,business.industry ,Applied Mathematics ,Pattern recognition ,Bioimaging ,Chromatin ,Computer Science Applications ,Visualization ,lcsh:Biology (General) ,lcsh:R858-859.7 ,Artificial intelligence ,Data mining ,Focus (optics) ,business ,computer ,Algorithms ,Software ,030217 neurology & neurosurgery - Abstract
BackgroundBioimaging techniques offer a robust tool for studying molecular pathways and morphological phenotypes of cell populations subjected to various conditions. As modern high resolution 3D microscopy provides access to an ever-increasing amount of high quality images, there arises a need for their analysis in an automated, unbiased and simple way.Segmentation of structures within cell nucleus, which is the focus of this paper, presents a new layer of complexity in the form of dense packing and significant signal overlap.At the same time the available segmentation tools provide a steep learning curve for new users with limited technical background. This is especially apparent in bulk processing of image sets, which requires the use of some form of programming notation.ResultsIn this paper, we present PartSeg, a tool for segmentation and reconstruction of 3D microscopy images, optimised for the study of cell nucleus. PartSeg integrates refined versions of several state-of-the-art algorithms, including a new multi-scale approach for segmentation and quantitative analysis of 3D microscopy images.The features and user-friendly interface of PartSeg were carefully planned with biologists in mind, based on analysis of multiple use cases and difficulties encountered with other tools, to offer ergonomic interface with a minimal entry barrier. Bulk processing in an ad-hoc manner is possible without the need for programmer support. As the size of datasets of interest grows, such bulk processing solutions become essential for proper statistical analysis of results.Advanced users can use PartSeg components as a library within Python data processing and visualisation pipelines, for example within Jupyter notebooks. The tool is extensible so that new functionality and algorithms can be added by the use of plugins.For biologists the utility of PartSeg is presented in several scenarios, showing the quantitative analysis of nuclear structures.ConclusionsIn this paper, we have presented PartSeg which is a tool for precise and verifiable segmentation and reconstruction of 3D microscopy images. PartSeg is optimised for cell nucleus analysis and offers multiscale segmentation algorithms best-suited for this task. PartSeg can also be used for bulk processing of multiple images and its components can be reused in other systems or computational experiments.Contactg.bokota@cent.uw.edu.pl, a.magalska@nencki.edu.pl, d.plewczynski@cent.uw.edu.pl
- Published
- 2021
42. Application of artificial intelligence ensemble learning model in early prediction of atrial fibrillation
- Author
-
Jian Huang, Yen-Ming J. Chen, Yiu-Jen Chang, Maxwell Hwang, Wen-Hsien Ho, Tian-Hsiang Huang, Kao-Shing Hwang, Cai Wu, and Tsung-Han Ho
- Subjects
Artificial intelligence ,QH301-705.5 ,Computer science ,Computer applications to medicine. Medical informatics ,Feature extraction ,R858-859.7 ,Biochemistry ,Machine Learning ,symbols.namesake ,Electrocardiography ,Structural Biology ,Ensemble learning ,Atrial Fibrillation ,Feature (machine learning) ,Gaussian function ,Humans ,Sensitivity (control systems) ,AdaBoost ,Biology (General) ,Molecular Biology ,Receiver operating characteristic ,business.industry ,Applied Mathematics ,Research ,Computer Science Applications ,Electrocardiogram ,ROC Curve ,symbols ,F1 score ,business ,Algorithms - Abstract
Background Atrial fibrillation is a paroxysmal heart disease without any obvious symptoms for most people during the onset. The electrocardiogram (ECG) at the time other than the onset of this disease is not significantly different from that of normal people, which makes it difficult to detect and diagnose. However, if atrial fibrillation is not detected and treated early, it tends to worsen the condition and increase the possibility of stroke. In this paper, P-wave morphology parameters and heart rate variability feature parameters were simultaneously extracted from the ECG. A total of 31 parameters were used as input variables to perform the modeling of artificial intelligence ensemble learning model. Results This paper applied three artificial intelligence ensemble learning methods, namely Bagging ensemble learning method, AdaBoost ensemble learning method, and Stacking ensemble learning method. The prediction results of these three artificial intelligence ensemble learning methods were compared. As a result of the comparison, the Stacking ensemble learning method combined with various models finally obtained the best prediction effect with the accuracy of 92%, sensitivity of 88%, specificity of 96%, positive predictive value of 95.7%, negative predictive value of 88.9%, F1 score of 0.9231 and area under receiver operating characteristic curve value of 0.911. Conclusion In feature extraction, this paper combined P-wave morphology parameters and heart rate variability parameters as input parameters for model training, and validated the value of the proposed parameters combination for the improvement of the model’s predicting effect. In the calculation of the P-wave morphology parameters, the hybrid Taguchi-genetic algorithm was used to obtain more accurate Gaussian function fitting parameters. The prediction model was trained using the Stacking ensemble learning method, so that the model accuracy had better results, which can further improve the early prediction of atrial fibrillation.
- Published
- 2021
43. Analysis of associations between emotions and activities of drug users and their addiction recovery tendencies from social media posts using structural equation modeling
- Author
-
Rahul Singh and Deeptanshu Jha
- Subjects
Drug ,Text mining ,Substance-Related Disorders ,Reddit ,media_common.quotation_subject ,Emotions ,Word count ,030508 substance abuse ,Context (language use) ,lcsh:Computer applications to medicine. Medical informatics ,Substance misuse disorder ,Biochemistry ,Structural equation modeling ,Developmental psychology ,Social media ,Addiction recovery ,Opioid epidemic ,03 medical and health sciences ,0302 clinical medicine ,Online communities ,Recurrence ,Structural Biology ,Humans ,030212 general & internal medicine ,lcsh:QH301-705.5 ,Molecular Biology ,Addiction relapse ,media_common ,Research ,Applied Mathematics ,Addiction ,Personalized interventions ,Computer Science Applications ,lcsh:Biology (General) ,Latent Class Analysis ,Research Design ,Life expectancy ,lcsh:R858-859.7 ,0305 other medical science - Abstract
Background Addiction to drugs and alcohol constitutes one of the significant factors underlying the decline in life expectancy in the US. Several context-specific reasons influence drug use and recovery. In particular emotional distress, physical pain, relationships, and self-development efforts are known to be some of the factors associated with addiction recovery. Unfortunately, many of these factors are not directly observable and quantifying, and assessing their impact can be difficult. Based on social media posts of users engaged in substance use and recovery on the forum Reddit, we employed two psycholinguistic tools, Linguistic Inquiry and Word Count and Empath and activities of substance users on various Reddit sub-forums to analyze behavior underlining addiction recovery and relapse. We then employed a statistical analysis technique called structural equation modeling to assess the effects of these latent factors on recovery and relapse. Results We found that both emotional distress and physical pain significantly influence addiction recovery behavior. Self-development activities and social relationships of the substance users were also found to enable recovery. Furthermore, within the context of self-development activities, those that were related to influencing the mental and physical well-being of substance users were found to be positively associated with addiction recovery. We also determined that lack of social activities and physical exercise can enable a relapse. Moreover, geography, especially life in rural areas, appears to have a greater correlation with addiction relapse. Conclusions The paper describes how observable variables can be extracted from social media and then be used to model important latent constructs that impact addiction recovery and relapse. We also report factors that impact self-induced addiction recovery and relapse. To the best of our knowledge, this paper represents the first use of structural equation modeling of social media data with the goal of analyzing factors influencing addiction recovery.
- Published
- 2020
44. DSCMF: prediction of LncRNA-disease associations based on dual sparse collaborative matrix factorization
- Author
-
Jin-Xing Liu, Feng Li, Zhen Cui, Ying-Lian Gao, and Ming-Ming Gao
- Subjects
Male ,Computer science ,QH301-705.5 ,Gaussian ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Value (computer science) ,Breast Neoplasms ,Disease ,Machine learning ,computer.software_genre ,Biochemistry ,Matrix decomposition ,Gaussian interaction profile kernel ,03 medical and health sciences ,symbols.namesake ,0302 clinical medicine ,Similarity (network science) ,Structural Biology ,Code (cryptography) ,Humans ,Computer Simulation ,Biology (General) ,Molecular Biology ,030304 developmental biology ,0303 health sciences ,business.industry ,Applied Mathematics ,Research ,Prostatic Neoplasms ,Computer Science Applications ,Dual (category theory) ,Kernel (image processing) ,LncRNA-disease associations ,030220 oncology & carcinogenesis ,symbols ,RNA, Long Noncoding ,Artificial intelligence ,Collaborative matrix factorization ,business ,computer ,Algorithms - Abstract
Background In the development of science and technology, there are increasing evidences that there are some associations between lncRNAs and human diseases. Therefore, finding these associations between them will have a huge impact on our treatment and prevention of some diseases. However, the process of finding the associations between them is very difficult and requires a lot of time and effort. Therefore, it is particularly important to find some good methods for predicting lncRNA-disease associations (LDAs). Results In this paper, we propose a method based on dual sparse collaborative matrix factorization (DSCMF) to predict LDAs. The DSCMF method is improved on the traditional collaborative matrix factorization method. To increase the sparsity, the L2,1-norm is added in our method. At the same time, Gaussian interaction profile kernel is added to our method, which increase the network similarity between lncRNA and disease. Finally, the AUC value obtained by the experiment is used to evaluate the quality of our method, and the AUC value is obtained by the ten-fold cross-validation method. Conclusions The AUC value obtained by the DSCMF method is 0.8523. At the end of the paper, simulation experiment is carried out, and the experimental results of prostate cancer, breast cancer, ovarian cancer and colorectal cancer are analyzed in detail. The DSCMF method is expected to bring some help to lncRNA-disease associations research. The code can access the https://github.com/Ming-0113/DSCMF website.
- Published
- 2020
45. Optimized permutation testing for information theoretic measures of multi-gene interactions
- Author
-
David J. Galas, James Kunert-Graf, and Nikita A. Sakhanenko
- Subjects
Information theory ,Genotype ,Computer science ,Pipeline (computing) ,Computation ,0206 medical engineering ,02 engineering and technology ,lcsh:Computer applications to medicine. Medical informatics ,Biochemistry ,Polymorphism, Single Nucleotide ,Bottleneck ,03 medical and health sciences ,Structural Biology ,Resampling ,Code (cryptography) ,Multivariable interactions ,Humans ,Molecular Biology ,lcsh:QH301-705.5 ,030304 developmental biology ,0303 health sciences ,Applied Mathematics ,Multivariable calculus ,Methodology Article ,Epistasis, Genetic ,Computer Science Applications ,Permutation testing ,Exact test ,ComputingMethodologies_PATTERNRECOGNITION ,Phenotype ,lcsh:Biology (General) ,Multi-locus GWAS ,lcsh:R858-859.7 ,Algorithm ,020602 bioinformatics ,Genome-Wide Association Study - Abstract
Background Permutation testing is often considered the “gold standard” for multi-test significance analysis, as it is an exact test requiring few assumptions about the distribution being computed. However, it can be computationally very expensive, particularly in its naive form in which the full analysis pipeline is re-run after permuting the phenotype labels. This can become intractable in multi-locus genome-wide association studies (GWAS), in which the number of potential interactions to be tested is combinatorially large. Results In this paper, we develop an approach for permutation testing in multi-locus GWAS, specifically focusing on SNP–SNP-phenotype interactions using multivariable measures that can be computed from frequency count tables, such as those based in Information Theory. We find that the computational bottleneck in this process is the construction of the count tables themselves, and that this step can be eliminated at each iteration of the permutation testing by transforming the count tables directly. This leads to a speed-up by a factor of over 103 for a typical permutation test compared to the naive approach. Additionally, this approach is insensitive to the number of samples making it suitable for datasets with large number of samples. Conclusions The proliferation of large-scale datasets with genotype data for hundreds of thousands of individuals enables new and more powerful approaches for the detection of multi-locus genotype-phenotype interactions. Our approach significantly improves the computational tractability of permutation testing for these studies. Moreover, our approach is insensitive to the large number of samples in these modern datasets. The code for performing these computations and replicating the figures in this paper is freely available at https://github.com/kunert/permute-counts.
- Published
- 2020
46. BITS2019: the sixteenth annual meeting of the Italian society of bioinformatics
- Author
-
Antonino Fiannaca, Giosuè Lo Bosco, Laura La Paglia, Riccardo Rizzo, Alfonso Urso, Massimo La Rosa, Urso A., Fiannaca A., La Rosa M., La Paglia L., Lo Bosco G., and Rizzo R.
- Subjects
Introduction ,History ,Scope (project management) ,Settore INF/01 - Informatica ,Bioinformatics ,Applied Mathematics ,media_common.quotation_subject ,MEDLINE ,Computational Biology ,lcsh:Computer applications to medicine. Medical informatics ,Biochemistry ,Computer Science Applications ,BITS2019 ,Presentation ,lcsh:Biology (General) ,Italy ,Structural Biology ,lcsh:R858-859.7 ,Humans ,lcsh:QH301-705.5 ,Molecular Biology ,BITS ,media_common - Abstract
The 16th Annual Meeting of the Bioinformatics Italian Society was held in Palermo, Italy, on June 26-28, 2019. More than 80 scientific contributions were presented, including 4 keynote lectures, 31 oral communications and 49 posters. Also, three workshops were organised before and during the meeting. Full papers from some of the works presented in Palermo were submitted for this Supplement of BMC Bioinformatics. Here, we provide an overview of meeting aims and scope. We also shortly introduce selected papers that have been accepted for publication in this Supplement, for a complete presentation of the outcomes of the meeting.
- Published
- 2020
47. Prediction of heart disease and classifiers’ sensitivity analysis
- Author
-
Khaled Mohamad Almustafa
- Subjects
Chest Pain ,Support Vector Machine ,K-nearest neighbor ,Heart Diseases ,Computer science ,Decision tree ,Feature selection ,02 engineering and technology ,lcsh:Computer applications to medicine. Medical informatics ,Biochemistry ,k-nearest neighbors algorithm ,Naive Bayes classifier ,C4.5 algorithm ,Heart disease (HD) ,Structural Biology ,0202 electrical engineering, electronic engineering, information engineering ,Humans ,AdaBoost ,Molecular Biology ,lcsh:QH301-705.5 ,business.industry ,Applied Mathematics ,Methodology Article ,Pattern recognition ,Bayes Theorem ,021001 nanoscience & nanotechnology ,Classification ,Computer Science Applications ,Support vector machine ,Statistical classification ,lcsh:Biology (General) ,Databases as Topic ,ROC Curve ,Decision tree J48 ,lcsh:R858-859.7 ,020201 artificial intelligence & image processing ,Artificial intelligence ,Support vector machine (SVM) ,0210 nano-technology ,Decision table ,business ,Prediction ,Sensitivity analysis ,Algorithms - Abstract
Background Heart disease (HD) is one of the most common diseases nowadays, and an early diagnosis of such a disease is a crucial task for many health care providers to prevent their patients for such a disease and to save lives. In this paper, a comparative analysis of different classifiers was performed for the classification of the Heart Disease dataset in order to correctly classify and or predict HD cases with minimal attributes. The set contains 76 attributes including the class attribute, for 1025 patients collected from Cleveland, Hungary, Switzerland, and Long Beach, but in this paper, only a subset of 14 attributes are used, and each attribute has a given set value. The algorithms used K- Nearest Neighbor (K-NN), Naive Bayes, Decision tree J48, JRip, SVM, Adaboost, Stochastic Gradient Decent (SGD) and Decision Table (DT) classifiers to show the performance of the selected classifications algorithms to best classify, and or predict, the HD cases. Results It was shown that using different classification algorithms for the classification of the HD dataset gives very promising results in term of the classification accuracy for the K-NN (K = 1), Decision tree J48 and JRip classifiers with accuracy of classification of 99.7073, 98.0488 and 97.2683% respectively. A feature extraction method was performed using Classifier Subset Evaluator on the HD dataset, and results show enhanced performance in term of the classification accuracy for K-NN (N = 1) and Decision Table classifiers to 100 and 93.8537% respectively after using the selected features by only applying a combination of up to 4 attributes instead of 13 attributes for the predication of the HD cases. Conclusion Different classifiers were used and compared to classify the HD dataset, and we concluded the benefit of having a reliable feature selection method for HD disease prediction with using minimal number of attributes instead of having to consider all available ones.
- Published
- 2020
48. Review of medical image recognition technologies to detect melanomas using neural networks
- Author
-
Alexander Ignatev, Mila Efimenko, and Konstantin Koshechkin
- Subjects
Skin Neoplasms ,Computer science ,Convolutional neural network ,Review ,lcsh:Computer applications to medicine. Medical informatics ,Machine learning ,computer.software_genre ,Biochemistry ,Sensitivity and Specificity ,World health ,03 medical and health sciences ,0302 clinical medicine ,Deep Learning ,Structural Biology ,Image Interpretation, Computer-Assisted ,medicine ,Skin cancer ,Humans ,Deep learning neural network ,lcsh:QH301-705.5 ,Molecular Biology ,Melanoma ,Early Detection of Cancer ,030304 developmental biology ,0303 health sciences ,Dermatoscopy ,Artificial neural network ,medicine.diagnostic_test ,business.industry ,Applied Mathematics ,Melanoma classification ,Cancer ,medicine.disease ,Computer Science Applications ,Data Accuracy ,Systematic review ,lcsh:Biology (General) ,030220 oncology & carcinogenesis ,lcsh:R858-859.7 ,Fuzzy clustering algorithm ,Artificial intelligence ,business ,computer - Abstract
Background Melanoma is one of the most aggressive types of cancer that has become a world-class problem. According to the World Health Organization estimates, 132,000 cases of the disease and 66,000 deaths from malignant melanoma and other forms of skin cancer are reported annually worldwide (https://apps.who.int/gho/data/?theme=main) and those numbers continue to grow. In our opinion, due to the increasing incidence of the disease, it is necessary to find new, easy to use and sensitive methods for the early diagnosis of melanoma in a large number of people around the world. Over the last decade, neural networks show highly sensitive, specific, and accurate results. Objective This study presents a review of PubMed papers including requests «melanoma neural network» and «melanoma neural network dermatoscopy». We review recent researches and discuss their opportunities acceptable in clinical practice. Methods We searched the PubMed database for systematic reviews and original research papers on the requests «melanoma neural network» and «melanoma neural network dermatoscopy» published in English. Only papers that reported results, progress and outcomes are included in this review. Results We found 11 papers that match our requests that observed convolutional and deep-learning neural networks combined with fuzzy clustering or World Cup Optimization algorithms in analyzing dermatoscopic images. All of them require an ABCD (asymmetry, border, color, and differential structures) algorithm and its derivates (in combination with ABCD algorithm or separately). Also, they require a large dataset of dermatoscopic images and optimized estimation parameters to provide high specificity, accuracy and sensitivity. Conclusions According to the analyzed papers, neural networks show higher specificity, accuracy and sensitivity than dermatologists. Neural networks are able to evaluate features that might be unavailable to the naked human eye. Despite that, we need more datasets to confirm those statements. Nowadays machine learning becomes a helpful tool in early diagnosing skin diseases, especially melanoma.
- Published
- 2020
49. Venn-diaNet : venn diagram based network propagation analysis framework for comparing multiple biological experiments
- Author
-
Ji Hwan Moon, Sangseon Lee, Sun Kim, Benjamin Hur, Dong-Won Kang, and Gung Lee
- Subjects
Computer science ,Interface (Java) ,lcsh:Computer applications to medicine. Medical informatics ,computer.software_genre ,Biochemistry ,law.invention ,User-Computer Interface ,Structural Biology ,Interaction network ,law ,Animals ,Gene Regulatory Networks ,Protein Interaction Maps ,lcsh:QH301-705.5 ,Molecular Biology ,Mice, Knockout ,Internet ,Intersection (set theory) ,Applied Mathematics ,Research ,Gene Expression Profiling ,Rank (computer programming) ,Network propagation ,Expression (mathematics) ,Computer Science Applications ,Venn diagram ,Gene Ontology ,lcsh:Biology (General) ,Differentially expressed genes ,Gene prioritization ,lcsh:R858-859.7 ,Data mining ,Transcriptome ,computer ,Software - Abstract
Background The main research topic in this paper is how to compare multiple biological experiments using transcriptome data, where each experiment is measured and designed to compare control and treated samples. Comparison of multiple biological experiments is usually performed in terms of the number of DEGs in an arbitrary combination of biological experiments. This process is usually facilitated with Venn diagram but there are several issues when Venn diagram is used to compare and analyze multiple experiments in terms of DEGs. First, current Venn diagram tools do not provide systematic analysis to prioritize genes. Because that current tools generally do not fully focus to prioritize genes, genes that are located in the segments in the Venn diagram (especially, intersection) is usually difficult to rank. Second, elucidating the phenotypic difference only with the lists of DEGs and expression values is challenging when the experimental designs have the combination of treatments. Experiment designs that aim to find the synergistic effect of the combination of treatments are very difficult to find without an informative system. Results We introduce Venn-diaNet, a Venn diagram based analysis framework that uses network propagation upon protein-protein interaction network to prioritizes genes from experiments that have multiple DEG lists. We suggest that the two issues can be effectively handled by ranking or prioritizing genes with segments of a Venn diagram. The user can easily compare multiple DEG lists with gene rankings, which is easy to understand and also can be coupled with additional analysis for their purposes. Our system provides a web-based interface to select seed genes in any of areas in a Venn diagram and then perform network propagation analysis to measure the influence of the selected seed genes in terms of ranked list of DEGs. Conclusions We suggest that our system can logically guide to select seed genes without additional prior knowledge that makes us free from the seed selection of network propagation issues. We showed that Venn-diaNet can reproduce the research findings reported in the original papers that have experiments that compare two, three and eight experiments. Venn-diaNet is freely available at: http://biohealth.snu.ac.kr/software/venndianet
- Published
- 2019
50. Multimodal deep representation learning for protein interaction identification and protein family classification
- Author
-
Da Zhang and Mansur R. Kabuka
- Subjects
Protein family ,Computer science ,0206 medical engineering ,Saccharomyces cerevisiae ,02 engineering and technology ,lcsh:Computer applications to medicine. Medical informatics ,Machine learning ,computer.software_genre ,Biochemistry ,Multimodal deep neural network ,Knowledge graph representation learning ,03 medical and health sciences ,Deep Learning ,Protein sequencing ,Structural Biology ,Protein Interaction Mapping ,Animals ,Humans ,Amino Acid Sequence ,Databases, Protein ,lcsh:QH301-705.5 ,Molecular Biology ,030304 developmental biology ,Structure (mathematical logic) ,0303 health sciences ,Protein-protein interaction network ,business.industry ,Research ,Applied Mathematics ,Node (networking) ,Proteins ,Reproducibility of Results ,Construct (python library) ,Computer Science Applications ,Identification (information) ,ROC Curve ,lcsh:Biology (General) ,Area Under Curve ,lcsh:R858-859.7 ,Graph (abstract data type) ,Neural Networks, Computer ,Artificial intelligence ,DNA microarray ,business ,computer ,Feature learning ,Algorithms ,020602 bioinformatics ,Protein Binding - Abstract
BackgroundProtein-protein interactions(PPIs) engage in dynamic pathological and biological procedures constantly in our life. Thus, it is crucial to comprehend the PPIs thoroughly such that we are able to illuminate the disease occurrence, achieve the optimal drug-target therapeutic effect and describe the protein complex structures. However, compared to the protein sequences obtainable from various species and organisms, the number of revealed protein-protein interactions is relatively limited. To address this dilemma, lots of research endeavor have investigated in it to facilitate the discovery of novel PPIs. Among these methods, PPI prediction techniques that merely rely on protein sequence data are more widespread than other methods which require extensive biological domain knowledge.ResultsIn this paper, we propose a multi-modal deep representation learning structure by incorporating protein physicochemical features with the graph topological features from the PPI networks. Specifically, our method not only bears in mind the protein sequence information but also discerns the topological representations for each protein node in the PPI networks. In our paper, we construct a stacked auto-encoder architecture together with a continuous bag-of-words (CBOW) model based on generated metapaths to study the PPI predictions. Following by that, we utilize the supervised deep neural networks to identify the PPIs and classify the protein families. The PPI prediction accuracy for eight species ranged from 96.76% to 99.77%, which signifies that our multi-modal deep representation learning framework achieves superior performance compared to other computational methods.ConclusionTo the best of our knowledge, this is the first multi-modal deep representation learning framework for examining the PPI networks.
- Published
- 2019
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.