45 results on '"Hu, Xiaohua"'
Search Results
2. Subtask-Aware Representation Learning for Predicting Antibiotic Resistance Gene Properties via Gating-Controlled Mechanism.
- Author
-
Zhao W, Wu J, Luo S, Jiang X, He T, and Hu X
- Subjects
- Humans, Drug Resistance, Microbial genetics, Machine Learning, Algorithms, Anti-Bacterial Agents pharmacology, Computational Biology methods
- Abstract
The crisis of antibiotic resistance has become a significant global threat to human health. Understanding properties of antibiotic resistance genes (ARGs) is the first step to mitigate this issue. Although many methods have been proposed for predicting properties of ARGs, most of these methods focus only on predicting antibiotic classes, while ignoring other properties of ARGs, such as resistance mechanisms and transferability. However, acquiring all of these properties of ARGs can help researchers gain a more comprehensive understanding of the essence of antibiotic resistance, which will facilitate the development of antibiotics. In this paper, the task of predicting properties of ARGs is modeled as a multi-task learning problem, and an effective subtask-aware representation learning-based framework is proposed accordingly. More specifically, property-specific expert networks and shared expert networks are utilized respectively to learn subtask-specific features for each subtask and shared features among different subtasks. In addition, a gating-controlled mechanism is employed to dynamically allocate weights to subtask-specific semantics and shared semantics obtained respectively from property-specific expert networks and shared expert networks, thus adjusting distinctive contributions of subtask-specific features and shared features to achieve optimal performance for each subtask simultaneously. Extensive experiments are conducted on publicly available data, and experimental results demonstrate the effectiveness of the proposed framework on the task of ARGs properties prediction.
- Published
- 2024
- Full Text
- View/download PDF
3. Network-based approaches in bioinformatics and biomedicine.
- Author
-
Cho YR and Hu X
- Subjects
- Computational Biology
- Published
- 2022
- Full Text
- View/download PDF
4. Protein2Vec: Aligning Multiple PPI Networks with Representation Learning.
- Author
-
Gao J, Tian L, Lv T, Wang J, Song B, and Hu X
- Subjects
- Algorithms, Animals, Humans, Computational Biology methods, Machine Learning, Protein Interaction Mapping methods, Protein Interaction Maps genetics, Sequence Alignment methods
- Abstract
Research of Protein-Protein Interaction (PPI) Network Alignment is playing an important role in understanding the crucial underlying biological knowledge such as functionally homologous proteins and conserved evolutionary pathways across different species. Existing methods of PPI network alignment often try to improve the coverage ratio of the alignment result by aligning all proteins from different species. However, there is a fundamental biological premise that needs to be considered carefully: not every protein in a species can, nor should, find its homologous proteins in other species. In this work, we propose a novel alignment method to map only those proteins with the most similarity throughout the PPI networks of multiple species. For the similarity features of the protein in the networks, we integrate both topological features with biological characteristics to provide enhanced supports for the alignment procedures. For topological features, we apply a representation learning method on the networks and generate a low dimensional vector embedding with its surrounding structural features for each protein. The topological similarity of proteins from different PPI networks can thus be transferred as the similarity of their corresponding vector representations, which provides a new way to comprehensively quantify the topological similarities between proteins. We also propose a new measure for the topological evaluation of the alignment results which better uncover the structural quality of the alignment across multiple networks. Both biological and topological evaluations on the alignment results of real datasets demonstrate our approach is promising and preferable against previous multiple alignment methods.
- Published
- 2021
- Full Text
- View/download PDF
5. Differential Network Analysis via Weighted Fused Conditional Gaussian Graphical Model.
- Author
-
Ou-Yang L, Zhang XF, Hu X, and Yan H
- Subjects
- Brain Neoplasms genetics, Brain Neoplasms metabolism, Brain Neoplasms pathology, Computer Simulation, Glioblastoma genetics, Glioblastoma metabolism, Glioblastoma pathology, Humans, Normal Distribution, Computational Biology methods, Gene Expression Profiling methods, Gene Regulatory Networks genetics, Transcriptome genetics
- Abstract
The development and prognosis of complex diseases usually involves changes in regulatory relationships among biomolecules. Understanding how the regulatory relationships change with genetic alterations can help to reveal the underlying biological mechanisms for complex diseases. Although several models have been proposed to estimate the differential network between two different states, they are not suitable to deal with situations where the molecules of interest are affected by other covariates. Nor can they make use of prior information that provides insights about the structures of biomolecular networks. In this study, we introduce a novel weighted fused conditional Gaussian graphical model to jointly estimate two state-specific biomolecular regulatory networks and their difference between two different states. Unlike previous differential network estimation methods, our model can take into account the related covariates and the prior network information when inferring differential networks. The effectiveness of our proposed model is first evaluated based on simulation studies. Experiment results demonstrate that our model outperforms other state-of-the-art differential networks estimation models in all cases. We then apply our model to identify the differential gene network between two subtypes of glioblastoma based on gene expression and miRNA expression data. Our model is able to discover known mechanisms of glioblastoma and provide interesting predictions.
- Published
- 2020
- Full Text
- View/download PDF
6. A novel graph attention adversarial network for predicting disease-related associations.
- Author
-
Zhang J, Jiang Z, Hu X, and Song B
- Subjects
- Databases, Genetic, Datasets as Topic, Gene Expression Regulation, Gene Regulatory Networks, Genetic Predisposition to Disease, MicroRNAs metabolism, Predictive Value of Tests, RNA, Long Noncoding metabolism, Computational Biology methods, Deep Learning, Genetic Association Studies methods
- Abstract
Identifying complex human diseases at molecular level is very helpful, especially in diseases diagnosis, therapy, prognosis and monitoring. Accumulating evidences demonstrated that RNAs are playing important roles in identifying various complex human diseases. However, the amount of verified disease-related RNAs is still little while many of their biological experiments are very time-consuming and labor-intensive. Therefore, researchers have instead been seeking to develop effective computational algorithms to predict associations between diseases and RNAs. In this paper, we propose a novel model called Graph Attention Adversarial Network (GAAN) for the potential disease-RNA association prediction. To our best knowledge, we are among the pioneers to integrate successfully both the state-of-the-art graph convolutional networks (GCNs) and attention mechanism in our model for the prediction of disease-RNA associations. Comparing to other disease-RNA association prediction methods, GAAN is novel in conducting the computations from the aspect of global structure of disease-RNA network with graph embedding while integrating features of local neighborhoods with the attention mechanism. Moreover, GAAN uses adversarial regularization to further discover feature representation distribution of the latent nodes in disease-RNA networks. GAAN also benefits from the efficiency of deep model for the computation of big associations networks. To evaluate the performance of GAAN, we conduct experiments on networks of diseases associating with two different RNAs: MicroRNAs (miRNAs) and Long non-coding RNAs (lncRNAs). Comparisons of GAAN with several popular baseline methods on disease-RNA networks show that our novel model outperforms others by a wide margin in predicting potential disease-RNAs associations., (Copyright © 2020. Published by Elsevier Inc.)
- Published
- 2020
- Full Text
- View/download PDF
7. Similar Disease Prediction With Heterogeneous Disease Information Networks.
- Author
-
Gao J, Tian L, Wang J, Chen Y, Song B, and Hu X
- Subjects
- Algorithms, Databases, Factual, Humans, Models, Statistical, Semantics, Computational Biology methods, Disease classification, Machine Learning
- Abstract
Studying the similarity of diseases can help us to explore the pathological characteristics of complex diseases, and help provide reliable reference information for inferring the relationship between new diseases and known diseases, so as to develop effective treatment plans. To obtain the similarity of the disease, most previous methods either use a single similarity metric such as semantic score, functional score from single data source, or utilize weighting coefficients to simply combine multiple metrics with different dimensions. In this paper, we proposes a method to predict the similarity of diseases by node representation learning. We first integrate the semantic score and topological score between diseases by combining multiple data sources. Then for each disease, its integrated scores with all other diseases are utilized to map it into a vector of the same spatial dimension, and the vectors are used to measure and comprehensively analyze the similarity between diseases. Lastly, we conduct comparative experiment based on benchmark set and other disease nodes outside the benchmark set. Using the statistics such as average, variance, and coefficient of variation in the benchmark set to evaluate multiple methods demonstrates the effectiveness of our approach in the prediction of similar diseases.
- Published
- 2020
- Full Text
- View/download PDF
8. Clustering and Integrating of Heterogeneous Microbiome Data by Joint Symmetric Nonnegative Matrix Factorization with Laplacian Regularization.
- Author
-
Ma Y, Hu X, He T, and Jiang X
- Subjects
- Databases, Genetic, Humans, Phylogeny, Statistics as Topic, Cluster Analysis, Computational Biology methods, Microbiota genetics
- Abstract
Many datasets that exists in the real world are often comprised of different representations or views which provide complementary information to each other. To integrate information from multiple views, data integration approaches such as nonnegative matrix factorization (NMF) have been developed to combine multiple heterogeneous data simultaneously to obtain a comprehensive representation. In this paper, we proposed a novel variant of symmetric nonnegative matrix factorization (SNMF), called Laplacian regularization based joint symmetric nonnegative matrix factorization (LJ-SNMF) for clustering multi-view data. We conduct extensive experiments on several realistic datasets including Human Microbiome Project data. The experimental results show that the proposed method outperforms other variants of NMF, which suggests the potential application of LJ-SNMF in clustering multi-view datasets. Additionally, we also demonstrate the capability of LJ-SNMF in community finding.
- Published
- 2020
- Full Text
- View/download PDF
9. Identifying Gene Network Rewiring Using Robust Differential Graphical Model with Multivariate t-Distribution.
- Author
-
Yuan R, Ou-Yang L, Hu X, and Zhang XF
- Subjects
- Algorithms, Brain Neoplasms genetics, Breast Neoplasms genetics, Female, Glioblastoma genetics, Humans, Transcriptome genetics, Computational Biology methods, Gene Regulatory Networks genetics, Multivariate Analysis
- Abstract
Identifying gene network rewiring under different biological conditions is important for understanding the mechanisms underlying complex diseases. Gaussian graphical models, which assume the data follow the multivariate normal distribution, are widely used to identify gene network rewiring. However, the normality assume often fails in reality since the data are contaminated by extreme outliers in general. In this study, we propose a new robust differential graphical model to identify gene network rewiring between two conditions based on the multivariate t-distribution. The multivariate t-distribution is more robust to outliers than the normal distribution since it has heavy tails and allows values far from the mean. A fused lasso penalty is used to borrow information across conditions to improve the results. We develop an expectation maximization algorithm to solve the optimization model. Experiment results on simulated data show that our method outperforms the state-of-the-art methods. Our method is also applied to identify gene network rewiring between luminal A and basal-like subtypes of breast cancer, and gene network rewiring between the proneural and mesenchymal subtypes of glioblastoma. Several key genes which drive gene network rewiring are discovered.
- Published
- 2020
- Full Text
- View/download PDF
10. Multiscale network-based approaches in bioinformatics and biomedicine.
- Author
-
Zheng HJ and Hu XT
- Subjects
- Algorithms, Humans, Biomedical Research methods, Computational Biology methods
- Published
- 2020
- Full Text
- View/download PDF
11. Recognition of bacteria named entity using conditional random fields in Spark.
- Author
-
Wang X, Li Y, He T, Jiang X, and Hu X
- Subjects
- Algorithms, Data Mining, Bacteria isolation & purification, Computational Biology methods
- Abstract
Background: Microbe plays a crucial role in the functional mechanism of an ecosystem. Identification of the interactions among microbes is an important step towards understand the structure and function of microbial communities, as well as of the impact of microbes on human health and disease. Despite the importance of it, there is not a gold-standard dataset of microbial interactions currently. Traditional approaches such as growth and co-culture analysis need to be performed in the laboratory, which are time-consuming and costly. By providing predicted candidate interactions to experimental verification, computational methods are able to alleviate this problem. Mining microbial interactions from mass medical texts is one type of computational methods. Identification of the named entity of bacteria and related entities from the text is the basis for microbial relation extraction. In the previous work, a system of bacteria named entities recognition based on the dictionary and conditional random field was proposed. However, it is inefficient when dealing with large-scale text., Results: We implemented bacteria named entity recognition on Spark platform and designed experiments for comparison to verify the correctness and validity of the proposed system. The experimental results show that it can achieve higher F-Measure on the comparison of correctness. Moreover, the predicting speed is much faster than the previous version in large-scale biomedical datasets, and the computational efficiency is improved remarkably by about 3.1 to 6.7 times., Conclusions: The system for bacteria named entity recognition solves the inefficiency of the previous proposed system on large-scale datasets. The proposed system has good performance in accuracy and scalability.
- Published
- 2018
- Full Text
- View/download PDF
12. Identifying Gene Network Rewiring by Integrating Gene Expression and Gene Network Data.
- Author
-
Xu T, Ou-Yang L, Hu X, and Zhang XF
- Subjects
- Algorithms, Databases, Genetic, Female, Humans, Normal Distribution, Computational Biology methods, Gene Expression genetics, Gene Regulatory Networks genetics
- Abstract
Exploring the rewiring pattern of gene regulatory networks between different pathological states is an important task in bioinformatics. Although a number of computational approaches have been developed to infer differential networks from high-throughput data, most of them only focus on gene expression data. The valuable static gene regulatory network data accumulated in recent biomedical researches are neglected. In this study, we propose a new Gaussian graphical model-based method to infer differential networks by integrating gene expression and static gene regulatory network data. We first evaluate the empirical performance of our method by comparing with the state-of-the-art methods using simulation data. We also apply our method to The Cancer Genome Atlas data to identify gene network rewiring between ovarian cancers with different platinum responses, and rewiring between breast cancers of luminal A subtype and basal-like subtype. Hub genes in the estimated differential networks rediscover known genes associated with platinum resistance in ovarian cancer and signatures of the breast cancer intrinsic subtypes.
- Published
- 2018
- Full Text
- View/download PDF
13. ConnectedAlign: a PPI network alignment method for identifying conserved protein complexes across multiple species.
- Author
-
Gao J, Song B, Hu X, Yan F, and Wang J
- Subjects
- Animals, Caenorhabditis elegans metabolism, Drosophila metabolism, Humans, Proteins chemistry, Saccharomyces cerevisiae metabolism, Species Specificity, Algorithms, Computational Biology methods, Protein Interaction Mapping methods, Proteins metabolism
- Abstract
Background: In bioinformatics, network alignment algorithms have been applied to protein-protein interaction (PPI) networks to discover evolutionary conserved substructures at the system level. However, most previous methods aim to maximize the similarity of aligned proteins in pairwise networks, while concerning little about the feature of connectivity in these substructures, such as the protein complexes., Results: In this paper, we identify the problem of finding conserved protein complexes, which requires the aligned proteins in a PPI network to form a connected subnetwork. By taking the feature of connectivity into consideration, we propose ConnectedAlign, an efficient method to find conserved protein complexes from multiple PPI networks. The proposed method improves the coverage significantly without compromising of the consistency in the aligned results. In this way, the knowledge of protein complexes in well-studied species can be extended to that of poor-studied species., Conclusions: We conducted extensive experiments on real PPI networks of four species, including human, yeast, fruit fly and worm. The experimental results demonstrate dominant benefits of the proposed method in finding protein complexes across multiple species.
- Published
- 2018
- Full Text
- View/download PDF
14. DiffGraph: an R package for identifying gene network rewiring using differential graphical models.
- Author
-
Zhang XF, Ou-Yang L, Yang S, Hu X, and Yan H
- Subjects
- Computational Biology methods, Data Visualization, Gene Regulatory Networks, Models, Genetic, Software
- Abstract
Summary: We develop DiffGraph, an R package that integrates four influential differential graphical models for identifying gene network rewiring under two different conditions from gene expression data. The input and output of different models are packaged in the same format, making it convenient for users to compare different models using a wide range of datasets and carry out follow-up analysis. Furthermore, the inferred differential networks can be visualized both non-interactively and interactively. The package is useful for identifying gene network rewiring from input datasets, comparing the predictions of different methods and visualizing the results., Availability and Implementation: The package is available at https://github.com/Zhangxf-ccnu/DiffGraph., Contact: leouyang@szu.edu.cn., Supplementary Information: Supplementary data are available at Bioinformatics online.
- Published
- 2018
- Full Text
- View/download PDF
15. Reverse-engineering of gene networks for regulating early blood development from single-cell measurements.
- Author
-
Wei J, Hu X, Zou X, and Tian T
- Subjects
- Genetic Markers genetics, Models, Genetic, Transcription Factors genetics, Blood Cells metabolism, Computational Biology methods, Gene Regulatory Networks, Single-Cell Analysis
- Abstract
Background: Recent advances in omics technologies have raised great opportunities to study large-scale regulatory networks inside the cell. In addition, single-cell experiments have measured the gene and protein activities in a large number of cells under the same experimental conditions. However, a significant challenge in computational biology and bioinformatics is how to derive quantitative information from the single-cell observations and how to develop sophisticated mathematical models to describe the dynamic properties of regulatory networks using the derived quantitative information., Methods: This work designs an integrated approach to reverse-engineer gene networks for regulating early blood development based on singel-cell experimental observations. The wanderlust algorithm is initially used to develop the pseudo-trajectory for the activities of a number of genes. Since the gene expression data in the developed pseudo-trajectory show large fluctuations, we then use Gaussian process regression methods to smooth the gene express data in order to obtain pseudo-trajectories with much less fluctuations. The proposed integrated framework consists of both bioinformatics algorithms to reconstruct the regulatory network and mathematical models using differential equations to describe the dynamics of gene expression., Results: The developed approach is applied to study the network regulating early blood cell development. A graphic model is constructed for a regulatory network with forty genes and a dynamic model using differential equations is developed for a network of nine genes. Numerical results suggests that the proposed model is able to match experimental data very well. We also examine the networks with more regulatory relations and numerical results show that more regulations may exist. We test the possibility of auto-regulation but numerical simulations do not support the positive auto-regulation. In addition, robustness is used as an importantly additional criterion to select candidate networks., Conclusion: The research results in this work shows that the developed approach is an efficient and effective method to reverse-engineer gene networks using single-cell experimental observations.
- Published
- 2017
- Full Text
- View/download PDF
16. Prediction of essential proteins based on subcellular localization and gene expression correlation.
- Author
-
Fan Y, Tang X, Hu X, Wu W, and Ping Q
- Subjects
- Saccharomyces cerevisiae genetics, Saccharomyces cerevisiae Proteins genetics, Subcellular Fractions, Algorithms, Computational Biology methods, Gene Expression Regulation, Fungal, Genes, Essential, Protein Interaction Maps, Saccharomyces cerevisiae metabolism, Saccharomyces cerevisiae Proteins metabolism
- Abstract
Background: Essential proteins are indispensable to the survival and development process of living organisms. To understand the functional mechanisms of essential proteins, which can be applied to the analysis of disease and design of drugs, it is important to identify essential proteins from a set of proteins first. As traditional experimental methods designed to test out essential proteins are usually expensive and laborious, computational methods, which utilize biological and topological features of proteins, have attracted more attention in recent years. Protein-protein interaction networks, together with other biological data, have been explored to improve the performance of essential protein prediction., Results: The proposed method SCP is evaluated on Saccharomyces cerevisiae datasets and compared with five other methods. The results show that our method SCP outperforms the other five methods in terms of accuracy of essential protein prediction., Conclusions: In this paper, we propose a novel algorithm named SCP, which combines the ranking by a modified PageRank algorithm based on subcellular compartments information, with the ranking by Pearson correlation coefficient (PCC) calculated from gene expression data. Experiments show that subcellular localization information is promising in boosting essential protein prediction.
- Published
- 2017
- Full Text
- View/download PDF
17. BalanceAli: Multiple PPI Network Alignment With Balanced High Coverage and Consistency.
- Author
-
Gao J, Song B, Ke W, and Hu X
- Subjects
- Algorithms, Animals, Databases, Protein, Humans, Mice, Yeasts, Computational Biology methods, Protein Interaction Mapping methods, Protein Interaction Maps, Proteins chemistry, Proteins metabolism, Sequence Alignment methods
- Abstract
Coverage and consistency are two most considered metrics to evaluate the effectiveness of network alignment. But they are a pair of contradictory evaluation metrics in protein-protein interaction (PPI) network alignment. It is difficult, if not impossible, to achieve high coverage and consistency simultaneously. Furthermore, existing methods of multiple PPI network alignment mostly ignore k-coverage or k-consistency, where k indicates the number of aligned species. In this paper, we propose BalanceAli, a novel approach for global alignment of multiple PPI networks that achieves high k-coverage and k-consistency simultaneously. With six data sets consisting of various numbers of PPI networks from five species, we evaluate the experimental results using different k values. The performance evaluations of our approach against other three state-of-the-art methods demonstrate the preferable comprehensive strength of our approach.
- Published
- 2017
- Full Text
- View/download PDF
18. Microbiome Data Representation by Joint Nonnegative Matrix Factorization with Laplacian Regularization.
- Author
-
Jiang X, Hu X, and Xu W
- Subjects
- Algorithms, Cluster Analysis, Gene Expression Profiling, Humans, Models, Theoretical, Phylogeny, Computational Biology methods, Databases, Factual, Microbiota genetics, Microbiota physiology
- Abstract
Microbiome datasets are often comprised of different representations or views which provide complementary information to understand microbial communities, such as metabolic pathways, taxonomic assignments, and gene families. Data integration methods including approaches based on nonnegative matrix factorization (NMF) combine multi-view data to create a comprehensive view of a given microbiome study by integrating multi-view information. In this paper, we proposed a novel variant of NMF which called Laplacian regularized joint non-negative matrix factorization (LJ-NMF) for integrating functional and phylogenetic profiles from HMP. We compare the performance of this method to other variants of NMF. The experimental results indicate that the proposed method offers an efficient framework for microbiome data analysis.
- Published
- 2017
- Full Text
- View/download PDF
19. Multi-View Clustering of Microbiome Samples by Robust Similarity Network Fusion and Spectral Clustering.
- Author
-
Zhang Y, Hu X, and Jiang X
- Subjects
- Databases, Protein, Humans, Algorithms, Cluster Analysis, Computational Biology methods, Microbiota genetics
- Abstract
Microbiome datasets are often comprised of different representations or views which provide complementary information, such as genes, functions, and taxonomic assignments. Integration of multi-view information for clustering microbiome samples could create a comprehensive view of a given microbiome study. Similarity network fusion (SNF) can efficiently integrate similarities built from each view of data into a unique network that represents the full spectrum of the underlying data. Based on this method, we develop a Robust Similarity Network Fusion (RSNF) approach which combines the strength of random forest and the advantage of SNF at data aggregation. The experimental results indicate the strength of the proposed strategy. The method substantially improves the clustering performance significantly comparing to several state-of-the-art methods in several datasets.
- Published
- 2017
- Full Text
- View/download PDF
20. An integrated approach to infer dynamic protein-gene interactions - A case study of the human P53 protein.
- Author
-
Wang J, Wu Q, Hu XT, and Tian T
- Subjects
- Algorithms, DNA Repair genetics, Humans, Models, Statistical, Signal Transduction genetics, Computational Biology methods, Gene Expression Profiling methods, Gene Regulatory Networks genetics, Tumor Suppressor Protein p53 genetics
- Abstract
Investigating the dynamics of genetic regulatory networks through high throughput experimental data, such as microarray gene expression profiles, is a very important but challenging task. One of the major hindrances in building detailed mathematical models for genetic regulation is the large number of unknown model parameters. To tackle this challenge, a new integrated method is proposed by combining a top-down approach and a bottom-up approach. First, the top-down approach uses probabilistic graphical models to predict the network structure of DNA repair pathway that is regulated by the p53 protein. Two networks are predicted, namely a network of eight genes with eight inferred interactions and an extended network of 21 genes with 17 interactions. Then, the bottom-up approach using differential equation models is developed to study the detailed genetic regulations based on either a fully connected regulatory network or a gene network obtained by the top-down approach. Model simulation error, parameter identifiability and robustness property are used as criteria to select the optimal network. Simulation results together with permutation tests of input gene network structures indicate that the prediction accuracy and robustness property of the two predicted networks using the top-down approach are better than those of the corresponding fully connected networks. In particular, the proposed approach reduces computational cost significantly for inferring model parameters. Overall, the new integrated method is a promising approach for investigating the dynamics of genetic regulation., (Copyright © 2016 Elsevier Inc. All rights reserved.)
- Published
- 2016
- Full Text
- View/download PDF
21. Neighbor affinity based algorithm for discovering temporal protein complex from dynamic PPI network.
- Author
-
Shen X, Yi L, Jiang X, Zhao Y, Hu X, He T, and Yang J
- Subjects
- Algorithms, Cluster Analysis, Multiprotein Complexes genetics, Computational Biology methods, Protein Interaction Mapping methods, Protein Interaction Maps genetics
- Abstract
Detection of temporal protein complexes would be a great aid in furthering our knowledge of the dynamic features and molecular mechanism in cell life activities. Most existing clustering algorithms for discovering protein complexes are based on static protein interaction networks in which the inherent dynamics are often overlooked. We propose a novel algorithm DPC-NADPIN (Discovering Protein Complexes based on Neighbor Affinity and Dynamic Protein Interaction Network) to identify temporal protein complexes from the time course protein interaction networks. Inspired by the idea of that the tighter a protein's neighbors inside a module connect, the greater the possibility that the protein belongs to the module, DPC-NADPIN algorithm first chooses each of the proteins with high clustering coefficient and its neighbors to consolidate into an initial cluster, and then the initial cluster becomes a protein complex by appending its neighbor proteins according to the relationship between the affinity among neighbors inside the cluster and that outside the cluster. In our experiments, DPC-NADPIN algorithm is proved to be reasonable and it has better performance on discovering protein complexes than the following state-of-the-art algorithms: Hunter, MCODE, CFinder, SPICI, and ClusterONE; Meanwhile, it obtains many protein complexes with strong biological significance, which provide helpful biological knowledge to the related researchers. Moreover, we find that proteins are assembled coordinately to form protein complexes with characteristics of temporality and spatiality, thereby performing specific biological functions., (Copyright © 2016 Elsevier Inc. All rights reserved.)
- Published
- 2016
- Full Text
- View/download PDF
22. Editorial.
- Author
-
Liao L and Hu X
- Subjects
- DNA-Binding Proteins metabolism, Humans, Models, Biological, Multiprotein Complexes metabolism, Protein Interaction Domains and Motifs, Sequence Analysis, Protein methods, Computational Biology methods, Protein Interaction Mapping methods
- Published
- 2016
- Full Text
- View/download PDF
23. Identifying protein complexes based on brainstorming strategy.
- Author
-
Shen X, Zhou J, Yi L, Hu X, He T, and Yang J
- Subjects
- Algorithms, Cluster Analysis, Humans, Saccharomyces cerevisiae genetics, Computational Biology methods, Protein Interaction Mapping methods, Protein Interaction Maps genetics, Proteomics methods
- Abstract
Protein complexes comprising of interacting proteins in protein-protein interaction network (PPI network) play a central role in driving biological processes within cells. Recently, more and more swarm intelligence based algorithms to detect protein complexes have been emerging, which have become the research hotspot in proteomics field. In this paper, we propose a novel algorithm for identifying protein complexes based on brainstorming strategy (IPC-BSS), which is integrated into the main idea of swarm intelligence optimization and the improved K-means algorithm. Distance between the nodes in PPI network is defined by combining the network topology and gene ontology (GO) information. Inspired by human brainstorming process, IPC-BSS algorithm firstly selects the clustering center nodes, and then they are separately consolidated with the other nodes with short distance to form initial clusters. Finally, we put forward two ways of updating the initial clusters to search optimal results. Experimental results show that our IPC-BSS algorithm outperforms the other classic algorithms on yeast and human PPI networks, and it obtains many predicted protein complexes with biological significance., (Copyright © 2016 Elsevier Inc. All rights reserved.)
- Published
- 2016
- Full Text
- View/download PDF
24. Predicting diabetes mellitus genes via protein-protein interaction and protein subcellular localization information.
- Author
-
Tang X, Hu X, Yang X, Fan Y, Li Y, Hu W, Liao Y, Zheng MC, Peng W, and Gao L
- Subjects
- Algorithms, Humans, Proteins genetics, Proteins metabolism, Software, Computational Biology methods, Diabetes Mellitus, Type 2 genetics, Protein Interaction Mapping methods, Protein Interaction Maps genetics
- Abstract
Background: Diabetes mellitus characterized by hyperglycemia as a result of insufficient production of or reduced sensitivity to insulin poses a growing threat to the health of people. It is a heterogeneous disorder with multiple etiologies consisting of type 1 diabetes, type 2 diabetes, gestational diabetes and so on. Diabetes-associated protein/gene prediction is a key step to understand the cellular mechanisms related to diabetes mellitus. Compared with experimental methods, computational predictions of candidate proteins/genes are cheaper and more effortless. Protein-protein interaction (PPI) data produced by the high-throughput technology have been used to prioritize candidate disease genes/proteins. However, the false interactions in the PPI data seriously hurt computational methods performance. In order to address that particular question, new methods are developed to identify candidate disease genes/proteins via integrating biological data from other sources., Results: In this study, a new framework called PDMG is proposed to predict candidate disease genes/proteins. First, the weighted networks are building in terms of the combination of the subcellular localization information and PPI data. To form the weighted networks, the importance of each compartment is evaluated based on the number of interacted proteins in this compartment. This is because the very different roles played by different compartments in cell activities. Besides, some compartments are more important than others. Based on the evaluated compartments, the interactions between proteins are scored and the weighted PPI networks are constructed. Second, the known disease genes are extracted from OMIM database as the seed genes to expand disease-specific networks based on the weighted networks. Third, the weighted values between a protein and its neighbors in the disease-related networks are added together and the sum is as the score of the protein. Last but not least, the proteins are ranked based on descending order of their scores. The candidate proteins in the top are considered to be associated with the diseases and are potential disease-related proteins. Various types of data, such as type 2 diabetes-associated genes, subcellular localizations and protein interactions, are used to test PDMG method., Conclusions: The results show that the proteins/genes functionally exerting a direct influence over diabetes are consistently placed at the head of the queue. PDMG expands and ranks 445 candidate proteins from the seed set including original 27 type 2 diabetes proteins. Out of the top 27 proteins, 14 proteins are the real type 2 diabetes proteins. The literature extracted from the PubMed database has proved that, out of 13 novel proteins, 8 proteins are associated with diabetes.
- Published
- 2016
- Full Text
- View/download PDF
25. Mining Temporal Protein Complex Based on the Dynamic PIN Weighted with Connected Affinity and Gene Co-Expression.
- Author
-
Shen X, Yi L, Jiang X, He T, Hu X, and Yang J
- Subjects
- Humans, Proteins genetics, Time Factors, Algorithms, Computational Biology methods, Gene Expression Profiling, Protein Interaction Mapping methods, Protein Interaction Maps, Proteins metabolism
- Abstract
The identification of temporal protein complexes would make great contribution to our knowledge of the dynamic organization characteristics in protein interaction networks (PINs). Recent studies have focused on integrating gene expression data into static PIN to construct dynamic PIN which reveals the dynamic evolutionary procedure of protein interactions, but they fail in practice for recognizing the active time points of proteins with low or high expression levels. We construct a Time-Evolving PIN (TEPIN) with a novel method called Deviation Degree, which is designed to identify the active time points of proteins based on the deviation degree of their own expression values. Owing to the differences between protein interactions, moreover, we weight TEPIN with connected affinity and gene co-expression to quantify the degree of these interactions. To validate the efficiencies of our methods, ClusterONE, CAMSE and MCL algorithms are applied on the TEPIN, DPIN (a dynamic PIN constructed with state-of-the-art three-sigma method) and SPIN (the original static PIN) to detect temporal protein complexes. Each algorithm on our TEPIN outperforms that on other networks in terms of match degree, sensitivity, specificity, F-measure and function enrichment etc. In conclusion, our Deviation Degree method successfully eliminates the disadvantages which exist in the previous state-of-the-art dynamic PIN construction methods. Moreover, the biological nature of protein interactions can be well described in our weighted network. Weighted TEPIN is a useful approach for detecting temporal protein complexes and revealing the dynamic protein assembly process for cellular organization.
- Published
- 2016
- Full Text
- View/download PDF
26. NDRC: A Disease-Causing Genes Prioritized Method Based on Network Diffusion and Rank Concordance.
- Author
-
Fang M, Hu X, Wang Y, Zhao J, Shen X, and He T
- Subjects
- Algorithms, Humans, Computational Biology methods, Disease genetics, Protein Interaction Maps genetics, Proteins genetics
- Abstract
Disease-causing genes prioritization is very important to understand disease mechanisms and biomedical applications, such as design of drugs. Previous studies have shown that promising candidate genes are mostly ranked according to their relatedness to known disease genes or closely related disease genes. Therefore, a dangling gene (isolated gene) with no edges in the network can not be effectively prioritized. These approaches tend to prioritize those genes that are highly connected in the PPI network while perform poorly when they are applied to loosely connected disease genes. To address these problems, we propose a new disease-causing genes prioritization method that based on network diffusion and rank concordance (NDRC). The method is evaluated by leave-one-out cross validation on 1931 diseases in which at least one gene is known to be involved, and it is able to rank the true causal gene first in 849 of all 2542 cases. The experimental results suggest that NDRC significantly outperforms other existing methods such as RWR, VAVIEN, DADA and PRINCE on identifying loosely connected disease genes and successfully put dangling genes as potential candidate disease genes. Furthermore, we apply NDRC method to study three representative diseases, Meckel syndrome 1, Protein C deficiency and Peroxisome biogenesis disorder 1A (Zellweger). Our study has also found that certain complex disease-causing genes can be divided into several modules that are closely associated with different disease phenotype.
- Published
- 2015
- Full Text
- View/download PDF
27. Predicting Microbial Interactions Using Vector Autoregressive Model with Graph Regularization.
- Author
-
Jiang X, Hu X, Xu W, and Park EK
- Subjects
- Databases, Factual, Humans, Microbiota, Models, Statistical, Computational Biology methods, Microbial Interactions, Models, Biological
- Abstract
Microbial interactions play important roles on the structure and function of complex microbial communities. With the rapid accumulation of high-throughput metagenomic or 16S rRNA sequencing data, it is possible to infer complex microbial interactions. Co-occurrence patterns of microbial species among multiple samples are often utilized to infer interactions. There are few methods to consider the temporally interacting patterns among microbial species. In this paper, we present a Graph-regularized Vector Autoregressive (GVAR) model to infer causal relationships among microbial entities. The new model has advantage comparing to the original vector autoregressive (VAR) model. Specifically, GVAR can incorporate similarity information for microbial interaction inference--i.e., GVAR assumed that if two species are similar in the previous stage, they tend to have similar influence on the other species in the next stage. We apply the model on a time series dataset of human gut microbiome which was treated with repeated antibiotics. The experimental results indicate that the new approach has better performance than several other VAR-based models and demonstrate its capability of extracting relevant microbial interactions.
- Published
- 2015
- Full Text
- View/download PDF
28. Inferring microbial interaction networks based on consensus similarity network fusion.
- Author
-
Jiang X and Hu X
- Subjects
- Algorithms, Cluster Analysis, Computer Simulation, Databases, Genetic, Gene Regulatory Networks, Genomics, Humans, Models, Biological, Oxygen chemistry, Computational Biology methods, Genes, Bacterial, Microbiota physiology
- Abstract
With the rapid accumulation of high-throughput metagenomic sequencing data, it is possible to infer microbial species relations in a microbial community systematically. In recent years, some approaches have been proposed for identifying microbial interaction network. These methods often focus on one dataset without considering the advantage of data integration. In this study, we propose to use a similarity network fusion (SNF) method to infer microbial relations. The SNF efficiently integrates the similarities of species derived from different datasets by a cross-network diffusion process. We also introduce consensus k-nearest neighborhood (Ck-NN) method instead of k-NN in the original SNF (we call the approach CSNF). The final network represents the augmented species relationships with aggregated evidence from various datasets, taking advantage of complementarity in the data. We apply the method on genus profiles derived from three microbiome datasets and we find that CSNF can discover the modular structure of microbial interaction network which cannot be identified by analyzing a single dataset.
- Published
- 2014
- Full Text
- View/download PDF
29. Special section on the 2013 IEEE Conference on Bioinformatics and Biomedicine (BIBM).
- Author
-
Hu B, Hu X, and Huang DS
- Subjects
- Biomedical Research, Genomics, Computational Biology, Medical Informatics
- Published
- 2014
- Full Text
- View/download PDF
30. Visualization of genetic disease-phenotype similarities by multiple maps t-SNE with Laplacian regularization.
- Author
-
Xu W, Jiang X, Hu X, and Li G
- Subjects
- Artificial Intelligence, Cluster Analysis, Databases, Factual, Humans, Algorithms, Computational Biology methods, Computer Graphics, Disease, Phenotype
- Abstract
Background: From a phenotypic standpoint, certain types of diseases may prove to be difficult to accurately diagnose, due to specific combinations of confounding symptoms. Referred to as phenotypic overlap, these sets of disease-related symptoms suggest shared pathophysiological mechanisms. Few attempts have been made to visualize the phenotypic relationships between different human diseases from a machine learning perspective. The proposed research, it is anticipated, will visually assist researchers in quickly disambiguating symptoms which can confound the timely and accurate diagnosis of a disease., Methods: Our method is primarily based on multiple maps t-SNE (mm-tSNE), which is a probabilistic method for visualizing data points in multiple low dimensional spaces. We improved mm-tSNE by adding a Laplacian regularization term and subsequently provide an algorithm for optimizing the new objective function. The advantage of Laplacian regularization is that it adopts clustering structures of variables and provides more sparsity to the estimated parameters., Results: In order to further assess our modified mm-tSNE algorithm from a comparative standpoint, we reexamined two social network datasets used by the previous authors. Subsequently, we apply our method on phenotype dataset. In all these cases, our proposed method demonstrated better performance than the original version of mm-tSNE, as measured by the neighbourhood preservation ratio., Conclusions: Phenotype grouping reflects the nature of human disease genetics. Thus, phenotype visualization may be complementary to investigate candidate genes for diseases as well as functional relations between genes and proteins. These relationships can be modelled by the modified mm-tSNE method. The modified mm-tSNE can be applied directly in other domain including social and biological datasets.
- Published
- 2014
- Full Text
- View/download PDF
31. Multilabel learning for protein subcellular location prediction.
- Author
-
Li GZ, Wang X, Hu X, Liu JM, and Zhao RW
- Subjects
- Algorithms, Databases, Protein, Humans, Artificial Intelligence, Computational Biology methods, Intracellular Space chemistry, Models, Biological, Models, Statistical, Proteins chemistry
- Abstract
Protein subcellular localization aims at predicting the location of a protein within a cell using computational methods. Knowledge of subcellular localization of proteins indicates protein functions and helps in identifying drug targets. Prediction of protein subcellular localization is an important but challenging problem, particularly when proteins may simultaneously exist at, or move between, two or more different subcellular location sites. Most of the existing protein subcellular localization methods are only used to deal with the single-location proteins. To better reflect the characteristics of multiplex proteins, we formulate prediction of subcellular localization of multiplex proteins as a multilabel learning problem. We present and compare two multilabel learning approaches, which exploit correlations between labels and leverage label-specific features, respectively, to induce a high quality prediction model. Experimental results on six protein data sets under various organisms show that our described methods achieve significantly higher performance than any of the existing methods. Among the different multilabel learning methods, we find that methods exploiting label correlations performs better than those leveraging label-specific features.
- Published
- 2012
- Full Text
- View/download PDF
32. Exploiting the functional and taxonomic structure of genomic data by probabilistic topic modeling.
- Author
-
Chen X, Hu X, Lim TY, Shen X, Park EK, and Rosen GL
- Subjects
- Bacteria genetics, Gastrointestinal Tract microbiology, Genome, Bacterial, Humans, Inflammatory Bowel Diseases microbiology, Models, Genetic, Models, Statistical, Computational Biology methods, Data Mining methods, Databases, Genetic, Metagenomics methods
- Abstract
In this paper, we present a method that enable both homology-based approach and composition-based approach to further study the functional core (i.e., microbial core and gene core, correspondingly). In the proposed method, the identification of major functionality groups is achieved by generative topic modeling, which is able to extract useful information from unlabeled data. We first show that generative topic model can be used to model the taxon abundance information obtained by homology-based approach and study the microbial core. The model considers each sample as a “document,” which has a mixture of functional groups, while each functional group (also known as a “latent topic”) is a weight mixture of species. Therefore, estimating the generative topic model for taxon abundance data will uncover the distribution over latent functions (latent topic) in each sample. Second, we show that, generative topic model can also be used to study the genome-level composition of “N-mer” features (DNA subreads obtained by composition-based approaches). The model consider each genome as a mixture of latten genetic patterns (latent topics), while each functional pattern is a weighted mixture of the “N-mer” features, thus the existence of core genomes can be indicated by a set of common N-mer features. After studying the mutual information between latent topics and gene regions, we provide an explanation of the functional roles of uncovered latten genetic patterns. The experimental results demonstrate the effectiveness of proposed method.
- Published
- 2012
- Full Text
- View/download PDF
33. Modeling and mining term association for improving biomedical information retrieval performance.
- Author
-
Hu Q, Huang JX, and Hu X
- Subjects
- Genomics methods, Semantics, Algorithms, Computational Biology methods, Data Mining, Information Storage and Retrieval methods
- Abstract
Background: The growth of the biomedical information requires most information retrieval systems to provide short and specific answers in response to complex user queries. Semantic information in the form of free text that is structured in a way makes it straightforward for humans to read but more difficult for computers to interpret automatically and search efficiently. One of the reasons is that most traditional information retrieval models assume terms are conditionally independent given a document/passage. Therefore, we are motivated to consider term associations within different contexts to help the models understand semantic information and use it for improving biomedical information retrieval performance., Results: We propose a term association approach to discover term associations among the keywords from a query. The experiments are conducted on the TREC 2004-2007 Genomics data sets and the TREC 2004 HARD data set. The proposed approach is promising and achieves superiority over the baselines and the GSP results. The parameter settings and different indices are investigated that the sentence-based index produces the best results in terms of the document-level, the word-based index for the best results in terms of the passage-level and the paragraph-based index for the best results in terms of the passage2-level. Furthermore, the best term association results always come from the best baseline. The tuning number k in the proposed recursive re-ranking algorithm is discussed and locally optimized to be 10., Conclusions: First, modelling term association for improving biomedical information retrieval using factor analysis, is one of the major contributions in our work. Second, the experiments confirm that term association considering co-occurrence and dependency among the keywords can produce better results than the baselines treating the keywords independently. Third, the baselines are re-ranked according to the importance and reliance of latent factors behind term associations. These latent factors are decided by the proposed model and their term appearances in the first round retrieved passages.
- Published
- 2012
- Full Text
- View/download PDF
34. Dynamic biclustering of microarray data by multi-objective immune optimization.
- Author
-
Liu J, Li Z, Hu X, Chen Y, and Park EK
- Subjects
- B-Lymphocytes immunology, Cell Cycle, Cluster Analysis, Databases, Genetic, Gene Expression Profiling, Genome, Human, Humans, Oligonucleotide Array Sequence Analysis, Saccharomyces cerevisiae cytology, Time Factors, Algorithms, B-Lymphocytes cytology, Computational Biology methods, Data Mining methods, Genome, Fungal, Saccharomyces cerevisiae genetics
- Abstract
Background: Newly microarray technologies yield large-scale datasets. The microarray datasets are usually presented in 2D matrices, where rows represent genes and columns represent experimental conditions. Systematic analysis of those datasets provides the increasing amount of information, which is urgently needed in the post-genomic era. Biclustering, which is a technique developed to allow simultaneous clustering of rows and columns of a dataset, might be useful to extract more accurate information from those datasets. Biclustering requires the optimization of two conflicting objectives (residue and volume), and a multi-objective artificial immune system capable of performing a multi-population search. As a heuristic search technique, artificial immune systems (AISs) can be considered a new computational paradigm inspired by the immunological system of vertebrates and designed to solve a wide range of optimization problems. During biclustering several objectives in conflict with each other have to be optimized simultaneously, so multi-objective optimization model is suitable for solving biclustering problem., Results: Based on dynamic population, this paper proposes a novel dynamic multi-objective immune optimization biclustering (DMOIOB) algorithm to mine coherent patterns from microarray data. Experimental results on two common and public datasets of gene expression profiles show that our approach can effectively find significant localized structures related to sets of genes that show consistent expression patterns across subsets of experimental conditions. The mined patterns present a significant biological relevance in terms of related biological processes, components and molecular functions in a species-independent manner., Conclusions: The proposed DMOIOB algorithm is an efficient tool to analyze large microarray datasets. It achieves a good diversity and rapid convergence.
- Published
- 2011
- Full Text
- View/download PDF
35. Predicting gene function using few positive examples and unlabeled ones.
- Author
-
Chen Y, Li Z, Wang X, Feng J, and Hu X
- Subjects
- Algorithms, Databases, Genetic, Gene Expression Profiling, Protein Interaction Mapping, Saccharomyces cerevisiae genetics, Artificial Intelligence, Computational Biology methods, Genomics methods
- Abstract
Background: A large amount of functional genomic data have provided enough knowledge in predicting gene function computationally, which uses known functional annotations and relationship between unknown genes and known ones to map unknown genes to GO functional terms. The prediction procedure is usually formulated as binary classification problem. Training binary classifier needs both positive examples and negative ones that have almost the same size. However, from various annotation database, we can only obtain few positive genes annotation for most of functional terms, that is, there are only few positive examples for training classifier, which makes predicting directly gene function infeasible., Results: We propose a novel approach SPE_RNE to train classifier for each functional term. Firstly, positive examples set is enlarged by creating synthetic positive examples. Secondly, representative negative examples are selected by training SVM (support vector machine) iteratively to move classification hyperplane to a appropriate place. Lastly, an optimal SVM classifier are trained by using grid search technique. On combined kernel of Yeast protein sequence, microarray expression, protein-protein interaction and GO functional annotation data, we compare SPE_RNE with other three typical methods in three classical performance measures recall R, precise P and their combination F: twoclass considers all unlabeled genes as negative examples, twoclassbal selects randomly same number negative examples from unlabeled gene, PSoL selects a negative examples set that are far from positive examples and far from each other., Conclusions: In test data and unknown genes data, we compute average and variant of measure F. The experiments show that our approach has better generalized performance and practical prediction capacity. In addition, our method can also be used for other organisms such as human.
- Published
- 2010
- Full Text
- View/download PDF
36. Exploratory analysis of protein translation regulatory networks using hierarchical random graphs.
- Author
-
Wu DD, Hu X, Park EK, Wang X, Feng J, and Wu X
- Subjects
- Algorithms, Databases, Protein, Fungal Proteins genetics, Fungal Proteins metabolism, Markov Chains, Monte Carlo Method, Protein Biosynthesis, Protein Interaction Domains and Motifs, RNA, Messenger chemistry, RNA, Messenger genetics, RNA, Messenger metabolism, Reproducibility of Results, Yeasts genetics, Yeasts metabolism, Cluster Analysis, Computational Biology methods, Fungal Proteins chemistry, Protein Interaction Mapping methods
- Abstract
Background: Protein translation is a vital cellular process for any living organism. The availability of interaction databases provides an opportunity for researchers to exploit the immense amount of data in silico such as studying biological networks. There has been an extensive effort using computational methods in deciphering the transcriptional regulatory networks. However, research on translation regulatory networks has caught little attention in the bioinformatics and computational biology community., Results: In this paper, we present an exploratory analysis of yeast protein translation regulatory networks using hierarchical random graphs. We derive a protein translation regulatory network from a protein-protein interaction dataset. Using a hierarchical random graph model, we show that the network exhibits well organized hierarchical structure. In addition, we apply this technique to predict missing links in the network., Conclusions: The hierarchical random graph mode can be a potentially useful technique for inferring hierarchical structure from network data and predicting missing links in partly known networks. The results from the reconstructed protein translation regulatory networks have potential implications for better understanding mechanisms of translational control from a system's perspective.
- Published
- 2010
- Full Text
- View/download PDF
37. Learning an enriched representation from unlabeled data for protein-protein interaction extraction.
- Author
-
Li Y, Hu X, Lin H, and Yang Z
- Subjects
- Algorithms, Chi-Square Distribution, Artificial Intelligence, Computational Biology methods, Data Mining methods, Protein Interaction Mapping
- Abstract
Background: Extracting protein-protein interactions from biomedical literature is an important task in biomedical text mining. Supervised machine learning methods have been used with great success in this task but they tend to suffer from data sparseness because of their restriction to obtain knowledge from limited amount of labelled data. In this work, we study the use of unlabeled biomedical texts to enhance the performance of supervised learning for this task. We use feature coupling generalization (FCG) - a recently proposed semi-supervised learning strategy - to learn an enriched representation of local contexts in sentences from 47 million unlabeled examples and investigate the performance of the new features on AIMED corpus., Results: The new features generated by FCG achieve a 60.1 F-score and produce significant improvement over supervised baselines. The experimental analysis shows that FCG can utilize well the sparse features which have little effect in supervised learning. The new features perform better in non-linear classifiers than linear ones. We combine the new features with local lexical features, obtaining an F-score of 63.5 on AIMED corpus, which is comparable with the current state-of-the-art results. We also find that simple Boolean lexical features derived only from local contexts are able to achieve competitive results against most syntactic feature/kernel based methods., Conclusions: FCG creates a lot of opportunities for designing new features, since a lot of sparse features ignored by supervised learning can be utilized well. Interestingly, our results also demonstrate that the state-of-the art performance can be achieved without using any syntactic information in this task.
- Published
- 2010
- Full Text
- View/download PDF
38. Recursive fuzzy granulation for gene subsets extraction and cancer classification.
- Author
-
Tang Y, Zhang YQ, Huang Z, Hu X, and Zhao Y
- Subjects
- Algorithms, Artificial Intelligence, Databases, Genetic, Gene Expression Profiling methods, Humans, Male, Prostatic Neoplasms classification, Prostatic Neoplasms genetics, Computational Biology methods, Fuzzy Logic, Neoplasms classification, Neoplasms genetics, Oligonucleotide Array Sequence Analysis methods
- Abstract
A typical microarray gene expression dataset is usually both extremely sparse and imbalanced. To select multiple highly informative gene subsets for cancer classification and diagnosis, a new Fuzzy Granular Support Vector Machine---Recursive Feature Elimination algorithm (FGSVM-RFE) is designed in this paper. As a hybrid algorithm of statistical learning, fuzzy clustering, and granular computing, the FGSVM-RFE separately eliminates irrelevant, redundant, or noisy genes in different granules at different stages and selects highly informative genes with potentially different biological functions in balance. Empirical studies on three public datasets demonstrate that the FGSVM-RFE outperforms state-of-the-art approaches. Moreover, the FGSVM-RFE can extract multiple gene subsets on each of which a classifier can be modeled with 100% accuracy. Specifically, the independent testing accuracy for the prostate cancer dataset is significantly improved. The previous best result is 86% with 16 genes and our best result is 100% with only eight genes. The identified genes are annotated by Onto-Express to be biologically meaningful.
- Published
- 2008
- Full Text
- View/download PDF
39. Biomedical ontology improves biomedical literature clustering performance: a comparison study.
- Author
-
Yoo I, Hu X, and Song IY
- Subjects
- Cluster Analysis, MEDLINE, Medical Subject Headings, Computational Biology, Information Storage and Retrieval, Publications
- Abstract
Document clustering has been used for better document retrieval and text mining. In this paper, we investigate if a biomedical ontology improves biomedical literature clustering performance in terms of the effectiveness and the scalability. For this investigation, we perform a comprehensive comparison study of various document clustering approaches such as hierarchical clustering methods, Bisecting K-means, K-means and Suffix Tree Clustering (STC). According to our experiment results, a biomedical ontology significantly enhances clustering quality on biomedical documents. In addition, our results show that decent document clustering approaches, such as Bisecting K-means, K-means and STC, gains some benefit from the ontology while hierarchical algorithms showing the poorest clustering quality do not reap the benefit of the biomedical ontology.
- Published
- 2007
- Full Text
- View/download PDF
40. GE-Miner: integration of cluster ensemble and text mining for comprehensive gene expression analysis.
- Author
-
Hu X
- Subjects
- Algorithms, Cluster Analysis, Computers, Database Management Systems, Databases, Protein, Gene Expression, Information Storage and Retrieval, Natural Language Processing, Oligonucleotide Array Sequence Analysis, Pattern Recognition, Automated, Software, Computational Biology methods, Gene Expression Profiling, Multigene Family
- Abstract
Generating high quality gene clusters and identifying the underlying biological mechanism of the gene clusters are the important goals of clustering gene expression analysis. Based on this consideration, we design and develop a unified system Gene Expression Miner (GE-Miner) by integrating cluster ensemble, text clustering and multidocument summarisation and provide an environment for comprehensive gene expression data analysis. Experimental results demonstrate that our systems can obtain high quality clusters and provide concise and informative textual summary for the gene clusters.
- Published
- 2006
- Full Text
- View/download PDF
41. Mining and analysing scale-free protein-protein interaction network.
- Author
-
Hu X
- Subjects
- Algorithms, Artificial Intelligence, Cluster Analysis, Information Storage and Retrieval, Models, Statistical, Models, Theoretical, Natural Language Processing, Research, Chromatin chemistry, Computational Biology methods, Protein Interaction Mapping methods
- Abstract
Protein-protein interaction network is essential to understand the fundamental processes that govern cell biology. In this paper, we integrate information extraction and data mining techniques to extract and mine the scale-free protein-protein interaction network from biomedical literature. The experiments on around 1,600 chromatin proteins indicate that our system is very promising for mining and analysing protein-protein interaction network.
- Published
- 2005
- Full Text
- View/download PDF
42. A Review of Artificial Intelligence Applications in Bacterial Genomics
- Author
-
Xie, Jianghang, Zhang, Le, Xiao, Ming, Park, Taesung, Cho, Young-Rae, Hu, Xiaohua Tony, Yoo, Illhoi, Woo, Hyun Goo, Wang, Jianxin, Facelli, Julio, Nam, Seungyoon, Kang, Mingon, Amsterdam Gastroenterology Endocrinology Metabolism, Graduate School, Center of Experimental and Molecular Medicine, and AII - Infectious diseases
- Subjects
0303 health sciences ,Mechanism (biology) ,Computer science ,Bacterial genomics ,Gene prediction ,0206 medical engineering ,Genomics ,02 engineering and technology ,Computational biology ,Bacterial genome size ,Genome ,03 medical and health sciences ,Applications of artificial intelligence ,Gene ,020602 bioinformatics ,030304 developmental biology - Abstract
Because of the different genetic structures and functional gene diversity, bacterial genome data have high complexity and dimensions. Therefore, it is difficult to reveal the sequence patterns and biological mechanism of a genome with classical analysis methods. Since artificial intelligence (AI) applications are capable of mining key biological information from massive multidimensional data, they are broadly employed to analyze bacterial genomes. However, to our knowledge, there are few systematic reviews that illustrate these AI applications in bacterial genomics research. Therefore, we first introduce the characteristics of bacterial genomics, and then briefly summarize AI applications in bacterial genomics research from three aspects: gene finding, gene function prediction and gene expression network construction. Finally, we discuss the challenges and future AI applications in bacterial genomics research.
- Published
- 2020
43. A Framework for Semisupervised Feature Generation and Its Applications in Biomedical Literature Mining.
- Author
-
Li, Yanpeng, Hu, Xiaohua, Lin, Hongfei, and Yang, Zhiahi
- Abstract
Feature representation is essential to machine learning and text mining. In this paper, we present a feature coupling generalization (FCG) framework for generating new features from unlabeled data. It selects two special types of features, i.e., example-distinguishing features (EDFs) and class-distinguishing features (CDFs) from original feature set, and then generalizes EDFs into higher-level features based on their coupling degrees with CDFs in unlabeled data. The advantage is: EDFs with extreme sparsity in labeled data can be enriched by their co-occurrences with CDFs in unlabeled data so that the performance of these low-frequency features can be greatly boosted and new information from unlabeled can be incorporated. We apply this approach to three tasks in biomedical literature mining: gene named entity recognition (NER), protein-protein interaction extraction (PPIE), and text classification (TC) for gene ontology (GO) annotation. New features are generated from over 20 GB unlabeled PubMed abstracts. The experimental results on BioCreative 2, AIMED corpus, and TREC 2005 Genomics Track show that 1) FCG can utilize well the sparse features ignored by supervised learning. 2) It improves the performance of supervised baselines by 7.8 percent, 5.0 percent, and 5.8 percent, respectively, in the tree tasks. 3) Our methods achieve 89.1, 64.5 F-score, and 60.1 normalized utility on the three benchmark data sets. [ABSTRACT FROM PUBLISHER]
- Published
- 2011
- Full Text
- View/download PDF
44. Weighted gene co-expression network analysis of microarray mRNA expression profiling in response to electroacupuncture
- Author
-
Afsaneh Mohammadnejad, Weilong Li, Jesper Lund, Shuxia Li, Qihua Tan, Jan Baumbach, Hongmei Duan, Schmidt, Harald, Griol, David, Wang, Haiying, Baumbach, Jan, Zheng, Huiru, Callejas, Zoraida, Hu, Xiaohua, Dickerson, Julie, and Zhang, Le
- Subjects
0301 basic medicine ,wGCNA ,Microarray ,hub genes ,Computational biology ,Biology ,Gene expression profiling ,Biological pathway ,03 medical and health sciences ,030104 developmental biology ,Electroacupuncture ,Gene Expression profiling ,Interaction network ,Gene expression ,Gene chip analysis ,Gene co-expression network ,Analgesia ,Gene - Abstract
Electroacupuncture (EA) has been extensively considered as a tool for treating diseases and relieving various pains. However, understanding the molecular mechanisms underlying its effect is of high importance. In this study, we performed a weighted gene co-expression network analysis (WGCNA) on data collected from a microarray experiment to investigate the relationship underlying EA within three factors, time, frequency and tissue regions (periaqueductal gray (PAG) and spinal dorsal horn (DH)) as well as the biological implication of gene expression changes. Gene expression on rats in PAG-DH regions induced by EA with 2 Hz and 100 Hz at l h and 24 h were measured using microarray technology. The WGCNA was performed to identify distinct network modules related to EA effects. To find the biological function of genes and pathways, the gene ontology (GO) Consortium was applied and the gene-gene interaction network of top genes in important modules was visualized. We identified one network module (466 genes) which is significantly associated with time, another module (402 genes) significantly related to frequency, and three modules each consisting of 1144, 402 and 3148 genes that are significantly associated with tissue regions. Furthermore, meaningful biological pathways were enriched in association with each of the experimental factors during EA stimulation. Our analysis showed the robustness of WGCNA and revealed important genes within specific modules and pathways which might be activated in response to EA analgesia. The findings may help to clarify the underlying mechanisms of EA and provide references for future verification of this study.
- Published
- 2018
45. Cis-regulatory module detection using constraint programming
- Author
-
Guns, Tias, Sun, Hong, Marchal, Kathleen, Nijssen, Siegfried, 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), UCL - SST/ICTM/INGI - Pôle en ingénierie informatique, Park, Taesung, Chen, Luonan, Wong, Limsoon, Tsui, S, Ng, M, Hu, Xiaohua, Park, T, Chen, L, Wong, L, Hu, X, Data Analytics Laboratory, Business technology and Operations, and Electromobility research centre
- Subjects
constraint programming ,Computer science ,Health Informatics ,Genomics ,Computational biology ,computer.software_genre ,itemset mining ,Set (abstract data type) ,biomedical engineering ,Redundancy (engineering) ,Constraint programming ,genomics ,molecular biophysics ,genetics ,Binding site ,Gene ,Transcription factor ,Cis-regulatory module ,SISTA ,Itemset mining ,Biology and Life Sciences ,bioinformatics ,data mining ,constraint handling ,cis-regulatory module ,Data mining ,computer - Abstract
We propose a method for finding CRMs in a set of co-regulated genes. Each CRM consists of a set of binding sites of transcription factors. We wish to find CRMs involving the same transcription factors in multiple sequences. Finding such a combination of transcription factors is inherently a combinatorial problem. We solve this problem by combining the principles of itemset mining and constraint programming. The constraints involve the putative binding sites of transcription factors, the number of sequences in which they co-occur and the proximity of the binding sites. Genomic background sequences are used to assess the significance of the modules. We experimentally validate our approach and compare it with state-of-the-art techniques. acceptance rate = 17.2% ispartof: pages:363-368 ispartof: Proc. of IEEE International Conference on Bioinformatics & Biomedicine pages:363-368 ispartof: IEEE International Conference on Bioinformatics & Biomedicine location:Hong Kong date:18 Dec - 21 Dec 2010 status: published
- Published
- 2010
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.