32 results on '"Guan, Jihong"'
Search Results
2. Identifying essential proteins from protein–protein interaction networks based on influence maximization
- Author
-
Xu, Weixia, Dong, Yunfeng, Guan, Jihong, and Zhou, Shuigeng
- Published
- 2022
- Full Text
- View/download PDF
3. Boosting scRNA-seq data clustering by cluster-aware feature weighting
- Author
-
Li, Rui-Yi, Guan, Jihong, and Zhou, Shuigeng
- Published
- 2021
- Full Text
- View/download PDF
4. Protein–protein interaction prediction based on ordinal regression and recurrent convolutional neural networks
- Author
-
Xu, Weixia, Gao, Yangyun, Wang, Yang, and Guan, Jihong
- Published
- 2021
- Full Text
- View/download PDF
5. Computationally identifying hot spots in protein-DNA binding interfaces using an ensemble approach
- Author
-
Pan, Yuliang, Zhou, Shuigeng, and Guan, Jihong
- Published
- 2020
- Full Text
- View/download PDF
6. DEEPSEN: a convolutional neural network based method for super-enhancer prediction
- Author
-
Bu, Hongda, Hao, Jiaqi, Gan, Yanglan, Zhou, Shuigeng, and Guan, Jihong
- Published
- 2019
- Full Text
- View/download PDF
7. A new and effective two-step clustering approach for single cell RNA sequencing data.
- Author
-
Li, Ruiyi, Guan, Jihong, Wang, Zhiye, and Zhou, Shuigeng
- Subjects
- *
RNA sequencing , *HIERARCHICAL clustering (Cluster analysis) , *NATURAL immunity , *DRUG resistance , *CLUSTER analysis (Statistics) - Abstract
Background: The rapid devolvement of single cell RNA sequencing (scRNA-seq) technology leads to huge amounts of scRNA-seq data, which greatly advance the research of many biomedical fields involving tissue heterogeneity, pathogenesis of disease and drug resistance etc. One major task in scRNA-seq data analysis is to cluster cells in terms of their expression characteristics. Up to now, a number of methods have been proposed to infer cell clusters, yet there is still much space to improve their performance. Results: In this paper, we develop a new two-step clustering approach to effectively cluster scRNA-seq data, which is called TSC — the abbreviation of Two-Step Clustering. Particularly, by dividing all cells into two types: core cells (those possibly lying around the centers of clusters) and non-core cells (those locating in the boundary areas of clusters), we first clusters the core cells by hierarchical clustering (the first step) and then assigns the non-core cells to the corresponding nearest clusters (the second step). Extensive experiments on 12 real scRNA-seq datasets show that TSC outperforms the state of the art methods. Conclusion: TSC is an effective clustering method due to its two-steps clustering strategy, and it is a useful tool for scRNA-seq data analysis. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
8. Genome-wide analysis of epigenetic dynamics across human developmental stages and tissues
- Author
-
Zhang, Xia, Gan, Yanglan, Zou, Guobing, Guan, Jihong, and Zhou, Shuigeng
- Published
- 2019
- Full Text
- View/download PDF
9. Identification of cancer subtypes from single-cell RNA-seq data using a consensus clustering method
- Author
-
Gan, Yanglan, Li, Ning, Zou, Guobing, Xin, Yongchang, and Guan, Jihong
- Published
- 2018
- Full Text
- View/download PDF
10. Classifying early and late mild cognitive impairment stages of Alzheimer’s disease by fusing default mode networks extracted with multiple seeds
- Author
-
Pei, Shengbing, Guan, Jihong, and Zhou, Shuigeng
- Published
- 2018
- Full Text
- View/download PDF
11. CPredictor3.0: detecting protein complexes from PPI networks with expression data and functional annotations.
- Author
-
Xu Y, Zhou J, Zhou S, and Guan J
- Subjects
- Cluster Analysis, Supervised Machine Learning, Gene Expression Profiling, Molecular Sequence Annotation, Protein Interaction Mapping, Proteins genetics, Proteins metabolism
- Abstract
Background: Effectively predicting protein complexes not only helps to understand the structures and functions of proteins and their complexes, but also is useful for diagnosing disease and developing new drugs. Up to now, many methods have been developed to detect complexes by mining dense subgraphs from static protein-protein interaction (PPI) networks, while ignoring the value of other biological information and the dynamic properties of cellular systems., Results: In this paper, based on our previous works CPredictor and CPredictor2.0, we present a new method for predicting complexes from PPI networks with both gene expression data and protein functional annotations, which is called CPredictor3.0. This new method follows the viewpoint that proteins in the same complex should roughly have similar functions and are active at the same time and place in cellular systems. We first detect active proteins by using gene express data of different time points and cluster proteins by using gene ontology (GO) functional annotations, respectively. Then, for each time point, we do set intersections with one set corresponding to active proteins generated from expression data and the other set corresponding to a protein cluster generated from functional annotations. Each resulting unique set indicates a cluster of proteins that have similar function(s) and are active at that time point. Following that, we map each cluster of active proteins of similar function onto a static PPI network, and get a series of induced connected subgraphs. We treat these subgraphs as candidate complexes. Finally, by expanding and merging these candidate complexes, the predicted complexes are obtained. We evaluate CPredictor3.0 and compare it with a number of existing methods on several PPI networks and benchmarking complex datasets. The experimental results show that CPredictor3.0 achieves the highest F1-measure, which indicates that CPredictor3.0 outperforms these existing method in overall., Conclusion: CPredictor3.0 can serve as a promising tool of protein complex prediction.
- Published
- 2017
- Full Text
- View/download PDF
12. Fusing multiple protein-protein similarity networks to effectively predict lncRNA-protein interactions.
- Author
-
Zheng X, Wang Y, Tian K, Zhou J, Guan J, Luo L, and Zhou S
- Subjects
- Area Under Curve, Humans, RNA, Long Noncoding genetics, ROC Curve, Proteins metabolism, RNA, Long Noncoding metabolism, Sequence Homology, Amino Acid
- Abstract
Background: Long non-coding RNA (lncRNA) plays important roles in many biological and pathological processes, including transcriptional regulation and gene regulation. As lncRNA interacts with multiple proteins, predicting lncRNA-protein interactions (lncRPIs) is an important way to study the functions of lncRNA. Up to now, there have been a few works that exploit protein-protein interactions (PPIs) to help the prediction of new lncRPIs., Results: In this paper, we propose to boost the prediction of lncRPIs by fusing multiple protein-protein similarity networks (PPSNs). Concretely, we first construct four PPSNs based on protein sequences, protein domains, protein GO terms and the STRING database respectively, then build a more informative PPSN by fusing these four constructed PPSNs. Finally, we predict new lncRPIs by a random walk method with the fused PPSN and known lncRPIs. Our experimental results show that the new approach outperforms the existing methods., Conclusion: Fusing multiple protein-protein similarity networks can effectively boost the performance of predicting lncRPIs.
- Published
- 2017
- Full Text
- View/download PDF
13. A new method for enhancer prediction based on deep belief network.
- Author
-
Bu H, Gan Y, Wang Y, Zhou S, and Guan J
- Subjects
- Databases, Genetic, Humans, ROC Curve, Algorithms, Computational Biology methods, Enhancer Elements, Genetic
- Abstract
Background: Studies have shown that enhancers are significant regulatory elements to play crucial roles in gene expression regulation. Since enhancers are unrelated to the orientation and distance to their target genes, it is a challenging mission for scholars and researchers to accurately predicting distal enhancers. In the past years, with the high-throughout ChiP-seq technologies development, several computational techniques emerge to predict enhancers using epigenetic or genomic features. Nevertheless, the inconsistency of computational models across different cell-lines and the unsatisfactory prediction performance call for further research in this area., Results: Here, we propose a new Deep Belief Network (DBN) based computational method for enhancer prediction, which is called EnhancerDBN. This method combines diverse features, composed of DNA sequence compositional features, DNA methylation and histone modifications. Our computational results indicate that 1) EnhancerDBN outperforms 13 existing methods in prediction, and 2) GC content and DNA methylation can serve as relevant features for enhancer prediction., Conclusion: Deep learning is effective in boosting the performance of enhancer prediction.
- Published
- 2017
- Full Text
- View/download PDF
14. An effective approach to detecting both small and large complexes from protein-protein interaction networks.
- Author
-
Xu B, Wang Y, Wang Z, Zhou J, Zhou S, and Guan J
- Subjects
- Algorithms, Cluster Analysis, Databases, Protein, Multiprotein Complexes metabolism, Protein Interaction Mapping methods, Protein Interaction Maps, Saccharomyces cerevisiae metabolism, Saccharomyces cerevisiae Proteins metabolism
- Abstract
Background: Predicting protein complexes from protein-protein interaction (PPI) networks has been studied for decade. Various methods have been proposed to address some challenging issues of this problem, including overlapping clusters, high false positive/negative rates of PPI data and diverse complex structures. It is well known that most current methods can detect effectively only complexes of size ≥3, which account for only about half of the total existing complexes. Recently, a method was proposed specifically for finding small complexes (size = 2 and 3) from PPI networks. However, up to now there is no effective approach that can predict both small (size ≤ 3) and large (size >3) complexes from PPI networks., Results: In this paper, we propose a novel method, called CPredictor2.0, that can detect both small and large complexes under a unified framework. Concretely, we first group proteins of similar functions. Then, the Markov clustering algorithm is employed to discover clusters in each group. Finally, we merge all discovered clusters that overlap with each other to a certain degree, and the merged clusters as well as the remaining clusters constitute the set of detected complexes. Extensive experiments have shown that the new method can more effectively predict both small and large complexes, in comparison with the state-of-the-art methods., Conclusions: The proposed method, CPredictor2.0, can be applied to accurately predict both small and large protein complexes.
- Published
- 2017
- Full Text
- View/download PDF
15. Selecting high-quality negative samples for effectively predicting protein-RNA interactions.
- Author
-
Cheng Z, Huang K, Wang Y, Liu H, Guan J, and Zhou S
- Subjects
- Algorithms, Protein Binding, Computational Biology methods, RNA metabolism, RNA-Binding Proteins metabolism
- Abstract
Background: The identification of Protein-RNA Interactions (PRIs) is important to understanding cell activities. Recently, several machine learning-based methods have been developed for identifying PRIs. However, the performance of these methods is unsatisfactory. One major reason is that they usually use unreliable negative samples in the training process., Methods: For boosting the performance of PRI prediction, we propose a novel method to generate reliable negative samples. Concretely, we firstly collect the known PRIs as positive samples for generating positive sets. For each positive set, we construct two corresponding negative sets, one is by our method and the other by random method. Each positive set is combined with a negative set to form a dataset for model training and performance evaluation. Consequently, we get 18 datasets of different species and different ratios of negative samples to positive samples. Secondly, sequence-based features are extracted to represent each of PRIs and protein-RNA pairs in the datasets. A filter-based method is employed to cut down the dimensionality of feature vectors for reducing computational cost. Finally, the performance of support vector machine (SVM), random forest (RF) and naive Bayes (NB) is evaluated on the generated 18 datasets., Results: Extensive experiments show that comparing to using randomly-generated negative samples, all classifiers achieve substantial performance improvement by using negative samples selected by our method. The improvements on accuracy and geometric mean for the SVM classifier, the RF classifier and the NB classifier are as high as 204.5 and 68.7%, 174.5 and 53.9%, 80.9 and 54.3%, respectively., Conclusion: Our method is useful to the identification of PRIs.
- Published
- 2017
- Full Text
- View/download PDF
16. iHMS: a database integrating human histone modification data across developmental stages and tissues.
- Author
-
Gan Y, Tao H, Guan J, and Zhou S
- Subjects
- Chromatin metabolism, CpG Islands, Humans, Internet, Protein Processing, Post-Translational, Databases, Genetic, Histones metabolism, User-Computer Interface
- Abstract
Background: Differences in chromatin states are critical to the multiplicity of cell states. Recently genome-wide histone modification maps of diverse human developmental stages and tissues have been charted., Description: To facilitate the investigation of epigenetic dynamics and regulatory mechanisms in cellular differentiation processes, we developed iHMS, an integrated human histone modification database that incorporates massive histone modification maps spanning different developmental stages, lineages and tissues ( http://www.tongjidmb.com/human/index.html ). It also includes genome-wide expression data of different conditions, reference gene annotations, GC content and CpG island information. By providing an intuitive and user-friendly query interface, iHMS enables comprehensive query and comparative analysis based on gene names, genomic region locations, histone modification marks and cell types. Moreover, it offers an efficient browser that allows users to visualize and compare multiple genome-wide histone modification maps and related expression profiles across different developmental stages and tissues., Conclusion: iHMS is of great helpfulness to understand how global histone modification state transitions impact cellular phenotypes across different developmental stages and tissues in the human genome. This extensive catalog of histone modification states thus presents an important resource for epigenetic and developmental studies.
- Published
- 2017
- Full Text
- View/download PDF
17. Screening lifespan-extending drugs in Caenorhabditis elegans via label propagation on drug-protein networks.
- Author
-
Liu H, Guo M, Xue T, Guan J, Luo L, and Zhuang Z
- Subjects
- Aging drug effects, Algorithms, Animals, Caenorhabditis elegans metabolism, Drug Evaluation, Preclinical, Internet, Protein Binding, Support Vector Machine, Caenorhabditis elegans drug effects, Caenorhabditis elegans physiology, Caenorhabditis elegans Proteins metabolism, Computational Biology methods, Longevity drug effects
- Abstract
Background: One of the most challenging tasks in the exploration of anti-aging is to discover drugs that can promote longevity and delay the incidence of age-associated diseases of human. Up to date, a number of drugs, including some antioxidants, metabolites and synthetic compounds, have been found to effectively delay the aging of nematodes and insects., Results: We proposed a label propagation algorithm on drug-protein network to infer drugs that can extend the lifespan of C. elegans. We collected a set of drugs of which functions on lifespan extension of C. elegans have been reliably determined, and then built a large-scale drug-protein network by collecting a set of high-confidence drugprotein interactions. A label propagation algorithm was run on the drug-protein bipartite network to predict new drugs with lifespan-extending effect on C. elegans. We calibrated the performance of the proposed method by conducting performance comparison with two classical models, kNN and SVM. We also showed that the screened drugs significantly mediate in the aging-related pathways, and have higher chemical similarities to the effective drugs than ineffective drugs in promoting longevity of C. elegans. Moreover, we carried out wet-lab experiments to verify a screened drugs, 2- Bromo-4'-nitroacetophenone, and found that it can effectively extend the lifespan of C. elegans. These results showed that our method is effective in screening lifespanextending drugs in C. elegans., Conclusions: In this paper, we proposed a semi-supervised algorithm to predict drugs with lifespan-extending effects on C. elegans. In silico empirical evaluations and in vivo experiments in C. elegans have demonstrated that our method can effectively narrow down the scope of candidate drugs needed to be verified by wet lab experiments.
- Published
- 2016
- Full Text
- View/download PDF
18. Inferring new indications for approved drugs via random walk on drug-disease heterogenous networks.
- Author
-
Liu H, Song Y, Guan J, Luo L, and Zhuang Z
- Subjects
- Humans, Metabolic Networks and Pathways, Algorithms, Computational Biology methods, Drug Repositioning methods, Precision Medicine
- Abstract
Background: Since traditional drug research and development is often time-consuming and high-risk, there is an increasing interest in establishing new medical indications for approved drugs, referred to as drug repositioning, which provides a relatively low-cost and high-efficiency approach for drug discovery. With the explosive growth of large-scale biochemical and phenotypic data, drug repositioning holds great potential for precision medicine in the post-genomic era. It is urgent to develop rational and systematic approaches to predict new indications for approved drugs on a large scale., Results: In this paper, we propose the two-pass random walks with restart on a heterogenous network, TP-NRWRH for short, to predict new indications for approved drugs. Rather than random walk on bipartite network, we integrated the drug-drug similarity network, disease-disease similarity network and known drug-disease association network into one heterogenous network, on which the two-pass random walks with restart is implemented. We have conducted performance evaluation on two datasets of drug-disease associations, and the results show that our method has higher performance than six existing methods. A case study on the Alzheimer's disease showed that nine of top 10 predicted drugs have been approved or investigational for neurodegenerative diseases. The experimental results show that our method achieves state-of-the-art performance in predicting new indications for approved drugs., Conclusions: We proposed a two-pass random walk with restart on the drug-disease heterogeneous network, referred to as TP-NRWRH, to predict new indications for approved drugs. Performance evaluation on two independent datasets showed that TP-NRWRH achieved higher performance than six existing methods on 10-fold cross validations. The case study on the Alzheimer's disease showed that nine of top 10 predicted drugs have been approved or are investigational for neurodegenerative diseases. The results show that our method achieves state-of-the-art performance in predicting new indications for approved drugs.
- Published
- 2016
- Full Text
- View/download PDF
19. Dynamic epigenetic mode analysis using spatial temporal clustering.
- Author
-
Gan Y, Tao H, Zou G, Yan C, and Guan J
- Subjects
- DNA Methylation, Embryonic Stem Cells metabolism, Histones metabolism, Humans, Cell Differentiation genetics, Cluster Analysis, Embryonic Stem Cells physiology, Epigenesis, Genetic, Gene Expression Regulation, Developmental
- Abstract
Background: Differentiation of human embryonic stem cells requires precise control of gene expression that depends on specific spatial and temporal epigenetic regulation. Recently available temporal epigenomic data derived from cellular differentiation processes provides an unprecedented opportunity for characterizing fundamental properties of epigenomic dynamics and revealing regulatory roles of epigenetic modifications., Results: This paper presents a spatial temporal clustering approach, named STCluster, which exploits the temporal variation information of epigenomes to characterize dynamic epigenetic mode during cellular differentiation. This approach identifies significant spatial temporal patterns of epigenetic modifications along human embryonic stem cell differentiation and cluster regulatory sequences by their spatial temporal epigenetic patterns., Conclusions: The results show that this approach is effective in capturing epigenetic modification patterns associated with specific cell types. In addition, STCluster allows straightforward identification of coherent epigenetic modes in multiple cell types, indicating the ability in the establishment of the most conserved epigenetic signatures during cellular differentiation process.
- Published
- 2016
- Full Text
- View/download PDF
20. Exploiting topic modeling to boost metagenomic reads binning.
- Author
-
Zhang R, Cheng Z, Guan J, and Zhou S
- Subjects
- Cluster Analysis, Genome, Bacterial, High-Throughput Nucleotide Sequencing, Phylogeny, Software, Algorithms, DNA Barcoding, Taxonomic methods, Metagenome, Metagenomics methods, Microbiota genetics, Molecular Sequence Annotation methods, Sequence Analysis, DNA methods
- Abstract
Background: With the rapid development of high-throughput technologies, researchers can sequence the whole metagenome of a microbial community sampled directly from the environment. The assignment of these metagenomic reads into different species or taxonomical classes is a vital step for metagenomic analysis, which is referred to as binning of metagenomic data., Results: In this paper, we propose a new method TM-MCluster for binning metagenomic reads. First, we represent each metagenomic read as a set of "k-mers" with their frequencies occurring in the read. Then, we employ a probabilistic topic model -- the Latent Dirichlet Allocation (LDA) model to the reads, which generates a number of hidden "topics" such that each read can be represented by a distribution vector of the generated topics. Finally, as in the MCluster method, we apply SKWIC -- a variant of the classical K-means algorithm with automatic feature weighting mechanism to cluster these reads represented by topic distributions., Conclusions: Experiments show that the new method TM-MCluster outperforms major existing methods, including AbundanceBin, MetaCluster 3.0/5.0 and MCluster. This result indicates that the exploitation of topic modeling can effectively improve the binning performance of metagenomic reads.
- Published
- 2015
- Full Text
- View/download PDF
21. Similarity evaluation of DNA sequences based on frequent patterns and entropy.
- Author
-
Xie X, Guan J, and Zhou S
- Subjects
- Algorithms, Animals, Computational Biology, Humans, beta-Globins genetics, Entropy, Sequence Alignment methods, Sequence Analysis, DNA methods
- Abstract
Background: DNA sequence analysis is an important research topic in bioinformatics. Evaluating the similarity between sequences, which is crucial for sequence analysis, has attracted much research effort in the last two decades, and a dozen of algorithms and tools have been developed. These methods are based on alignment, word frequency and geometric representation respectively, each of which has its advantage and disadvantage., Results: In this paper, for effectively computing the similarity between DNA sequences, we introduce a novel method based on frequency patterns and entropy to construct representative vectors of DNA sequences. Experiments are conducted to evaluate the proposed method, which is compared with two recently-developed alignment-free methods and the BLASTN tool. When testing on the β-globin genes of 11 species and using the results from MEGA as the baseline, our method achieves higher correlation coefficients than the two alignment-free methods and the BLASTN tool., Conclusions: Our method is not only able to capture fine-granularity information (location and ordering) of DNA sequences via sequence blocking, but also insensitive to noise and sequence rearrangement due to considering only the maximal frequent patterns. It outperforms major existing methods or tools.
- Published
- 2015
- Full Text
- View/download PDF
22. Histone modifications involved in cassette exon inclusions: a quantitative and interpretable analysis.
- Author
-
Liu H, Jin T, Guan J, and Zhou S
- Subjects
- Area Under Curve, Cell Line, Chromatin genetics, Chromatin metabolism, Gene Expression Regulation, Humans, Models, Statistical, Nucleosomes genetics, Nucleosomes metabolism, RNA Precursors genetics, RNA Precursors metabolism, Alternative Splicing, Epigenesis, Genetic, Exons, Histones metabolism
- Abstract
Background: Chromatin structure and epigenetic modifications have been shown to involve in the co-transcriptional splicing of RNA precursors. In particular, some studies have suggested that some types of histone modifications (HMs) may participate in the alternative splicing and function as exon marks. However, most existing studies pay attention to the qualitative relationship between epigenetic modifications and exon inclusion. The quantitative analysis that reveals to what extent each type of epigenetic modification is responsible for exon inclusion is very helpful for us to understand the splicing process., Results: In this paper, we focus on the quantitative analysis of HMs' influence on the inclusion of cassette exons (CEs) into mature RNAs. With the high-throughput ChIP-seq and RNA-seq data obtained from ENCODE website, we modeled the association of HMs with CE inclusions by logistic regression whose coefficients are meaningful and interpretable for us to reveal the effect of each type of HM. Three type of HMs, H3K36me3, H3K9me3 and H4K20me1, were found to play major role in CE inclusions. HMs' effect on CE inclusions is conservative across cell types, and does not depend on the expression levels of the genes hosting CEs. HMs located in the flanking regions of CEs were also taken into account in our analysis, and HMs within bounded flanking regions were shown to affect moderately CE inclusions. Moreover, we also found that HMs on CEs whose length is approximately close to nucleosomal-DNA length affect greatly on CE inclusion., Conclusions: We suggested that a few types of HMs correlate closely to alternative splicing and perhaps function jointly with splicing machinery to regulate the inclusion level of exons. Our findings are helpful to understand HMs' effect on exon definition, as well as the mechanism of co-transcriptional splicing.
- Published
- 2014
- Full Text
- View/download PDF
23. Protein function prediction by collective classification with explicit and implicit edges in protein-protein interaction networks.
- Author
-
Xiong W, Liu H, Guan J, and Zhou S
- Subjects
- Genomics, Hepatitis C drug therapy, Hepatitis C genetics, Humans, Molecular Sequence Annotation, Pneumonia, Ventilator-Associated genetics, Proteins genetics, Proteins metabolism, Transcription, Genetic, Transcriptome, Algorithms, Protein Interaction Maps
- Abstract
Background: Protein function prediction is an important problem in the post-genomic era. Recent advances in experimental biology have enabled the production of vast amounts of protein-protein interaction (PPI) data. Thus, using PPI data to functionally annotate proteins has been extensively studied. However, most existing network-based approaches do not work well when annotation and interaction information is inadequate in the networks., Results: In this paper, we proposed a new method that combines PPI information and protein sequence information to boost the prediction performance based on collective classification. Our method divides function prediction into two phases: First, the original PPI network is enriched by adding a number of edges that are inferred from protein sequence information. We call the added edges implicit edges, and the existing ones explicit edges correspondingly. Second, a collective classification algorithm is employed on the new network to predict protein function., Conclusions: We conducted extensive experiments on two real, publicly available PPI datasets. Compared to four existing protein function prediction approaches, our method performs better in many situations, which shows that adding implicit edges can indeed improve the prediction performance. Furthermore, the experimental results also indicate that our method is significantly better than the compared approaches in sparsely-labeled networks, and it is robust to the change of the proportion of annotated proteins.
- Published
- 2013
- Full Text
- View/download PDF
24. Genome-wide search for miRNA-target interactions in Arabidopsis thaliana with an integrated approach.
- Author
-
Ding J, Li D, Ohler U, Guan J, and Zhou S
- Subjects
- Algorithms, MicroRNAs metabolism, RNA, Messenger genetics, RNA, Messenger metabolism, RNA, Plant metabolism, Reproducibility of Results, Software, Arabidopsis genetics, Computational Biology methods, Genome, Plant genetics, MicroRNAs genetics, RNA, Plant genetics
- Abstract
Background: MiRNA are about 22nt long small noncoding RNAs that post transcriptionally regulate gene expression in animals, plants and protozoa. Confident identification of MiRNA-Target Interactions (MTI) is vital to understand their function. Currently, several integrated computational programs and databases are available for animal miRNAs, the mechanisms of which are significantly different from plant miRNAs., Methods: Here we present an integrated MTI prediction and analysis toolkit (imiRTP) for Arabidopsis thaliana. It features two important functions: (i) combination of several effective plant miRNA target prediction methods provides a sufficiently large MTI candidate set, and (ii) different filters allow for an efficient selection of potential targets. The modularity of imiRTP enables the prediction of high quality targets on genome-wide scale. Moreover, predicted MTIs can be presented in various ways, which allows for browsing through the putative target sites as well as conducting simple and advanced analyses., Results: Results show that imiRTP could always find high quality candidates compared with single method by choosing appropriate filter and parameter. And we also reveal that a portion of plant miRNA could bind target genes out of coding region. Based on our results, imiRTP could facilitate the further study of Arabidopsis miRNAs in real use. All materials of imiRTP are freely available under a GNU license at (http://admis.fudan.edu.cn/projects/imiRTP.htm).
- Published
- 2012
- Full Text
- View/download PDF
25. miRFANs: an integrated database for Arabidopsis thaliana microRNA function annotations.
- Author
-
Liu H, Jin T, Liao R, Wan L, Xu B, Zhou S, and Guan J
- Subjects
- Base Sequence, Computational Biology methods, Data Mining methods, Gene Regulatory Networks, Internet, RNA, Plant genetics, Sequence Alignment methods, Transcription Factors genetics, Transcriptome, User-Computer Interface, Arabidopsis genetics, Databases, Nucleic Acid, MicroRNAs genetics, Molecular Sequence Annotation methods
- Abstract
Background: Plant microRNAs (miRNAs) have been revealed to play important roles in developmental control, hormone secretion, cell differentiation and proliferation, and response to environmental stresses. However, our knowledge about the regulatory mechanisms and functions of miRNAs remains very limited. The main difficulties lie in two aspects. On one hand, the number of experimentally validated miRNA targets is very limited and the predicted targets often include many false positives, which constrains us to reveal the functions of miRNAs. On the other hand, the regulation of miRNAs is known to be spatio-temporally specific, which increases the difficulty for us to understand the regulatory mechanisms of miRNAs., Description: In this paper we present miRFANs, an online database for Arabidopsis thalianamiRNA function annotations. We integrated various type of datasets, including miRNA-target interactions, transcription factor (TF) and their targets, expression profiles, genomic annotations and pathways, into a comprehensive database, and developed various statistical and mining tools, together with a user-friendly web interface. For each miRNA target predicted by psRNATarget, TargetAlign and UEA target-finder, or recorded in TarBase and miRTarBase, the effect of its up-regulated or down-regulated miRNA on the expression level of the target gene is evaluated by carrying out differential expression analysis of both miRNA and targets expression profiles acquired under the same (or similar) experimental condition and in the same tissue. Moreover, each miRNA target is associated with gene ontology and pathway terms, together with the target site information and regulating miRNAs predicted by different computational methods. These associated terms may provide valuable insight for the functions of each miRNA., Conclusion: First, a comprehensive collection of miRNA targets for Arabidopsis thaliana provides valuable information about the functions of plant miRNAs. Second, a highly informative miRNA-mediated genetic regulatory network is extracted from our integrative database. Third, a set of statistical and mining tools is equipped for analyzing and mining the database. And fourth, a user-friendly web interface is developed to facilitate the browsing and analysis of the collected data.
- Published
- 2012
- Full Text
- View/download PDF
26. Structural features based genome-wide characterization and prediction of nucleosome organization.
- Author
-
Gan Y, Guan J, Zhou S, and Zhang W
- Subjects
- Centromere, Chromatin metabolism, DNA, Z-Form metabolism, Genome, Fungal, Genome-Wide Association Study, Markov Chains, Promoter Regions, Genetic, Saccharomyces cerevisiae metabolism, Gene Expression Regulation, Fungal, Nucleosomes metabolism, Saccharomyces cerevisiae genetics
- Abstract
Background: Nucleosome distribution along chromatin dictates genomic DNA accessibility and thus profoundly influences gene expression. However, the underlying mechanism of nucleosome formation remains elusive. Here, taking a structural perspective, we systematically explored nucleosome formation potential of genomic sequences and the effect on chromatin organization and gene expression in S. cerevisiae., Results: We analyzed twelve structural features related to flexibility, curvature and energy of DNA sequences. The results showed that some structural features such as DNA denaturation, DNA-bending stiffness, Stacking energy, Z-DNA, Propeller twist and free energy, were highly correlated with in vitro and in vivo nucleosome occupancy. Specifically, they can be classified into two classes, one positively and the other negatively correlated with nucleosome occupancy. These two kinds of structural features facilitated nucleosome binding in centromere regions and repressed nucleosome formation in the promoter regions of protein-coding genes to mediate transcriptional regulation. Based on these analyses, we integrated all twelve structural features in a model to predict more accurately nucleosome occupancy in vivo than the existing methods that mainly depend on sequence compositional features. Furthermore, we developed a novel approach, named DLaNe, that located nucleosomes by detecting peaks of structural profiles, and built a meta predictor to integrate information from different structural features. As a comparison, we also constructed a hidden Markov model (HMM) to locate nucleosomes based on the profiles of these structural features. The result showed that the meta DLaNe and HMM-based method performed better than the existing methods, demonstrating the power of these structural features in predicting nucleosome positions., Conclusions: Our analysis revealed that DNA structures significantly contribute to nucleosome organization and influence chromatin structure and gene expression regulation. The results indicated that our proposed methods are effective in predicting nucleosome occupancy and positions and that these structural features are highly predictive of nucleosome organization.The implementation of our DLaNe method based on structural features is available online.
- Published
- 2012
- Full Text
- View/download PDF
27. A comparison study on feature selection of DNA structural properties for promoter prediction.
- Author
-
Gan Y, Guan J, and Zhou S
- Subjects
- Animals, Base Sequence, DNA-Directed RNA Polymerases metabolism, Humans, Sequence Analysis, DNA, Software, Support Vector Machine, DNA chemistry, Genome, Human, Promoter Regions, Genetic
- Abstract
Background: Promoter prediction is an integrant step for understanding gene regulation and annotating genomes. Traditional promoter analysis is mainly based on sequence compositional features. Recently, many kinds of structural features have been employed in promoter prediction. However, considering the high-dimensionality and overfitting problems, it is unfeasible to utilize all available features for promoter prediction. Thus it is necessary to choose some appropriate features for the prediction task., Results: This paper conducts an extensive comparison study on feature selection of DNA structural properties for promoter prediction. Firstly, to examine whether promoters possess some special structures, we carry out a systematical comparison among the profiles of thirteen structural features on promoter and non-promoter sequences. Secondly, we investigate the correlations between these structural features and promoter sequences. Thirdly, both filter and wrapper methods are utilized to select appropriate feature subsets from thirteen different kinds of structural features for promoter prediction, and the predictive power of the selected feature subsets is evaluated. Finally, we compare the prediction performance of the feature subsets selected in this paper with nine existing promoter prediction approaches., Conclusions: Experimental results show that the structural features are differentially correlated to promoters. Specifically, DNA-bending stiffness, DNA denaturation and energy-related features are highly correlated with promoters. The predictive power for promoter sequences differentiates greatly among different structural features. Selecting the relevant features can significantly improve the accuracy of promoter prediction.
- Published
- 2012
- Full Text
- View/download PDF
28. Automatically clustering large-scale miRNA sequences: methods and experiments.
- Author
-
Wan L, Ding J, Jin T, Guan J, and Zhou S
- Subjects
- Animals, Automation, Cluster Analysis, Databases, Genetic, Internet, MicroRNAs chemistry, MicroRNAs classification, Plants genetics, User-Computer Interface, Viruses genetics, Algorithms, MicroRNAs metabolism
- Abstract
Background: Since the initial annotation of microRNAs (miRNAs) in 2001, many studies have sought to identify additional miRNAs experimentally or computationally in various species. MiRNAs act with the Argonaut family of proteins to regulate target messenger RNAs (mRNAs) post-transcriptionally. Currently, researches mainly focus on single miRNA function study. Considering that members in the same miRNA family might participate in the same pathway or regulate the same target(s) and thus share similar biological functions, people can explore useful knowledge from high quality miRNA family architecture., Results: In this article, we developed an unsupervised clustering-based method miRCluster to automatically group miRNAs. In order to evaluate this method, several data sets were constructed from the online database miRBase. Results showed that miRCluster can efficiently arrange miRNAs (e.g identify 354 families in miRBase16 with an accuracy of 92.08%, and can recognize 9 of all 10 newly-added families in miRBase 17). By far, ~30% mature miRNAs registered in miRBase are unclassified. With miRCluster, over 85% unclassified miRNAs can be assigned to certain families, while ~44% of these miRNAs distributed in ~300novel families., Conclusions: In short, miRCluster is an automatic and efficient miRNA family identification method, which does not require any prior knowledge. It can be helpful in real use, especially when exploring functions of novel miRNAs. All relevant materials could be freely accessed online (http://admis.fudan.edu.cn/projects/miRCluster).
- Published
- 2012
- Full Text
- View/download PDF
29. A semi-supervised boosting SVM for predicting hot spots at protein-protein interfaces.
- Author
-
Xu B, Wei X, Deng L, Guan J, and Zhou S
- Subjects
- Databases, Protein, Erythropoietin chemistry, Erythropoietin metabolism, Models, Molecular, Protein Binding, Protein Conformation, Proteins chemistry, Reproducibility of Results, Computational Biology methods, Proteins metabolism, Support Vector Machine
- Abstract
Background: Hot spots are residues contributing the most of binding free energy yet accounting for a small portion of a protein interface. Experimental approaches to identify hot spots such as alanine scanning mutagenesis are expensive and time-consuming, while computational methods are emerging as effective alternatives to experimental approaches., Results: In this study, we propose a semi-supervised boosting SVM, which is called sbSVM, to computationally predict hot spots at protein-protein interfaces by combining protein sequence and structure features. Here, feature selection is performed using random forests to avoid over-fitting. Due to the deficiency of positive samples, our approach samples useful unlabeled data iteratively to boost the performance of hot spots prediction. The performance evaluation of our method is carried out on a dataset generated from the ASEdb database for cross-validation and a dataset from the BID database for independent test. Furthermore, a balanced dataset with similar amounts of hot spots and non-hot spots (65 and 66 respectively) derived from the first training dataset is used to further validate our method. All results show that our method yields good sensitivity, accuracy and F1 score comparing with the existing methods., Conclusion: Our method boosts prediction performance of hot spots by using unlabeled data to overcome the deficiency of available training data. Experimental results show that our approach is more effective than the traditional supervised algorithms and major existing hot spot prediction methods.
- Published
- 2012
- Full Text
- View/download PDF
30. miRFam: an effective automatic miRNA classification method based on n-grams and a multiclass SVM.
- Author
-
Ding J, Zhou S, and Guan J
- Subjects
- Algorithms, Animals, Base Sequence, MicroRNAs chemistry, RNA, Helminth chemistry, RNA, Helminth genetics, Schistosoma genetics, Sequence Alignment, Artificial Intelligence, MicroRNAs genetics, Software
- Abstract
Background: MicroRNAs (miRNAs) are ~22 nt long integral elements responsible for post-transcriptional control of gene expressions. After the identification of thousands of miRNAs, the challenge is now to explore their specific biological functions. To this end, it will be greatly helpful to construct a reasonable organization of these miRNAs according to their homologous relationships. Given an established miRNA family system (e.g. the miRBase family organization), this paper addresses the problem of automatically and accurately classifying newly found miRNAs to their corresponding families by supervised learning techniques. Concretely, we propose an effective method, miRFam, which uses only primary information of pre-miRNAs or mature miRNAs and a multiclass SVM, to automatically classify miRNA genes., Results: An existing miRNA family system prepared by miRBase was downloaded online. We first employed n-grams to extract features from known precursor sequences, and then trained a multiclass SVM classifier to classify new miRNAs (i.e. their families are unknown). Comparing with miRBase's sequence alignment and manual modification, our study shows that the application of machine learning techniques to miRNA family classification is a general and more effective approach. When the testing dataset contains more than 300 families (each of which holds no less than 5 members), the classification accuracy is around 98%. Even with the entire miRBase15 (1056 families and more than 650 of them hold less than 5 samples), the accuracy surprisingly reaches 90%., Conclusions: Based on experimental results, we argue that miRFam is suitable for application as an automated method of family classification, and it is an important supplementary tool to the existing alignment-based small non-coding RNA (sncRNA) classification methods, since it only requires primary sequence information., Availability: The source code of miRFam, written in C++, is freely and publicly available at: http://admis.fudan.edu.cn/projects/miRFam.htm.
- Published
- 2011
- Full Text
- View/download PDF
31. MiRenSVM: towards better prediction of microRNA precursors using an ensemble SVM classifier with multi-loop features.
- Author
-
Ding J, Zhou S, and Guan J
- Subjects
- Animals, Anopheles genetics, Classification methods, Computational Biology methods, Humans, MicroRNAs chemistry, MicroRNAs classification, RNA Precursors chemistry, Artificial Intelligence, MicroRNAs genetics, RNA Precursors genetics
- Abstract
Background: MicroRNAs (simply miRNAs) are derived from larger hairpin RNA precursors and play essential regular roles in both animals and plants. A number of computational methods for miRNA genes finding have been proposed in the past decade, yet the problem is far from being tackled, especially when considering the imbalance issue of known miRNAs and unidentified miRNAs, and the pre-miRNAs with multi-loops or higher minimum free energy (MFE). This paper presents a new computational approach, miRenSVM, for finding miRNA genes. Aiming at better prediction performance, an ensemble support vector machine (SVM) classifier is established to deal with the imbalance issue, and multi-loop features are included for identifying those pre-miRNAs with multi-loops., Results: We collected a representative dataset, which contains 697 real miRNA precursors identified by experimental procedure and other computational methods, and 5428 pseudo ones from several datasets. Experiments showed that our miRenSVM achieved a 96.5% specificity and a 93.05% sensitivity on the dataset. Compared with the state-of-the-art approaches, miRenSVM obtained better prediction results. We also applied our method to predict 14 Homo sapiens pre-miRNAs and 13 Anopheles gambiae pre-miRNAs that first appeared in miRBase13.0, MiRenSVM got a 100% prediction rate. Furthermore, performance evaluation was conducted over 27 additional species in miRBase13.0, and 92.84% (4863/5238) animal pre-miRNAs were correctly identified by miRenSVM., Conclusion: MiRenSVM is an ensemble support vector machine (SVM) classification system for better detecting miRNA genes, especially those with multi-loop secondary structure.
- Published
- 2010
- Full Text
- View/download PDF
32. Prediction of protein-protein interaction sites using an ensemble method.
- Author
-
Deng L, Guan J, Dong Q, and Zhou S
- Subjects
- Binding Sites, Databases, Protein, Protein Interaction Mapping, Sequence Analysis, Protein, Software, Viral Nonstructural Proteins chemistry, Viral Nonstructural Proteins metabolism, Computational Biology methods, Proteins chemistry
- Abstract
Background: Prediction of protein-protein interaction sites is one of the most challenging and intriguing problems in the field of computational biology. Although much progress has been achieved by using various machine learning methods and a variety of available features, the problem is still far from being solved., Results: In this paper, an ensemble method is proposed, which combines bootstrap resampling technique, SVM-based fusion classifiers and weighted voting strategy, to overcome the imbalanced problem and effectively utilize a wide variety of features. We evaluate the ensemble classifier using a dataset extracted from 99 polypeptide chains with 10-fold cross validation, and get a AUC score of 0.86, with a sensitivity of 0.76 and a specificity of 0.78, which are better than that of the existing methods. To improve the usefulness of the proposed method, two special ensemble classifiers are designed to handle the cases of missing homologues and structural information respectively, and the performance is still encouraging. The robustness of the ensemble method is also evaluated by effectively classifying interaction sites from surface residues as well as from all residues in proteins. Moreover, we demonstrate the applicability of the proposed method to identify interaction sites from the non-structural proteins (NS) of the influenza A virus, which may be utilized as potential drug target sites., Conclusion: Our experimental results show that the ensemble classifiers are quite effective in predicting protein interaction sites. The Sub-EnClassifiers with resampling technique can alleviate the imbalanced problem and the combination of Sub-EnClassifiers with a wide variety of feature groups can significantly improve prediction performance.
- Published
- 2009
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.