Author: "Zou, Quan" / Search Limiters: Peer Reviewed / Topic: machine learning - Searchworks@Jio Institute Digital Library Search Results

1. Prediction of protein solubility based on sequence physicochemical patterns and distributed representation information with DeepSoluE

Author: Wang, Chao and Zou, Quan
Published: 2023
Full Text: View/download PDF

2. Multi-kernel Learning Fusion Algorithm Based on RNN and GRU for ASD Diagnosis and Pathogenic Brain Region Extraction.

Author: Chen, Jie, Zhang, Huilian, Zou, Quan, Liao, Bo, and Bi, Xia-an
Subjects: MACHINE learning, AUTISM spectrum disorders, RECURRENT neural networks, DATABASES, ELECTRONIC data processing
Abstract: Autism spectrum disorder (ASD) is a complex, severe disorder related to brain development. It impairs patient language communication and social behaviors. In recent years, ASD researches have focused on a single-modal neuroimaging data, neglecting the complementarity between multi-modal data. This omission may lead to poor classification. Therefore, it is important to study multi-modal data of ASD for revealing its pathogenesis. Furthermore, recurrent neural network (RNN) and gated recurrent unit (GRU) are effective for sequence data processing. In this paper, we introduce a novel framework for a Multi-Kernel Learning Fusion algorithm based on RNN and GRU (MKLF-RAG). The framework utilizes RNN and GRU to provide feature selection for data of different modalities. Then these features are fused by MKLF algorithm to detect the pathological mechanisms of ASD and extract the most relevant the Regions of Interest (ROIs) for the disease. The MKLF-RAG proposed in this paper has been tested in a variety of experiments with the Autism Brain Imaging Data Exchange (ABIDE) database. Experimental findings indicate that our framework notably enhances the classification accuracy for ASD. Compared with other methods, MKLF-RAG demonstrates superior efficacy across multiple evaluation metrics and could provide valuable insights into the early diagnosis of ASD. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

3. Application and Comparison of Machine Learning and Database-Based Methods in Taxonomic Classification of High-Throughput Sequencing Data.

Author: Tian, Qinzhong, Zhang, Pinglu, Zhai, Yixiao, Wang, Yansu, and Zou, Quan
Subjects: NUCLEOTIDE sequencing, TECHNOLOGICAL innovations, CLASSIFICATION, DEVELOPMENTAL biology, DATABASES, SYNTHETIC biology
Abstract: The advent of high-throughput sequencing technologies has not only revolutionized the field of bioinformatics but has also heightened the demand for efficient taxonomic classification. Despite technological advancements, efficiently processing and analyzing the deluge of sequencing data for precise taxonomic classification remains a formidable challenge. Existing classification approaches primarily fall into two categories, database-based methods and machine learning methods, each presenting its own set of challenges and advantages. On this basis, the aim of our study was to conduct a comparative analysis between these two methods while also investigating the merits of integrating multiple database-based methods. Through an in-depth comparative study, we evaluated the performance of both methodological categories in taxonomic classification by utilizing simulated data sets. Our analysis revealed that database-based methods excel in classification accuracy when backed by a rich and comprehensive reference database. Conversely, while machine learning methods show superior performance in scenarios where reference sequences are sparse or lacking, they generally show inferior performance compared with database methods under most conditions. Moreover, our study confirms that integrating multiple database-based methods does, in fact, enhance classification accuracy. These findings shed new light on the taxonomic classification of high-throughput sequencing data and bear substantial implications for the future development of computational biology. For those interested in further exploring our methods, the source code of this study is publicly available on https://github.com/LoadStar822/Genome-Classifier-Performance-Evaluator. Additionally, a dedicated webpage showcasing our collected database, data sets, and various classification software can be found at http://lab.malab.cn/~tqz/project/taxonomic/. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

4. Deepstacked-AVPs: predicting antiviral peptides using tri-segment evolutionary profile and word embedding based multi-perspective features with deep stacking model.

Author: Akbar, Shahid, Raza, Ali, and Zou, Quan
Subjects: PEPTIDES, ANTIMICROBIAL peptides, VIRUS diseases, MACHINE learning, ANTIVIRAL agents
Abstract: Background: Viral infections have been the main health issue in the last decade. Antiviral peptides (AVPs) are a subclass of antimicrobial peptides (AMPs) with substantial potential to protect the human body against various viral diseases. However, there has been significant production of antiviral vaccines and medications. Recently, the development of AVPs as an antiviral agent suggests an effective way to treat virus-affected cells. Recently, the involvement of intelligent machine learning techniques for developing peptide-based therapeutic agents is becoming an increasing interest due to its significant outcomes. The existing wet-laboratory-based drugs are expensive, time-consuming, and cannot effectively perform in screening and predicting the targeted motif of antiviral peptides. Methods: In this paper, we proposed a novel computational model called Deepstacked-AVPs to discriminate AVPs accurately. The training sequences are numerically encoded using a novel Tri-segmentation-based position-specific scoring matrix (PSSM-TS) and word2vec-based semantic features. Composition/Transition/Distribution-Transition (CTDT) is also employed to represent the physiochemical properties based on structural features. Apart from these, the fused vector is formed using PSSM-TS features, semantic information, and CTDT descriptors to compensate for the limitations of single encoding methods. Information gain (IG) is applied to choose the optimal feature set. The selected features are trained using a stacked-ensemble classifier. Results: The proposed Deepstacked-AVPs model achieved a predictive accuracy of 96.60%%, an area under the curve (AUC) of 0.98, and a precision-recall (PR) value of 0.97 using training samples. In the case of the independent samples, our model obtained an accuracy of 95.15%, an AUC of 0.97, and a PR value of 0.97. Conclusion: Our Deepstacked-AVPs model outperformed existing models with a ~ 4% and ~ 2% higher accuracy using training and independent samples, respectively. The reliability and efficacy of the proposed Deepstacked-AVPs model make it a valuable tool for scientists and may perform a beneficial role in pharmaceutical design and research academia. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

5. FEOpti-ACVP: identification of novel anti-coronavirus peptide sequences based on feature engineering and optimization.

Author: Jiang, Jici, Pei, Hongdi, Li, Jiayu, Li, Mingxin, Zou, Quan, and Lv, Zhibin
Subjects: AMINO acid sequence, DEEP learning, MACHINE learning, ENGINEERING, FEATURE extraction, IDENTIFICATION
Abstract: Anti-coronavirus peptides (ACVPs) represent a relatively novel approach of inhibiting the adsorption and fusion of the virus with human cells. Several peptide-based inhibitors showed promise as potential therapeutic drug candidates. However, identifying such peptides in laboratory experiments is both costly and time consuming. Therefore, there is growing interest in using computational methods to predict ACVPs. Here, we describe a model for the prediction of ACVPs that is based on the combination of feature engineering (FE) optimization and deep representation learning. FEOpti-ACVP was pre-trained using two feature extraction frameworks. At the next step, several machine learning approaches were tested in to construct the final algorithm. The final version of FEOpti-ACVP outperformed existing methods used for ACVPs prediction and it has the potential to become a valuable tool in ACVP drug design. A user-friendly webserver of FEOpti-ACVP can be accessed at http://servers.aibiochem.net/soft/FEOpti-ACVP/. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

6. GSCtool: A Novel Descriptor that Characterizes the Genome for Applying Machine Learning in Genomics.

Author: Shen, Zijie, Shen, Enhui, Zhu, Qian-Hao, Fan, Longjiang, Zou, Quan, and Ye, Chu-Yu
Subjects: GENOMICS, MACHINE learning, GENOMES, SINGLE nucleotide polymorphisms, GENOTYPES, CULTIVARS
Abstract: Machine learning (ML) is one of the core driving forces for the next breeding stage, and Breeding 4.0. Genotype matrix based on single‐nucleotide polymorphisms (SNPs) is often used in ML for genome‐to‐phenotype prediction. Genotype matrix has an inherent defect, as the feature spaces it generates across different individuals or groups are inconsistent, and this hinders the application of ML. To overcome the challenge, a genome descriptor, Genic SNPs Composition Tool (GSCtool) is developed, which counts the number of SNPs in each gene of the genome so the dimension of the feature vectors equals the number of annotated genes in a species. Compared to using the genotype matrix, using GSCtool significantly decreases the model training time and has a higher accuracy of phenotype prediction. GSCtool also achieves good performance in variety identification, which is useful in crop variety protection. In general, GSCtool will help facilitate the application and study of genomic ML. The source code and test data of GSCtool are freely available at https://github.com/SZJhacker/GSCtool and https://gitee.com/shenzijie/GSCtool. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

7. Adaptive learning embedding features to improve the predictive performance of SARS-CoV-2 phosphorylation sites.

Author: Jiao, Shihu, Ye, Xiucai, Ao, Chunyan, Sakurai, Tetsuya, Zou, Quan, and Xu, Lei
Subjects: SARS-CoV-2, DEEP learning, MACHINE learning
Abstract: Motivation The rapid and extensive transmission of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has led to an unprecedented global health emergency, affecting millions of people and causing an immense socioeconomic impact. The identification of SARS-CoV-2 phosphorylation sites plays an important role in unraveling the complex molecular mechanisms behind infection and the resulting alterations in host cell pathways. However, currently available prediction tools for identifying these sites lack accuracy and efficiency. Results In this study, we presented a comprehensive biological function analysis of SARS-CoV-2 infection in a clonal human lung epithelial A549 cell, revealing dramatic changes in protein phosphorylation pathways in host cells. Moreover, a novel deep learning predictor called PSPred-ALE is specifically designed to identify phosphorylation sites in human host cells that are infected with SARS-CoV-2. The key idea of PSPred-ALE lies in the use of a self-adaptive learning embedding algorithm, which enables the automatic extraction of context sequential features from protein sequences. In addition, the tool uses multihead attention module that enables the capturing of global information, further improving the accuracy of predictions. Comparative analysis of features demonstrated that the self-adaptive learning embedding features are superior to hand-crafted statistical features in capturing discriminative sequence information. Benchmarking comparison shows that PSPred-ALE outperforms the state-of-the-art prediction tools and achieves robust performance. Therefore, the proposed model can effectively identify phosphorylation sites assistant the biomedical scientists in understanding the mechanism of phosphorylation in SARS-CoV-2 infection. Availability and implementation PSPred-ALE is available at https://github.com/jiaoshihu/PSPred-ALE and Zenodo (https://doi.org/10.5281/zenodo.8330277). [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

8. m5U-SVM: identification of RNA 5-methyluridine modification sites based on multi-view features of physicochemical features and distributed representation.

Author: Ao, Chunyan, Ye, Xiucai, Sakurai, Tetsuya, Zou, Quan, and Yu, Liang
Subjects: RNA modification & restriction, MACHINE learning, URIDINE, SUPPORT vector machines
Abstract: Background: RNA 5-methyluridine (m5U) modifications are obtained by methylation at the C5 position of uridine catalyzed by pyrimidine methylation transferase, which is related to the development of human diseases. Accurate identification of m5U modification sites from RNA sequences can contribute to the understanding of their biological functions and the pathogenesis of related diseases. Compared to traditional experimental methods, computational methods developed based on machine learning with ease of use can identify modification sites from RNA sequences in an efficient and time-saving manner. Despite the good performance of these computational methods, there are some drawbacks and limitations. Results: In this study, we have developed a novel predictor, m5U-SVM, based on multi-view features and machine learning algorithms to construct predictive models for identifying m5U modification sites from RNA sequences. In this method, we used four traditional physicochemical features and distributed representation features. The optimized multi-view features were obtained from the four fused traditional physicochemical features by using the two-step LightGBM and IFS methods, and then the distributed representation features were fused with the optimized physicochemical features to obtain the new multi-view features. The best performing classifier, support vector machine, was identified by screening different machine learning algorithms. Compared with the results, the performance of the proposed model is better than that of the existing state-of-the-art tool. Conclusions: m5U-SVM provides an effective tool that successfully captures sequence-related attributes of modifications and can accurately predict m5U modification sites from RNA sequences. The identification of m5U modification sites helps to understand and delve into the related biological processes and functions. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

9. A Machine Learning Method to Identify Umami Peptide Sequences by Using Multiplicative LSTM Embedded Features.

Author: Jiang, Jici, Li, Jiayu, Li, Junxian, Pei, Hongdi, Li, Mingxin, Zou, Quan, and Lv, Zhibin
Subjects: AMINO acid sequence, MACHINE learning, DEEP learning, UMAMI (Taste), FEATURE extraction, TASTE testing of food
Abstract: Umami peptides enhance the umami taste of food and have good food processing properties, nutritional value, and numerous potential applications. Wet testing for the identification of umami peptides is a time-consuming and expensive process. Here, we report the iUmami-DRLF that uses a logistic regression (LR) method solely based on the deep learning pre-trained neural network feature extraction method, unified representation (UniRep based on multiplicative LSTM), for feature extraction from the peptide sequences. The findings demonstrate that deep learning representation learning significantly enhanced the capability of models in identifying umami peptides and predictive precision solely based on peptide sequence information. The newly validated taste sequences were also used to test the iUmami-DRLF and other predictors, and the result indicates that the iUmami-DRLF has better robustness and accuracy and remains valid at higher probability thresholds. The iUmami-DRLF method can aid further studies on enhancing the umami flavor of food for satisfying the need for an umami-flavored diet. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

10. SkipCPP-Pred: an improved and promising sequence-based predictor for predicting cell-penetrating peptides

Author: Wei, Leyi, Tang, Jijun, and Zou, Quan
Published: 2017
Full Text: View/download PDF

11. Identification of Thermophilic Proteins Based on Sequence-Based Bidirectional Representations from Transformer-Embedding Features.

Author: Pei, Hongdi, Li, Jiayu, Ma, Shuhan, Jiang, Jici, Li, Mingxin, Zou, Quan, and Lv, Zhibin
Subjects: LANGUAGE models, MACHINE learning, PROTEOMICS, METHODS engineering, FEATURE extraction
Abstract: Thermophilic proteins have great potential to be utilized as biocatalysts in biotechnology. Machine learning algorithms are gaining increasing use in identifying such enzymes, reducing or even eliminating the need for experimental studies. While most previously used machine learning methods were based on manually designed features, we developed BertThermo, a model using Bidirectional Encoder Representations from Transformers (BERT), as an automatic feature extraction tool. This method combines a variety of machine learning algorithms and feature engineering methods, while relying on single-feature encoding based on the protein sequence alone for model input. BertThermo achieved an accuracy of 96.97% and 97.51% in 5-fold cross-validation and in independent testing, respectively, identifying thermophilic proteins more reliably than any previously described predictive algorithm. Additionally, BertThermo was tested by a balanced dataset, an imbalanced dataset and a dataset with homology sequences, and the results show that BertThermo was with the best robustness as comparied with state-of-the-art methods. The source code of BertThermo is available. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

12. Prediction of the effects of process informatics parameters on platinum, palladium, and gold-loaded tin oxide sensors with an artificial neural network.

Author: Zou, Quan, Itoh, Toshio, Choi, Pil Gyu, Masuda, Yoshitake, and Shin, Woosuck
Subjects: *ARTIFICIAL neural networks, *TIN oxides, *STANNIC oxide, *PLATINUM, *PALLADIUM, *DETECTORS
Abstract: In this study, we investigated the effect of the preparation process i.e., the process informatics (PI) parameters of sensor elements on the responses of Pt-, Pd-, and Au-loaded SnO 2 sensors. These responses were predicted by an artificial neural network (ANN) using a dataset comprising 441 data points that had been fabricated and evaluated under many parameters in our previous studies. We reported an optimal data preprocessing method based on the relational expression between the sensor response of a semiconductor sensor and the concentration of a target gas and the effect of each PI parameter based on predicted sensor responses under untested conditions. • Prediction of the effects of process informatics on SnO 2 -type sensors. • Using artificial neural networks for the prediction. • Preparing a dataset comprising 441 data points from our previous studies. • Optimization of data preprocessing based on the theory of semiconductor sensors. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

13. CRCF: A Method of Identifying Secretory Proteins of Malaria Parasites.

Author: Feng, Changli, Wu, Jin, Wei, Haiyan, Xu, Lei, and Zou, Quan
Abstract: Malaria is a mosquito-borne disease that results in millions of cases and deaths annually. The development of a fast computational method that identifies secretory proteins of the malaria parasite is important for research on antimalarial drugs and vaccines. Thus, a method was developed to identify the secretory proteins of malaria parasites. In this method, a reduced alphabet was selected to recode the original protein sequence. A feature synthesis method was used to synthesise three different types of feature information. Finally, the random forest method was used as a classifier to identify the secretory proteins. In addition, a web server was developed to share the proposed algorithm. Experiments using the benchmark dataset demonstrated that the overall accuracy achieved by the proposed method was greater than 97.8 percent using the 10-fold cross-validation method. Furthermore, the reduced schemes and characteristic performance analyses are discussed. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

14. Prediction of Cell-Penetrating Peptides Using a Novel HSIC-Based Multiview TSK Fuzzy System.

Author: Liu, Peng, Zhao, Shulin, Zou, Quan, and Ding, Yijie
Subjects: CELL-penetrating peptides, MACHINE learning, INDEPENDENT sets, FUZZY systems, FORECASTING, PEPTIDES
Abstract: Cell-penetrating peptides (CPPs) are short peptides that can carry cargo into cells. CPPs are widely utilized due to their powerful loading capacity and transduction efficiency. Identifying CPPs is the basis for studying their functions and mechanisms; however, experimental methods to identify CPPs are expensive and time-consuming. Recently, CPP predictors based on machine learning methods have become a research hotspot. Although considerable progress has been made, some challenges remain unresolved. First, most predictors employ a variety of feature descriptors to transform an original sequence into multiview data; however, extant methods ignore the relationships between different views, limiting further performance improvement. Second, most machine learning models are actually black boxes and cannot offer insightful advice. In this paper, a novel Hilbert–Schmidt independence criterion (HSIC)-based multiview TSK fuzzy system is proposed. Compared with other machine learning methods, TSK fuzzy systems have better interpretability, and the introduction of multiview mechanisms provides comprehensive insight into the intrinsic laws of the data. HSIC is utilized here to measure the independence and enhance the complementarity between different views. Notably, the proposed method attained prediction accuracy results of 92.2% and 96.2% for the training and independent test sets, respectively. The empirical results show that our promising approach features greater recognition performance than the state-of-the-art method. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

15. Structured Sparse Regularized TSK Fuzzy System for predicting therapeutic peptides.

Author: Guo, Xiaoyi, Jiang, Yizhang, and Zou, Quan
Subjects: PEPTIDES, DIGESTIVE organs, MACHINE learning, AMINO acid sequence
Abstract: Therapeutic peptides act on the skeletal system, digestive system and blood system, have antibacterial properties and help relieve inflammation. In order to reduce the resource consumption of wet experiments for the identification of therapeutic peptides, many computational-based methods have been developed to solve the identification of therapeutic peptides. Due to the insufficiency of traditional machine learning methods in dealing with feature noise. We propose a novel therapeutic peptide identification method called Structured Sparse Regularized Takagi–Sugeno–Kang Fuzzy System on Within-Class Scatter (SSR-TSK-FS-WCS). Our method achieves good performance on multiple therapeutic peptides and UCI datasets. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

16. Machine Learning and Its Applications for Protozoal Pathogens and Protozoal Infectious Diseases.

Author: Hu, Rui-Si, Hesham, Abd El-Latif, and Zou, Quan
Subjects: PROTOZOAN diseases, COMMUNICABLE diseases, MACHINE learning, FEATURE selection, SUPPORT vector machines, PUBLIC health surveillance
Abstract: In recent years, massive attention has been attracted to the development and application of machine learning (ML) in the field of infectious diseases, not only serving as a catalyst for academic studies but also as a key means of detecting pathogenic microorganisms, implementing public health surveillance, exploring host-pathogen interactions, discovering drug and vaccine candidates, and so forth. These applications also include the management of infectious diseases caused by protozoal pathogens, such as Plasmodium , Trypanosoma , Toxoplasma , Cryptosporidium , and Giardia , a class of fatal or life-threatening causative agents capable of infecting humans and a wide range of animals. With the reduction of computational cost, availability of effective ML algorithms, popularization of ML tools, and accumulation of high-throughput data, it is possible to implement the integration of ML applications into increasing scientific research related to protozoal infection. Here, we will present a brief overview of important concepts in ML serving as background knowledge, with a focus on basic workflows, popular algorithms (e.g., support vector machine, random forest, and neural networks), feature extraction and selection, and model evaluation metrics. We will then review current ML applications and major advances concerning protozoal pathogens and protozoal infectious diseases through combination with correlative biology expertise and provide forward-looking insights for perspectives and opportunities in future advances in ML techniques in this field. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

17. Identification of drug–target interactions via multiple kernel-based triple collaborative matrix factorization.

Author: Ding, Yijie, Tang, Jijun, Guo, Fei, and Zou, Quan
Subjects: MATRIX decomposition, KERNEL operating systems, DRUG use testing, MACHINE learning, BIPARTITE graphs, TREATMENT effectiveness
Abstract: Targeted drugs have been applied to the treatment of cancer on a large scale, and some patients have certain therapeutic effects. It is a time-consuming task to detect drug–target interactions (DTIs) through biochemical experiments. At present, machine learning (ML) has been widely applied in large-scale drug screening. However, there are few methods for multiple information fusion. We propose a multiple kernel-based triple collaborative matrix factorization (MK-TCMF) method to predict DTIs. The multiple kernel matrices (contain chemical, biological and clinical information) are integrated via multi-kernel learning (MKL) algorithm. And the original adjacency matrix of DTIs could be decomposed into three matrices, including the latent feature matrix of the drug space, latent feature matrix of the target space and the bi-projection matrix (used to join the two feature spaces). To obtain better prediction performance, MKL algorithm can regulate the weight of each kernel matrix according to the prediction error. The weights of drug side-effects and target sequence are the highest. Compared with other computational methods, our model has better performance on four test data sets. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

18. Critical assessment of computational tools for prokaryotic and eukaryotic promoter prediction.

Author: Zhang, Meng, Jia, Cangzhi, Li, Fuyi, Li, Chen, Zhu, Yan, Akutsu, Tatsuya, Webb, Geoffrey I, Zou, Quan, Coin, Lachlan J M, and Song, Jiangning
Subjects: DEEP learning, MACHINE learning, DROSOPHILA melanogaster, MICE, CORN, GENETIC regulation
Abstract: Promoters are crucial regulatory DNA regions for gene transcriptional activation. Rapid advances in next-generation sequencing technologies have accelerated the accumulation of genome sequences, providing increased training data to inform computational approaches for both prokaryotic and eukaryotic promoter prediction. However, it remains a significant challenge to accurately identify species-specific promoter sequences using computational approaches. To advance computational support for promoter prediction, in this study, we curated 58 comprehensive, up-to-date, benchmark datasets for 7 different species (i.e. Escherichia coli , Bacillus subtilis , Homo sapiens , Mus musculus , Arabidopsis thaliana , Zea mays and Drosophila melanogaster) to assist the research community to assess the relative functionality of alternative approaches and support future research on both prokaryotic and eukaryotic promoters. We revisited 106 predictors published since 2000 for promoter identification (40 for prokaryotic promoter, 61 for eukaryotic promoter, and 5 for both). We systematically evaluated their training datasets, computational methodologies, calculated features, performance and software usability. On the basis of these benchmark datasets, we benchmarked 19 predictors with functioning webservers/local tools and assessed their prediction performance. We found that deep learning and traditional machine learning–based approaches generally outperformed scoring function–based approaches. Taken together, the curated benchmark dataset repository and the benchmarking analysis in this study serve to inform the design and implementation of computational approaches for promoter prediction and facilitate more rigorous comparison of new techniques in the future. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

19. NmRF: identification of multispecies RNA 2'-O-methylation modification sites from RNA sequences.

Author: Ao, Chunyan, Zou, Quan, and Yu, Liang
Subjects: *RNA modification & restriction, *TRANSFER RNA, *FEATURE selection, *RANDOM forest algorithms, *METHYL groups, *MACHINE learning, *BOOSTING algorithms
Abstract: 2'-O-methylation (Nm) is a post-transcriptional modification of RNA that is catalyzed by 2'-O-methyltransferase and involves replacing the H on the 2′-hydroxyl group with a methyl group. The 2'-O-methylation modification site is detected in a variety of RNA types (miRNA, tRNA, mRNA, etc.), plays an important role in biological processes and is associated with different diseases. There are few functional mechanisms developed at present, and traditional high-throughput experiments are time-consuming and expensive to explore functional mechanisms. For a deeper understanding of relevant biological mechanisms, it is necessary to develop efficient and accurate recognition tools based on machine learning. Based on this, we constructed a predictor called NmRF based on optimal mixed features and random forest classifier to identify 2'-O-methylation modification sites. The predictor can identify modification sites of multiple species at the same time. To obtain a better prediction model, a two-step strategy is adopted; that is, the optimal hybrid feature set is obtained by combining the light gradient boosting algorithm and incremental feature selection strategy. In 10-fold cross-validation, the accuracies of Homo sapiens and Saccharomyces cerevisiae were 89.069 and 93.885%, and the AUC were 0.9498 and 0.9832, respectively. The rigorous 10-fold cross-validation and independent tests confirm that the proposed method is significantly better than existing tools. A user-friendly web server is accessible at http://lab.malab.cn/∼acy/NmRF. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

20. Characterizing viral circRNAs and their application in identifying circRNAs in viruses.

Author: Niu, Mengting, Ju, Ying, Lin, Chen, and Zou, Quan
Subjects: CIRCULAR RNA, NUCLEIC acids, NON-coding RNA, SEQUENCE alignment, MACHINE learning, VIRUS diseases, NERVOUS system, RNA splicing
Abstract: Circular RNAs (circRNAs) are non-coding RNAs with a special circular structure produced formed by the reverse splicing mechanism, which play an important role in a variety of biological activities. Viruses can encode circRNA, and viral circRNAs have been found in multiple single-stranded and double-stranded viruses. However, the characteristics and functions of viral circRNAs remain unknown. Sequence alignment showed that viral circRNAs are less conserved than circRNAs in animal, indicating that the viral circRNAs may evolve rapidly. Through the analysis of the sequence characteristics of viral circRNAs and circRNAs in animal, it was found that viral circRNAs and animals circRNAs are similar in nucleic acid composition, but have obvious differences in secondary structure and autocorrelation characteristics. Based on these characteristics of viral circRNAs, machine learning algorithms were employed to construct a prediction model to identify viral circRNA. Additionally, analysis of the interaction between viral circRNA and miRNAs showed that viral circRNA is expected to interact with 518 human miRNAs, and preliminary analysis of the role of viral circRNA. And it has been also found that viral circRNAs may be involved in many KEGG pathways related to nervous system and cancer. We curated an online server, and the data and code are available: http://server.malab.cn/viral-CircRNA/. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

21. accurate prediction and characterization of cancerlectin by a combined machine learning and GO analysis.

Author: Tang, Furong, Zhang, Lichao, Xu, Lei, Zou, Quan, and Feng, Hailin
Subjects: PLANT lectins, MACHINE learning, ENZYME stability, PROTEIN domains, CANCER invasiveness, LECTINS, BINDING sites
Abstract: Cancerlectins, lectins linked to tumor progression, have become the focus of cancer therapy research for their carbohydrate-binding specificity. However, the specific characterization for cancerlectins involved in tumor progression is still unclear. By taking advantage of the g-gap tripeptide and tetrapeptide composition feature descriptors, we increased the accuracy of the classification model of cancerlectin and lectin to 98.54% and 95.38%, respectively. About 36 cancerlectin and 135 lectin features were selected for functional characterization by P/N feature ranking method, which particularly selects the features in positive samples. The specific protein domains of cancerlectins are found to be p-GalNAc-T, crystal and annexin by comparing with lectins through the exclusion method. Moreover, the combined GO analysis showed that the conserved cation binding sites of cancerlectin specific domains are covered by selected feature peptides, suggesting that the capability of cation binding, critical for enzyme activity and stability, could be the key characteristic of cancerlectins in tumor progression. These results will help to identify potential cancerlectin and provide clues for mechanism study of cancerlectin in tumor progression. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

22. MMFGRN: a multi-source multi-model fusion method for gene regulatory network reconstruction.

Author: He, Wenying, Tang, Jijun, Zou, Quan, and Guo, Fei
Subjects: GENE regulatory networks, GENE fusion, RECEIVER operating characteristic curves, BIOLOGICAL networks, GENE expression, CELL differentiation
Abstract: Lots of biological processes are controlled by gene regulatory networks (GRNs), such as growth and differentiation of cells, occurrence and development of the diseases. Therefore, it is important to persistently concentrate on the research of GRN. The determination of the gene–gene relationships from gene expression data is a complex issue. Since it is difficult to efficiently obtain the regularity behind the gene-gene relationship by only relying on biochemical experimental methods, thus various computational methods have been used to construct GRNs, and some achievements have been made. In this paper, we propose a novel method MMFGRN (for "Multi-source Multi-model Fusion for Gene Regulatory Network reconstruction") to reconstruct the GRN. In order to make full use of the limited datasets and explore the potential regulatory relationships contained in different data types, we construct the MMFGRN model from three perspectives: single time series data model, single steady-data model and time series and steady-data joint model. And, we utilize the weighted fusion strategy to get the final global regulatory link ranking. Finally, MMFGRN model yields the best performance on the DREAM4 InSilico_Size10 data, outperforming other popular inference algorithms, with an overall area under receiver operating characteristic score of 0.909 and area under precision-recall (AUPR) curves score of 0.770 on the 10-gene network. Additionally, as the network scale increases, our method also has certain advantages with an overall AUPR score of 0.335 on the DREAM4 InSilico_Size100 data. These results demonstrate the good robustness of MMFGRN on different scales of networks. At the same time, the integration strategy proposed in this paper provides a new idea for the reconstruction of the biological network model without prior knowledge, which can help researchers to decipher the elusive mechanism of life. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

23. Current status and future prospects of drug–target interaction prediction.

Author: Ru, Xiaoqing, Ye, Xiucai, Sakurai, Tetsuya, Zou, Quan, Xu, Lei, and Lin, Chen
Subjects: DRUG repositioning, DRUG development, MOLECULAR docking, COST control, FORECASTING
Abstract: Drug–target interaction prediction is important for drug development and drug repurposing. Many computational methods have been proposed for drug–target interaction prediction due to their potential to the time and cost reduction. In this review, we introduce the molecular docking and machine learning-based methods, which have been widely applied to drug–target interaction prediction. Particularly, machine learning-based methods are divided into different types according to the data processing form and task type. For each type of method, we provide a specific description and propose some solutions to improve its capability. The knowledge of heterogeneous network and learning to rank are also summarized in this review. As far as we know, this is the first comprehensive review that summarizes the knowledge of heterogeneous network and learning to rank in the drug–target interaction prediction. Moreover, we propose three aspects that can be explored in depth for future research. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

24. A comprehensive review of the imbalance classification of protein post-translational modifications.

Author: Dou, Lijun, Yang, Fenglong, Xu, Lei, and Zou, Quan
Subjects: NUCLEOTIDE sequencing, POST-translational modification, PROTEIN structure, THERAPEUTICS, DRUG design, CLASSIFICATION
Abstract: Post-translational modifications (PTMs) play significant roles in regulating protein structure, activity and function, and they are closely involved in various pathologies. Therefore, the identification of associated PTMs is the foundation of in-depth research on related biological mechanisms, disease treatments and drug design. Due to the high cost and time consumption of high-throughput sequencing techniques, developing machine learning-based predictors has been considered an effective approach to rapidly recognize potential modified sites. However, the imbalanced distribution of true and false PTM sites, namely, the data imbalance problem, largely effects the reliability and application of prediction tools. In this article, we conduct a systematic survey of the research progress in the imbalanced PTMs classification. First, we describe the modeling process in detail and outline useful data imbalance solutions. Then, we summarize the recently proposed bioinformatics tools based on imbalanced PTM data and simultaneously build a convenient website, ImClassi_PTMs (available at lab.malab.cn/∼dlj/ImbClassi_PTMs/), to facilitate the researchers to view. Moreover, we analyze the challenges of current computational predictors and propose some suggestions to improve the efficiency of imbalance learning. We hope that this work will provide comprehensive knowledge of imbalanced PTM recognition and contribute to advanced predictors in the future. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

25. Machine learning for phytopathology: from the molecular scale towards the network scale.

Author: Wang, Yansu, Zhou, Murong, Zou, Quan, and Xu, Lei
Subjects: MACHINE learning, PLANT-pathogen relationships, NUCLEOTIDE sequencing, SUPPORT vector machines, PHYTOPATHOGENIC microorganisms, DISEASE resistance of plants, PLANT diseases
Abstract: With the increasing volume of high-throughput sequencing data from a variety of omics techniques in the field of plant–pathogen interactions, sorting, retrieving, processing and visualizing biological information have become a great challenge. Within the explosion of data, machine learning offers powerful tools to process these complex omics data by various algorithms, such as Bayesian reasoning, support vector machine and random forest. Here, we introduce the basic frameworks of machine learning in dissecting plant–pathogen interactions and discuss the applications and advances of machine learning in plant–pathogen interactions from molecular to network biology, including the prediction of pathogen effectors, plant disease resistance protein monitoring and the discovery of protein–protein networks. The aim of this review is to provide a summary of advances in plant defense and pathogen infection and to indicate the important developments of machine learning in phytopathology. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

26. A comprehensive overview and critical evaluation of gene regulatory network inference technologies.

Author: Zhao, Mengyuan, He, Wenying, Tang, Jijun, Zou, Quan, and Guo, Fei
Subjects: GENE regulatory networks, DRUG target, MEDICAL scientists, KEY performance indicators (Management)
Abstract: Gene regulatory network (GRN) is the important mechanism of maintaining life process, controlling biochemical reaction and regulating compound level, which plays an important role in various organisms and systems. Reconstructing GRN can help us to understand the molecular mechanism of organisms and to reveal the essential rules of a large number of biological processes and reactions in organisms. Various outstanding network reconstruction algorithms use specific assumptions that affect prediction accuracy, in order to deal with the uncertainty of processing. In order to study why a certain method is more suitable for specific research problem or experimental data, we conduct research from model-based, information-based and machine learning-based method classifications. There are obviously different types of computational tools that can be generated to distinguish GRNs. Furthermore, we discuss several classical, representative and latest methods in each category to analyze core ideas, general steps, characteristics, etc. We compare the performance of state-of-the-art GRN reconstruction technologies on simulated networks and real networks under different scaling conditions. Through standardized performance metrics and common benchmarks, we quantitatively evaluate the stability of various methods and the sensitivity of the same algorithm applying to different scaling networks. The aim of this study is to explore the most appropriate method for a specific GRN, which helps biologists and medical scientists in discovering potential drug targets and identifying cancer biomarkers. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

27. SubLocEP: a novel ensemble predictor of subcellular localization of eukaryotic mRNA based on machine learning.

Author: Li, Jing, Zhang, Lichao, He, Shida, Guo, Fei, and Zou, Quan
Subjects: MACHINE learning, GENETIC translation, PREDICTION models, ALGORITHMS, MESSENGER RNA
Abstract: Motivation mRNA location corresponds to the location of protein translation and contributes to precise spatial and temporal management of the protein function. However, current assignment of subcellular localization of eukaryotic mRNA reveals important limitations: (1) turning multiple classifications into multiple dichotomies makes the training process tedious; (2) the majority of the models trained by classical algorithm are based on the extraction of single sequence information; (3) the existing state-of-the-art models have not reached an ideal level in terms of prediction and generalization ability. To achieve better assignment of subcellular localization of eukaryotic mRNA, a better and more comprehensive model must be developed. Results In this paper, SubLocEP is proposed as a two-layer integrated prediction model for accurate prediction of the location of sequence samples. Unlike the existing models based on limited features, SubLocEP comprehensively considers additional feature attributes and is combined with LightGBM to generated single feature classifiers. The initial integration model (single-layer model) is generated according to the categories of a feature. Subsequently, two single-layer integration models are weighted (sequence-based: physicochemical properties = 3:2) to produce the final two-layer model. The performance of SubLocEP on independent datasets is sufficient to indicate that SubLocEP is an accurate and stable prediction model with strong generalization ability. Additionally, an online tool has been developed that contains experimental data and can maximize the user convenience for estimation of subcellular localization of eukaryotic mRNA. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

28. rBPDL:Predicting RNA-Binding Proteins Using Deep Learning.

Author: Niu, Mengting, Wu, Jin, Zou, Quan, Liu, Zhendong, and Xu, Lei
Subjects: RNA-binding proteins, CONVOLUTIONAL neural networks, DEEP learning, CARRIER proteins, MACHINE learning, STATISTICS
Abstract: RNA-binding protein (RBP) is a powerful and wide-ranging regulator that plays an important role in cell development, differentiation, metabolism, health and disease. The prediction of RBPs provides valuable guidance for biologists. Although experimental methods have made great progress in predicting RBP, they are time-consuming and not flexible. Therefore, we developed a network model, rBPDL, by combining a convolutional neural network and long short-term memory for multilabel classification of RBPs. Moreover, to achieve better prediction results, we used a voting algorithm for ensemble learning of the model. We compared rBPDL with state-of-the-art methods and found that rBPDL significantly improved identification performance for the RBP68 dataset, with a macro-Area Under Curve (AUC), micro-AUC, and weighted AUC of 0.936, 0.962, and 0.946, respectively. Furthermore, through AUC statistical analysis of the RBP domain, we analyzed the performance of rBPDL and found that the RBP identification performance in the same domain was similar. In addition, we analyzed the performance preferences and physicochemical properties of the binding protein amino acids and explored the characteristics that affect the binding by using the RBP86 dataset. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

29. ITP-Pred: an interpretable method for predicting, therapeutic peptides with fused features low-dimension representation.

Author: Cai, Lijun, Wang, Li, Fu, Xiangzheng, Xia, Chenxing, Zeng, Xiangxiang, and Zou, Quan
Subjects: PEPTIDES, DNA-binding proteins, AMINO acids, MACHINE learning, BIOTECHNOLOGY industries, CHEMICAL properties
Abstract: The peptide therapeutics market is providing new opportunities for the biotechnology and pharmaceutical industries. Therefore, identifying therapeutic peptides and exploring their properties are important. Although several studies have proposed different machine learning methods to predict peptides as being therapeutic peptides, most do not explain the decision factors of model in detail. In this work, an Interpretable Therapeutic Peptide Prediction (ITP-Pred) model based on efficient feature fusion was developed. First, we proposed three kinds of feature descriptors based on sequence and physicochemical property encoded, namely amino acid composition (AAC), group AAC and coding autocorrelation, and concatenated them to obtain the feature representation of therapeutic peptide. Then, we input it into the CNN-Bi-directional Long Short-Term Memory (BiLSTM) model to automatically learn recognition of therapeutic peptides. The cross-validation and independent verification experiments results indicated that ITP-Pred has a higher prediction performance on the benchmark dataset than other comparison methods. Finally, we analyzed the output of the model from two aspects: sequence order and physical and chemical properties, mining important features as guidance for the design of better models that can complement existing methods. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

30. Revisiting genome-wide association studies from statistical modelling to machine learning.

Author: Sun, Shanwen, Dong, Benzhi, and Zou, Quan
Subjects: GENOME-wide association studies, MACHINE learning, STATISTICAL learning, GENETIC variation, STATISTICAL models, LINKAGE disequilibrium
Abstract: Over the last decade, genome-wide association studies (GWAS) have discovered thousands of genetic variants underlying complex human diseases and agriculturally important traits. These findings have been utilized to dissect the biological basis of diseases, to develop new drugs, to advance precision medicine and to boost breeding. However, the potential of GWAS is still underexploited due to methodological limitations. Many challenges have emerged, including detecting epistasis and single-nucleotide polymorphisms (SNPs) with small effects and distinguishing causal variants from other SNPs associated through linkage disequilibrium. These issues have motivated advancements in GWAS analyses in two contrasting cultures—statistical modelling and machine learning. In this review, we systematically present the basic concepts and the benefits and limitations in both methods. We further discuss recent efforts to mitigate their weaknesses. Additionally, we summarize the state-of-the-art tools for detecting the missed signals, ultrarare mutations and gene–gene interactions and for prioritizing SNPs. Our work can offer both theoretical and practical guidelines for performing GWAS analyses and for developing further new robust methods to fully exploit the potential of GWAS. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

31. Using a low correlation high orthogonality feature set and machine learning methods to identify plant pentatricopeptide repeat coding gene/protein.

Author: Feng, Changli, Zou, Quan, and Wang, Donghua
Subjects: *PENTATRICOPEPTIDE repeat genes, *MACHINE learning, *NAIVE Bayes classification, *AMINO acid sequence, *FEATURE selection, *PRINCIPAL components analysis
Abstract: Identifying whether a pentatricopeptide repeat (PPR) exists in an amino acid is a significant task in the field of bioinformatics. To address this problem, an identification method that combines an optimal feature set selection framework and machine learning algorithms is proposed to recognize the PPR coding genes and proteins in the sequence of amino acid. The original 188-dimensional (D) features are obtained using a feature extraction method, which is successively optimised through a covariance analysis, max-relevant-max-distance processing, and principal component analysis to reduce it to an optimal feature set that has fewer but more expressive features. Four machine learning methods are then used to serve as the classifiers for the identification task. The final number of feature data dimensions is reduced from 188 to only 10, and according to the experimental results from support vector machine methods, the loss of the AUC and the F 1 values are only 3.26% and 10.1%, respectively. Moreover, after applying the J48, random forest, and naïve Bayes methods as classifiers, it was also found that the optimal feature set with 10 dimensions has an almost equivalent performance for a 10-fold validation test. Identifying whether a pentatricopeptide repeat (PPR) exists in an amino acid is a significant task in the field of bioinformatics. To address this problem, an identification method that combines an optimal feature set selection framework and machine learning algorithms is proposed to recognise the PPR coding genes and proteins in the sequence of an amino acid. The original 188-dimensional (D) features are obtained using a feature extraction method, which is successively optimised through a covariance analysis, max-relevant-max-distance processing, and a principal component analysis to reduce it to an optimal feature set that has fewer but more expressive features. Four machine learning methods are then used to serve as the classifiers for the identification task. The final number of feature data dimensions is reduced from 188 to only 10, and according to the experimental results from support vector machine methods, the loss of the AUC and F 1 the value are only 3.26% and 10.1%, respectively. Moreover, after applying the J48, random forest, and naïve Bayes methods as classifiers, it was also found that the optimal feature set with 10 dimensions has an almost equivalent performance for a 10-fold validation test Image, graphical abstract [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

32. Prediction of bio-sequence modifications and the associations with diseases.

Author: Ao, Chunyan, Yu, Liang, and Zou, Quan
Subjects: RNA modification & restriction, INTERNET servers, THERAPEUTICS, FORECASTING, PREVENTIVE medicine
Abstract: Modifications of protein, RNA and DNA play an important role in many biological processes and are related to some diseases. Therefore, accurate identification and comprehensive understanding of protein, RNA and DNA modification sites can promote research on disease treatment and prevention. With the development of sequencing technology, the number of known sequences has continued to increase. In the past decade, many computational tools that can be used to predict protein, RNA and DNA modification sites have been developed. In this review, we comprehensively summarized the modification site predictors for three different biological sequences and the association with diseases. The relevant web server is accessible at http://lab.malab.cn/∼acy/PTM%5fdata/ some sample data on protein, RNA and DNA modification can be downloaded from that website. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

33. Clustering and classification methods for single-cell RNA-sequencing data.

Author: Qi, Ren, Ma, Anjun, Ma, Qin, and Zou, Quan
Subjects: DOWNLOADING, CLASSIFICATION, ELECTRONIC data processing
Abstract: Appropriate ways to measure the similarity between single-cell RNA-sequencing (scRNA-seq) data are ubiquitous in bioinformatics, but using single clustering or classification methods to process scRNA-seq data is generally difficult. This has led to the emergence of integrated methods and tools that aim to automatically process specific problems associated with scRNA-seq data. These approaches have attracted a lot of interest in bioinformatics and related fields. In this paper, we systematically review the integrated methods and tools, highlighting the pros and cons of each approach. We not only pay particular attention to clustering and classification methods but also discuss methods that have emerged recently as powerful alternatives, including nonlinear and linear methods and descending dimension methods. Finally, we focus on clustering and classification methods for scRNA-seq data, in particular, integrated methods, and provide a comprehensive description of scRNA-seq data and download URLs. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

34. mAML: an automated machine learning pipeline with a microbiome repository for human disease classification.

Author: Yang, Fenglong and Zou, Quan
Subjects: *NOSOLOGY, *MACHINE learning, *HUMAN microbiota, *GUT microbiome, *METAGENOMICS
Abstract: Due to the concerted efforts to utilize the microbial features to improve disease prediction capabilities, automated machine learning (AutoML) systems aiming to get rid of the tediousness in manually performing ML tasks are in great demand. Here we developed mAML, an ML model-building pipeline, which can automatically and rapidly generate optimized and interpretable models for personalized microbiome-based classification tasks in a reproducible way. The pipeline is deployed on a web-based platform, while the server is user-friendly and flexible and has been designed to be scalable according to the specific requirements. This pipeline exhibits high performance for 13 benchmark datasets including both binary and multi-class classification tasks. In addition, to facilitate the application of mAML and expand the human disease-related microbiome learning repository, we developed GMrepo ML repository (GMrepo Microbiome Learning repository) from the GMrepo database. The repository involves 120 microbiome-based classification tasks for 85 human-disease phenotypes referring to 12 429 metagenomic samples and 38 643 amplicon samples. The mAML pipeline and the GMrepo ML repository are expected to be important resources for researches in microbiology and algorithm developments. Database URL : http://lab.malab.cn/soft/mAML [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

35. Machine learning and its applications in plant molecular studies.

Author: Sun, Shanwen, Wang, Chunyu, Ding, Hui, and Zou, Quan
Subjects: MACHINE learning, BOTANISTS, ABIOTIC stress, SUPERVISED learning, CHEMICAL plants, DATA analysis
Abstract: The advent of high-throughput genomic technologies has resulted in the accumulation of massive amounts of genomic information. However, biologists are challenged with how to effectively analyze these data. Machine learning can provide tools for better and more efficient data analysis. Unfortunately, because many plant biologists are unfamiliar with machine learning, its application in plant molecular studies has been restricted to a few species and a limited set of algorithms. Thus, in this study, we provide the basic steps for developing machine learning frameworks and present a comprehensive overview of machine learning algorithms and various evaluation metrics. Furthermore, we introduce sources of important curated plant genomic data and R packages to enable plant biologists to easily and quickly apply appropriate machine learning algorithms in their research. Finally, we discuss current applications of machine learning algorithms for identifying various genes related to resistance to biotic and abiotic stress. Broad application of machine learning and the accumulation of plant sequencing data will advance plant molecular studies. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

36. Comparative analysis and prediction of quorum-sensing peptides using feature representation learning and machine learning algorithms.

Author: Wei, Leyi, Hu, Jie, Li, Fuyi, Song, Jiangning, Su, Ran, and Zou, Quan
Subjects: MACHINE learning, FORECASTING, GENETIC regulation, DESCRIPTOR systems, COMPARATIVE studies, GRAM-positive bacteria
Abstract: Quorum-sensing peptides (QSPs) are the signal molecules that are closely associated with diverse cellular processes, such as cell–cell communication, and gene expression regulation in Gram-positive bacteria. It is therefore of great importance to identify QSPs for better understanding and in-depth revealing of their functional mechanisms in physiological processes. Machine learning algorithms have been developed for this purpose, showing the great potential for the reliable prediction of QSPs. In this study, several sequence-based feature descriptors for peptide representation and machine learning algorithms are comprehensively reviewed, evaluated and compared. To effectively use existing feature descriptors, we used a feature representation learning strategy that automatically learns the most discriminative features from existing feature descriptors in a supervised way. Our results demonstrate that this strategy is capable of effectively capturing the sequence determinants to represent the characteristics of QSPs, thereby contributing to the improved predictive performance. Furthermore, wrapping this feature representation learning strategy, we developed a powerful predictor named QSPred-FL for the detection of QSPs in large-scale proteomic data. Benchmarking results with 10-fold cross validation showed that QSPred-FL is able to achieve better performance as compared to the state-of-the-art predictors. In addition, we have established a user-friendly webserver that implements QSPred-FL, which is currently available at http://server.malab.cn/QSPred-FL. We expect that this tool will be useful for the high-throughput prediction of QSPs and the discovery of important functional mechanisms of QSPs. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

37. Iterative feature representations improve N4-methylcytosine site prediction.

Author: Wei, Leyi, Su, Ran, Luan, Shasha, Liao, Zhijun, Manavalan, Balachandran, Zou, Quan, and Shi, Xiaolong
Subjects: INTERNET servers, MACHINE learning
Abstract: Motivation Accurate identification of N4-methylcytosine (4mC) modifications in a genome wide can provide insights into their biological functions and mechanisms. Machine learning recently have become effective approaches for computational identification of 4mC sites in genome. Unfortunately, existing methods cannot achieve satisfactory performance, owing to the lack of effective DNA feature representations that are capable to capture the characteristics of 4mC modifications. Results In this work, we developed a new predictor named 4mcPred-IFL, aiming to identify 4mC sites. To represent and capture discriminative features, we proposed an iterative feature representation algorithm that enables to learn informative features from several sequential models in a supervised iterative mode. Our analysis results showed that the feature representations learnt by our algorithm can capture the discriminative distribution characteristics between 4mC sites and non-4mC sites, enlarging the decision margin between the positives and negatives in feature space. Additionally, by evaluating and comparing our predictor with the state-of-the-art predictors on benchmark datasets, we demonstrate that our predictor can identify 4mC sites more accurately. Availability and implementation The user-friendly webserver that implements the proposed 4mcPred-IFL is well established, and is freely accessible at http://server.malab.cn/4mcPred-IFL. Supplementary information Supplementary data are available at Bioinformatics online. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

38. Research progress in protein posttranslational modification site prediction.

Author: He, Wenying, Wei, Leyi, and Zou, Quan
Subjects: PROTEOMICS, POST-translational modification, UBIQUITINATION, PROTEIN folding, THERAPEUTICS
Abstract: Posttranslational modifications (PTMs) play an important role in regulating protein folding, activity and function and are involved in almost all cellular processes. Identification of PTMs of proteins is the basis for elucidating the mechanisms of cell biology and disease treatments. Compared with the laboriousness of equivalent experimental work, PTM prediction using various machine-learning methods can provide accurate, simple and rapid research solutions and generate valuable information for further laboratory studies. In this review, we manually curate most of the bioinformatics tools published since 2008. We also summarize the approaches for predicting ubiquitination sites and glycosylation sites. Moreover, we discuss the challenges of current PTM bioinformatics tools and look forward to future research possibilities. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

39. Application of Machine Learning in Microbiology.

Author: Qu, Kaiyang, Guo, Fei, Liu, Xiangrong, Lin, Yuan, and Zou, Quan
Subjects: MACHINE learning, MICROBIOLOGY, LITERATURE reviews, NINETEENTH century, FOOD microbiology
Abstract: Microorganisms are ubiquitous and closely related to people's daily lives. Since they were first discovered in the 19th century, researchers have shown great interest in microorganisms. People studied microorganisms through cultivation, but this method is expensive and time consuming. However, the cultivation method cannot keep a pace with the development of high-throughput sequencing technology. To deal with this problem, machine learning (ML) methods have been widely applied to the field of microbiology. Literature reviews have shown that ML can be used in many aspects of microbiology research, especially classification problems, and for exploring the interaction between microorganisms and the surrounding environment. In this study, we summarize the application of ML in microbiology. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

40. Exploring sequence-based features for the improved prediction of DNA N4-methylcytosine sites in multiple species.

Author: Wei, Leyi, Luan, Shasha, Nagai, Luis Augusto Eijy, Su, Ran, and Zou, Quan
Subjects: METHYLCYTOSINE, MACHINE learning, SUPPORT vector machines, SPECIES, BIOLOGICAL errors
Abstract: Motivation As one of important epigenetic modifications, DNA N4-methylcytosine (4mC) is recently shown to play crucial roles in restriction–modification systems. For better understanding of their functional mechanisms, it is fundamentally important to identify 4mC modification. Machine learning methods have recently emerged as an effective and efficient approach for the high-throughput identification of 4mC sites, although high predictive error rates are still challenging for existing methods. Therefore, it is highly desirable to develop a computational method to more accurately identify m4C sites. Results In this study, we propose a machine learning based predictor, namely 4mcPred-SVM, for the genome-wide detection of DNA 4mC sites. In this predictor, we present a new feature representation algorithm that sufficiently exploits sequence-based information. To improve the feature representation ability, we use a two-step feature optimization strategy, thereby obtaining the most representative features. Using the resulting features and Support Vector Machine (SVM), we adaptively train the optimal models for different species. Comparative results on benchmark datasets from six species indicate that our predictor is able to achieve generally better performance in predicting 4mC sites as compared to the state-of-the-art predictors. Importantly, the sequence-based features can reliably and robust predict 4mC sites, facilitating the discovery of potentially important sequence characteristics for the prediction of 4mC sites. Availability and implementation The user-friendly webserver that implements the proposed 4mcPred-SVM is well established, and is freely accessible at http://server.malab.cn/4mcPred-SVM. Supplementary information Supplementary data are available at Bioinformatics online. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

41. 4mCPred: machine learning methods for DNA N 4 -methylcytosine sites prediction.

Author: He, Wenying, Zou, Quan, and Jia, Cangzhi
Subjects: *MACHINE learning, *METHYLCYTOSINE, *EPIGENETICS, *DNA repair, *ELECTRON-ion collisions
Abstract: Motivation N4-methylcytosine (4mC), an important epigenetic modification formed by the action of specific methyltransferases, plays an essential role in DNA repair, expression and replication. The accurate identification of 4mC sites aids in-depth research to biological functions and mechanisms. Because, experimental identification of 4mC sites is time-consuming and costly, especially given the rapid accumulation of gene sequences. Supplementation with efficient computational methods is urgently needed. Results In this study, we developed a new tool, 4mCPred, for predicting 4mC sites in Caenorhabditis elegans, Drosophila melanogaster, Arabidopsis thaliana, Escherichia coli, Geoalkalibacter subterraneus and Geobacter pickeringii. 4mCPred consists of two independent models, 4mCPred_I and 4mCPred_II, for each species. The predictive results of independent and cross-species tests demonstrated that the performance of 4mCPred_I is a useful tool. To identify position-specific trinucleotide propensity (PSTNP) and electron-ion interaction potential features, we used the F-score method to construct predictive models and to compare their PSTNP features. Compared with other existing predictors, 4mCPred achieved much higher accuracies in rigorous jackknife and independent tests. We also analyzed the importance of different features in detail. Availability and implementation The web-server 4mCPred is accessible at http://server.malab.cn/4mCPred/index.jsp. Supplementary information Supplementary data are available at Bioinformatics online. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

42. Deep learning in omics: a survey and guideline.

Author: Zhang, Zhiqiang, Zhao, Yi, Liao, Xiangke, Shi, Wenqiang, Li, Kenli, Zou, Quan, and Peng, Shaoliang
Subjects: MACHINE learning, DEEP learning, ARTIFICIAL intelligence, GENE expression, DATA analysis
Abstract: Omics, such as genomics, transcriptome and proteomics, has been affected by the era of big data. A huge amount of high dimensional and complex structured data has made it no longer applicable for conventional machine learning algorithms. Fortunately, deep learning technology can contribute toward resolving these challenges. There is evidence that deep learning can handle omics data well and resolve omics problems. This survey aims to provide an entry-level guideline for researchers, to understand and use deep learning in order to solve omics problems. We first introduce several deep learning models and then discuss several research areas which have combined omics and deep learning in recent years. In addition, we summarize the general steps involved in using deep learning which have not yet been systematically discussed in the existent literature on this topic. Finally, we compare the features and performance of current mainstream open source deep learning frameworks and present the opportunities and challenges involved in deep learning. This survey will be a good starting point and guideline for omics researchers to understand deep learning. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

43. Predicting Diabetes Mellitus With Machine Learning Techniques.

Author: Zou, Quan, Qu, Kaiyang, Luo, Yamei, Yin, Dehui, Ju, Ying, and Tang, Hua
Abstract: Diabetes mellitus is a chronic disease characterized by hyperglycemia. It may cause many complications. According to the growing morbidity in recent years, in 2040, the world's diabetic patients will reach 642 million, which means that one of the ten adults in the future is suffering from diabetes. There is no doubt that this alarming figure needs great attention. With the rapid development of machine learning, machine learning has been applied to many aspects of medical health. In this study, we used decision tree, random forest and neural network to predict diabetes mellitus. The dataset is the hospital physical examination data in Luzhou, China. It contains 14 attributes. In this study, five-fold cross validation was used to examine the models. In order to verity the universal applicability of the methods, we chose some methods that have the better performance to conduct independent test experiments. We randomly selected 68994 healthy people and diabetic patients' data, respectively as training set. Due to the data unbalance, we randomly extracted 5 times data. And the result is the average of these five experiments. In this study, we used principal component analysis (PCA) and minimum redundancy maximum relevance (mRMR) to reduce the dimensionality. The results showed that prediction with random forest could reach the highest accuracy (ACC = 0.8084) when all the attributes were used. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

44. Special Protein Molecules Computational Identification.

Author: Zou, Quan and He, Wenying
Subjects: *PROTEIN expression, *PROTEIN genetics, *COMPUTATIONAL biology, *MOLECULAR genetics, *GENETIC regulation
Abstract: Computational identification of special protein molecules is a key issue in understanding protein function. It can guide molecular experiments and help to save costs. I assessed 18 papers published in the special issue of Int. J. Mol. Sci., and also discussed the related works. The computational methods employed in this special issue focused on machine learning, network analysis, and molecular docking. New methods and new topics were also proposed. There were in addition several wet experiments, with proven results showing promise. I hope our special issue will help in protein molecules identification researches. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

45. The memory degradation based online sequential extreme learning machine.

Author: Zou, Quan-Yi, Wang, Xiao-Jun, Zhou, Chang-Jun, and Zhang, Qiang
Subjects: *DISTANCE education, *MACHINE learning, *DATA analysis, *ALGORITHMS, *ARTIFICIAL neural networks
Abstract: In online learning, the contribution of old samples to a model decreases as time passes, and old samples gradually become invalid. Although the Online Sequential Extreme Learning Machine (OS-ELM) can avoid the repetitive training of old samples, invalid samples are still used, which goes against improving the accuracy of an OS-ELM model. The Online Sequence Extreme Learning Machine with Forgetting Mechanism (FOS-ELM) timely discards invalid samples, but it does not consider the differences among valid samples and then has the limitation on boosting the accuracy and generalization. To solve this issue, the Memory Degradation Based OS-ELM (MDOS-ELM) is proposed in this paper. The MDOS-ELM adjusts the weights of the old and new samples in real time by a self-adaptive memory factor, and simultaneously discards invalid samples. The self-adaptive memory factor is determined by two elements. One is the similarity between the new and old samples, and the other is the prediction errors of the current training samples on the previous model. The performance of the proposed MDOS-ELM is validated on both regression and classification datasets which include an artificial dataset and twenty-two real-world dataset. The results demonstrate that the MDOS-ELM model outperforms the OS-ELM and the FOS-ELM models on the accuracy and generalization. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

46. Identification of drug-side effect association via correntropy-loss based matrix factorization with neural tangent kernel.

Author: Ding, Yijie, Zhou, Hongmei, Zou, Quan, and Yuan, Lei
Subjects: *MATRIX decomposition, *DRUG side effects, *DRUG monitoring, *KERNEL operating systems, *MEDICATION safety, *MACHINE learning
Abstract: • Neural tangent kernel is used to construct the similarity matrices. • Correntropy-loss function is introduced into matrix factorization. • An efficient iterative algorithm is employed to optimize the model. Adverse drug reactions include side effects, allergic reactions, and secondary infections. Severe adverse reactions can cause cancer, deformity, or mutation. The monitoring of drug side effects is an important support for post marketing safety supervision of drugs, and an important basis for revising drug instructions. Its purpose is to timely detect and control drug safety risks. Traditional methods are time-consuming. To accelerate the discovery of side effects, we propose a machine learning based method, called correntropy-loss based matrix factorization with neural tangent kernel (CLMF-NTK), to solve the prediction of drug side effects. Our method and other computational methods are tested on three benchmark datasets, and the results show that our method achieves the best predictive performance. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

47. PhosPred-RF: A Novel Sequence-Based Predictor for Phosphorylation Sites Using Sequential Information Only.

Author: Wei, Leyi, Xing, Pengwei, Tang, Jijun, and Zou, Quan
Abstract: Many recent efforts have been made for the development of machine learning-based methods for fast and accurate phosphorylation site prediction. Currently, a majority of well-performing methods are based on hybrid information to build prediction models, such as evolutionary information, disorder information, and so on. Unfortunately, this type of methods suffers two major limitations: one is that it would not be much of help for protein phosphorylation site prediction in case of no obvious homology detected; the other is that computing such the complicated information is time-consuming, which probably limits the usage of predictors in practical applications. In this paper, we present a simple, fast, and powerful feature representation algorithm, which sufficiently explores the sequential information from multiple perspectives only based on primary sequences, and successfully captures the differences between true phosphorylation sites and hboxnon-phosphorylation sites. Using the proposed features, we propose a random forest-based predictor named PhosPred-RF in the prediction of protein phosphorylation sites from proteins. We evaluate and compare the proposed predictor with the state-of-the-art predictors on some benchmark data sets. The experimental results show that PhosPred-RF outperforms other existing predictors, demonstrating its potential to be a useful tool for protein phosphorylation site prediction. Currently, the proposed PhosPred-RF is freely accessible to the public through the user-friendly webserver http://server.malab.cn/PhosPred-RF. [ABSTRACT FROM PUBLISHER]
Published: 2017
Full Text: View/download PDF

48. Improving tRNAscan-SE Annotation Results via Ensemble Classifiers.

Author: Zou, Quan, Guo, Jiasheng, Ju, Ying, Wu, Meihong, Zeng, Xiangxiang, and Hong, Zhiling
Subjects: TRANSFER RNA, DATA mining, MACHINE learning
Abstract: tRNAScan-SE is a tRNA detection program that is widely used for tRNA annotation; however, the false positive rate of tRNAScan-SE is unacceptable for large sequences. Here, we used a machine learning method to try to improve the tRNAScan-SE results. A new predictor, tRNA-Predict, was designed. We obtained real and pseudo-tRNA sequences as training data sets using tRNAScan-SE and constructed three different tRNA feature sets. We then set up an ensemble classifier, LibMutil, to predict tRNAs from the training data. The positive data set of 623 tRNA sequences was obtained from tRNAdb 2009 and the negative data set was the false positive tRNAs predicted by tRNAscan-SE. Our in silico experiments revealed a prediction accuracy rate of 95.1 % for tRNA-Predict using 10-fold cross-validation. tRNA-Predict was developed to distinguish functional tRNAs from pseudo-tRNAs rather than to predict tRNAs from a genome-wide scan. However, tRNA-Predict can work with the output of tRNAscan-SE, which is a genome-wide scanning method, to improve the tRNAscan-SE annotation results. The tRNA-Predict web server is accessible at http://datamining.xmu.edu.cn/∼gjs/tRNA-Predict. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

49. Advanced Machine Learning Techniques for Bioinformatics.

Author: Zou, Quan and Liu, Qi
Abstract: The papers in this special section focus on the machine learning methods, and applications of these methods to computational biology. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

50. Deep learning based method for predicting DNA N6-methyladenosine sites.

Author: Han, Ke, Wang, Jianchun, Chu, Ying, Liao, Qian, Ding, Yijie, Zheng, Dequan, Wan, Jie, Guo, Xiaoyi, and Zou, Quan
Subjects: *CONVOLUTIONAL neural networks, *DATABASES, *MACHINE learning, *MULTISCALE modeling, *ADENOSINES, *DEEP learning
Abstract: • The use of multi-scale convolutional layers can effectively help to identify hidden dependencies between multiple sequences, capture local patterns in the input sequences more flexibly, and extract location-specific features at different levels. • As global response normalization can achieve global feature aggregation, it can help extract more accurate features in the model and fully express the key information of the 6mA site. • The prediction results are better than other models, and a vector of contribution scores is created that clearly explains the prediction mechanism. DNA N6 methyladenine (6mA) plays an important role in many biological processes, and accurately identifying its sites helps one to understand its biological effects more comprehensively. Previous traditional experimental methods are very labor-intensive and traditional machine learning methods also seem to be somewhat insufficient as the database of 6mA methylation groups becomes progressively larger, so we propose a deep learning-based method called multi-scale convolutional model based on global response normalization (CG6mA) to solve the prediction problem of 6mA site. This method is tested with other methods on three different kinds of benchmark datasets, and the results show that our model can get more excellent prediction results. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

101 results on '"Zou, Quan"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources