24 results on '"Zou, Quan"'
Search Results
2. Deepstacked-AVPs: predicting antiviral peptides using tri-segment evolutionary profile and word embedding based multi-perspective features with deep stacking model
- Author
-
Akbar, Shahid, Raza, Ali, and Zou, Quan
- Published
- 2024
- Full Text
- View/download PDF
3. scMNMF: a novel method for single-cell multi-omics clustering based on matrix factorization.
- Author
-
Qiu, Yushan, Guo, Dong, Zhao, Pu, and Zou, Quan
- Subjects
MATRIX decomposition ,MULTIOMICS ,METABOLOMICS ,NONNEGATIVE matrices ,CONSTRAINED optimization ,FEATURE selection ,TRANSCRIPTOMES - Abstract
Motivation The technology for analyzing single-cell multi-omics data has advanced rapidly and has provided comprehensive and accurate cellular information by exploring cell heterogeneity in genomics, transcriptomics, epigenomics, metabolomics and proteomics data. However, because of the high-dimensional and sparse characteristics of single-cell multi-omics data, as well as the limitations of various analysis algorithms, the clustering performance is generally poor. Matrix factorization is an unsupervised, dimensionality reduction-based method that can cluster individuals and discover related omics variables from different blocks. Here, we present a novel algorithm that performs joint dimensionality reduction learning and cell clustering analysis on single-cell multi-omics data using non-negative matrix factorization that we named scMNMF. We formulate the objective function of joint learning as a constrained optimization problem and derive the corresponding iterative formulas through alternating iterative algorithms. The major advantage of the scMNMF algorithm remains its capability to explore hidden related features among omics data. Additionally, the feature selection for dimensionality reduction and cell clustering mutually influence each other iteratively, leading to a more effective discovery of cell types. We validated the performance of the scMNMF algorithm using two simulated and five real datasets. The results show that scMNMF outperformed seven other state-of-the-art algorithms in various measurements. Availability and implementation scMNMF code can be found at https://github.com/yushanqiu/scMNMF. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
4. DeepAVP-TPPred: identification of antiviral peptides using transformed image-based localized descriptors and binary tree growth algorithm.
- Author
-
Ullah, Matee, Akbar, Shahid, Raza, Ali, and Zou, Quan
- Subjects
ARTIFICIAL neural networks ,TREE growth ,PEPTIDES ,LIFE cycles (Biology) ,FEATURE selection ,FEATURE extraction ,IDENTIFICATION - Abstract
Motivation Despite the extensive manufacturing of antiviral drugs and vaccination, viral infections continue to be a major human ailment. Antiviral peptides (AVPs) have emerged as potential candidates in the pursuit of novel antiviral drugs. These peptides show vigorous antiviral activity against a diverse range of viruses by targeting different phases of the viral life cycle. Therefore, the accurate prediction of AVPs is an essential yet challenging task. Lately, many machine learning-based approaches have developed for this purpose; however, their limited capabilities in terms of feature engineering, accuracy, and generalization make these methods restricted. Results In the present study, we aim to develop an efficient machine learning-based approach for the identification of AVPs, referred to as DeepAVP-TPPred, to address the aforementioned problems. First, we extract two new transformed feature sets using our designed image-based feature extraction algorithms and integrate them with an evolutionary information-based feature. Next, these feature sets were optimized using a novel feature selection approach called binary tree growth Algorithm. Finally, the optimal feature space from the training dataset was fed to the deep neural network to build the final classification model. The proposed model DeepAVP-TPPred was tested using stringent 5-fold cross-validation and two independent dataset testing methods, which achieved the maximum performance and showed enhanced efficiency over existing predictors in terms of both accuracy and generalization capabilities. Availability and implementation https://github.com/MateeullahKhan/DeepAVP-TPPred. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
5. iTTCA-RF: a random forest predictor for tumor T cell antigens
- Author
-
Jiao, Shihu, Zou, Quan, Guo, Huannan, and Shi, Lei
- Published
- 2021
- Full Text
- View/download PDF
6. RFhy-m2G: Identification of RNA N2-methylguanosine modification sites based on random forest and hybrid features.
- Author
-
Ao, Chunyan, Zou, Quan, and Yu, Liang
- Subjects
- *
RANDOM forest algorithms , *RNA modification & restriction , *TRANSFER RNA , *FEATURE selection , *PREDICTION models - Abstract
• A novel method was proposed to identify RNA m2G sites using hybrid features. • The over-sample method SMOTE was adopted to deal with the problem of data imbalance. • After using MRMD to select features, the performance of the model is improved. • The RFhy-m2G is superior to other methods, which can effective identify m2G sites. N2-methylguanosine is a post-transcriptional modification of RNA that is found in eukaryotes and archaea. The biological function of m2G modification discovered so far is to control and stabilize the three-dimensional structure of tRNA and the dynamic barrier of reverse transcription. To discover additional biological functions of m2G, it is necessary to develop time-saving and labor-saving calculation tools to identify m2G. In this paper, based on hybrid features and a random forest, a novel predictor, RFhy-m2G, was developed to identify the m2G modification sites for three species. The hybrid feature used by the predictor is used to fuse the three features of ENAC, PseDNC, and NPPS. These three features include primary sequence derivation properties, physicochemical properties, and position-specific properties. Since there are redundant features in hybrid features, MRMD2.0 is used for optimal feature selection. Through feature analysis, it is found that the optimal hybrid features obtained still contain three kinds of properties, and the hybrid features can more accurately identify m2G modification sites and improve prediction performance. Based on five-fold cross-validation and independent testing to evaluate the prediction model, the accuracies obtained were 0.9982 and 0.9417, respectively. The robustness of the predictor is demonstrated by comparisons with other predictors. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
7. Machine Learning and Its Applications for Protozoal Pathogens and Protozoal Infectious Diseases.
- Author
-
Hu, Rui-Si, Hesham, Abd El-Latif, and Zou, Quan
- Subjects
PROTOZOAN diseases ,COMMUNICABLE diseases ,MACHINE learning ,FEATURE selection ,SUPPORT vector machines ,PUBLIC health surveillance - Abstract
In recent years, massive attention has been attracted to the development and application of machine learning (ML) in the field of infectious diseases, not only serving as a catalyst for academic studies but also as a key means of detecting pathogenic microorganisms, implementing public health surveillance, exploring host-pathogen interactions, discovering drug and vaccine candidates, and so forth. These applications also include the management of infectious diseases caused by protozoal pathogens, such as Plasmodium , Trypanosoma , Toxoplasma , Cryptosporidium , and Giardia , a class of fatal or life-threatening causative agents capable of infecting humans and a wide range of animals. With the reduction of computational cost, availability of effective ML algorithms, popularization of ML tools, and accumulation of high-throughput data, it is possible to implement the integration of ML applications into increasing scientific research related to protozoal infection. Here, we will present a brief overview of important concepts in ML serving as background knowledge, with a focus on basic workflows, popular algorithms (e.g., support vector machine, random forest, and neural networks), feature extraction and selection, and model evaluation metrics. We will then review current ML applications and major advances concerning protozoal pathogens and protozoal infectious diseases through combination with correlative biology expertise and provide forward-looking insights for perspectives and opportunities in future advances in ML techniques in this field. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
8. iAFPs-Mv-BiTCN: Predicting antifungal peptides using self-attention transformer embedding and transform evolutionary based multi-view features with bidirectional temporal convolutional networks.
- Author
-
Akbar, Shahid, Zou, Quan, Raza, Ali, and Alarfaj, Fawaz Khaled
- Abstract
Globally, fungal infections have become a major health concern in humans. Fungal diseases generally occur due to the invading fungus appearing on a specific portion of the body and becoming hard for the human immune system to resist. The recent emergence of COVID-19 has intensely increased different nosocomial fungal infections. The existing wet-laboratory-based medications are expensive, time-consuming, and may have adverse side effects on normal cells. In the last decade, peptide therapeutics have gained significant attention due to their high specificity in targeting affected cells without affecting healthy cells. Motivated by the significance of peptide-based therapies, we developed a highly discriminative prediction scheme called iAFPs-Mv-BiTCN to predict antifungal peptides correctly. The training peptides are encoded using word embedding methods such as skip-gram and attention mechanism-based bidirectional encoder representation using transformer. Additionally, transform-based evolutionary features are generated using the Pseduo position-specific scoring matrix using discrete wavelet transform (PsePSSM-DWT). The fused vector of word embedding and evolutionary descriptors is formed to compensate for the limitations of single encoding methods. A Shapley Additive exPlanations (SHAP) based global interpolation approach is applied to reduce training costs by choosing the optimal feature set. The selected feature set is trained using a bi-directional temporal convolutional network (BiTCN). The proposed iAFPs-Mv-BiTCN model achieved a predictive accuracy of 98.15 % and an AUC of 0.99 using training samples. In the case of the independent samples, our model obtained an accuracy of 94.11 % and an AUC of 0.98. Our iAFPs-Mv-BiTCN model outperformed existing models with a ~4 % and ~5 % higher accuracy using training and independent samples, respectively. The reliability and efficacy of the proposed iAFPs-Mv-BiTCN model make it a valuable tool for scientists and may perform a beneficial role in pharmaceutical design and research academia. • A Bidirectional Temporal Convolutional Networks-based computational model is developed for the Prediction of Antifungal peptides. • A Transform evolutionary matrix, self-attention based transformer, and fasttext-based word embedding are employed to numerically represent the peptide samples. • The SHAP interpolation-based feature selection is applied to select optimal features from the Hybrid vector • The proposed iAFPs-Mv-BiTCN model achieved the highest predictive results using training and independent datasets than existing computational models. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
9. NmRF: identification of multispecies RNA 2'-O-methylation modification sites from RNA sequences.
- Author
-
Ao, Chunyan, Zou, Quan, and Yu, Liang
- Subjects
- *
RNA modification & restriction , *TRANSFER RNA , *FEATURE selection , *RANDOM forest algorithms , *METHYL groups , *MACHINE learning , *BOOSTING algorithms - Abstract
2'-O-methylation (Nm) is a post-transcriptional modification of RNA that is catalyzed by 2'-O-methyltransferase and involves replacing the H on the 2′-hydroxyl group with a methyl group. The 2'-O-methylation modification site is detected in a variety of RNA types (miRNA, tRNA, mRNA, etc.), plays an important role in biological processes and is associated with different diseases. There are few functional mechanisms developed at present, and traditional high-throughput experiments are time-consuming and expensive to explore functional mechanisms. For a deeper understanding of relevant biological mechanisms, it is necessary to develop efficient and accurate recognition tools based on machine learning. Based on this, we constructed a predictor called NmRF based on optimal mixed features and random forest classifier to identify 2'-O-methylation modification sites. The predictor can identify modification sites of multiple species at the same time. To obtain a better prediction model, a two-step strategy is adopted; that is, the optimal hybrid feature set is obtained by combining the light gradient boosting algorithm and incremental feature selection strategy. In 10-fold cross-validation, the accuracies of Homo sapiens and Saccharomyces cerevisiae were 89.069 and 93.885%, and the AUC were 0.9498 and 0.9832, respectively. The rigorous 10-fold cross-validation and independent tests confirm that the proposed method is significantly better than existing tools. A user-friendly web server is accessible at http://lab.malab.cn/∼acy/NmRF. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
10. HSM6AP: a high-precision predictor for the Homo sapiens N6-methyladenosine (m^6 A) based on multiple weights and feature stitching.
- Author
-
Li, Jing, He, Shida, Guo, Fei, and Zou, Quan
- Subjects
BOOSTING algorithms ,RNA methylation ,HUMAN beings ,DECISION trees ,FEATURE selection ,WEIGHT training ,RNA modification & restriction - Abstract
Recent studies have shown that RNA methylation modification can affect RNA transcription, metabolism, splicing and stability. In addition, RNA methylation modification has been associated with cancer, obesity and other diseases. Based on information about human genome and machine learning, this paper discusses the effect of the fusion sequence and gene-level feature extraction on the accuracy of methylation site recognition. The significant limitation of existing computing tools was exposed by discovered of new features. (1) Most prediction models are based solely on sequence features and use SVM or random forest as classification methods. (2) Limited by the number of samples, the model may not achieve good performance. In order to establish a better prediction model for methylation sites, we must set specific weighting strategies for training samples and find more powerful and informative feature matrices to establish a comprehensive model. In this paper, we present HSM6AP, a high-precision predictor for the Homo sapiens N6-methyladenosine ( m 6 A) based on multiple weights and feature stitching. Compared with existing methods, HSM6AP samples were creatively weighted during training, and a wide range of features were explored. Max-Relevance-Max-Distance (MRMD) is employed for feature selection, and the feature matrix is generated by fusing a single feature. The extreme gradient boosting (XGBoost), an integrated machine learning algorithm based on decision tree, is used for model training and improves model performance through parameter adjustment. Two rigorous independent data sets demonstrated the superiority of HSM6AP in identifying methylation sites. HSM6AP is an advanced predictor that can be directly employed by users (especially non-professional users) to predict methylation sites. Users can access our related tools and data sets at the following website: The codes of our tool can be publicly accessible at [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
11. DisBalance: a platform to automatically build balance-based disease prediction models and discover microbial biomarkers from microbiome data.
- Author
-
Yang, Fenglong and Zou, Quan
- Subjects
- *
PREDICTION models , *MEDICAL model , *BIOMARKERS , *ULCERATIVE colitis , *FEATURE selection - Abstract
How best to utilize the microbial taxonomic abundances in regard to the prediction and explanation of human diseases remains appealing and challenging, and the relative nature of microbiome data necessitates a proper feature selection method to resolve the compositional problem. In this study, we developed an all-in-one platform to address a series of issues in microbiome-based human disease prediction and taxonomic biomarkers discovery. We prioritize the interpretation, runtime and classification accuracy of the distal discriminative balances analysis (DBA-distal) method in selecting a set of distal discriminative balances, and develop DisBalance, a comprehensive platform, to integrate and streamline the workflows of disease model building, disease risk prediction and disease-related biomarker discovery for microbiome-based binary classifications. DisBalance allows the de novo model-building and disease risk prediction in a very fast and convenient way. To facilitate the model-driven and knowledge-driven discoveries, DisBalance dedicates multiple strategies for the mining of microbial biomarkers. The independent validation of the models constructed by the DisBalance pipeline is performed on seven microbiome datasets from the original article of DBA-distal. The implementation of the DisBalance platform is demonstrated by a complete analysis of a shotgun metagenomic dataset of Ulcerative Colitis (UC). As a free and open-source, DisBlance can be accessed at http://lab.malab.cn/soft/DisBalance. The source code and demo data for Disbalance are available at https://github.com/yangfenglong/DisBalance. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
12. Anticancer peptides prediction with deep representation learning features.
- Author
-
Lv, Zhibin, Cui, Feifei, Zou, Quan, Zhang, Lichao, and Xu, Lei
- Subjects
DEEP learning ,ARTIFICIAL neural networks ,BOOSTING algorithms ,PEPTIDES ,FEATURE selection - Abstract
Anticancer peptides constitute one of the most promising therapeutic agents for combating common human cancers. Using wet experiments to verify whether a peptide displays anticancer characteristics is time-consuming and costly. Hence, in this study, we proposed a computational method named identify anticancer peptides via deep representation learning features (iACP-DRLF) using light gradient boosting machine algorithm and deep representation learning features. Two kinds of sequence embedding technologies were used, namely soft symmetric alignment embedding and unified representation (UniRep) embedding, both of which involved deep neural network models based on long short-term memory networks and their derived networks. The results showed that the use of deep representation learning features greatly improved the capability of the models to discriminate anticancer peptides from other peptides. Also, UMAP (uniform manifold approximation and projection for dimension reduction) and SHAP (shapley additive explanations) analysis proved that UniRep have an advantage over other features for anticancer peptide identification. The python script and pretrained models could be downloaded from https://github.com/zhibinlv/iACP-DRLF or from http://public.aibiochem.net/iACP-DRLF/. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
13. Goals and approaches for each processing step for single-cell RNA sequencing data.
- Author
-
Zhang, Zilong, Cui, Feifei, Wang, Chunyu, Zhao, Lingling, and Zou, Quan
- Subjects
RNA sequencing ,FEATURE selection ,GENE expression ,QUALITY control ,DATA reduction ,PIPELINE inspection - Abstract
Single-cell RNA sequencing (scRNA-seq) has enabled researchers to study gene expression at the cellular level. However, due to the extremely low levels of transcripts in a single cell and technical losses during reverse transcription, gene expression at a single-cell resolution is usually noisy and highly dimensional; thus, statistical analyses of single-cell data are a challenge. Although many scRNA-seq data analysis tools are currently available, a gold standard pipeline is not available for all datasets. Therefore, a general understanding of bioinformatics and associated computational issues would facilitate the selection of appropriate tools for a given set of data. In this review, we provide an overview of the goals and most popular computational analysis tools for the quality control, normalization, imputation, feature selection and dimension reduction of scRNA-seq data. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
14. DeepATT: a hybrid category attention neural network for identifying functional effects of DNA sequences.
- Author
-
Li, Jiawei, Pu, Yuqian, Tang, Jijun, Zou, Quan, and Guo, Fei
- Subjects
DNA sequencing ,NON-coding DNA ,CONVOLUTIONAL neural networks ,FEATURE selection ,FEATURE extraction - Abstract
Quantifying DNA properties is a challenging task in the broad field of human genomics. Since the vast majority of non-coding DNA is still poorly understood in terms of function, this task is particularly important to have enormous benefit for biology research. Various DNA sequences should have a great variety of representations, and specific functions may focus on corresponding features in the front part of learning model. Currently, however, for multi-class prediction of non-coding DNA regulatory functions, most powerful predictive models do not have appropriate feature extraction and selection approaches for specific functional effects, so that it is difficult to gain a better insight into their internal correlations. Hence, we design a category attention layer and category dense layer in order to select efficient features and distinguish different DNA functions. In this study, we propose a hybrid deep neural network method, called DeepATT, for identifying |$919$| regulatory functions on nearly |$5$| million DNA sequences. Our model has four built-in neural network constructions: convolution layer captures regulatory motifs, recurrent layer captures a regulatory grammar, category attention layer selects corresponding valid features for different functions and category dense layer classifies predictive labels with selected features of regulatory functions. Importantly, we compare our novel method, DeepATT, with existing outstanding prediction tools, DeepSEA and DanQ. DeepATT performs significantly better than other existing tools for identifying DNA functions, at least increasing |$1.6\%$| area under precision recall. Furthermore, we can mine the important correlation among different DNA functions according to the category attention module. Moreover, our novel model can greatly reduce the number of parameters by the mechanism of attention and locally connected, on the basis of ensuring accuracy. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
15. Using a low correlation high orthogonality feature set and machine learning methods to identify plant pentatricopeptide repeat coding gene/protein.
- Author
-
Feng, Changli, Zou, Quan, and Wang, Donghua
- Subjects
- *
PENTATRICOPEPTIDE repeat genes , *MACHINE learning , *NAIVE Bayes classification , *AMINO acid sequence , *FEATURE selection , *PRINCIPAL components analysis - Abstract
Identifying whether a pentatricopeptide repeat (PPR) exists in an amino acid is a significant task in the field of bioinformatics. To address this problem, an identification method that combines an optimal feature set selection framework and machine learning algorithms is proposed to recognize the PPR coding genes and proteins in the sequence of amino acid. The original 188-dimensional (D) features are obtained using a feature extraction method, which is successively optimised through a covariance analysis, max-relevant-max-distance processing, and principal component analysis to reduce it to an optimal feature set that has fewer but more expressive features. Four machine learning methods are then used to serve as the classifiers for the identification task. The final number of feature data dimensions is reduced from 188 to only 10, and according to the experimental results from support vector machine methods, the loss of the AUC and the F 1 values are only 3.26% and 10.1%, respectively. Moreover, after applying the J48, random forest, and naïve Bayes methods as classifiers, it was also found that the optimal feature set with 10 dimensions has an almost equivalent performance for a 10-fold validation test. Identifying whether a pentatricopeptide repeat (PPR) exists in an amino acid is a significant task in the field of bioinformatics. To address this problem, an identification method that combines an optimal feature set selection framework and machine learning algorithms is proposed to recognise the PPR coding genes and proteins in the sequence of an amino acid. The original 188-dimensional (D) features are obtained using a feature extraction method, which is successively optimised through a covariance analysis, max-relevant-max-distance processing, and a principal component analysis to reduce it to an optimal feature set that has fewer but more expressive features. Four machine learning methods are then used to serve as the classifiers for the identification task. The final number of feature data dimensions is reduced from 188 to only 10, and according to the experimental results from support vector machine methods, the loss of the AUC and F 1 the value are only 3.26% and 10.1%, respectively. Moreover, after applying the J48, random forest, and naïve Bayes methods as classifiers, it was also found that the optimal feature set with 10 dimensions has an almost equivalent performance for a 10-fold validation test Image, graphical abstract [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
16. 2lpiRNApred: a two-layered integrated algorithm for identifying piRNAs and their functions based on LFE-GM feature selection.
- Author
-
Zuo, Yun, Zou, Quan, Lin, Jianyuan, Jiang, Min, and Liu, Xiangrong
- Abstract
Piwi–interacting RNAs (piRNAs) are indispensable in the transposon silencing, including in germ cell formation, germline stem cell maintenance, spermatogenesis, and oogenesis. piRNA pathways are amongst the major genome defence mechanisms, which maintain genome integrity. They also have important functions in tumorigenesis, as indicated by aberrantly expressed piRNAs being recently shown to play roles in the process of cancer development. A number of computational methods for this have recently been proposed, but they still have not yielded satisfactory predictive performance. Moreover, only one computational method that identifies whether piRNAs function in inducting target mRNA deadenylation been reported in the literature. In this study, we developed a two-layered integrated classifier algorithm, 2lpiRNApred. It identifies piRNAs in the first layer and determines whether they function in inducting target mRNA deadenylation in the second layer. A new feature selection algorithm, which was based on Luca fuzzy entropy and Gaussian membership function (LFE-GM), was proposed to reduce the dimensionality of the features. Five feature extraction strategies, namely, Kmer, General parallel correlation pseudo-dinucleotide composition, General series correlation pseudo-dinucleotide composition, Normalized Moreau–Broto autocorrelation, and Geary autocorrelation, and two types of classifier, Sparse Representation Classifier (SRC) and support vector machine with Mahalanobis distance-based radial basis function (SVMMDRBF), were used to construct a two-layered integrated classifier algorithm, 2lpiRNApred. The results indicate that 2lpiRNApred performs significantly better than six other existing prediction tools. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
17. Combining Sparse Group Lasso and Linear Mixed Model Improves Power to Detect Genetic Variants Underlying Quantitative Traits.
- Author
-
Guo, Yingjie, Wu, Chenxi, Guo, Maozu, Zou, Quan, Liu, Xiaoyan, and Keinan, Alon
- Subjects
SINGLE nucleotide polymorphisms ,RANDOM effects model ,QUANTITATIVE genetics ,FEATURE selection - Abstract
Genome-Wide association studies (GWAS), based on testing one single nucleotide polymorphism (SNP) at a time, have revolutionized our understanding of the genetics of complex traits. In GWAS, there is a need to consider confounding effects such as due to population structure, and take groups of SNPs into account simultaneously due to the "polygenic" attribute of complex quantitative traits. In this paper, we propose a new approach SGL-LMM that puts together sparse group lasso (SGL) and linear mixed model (LMM) for multivariate associations of quantitative traits. LMM, as has been often used in GWAS, controls for confounders, while SGL maintains sparsity of the underlying multivariate regression model. SGL-LMM first sets a fixed zero effect to learn the parameters of random effects using LMM, and then estimates fixed effects using SGL regularization. We present efficient algorithms for hyperparameter tuning and feature selection using stability selection. While controlling for confounders and constraining for sparse solutions, SGL-LMM also provides a natural framework for incorporating prior biological information into the group structure underlying the model. Results based on both simulated and real data show SGL-LMM outperforms previous approaches in terms of power to detect associations and accuracy of quantitative trait prediction. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
18. Special Protein Molecules Computational Identification.
- Author
-
Zou, Quan and He, Wenying
- Subjects
- *
PROTEIN expression , *PROTEIN genetics , *COMPUTATIONAL biology , *MOLECULAR genetics , *GENETIC regulation - Abstract
Computational identification of special protein molecules is a key issue in understanding protein function. It can guide molecular experiments and help to save costs. I assessed 18 papers published in the special issue of Int. J. Mol. Sci., and also discussed the related works. The computational methods employed in this special issue focused on machine learning, network analysis, and molecular docking. New methods and new topics were also proposed. There were in addition several wet experiments, with proven results showing promise. I hope our special issue will help in protein molecules identification researches. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
19. Identifying and Classifying Enhancers by Dinucleotide-Based Auto-Cross Covariance and Attention-Based Bi-LSTM.
- Author
-
Zhao, Shulin, Pan, Qingfeng, Zou, Quan, Ju, Ying, Shi, Lei, and Su, Xi
- Subjects
- *
NON-coding DNA , *FEATURE selection , *GENE enhancers , *RANDOM forest algorithms , *DINUCLEOTIDES , *PREDICTION models - Abstract
Enhancers are a class of noncoding DNA elements located near structural genes. In recent years, their identification and classification have been the focus of research in the field of bioinformatics. However, due to their high free scattering and position variability, although the performance of the prediction model has been continuously improved, there is still a lot of room for progress. In this paper, density-based spatial clustering of applications with noise (DBSCAN) was used to screen the physicochemical properties of dinucleotides to extract dinucleotide-based auto-cross covariance (DACC) features; then, the features are reduced by feature selection Python toolkit MRMD 2.0. The reduced features are input into the random forest to identify enhancers. The enhancer classification model was built by word2vec and attention-based Bi-LSTM. Finally, the accuracies of our enhancer identification and classification models were 77.25% and 73.50%, respectively, and the Matthews' correlation coefficients (MCCs) were 0.5470 and 0.4881, respectively, which were better than the performance of most predictors. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
20. DP-AOP: A novel SVM-based antioxidant proteins identifier.
- Author
-
Meng, Chaolu, Pei, Yue, Zou, Quan, and Yuan, Lei
- Subjects
- *
INTERNET servers , *MACHINE learning , *FEATURE selection , *CLASSIFICATION algorithms , *PROTEOMICS , *PROTEINS , *FEATURE extraction - Abstract
The identification of antioxidant proteins is a challenging yet meaningful task, as they can protect against the damage caused by some free radicals. In addition to time-consuming, laborious, and expensive experimental identification methods, efficient identification of antioxidant proteins through machine learning algorithms has become increasingly common. In recent years, researchers have proposed models for identifying antioxidant proteins; unfortunately, although the accuracy of models is already high, their sensitivity is too low, indicating the possibility of overfitting in the model. Therefore, we developed a new model called DP-AOP for the recognition of antioxidant proteins. We used the SMOTE algorithm to balance the dataset, selected Wei's proposed feature extraction algorithm to obtain 473 dimensional feature vectors, and based on the sorting function in MRMD, scored and ranked each feature to obtain a feature set with contribution values ranging from high to low. To effectively reduce the feature dimension, we combined the dynamic programming idea to make the local eight features the optimal subset. After obtaining the 36 dimensional feature vectors, we finally selected 17 features through experimental analysis. The SVM classification algorithm was used to implement the model through the libsvm tool. The model achieved satisfactory performance, with an accuracy rate of 91.076 %, SN of 96.4 %, SP of 85.8 %, MCC of 82.6 %, and F1 core of 91.5 %. Furthermore, we built a free web server to facilitate researchers' subsequent unfolding studies of antioxidant protein recognition. The website is http://112.124.26.17:8003/#/. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
21. CWLy-SVM: A support vector machine-based tool for identifying cell wall lytic enzymes.
- Author
-
Meng, Chaolu, Guo, Fei, and Zou, Quan
- Subjects
- *
LYSINS , *INTERNET servers , *FEATURE selection , *FEATURE extraction , *DRUG development , *BACTERIAL cell walls - Abstract
• We identified cell wall lytic enzymes in bioinformatic way to overcome inefficiency of in vitro experiments and provide a website tool by wrapping the proposed model. • Our proposed model outperforms the state-of-the-art method in jackknife cross validation test. • We comprehensively analyzed the optimal feature set of proposed model from the prospective of data and biological meaning. Cell wall lytic enzymes, as an important biotechnical tool in drug development, agriculture and the food industry, have attracted more research attention. In this research, the accurate identification of cell wall lytic enzymes is one of the key and fundamental tasks. In this study, in order to eliminate the inefficiency of in vitro experiments, a support vector machine-based cell wall lytic enzyme identification model was constructed using bioinformatics. This machine learning process includes feature extraction, feature selection, model training and optimization. According to the jackknife cross validation test, this model obtained a sensitivity of 0.853, a specificity of 0.977, an MCC of 0.845 and an AUC of 0.915. These benchmark results demonstrate that the proposed model outperforms the state-of-the-art method and that it has powerful cell wall lytic enzyme identification ability. Furthermore, we comprehensively analyzed the selected optimal features and used the proposed model to construct a user friendly web server called the CWLy-SVM to identify cell wall lytic enzymes, which is available at http://server.malab.cn/CWLy-SVM/index.jsp. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
22. Multi-correntropy fusion based fuzzy system for predicting DNA N4-methylcytosine sites.
- Author
-
Ding, Yijie, Tiwari, Prayag, Guo, Fei, and Zou, Quan
- Subjects
- *
FUZZY systems , *DEEP learning , *STATISTICAL learning , *FUZZY logic , *FEATURE selection , *DNA - Abstract
The identification of DNA N4-methylcytosine (4mC) sites is an important field of bioinformatics. Statistical learning methods and deep learning have been applied in this direction. The previous methods focused on feature representation and feature selection, and did not take into account the deviation of noise samples for recognition. Moreover, these models were not established from the perspective of prediction error distribution. To solve the problem of complex error distribution, we propose a maximum multi-correntropy criterion based kernelized higher-order fuzzy inference system (MMC-KHFIS), which is constructed with multi-correntropy fusion. There are 6 4mC and 8 UCI data sets are employed to evaluate our model. The MMC-KHFIS achieves better performance in the experiment. • For complex error distribution, multi-correntropy based fuzzy system is built. • Fuzzy kernel is built to solve feature space projection in each fuzzy subset. • An efficient iterative algorithm is employed to optimize the fuzzy system. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
23. Review and comparative analysis of machine learning-based phage virion protein identification methods.
- Author
-
Meng, Chaolu, Zhang, Jun, Ye, Xiucai, Guo, Fei, and Zou, Quan
- Subjects
- *
PROTEOMICS , *VIRION , *SUPPORT vector machines , *BACTERIOPHAGES , *FEATURE selection , *IMAGE recognition (Computer vision) , *FEATURE extraction - Abstract
Phage virion protein (PVP) identification plays key role in elucidating relationships between phages and hosts. Moreover, PVP identification can facilitate the design of related biochemical entities. Recently, several machine learning approaches have emerged for this purpose and have shown their potential capacities. In this study, the proposed PVP identifiers are systemically reviewed, and the related algorithms and tools are comprehensively analyzed. We summarized the common framework of these PVP identifiers and constructed our own novel identifiers based upon the framework. Furthermore, we focus on a performance comparison of all PVP identifiers by using a training dataset and an independent dataset. Highlighting the pros and cons of these identifiers demonstrates that g-gap DPC (dipeptide composition) features are capable of representing characteristics of PVPs. Moreover, SVM (support vector machine) is proven to be the more effective classifier to distinguish PVPs and non-PVPs. Unlabelled Image • We give a detailed overview of feature extraction, feature selection and classifier used in PVP recognition process. • The general framework of PVP recognition machine learning model is summarized. • The performance of PVP is compared and their limitits and strengths are given. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
24. Taxonomy dimension reduction for colorectal cancer prediction.
- Author
-
Qu, Kaiyang, Gao, Feng, Guo, Fei, and Zou, Quan
- Subjects
- *
COLORECTAL cancer , *FEATURE selection , *TAXONOMY , *MACHINE learning , *DECISION trees , *FEATURE extraction , *DIMENSION reduction (Statistics) - Abstract
• Using the species and number of microorganisms for prediction CRC, experimental data can be easily obtained. • Using taxonomy files to predict CRC, and the influential microorganisms were found. • A variety of feature extraction methods are used to improve the prediction accuracy. • Use machine learning method to improve prediction efficiency. • Ensemble feature selection methods were proposed, which can use fewer features and get better results. A growing number of people suffer from colorectal cancer, which is one of the most common cancers. It is essential to diagnose and treat the cancer as early as possible. The disease may change the microorganism communities in the gut, and it could be an efficient method to employ gut microorganisms to predict colorectal cancer. In this study, we selected operational taxonomic units that include several kinds of microorganisms to predict colorectal cancer. To find the most important microorganisms and obtain the best prediction performance, we explore effective feature selection methods. We employ three main steps. First, we use a single method to reduce features. Next, to reduce the number of features, we integrate the dimension reduction methods correlation-based feature selection and maximum relevance–maximum distance (MRMD 1.0 and MRMD 2.0). Then, we selected the important features according to the taxonomy files. In this study, we created training and test sets to obtain a more objective evaluation. Random forest, naïve Bayes, and decision tree classifiers were evaluated. The results show that the methods proposed in this study are better than hierarchical feature engineering. The proposed method, which combines correlation-based feature selection with MRMD 2.0, performed the best on the CRC2 dataset. The dataset and methods can be found in http://lab.malab.cn/data/microdata/data.html. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.