338 results on '"rna-seq data"'
Search Results
2. Integration of genomic and transcriptomic layers in RNA‐Seq data leads to protein interaction modules with improved Alzheimer's disease associations.
- Author
-
Düz, Elif, İlgün, Atılay, Bozkurt, Fatma Betül, and Çakır, Tunahan
- Subjects
- *
HUMAN gene mapping , *GENE expression , *ALZHEIMER'S disease , *NEURODEGENERATION , *GENE regulatory networks - Abstract
Alzheimer's disease (AD) is the most common neurodegenerative disease, and it is currently untreatable. RNA sequencing (RNA‐Seq) is commonly used in the literature to identify AD‐associated molecular mechanisms by analysing changes in gene expression. RNA‐Seq data can also be used to detect genomic variants, enabling the identification of the genes with a higher load of deleterious variants in patients compared with controls. Here, we analysed AD RNA‐Seq datasets to obtain differentially expressed genes and genes with a higher load of pathogenic variants in AD, and we combined them in a single list. We mapped these genes on a human protein–protein interaction network to discover subnetworks perturbed by AD. Our results show that utilizing gene pathogenicity information from RNA‐Seq data positively contributes to the disclosure of AD‐related mechanisms. Moreover, dividing the discovered subnetworks into highly connected modules reveals a clearer picture of altered molecular pathways that, otherwise, would not be captured. Repeating the whole pipeline with human metabolic network genes led to results confirming the positive contribution of gene pathogenicity information and enabled a more detailed identification of altered metabolic pathways in AD. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
3. Codes between Poles: Linking Transcriptomic Insights into the Neurobiology of Bipolar Disorder.
- Author
-
Garcia, Jon Patrick T. and Tayo, Lemmuel L.
- Subjects
- *
GENETIC variation , *SINGLE nucleotide polymorphisms , *BIPOLAR disorder , *GENETICS , *CINGULATE cortex - Abstract
Simple Summary: Bipolar disorder is a psychiatric condition in which one prominent symptom is the regular occurrence of mood swings. Its root cause remains unclear; thus, this study was intended to provide new insights to better understand the genetics underlying such a disorder. Using samples obtained from three regions in the brain, it was found that some molecular and cellular mechanisms are involved in the disruption of neurobiological processes. The dysregulated expression of certain genes eventually leads to neurogenesis and neurotransmission impairment events in humans. Furthermore, these genes significant in the onset of bipolar disorder were identified and evaluated for the presence of variants, which may be targeted to engineer better curative treatment strategies for the disorder. Bipolar disorder (BPD) is a serious psychiatric condition that is characterized by the frequent shifting of mood patterns, ranging from manic to depressive episodes. Although there are already treatment strategies that aim at regulating the manifestations of this disorder, its etiology remains unclear and continues to be a question of interest within the scientific community. The development of RNA sequencing techniques has provided newer and better approaches to studying disorders at the transcriptomic level. Hence, using RNA-seq data, we employed intramodular connectivity analysis and network pharmacology assessment of disease-associated variants to elucidate the biological pathways underlying the complex nature of BPD. This study was intended to characterize the expression profiles obtained from three regions in the brain, which are the nucleus accumbens (nAcc), the anterior cingulate cortex (AnCg), and the dorsolateral prefrontal cortex (DLPFC), provide insights into the specific roles of these regions in the onset of the disorder, and present potential targets for drug design and development. The nAcc was found to be highly associated with genes responsible for the deregulated transcription of neurotransmitters, while the DLPFC was greatly correlated with genes involved in the impairment of components crucial in neurotransmission. The AnCg did show association with some of the expressions, but the relationship was not as strong as the other two regions. Furthermore, disease-associated variants or single nucleotide polymorphisms (SNPs) were identified among the significant genes in BPD, which suggests the genetic interrelatedness of such a disorder and other mental illnesses. DRD2, GFRA2, and DCBLD1 were the genes with disease-associated variants expressed in the nAcc; ST8SIA2 and ADAMTS16 were the genes with disease-associated variants expressed in the AnCg; and FOXO3, ITGA9, CUBN, PLCB4, and RORB were the genes with disease-associated variants expressed in the DLPFC. Aside from unraveling the molecular and cellular mechanisms behind the expression of BPD, this investigation was envisioned to propose a new research pipeline in studying the transcriptome of psychiatric disorders to support and improve existing studies. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
4. Identification of potential targetable genes in papillary, follicular, and anaplastic thyroid carcinoma using bioinformatics analysis.
- Author
-
Agarwal, Shipra, Gupta, Shikha, and Raj, Rishav
- Abstract
Purpose: To perform an extensive exploratory analysis to build a deeper insight into clinically relevant molecular biomarkers in Papillary, Follicular, and Anaplastic thyroid carcinomas (PTC, FTC, ATC). Methods: Thirteen Thyroid Cancer (THCA) datasets incorporating PTC, FTC, and ATC were derived from the Gene Expression Omnibus. Genes differentially expressed (DEGs) between THCA and normal were identified and subjected to GO and KEGG analyses. Multiple topological properties were harnessed and protein-protein interaction (PPI) networks were constructed to identify the hub genes followed by survival analysis and validation. Results: There were 70, 87, and 377 DEGs, and 23, 27, and 53 hub genes for PTC, FTC, and ATC samples, respectively. Survival analysis detected 39 overall and 49 relapse-free survival-relevant hub genes. Six hub genes, BCL2, FN1, ITPR1, LYVE1, NTRK2, TBC1D4, were found common to more than one THCA type. The most significant hub genes found in the study were: BCL2, CD44, DCN, FN1, IRS1, ITPR1, MFAP4, MKI67, NTRK2, PCLO, TGFA. The most enriched and significant GO terms were Melanocyte differentiation for PTC, Extracellular region for FTC, and Extracellular exosome for ATC. Prostate cancer for PTC was the most significantly enriched KEGG pathway. The results were validated using TCGA data. Conclusions: The findings unravel potential biomarkers and therapeutic targets of thyroid carcinomas. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
5. A comprehensive workflow for optimizing RNA-seq data analysis
- Author
-
Gao Jiang, Juan-Yu Zheng, Shu-Ning Ren, Weilun Yin, Xinli Xia, Yun Li, and Hou-Ling Wang
- Subjects
RNA-seq data ,Differential gene analysis ,Software comparison ,Biotechnology ,TP248.13-248.65 ,Genetics ,QH426-470 - Abstract
Abstract Background Current RNA-seq analysis software for RNA-seq data tends to use similar parameters across different species without considering species-specific differences. However, the suitability and accuracy of these tools may vary when analyzing data from different species, such as humans, animals, plants, fungi, and bacteria. For most laboratory researchers lacking a background in information science, determining how to construct an analysis workflow that meets their specific needs from the array of complex analytical tools available poses a significant challenge. Results By utilizing RNA-seq data from plants, animals, and fungi, it was observed that different analytical tools demonstrate some variations in performance when applied to different species. A comprehensive experiment was conducted specifically for analyzing plant pathogenic fungal data, focusing on differential gene analysis as the ultimate goal. In this study, 288 pipelines using different tools were applied to analyze five fungal RNA-seq datasets, and the performance of their results was evaluated based on simulation. This led to the establishment of a relatively universal and superior fungal RNA-seq analysis pipeline that can serve as a reference, and certain standards for selecting analysis tools were derived for reference. Additionally, we compared various tools for alternative splicing analysis. The results based on simulated data indicated that rMATS remained the optimal choice, although consideration could be given to supplementing with tools such as SpliceWiz. Conclusion The experimental results demonstrate that, in comparison to the default software parameter configurations, the analysis combination results after tuning can provide more accurate biological insights. It is beneficial to carefully select suitable analysis software based on the data, rather than indiscriminately choosing tools, in order to achieve high-quality analysis results more efficiently.
- Published
- 2024
- Full Text
- View/download PDF
6. Genome-Wide Identification and Characterization of RdHSP Genes Related to High Temperature in Rhododendron delavayi.
- Author
-
Wang, Cheng, Wang, Xiaojing, Zhou, Ping, and Li, Changchun
- Subjects
GENE expression ,CHROMOSOME analysis ,HEAT shock proteins ,PLANT genes ,GENE families - Abstract
Heat shock proteins (HSPs) are molecular chaperones that play essential roles in plant development and in response to various environmental stresses. Understanding R. delavayi HSP genes is of great importance since R. delavayi is severely affected by heat stress. In the present study, a total of 76 RdHSP genes were identified in the R. delavayi genome, which were divided into five subfamilies based on molecular weight and domain composition. Analyses of the chromosome distribution, gene structure, and conserved motif of the RdHSP family genes were conducted using bioinformatics analysis methods. Gene duplication analysis showed that 15 and 8 RdHSP genes were obtained and retained from the WGD/segmental duplication and tandem duplication, respectively. Cis-element analysis revealed the importance of RdHSP genes in plant adaptations to the environment. Moreover, the expression patterns of RdHSP family genes were investigated in R. delavayi treated with high temperature based on our RNA-seq data, which were further verified by qRT-PCR. Further analysis revealed that nine candidate genes, including six RdHSP20 subfamily genes (RdHSP20.4, RdHSP20.8, RdHSP20.6, RdHSP20.3, RdHSP20.10, and RdHSP20.15) and three RdHSP70 subfamily genes (RdHSP70.15, RdHSP70.21, and RdHSP70.16), might be involved in enhancing the heat stress tolerance. The subcellular localization of two candidate RdHSP genes (RdHSP20.8 and RdHSP20.6) showed that two candidate RdHSPs were expressed and function in the chloroplast and nucleus, respectively. These results provide a basis for the functional characterization of HSP genes and investigations on the molecular mechanisms of heat stress response in R. delavayi. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
7. A comprehensive workflow for optimizing RNA-seq data analysis.
- Author
-
Jiang, Gao, Zheng, Juan-Yu, Ren, Shu-Ning, Yin, Weilun, Xia, Xinli, Li, Yun, and Wang, Hou-Ling
- Subjects
RNA sequencing ,ALTERNATIVE RNA splicing ,WORKFLOW management systems ,DATA analysis ,WORKFLOW ,MICROBIAL inoculants ,RESEARCH personnel - Abstract
Background: Current RNA-seq analysis software for RNA-seq data tends to use similar parameters across different species without considering species-specific differences. However, the suitability and accuracy of these tools may vary when analyzing data from different species, such as humans, animals, plants, fungi, and bacteria. For most laboratory researchers lacking a background in information science, determining how to construct an analysis workflow that meets their specific needs from the array of complex analytical tools available poses a significant challenge. Results: By utilizing RNA-seq data from plants, animals, and fungi, it was observed that different analytical tools demonstrate some variations in performance when applied to different species. A comprehensive experiment was conducted specifically for analyzing plant pathogenic fungal data, focusing on differential gene analysis as the ultimate goal. In this study, 288 pipelines using different tools were applied to analyze five fungal RNA-seq datasets, and the performance of their results was evaluated based on simulation. This led to the establishment of a relatively universal and superior fungal RNA-seq analysis pipeline that can serve as a reference, and certain standards for selecting analysis tools were derived for reference. Additionally, we compared various tools for alternative splicing analysis. The results based on simulated data indicated that rMATS remained the optimal choice, although consideration could be given to supplementing with tools such as SpliceWiz. Conclusion: The experimental results demonstrate that, in comparison to the default software parameter configurations, the analysis combination results after tuning can provide more accurate biological insights. It is beneficial to carefully select suitable analysis software based on the data, rather than indiscriminately choosing tools, in order to achieve high-quality analysis results more efficiently. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
8. Analysis of gene expression dynamics and differential expression in viral infections using generalized linear models and quasi-likelihood methods.
- Author
-
Rezapour, Mostafa, Walker, Stephen J., Ornelles, David A., McNutt, Patrick M., Atala, Anthony, and Gurcan, Metin Nafi
- Subjects
GENE expression ,VIRUS diseases ,ORGANS (Anatomy) ,PARAINFLUENZA viruses ,GENETIC regulation ,HOST-virus relationships - Abstract
Introduction: Our study undertakes a detailed exploration of gene expression dynamics within human lung organ tissue equivalents (OTEs) in response to Influenza A virus (IAV), Human metapneumovirus (MPV), and Parainfluenza virus type 3 (PIV3) infections. Through the analysis of RNA-Seq data from 19,671 genes, we aim to identify differentially expressed genes under various infection conditions, elucidating the complexities of virus-host interactions. Methods: We employ Generalized Linear Models (GLMs) with Quasi-Likelihood (QL) F-tests (GLMQL) and introduce the novel Magnitude-Altitude Score (MAS) and Relaxed Magnitude-Altitude Score (RMAS) algorithms to navigate the intricate landscape of RNA-Seq data. This approach facilitates the precise identification of potential biomarkers, highlighting the host's reliance on innate immune mechanisms. Our comprehensive methodological framework includes RNA extraction, library preparation, sequencing, and Gene Ontology (GO) enrichment analysis to interpret the biological significance of our findings. Results: The differential expression analysis unveils significant changes in gene expression triggered by IAV, MPV, and PIV3 infections. The MAS and RMAS algorithms enable focused identification of biomarkers, revealing a consistent activation of interferon-stimulated genes (e.g., IFIT1, IFIT2, IFIT3, OAS1) across all viruses. Our GO analysis provides deep insights into the host's defense mechanisms and viral strategies exploiting host cellular functions. Notably, changes in cellular structures, such as cilium assembly and mitochondrial ribosome assembly, indicate a strategic shift in cellular priorities. The precision of our methodology is validated by a 92% mean accuracy in classifying respiratory virus infections using multinomial logistic regression, demonstrating the superior efficacy of our approach over traditional methods. Discussion: This study highlights the intricate interplay between viral infections and host gene expression, underscoring the need for targeted therapeutic interventions. The stability and reliability of the MAS/RMAS ranking method, even under stringent statistical corrections, and the critical importance of adequate sample size for biomarker reliability are significant findings. Our comprehensive analysis not only advances our understanding of the host's response to viral infections but also sets a new benchmark for the identification of biomarkers, paving the way for the development of effective diagnostic and therapeutic strategies. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
9. Multi-System-Level Analysis Reveals Differential Expression of Stress Response-Associated Genes in Inflammatory Solar Lentigo.
- Author
-
Jeong, Jisu, Lee, Wonmin, Kim, Ye-Ah, Lee, Yun-Ji, Kim, Sohyun, Shin, Jaeyeon, Choi, Yueun, Kim, Jihan, Lee, Yoonsung, Kim, Man S., and Kwon, Soon-Hyo
- Subjects
- *
DEVIATORIC stress (Engineering) , *NF-kappa B , *LENTIGO , *REGULATOR genes , *CELLULAR aging , *ULTRAVIOLET radiation - Abstract
Although the pathogenesis of solar lentigo (SL) involves chronic ultraviolet (UV) exposure, cellular senescence, and upregulated melanogenesis, underlying molecular-level mechanisms associated with SL remain unclear. The aim of this study was to investigate the gene regulatory mechanisms intimately linked to inflammation in SL. Skin samples from patients with SL with or without histological inflammatory features were obtained. RNA-seq data from the samples were analyzed via multiple analysis approaches, including exploration of core inflammatory gene alterations, identifying functional pathways at both transcription and protein levels, comparison of inflammatory module (gene clusters) activation levels, and analyzing correlations between modules. These analyses disclosed specific core genes implicated in oxidative stress, especially the upregulation of nuclear factor kappa B in the inflammatory SLs, while genes associated with protective mechanisms, such as SLC6A9, were highly expressed in the non-inflammatory SLs. For inflammatory modules, Extracellular Immunity and Mitochondrial Innate Immunity were exclusively upregulated in the inflammatory SL. Analysis of protein–protein interactions revealed the significance of CXCR3 upregulation in the pathogenesis of inflammatory SL. In conclusion, the upregulation of stress response-associated genes and inflammatory pathways in response to UV-induced oxidative stress implies their involvement in the pathogenesis of inflammatory SL. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
10. PRANA: an R package for differential co-expression network analysis with the presence of additional covariates
- Author
-
Seungjun Ahn and Somnath Datta
- Subjects
Differential network analysis ,Pseudo-value regression ,RNA-Seq data ,Covariate adjustment ,Biotechnology ,TP248.13-248.65 ,Genetics ,QH426-470 - Abstract
Abstract Background Advances in sequencing technology and cost reduction have enabled an emergence of various statistical methods used in RNA-sequencing data, including the differential co-expression network analysis (or differential network analysis). A key benefit of this method is that it takes into consideration the interactions between or among genes and do not require an established knowledge in biological pathways. As of now, none of existing softwares can incorporate covariates that should be adjusted if they are confounding factors while performing the differential network analysis. Results We develop an R package PRANA which a user can easily include multiple covariates. The main R function in this package leverages a novel pseudo-value regression approach for a differential network analysis in RNA-sequencing data. This software is also enclosed with complementary R functions for extracting adjusted p-values and coefficient estimates of all or specific variable for each gene, as well as for identifying the names of genes that are differentially connected (DC, hereafter) between subjects under biologically different conditions from the output. Conclusion Herewith, we demonstrate the application of this package in a real data on chronic obstructive pulmonary disease. PRANA is available through the CRAN repositories under the GPL-3 license: https://cran.r-project.org/web/packages/PRANA/index.html .
- Published
- 2023
- Full Text
- View/download PDF
11. Analysis of gene expression dynamics and differential expression in viral infections using generalized linear models and quasi-likelihood methods
- Author
-
Mostafa Rezapour, Stephen J. Walker, David A. Ornelles, Patrick M. McNutt, Anthony Atala, and Metin Nafi Gurcan
- Subjects
3D airway organ tissue equivalent (OTEs) ,RNA-seq data ,generalized linear models ,quasi-likelihood F-test ,differentially expressed genes ,Microbiology ,QR1-502 - Abstract
IntroductionOur study undertakes a detailed exploration of gene expression dynamics within human lung organ tissue equivalents (OTEs) in response to Influenza A virus (IAV), Human metapneumovirus (MPV), and Parainfluenza virus type 3 (PIV3) infections. Through the analysis of RNA-Seq data from 19,671 genes, we aim to identify differentially expressed genes under various infection conditions, elucidating the complexities of virus-host interactions.MethodsWe employ Generalized Linear Models (GLMs) with Quasi-Likelihood (QL) F-tests (GLMQL) and introduce the novel Magnitude-Altitude Score (MAS) and Relaxed Magnitude-Altitude Score (RMAS) algorithms to navigate the intricate landscape of RNA-Seq data. This approach facilitates the precise identification of potential biomarkers, highlighting the host’s reliance on innate immune mechanisms. Our comprehensive methodological framework includes RNA extraction, library preparation, sequencing, and Gene Ontology (GO) enrichment analysis to interpret the biological significance of our findings.ResultsThe differential expression analysis unveils significant changes in gene expression triggered by IAV, MPV, and PIV3 infections. The MAS and RMAS algorithms enable focused identification of biomarkers, revealing a consistent activation of interferon-stimulated genes (e.g., IFIT1, IFIT2, IFIT3, OAS1) across all viruses. Our GO analysis provides deep insights into the host’s defense mechanisms and viral strategies exploiting host cellular functions. Notably, changes in cellular structures, such as cilium assembly and mitochondrial ribosome assembly, indicate a strategic shift in cellular priorities. The precision of our methodology is validated by a 92% mean accuracy in classifying respiratory virus infections using multinomial logistic regression, demonstrating the superior efficacy of our approach over traditional methods.DiscussionThis study highlights the intricate interplay between viral infections and host gene expression, underscoring the need for targeted therapeutic interventions. The stability and reliability of the MAS/RMAS ranking method, even under stringent statistical corrections, and the critical importance of adequate sample size for biomarker reliability are significant findings. Our comprehensive analysis not only advances our understanding of the host’s response to viral infections but also sets a new benchmark for the identification of biomarkers, paving the way for the development of effective diagnostic and therapeutic strategies.
- Published
- 2024
- Full Text
- View/download PDF
12. PRANA: an R package for differential co-expression network analysis with the presence of additional covariates.
- Author
-
Ahn, Seungjun and Datta, Somnath
- Subjects
CHRONIC obstructive pulmonary disease ,COST control - Abstract
Background: Advances in sequencing technology and cost reduction have enabled an emergence of various statistical methods used in RNA-sequencing data, including the differential co-expression network analysis (or differential network analysis). A key benefit of this method is that it takes into consideration the interactions between or among genes and do not require an established knowledge in biological pathways. As of now, none of existing softwares can incorporate covariates that should be adjusted if they are confounding factors while performing the differential network analysis. Results: We develop an R package PRANA which a user can easily include multiple covariates. The main R function in this package leverages a novel pseudo-value regression approach for a differential network analysis in RNA-sequencing data. This software is also enclosed with complementary R functions for extracting adjusted p-values and coefficient estimates of all or specific variable for each gene, as well as for identifying the names of genes that are differentially connected (DC, hereafter) between subjects under biologically different conditions from the output. Conclusion: Herewith, we demonstrate the application of this package in a real data on chronic obstructive pulmonary disease. PRANA is available through the CRAN repositories under the GPL-3 license: https://cran.r-project.org/web/packages/PRANA/index.html. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
13. Clustering Matrix Variate Longitudinal Count Data
- Author
-
Sanjeena Subedi
- Subjects
cluster analysis ,RNA-seq data ,matrix variate discrete data ,longitudinal data ,mixture models ,Electronic computers. Computer science ,QA75.5-76.95 ,Probabilities. Mathematical statistics ,QA273-280 - Abstract
Matrix variate longitudinal discrete data can arise in transcriptomics studies when the data are collected for N genes at r conditions over t time points, and thus, each observation Yn for n=1,…,N can be written as an r×t matrix. When dealing with such data, the number of parameters in the model can be greatly reduced by considering the matrix variate structure. The components of the covariance matrix then also provide a meaningful interpretation. In this work, a mixture of matrix variate Poisson-log normal distributions is introduced for clustering longitudinal read counts from RNA-seq studies. To account for the longitudinal nature of the data, a modified Cholesky-decomposition is utilized for a component of the covariance structure. Furthermore, a parsimonious family of models is developed by imposing constraints on elements of these decompositions. The models are applied to both real and simulated data, and it is demonstrated that the proposed approach can recover the underlying cluster structure.
- Published
- 2023
- Full Text
- View/download PDF
14. Genome-Wide Identification and Characterization of RdHSP Genes Related to High Temperature in Rhododendron delavayi
- Author
-
Cheng Wang, Xiaojing Wang, Ping Zhou, and Changchun Li
- Subjects
RdHSP gene family ,expression pattern ,high-temperature stress ,RNA-seq data ,subcellular localization ,Botany ,QK1-989 - Abstract
Heat shock proteins (HSPs) are molecular chaperones that play essential roles in plant development and in response to various environmental stresses. Understanding R. delavayi HSP genes is of great importance since R. delavayi is severely affected by heat stress. In the present study, a total of 76 RdHSP genes were identified in the R. delavayi genome, which were divided into five subfamilies based on molecular weight and domain composition. Analyses of the chromosome distribution, gene structure, and conserved motif of the RdHSP family genes were conducted using bioinformatics analysis methods. Gene duplication analysis showed that 15 and 8 RdHSP genes were obtained and retained from the WGD/segmental duplication and tandem duplication, respectively. Cis-element analysis revealed the importance of RdHSP genes in plant adaptations to the environment. Moreover, the expression patterns of RdHSP family genes were investigated in R. delavayi treated with high temperature based on our RNA-seq data, which were further verified by qRT-PCR. Further analysis revealed that nine candidate genes, including six RdHSP20 subfamily genes (RdHSP20.4, RdHSP20.8, RdHSP20.6, RdHSP20.3, RdHSP20.10, and RdHSP20.15) and three RdHSP70 subfamily genes (RdHSP70.15, RdHSP70.21, and RdHSP70.16), might be involved in enhancing the heat stress tolerance. The subcellular localization of two candidate RdHSP genes (RdHSP20.8 and RdHSP20.6) showed that two candidate RdHSPs were expressed and function in the chloroplast and nucleus, respectively. These results provide a basis for the functional characterization of HSP genes and investigations on the molecular mechanisms of heat stress response in R. delavayi.
- Published
- 2024
- Full Text
- View/download PDF
15. clrDV: a differential variability test for RNA-Seq data based on the skew-normal distribution.
- Author
-
Hongxiang Li and Tsung Fei Khang
- Subjects
RNA sequencing ,SKEWNESS (Probability theory) ,ALZHEIMER'S disease ,GENE expression ,FALSE discovery rate ,NEURODEGENERATION ,FALSE positive error - Abstract
Background. Pathological conditions may result in certain genes having expression variance that differs markedly from that of the control. Finding such genes from gene expression data can provide invaluable candidates for therapeutic intervention. Under the dominant paradigm for modeling RNA-Seq gene counts using the negative binomial model, tests of differential variability are challenging to develop, owing to dependence of the variance on the mean. Methods. Here, we describe clrDV, a statistical method for detecting genes that show differential variability between two populations. We present the skew-normal distribution for modeling gene-wise null distribution of centered log-ratio transformation of compositional RNA-seq data. Results. Simulation results show that clrDV has false discovery rate and probability of Type II error that are on par with or superior to existing methodologies. In addition, its run time is faster than its closest competitors, and remains relatively constant for increasing sample size per group. Analysis of a large neurodegenerative disease RNASeq dataset using clrDV successfully recovers multiple gene candidates that have been reported to be associated with Alzheimer's disease. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
16. PlaASDB: a comprehensive database of plant alternative splicing events in response to stress
- Author
-
Xiaokun Guo, Tianpeng Wang, Linyang Jiang, Huan Qi, and Ziding Zhang
- Subjects
Plant ,Alternative splicing ,RNA-Seq data ,Stress response ,Arabidopsis ,Rice ,Botany ,QK1-989 - Abstract
Abstract Background Alternative splicing (AS) is a co-transcriptional regulatory mechanism of plants in response to environmental stress. However, the role of AS in biotic and abiotic stress responses remains largely unknown. To speed up our understanding of plant AS patterns under different stress responses, development of informative and comprehensive plant AS databases is highly demanded. Description In this study, we first collected 3,255 RNA-seq data under biotic and abiotic stresses from two important model plants (Arabidopsis and rice). Then, we conducted AS event detection and gene expression analysis, and established a user-friendly plant AS database termed PlaASDB. By using representative samples from this highly integrated database resource, we compared AS patterns between Arabidopsis and rice under abiotic and biotic stresses, and further investigated the corresponding difference between AS and gene expression. Specifically, we found that differentially spliced genes (DSGs) and differentially expressed genes (DEG) share very limited overlapping under all kinds of stresses, suggesting that gene expression regulation and AS seemed to play independent roles in response to stresses. Compared with gene expression, Arabidopsis and rice were more inclined to have conserved AS patterns under stress conditions. Conclusion PlaASDB is a comprehensive plant-specific AS database that mainly integrates the AS and gene expression data of Arabidopsis and rice in stress response. Through large-scale comparative analyses, the global landscape of AS events in Arabidopsis and rice was observed. We believe that PlaASDB could help researchers understand the regulatory mechanisms of AS in plants under stresses more conveniently. PlaASDB is freely accessible at http://zzdlab.com/PlaASDB/ASDB/index.html .
- Published
- 2023
- Full Text
- View/download PDF
17. A pseudo-value regression approach for differential network analysis of co-expression data
- Author
-
Seungjun Ahn, Tyler Grimes, and Somnath Datta
- Subjects
Pseudo-value ,Differential network analysis ,Regression method ,Gene regulatory network ,RNA-seq data ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Biology (General) ,QH301-705.5 - Abstract
Abstract Background The differential network (DN) analysis identifies changes in measures of association among genes under two or more experimental conditions. In this article, we introduce a pseudo-value regression approach for network analysis (PRANA). This is a novel method of differential network analysis that also adjusts for additional clinical covariates. We start from mutual information criteria, followed by pseudo-value calculations, which are then entered into a robust regression model. Results This article assesses the model performances of PRANA in a multivariable setting, followed by a comparison to dnapath and DINGO in both univariable and multivariable settings through variety of simulations. Performance in terms of precision, recall, and F1 score of differentially connected (DC) genes is assessed. By and large, PRANA outperformed dnapath and DINGO, neither of which is equipped to adjust for available covariates such as patient-age. Lastly, we employ PRANA in a real data application from the Gene Expression Omnibus database to identify DC genes that are associated with chronic obstructive pulmonary disease to demonstrate its utility. Conclusion To the best of our knowledge, this is the first attempt of utilizing a regression modeling for DN analysis by collective gene expression levels between two or more groups with the inclusion of additional clinical covariates. By and large, adjusting for available covariates improves accuracy of a DN analysis.
- Published
- 2023
- Full Text
- View/download PDF
18. Clustering Matrix Variate Longitudinal Count Data.
- Author
-
Subedi, Sanjeena
- Subjects
RNA sequencing ,CLUSTER analysis (Statistics) ,GAUSSIAN distribution ,DATA analysis ,ANALYSIS of covariance - Abstract
Matrix variate longitudinal discrete data can arise in transcriptomics studies when the data are collected for N genes at r conditions over t time points, and thus, each observation Y n for n = 1 , ... , N can be written as an r × t matrix. When dealing with such data, the number of parameters in the model can be greatly reduced by considering the matrix variate structure. The components of the covariance matrix then also provide a meaningful interpretation. In this work, a mixture of matrix variate Poisson-log normal distributions is introduced for clustering longitudinal read counts from RNA-seq studies. To account for the longitudinal nature of the data, a modified Cholesky-decomposition is utilized for a component of the covariance structure. Furthermore, a parsimonious family of models is developed by imposing constraints on elements of these decompositions. The models are applied to both real and simulated data, and it is demonstrated that the proposed approach can recover the underlying cluster structure. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
19. A pseudo-value regression approach for differential network analysis of co-expression data.
- Author
-
Ahn, Seungjun, Grimes, Tyler, and Datta, Somnath
- Subjects
CHRONIC obstructive pulmonary disease ,DATA analysis ,GENE expression ,GENE regulatory networks ,DINGO - Abstract
Background: The differential network (DN) analysis identifies changes in measures of association among genes under two or more experimental conditions. In this article, we introduce a pseudo-value regression approach for network analysis (PRANA). This is a novel method of differential network analysis that also adjusts for additional clinical covariates. We start from mutual information criteria, followed by pseudo-value calculations, which are then entered into a robust regression model. Results: This article assesses the model performances of PRANA in a multivariable setting, followed by a comparison to dnapath and DINGO in both univariable and multivariable settings through variety of simulations. Performance in terms of precision, recall, and F1 score of differentially connected (DC) genes is assessed. By and large, PRANA outperformed dnapath and DINGO, neither of which is equipped to adjust for available covariates such as patient-age. Lastly, we employ PRANA in a real data application from the Gene Expression Omnibus database to identify DC genes that are associated with chronic obstructive pulmonary disease to demonstrate its utility. Conclusion: To the best of our knowledge, this is the first attempt of utilizing a regression modeling for DN analysis by collective gene expression levels between two or more groups with the inclusion of additional clinical covariates. By and large, adjusting for available covariates improves accuracy of a DN analysis. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
20. GeneSelectML: a comprehensive way of gene selection for RNA-Seq data via machine learning algorithms.
- Author
-
Dag, Osman, Kasikci, Merve, Ilk, Ozlem, and Yesiltepe, Metin
- Subjects
- *
MACHINE learning , *RNA sequencing , *FEATURE selection , *GENE ontology , *DATA analysis - Abstract
Selection of differentially expressed genes (DEGs) is a vital process to discover the causes of diseases. It has been shown that modelling of genomics data by considering relation among genes increases the predictive performance of methods compared to univariate analysis. However, there exist serious differences among most studies analyzing the same dataset for the reasons arising from the methods. Therefore, there is a strong need for easily accessible, user-friendly, and interactive tool to perform gene selection for RNA-seq data via machine learning algorithms simultaneously not to miss DEGs. We develop an open-source and freely available web-based tool for gene selection via machine learning algorithms that can deal with high performance computation. This tool includes six machine learning algorithms having different aspects. Moreover, the tool involves classical pre-processing steps; filtering, normalization, transformation, and univariate analysis. It also offers well-arranged graphical approaches; network plot, heatmap, venn diagram, and box-and-whisker plot. Gene ontology analysis is provided for both mRNA and miRNA DEGs. The implementation is carried out on Alzheimer RNA-seq data to demonstrate the use of this web-based tool. Eleven genes are suggested by at least two out of six methods. One of these genes, hsa-miR-148a-3p, might be considered as a new biomarker for Alzheimer's disease diagnosis. Kidney Chromophobe dataset is also analyzed to demonstrate the validity of GeneSelectML web tool on a different dataset. GeneSelectML is distinguished in that it simultaneously uses different machine learning algorithms for gene selection and can perform pre-processing, graphical representation, and gene ontology analyses on the same tool. This tool is freely available at www.softmed.hacettepe.edu.tr/GeneSelectML. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
21. Protocol for transcriptome assembly by the TransBorrow algorithm.
- Author
-
Zhao, Dengyi, Liu, Juntao, and Yu, Ting
- Subjects
- *
BIOLOGICAL research , *TRANSCRIPTOMES , *RNA sequencing , *COMPUTATIONAL biology , *BIOINFORMATICS - Abstract
High-throughput RNA-seq enables comprehensive analysis of the transcriptome for various purposes. However, this technology generally generates massive amounts of sequencing reads with a shorter read length. Consequently, fast, accurate, and flexible tools are needed for assembling raw RNA-seq data into full-length transcripts and quantifying their expression levels. In this protocol, we report TransBorrow, a novel transcriptome assembly software specifically designed for short RNA-seq reads. TransBorrow is employed in conjunction with a splice-aware alignment tool (e.g. Hisat2 and Star) and some other transcriptome assembly tools (e.g. StringTie, Cufflinks, and Scallop). The protocol encompasses all necessary steps, starting from downloading and processing raw sequencing data to assembling the full-length transcripts and quantifying their expressed abundances. The execution time of the protocol may vary depending on the sizes of processed datasets and computational platforms. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
22. bestDEG: a web-based application automatically combines various tools to precisely predict differentially expressed genes (DEGs) from RNA-Seq data.
- Author
-
Sangket, Unitsa, Yodsawat, Prasert, Nuanpirom, Jiratchaya, and Sathapondecha, Ponsit
- Subjects
GENE expression ,WEB-based user interfaces ,RNA sequencing ,FALSE discovery rate ,GENES ,DNA microarrays - Abstract
Background. Differential gene expression analysis using RNA sequencing technology (RNA-Seq) has become the most popular technique in transcriptome research. Although many R packages have been developed to analyze differentially expressed genes (DEGs), several evaluations have shown that no single DEG analysis method outperforms all others. The validity of DEG identification could be increased by using multiple methods and producing the consensus results. However, DEG analysis methods are complex and most of them require prior knowledge of a programming language or command-line shell. Users who do not have this knowledge need to invest time and effort to acquire it. Methods. We developed a novel web application called "bestDEG" to automatically analyze DEGs with different tools and compare the results. A differential expression (DE) analysis pipeline was created combining the edgeR, DESeq2, NOISeq, and EBSeq packages; selected because they use different statistical methods to identify DEGs. bestDEG was evaluated on human datasets from the MicroArray Quality Control (MAQC) project. Results. The performance of the bestDEG web application with the human datasets showed excellent results, and the consensus method outperformed the otherDEanalysis methods in terms of precision (94.71%) and specificity (97.01%). bestDEG is a rapid and efficient tool to analyze DEGs. With bestDEG, users can select DE analysis methods and parameters in the user-friendly web interface. bestDEG also provides a Venn diagram and a table of results. Moreover, the consensus method of this tool can maximize the precision or minimize the false discovery rate (FDR), which reduces the cost of gene expression validation by minimizing wet-lab experiments. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
23. NetSeekR: a network analysis pipeline for RNA-Seq time series data
- Author
-
Himangi Srivastava, Drew Ferrell, and George V. Popescu
- Subjects
RNA-Seq data ,Differential gene expression analysis ,Correlation gene expression analysis ,Regulatory network analysis ,Complex network analysis ,Bioinformatics pipeline ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Biology (General) ,QH301-705.5 - Abstract
Abstract Background Recent development of bioinformatics tools for Next Generation Sequencing data has facilitated complex analyses and prompted large scale experimental designs for comparative genomics. When combined with the advances in network inference tools, this can lead to powerful methodologies for mining genomics data, allowing development of pipelines that stretch from sequence reads mapping to network inference. However, integrating various methods and tools available over different platforms requires a programmatic framework to fully exploit their analytic capabilities. Integrating multiple genomic analysis tools faces challenges from standardization of input and output formats, normalization of results for performing comparative analyses, to developing intuitive and easy to control scripts and interfaces for the genomic analysis pipeline. Results We describe here NetSeekR, a network analysis R package that includes the capacity to analyze time series of RNA-Seq data, to perform correlation and regulatory network inferences and to use network analysis methods to summarize the results of a comparative genomics study. The software pipeline includes alignment of reads, differential gene expression analysis, correlation network analysis, regulatory network analysis, gene ontology enrichment analysis and network visualization of differentially expressed genes. The implementation provides support for multiple RNA-Seq read mapping methods and allows comparative analysis of the results obtained by different bioinformatics methods. Conclusion Our methodology increases the level of integration of genomics data analysis tools to network inference, facilitating hypothesis building, functional analysis and genomics discovery from large scale NGS data. When combined with network analysis and simulation tools, the pipeline allows for developing systems biology methods using large scale genomics data.
- Published
- 2022
- Full Text
- View/download PDF
24. Regulatory modules of human thermogenic adipocytes: functional genomics of large cohort and Meta-analysis derived marker-genes
- Author
-
Beáta B. Tóth, Zoltán Barta, Ákos Barnabás Barta, and László Fésüs
- Subjects
Adipocytes, browning and thermogenesis ,Protein interaction networks, gene expression regulation ,RNA-seq data ,Transcriptional factors, HIF1A ,UCP1 promoter ,AdipoNET ,Biotechnology ,TP248.13-248.65 ,Genetics ,QH426-470 - Abstract
Abstract Background Recently, ProFAT and BATLAS studies identified brown and white adipocytes marker genes based on analysis of large databases. They offered scores to determine the thermogenic status of adipocytes using the gene-expression data of these markers. In this work, we investigated the functional context of these genes. Results Gene Set Enrichment Analyses (KEGG, Reactome) of the BATLAS and ProFAT marker-genes identified pathways deterministic in the formation of brown and white adipocytes. The collection of the annotated proteins of the defined pathways resulted in expanded white and brown characteristic protein-sets, which theoretically contain all functional proteins that could be involved in the formation of adipocytes. Based on our previously obtained RNA-seq data, we visualized the expression profile of these proteins coding genes and found patterns consistent with the two adipocyte phenotypes. The trajectory of the regulatory processes could be outlined by the transcriptional profile of progenitor and differentiated adipocytes, highlighting the importance of suppression processes in browning. Protein interaction network-based functional genomics by STRING, Cytoscape and R-Igraph platforms revealed that different biological processes shape the brown and white adipocytes and highlighted key regulatory elements and modules including GAPDH-CS, DECR1, SOD2, IL6, HRAS, MTOR, INS-AKT, ERBB2 and 4-NFKB, and SLIT-ROBO-MAPK. To assess the potential role of a particular protein in shaping adipocytes, we assigned interaction network location-based scores (betweenness centrality, number of bridges) to them and created a freely accessible platform, the AdipoNET ( https//adiponet.com ), to conveniently use these data. The Eukaryote Promoter Database predicted the response elements in the UCP1 promoter for the identified, potentially important transcription factors (HIF1A, MYC, REL, PPARG, TP53, AR, RUNX, and FoxO1). Conclusion Our integrative approach-based results allowed us to investigate potential regulatory elements of thermogenesis in adipose tissue. The analyses revealed that some unique biological processes form the brown and white adipocyte phenotypes, which presumes the existence of the transitional states. The data also suggests that the two phenotypes are not mutually exclusive, and differentiation of thermogenic adipocyte requires induction of browning as well as repressions of whitening. The recognition of these simultaneous actions and the identified regulatory modules can open new direction in obesity research.
- Published
- 2021
- Full Text
- View/download PDF
25. bestDEG: a web-based application automatically combines various tools to precisely predict differentially expressed genes (DEGs) from RNA-Seq data
- Author
-
Unitsa Sangket, Prasert Yodsawat, Jiratchaya Nuanpirom, and Ponsit Sathapondecha
- Subjects
Differentially expressed genes ,DEGs ,RNA-Seq data ,EdgeR ,DESeq2 ,NOISeq ,Medicine ,Biology (General) ,QH301-705.5 - Abstract
Background Differential gene expression analysis using RNA sequencing technology (RNA-Seq) has become the most popular technique in transcriptome research. Although many R packages have been developed to analyze differentially expressed genes (DEGs), several evaluations have shown that no single DEG analysis method outperforms all others. The validity of DEG identification could be increased by using multiple methods and producing the consensus results. However, DEG analysis methods are complex and most of them require prior knowledge of a programming language or command-line shell. Users who do not have this knowledge need to invest time and effort to acquire it. Methods We developed a novel web application called “bestDEG” to automatically analyze DEGs with different tools and compare the results. A differential expression (DE) analysis pipeline was created combining the edgeR, DESeq2, NOISeq, and EBSeq packages; selected because they use different statistical methods to identify DEGs. bestDEG was evaluated on human datasets from the MicroArray Quality Control (MAQC) project. Results The performance of the bestDEG web application with the human datasets showed excellent results, and the consensus method outperformed the other DE analysis methods in terms of precision (94.71%) and specificity (97.01%). bestDEG is a rapid and efficient tool to analyze DEGs. With bestDEG, users can select DE analysis methods and parameters in the user-friendly web interface. bestDEG also provides a Venn diagram and a table of results. Moreover, the consensus method of this tool can maximize the precision or minimize the false discovery rate (FDR), which reduces the cost of gene expression validation by minimizing wet-lab experiments.
- Published
- 2022
- Full Text
- View/download PDF
26. Recent Advances on Penalized Regression Models for Biological Data.
- Author
-
Wang, Pei, Chen, Shunjie, and Yang, Sijia
- Subjects
- *
REGRESSION analysis , *BIOLOGICAL models , *DATA modeling , *MATHEMATICAL optimization , *BREAST cancer , *FEATURE selection - Abstract
Increasingly amounts of biological data promote the development of various penalized regression models. This review discusses the recent advances in both linear and logistic regression models with penalization terms. This review is mainly focused on various penalized regression models, some of the corresponding optimization algorithms, and their applications in biological data. The pros and cons of different models in terms of response prediction, sample classification, network construction and feature selection are also reviewed. The performances of different models in a real-world RNA-seq dataset for breast cancer are explored. Finally, some future directions are discussed. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
27. Projection of Expression Profiles to Transcription Factor Activity Space Provides Added Information.
- Author
-
Bornshten, Rut, Danilenko, Michael, and Rubin, Eitan
- Subjects
- *
TRANSCRIPTION factors , *ACUTE myeloid leukemia , *GENE regulatory networks , *MACHINE learning , *STEM cells - Abstract
Acute myeloid leukemia (AML) is an aggressive type of leukemia, characterized by the accumulation of highly proliferative blasts with a disrupted myeloid differentiation program. Current treatments are ineffective for most patients, partly due to the genetic heterogeneity of AML. This is driven by genetically distinct leukemia stem cells, resulting in relapse even after most of the tumor cells are destroyed. Thus, personalized treatment approaches addressing cellular heterogeneity are urgently required. Reconstruction of Transcriptional regulatory Networks (RTN) is a tool for inferring transcriptional activity in patients with various diseases. In this study, we applied this method to transcriptome profiles of AML patients to test if it provided additional information for the interpretation of transcriptome data. We showed that when RTN results were added to RNA-seq results, superior clusters were formed, which were more homogenous and allowed the better separation of patients with low and high survival rates. We concluded that the external knowledge used for RTN analysis improved the ability of unsupervised machine learning to find meaningful patterns in the data. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
28. Establish immune-related gene prognostic index for esophageal cancer.
- Author
-
Caiyu Guo, Fanye Zeng, Hui Liu, Jianlin Wang, Xue Huang, and Judong Luo
- Subjects
ESOPHAGEAL cancer ,CANCER prognosis ,PROGNOSTIC models ,OVERALL survival ,REGRESSION analysis ,P16 gene - Abstract
Background: Esophageal cancer is a tumor type with high invasiveness and low prognosis. As immunotherapy has been shown to improve the prognosis of esophageal cancer patients, we were interested in the establishment of an immune-associated gene prognostic index to effectively predict the prognosis of patients. Methods: To establish the immune-related gene prognostic index of esophageal cancer (EC), we screened 363 upregulated and 83 downregulated immune-related genes that were differentially expressed in EC compared to normal tissues. By multivariate Cox regression and weighted gene coexpression network analysis (WGCNA), we built a prognostic model based on eight immune-related genes (IRGs). We confirmed the prognostic model in both TCGA and GEO cohorts and found that the low-risk group had better overall survival than the high-risk group. Results: In this study, we identified 363 upregulated IRGs and 83 downregulated IRGs. Next, we found a prognostic model that was constructed with eight IRGs (OSM, CEACAM8, HSPA6, HSP90AB1, PCSK2, PLXNA1, TRIB2, and HMGB3) by multivariate Cox regression analysis and WGCNA. According to the Kaplan–Meier survival analysis results, the model we constructed can predict the prognosis of patients with esophageal cancer. This result can be verified by the Gene Expression Omnibus (GEO). Patients were divided into two groups with different outcomes. IRGPI-low patients had better overall survival than IRGPI-high patients. Conclusion: Our findings indicated the potential value of the IRGPI risk model for predicting the prognosis of EC patients [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
29. Single Nucleotide Polymorphism (SNP) Discovery on Transcriptomes of American Holstein and Pakistanian Cholistani Cows
- Author
-
Mojgan Ghasmi siab, Sheida Varkoohi, and Mohammad Hossein Banabazi
- Subjects
chromosome length ,nucleotide replacement ,rna-seq data ,Animal culture ,SF1-1100 - Abstract
Introduction[1] (SNPs) are single nucleotide base variations, caused by transitions (C/T or G/A) or transversions (C/G, C/A, or T/A, T/G), in the same position between individual genomic DNA sequences. Single nucleotide polymorphisms have been applied as important molecular markers in genetics and breeding studies. About 40% of the Single nucleotide polymorphisms in the genes cause a change in an amino acid. The rapid advance of next generation sequencing provides a high-throughput means of SNP discovery. Transcriptome study can fill the gap between genotype and phenotype and help understanding the mechanisms from sequence to function. RNA sequencing (RNA-Seq) is a next generating sequencing based technology for studying of whole transcriptome and gene expression. It simultaneously enables study of transcriptomics sequences and very accurate quantitative gene expression (digital expression). Hence, these data are very suitable for high-throughput study of expression level of all transcribed genes and their SNPs (Single Nucleotide Polymorphism. Recently, RNA-Seq has also been used as an efficient and cost-effective method to systematically identify SNPs in transcribed regions in different species. A transcriptomics-based sequencing approach offers a cheaper alternative to identify a large number of polymorphisms and possibly to discover causative variants.Materials and Methods In this study, RNA-Seq data were used to SNP discovery in American Holstein (Bos taurus) and Pakistanian Cholistani (Bos indicus) cows. RNA-Seq data of 21,078,477 and 20940063 paired end reads with 75 bp length resulted from pooling of whole blood samples of 40 Holstein cows at the University of Wisconsin, Dairy Cattle Center, USA, and 45 Cholistani cows at Gujait Peer Farm, Bahawalpur, Punjab, Pakistan, respectively, obtained from SRA database in NCBI for Holstein cows (http://www.ncbi.nlm.nih.gov/sra/SRX317197) and Cholistani cows http://www.ncbi.nlm.nih.gov/sra/SRS454433). MRNA sequencing was run on Illumina Genome Analyzer IIx (Illumina Inc., San Diego, CA). Data were converted from Sra format to Fastq format by fastq-dump command from Ubuntu linux version of Sratoolkit 2.5.4-1. Data quality control was checked by FastQC (v0.11.3) likewise trimmed for linked adaptors and bad quality reads by Trimmomatic 0.33 Adaptors were considered according to sequencing instrument as default (TruSeq2-PE.fa) and the minimum read length was set at 50 bp. Trimmed reads were aligned on UMD3.1 reference genome (release 81) based on annotation data by Tophat2, which applies Bowtie2 as the aligner. The transcriptome was assembled by TopHat2 software in two cow’s population by aligning and mapping the RNA-Seq reads on bovine reference genome. The SNPs were discovered by Samtools software.Results and Discussion After data editing, the removed and low quality reads in both breeds were almost equal and relatively low. The length of whole transcriptome assembled, for example 52798651 bases in Holstein, indicates around 2% of the whole genome (around 2.6 Mbp) expressed as mRNA. In Cholistani cows, read mapping rate for forward and reverse reads were 81.3 and 79.9%, respectively, and multiple alignments rate was about 9.4%. Overall read mapping was 80.6% and concordant pair alignment was 70.1%. In Holstein cows, read mapping rate for forward and reverse reads were 66.3 and 55.4%, respectively, and multiple alignments rate was about 7.2%,. Overall read mapping was 60.8% and concordant pair alignment was 51.3%. Results show that 50183 and 137954 SNPs were discovered on the assembled transcriptome of Holstein and Cholistani cow’s samples, respectively, and 15308 SNPs were common in both breeds. No direct relation was found between the number of discovered SNPs and the chromosome length. Also 12 SNP types were identified including 4 transition and 8 transversion. The most commonly discovered SNP were transition, which were 70.6% in Cholistani and 69.6% in Holstein cows. The ratio of transition to transversion SNP (Ts / Tv) was 2.4 and 2.3 in Cholistani and Holstein cows, respectively. The number of discovered SNPs in Cholistani cows were approximately three times higher than Holstein cows. Because, for the alignment of both species used a same reference genome with Herford origin. Conclusion the expression difference between two alleles in a single-nucleotide position causes phenotype diversity and probably explains the large part of variances between these two bovine subspecies, especially in diversity, susceptibility to disease and parasites, tolerating environmental stress such as biological and non-biological stresses in different environmental conditions. While, differential gene expression analysis or even allelic specific expression in gene level may not be able to explain phenotype diversity.
- Published
- 2021
- Full Text
- View/download PDF
30. A scaling-free minimum enclosing ball method to detect differentially expressed genes for RNA-seq data
- Author
-
Yan Zhou, Bin Yang, Junhui Wang, Jiadi Zhu, and Guoliang Tian
- Subjects
Minimum enclosing ball ,Differentially expressed genes ,RNA-seq data ,Biotechnology ,TP248.13-248.65 ,Genetics ,QH426-470 - Abstract
Abstract Background Identifying differentially expressed genes between the same or different species is an urgent demand for biological and medical research. For RNA-seq data, systematic technical effects and different sequencing depths are usually encountered when conducting experiments. Normalization is regarded as an essential step in the discovery of biologically important changes in expression. The present methods usually involve normalization of the data with a scaling factor, followed by detection of significant genes. However, more than one scaling factor may exist because of the complexity of real data. Consequently, methods that normalize data by a single scaling factor may deliver suboptimal performance or may not even work.The development of modern machine learning techniques has provided a new perspective regarding discrimination between differentially expressed (DE) and non-DE genes. However, in reality, the non-DE genes comprise only a small set and may contain housekeeping genes (in same species) or conserved orthologous genes (in different species). Therefore, the process of detecting DE genes can be formulated as a one-class classification problem, where only non-DE genes are observed, while DE genes are completely absent from the training data. Results In this study, we transform the problem to an outlier detection problem by treating DE genes as outliers, and we propose a scaling-free minimum enclosing ball (SFMEB) method to construct a smallest possible ball to contain the known non-DE genes in a feature space. The genes outside the minimum enclosing ball can then be naturally considered to be DE genes. Compared with the existing methods, the proposed SFMEB method does not require data normalization, which is particularly attractive when the RNA-seq data include more than one scaling factor. Furthermore, the SFMEB method could be easily extended to different species without normalization. Conclusions Simulation studies demonstrate that the SFMEB method works well in a wide range of settings, especially when the data are heterogeneous or biological replicates. Analysis of the real data also supports the conclusion that the SFMEB method outperforms other existing competitors. The R package of the proposed method is available at https://bioconductor.org/packages/MEB .
- Published
- 2021
- Full Text
- View/download PDF
31. Ess-NEXG: Predict Essential Proteins by Constructing a Weighted Protein Interaction Network Based on Node Embedding and XGBoost
- Author
-
Wang, Nian, Zeng, Min, Zhang, Jiashuai, Li, Yiming, Li, Min, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Cai, Zhipeng, editor, Mandoiu, Ion, editor, Narasimhan, Giri, editor, Skums, Pavel, editor, and Guo, Xuan, editor
- Published
- 2020
- Full Text
- View/download PDF
32. SmGDB: genome database of Salvia miltiorrhiza, an important TCM Plant.
- Author
-
Zhou, Changhao, Lin, Caicai, Xing, Piyi, Li, Xingfeng, and Song, Zhenqiao
- Abstract
Background: Salvia miltiorrhiza is an important traditional Chinese medicinal (TCM) plant and a model plant in the genetic study of TCM. A series of omics related to Danshen have been published. Integrating, managing, storing, and sharing data has become an urgent problem to be solved in S. miltiorrhiza genetic studies. Objectives: The genome database is the link for the exchange, acquisition, and use of different omics data between data producers and users, maximizing value and utilization of data. Methods: The genome database included DSS3 genome and five RNA-Seq data. The back-end performs data search and retrieval through the LAMP (Linux, Apache, MySQL, PHP) framework. Results: Here, we present SmGDB (S. miltiorrhiza genome database; http://8.140.162.85/), which houses the latest version of genome sequence and annotation data for S. miltiorrhiza, combining three unpublished RNA-Seq data from our group and two released RNA-Seq data. We also identified a novel gene cluster including seven CYP71D genes involved in the tanshinone synthesis pathway based on genome sequences and expression data. Besides, SmGDB provides user-friendly web interfaces for querying and browsing gene annotation, structure, location, and expression profiles for concerned genes. Popular bioinformatics tools such as 'BLAST', 'Search', 'Heatmap', 'JBrowse', etc., were also provided in SmGDB. Conclusions: SmGDB will provide utility for characterizing the structure of the S. miltiorrhiza genome and better understanding gene functions and biological processes underlying complex secondary metabolism in Danshen. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
33. Deciphering the stem variations in ginseng plant using RNA-Seq
- Author
-
Lu ZHAO, Yan-Shuang YU, Xin-Fang ZHOU, Huxitaer REHEMAN, Fu-Hui WEI, Da-Pu ZHO, Ping FANG, Jin-Zhuang GONG, and Yong-Hua XU
- Subjects
ginseng ,plant hormone ,plant morphology ,RNA-Seq Data ,terpenoids ,Forestry ,SD1-669.5 ,Agriculture (General) ,S1-972 - Abstract
Ginseng is an important herb widely grown in East Asia that has medicinal and nutritional uses. Multi-stem ginseng plants undergo rapid growth, are of good quality, and have a high main-root yield. The multi-stem trait is important in ginseng breeding. To understand the molecular mechanisms responsible for the multi-stem formation, the physiological changes before and after overwintering bud formation, we analysed the transcriptomes of multi- and single-stem ginseng plants. RNA sequencing of overwintering buds from multi- and single-stem ginseng plants was performed using high-throughput second-generation sequencing. We obtained 47.66 million high quality reads at a sequencing efficiency of greater than 99% from the multi- and single-stem transcriptome. An analysis of significantly enriched gene ontology functions and comparisons with Kyoto Encyclopedia of Genes and Genomes pathways revealed expression level changes in genes associated with plant hormones, photosynthesis, steroids biosynthesis, and sugar metabolism. Plant hormones are involved in multi-stem formation in ginseng. Auxin, cytokinin, brassinolide, and strigolactone have positive effects on multi-stem formation, but further research is needed to elucidate their mechanisms. Our results have important implications in ginseng cultivation and breeding.
- Published
- 2022
- Full Text
- View/download PDF
34. ASTool: An Easy-to-Use Tool to Accurately Identify Alternative Splicing Events from Plant RNA-Seq Data.
- Author
-
Qi, Huan, Guo, Xiaokun, Wang, Tianpeng, and Zhang, Ziding
- Subjects
- *
ALTERNATIVE RNA splicing , *RNA sequencing , *PLANT performance - Abstract
Alternative splicing (AS) is an essential co-transcriptional regulatory mechanism in eukaryotes. The accumulation of plant RNA-Seq data provides an unprecedented opportunity to investigate the global landscape of plant AS events. However, most existing AS identification tools were originally designed for animals, and their performance in plants was not rigorously benchmarked. In this work, we developed a simple and easy-to-use bioinformatics tool named ASTool for detecting AS events from plant RNA-Seq data. As an exon-based method, ASTool can detect 4 major AS types, including intron retention (IR), exon skipping (ES), alternative 5′ splice sites (A5SS), and alternative 3′ splice sites (A3SS). Compared with existing tools, ASTool revealed a favorable performance when tested in simulated RNA-Seq data, with both recall and precision values exceeding 95% in most cases. Moreover, ASTool also showed a competitive computational speed and consistent detection results with existing tools when tested in simulated or real plant RNA-Seq data. Considering that IR is the most predominant AS type in plants, ASTool allowed the detection and visualization of novel IR events based on known splice sites. To fully present the functionality of ASTool, we also provided an application example of ASTool in processing real RNA-Seq data of Arabidopsis in response to heat stress. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
35. Machine learning model for predicting Major Depressive Disorder using RNA-Seq data: optimization of classification approach.
- Author
-
Verma, Pragya and Shakya, Madhvi
- Abstract
Considering human brain disorders, Major Depressive Disorder (MDD) is seen as a lethal disease in which a person goes to the extent of suicidal behavior. Physical detection of MDD patients is less precise but machine learning can aid in improved classification of disease. The present research included three RNA-seq data classes to classify DEGs and then train key gene data using a random forest machine learning method. The three classes in the sample are 29 CON (sudden death healthy control), 21 MDD-S (a Major Depressive Disorder Suicide) being included in the second group, and 9 MDD (non-suicides MDD) which are included in the third group. With PCA analysis, 99 key genes were obtained. 47.1% data variability is given by these 99 genes. The model training of 99 genes indicated improved classification. The RF classification model has an accuracy of 61.11% over test data and 97.56% over train data. It was also noticed that the RF method offered greater accuracy than the KNN method. 99 genes were annotated using DAVID and ClueGo packages. Some of the important pathways and function observed in the study were glutamatergic synapse, GABA receptor activation, long-term synaptic depression, and morphine addiction. Out Of 99 genes, four genes, namely DLGAP1, GNG2, GRIA1, and GRIA4, were found to be predominantly involved in the glutamatergic synapse pathway. Another substantial link was observed in the GABA receptor activation involving the following two genes, GABBR2 and GNG2. Also, the genes found responsible for long-term synaptic depression were GRIA1, MAPT, and PTEN. There was another finding of morphine addiction which comprises three genes, namely GABBR2, GNG2, and PDE4D. For massive datasets, this approach will act as the gold standard. The cases of CON, MDD, and MDD-S are physically distinct. There was dysregulation in the expression level of 12 genes. The 12 genes act as a possible biomarker for Major Depressive Disorder and open up a new path for depressed subjects to explore further. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
36. Identification of candidate target genes of oral squamous cell carcinoma using high-throughput RNA-Seq data and in silico studies of their interaction with naturally occurring bioactive compounds.
- Author
-
Soni U, Singh A, Soni R, Samanta SK, Varadwaj PK, and Misra K
- Subjects
- Humans, RNA-Seq, Molecular Docking Simulation, Ubiquitin-Conjugating Enzymes genetics, Ubiquitin-Conjugating Enzymes metabolism, Gene Expression Profiling methods, Biological Products pharmacology, Biological Products chemistry, Curcumin pharmacology, Curcumin chemistry, Antineoplastic Agents pharmacology, Antineoplastic Agents chemistry, Computational Biology methods, Mouth Neoplasms genetics, Mouth Neoplasms drug therapy, Carcinoma, Squamous Cell genetics, Carcinoma, Squamous Cell drug therapy, Gene Expression Regulation, Neoplastic drug effects, Computer Simulation
- Abstract
Oral Squamous Cell Carcinoma (OSCC) accounts for more than 90% of all kinds of oral neoplasms that develop in the oral cavity. It is a type of malignancy that shows high morbidity and recurrence rate, but data on the disease's target genes and biomarkers is still insufficient. In this study, in silico studies have been performed to find out the novel target genes and their potential therapeutic inhibitors for the effective and efficient treatment of OSCC. The DESeq2 package of RStudio was used in the current investigation to screen and identify differentially expressed genes for OSCC. As a result of gene expression analysis, the top 10 novel genes were identified using the Cytohubba plugin of Cytoscape, and among them, the ubiquitin-conjugating enzyme (UBE2D1) was found to be upregulated and playing a significant role in the progression of human oral cancers. Following this, naturally occurring compounds were virtually evaluated and simulated against the discovered novel target as prospective drugs utilizing the Maestro, Schrodinger, and Gromacs software. In a simulated screening of naturally occurring potential inhibitors against the novel target UBE2D1, Epigallocatechin 3-gallate, Quercetin, Luteoline, Curcumin, and Baicalein were identified as potent inhibitors. Novel identified gene UBE2D1 has a significant role in the proliferation of human cancers through suppression of 'guardian of genome' p53 via ubiquitination dependent pathway. Therefore, the treatment of OSCC may benefit significantly from targeting this gene and its discovered naturally occurring inhibitors.Communicated by Ramaswamy H. Sarma.
- Published
- 2024
- Full Text
- View/download PDF
37. NetSeekR: a network analysis pipeline for RNA-Seq time series data.
- Author
-
Srivastava, Himangi, Ferrell, Drew, and Popescu, George V.
- Subjects
TIME series analysis ,SYSTEMS biology ,GENOMICS ,RNA sequencing ,FUNCTIONAL genomics ,COMPARATIVE genomics ,GENE regulatory networks ,FUNCTIONAL analysis - Abstract
Background: Recent development of bioinformatics tools for Next Generation Sequencing data has facilitated complex analyses and prompted large scale experimental designs for comparative genomics. When combined with the advances in network inference tools, this can lead to powerful methodologies for mining genomics data, allowing development of pipelines that stretch from sequence reads mapping to network inference. However, integrating various methods and tools available over different platforms requires a programmatic framework to fully exploit their analytic capabilities. Integrating multiple genomic analysis tools faces challenges from standardization of input and output formats, normalization of results for performing comparative analyses, to developing intuitive and easy to control scripts and interfaces for the genomic analysis pipeline. Results: We describe here NetSeekR, a network analysis R package that includes the capacity to analyze time series of RNA-Seq data, to perform correlation and regulatory network inferences and to use network analysis methods to summarize the results of a comparative genomics study. The software pipeline includes alignment of reads, differential gene expression analysis, correlation network analysis, regulatory network analysis, gene ontology enrichment analysis and network visualization of differentially expressed genes. The implementation provides support for multiple RNA-Seq read mapping methods and allows comparative analysis of the results obtained by different bioinformatics methods. Conclusion: Our methodology increases the level of integration of genomics data analysis tools to network inference, facilitating hypothesis building, functional analysis and genomics discovery from large scale NGS data. When combined with network analysis and simulation tools, the pipeline allows for developing systems biology methods using large scale genomics data. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
38. systematic comparison of normalization methods for eQTL analysis.
- Author
-
Yang, Jiajun, Wang, Dongyang, Yang, Yanbo, Yang, Wenqian, Jin, Weiwei, Niu, Xiaohui, and Gong, Jing
- Subjects
- *
LOCUS (Genetics) , *GENETIC variation , *RECEIVER operating characteristic curves , *RNA sequencing - Abstract
Expression quantitative trait loci (eQTL) analysis has been widely used in interpreting disease-associated loci through correlating genetic variant loci with the expression of specific genes. RNA-sequencing (RNA-Seq), which can quantify gene expression at the genome-wide level, is often used in eQTL identification. Since different normalization methods of gene expression have substantial impacts on RNA-seq downstream analysis, it is of great necessity to systematically compare the effects of these methods on eQTL identification. Here, by using RNA-seq and genotype data of four different cancers in The Cancer Genome Atlas (TCGA) database, we comprehensively evaluated the effect of eight commonly used normalization methods on eQTL identification. Our results showed that the application of different methods could cause 20–30% differences in the final results of eQTL identification. Among these methods, COUNT, Median of Ratio (MED) and Trimmed Mean of M-values (TMM) generated similar results for identifying eQTLs, while Fragments Per Kilobase Million (FPKM) or RANK produced more differential results compared with other methods. Based on the accuracy and receiver operating characteristic (ROC) curve, the TMM method was found to be the optimal method for normalizing gene expression data in eQTLs analysis. In addition, we also evaluated the performance of different pairwise combinations of these methods. As a result, compared with single normalization methods, the combination of methods can not only identify more cis-eQTLs, but also improve the performance of the ROC curve. Overall, this study provides a comprehensive comparison of normalization methods for identifying eQTLs from RNA-seq data, and proposes some practical recommendations for diverse scenarios. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
39. Data of RNA-seq transcriptomes in the brain associated with aggression in males of the fish Betta splendens
- Author
-
Trieu-Duc Vu, Yuki Iwasaki, Kenshiro Oshima, Ming-Tzu Chiu, Masato Nikaido, and Norihiro Okada
- Subjects
RNA-seq data ,Betta splendens ,Aggression ,DEG ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Science (General) ,Q1-390 - Abstract
Siamese fighting fish Betta splendens are notorious for their aggressiveness and males of this fish have been widely used to study aggression. However, an understanding of brain transcriptome signature associated with aggression in the context of male-male interaction in this fish remains to be understood. Herein, RNA-Seq transcriptome data from 37 brains samples collected at different fighting stages are described. These brain samples were collected before fighting (B), during fighting (D20 and D60), and after fighting (A0 and A30). The raw data were analyzed for differential gene expression using edgeR package in R. A criterion of FDR cut-off ≤ 0.05 and an absolute fold change (FC) of 0 or greater were used to identify top upregulated and downregulated genes in fighting groups (D20, D60, A0, and A30) relative to non-fighting group (B). The data presented hereafter enable fundamental studies on genes and molecular events mediating aggressive behavior in this fish and will lay a valuable foundation for future research on the aggression of vertebrates.
- Published
- 2021
- Full Text
- View/download PDF
40. The Analysis of Gene Expression Data Incorporating Tumor Purity Information
- Author
-
Seungjun Ahn, Tyler Grimes, and Somnath Datta
- Subjects
tumor purity ,RNA-seq data ,differential network analysis ,differential gene expression analysis ,gene expression data ,confounding effects ,Genetics ,QH426-470 - Abstract
The tumor microenvironment is composed of tumor cells, stroma cells, immune cells, blood vessels, and other associated non-cancerous cells. Gene expression measurements on tumor samples are an average over cells in the microenvironment. However, research questions often seek answers about tumor cells rather than the surrounding non-tumor tissue. Previous studies have suggested that the tumor purity (TP)—the proportion of tumor cells in a solid tumor sample—has a confounding effect on differential expression (DE) analysis of high vs. low survival groups. We investigate three ways incorporating the TP information in the two statistical methods used for analyzing gene expression data, namely, differential network (DN) analysis and DE analysis. Analysis 1 ignores the TP information completely, Analysis 2 uses a truncated sample by removing the low TP samples, and Analysis 3 uses TP as a covariate in the underlying statistical models. We use three gene expression data sets related to three different cancers from the Cancer Genome Atlas (TCGA) for our investigation. The networks from Analysis 2 have greater amount of differential connectivity in the two networks than that from Analysis 1 in all three cancer datasets. Similarly, Analysis 1 identified more differentially expressed genes than Analysis 2. Results of DN and DE analyses using Analysis 3 were mostly consistent with those of Analysis 1 across three cancers. However, Analysis 3 identified additional cancer-related genes in both DN and DE analyses. Our findings suggest that using TP as a covariate in a linear model is appropriate for DE analysis, but a more robust model is needed for DN analysis. However, because true DN or DE patterns are not known for the empirical datasets, simulated datasets can be used to study the statistical properties of these methods in future studies.
- Published
- 2021
- Full Text
- View/download PDF
41. 真菌环状RNA鉴定及研究展望.
- Author
-
胡雪嫣, 张赟, 杨恩策, and 杜明昊
- Subjects
CIRCULAR RNA ,GENETIC regulation ,FUNGAL metabolism ,SCAFFOLD proteins ,TRANSLATING & interpreting ,RNA - Abstract
Copyright of Mycosystema is the property of Mycosystema Editorial Board and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2021
- Full Text
- View/download PDF
42. The Analysis of Gene Expression Data Incorporating Tumor Purity Information.
- Author
-
Ahn, Seungjun, Grimes, Tyler, and Datta, Somnath
- Subjects
GENE expression ,TUMOR microenvironment ,BLOOD vessels ,STATISTICAL models ,TUMORS - Abstract
The tumor microenvironment is composed of tumor cells, stroma cells, immune cells, blood vessels, and other associated non-cancerous cells. Gene expression measurements on tumor samples are an average over cells in the microenvironment. However, research questions often seek answers about tumor cells rather than the surrounding non-tumor tissue. Previous studies have suggested that the tumor purity (TP)—the proportion of tumor cells in a solid tumor sample—has a confounding effect on differential expression (DE) analysis of high vs. low survival groups. We investigate three ways incorporating the TP information in the two statistical methods used for analyzing gene expression data, namely, differential network (DN) analysis and DE analysis. Analysis 1 ignores the TP information completely, Analysis 2 uses a truncated sample by removing the low TP samples, and Analysis 3 uses TP as a covariate in the underlying statistical models. We use three gene expression data sets related to three different cancers from the Cancer Genome Atlas (TCGA) for our investigation. The networks from Analysis 2 have greater amount of differential connectivity in the two networks than that from Analysis 1 in all three cancer datasets. Similarly, Analysis 1 identified more differentially expressed genes than Analysis 2. Results of DN and DE analyses using Analysis 3 were mostly consistent with those of Analysis 1 across three cancers. However, Analysis 3 identified additional cancer-related genes in both DN and DE analyses. Our findings suggest that using TP as a covariate in a linear model is appropriate for DE analysis, but a more robust model is needed for DN analysis. However, because true DN or DE patterns are not known for the empirical datasets, simulated datasets can be used to study the statistical properties of these methods in future studies. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
43. A Generative Adversarial Network Model for Disease Gene Prediction With RNA-seq Data
- Author
-
Xue Jiang, Jingjing Zhao, Wei Qian, Weichen Song, and Guan Ning Lin
- Subjects
Denoising auto-encoder ,multilayer perceptron ,generative adversarial network ,RNA-seq data ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Deep learning models often need large amounts of training samples (thousands of training samples) to effectively extract hidden patterns in the data, thus achieving better results. However, in the field of brain-related disease, the omics data obtained by using advanced sequencing technology typically have much fewer patient samples (tens to hundreds of samples). Due to the small sample problem, statistical methods and intelligent machine learning methods have been unable to obtain a convergent gene set when prioritizing biomarkers. Furthermore, mathematical models designed for prioritizing biomarkers perform differently on different datasets. However, the architecture of the generative adversarial network (GAN) can address this bottleneck problem. Through the game between the generator and the discriminator, samples with similar distributions to that of samples in the training set can be generated by the generator, and the prediction accuracy and robustness of the discriminator could be significantly improved. Therefore, in this study, we designed a new generative adversarial network model with a denoising auto-encoder (DAE) as the generator and a multilayer perceptron (MLP) as the discriminator. The prediction residual error was backpropagated to the decoder part of the DAE, modifying the captured probability distribution. Based on this model, we further designed a framework to predict disease genes with RNA-seq data. The deep learning model improves the identification accuracy of disease genes over the-state-of-the-art approaches. An analysis of the experimental results has uncovered new disease-related genes and disease-associated pathways in the brain, which in turn have provided insight into the molecular mechanisms underlying disease phenotypes.
- Published
- 2020
- Full Text
- View/download PDF
44. TransLiG: a de novo transcriptome assembler that uses line graph iteration
- Author
-
Juntao Liu, Ting Yu, Zengchao Mu, and Guojun Li
- Subjects
RNA-seq data ,Transcriptome assembly ,Splicing graph ,Line graph ,Algorithm ,Biology (General) ,QH301-705.5 ,Genetics ,QH426-470 - Abstract
Abstract We present TransLiG, a new de novo transcriptome assembler, which is able to integrate the sequence depth and pair-end information into the assembling procedure by phasing paths and iteratively constructing line graphs starting from splicing graphs. TransLiG is shown to be significantly superior to all the salient de novo assemblers in both accuracy and computing resources when tested on artificial and real RNA-seq data. TransLiG is freely available at https://sourceforge.net/projects/transcriptomeassembly/files/.
- Published
- 2019
- Full Text
- View/download PDF
45. MI_DenseNetCAM: A Novel Pan-Cancer Classification and Prediction Method Based on Mutual Information and Deep Learning Model
- Author
-
Jianlin Wang, Xuebing Dai, Huimin Luo, Chaokun Yan, Ge Zhang, and Junwei Luo
- Subjects
pan-cancer ,cancer classification ,DenseNet ,guided grad-CAM algorithm ,RNA-seq data ,Genetics ,QH426-470 - Abstract
The Pan-Cancer Atlas consists of original sequencing data from various sources, provides the opportunity to perform systematic studies on the commonalities and differences between diverse cancers. The analysis for the pan-cancer dataset could help researchers to identify the key factors that could trigger cancer. In this paper, we present a novel pan-cancer classification method, referred to MI_DenseNetCAM, to identify a set of genes that can differentiate all tumor types accurately. First, the Mutual Information (MI) was utilized to eliminate noise and redundancy from the pan-cancer datasets. Then, the gene data was further converted to 2D images. Next, the DenseNet model was adopted as a classifier and the Guided Grad-CAM algorithm was applied to identify the key genes. Extensive experimental results on the public RNA-seq data sets with 33 different tumor types show that our method outperforms the other state-of-the-art classification methods. Moreover, gene analysis further demonstrated that the genes selected by our method were related to the corresponding tumor types.
- Published
- 2021
- Full Text
- View/download PDF
46. Machine Learning-Based State-of-the-Art Methods for the Classification of RNA-Seq Data
- Author
-
Jabeen, Almas, Ahmad, Nadeem, Raza, Khalid, Tavares, João Manuel R.S., Series editor, Jorge, Renato Natal, Series editor, Dey, Nilanjan, editor, Ashour, Amira S., editor, and Borra, Surekha, editor
- Published
- 2018
- Full Text
- View/download PDF
47. A scaling-free minimum enclosing ball method to detect differentially expressed genes for RNA-seq data.
- Author
-
Zhou, Yan, Yang, Bin, Wang, Junhui, Zhu, Jiadi, and Tian, Guoliang
- Subjects
RNA sequencing ,GENES ,MACHINE learning ,MEDICAL research ,HOUSEKEEPING - Abstract
Background: Identifying differentially expressed genes between the same or different species is an urgent demand for biological and medical research. For RNA-seq data, systematic technical effects and different sequencing depths are usually encountered when conducting experiments. Normalization is regarded as an essential step in the discovery of biologically important changes in expression. The present methods usually involve normalization of the data with a scaling factor, followed by detection of significant genes. However, more than one scaling factor may exist because of the complexity of real data. Consequently, methods that normalize data by a single scaling factor may deliver suboptimal performance or may not even work.The development of modern machine learning techniques has provided a new perspective regarding discrimination between differentially expressed (DE) and non-DE genes. However, in reality, the non-DE genes comprise only a small set and may contain housekeeping genes (in same species) or conserved orthologous genes (in different species). Therefore, the process of detecting DE genes can be formulated as a one-class classification problem, where only non-DE genes are observed, while DE genes are completely absent from the training data. Results: In this study, we transform the problem to an outlier detection problem by treating DE genes as outliers, and we propose a scaling-free minimum enclosing ball (SFMEB) method to construct a smallest possible ball to contain the known non-DE genes in a feature space. The genes outside the minimum enclosing ball can then be naturally considered to be DE genes. Compared with the existing methods, the proposed SFMEB method does not require data normalization, which is particularly attractive when the RNA-seq data include more than one scaling factor. Furthermore, the SFMEB method could be easily extended to different species without normalization. Conclusions: Simulation studies demonstrate that the SFMEB method works well in a wide range of settings, especially when the data are heterogeneous or biological replicates. Analysis of the real data also supports the conclusion that the SFMEB method outperforms other existing competitors. The R package of the proposed method is available at https://bioconductor.org/packages/MEB. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
48. MI_DenseNetCAM: A Novel Pan-Cancer Classification and Prediction Method Based on Mutual Information and Deep Learning Model.
- Author
-
Wang, Jianlin, Dai, Xuebing, Luo, Huimin, Yan, Chaokun, Zhang, Ge, and Luo, Junwei
- Subjects
DEEP learning ,CLASSIFICATION ,RNA sequencing ,TUMOR classification - Abstract
The Pan-Cancer Atlas consists of original sequencing data from various sources, provides the opportunity to perform systematic studies on the commonalities and differences between diverse cancers. The analysis for the pan-cancer dataset could help researchers to identify the key factors that could trigger cancer. In this paper, we present a novel pan-cancer classification method, referred to MI_DenseNetCAM, to identify a set of genes that can differentiate all tumor types accurately. First, the Mutual Information (MI) was utilized to eliminate noise and redundancy from the pan-cancer datasets. Then, the gene data was further converted to 2D images. Next, the DenseNet model was adopted as a classifier and the Guided Grad-CAM algorithm was applied to identify the key genes. Extensive experimental results on the public RNA-seq data sets with 33 different tumor types show that our method outperforms the other state-of-the-art classification methods. Moreover, gene analysis further demonstrated that the genes selected by our method were related to the corresponding tumor types. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
49. Single Nucleotide Polymorphism (SNP) Discovery on Transcriptomes of American Holstein and Pakistanian Cholistani Cows.
- Author
-
Ghasemi-Siab, Mojgan, Varkoohi, Sheida, and Banabazi, Mohammad Hossein
- Abstract
Introduction (SNPs) are single nucleotide base variations, caused by transitions (C/T or G/A) or transversions (C/G, C/A, or T/A, T/G), in the same position between individual genomic DNA sequences. Single nucleotide polymorphisms have been applied as important molecular markers in genetics and breeding studies. About 40% of the Single nucleotide polymorphisms in the genes cause a change in an amino acid. The rapid advance of next generation sequencing provides a high-throughput means of SNP discovery. Transcriptome study can fill the gap between genotype and phenotype and help understanding the mechanisms from sequence to function. RNA sequencing (RNA-Seq) is a next generating sequencing based technology for studying of whole transcriptome and gene expression. It simultaneously enables study of transcriptomics sequences and very accurate quantitative gene expression (digital expression). Hence, these data are very suitable for high-throughput study of expression level of all transcribed genes and their SNPs (Single Nucleotide Polymorphism. Recently, RNA-Seq has also been used as an efficient and cost-effective method to systematically identify SNPs in transcribed regions in different species. A transcriptomics-based sequencing approach offers a cheaper alternative to identify a large number of polymorphisms and possibly to discover causative variants. Materials and Methods In this study, RNA-Seq data were used to SNP discovery in American Holstein (Bos taurus) and Pakistanian Cholistani (Bos indicus) cows. RNA-Seq data of 21,078,477 and 20940063 paired end reads with 75 bp length resulted from pooling of whole blood samples of 40 Holstein cows at the University of Wisconsin, Dairy Cattle Center, USA, and 45 Cholistani cows at Gujait Peer Farm, Bahawalpur, Punjab, Pakistan, respectively, obtained from SRA database in NCBI for Holstein cows (http://www.ncbi.nlm.nih.gov/sra/SRX317197) and Cholistani cows http://www.ncbi.nlm.nih.gov/sra/SRS454433). MRNA sequencing was run on Illumina Genome Analyzer IIx (Illumina Inc., San Diego, CA). Data were converted from Sra format to Fastq format by fastq-dump command from Ubuntu linux version of Sratoolkit 2.5.4-1. Data quality control was checked by FastQC (v0.11.3) likewise trimmed for linked adaptors and bad quality reads by Trimmomatic 0.33 Adaptors were considered according to sequencing instrument as default (TruSeq2-PE.fa) and the minimum read length was set at 50 bp. Trimmed reads were aligned on UMD3.1 reference genome (release 81) based on annotation data by Tophat2, which applies Bowtie2 as the aligner. The transcriptome was assembled by TopHat2 software in two cow's population by aligning and mapping the RNA-Seq reads on bovine reference genome. The SNPs were discovered by Samtools software. Results and Discussion After data editing, the removed and low quality reads in both breeds were almost equal and relatively low. The length of whole transcriptome assembled, for example 52798651 bases in Holstein, indicates around 2% of the whole genome (around 2.6 Mbp) expressed as mRNA. In Cholistani cows, read mapping rate for forward and reverse reads were 81.3 and 79.9%, respectively, and multiple alignments rate was about 9.4%. Overall read mapping was 80.6% and concordant pair alignment was 70.1%. In Holstein cows, read mapping rate for forward and reverse reads were 66.3 and 55.4%, respectively, and multiple alignments rate was about 7.2%,. Overall read mapping was 60.8% and concordant pair alignment was 51.3%. Results show that 50183 and 137954 SNPs were discovered on the assembled transcriptome of Holstein and Cholistani cow's samples, respectively, and 15308 SNPs were common in both breeds. No direct relation was found between the number of discovered SNPs and the chromosome length. Also 12 SNP types were identified including 4 transition and 8 transversion. The most commonly discovered SNP were transition, which were 70.6% in Cholistani and 69.6% in Holstein cows. The ratio of transition to transversion SNP (Ts / Tv) was 2.4 and 2.3 in Cholistani and Holstein cows, respectively. The number of discovered SNPs in Cholistani cows were approximately three times higher than Holstein cows. Because, for the alignment of both species used a same reference genome with Herford origin. Conclusion the expression difference between two alleles in a single-nucleotide position causes phenotype diversity and probably explains the large part of variances between these two bovine subspecies, especially in diversity, susceptibility to disease and parasites, tolerating environmental stress such as biological and non-biological stresses in different environmental conditions. While, differential gene expression analysis or even allelic specific expression in gene level may not be able to explain phenotype diversity. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
50. Selecting Classification Methods for Small Samples of Next-Generation Sequencing Data
- Author
-
Jiadi Zhu, Ziyang Yuan, Lianjie Shu, Wenhui Liao, Mingtao Zhao, and Yan Zhou
- Subjects
RNA-seq data ,classification ,PLDA ,NBLDA ,ZIPLDA ,ZINBLDA ,Genetics ,QH426-470 - Abstract
Next-generation sequencing has emerged as an essential technology for the quantitative analysis of gene expression. In medical research, RNA sequencing (RNA-seq) data are commonly used to identify which type of disease a patient has. Because of the discrete nature of RNA-seq data, the existing statistical methods that have been developed for microarray data cannot be directly applied to RNA-seq data. Existing statistical methods usually model RNA-seq data by a discrete distribution, such as the Poisson, the negative binomial, or the mixture distribution with a point mass at zero and a Poisson distribution to further allow for data with an excess of zeros. Consequently, analytic tools corresponding to the above three discrete distributions have been developed: Poisson linear discriminant analysis (PLDA), negative binomial linear discriminant analysis (NBLDA), and zero-inflated Poisson logistic discriminant analysis (ZIPLDA). However, it is unclear what the real distributions would be for these classifications when applied to a new and real dataset. Considering that count datasets are frequently characterized by excess zeros and overdispersion, this paper extends the existing distribution to a mixture distribution with a point mass at zero and a negative binomial distribution and proposes a zero-inflated negative binomial logistic discriminant analysis (ZINBLDA) for classification. More importantly, we compare the above four classification methods from the perspective of model parameters, as an understanding of parameters is necessary for selecting the optimal method for RNA-seq data. Furthermore, we determine that the above four methods could transform into each other in some cases. Using simulation studies, we compare and evaluate the performance of these classification methods in a wide range of settings, and we also present a decision tree model created to help us select the optimal classifier for a new RNA-seq dataset. The results of the two real datasets coincide with the theory and simulation analysis results. The methods used in this work are implemented in the open-scource R scripts, with a source code freely available at https://github.com/FocusPaka/ZINBLDA.
- Published
- 2021
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.