45 results on '"Chen, James"'
Search Results
2. “MYH9 mutation and squamous cell cancer of the tongue in a young adult: a novel case report”
- Author
-
Yabe, Takako Eva, King, Kylie, Russell, Susan, Satgunaseelan, Laveniya, Gupta, Ruta, Chen, James, and Ashford, Bruce
- Published
- 2022
- Full Text
- View/download PDF
3. Activity of PD1 inhibitor therapy in advanced sarcoma: a single-center retrospective analysis
- Author
-
Quiroga, Dionisia, Liebner, David A., Philippon, Jennifer S., Hoffman, Sarah, Tan, Yubo, Chen, James L., Lenobel, Scott, Wakely, Jr, Paul E., Pollock, Raphael, and Tinoco, Gabriel
- Published
- 2020
- Full Text
- View/download PDF
4. Metastatic breast cancer patient perceptions of somatic tumor genomic testing
- Author
-
Adams, Elizabeth J., Asad, Sarah, Reinbolt, Raquel, Collier, Katharine A., Abdel-Rasoul, Mahmoud, Gillespie, Susan, Chen, James L., Cherian, Mathew A., Noonan, Anne M., Sardesai, Sagar, VanDeusen, Jeffrey, Wesolowski, Robert, Williams, Nicole, Shapiro, Charles L., Macrae, Erin R., Pilarski, Robert, Toland, Amanda E., Senter, Leigha, Ramaswamy, Bhuvaneswari, Lee, Clara N., Lustberg, Maryam B., and Stover, Daniel G.
- Published
- 2020
- Full Text
- View/download PDF
5. Randomized Embolization Trial for NeuroEndocrine Tumor Metastases to the Liver (RETNET): study protocol for a randomized controlled trial
- Author
-
Chen, James X., Wileyto, E. Paul, and Soulen, Michael C.
- Published
- 2018
- Full Text
- View/download PDF
6. miR-133a function in the pathogenesis of dedifferentiated liposarcoma
- Author
-
Yu, Peter Y., Lopez, Gonzalo, Braggio, Danielle, Koller, David, Bill, Kate Lynn J., Prudner, Bethany C., Zewdu, Abbie, Chen, James L., Iwenofu, O. Hans, Lev, Dina, Strohecker, Anne M., Fenger, Joelle M., Pollock, Raphael E., and Guttridge, Denis C.
- Published
- 2018
- Full Text
- View/download PDF
7. A novel procedure on next generation sequencing data analysis using text mining algorithm.
- Author
-
Weizhong Zhao, Chen, James J., Perkins, Roger, Yuping Wang, Zhichao Liu, Huixiao Hong, Weida Tong, and Wen Zou
- Subjects
- *
DATA mining , *GENETIC algorithms , *BIOMARKERS , *PHENOTYPES ,SALMONELLA genetics - Abstract
Background: Next-generation sequencing (NGS) technologies have provided researchers with vast possibilities in various biological and biomedical research areas. Efficient data mining strategies are in high demand for large scale comparative and evolutional studies to be performed on the large amounts of data derived from NGS projects. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Methods: We report a novel procedure to analyse NGS data using topic modeling. It consists of four major procedures: NGS data retrieval, preprocessing, topic modeling, and data mining using Latent Dirichlet Allocation (LDA) topic outputs. The NGS data set of the Salmonella enterica strains were used as a case study to show the workflow of this procedure. The perplexity measurement of the topic numbers and the convergence efficiencies of Gibbs sampling were calculated and discussed for achieving the best result from the proposed procedure. Results: The output topics by LDA algorithms could be treated as features of Salmonella strains to accurately describe the genetic diversity of fliC gene in various serotypes. The results of a two-way hierarchical clustering and data matrix analysis on LDA-derived matrices successfully classified Salmonella serotypes based on the NGS data. The implementation of topic modeling in NGS data analysis procedure provides a new way to elucidate genetic information from NGS data, and identify the gene-phenotype relationships and biomarkers, especially in the era of biological and medical big data. Conclusion: The implementation of topic modeling in NGS data analysis provides a new way to elucidate genetic information from NGS data, and identify the gene-phenotype relationships and biomarkers, especially in the era of biological and medical big data. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
8. Text mining for identifying topics in the literatures about adolescent substance use and depression.
- Author
-
Shi-Heng Wang, Yijun Ding, Weizhong Zhao, Yung-Hsiang Huang, Roger Perkins, Wen Zou, Chen, James J., Wang, Shi-Heng, Ding, Yijun, Zhao, Weizhong, Huang, Yung-Hsiang, Perkins, Roger, and Zou, Wen
- Subjects
UNDERAGE drinking ,MENTAL depression ,PUBLIC health ,TEXT mining ,ALCOHOL drinking risk factors - Abstract
Background: Both adolescent substance use and adolescent depression are major public health problems, and have the tendency to co-occur. Thousands of articles on adolescent substance use or depression have been published. It is labor intensive and time consuming to extract huge amounts of information from the cumulated collections. Topic modeling offers a computational tool to find relevant topics by capturing meaningful structure among collections of documents.Methods: In this study, a total of 17,723 abstracts from PubMed published from 2000 to 2014 on adolescent substance use and depression were downloaded as objects, and Latent Dirichlet allocation (LDA) was applied to perform text mining on the dataset. Word clouds were used to visually display the content of topics and demonstrate the distribution of vocabularies over each topic.Results: The LDA topics recaptured the search keywords in PubMed, and further discovered relevant issues, such as intervention program, association links between adolescent substance use and adolescent depression, such as sexual experience and violence, and risk factors of adolescent substance use, such as family factors and peer networks. Using trend analysis to explore the dynamics of proportion of topics, we found that brain research was assessed as a hot issue by the coefficient of the trend test.Conclusions: Topic modeling has the ability to segregate a large collection of articles into distinct themes, and it could be used as a tool to understand the literature, not only by recapturing known facts but also by discovering other relevant topics. [ABSTRACT FROM AUTHOR]- Published
- 2016
- Full Text
- View/download PDF
9. Subgroup identification for treatment selection in biomarker adaptive design.
- Author
-
Tzu-Pin Lu, Chen, James J., and Lu, Tzu-Pin
- Subjects
- *
BIOMARKERS , *CANCER genetics , *CANCER treatment , *DRUG development , *DISCRIMINANT analysis , *ADENOCARCINOMA , *LUNG tumors , *ALGORITHMS , *BIOLOGICAL assay , *COMPUTER simulation , *EXPERIMENTAL design , *PROGNOSIS , *PATIENT selection , *STATISTICAL models , *DIAGNOSIS - Abstract
Background: Advances in molecular technology have shifted new drug development toward targeted therapy for treatments expected to benefit subpopulations of patients. Adaptive signature design (ASD) has been proposed to identify the most suitable target patient subgroup to enhance efficacy of treatment effect. There are two essential aspects in the development of biomarker adaptive designs: 1) an accurate classifier to identify the most appropriate treatment for patients, and 2) statistical tests to detect treatment effect in the relevant population and subpopulations. We propose utilization of classification methods to identity patient subgroups and present a statistical testing strategy to detect treatment effects.Methods: The diagonal linear discriminant analysis (DLDA) is used to identify targeted and non-targeted subgroups. For binary endpoints, DLDA is directly applied to classify patient into two subgroups; for continuous endpoints, a two-step procedure involving model fitting and determination of a cutoff-point is used for subgroup classification. The proposed strategy includes tests for treatment effect in all patients and in a marker-positive subgroup, with a possible follow-up estimation of treatment effect in the marker-negative subgroup. The proposed method is compared to the ASD classification method using simulated datasets and two publically available cancer datasets.Results: The DLDA-based classifier performs well in terms of sensitivity, specificity, positive and negative predictive values, and accuracy in the simulation data and the two cancer datasets, with superior accuracy compared to the ASD method. The subgroup testing strategy is shown to be useful in detecting treatment effect in terms of power and control of study-wise error.Conclusion: Accuracy of a classifier is essential for adaptive designs. A poor classifier not only assigns patients to inappropriate treatments, but also reduces the power of the test, resulting in incorrect conclusions. The proposed procedure provides an effective approach for subgroup identification and subgroup analysis. [ABSTRACT FROM AUTHOR]- Published
- 2015
- Full Text
- View/download PDF
10. Disc degeneration implies low back pain.
- Author
-
Chang-Jiang Zheng and Chen, James
- Subjects
LUMBAR pain ,INTERVERTEBRAL disk ,BACK diseases ,CARTILAGE ,MAGNETIC resonance imaging ,EPIDEMIOLOGY - Abstract
Background: Low back pain exerts a tremendous burden on individual patients and society due to its prevalence and ability to cause long-term disability. Contemporary treatment and prevention efforts are stymied by the absence of a confirmed cause for the majority of low back pain patients. Methods: A system dynamics approach is used to build a physiologically-based model investigating the relationship between disc degeneration and low back pain. The model's predictions are evaluated under two different types of study designs and compared with established observations on low back pain. Results: A three-compartment model (no disc degeneration, disc degeneration with pain remission, disc degeneration with pain recurrence) accurately predicts the age-specific prevalence observed in one of the largest population-based surveys (R² = 0.998). The estimated transition age at which intervertebral discs lose the growth potential and begin degenerating is 13.3 years. The estimated disc degeneration rate is 0.0344/year. Without any additional change being made to parameter's values, the model also fully accounts for the age-specific prevalence of disc degeneration detected with a lumbar MRI among asymptomatic individuals (R² = 0.978). Conclusions: Dual testing of the proposed mechanistic model with two independent data sources (one with lumbar MRI and the other without) confirm that disc degeneration is the driving force behind and cause of age dependence in low back pain. Observed complexity of low back pain epidemiology arises from the slow dynamics of disc degeneration coupled with the fast dynamics of disease recurrence. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
11. A heuristic approach to determine an appropriate number of topics in topic modeling.
- Author
-
Weizhong Zhao, Chen, James J., Perkins, Roger, Zhichao Liu, Weigong Ge, Yijun Ding, and Wen Zou
- Subjects
- *
MACHINE learning , *HEURISTIC programming , *ARTIFICIAL intelligence , *HEURISTIC algorithms , *MATHEMATICAL programming - Abstract
Background: Topic modelling is an active research field in machine learning. While mainly used to build models from unstructured textual data, it offers an effective means of data mining where samples represent documents, and different biological endpoints or omics data represent words. Latent Dirichlet Allocation (LDA) is the most commonly used topic modelling method across a wide number of technical fields. However, model development can be arduous and tedious, and requires burdensome and systematic sensitivity studies in order to find the best set of model parameters. Often, time-consuming subjective evaluations are needed to compare models. Currently, research has yielded no easy way to choose the proper number of topics in a model beyond a major iterative approach. Methods and results: Based on analysis of variation of statistical perplexity during topic modelling, a heuristic approach is proposed in this study to estimate the most appropriate number of topics. Specifically, the rate of perplexity change (RPC) as a function of numbers of topics is proposed as a suitable selector. We test the stability and effectiveness of the proposed method for three markedly different types of grounded-truth datasets: Salmonella next generation sequencing, pharmacological side effects, and textual abstracts on computational biology and bioinformatics (TCBB) from PubMed. Conclusion: The proposed RPC-based method is demonstrated to choose the best number of topics in three numerical experiments of widely different data types, and for databases of very different sizes. The work required was markedly less arduous than if full systematic sensitivity studies had been carried out with number of topics as a parameter. We understand that additional investigation is needed to substantiate the method’s theoretical basis, and to establish its generalizability in terms of dataset characteristics. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
12. Chemoproteomics reveals Toll-like receptor fatty acylation.
- Author
-
Chesarino, Nicholas M., Hach, Jocelyn C., Chen, James L., Zaro, Balyn W., Rajaram, Murugesan V. S., Turner, Joanne, Schlesinger, Larry S., Pratt, Matthew R., Hang, Howard C., and Yount, Jacob S.
- Subjects
PALMITOYLATION ,CARBON ,LIPIDS ,CELL cycle ,IMMUNOMODULATORS - Abstract
Background Palmitoylation is a 16-carbon lipid post-translational modification that increases protein hydrophobicity. This form of protein fatty acylation is emerging as a critical regulatory modification for multiple aspects of cellular interactions and signaling. Despite recent advances in the development of chemical tools for the rapid identification and visualization of palmitoylated proteins, the palmitoyl proteome has not been fully defined. Here we sought to identify and compare the palmitoylated proteins in murine fibroblasts and dendritic cells. Results 563 putative palmitoylation substrates were identified, more than 200 of which have not been previously suggested to be palmitoylated in past proteomic studies. Here we validate the palmitoylation of several new proteins including TLRs 2, 5, and 10, CD80, CD86, and NEDD4. Palmitoylation of TLR2, which was uniquely identified in dendritic cells, was mapped to a transmembrane domain-proximal cysteine. Inhibition of TLR2 S-palmitoylation pharmacologically or by cysteine mutagenesis led to decreased cell surface expression and a decreased inflammatory response to microbial ligands. Conclusions This work identifies many fatty acylated proteins involved in fundamental cellular processes as well as cell type-specific functions, highlighting the value of examining the palmitoyl proteomes of multiple cell types. S-palmitoylation of TLR2 is a previously unknown immunoregulatory mechanism that represents an entirely novel avenue for modulation of TLR2 inflammatory activity. [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF
13. Topic modeling for cluster analysis of large biological and medical datasets.
- Author
-
Weizhong Zhao, Wen Zou, and Chen, James J.
- Abstract
Background: The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. Results: In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/ effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. Conclusion: Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than traditional methods, suggesting that topic model-based methods could provide an analytic advancement in the analysis of large biological or medical datasets. [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF
14. Deep brain stimulation of the basolateral amygdala for treatment-refractory combat post-traumatic stress disorder (PTSD): study protocol for a pilot randomized controlled trial with blinded, staggered onset of stimulation.
- Author
-
Koek, Ralph J., Langevin, Jean-Philippe, Krahl, Scott E., Kosoyan, Hovsep J., Schwartz, Holly N., Chen, James W. Y., Melrose, Rebecca, Mandelkern, Mark J., and Sultzer, David
- Abstract
Background: Combat post-traumatic stress disorder (PTSD) involves significant suffering, impairments in social and occupational functioning, substance use and medical comorbidity, and increased mortality from suicide and other causes. Many veterans continue to suffer despite current treatments. Deep brain stimulation (DBS) has shown promise in refractory movement disorders, depression and obsessive-compulsive disorder, with deep brain targets chosen by integration of clinical and neuroimaging literature. The basolateral amygdala (BLn) is an optimal target for high-frequency DBS in PTSD based on neurocircuitry findings from a variety of perspectives. DBS of the BLn was validated in a rat model of PTSD by our group, and limited data from humans support the potential safety and effectiveness of BLn DBS. Methods/Design: We describe the protocol design for a first-ever Phase I pilot study of bilateral BLn high-frequency DBS for six severely ill, functionally impaired combat veterans with PTSD refractory to conventional treatments. After implantation, patients are monitored for a month with stimulators off. An electroencephalographic (EEG) telemetry session will test safety of stimulation before randomization to staggered-onset, double-blind sham versus active stimulation for two months. Thereafter, patients will undergo an open-label stimulation for a total of 24 months. Primary efficacy outcome is a 30% decrease in the Clinician Administered PTSD Scale (CAPS) total score. Safety outcomes include extensive assessments of psychiatric and neurologic symptoms, psychosocial function, amygdala-specific and general neuropsychological functions, and EEG changes. The protocol requires the veteran to have a cohabiting significant other who is willing to assist in monitoring safety and effect on social functioning. At baseline and after approximately one year of stimulation, trauma script-provoked
18 FDG PET metabolic changes in limbic circuitry will also be evaluated. Discussion: While the rationale for studying DBS for PTSD is ethically and scientifically justified, the importance of the amygdaloid complex and its connections for a myriad of emotional, perceptual, behavioral, and vegetative functions requires a complex trial design in terms of outcome measures. Knowledge generated from this pilot trial can be used to design future studies to determine the potential of DBS to benefit both veterans and nonveterans suffering from treatment-refractory PTSD. [ABSTRACT FROM AUTHOR]- Published
- 2014
- Full Text
- View/download PDF
15. Imaging genomic mapping of an invasive MRI phenotype predicts patient outcome and metabolic dysfunction: a TCGA glioma phenotype research group project.
- Author
-
Colen, Rivka R., Vangel, Mark, Wang, Jixin, Gutman, David A., Hwang, Scott N., Wintermark, Max, Jain, Rajan, Jilwan-Nicolas, Manal, Chen, James Y., Raghavan, Prashant, Holder, Chad A., Rubin, Daniel, Huang, Eric, Kirby, Justin, Freymann, John, Jaffe, Carl C., Flanders, Adam, and Zinn, Pascal O.
- Subjects
GENOTYPE-environment interaction ,MEDICAL imaging systems ,GENETIC regulation ,CANCER cells ,GLIOMAS ,MAGNETIC resonance imaging ,BIOMARKERS ,GENE therapy - Abstract
Background Invasion of tumor cells into adjacent brain parenchyma is a major cause of treatment failure in glioblastoma. Furthermore, invasive tumors are shown to have a different genomic composition and metabolic abnormalities that allow for a more aggressive GBM phenotype and resistance to therapy. We thus seek to identify those genomic abnormalities associated with a highly aggressive and invasive GBM imaging-phenotype. Methods and materials We retrospectively identified 104 treatment-naïve glioblastoma patients from The Cancer Genome Atlas (TCGA) whom had gene expression profiles and corresponding MR imaging available in The Cancer Imaging Archive (TCIA). The standardized VASARI feature-set criteria were used for the qualitative visual assessments of invasion. Patients were assigned to classes based on the presence (Class A) or absence (Class B) of statistically significant invasion parameters to create an invasive imaging signature; imaging genomic analysis was subsequently performed using GenePattern Comparative Marker Selection module (Broad Institute). Results Our results show that patients with a combination of deep white matter tracts and ependymal invasion (Class A) on imaging had a significant decrease in overall survival as compared to patients with absence of such invasive imaging features (Class B) (8.7 versus 18.6 months, p < 0.001). Mitochondrial dysfunction was the top canonical pathway associated with Class A gene expression signature. The MYC oncogene was predicted to be the top activation regulator in Class A. Conclusion We demonstrate that MRI biomarker signatures can identify distinct GBM phenotypes associated with highly significant survival differences and specific molecular pathways. This study identifies mitochondrial dysfunction as the top canonical pathway in a very aggressive GBM phenotype. Thus, imaging-genomic analyses may prove invaluable in detecting novel targetable genomic pathways. [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF
16. Applying genome-wide gene-based expression quantitative trait locus mapping to study population ancestry and pharmacogenetics.
- Author
-
Hsin-Chou Yang, Chien-Wei Lin, Chia-Wei Chen, and Chen, James J.
- Subjects
GENE expression ,GENOMICS ,GENETIC transcription regulation ,GENETIC transcription ,GENETIC regulation ,PHARMACOGENOMICS ,PHYSIOLOGY - Abstract
Background Gene-based analysis has become popular in genomic research because of its appealing biological and statistical properties compared with those of a single-locus analysis. However, only a few, if any, studies have discussed a mapping of expression quantitative trait loci (eQTL) in a gene-based framework. Neither study has discussed ancestry-informative eQTL nor investigated their roles in pharmacogenetics by integrating single nucleotide polymorphism (SNP)-based eQTL (s-eQTL) and gene-based eQTL (g-eQTL). Results In this g-eQTL mapping study, the transcript expression levels of genes (transcript-level genes; T-genes) were correlated with the SNPs of genes (sequence-level genes; S-genes) by using a method of gene-based partial least squares (PLS). Ancestry-informative transcripts were identified using a rank-score-based multivariate association test, and ancestry-informative eQTL were identified using Fisher's exact test. Furthermore, key ancestry-predictive eQTL were selected in a flexible discriminant analysis. We analyzed SNPs and gene expression of 210 independent people of African-, Asian- and European-descent. We identified numerous cis- and trans-acting g-eQTL and s-eQTL for each population by using PLS. We observed ancestry information enriched in eQTL. Furthermore, we identified 2 ancestry-informative eQTL associated with adverse drug reactions and/or drug response. Rs1045642, located on MDR1, is an ancestry-informative eQTL (P = 2.13E-13, using Fisher's exact test) associated with adverse drug reactions to amitriptyline and nortriptyline and drug responses to morphine. Rs20455, located in KIF6, is an ancestry-informative eQTL (P = 2.76E-23, using Fisher's exact test) associated with the response to statin drugs (e.g., pravastatin and atorvastatin). The ancestry-informative eQTL of drug biotransformation genes were also observed; cross-population cis-acting expression regulators included SPG7, TAP2, SLC7A7, and CYP4F2. Finally, we also identified key ancestry-predictive eQTL and established classification models with promising training and testing accuracies in separating samples from close populations. Conclusions In summary, we developed a gene-based PLS procedure and a SAS macro for identifying geQTL and s-eQTL. We established data archives of eQTL for global populations. The program and data archives are accessible at http://www.stat.sinica.edu.tw/hsinchou/genetics/eQTL/HapMapII.htm. Finally, the results from our investigations regarding the interrelationship between eQTL, ancestry information, and pharmacodynamics provide rich resources for future eQTL studies and practical applications in population genetics and medical genetics. [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF
17. Data mining tools for Salmonella characterization: application to gel-based fingerprinting analysis.
- Author
-
Wen Zou, Hailin Tang, Weizhong Zhao, Meehan, Joe, Foley, Steven L., Wei-Jiun Lin, Hung-Chia Chen, Hong Fang, Nayak, Rajesh, and Chen, James J.
- Subjects
DATA mining ,SALMONELLA ,HUMAN fingerprints ,PULSED-field gel electrophoresis ,BIOINFORMATICS - Abstract
Background: Pulsed field gel electrophoresis (PFGE) is currently the most widely and routinely used method by the Centers for Disease Control and Prevention (CDC) and state health labs in the United States for Salmonella surveillance and outbreak tracking. Major drawbacks of commercially available PFGE analysis programs have been their difficulty in dealing with large datasets and the limited availability of analysis tools. There exists a need to develop new analytical tools for PFGE data mining in order to make full use of valuable data in large surveillance databases. Results: In this study, a software package was developed consisting of five types of bioinformatics approaches exploring and implementing for the analysis and visualization of PFGE fingerprinting. The approaches include PFGE band standardization, Salmonella serotype prediction, hierarchical cluster analysis, distancematrix analysis and two-way hierarchical cluster analysis. PFGE band standardizationmakes it possible for cross-group large dataset analysis. The Salmonella serotype prediction approach allows users to predict serotypes of Salmonella isolates based on their PFGE patterns. The hierarchical cluster analysis approach could be used to clarify subtypes and phylogenetic relationships among groups of PFGE patterns. The distance matrix and two-way hierarchical cluster analysis tools allow users to directly visualize the similarities/dissimilarities of any two individual patterns and the inter- and intra-serotype relationships of two ormore serotypes, and provide a summary of the overall relationships between user-selected serotypes as well as the distinguishable band markers of these serotypes. The functionalities of these tools were illustrated on PFGE fingerprinting data from PulseNet of CDC. Conclusions: The bioinformatics approaches included in the software package developed in this study were integrated with the PFGE database to enhance the data mining of PFGE fingerprints. Fast and accurate prediction makes it possible to elucidate Salmonella serotype information before conventional serological methods are pursued. The development of bioinformatics tools to distinguish the PFGE markers and serotype specific patterns will enhance PFGE data retrieval, interpretation and serotype identification and will likely accelerate source tracking to identify the Salmonella isolates implicated in foodborne diseases. [ABSTRACT FROM AUTHOR]
- Published
- 2013
- Full Text
- View/download PDF
18. Curation-free biomodules mechanisms in prostate cancer predict recurrent disease.
- Author
-
Chen, James L., Hsu, Alexander, Xinan Yang, Jianrong Li, Younghee Lee, Parinandi, Gurunadh, Haiquan Li, and Lussier, Yves A.
- Subjects
- *
PROSTATE cancer , *ONCOGENES , *CANCER genes , *CANCER genetics , *CANCER patients , *CANCER relapse - Abstract
Motivation: Gene expression-based prostate cancer gene signatures of poor prognosis are hampered by lack of gene feature reproducibility and a lack of understandability of their function. Molecular pathway-level mechanisms are intrinsically more stable and more robust than an individual gene. The Functional Analysis of Individual Microarray Expression (FAIME) we developed allows distinctive sample-level pathway measurements with utility for correlation with continuous phenotypes (e.g. survival). Further, we and others have previously demonstrated that pathway-level classifiers can be as accurate as gene-level classifiers using curated genesets that may implicitly comprise ascertainment biases (e.g. KEGG, GO). Here, we hypothesized that transformation of individual prostate cancer patient gene expression to pathway-level mechanisms derived from automated high throughput analyses of genomic datasets may also permit personalized pathway analysis and improve prognosis of recurrent disease. Results: Via FAIME, three independent prostate gene expression arrays with both normal and tumor samples were transformed into two distinct types of molecular pathway mechanisms: (i) the curated Gene Ontology (GO) and (ii) dynamic expression activity networks of cancer (Cancer Modules). FAIME-derived mechanisms for tumorigenesis were then identified and compared. Curated GO and computationally generated "Cancer Module" mechanisms overlap significantly and are enriched for known oncogenic deregulations and highlight potential areas of investigation. We further show in two independent datasets that these pathway-level tumorigenesis mechanisms can identify men who are more likely to develop recurrent prostate cancer (log-rank_p = 0.019). Conclusion: Curation-free biomodules classification derived from congruent gene expression activation breaks from the paradigm of recapitulating the known curated pathway mechanism universe. [ABSTRACT FROM AUTHOR]
- Published
- 2013
- Full Text
- View/download PDF
19. Assessment of reproducibility of cancer survival risk predictions across medical centers.
- Author
-
Chen, Hung-Chia and Chen, James J.
- Subjects
- *
MEDICAL centers , *PREDICTION models , *CANCER risk factors , *GENE expression , *COLON cancer patients , *STATISTICAL correlation - Abstract
Background: Two most important considerations in evaluation of survival prediction models are 1) predictability - ability to predict survival risks accurately and 2) reproducibility - ability to generalize to predict samples generated from different studies. We present approaches for assessment of reproducibility of survival risk score predictions across medical centers. Methods: Reproducibility was evaluated in terms of consistency and transferability. Consistency is the agreement of risk scores predicted between two centers. Transferability from one center to another center is the agreement of the risk scores of the second center predicted by each of the two centers. The transferability can be: 1) model transferability - whether a predictive model developed from one center can be applied to predict the samples generated from other centers and 2) signature transferability - whether signature markers of a predictive model developed from one center can be applied to predict the samples from other centers. We considered eight prediction models, including two clinical models, two gene expression models, and their combinations. Predictive performance of the eight models was evaluated by several common measures. Correlation coefficients between predicted risk scores of different centers were computed to assess reproducibility - consistency and transferability. Results: Two public datasets, the lung cancer data generated from four medical centers and colon cancer data generated from two medical centers, were analyzed. The risk score estimates for lung cancer patients predicted by three of four centers agree reasonably well. In general, a good prediction model showed better cross-center consistency and transferability. The risk scores for the colon cancer patients from one (Moffitt) medical center that were predicted by the clinical models developed from the another (Vanderbilt) medical center were shown to have excellent model transferability and signature transferability. Conclusions: This study illustrates an analytical approach to assessing reproducibility of predictive models and signatures. Based on the analyses of the two cancer datasets, we conclude that the models with clinical variables appear to perform reasonable well with high degree of consistency and transferability. There should have more investigations on the reproducibility of prediction models including gene expression data across studies. [ABSTRACT FROM AUTHOR]
- Published
- 2013
- Full Text
- View/download PDF
20. Comparison of the global gene expression of choroid plexus and meninges and associated vasculature under control conditions and after pronounced hyperthermia or amphetamine toxicity.
- Author
-
Bowyer, John F., Patterson, Tucker A., Saini, Upasana T., Hanig, Joseph P., Thomas, Monzy, Camacho, Luísa, George, Nysia I., and Chen, James J.
- Subjects
CHOROID plexus ,BLOOD-brain barrier ,GENE expression ,MENINGES ,CEREBROSPINAL fluid ,FEVER ,AMPHETAMINES ,COMPARATIVE studies ,BLOOD vessels ,DRUG toxicity - Abstract
Background: The meninges (arachnoid and pial membranes) and associated vasculature (MAV) and choroid plexus are important in maintaining cerebrospinal fluid (CSF) generation and flow. MAV vasculature was previously observed to be adversely affected by environmentally-induced hyperthermia (EIH) and more so by a neurotoxic amphetamine (AMPH) exposure. Herein, microarray and RT-PCR analysis was used to compare the gene expression profiles between choroid plexus and MAV under control conditions and at 3 hours and 1 day after EIH or AMPH exposure. Since AMPH and EIH are so disruptive to vasculature, genes related to vasculature integrity and function were of interest. Results: Our data shows that, under control conditions, many of the genes with relatively high expression in both the MAV and choroid plexus are also abundant in many epithelial tissues. These genes function in transport of water, ions, and solutes, and likely play a role in CSF regulation. Most genes that help form the blood-brain barrier (BBB) and tight junctions were also highly expressed in MAV but not in choroid plexus. In MAV, exposure to EIH and more so to AMPH decreased the expression of BBB-related genes such as Sox18, Ocln, and Cldn5, but they were much less affected in the choroid plexus. There was a correlation between the genes related to reactive oxidative stress and damage that were significantly altered in the MAV and choroid plexus after either EIH or AMPH. However, AMPH (at 3 hr) significantly affected about 5 times as many genes as EIH in the MAV, while in the choroid plexus EIH affected more genes than AMPH. Several unique genes that are not specifically related to vascular damage increased to a much greater extent after AMPH compared to EIH in the MAV (Lbp, Reg3a, Reg3b, Slc15a1, Sct and Fst) and choroid plexus (Bmp4, Dio2 and Lbp). Conclusions: Our study indicates that the disruption of choroid plexus function and damage produced by AMPH and EIH is significant, but the changes may not be as pronounced as they are in the MAV, particularly for AMPH. Expression profiles in the MAV and choroid plexus differed to some extent and differences were not restricted to vascular related genes. [ABSTRACT FROM AUTHOR]
- Published
- 2013
- Full Text
- View/download PDF
21. Identification of reproducible gene expression signatures in lung adenocarcinoma.
- Author
-
Tzu-Pin Lu, Chuang, Eric Y., and Chen, James J.
- Subjects
LUNG cancer treatment ,GENE expression ,CANCER patients ,BIOMARKERS ,ANTINEOPLASTIC agents ,EPIDERMAL growth factor receptors - Abstract
Background Lung cancer is the leading cause of cancer-related death worldwide. Tremendous research efforts have been devoted to improving treatment procedures, but the average five-year overall survival rates are still less than 20%. Many biomarkers have been identified for predicting survival; challenges arise, however, in translating the findings into clinical practice due to their inconsistency and irreproducibility. In this study, we proposed an approach by identifying predictive genes through pathways. Results The microarrays from Shedden et al. were used as the training set, and the log-rank test was performed to select potential signature genes. We focused on 24 cancer-related pathways from 4 biological databases. A scoring scheme was developed by the Cox hazard regression model, and patients were divided into two groups based on the medians. Subsequently, their predictability and generalizability were evaluated by the 2-fold cross-validation and a resampling test in 4 independent datasets, respectively. A set of 16 genes related to apoptosis execution was demonstrated to have good predictability as well as generalizability in more than 700 lung adenocarcinoma patients and was reproducible in 4 independent datasets. This signature set was shown to have superior performances compared to 6 other published signatures. Furthermore, the corresponding risk scores derived from the set were found to associate with the efficacy of the anti-cancer drug ZD-6474 targeting EGFR. Conclusion In summary, we presented a new approach to identify reproducible survival predictors for lung adenocarcinoma, and the identified genes may serve as both prognostic and predictive biomarkers in the future. [ABSTRACT FROM AUTHOR]
- Published
- 2013
- Full Text
- View/download PDF
22. Assessment of performance of survival prediction models for cancer prognosis.
- Author
-
Chen, Hung-Chia, Kodell, Ralph L., Chen, Kuang Fu, and Chen, James J.
- Subjects
CANCER patients ,GENETIC regulation ,GENE expression ,COMBINATORICS ,BIOSYNTHESIS - Abstract
Background: Cancer survival studies are commonly analyzed using survival-time prediction models for cancer prognosis. A number of different performance metrics are used to ascertain the concordance between the predicted risk score of each patient and the actual survival time, but these metrics can sometimes conflict. Alternatively, patients are sometimes divided into two classes according to a survival-time threshold, and binary classifiers are applied to predict each patient's class. Although this approach has several drawbacks, it does provide natural performance metrics such as positive and negative predictive values to enable unambiguous assessments. Methods: We compare the survival-time prediction and survival-time threshold approaches to analyzing cancer survival studies. We review and compare common performance metrics for the two approaches. We present new randomization tests and cross-validation methods to enable unambiguous statistical inferences for several performance metrics used with the survival-time prediction approach. We consider five survival prediction models consisting of one clinical model, two gene expression models, and two models from combinations of clinical and gene expression models. Results: A public breast cancer dataset was used to compare several performance metrics using five prediction models. 1) For some prediction models, the hazard ratio from fitting a Cox proportional hazards model was significant, but the two-group comparison was insignificant, and vice versa. 2) The randomization test and crossvalidation were generally consistent with the p-values obtained from the standard performance metrics. 3) Binary classifiers highly depended on how the risk groups were defined; a slight change of the survival threshold for assignment of classes led to very different prediction results. Conclusions: 1) Different performance metrics for evaluation of a survival prediction model may give different conclusions in its discriminatory ability. 2) Evaluation using a high-risk versus low-risk group comparison depends on the selected risk-score threshold; a plot of p-values from all possible thresholds can show the sensitivity of the threshold selection. 3) A randomization test of the significance of Somers' rank correlation can be used for further evaluation of performance of a prediction model. 4) The cross-validated power of survival prediction models decreases as the training and test sets become less balanced. [ABSTRACT FROM AUTHOR]
- Published
- 2012
- Full Text
- View/download PDF
23. Stromal microenvironment processes unveiled by biological component analysis of gene expression in xenograft tumor models.
- Author
-
Yang, Xinan, Lee, Younghee, Huang, Yong, Chen, James L., Xing, Rosie H., and Lussier, Yves A.
- Subjects
GENE expression ,GENES ,XENOGRAFTS ,TUMORS ,CANCER cells - Abstract
Background: Mouse xenograft models, in which human cancer cells are implanted in immune-suppressed mice, have been popular for studying the mechanisms of novel therapeutic targets, tumor progression and metastasis. We hypothesized that we could exploit the interspecies genetic differences in these experiments. Our purpose is to elucidate stromal microenvironment signals from probes on human arrays unintentionally cross-hybridizing with mouse homologous genes in xenograft tumor models. Results: By identifying cross-species hybridizing probes from sequence alignment and cross-species hybridization experiment for the human whole-genome arrays, deregulated stromal genes can be identified and then their biological significance were predicted from enrichment studies. Comparing these results with those found by the laser capture microdissection of stromal cells from tumor specimens resulted in the discovery of significantly enriched stromal biological processes. Conclusions: Using this method, in addition to their primary endpoints, researchers can leverage xenograft experiments to better characterize the tumor microenvironment without additional costs. The Xhyb probes and R script are available at http://www.lussierlab.org/publications/Stroma. [ABSTRACT FROM AUTHOR]
- Published
- 2010
24. An FDA bioinformatics tool for microbial genomics research on molecular characterization of bacterial foodborne pathogens using microarrays.
- Author
-
Fang, Hong, Xu, Joshua, Ding, Don, Jackson, Scott A., Patel, Isha R., Frye, Jonathan G., Zou, Wen, Nayak, Rajesh, Foley, Steven, Chen, James, Su, Zhenqiang, Ye, Yanbin, Turner, Steve, Harris, Steve, Zhou, Guangxu, Cemiglia, Carl, and Tong, Weida
- Subjects
BIOINFORMATICS ,GENOMICS ,FOOD pathogens - Abstract
Background: Advances in microbial genomics and bioinformatics are offering greater insights into the emergence and spread of foodborne pathogens in outbreak scenarios. The Food and Drug Administration (FDA) has developed a genomics tool, ArrayTrack™, which provides extensive functionalities to manage, analyze, and interpret genomic data for mammalian species. ArrayTrack™ has been widely adopted by the research community and used for pharmacogenomics data review in the FDA's Voluntary Genomics Data Submission program. Results: ArrayTrack™ has been extended to manage and analyze genomics data from bacterial pathogens of human, animal, and food origin. It was populated with bioinformatics data from public databases such as NCBI, Swiss-Prot, KEGG Pathway, and Gene Ontology to facilitate pathogen detection and characterization. ArrayTrack™'s data processing and visualization tools were enhanced with analysis capabilities designed specifically for microbial genomics including flag-based hierarchical clustering analysis (HCA), flag concordance heat maps, and mixed scatter plots. These specific functionalities were evaluated on data generated from a custom Affymetrix array (FDAECSG) previously developed within the FDA. The FDA-ECSG array represents 32 complete genomes of Escherichia coli and Shigella. The new functions were also used to analyze microarray data focusing on antimicrobial resistance genes from Salmonella isolates in a poultry production environment using a universal antimicrobial resistance microarray developed by the United States Department of Agriculture (USDA). Conclusion: The application of ArrayTrack™ to different microarray platforms demonstrates its utility in microbial genomics research, and thus will improve the capabilities of the FDA to rapidly identify foodborne bacteria and their genetic traits (e.g., antimicrobial resistance, virulence, etc.) during outbreak investigations. ArrayTrack™ is free to use and available to public, private, and academic researchers at http://www.fda.gov/ArrayTrack. [ABSTRACT FROM AUTHOR]
- Published
- 2010
25. Power and sample size estimation in microarraystudies.
- Author
-
Wei-Jiun Lin, Huey-Miin Hsueh, and Chen, James J.
- Subjects
DNA microarrays ,GENES ,HEREDITY ,GENE expression ,GENOMIC imprinting - Abstract
Background: Before conducting a microarray experiment, one important issue that needs to be determined is the number of arrays required in order to have adequate power to identify differentially expressed genes. This paper discusses some crucial issues in the problem formulation, parameter specifications, and approaches that are commonly proposed for sample size estimation in microarray experiments. Common methods for sample size estimation are formulated as the minimum sample size necessary to achieve a specified sensitivity (proportion of detected truly differentially expressed genes) on average at a specified false discovery rate (FDR) level and specified expected proportion (p1) of the true differentially expression genes in the array. Unfortunately, the probability of detecting the specified sensitivity in such a formulation can be low. We formulate the sample size problem as the number of arrays needed to achieve a specified sensitivity with 95% probability at the specified significance level. A permutation method using a small pilot dataset to estimate sample size is proposed. This method accounts for correlation and effect size heterogeneity among genes. Results: A sample size estimate based on the common formulation, to achieve the desired sensitivity on average, can be calculated using a univariate method without taking the correlation among genes into consideration. This formulation of sample size problem is inadequate because the probability of detecting the specified sensitivity can be lower than 50%. On the other hand, the needed sample size calculated by the proposed permutation method will ensure detecting at least the desired sensitivity with 95% probability. The method is shown to perform well for a real example dataset using a small pilot dataset with 4-6 samples per group. Conclusions: We recommend that the sample size problem should be formulated to detect a specified proportion of differentially expressed genes with 95% probability. This formulation ensures finding the desired proportion of true positives with high probability. The proposed permutation method takes the correlation structure and effect size heterogeneity into consideration and works well using only a small pilot dataset. [ABSTRACT FROM AUTHOR]
- Published
- 2010
- Full Text
- View/download PDF
26. Mechanism-anchored profiling derived from epigenetic networks predicts outcome in acute lymphoblastic leukemia.
- Author
-
Xinan Yang, Yong Huang, Chen, James L., Jianming Xie, Xiao Sun, and Lussier, Yves A.
- Subjects
EPIGENESIS ,LYMPHOBLASTIC leukemia ,GENE expression ,PHENOTYPES ,GENETIC regulation - Abstract
Background: Current outcome predictors based on "molecular profiling" rely on gene lists selected without consideration for their molecular mechanisms. This study was designed to demonstrate that we could learn about genes related to a specific mechanism and further use this knowledge to predict outcome in patients -- a paradigm shift towards accurate "mechanism-anchored profiling". We propose a novel algorithm, PGnet, which predicts a tripartite mechanism-anchored network associated to epigenetic regulation consisting of phenotypes, genes and mechanisms. Genes termed as GEMs in this network meet all of the following criteria: (i) they are co-expressed with genes known to be involved in the biological mechanism of interest, (ii) they are also differentially expressed between distinct phenotypes relevant to the study, and (iii) as a biomodule, genes correlate with both the mechanism and the phenotype. Results: This proof-of-concept study, which focuses on epigenetic mechanisms, was conducted in a well-studied set of 132 acute lymphoblastic leukemia (ALL) microarrays annotated with nine distinct phenotypes and three measures of response to therapy. We used established parametric and non parametric statistics to derive the PGnet tripartite network that consisted of 10 phenotypes and 33 significant clusters of GEMs comprising 535 distinct genes. The significance of PGnet was estimated from empirical p-values, and a robust subnetwork derived from ALL outcome data was produced by repeated random sampling. The evaluation of derived robust network to predict outcome (relapse of ALL) was significant (p = 3%), using one hundred three-fold cross-validations and the shrunken centroids classifier. Conclusion: To our knowledge, this is the first method predicting co-expression networks of genes associated with epigenetic mechanisms and to demonstrate its inherent capability to predict therapeutic outcome. This PGnet approach can be applied to any regulatory mechanisms including transcriptional or microRNA regulation in order to derive predictive molecular profiles that are mechanistically anchored. The implementation of PGnet in R is freely available at http://Lussierlab.org/publication/PGnet. [ABSTRACT FROM AUTHOR]
- Published
- 2009
- Full Text
- View/download PDF
27. Assessing batch effects of genotype calling algorithm BRLMM for the Affymetrix GeneChip Human Mapping 500 K array set using 270 HapMap samples.
- Author
-
Huixiao Hong, Zhenqiang Su, Weigong Ge, Leming Shi, Perkins, Roger, Hong Fang, Xu, Joshua, Chen, James J., Tao Han, Kaput, Jim, Fuscoe, James C., and Tong, Weida
- Subjects
GENOMES ,COMPUTER algorithms ,HUMAN gene mapping ,HUMAN genome ,GENETICS ,GENOTYPE-environment interaction ,PHENOTYPES ,GENETIC polymorphisms - Abstract
Background: Genome-wide association studies (GWAS) aim to identify genetic variants (usually single nucleotide polymorphisms [SNPs]) across the entire human genome that are associated with phenotypic traits such as disease status and drug response. Highly accurate and reproducible genotype calling are paramount since errors introduced by calling algorithms can lead to inflation of false associations between genotype and phenotype. Most genotype calling algorithms currently used for GWAS are based on multiple arrays. Because hundreds of gigabytes (GB) of raw data are generated from a GWAS, the samples are typically partitioned into batches containing subsets of the entire dataset for genotype calling. High call rates and accuracies have been achieved. However, the effects of batch size (i.e., number of chips analyzed together) and of batch composition (i.e., the choice of chips in a batch) on call rate and accuracy as well as the propagation of the effects into significantly associated SNPs identified have not been investigated. In this paper, we analyzed both the batch size and batch composition for effects on the genotype calling algorithm BRLMM using raw data of 270 HapMap samples analyzed with the Affymetrix Human Mapping 500 K array set. Results: Using data from 270 HapMap samples interrogated with the Affymetrix Human Mapping 500 K array set, three different batch sizes and three different batch compositions were used for genotyping using the BRLMM algorithm. Comparative analysis of the calling results and the corresponding lists of significant SNPs identified through association analysis revealed that both batch size and composition affected genotype calling results and significantly associated SNPs. Batch size and batch composition effects were more severe on samples and SNPs with lower call rates than ones with higher call rates, and on heterozygous genotype calls compared to homozygous genotype calls. Conclusion: Batch size and composition affect the genotype calling results in GWAS using BRLMM. The larger the differences in batch sizes, the larger the effect. The more homogenous the samples in the batches, the more consistent the genotype calls. The inconsistency propagates to the lists of significantly associated SNPs identified in downstream association analysis. Thus, uniform and large batch sizes should be used to make genotype calls for GWAS. In addition, samples of high homogeneity should be placed into the same batch. [ABSTRACT FROM AUTHOR]
- Published
- 2008
- Full Text
- View/download PDF
28. Reproducibility of microarray data: a further analysis of microarray quality control (MAQC) data.
- Author
-
Chen, James J., Huey-Miin Hsueh, Delongchamp, Robert R., Chien-Ju Lin, and Chen-An Tsai
- Subjects
- *
DNA microarrays , *GENE expression , *GENES , *PROTEIN folding , *STATISTICAL sampling , *GENETIC regulation - Abstract
Background: Many researchers are concerned with the comparability and reliability of microarray gene expression data. Recent completion of the MicroArray Quality Control (MAQC) project provides a unique opportunity to assess reproducibility across multiple sites and the comparability across multiple platforms. The MAQC analysis presented for the conclusion of inter- and intra-platform comparability/ reproducibility of microarray gene expression measurements is inadequate. We evaluate the reproducibility/comparability of the MAQC data for 12901 common genes in four titration samples generated from five high-density one-color microarray platforms and the TaqMan technology. We discuss some of the problems with the use of correlation coefficient as metric to evaluate the inter- and intraplatform reproducibility and the percent of overlapping genes (POG) as a measure for evaluation of a gene selection procedure by MAQC. Results: A total of 293 arrays were used in the intra- and inter-platform analysis. A hierarchical cluster analysis shows distinct differences in the measured intensities among the five platforms. A number of genes show a small fold-change in one platform and a large fold-change in another platform, even though the correlations between platforms are high. An analysis of variance shows thirty percent of gene expressions of the samples show inconsistent patterns across the five platforms. We illustrated that POG does not reflect the accuracy of a selected gene list. A non-overlapping gene can be truly differentially expressed with a stringent cut, and an overlapping gene can be non-differentially expressed with non-stringent cutoff. In addition, POG is an unusable selection criterion. POG can increase or decrease irregularly as cutoff changes; there is no criterion to determine a cutoff so that POG is optimized. Conclusion: Using various statistical methods we demonstrate that there are differences in the intensities measured by different platforms and different sites within platform. Within each platform, the patterns of expression are generally consistent, but there is site-by-site variability. Evaluation of data analysis methods for use in regulatory decision should take no treatment effect into consideration, when there is no treatment effect, "a fold-change cutoff with a non-stringent p-value cutoff" could result in 100% false positive error selection. [ABSTRACT FROM AUTHOR]
- Published
- 2007
- Full Text
- View/download PDF
29. Microarray scanner calibration curves: characteristics and implications.
- Author
-
Shi, Leming, Tong, Weida, Su, Zhenqiang, Han, Tao, Han, Jing, Puri, Raj K., Fang, Hong, Frueh, Felix W., Goodsaid, Federico M., Guo, Lei, Branham, William S., Chen, James J., Xu, Z. Alex, Harris, Stephen C., Hong, Huixiao, Xie, Qian, Perkins, Roger G., and Fuscoe, James C.
- Subjects
MESSENGER RNA ,CALIBRATION ,STANDARDIZATION ,PROTEIN microarrays ,SCANNING systems - Abstract
Background: Microarray-based measurement of mRNA abundance assumes a linear relationship between the fluorescence intensity and the dye concentration. In reality, however, the calibration curve can be nonlinear. Results: By scanning a microarray scanner calibration slide containing known concentrations of fluorescent dyes under 18 PMT gains, we were able to evaluate the differences in calibration characteristics of Cy5 and Cy3. First, the calibration curve for the same dye under the same PMT gain is onlinear at both the high and low intensity ends. Second, the degree of nonlinearity of the calibration curve depends on the PMT gain. Third, the two PMTs (for Cy5 and Cy3) behave differently even under the same gain. Fourth, the background intensity for the Cy3 channel is higher than that for the Cy5 channel. The impact of such characteristics on the accuracy and reproducibility of measured mRNA abundance and the calculated ratios was demonstrated. Combined with simulation results, we provided explanations to the existence of ratio underestimation, intensity-dependence of ratio bias, and anti-correlation of ratios in dye-swap replicates. We further demonstrated that although Lowess normalization effectively eliminates the intensity-dependence of ratio bias, the systematic deviation from true ratios largely remained. A method of calculating ratios based on concentrations estimated from the calibration curves was proposed for correcting ratio bias. Conclusion: It is preferable to scan microarray slides at fixed, optimal gain settings under which the linearity between concentration and intensity is maximized. Although normalization methods improve reproducibility of microarray measurements, they appear less effective in improving accuracy. [ABSTRACT FROM AUTHOR]
- Published
- 2005
- Full Text
- View/download PDF
30. Cross-platform comparability of microarray technology: Intra-platform consistency and appropriate data analysis procedures are essential.
- Author
-
Shi, Leming, Tong, Weida, Fang, Hong, Scherf, Uwe, Han, Jing, Puri, Raj K., Frueh, Felix W., Goodsaid, Federico M., Guo, Lei, Su, Zhenqiang, Han, Tao, Fuscoe, James C., Xu, Z. Alex, Patterson, Tucker A., Hong, Huixiao, Xie, Qian, Perkins, Roger G., Chen, James J., and Casciano, Daniel A.
- Subjects
DNA microarrays ,DATA analysis ,RNA ,NUCLEIC acids ,GENES - Abstract
Background: The acceptance of microarray technology in regulatory decision-making is being challenged by the existence of various platforms and data analysis methods. A recent report (E. Marshall, Science, 306, 630-631, 2004), by extensively citing the study of and et al. (Nucleic Acids Res., 31, 5676-5684, 2003), portrays a disturbingly negative picture of the cross-platform comparability, and, hence, the reliability of microarray technology. Results: We reanalyzed Tan's dataset and found that the intra-platform consistency was low, indicating a problem in experimental procedures from which the dataset was generated. Furthermore, by using three gene selection methods (i.e., p-value ranking, fold-change ranking, and Significance Analysis of Microarrays (SAM)) on the same dataset we found that p-value ranking (the method emphasized by Tan et al.) results in much lower cross-platform concordance compared to fold-change ranking or SAM. Therefore, the low cross-platform concordance reported in Tan's study appears to be mainly due to a combination of low intra-platform consistency and a poor choice of data analysis procedures, instead of inherent technical differences among different platforms, as suggested by Tan et al. and Marshall. Conclusion: Our results illustrate the importanceof establishing calibrated RNA samples and reference datasets to objectively assess the performance of different microarray platforms and the proficiency of individual laboratories as well as the merits of various data analysis procedures. Thus, we are progressively coordinating the MAQC project, a community-wide effort for microarray quality control. [ABSTRACT FROM AUTHOR]
- Published
- 2005
- Full Text
- View/download PDF
31. Erratum to: A novel procedure on next generation sequencing data analysis using text mining algorithm.
- Author
-
Zhao W, Chen JJ, Perkins R, Wang Y, Liu Z, Hong H, Tong W, and Zou W
- Published
- 2016
- Full Text
- View/download PDF
32. A novel procedure on next generation sequencing data analysis using text mining algorithm.
- Author
-
Zhao W, Chen JJ, Perkins R, Wang Y, Liu Z, Hong H, Tong W, and Zou W
- Subjects
- Biomarkers analysis, Cluster Analysis, Models, Theoretical, Polymorphism, Single Nucleotide genetics, Salmonella classification, Salmonella genetics, Serotyping, Algorithms, Data Mining methods, High-Throughput Nucleotide Sequencing methods
- Abstract
Background: Next-generation sequencing (NGS) technologies have provided researchers with vast possibilities in various biological and biomedical research areas. Efficient data mining strategies are in high demand for large scale comparative and evolutional studies to be performed on the large amounts of data derived from NGS projects. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining., Methods: We report a novel procedure to analyse NGS data using topic modeling. It consists of four major procedures: NGS data retrieval, preprocessing, topic modeling, and data mining using Latent Dirichlet Allocation (LDA) topic outputs. The NGS data set of the Salmonella enterica strains were used as a case study to show the workflow of this procedure. The perplexity measurement of the topic numbers and the convergence efficiencies of Gibbs sampling were calculated and discussed for achieving the best result from the proposed procedure., Results: The output topics by LDA algorithms could be treated as features of Salmonella strains to accurately describe the genetic diversity of fliC gene in various serotypes. The results of a two-way hierarchical clustering and data matrix analysis on LDA-derived matrices successfully classified Salmonella serotypes based on the NGS data. The implementation of topic modeling in NGS data analysis procedure provides a new way to elucidate genetic information from NGS data, and identify the gene-phenotype relationships and biomarkers, especially in the era of biological and medical big data., Conclusion: The implementation of topic modeling in NGS data analysis provides a new way to elucidate genetic information from NGS data, and identify the gene-phenotype relationships and biomarkers, especially in the era of biological and medical big data.
- Published
- 2016
- Full Text
- View/download PDF
33. Text mining for identifying topics in the literatures about adolescent substance use and depression.
- Author
-
Wang SH, Ding Y, Zhao W, Huang YH, Perkins R, Zou W, and Chen JJ
- Subjects
- Adolescent, Adolescent Behavior, Humans, Data Mining methods, Depression epidemiology, Substance-Related Disorders epidemiology
- Abstract
Background: Both adolescent substance use and adolescent depression are major public health problems, and have the tendency to co-occur. Thousands of articles on adolescent substance use or depression have been published. It is labor intensive and time consuming to extract huge amounts of information from the cumulated collections. Topic modeling offers a computational tool to find relevant topics by capturing meaningful structure among collections of documents., Methods: In this study, a total of 17,723 abstracts from PubMed published from 2000 to 2014 on adolescent substance use and depression were downloaded as objects, and Latent Dirichlet allocation (LDA) was applied to perform text mining on the dataset. Word clouds were used to visually display the content of topics and demonstrate the distribution of vocabularies over each topic., Results: The LDA topics recaptured the search keywords in PubMed, and further discovered relevant issues, such as intervention program, association links between adolescent substance use and adolescent depression, such as sexual experience and violence, and risk factors of adolescent substance use, such as family factors and peer networks. Using trend analysis to explore the dynamics of proportion of topics, we found that brain research was assessed as a hot issue by the coefficient of the trend test., Conclusions: Topic modeling has the ability to segregate a large collection of articles into distinct themes, and it could be used as a tool to understand the literature, not only by recapturing known facts but also by discovering other relevant topics.
- Published
- 2016
- Full Text
- View/download PDF
34. Subgroup identification for treatment selection in biomarker adaptive design.
- Author
-
Lu TP and Chen JJ
- Subjects
- Adenocarcinoma of Lung, Algorithms, Computer Simulation, Disease-Free Survival, Endpoint Determination, Humans, Models, Statistical, Patient Selection, Adenocarcinoma diagnosis, Biomarkers, Tumor analysis, Lung Neoplasms diagnosis, Precision Medicine methods, Research Design
- Abstract
Background: Advances in molecular technology have shifted new drug development toward targeted therapy for treatments expected to benefit subpopulations of patients. Adaptive signature design (ASD) has been proposed to identify the most suitable target patient subgroup to enhance efficacy of treatment effect. There are two essential aspects in the development of biomarker adaptive designs: 1) an accurate classifier to identify the most appropriate treatment for patients, and 2) statistical tests to detect treatment effect in the relevant population and subpopulations. We propose utilization of classification methods to identity patient subgroups and present a statistical testing strategy to detect treatment effects., Methods: The diagonal linear discriminant analysis (DLDA) is used to identify targeted and non-targeted subgroups. For binary endpoints, DLDA is directly applied to classify patient into two subgroups; for continuous endpoints, a two-step procedure involving model fitting and determination of a cutoff-point is used for subgroup classification. The proposed strategy includes tests for treatment effect in all patients and in a marker-positive subgroup, with a possible follow-up estimation of treatment effect in the marker-negative subgroup. The proposed method is compared to the ASD classification method using simulated datasets and two publically available cancer datasets., Results: The DLDA-based classifier performs well in terms of sensitivity, specificity, positive and negative predictive values, and accuracy in the simulation data and the two cancer datasets, with superior accuracy compared to the ASD method. The subgroup testing strategy is shown to be useful in detecting treatment effect in terms of power and control of study-wise error., Conclusion: Accuracy of a classifier is essential for adaptive designs. A poor classifier not only assigns patients to inappropriate treatments, but also reduces the power of the test, resulting in incorrect conclusions. The proposed procedure provides an effective approach for subgroup identification and subgroup analysis.
- Published
- 2015
- Full Text
- View/download PDF
35. Disc degeneration implies low back pain.
- Author
-
Zheng CJ and Chen J
- Subjects
- Adolescent, Adult, Aged, Humans, Hungary epidemiology, Low Back Pain epidemiology, Magnetic Resonance Imaging, Middle Aged, Models, Biological, Prevalence, Young Adult, Intervertebral Disc Degeneration complications, Low Back Pain etiology
- Abstract
Background: Low back pain exerts a tremendous burden on individual patients and society due to its prevalence and ability to cause long-term disability. Contemporary treatment and prevention efforts are stymied by the absence of a confirmed cause for the majority of low back pain patients., Methods: A system dynamics approach is used to build a physiologically-based model investigating the relationship between disc degeneration and low back pain. The model's predictions are evaluated under two different types of study designs and compared with established observations on low back pain., Results: A three-compartment model (no disc degeneration, disc degeneration with pain remission, disc degeneration with pain recurrence) accurately predicts the age-specific prevalence observed in one of the largest population-based surveys (R (2) = 0.998). The estimated transition age at which intervertebral discs lose the growth potential and begin degenerating is 13.3 years. The estimated disc degeneration rate is 0.0344/year. Without any additional change being made to parameter's values, the model also fully accounts for the age-specific prevalence of disc degeneration detected with a lumbar MRI among asymptomatic individuals (R (2) = 0.978)., Conclusions: Dual testing of the proposed mechanistic model with two independent data sources (one with lumbar MRI and the other without) confirm that disc degeneration is the driving force behind and cause of age dependence in low back pain. Observed complexity of low back pain epidemiology arises from the slow dynamics of disc degeneration coupled with the fast dynamics of disease recurrence.
- Published
- 2015
- Full Text
- View/download PDF
36. A heuristic approach to determine an appropriate number of topics in topic modeling.
- Author
-
Zhao W, Chen JJ, Perkins R, Liu Z, Ge W, Ding Y, and Zou W
- Subjects
- Databases, Factual, High-Throughput Nucleotide Sequencing, Computational Biology methods, Data Mining methods, Heuristics physiology
- Abstract
Background: Topic modelling is an active research field in machine learning. While mainly used to build models from unstructured textual data, it offers an effective means of data mining where samples represent documents, and different biological endpoints or omics data represent words. Latent Dirichlet Allocation (LDA) is the most commonly used topic modelling method across a wide number of technical fields. However, model development can be arduous and tedious, and requires burdensome and systematic sensitivity studies in order to find the best set of model parameters. Often, time-consuming subjective evaluations are needed to compare models. Currently, research has yielded no easy way to choose the proper number of topics in a model beyond a major iterative approach., Methods and Results: Based on analysis of variation of statistical perplexity during topic modelling, a heuristic approach is proposed in this study to estimate the most appropriate number of topics. Specifically, the rate of perplexity change (RPC) as a function of numbers of topics is proposed as a suitable selector. We test the stability and effectiveness of the proposed method for three markedly different types of grounded-truth datasets: Salmonella next generation sequencing, pharmacological side effects, and textual abstracts on computational biology and bioinformatics (TCBB) from PubMed., Conclusion: The proposed RPC-based method is demonstrated to choose the best number of topics in three numerical experiments of widely different data types, and for databases of very different sizes. The work required was markedly less arduous than if full systematic sensitivity studies had been carried out with number of topics as a parameter. We understand that additional investigation is needed to substantiate the method's theoretical basis, and to establish its generalizability in terms of dataset characteristics.
- Published
- 2015
- Full Text
- View/download PDF
37. Applying genome-wide gene-based expression quantitative trait locus mapping to study population ancestry and pharmacogenetics.
- Author
-
Yang HC, Lin CW, Chen CW, and Chen JJ
- Subjects
- Humans, Polymorphism, Single Nucleotide, Genome, Pharmacogenetics, Quantitative Trait Loci
- Abstract
Background: Gene-based analysis has become popular in genomic research because of its appealing biological and statistical properties compared with those of a single-locus analysis. However, only a few, if any, studies have discussed a mapping of expression quantitative trait loci (eQTL) in a gene-based framework. Neither study has discussed ancestry-informative eQTL nor investigated their roles in pharmacogenetics by integrating single nucleotide polymorphism (SNP)-based eQTL (s-eQTL) and gene-based eQTL (g-eQTL)., Results: In this g-eQTL mapping study, the transcript expression levels of genes (transcript-level genes; T-genes) were correlated with the SNPs of genes (sequence-level genes; S-genes) by using a method of gene-based partial least squares (PLS). Ancestry-informative transcripts were identified using a rank-score-based multivariate association test, and ancestry-informative eQTL were identified using Fisher's exact test. Furthermore, key ancestry-predictive eQTL were selected in a flexible discriminant analysis. We analyzed SNPs and gene expression of 210 independent people of African-, Asian- and European-descent. We identified numerous cis- and trans-acting g-eQTL and s-eQTL for each population by using PLS. We observed ancestry information enriched in eQTL. Furthermore, we identified 2 ancestry-informative eQTL associated with adverse drug reactions and/or drug response. Rs1045642, located on MDR1, is an ancestry-informative eQTL (P = 2.13E-13, using Fisher's exact test) associated with adverse drug reactions to amitriptyline and nortriptyline and drug responses to morphine. Rs20455, located in KIF6, is an ancestry-informative eQTL (P = 2.76E-23, using Fisher's exact test) associated with the response to statin drugs (e.g., pravastatin and atorvastatin). The ancestry-informative eQTL of drug biotransformation genes were also observed; cross-population cis-acting expression regulators included SPG7, TAP2, SLC7A7, and CYP4F2. Finally, we also identified key ancestry-predictive eQTL and established classification models with promising training and testing accuracies in separating samples from close populations., Conclusions: In summary, we developed a gene-based PLS procedure and a SAS macro for identifying g-eQTL and s-eQTL. We established data archives of eQTL for global populations. The program and data archives are accessible at http://www.stat.sinica.edu.tw/hsinchou/genetics/eQTL/HapMapII.htm. Finally, the results from our investigations regarding the interrelationship between eQTL, ancestry information, and pharmacodynamics provide rich resources for future eQTL studies and practical applications in population genetics and medical genetics.
- Published
- 2014
- Full Text
- View/download PDF
38. Topic modeling for cluster analysis of large biological and medical datasets.
- Author
-
Zhao W, Zou W, and Chen JJ
- Subjects
- Algorithms, Breast Neoplasms classification, Breast Neoplasms mortality, Cluster Analysis, Electrophoresis, Gel, Pulsed-Field, Female, Humans, Lung Neoplasms classification, Models, Statistical, Salmonella classification, Salmonella isolation & purification, Survival Analysis, Data Mining methods
- Abstract
Background: The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets., Results: In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths., Conclusion: Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than traditional methods, suggesting that topic model-based methods could provide an analytic advancement in the analysis of large biological or medical datasets.
- Published
- 2014
- Full Text
- View/download PDF
39. Identification of reproducible gene expression signatures in lung adenocarcinoma.
- Author
-
Lu TP, Chuang EY, and Chen JJ
- Subjects
- Adenocarcinoma of Lung, Aged, Female, Gene Expression Regulation, Neoplastic, Gene Library, Humans, Lung Neoplasms drug therapy, Lung Neoplasms pathology, Male, Middle Aged, Proportional Hazards Models, Reproducibility of Results, Survival Rate, Transcriptome, Adenocarcinoma genetics, Lung Neoplasms genetics
- Abstract
Background: Lung cancer is the leading cause of cancer-related death worldwide. Tremendous research efforts have been devoted to improving treatment procedures, but the average five-year overall survival rates are still less than 20%. Many biomarkers have been identified for predicting survival; challenges arise, however, in translating the findings into clinical practice due to their inconsistency and irreproducibility. In this study, we proposed an approach by identifying predictive genes through pathways., Results: The microarrays from Shedden et al. were used as the training set, and the log-rank test was performed to select potential signature genes. We focused on 24 cancer-related pathways from 4 biological databases. A scoring scheme was developed by the Cox hazard regression model, and patients were divided into two groups based on the medians. Subsequently, their predictability and generalizability were evaluated by the 2-fold cross-validation and a resampling test in 4 independent datasets, respectively. A set of 16 genes related to apoptosis execution was demonstrated to have good predictability as well as generalizability in more than 700 lung adenocarcinoma patients and was reproducible in 4 independent datasets. This signature set was shown to have superior performances compared to 6 other published signatures. Furthermore, the corresponding risk scores derived from the set were found to associate with the efficacy of the anti-cancer drug ZD-6474 targeting EGFR., Conclusions: In summary, we presented a new approach to identify reproducible survival predictors for lung adenocarcinoma, and the identified genes may serve as both prognostic and predictive biomarkers in the future.
- Published
- 2013
- Full Text
- View/download PDF
40. Data mining tools for Salmonella characterization: application to gel-based fingerprinting analysis.
- Author
-
Zou W, Tang H, Zhao W, Meehan J, Foley SL, Lin WJ, Chen HC, Fang H, Nayak R, and Chen JJ
- Subjects
- Cluster Analysis, Data Mining, Databases, Genetic, Humans, Salmonella chemistry, Salmonella genetics, Serotyping, Computational Biology methods, Electrophoresis, Gel, Pulsed-Field methods, Salmonella classification
- Abstract
Background: Pulsed field gel electrophoresis (PFGE) is currently the most widely and routinely used method by the Centers for Disease Control and Prevention (CDC) and state health labs in the United States for Salmonella surveillance and outbreak tracking. Major drawbacks of commercially available PFGE analysis programs have been their difficulty in dealing with large datasets and the limited availability of analysis tools. There exists a need to develop new analytical tools for PFGE data mining in order to make full use of valuable data in large surveillance databases., Results: In this study, a software package was developed consisting of five types of bioinformatics approaches exploring and implementing for the analysis and visualization of PFGE fingerprinting. The approaches include PFGE band standardization, Salmonella serotype prediction, hierarchical cluster analysis, distance matrix analysis and two-way hierarchical cluster analysis. PFGE band standardization makes it possible for cross-group large dataset analysis. The Salmonella serotype prediction approach allows users to predict serotypes of Salmonella isolates based on their PFGE patterns. The hierarchical cluster analysis approach could be used to clarify subtypes and phylogenetic relationships among groups of PFGE patterns. The distance matrix and two-way hierarchical cluster analysis tools allow users to directly visualize the similarities/dissimilarities of any two individual patterns and the inter- and intra-serotype relationships of two or more serotypes, and provide a summary of the overall relationships between user-selected serotypes as well as the distinguishable band markers of these serotypes. The functionalities of these tools were illustrated on PFGE fingerprinting data from PulseNet of CDC., Conclusions: The bioinformatics approaches included in the software package developed in this study were integrated with the PFGE database to enhance the data mining of PFGE fingerprints. Fast and accurate prediction makes it possible to elucidate Salmonella serotype information before conventional serological methods are pursued. The development of bioinformatics tools to distinguish the PFGE markers and serotype specific patterns will enhance PFGE data retrieval, interpretation and serotype identification and will likely accelerate source tracking to identify the Salmonella isolates implicated in foodborne diseases.
- Published
- 2013
- Full Text
- View/download PDF
41. Power and sample size estimation in microarray studies.
- Author
-
Lin WJ, Hsueh HM, and Chen JJ
- Subjects
- False Positive Reactions, Gene Expression Profiling methods, Sample Size, Computational Biology methods, Oligonucleotide Array Sequence Analysis statistics & numerical data
- Abstract
Background: Before conducting a microarray experiment, one important issue that needs to be determined is the number of arrays required in order to have adequate power to identify differentially expressed genes. This paper discusses some crucial issues in the problem formulation, parameter specifications, and approaches that are commonly proposed for sample size estimation in microarray experiments. Common methods for sample size estimation are formulated as the minimum sample size necessary to achieve a specified sensitivity (proportion of detected truly differentially expressed genes) on average at a specified false discovery rate (FDR) level and specified expected proportion (pi1) of the true differentially expression genes in the array. Unfortunately, the probability of detecting the specified sensitivity in such a formulation can be low. We formulate the sample size problem as the number of arrays needed to achieve a specified sensitivity with 95% probability at the specified significance level. A permutation method using a small pilot dataset to estimate sample size is proposed. This method accounts for correlation and effect size heterogeneity among genes., Results: A sample size estimate based on the common formulation, to achieve the desired sensitivity on average, can be calculated using a univariate method without taking the correlation among genes into consideration. This formulation of sample size problem is inadequate because the probability of detecting the specified sensitivity can be lower than 50%. On the other hand, the needed sample size calculated by the proposed permutation method will ensure detecting at least the desired sensitivity with 95% probability. The method is shown to perform well for a real example dataset using a small pilot dataset with 4-6 samples per group., Conclusions: We recommend that the sample size problem should be formulated to detect a specified proportion of differentially expressed genes with 95% probability. This formulation ensures finding the desired proportion of true positives with high probability. The proposed permutation method takes the correlation structure and effect size heterogeneity into consideration and works well using only a small pilot dataset.
- Published
- 2010
- Full Text
- View/download PDF
42. Mechanism-anchored profiling derived from epigenetic networks predicts outcome in acute lymphoblastic leukemia.
- Author
-
Yang X, Huang Y, Chen JL, Xie J, Sun X, and Lussier YA
- Subjects
- Algorithms, Gene Expression Profiling, Oligonucleotide Array Sequence Analysis methods, Epigenesis, Genetic genetics, Gene Regulatory Networks genetics, Precursor Cell Lymphoblastic Leukemia-Lymphoma genetics
- Abstract
Background: Current outcome predictors based on "molecular profiling" rely on gene lists selected without consideration for their molecular mechanisms. This study was designed to demonstrate that we could learn about genes related to a specific mechanism and further use this knowledge to predict outcome in patients - a paradigm shift towards accurate "mechanism-anchored profiling". We propose a novel algorithm, PGnet, which predicts a tripartite mechanism-anchored network associated to epigenetic regulation consisting of phenotypes, genes and mechanisms. Genes termed as GEMs in this network meet all of the following criteria: (i) they are co-expressed with genes known to be involved in the biological mechanism of interest, (ii) they are also differentially expressed between distinct phenotypes relevant to the study, and (iii) as a biomodule, genes correlate with both the mechanism and the phenotype., Results: This proof-of-concept study, which focuses on epigenetic mechanisms, was conducted in a well-studied set of 132 acute lymphoblastic leukemia (ALL) microarrays annotated with nine distinct phenotypes and three measures of response to therapy. We used established parametric and non parametric statistics to derive the PGnet tripartite network that consisted of 10 phenotypes and 33 significant clusters of GEMs comprising 535 distinct genes. The significance of PGnet was estimated from empirical p-values, and a robust subnetwork derived from ALL outcome data was produced by repeated random sampling. The evaluation of derived robust network to predict outcome (relapse of ALL) was significant (p = 3%), using one hundred three-fold cross-validations and the shrunken centroids classifier., Conclusion: To our knowledge, this is the first method predicting co-expression networks of genes associated with epigenetic mechanisms and to demonstrate its inherent capability to predict therapeutic outcome. This PGnet approach can be applied to any regulatory mechanisms including transcriptional or microRNA regulation in order to derive predictive molecular profiles that are mechanistically anchored. The implementation of PGnet in R is freely available at http://Lussierlab.org/publication/PGnet.
- Published
- 2009
- Full Text
- View/download PDF
43. Assessing batch effects of genotype calling algorithm BRLMM for the Affymetrix GeneChip Human Mapping 500 K array set using 270 HapMap samples.
- Author
-
Hong H, Su Z, Ge W, Shi L, Perkins R, Fang H, Xu J, Chen JJ, Han T, Kaput J, Fuscoe JC, and Tong W
- Subjects
- Base Sequence, DNA Mutational Analysis methods, Genotype, Humans, Molecular Sequence Data, Algorithms, Chromosome Mapping methods, Genome, Human genetics, Haplotypes, Oligonucleotide Array Sequence Analysis methods, Polymorphism, Single Nucleotide genetics, Software
- Abstract
Background: Genome-wide association studies (GWAS) aim to identify genetic variants (usually single nucleotide polymorphisms [SNPs]) across the entire human genome that are associated with phenotypic traits such as disease status and drug response. Highly accurate and reproducible genotype calling are paramount since errors introduced by calling algorithms can lead to inflation of false associations between genotype and phenotype. Most genotype calling algorithms currently used for GWAS are based on multiple arrays. Because hundreds of gigabytes (GB) of raw data are generated from a GWAS, the samples are typically partitioned into batches containing subsets of the entire dataset for genotype calling. High call rates and accuracies have been achieved. However, the effects of batch size (i.e., number of chips analyzed together) and of batch composition (i.e., the choice of chips in a batch) on call rate and accuracy as well as the propagation of the effects into significantly associated SNPs identified have not been investigated. In this paper, we analyzed both the batch size and batch composition for effects on the genotype calling algorithm BRLMM using raw data of 270 HapMap samples analyzed with the Affymetrix Human Mapping 500 K array set., Results: Using data from 270 HapMap samples interrogated with the Affymetrix Human Mapping 500 K array set, three different batch sizes and three different batch compositions were used for genotyping using the BRLMM algorithm. Comparative analysis of the calling results and the corresponding lists of significant SNPs identified through association analysis revealed that both batch size and composition affected genotype calling results and significantly associated SNPs. Batch size and batch composition effects were more severe on samples and SNPs with lower call rates than ones with higher call rates, and on heterozygous genotype calls compared to homozygous genotype calls., Conclusion: Batch size and composition affect the genotype calling results in GWAS using BRLMM. The larger the differences in batch sizes, the larger the effect. The more homogenous the samples in the batches, the more consistent the genotype calls. The inconsistency propagates to the lists of significantly associated SNPs identified in downstream association analysis. Thus, uniform and large batch sizes should be used to make genotype calls for GWAS. In addition, samples of high homogeneity should be placed into the same batch.
- Published
- 2008
- Full Text
- View/download PDF
44. Evaluation of high-throughput functional categorization of human disease genes.
- Author
-
Chen JL, Liu Y, Sam LT, Li J, and Lussier YA
- Subjects
- Computer Simulation, Database Management Systems, Humans, Information Storage and Retrieval methods, Databases, Genetic, Genetic Diseases, Inborn genetics, Genetic Predisposition to Disease genetics, Models, Genetic, Natural Language Processing, Proteome classification, Proteome genetics
- Abstract
Background: Biological data that are well-organized by an ontology, such as Gene Ontology, enables high-throughput availability of the semantic web. It can also be used to facilitate high throughput classification of biomedical information. However, to our knowledge, no evaluation has been published on automating classifications of human diseases genes using Gene Ontology. In this study, we evaluate automated classifications of well-defined human disease genes using their Gene Ontology annotations and compared them to a gold standard. This gold standard was independently conceived by Valle's research group, and contains 923 human disease genes organized in 14 categories of protein function., Results: Two automated methods were applied to investigate the classification of human disease genes into independently pre-defined categories of protein function. One method used the structure of Gene Ontology by pre-selecting 74 Gene Ontology terms assigned to 11 protein function categories. The second method was based on the similarity of human disease genes clustered according to the information-theoretic distance of their Gene Ontology annotations. Compared to the categorization of human disease genes found in the gold standard, our automated methods can achieve an overall 56% and 47% precision with 62% and 71% recall respectively. However, approximately 15% of the studied human disease genes remain without GO annotations., Conclusion: Automated methods can recapitulate a significant portion of classification of the human disease genes. The method using information-theoretic distance performs slightly better on the precision with some loss in recall. For some protein function categories, such as 'hormone' and 'transcription factor', the automated methods perform particularly well, achieving precision and recall levels above 75%. In summary, this study demonstrates that for semantic webs, methods to automatically classify or analyze a majority of human disease genes require significant progress in both the Gene Ontology annotations and particularly in the utilization of these annotations.
- Published
- 2007
- Full Text
- View/download PDF
45. Gene selection with multiple ordering criteria.
- Author
-
Chen JJ, Tsai CA, Tzeng S, and Chen CH
- Subjects
- Algorithms, Colonic Neoplasms genetics, Humans, Oligonucleotide Array Sequence Analysis methods, Databases, Genetic, Gene Expression Profiling methods, Models, Genetic
- Abstract
Background: A microarray study may select different differentially expressed gene sets because of different selection criteria. For example, the fold-change and p-value are two commonly known criteria to select differentially expressed genes under two experimental conditions. These two selection criteria often result in incompatible selected gene sets. Also, in a two-factor, say, treatment by time experiment, the investigator may be interested in one gene list that responds to both treatment and time effects., Results: We propose three layer ranking algorithms, point-admissible, line-admissible (convex), and Pareto, to provide a preference gene list from multiple gene lists generated by different ranking criteria. Using the public colon data as an example, the layer ranking algorithms are applied to the three univariate ranking criteria, fold-change, p-value, and frequency of selections by the SVM-RFE classifier. A simulation experiment shows that for experiments with small or moderate sample sizes (less than 20 per group) and detecting a 4-fold change or less, the two-dimensional (p-value and fold-change) convex layer ranking selects differentially expressed genes with generally lower FDR and higher power than the standard p-value ranking. Three applications are presented. The first application illustrates a use of the layer rankings to potentially improve predictive accuracy. The second application illustrates an application to a two-factor experiment involving two dose levels and two time points. The layer rankings are applied to selecting differentially expressed genes relating to the dose and time effects. In the third application, the layer rankings are applied to a benchmark data set consisting of three dilution concentrations to provide a ranking system from a long list of differentially expressed genes generated from the three dilution concentrations., Conclusion: The layer ranking algorithms are useful to help investigators in selecting the most promising genes from multiple gene lists generated by different filter, normalization, or analysis methods for various objectives.
- Published
- 2007
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.