Author: "Liu, P S" / Publication Type: Dissertations - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Liu, P S"' showing total 3 results

Start Over Author "Liu, P S" Publication Type Dissertations

3 results on '"Liu, P S"'

1. Statistical Methods for Large-Scale Integrative Genomics

Author: Li, Yang, Liu, Jun S., and Mootha, Vamsi K.
Subjects: Statistics, Biology, Bioinformatics
Abstract: In the past 20 years, we have witnessed a significant advance of high-throughput genetic and genomic technologies. With the massively generated genomics data, there is a pressing need for statistical methods that can utilize them to make quantitative inference on substantive scientific questions. My research has been focusing on statistical methods for large-scale integrative genomics. The human genome encodes more than 20,000 genes, while the functions of about 50% (>10,000) genes remains unknown up to date. The determination of the functions of the poorly characterized genes is crucial for understanding biological processes and human diseases. In the era of Big Data, the availability of massive genomic data provides us unprecedented opportunity to identify the association between genes and predict their biological functions. Genome sequencing data and mRNA expression data are the two most important classes of genomic data. This thesis presents three research projects in self-contained chapters: (1) a statistical framework for inferring evolutionary history of human genes and identifying gene modules with shared evolutionary history from genome sequencing data, (2) a statistical method to predict frequent and specific gene co-expression by integrating a large number of mRNA expression datasets, and (3) robust variable and interaction selection for high-dimensional classification problem under the discriminant analysis and logistic regression model. Chapter 1. Human has more than 20,000 genes but till now most of their functions are uncharacterized. Determination of the function for poorly characterized genes is crucial for understanding biological processes and study of human diseases. Functionally associated genes tend to gain and lose simultaneously during evolution, therefore identifying co-evolution of genes predicts gene-gene associations. In this chapter, we propose a mixture of tree-structured hidden Markov models for gene evolution process, and a Bayesian model-based clustering algorithm to detect gene modules with shared evolutionary history (named as evolutionary conserved modules, ECM). Dirichlet process prior is adopted for estimation of number of gene clusters and an efficient Gibbs sampler is developed for posterior distribution computation. By simulation study and benchmarks on real data sets, we show that our algorithm outperforms traditional methods that use simple metrics (e.g. Hamming distance, Pearson correlation) to measure the similarity between genes presence/absence patterns. We apply our methods on 1,025 canonical human pathways gene sets, and found a large portion of the detected gene associations are substantiated by other sources of evidence. The rest of genes have predicted functions of high priority to be verified by further biological experiments. Chapter 2. The availability of gene expression measurements across thousands of experimental conditions provides the opportunity to predict gene function based on shared mRNA expression. While many biological complexes and pathways are coordinately expressed, their genes may be organized into co-expression modules with distinct patterns in certain tissues or conditions, which can provide insight into pathway organization and function. We developed the algorithm CLIC (clustering by inferred co-expression, www.gene-clic.org) that clusters a set of functionally-related genes into co-expressed modules, highlights the most relevant datasets, and predicts additional co-expressed genes. Using a statistical Bayesian partition model, CLIC simultaneously partitions the input gene set into disjoint co-expression modules and weights the most relevant datasets for each module. CLIC then expands each module with additional members that co-express with the module’s genes more than the background model in the weighted datasets. We applied CLIC to (i) model the background correlation in each of 3,662 mouse and human microarray datasets from the Gene Expression Omnibus (GEO), (ii) partition each of 900 annotated complexes/pathways into co-expression modules, and (iii) expand each co-expression module with additional genes showing frequent and specific co-expression over multiple GEO datasets. CLIC provided very strong functional predictions for many completely uncharacterized genes, including a link between protein C7orf55 and the mitochondrial ATP synthase complex that we experimentally validated via CRISPR knock-out. CLIC software is freely available and should become increasingly powerful with the growing wealth of transcriptomic datasets. Chapter 3. Discriminant analysis and logistic regression are fundamental tools for classification problems. Quadratic discriminant analysis has the ability to exploit interaction effects of predictors, but the selection of interaction terms is non-trivial and the Gaussian assumption is often too restrictive for many real problems. Under the logistic regression framework, we propose a forward-backward method, SODA, for variable selection with both main and quadratic interaction terms, where in the forward stage, a stepwise procedure is conducted to screen for important predictors with both main and interaction effects, and in the backward stage SODA remove insignificant terms so as to optimize the extended BIC (EBIC) criterion. Compared with existing methods on quadratic discriminant analysis variable selection (e.g., (Murphy et al., 2010), (Zhang and Wang, 2011) and (Maugis et al., 2011)), SODA can deal with high-dimensional data with the number of predictors much larger than the sample size and does not require the joint normality assumption on predictors, leading to much enhanced robustness. Theoretical analysis establishes the consistency of SODA under high-dimensional setting. Empirical performance of SODA is assessed on both simulated and real data and is found to be superior to all existing methods we have tested. For all the three real datasets we have studied, SODA selected more parsimonious models achieving higher classification accuracies compared to other tested methods., Statistics
Published: 2016

2. Three Aspects of Biostatistical Learning Theory

Author: Neykov, Matey, Cai, Tianxi, and Liu, Jun S.
Subjects: Statistics
Abstract: In the present dissertation we consider three classical problems in biostatistics and statistical learning - classification, variable selection and statistical inference. Chapter 2 is dedicated to multi-class classification. We characterize a class of loss functions which we deem relaxed Fisher consistent, whose local minimizers not only recover the Bayes rule but also the exact conditional class probabilities. Our class encompasses previously studied classes of loss-functions, and includes non-convex functions, which are known to be less susceptible to outliers. We propose a generic greedy functional gradient-descent minimization algorithm for boosting weak learners, which works with any loss function in our class. We show that the boosting algorithm achieves geometric rate of convergence in the case of a convex loss. In addition we provide numerical studies and a real data example which serve to illustrate that the algorithm performs well in practice. In Chapter 3, we provide insights on the behavior of sliced inverse regression in a high-dimensional setting under a single index model. We analyze two algorithms: a thresholding based algorithm known as diagonal thresholding and an L1 penalization algorithm - semidefinite programming, and show that they achieve optimal (up to a constant) sample size in terms of support recovery in the case of standard Gaussian predictors. In addition, we look into the performance of the linear regression LASSO in single index models with correlated Gaussian designs. We show that under certain restrictions on the covariance and signal, the linear regression LASSO can also enjoy optimal sample size in terms of support recovery. Our analysis extends existing results on LASSO's variable selection capabilities for linear models. Chapter 4 develops general inferential framework for testing and constructing confidence intervals for high-dimensional estimating equations. Such framework has a variety of applications and allows us to provide tests and confidence regions for parameters estimated by algorithms such as the Dantzig Selector, CLIME and LDP among others, non of which has been previously equipped with inferential procedures., Biostatistics
Published: 2015

3. Cell States and Cell Fate: Statistical and Computational Models in (Epi)Genomics

Author: Fernandez, Daniel and Liu, Jun S.
Subjects: Statistics, Biology, Bioinformatics, Genetics
Abstract: This dissertation develops and applies several statistical and computational methods to the analysis of Next Generation Sequencing (NGS) data in order to gain a better understanding of our biology. In the rest of the chapter we introduce key concepts in molecular biology, and recent technological developments that help us better understand this complex science, which, in turn, provide the foundation and motivation for the subsequent chapters. In the second chapter we present the problem of estimating gene/isoform expression at the allelic level, and different models to solve this problem. First, we describe the observed data and the computational workflow to process the data. Next, we propose frequentist and bayesian models motivated by the central dogma of molecular biology and the data generating process (DGP) for RNA-Seq. We develop EM and Gibbs sampling approaches to estimate gene and transcript-specic expression from our proposed models. Finally, we present the performance of our models in simulations and we end with the analysis of experimental RNA-Seq data at the allelic level. In the third chapter we present our paired factorial experimental design to study parentally biased gene/isoform expression in the mouse cerebellum, and dynamic changes of this pattern between young and adult stages of cerebellar development. We present a bayesian variable selection model to estimate the difference in expression between the paternal and maternal genes, while incorporating relevant factors and its interactions into the model. Next, we apply our model to our experimental data, and further on we validate our predictions using pyrosequencing follow-up experiments. We subsequently applied our model to the pyrosequencing data across multiple brain regions. Our method, combined with the validation experiments, allowed us to find novel imprinted genes, and investigate, for the first time, imprinting dynamics across brain regions and across development. In the fourth chapter we move from the controlled-experiments in mouse isogenic lines to the highly variant world of human genetics in observational studies. In this chapter we introduce a Bayesian Regression Allelic Imbalance Model, BRAIM, that estimates the imbalance coming from two major sources: cis-regulation and imprinting. We model the cis-effect as an additive effect for the heterozygous group and we model the parent-of-origin detect with a latent variable that indicates to which parent a given allele belongs. Next, we show the performance of the model under simulation scenarios, and finally we apply the model to several experiments across multiple tissues and multiple individuals. In the fifth chapter we characterize the transcriptional regulation and gene expression of in-vitro Embryonic Stem Cells (ESCs), and two-related in-vivo cells; the Inner Cell Mass (ICM) tissue, and the embryonic tissue at day 6.5. Our objective is two fold. First we would like to understand the differences in gene expression between the ESCs and their in-vivo counterpart from where these cells were derived (ICM). Second, we want to characterize the active transcriptional regulatory regions using several histone modifications and to connect such regulatory activity with gene expression. In this chapter we used several statistical and computational methods to analyze and visualize the data, and it provides a good showcase of how combining several methods of analysis we can delve into interesting developmental biology.
Published: 2015

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

3 results on '"Liu, P S"'

1. Statistical Methods for Large-Scale Integrative Genomics

2. Three Aspects of Biostatistical Learning Theory

3. Cell States and Cell Fate: Statistical and Computational Models in (Epi)Genomics

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Database

3 results on '"Liu, P S"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources