1. A sequential Monte Carlo approach to gene expression deconvolution
- Author
-
Xiaodong Wang and Oyetunji Enoch Ogundijo
- Subjects
0301 basic medicine ,Computer science ,Monte Carlo method ,lcsh:Medicine ,Datasets as Topic ,01 natural sciences ,010104 statistics & probability ,Open Science ,Open Data ,Medicine and Health Sciences ,Preprocessor ,lcsh:Science ,Likelihood Functions ,Sequence ,Multidisciplinary ,Applied Mathematics ,Simulation and Modeling ,Software Engineering ,Heart ,Physical Sciences ,Engineering and Technology ,Probability distribution ,Deconvolution ,Anatomy ,Particle filter ,Algorithms ,Research Article ,Computer and Information Sciences ,Science Policy ,Materials by Structure ,Materials Science ,Research and Analysis Methods ,Non-negative matrix factorization ,Set (abstract data type) ,03 medical and health sciences ,Genetics ,0101 mathematics ,Preprocessing ,business.industry ,lcsh:R ,Biology and Life Sciences ,Pattern recognition ,Probability Theory ,Probability Distribution ,Probability Density ,030104 developmental biology ,Mixtures ,Electrical engineering ,Cardiovascular Anatomy ,lcsh:Q ,Artificial intelligence ,Gene expression ,business ,Mathematics - Abstract
High-throughput gene expression data are often obtained from pure or complex (heterogeneous) biological samples. In the latter case, data obtained are a mixture of different cell types and the heterogeneity imposes some difficulties in the analysis of such data. In order to make conclusions on gene expresssion data obtained from heterogeneous samples, methods such as microdissection and flow cytometry have been employed to physically separate the constituting cell types. However, these manual approaches are time consuming when measuring the responses of multiple cell types simultaneously. In addition, exposed samples, on many occasions, end up being contaminated with external perturbations and this may result in an altered yield of molecular content. In this paper, we model the heterogeneous gene expression data using a Bayesian framework, treating the cell type proportions and the cell-type specific expressions as the parameters of the model. Specifically, we present a novel sequential Monte Carlo (SMC) sampler for estimating the model parameters by approximating their posterior distributions with a set of weighted samples. The SMC framework is a robust and efficient approach where we construct a sequence of artificial target (posterior) distributions on spaces of increasing dimensions which admit the distributions of interest as marginals. The proposed algorithm is evaluated on simulated datasets and publicly available real datasets, including Affymetrix oligonucleotide arrays and national center for biotechnology information (NCBI) gene expression omnibus (GEO), with varying number of cell types. The results obtained on all datasets show a superior performance with an improved accuracy in the estimation of cell type proportions and the cell-type specific expressions, and in addition, more accurate identification of differentially expressed genes when compared to other widely known methods for blind decomposition of heterogeneous gene expression data such as Dsection and the nonnegative matrix factorization (NMF) algorithms. MATLAB implementation of the proposed SMC algorithm is available to download at https://github.com/moyanre/smcgenedeconv.git.
- Published
- 2017
- Full Text
- View/download PDF