479 results on '"zero-inflation"'
Search Results
52. Zero‐inflated count distributions for capture–mark–reencounter data.
- Author
-
Riecke, Thomas V., Gibson, Daniel, Sedinger, James S., and Schaub, Michael
- Subjects
- *
DATA distribution , *BINOMIAL distribution , *POISSON distribution , *POISSON regression , *CONDITIONAL probability , *CONSERVATION biology - Abstract
The estimation of demographic parameters is a key component of evolutionary demography and conservation biology. Capture–mark–recapture methods have served as a fundamental tool for estimating demographic parameters. The accurate estimation of demographic parameters in capture–mark–recapture studies depends on accurate modeling of the observation process. Classic capture–mark–recapture models typically model the observation process as a Bernoulli or categorical trial with detection probability conditional on a marked individual's availability for detection (e.g., alive, or alive and present in a study area). Alternatives to this approach are underused, but may have great utility in capture–recapture studies. In this paper, we explore a simple concept: in the same way that counts contain more information about abundance than simple detection/non‐detection data, the number of encounters of individuals during observation occasions contains more information about the observation process than detection/non‐detection data for individuals during the same occasion. Rather than using Bernoulli or categorical distributions to estimate detection probability, we demonstrate the application of zero‐inflated Poisson and gamma‐Poisson distributions. The use of count distributions allows for inference on availability for encounter, as well as a wide variety of parameterizations for heterogeneity in the observation process. We demonstrate that this approach can accurately recover demographic and observation parameters in the presence of individual heterogeneity in detection probability and discuss some potential future extensions of this method. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
53. Causal variation modelling identifies large inter- and intra-regional disparities in physical therapy offered to hip fracture patients in Estonia.
- Author
-
Prommik, Pärt, Maiväli, Ülo, Kolk, Helgi, and Märtson, Aare
- Subjects
- *
KRUSKAL-Wallis Test , *PHYSICAL therapy , *HIP fractures , *POPULATION geography , *RETROSPECTIVE studies , *PEARSON correlation (Statistics) , *CHI-squared test , *HEALTH equity , *REHABILITATION , *LOGISTIC regression analysis , *LONGITUDINAL method - Abstract
An essential measure of hip fracture (HF) rehabilitation, the amount of physical therapy (PT) used per patient, has been severely understudied. This study (1) evaluates post-acute PT use after HF in Estonia, (2) presents causal variation modelling for examining inter- and intra-regional disparities, and (3) analyses its temporal trends. This retrospective cohort study used validated population-wide health data, including patients aged ≥50 years, with an index HF diagnosed between January 2009 and September 2017. Patients' 6-month PT use was analysed and reported separately for acute and post-acute phases. While most of the included 11,461 patients received acute rehabilitation, only 40% of them received post-acute PT by a median of 6 h. Analyses based on measures of central tendency revealed 2.5 to 2.6-fold inter-regional differences in HF post-acute rehabilitation. Variation modelling additionally detected intra-regional disparities, showing imbalances in the fairness of allocating local rehabilitation resources between a county's patients. This study demonstrates the advantages of causal variation modelling for identifying inter- and intra-regional disparities in rehabilitation. The analyses revealed persisting large multi-level disparities and accompanying overall inaccessibility of PT in HF rehabilitation in Estonia, showing an urgent need for system-wide improvements. This study demonstrates the advantages of causal variation modelling for identifying inter- and intra-regional disparities in rehabilitation, using an essential outcome measure – used physical therapy hours. The study revealed large multi-level disparities and overall inaccessibility of physical therapy in hip fracture rehabilitation in Estonia, showing an urgent need for system-wide improvements. This study expands our knowledge on unstudied topics – hip fracture post-acute care and long-term physical therapy use. This regional analysis provides the first evidence-based regional-level basis for improving the rehabilitation system in Estonia. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
54. Differential expression of single‐cell RNA‐seq data using Tweedie models.
- Author
-
Mallick, Himel, Chatterjee, Suvo, Chowdhury, Shrabanti, Chatterjee, Saptarshi, Rahnavard, Ali, and Hicks, Stephanie C.
- Subjects
- *
GENE expression profiling , *FALSE discovery rate , *RNA sequencing , *GENE expression , *STATISTICAL power analysis - Abstract
The performance of computational methods and software to identify differentially expressed features in single‐cell RNA‐sequencing (scRNA‐seq) has been shown to be influenced by several factors, including the choice of the normalization method used and the choice of the experimental platform (or library preparation protocol) to profile gene expression in individual cells. Currently, it is up to the practitioner to choose the most appropriate differential expression (DE) method out of over 100 DE tools available to date, each relying on their own assumptions to model scRNA‐seq expression features. To model the technological variability in cross‐platform scRNA‐seq data, here we propose to use Tweedie generalized linear models that can flexibly capture a large dynamic range of observed scRNA‐seq expression profiles across experimental platforms induced by platform‐ and gene‐specific statistical properties such as heavy tails, sparsity, and gene expression distributions. We also propose a zero‐inflated Tweedie model that allows zero probability mass to exceed a traditional Tweedie distribution to model zero‐inflated scRNA‐seq data with excessive zero counts. Using both synthetic and published plate‐ and droplet‐based scRNA‐seq datasets, we perform a systematic benchmark evaluation of more than 10 representative DE methods and demonstrate that our method (Tweedieverse) outperforms the state‐of‐the‐art DE approaches across experimental platforms in terms of statistical power and false discovery rate control. Our open‐source software (R/Bioconductor package) is available at https://github.com/himelmallick/Tweedieverse. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
55. Modelling Excess Zeros in Count Data: A New Perspective on Modelling Approaches.
- Author
-
Haslett, John, Parnell, Andrew C., Hinde, John, and de Andrade Moral, Rafael
- Subjects
- *
POISSON regression , *POISSON distribution , *DISPERSION (Atmospheric chemistry) , *DATA analysis , *GENERALIZATION - Abstract
Summary: We consider the analysis of count data in which the observed frequency of zero counts is unusually large, typically with respect to the Poisson distribution. We focus on two alternative modelling approaches: over‐dispersion (OD) models and zero‐inflation (ZI) models, both of which can be seen as generalisations of the Poisson distribution; we refer to these as implicit and explicit ZI models, respectively. Although sometimes seen as competing approaches, they can be complementary; OD is a consequence of ZI modelling, and ZI is a by‐product of OD modelling. The central objective in such analyses is often concerned with inference on the effect of covariates on the mean, in light of the apparent excess of zeros in the counts. Typically, the modelling of the excess zeros per se is a secondary objective, and there are choices to be made between, and within, the OD and ZI approaches. The contribution of this paper is primarily conceptual. We contrast, descriptively, the impact on zeros of the two approaches. We further offer a novel descriptive characterisation of alternative ZI models, including the classic hurdle and mixture models, by providing a unifying theoretical framework for their comparison. This in turn leads to a novel and technically simpler ZI model. We develop the underlying theory for univariate counts and touch on its implication for multivariate count data. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
56. DIF Detection With Zero-Inflation Under the Factor Mixture Modeling Framework.
- Author
-
Lee, Sooyong, Han, Suhwa, and Choi, Seung W.
- Subjects
- *
STRUCTURAL equation modeling , *STATISTICS , *REGRESSION analysis , *DIFFERENTIAL item functioning (Research bias) , *DESCRIPTIVE statistics , *STATISTICAL models , *RESEARCH bias , *DATA analysis software , *DATA analysis , *MAXIMUM likelihood statistics , *LOGISTIC regression analysis , *GROUP process , *ALGORITHMS , *PROBABILITY theory - Abstract
Response data containing an excessive number of zeros are referred to as zero-inflated data. When differential item functioning (DIF) detection is of interest, zero-inflation can attenuate DIF effects in the total sample and lead to underdetection of DIF items. The current study presents a DIF detection procedure for response data with excess zeros due to the existence of unobserved heterogeneous subgroups. The suggested procedure utilizes the factor mixture modeling (FMM) with MIMIC (multiple-indicator multiple-cause) to address the compromised DIF detection power via the estimation of latent classes. A Monte Carlo simulation was conducted to evaluate the suggested procedure in comparison to the well-known likelihood ratio (LR) DIF test. Our simulation study results indicated the superiority of FMM over the LR DIF test in terms of detection power and illustrated the importance of accounting for latent heterogeneity in zero-inflated data. The empirical data analysis results further supported the use of FMM by flagging additional DIF items over and above the LR test. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
57. Modelling the spatial population structure and distribution of the queen conch, Aliger gigas, on the Pedro Bank, Jamaica
- Author
-
Ricardo A. Morris, Alvaro Hernández-Flores, and Alfonso Cuevas-Jimenez
- Subjects
spatial analysis ,sedentary species ,zero-inflation ,species distribution models ,Aquaculture. Fisheries. Angling ,SH1-691 - Abstract
The estimation of reliable indices of abundance for sedentary stocks requires the incorporation of the underlying spatial population structure, including issues arising from the sampling design and zero inflation. We applied seven spatial interpolation techniques [ordinary kriging (OK), kriging with external drift (KED), a negative binomial generalized additive model (NBGAM), NBGAM plus OK (NBGAM+OK), a general additive mixed model (GAMM), GAMM plus OK (GAMM+OK) and a zero-inflated negative binomial model (ZINB) ] to three survey datasets to estimate biomass for the gastropod Aliger gigas on the Pedro Bank Jamaica. The models were evaluated using 10-fold cross-validation diagnostics criteria for choosing the best model. We also compared the best model estimations against two common design methods to assess the consequences of ignoring the spatial structure of the species distribution. GAMM and ZINB were overall the best models but were strongly affected by the sampling design, sample size, the coefficient of variation of the sample and the quality of the available covariates used to model the distribution (geographic location, depth and habitat). More reliable abundance indices can help to improve stock assessments and the development of spatial management using an ecosystem approach.
- Published
- 2022
- Full Text
- View/download PDF
58. Discrete Weibull regression model for count data
- Author
-
Kalktawi, Hadeel Saleh and Yu, K.
- Subjects
519.2 ,Dispersion ,Maximum likelihood ,Censored ,Zero-inflation ,Generalized linear model - Abstract
Data can be collected in the form of counts in many situations. In other words, the number of deaths from an accident, the number of days until a machine stops working or the number of annual visitors to a city may all be considered as interesting variables for study. This study is motivated by two facts; first, the vital role of the continuous Weibull distribution in survival analyses and failure time studies. Hence, the discrete Weibull (DW) is introduced analogously to the continuous Weibull distribution, (see, Nakagawa and Osaki (1975) and Kulasekera (1994)). Second, researchers usually focus on modeling count data, which take only non-negative integer values as a function of other variables. Therefore, the DW, introduced by Nakagawa and Osaki (1975), is considered to investigate the relationship between count data and a set of covariates. Particularly, this DW is generalised by allowing one of its parameters to be a function of covariates. Although the Poisson regression can be considered as the most common model for count data, it is constrained by its equi-dispersion (the assumption of equal mean and variance). Thus, the negative binomial (NB) regression has become the most widely used method for count data regression. However, even though the NB can be suitable for the over-dispersion cases, it cannot be considered as the best choice for modeling the under-dispersed data. Hence, it is required to have some models that deal with the problem of under-dispersion, such as the generalized Poisson regression model (Efron (1986) and Famoye (1993)) and COM-Poisson regression (Sellers and Shmueli (2010) and Sáez-Castillo and Conde-Sánchez (2013)). Generally, all of these models can be considered as modifications and developments of Poisson models. However, this thesis develops a model based on a simple distribution with no modification. Thus, if the data are not following the dispersion system of Poisson or NB, the true structure generating this data should be detected. Applying a model that has the ability to handle different dispersions would be of great interest. Thus, in this study, the DW regression model is introduced. Besides the exibility of the DW to model under- and over-dispersion, it is a good model for inhomogeneous and highly skewed data, such as those with excessive zero counts, which are more disperse than Poisson. Although these data can be fitted well using some developed models, namely, the zero-inated and hurdle models, the DW demonstrates a good fit and has less complexity than these modifed models. However, there could be some cases when a special model that separates the probability of zeros from that of the other positive counts must be applied. Then, to cope with the problem of too many observed zeros, two modifications of the DW regression are developed, namely, zero-inated discrete Weibull (ZIDW) and hurdle discrete Weibull (HDW) models. Furthermore, this thesis considers another type of data, where the response count variable is censored from the right, which is observed in many experiments. Applying the standard models for these types of data without considering the censoring may yield misleading results. Thus, the censored discrete Weibull (CDW) model is employed for this case. On the other hand, this thesis introduces the median discrete Weibull (MDW) regression model for investigating the effect of covariates on the count response through the median which are more appropriate for the skewed nature of count data. In other words, the likelihood of the DW model is re-parameterized to explain the effect of the predictors directly on the median. Thus, in comparison with the generalized linear models (GLMs), MDW and GLMs both investigate the relations to a set of covariates via certain location measurements; however, GLMs consider the means, which is not the best way to represent skewed data. These DW regression models are investigated through simulation studies to illustrate their performance. In addition, they are applied to some real data sets and compared with the related count models, mainly Poisson and NB models. Overall, the DW models provide a good fit to the count data as an alternative to the NB models in the over-dispersion case and are much better fitting than the Poisson models. Additionally, contrary to the NB model, the DW can be applied for the under-dispersion case.
- Published
- 2017
59. Estimation of Mediation Effect on Zero-Inflated Microbiome Mediators
- Author
-
Dongyang Yang and Wei Xu
- Subjects
mediation model ,microbiome data ,zero-inflation ,semiparamatric ,direct effects ,indirect effects ,Mathematics ,QA1-939 - Abstract
The mediation analysis methodology of the cause-and-effect relationship through mediators has been increasingly popular over the past decades. The human microbiome can contribute to the pathogenesis of many complex diseases by mediating disease-leading causal pathways. However, standard mediation analysis is not adequate for microbiome data due to the excessive number of zero values and the over-dispersion in the sequencing reads, which arise for both biological and sampling reasons. To address these unique challenges brought by the zero-inflated mediator, we developed a novel mediation analysis algorithm under the potential-outcome framework to fill this gap. The proposed semiparametric model estimates the mediation effect of the microbiome by decomposing indirect effects into two components according to the zero-inflated distributions. The bootstrap algorithm is utilized to calculate the empirical confidence intervals of the causal effects. We conducted extensive simulation studies to investigate the performance of the proposed weighting-based approach and some model-based alternatives, and our proposed model showed robust performance. The proposed algorithm was implemented in a real human microbiome study of identifying whether some taxa mediate the relationship between LACTIN-V treatment and immune response.
- Published
- 2023
- Full Text
- View/download PDF
60. Determining Risk Factors of Antenatal Care Attendance and its Frequency in Bangladesh: An Application of Count Regression Analysis
- Author
-
Bhowmik, Kakoli Rani, Das, Sumonkanti, Islam, Md. Atiqul, and Rahman, Azizur, editor
- Published
- 2020
- Full Text
- View/download PDF
61. Distribution‐free model selection for longitudinal zero‐inflated count data with missing responses and covariates.
- Author
-
Chen, Chun‐Shu and Shen, Chung‐Wei
- Subjects
- *
MISSING data (Statistics) , *GENERALIZED estimating equations , *DATA distribution - Abstract
In many medical and social science studies, count responses with excess zeros are very common and often the primary outcome of interest. Such count responses are usually generated under some clustered correlation structures due to longitudinal observations of subjects. To model such longitudinal count data with excess zeros, the zero‐inflated binomial (ZIB) models for bounded outcomes, and the zero‐inflated negative binomial (ZINB) and zero‐inflated poisson (ZIP) models for unbounded outcomes all are popular methods. To alleviate the effects of deviations from model assumptions, a semiparametric (or, distribution‐free) weighted generalized estimating equations has been proposed to estimate model parameters when data are subject to missingness. In this article, we further explore important covariates for the response variable. Without assumptions on the data distribution, a model selection criterion based on the expected weighted quadratic loss is proposed to select an appropriate subset of covariates, especially when count responses have excess zeros and data are subject to nonmonotone missingness in both responses and covariates. To understand the selection effects of the percentages of excess zeros and missingness, we design various scenarios for covariate selection in the mean model via simulation studies and a real data example regarding the study of cardiovascular disease is also presented for illustration. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
62. The Log-Normal zero-inflated cure regression model for labor time in an African obstetric population.
- Author
-
Cavenague de Souza, Hayala Cristina, Louzada, Francisco, de Oliveira, Mauro Ribeiro, Fawole, Bukola, Akintan, Adesina, Oyeneyin, Lawal, Sanni, Wilfred, and Silva Castro Perdoná, Gleici da
- Subjects
- *
REGRESSION analysis , *LABOR time , *WOMEN'S hospitals , *FETAL death , *PREGNANT women , *CHILDBIRTH - Abstract
In obstetrics and gynecology, knowledge about how women's features are associated with childbirth is important. This leads to establishing guidelines and can help managers to describe the dynamics of pregnant women's hospital stays. Then, time is a variable of great importance and can be described by survival models. An issue that should be considered in the modeling is the inclusion of women for whom the duration of labor cannot be observed due to fetal death, generating a proportion of times equal to zero. Additionally, another proportion of women's time may be censored due to some intervention. The aim of this paper was to present the Log-Normal zero-inflated cure regression model and to evaluate likelihood-based parameter estimation by a simulation study. In general, the inference procedures showed a better performance for larger samples and low proportions of zero inflation and cure. To exemplify how this model can be an important tool for investigating the course of the childbirth process, we considered the Better Outcomes in Labor Difficulty project dataset and showed that parity and educational level are associated with the main outcomes. We acknowledge the World Health Organization for granting us permission to use the dataset. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
63. Zero‐state coupled Markov switching count models for spatio‐temporal infectious disease spread.
- Author
-
Douwes‐Schultz, Dirk and Schmidt, Alexandra M.
- Subjects
INFECTIOUS disease transmission ,COMMUNICABLE diseases ,DENGUE ,MARKOV processes ,BAYESIAN field theory ,ARBOVIRUS diseases - Abstract
Spatio‐temporal counts of infectious disease cases often contain an excess of zeros. With existing zero‐inflated count models applied to such data it is difficult to quantify space‐time heterogeneity in the effects of disease spread between areas. Also, existing methods do not allow for separate dynamics to affect the reemergence and persistence of the disease. As an alternative, we develop a new zero‐state coupled Markov switching negative binomial model, under which the disease switches between periods of presence and absence in each area through a series of partially hidden nonhomogeneous Markov chains coupled between neighbouring locations. When the disease is present, an autoregressive negative binomial model generates the cases with a possible zero representing the disease being undetected. Bayesian inference and prediction is illustrated using spatio‐temporal counts of dengue fever cases in Rio de Janeiro, Brazil. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
64. A zero-inflated non-negative matrix factorization for the deconvolution of mixed signals of biological data.
- Author
-
Kong, Yixin, Kozik, Ariangela, Nakatsu, Cindy H., Jones-Hall, Yava L., and Chun, Hyonho
- Subjects
MATRIX decomposition ,NONNEGATIVE matrices ,GENE expression ,TRANSCRIPTOMES ,DECONVOLUTION (Mathematics) - Abstract
A latent factor model for count data is popularly applied in deconvoluting mixed signals in biological data as exemplified by sequencing data for transcriptome or microbiome studies. Due to the availability of pure samples such as single-cell transcriptome data, the accuracy of the estimates could be much improved. However, the advantage quickly disappears in the presence of excessive zeros. To correctly account for this phenomenon in both mixed and pure samples, we propose a zero-inflated non-negative matrix factorization and derive an effective multiplicative parameter updating rule. In simulation studies, our method yielded the smallest bias. We applied our approach to brain gene expression as well as fecal microbiome datasets, illustrating the superior performance of the approach. Our method is implemented as a publicly available R-package, iNMF. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
65. An Omnibus Test for Differential Distribution Analysis of Continuous Microbiome Data
- Author
-
Xiang Lin, Jie Zhang, Zhi Wei, and Turki Turki
- Subjects
Microbiome ,zero-inflation ,omnibus test ,statistical modeling ,microarray ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
The human microbiome is comprised of thousands of microbial species. These species will substantially influence the normal physiology of humans and cause numerous diseases. Microbiome data can be measured by sequencing, microarray, or other technologies. With the fast development of these technologies, downstream analysis methods should also be designed to effectively and accurately discover the valuable information that is hidden in the data. Many methods have been designed for the count data of microbiome. However, to our knowledge, there are only a few methods developed for the continuous microbiome data. Many microbiome data have an over-dispersed and zero-inflated data structure. Traditional methods rarely characterize this data structure and only focus on the differences in the abundance between different samples. In this study, we introduce a novel method, the zero-inflated gamma (ZIG) omnibus test, to specifically test the continuous and zero-inflated microbiome data. In this test, abundance will be tested along with zero prevalence and dispersion. We compared this method with five other popular methods. We found that ZIG omnibus test has significantly higher power and a similar or lower false-positive rate than the competing methods in the tests of simulated data. It also found more proved microbiomes in the real data with tonsil cancer. So, we conclude that ZIG omnibus test is a robust method across various biological conditions in the differential expression test of microbiome data.
- Published
- 2021
- Full Text
- View/download PDF
66. Nonparametric Copula Estimation for Mixed Insurance Claim Data.
- Author
-
Yang, Lu
- Subjects
INSURANCE claims ,NONPARAMETRIC estimation ,GOVERNMENT insurance ,PROPERTY insurance ,INSURANCE funding ,COPULA functions - Abstract
Multivariate claim data are common in insurance applications, for example, claims of each policyholder from different types of insurance coverages. Understanding the dependencies among such multivariate risks is critical to the solvency and profitability of insurers. Effectively modeling insurance claim data is challenging due to their special complexities. At the policyholder level, claim outcomes usually follow a two-part mixed distribution: a probability mass at zero corresponding to no claim and an otherwise positive claim from a skewed and long-tailed distribution. To simultaneously accommodate the complex features of the marginal distributions while flexibly quantifying the dependencies among multivariate claims, copula models are commonly used. Although a substantial body of literature focusing on copulas with continuous outcomes has emerged, some key steps do not carry over to mixed data. In particular, existing nonparametric copula estimators are not consistent for mixed data, and thus copula specification and diagnostics for mixed outcomes have been a problem. However, insurance is a closely regulated industry in which model validation is particularly important, and it is essential to develop a baseline nonparametric copula estimator to identify the underlying dependence structure. In this article, we fill in this gap by developing a nonparametric copula estimator for mixed data. We show the uniform convergence of the proposed nonparametric copula estimator. Through simulation studies, we demonstrate that the proportion of zeros plays a key role in the finite sample performance of the proposed estimator. Using the claim data from the Wisconsin Local Government Property Insurance Fund, we illustrate that our nonparametric copula estimator can assist analysts in identifying important features of the underlying dependence structure, revealing how different claims or risks are related to one another. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
67. Prioritizing Autism Risk Genes Using Personalized Graphical Models Estimated From Single-Cell RNA-seq Data.
- Author
-
Liu, Jianyu, Wang, Haodong, Sun, Wei, and Liu, Yufeng
- Subjects
- *
AUTISM , *AUTISTIC children , *RYANODINE receptors , *RNA sequencing , *GENES , *PROTEIN receptors , *CHILDREN with autism spectrum disorders - Abstract
Hundreds of autism risk genes have been reported recently, mainly based on genetic studies where these risk genes have more de novo mutations in autism subjects than healthy controls. However, as a complex disease, autism is likely associated with more risk genes and many of them may not be identifiable through de novo mutations. We hypothesize that more autism risk genes can be identified through their connections with known autism risk genes in personalized gene–gene interaction graphs. We estimate such personalized graphs using single-cell RNA sequencing (scRNA-seq) while appropriately modeling the cell dependence and possible zero-inflation in the scRNA-seq data. The sample size, which is the number of cells per individual, ranges from 891 to 1241 in our case study using scRNA-seq data in autism subjects and controls. We consider 1500 genes in our analysis. Since the number of genes is larger or comparable to the sample size, we perform penalized estimation. We score each gene's relevance by applying a simple graph kernel smoothing method to each personalized graph. The molecular functions of the top-scored genes are related to autism diseases. For example, a candidate gene RYR2 that encodes protein ryanodine receptor 2 is involved in neurotransmission, a process that is impaired in ASD patients. While our method provides a systemic and unbiased approach to prioritize autism risk genes, the relevance of these genes needs to be further validated in functional studies. for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
68. Compositional zero-inflated network estimation for microbiome data
- Author
-
Min Jin Ha, Junghi Kim, Jessica Galloway-Peña, Kim-Anh Do, and Christine B. Peterson
- Subjects
Microbiome ,Network ,Graphical model ,Zero-inflation ,Compositional data ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Biology (General) ,QH301-705.5 - Abstract
Abstract Background The estimation of microbial networks can provide important insight into the ecological relationships among the organisms that comprise the microbiome. However, there are a number of critical statistical challenges in the inference of such networks from high-throughput data. Since the abundances in each sample are constrained to have a fixed sum and there is incomplete overlap in microbial populations across subjects, the data are both compositional and zero-inflated. Results We propose the COmpositional Zero-Inflated Network Estimation (COZINE) method for inference of microbial networks which addresses these critical aspects of the data while maintaining computational scalability. COZINE relies on the multivariate Hurdle model to infer a sparse set of conditional dependencies which reflect not only relationships among the continuous values, but also among binary indicators of presence or absence and between the binary and continuous representations of the data. Our simulation results show that the proposed method is better able to capture various types of microbial relationships than existing approaches. We demonstrate the utility of the method with an application to understanding the oral microbiome network in a cohort of leukemic patients. Conclusions Our proposed method addresses important challenges in microbiome network estimation, and can be effectively applied to discover various types of dependence relationships in microbial communities. The procedure we have developed, which we refer to as COZINE, is available online at https://github.com/MinJinHa/COZINE .
- Published
- 2020
- Full Text
- View/download PDF
69. BZINB Model-Based Pathway Analysis and Module Identification Facilitates Integration of Microbiome and Metabolome Data
- Author
-
Bridget M. Lin, Hunyong Cho, Chuwen Liu, Jeff Roach, Apoena Aguiar Ribeiro, Kimon Divaris, and Di Wu
- Subjects
correlation ,microbiome ,metabolomics ,multi-omics ,zero-inflation ,counts ,Biology (General) ,QH301-705.5 - Abstract
Integration of multi-omics data is a challenging but necessary step to advance our understanding of the biology underlying human health and disease processes. To date, investigations seeking to integrate multi-omics (e.g., microbiome and metabolome) employ simple correlation-based network analyses; however, these methods are not always well-suited for microbiome analyses because they do not accommodate the excess zeros typically present in these data. In this paper, we introduce a bivariate zero-inflated negative binomial (BZINB) model-based network and module analysis method that addresses this limitation and improves microbiome–metabolome correlation-based model fitting by accommodating excess zeros. We use real and simulated data based on a multi-omics study of childhood oral health (ZOE 2.0; investigating early childhood dental caries, ECC) and find that the accuracy of the BZINB model-based correlation method is superior compared to Spearman’s rank and Pearson correlations in terms of approximating the underlying relationships between microbial taxa and metabolites. The new method, BZINB-iMMPath, facilitates the construction of metabolite–species and species–species correlation networks using BZINB and identifies modules of (i.e., correlated) species by combining BZINB and similarity-based clustering. Perturbations in correlation networks and modules can be efficiently tested between groups (i.e., healthy and diseased study participants). Upon application of the new method in the ZOE 2.0 study microbiome–metabolome data, we identify that several biologically-relevant correlations of ECC-associated microbial taxa with carbohydrate metabolites differ between healthy and dental caries-affected participants. In sum, we find that the BZINB model is a useful alternative to Spearman or Pearson correlations for estimating the underlying correlation of zero-inflated bivariate count data and thus is suitable for integrative analyses of multi-omics data such as those encountered in microbiome and metabolome studies.
- Published
- 2023
- Full Text
- View/download PDF
70. EXPOSURE EFFECTS ON COUNT OUTCOMES WITH OBSERVATIONAL DATA, WITH APPLICATION TO INCARCERATED WOMEN.
- Author
-
Shook-Sa BE, Hudgens MG, Knittel AK, Edmonds A, Ramirez C, Cole SR, Cohen M, Adedimeji A, Taylor T, Michel KG, Kovacs A, Cohen J, Donohue J, Foster A, Fischl MA, Long D, and Adimora AA
- Abstract
Causal inference methods can be applied to estimate the effect of a point exposure or treatment on an outcome of interest using data from observational studies. For example, in the Women's Interagency HIV Study, it is of interest to understand the effects of incarceration on the number of sexual partners and the number of cigarettes smoked after incarceration. In settings like this where the outcome is a count, the estimand is often the causal mean ratio, i.e., the ratio of the counterfactual mean count under exposure to the counterfactual mean count under no exposure. This paper considers estimators of the causal mean ratio based on inverse probability of treatment weights, the parametric g-formula, and doubly robust estimation, each of which can account for overdispersion, zero-inflation, and heaping in the measured outcome. Methods are compared in simulations and are applied to data from the Women's Interagency HIV Study.
- Published
- 2024
- Full Text
- View/download PDF
71. P-Thinned Gamma Process and Corresponding Random Walk
- Author
-
Jordanova, Pavlina, Stehlík, Milan, Hutchison, David, Editorial Board Member, Kanade, Takeo, Editorial Board Member, Kittler, Josef, Editorial Board Member, Kleinberg, Jon M., Editorial Board Member, Mattern, Friedemann, Editorial Board Member, Mitchell, John C., Editorial Board Member, Naor, Moni, Editorial Board Member, Pandu Rangan, C., Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Terzopoulos, Demetri, Editorial Board Member, Tygar, Doug, Editorial Board Member, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Dimov, Ivan, editor, Faragó, István, editor, and Vulkov, Lubin, editor
- Published
- 2019
- Full Text
- View/download PDF
72. Analysis of zero inflated dichotomous variables from a Bayesian perspective: application to occupational health.
- Author
-
Moriña, David, Puig, Pedro, and Navarro, Albert
- Subjects
- *
INDUSTRIAL hygiene , *DEFAULT (Finance) , *PRESENTEEISM (Labor) - Abstract
Background: Zero-inflated models are generally aimed to addressing the problem that arises from having two different sources that generate the zero values observed in a distribution. In practice, this is due to the fact that the population studied actually consists of two subpopulations: one in which the value zero is by default (structural zero) and the other is circumstantial (sample zero).Methods: This work proposes a new methodology to fit zero inflated Bernoulli data from a Bayesian approach, able to distinguish between two potential sources of zeros (structural and non-structural).Results: The proposed methodology performance has been evaluated through a comprehensive simulation study, and it has been compiled as an R package freely available to the community. Its usage is illustrated by means of a real example from the field of occupational health as the phenomenon of sickness presenteeism, in which it is reasonable to think that some individuals will never be at risk of suffering it because they have not been sick in the period of study (structural zeros). Without separating structural and non-structural zeros one would be studying jointly the general health status and the presenteeism itself, and therefore obtaining potentially biased estimates as the phenomenon is being implicitly underestimated by diluting it into the general health status.Conclusions: The proposed methodology is able to distinguish two different sources of zeros (structural and non-structural) from dichotomous data with or without covariates in a Bayesian framework, and has been made available to any interested researcher in the form of the bayesZIB R package ( https://cran.r-project.org/package=bayesZIB ). [ABSTRACT FROM AUTHOR]- Published
- 2021
- Full Text
- View/download PDF
73. Poisson–Tweedie mixed-effects model: A flexible approach for the analysis of longitudinal RNA-seq data.
- Author
-
Signorelli, Mirko, Spitali, Pietro, and Tsonaka, Roula
- Subjects
- *
MAXIMUM likelihood statistics , *RNA sequencing , *NUCLEOTIDE sequencing - Abstract
We present a new modelling approach for longitudinal overdispersed counts that is motivated by the increasing availability of longitudinal RNA-sequencing experiments. The distribution of RNA-seq counts typically exhibits overdispersion, zero-inflation and heavy tails; moreover, in longitudinal designs repeated measurements from the same subject are typically (positively) correlated. We propose a generalized linear mixed model based on the Poisson–Tweedie distribution that can flexibly handle each of the aforementioned features of longitudinal overdispersed counts. We develop a computational approach to accurately evaluate the likelihood of the proposed model and to perform maximum likelihood estimation. Our approach is implemented in the R package ptmixed, which can be freely downloaded from CRAN. We assess the performance of ptmixed on simulated data, and we present an application to a dataset with longitudinal RNA-sequencing measurements from healthy and dystrophic mice. The applicability of the Poisson–Tweedie mixed-effects model is not restricted to longitudinal RNA-seq data, but it extends to any scenario where non-independent measurements of a discrete overdispersed response variable are available. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
74. Improving invasive species management using predictive phenology models: an example from brown marmorated stink bug (Halyomorpha halys) in Japan.
- Author
-
Kamiyama, Matthew T, Matsuura, Kenji, Yoshimura, Tsuyoshi, and Yang, Chin‐Cheng Scotty
- Subjects
STINKBUGS ,BROWN marmorated stink bug ,INTRODUCED species ,PREDICTION models ,INTRODUCED insects ,PLANT phenology ,INSECT pests - Abstract
BACKGROUND: In order to better understand the population dynamics of invasive species in their native range, we developed two predictive phenological models using the ubiquitous invasive insect pest, Halyomorpha halys (Stål) (Hemiptera: Pentatomidae), as the model organism. Our work establishes a zero‐inflated negative binomial regression (ZINB) model, and a general additive mixed model (GAMM) based on 11 years of black light trap monitoring of H. halys at three locations in Japan. RESULTS: The ZINB model indicated that degree days (DD) have a significant effect on the trap catch of adult H. halys, and that precipitation has no effect. A dataset generated by 1000 simulations from the ZINB suggested that higher predicted trap catches equated to a lower probability of encountering a zero‐count. The GAMM produced a cubic regression smooth curve which forecasts the seasonal phenology of H. halys as following a bell‐shaped trend in Japan. Critical DD points during the field season in Japan included 261 DD for first H. halys adult detection and 1091 DD for peak activity. CONCLUSIONS: This study establishes the first models capable of forecasting native H. halys population dynamics based on DD. These robust models practically improve population forecasting of H. halys in the future and help fill gaps in knowledge pertaining to its native phenology, thus ultimately contributing to the progression of efficient management of this globally invasive species. © 2021 Society of Chemical Industry. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
75. Historical maps confirm the accuracy of zero-inflated model predictions of ancient tree abundance in English wood-pastures.
- Author
-
Nolan, Victoria, Reader, Tom, Gilbert, Francis, and Atkinson, Nick
- Subjects
- *
HISTORICAL maps , *PREDICTION models , *WOOD decay , *SOIL classification , *TREES , *KNOWLEDGE gap theory - Abstract
1. Ancient trees have important ecological, historical and social connections, and are a key source of dead and decaying wood, a globally declining resource. Wood-pastures, which combine livestock grazing, open spaces and scattered trees, are significant reservoirs of ancient trees, yet information about their true abundance within wood-pastures is limited. England has extensive databases of both ancient trees and wood-pasture habitat, providing a unique opportunity for the first large-scale, national case study to address this knowledge gap. 2. We investigated the relationship between the abundance of ancient trees in a large sample of English wood-pastures (5,571) and various unique environmental, historical and anthropogenic predictors, to identify wood-pastures with high numbers of undiscovered ancient trees. A major challenge in many modelling studies is obtaining independent data for model verification: here we introduce a novel model verification step using series of historic maps with detailed records of trees to validate our model predictions. This desk-based method enables rapid verification of model predictions using completely independent data across a large geographical area, without the need for, or limitations associated with, extensive field surveys. 3. Historic map verification estimates correlated well with model predictions of tree abundance. Model predictions suggest there are ~101,400 undiscovered ancient trees in all wood-pastures in England, around 10 times the total current number of ancient tree records. Important predictors of ancient tree abundance included wood-pasture area, distance to several features including cities, commons, historic Royal forests and Tudor deer parks, and different types of soil and land classes. 4. Synthesis and applications. Historical maps and statistical models can be used in combination to produce accurate predictions of ancient tree abundance in wood-pastures, and inform future targeted surveys of wood-pasture habitat, with a focus on those deemed to have undiscovered ancient trees. This study provides support for improvements to conservation policy and protection measures for ancient trees and wood-pastures. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
76. Principal component analysis for zero-inflated compositional data.
- Author
-
Kim, Kipoong, Park, Jaesung, and Jung, Sungkyu
- Subjects
- *
PRINCIPAL components analysis , *CONSTRAINED optimization , *DNA sequencing , *PROBLEM solving - Abstract
Recent advances in DNA sequencing technology have led to a growing interest in microbiome data. Since the data are often high-dimensional, there is a clear need for dimensionality reduction. However, the compositional nature and zero-inflation of microbiome data present many challenges in developing new methodologies. New PCA methods for zero-inflated compositional data are presented, based on a novel framework called principal compositional subspace. These methods aim to identify both the principal compositional subspace and the corresponding principal scores that best approximate the given data, ensuring that their reconstruction remains within the compositional simplex. To this end, the constrained optimization problems are established and alternating minimization algorithms are provided to solve the problems. The theoretical properties of the principal compositional subspace, particularly focusing on its existence and consistency, are further investigated. Simulation studies have demonstrated that the methods achieve lower reconstruction errors than the existing log-ratio PCA in the presence of a linear pattern and have shown comparable performance in a curved pattern. The methods have been applied to four microbiome compositional datasets with excessive zeros, successfully recovering the underlying low-rank structure. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
77. Model-Based Microbiome Data Ordination: A Variational Approximation Approach.
- Author
-
Zeng, Yanyan, Zhao, Hongyu, and Wang, Tao
- Subjects
- *
ORDINATION , *PRINCIPAL components analysis , *GUT microbiome , *DATA reduction , *HUMAN body - Abstract
The coevolution between human and bacteria colonizing the human body has profound implications for heath and development, with a growing body of evidence linking the altered microbiome composition with a wide array of disease states. Yet dimension reduction and visualization analysis of microbiome data are still in their infancy and many challenges exist. In this article, we introduce a general framework, zero-inflated probabilistic principal component analysis (ZIPPCA), for dimension reduction and data ordination of multivariate abundance data, and propose an efficient variational approximation method for estimation, inference, and prediction. Extensive simulations show that the proposed method outperforms algorithm-based methods and compares favorably with existing model-based methods. We further apply our method to a gut microbiome dataset for visualization analysis of community composition across age and geography. The method is implemented in R and available at . [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
78. Joint modeling of longitudinal data with informative cluster size adjusted for zero‐inflation and a dependent terminal event.
- Author
-
Shen, Biyi, Chen, Chixiang, Liu, Danping, Datta, Somnath, Ghahramani, Nasrollah, Chinchilli, Vernon M., and Wang, Ming
- Subjects
- *
ACUTE kidney failure , *DATA modeling , *PARAMETER estimation , *SIZE - Abstract
Repeated measures are often collected in longitudinal follow‐up from clinical trials and observational studies. In many situations, these measures are adherent to some specific event and are only available when it occurs; an example is serum creatinine from laboratory tests for hospitalized acute kidney injuries. The frequency of event recurrences is potentially correlated with overall health condition and hence may influence the distribution of the outcome measure of interest, leading to informative cluster size. In particular, there may be a large portion of subjects without any events, thus no longitudinal measures are available, which may be due to insusceptibility to such events or censoring before any events, and this zero‐inflation nature of the data needs to be taken into account. On the other hand, there often exists a terminal event that may be correlated with the recurrent events. Previous work in this area suffered from the limitation that not all these issues were handled simultaneously. To address this deficiency, we propose a novel joint modeling approach for longitudinal data adjusting for zero‐inflated and informative cluster size as well as a terminal event. A three‐stage semiparametric likelihood‐based approach is applied for parameter estimation and inference. Extensive simulations are conducted to evaluate the performance of our proposal. Finally, we utilize the Assessment, Serial Evaluation, and Subsequent Sequelae of Acute Kidney Injury (ASSESS‐AKI) study for illustration. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
79. Semiparametric analysis of zero‐inflated recurrent events with a terminal event.
- Author
-
Ma, Chenchen, Hu, Tao, and Lin, Zhantao
- Subjects
- *
BERNSTEIN polynomials , *MYOCARDIAL infarction , *LONGITUDINAL method , *SIEVES , *CLINICAL trials - Abstract
Recurrent event data frequently arise in longitudinal studies and observations on recurrent events could be terminated by a major failure event such as death. In many situations, there exist a large fraction of subjects without any recurrent events of interest. Among these subjects, some are unsusceptible to recurrent events, while others are susceptible but have no recurrent events being observed due to censoring. In this article, we propose a zero‐inflated generalized joint frailty model and a sieve maximum likelihood approach to analyze zero‐inflated recurrent events with a terminal event. The model provides a considerable flexibility in formulating the effects of covariates on both recurrent events and the terminal event by specifying various transformation functions. In addition, Bernstein polynomials are employed to approximate the unknown cumulative baseline hazard (intensity) function. The estimation procedure can be easily implemented and is computationally fast. Extensive simulation studies are conducted and demonstrate that our proposed method works well for practical situations. Finally, we apply the method to analyze myocardial infarction recurrences in the presence of death in a clinical trial with cardiovascular outcomes. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
80. A pathway for multivariate analysis of ecological communities using copulas
- Author
-
Marti J. Anderson, Perry de Valpine, Andrew Punnett, and Arden E. Miller
- Subjects
abundance data ,discrete counts ,over‐dispersion ,species’ associations ,statistical model ,zero‐inflation ,Ecology ,QH540-549.5 - Abstract
Abstract We describe a new pathway for multivariate analysis of data consisting of counts of species abundances that includes two key components: copulas, to provide a flexible joint model of individual species, and dissimilarity‐based methods, to integrate information across species and provide a holistic view of the community. Individual species are characterized using suitable (marginal) statistical distributions, with the mean, the degree of over‐dispersion, and/or zero‐inflation being allowed to vary among a priori groups of sampling units. Associations among species are then modeled using copulas, which allow any pair of disparate types of variables to be coupled through their cumulative distribution function, while maintaining entirely the separate individual marginal distributions appropriate for each species. A Gaussian copula smoothly captures changes in an index of association that excludes joint absences in the space of the original species variables. A permutation‐based filter with exact family‐wise error can optionally be used a priori to reduce the dimensionality of the copula estimation problem. We describe in detail a Monte Carlo expectation maximization algorithm for efficient estimation of the copula correlation matrix with discrete marginal distributions (counts). The resulting fully parameterized copula models can be used to simulate realistic ecological community data under fully specified null or alternative hypotheses. Distributions of community centroids derived from simulated data can then be visualized in ordinations of ecologically meaningful dissimilarity spaces. Multinomial mixtures of data drawn from copula models also yield smooth power curves in dissimilarity‐based settings. Our proposed analysis pathway provides new opportunities to combine model‐based approaches with dissimilarity‐based methods to enhance understanding of ecological systems. We demonstrate implementation of the pathway through an ecological example, where associations among fish species were found to increase after the establishment of a marine reserve.
- Published
- 2019
- Full Text
- View/download PDF
81. Robust and Powerful Differential Composition Tests for Clustered Microbiome Data.
- Author
-
Tang, Zheng-Zheng and Chen, Guanhua
- Abstract
Thanks to advances in high-throughput sequencing technologies, the importance of microbiome to human health and disease has been increasingly recognized. Analyzing microbiome data from sequencing experiments is challenging due to their unique features such as compositional data, excessive zero observations, overdispersion, and complex relations among microbial taxa. Clustered microbiome data have become prevalent in recent years from designs such as longitudinal studies, family studies, and matched case–control studies. The within-cluster dependence compounds the challenge of the microbiome data analysis. Methods that properly accommodate intra-cluster correlation and features of the microbiome data are needed. We develop robust and powerful differential composition tests for clustered microbiome data. The methods do not rely on any distributional assumptions on the microbial compositions, which provides flexibility to model various correlation structures among taxa and among samples within a cluster. By leveraging the adjusted sandwich covariance estimate, the methods properly accommodate sample dependence within a cluster. The two-part version of the test can further improve power in the presence of excessive zero observations. Different types of confounding variables can be easily adjusted for in the methods. We perform extensive simulation studies under commonly adopted clustered data designs to evaluate the methods. We demonstrate that the methods properly control the type I error under all designs and are more powerful than existing methods in many scenarios. The usefulness of the proposed methods is further demonstrated with two real datasets from longitudinal microbiome studies on pregnant women and inflammatory bowel disease patients. The methods have been incorporated into the R package "miLineage" publicly available at https://tangzheng1.github.io/tanglab/software.html. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
82. A multilevel zero-inflated Conway–Maxwell type negative binomial model for analysing clustered count data.
- Author
-
Gholiabad, Somayeh Ghorbani, Moghimbeigi, Abbas, Faradmal, Javad, and Baghestani, Ahmad Reza
- Subjects
- *
NEGATIVE binomial distribution , *EXPECTATION-maximization algorithms , *INFERENTIAL statistics , *PARAMETER estimation - Abstract
Basic negative binomial models can only capture over-dispersed count responses, because the variance of the distribution is always greater than the mean value. So, they are not the best selection when the data are under-dispersed or have less dispersion than the negative binomial. Over the last years, a variety of new distributions that can account a wide range of dispersion in count data, have been introduced. One of these novel distributions is Conway–Maxwell type negative binomial distribution. In biomedical studies, it is common to demonstrate excess zeros and a pattern of dispersion in count data. Also, the observations may be correlated in clusters or longitudinally. Here, we propose a multilevel zero-inflated Conway–Maxwell type negative binomial model. Statistical inference is employed via an expectation-maximization algorithm for the parameter estimation. The model performance is illustrated by simulation studies and with a real data set. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
83. Explaining the Strength of RO Responses to Coups
- Author
-
Hohlstein, Franziska, author
- Published
- 2022
- Full Text
- View/download PDF
84. Copula-based Markov zero-inflated count time series models with application.
- Author
-
Alqawba, Mohammed and Diawara, Norou
- Subjects
- *
TIME series analysis , *COPULA functions , *DISTRIBUTION (Probability theory) , *MARKOV processes , *GAUSSIAN function , *SANDSTORMS - Abstract
Count time series data with excess zeros are observed in several applied disciplines. When these zero-inflated counts are sequentially recorded, they might result in serial dependence. Ignoring the zero-inflation and the serial dependence might produce inaccurate results. In this paper, Markov zero-inflated count time series models based on a joint distribution on consecutive observations are proposed. The joint distribution function of the consecutive observations is constructed through copula functions. First- and second-order Markov chains are considered with the univariate margins of zero-inflated Poisson (ZIP), zero-inflated negative binomial (ZINB), or zero-inflated Conway–Maxwell–Poisson (ZICMP) distributions. Under the Markov models, bivariate copula functions such as the bivariate Gaussian, Frank, and Gumbel are chosen to construct a bivariate distribution of two consecutive observations. Moreover, the trivariate Gaussian and max-infinitely divisible copula functions are considered to build the joint distribution of three consecutive observations. Likelihood-based inference is performed and asymptotic properties are studied. To evaluate the estimation method and the asymptotic results, simulated examples are studied. The proposed class of models are applied to sandstorm counts example. The results suggest that the proposed models have some advantages over some of the models in the literature for modeling zero-inflated count time series data. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
85. The consequences of checking for zero‐inflation and overdispersion in the analysis of count data.
- Author
-
Campbell, Harlan and O'Hara, Robert B.
- Subjects
SELECTION bias (Statistics) ,DATA analysis ,FALSE positive error ,TEST scoring ,NULL hypothesis ,SAMPLE size (Statistics) - Abstract
Count data are ubiquitous in ecology and the Poisson generalized linear model (GLM) is commonly used to model the association between counts and explanatory variables of interest. When fitting this model to the data, one typically proceeds by first confirming that the model assumptions are satisfied. If the residuals appear to be overdispersed or if there is zero‐inflation, key assumptions of the Poison GLM may be violated and researchers will then typically consider alternatives to the Poison GLM. An important question is whether the potential model selection bias introduced by this data‐driven multi‐stage procedure merits concern.Here we conduct a large‐scale simulation study to investigate the potential consequences of model selection bias that can arise in the simple scenario of analysing a sample of potentially overdispersed, potentially zero‐inflated, count data. Specifically, we investigate model selection procedures recently recommended by Blasco‐Moreno et al. (2019) using either a series of score tests or information theoretic criteria to select the best model.We find that, when sample sizes are small, model selection based on preliminary score tests (or information theoretic criteria, e.g. AIC, BIC) can lead to potentially substantial inflation of false positive rates (i.e. type 1 error inflation). When sample sizes are sufficiently large, model selection based on preliminary score tests, is not problematic.Ignoring the possibility of overdispersion and zero‐inflation during data analyses can lead to invalid inference. However, if one does not have sufficient power to test for overdispersion and zero‐inflation, post hoc model selection may also lead to substantial bias. This 'catch‐22' suggests that, if sample sizes are small, a healthy skepticism is warranted whenever one rejects the null hypothesis of no association between a given outcome and covariate. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
86. A Bayesian quantile regression approach to multivariate semi-continuous longitudinal data.
- Author
-
Biswas, Jayabrata and Das, Kiranmoy
- Subjects
- *
LAPLACE distribution , *QUANTILE regression , *DISTRIBUTION (Probability theory) , *MARKOV chain Monte Carlo , *QUANTILES , *TOBITS - Abstract
Quantile regression is a powerful tool for modeling non-Gaussian data, and also for modeling different quantiles of the probability distributions of the responses. We propose a Bayesian approach of estimating the quantiles of multivariate longitudinal data where the responses contain excess zeros. We consider a Tobit regression approach, where the latent responses are estimated using a linear mixed model. The longitudinal dependence and the correlations among different (latent) responses are modeled by the subject-specific vector of random effects. We consider a mixture representation of the Asymmetric Laplace Distribution (ALD), and develop an efficient MCMC algorithm for estimating the model parameters. The proposed approach is used for analyzing data from the health and retirement study (HRS) conducted by the University of Michigan, USA; where we jointly model (i) out-of-pocket medical expenditures, (ii) total financial assets, and (iii) total financial debt for the aged subjects, and estimate the effects of different covariates on these responses across different quantiles. Simulation studies are performed for assessing the operating characteristics of the proposed approach. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
87. ZIBGLMM: Zero-Inflated Bivariate Generalized Linear Mixed Model for Meta-Analysis with Double-Zero-Event Studies.
- Author
-
Li L, Lin L, Cappelleri JC, Chu H, and Chen Y
- Abstract
Double-zero-event studies (DZS) pose a challenge for accurately estimating the overall treatment effect in meta-analysis. Current approaches, such as continuity correction or omission of DZS, are commonly employed, yet these ad hoc methods can yield biased conclusions. Although the standard bivariate generalized linear mixed model can accommodate DZS, it fails to address the potential systemic differences between DZS and other studies. In this paper, we propose a zero-inflated bivariate generalized linear mixed model (ZIBGLMM) to tackle this issue. This two-component finite mixture model includes zero-inflation for a subpopulation with negligible or extremely low risk. We develop both frequentist and Bayesian versions of ZIBGLMM and examine its performance in estimating risk ratios (RRs) against the bivariate generalized linear mixed model and conventional two-stage meta-analysis that excludes DZS. Through extensive simulation studies and real-world meta-analysis case studies, we demonstrate that ZIBGLMM outperforms the bivariate generalized linear mixed model and conventional two-stage meta-analysis that excludes DZS in estimating the true effect size with substantially less bias and comparable coverage probability., Competing Interests: Conflict of Interest Statement Cappelleri and Chu are employed by Pfizer. They own stocks in their company.
- Published
- 2024
- Full Text
- View/download PDF
88. Bayesian modelling of recurrent pipe failures in urban water systems using non-homogeneous Poisson processes with latent structure
- Author
-
Economou, Theodoros, Bailey, Trevor C., and Kapelan, Zoran
- Subjects
519.5 ,NHPP ,random effects ,zero-inflation ,MCMC ,hidden semi-Markov ,water pipe failures - Abstract
Recurrent events are very common in a wide range of scientific disciplines. The majority of statistical models developed to characterise recurrent events are derived from either reliability theory or survival analysis. This thesis concentrates on applications that arise from reliability, which in general involve the study about components or devices where the recurring event is failure. Specifically, interest lies in repairable components that experience a number of failures during their lifetime. The goal is to develop statistical models in order to gain a good understanding about the driving force behind the failures. A particular counting process is adopted, the non-homogenous Poisson process (NHPP), where the rate of occurrence (failure rate) depends on time. The primary application considered in the thesis is the prediction of underground water pipe bursts although the methods described have more general scope. First, a Bayesian mixed effects NHPP model is developed and applied to a network of water pipes using MCMC. The model is then extended to a mixture of NHPPs. Further, a special mixture case, the zero-inflated NHPP model is developed to cope with data involving a large number of pipes that have never failed. The zero-inflated model is applied to the same pipe network. Quite often, data involving recurrent failures over time, are aggregated where for instance the times of failures are unknown and only the total number of failures are available. Aggregated versions of the NHPP model and its zero-inflated version are developed to accommodate aggregated data and these are applied to the aggregated version of the earlier data set. Complex devices in random environments often exhibit what may be termed as state changes in their behaviour. These state changes may be caused by unobserved and possibly non-stationary processes such as severe weather changes. A hidden semi-Markov NHPP model is formulated, which is a NHPP process modulated by an unobserved semi-Markov process. An algorithm is developed to evaluate the likelihood of this model and a Metropolis-Hastings sampler is constructed for parameter estimation. Simulation studies are performed to test implementation and finally an illustrative application of the model is presented. The thesis concludes with a general discussion and a list of possible generalisations and extensions as well as possible applications other than the ones considered.
- Published
- 2010
89. The Truth behind the Zeros: A New Approach to Principal Component Analysis of the Neuropsychiatric Inventory.
- Author
-
Hellton, Kristoffer H., Cummings, Jeffrey, Vik-Mo, Audun Osland, Nordrehaug, Jan Erik, Aarsland, Dag, Selbaek, Geir, and Giil, Lasse Melvaer
- Subjects
- *
PRINCIPAL components analysis , *DEMENTIA , *INVENTORIES , *MONTE Carlo method , *PSYCHOSES , *SKEWNESS (Probability theory) - Abstract
Psychiatric syndromes in dementia are often derived from the Neuropsychiatric Inventory (NPI) using principal component analysis (PCA). The validity of this statistical approach can be questioned, since the excessive proportion of zeros and skewness of NPI items may distort the estimated relations between the items. We propose a novel version of PCA, ZIBP-PCA, where a zero-inflated bivariate Poisson (ZIBP) distribution models the pairwise covariance between the NPI items. We compared the performance of the method to classical PCA under zero-inflation using simulations, and in two dementia-cohorts (N = 830, N = 1349). Simulations showed that component loadings from PCA were biased due to zero-inflation, while the loadings of ZIBP-PCA remained unaffected. ZIBP-PCA obtained a simpler component structure of "psychosis," "mood" and "agitation" in both dementia-cohorts, compared to PCA. The principal components from ZIBP-PCA had component loadings as follows: First, the component interpreted as "psychosis" was loaded by the items delusions and hallucinations. Second, the "mood" component was loaded by depression and anxiety. Finally, the "agitation" component was loaded by irritability and aggression. In conclusion, PCA is not equipped to handle zero-inflation. Using the NPI, PCA fails to identify components with a valid interpretation, while ZIBP-PCA estimates simple and interpretable components to characterize the psychopathology of dementia. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
90. Compositional zero-inflated network estimation for microbiome data.
- Author
-
Ha, Min Jin, Kim, Junghi, Galloway-Peña, Jessica, Do, Kim-Anh, and Peterson, Christine B.
- Subjects
- *
INFERENTIAL statistics , *MICROBIAL communities , *SCALABILITY - Abstract
Background: The estimation of microbial networks can provide important insight into the ecological relationships among the organisms that comprise the microbiome. However, there are a number of critical statistical challenges in the inference of such networks from high-throughput data. Since the abundances in each sample are constrained to have a fixed sum and there is incomplete overlap in microbial populations across subjects, the data are both compositional and zero-inflated. Results: We propose the COmpositional Zero-Inflated Network Estimation (COZINE) method for inference of microbial networks which addresses these critical aspects of the data while maintaining computational scalability. COZINE relies on the multivariate Hurdle model to infer a sparse set of conditional dependencies which reflect not only relationships among the continuous values, but also among binary indicators of presence or absence and between the binary and continuous representations of the data. Our simulation results show that the proposed method is better able to capture various types of microbial relationships than existing approaches. We demonstrate the utility of the method with an application to understanding the oral microbiome network in a cohort of leukemic patients. Conclusions: Our proposed method addresses important challenges in microbiome network estimation, and can be effectively applied to discover various types of dependence relationships in microbial communities. The procedure we have developed, which we refer to as COZINE, is available online at https://github.com/MinJinHa/COZINE. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
91. Model Checking for Hidden Markov Models.
- Author
-
Buckby, Jodie, Wang, Ting, Zhuang, Jiancang, and Obara, Kazushige
- Subjects
- *
STOCHASTIC analysis , *POINT processes , *TIME series analysis , *GOODNESS-of-fit tests , *HIDDEN Markov models - Abstract
Residual analysis is a useful tool for checking lack of fit and for providing insight into model improvement. However, literature on residual analysis and the goodness of fit for hidden Markov models (HMMs) is limited. As HMMs with complex structures are increasingly used to accommodate different types of data, there is a need for further tools to check the validity of models applied to real world data. We review model checking methods for HMMs and develop new methods motivated by a particular case study involving a two-dimensional HMM developed for time series with many null events. We propose new residual analysis and stochastic reconstruction methods, which are adapted from model checking techniques for point process models. We apply the new methods to the case study model and discuss their adequacy. We find that there is not one "best" test for diagnostics but that our new methods have some advantages over previously developed tools. The importance of multiple tests for complex HMMs is highlighted and we use the results of our model checking to provide suggestions for possible improvements to the case study model. for this article are available online. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
92. Random effect exponentiated-exponential geometric model for clustered/longitudinal zero-inflated count data.
- Author
-
Tapak, Leili, Hamidi, Omid, Amini, Payam, and Verbeke, Geert
- Subjects
- *
GEOMETRIC modeling , *RANDOM effects model , *COUNTING , *REGRESSION analysis - Abstract
For count responses, there are situations in biomedical and sociological applications in which extra zeroes occur. Modeling correlated (e.g. repeated measures and clustered) zero-inflated count data includes special challenges because the correlation between measurements for a subject or a cluster needs to be taken into account. Moreover, zero-inflated count data are often faced with over/under dispersion problem. In this paper, we propose a random effect model for repeated measurements or clustered data with over/under dispersed response called random effect zero-inflated exponentiated-exponential geometric regression model. The proposed method was illustrated through real examples. The performance of the model and asymptotical properties of the estimations were investigated using simulation studies. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
93. ON COMPARISON OF MODELS FOR COUNT DATA WITH EXCESSIVE ZEROS IN NON-LIFE INSURANCE.
- Author
-
KARADAĞ ERDEMİR, Övgücan and KARADAĞ, Özge
- Subjects
- *
POISSON distribution , *INSURANCE claims , *INSURANCE , *DISTRIBUTION (Probability theory) , *PARAMETER estimation - Abstract
Modeling of the claim frequency is crucial from many respects in the issues of non-life insurance such ratemaking, credibility theory, claim reserving, risk theory, risk classification and bonus-malus system. For analysing claims in non-life insurance the most used models are generalized linear models, depending on the distribution of claims. The distribution of the claim frequency is generally assumed Poisson, however insurance claim data contains zero counts which effects the statistical estimations. In the presence of excess zero, there are more appropriate distributions for the claim frequency such as zero-inflated and hurdle models instead of a standard Poisson distribution. In this study, using a real annual comprehensive insurance data, the zero-inflated claim frequency is modeled via several models with and without consideration of zero-inflation. The underlying models are compared using information criteria and Vuong test. Parameter estimations are carried out using the maximum likelihood. [ABSTRACT FROM AUTHOR]
- Published
- 2020
94. A Large-Scale Constrained Joint Modeling Approach for Predicting User Activity, Engagement, and Churn With Application to Freemium Mobile Games.
- Author
-
Banerjee, Trambak, Mukherjee, Gourab, Dutta, Shantanu, and Ghosh, Pulak
- Subjects
- *
MOBILE games , *DISTRIBUTED computing , *POKEMON Go , *SCALABILITY , *SPORTS nutrition , *MOBILE apps - Abstract
We develop a constrained extremely zero inflated joint (CEZIJ) modeling framework for simultaneously analyzing player activity, engagement, and dropouts (churns) in app-based mobile freemium games. Our proposed framework addresses the complex interdependencies between a player's decision to use a freemium product, the extent of her direct and indirect engagement with the product and her decision to permanently drop its usage. CEZIJ extends the existing class of joint models for longitudinal and survival data in several ways. It not only accommodates extremely zero-inflated responses in a joint model setting but also incorporates domain-specific, convex structural constraints on the model parameters. Longitudinal data from app-based mobile games usually exhibit a large set of potential predictors and choosing the relevant set of predictors is highly desirable for various purposes including improved predictability. To achieve this goal, CEZIJ conducts simultaneous, coordinated selection of fixed and random effects in high-dimensional penalized generalized linear mixed models. For analyzing such large-scale datasets, variable selection and estimation are conducted via a distributed computing based split-and-conquer approach that massively increases scalability and provides better predictive performance over competing predictive methods. Our results reveal codependencies between varied player characteristics that promote player activity and engagement. Furthermore, the predicted churn probabilities exhibit idiosyncratic clusters of player profiles over time based on which marketers and game managers can segment the playing population for improved monetization of app-based freemium games. for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
95. The score test for the two‐sample occupancy model.
- Author
-
Karavarsamis, N., Guillera‐Arroita, G., Huggins, R.M., and Morgan, B.J.T.
- Subjects
- *
TEST scoring , *RANDOM variables , *NULL hypothesis , *ELECTRONIC journals , *OUTLIER detection , *LIKELIHOOD ratio tests - Abstract
Summary: The score test statistic from the observed information is easy to compute numerically. Its large sample distribution under the null hypothesis is well known and is equivalent to that of the score test based on the expected information, the likelihood‐ratio test and the Wald test. However, several authors have noted that under the alternative hypothesis this no longer holds and in particular the score statistic from the observed information can take negative values. We extend the anthology on the score test to a problem of interest in ecology when studying species occurrence. This is the comparison of two zero‐inflated binomial random variables from two independent samples under imperfect detection. An analysis of eigenvalues associated with the score test in this setting assists in understanding why using the observed information matrix in the score test can be problematic. We demonstrate through a combination of simulations and theoretical analysis that the power of the score test calculated under the observed information decreases as the populations being compared become more dissimilar. In particular, the score test based on the observed information is inconsistent. Finally, we propose a modified rule that rejects the null hypothesis when the score statistic is computed using the observed information is negative or is larger than the usual chi‐square cut‐off. In simulations in our setting this has power that is comparable to the Wald and likelihood ratio tests and consistency is largely restored. Our new test is easy to use and inference is possible. Supplementary material for this article is available online as per journal instructions. A new modified score test is proposed for mixture zero‐inflated binomial data under imperfect detection for two independent samples that restores power and consistency. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
96. A study of alternative approaches to non-normal latent trait distributions in item response theory models used for health outcome measurement.
- Author
-
Smits, Niels, Öğreden, Oğuzhan, Garnier-Villarreal, Mauricio, Terwee, Caroline B, Chalmers, R Philip, and Fox, Jean-Paul
- Subjects
- *
STATISTICS , *STATISTICAL models , *DATA analysis - Abstract
It is often unrealistic to assume normally distributed latent traits in the measurement of health outcomes. If normality is violated, the item response theory (IRT) models that are used to calibrate questionnaires may yield parameter estimates that are biased. Recently, IRT models were developed for dealing with specific deviations from normality, such as zero-inflation ("excess zeros") and skewness. However, these models have not yet been evaluated under conditions representative of item bank development for health outcomes, characterized by a large number of polytomous items. A simulation study was performed to compare the bias in parameter estimates of the graded response model (GRM), polytomous extensions of the zero-inflated mixture IRT (ZIM-GRM), and Davidian Curve IRT (DC-GRM). In the case of zero-inflation, the GRM showed high bias overestimating discrimination parameters and yielding estimates of threshold parameters that were too high and too close to one another, while ZIM-GRM showed no bias. In the case of skewness, the GRM and DC-GRM showed little bias with the GRM showing slightly better results. Consequences for the development of health outcome measures are discussed. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
97. Zero-modified count time series with Markovian intensities.
- Author
-
Balakrishna, N., Muhammed Anvar, P., and Bovas Abraham
- Subjects
- *
GENERALIZED estimating equations , *KALMAN filtering , *GENERALIZED spaces , *PARAMETER estimation , *TIME series analysis - Abstract
This paper proposes a method for analyzing count time series with inflation or deflation of zeros. In particular, zero-modified Poisson and zero-modified negative binomial series with intensities generated by non-negative Markov sequences are studied in detail. Parameters of the model are estimated by the method of estimating equations which is facilitated by expressing the model in a generalized state space form. The latent intensities required for estimation are extracted using generalized Kalman filter. Applications of the proposed model and its estimation methods are illustrated using simulated and real data sets. • Parameter-driven models suggested for zero-inflated and zero-deflated count time series. • Count time series are generated by Markov dependent intensities. • Estimating function methods proposed for parameter estimation in count time series. • Applications of generalized state space models and generalized Kalman filters are explored. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
98. Dynamic Spatial Pattern Recognition in Count Data
- Author
-
Wang, Xia, Chen, Ming-Hui, Kuo, Rita C., Dey, Dipak K., Chen, Jiahua, Series editor, Chen, Ding-Geng (Din), Series editor, Jin, Zhezhen, editor, Liu, Mengling, editor, and Luo, Xiaolong, editor
- Published
- 2016
- Full Text
- View/download PDF
99. Count Regression And Machine Learning Techniques For Zero-Inflated Overdispersed Count Data: Application To Ecological Data
- Author
-
Sidumo, Bonelwa, Sonono, Energy, Takaidza, Isaac, Sidumo, Bonelwa, Sonono, Energy, and Takaidza, Isaac
- Abstract
The aim of this study is to investigate the overdispersion problem that is rampant in ecological count data. In order to explore this problem, we consider the most commonly used count regression models: the Poisson, the negative binomial, the zero-inflated Poisson and the zero-inflated negative binomial models. The performance of these count regression models is compared with the four proposed machine learning (ML) regression techniques: random forests, support vector machines, k−nearest neighbors and artificial neural networks. The mean absolute error was used to compare the performance of count regression models and ML regression models. The results suggest that ML regression models perform better compared to count regression models. The performance shown by ML regression techniques is a motivation for further research in improving methods and applications in ecological studies.
- Published
- 2023
100. Estimation of Mediation Effect on Zero-Inflated Microbiome Mediators
- Author
-
Xu, Dongyang Yang and Wei
- Subjects
mediation model ,microbiome data ,zero-inflation ,semiparamatric ,direct effects ,indirect effects - Abstract
The mediation analysis methodology of the cause-and-effect relationship through mediators has been increasingly popular over the past decades. The human microbiome can contribute to the pathogenesis of many complex diseases by mediating disease-leading causal pathways. However, standard mediation analysis is not adequate for microbiome data due to the excessive number of zero values and the over-dispersion in the sequencing reads, which arise for both biological and sampling reasons. To address these unique challenges brought by the zero-inflated mediator, we developed a novel mediation analysis algorithm under the potential-outcome framework to fill this gap. The proposed semiparametric model estimates the mediation effect of the microbiome by decomposing indirect effects into two components according to the zero-inflated distributions. The bootstrap algorithm is utilized to calculate the empirical confidence intervals of the causal effects. We conducted extensive simulation studies to investigate the performance of the proposed weighting-based approach and some model-based alternatives, and our proposed model showed robust performance. The proposed algorithm was implemented in a real human microbiome study of identifying whether some taxa mediate the relationship between LACTIN-V treatment and immune response.
- Published
- 2023
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.