640 results on '"Correlated data"'
Search Results
2. New Quadratic Discriminant Analysis Algorithms for Correlated Audiometric Data.
- Author
-
Guo, Fuyu, Zucker, David M., Vaden, Kenneth I., Curhan, Sharon, Dubno, Judy R., and Wang, Molin
- Subjects
- *
DISCRIMINANT analysis , *HEARING impaired , *HEARING disorders , *PHENOTYPES , *LUNGS - Abstract
Paired organs like eyes, ears, and lungs in humans exhibit similarities, and data from these organs often display remarkable correlations. Accounting for these correlations could enhance classification models used in predicting disease phenotypes. To our knowledge, there is limited, if any, literature addressing this topic, and existing methods do not exploit such correlations. For example, the conventional approach treats each ear as an independent observation when predicting audiometric phenotypes and is agnostic about the correlation of data from the two ears of the same person. This approach may lead to information loss and reduce the model performance. In response to this gap, particularly in the context of audiometric phenotype prediction, this paper proposes new quadratic discriminant analysis (QDA) algorithms that appropriately deal with the dependence between ears. We propose two‐stage analysis strategies: (1) conducting data transformations to reduce data dimensionality before applying QDA; and (2) developing new QDA algorithms to partially utilize the dependence between phenotypes of two ears. We conducted simulation studies to compare different transformation methods and to assess the performance of different QDA algorithms. The empirical results suggested that the transformation may only be beneficial when the sample size is relatively small. Moreover, our proposed new QDA algorithms performed better than the conventional approach in both person‐level and ear‐level accuracy. As an illustration, we applied them to audiometric data from the Medical University of South Carolina Longitudinal Cohort Study of Age‐related Hearing Loss. In addition, we developed an R package, PairQDA, to implement the proposed algorithms. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
3. Hidden Markov models for multivariate panel data.
- Author
-
Neal, Mackenzie R., Sochaniwsky, Alexa A., and McNicholas, Paul D.
- Abstract
While advances continue to be made in model-based clustering, challenges persist in modeling various data types such as panel data. Multivariate panel data present difficulties for clustering algorithms because they are often plagued by missing data and dropouts, presenting issues for estimation algorithms. This research presents a family of hidden Markov models that compensate for the issues that arise in panel data. A modified expectation–maximization algorithm capable of handling missing not at random data and dropout is presented and used to perform model estimation. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
4. Robust multi-outcome regression with correlated covariate blocks using fused LAD-lasso.
- Author
-
Möttönen, Jyrki, Lähderanta, Tero, Salonen, Janne, and Sillanpää, Mikko J.
- Subjects
- *
REGRESSION analysis , *MULTISENSOR data fusion , *MULTIVARIATE analysis , *RETIREMENT - Abstract
Lasso is a popular and efficient approach to simultaneous estimation and variable selection in high-dimensional regression models. In this paper, a robust fused LAD-lasso method for multiple outcomes is presented that addresses the challenges of non-normal outcome distributions and outlying observations. Measured covariate data from space or time, or spectral bands or genomic positions often have natural correlation structure arising from measuring distance between the covariates. The proposed multi-outcome approach includes handling of such covariate blocks by a group fusion penalty, which encourages similarity between neighboring regression coefficient vectors by penalizing their differences, for example, in sequential data situation. Properties of the proposed approach are illustrated by extensive simulations using BIC-type criteria for model selection. The method is also applied to a real-life skewed data on retirement behavior with longitudinal heteroscedastic explanatory variables. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
5. Incorporating sources of correlation between outcomes: An introduction to mixed models.
- Author
-
Liu, Limeng and Petersen, Ashley
- Subjects
- *
RANDOM effects model , *RESEARCH questions , *INFERENTIAL statistics , *RESEARCH personnel , *LABORATORY animals - Abstract
Animal research often involves measuring the outcomes of interest multiple times on the same animal, whether over time or for different exposures. These repeated outcomes measured on the same animal are correlated due to animal-specific characteristics. While this repeated measures data can address more complex research questions than single-outcome data, the statistical analysis must take into account the study design resulting in correlated outcomes, which violate the independence assumption of standard statistical methods (e.g. a two-sample t -test, linear regression). When standard statistical methods are incorrectly used to analyze correlated outcome data, the statistical inference (i.e. confidence intervals and p -values) will be incorrect, with some settings leading to null findings too often and others producing statistically significant findings despite no support for this in the data. Instead, researchers can leverage approaches designed specifically for correlated outcomes. In this article, we discuss common study designs that lead to correlated outcome data, motivate the intuition about the impact of improperly analyzing correlated outcomes using methods for independent data, and introduce approaches that properly leverage correlated outcome data. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
6. Testing Informativeness of Covariate-Induced Group Sizes in Clustered Data.
- Author
-
Senevirathne, Hasika K. Wickrama and Dutta, Sandipan
- Subjects
- *
DISTRIBUTION (Probability theory) , *TEST methods - Abstract
Clustered data are a special type of correlated data where units within a cluster are correlated while units between different clusters are independent. The number of units in a cluster can be associated with that cluster's outcome. This is called the informative cluster size (ICS), which is known to impact clustered data inference. However, when comparing the outcomes from multiple groups of units in clustered data, investigating ICS may not be enough. This is because the number of units belonging to a particular group in a cluster can be associated with the outcome from that group in that cluster, leading to an informative intra-cluster group size or IICGS. This phenomenon of IICGS can exist even in the absence of ICS. Ignoring the existence of IICGS can result in a biased inference for group-based outcome comparisons in clustered data. In this article, we mathematically formulate the concept of IICGS while distinguishing it from ICS and propose a nonparametric bootstrap-based statistical hypothesis-testing mechanism for testing any claim of IICGS in a clustered data setting. Through simulations and real data applications, we demonstrate that our proposed statistical testing method can accurately identify IICGS, with substantial power, in clustered data. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
7. Evaluating Differential Privacy on Correlated Datasets Using Pointwise Maximal Leakage
- Author
-
Saeidian, Sara, Oechtering, Tobias J., Skoglund, Mikael, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Jensen, Meiko, editor, Lauradoux, Cédric, editor, and Rannenberg, Kai, editor
- Published
- 2024
- Full Text
- View/download PDF
8. Generalized Estimating Equations Boosting (GEEB) machine for correlated data
- Author
-
Wang, Yuan-Wey, Yang, Hsin-Chou, Chen, Yi-Hau, and Guo, Chao-Yu
- Published
- 2024
- Full Text
- View/download PDF
9. Imputation-Based Variable Selection Method for Block-Wise Missing Data When Integrating Multiple Longitudinal Studies.
- Author
-
Ouyang, Zhongzhe and Wang, Lu
- Subjects
- *
MULTIPLE imputation (Statistics) , *MISSING data (Statistics) , *LONGITUDINAL method , *ALZHEIMER'S disease , *GENERALIZED method of moments , *EARLY diagnosis , *CROSS-sectional method - Abstract
When integrating data from multiple sources, a common challenge is block-wise missing. Most existing methods address this issue only in cross-sectional studies. In this paper, we propose a method for variable selection when combining datasets from multiple sources in longitudinal studies. To account for block-wise missing in covariates, we impute the missing values multiple times based on combinations of samples from different missing pattern and predictors from different data sources. We then use these imputed data to construct estimating equations, and aggregate the information across subjects and sources with the generalized method of moments. We employ the smoothly clipped absolute deviation penalty in variable selection and use the extended Bayesian Information Criterion criteria for tuning parameter selection. We establish the asymptotic properties of the proposed estimator, and demonstrate the superior performance of the proposed method through numerical experiments. Furthermore, we apply the proposed method in the Alzheimer's Disease Neuroimaging Initiative study to identify sensitive early-stage biomarkers of Alzheimer's Disease, which is crucial for early disease detection and personalized treatment. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
10. Generalized Estimating Equations Boosting (GEEB) machine for correlated data
- Author
-
Yuan-Wey Wang, Hsin-Chou Yang, Yi-Hau Chen, and Chao-Yu Guo
- Subjects
Correlated data ,Hierarchical data ,Generalized Estimating Equations ,Machine learning ,Gradient boosting ,Computer engineering. Computer hardware ,TK7885-7895 ,Information technology ,T58.5-58.64 ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Abstract Rapid development in data science enables machine learning and artificial intelligence to be the most popular research tools across various disciplines. While numerous articles have shown decent predictive ability, little research has examined the impact of complex correlated data. We aim to develop a more accurate model under repeated measures or hierarchical data structures. Therefore, this study proposes a novel algorithm, the Generalized Estimating Equations Boosting (GEEB) machine, to integrate the gradient boosting technique into the benchmark statistical approach that deals with the correlated data, the generalized Estimating Equations (GEE). Unlike the previous gradient boosting utilizing all input features, we randomly select some input features when building the model to reduce predictive errors. The simulation study evaluates the predictive performance of the GEEB, GEE, eXtreme Gradient Boosting (XGBoost), and Support Vector Machine (SVM) across several hierarchical structures with different sample sizes. Results suggest that the new strategy GEEB outperforms the GEE and demonstrates superior predictive accuracy than the SVM and XGBoost in most situations. An application to a real-world dataset, the Forest Fire Data, also revealed that the GEEB reduced mean squared errors by 4.5% to 25% compared to GEE, XGBoost, and SVM. This research also provides a freely available R function that could implement the GEEB machine effortlessly for longitudinal or hierarchical data.
- Published
- 2024
- Full Text
- View/download PDF
11. Analysis of correlated unit-Lindley data based on estimating equations.
- Author
-
Silva, Danilo V., Akdur, Hatice Tul Kubra, and Paula, Gilberto A.
- Subjects
MARGINAL distributions ,EQUATIONS ,WATER supply ,STATISTICAL correlation ,REGRESSION analysis ,GENERALIZED estimating equations - Abstract
In this paper we derive estimating equations for modeling unbalanced correlated data sets in which the marginal distributions follow the one parameter unit-Lindley distributions with domain on the interval (0,1). A class of regressions models is proposed for modeling the location parameter and a reweighted iterative process is developed for the joint estimation of the regression coefficients and the correlation structure. Simulation studies are performed to assess the empirical properties of the derived estimators and diagnostic procedures, such as residual analysis and sensitivity studies based on conformal local influence are given. Finally, we analyze the proportion of people in households with inadequate water supply and sewage within federation units of Brazil by the procedures developed in the paper. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
12. A latent functional approach for modeling the effects of multidimensional exposures on disease risk.
- Author
-
Kim, Sungduk, Beane Freeman, Laura E., and Albert, Paul S.
- Subjects
- *
MARKOV chain Monte Carlo , *RISK exposure , *RHEUMATIC heart disease - Abstract
Understanding the relationships between exposure and disease incidence is an important problem in environmental epidemiology. Typically, a large number of these exposures are measured, and it is found either that a few exposures transmit risk or that each exposure transmits a small amount of risk, but, taken together, these may pose a substantial disease risk. Further, these exposure effects can be nonlinear. We develop a latent functional approach, which assumes that the individual effect of each exposure can be characterized as one of a series of unobserved functions, where the number of latent functions is less than or equal to the number of exposures. We propose Bayesian methodology to fit models with a large number of exposures and show that existing Bayesian group LASSO approaches are a special case of the proposed model. An efficient Markov chain Monte Carlo sampling algorithm is developed for carrying out Bayesian inference. The deviance information criterion is used to choose an appropriate number of nonlinear latent functions. We demonstrate the good properties of the approach using simulation studies. Further, we show that complex exposure relationships can be represented with only a few latent functional curves. The proposed methodology is illustrated with an analysis of the effect of cumulative pesticide exposure on cancer risk in a large cohort of farmers. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
13. A flexible partially linear regression with random effects for bimodal data with application in postharvest.
- Author
-
Vasconcelos, Julio C. S., Ortega, Edwin M. M., Kluge, Ricardo A., Cordeiro, Gauss M., and Vila, Roberto
- Abstract
Abstract In the postharvest area of agricultural products, it is common to have data sets in which the dependent variable in general may not be symmetrical and unimodal, therefore, the use of usual regression models is not always appropriate. In addition, the data can also have repeated measurements over time. With this in mind, the aim of this article is to propose a partially linear regression model with random effects for postharvest bimodal data, this extended regression is based on the Birnbaum–Saunders distribution. Additionally we present different mathematical properties of this new model. To verify the impartiality and accuracy of the estimators, a simulation study was carried out. The parameters of partially linear regression with random effects are estimated using the penalized maximum likelihood. The results of the developed model are presented empirically and indicate that the storage temperature can cause a significant change in the lychee’s respiratory metabolism. Thus, according to the model selection and residue analysis measures, it can be concluded that the proposed model is a statistical tool for postharvest data, providing an adequate analysis. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
14. SAGA Application for Generalized Estimating Equations Analysis
- Author
-
Moncaixa, Luís, Braga, Ana Cristina, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Gervasi, Osvaldo, editor, Murgante, Beniamino, editor, Rocha, Ana Maria A. C., editor, Garau, Chiara, editor, Scorza, Francesco, editor, Karaca, Yeliz, editor, and Torre, Carmelo M., editor
- Published
- 2023
- Full Text
- View/download PDF
15. Testing Informativeness of Covariate-Induced Group Sizes in Clustered Data
- Author
-
Hasika K. Wickrama Senevirathne and Sandipan Dutta
- Subjects
binary group ,bootstrap test ,clustered data ,correlated data ,distribution function ,informative cluster size ,Mathematics ,QA1-939 - Abstract
Clustered data are a special type of correlated data where units within a cluster are correlated while units between different clusters are independent. The number of units in a cluster can be associated with that cluster’s outcome. This is called the informative cluster size (ICS), which is known to impact clustered data inference. However, when comparing the outcomes from multiple groups of units in clustered data, investigating ICS may not be enough. This is because the number of units belonging to a particular group in a cluster can be associated with the outcome from that group in that cluster, leading to an informative intra-cluster group size or IICGS. This phenomenon of IICGS can exist even in the absence of ICS. Ignoring the existence of IICGS can result in a biased inference for group-based outcome comparisons in clustered data. In this article, we mathematically formulate the concept of IICGS while distinguishing it from ICS and propose a nonparametric bootstrap-based statistical hypothesis-testing mechanism for testing any claim of IICGS in a clustered data setting. Through simulations and real data applications, we demonstrate that our proposed statistical testing method can accurately identify IICGS, with substantial power, in clustered data.
- Published
- 2024
- Full Text
- View/download PDF
16. Avoiding Blunders When Analyzing Correlated Data, Clustered Data, or Repeated Measures.
- Author
-
Yu-Hui H. Chang, Buras, Matthew R., Davis III, John M., and Crowson, Cynthia S.
- Abstract
Rheumatology research often involves correlated and clustered data. A common error when analyzing these data occurs when instead we treat these data as independent observations. This can lead to incorrect statistical inference. The data used are a subset of the 2017 study from Raheel et al consisting of 633 patients with rheumatoid arthritis (RA) between 1988 and 2007. RA flare and the number of swollen joints served as our binary and continuous outcomes, respectively. Generalized linear models (GLM) were fitted for each, while adjusting for rheumatoid factor (RF) positivity and sex. Additionally, a generalized linear mixed model with a random intercept and a generalized estimating equation were used to model RA flare and the number of swollen joints, respectively, to take additional correlation into account. The GLM's β coefficients and their 95% confidence intervals (CIs) are then compared to their mixed-effects equivalents. The β coefficients compared between methodologies are very similar. However, their standard errors increase when correlation is accounted for. As a result, if the additional correlations are not considered, the standard error can be underestimated. This results in an overestimated effect size, narrower CIs, increased type I error, and a smaller P value, thus potentially producing misleading results. It is important to model the additional correlation that occurs in correlated data. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
17. Model-based entropy estimation for data with covariates and dependence structures.
- Author
-
Altieri, Linda, Cocchi, Daniela, and Ventrucci, Massimo
- Subjects
ENTROPY ,ENVIRONMENTAL sciences ,RAIN forests ,GROWTH curves (Statistics) - Abstract
Entropy is widely used in ecological and environmental studies, where data often present complex interactions. Difficulties arise in linking entropy to available covariates or data dependence structures, thus, all existing entropy estimators assume independence. To overcome this limit, we take a Bayesian model-based approach which focuses on estimating the probabilities that compose the index, accounting for any data dependence and correlation. An estimate of entropy can be constructed from the model fitted values, returning an observation-specific measure of entropy rather than an overall index. This way, the latent heterogeneity of the system can be represented by a curve in time or a surface in space, according to the characteristics of the survey study at hand. An empirical study illustrates the flexibility and interpretability of our results over temporally and spatially correlated data. An application is presented about the biodiversity of spatially structured rainforest tree data. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
18. Evaluation of Risk Prediction with Hierarchical Data: Dependency Adjusted Confidence Intervals for the AUC
- Author
-
Camden Bay, Robert J Glynn, Johanna M Seddon, Mei-Ling Ting Lee, and Bernard Rosner
- Subjects
AUC ,clustered data ,confidence interval ,correlated data ,hierarchical models ,Statistics ,HA1-4737 - Abstract
The area under the true ROC curve (AUC) is routinely used to determine how strongly a given model discriminates between the levels of a binary outcome. Standard inference with the AUC requires that outcomes be independent of each other. To overcome this limitation, a method was developed for the estimation of the variance of the AUC in the setting of two-level hierarchical data using probit-transformed prediction scores generated from generalized estimating equation models, thereby allowing for the application of inferential methods. This manuscript presents an extension of this approach so that inference for the AUC may be performed in a three-level hierarchical data setting (e.g., eyes nested within persons and persons nested within families). A method that accounts for the effect of tied prediction scores on inference is also described. The performance of 95% confidence intervals around the AUC was assessed through the simulation of three-level clustered data in multiple settings, including ones with tied data and variable cluster sizes. Across all settings, the actual 95% confidence interval coverage varied from 0.943 to 0.958, and the ratio of the theoretical variance to the empirical variance of the AUC varied from 0.920 to 1.013. The results are better than those from existing methods. Two examples of applying the proposed methodology are presented.
- Published
- 2023
- Full Text
- View/download PDF
19. lmerSeq: an R package for analyzing transformed RNA-Seq data with linear mixed effects models
- Author
-
Brian E. Vestal, Elizabeth Wynn, and Camille M. Moore
- Subjects
RNA-Seq ,Linear mixed models ,Correlated data ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Biology (General) ,QH301-705.5 - Abstract
Abstract Background Studies that utilize RNA Sequencing (RNA-Seq) in conjunction with designs that introduce dependence between observations (e.g. longitudinal sampling) require specialized analysis tools to accommodate this additional complexity. This R package contains a set of utilities to fit linear mixed effects models to transformed RNA-Seq counts that properly account for this dependence when performing statistical analyses. Results In a simulation study comparing lmerSeq and two existing methodologies that also work with transformed RNA-Seq counts, we found that lmerSeq was comprehensively better in terms of nominal error rate control and statistical power. Conclusions Existing R packages for analyzing transformed RNA-Seq data with linear mixed models are limited in the variance structures they allow and/or the transformation methods they support. The lmerSeq package offers more flexibility in both of these areas and gave substantially better results in our simulations.
- Published
- 2022
- Full Text
- View/download PDF
20. Evaluation of Risk Prediction with Hierarchical Data: Dependency Adjusted Confidence Intervals for the AUC.
- Author
-
Bay, Camden, Glynn, Robert J, Seddon, Johanna M, Lee, Mei-Ling Ting, and Rosner, Bernard
- Subjects
RECEIVER operating characteristic curves ,CONFIDENCE intervals ,GENERALIZED estimating equations ,DATA analysis ,PREDICTION models - Abstract
The area under the true ROC curve (AUC) is routinely used to determine how strongly a given model discriminates between the levels of a binary outcome. Standard inference with the AUC requires that outcomes be independent of each other. To overcome this limitation, a method was developed for the estimation of the variance of the AUC in the setting of two-level hierarchical data using probit-transformed prediction scores generated from generalized estimating equation models, thereby allowing for the application of inferential methods. This manuscript presents an extension of this approach so that inference for the AUC may be performed in a three-level hierarchical data setting (e.g., eyes nested within persons and persons nested within families). A method that accounts for the effect of tied prediction scores on inference is also described. The performance of 95% confidence intervals around the AUC was assessed through the simulation of three-level clustered data in multiple settings, including ones with tied data and variable cluster sizes. Across all settings, the actual 95% confidence interval coverage varied from 0.943 to 0.958, and the ratio of the theoretical variance to the empirical variance of the AUC varied from 0.920 to 1.013. The results are better than those from existing methods. Two examples of applying the proposed methodology are presented. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
21. Clinical trial design and analysis for comparing three treatments with intra-individual right- and left-hand data.
- Author
-
Machida, Ryunosuke, Sakamaki, Kentaro, and Kuchiba, Aya
- Subjects
TREATMENT of peripheral neuropathy ,EXPERIMENTAL design ,STATISTICS ,PERIPHERAL neuropathy ,CEREBRAL dominance ,CLINICAL trials ,CANCER chemotherapy ,COLD therapy ,REGRESSION analysis ,SIMULATION methods in education ,TREATMENT effectiveness ,T-test (Statistics) ,COMPRESSION therapy ,GLOVES ,DATA analysis ,DATA analysis software ,PROBABILITY theory - Abstract
Background: Chemotherapy-induced peripheral neuropathy can occur in the right and left hand. Studies on prevention treatments for chemotherapy-induced peripheral neuropathy have largely adopted either self-controlled designs or parallel designs to compare two preventive treatments. When three treatment options (two experimental treatments and a control treatment) are available, both designs can be extended. However, no clinical trials have adopted a self-controlled design to compare three prevention treatments for chemotherapy-induced peripheral neuropathy. The incomplete block crossover design for more than two treatments can be extended to compare three treatments in the self-controlled design. In simple extension, some of the participants receive two experimental treatments in both hands; however, it may be difficult to administer different experimental treatments in both hands for practical reasons, such as a concern for the different types of unexpected adverse events. This study proposes a design and analysis method appropriate for the situation where only one experimental treatment is provided to each participant. Methods: We assume clinical trials to compare each of the two experimental treatments (E
1 and E2 ) with the control treatment (C) and between two experimental treatments only when both experimental treatments are superior to the control treatment. We propose a self-controlled design, which equally randomizes to four arms to adjust for the dominant hand effect: Arm 1: E1 for right hand, C for left hand; Arm 2: C for right hand, E1 for left hand; Arm 3: E2 for right hand, C for left hand; and Arm 4: C for right hand, E2 for left hand. We compare operating characteristics of the proposed design with the three-arm parallel design in which the same treatment is performed in both hands by participants. We also assess three proposed analysis methods for comparisons between experimental treatments in the self-controlled design under several conditions of correlations between right and left hands using simulation studies. Results: The simulation studies showed that the proposed design was more powerful than the three-arm parallel design when correlation was 0.3 or higher. For comparisons between experimental treatments, the methods based on the regression model, including the outcome of hands with C as a covariate, had the highest power under modest to high correlation among the analysis methods in the self-controlled design. Conclusion: The proposed design can improve the power for comparing between two experimental treatments and the control treatment. Our design is useful in situations where it is undesirable for participants to receive different experimental treatments in both hands for practical reasons. [ABSTRACT FROM AUTHOR]- Published
- 2023
- Full Text
- View/download PDF
22. Imputation-Based Variable Selection Method for Block-Wise Missing Data When Integrating Multiple Longitudinal Studies
- Author
-
Zhongzhe Ouyang, Lu Wang, and Alzheimer’s Disease Neuroimaging Initiative
- Subjects
multiple imputation ,correlated data ,data integration ,Mathematics ,QA1-939 - Abstract
When integrating data from multiple sources, a common challenge is block-wise missing. Most existing methods address this issue only in cross-sectional studies. In this paper, we propose a method for variable selection when combining datasets from multiple sources in longitudinal studies. To account for block-wise missing in covariates, we impute the missing values multiple times based on combinations of samples from different missing pattern and predictors from different data sources. We then use these imputed data to construct estimating equations, and aggregate the information across subjects and sources with the generalized method of moments. We employ the smoothly clipped absolute deviation penalty in variable selection and use the extended Bayesian Information Criterion criteria for tuning parameter selection. We establish the asymptotic properties of the proposed estimator, and demonstrate the superior performance of the proposed method through numerical experiments. Furthermore, we apply the proposed method in the Alzheimer’s Disease Neuroimaging Initiative study to identify sensitive early-stage biomarkers of Alzheimer’s Disease, which is crucial for early disease detection and personalized treatment.
- Published
- 2024
- Full Text
- View/download PDF
23. Cluster Randomized Trials
- Author
-
Moulton, Lawrence H., Hayes, Richard J., Choodari-Oskooei, Babak, Section editor, Parmar, Mahesh, Section editor, Piantadosi, Steven, editor, and Meinert, Curtis L., editor
- Published
- 2022
- Full Text
- View/download PDF
24. A random effect regression based on the odd log-logistic generalized inverse Gaussian distribution.
- Author
-
Vasconcelos, J. C. S., Cordeiro, G. M., Ortega, E. M. M., and Silva, G. O.
- Subjects
- *
INVERSE Gaussian distribution , *GAUSSIAN distribution , *RANDOM effects model , *REGRESSION analysis , *PRICES - Abstract
In recent decades, the use of regression models with random effects has made great progress. Among these models' attractions is the flexibility to analyze correlated data. In various situations, the distribution of the response variable presents asymmetry or bimodality. In these cases, it is possible to use the normal regression with random effect at the intercept. In light of these contexts, i.e. the desire to analyze correlated data in the presence of bimodality or asymmetry, in this paper we propose a regression model with random effect at the intercept based onthe generalized inverse Gaussian distribution model with correlated data. The maximum likelihood is adopted to estimate the parameters and various simulations are performed for correlated data. A type of residuals for the new regression is proposed whose empirical distribution is close to normal. The versatility of the new regression is demonstrated by estimating the average price per hectare of bare land in 10 municipalities in the state of São Paulo (Brazil). In this context, various databases are constantly emerging, requiring flexible modeling. Thus, it is likely to be of interest to data analysts, and can make a good contribution to the statistical literature. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
25. Modeling noisy time-series data of crime with stochastic differential equations.
- Author
-
Calatayud, Julia, Jornet, Marc, and Mateu, Jorge
- Subjects
- *
CRIME statistics , *PREDICTIVE policing , *CRIME , *BROWNIAN motion , *GEOMETRIC modeling , *TIME series analysis , *HILBERT-Huang transform - Abstract
We develop and calibrate stochastic continuous models that capture crime dynamics in the city of Valencia, Spain. From the emergency phone, data corresponding to three crime events, aggressions, stealing and women alarms, are available from the year 2010 until 2020. As the resulting time series, with monthly counts, are highly noisy, we decompose them into trend and seasonality parts. The former is modeled by geometric Brownian motions, both uncorrelated and correlated, and the latter is accommodated by randomly perturbed sine-cosine waves. Albeit simple, the models exhibit high ability to simulate the real data and show promising for crimes-interaction identification and short-term predictive policing. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
26. Ultra-High Dimensional Quantile Regression for Longitudinal Data: An Application to Blood Pressure Analysis.
- Author
-
Zu, Tianhai, Lian, Heng, Green, Brittany, and Yu, Yan
- Subjects
- *
QUANTILE regression , *BLOOD pressure , *BLOOD testing , *PANEL analysis , *HYPERTENSION , *SINGLE nucleotide polymorphisms - Abstract
Despite major advances in research and treatment, identifying important genotype risk factors for high blood pressure remains challenging. Traditional genome-wide association studies (GWAS) focus on one single nucleotide polymorphism (SNP) at a time. We aim to select among over half a million SNPs along with time-varying phenotype variables via simultaneous modeling and variable selection, focusing on the most dangerous blood pressure levels at high quantiles. Taking advantage of rich data from a large-scale public health study, we develop and apply a novel quantile penalized generalized estimating equations (GEE) approach, incorporating several key aspects including ultra-high dimensional genetic SNPs, the longitudinal nature of blood pressure measurements, time-varying covariates, and conditional high quantiles of blood pressure. Importantly, we identify interesting new SNPs for high blood pressure. Besides, we find blood pressure levels are likely heterogeneous, where the important risk factors identified differ among quantiles. This comprehensive picture of conditional quantiles of blood pressure can allow more insights and targeted treatments. We provide an efficient computational algorithm and prove consistency, asymptotic normality, and the oracle property for the quantile penalized GEE estimators with ultra-high dimensional predictors. Moreover, we establish model-selection consistency for high-dimensional BIC. Simulation studies show the promise of the proposed approach. for this article are available online. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
27. Correct specification of design matrices in linear mixed effects models: tests with graphical representation.
- Author
-
Peterlin, Jakob, Kejžar, Nataša, and Blagus, Rok
- Abstract
Linear mixed effects models (LMMs) are a popular and powerful tool for analysing grouped or repeated observations for numeric outcomes. LMMs consist of a fixed and a random component, which are specified in the model through their respective design matrices. Verifying the correct specification of the two design matrices is important since mis-specifying them can affect the validity and efficiency of the analysis. We show how to use empirical stochastic processes constructed from appropriately ordered and standardized residuals from the model to test whether the design matrices of the fitted LMM are correctly specified. We define two different processes: one can be used to test whether both design matrices are correctly specified, and the other can be used only to test whether the fixed effects design matrix is correctly specified. The proposed empirical stochastic processes are smoothed versions of cumulative sum processes, which have a nice graphical representation in which model mis-specification can easily be observed. The amount of smoothing can be adjusted, which facilitates visual inspection and can potentially increase the power of the tests. We propose a computationally efficient procedure for estimating p-values in which refitting of the LMM is not necessary. Its validity is shown by using theoretical results and a large Monte Carlo simulation study. The proposed methodology could be used with LMMs with multilevel or crossed random effects. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
28. Regression, Transformations, and Mixed-Effects with Marine Bryozoans
- Author
-
Ciaran Evans
- Subjects
Applications ,Correlated data ,Data processing ,Regression assumptions ,Probabilities. Mathematical statistics ,QA273-280 ,Special aspects of education ,LC8-6691 - Abstract
This article demonstrates how data from a biology paper, which analyzes the relationship between mass and metabolic rate for two species of marine bryozoan, can be used to teach a variety of regression topics to both introductory and advanced students. A thorough analysis requires intelligent data wrangling, variable transformations, and accounting for correlation between observations. The bryozoan data can be used as a valuable class example throughout the semester, or as a dataset for extended homework assignments and class projects. Supplementary materials for this article are available online.
- Published
- 2022
- Full Text
- View/download PDF
29. A comparison of methods for multiple degree of freedom testing in repeated measures RNA-sequencing experiments
- Author
-
Elizabeth A. Wynn, Brian E. Vestal, Tasha E. Fingerlin, and Camille M. Moore
- Subjects
RNA-seq ,Correlated data ,Multiple DF testing ,Medicine (General) ,R5-920 - Abstract
Abstract Background As the cost of RNA-sequencing decreases, complex study designs, including paired, longitudinal, and other correlated designs, become increasingly feasible. These studies often include multiple hypotheses and thus multiple degree of freedom tests, or tests that evaluate multiple hypotheses jointly, are often useful for filtering the gene list to a set of interesting features for further exploration while controlling the false discovery rate. Though there are several methods which have been proposed for analyzing correlated RNA-sequencing data, there has been little research evaluating and comparing the performance of multiple degree of freedom tests across methods. Methods We evaluated 11 different methods for modelling correlated RNA-sequencing data by performing a simulation study to compare the false discovery rate, power, and model convergence rate across several hypothesis tests and sample size scenarios. We also applied each method to a real longitudinal RNA-sequencing dataset. Results Linear mixed modelling using transformed data had the best false discovery rate control while maintaining relatively high power. However, this method had high model non-convergence, particularly at small sample sizes. No method had high power at the lowest sample size. We found a mix of conservative and anti-conservative behavior across the other methods, which was influenced by the sample size and the hypothesis being evaluated. The patterns observed in the simulation study were largely replicated in the analysis of a longitudinal study including data from intensive care unit patients experiencing cardiogenic or septic shock. Conclusions Multiple degree of freedom testing is a valuable tool in longitudinal and other correlated RNA-sequencing experiments. Of the methods that we investigated, linear mixed modelling had the best overall combination of power and false discovery rate control. Other methods may also be appropriate in some scenarios.
- Published
- 2022
- Full Text
- View/download PDF
30. Integrating Random Effects in Deep Neural Networks.
- Author
-
Simchoni, Giora and Rosset, Saharon
- Subjects
- *
ARTIFICIAL neural networks , *SUPERVISED learning , *DEEP learning - Abstract
Modern approaches to supervised learning like deep neural networks (DNNs) typically implicitly assume that observed responses are statistically independent. In contrast, correlated data are prevalent in real-life large-scale applications, with typical sources of correlation including spatial, temporal and clustering structures. These correlations are either ignored by DNNs, or ad-hoc solutions are developed for specific use cases. We propose to use the mixed models framework to handle correlated data in DNNs. By treating the effects underlying the correlation structure as random effects, mixed models are able to avoid overfitted parameter estimates and ultimately yield better predictive performance. The key to combining mixed models and DNNs is using the Gaussian negative log-likelihood (NLL) as a natural loss function that is minimized with DNN machinery including stochastic gradient descent (SGD). Since NLL does not decompose like standard DNN loss functions, the use of SGD with NLL presents some theoretical and implementation challenges, which we address. Our approach which we call LMMNN is demonstrated to improve performance over natural competitors in various correlation scenarios on diverse simulated and real datasets. Our focus is on a regression setting and tabular datasets, but we also show some results for classification. Our code is available at https://github.com/gsimchoni/lmmnn. [ABSTRACT FROM AUTHOR]
- Published
- 2023
31. A model selection criterion for clustered survival analysis with informative cluster size.
- Author
-
Chien, Li‐Chu, Chang, Li‐Ying, and Shen, Chung‐Wei
- Subjects
- *
SURVIVAL analysis (Biometry) , *PROPORTIONAL hazards models , *CLUSTER analysis (Statistics) , *DISEASE risk factors - Abstract
We propose a model selection criterion for correlated survival data when the cluster size is informative to the outcome. This approach, called Resampling Cluster Survival Information Criterion (RCSIC), uses the Cox proportional hazards model that is weighted with the inverse of the cluster size. The RCSIC based on the within‐cluster resampling idea takes into account the possible variability of the within‐cluster subsampling and the possible informativeness of cluster sizes. The RCSIC allows for easy execution for the within‐cluster resampling idea without a large number of resamples of the data. In contrast with the traditional model selection method in survival analysis, the RCSIC has an additional penalization for the within‐cluster subsampling variability. Our simulations show the satisfactory results where the RCSIC provides a more robust power for variable selection in terms of clustered survival analysis, regardless of whether informative cluster size exists or not. Applying the RCSIC method to a periodontal disease studies, we identify the tooth loss in patients associated with the risk factors, Age, Filled Tooth, Molar, Crown, Decayed Tooth, and Smoking Status, respectively. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
32. A new approach to modeling positive random variables with repeated measures.
- Author
-
de Freitas, João Victor B., Nobre, Juvêncio S., Bourguignon, Marcelo, and Santos-Neto, Manoel
- Subjects
- *
RANDOM variables , *MONTE Carlo method , *REGRESSION analysis , *GENERALIZED estimating equations - Abstract
In many situations, it is common to have more than one observation per experimental unit, thus generating the experiments with repeated measures. In the modeling of such experiments, it is necessary to consider and model the intra-unit dependency structure. In the literature, there are several proposals to model positive continuous data with repeated measures. In this paper, we propose one more with the generalization of the beta prime regression model. We consider the possibility of dependence between observations of the same unit. Residuals and diagnostic tools also are discussed. To evaluate the finite-sample performance of the estimators, using different correlation matrices and distributions, we conducted a Monte Carlo simulation study. The methodology proposed is illustrated with an analysis of a real data set. Finally, we create an R package for easy access to publicly available the methodology described in this paper. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
33. Gaussian Orthogonal Latent Factor Processes for Large Incomplete Matrices of Correlated Data.
- Author
-
Mengyang Gu and Hanmo Li
- Subjects
MATRICES (Mathematics) ,GAUSSIAN function ,PARAMETER estimation ,ORTHOGONAL functions ,DATA analysis - Abstract
We introduce Gaussian orthogonal latent factor processes for modeling and predicting large correlated data. To handle the computational challenge, we first decompose the likelihood function of the Gaussian random field with a multi-dimensional input domain into a product of densities at the orthogonal components with lower-dimensional inputs. The continuous-time Kalman filter is implemented to compute the likelihood function efficiently without making approximations. We also show that the posterior distribution of the factor processes is independent, as a consequence of prior independence of factor processes and orthogonal factor loading matrix. For studies with large sample sizes, we propose a flexible way to model the mean, and we derive the marginal posterior distribution to solve identifiability issues in sampling these parameters. Both simulated and real data applications confirm the outstanding performance of this method. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
34. lmerSeq: an R package for analyzing transformed RNA-Seq data with linear mixed effects models.
- Author
-
Vestal, Brian E., Wynn, Elizabeth, and Moore, Camille M.
- Subjects
- *
RNA sequencing , *STATISTICAL power analysis , *ERROR rates , *LINEAR statistical models - Abstract
Background: Studies that utilize RNA Sequencing (RNA-Seq) in conjunction with designs that introduce dependence between observations (e.g. longitudinal sampling) require specialized analysis tools to accommodate this additional complexity. This R package contains a set of utilities to fit linear mixed effects models to transformed RNA-Seq counts that properly account for this dependence when performing statistical analyses. Results: In a simulation study comparing lmerSeq and two existing methodologies that also work with transformed RNA-Seq counts, we found that lmerSeq was comprehensively better in terms of nominal error rate control and statistical power. Conclusions: Existing R packages for analyzing transformed RNA-Seq data with linear mixed models are limited in the variance structures they allow and/or the transformation methods they support. The lmerSeq package offers more flexibility in both of these areas and gave substantially better results in our simulations. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
35. A Pufferfish Privacy Mechanism for the Trajectory Clustering Task
- Author
-
Zeng, Yuying, Sang, Yingpeng, Luo, Shunchao, Song, Mingyang, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Prates, Raquel Oliveira, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Ning, Li, editor, Chau, Vincent, editor, and Lau, Francis, editor
- Published
- 2021
- Full Text
- View/download PDF
36. Web Content Authentication: A Machine Learning Approach to Identify Fake and Authentic Web Pages on Internet
- Author
-
Ashok, Jayakrishnan, Badoni, Pankaj, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Kaiser, M. Shamim, editor, Xie, Juanying, editor, and Rathore, Vijay Singh, editor
- Published
- 2021
- Full Text
- View/download PDF
37. Random errors are neither: On the interpretation of correlated data.
- Subjects
STATISTICAL models ,PHYLOGENETIC models ,STATISTICS ,GOODNESS-of-fit tests ,INDEPENDENT variables ,DEPENDENT variables ,TIME series analysis - Abstract
Many statistical models currently used in ecology and evolution account for covariances among random errors. Here, I address five points: (i) correlated random errors unite many types of statistical models, including spatial, phylogenetic and time‐series models; (ii) random errors are neither unpredictable nor mistakes; (iii) diagnostics for correlated random errors are not useful, but simulations are; (iv) model predictions can be made with random errors; and (v) can random errors be causal?These five points are illustrated by applying statistical models to analyse simulated spatial, phylogenetic and time‐series data. These three simulation studies are paired with three types of predictions that can be made using information from covariances among random errors: predictions for goodness‐of‐fit, interpolation, and forecasting.In the simulation studies, models incorporating covariances among random errors improve inference about the relationship between dependent and independent variables. They also imply the existence of unmeasured variables that generate the covariances among random errors. Understanding the covariances among random errors gives information about possible processes underlying the data.Random errors are caused by something. Therefore, to extract full information from data, covariances among random errors should not just be included in statistical models; they should also be studied in their own right. Data are hard won, and appropriate statistical analyses can make the most of them. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
38. Evaluation of GENESIS, SAIGE, REGENIE and fastGWA-GLMM for genome-wide association studies of binary traits in correlated data.
- Author
-
Gurinovich, Anastasia, Mengze Li, Leshchyk, Anastasia, Bae, Harold, Zeyuan Song, Arbeev, Konstantin G., Nygaard, Marianne, Feitosa, Mary F., Perls, Thomas T., and Sebastiani, Paola
- Subjects
GENOME-wide association studies ,LOGISTIC regression analysis ,MULTITRAIT multimethod techniques ,GENETIC variation ,HUMAN phenotype ,NUCLEOTIDE sequencing ,LONGEVITY ,PHENOTYPES - Abstract
Performing a genome-wide association study (GWAS) with a binary phenotype using family data is a challenging task. Using linear mixed effects models is typically unsuitable for binary traits, and numerical approximations of the likelihood function may not work well with rare genetic variants with small counts. Additionally, imbalance in the case-control ratios poses challenges as traditional statistical methods such as the Score test or Wald test perform poorly in this setting. In the last couple of years, several methods have been proposed to better approximate the likelihood function of a mixed effects logistic regression model that uses Saddle Point Approximation (SPA). SPA adjustment has recently been implemented in multiple software, including GENESIS, SAIGE, REGENIE and fastGWA-GLMM: four increasingly popular tools to perform GWAS of binary traits. We compare Score and SPA tests using real family data to evaluate computational efficiency and the agreement of the results. Additionally, we compare various ways to adjust for family relatedness, such as sparse and full genetic relationship matrices (GRM) and polygenic effect estimates. We use the New England Centenarian Study imputed genotype data and the Long Life Family Study whole-genome sequencing data and the binary phenotype of human extreme longevity to compare the agreement of the results and tools' computational performance. The evaluation suggests that REGENIE might not be a good choice when analyzing correlated data of a small size. fastGWA-GLMM is the most computationally efficient compared to the other three tools, but it appears to be overly conservative when applied to family-based data. GENESIS, SAIGE and fastGWA-GLMM produced similar, although not identical, results, with SPA adjustment performing better than Score tests. Our evaluation also demonstrates the importance of adjusting by full GRM in highly correlated datasets when using GENESIS or SAIGE. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
39. Sample size and performance estimation for biomarker combinations based on pilot studies with small sample sizes.
- Author
-
Al-Mekhlafi, Amani, Becker, Tobias, and Klawonn, Frank
- Subjects
- *
RECEIVER operating characteristic curves , *CHANNEL estimation , *SAMPLE size (Statistics) , *PILOT projects , *BIOMARKERS , *BIOLOGICAL systems - Abstract
High throughput technologies like microarrays, next generation sequencing and mass spectrometry enable the measurement of tens of thousands of biomarker candidates in pilot studies. Biological systems are often too complex to be based on simple single cause-effect associations and from the medical practice point of view, a single biomarker may not possess the desired sensitivity and/or specificity for disease classification and outcome prediction. Therefore, the efforts of researchers currently aims at combining biomarkers. The intention of biomarker pilot studies with small sample sizes is often to explore the possibility of finding good biomarker combinations and not to find and evaluate a final combination of biomarkers with high predictive value. The aim of the pilot study is to answer the question whether it is worthwhile to extend the study to a larger study and to obtain information about the required sample size. In this paper, we propose a method to judge the potential in a small biomarker pilot study without the need to explicitly identifying and confirming a specific subset of biomarkers. In addition, we provide a method for sample size estimation for an extended study when the results of the pilot study look promising. Abbreviations: ROC: receiver operating characteristic curve; AUC: Area Under the ROC curve; HAUCA: high AUC abundance; ER: Estrogen receptor; BCs: Biomarker candidates; w: with; wt: without [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
40. Evaluation of GENESIS, SAIGE, REGENIE and fastGWA-GLMM for genome-wide association studies of binary traits in correlated data
- Author
-
Anastasia Gurinovich, Mengze Li, Anastasia Leshchyk, Harold Bae, Zeyuan Song, Konstantin G. Arbeev, Marianne Nygaard, Mary F Feitosa, Thomas T Perls, and Paola Sebastiani
- Subjects
GWAS ,correlated data ,binary phenotype ,SPA ,SAIGE ,GENESIS ,Genetics ,QH426-470 - Abstract
Performing a genome-wide association study (GWAS) with a binary phenotype using family data is a challenging task. Using linear mixed effects models is typically unsuitable for binary traits, and numerical approximations of the likelihood function may not work well with rare genetic variants with small counts. Additionally, imbalance in the case-control ratios poses challenges as traditional statistical methods such as the Score test or Wald test perform poorly in this setting. In the last couple of years, several methods have been proposed to better approximate the likelihood function of a mixed effects logistic regression model that uses Saddle Point Approximation (SPA). SPA adjustment has recently been implemented in multiple software, including GENESIS, SAIGE, REGENIE and fastGWA-GLMM: four increasingly popular tools to perform GWAS of binary traits. We compare Score and SPA tests using real family data to evaluate computational efficiency and the agreement of the results. Additionally, we compare various ways to adjust for family relatedness, such as sparse and full genetic relationship matrices (GRM) and polygenic effect estimates. We use the New England Centenarian Study imputed genotype data and the Long Life Family Study whole-genome sequencing data and the binary phenotype of human extreme longevity to compare the agreement of the results and tools’ computational performance. The evaluation suggests that REGENIE might not be a good choice when analyzing correlated data of a small size. fastGWA-GLMM is the most computationally efficient compared to the other three tools, but it appears to be overly conservative when applied to family-based data. GENESIS, SAIGE and fastGWA-GLMM produced similar, although not identical, results, with SPA adjustment performing better than Score tests. Our evaluation also demonstrates the importance of adjusting by full GRM in highly correlated datasets when using GENESIS or SAIGE.
- Published
- 2022
- Full Text
- View/download PDF
41. Spatial Modelling of Retinal Thickness in Images from Patients with Diabetic Macular Oedema
- Author
-
Zhu, Wenyue, Ku, Jae Yee, Zheng, Yalin, Knox, Paul, Harding, Simon P., Kolamunnage-Dona, Ruwanthi, Czanner, Gabriela, Barbosa, Simone Diniz Junqueira, Editorial Board Member, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Kotenko, Igor, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Zheng, Yalin, editor, Williams, Bryan M., editor, and Chen, Ke, editor
- Published
- 2020
- Full Text
- View/download PDF
42. Correlated data in differential privacy: Definition and analysis.
- Author
-
Zhang, Tao, Zhu, Tianqing, Liu, Renping, and Zhou, Wanlei
- Subjects
PRIVACY ,DEFINITIONS ,DATABASES - Abstract
Summary: Differential privacy is a rigorous mathematical framework for evaluating and protecting data privacy. In most existing studies, there is a vulnerable assumption that records in a dataset are independent when differential privacy is applied. However, in real‐world datasets, records are likely to be correlated, which may lead to unexpected data leakage. In this survey, we investigate the issue of privacy loss due to data correlation under differential privacy models. Roughly, we classify existing literature into three lines: (1) using parameters to describe data correlation in differential privacy, (2) using models to describe data correlation in differential privacy, and (3) describing data correlation based on the framework of Pufferfish. First, a detailed example is given to illustrate the issue of privacy leakage on correlated data in real scenes. Then our main work is to analyze and compare these methods, and evaluate situations that these diverse studies are applied. Finally, we propose some future challenges on correlated differential privacy. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
43. Statistical Evaluation of Correlated Measurement Data in Longitudinal Setting Based on Bilateral Corneal Cross-Linking.
- Author
-
Herber, Robert, Graehlert, Xina, Raiskup, Frederik, Veselá, Martina, Pillunat, Lutz E., and Spoerl, Eberhard
- Subjects
- *
CORNEA , *CORNEAL topography , *VISUAL acuity , *ANALYSIS of variance , *STATISTICAL software , *CORNEAL cross-linking - Abstract
In ophthalmology, data from both eyes of a person are frequently included in the statistical evaluation. This violates the requirement of data independence for classical statistical tests (e.g. t-Test or analysis of variance (ANOVA)) because it is correlated data. Linear mixed models (LMM) were used as a possibility to include the data of both eyes in the statistical evaluation. The LMM is available for a variety of statistical software such as SPSS or R. The application was applied to a retrospective longitudinal analysis of an accelerated corneal cross-linking (ACXL (9*10)) treatment in progressive keratoconus (KC) with a follow-up period of 36 months. Forty eyes of 20 patients were included, whereas sequential bilateral CXL treatment was performed within 12 months. LMM and ANOVA for repeated measurements were used for statistical evaluation of topographical and tomographical data measured by Pentacam (Oculus, Wetzlar, Germany). Both eyes were classified into a worse and better eye concerning corneal topography. Visual acuity, keratometric values and minimal corneal thickness were statistically significant between them at baseline (p < 0.05). A significant correlation between worse and better eye was shown (p < 0.05). Therefore, analyzing the data at each follow-up visit using ANOVA partially led to an overestimation of the statistical effect that could be avoided by using LMM. After 36 months, ACXL has significantly improved BCVA and flattened the cornea. The evaluation of data of both eyes without considering their correlation using classical statistical tests leads to an overestimation of the statistical effect, which can be avoided by using the LMM. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
44. A comparison of methods for multiple degree of freedom testing in repeated measures RNA-sequencing experiments.
- Author
-
Wynn, Elizabeth A., Vestal, Brian E., Fingerlin, Tasha E., and Moore, Camille M.
- Subjects
- *
DEGREES of freedom , *RNA sequencing , *MULTIPLE comparisons (Statistics) , *FALSE discovery rate , *INTENSIVE care patients , *EXPERIMENTAL design , *SAMPLE size (Statistics) , *SEQUENCE analysis , *RNA , *RESEARCH funding , *LONGITUDINAL method - Abstract
Background: As the cost of RNA-sequencing decreases, complex study designs, including paired, longitudinal, and other correlated designs, become increasingly feasible. These studies often include multiple hypotheses and thus multiple degree of freedom tests, or tests that evaluate multiple hypotheses jointly, are often useful for filtering the gene list to a set of interesting features for further exploration while controlling the false discovery rate. Though there are several methods which have been proposed for analyzing correlated RNA-sequencing data, there has been little research evaluating and comparing the performance of multiple degree of freedom tests across methods.Methods: We evaluated 11 different methods for modelling correlated RNA-sequencing data by performing a simulation study to compare the false discovery rate, power, and model convergence rate across several hypothesis tests and sample size scenarios. We also applied each method to a real longitudinal RNA-sequencing dataset.Results: Linear mixed modelling using transformed data had the best false discovery rate control while maintaining relatively high power. However, this method had high model non-convergence, particularly at small sample sizes. No method had high power at the lowest sample size. We found a mix of conservative and anti-conservative behavior across the other methods, which was influenced by the sample size and the hypothesis being evaluated. The patterns observed in the simulation study were largely replicated in the analysis of a longitudinal study including data from intensive care unit patients experiencing cardiogenic or septic shock.Conclusions: Multiple degree of freedom testing is a valuable tool in longitudinal and other correlated RNA-sequencing experiments. Of the methods that we investigated, linear mixed modelling had the best overall combination of power and false discovery rate control. Other methods may also be appropriate in some scenarios. [ABSTRACT FROM AUTHOR]- Published
- 2022
- Full Text
- View/download PDF
45. Regression, Transformations, and Mixed-Effects with Marine Bryozoans.
- Author
-
Evans, Ciaran
- Subjects
- *
BRYOZOA , *CLASSROOM activities , *HOMEWORK - Abstract
This article demonstrates how data from a biology paper, which analyzes the relationship between mass and metabolic rate for two species of marine bryozoan, can be used to teach a variety of regression topics to both introductory and advanced students. A thorough analysis requires intelligent data wrangling, variable transformations, and accounting for correlation between observations. The bryozoan data can be used as a valuable class example throughout the semester, or as a dataset for extended homework assignments and class projects. for this article are available online. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
46. A state-space approach for longitudinal outcomes: An application to neuropsychological outcomes.
- Author
-
Chua, Alicia S and Tripodis, Yorghos
- Subjects
- *
SKEWNESS (Probability theory) , *KALMAN filtering , *NEURODEGENERATION , *DISEASE progression , *PANEL analysis , *MISSING data (Statistics) , *LONGITUDINAL method - Abstract
Longitudinal assessments are crucial in evaluating the disease state and trajectory in patients with neurodegenerative diseases. Neuropsychological outcomes measured over time often have a non-linear trajectory with autocorrelated residuals and a skewed distribution. We propose the adjusted local linear trend model, an extended state-space model in lieu of the commonly used linear mixed-effects model in modeling longitudinal neuropsychological outcomes. Our contributed model has the capability to utilize information from the stochasticity of the data while accounting for subject-specific trajectories with the inclusion of covariates and unequally spaced time intervals. The first step of model fitting involves a likelihood maximization step to estimate the unknown variances in the model before parsing these values into the Kalman filter and Kalman smoother recursive algorithms. Results from simulation studies showed that the adjusted local linear trend model is able to attain lower bias, lower standard errors, and high power, particularly in short longitudinal studies with equally spaced time intervals, as compared to the linear mixed-effects model. The adjusted local linear trend model also outperforms the linear mixed-effects model when data is missing completely at random, missing at random, and, in certain cases, even in data with missing not at random. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
47. Reflection on modern methods: visualizing the effects of collinearity in distributed lag models.
- Author
-
Basagaña, Xavier and Barrera-Gómez, Jose
- Subjects
- *
TIME series analysis , *AIR pollution , *REGRESSION analysis , *HOSPITAL admission & discharge , *COHORT analysis , *PARTICULATE matter , *TEMPERATURE , *LONGITUDINAL method , *ENVIRONMENTAL exposure - Abstract
Collinearity can be a problem in regression models. When examining the effects of an exposure at different time points, constrained distributed lag models can alleviate some of the problems caused by collinearity. Still, some consequences of collinearity may remain and they are often unexplored. We aimed to illustrate the effects of collinearity in the context of distributed lag models, and to provide a tool to assess whether the results of a study could be influenced by collinearity. We used simulations under different scenarios of hypothesized effects of an exposure to visualize the resulting curves of lagged effects. We analysed three real datasets: a cohort study looking for windows of vulnerability to air pollution, a time series study examining the linear association of air pollution with hospital admissions, and a time series study examining the non-linear association between temperature and mortality. We showed that collinearity could be the explanation for some unexpected results, e.g. for statistically significant associations in the opposite direction from that expected, or for wrongly suggesting that some periods are more important than others. We implemented the collin R package to explore the potential consequences of collinearity in the context of distributed lag models. Our visual tool can be a useful way to assess if the results of an analysis may be influenced by collinearity. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
48. Bayesian frailty modeling of correlated survival data with application to under-five mortality
- Author
-
Refah M. Alotaibi, Hoda Ragab Rezk, and Chris Guure
- Subjects
Bayesian approach ,Frailty models ,Correlated data ,Community frailty ,Under-five mortality ,Parametric regression models ,Public aspects of medicine ,RA1-1270 - Abstract
Abstract Background There is high rate of under-five mortality in West Africa with little effort made to study determinants that significantly increase or decrease its risk across the West African sub-region. This is important since it will help in the design of effective intervention programs for each country or the entire region. The overall objective of this research evaluates the determinants of under-five mortality prior to the end of the 2015 Millennium Development Goals, to guide West African countries implement strategies that will aid them achieve the Sustainable Development Goal 3 by 2030. Method This study used the Demographic and Health Survey (DHS) data from twelve (12) out of the eighteen West African countries; Ghana, Benin, Cote d’ Ivoire, Guinea, Liberia, Mali, Niger, Nigeria, Sierra Leone, Burkina Faso, Gambia and Togo. Data were extracted from the children and women of reproductive age files as provided in the DHS report. The response or outcome variable of interest is under-five mortality rate. A Bayesian exponential, Weibull and Gompertz regression models via a gamma shared frailty model were used for the analysis. The deviance information criteria and Bayes factors were used to discriminate between models. These analyses were carried out using Stata version 15 software. Results The study recorded 101 (95% CI: 98.6–103.5) deaths per 1000 live births occurring among the twelve countries. Burkina Faso (124.4), Cote D’lvoire (110.1), Guinea (116.4), Nigeria (120.6) and Niger (118.3) recorded the highest child under-5 mortality rate. Gambia (48.1), Ghana (60.1) and Benin (70.4) recorded the least unde-5 mortality rate per 1000 livebirths. Multiple birth children were about two times more likely to die compared to singleton birth, in all except Gambia, Nigeria and Sierra Leone. We observed significantly higher hazard rates for male compared to female children in the combined data analysis (HR: 1.14, 95% CI: [1.10–1.18]). The country specific analysis in Benin, Cote D’lvoire, Guinea, Liberia, Mali and Nigeria showed higher under-5 mortality hazard rates among male children compared to female children whilst Niger was the only country to report significantly lower hazard rate of males compared to females. Conclusion There is still quite a substantial amount of work to be done in order to meet the Sustainable Development Goal 3 in 2030 in West Africa. There exist variant differences among some of the countries with respect to mortality rates and determinants which require different interventions and policy decisions.
- Published
- 2020
- Full Text
- View/download PDF
49. Group penalized generalized estimating equation for correlated event-related potentials and biomarker selection
- Author
-
Ye Lin, Jianhui Zhou, Swapna Kumar, Wanze Xie, Sarah K. G. Jensen, Rashidul Haque, Charles A. Nelson, William A. Petri Jr, and Jennie Z. Ma
- Subjects
Event-related potentials ,Correlated data ,Penalized generalized estimating equations (GEE) ,Variable selection ,Structured correlation matrix ,Medicine (General) ,R5-920 - Abstract
Abstract Background Event-related potentials (ERP) data are widely used in brain studies that measure brain responses to specific stimuli using electroencephalogram (EEG) with multiple electrodes. Previous ERP data analyses haven’t accounted for the structured correlation among observations in ERP data from multiple electrodes, and therefore ignored the electrode-specific information and variation among the electrodes on the scalp. Our objective was to evaluate the impact of early adversity on brain connectivity by identifying risk factors and early-stage biomarkers associated with the ERP responses while properly accounting for structured correlation. Methods In this study, we extend a penalized generalized estimating equation (PGEE) method to accommodate structured correlation of ERPs that accounts for electrode-specific data and to enable group selection, such that grouped covariates can be evaluated together for their association with brain development in a birth cohort of urban-dwelling Bangladeshi children. The primary ERP responses of interest in our study are N290 amplitude and the difference in N290 amplitude. Results The selected early-stage biomarkers associated with the N290 responses are representatives of enteric inflammation (days of diarrhea, MIP1b, retinol binding protein (RBP), Zinc, myeloperoxidase (MPO), calprotectin, and neopterin), systemic inflammation (IL-5, IL-10, ferritin, C Reactive Protein (CRP)), socioeconomic status (household expenditure), maternal health (mother height) and sanitation (water treatment). Conclusions Our proposed group penalized GEE estimator with structured correlation matrix can properly model the complex ERP data and simultaneously identify informative biomarkers associated with such brain connectivity. The selected early-stage biomarkers offer a potential explanation for the adversity of neurocognitive development in low-income countries and facilitate early identification of infants at risk, as well as potential pathways for intervention. Trial registration The related clinical study was retrospectively registered with https://doi.org/ClinicalTrials.gov , identifier NCT01375647, on June 3, 2011.
- Published
- 2020
- Full Text
- View/download PDF
50. MCMSeq: Bayesian hierarchical modeling of clustered and repeated measures RNA sequencing experiments
- Author
-
Brian E. Vestal, Camille M. Moore, Elizabeth Wynn, Laura Saba, Tasha Fingerlin, and Katerina Kechris
- Subjects
RNA-Seq ,Markov chain Monte Carlo ,Longitudinal data ,Correlated data ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Biology (General) ,QH301-705.5 - Abstract
Abstract Background As the barriers to incorporating RNA sequencing (RNA-Seq) into biomedical studies continue to decrease, the complexity and size of RNA-Seq experiments are rapidly growing. Paired, longitudinal, and other correlated designs are becoming commonplace, and these studies offer immense potential for understanding how transcriptional changes within an individual over time differ depending on treatment or environmental conditions. While several methods have been proposed for dealing with repeated measures within RNA-Seq analyses, they are either restricted to handling only paired measurements, can only test for differences between two groups, and/or have issues with maintaining nominal false positive and false discovery rates. In this work, we propose a Bayesian hierarchical negative binomial generalized linear mixed model framework that can flexibly model RNA-Seq counts from studies with arbitrarily many repeated observations, can include covariates, and also maintains nominal false positive and false discovery rates in its posterior inference. Results In simulation studies, we showed that our proposed method (MCMSeq) best combines high statistical power (i.e. sensitivity or recall) with maintenance of nominal false positive and false discovery rates compared the other available strategies, especially at the smaller sample sizes investigated. This behavior was then replicated in an application to real RNA-Seq data where MCMSeq was able to find previously reported genes associated with tuberculosis infection in a cohort with longitudinal measurements. Conclusions Failing to account for repeated measurements when analyzing RNA-Seq experiments can result in significantly inflated false positive and false discovery rates. Of the methods we investigated, whether they model RNA-Seq counts directly or worked on transformed values, the Bayesian hierarchical model implemented in the mcmseq R package (available at https://github.com/stop-pre16/mcmseq ) best combined sensitivity and nominal error rate control.
- Published
- 2020
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.