527 results
Search Results
2. Robust Matrix Completion with Heavy-tailed Noise.
- Author
-
Wang, Bingyan and Fan, Jianqing
- Subjects
- *
LOW-rank matrices , *MATRIX decomposition , *STATISTICAL errors , *BAYES' estimation , *EUCLIDEAN algorithm - Abstract
AbstractThis paper studies noisy low-rank matrix completion in the presence of heavy-tailed and possibly asymmetric noise, where we aim to estimate an underlying low-rank matrix given a set of highly incomplete noisy entries. Though the matrix completion problem has attracted much attention in the past decade, there is still lack of theoretical understanding when the observations are contaminated by heavy-tailed noises. Prior theory falls short of explaining the empirical results and is unable to capture the optimal dependence of the estimation error on the noise level. In this paper, we adopt an adaptive Huber loss to accommodate heavy-tailed noise, which is robust against large and possibly asymmetric errors when the parameter in the Huber loss function is carefully designed to balance the Huberization biases and robustness to outliers. Then, we propose an efficient nonconvex algorithm via a balanced low-rank Burer-Monteiro matrix factorization and gradient descent with robust spectral initialization. We prove that under merely a bounded second-moment condition on the error distributions, rather than the sub-Gaussian assumption, the Euclidean errors of the iterates generated by the proposed algorithm decrease geometrically fast until achieving a minimax-optimal statistical estimation error, which has the same order as that in the sub-Gaussian case. The key technique behind this significant advancement is a powerful leave-one-out analysis framework. The theoretical results are corroborated by our numerical studies. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
3. Enhanced Response Envelope via Envelope Regularization.
- Author
-
Kwon, Oh-Ran and Zou, Hui
- Abstract
AbstractThe response envelope model provides substantial efficiency gains over the standard multivariate linear regression by identifying the material part of the response to the model and by excluding the immaterial part. In this paper, we propose the enhanced response envelope by incorporating a novel envelope regularization term based on a nonconvex manifold formulation. It is shown that the enhanced response envelope can yield better prediction risk than the original envelope estimator. The enhanced response envelope naturally handles high-dimensional data for which the original response envelope is not serviceable without necessary remedies. In an asymptotic high-dimensional regime where the ratio of the number of predictors over the number of samples converges to a non-zero constant, we characterize the risk function and reveal an interesting double descent phenomenon for the envelope model. A simulation study confirms our main theoretical findings. Simulations and real data applications demonstrate that the enhanced response envelope does have significantly improved prediction performance over the original envelope method, especially when the number of predictors is close to or moderately larger than the number of samples. Proofs and additional simulation results are shown in the supplementary file to this paper. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
4. Discussion of the Paper "Prediction, Estimation, and Attribution" by B. Efron.
- Author
-
Candès, Emmanuel and Sabatti, Chiara
- Subjects
- *
FORECASTING , *ATTRIBUTION (Social psychology) , *HIDDEN Markov models , *INFORMATION storage & retrieval systems , *ALGORITHMIC randomness - Abstract
We enjoyed reading Professor Efron's (Brad) paper just as much as we enjoyed listening to his June 2019 lecture in Leiden. Interestingly, Efron puts the pure predictive algorithms to the test in a scenario where the sample size is extremely moderate by today's standards ( I n i = 102). In the knockoff framework, one can use I any i black-box (any predictive model) to identify any kind possible association - linear or not - between a large set of explanatory variables Graph HT ht and an outcome I Y i . This framework rigorously tests whether the conditional distribution Graph HT ht actually depends on any of the variables I X SB j sb i without assuming any model for how I X i and I Y i are connected (Candès et al. [4]). [Extracted from the article]
- Published
- 2020
- Full Text
- View/download PDF
5. Natural Gradient Variational Bayes without Fisher Matrix Analytic Calculation and Its Inversion.
- Author
-
Godichon-Baggioni, A., Nguyen, D., and Tran, M-N.
- Abstract
AbstractThis paper introduces a method for efficiently approximating the inverse of the Fisher information matrix, a crucial step in achieving effective variational Bayes inference. A notable aspect of our approach is the avoidance of analytically computing the Fisher information matrix and its explicit inversion. Instead, we introduce an iterative procedure for generating a sequence of matrices that converge to the inverse of Fisher information. The natural gradient variational Bayes algorithm without analytic expression of the Fisher matrix and its inversion is provably convergent and achieves a convergence rate of order O( log s/s) , with
s the number of iterations. We also obtain a central limit theorem for the iterates. Implementation of our method does not require storage of large matrices, and achieves a linear complexity in the number of variational parameters. Our algorithm exhibits versatility, making it applicable across a diverse array of variational Bayes domains, including Gaussian approximation and normalizing flow Variational Bayes. We offer a range of numerical examples to demonstrate the efficiency and reliability of the proposed variational Bayes method. [ABSTRACT FROM AUTHOR]- Published
- 2024
- Full Text
- View/download PDF
6. Efficient Multiple Change Point Detection and Localization For High-dimensional Quantile Regression with Heteroscedasticity.
- Author
-
Wang, Xianru, Liu, Bin, Zhang, Xinsheng, and Liu, Yufeng
- Abstract
AbstractData heterogeneity is a challenging issue for modern statistical data analysis. There are different types of data heterogeneity in practice. In this paper, we consider potential structural changes and complicated tail distributions. There are various existing methods proposed to handle either structural changes or heteroscedasticity. However, it is difficult to handle them simultaneously. To overcome this limitation, we consider statistically and computationally efficient change point detection and localization in high-dimensional quantile regression models. Our proposed framework is general and flexible since the change points and the underlying regression coefficients are allowed to vary across different quantile levels. The model parameters, including the data dimension, the number of change points, and the signal jump size, can be scaled with the sample size. Under this framework, we construct a novel two-step estimation of the number and locations of the change points as well as the underlying regression coefficients. Without any moment constraints on the error term, we present theoretical results, including consistency of the change point number, oracle estimation of change point locations, and estimation for the underlying regression coefficients with the optimal convergence rate. Finally, we present simulation results and an application to the S&P 100 dataset to demonstrate the advantage of the proposed method. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
7. Matrix Completion When Missing Is Not at Random and Its Applications in Causal Panel Data Models*.
- Author
-
Choi, Jungjun and Yuan, Ming
- Subjects
- *
PANEL analysis , *PILOT projects , *UNITS of time , *FINANCIAL markets , *CAUSAL inference - Abstract
AbstractThis paper develops an inferential framework for matrix completion when missing is not at random and without the requirement of strong signals. Our development is based on the observation that if the number of missing entries is small enough compared to the panel size, then they can be estimated well even when missing is not at random. Taking advantage of this fact, we divide the missing entries into smaller groups and estimate each group via nuclear norm regularization. In addition, we show that with appropriate debiasing, our proposed estimate is asymptotically normal even for fairly weak signals. Our work is motivated by recent research on the Tick Size Pilot Program, an experiment conducted by the Security and Exchange Commission (SEC) to evaluate the impact of widening the tick size on the market quality of stocks from 2016 to 2018. While previous studies were based on traditional regression or difference-in-difference methods by assuming that the treatment effect is invariant with respect to time and unit, our analyses suggest significant heterogeneity across units and intriguing dynamics over time during the pilot program. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
8. Controlling the False Split Rate in Tree-Based Aggregation.
- Author
-
Shao, Simeng, Bien, Jacob, and Javanmard, Adel
- Subjects
- *
FALSE discovery rate , *URBAN trees - Abstract
AbstractIn many domains, data measurements can naturally be associated with the leaves of a tree, expressing the relationships among these measurements. For example, companies belong to industries, which in turn belong to ever coarser divisions such as sectors; microbes are commonly arranged in a taxonomic hierarchy from species to kingdoms; street blocks belong to neighborhoods, which in turn belong to larger-scale regions. The problem of tree-based aggregation that we consider in this paper asks which of these tree-defined subgroups of leaves should really be treated as a single entity and which of these entities should be distinguished from each other.We introduce the
false split rate , an error measure that describes the degree to which subgroups have been split when they should not have been. While expressible as the false discovery rate in a special case, we show that these measures can be quite different for the general tree structures common in our setting. We then propose a multiple hypothesis testing algorithm for tree-based aggregation, which we prove controls this error measure. We focus on two main examples of tree-based aggregation, one which involves aggregating means and the other which involves aggregating regression coefficients. [ABSTRACT FROM AUTHOR]- Published
- 2024
- Full Text
- View/download PDF
9. Contextual Dynamic Pricing with Strategic Buyers.
- Author
-
Liu, Pangpang, Yang, Zhuoran, Wang, Zhaoran, and Sun, Will Wei
- Subjects
- *
TIME-based pricing , *ONLINE shopping , *PRICES , *DIRECT costing , *ONLINE education - Abstract
AbstractPersonalized pricing, which involves tailoring prices based on individual characteristics, is commonly used by firms to implement a consumer-specific pricing policy. In this process, buyers can also strategically manipulate their feature data to obtain a lower price, incurring certain manipulation costs. Such strategic behavior can hinder firms from maximizing their profits. In this paper, we study the contextual dynamic pricing problem with strategic buyers. The seller does not observe the buyer’s true feature, but a manipulated feature according to buyers’ strategic behavior. In addition, the seller does not observe the buyers’ valuation of the product, but only a binary response indicating whether a sale happens or not. Recognizing these challenges, we propose a strategic dynamic pricing policy that incorporates the buyers’ strategic behavior into the online learning to maximize the seller’s cumulative revenue. We first prove that existing non-strategic pricing policies that neglect the buyers’ strategic behavior result in a linear Ω(T) regret with
T the total time horizon, indicating that these policies are not better than a random pricing policy. We then establish an O(T) regret upper bound of our proposed policy and an Ω(T) regret lower bound for any pricing policy within our problem setting. This underscores the rate optimality of our policy. Importantly, our policy is not a mere amalgamation of existing dynamic pricing policies and strategic behavior handling algorithms. Our policy can also accommodate the scenario when the marginal cost of manipulation is unknown in advance. To account for it, we simultaneously estimate the valuation parameter and the cost parameter in the online pricing policy, which is shown to also achieve an O(T) regret bound. Extensive experiments support our theoretical developments and demonstrate the superior performance of our policy compared to other pricing policies that are unaware of the strategic behaviors. [ABSTRACT FROM AUTHOR]- Published
- 2024
- Full Text
- View/download PDF
10. Supervised Dynamic PCA: Linear Dynamic Forecasting with Many Predictors.
- Author
-
Gao, Zhaoxing and Tsay, Ruey S.
- Subjects
- *
INDEPENDENT variables , *PRINCIPAL components analysis , *SUPERVISED learning , *FORECASTING - Abstract
AbstractThis paper proposes a novel dynamic forecasting method using a new supervised Principal Component Analysis (PCA) when a large number of predictors are available. The new supervised PCA provides an effective way to bridge the gap between predictors and the target variable of interest by scaling and combining the predictors and their lagged values, resulting in an effective dynamic forecasting. Unlike the traditional diffusion-index approach, which does not learn the relationships between the predictors and the target variable before conducting PCA, we first re-scale each predictor according to their significance in forecasting the targeted variable in a dynamic fashion, and a PCA is then applied to a re-scaled and additive panel, which establishes a connection between the predictability of the PCA factors and the target variable. We also propose to use penalized methods such as the LASSO to select the significant factors that have superior predictive power over the others. Theoretically, we show that our estimators are consistent and outperform the traditional methods in prediction under some mild conditions. We conduct extensive simulations to verify that the proposed method produces satisfactory forecasting results and outperforms most of the existing methods using the traditional PCA. An example of predicting U.S. macroeconomic variables using a large number of predictors showcases that our method fares better than most of the existing ones in applications. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
11. Tyranny-of-the-minority Regression Adjustment in Randomized Experiments.
- Author
-
Lu, Xin and Liu, Hanzhong
- Abstract
Abstract–
Regression adjustment is widely used in the analysis of randomized experiments to improve the estimation efficiency of the treatment effect. This paper reexamines a weighted regression adjustment method termed tyranny-of-the-minority (ToM), wherein units in the minority group are given greater weights. We demonstrate that ToM regression adjustment is more robust than Lin( 2013)’s regression adjustment with treatment-covariate interactions, even though these two regression adjustment methods are asymptotically equivalent in completely randomized experiments. Moreover, ToM regression adjustment can be easily extended to stratified randomized experiments and completely randomized survey experiments. We obtain the design-based properties of the ToM regression-adjusted average treatment effect estimator under such designs. In particular, we show that the ToM regression-adjusted estimator improves the asymptotic estimation efficiency compared to the unadjusted estimator, even when the regression model is misspecified, and is optimal in the class of linearly adjusted estimators. We also study the asymptotic properties of various heteroscedasticity-robust standard errors and provide recommendations for practitioners. Simulation studies and real data analysis demonstrate ToM regression adjustment’s superiority over existing methods. [ABSTRACT FROM AUTHOR]- Published
- 2024
- Full Text
- View/download PDF
12. Generalized Data Thinning Using Sufficient Statistics.
- Author
-
Dharamshi, Ameer, Neufeld, Anna, Motwani, Keshav, Gao, Lucy L., Witten, Daniela, and Bien, Jacob
- Abstract
AbstractOur goal is to develop a general strategy to decompose a random variable
X into multiple independent random variables, without sacrificing any information about unknown parameters. A recent paper showed that for some well-known natural exponential families,X can bethinned into independent random variables X(1),…,X(K) , such that X=∑k=1KX(k) . These independent random variables can then be used for various model validation and inference tasks, including in contexts where traditional sample splitting fails. In this article, we generalize their procedure by relaxing this summation requirement and simply asking that some known function of the independent random variables exactly reconstructX . This generalization of the procedure serves two purposes. First, it greatly expands the families of distributions for which thinning can be performed. Second, it unifies sample splitting and data thinning, which on the surface seem to be very different, as applications of the same principle. This shared principle is sufficiency. We use this insight to perform generalized thinning operations for a diverse set of families. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work. [ABSTRACT FROM AUTHOR]- Published
- 2024
- Full Text
- View/download PDF
13. Recommender Systems: A Review.
- Author
-
LeBlanc, Patrick M., Banks, David, Fu, Linhui, Li, Mingyan, Tang, Zhengyu, and Wu, Qiuyi
- Subjects
- *
INTERNET advertising , *MATRIX decomposition , *RECOMMENDER systems - Abstract
Recommender systems are the engine of online advertising. Not only do they suggest movies, music, or romantic partners, but they also are used to select which advertisements to show to users. This paper reviews the basics of recommender system methodology and then looks at the emerging arena of active recommender systems. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
14. Using SVD for Topic Modeling.
- Author
-
Ke, Zheng Tracy and Wang, Minzhe
- Subjects
- *
WORD frequency , *VECTOR data , *MATRIX decomposition , *NONNEGATIVE matrices , *SINGULAR value decomposition - Abstract
The probabilistic topic model imposes a low-rank structure on the expectation of the corpus matrix. Therefore, singular value decomposition (SVD) is a natural tool of dimension reduction. We propose an SVD-based method for estimating a topic model. Our method constructs an estimate of the topic matrix from only a few leading singular vectors of the data matrix, and has a great advantage in memory use and computational cost for large-scale corpora. The core ideas behind our method include a pre-SVD normalization to tackle severe word frequency heterogeneity, a post-SVD normalization to create a low-dimensional word embedding that manifests a simplex geometry, and a post-SVD procedure to construct an estimate of the topic matrix directly from the embedded word cloud. We provide the explicit rate of convergence of our method. We show that our method attains the optimal rate in the case of long and moderately long documents, and it improves the rates of existing methods in the case of short documents. The key of our analysis is a sharp row-wise large-deviation bound for empirical singular vectors, which is technically demanding to derive and potentially useful for other problems. We apply our method to a corpus of Associated Press news articles and a corpus of abstracts of statistical papers. for this article are available online. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
15. A Projection Space-Filling Criterion and Related Optimality Results.
- Author
-
Shi, Chenlu and Xu, Hongquan
- Abstract
Abstract Computer experiments call for space-filling designs. Recently, a minimum aberration type space-filling criterion was proposed to rank and assess a family of space-filling designs including Latin hypercubes and strong orthogonal arrays. It aims at capturing the space-filling properties of a design when projected onto subregions of various sizes. In this paper, we also consider the dimension aside from the sizes of subregions by proposing first an expanded space-filling hierarchy principle and then a projection space-filling criterion as per the new principle. When projected onto subregions of the specific size, the proposed criterion ranks designs via sequentially maximizing the space-filling properties on equally-sized subregions in lower dimensions to higher dimensions, while the minimum aberration type space-filling criterion compares designs by maximizing the aggregate space-filling properties on multidimensional subregions of the same size. We present illustrative examples to demonstrate two criteria and conduct simulations as evidence of the utility of our criterion in terms of selecting efficient space-filling designs to build statistical surrogate models. We further consider the construction of the optimal space-filling designs under the proposed criterion. Although many algorithms have been proposed for generating space-filling designs, it is well-known that they often deteriorate rapidly in performance for large designs. In the present paper, we develop some theoretical optimality results and characterize several classes of strong orthogonal arrays of strength three that are the most space-filling. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
16. A Note on Chernoff and Lieberman's Generalized Probability Paper.
- Author
-
Cran, G. W.
- Subjects
- *
PROBABILITY theory , *LEAST squares , *MATHEMATICAL combinations , *BAYESIAN analysis , *DISTRIBUTION (Probability theory) , *STATISTICAL correlation , *GRAPHIC methods , *MATHEMATICAL statistics - Abstract
The determination of plotting positions on probability graph paper so that the associated weighted least squares estimators of the scale parameter and the percentiles of a continuous distribution have certain properties is discussed. Necessary and sufficient conditions are given for an invariant optimal plot for percentile estimation. Also discussed is the derivation of ordered plotting positions. [ABSTRACT FROM AUTHOR]
- Published
- 1975
- Full Text
- View/download PDF
17. DISCUSSION OF PAPER BY I. J. GOOD, APRIL 9, 1968.
- Author
-
Anscombe, F. J.
- Subjects
- *
BAYESIAN analysis , *GOODNESS-of-fit tests , *STATISTICAL hypothesis testing , *STATISTICIANS , *PLANETS , *SOLAR system - Abstract
The article presents discussion of a paper by statistician I.J. Good. The author supposes that most statisticians approaching the problem of testing goodness of fit would have been very much less ambitious than Good has been. Most of the people, if spend some time looking at the first planet, Mercury and the last two, Neptune and Pluto, do not agree with Bode's law, whereas all the intermediate planets from Venus to Uranus seem to fit the law remarkably well. Those statisticians who were sufficiently knowledgeable would presumably meditate on the matters that Good has meditated on; but, having done this, most people would make a merely informal judgment and completely shirk the challenge that Good has met, to evaluate the weight of evidence. Good remarks that if one is going to test goodness of fit of a theory to some data, using either a Neyman-Pearson approach or a Bayesian approach, one have to frame explicit hypotheses. Over a good many years the author has, from time to time tried to think about the general problem of testing goodness of fit, and remain skeptical about whether goodness of fit can be adequately discussed either in Neyrnan-Pearson or in Bayesian terms because of this necessity to have, not only an explicit null hypothesis, but also a fully defined explicit alternative hypothesis or set of alternative hypotheses.
- Published
- 1969
- Full Text
- View/download PDF
18. Discussion of Paper by Brad Efron.
- Author
-
Cox, D. R.
- Subjects
- *
MARKOV processes , *MATHEMATICAL analysis , *LIFE sciences , *STOCHASTIC processes - Abstract
It is a privilege to have the chance of congratulating Professor Efron first on this wise paper, then on the richly merited International Prize, the award of which the paper commemorates, but above all on the whole body of his deeply impressive, wide-ranging contributions to our subject. One of the great masterpieces of our field, largely unread nowadays, is M.S. Bartlett's (1958) I Introduction to stochastic processes and their application i and Bartlett's influence on statistical development, at least in UK, was second only to Fisher's. Professor Efron in his theoretical papers manages with great skill and panache to combine careful mathematical discussion with judicious statistical emphasis. [Extracted from the article]
- Published
- 2020
- Full Text
- View/download PDF
19. Online Regularization towards Always-Valid High-Dimensional Dynamic Pricing.
- Author
-
Wang, Chi-Hua, Wang, Zhanyu, Sun, Will Wei, and Cheng, Guang
- Abstract
Abstract Devising a dynamic pricing policy with always valid online statistical learning procedures is an important and as yet unresolved problem. Most existing dynamic pricing policies, which focus on the faithfulness of adopted customer choice models, exhibit a limited capability for adapting to the online uncertainty of learned statistical models during the pricing process. In this paper, we propose a novel approach for designing a dynamic pricing policy based on regularized online statistical learning with theoretical guarantees. The new approach overcomes the challenge of continuous monitoring of the online Lasso procedure and possesses several appealing properties. In particular, we make the decisive observation that the always-validity of pricing decisions builds and thrives on the
online regularization scheme. Our proposed online regularization scheme equips the proposed optimistic online regularized maximum likelihood pricing (OORMLP) pricing policy with three major advantages: encode market noise knowledge into pricing process optimism; empower online statistical learning with always-validity overall decision points; envelop prediction error process with time-uniform non-asymptotic oracle inequalities. This type of non-asymptotic inference results allows us to design more sample-efficient and robust dynamic pricing algorithms in practice. In theory, the proposed OORMLP algorithm exploits the sparsity structure of high-dimensional models and secures a logarithmic regret in a decision horizon. These theoretical advances are made possible by proposing an optimistic online Lasso procedure that resolves dynamic pricing problems at theprocess level, based on a novel use of non-asymptotic martingale concentration. In experiments, we evaluate OORMLP in different synthetic and real pricing problem settings and demonstrate that OORMLP advances the state-of-the-art methods. [ABSTRACT FROM AUTHOR]- Published
- 2023
- Full Text
- View/download PDF
20. Doubly Robust Interval Estimation for Optimal Policy Evaluation in Online Learning.
- Author
-
Ye, Shen, Cai, Hengrui, and Rui Song
- Abstract
Abstract Evaluating the performance of an ongoing policy plays a vital role in many areas such as medicine and economics, to provide crucial instructions on the early-stop of the online experiment and timely feedback from the environment. Policy evaluation in online learning thus attracts increasing attention by inferring the mean outcome of the optimal policy (i.e., the value) in real-time. Yet, such a problem is particularly challenging due to the dependent data generated in the online environment, the unknown optimal policy, and the complex exploration and exploitation trade-off in the adaptive experiment. In this paper, we aim to overcome these difficulties in policy evaluation for online learning. We explicitly derive the probability of exploration that quantifies the probability of exploring non-optimal actions under commonly used bandit algorithms. We use this probability to conduct valid inference on the online conditional mean estimator under each action and develop the doubly robust interval estimation (DREAM) method to infer the value under the estimated optimal policy in online learning. The proposed value estimator provides double protection for consistency and is asymptotically normal with a Wald-type confidence interval provided. Extensive simulation studies and real data applications are conducted to demonstrate the empirical validity of the proposed DREAM method. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
21. Dimension Reduction for Fréchet Regression †.
- Author
-
Zhang, Qi, Xue, Lingzhou, and Li, Bing
- Abstract
Abstract With the rapid development of data collection techniques, complex data objects that are not in the Euclidean space are frequently encountered in new statistical applications. Fréchet regression model (Peterson & Müller 2019) provides a promising framework for regression analysis with metric space-valued responses. In this paper, we introduce a flexible sufficient dimension reduction (SDR) method for Fréchet regression to achieve two purposes: to mitigate the curse of dimensionality caused by high-dimensional predictors, and to provide a visual inspection tool for Fréchet regression. Our approach is flexible enough to turn any existing SDR method for Euclidean ( X , Y ) into one for Euclidean X and metric space-valued Y. The basic idea is to first map the metric space-valued random object Y to a real-valued random variable f ( Y ) using a class of functions, and then perform classical SDR to the transformed data. If the class of functions is sufficiently rich, then we are guaranteed to uncover the Fréchet SDR space. We showed that such a class, which we call an ensemble, can be generated by a universal kernel (cc-universal kernel). We established the consistency and asymptotic convergence rate of the proposed methods. The finite-sample performance of the proposed methods is illustrated through simulation studies for several commonly encountered metric spaces that include Wasserstein space, the space of symmetric positive definite matrices, and the sphere. We illustrated the data visualization aspect of our method by the human mortality distribution data from the United Nations Databases. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
22. Higher-order Expansions and Inference for Panel Data Models.
- Author
-
Gao, Jiti, Peng, Bin, and Yan, Yayi
- Abstract
Abstract In this paper, we propose a simple inferential method for a wide class of panel data models with a focus on such cases that have both serial correlation and cross-sectional dependence. In order to establish an asymptotic theory to support the inferential method, we develop some new and useful higher-order expansions, such as Berry-Esseen bound and Edgeworth Expansion, under a set of simple and general conditions. We further demonstrate the usefulness of these theoretical results by explicitly investigating a panel data model with interactive effects which nests many traditional panel data models as special cases. Finally, we show the superiority of our approach over several natural competitors using extensive numerical studies. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
23. Latent Space Modelling of Hypergraph Data.
- Author
-
Turnbull, Kathryn, Lunagómez, Simón, Nemeth, Christopher, and Airoldi, Edoardo
- Abstract
Abstract The increasing prevalence of relational data describing interactions among a target population has motivated a wide literature on statistical network analysis. In many applications, interactions may involve more than two members of the population and this data is more appropriately represented by a hypergraph. In this paper, we present a model for hypergraph data that extends the well-established latent space approach for graphs and, by drawing a connection to constructs from computational topology, we develop a model whose likelihood is inexpensive to compute. A delayed acceptance MCMC scheme is proposed to obtain posterior samples and we rely on Bookstein coordinates to remove the identifiability issues associated with the latent representation. We theoretically examine the degree distribution of hypergraphs generated under our framework and, through simulation, we investigate the flexibility of our model and consider estimation of predictive distributions. Finally, we explore the application of our model to two real-world datasets. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
24. Data fission: splitting a single data point.
- Author
-
Leiner, James, Duan, Boyan, Wasserman, Larry, and Ramdas, Aaditya
- Abstract
Abstract Suppose we observe a random vector
X from some distribution in a known family with unknown parameters. We ask the following question: when is it possible to splitX into two piecesf (X ) andg (X ) such that neither part is sufficient to reconstruct X by itself, but both together can recover X fully, and their joint distribution is tractable? One common solution to this problem when multiple samples of X are observed is data splitting, but Rasines and Young (2022) offers an alternative approach that uses additive Gaussian noise — this enables post-selection inference in finite samples for Gaussian distributed data and asymptotically for non-Gaussian additive models. In this paper, we offer a more general methodology for achieving such a split in finite samples by borrowing ideas from Bayesian inference to yield a (frequentist) solution that can be viewed as a continuous analog of data splitting. We call our method data fission, as an alternative to data splitting, data carving and p-value masking. We exemplify the method on several prototypical applications, such as post-selection inference for trend filtering and other regression problems, and effect size estimation after interactive multiple testing. [ABSTRACT FROM AUTHOR]- Published
- 2023
- Full Text
- View/download PDF
25. A Wavelet-Based Independence Test for Functional Data With an Application to MEG Functional Connectivity.
- Author
-
Miao, Rui, Zhang, Xiaoke, and Wong, Raymond K. W.
- Subjects
- *
FUNCTIONAL connectivity , *THRESHOLDING algorithms , *FUNCTIONAL analysis , *BESOV spaces , *MAGNETOENCEPHALOGRAPHY , *HILBERT space - Abstract
Measuring and testing the dependency between multiple random functions is often an important task in functional data analysis. In the literature, a model-based method relies on a model which is subject to the risk of model misspecification, while a model-free method only provides a correlation measure which is inadequate to test independence. In this paper, we adopt the Hilbert–Schmidt Independence Criterion (HSIC) to measure the dependency between two random functions. We develop a two-step procedure by first pre-smoothing each function based on its discrete and noisy measurements and then applying the HSIC to recovered functions. To ensure the compatibility between the two steps such that the effect of the pre-smoothing error on the subsequent HSIC is asymptotically negligible when the data are densely measured, we propose a new wavelet thresholding method for pre-smoothing and to use Besov-norm-induced kernels for HSIC. We also provide the corresponding asymptotic analysis. The superior numerical performance of the proposed method over existing ones is demonstrated in a simulation study. Moreover, in a magnetoencephalography (MEG) data application, the functional connectivity patterns identified by the proposed method are more anatomically interpretable than those by existing methods. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
26. Time-to-Event Analysis with Unknown Time Origins via Longitudinal Biomarker Registration.
- Author
-
Wang, Tianhao, Ratcliffe, Sarah J., and Guo, Wensheng
- Subjects
- *
BIOMARKERS , *RECORDING & registration , *TIME management , *DISEASE progression , *PARAMETRIC modeling , *EXTRAPOLATION - Abstract
In observational studies, the time origin of interest for time-to-event analysis is often unknown, such as the time of disease onset. Existing approaches to estimating the time origins are commonly built on extrapolating a parametric longitudinal model, which rely on rigid assumptions that can lead to biased inferences. In this paper, we introduce a flexible semiparametric curve registration model. It assumes the longitudinal trajectories follow a flexible common shape function with person-specific disease progression pattern characterized by a random curve registration function, which is further used to model the unknown time origin as a random start time. This random time is used as a link to jointly model the longitudinal and survival data where the unknown time origins are integrated out in the joint likelihood function, which facilitates unbiased and consistent estimation. Since the disease progression pattern naturally predicts time-to-event, we further propose a new functional survival model using the registration function as a predictor of the time-to-event. The asymptotic consistency and semiparametric efficiency of the proposed models are proved. Simulation studies and two real data applications demonstrate the effectiveness of this new approach. for this article are available online. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
27. A Comprehensive Bayesian Framework for Envelope Models.
- Author
-
Chakraborty, Saptarshi and Su, Zhihua
- Abstract
Abstract The envelope model aims to increase efficiency in multivariate analysis by utilizing dimension reduction techniques. It has been used in many contexts including linear regression, generalized linear models, matrix/tensor variate regression, reduced rank regression, and quantile regression, and has shown the potential to provide substantial efficiency gains. Virtually all of these advances, however, have been made from a frequentist perspective, and the literature addressing envelope models from a Bayesian point of view is sparse. The objective of this paper is to propose a Bayesian framework that is applicable across various envelope model contexts. The proposed framework aids straightforward interpretation of model parameters and allows easy incorporation of prior information. We provide a simple block Metropolis-within-Gibbs MCMC sampler for practical implementations of our method. Simulations and data examples are included for illustration. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
28. Value Enhancement of Reinforcement Learning via Efficient and Robust Trust Region Optimization.
- Author
-
Shi, Chengchun, Qi, Zhengling, Wang, Jianing, and Zhou, Fan
- Abstract
Abstract Reinforcement learning (RL) is a powerful machine learning technique that enables an intelligent agent to learn an optimal policy that maximizes the cumulative rewards in sequential decision making. Most of methods in the existing literature are developed in
online settings where the data are easy to collect or simulate. Motivated by high stake domains such as mobile health studies with limited and pre-collected data, in this paper, we studyoffline reinforcement learning methods. To efficiently use these datasets for policy optimization, we propose a novel value enhancement method to improve the performance of a given initial policy computed by existing state-of-the-art RL algorithms. Specifically, when the initial policy is not consistent, our method will output a policy whose value is no worse and often better than that of the initial policy. When the initial policy is consistent, under some mild conditions, our method will yield a policy whose value converges to the optimal one at a faster rate than the initial policy, achieving the desired “value enhancement” property. The proposed method is generally applicable to any parametrized policy that belongs to certain pre-specified function class (e.g., deep neural networks). Extensive numerical studies are conducted to demonstrate the superior performance of our method. [ABSTRACT FROM AUTHOR]- Published
- 2023
- Full Text
- View/download PDF
29. Data-driven selection of the number of change-points via error rate control.
- Author
-
Chen, Hui, Ren, Haojie, Yao, Fang, and Zou, Changliang
- Subjects
- *
FALSE discovery rate , *ERROR rates , *SYMMETRY - Abstract
In multiple change-point analysis, one of the main difficulties is to determine the number of change-points. Various consistent selection methods, including the use of Schwarz information criterion and cross-validation, have been proposed to balance the model fitting and complexity. However, there is lack of systematic approaches to provide theoretical guarantee of significance in determining the number of changes. In this paper, we introduce a data-adaptive selection procedure via error rate control based on order-preserving sample-splitting, which is applicable to most existing change-point methods. The key idea is to construct a series of statistics with global symmetry property and then utilize the symmetry to derive a data-driven threshold. Under this general framework, we are able to rigorously investigate the false discovery proportion control, and show that the proposed method controls the false discovery rate (FDR) asymptotically under mild conditions while retaining the true change-points. Numerical experiments indicate that our selection procedure works well for many change-detection methods and is able to yield accurate FDR control in finite samples. Keywords: Empirical distribution; False discovery rate; Multiple change-point model; Sample-splitting; Symmetry; Uniform convergence. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
30. Hierarchical Community Detection by Recursive Partitioning.
- Author
-
Li, Tianxi, Lei, Lihua, Bhattacharyya, Sharmodeep, Van den Berge, Koen, Sarkar, Purnamrita, Bickel, Peter J., and Levina, Elizaveta
- Subjects
- *
RECURSIVE partitioning , *GENE regulatory networks , *PARALLEL algorithms , *COMMUNITIES , *GENE clusters , *STOCHASTIC models - Abstract
The problem of community detection in networks is usually formulated as finding a single partition of the network into some "correct" number of communities. We argue that it is more interpretable and in some regimes more accurate to construct a hierarchical tree of communities instead. This can be done with a simple top-down recursive partitioning algorithm, starting with a single community and separating the nodes into two communities by spectral clustering repeatedly, until a stopping rule suggests there are no further communities. This class of algorithms is model-free, computationally efficient, and requires no tuning other than selecting a stopping rule. We show that there are regimes where this approach outperforms K-way spectral clustering, and propose a natural framework for analyzing the algorithm's theoretical performance, the binary tree stochastic block model. Under this model, we prove that the algorithm correctly recovers the entire community tree under relatively mild assumptions. We apply the algorithm to a gene network based on gene co-occurrence in 1580 research papers on anemia, and identify six clusters of genes in a meaningful hierarchy. We also illustrate the algorithm on a dataset of statistics papers. for this article are available online. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
31. Asymptotic Distribution-Free Independence Test for High Dimension Data.
- Author
-
Cai, Zhanrui, Lei, Jing, and Roeder, Kathryn
- Abstract
Abstract Test of independence is of fundamental importance in modern data analysis, with broad applications in variable selection, graphical models, and causal inference. When the data is high dimensional and the potential dependence signal is sparse, independence testing becomes very challenging without distributional or structural assumptions. In this paper, we propose a general framework for independence testing by first fitting a classifier that distinguishes the joint and product distributions, and then testing the significance of the fitted classifier. This framework allows us to borrow the strength of the most advanced classification algorithms developed from the modern machine learning community, making it applicable to high dimensional, complex data. By combining a sample split and a fixed permutation, our test statistic has a universal, fixed Gaussian null distribution that is independent of the underlying data distribution. Extensive simulations demonstrate the advantages of the newly proposed test compared with existing methods. We further apply the new test to a single cell data set to test the independence between two types of single cell sequencing measurements, whose high dimensionality and sparsity make existing methods hard to apply. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
32. Individual Data Protected Integrative Regression Analysis of High-Dimensional Heterogeneous Data.
- Author
-
Cai, Tianxi, Liu, Molei, and Xia, Yin
- Subjects
- *
REGRESSION analysis , *ELECTRONIC health records , *CORONARY artery disease , *DATA recorders & recording , *DECISION making - Abstract
Evidence-based decision making often relies on meta-analyzing multiple studies, which enables more precise estimation and investigation of generalizability. Integrative analysis of multiple heterogeneous studies is, however, highly challenging in the ultra high-dimensional setting. The challenge is even more pronounced when the individual-level data cannot be shared across studies, known as DataSHIELD contraint. Under sparse regression models that are assumed to be similar yet not identical across studies, we propose in this paper a novel integrative estimation procedure for data-Shielding High-dimensional Integrative Regression (SHIR). SHIR protects individual data through summary-statistics-based integrating procedure, accommodates between-study heterogeneity in both the covariate distribution and model parameters, and attains consistent variable selection. Theoretically, SHIR is statistically more efficient than the existing distributed approaches that integrate debiased LASSO estimators from the local sites. Furthermore, the estimation error incurred by aggregating derived data is negligible compared to the statistical minimax rate and SHIR is shown to be asymptotically equivalent in estimation to the ideal estimator obtained by sharing all data. The finite-sample performance of our method is studied and compared with existing approaches via extensive simulation settings. We further illustrate the utility of SHIR to derive phenotyping algorithms for coronary artery disease using electronic health records data from multiple chronic disease cohorts. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
33. Robust Two-Step Wavelet-Based Inference for Time Series Models.
- Author
-
Guerrier, Stéphane, Molinari, Roberto, Victoria-Feser, Maria-Pia, and Xu, Haotian
- Subjects
- *
GENERALIZED method of moments , *ASYMPTOTIC normality , *STOCHASTIC processes , *STOCHASTIC models , *COMPUTATIONAL complexity - Abstract
Latent time series models such as (the independent sum of) ARMA(p, q) models with additional stochastic processes are increasingly used for data analysis in biology, ecology, engineering, and economics. Inference on and/or prediction from these models can be highly challenging: (i) the data may contain outliers that can adversely affect the estimation procedure; (ii) the computational complexity can become prohibitive when the time series are extremely large; (iii) model selection adds another layer of (computational) complexity; and (iv) solutions that address (i), (ii), and (iii) simultaneously do not exist in practice. This paper aims at jointly addressing these challenges by proposing a general framework for robust two-step estimation based on a bounded influence M-estimator of the wavelet variance. We first develop the conditions for the joint asymptotic normality of the latter estimator thereby providing the necessary tools to perform (direct) inference for scale-based analysis of signals. Taking advantage of the model-independent weights of this first-step estimator, we then develop the asymptotic properties of two-step robust estimators using the framework of the generalized method of wavelet moments (GMWM). Simulation studies illustrate the good finite sample performance of the robust GMWM estimator and applied examples highlight the practical relevance of the proposed approach. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
34. Confidence Intervals for Nonparametric Empirical Bayes Analysis.
- Author
-
Ignatiadis, Nikolaos and Wager, Stefan
- Subjects
- *
CONFIDENCE intervals , *MIXTURES - Abstract
In an empirical Bayes analysis, we use data from repeated sampling to imitate inferences made by an oracle Bayesian with extensive knowledge of the data-generating distribution. Existing results provide a comprehensive characterization of when and why empirical Bayes point estimates accurately recover oracle Bayes behavior. In this paper, we develop flexible and practical confidence intervals that provide asymptotic frequentist coverage of empirical Bayes estimands, such as the posterior mean or the local false sign rate. The coverage statements hold even when the estimands are only partially identified or when empirical Bayes point estimates converge very slowly. for this article are available online. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
35. Nonlinear Spectral Analysis: A Local Gaussian Approach.
- Author
-
Jordanger, Lars Arne and Tjøstheim, Dag
- Subjects
- *
NONLINEAR analysis , *HYPERGEOMETRIC series , *SPECTRAL energy distribution , *TIME series analysis , *WHITE noise - Abstract
The spectral distribution f (ω) of a stationary time series { Y t } t ∈ Z can be used to investigate whether or not periodic structures are present in { Y t } t ∈ Z , but f (ω) has some limitations due to its dependence on the autocovariances γ (h) . For example, f (ω) can not distinguish white iid noise from GARCH-type models (whose terms are dependent, but uncorrelated), which implies that f (ω) can be an inadequate tool when { Y t } t ∈ Z contains asymmetries and nonlinear dependencies. Asymmetries between the upper and lower tails of a time series can be investigated by means of the local Gaussian autocorrelations, and these local measures of dependence can be used to construct the local Gaussian spectral density presented in this paper. A key feature of the new local spectral density is that it coincides with f (ω) for Gaussian time series, which implies that it can be used to detect non-Gaussian traits in the time series under investigation. In particular, if f (ω) is flat, then peaks and troughs of the new local spectral density can indicate nonlinear traits, which potentially might discover local periodic phenomena that remain undetected in an ordinary spectral analysis. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
36. On Genetic Correlation Estimation With Summary Statistics From Genome-Wide Association Studies.
- Author
-
Zhao, Bingxin and Zhu, Hongtu
- Subjects
- *
GENOME-wide association studies , *ESTIMATION theory , *DISEASE risk factors , *MONOGENIC & polygenic inheritance (Genetics) , *GENETIC correlations , *VARIANCES , *BRAIN anatomy - Abstract
Cross-trait polygenic risk score (PRS) method has gained popularity for assessing genetic correlation of complex traits using summary statistics from biobank-scale genome-wide association studies (GWAS). However, empirical evidence has shown a common bias phenomenon that highly significant cross-trait PRS can only account for a very small amount of genetic variance (R2 can be < 1 % ) in independent testing GWAS. The aim of this paper is to investigate and address the bias phenomenon of cross-trait PRS in numerous GWAS applications. We show that the estimated genetic correlation can be asymptotically biased toward zero. A consistent cross-trait PRS estimator is then proposed to correct such asymptotic bias. In addition, we investigate whether or not SNP screening by GWAS p-values can lead to improved estimation and show the effect of overlapping samples among GWAS. We analyze GWAS summary statistics of reaction time and brain structural magnetic resonance imaging-based features measured in the Pediatric Imaging, Neurocognition, and Genetics study. We find that the raw cross-trait PRS estimators heavily underestimate the genetic similarity between cognitive function and human brain structures (mean R 2 = 1.32 % ), whereas the bias-corrected estimators uncover the moderate degree of genetic overlap between these closely related heritable traits (mean R 2 = 22.42 % ). Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
37. An Empirical Bayes Method for Chi-Squared Data.
- Author
-
Du, Lilun and Hu, Inchi
- Subjects
- *
EMPIRICAL Bayes methods , *CHI-squared test , *CHI-square distribution , *GAUSSIAN distribution , *FALSE discovery rate - Abstract
In a thought-provoking paper, Efron investigated the merit and limitation of an empirical Bayes method to correct selection bias based on Tweedie's formula first reported in the study by Robbins. The exceptional virtue of Tweedie's formula for the normal distribution lies in its representation of selection bias as a simple function of the derivative of log marginal likelihood. Since the marginal likelihood and its derivative can be estimated from the data directly without invoking prior information, bias correction can be carried out conveniently. We propose a Bayesian hierarchical model for chi-squared data such that the resulting Tweedie's formula has the same virtue as that of the normal distribution. Because the family of noncentral chi-squared distributions, the common alternative distributions for chi-squared tests, does not constitute an exponential family, our results cannot be obtained by extending existing results. Furthermore, the corresponding Tweedie's formula manifests new phenomena quite different from those of the normal distribution and suggests new ways of analyzing chi-squared data. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
38. Statistical Models for COVID-19 Incidence, Cumulative Prevalence, and Rt.
- Author
-
Jewell, Nicholas P.
- Subjects
- *
PANDEMICS , *STATISTICAL models , *COVID-19 - Abstract
At first glance, the initial stage of MERMAID modeling seems to involve an epidemiological model for infection growth, centered on I R SB t sb i , the reproductive number associated with SARS-Cov-2 infections; however, this does not appear essential to the overall approach as noted below. 2 Characteristics of MERMAID There has been a plethora of quantitative models that try to both explain patterns of COVID-19 infections and provide insight into the scale of future transmissions and associated hospitalizations. To be fair, empirical models (and mechanistic models for that matter) have not, in general, successfully I predicted i COVID-19 infection counts more than a month out, and model ensemble estimates only partially address key uncertainties. We appreciate the opportunity to discuss the thought-provoking paper by Quick, Dey and Lin (2021), which we shall refer to as QDL in the following for convenience. [Extracted from the article]
- Published
- 2021
- Full Text
- View/download PDF
39. Graph-Based Equilibrium Metrics for Dynamic Supply–Demand Systems With Applications to Ride-sourcing Platforms.
- Author
-
Zhou, Fan, Luo, Shikai, Qie, Xiaohu, Ye, Jieping, and Zhu, Hongtu
- Subjects
- *
DYNAMICAL systems , *EQUILIBRIUM , *LINEAR programming , *SUPPLY & demand , *WEIGHTED graphs , *PREDICATE calculus - Abstract
How to dynamically measure the local-to-global spatio-temporal coherence between demand and supply networks is a fundamental task for ride-sourcing platforms, such as DiDi. Such coherence measurement is critically important for the quantification of the market efficiency and the comparison of different platform policies, such as dispatching. The aim of this paper is to introduce a graph-based equilibrium metric (GEM) to quantify the distance between demand and supply networks based on a weighted graph structure. We formulate GEM as the optimal objective value of an unbalanced optimal transport problem, which can be formulated as an equivalent linear programming and efficiently solved. We examine how the GEM can help solve three operational tasks of ride-sourcing platforms. The first one is that GEM achieves up to 70.6% reduction in root-mean-square error over the second-best distance measurement for the prediction accuracy of order answer rate. The second one is that the use of GEM for designing order dispatching policy increases drivers' revenue for more than 1%, representing a huge improvement in number. The third one is that GEM can serve as an endpoint for comparing different platform policies in AB test. for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
40. Discussion of "Confidence Intervals for Nonparametric Empirical Bayes Analysis".
- Author
-
Pensky, Marianna
- Subjects
- *
CONFIDENCE intervals , *PROBABILITY density function , *BAYES' estimation , *MATHEMATICAL functions , *CHARTS, diagrams, etc. , *ESTIMATION theory - Abstract
One of the properties of such nonparametric classes is that they are broad enough, so misspecification of Graph HT ht stops being an issue. The present paper studies much more narrow choices of Graph HT ht . The latter also highlights what choices of Graph HT ht is most appropriate. [Extracted from the article]
- Published
- 2022
- Full Text
- View/download PDF
41. Optimum Sample Size for a Problem in Choosing the Population with the Largest Mean: Some Comments on Somerville's Paper.
- Author
-
Ofosu, J. B.
- Subjects
- *
POPULATION , *ANALYSIS of variance , *CHEBYSHEV approximation , *APPROXIMATION theory , *STATISTICAL sampling - Abstract
The article presents the author's comments on an article "Optimum Sample Size for a Problem in Choosing the Population With the Largest Mean," by statistician Paul N. Somerville, published in the June 1970 issue of "Journal of the American Statistical Association." In the article Somerville described a procedure for selecting the population with the largest mean from normal populations with unknown means and a common known variance. The author claims that his procedure gives a unique minimax solution and that the minimax is a reasonable solution to the problem. According to the author, Somerville has not made it clear that his procedure gives only the local minimax solution. This makes practical applications of his procedure severely limited for it is difficult to think of a situation in which an experimenter would like to determine a local minimax value.
- Published
- 1974
- Full Text
- View/download PDF
42. Network Dependence Can Lead to Spurious Associations and Invalid Inference.
- Author
-
Lee, Youjin and Ogburn, Elizabeth L.
- Subjects
- *
SOCIAL networks , *CORONARY disease , *PEER pressure , *INFERENTIAL statistics , *CARDIAC research - Abstract
Researchers across the health and social sciences generally assume that observations are independent, even while relying on convenience samples that draw subjects from one or a small number of communities, schools, hospitals, etc. A paradigmatic example of this is the Framingham Heart Study (FHS). Many of the limitations of such samples are well-known, but the issue of statistical dependence due to social network ties has not previously been addressed. We show that, along with anticonservative variance estimation, this can result in spurious associations due to network dependence. Using a statistical test that we adapted from one developed for spatial autocorrelation, we test for network dependence in several of the thousands of influential papers that have been published using FHS data. Results suggest that some of the many decades of research on coronary heart disease, other health outcomes, and peer influence using FHS data may suffer from spurious associations, error-prone point estimates, and anticonservative inference due to unacknowledged network dependence. These issues are not unique to the FHS; as researchers in psychology, medicine, and beyond grapple with replication failures, this unacknowledged source of invalid statistical inference should be part of the conversation. for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
43. Introduction to the Theory and Methods Special Issue on Precision Medicine and Individualized Policy Discovery.
- Author
-
Kosorok, Michael R., Laber, Eric B., Small, Dylan S., and Zeng, Donglin
- Subjects
- *
INDIVIDUALIZED medicine , *DECISION making , *CAUSAL inference , *ARTIFICIAL intelligence , *MACHINE learning - Abstract
We introduce the Theory and Methods Special Issue on Precision Medicine and Individualized Policy Discovery. The issue consists of four discussion papers, grouped into two pairs, and sixteen regular research papers that cover many important lines of research on data-driven decision making. We hope that the many provocative and original ideas presented herein will inspire further work and development in precision medicine and personalization. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
44. Revealing Subgroup Structure in Ranked Data Using a Bayesian WAND.
- Author
-
Johnson, S. R., Henderson, D. A., and Boys, R. J.
- Subjects
- *
DATA structures , *GIBBS sampling , *RANKING (Statistics) , *CANCER genes , *SCHOLARLY periodicals - Abstract
Ranked data arise in many areas of application ranging from the ranking of up-regulated genes for cancer to the ranking of academic statistics journals. Complications can arise when rankers do not report a full ranking of all entities; for example, they might only report their top-M ranked entities after seeing some or all entities. It can also be useful to know whether rankers are equally informative, and whether some entities are effectively judged to be exchangeable. Revealing subgroup structure in the data may also be helpful in understanding the distribution of ranker views. In this paper, we propose a flexible Bayesian nonparametric model for identifying heterogeneous structure and ranker reliability in ranked data. The model is a weighted adapted nested Dirichlet (WAND) process mixture of Plackett–Luce models and inference proceeds through a simple and efficient Gibbs sampling scheme for posterior sampling. The richness of information in the posterior distribution allows us to infer many details of the structure both between ranker groups and between entity groups (within-ranker groups). Our modeling framework also facilitates a flexible representation of the posterior predictive distribution. This flexibility is important as we propose to use the posterior predictive distribution as the basis for addressing the rank aggregation problem, and also for identifying lack of model fit. The methodology is illustrated using several simulation studies and real data examples. for this article are available online. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
45. Optimal Sparse Linear Prediction for Block-missing Multi-modality Data Without Imputation.
- Author
-
Yu, Guan, Li, Quefeng, Shen, Dinggang, and Liu, Yufeng
- Subjects
- *
FORECASTING , *MISSING data (Statistics) , *MULTIPLE imputation (Statistics) , *COVARIANCE matrices , *ALZHEIMER'S disease - Abstract
In modern scientific research, data are often collected from multiple modalities. Since different modalities could provide complementary information, statistical prediction methods using multimodality data could deliver better prediction performance than using single modality data. However, one special challenge for using multimodality data is related to block-missing data. In practice, due to dropouts or the high cost of measures, the observations of a certain modality can be missing completely for some subjects. In this paper, we propose a new direct sparse regression procedure using covariance from multimodality data (DISCOM). Our proposed DISCOM method includes two steps to find the optimal linear prediction of a continuous response variable using block-missing multimodality predictors. In the first step, rather than deleting or imputing missing data, we make use of all available information to estimate the covariance matrix of the predictors and the cross-covariance vector between the predictors and the response variable. The proposed new estimate of the covariance matrix is a linear combination of the identity matrix, the estimates of the intra-modality covariance matrix and the cross-modality covariance matrix. Flexible estimates for both the sub-Gaussian and heavy-tailed cases are considered. In the second step, based on the estimated covariance matrix and the estimated cross-covariance vector, an extended Lasso-type estimator is used to deliver a sparse estimate of the coefficients in the optimal linear prediction. The number of samples that are effectively used by DISCOM is the minimum number of samples with available observations from two modalities, which can be much larger than the number of samples with complete observations from all modalities. The effectiveness of the proposed method is demonstrated by theoretical studies, simulated examples, and a real application from the Alzheimer's Disease Neuroimaging Initiative. The comparison between DISCOM and some existing methods also indicates the advantages of our proposed method. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
46. Selected Papers of Hirotugu Akaike.
- Author
-
MTW
- Subjects
- *
ESTIMATION theory , *NONPARAMETRIC statistics - Abstract
Develops an approach to obtain nonparametric estimators for the number of classes in a finite population of known size. Approaches for dealing with difficulties caused by variations in class sizes; Correlation of different estimators; Investigation of estimator performance through an asymptomatic variance analysis and a Monte Carlo simulation study.
- Published
- 1998
- Full Text
- View/download PDF
47. Quantile Function on Scalar Regression Analysis for Distributional Data.
- Author
-
Yang, Hojin, Baladandayuthapani, Veerabhadran, Rao, Arvind U.K., and Morris, Jeffrey S.
- Subjects
- *
QUANTILE regression , *REGRESSION analysis , *MARGINAL distributions , *MARKOV chain Monte Carlo , *SMOOTHNESS of functions , *DATA analysis - Abstract
Radiomics involves the study of tumor images to identify quantitative markers explaining cancer heterogeneity. The predominant approach is to extract hundreds to thousands of image features, including histogram features comprised of summaries of the marginal distribution of pixel intensities, which leads to multiple testing problems and can miss out on insights not contained in the selected features. In this paper, we present methods to model the entire marginal distribution of pixel intensities via the quantile function as functional data, regressed on a set of demographic, clinical, and genetic predictors to investigate their effects of imaging-based cancer heterogeneity. We call this approach quantile functional regression, regressing subject-specific marginal distributions across repeated measurements on a set of covariates, allowing us to assess which covariates are associated with the distribution in a global sense, as well as to identify distributional features characterizing these differences, including mean, variance, skewness, heavy-tailedness, and various upper and lower quantiles. To account for smoothness in the quantile functions, account for intrafunctional correlation, and gain statistical power, we introduce custom basis functions we call quantlets that are sparse, regularized, near-lossless, and empirically defined, adapting to the features of a given dataset and containing a Gaussian subspace so non-Gaussianness can be assessed. We fit this model using a Bayesian framework that uses nonlinear shrinkage of quantlet coefficients to regularize the functional regression coefficients and provides fully Bayesian inference after fitting a Markov chain Monte Carlo. We demonstrate the benefit of the basis space modeling through simulation studies, and apply the method to Magnetic resonance imaging (MRI)-based radiomic dataset from Glioblastoma Multiforme to relate imaging-based quantile functions to various demographic, clinical, and genetic predictors, finding specific differences in tumor pixel intensity distribution between males and females and between tumors with and without DDIT3 mutations. for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
48. Estimation of the Boundary of a Variable Observed With Symmetric Error.
- Author
-
Florens, Jean-Pierre, Simar, Léopold, and Van Keilegom, Ingrid
- Subjects
- *
LAGUERRE polynomials , *RANDOM variables , *STOCHASTIC analysis , *STOCHASTIC frontier analysis , *POSTAL service - Abstract
Consider the model Y = X + ε with X = τ + Z , where τ is an unknown constant (the boundary of X), Z is a random variable defined on R + , ε is a symmetric error, and ε and Z are independent. Based on an iid sample of Y, we aim at identifying and estimating the boundary τ when the law of ε is unknown (apart from symmetry) and in particular its variance is unknown. We propose an estimation procedure based on a minimal distance approach and by making use of Laguerre polynomials. Asymptotic results as well as finite sample simulations are shown. The paper also proposes an extension to stochastic frontier analysis, where the model is conditional to observed variables. The model becomes Y = τ (w 1 , w 2) + Z + ε , where Y is a cost, w1 are the observed outputs and w2 represents the observed values of other conditioning variables, so Z is the cost inefficiency. Some simulations illustrate again how the approach works in finite samples, and the proposed procedure is illustrated with data coming from post offices in France. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
49. From Distance Correlation to Multiscale Graph Correlation.
- Author
-
Shen, Cencheng, Priebe, Carey E., and Vogelstein, Joshua T.
- Subjects
- *
CHARACTERISTIC functions , *DISTANCES , *NEAREST neighbor analysis (Statistics) , *BIG data , *MACHINE learning , *MULTISCALE modeling - Abstract
Understanding and developing a correlation measure that can detect general dependencies is not only imperative to statistics and machine learning, but also crucial to general scientific discovery in the big data age. In this paper, we establish a new framework that generalizes distance correlation (Dcorr)—a correlation measure that was recently proposed and shown to be universally consistent for dependence testing against all joint distributions of finite moments—to the multiscale graph correlation (MGC). By using the characteristic functions and incorporating the nearest neighbor machinery, we formalize the population version of local distance correlations, define the optimal scale in a given dependency, and name the optimal local correlation as MGC. The new theoretical framework motivates a theoretically sound sample MGC and allows a number of desirable properties to be proved, including the universal consistency, convergence, and almost unbiasedness of the sample version. The advantages of MGC are illustrated via a comprehensive set of simulations with linear, nonlinear, univariate, multivariate, and noisy dependencies, where it loses almost no power in monotone dependencies while achieving better performance in general dependencies, compared to Dcorr and other popular methods. for this article are available online. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
50. Design, Identification, and Sensitivity Analysis for Patient Preference Trials.
- Author
-
Knox, Dean, Yamamoto, Teppei, Baum, Matthew A., and Berinsky, Adam J.
- Subjects
- *
PATIENT preferences , *SENSITIVITY analysis , *TREATMENT effectiveness , *MEDICAL scientists , *RANDOMIZED controlled trials - Abstract
Social and medical scientists are often concerned that the external validity of experimental results may be compromised because of heterogeneous treatment effects. If a treatment has different effects on those who would choose to take it and those who would not, the average treatment effect estimated in a standard randomized controlled trial (RCT) may give a misleading picture of its impact outside of the study sample. Patient preference trials (PPTs), where participants' preferences over treatment options are incorporated in the study design, provide a possible solution. In this paper, we provide a systematic analysis of PPTs based on the potential outcomes framework of causal inference. We propose a general design for PPTs with multi-valued treatments, where participants state their preferred treatments and are then randomized into either a standard RCT or a self-selection condition. We derive nonparametric sharp bounds on the average causal effects among each choice-based subpopulation of participants under the proposed design. We also propose a sensitivity analysis for the violation of the key ignorability assumption sufficient for identifying the target causal quantity. The proposed design and methodology are illustrated with an original study of partisan news media and its behavioral impact. for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.