173 results on '"B., Owen"'
Search Results
2. What makes you unique?
- Author
-
Benjamin B. Seiler, Masayoshi Mase, and Art B. Owen
- Subjects
Statistics and Probability ,Statistics, Probability and Uncertainty - Published
- 2023
- Full Text
- View/download PDF
3. Propensity score methods for merging observational and experimental datasets
- Author
-
Michael Baiocchi, Evan Rosenman, Hailey R. Banack, and Art B. Owen
- Subjects
FOS: Computer and information sciences ,Statistics and Probability ,Databases, Factual ,Epidemiology ,ODB++ ,01 natural sciences ,law.invention ,Methodology (stat.ME) ,External validity ,010104 statistics & probability ,03 medical and health sciences ,0302 clinical medicine ,Bias ,Randomized controlled trial ,law ,Statistics ,Humans ,030212 general & internal medicine ,0101 mathematics ,Propensity Score ,Statistics - Methodology ,Mathematics ,Estimator ,3. Good health ,Causality ,Delta method ,Research Design ,Causal inference ,Propensity score matching ,Female ,Observational study - Abstract
We consider how to merge a limited amount of data from a randomized controlled trial (RCT) into a much larger set of data from an observational data base (ODB), to estimate an average causal treatment effect. Our methods are based on stratification. The strata are defined in terms of effect moderators as well as propensity scores estimated in the ODB. Data from the RCT are placed into the strata they would have occupied, had they been in the ODB instead. We assume that treatment differences are comparable in the two data sources. Our first "spiked-in" method simply inserts the RCT data into their corresponding ODB strata. We also consider a data-driven convex combination of the ODB and RCT treatment effect estimates within each stratum. Using the delta method and simulations, we identify a bias problem with the spiked-in estimator that is ameliorated by the convex combination estimator. We apply our methods to data from the Women's Health Initiative, a study of thousands of postmenopausal women which has both observational and experimental data on hormone therapy (HT). Using half of the RCT to define a gold standard, we find that a version of the spiked-in estimator yields lower-MSE estimates of the causal impact of HT on coronary heart disease than would be achieved using either a small RCT or the observational component on its own.
- Published
- 2021
- Full Text
- View/download PDF
4. Designing experiments informed by observational studies
- Author
-
Evan Rosenman and Art B. Owen
- Subjects
FOS: Computer and information sciences ,Statistics and Probability ,experimental design ,Computer science ,Inference ,Machine learning ,computer.software_genre ,Statistics - Applications ,01 natural sciences ,QA273-280 ,Methodology (stat.ME) ,010104 statistics & probability ,03 medical and health sciences ,0302 clinical medicine ,sensitivity analysis ,QA1-939 ,Applications (stat.AP) ,030212 general & internal medicine ,0101 mathematics ,causal inference ,observational studies ,Statistics - Methodology ,Binary outcome ,business.industry ,Variance (accounting) ,Sensor fusion ,Outcome (probability) ,90c25 ,62k05 ,Causal inference ,Convex optimization ,Observational study ,Artificial intelligence ,Statistics, Probability and Uncertainty ,business ,computer ,optimization ,Probabilities. Mathematical statistics ,Mathematics - Abstract
The increasing availability of passively observed data has yielded a growing methodological interest in "data fusion." These methods involve merging data from observational and experimental sources to draw causal conclusions -- and they typically require a precarious tradeoff between the unknown bias in the observational dataset and the often-large variance in the experimental dataset. We propose an alternative approach to leveraging observational data, which avoids this tradeoff: rather than using observational data for inference, we use it to design a more efficient experiment. We consider the case of a stratified experiment with a binary outcome, and suppose pilot estimates for the stratum potential outcome variances can be obtained from the observational study. We extend results from Zhao et al. (2019) in order to generate confidence sets for these variances, while accounting for the possibility of unmeasured confounding. Then, we pose the experimental design problem as one of regret minimization, subject to the constraints imposed by our confidence sets. We show that this problem can be converted into a convex minimization and solved using conventional methods. Lastly, we demonstrate the practical utility of our methods using data from the Women's Health Initiative.
- Published
- 2021
5. Backfitting for large scale crossed random effects regressions
- Author
-
Swarnadip Ghosh, Trevor Hastie, and Art B. Owen
- Subjects
Methodology (stat.ME) ,FOS: Computer and information sciences ,Statistics and Probability ,FOS: Mathematics ,Mathematics - Statistics Theory ,Statistics Theory (math.ST) ,Statistics, Probability and Uncertainty ,Statistics - Computation ,Computation (stat.CO) ,Statistics - Methodology - Abstract
Regression models with crossed random effect errors can be very expensive to compute. The cost of both generalized least squares and Gibbs sampling can easily grow as $N^{3/2}$ (or worse) for $N$ observations. Papaspiliopoulos et al. (2020) present a collapsed Gibbs sampler that costs $O(N)$, but under an extremely stringent sampling model. We propose a backfitting algorithm to compute a generalized least squares estimate and prove that it costs $O(N)$. A critical part of the proof is in ensuring that the number of iterations required is $O(1)$ which follows from keeping a certain matrix norm below $1-\delta$ for some $\delta>0$. Our conditions are greatly relaxed compared to those for the collapsed Gibbs sampler, though still strict. Empirically, the backfitting algorithm has a norm below $1-\delta$ under conditions that are less strict than those in our assumptions. We illustrate the new algorithm on a ratings data set from Stitch Fix.
- Published
- 2022
- Full Text
- View/download PDF
6. The nonzero gain coefficients of Sobol's sequences are always powers of two
- Author
-
Zexin Pan and Art B. Owen
- Subjects
Statistics and Probability ,FOS: Computer and information sciences ,Numerical Analysis ,Control and Optimization ,Algebra and Number Theory ,Applied Mathematics ,General Mathematics ,Numerical Analysis (math.NA) ,Statistics - Computation ,Computer Science::Multimedia ,FOS: Mathematics ,Mathematics - Numerical Analysis ,Computation (stat.CO) ,Computer Science::Cryptography and Security - Abstract
When a plain Monte Carlo estimate on $n$ samples has variance $\sigma^2/n$, then scrambled digital nets attain a variance that is $o(1/n)$ as $n\to\infty$. For finite $n$ and an adversarially selected integrand, the variance of a scrambled $(t,m,s)$-net can be at most $\Gamma\sigma^2/n$ for a maximal gain coefficient $\Gamma
- Published
- 2021
7. Scalable logistic regression with crossed random effects
- Author
-
Swarnadip Ghosh, Trevor Hastie, and Art B. Owen
- Subjects
Statistics and Probability ,Methodology (stat.ME) ,FOS: Computer and information sciences ,FOS: Mathematics ,Statistics::Methodology ,Mathematics - Statistics Theory ,Statistics Theory (math.ST) ,Statistics, Probability and Uncertainty ,Statistics - Computation ,Computation (stat.CO) ,Statistics - Methodology - Abstract
The cost of both generalized least squares (GLS) and Gibbs sampling in a crossed random effects model can easily grow faster than $N^{3/2}$ for $N$ observations. Ghosh et al. (2020) develop a backfitting algorithm that reduces the cost to $O(N)$. Here we extend that method to a generalized linear mixed model for logistic regression. We use backfitting within an iteratively reweighted penalized least square algorithm. The specific approach is a version of penalized quasi-likelihood due to Schall (1991). A straightforward version of Schall's algorithm would also cost more than $N^{3/2}$ because it requires the trace of the inverse of a large matrix. We approximate that quantity at cost $O(N)$ and prove that this substitution makes an asymptotically negligible difference. Our backfitting algorithm also collapses the fixed effect with one random effect at a time in a way that is analogous to the collapsed Gibbs sampler of Papaspiliopoulos et al. (2020). We use a symmetric operator that facilitates efficient covariance computation. We illustrate our method on a real dataset from Stitch Fix. By properly accounting for crossed random effects we show that a naive logistic regression could underestimate sampling variances by several hundred fold., 32 pages, 5 figures
- Published
- 2021
8. Efficient estimation of the ANOVA mean dimension, with an application to neural net classification
- Author
-
Christopher R. Hoyt and Art B. Owen
- Subjects
Statistics and Probability ,FOS: Computer and information sciences ,Computer Science - Machine Learning ,010103 numerical & computational mathematics ,01 natural sciences ,Machine Learning (cs.LG) ,Methodology (stat.ME) ,010104 statistics & probability ,Dimension (vector space) ,Global sensitivity analysis ,Statistics ,FOS: Mathematics ,Discrete Mathematics and Combinatorics ,Order (group theory) ,Mathematics - Numerical Analysis ,0101 mathematics ,Statistics - Methodology ,Mathematics ,Estimation ,Artificial neural network ,Applied Mathematics ,Black box function ,Numerical Analysis (math.NA) ,Modeling and Simulation ,Chaining ,Analysis of variance ,Statistics, Probability and Uncertainty - Abstract
The mean dimension of a black box function of $d$ variables is a convenient way to summarize the extent to which it is dominated by high or low order interactions. It is expressed in terms of $2^d-1$ variance components but it can be written as the sum of $d$ Sobol' indices that can be estimated by leave one out methods. We compare the variance of these leave one out methods: a Gibbs sampler called winding stairs, a radial sampler that changes each variable one at a time from a baseline, and a naive sampler that never reuses function evaluations and so costs about double the other methods. For an additive function the radial and winding stairs are most efficient. For a multiplicative function the naive method can easily be most efficient if the factors have high kurtosis. As an illustration we consider the mean dimension of a neural network classifier of digits from the MNIST data set. The classifier is a function of $784$ pixels. For that problem, winding stairs is the best algorithm. We find that inputs to the final softmax layer have mean dimensions ranging from $1.35$ to $2.0$.
- Published
- 2020
- Full Text
- View/download PDF
9. ESTIMATION AND INFERENCE FOR VERY LARGE LINEAR MIXED EFFECTS MODELS
- Author
-
Art B. Owen and Katelyn Gao
- Subjects
Statistics and Probability ,Asymptotic distribution ,Variance (accounting) ,Method of moments (statistics) ,Random effects model ,01 natural sciences ,Generalized linear mixed model ,010104 statistics & probability ,Consistency (statistics) ,Linear regression ,Applied mathematics ,0101 mathematics ,Statistics, Probability and Uncertainty ,Mathematics ,Parametric statistics - Abstract
Linear mixed models with large imbalanced crossed random effects structures pose severe computational problems for maximum likelihood estimation and for Bayesian analysis. The costs can grow as fast as $N^{3/2}$ when there are N observations. Such problems arise in any setting where the underlying factors satisfy a many to many relationship (instead of a nested one) and in electronic commerce applications, the N can be quite large. Methods that do not account for the correlation structure can greatly underestimate uncertainty. We propose a method of moments approach that takes account of the correlation structure and that can be computed at O(N) cost. The method of moments is very amenable to parallel computation and it does not require parametric distributional assumptions, tuning parameters or convergence diagnostics. For the regression coefficients, we give conditions for consistency and asymptotic normality as well as a consistent variance estimate. For the variance components, we give conditions for consistency and we use consistent estimates of a mildly conservative variance estimate. All of these computations can be done in O(N) work. We illustrate the algorithm with some data from Stitch Fix where the crossed random effects correspond to clients and items.
- Published
- 2020
- Full Text
- View/download PDF
10. On Shapley Value for Measuring Importance of Dependent Inputs
- Author
-
Art B. Owen, Clémentine Prieur, Stanford University, Méthodes d'Analyse Stochastique des Codes et Traitements Numériques (GdR MASCOT-NUM), Centre National de la Recherche Scientifique (CNRS), Mathematics and computing applied to oceanic and atmospheric flows (AIRSEA), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Laboratoire Jean Kuntzmann (LJK ), and Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Centre National de la Recherche Scientifique (CNRS)
- Subjects
Statistics and Probability ,Computer Science::Computer Science and Game Theory ,Mathematical optimization ,Applied Mathematics ,TheoryofComputation_GENERAL ,Mathematics - Statistics Theory ,02 engineering and technology ,Function (mathematics) ,01 natural sciences ,Shapley value ,010104 statistics & probability ,[MATH.MATH-ST]Mathematics [math]/Statistics [math.ST] ,Simple (abstract algebra) ,Modeling and Simulation ,0202 electrical engineering, electronic engineering, information engineering ,Decomposition (computer science) ,Discrete Mathematics and Combinatorics ,020201 artificial intelligence & image processing ,Mathematics - Numerical Analysis ,0101 mathematics ,Statistics, Probability and Uncertainty ,Computational problem ,Mathematics - Abstract
International audience; This paper makes the case for using Shapley value to quantify the importance of random input variables to a function. Alternatives based on the ANOVA decomposition can run into conceptual and computational problems when the input variables are dependent. Our main goal here is to show that Shapley value removes the conceptual problems. We do this with some simple examples where Shapley value leads to intuitively reasonable nearly closed form values.
- Published
- 2017
- Full Text
- View/download PDF
11. Comment: Unreasonable Effectiveness of Monte Carlo
- Author
-
Art B. Owen
- Subjects
quasi-Monte Carlo ,Statistics and Probability ,Pseudorandom number generator ,0303 health sciences ,Mathematical optimization ,Computer science ,General Mathematics ,Monte Carlo method ,Probabilistic logic ,Probabilistic numerics ,01 natural sciences ,Numerical integration ,Competition (economics) ,010104 statistics & probability ,03 medical and health sciences ,symbols.namesake ,Probabilistic method ,symbols ,0101 mathematics ,Statistics, Probability and Uncertainty ,Gaussian process ,030304 developmental biology ,Central limit theorem - Abstract
There is a role for statistical computation in numerical integration. However, the competition from incumbent methods looks to be stiffer for this problem than for some of the newer problems being handled by probabilistic numerics. One of the challenges is the unreasonable effectiveness of the central limit theorem. Another is the unreasonable effectiveness of pseudorandom number generators. A third is the common O(n3) cost of methods based on Gaussian processes. Despite these advantages, the classical methods are weak in places where probabilistic methods could bring an improvement.
- Published
- 2019
- Full Text
- View/download PDF
12. Permutation $p$-value approximation via generalized Stolarsky invariance
- Author
-
Qingyuan Zhao, Hera Yu He, Art B. Owen, and Kinjal Basu
- Subjects
Statistics and Probability ,gene sets ,Generalization ,Mathematics - Statistics Theory ,Statistics Theory (math.ST) ,01 natural sciences ,Square (algebra) ,Set (abstract data type) ,010104 statistics & probability ,Permutation ,Approximation error ,62G09 ,hypothesis testing ,FOS: Mathematics ,Mathematics - Combinatorics ,Applied mathematics ,p-value ,0101 mathematics ,Discrepancy ,11K38 ,Mathematics ,Statistical hypothesis testing ,quasi-Monte Carlo ,Invariance principle ,Combinatorics (math.CO) ,Statistics, Probability and Uncertainty ,62G10 - Abstract
It is common for genomic data analysis to use $p$-values from a large number of permutation tests. The multiplicity of tests may require very tiny $p$-values in order to reject any null hypotheses and the common practice of using randomly sampled permutations then becomes very expensive. We propose an inexpensive approximation to $p$-values for two sample linear test statistics, derived from Stolarsky’s invariance principle. The method creates a geometrically derived reference set of approximate $p$-values for each hypothesis. The average of that set is used as a point estimate $\hat{p}$ and our generalization of the invariance principle allows us to compute the variance of the $p$-values in that set. We find that in cases where the point estimate is small, the variance is a modest multiple of the square of that point estimate, yielding a relative error property similar to that of saddlepoint approximations. On a Parkinson’s disease data set, the new approximation is faster and more accurate than the saddlepoint approximation. We also obtain a simple probabilistic explanation of Stolarsky’s invariance principle.
- Published
- 2019
- Full Text
- View/download PDF
13. Density estimation by Randomized Quasi-Monte Carlo
- Author
-
Amal Ben Abdellah, Florian Puchhammer, Pierre L'Ecuyer, and Art B. Owen
- Subjects
Statistics and Probability ,021103 operations research ,Applied Mathematics ,Monte Carlo method ,Kernel density estimation ,0211 other engineering and technologies ,Stratification (water) ,Mathematics - Statistics Theory ,Statistics Theory (math.ST) ,02 engineering and technology ,Density estimation ,01 natural sciences ,010104 statistics & probability ,Modeling and Simulation ,62G07, 62G20, 65C06 ,FOS: Mathematics ,Discrete Mathematics and Combinatorics ,Variance reduction ,Statistical physics ,Quasi-Monte Carlo method ,0101 mathematics ,Statistics, Probability and Uncertainty ,Random variable ,Computer Science::Databases ,Mathematics - Abstract
We consider the problem of estimating the density of a random variable $X$ that can be sampled exactly by Monte Carlo (MC). We investigate the effectiveness of replacing MC by randomized quasi Monte Carlo (RQMC) or by stratified sampling over the unit cube, to reduce the integrated variance (IV) and the mean integrated square error (MISE) for kernel density estimators. We show theoretically and empirically that the RQMC and stratified estimators can achieve substantial reductions of the IV and the MISE, and even faster convergence rates than MC in some situations, while leaving the bias unchanged. We also show that the variance bounds obtained via a traditional Koksma-Hlawka-type inequality for RQMC are much too loose to be useful when the dimension of the problem exceeds a few units. We describe an alternative way to estimate the IV, a good bandwidth, and the MISE, under RQMC or stratification, and we show empirically that in some situations, the MISE can be reduced significantly even in high-dimensional settings., 22 pages, 6 figures, 4 tables; We are thankful to the anonymous referees, whose comments were considered in this submission
- Published
- 2018
14. Deterministic parallel analysis: An improved method for selecting factors and principal components
- Author
-
Edgar Dobriban and Art B. Owen
- Subjects
Statistics and Probability ,FOS: Computer and information sciences ,Computer science ,020206 networking & telecommunications ,Improved method ,02 engineering and technology ,01 natural sciences ,Methodology (stat.ME) ,010104 statistics & probability ,Consistency (database systems) ,Application areas ,Principal component analysis ,0202 electrical engineering, electronic engineering, information engineering ,0101 mathematics ,Statistics, Probability and Uncertainty ,Decision threshold ,Algorithm ,Statistics - Methodology - Abstract
Factor analysis and principal component analysis (PCA) are used in many application areas. The first step, choosing the number of components, remains a serious challenge. Our work proposes improved methods for this important problem. One of the most popular state-of-the-art methods is Parallel Analysis (PA), which compares the observed factor strengths to simulated ones under a noise-only model. This paper proposes improvements to PA. We first de-randomize it, proposing Deterministic Parallel Analysis (DPA), which is faster and more reproducible than PA. Both PA and DPA are prone to a shadowing phenomenon in which a strong factor makes it hard to detect smaller but more interesting factors. We propose deflation to counter shadowing. We also propose to raise the decision threshold to improve estimation accuracy. We prove several consistency results for our methods, and test them in simulations. We also illustrate our methods on data from the Human Genome Diversity Project, where they significantly improve the accuracy., Made title consistent with published version
- Published
- 2017
15. Importance sampling the union of rare events with an application to power systems analysis
- Author
-
Art B. Owen, Michael Chertkov, and Yury Maximov
- Subjects
FOS: Computer and information sciences ,Statistics and Probability ,Coefficient of variation ,05 social sciences ,Computer Science - Numerical Analysis ,Inverse ,Sampling (statistics) ,Numerical Analysis (math.NA) ,01 natural sciences ,Statistics - Computation ,Combinatorics ,Moment (mathematics) ,010104 statistics & probability ,0502 economics and business ,FOS: Mathematics ,Rare events ,Mathematics - Numerical Analysis ,0101 mathematics ,Statistics, Probability and Uncertainty ,Random variable ,Computation (stat.CO) ,Importance sampling ,050205 econometrics ,Mathematics ,Event (probability theory) - Abstract
We consider importance sampling to estimate the probability $\mu$ of a union of $J$ rare events $H_{j}$ defined by a random variable $\boldsymbol{x}$. The sampler we study has been used in spatial statistics, genomics and combinatorics going back at least to Karp and Luby (1983). It works by sampling one event at random, then sampling $\boldsymbol{x}$ conditionally on that event happening and it constructs an unbiased estimate of $\mu$ by multiplying an inverse moment of the number of occuring events by the union bound. We prove some variance bounds for this sampler. For a sample size of $n$, it has a variance no larger than $\mu(\bar{\mu}-\mu)/n$ where $\bar{\mu}$ is the union bound. It also has a coefficient of variation no larger than $\sqrt{(J+J^{-1}-2)/(4n)}$ regardless of the overlap pattern among the $J$ events. Our motivating problem comes from power system reliability, where the phase differences between connected nodes have a joint Gaussian distribution and the $J$ rare events arise from unacceptably large phase differences. In the grid reliability problems even some events defined by $5772$ constraints in $326$ dimensions, with probability below $10^{-22}$, are estimated with a coefficient of variation of about $0.0024$ with only $n=10{,}000$ sample values.
- Published
- 2017
16. Higher order Sobol' indices
- Author
-
Su Chen, Josef Dick, and Art B. Owen
- Subjects
Statistics and Probability ,Monte Carlo method ,01 natural sciences ,Measure (mathematics) ,Matrix decomposition ,Combinatorics ,010104 statistics & probability ,symbols.namesake ,FOS: Mathematics ,Applied mathematics ,Mathematics - Numerical Analysis ,0101 mathematics ,Mathematics ,Variable (mathematics) ,Numerical Analysis ,Applied Mathematics ,010102 general mathematics ,62P30, 78M05, 41A55 ,Sobol sequence ,Numerical Analysis (math.NA) ,Function (mathematics) ,Fourier transform ,Computational Theory and Mathematics ,Unit cube ,symbols ,Analysis - Abstract
Sobol' indices measure the dependence of a high dimensional function on groups of variables defined on the unit cube $[0,1]^d$. They are based on the ANOVA decomposition of functions, which is an $L^2$ decomposition. In this paper we discuss generalizations of Sobol' indices which yield $L^p$ measures of the dependence of $f$ on subsets of variables. Our interest is in values $p>2$ because then variable importance becomes more about reaching the extremes of $f$. We introduce two methods. One based on higher order moments of the ANOVA terms and another based on higher order norms of a spectral decomposition of $f$, including Fourier and Haar variants. Both of our generalizations have representations as integrals over $[0,1]^{kd}$ for $k\ge 1$, allowing direct Monte Carlo or quasi-Monte Carlo estimation. We find that they are sensitive to different aspects of $f$, and thus quantify different notions of variable importance.
- Published
- 2014
- Full Text
- View/download PDF
17. Correct Ordering in the Zipf–Poisson Ensemble
- Author
-
Justin S. Dyer and Art B. Owen
- Subjects
Statistics and Probability ,Zipf's law ,Skellam distribution ,Poisson distribution ,Power law ,Combinatorics ,symbols.namesake ,symbols ,Rank (graph theory) ,Order (group theory) ,Statistics, Probability and Uncertainty ,Random variable ,Mathematics ,Count data - Abstract
Rankings based on counts are often presented to identify popular items, such as baby names, English words, or Web sites. This article shows that, in some examples, the number of correctly identified items can be very small. We introduce a standard error versus rank plot to diagnose possible misrankings. Then to explain the slowly growing number of correct ranks, we model the entire set of count data via a Zipf–Poisson ensemble with independent Xi ∼ Poi(Ni − α) for α > 1 and N > 0 and integers i ⩾ 1. We show that as N → ∞, the first n′(N) random variables have their proper order relative to each other, with probability tending to 1 for n′ up to (AN/log (N))1/(α + 2) for A = α2(α + 2)/4. We also show that the rate N 1/(α + 2) cannot be achieved. The ordering of the first n′(N) entities does not preclude for some interloping m > n′. However, we show that the first n″ random variables are correctly ordered exclusive of any interlopers, with probability tending to 1 if n″ ⩽ (BN/log (N))1/(α + 2) for any B < A....
- Published
- 2012
- Full Text
- View/download PDF
18. Bi-Cross-Validation for Factor Analysis
- Author
-
Art B. Owen and Jingshu Wang
- Subjects
Statistics and Probability ,Heteroscedasticity ,Early stopping ,General Mathematics ,05 social sciences ,Block matrix ,random matrix theory ,01 natural sciences ,Cross-validation ,010104 statistics & probability ,Matrix (mathematics) ,Sample size determination ,scree plot ,Homoscedasticity ,0502 economics and business ,Statistics ,Parallel analysis ,0101 mathematics ,Statistics, Probability and Uncertainty ,Algorithm ,Random matrix ,unwanted variation ,050205 econometrics ,Mathematics - Abstract
Factor analysis is over a century old, but it is still problematic to choose the number of factors for a given data set. We provide a systematic review of current methods and then introduce a method based on bi-cross-validation, using randomly held-out submatrices of the data to choose the optimal number of factors. We find it performs better than many existing methods especially when both the number of variables and the sample size are large and some of the factors are relatively weak. Our performance criterion is based on recovery of an underlying signal, equal to the product of the usual factor and loading matrices. Like previous comparisons, our work is simulation based. Recent advances in random matrix theory provide principled choices for the number of factors when the noise is homoscedastic, but not for the heteroscedastic case. The simulations we chose are designed using guidance from random matrix theory. In particular, we include factors which are asymptotically too small to detect, factors large enough to detect but not large enough to improve the estimate, and two classes of factors (weak and strong) large enough to be useful. We also find that a form of early stopping regularization improves the recovery of the signal matrix.
- Published
- 2016
19. Efficient moment calculations for variance components in large unbalanced crossed random effects models
- Author
-
Katelyn Gao and Art B. Owen
- Subjects
FOS: Computer and information sciences ,Statistics and Probability ,Bayesian probability ,Inference ,02 engineering and technology ,01 natural sciences ,Measure (mathematics) ,Statistics - Computation ,Generalized linear mixed model ,Methodology (stat.ME) ,010104 statistics & probability ,symbols.namesake ,big data ,0202 electrical engineering, electronic engineering, information engineering ,0101 mathematics ,Computation (stat.CO) ,Statistics - Methodology ,Mathematics ,Markov chain Monte Carlo ,Crossed random effects ,Random effects model ,Moment (mathematics) ,Sample size determination ,symbols ,variance components ,020201 artificial intelligence & image processing ,62J10 ,Statistics, Probability and Uncertainty ,Algorithm ,62F10 - Abstract
Large crossed data sets, often modeled by generalized linear mixed models, have become increasingly common and provide challenges for statistical analysis. At very large sizes it becomes desirable to have the computational costs of estimation, inference and prediction (both space and time) grow at most linearly with sample size. ¶ Both traditional maximum likelihood estimation and numerous Markov chain Monte Carlo Bayesian algorithms take superlinear time in order to obtain good parameter estimates in the simple two-factor crossed random effects model. We propose moment based algorithms that, with at most linear cost, estimate variance components, measure the uncertainties of those estimates, and generate shrinkage based predictions for missing observations. When run on simulated normally distributed data, our algorithm performs competitively with maximum likelihood methods.
- Published
- 2016
20. Detecting Multiple Replicating Signals using Adaptive Filtering Procedures
- Author
-
Jingshu Wang, Lin Gui, Weijie J. Su, Chiara Sabatti, and Art B. Owen
- Subjects
Statistics and Probability ,Methodology (stat.ME) ,FOS: Computer and information sciences ,Statistics, Probability and Uncertainty ,Statistics - Methodology - Abstract
Replicability is a fundamental quality of scientific discoveries: we are interested in those signals that are detectable in different laboratories, study populations, across time etc. Unlike meta-analysis which accounts for experimental variability but does not guarantee replicability, testing a partial conjunction (PC) null aims specifically to identify the signals that are discovered in multiple studies. In many contemporary applications, e.g., comparing multiple high-throughput genetic experiments, a large number $M$ of PC nulls need to be tested simultaneously, calling for a multiple comparisons correction. However, standard multiple testing adjustments on the $M$ PC $p$-values can be severely conservative, especially when $M$ is large and the signals are sparse. We introduce AdaFilter, a new multiple testing procedure that increases power by adaptively filtering out unlikely candidates of PC nulls. We prove that AdaFilter can control FWER and FDR as long as data across studies are independent, and has much higher power than other existing methods. We illustrate the application of AdaFilter with three examples: microarray studies of Duchenne muscular dystrophy, single-cell RNA sequencing of T cells in lung cancer tumors and GWAS for metabolomics.
- Published
- 2016
- Full Text
- View/download PDF
21. Statistically efficient thinning of a Markov chain sampler
- Author
-
Art B. Owen
- Subjects
Statistics and Probability ,FOS: Computer and information sciences ,Lag ,Scalar (mathematics) ,Machine Learning (stat.ML) ,01 natural sciences ,Statistics - Computation ,Machine Learning (cs.LG) ,010104 statistics & probability ,symbols.namesake ,Statistics - Machine Learning ,0502 economics and business ,Discrete Mathematics and Combinatorics ,Applied mathematics ,050207 economics ,0101 mathematics ,Computation (stat.CO) ,Mathematics ,Thinning ,Markov chain ,Unit of time ,05 social sciences ,Markov chain Monte Carlo ,Computer Science - Learning ,65C40, 62M05 ,Efficiency ,Autoregressive model ,symbols ,Statistics, Probability and Uncertainty - Abstract
It is common to subsample Markov chain output to reduce the storage burden. Geyer (1992) shows that discarding $k-1$ out of every $k$ observations will not improve statistical efficiency, as quantified through variance in a given computational budget. That observation is often taken to mean that thinning MCMC output cannot improve statistical efficiency. Here we suppose that it costs one unit of time to advance a Markov chain and then $\theta>0$ units of time to compute a sampled quantity of interest. For a thinned process, that cost $\theta$ is incurred less often, so it can be advanced through more stages. Here we provide examples to show that thinning will improve statistical efficiency if $\theta$ is large and the sample autocorrelations decay slowly enough. If the lag $\ell\ge1$ autocorrelations of a scalar measurement satisfy $\rho_\ell\ge\rho_{\ell+1}\ge0$, then there is always a $\theta0$ it is optimal if and only if $\theta \le (1-\rho)^2/(2\rho)$. This efficiency gain never exceeds $1+\theta$. This paper also gives efficiency bounds for autocorrelations bounded between those of two AR(1) processes., Comment: 18 pages
- Published
- 2015
22. Admissibility in Partial Conjunction Testing
- Author
-
Art B. Owen and Jingshu Wang
- Subjects
Statistics and Probability ,FOS: Computer and information sciences ,Computer science ,05 social sciences ,Subgroup analysis ,Mathematics - Statistics Theory ,Statistics Theory (math.ST) ,01 natural sciences ,Conjunction (grammar) ,Power (physics) ,Methodology (stat.ME) ,010104 statistics & probability ,Meta-analysis ,0502 economics and business ,FOS: Mathematics ,0101 mathematics ,Statistics, Probability and Uncertainty ,Null hypothesis ,Algorithm ,Statistics - Methodology ,050205 econometrics - Abstract
Meta-analysis combines results from multiple studies aiming to increase power in finding their common effect. It would typically reject the null hypothesis of no effect if any one of the studies shows strong significance. The partial conjunction null hypothesis is rejected only when at least $r$ of $n$ component hypotheses are non-null with $r = 1$ corresponding to a usual meta-analysis. Compared with meta-analysis, it can encourage replicable findings across studies. A by-product of it when applied to different $r$ values is a confidence interval of $r$ quantifying the proportion of non-null studies. Benjamini and Heller (2008) provided a valid test for the partial conjunction null by ignoring the $r - 1$ smallest p-values and applying a valid meta-analysis p-value to the remaining $n - r + 1$ p-values. We provide sufficient and necessary conditions of admissible combined p-value for the partial conjunction hypothesis among monotone tests. Non-monotone tests always dominate monotone tests but are usually too unreasonable to be used in practice. Based on these findings, we propose a generalized form of Benjamini and Heller's test which allows usage of various types of meta-analysis p-values, and apply our method to an example in assessing replicable benefit of new anticoagulants across subgroups of patients for stroke prevention.
- Published
- 2015
23. Single Nugget Kriging
- Author
-
Art B. Owen and Minyong R. Lee
- Subjects
FOS: Computer and information sciences ,Statistics and Probability ,Conditional likelihood ,Statistics::Theory ,Mean squared error ,010103 numerical & computational mathematics ,Covariance ,01 natural sciences ,Methodology (stat.ME) ,010104 statistics & probability ,Statistics::Machine Learning ,Data point ,Conditional bias ,Kriging ,Robustness (computer science) ,Statistics::Methodology ,0101 mathematics ,Statistics, Probability and Uncertainty ,Extreme value theory ,Algorithm ,Statistics - Methodology ,Mathematics - Abstract
We propose a method with better predictions at extreme values than the standard method of Kriging. We construct our predictor in two ways: by penalizing the mean squared error through conditional bias and by penalizing the conditional likelihood at the target function value. Our prediction exhibits robustness to the model mismatch in the covariance parameters, a desirable feature for computer simulations with a restricted number of data points. Applications on several functions show that our predictor is robust to the non-Gaussianity of the function.
- Published
- 2015
24. Estimating Mean Dimensionality of Analysis of Variance Decompositions
- Author
-
Ruixue Liu and Art B. Owen
- Subjects
Statistics and Probability ,Weight function ,Variables ,media_common.quotation_subject ,Sobol sequence ,Effective dimension ,Unit cube ,Statistics ,Quasi-Monte Carlo method ,Statistics, Probability and Uncertainty ,Extreme value theory ,Random variable ,Mathematics ,media_common - Abstract
Analysis of variance (ANOVA) is now often applied to functions defined on the unit cube, where it serves as a tool for the exploratory analysis of functions. The mean dimension of a function, defined as a natural weighted combination of its ANOVA mean squares, provides one measure of how hard or easy it is to integrate the function by quasi-Monte Carlo sampling. This article presents some new identities relating the mean dimension, and some analogously defined higher moments, to the variance importance measures of I.M. Sobol. As a result, we are able to measure the mean dimension of certain functions arising in computational finance. We produce an unbiased and nonnegative estimate of the variance contribution of the highest-order interaction that avoids the cancellation problems of previous estimates. In an application to extreme value theory, we find that, among other things, the minimum of d independent U[0, 1] random variables has a mean dimension of 2(d + 1)/(d + 3).
- Published
- 2006
- Full Text
- View/download PDF
25. Variance of the Number of False Discoveries
- Author
-
Art B. Owen
- Subjects
Statistics and Probability ,False discovery rate ,Law of large numbers ,Statistics ,Multiple comparisons problem ,Test statistic ,Econometrics ,Variance (accounting) ,Statistics, Probability and Uncertainty ,Expected value ,Null hypothesis ,Mathematics ,Statistical hypothesis testing - Abstract
SummaryIn high throughput genomic work, a very large number d of hypotheses are tested based on n≪d data samples. The large number of tests necessitates an adjustment for false discoveries in which a true null hypothesis was rejected. The expected number of false discoveries is easy to obtain. Dependences between the hypothesis tests greatly affect the variance of the number of false discoveries. Assuming that the tests are independent gives an inadequate variance formula. The paper presents a variance formula that takes account of the correlations between test statistics. That formula involves O(d2) correlations, and so a naïve implementation has cost O(nd2). A method based on sampling pairs of tests allows the variance to be approximated at a cost that is independent of d.
- Published
- 2005
- Full Text
- View/download PDF
26. Confounder Adjustment in Multiple Hypothesis Testing
- Author
-
Jingshu Wang, Qingyuan Zhao, Art B. Owen, and Trevor Hastie
- Subjects
0301 basic medicine ,Statistics and Probability ,False discovery rate ,FOS: Computer and information sciences ,62H25, 62J15 ,Nuisance variable ,Mathematics - Statistics Theory ,Statistics Theory (math.ST) ,01 natural sciences ,Article ,Methodology (stat.ME) ,010104 statistics & probability ,03 medical and health sciences ,Statistics ,Covariate ,Null distribution ,FOS: Mathematics ,62H25 ,Empirical null ,0101 mathematics ,Statistics - Methodology ,Statistical hypothesis testing ,Mathematics ,batch effect ,Estimator ,robust regression ,030104 developmental biology ,Multiple comparisons problem ,surrogate variable analysis ,62J15 ,Statistics, Probability and Uncertainty ,unwanted variation ,Type I and type II errors - Abstract
We consider large-scale studies in which thousands of significance tests are performed simultaneously. In some of these studies, the multiple testing procedure can be severely biased by latent confounding factors such as batch effects and unmeasured covariates that correlate with both primary variable(s) of interest (e.g. treatment variable, phenotype) and the outcome. Over the past decade, many statistical methods have been proposed to adjust for the confounders in hypothesis testing. We unify these methods in the same framework, generalize them to include multiple primary variables and multiple nuisance variables, and analyze their statistical properties. In particular, we provide theoretical guarantees for RUV-4 and LEAPP, which correspond to two different identification conditions in the framework: the first requires a set of "negative controls" that are known a priori to follow the null distribution; the second requires the true non-nulls to be sparse. Two different estimators which are based on RUV-4 and LEAPP are then applied to these two scenarios. We show that if the confounding factors are strong, the resulting estimators can be asymptotically as powerful as the oracle estimator which observes the latent confounding factors. For hypothesis testing, we show the asymptotic z-tests based on the estimators can control the type I error. Numerical experiments show that the false discovery rate is also controlled by the Benjamini-Hochberg procedure when the sample size is reasonably large., Comment: The first two authors contributed equally to this paper
- Published
- 2015
- Full Text
- View/download PDF
27. Optimal Multiple Testing Under a Gaussian Prior on the Effect Sizes
- Author
-
Edgar Dobriban, Kristen Fortney, Art B. Owen, and Stuart K. Kim
- Subjects
Statistics and Probability ,FOS: Computer and information sciences ,Genome-wide association study ,Mathematical optimization ,Optimization problem ,General Mathematics ,Gaussian ,p-value weighting ,01 natural sciences ,Methodology (stat.ME) ,010104 statistics & probability ,03 medical and health sciences ,symbols.namesake ,Frequentist inference ,Prior probability ,Multiple testing ,0101 mathematics ,Weighted Bonferroni method ,Statistics - Methodology ,030304 developmental biology ,Mathematics ,0303 health sciences ,Applied Mathematics ,Small number ,Articles ,Agricultural and Biological Sciences (miscellaneous) ,Power (physics) ,Nonconvex optimization ,Bonferroni correction ,Multiple comparisons problem ,symbols ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences - Abstract
We develop a new method for large-scale frequentist multiple testing with Bayesian prior information. We find optimal \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$p$\end{document}-value weights that maximize the average power of the weighted Bonferroni method. Due to the nonconvexity of the optimization problem, previous methods that account for uncertain prior information are suitable for only a small number of tests. For a Gaussian prior on the effect sizes, we give an efficient algorithm that is guaranteed to find the optimal weights nearly exactly. Our method can discover new loci in genome-wide association studies and compares favourably to competitors. An open-source implementation is available.
- Published
- 2015
- Full Text
- View/download PDF
28. Safe and Effective Importance Sampling
- Author
-
Art B. Owen and Yi Zhou
- Subjects
Statistics and Probability ,Delta method ,Monte Carlo method ,Statistics ,Econometrics ,Sampling (statistics) ,Variance reduction ,Variance (accounting) ,Statistics, Probability and Uncertainty ,Control variates ,Upper and lower bounds ,Importance sampling ,Mathematics - Abstract
We present two improvements on the technique of importance sampling. First, we show that importance sampling from a mixture of densities, using those densities as control variates, results in a useful upper bound on the asymptotic variance. That bound is a small multiple of the asymptotic variance of importance sampling from the best single component density. This allows one to benefit from the great variance reductions obtainable by importance sampling, while protecting against the equally great variance increases that might take the practitioner by surprise. The second improvement is to show how importance sampling from two or more densities can be used to approach a zero sampling variance even for integrands that take both positive and negative values.
- Published
- 2000
- Full Text
- View/download PDF
29. Scrambling Sobol' and Niederreiter–Xing Points
- Author
-
Art B. Owen
- Subjects
quasi-Monte Carlo ,Statistics and Probability ,Discrete mathematics ,latin hypercube ,Numerical Analysis ,Control and Optimization ,Algebra and Number Theory ,Applied Mathematics ,General Mathematics ,Monte Carlo method ,integration ,Markov chain Monte Carlo ,Sobol sequence ,Control variates ,VEGAS algorithm ,wavelets ,Hybrid Monte Carlo ,symbols.namesake ,multiresolution ,Statistics ,symbols ,Monte Carlo integration ,Quasi-Monte Carlo method ,orthogonal array sampling ,Mathematics - Abstract
Hybrids of equidistribution and Monte Carlo methods of integration can achieve the superior accuracy of the former while allowing the simple error estimation methods of the latter. In particular, randomized (0, m , s )-nets in base b produce unbiased estimates of the integral, have a variance that tends to zero faster than 1/ n for any square integrable integrand and have a variance that for finite n is never more than e ≐2.718 times as large as the Monte Carlo variance. Lower bounds than e are known for special cases. Some very important ( t , m , s )-nets have t >0. The widely used Sobol' sequences are of this form, as are some recent and very promising nets due to Niederreiter and Xing. Much less is known about randomized versions of these nets, especially in s >1 dimensions. This paper shows that scrambled ( t , m , s )-nets enjoy the same properties as scrambled (0, m , s )-nets, except the sampling variance is guaranteed only to be below b t [( b +1)/( b −1)] s times the Monte Carlo variance for a least-favorable integrand and finite n .
- Published
- 1998
- Full Text
- View/download PDF
30. The sign of the logistic regression coefficient
- Author
-
Paul A. Roediger and Art B. Owen
- Subjects
Statistics and Probability ,General Mathematics ,Maximum likelihood ,Scalar (mathematics) ,High Energy Physics::Phenomenology ,Design matrix ,Binary number ,Inverse ,Mathematics - Statistics Theory ,Statistics Theory (math.ST) ,Logistic regression ,Separable space ,Probit model ,Statistics ,FOS: Mathematics ,Statistics::Methodology ,High Energy Physics::Experiment ,Statistics, Probability and Uncertainty ,Mathematics - Abstract
Let Y be a binary random variable and X a scalar. Let $\hat\beta$ be the maximum likelihood estimate of the slope in a logistic regression of Y on X with intercept. Further let $\bar x_0$ and $\bar x_1$ be the average of sample x values for cases with y=0 and y=1, respectively. Then under a condition that rules out separable predictors, we show that sign($\hat\beta$) = sign($\bar x_1-\bar x_0$). More generally, if $x_i$ are vector valued then we show that $\hat\beta=0$ if and only if $\bar x_1=\bar x_0$. This holds for logistic regression and also for more general binary regressions with inverse link functions satisfying a log-concavity condition. Finally, when $\bar x_1\ne \bar x_0$ then the angle between $\hat\beta$ and $\bar x_1-\bar x_0$ is less than ninety degrees in binary regressions satisfying the log-concavity condition and the separation condition, when the design matrix has full rank., Comment: 9 pages, 0 figures
- Published
- 2014
- Full Text
- View/download PDF
31. Data enriched linear regression
- Author
-
Art B. Owen, Minghui Shi, and Aiyou Chen
- Subjects
Statistics and Probability ,FOS: Computer and information sciences ,Small data ,small area estimation ,Gaussian ,Degrees of freedom (statistics) ,Stein shrinkage ,Data fusion ,transfer learning ,Data set ,Set (abstract data type) ,Methodology (stat.ME) ,symbols.namesake ,Small area estimation ,62J07 ,Statistics ,Linear regression ,symbols ,Applied mathematics ,62D05 ,Statistics, Probability and Uncertainty ,62F12 ,Statistics - Methodology ,Shrinkage ,Mathematics - Abstract
We present a linear regression method for predictions on a small data set making use of a second possibly biased data set that may be much larger. Our method fits linear regressions to the two data sets while penalizing the difference between predictions made by those two models. The resulting algorithm is a shrinkage method similar to those used in small area estimation. We find a Stein-type result for Gaussian responses: when the model has $5$ or more coefficients and $10$ or more error degrees of freedom, it becomes inadmissible to use only the small data set, no matter how large the bias is. We also present both plug-in and AICc-based methods to tune our penalty parameter. Most of our results use an $L_{2}$ penalty, but we obtain formulas for $L_{1}$ penalized estimates when the model is specialized to the location setting. Ordinary Stein shrinkage provides an inadmissibility result for only $3$ or more coefficients, but we find that our shrinkage method typically produces much lower squared errors in as few as $5$ or $10$ dimensions when the bias is small and essentially equivalent squared errors when the bias is large.
- Published
- 2013
- Full Text
- View/download PDF
32. Multiple hypothesis testing adjusted for latent variables, with an application to the AGEMAP gene expression data
- Author
-
Art B. Owen, Yunting Sun, and Nan Zhang
- Subjects
FOS: Computer and information sciences ,Statistics and Probability ,empirical null ,Surrogate variable analysis ,Latent variable ,Statistics - Applications ,Variable (computer science) ,Ranking ,Consistency (statistics) ,Modeling and Simulation ,Statistics ,Multiple comparisons problem ,surrogate variable analysis ,Applications (stat.AP) ,Statistics, Probability and Uncertainty ,EIGENSTRAT ,Statistical hypothesis testing ,Mathematics - Abstract
In high throughput settings we inspect a great many candidate variables (e.g., genes) searching for associations with a primary variable (e.g., a phenotype). High throughput hypothesis testing can be made difficult by the presence of systemic effects and other latent variables. It is well known that those variables alter the level of tests and induce correlations between tests. They also change the relative ordering of significance levels among hypotheses. Poor rankings lead to wasteful and ineffective follow-up studies. The problem becomes acute for latent variables that are correlated with the primary variable. We propose a two-stage analysis to counter the effects of latent variables on the ranking of hypotheses. Our method, called LEAPP, statistically isolates the latent variables from the primary one. In simulations, it gives better ordering of hypotheses than competing methods such as SVA and EIGENSTRAT. For an illustration, we turn to data from the AGEMAP study relating gene expression to age for 16 tissues in the mouse. LEAPP generates rankings with greater consistency across tissues than the rankings attained by the other methods., Comment: Published in at http://dx.doi.org/10.1214/12-AOAS561 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)
- Published
- 2012
33. Bootstrapping data arrays of arbitrary order
- Author
-
Art B. Owen and Dean Eckles
- Subjects
FOS: Computer and information sciences ,Statistics and Probability ,Bayesian pigeonhole bootstrap ,Heteroscedasticity ,tensor data ,02 engineering and technology ,Statistics - Applications ,Statistics - Computation ,01 natural sciences ,Methodology (stat.ME) ,010104 statistics & probability ,Bernoulli's principle ,020204 information systems ,Resampling ,0202 electrical engineering, electronic engineering, information engineering ,Statistics::Methodology ,Applications (stat.AP) ,0101 mathematics ,Computation (stat.CO) ,Statistics - Methodology ,online bagging ,Mathematics ,online bootstrap ,Variance (accounting) ,Random effects model ,unbalanced random effects ,Data set ,Bootstrapping (electronics) ,Modeling and Simulation ,Multinomial distribution ,relational data ,Statistics, Probability and Uncertainty ,Algorithm - Abstract
In this paper we study a bootstrap strategy for estimating the variance of a mean taken over large multifactor crossed random effects data sets. We apply bootstrap reweighting independently to the levels of each factor, giving each observation the product of independently sampled factor weights. No exact bootstrap exists for this problem [McCullagh (2000) Bernoulli 6 285-301]. We show that the proposed bootstrap is mildly conservative, meaning biased toward overestimating the variance, under sufficient conditions that allow very unbalanced and heteroscedastic inputs. Earlier results for a resampling bootstrap only apply to two factors and use multinomial weights that are poorly suited to online computation. The proposed reweighting approach can be implemented in parallel and online settings. The results for this method apply to any number of factors. The method is illustrated using a 3 factor data set of comment lengths from Facebook., Published in at http://dx.doi.org/10.1214/12-AOAS547 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)
- Published
- 2012
34. Variance components and generalized Sobol' indices
- Author
-
Art B. Owen
- Subjects
Statistics and Probability ,FOS: Computer and information sciences ,0211 other engineering and technologies ,Bilinear interpolation ,Asymptotic distribution ,02 engineering and technology ,Bilinear form ,01 natural sciences ,Statistics - Computation ,Methodology (stat.ME) ,010104 statistics & probability ,FOS: Mathematics ,Discrete Mathematics and Combinatorics ,Applied mathematics ,Mathematics - Numerical Analysis ,0101 mathematics ,65C99 ,Statistics - Methodology ,Computation (stat.CO) ,Mathematics ,021103 operations research ,Applied Mathematics ,Estimator ,Sobol sequence ,Function (mathematics) ,Numerical Analysis (math.NA) ,Computer experiment ,Nonlinear system ,Modeling and Simulation ,Statistics, Probability and Uncertainty - Abstract
This paper introduces generalized Sobol' indices, compares strategies for their estimation, and makes a systematic search for efficient estimators. Of particular interest are contrasts, sums of squares and indices of bilinear form which allow a reduced number of function evaluations compared to alternatives. The bilinear framework includes some efficient estimators from Saltelli (2002) and Mauntz (2002) as well as some new estimators for specific variance components and mean dimensions. This paper also provides a bias corrected version of the estimator of Janon et al.\,(2012) and extends the bias correction to generalized Sobol' indices. Some numerical comparisons are given., Comment: 24 pages, 0 figures, 3 tables
- Published
- 2012
- Full Text
- View/download PDF
35. Controlling Correlations in Latin Hypercube Samples
- Author
-
Art B. Owen
- Subjects
Statistics and Probability ,Correlation ,Latin hypercube sampling ,Statistics ,Monte Carlo method ,Bilinear interpolation ,Applied mathematics ,Monte Carlo integration ,Statistics, Probability and Uncertainty ,Bilinear form ,Computer experiment ,Mathematics ,Numerical integration - Abstract
Monte Carlo integration is competitive for high-dimensional integrands. Latin hypercube sampling is a stratification technique that reduces the variance of the integral. Previous work has shown that the additive part of the integrand is integrated with error Op (n −1/2) under Latin hypercube sampling with n integrand evaluations. A bilinear part of the integrand is more accurately estimated if the sample correlations among input variables are negligible. Other authors have proposed an algorithm for controlling these correlations. We show that their method reduces the correlations by roughly a factor of 3 for 10 ≤ n ≤ 500. We propose a method that, based on simulations, appears to produce correlations of order Op (n −3/2). An analysis of the algorithm indicates that it cannot be expected to do better than n −3/2.
- Published
- 1994
- Full Text
- View/download PDF
36. Visualizing bivariate long-tailed data
- Author
-
Art B. Owen and Justin S. Dyer
- Subjects
Statistics and Probability ,preferential attachment ,Zipf's law ,91D30 ,Bivariate analysis ,Preferential attachment ,Power law ,Zipf–Mandelbrot ,90B15 ,Visualization ,Copula (probability theory) ,00A66 ,Copula ,bipartite preferential attachment ,Econometrics ,Bipartite graph ,bivariate Zipf ,Statistics, Probability and Uncertainty ,Marginal distribution ,Mathematics - Abstract
Variables in large data sets in biology or e-commerce often have a head, made up of very frequent values and a long tail of ever rarer values. Models such as the Zipf or Zipf–Mandelbrot provide a good description. The problem we address here is the visualization of two such long-tailed variables, as one might see in a bivariate Zipf context. We introduce a copula plot to display the joint behavior of such variables. The plot uses an empirical ordering of the data; we prove that this ordering is asymptotically accurate in a Zipf–Mandelbrot–Poisson model. We often see an association between entities at the head of one variable with those from the tail of the other. We present two generative models (saturation and bipartite preferential attachment) that show such qualitative behavior and we characterize the power law behavior of the marginal distributions in these models.
- Published
- 2011
37. Consistency of Markov chain quasi-Monte Carlo on continuous state spaces
- Author
-
Josef Dick, S. Chen, and Art B. Owen
- Subjects
65C05 ,Statistics and Probability ,Stationary distribution ,Markov chain ,Monte Carlo method ,Mathematics - Statistics Theory ,Markov chain Monte Carlo ,Statistics Theory (math.ST) ,Completely uniformly distributed ,symbols.namesake ,Consistency (statistics) ,65C40 ,symbols ,FOS: Mathematics ,Probability distribution ,Applied mathematics ,iterated function mappings ,26A42 ,Quasi-Monte Carlo method ,coupling ,62F15 ,Statistics, Probability and Uncertainty ,Random variable ,Mathematics - Abstract
The random numbers driving Markov chain Monte Carlo (MCMC) simulation are usually modeled as independent U(0,1) random variables. Tribble [Markov chain Monte Carlo algorithms using completely uniformly distributed driving sequences (2007) Stanford Univ.] reports substantial improvements when those random numbers are replaced by carefully balanced inputs from completely uniformly distributed sequences. The previous theoretical justification for using anything other than i.i.d. U(0,1) points shows consistency for estimated means, but only applies for discrete stationary distributions. We extend those results to some MCMC algorithms for continuous stationary distributions. The main motivation is the search for quasi-Monte Carlo versions of MCMC. As a side benefit, the results also establish consistency for the usual method of using pseudo-random numbers in place of random ones., Comment: Published in at http://dx.doi.org/10.1214/10-AOS831 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)
- Published
- 2011
- Full Text
- View/download PDF
38. Outlier Detection Using Nonconvex Penalized Regression
- Author
-
Art B. Owen and Yiyuan She
- Subjects
FOS: Computer and information sciences ,Statistics and Probability ,Robust statistics ,02 engineering and technology ,01 natural sciences ,Statistics - Computation ,Machine Learning (cs.LG) ,Robust regression ,Methodology (stat.ME) ,010104 statistics & probability ,Bayesian information criterion ,Linear regression ,Statistics ,0202 electrical engineering, electronic engineering, information engineering ,0101 mathematics ,Computation (stat.CO) ,Statistics - Methodology ,Mathematics ,Regression analysis ,Thresholding ,Computer Science - Learning ,Outlier ,62F35, 62J07 ,020201 artificial intelligence & image processing ,Anomaly detection ,Statistics, Probability and Uncertainty ,Algorithm - Abstract
This paper studies the outlier detection problem from the point of view of penalized regressions. Our regression model adds one mean shift parameter for each of the $n$ data points. We then apply a regularization favoring a sparse vector of mean shift parameters. The usual $L_1$ penalty yields a convex criterion, but we find that it fails to deliver a robust estimator. The $L_1$ penalty corresponds to soft thresholding. We introduce a thresholding (denoted by $\Theta$) based iterative procedure for outlier detection ($\Theta$-IPOD). A version based on hard thresholding correctly identifies outliers on some hard test problems. We find that $\Theta$-IPOD is much faster than iteratively reweighted least squares for large data because each iteration costs at most $O(np)$ (and sometimes much less) avoiding an $O(np^2)$ least squares estimate. We describe the connection between $\Theta$-IPOD and $M$-estimators. Our proposed method has one tuning parameter with which to both identify outliers and estimate regression coefficients. A data-dependent choice can be made based on BIC. The tuned $\Theta$-IPOD shows outstanding performance in identifying outliers in various situations in comparison to other existing approaches. This methodology extends to high-dimensional modeling with $p\gg n$, if both the coefficient vector and the outlier pattern are sparse.
- Published
- 2010
39. Empirical stationary correlations for semi-supervised learning on graphs
- Author
-
Art B. Owen, Justin S. Dyer, and Ya Xu
- Subjects
Statistics and Probability ,FOS: Computer and information sciences ,Covariance matrix ,Two-graph ,Estimator ,Semi-supervised learning ,Missing data ,Random walk ,Statistics - Applications ,random walk ,Kriging ,Modeling and Simulation ,Graph Laplacian ,kriging ,Applications (stat.AP) ,Statistics, Probability and Uncertainty ,Laplacian matrix ,pagerank ,Algorithm ,Mathematics - Abstract
In semi-supervised learning on graphs, response variables observed at one node are used to estimate missing values at other nodes. The methods exploit correlations between nearby nodes in the graph. In this paper we prove that many such proposals are equivalent to kriging predictors based on a fixed covariance matrix driven by the link structure of the graph. We then propose a data-driven estimator of the correlation structure that exploits patterns among the observed response values. By incorporating even a small fraction of observed covariation into the predictions, we are able to obtain much improved prediction on two graph data sets., Comment: Published in at http://dx.doi.org/10.1214/09-AOAS293 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)
- Published
- 2010
- Full Text
- View/download PDF
40. A Central Limit Theorem for Latin Hypercube Sampling
- Author
-
Art B. Owen
- Subjects
Statistics and Probability ,010102 general mathematics ,Control variates ,01 natural sciences ,One-way analysis of variance ,010104 statistics & probability ,Delta method ,Latin hypercube sampling ,Statistics ,Monte Carlo integration ,Variance reduction ,0101 mathematics ,Mathematics ,Central limit theorem ,Antithetic variates - Abstract
SUMMARY Latin hypercube sampling (LHS) is a technique for Monte Carlo integration, due to McKay, Conover and Beckman. M. Stein proved that LHS integrals have smaller variance than independent and identically distributed Monte Carlo integration, the extent of the variance reduction depending on the extent to which the integrand is additive. We extend Stein's work to prove a central limit theorem. Variance estimation methods for nonparametric regression can be adapted to provide N'12-consistent estimates of the asymptotic variance in LHS. Moreover the skewness can be estimated at this rate. The variance reduction may be explained in terms of certain control variates that cannot be directly measured. We also show how to combine control variates with LHS. Finally we show how these results lead to a frequentist approach to computer experimentation.
- Published
- 1992
- Full Text
- View/download PDF
41. Karl Pearson’s meta-analysis revisited
- Author
-
Art B. Owen
- Subjects
Statistics and Probability ,Admissibility ,Mathematical statistics ,Mathematics - Statistics Theory ,Statistics Theory (math.ST) ,65T99 ,52A01 ,fast Fourier transform ,Upper and lower bounds ,Meta-analysis ,hypothesis testing ,Statistics ,FOS: Mathematics ,Standard test ,62F03 ,Statistics, Probability and Uncertainty ,62C15 ,microarrays ,Random variable ,62F03, 62C15, 65T99, 52A01 (Primary) ,Karl pearson ,Mathematics ,Sign (mathematics) ,Statistical hypothesis testing - Abstract
This paper revisits a meta-analysis method proposed by Pearson [Biometrika 26 (1934) 425--442] and first used by David [Biometrika 26 (1934) 1--11]. It was thought to be inadmissible for over fifty years, dating back to a paper of Birnbaum [J. Amer. Statist. Assoc. 49 (1954) 559--574]. It turns out that the method Birnbaum analyzed is not the one that Pearson proposed. We show that Pearson's proposal is admissible. Because it is admissible, it has better power than the standard test of Fisher [Statistical Methods for Research Workers (1932) Oliver and Boyd] at some alternatives, and worse power at others. Pearson's method has the advantage when all or most of the nonzero parameters share the same sign. Pearson's test has proved useful in a genomic setting, screening for age-related genes. This paper also presents an FFT-based method for getting hard upper and lower bounds on the CDF of a sum of nonnegative random variables., Comment: Published in at http://dx.doi.org/10.1214/09-AOS697 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)
- Published
- 2009
- Full Text
- View/download PDF
42. Bi-cross-validation of the SVD and the nonnegative matrix factorization
- Author
-
Patrick O. Perry and Art B. Owen
- Subjects
Statistics and Probability ,FOS: Computer and information sciences ,Rank (linear algebra) ,sample reuse ,Cross-validation ,Low-rank approximation ,random matrix theory ,Statistics - Applications ,weak factor model ,Data matrix (multivariate statistics) ,Non-negative matrix factorization ,Combinatorics ,Modeling and Simulation ,Principal component analysis ,Singular value decomposition ,Applications (stat.AP) ,Statistics, Probability and Uncertainty ,Random matrix ,Row ,principal components ,Mathematics - Abstract
This article presents a form of bi-cross-validation (BCV) for choosing the rank in outer product models, especially the singular value decomposition (SVD) and the nonnegative matrix factorization (NMF). Instead of leaving out a set of rows of the data matrix, we leave out a set of rows and a set of columns, and then predict the left out entries by low rank operations on the retained data. We prove a self-consistency result expressing the prediction error as a residual from a low rank approximation. Random matrix theory and some empirical results suggest that smaller hold-out sets lead to more over-fitting, while larger ones are more prone to under-fitting. In simulated examples we find that a method leaving out half the rows and half the columns performs well., Published in at http://dx.doi.org/10.1214/08-AOAS227 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)
- Published
- 2009
43. Calibration of the empirical likelihood method for a vector mean
- Author
-
Sarah C. Emerson and Art B. Owen
- Subjects
Statistics and Probability ,Score test ,Empirical likelihood ,Statistical power ,nonparametric hypothesis testing ,Statistics ,Econometrics ,Test statistic ,62H15 ,p-value ,Statistics, Probability and Uncertainty ,multivariate hypothesis testing ,Statistic ,Mathematics ,Type I and type II errors ,Statistical hypothesis testing ,62G10 - Abstract
The empirical likelihood method is a versatile approach for testing hypotheses and constructing confidence regions in a non-parametric setting. For testing the value of a vector mean, the empirical likelihood method offers the benefit of making no distributional assumptions beyond some mild moment conditions. However, in small samples or high dimensions the method is very poorly calibrated, producing tests that generally have a much higher type I error than the nominal level, and it suffers from a limiting convex hull constraint. Methods to address the performance of the empirical likelihood in the vector mean setting have been proposed in a number of papers, including a contribution that suggests supplementing the observed dataset with an artificial data point. We examine the consequences of this approach and describe a limitation of their method that we have discovered in settings when the sample size is relatively small compared with the dimension. We propose a new modification to the extra data approach that involves adding two points and changing the location of the extra points. We explore the benefits that this modification offers, and show that it results in better calibration, particularly in difficult cases. This new approach also results in a small-sample connection between the modified empirical likelihood method and Hotelling’s T-square test. We show that varying the location of the added data points creates a continuum of tests that range from the unmodified empirical likelihood statistic to Hotelling’s T-square statistic.
- Published
- 2009
44. Recycling physical random numbers
- Author
-
Art B. Owen
- Subjects
Statistics and Probability ,Pairwise independence ,Small number ,Monte Carlo method ,Asymptotic distribution ,Markov chain Monte Carlo ,Hybrid Monte Carlo ,Combinatorics ,symbols.namesake ,symbols ,Monte Carlo integration ,Statistics, Probability and Uncertainty ,Mathematics ,Central limit theorem - Abstract
Physical random numbers are not as widely used in Monte Carlo integration as pseudo-random numbers are. They are inconvenient for many reasons. If we want to generate them on the fly, then they may be slow. When we want reproducible results from them, we need a lot of storage. This paper shows that we may construct N = n(n 1)/2 pairwise independent random vectors from n independent ones, by summing them modulo 1 in pairs. As a consequence, the storage and speed problems of physical random numbers can be greatly mitigated. The new vectors lead to Monte Carlo averages with the same mean and variance as if we had used N independent vectors. The asymptotic distribution of the sample mean has a surprising feature: it is always symmetric, but never Gaussian. This follows by writing the sample mean as a degenerate U-statistic whose kernel is a left-circulant matrix. Because of the symmetry, a small number B of replicates can be used to get confidence intervals based on the central limit theorem.
- Published
- 2009
- Full Text
- View/download PDF
45. A likelihood ratio test for the equality of proportions of two normal populations
- Author
-
Youn Min Chou and Donald B. Owen
- Subjects
Statistics and Probability ,Normal distribution ,Noncentral t-distribution ,Likelihood-ratio test ,Statistics ,Chi-square test ,Econometrics ,Sampling (statistics) ,ComputerApplications_COMPUTERSINOTHERSYSTEMS ,Limit (mathematics) ,Chi-squared distribution ,Mathematics ,Statistical hypothesis testing - Abstract
In sampling inspection by variables, an item is considered defective if its quality characteristic Y falls below some specification limit L0. We consider switching to a new supplier if we can be sure that the proportion of defective items for the new supplier is smaller than the proportion defective for the present supplier. Assume that Y has a normal distribution. A test for comparing these proportions is developed. A simulation study of the performance of the test is presented.
- Published
- 1991
- Full Text
- View/download PDF
46. Local antithetic sampling with scrambled nets
- Author
-
Art B. Owen
- Subjects
65C05 ,Statistics and Probability ,FOS: Computer and information sciences ,Mean squared error ,Monte Carlo method ,Word error rate ,Mathematics - Statistics Theory ,Statistics Theory (math.ST) ,Control variates ,Statistics - Computation ,Combinatorics ,Digital nets ,Statistics ,FOS: Mathematics ,65D32 ,randomized quasi-Monte Carlo ,Computation (stat.CO) ,Mathematics ,quasi-Monte Carlo ,Random function ,Sampling (statistics) ,68U20 ,Variance reduction ,monomial rules ,65C05 (Primary) ,Statistics, Probability and Uncertainty ,68U20, 65D32 (Secondary) ,Importance sampling - Abstract
We consider the problem of computing an approximation to the integral $I=\int_{[0,1]^d}f(x) dx$. Monte Carlo (MC) sampling typically attains a root mean squared error (RMSE) of $O(n^{-1/2})$ from $n$ independent random function evaluations. By contrast, quasi-Monte Carlo (QMC) sampling using carefully equispaced evaluation points can attain the rate $O(n^{-1+\varepsilon})$ for any $\varepsilon>0$ and randomized QMC (RQMC) can attain the RMSE $O(n^{-3/2+\varepsilon})$, both under mild conditions on $f$. Classical variance reduction methods for MC can be adapted to QMC. Published results combining QMC with importance sampling and with control variates have found worthwhile improvements, but no change in the error rate. This paper extends the classical variance reduction method of antithetic sampling and combines it with RQMC. One such method is shown to bring a modest improvement in the RMSE rate, attaining $O(n^{-3/2-1/d+\varepsilon})$ for any $\varepsilon>0$, for smooth enough $f$., Published in at http://dx.doi.org/10.1214/07-AOS548 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)
- Published
- 2008
- Full Text
- View/download PDF
47. Construction of weakly CUD sequences for MCMC sampling
- Author
-
Art B. Owen and Seth D. Tribble
- Subjects
Statistics and Probability ,FOS: Computer and information sciences ,62F15 (Primary) 11K45, 11K41 (Secondary) ,11K41 ,Mathematics - Statistics Theory ,010103 numerical & computational mathematics ,Statistics Theory (math.ST) ,01 natural sciences ,Statistics - Computation ,010104 statistics & probability ,symbols.namesake ,equidistribution ,Gibbs sampler ,FOS: Mathematics ,Almost surely ,Finite state ,Quasi-Monte Carlo method ,0101 mathematics ,Computation (stat.CO) ,Mathematics ,quasi-Monte Carlo ,Sequence ,Sampling (statistics) ,Ranging ,Markov chain Monte Carlo ,symbols ,completely uniformly distributed ,probit ,Statistics, Probability and Uncertainty ,62F15 ,11K45 ,Algorithm ,Gibbs sampling - Abstract
In Markov chain Monte Carlo (MCMC) sampling considerable thought goes into constructing random transitions. But those transitions are almost always driven by a simulated IID sequence. Recently it has been shown that replacing an IID sequence by a weakly completely uniformly distributed (WCUD) sequence leads to consistent estimation in finite state spaces. Unfortunately, few WCUD sequences are known. This paper gives general methods for proving that a sequence is WCUD, shows that some specific sequences are WCUD, and shows that certain operations on WCUD sequences yield new WCUD sequences. A numerical example on a 42 dimensional continuous Gibbs sampler found that some WCUD inputs sequences produced variance reductions ranging from tens to hundreds for posterior means of the parameters, compared to IID inputs., Comment: Published in at http://dx.doi.org/10.1214/07-EJS162 the Electronic Journal of Statistics (http://www.i-journals.org/ejs/) by the Institute of Mathematical Statistics (http://www.imstat.org)
- Published
- 2008
- Full Text
- View/download PDF
48. A study of a new process capabily index
- Author
-
Byoung-Chul Choi and Donald B. Owen
- Subjects
Statistics and Probability ,Process variation ,Index (economics) ,Sample size determination ,Process capability ,Statistics ,Process (computing) ,Process capability index ,Process performance index ,Confidence interval ,Mathematics - Abstract
A new process capability index is proposed that takes into account the location of the process mean between the two specification limits, the proximity to the target value, and the process variation when assessing process performance. The proposed index is compared to other indices on several properties. The proposed index is estimated based on a random sample of observations from the production process when the process is assumed to be normally distributed. The 95% lower confidence limits for the proposed index are derived for given sample sizes and its estimates.
- Published
- 1990
- Full Text
- View/download PDF
49. Lower Confidence Limits on Process Capability Indices Based on the Range
- Author
-
A. Borrego, Li Huaixiang, Donald B. Owen, and A. Salvador
- Subjects
Statistics and Probability ,Normal distribution ,Observational error ,Sampling distribution ,Modeling and Simulation ,Process capability ,Statistics ,Range (statistics) ,Econometrics ,Sample mean and sample covariance ,Standard deviation ,Confidence interval ,Mathematics - Abstract
The common measures of process capability, usually indicated by Cp, CPU, CPL, and Cpk are considered. Tables of lower confidence limits are given on these measures where the sample mean, and the sample range (properly adjusted by d2) are substituted for the population mean and population standard deviation in the definition formulas.
- Published
- 1990
- Full Text
- View/download PDF
50. The pigeonhole bootstrap
- Author
-
Art B. Owen
- Subjects
Statistics and Probability ,FOS: Computer and information sciences ,Heteroscedasticity ,Computer science ,Pigeonhole principle ,Collaborative filtering ,Sampling (statistics) ,recommenders ,Random effects model ,Statistics - Applications ,Stability (probability) ,resampling ,Consistency (statistics) ,Modeling and Simulation ,Resampling ,Bipartite graph ,Statistics::Methodology ,Applications (stat.AP) ,Statistics, Probability and Uncertainty ,Algorithm - Abstract
Recently there has been much interest in data that, in statistical language, may be described as having a large crossed and severely unbalanced random effects structure. Such data sets arise for recommender engines and information retrieval problems. Many large bipartite weighted graphs have this structure too. We would like to assess the stability of algorithms fit to such data. Even for linear statistics, a naive form of bootstrap sampling can be seriously misleading and McCullagh [Bernoulli 6 (2000) 285--301] has shown that no bootstrap method is exact. We show that an alternative bootstrap separately resampling rows and columns of the data matrix satisfies a mean consistency property even in heteroscedastic crossed unbalanced random effects models. This alternative does not require the user to fit a crossed random effects model to the data., Comment: Published in at http://dx.doi.org/10.1214/07-AOAS122 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)
- Published
- 2007
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.