Lee, Sang Kyu, Hong, Hyokyoung G., Weng, Haolei, and Loftfield, Erikka
Subjects
Statistics - Methodology
Abstract
We study the high-dimensional partial linear model, where the linear part has a high-dimensional sparse regression coefficient and the nonparametric part includes a function whose derivatives are of bounded total variation. We expand upon the univariate trend filtering to develop partial linear trend filtering--a doubly penalized least square estimation approach based on $\ell_1$ penalty and total variation penalty. Analogous to the advantages of trend filtering in univariate nonparametric regression, partial linear trend filtering not only can be efficiently computed, but also achieves the optimal error rate for estimating the nonparametric function. This in turn leads to the oracle rate for the linear part as if the underlying nonparametric function were known. We compare the proposed approach with a standard smoothing spline based method, and show both empirically and theoretically that the former outperforms the latter when the underlying function possesses heterogeneous smoothness. We apply our approach to the IDATA study to investigate the relationship between metabolomic profiles and ultra-processed food (UPF) intake, efficiently identifying key metabolites associated with UPF consumption and demonstrating strong predictive performance., Comment: 41 pages, 5 figures
Guo, Yilin, Ghosh, Shubhangi, Weng, Haolei, and Maleki, Arian
Subjects
Mathematics - Statistics Theory
Abstract
Sparse linear regression is one of the classical and extensively studied problems in high-dimensional statistics and compressed sensing. Despite the substantial body of literature dedicated to this problem, the precise determination of its minimax risk remains elusive. This paper aims to fill this gap by deriving asymptotically constant-sharp characterization for the minimax risk of sparse linear regression. More specifically, the paper focuses on scenarios where the sparsity level, denoted as k, satisfies the condition $(k \log p)/n {\to} 0$, with p and n representing the number of features and observations respectively. We establish that the minimax risk under isotropic Gaussian random design is asymptotically equal to $2{\sigma}^2k/n log(p/k)$, where ${\sigma}$ denotes the standard deviation of the noise. In addition to this result, we will summarize the existing results in the literature, and mention some of the fundamental problems that have still remained open.
Pak, Daewoo, Zhang, Jianrui, Wu, Di, Weng, Haolei, and Li, Chenxi
Subjects
Statistics - Methodology
Abstract
We develop a set of variable selection methods for the Cox model under interval censoring, in the ultra-high dimensional setting where the dimensionality can grow exponentially with the sample size. The methods select covariates via a penalized nonparametric maximum likelihood estimation with some popular penalty functions, including lasso, adaptive lasso, SCAD, and MCP. We prove that our penalized variable selection methods with folded concave penalties or adaptive lasso penalty enjoy the oracle property. Extensive numerical experiments show that the proposed methods have satisfactory empirical performance under various scenarios. The utility of the methods is illustrated through an application to a genome-wide association study of age to early childhood caries.
While supervised federated learning approaches have enjoyed significant success, the domain of unsupervised federated learning remains relatively underexplored. Several federated EM algorithms have gained popularity in practice, however, their theoretical foundations are often lacking. In this paper, we first introduce a federated gradient EM algorithm (FedGrEM) designed for the unsupervised learning of mixture models, which supplements the existing federated EM algorithms by considering task heterogeneity and potential adversarial attacks. We present a comprehensive finite-sample theory that holds for general mixture models, then apply this general theory on specific statistical models to characterize the explicit estimation error of model parameters and mixture proportions. Our theory elucidates when and how FedGrEM outperforms local single-task learning with insights extending to existing federated EM algorithms. This bridges the gap between their practical success and theoretical understanding. Our numerical results validate our theory, and demonstrate FedGrEM's superiority over existing unsupervised federated learning benchmarks., Comment: 50 pages, 3 figures
We develop a post-selection inference method for the Cox proportional hazards model with interval-censored data, which provides asymptotically valid p-values and confidence intervals conditional on the model selected by lasso. The method is based on a pivotal quantity that is shown to converge to a uniform distribution under local alternatives. The proof can be adapted to many other regression models, which is illustrated by the extension to generalized linear models and the Cox model with right-censored data. Our method involves estimation of the efficient information matrix, for which several approaches are proposed with proof of their consistency. Thorough simulation studies show that our method has satisfactory performance in samples of modest sizes. The utility of the method is illustrated via an application to an Alzheimer's disease study., Comment: 46 pages, 14 figures
Since its development, the minimax framework has been one of the corner stones of theoretical statistics, and has contributed to the popularity of many well-known estimators, such as the regularized M-estimators for high-dimensional problems. In this paper, we will first show through the example of sparse Gaussian sequence model, that the theoretical results under the classical minimax framework are insufficient for explaining empirical observations. In particular, both hard and soft thresholding estimators are (asymptotically) minimax, however, in practice they often exhibit sub-optimal performances at various signal-to-noise ratio (SNR) levels. The first contribution of this paper is to demonstrate that this issue can be resolved if the signal-to-noise ratio is taken into account in the construction of the parameter space. We call the resulting minimax framework the signal-to-noise ratio aware minimaxity. The second contribution of this paper is to showcase how one can use higher-order asymptotics to obtain accurate approximations of the SNR-aware minimax risk and discover minimax estimators. The theoretical findings obtained from this refined minimax framework provide new insights and practical guidance for the estimation of sparse signals.
Unsupervised learning has been widely used in many real-world applications. One of the simplest and most important unsupervised learning models is the Gaussian mixture model (GMM). In this work, we study the multi-task learning problem on GMMs, which aims to leverage potentially similar GMM parameter structures among tasks to obtain improved learning performance compared to single-task learning. We propose a multi-task GMM learning procedure based on the EM algorithm that effectively utilizes unknown similarities between related tasks and is robust against a fraction of outlier tasks from arbitrary distributions. The proposed procedure is shown to achieve the minimax optimal rate of convergence for both parameter estimation error and the excess mis-clustering error, in a wide range of regimes. Moreover, we generalize our approach to tackle the problem of transfer learning for GMMs, where similar theoretical results are derived. Additionally, iterative unsupervised multi-task and transfer learning methods may suffer from an initialization alignment problem, and two alignment algorithms are proposed to resolve the issue. Finally, we demonstrate the effectiveness of our methods through simulations and real data examples. To the best of our knowledge, this is the first work studying multi-task and transfer learning on GMMs with theoretical guarantees., Comment: 162 pages, 15 figures, 2 tables
International large-scale assessments (ILSAs) play an important role in educational research and policy making. They collect valuable data on education quality and performance development across many education systems, giving countries the opportunity to share techniques, organizational structures, and policies that have proven efficient and successful. To gain insights from ILSA data, we identify non-cognitive variables associated with students' academic performance. This problem has three analytical challenges: 1) academic performance is measured by cognitive items under a matrix sampling design; 2) there are many missing values in the non-cognitive variables; and 3) multiple comparisons due to a large number of non-cognitive variables. We consider an application to the Programme for International Student Assessment (PISA), aiming to identify non-cognitive variables associated with students' performance in science. We formulate it as a variable selection problem under a general latent variable model framework and further propose a knockoff method that conducts variable selection with a controlled error rate for false selections.
Statistics - Machine Learning, Computer Science - Machine Learning, Computer Science - Social and Information Networks, Mathematics - Statistics Theory, Statistics - Computation
Abstract
One of the fundamental problems in network analysis is detecting community structure in multi-layer networks, of which each layer represents one type of edge information among the nodes. We propose integrative spectral clustering approaches based on effective convex layer aggregations. Our aggregation methods are strongly motivated by a delicate asymptotic analysis of the spectral embedding of weighted adjacency matrices and the downstream $k$-means clustering, in a challenging regime where community detection consistency is impossible. In fact, the methods are shown to estimate the optimal convex aggregation, which minimizes the mis-clustering error under some specialized multi-layer network models. Our analysis further suggests that clustering using Gaussian mixture models is generally superior to the commonly used $k$-means in spectral clustering. Extensive numerical studies demonstrate that our adaptive aggregation techniques, together with Gaussian mixture model clustering, make the new spectral clustering remarkably competitive compared to several popularly used methods., Comment: 74 pages
Estimating a low rank matrix from its linear measurements is a problem of central importance in contemporary statistical analysis. The choice of tuning parameters for estimators remains an important challenge from a theoretical and practical perspective. To this end, Stein's Unbiased Risk Estimate (SURE) framework provides a well-grounded statistical framework for degrees of freedom estimation. In this paper, we use the SURE framework to obtain degrees of freedom estimates for a general class of spectral regularized matrix estimators, generalizing beyond the class of estimators that have been studied thus far. To this end, we use a result due to Shapiro (2002) pertaining to the differentiability of symmetric matrix valued functions, developed in the context of semidefinite optimization algorithms. We rigorously verify the applicability of Stein's lemma towards the derivation of degrees of freedom estimates; and also present new techniques based on Gaussian convolution to estimate the degrees of freedom of a class of spectral estimators to which Stein's lemma is not directly applicable.
A recently proposed SLOPE estimator (arXiv:1407.3824) has been shown to adaptively achieve the minimax $\ell_2$ estimation rate under high-dimensional sparse linear regression models (arXiv:1503.08393). Such minimax optimality holds in the regime where the sparsity level $k$, sample size $n$, and dimension $p$ satisfy $k/p \rightarrow 0$, $k\log p/n \rightarrow 0$. In this paper, we characterize the estimation error of SLOPE under the complementary regime where both $k$ and $n$ scale linearly with $p$, and provide new insights into the performance of SLOPE estimators. We first derive a concentration inequality for the finite sample mean square error (MSE) of SLOPE. The quantity that MSE concentrates around takes a complicated and implicit form. With delicate analysis of the quantity, we prove that among all SLOPE estimators, LASSO is optimal for estimating $k$-sparse parameter vectors that do not have tied non-zero components in the low noise scenario. On the other hand, in the large noise scenario, the family of SLOPE estimators are sub-optimal compared with bridge regression such as the Ridge estimator., Comment: 50 pages, 18 figures
Motivated by portfolio allocation and linear discriminant analysis, we consider estimating a functional $\mathbf{\mu}^T \mathbf{\Sigma}^{-1} \mathbf{\mu}$ involving both the mean vector $\mathbf{\mu}$ and covariance matrix $\mathbf{\Sigma}$. We study the minimax estimation of the functional in the high-dimensional setting where $\mathbf{\Sigma}^{-1} \mathbf{\mu}$ is sparse. Akin to past works on functional estimation, we show that the optimal rate for estimating the functional undergoes a phase transition between regular parametric rate and some form of high-dimensional estimation rate. We further show that the optimal rate is attained by a carefully designed plug-in estimator based on de-biasing, while a family of naive plug-in estimators are proved to fall short. We further generalize the estimation problem and techniques that allow robust inputs of mean and covariance matrix estimators. Extensive numerical experiments lend further supports to our theoretical results.
In this paper, we study the popularly dubbed matrix completion problem, where the task is to "fill in" the unobserved entries of a matrix from a small subset of observed entries, under the assumption that the underlying matrix is of low-rank. Our contributions herein, enhance our prior work on nuclear norm regularized problems for matrix completion (Mazumder et al., 2010) by incorporating a continuum of nonconvex penalty functions between the convex nuclear norm and nonconvex rank functions. Inspired by SOFT-IMPUTE (Mazumder et al., 2010; Hastie et al., 2016), we propose NC-IMPUTE- an EM-flavored algorithmic framework for computing a family of nonconvex penalized matrix completion problems with warm-starts. We present a systematic study of the associated spectral thresholding operators, which play an important role in the overall algorithm. We study convergence properties of the algorithm. Using structured low-rank SVD computations, we demonstrate the computational scalability of our proposal for problems up to the Netflix size (approximately, a $500,000 \times 20, 000$ matrix with $10^8$ observed entries). We demonstrate that on a wide range of synthetic and real data instances, our proposed nonconvex regularization framework leads to low-rank solutions with better predictive performance when compared to those obtained from nuclear norm problems. Implementations of algorithms proposed herein, written in the R programming language, are made available on github.
We consider a binary sequence generated by thresholding a hidden continuous sequence. The hidden variables are assumed to have a compound symmetry covariance structure with a single parameter characterizing the common correlation. We study the parameter estimation problem under such one-parameter models. We demonstrate that maximizing the likelihood function does not yield consistent estimates for the correlation. We then formally prove the nonestimability of the parameter by deriving a non-vanishing minimax lower bound. This counter-intuitive phenomenon provides an interesting insight that one-bit information of each latent variable is not sufficient to consistently recover their common correlation. On the other hand, we further show that trinary data generated from the hidden variables can consistently estimate the correlation with parametric convergence rate. Thus we reveal a phase transition phenomenon regarding the discretization of latent continuous variables while preserving the estimability of the correlation. Numerical experiments are performed to validate the conclusions., Comment: 23 pages, 5 figures
Mathematics - Statistics Theory, Computer Science - Information Theory
Abstract
We study the problem of variable selection for linear models under the high-dimensional asymptotic setting, where the number of observations $n$ grows at the same rate as the number of predictors $p$. We consider two-stage variable selection techniques (TVS) in which the first stage uses bridge estimators to obtain an estimate of the regression coefficients, and the second stage simply thresholds this estimate to select the "important" predictors. The asymptotic false discovery proportion (AFDP) and true positive proportion (ATPP) of these TVS are evaluated. We prove that for a fixed ATPP, in order to obtain a smaller AFDP, one should pick a bridge estimator with smaller asymptotic mean square error in the first stage of TVS. Based on such principled discovery, we present a sharp comparison of different TVS, via an in-depth investigation of the estimation properties of bridge estimators. Rather than "order-wise" error bounds with loose constants, our analysis focuses on precise error characterization. Various interesting signal-to-noise ratio and sparsity settings are studied. Our results offer new and thorough insights into high-dimensional variable selection. For instance, we prove that a TVS with Ridge in its first stage outperforms TVS with other bridge estimators in large noise settings; two-stage LASSO becomes inferior when the signal is rare and weak. As a by-product, we show that two-stage methods outperform some standard variable selection techniques, such as LASSO and Sure Independence Screening, under certain conditions., Comment: 84 pages, 11 figures
Mathematics - Statistics Theory, Computer Science - Information Theory
Abstract
The class of Lq-regularized least squares (LQLS) are considered for estimating a p-dimensional vector \b{eta} from its n noisy linear observations y = X\b{eta}+w. The performance of these schemes are studied under the high-dimensional asymptotic setting in which p grows linearly with n. In this asymptotic setting, phase transition diagrams (PT) are often used for comparing the performance of different estimators. Although phase transition analysis is shown to provide useful information for compressed sensing, the fact that it ignores the measurement noise not only limits its applicability in many application areas, but also may lead to misunderstandings. For instance, consider a linear regression problem in which n > p and the signal is not exactly sparse. If the measurement noise is ignored in such systems, regularization techniques, such as LQLS, seem to be irrelevant since even the ordinary least squares (OLS) returns the exact solution. However, it is well-known that if n is not much larger than p then the regularization techniques improve the performance of OLS. In response to this limitation of PT analysis, we consider the low-noise sensitivity analysis. We show that this analysis framework (i) reveals the advantage of LQLS over OLS, (ii) captures the difference between different LQLS estimators even when n > p, and (iii) provides a fair comparison among different estimators in high signal-to-noise ratios. As an application of this framework, we will show that under mild conditions LASSO outperforms other LQLS even when the signal is dense. Finally, by a simple transformation we connect our low-noise sensitivity framework to the classical asymptotic regime in which n/p goes to infinity and characterize how and when regularization techniques offer improvements over ordinary least squares, and which regularizer gives the most improvement when the sample size is large.
Statistics - Methodology, Mathematics - Statistics Theory
Abstract
Community detection is one of the fundamental problems in the study of network data. Most existing community detection approaches only consider edge information as inputs, and the output could be suboptimal when nodal information is available. In such cases, it is desirable to leverage nodal information for the improvement of community detection accuracy. Towards this goal, we propose a flexible network model incorporating nodal information, and develop likelihood-based inference methods. For the proposed methods, we establish favorable asymptotic properties as well as efficient algorithms for computation. Numerical experiments show the effectiveness of our methods in utilizing nodal information across a variety of simulated and real network data sets., Comment: 53 pages
Mathematics - Statistics Theory, Computer Science - Information Theory
Abstract
We study the problem of estimating $\beta \in \mathbb{R}^p$ from its noisy linear observations $y= X\beta+ w$, where $w \sim N(0, \sigma_w^2 I_{n\times n})$, under the following high-dimensional asymptotic regime: given a fixed number $\delta$, $p \rightarrow \infty$, while $n/p \rightarrow \delta$. We consider the popular class of $\ell_q$-regularized least squares (LQLS) estimators, a.k.a. bridge, given by the optimization problem: \begin{equation*} \hat{\beta} (\lambda, q ) \in \arg\min_\beta \frac{1}{2} \|y-X\beta\|_2^2+ \lambda \|\beta\|_q^q, \end{equation*} and characterize the almost sure limit of $\frac{1}{p} \|\hat{\beta} (\lambda, q )- \beta\|_2^2$. The expression we derive for this limit does not have explicit forms and hence are not useful in comparing different algorithms, or providing information in evaluating the effect of $\delta$ or sparsity level of $\beta$. To simplify the expressions, researchers have considered the ideal "no-noise" regime and have characterized the values of $\delta$ for which the almost sure limit is zero. This is known as the phase transition analysis. In this paper, we first perform the phase transition analysis of LQLS. Our results reveal some of the limitations and misleading features of the phase transition analysis. To overcome these limitations, we propose the study of these algorithms under the low noise regime. Our new analysis framework not only sheds light on the results of the phase transition analysis, but also makes an accurate comparison of different regularizers possible.
Zheng, Le, Maleki, Arian, Weng, Haolei, Wang, Xiaodong, and Long, Teng
Subjects
Computer Science - Information Theory, Mathematics - Statistics Theory
Abstract
In many application areas we are faced with the following question: Can we recover a sparse vector $x_o \in \mathbb{R}^N$ from its undersampled set of noisy observations $y \in \mathbb{R}^n$, $y=A x_o+w$. The last decade has witnessed a surge of algorithms and theoretical results addressing this question. One of the most popular algorithms is the $\ell_p$-regularized least squares (LPLS) given by the following formulation: \[ \hat{x}(\gamma,p )\in \arg\min_x \frac{1}{2}\|y - Ax\|_2^2+\gamma\|x\|_p^p, \] where $p \in [0,1]$. Despite the non-convexity of these problems for $p<1$, they are still appealing because of the following folklores in compressed sensing: (i) $\hat{x}(\gamma,p )$ is closer to $x_o$ than $\hat{x}(\gamma,1)$. (ii) If we employ iterative methods that aim to converge to a local minima of LPLS, then under good initialization these algorithms converge to a solution that is closer to $x_o$ than $\hat{x}(\gamma,1)$. In spite of the existence of plenty of empirical results that support these folklore theorems, the theoretical progress to establish them has been very limited. This paper aims to study the above folklore theorems and establish their scope of validity. Starting with approximate message passing algorithm as a heuristic method for solving LPLS, we study the impact of initialization on the performance of AMP. Then, we employ the replica analysis to show the connection between the solution of AMP and $\hat{x}(\gamma, p)$ in the asymptotic settings. This enables us to compare the accuracy of $\hat{x}(\gamma,p)$ for $p \in [0,1]$. In particular, we will characterize the phase transition and noise sensitivity of LPLS for every $0\leq p\leq 1$ accurately. Our results in the noiseless setting confirm that LPLS exhibits the same phase transition for every $0\leq p <1$ and this phase transition is much higher than that of LASSO.
Xie, Zilong, Chen, Yunxiao, Davier, Matthias von, and Weng, Haolei
Subjects
MISSING data (Statistics), ASSESSMENT of education, EDUCATION research, MULTIPLE comparisons (Statistics), ERROR rates, LATENT variables
Abstract
International large-scale assessments (ILSAs) play an important role in educational research and policy making. They collect valuable data on education quality and performance development across many education systems, giving countries the opportunity to share techniques, organisational structures, and policies that have proven efficient and successful. To gain insights from ILSA data, we identify non-cognitive variables associated with students' academic performance. This problem has three analytical challenges: (a) academic performance is measured by cognitive items under a matrix sampling design; (b) there are many missing values in the non-cognitive variables; and (c) multiple comparisons due to a large number of non-cognitive variables. We consider an application to the Programme for International Student Assessment, aiming to identify non-cognitive variables associated with students' performance in science. We formulate it as a variable selection problem under a general latent variable model framework and further propose a knockoff method that conducts variable selection with a controlled error rate for false selections. [ABSTRACT FROM AUTHOR]
In ultrahigh dimensional setting, independence screening has been both theoretically and empirically proved a useful variable selection framework with low computation cost. In this work, we propose a two-step framework by using marginal information in a different perspective from independence screening. In particular, we retain significant variables rather than screening out irrelevant ones. The new method is shown to be model selection consistent in the ultrahigh dimensional linear regression model. To improve the finite sample performance, we then introduce a three-step version and characterize its asymptotic behavior. Simulations and real data analysis show advantages of our method over independence screening and its iterative variants in certain regimes.
Since its development, the minimax framework has been one of the corner stones of theoretical statistics, and has contributed to the popularity of many well-known estimators, such as the regularized M-estimators for high-dimensional problems. In this paper, we will first show through the example of sparse Gaussian sequence model, that the theoretical results under the classical minimax framework are insufficient for explaining empirical observations. In particular, both hard and soft thresholding estimators are (asymptotically) minimax, however, in practice they often exhibit sub-optimal performances at various signal-to-noise ratio (SNR) levels. The first contribution of this paper is to demonstrate that this issue can be resolved if the signal-to-noise ratio is taken into account in the construction of the parameter space. We call the resulting minimax framework the signal-to-noise ratio aware minimaxity. The second contribution of this paper is to showcase how one can use higher-order asymptotics to obtain accurate approximations of the SNR-aware minimax risk and discover minimax estimators. The theoretical findings obtained from this refined minimax framework provide new insights and practical guidance for the estimation of sparse signals.
One of the fundamental problems in network analysis is detecting community structure in multi-layer networks, of which each layer represents one type of edge information among the nodes. We propose integrative spectral clustering approaches based on effective convex layer aggregations. Our aggregation methods are strongly motivated by a delicate asymptotic analysis of the spectral embedding of weighted adjacency matrices and the downstream k-means clustering, in a challenging regime where community detection consistency is impossible. In fact, the methods are shown to estimate the optimal convex aggregation, which minimizes the misclustering error under some specialized multi-layer network models. Our analysis further suggests that clustering using Gaussian mixture models is generally superior to the commonly used k-means in spectral clustering. Extensive numerical studies demonstrate that our adaptive aggregation techniques, together with Gaussian mixture model clustering, make the new spectral clustering remarkably competitive compared to several popularly used methods. for this article are available online. [ABSTRACT FROM AUTHOR]
Chen, Yunxiao, von Davier, Matthias, Weng, Haolei, and Xie, Zilong
Subjects
Methodology (stat.ME), FOS: Computer and information sciences, Applications (stat.AP), Statistics - Applications, Statistics - Methodology
Abstract
International large-scale assessments (ILSAs) play an important role in educational research and policy making. They collect valuable data on education quality and performance development across many education systems, giving countries the opportunity to share techniques, organisational structures and policies that have proven efficient and successful. To gain insights from ILSA data, we identify non-cognitive variables associated with students' academic performance. This problem has three analytical challenges: 1) students' academic performance is measured by cognitive items under a matrix sampling design; 2) there are often many missing values in the non-cognitive variables; and 3) multiple comparisons due to a large number of non-cognitive variables. We consider an application to data from the Programme for International Student Assessment (PISA), aiming to identify non-cognitive variables associated with students' performance in science. We formulate it as a variable selection problem under a latent variable model and further propose a knockoff method that conducts variable selection with a controlled error rate for false selections. Keywords: Model-X knockoffs, item response theory, missing data, variable selection, international large-scale assessment
Unsupervised learning has been widely used in many real-world applications. One of the simplest and most important unsupervised learning models is the Gaussian mixture model (GMM). In this work, we study the multi-task learning problem on GMMs, which aims to leverage potentially similar GMM parameter structures among tasks to obtain improved learning performance compared to single-task learning. We propose a multi-task GMM learning procedure based on the EM algorithm that not only can effectively utilize unknown similarity between related tasks but is also robust against a fraction of outlier tasks from arbitrary sources. The proposed procedure is shown to achieve minimax optimal rate of convergence for both parameter estimation error and the excess mis-clustering error, in a wide range of regimes. Moreover, we generalize our approach to tackle the problem of transfer learning for GMMs, where similar theoretical results are derived. Finally, we demonstrate the effectiveness of our methods through simulations and a real data analysis. To the best of our knowledge, this is the first work studying multi-task and transfer learning on GMMs with theoretical guarantees., Comment: 149 pages, 7 figures, 2 tables
Sloan School of Management, Massachusetts Institute of Technology. Operations Research Center, Mazumder, Rahul, Weng, Haolei, Sloan School of Management, Massachusetts Institute of Technology. Operations Research Center, Mazumder, Rahul, and Weng, Haolei
Abstract
Estimating a low rank matrix from its linear measurements is a problem of central importance in contemporary statistical analysis. The choice of tuning parameters for estimators remains an important challenge from a theoretical and practical perspective. To this end, Stein’s Unbiased Risk Estimate (SURE) framework provides a well-grounded statistical framework for degrees of freedom estimation. In this paper, we use the SURE framework to obtain degrees of freedom estimates for a general class of spectral regularized matrix estimators-our results generalize beyond the class of estimators that have been studied thus far. To this end, we use a result due to Shapiro (2002) pertaining to the differentiability of symmetric matrix valued functions, developed in the context of semidefinite optimization algorithms. We rigorously verify the applicability of Stein’s Lemma towards the derivation of degrees of freedom estimates; and also present new techniques based on Gaussian convolution to estimate the degrees of freedom of a class of spectral estimators, for which Stein’s Lemma does not directly apply.
Sloan School of Management, Mazumder, Rahul, Saldana, Diego, Weng, Haolei, Sloan School of Management, Mazumder, Rahul, Saldana, Diego, and Weng, Haolei
Abstract
In this paper, we study the popularly dubbed matrix completion problem, where the task is to “fill in” the unobserved entries of a matrix from a small subset of observed entries, under the assumption that the underlying matrix is of low rank. Our contributions herein enhance our prior work on nuclear norm regularized problems for matrix completion (Mazumder et al. in J Mach Learn Res 1532(11):2287–2322, 2010) by incorporating a continuum of nonconvex penalty functions between the convex nuclear norm and nonconvex rank functions. Inspired by Soft-Impute (Mazumder et al. 2010; Hastie et al. in J Mach Learn Res, 2016), we propose NC-Impute—an EM-flavored algorithmic framework for computing a family of nonconvex penalized matrix completion problems with warm starts. We present a systematic study of the associated spectral thresholding operators, which play an important role in the overall algorithm. We study convergence properties of the algorithm. Using structured low-rank SVD computations, we demonstrate the computational scalability of our proposal for problems up to the Netflix size (approximately, a 500,000 $$\times $$× 20,000 matrix with $$10^8$$108 observed entries). We demonstrate that on a wide range of synthetic and real data instances, our proposed nonconvex regularization framework leads to low-rank solutions with better predictive performance when compared to those obtained from nuclear norm problems. Implementations of algorithms proposed herein, written in the R language, are made available on github.
A recently proposed SLOPE estimator [ 6 ] has been shown to adaptively achieve the minimax |$\ell _2$| estimation rate under high-dimensional sparse linear regression models [ 25 ]. Such minimax optimality holds in the regime where the sparsity level |$k$| , sample size |$n$| and dimension |$p$| satisfy |$k/p\rightarrow 0, k\log p/n\rightarrow 0$|. In this paper, we characterize the estimation error of SLOPE under the complementary regime where both |$k$| and |$n$| scale linearly with |$p$| , and provide new insights into the performance of SLOPE estimators. We first derive a concentration inequality for the finite sample mean square error (MSE) of SLOPE. The quantity that MSE concentrates around takes a complicated and implicit form. With delicate analysis of the quantity, we prove that among all SLOPE estimators, LASSO is optimal for estimating |$k$| -sparse parameter vectors that do not have tied nonzero components in the low noise scenario. On the other hand, in the large noise scenario, the family of SLOPE estimators are sub-optimal compared with bridge regression such as the Ridge estimator. [ABSTRACT FROM AUTHOR]
In many application areas ranging from bioinformatics to imaging, we are interested in recovering a sparse coefficient in the high-dimensional linear model, when the sample size n is comparable to or less than the dimension p. One of the most popular classes of estimators is the Lq-regularized least squares (LQLS), a.k.a. bridge regression. There have been extensive studies towards understanding the performance of the best subset selection (q=0), LASSO (q=1) and ridge (q=2), three widely known estimators from the LQLS family. This thesis aims at giving a unified view of LQLS for all the non-negative values of q. In contrast to most existing works which obtain order-wise error bounds with loose constants, we derive asymptotically exact error formulas characterized through a series of fixed point equations. A delicate analysis of the fixed point equations enables us to gain fruitful insights into the statistical properties of LQLS across the entire spectrum of Lq-regularization. Our work not only validates the scope of folklore understanding of Lq-minimization, but also provides new insights into high-dimensional statistics as a whole. We will elaborate on our theoretical findings mainly from parameter estimation point of view. At the end of the thesis, we briefly mention bridge regression for variable selection and prediction. We start by considering the parameter estimation problem and evaluate the performance of LQLS by characterizing the asymptotic mean square error (AMSE). The expression we derive for AMSE does not have explicit forms and hence is not useful in comparing LQLS for different values of q, or providing information in evaluating the effect of relative sample size n/p or the sparsity level of the coefficient. To simplify the expression, we first perform the phase transition (PT) analysis, a widely accepted analysis diagram, of LQLS. Our results reveal some of the limitations and misleading features of the PT framework. To overcome these limitations, we propose the small-error analysis of LQLS. Our new analysis framework not only sheds light on the results of the phase transition analysis, but also describes when phase transition analysis is reliable, and presents a more accurate comparison among different Lq-regularizations. We then extend our low noise sensitivity analysis to linear models without sparsity structure. Our analysis, as a generalization of phase transition analysis, reveals a clear picture of bridge regression for estimating generic coefficients. Moreover, by a simple transformation we connect our low-noise sensitivity framework to the classical asymptotic regime in which n/p goes to infinity, and give some insightful implications beyond what classical asymptotic analysis of bridge regression can offer. Furthermore, following the same idea of the new analysis framework, we are able to obtain an explicit characterization of AMSE in the form of second-order expansions under the large noise regime. The expansions provide us some intriguing messages. For example, ridge will outperform LASSO in terms of estimating sparse coefficients when the measurement noise is large. Finally, we present a short analysis of LQLS, for the purpose of variable selection and prediction. We propose a two-stage variable selection technique based on the LQLS estimators, and describe its superiority and close connection to parameter estimation. For prediction, we illustrate the intricate relation between the tuning parameter selection for optimal in-sample prediction and optimal parameter estimation.
The class of |$\ell _q$| -regularized least squares (LQLS) are considered for estimating |$\beta \in \mathbb{R}^p$| from its |$n$| noisy linear observations |$y=X\beta + w$|. The performance of these schemes are studied under the high-dimensional asymptotic setting in which the dimension of the signal grows linearly with the number of measurements. In this asymptotic setting, phase transition (PT) diagrams are often used for comparing the performance of different estimators. PT specifies the minimum number of observations required by a certain estimator to recover a structured signal, e.g. a sparse one, from its noiseless linear observations. Although PT analysis is shown to provide useful information for compressed sensing, the fact that it ignores the measurement noise not only limits its applicability in many application areas, but also may lead to misunderstandings. For instance, consider a linear regression problem in which |$n>p$| and the signal is not exactly sparse. If the measurement noise is ignored in such systems, regularization techniques, such as LQLS, seem to be irrelevant since even the ordinary least squares (OLS) returns the exact solution. However, it is well known that if |$n$| is not much larger than |$p$| , then the regularization techniques improve the performance of OLS. In response to this limitation of PT analysis, we consider the low-noise sensitivity analysis. We show that this analysis framework (i) reveals the advantage of LQLS over OLS, (ii) captures the difference between different LQLS estimators even when |$n>p$| , and (iii) provides a fair comparison among different estimators in high signal-to-noise ratios. As an application of this framework, we will show that under mild conditions LASSO outperforms other LQLS even when the signal is dense. Finally, by a simple transformation, we connect our low-noise sensitivity framework to the classical asymptotic regime in which |$n/p \rightarrow \infty$| , and characterize how and when regularization techniques offer improvements over ordinary least squares, and which regularizer gives the most improvement when the sample size is large. [ABSTRACT FROM AUTHOR]
Zheng, Le, Maleki, Arian, Weng, Haolei, Wang, Xiaodong, and Long, Teng
Subjects
L1-minimization, COMPRESSED sensing, LEAST squares, LASSO, MESSAGE passing (Computer science)
Abstract
In many application areas ranging from bioinformatics to imaging, we are faced with the following question: can we recover a sparse vector xo \in \mathbb {R}^{N} from its undersampled set of noisy observations y \in \mathbb R^n , y= A xo+ w . The last decade has witnessed a surge of algorithms and theoretical results to address this question. One of the most popular schemes is the \ell p -regularized least squares given by the following formulation: \hat x(\gamma ,p ) \in \arg \min x~({1}/{2})\| {y - Ax} \|2^{2} + \gamma {\| x \|p^{p}} , where p \in [0, 1] . Among these optimization problems, the case p = 1 , also known as LASSO, is the best accepted in practice, for the following two reasons. First, thanks to the extensive studies performed in the fields of high-dimensional statistics and compressed sensing, we have a clear picture of LASSO’s performance. Second, it is convex and efficient algorithms exist for finding its global minima. Unfortunately, neither of the above two properties hold for 0 \leq p<1 . However, they are still appealing because of the following folklores in the high-dimensional statistics. First, \hat x(\gamma , p ) is closer to than \hat {x}(\gamma ,1) . Second, if we employ iterative methods that aim to converge to a local minima of {\arg \min }_{x}~({1}/{2})\| {y - Ax} \|_{2}^{2} + \gamma {\| x \|_{p}^{p}} , then under good initialization, these algorithms converge to a solution that is still closer to xo than \hat {x}(\gamma ,1) . In spite of the existence of plenty of empirical results that support these folklore theorems, the theoretical progress to establish them has been very limited. This paper aims to study the above-mentioned folklore theorems and establish their scope of validity. Starting with approximate message passing (AMP) algorithm as a heuristic method for solving \ell p -regularized least squares, we study the following questions. First, what is the impact of initialization on the performance of the algorithm? Second, when does the algorithm recover the sparse signal xo under a “good” initialization? Third, when does the algorithm converge to the sparse signal regardless of the initialization? Studying these questions will not only shed light on the second folklore theorem, but also lead us to the answer the first one, i.e., the performance of the global optima \hat x(\gamma , p ) . For that purpose, we employ the replica analysis1 to show the connection between the solution of AMP and \hat {x}(\gamma , p) in the asymptotic settings. This enables us to compare the accuracy of \hat x(\gamma ,p )$ and $\hat x(\gamma ,1 )$ . In particular, we will present an accurate characterization of the phase transition and noise sensitivity of \ell p -regularized least squares for every $0 \leq p \leq 1$ . Our results in the noiseless setting confirm that \ell p -regularized least squares (if $\gamma $ is tuned optimally) exhibits the same phase transition for every $0 \leq p< 1$ and this phase transition is much better than that of LASSO. Furthermore, we show that in the noisy setting, there is a major difference between the performance of \ell p -regularized least squares with different values of $p$ . For instance, we will show that for very small and very large measurement noises, $p = 0$ and $p = 1$ outperform the other values of $p$ , respectively.
Replica method is a widely accepted heuristic in statistical physics for analyzing large disordered systems.
Xie, Zilong, Chen, Yunxiao, von Davier, Matthias, Weng, Haolei, Xie, Zilong, Chen, Yunxiao, von Davier, Matthias, and Weng, Haolei
Abstract
International large-scale assessments (ILSAs) play an important role in educational research and policy making. They collect valuable data on education quality and performance development across many education systems, giving countries the opportunity to share techniques, organizational structures, and policies that have proven efficient and successful. To gain insights from ILSA data, we identify non-cognitive variables associated with students’ academic performance. This problem has three analytical challenges: 1) academic performance is measured by cognitive items under a matrix sampling design; 2) there are many missing values in the non-cognitive variables; and 3) multiple comparisons due to a large number of non-cognitive variables. We consider an application to the Programme for International Student Assessment (PISA), aiming to identify non-cognitive variables associated with students’ performance in science. We formulate it as a variable selection problem under a general latent variable model framework and further propose a knockoff method that conducts variable selection with a controlled error rate for false selections.