Author: "Jordan, Michael" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Jordan, Michael"' showing total 6,080 results

Start Over Author "Jordan, Michael"

6,080 results on '"Jordan, Michael"'

251. Variance Reduction with Sparse Gradients

Author: Elibol, Melih, Lei, Lihua, and Jordan, Michael I.
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Variance reduction methods such as SVRG and SpiderBoost use a mixture of large and small batch gradients to reduce the variance of stochastic gradients. Compared to SGD, these methods require at least double the number of operations per update to model parameters. To reduce the computational cost of these methods, we introduce a new sparsity operator: The random-top-k operator. Our operator reduces computational complexity by estimating gradient sparsity exhibited in a variety of applications by combining the top-k operator and the randomized coordinate descent operator. With this operator, large batch gradients offer an extra benefit beyond variance reduction: A reliable estimate of gradient sparsity. Theoretically, our algorithm is at least as good as the best algorithm (SpiderBoost), and further excels in performance whenever the random-top-k operator captures gradient sparsity. Empirically, our algorithm consistently outperforms SpiderBoost using various models on various tasks including image classification, natural language processing, and sparse matrix factorization. We also provide empirical evidence to support the intuition behind our algorithm via a simple gradient entropy computation, which serves to quantify gradient sparsity at every iteration., Comment: ICLR 2020
Published: 2020

252. A Control-Theoretic Perspective on Optimal High-Order Optimization

Author: Lin, Tianyi and Jordan, Michael. I.
Subjects: Mathematics - Optimization and Control, Computer Science - Computational Complexity, Computer Science - Data Structures and Algorithms
Abstract: We provide a control-theoretic perspective on optimal tensor algorithms for minimizing a convex function in a finite-dimensional Euclidean space. Given a function $\Phi: \mathbb{R}^d \rightarrow \mathbb{R}$ that is convex and twice continuously differentiable, we study a closed-loop control system that is governed by the operators $\nabla \Phi$ and $\nabla^2 \Phi$ together with a feedback control law $\lambda(\cdot)$ satisfying the algebraic equation $(\lambda(t))^p\|\nabla\Phi(x(t))\|^{p-1} = \theta$ for some $\theta \in (0, 1)$. Our first contribution is to prove the existence and uniqueness of a local solution to this system via the Banach fixed-point theorem. We present a simple yet nontrivial Lyapunov function that allows us to establish the existence and uniqueness of a global solution under certain regularity conditions and analyze the convergence properties of trajectories. The rate of convergence is $O(1/t^{(3p+1)/2})$ in terms of objective function gap and $O(1/t^{3p})$ in terms of squared gradient norm. Our second contribution is to provide two algorithmic frameworks obtained from discretization of our continuous-time system, one of which generalizes the large-step A-HPE framework and the other of which leads to a new optimal $p$-th order tensor algorithm. While our discrete-time analysis can be seen as a simplification and generalization of~\citet{Monteiro-2013-Accelerated}, it is largely motivated by the aforementioned continuous-time analysis, demonstrating the fundamental role that the feedback control plays in optimal acceleration and the clear advantage that the continuous-time perspective brings to algorithmic design. A highlight of our analysis is that we show that all of the $p$-th order optimal tensor algorithms that we discuss minimize the squared gradient norm at a rate of $O(k^{-3p})$, which complements the recent analysis., Comment: Accepted by Mathematical Programming Series A; 45 pages
Published: 2019

253. Sampling for Bayesian Mixture Models: MCMC with Polynomial-Time Mixing

Author: Mou, Wenlong, Ho, Nhat, Wainwright, Martin J., Bartlett, Peter L., and Jordan, Michael I.
Subjects: Statistics - Machine Learning, Computer Science - Data Structures and Algorithms, Computer Science - Machine Learning, Mathematics - Probability, Statistics - Computation
Abstract: We study the problem of sampling from the power posterior distribution in Bayesian Gaussian mixture models, a robust version of the classical posterior. This power posterior is known to be non-log-concave and multi-modal, which leads to exponential mixing times for some standard MCMC algorithms. We introduce and study the Reflected Metropolis-Hastings Random Walk (RMRW) algorithm for sampling. For symmetric two-component Gaussian mixtures, we prove that its mixing time is bounded as $d^{1.5}(d + \Vert \theta_{0} \Vert^2)^{4.5}$ as long as the sample size $n$ is of the order $d (d + \Vert \theta_{0} \Vert^2)$. Notably, this result requires no conditions on the separation of the two means. En route to proving this bound, we establish some new results of possible independent interest that allow for combining Poincar\'{e} inequalities for conditional and marginal densities.
Published: 2019

254. Early response markers predict survival after etoposide-based therapy of hemophagocytic lymphohistiocytosis

Author: Verkamp, Bethany, Zoref-Lorenz, Adi, Francisco, Brenton, Kieser, Pearce, Mack, Joana, Blackledge, Tucker, Brik Simon, Dafna, Yacobovich, Joanne, and Jordan, Michael B.
Published: 2023
Full Text: View/download PDF

255. Interview (with) Janos Szabo, Hungarian Minister of Defence

Author: Jordan, Michael J.
Subjects: HUNGARY - Politics and Government, NORTH ATLANTIC TREATY ORGANIZATION
Abstract: por
Published: 1999

256. MVSym: Efficient symbiotic exploitation of HLS-kernel multi-versioning for collaborative CPU-FPGA cloud systems

Author: Jordan, Michael Guilherme, Lignati, Bernardo Neuhaus, Korol, Guilherme, Rutzig, Mateus Beck, and Beck, Antonio Carlos Schneider
Published: 2023
Full Text: View/download PDF

257. Checklist for studies of HIV drug resistance prevalence or incidence: rationale and recommended use

Author: Mbuagbaw, Lawrence, Garcia, Cristian, Brenner, Bluma, Cecchini, Diego, Chakroun, Mohamed, Djiadeu, Pascal, Holguin, Africa, Mor, Orna, Parkin, Neil, Santoro, Maria M, Ávila-Ríos, Santiago, Fokam, Joseph, Phillips, Andrew, Shafer, Robert W, and Jordan, Michael R
Published: 2023
Full Text: View/download PDF

258. IF TRUMP AND HIS HEROES RULED THE WORLD.

Author: SMITH, JORDAN MICHAEL
Subjects: *DEMOCRACY, *INTERNATIONAL relations
Abstract: The article discusses the potential consequences if former American President Donald Trump were to return to the Oval Office. Trump's admiration for autocrats like Russian Prime Minister Vladimir Putin and his disdain for traditional allies could reshape global politics. His affinity for authoritarian leaders raises concerns about the future of democracy and international relations.
Published: 2024

259. The Power of Batching in Multiple Hypothesis Testing

Author: Zrnic, Tijana, Jiang, Daniel L., Ramdas, Aaditya, and Jordan, Michael I.
Subjects: Statistics - Methodology, Statistics - Machine Learning
Abstract: One important partition of algorithms for controlling the false discovery rate (FDR) in multiple testing is into offline and online algorithms. The first generally achieve significantly higher power of discovery, while the latter allow making decisions sequentially as well as adaptively formulating hypotheses based on past observations. Using existing methodology, it is unclear how one could trade off the benefits of these two broad families of algorithms, all the while preserving their formal FDR guarantees. To this end, we introduce $\text{Batch}_{\text{BH}}$ and $\text{Batch}_{\text{St-BH}}$, algorithms for controlling the FDR when a possibly infinite sequence of batches of hypotheses is tested by repeated application of one of the most widely used offline algorithms, the Benjamini-Hochberg (BH) method or Storey's improvement of the BH method. We show that our algorithms interpolate between existing online and offline methodology, thus trading off the best of both worlds., Comment: 29 pages, 12 figures
Published: 2019

260. On the Complexity of Approximating Multimarginal Optimal Transport

Author: Lin, Tianyi, Ho, Nhat, Cuturi, Marco, and Jordan, Michael I.
Subjects: Statistics - Machine Learning, Computer Science - Data Structures and Algorithms, Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Computation
Abstract: We study the complexity of approximating the multimarginal optimal transport (MOT) distance, a generalization of the classical optimal transport distance, considered here between $m$ discrete probability distributions supported each on $n$ support points. First, we show that the standard linear programming (LP) representation of the MOT problem is not a minimum-cost flow problem when $m \geq 3$. This negative result implies that some combinatorial algorithms, e.g., network simplex method, are not suitable for approximating the MOT problem, while the worst-case complexity bound for the deterministic interior-point algorithm remains a quantity of $\tilde{O}(n^{3m})$. We then propose two simple and \textit{deterministic} algorithms for approximating the MOT problem. The first algorithm, which we refer to as \textit{multimarginal Sinkhorn} algorithm, is a provably efficient multimarginal generalization of the Sinkhorn algorithm. We show that it achieves a complexity bound of $\tilde{O}(m^3n^m\varepsilon^{-2})$ for a tolerance $\varepsilon \in (0, 1)$. This provides a first \textit{near-linear time} complexity bound guarantee for approximating the MOT problem and matches the best known complexity bound for the Sinkhorn algorithm in the classical OT setting when $m = 2$. The second algorithm, which we refer to as \textit{accelerated multimarginal Sinkhorn} algorithm, achieves the acceleration by incorporating an estimate sequence and the complexity bound is $\tilde{O}(m^3n^{m+1/3}\varepsilon^{-4/3})$. This bound is better than that of the first algorithm in terms of $1/\varepsilon$, and accelerated alternating minimization algorithm~\citep{Tupitsa-2020-Multimarginal} in terms of $n$. Finally, we compare our new algorithms with the commercial LP solver \textsc{Gurobi}. Preliminary results on synthetic data and real images demonstrate the effectiveness and efficiency of our algorithms., Comment: Accepted by Journal of Machine Learning Research; Add the references and the funding information; 40 pages, 14 figures
Published: 2019

261. Towards Understanding the Transferability of Deep Representations

Author: Liu, Hong, Long, Mingsheng, Wang, Jianmin, and Jordan, Michael I.
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Deep neural networks trained on a wide range of datasets demonstrate impressive transferability. Deep features appear general in that they are applicable to many datasets and tasks. Such property is in prevalent use in real-world applications. A neural network pretrained on large datasets, such as ImageNet, can significantly boost generalization and accelerate training if fine-tuned to a smaller target dataset. Despite its pervasiveness, few effort has been devoted to uncovering the reason of transferability in deep feature representations. This paper tries to understand transferability from the perspectives of improved generalization, optimization and the feasibility of transferability. We demonstrate that 1) Transferred models tend to find flatter minima, since their weight matrices stay close to the original flat region of pretrained parameters when transferred to a similar target dataset; 2) Transferred representations make the loss landscape more favorable with improved Lipschitzness, which accelerates and stabilizes training substantially. The improvement largely attributes to the fact that the principal component of gradient is suppressed in the pretrained parameters, thus stabilizing the magnitude of gradient in back-propagation. 3) The feasibility of transferability is related to the similarity of both input and label. And a surprising discovery is that the feasibility is also impacted by the training stages in that the transferability first increases during training, and then declines. We further provide a theoretical analysis to verify our observations.
Published: 2019

262. A Diffusion Process Perspective on Posterior Contraction Rates for Parameters

Author: Mou, Wenlong, Ho, Nhat, Wainwright, Martin J., Bartlett, Peter, and Jordan, Michael I.
Subjects: Mathematics - Statistics Theory
Abstract: We analyze the posterior contraction rates of parameters in Bayesian models via the Langevin diffusion process, in particular by controlling moments of the stochastic process and taking limits. Analogous to the non-asymptotic analysis of statistical M-estimators and stochastic optimization algorithms, our contraction rates depend on the structure of the population log-likelihood function, and stochastic perturbation bounds between the population and sample log-likelihood functions. Convergence rates are determined by a non-linear equation that relates the population-level structure to stochastic perturbation terms, along with a term characterizing the diffusive behavior. Based on this technique, we also prove non-asymptotic versions of a Bernstein-von-Mises guarantee for the posterior. We illustrate this general theory by deriving posterior convergence rates for various concrete examples, as well as approximate posterior distributions computed using Langevin sampling procedures., Comment: 81 pages
Published: 2019

263. High-Order Langevin Diffusion Yields an Accelerated MCMC Algorithm

Author: Mou, Wenlong, Ma, Yi-An, Wainwright, Martin J., Bartlett, Peter L., and Jordan, Michael I.
Subjects: Statistics - Machine Learning, Computer Science - Data Structures and Algorithms, Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Computation
Abstract: We propose a Markov chain Monte Carlo (MCMC) algorithm based on third-order Langevin dynamics for sampling from distributions with log-concave and smooth densities. The higher-order dynamics allow for more flexible discretization schemes, and we develop a specific method that combines splitting with more accurate integration. For a broad class of $d$-dimensional distributions arising from generalized linear models, we prove that the resulting third-order algorithm produces samples from a distribution that is at most $\varepsilon > 0$ in Wasserstein distance from the target distribution in $O\left(\frac{d^{1/4}}{ \varepsilon^{1/2}} \right)$ steps. This result requires only Lipschitz conditions on the gradient. For general strongly convex potentials with $\alpha$-th order smoothness, we prove that the mixing time scales as $O \left(\frac{d^{1/4}}{\varepsilon^{1/2}} + \frac{d^{1/2}}{\varepsilon^{1/(\alpha - 1)}} \right)$., Comment: Changes from v1: improved algorithm with $O (d^{1/4} / \varepsilon^{1/2})$ mixing time
Published: 2019

264. How Does Learning Rate Decay Help Modern Neural Networks?

Author: You, Kaichao, Long, Mingsheng, Wang, Jianmin, and Jordan, Michael I.
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Learning rate decay (lrDecay) is a \emph{de facto} technique for training modern neural networks. It starts with a large learning rate and then decays it multiple times. It is empirically observed to help both optimization and generalization. Common beliefs in how lrDecay works come from the optimization analysis of (Stochastic) Gradient Descent: 1) an initially large learning rate accelerates training or helps the network escape spurious local minima; 2) decaying the learning rate helps the network converge to a local minimum and avoid oscillation. Despite the popularity of these common beliefs, experiments suggest that they are insufficient in explaining the general effectiveness of lrDecay in training modern neural networks that are deep, wide, and nonconvex. We provide another novel explanation: an initially large learning rate suppresses the network from memorizing noisy data while decaying the learning rate improves the learning of complex patterns. The proposed explanation is validated on a carefully-constructed dataset with tractable pattern complexity. And its implication, that additional patterns learned in later stages of lrDecay are more complex and thus less transferable, is justified in real-world datasets. We believe that this alternative explanation will shed light into the design of better training strategies for modern neural networks., Comment: title changed
Published: 2019

265. A Higher-Order Swiss Army Infinitesimal Jackknife

Author: Giordano, Ryan, Jordan, Michael I., and Broderick, Tamara
Subjects: Mathematics - Statistics Theory, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Cross validation (CV) and the bootstrap are ubiquitous model-agnostic tools for assessing the error or variability of machine learning and statistical estimators. However, these methods require repeatedly re-fitting the model with different weighted versions of the original dataset, which can be prohibitively time-consuming. For sufficiently regular optimization problems the optimum depends smoothly on the data weights, and so the process of repeatedly re-fitting can be approximated with a Taylor series that can be often evaluated relatively quickly. The first-order approximation is known as the "infinitesimal jackknife" in the statistics literature and has been the subject of recent interest in machine learning for approximate CV. In this work, we consider high-order approximations, which we call the "higher-order infinitesimal jackknife" (HOIJ). Under mild regularity conditions, we provide a simple recursive procedure to compute approximations of all orders with finite-sample accuracy bounds. Additionally, we show that the HOIJ can be efficiently computed even in high dimensions using forward-mode automatic differentiation. We show that a linear approximation with bootstrap weights approximation is equivalent to those provided by asymptotic normal approximations. Consequently, the HOIJ opens up the possibility of enjoying higher-order accuracy properties of the bootstrap using local approximations. Consistency of the HOIJ for leave-one-out CV under different asymptotic regimes follows as corollaries from our finite-sample bounds under additional regularity assumptions. The generality of the computation and bounds motivate the name "higher-order Swiss Army infinitesimal jackknife."
Published: 2019

266. Bayesian Robustness: A Nonasymptotic Viewpoint

Author: Bhatia, Kush, Ma, Yi-An, Dragan, Anca D., Bartlett, Peter L., and Jordan, Michael I.
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning, Statistics - Computation
Abstract: We study the problem of robustly estimating the posterior distribution for the setting where observed data can be contaminated with potentially adversarial outliers. We propose Rob-ULA, a robust variant of the Unadjusted Langevin Algorithm (ULA), and provide a finite-sample analysis of its sampling distribution. In particular, we show that after $T= \tilde{\mathcal{O}}(d/\varepsilon_{\textsf{acc}})$ iterations, we can sample from $p_T$ such that $\text{dist}(p_T, p^*) \leq \varepsilon_{\textsf{acc}} + \tilde{\mathcal{O}}(\epsilon)$, where $\epsilon$ is the fraction of corruptions. We corroborate our theoretical analysis with experiments on both synthetic and real-world data sets for mean estimation, regression and binary classification., Comment: 30 pages, 5 figures
Published: 2019

267. Provably Efficient Reinforcement Learning with Linear Function Approximation

Author: Jin, Chi, Yang, Zhuoran, Wang, Zhaoran, and Jordan, Michael I.
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: Modern Reinforcement Learning (RL) is commonly applied to practical problems with an enormous number of states, where function approximation must be deployed to approximate either the value function or the policy. The introduction of function approximation raises a fundamental set of challenges involving computational and statistical efficiency, especially given the need to manage the exploration/exploitation tradeoff. As a result, a core RL question remains open: how can we design provably efficient RL algorithms that incorporate function approximation? This question persists even in a basic setting with linear dynamics and linear rewards, for which only linear function approximation is needed. This paper presents the first provable RL algorithm with both polynomial runtime and polynomial sample complexity in this linear setting, without requiring a "simulator" or additional assumptions. Concretely, we prove that an optimistic modification of Least-Squares Value Iteration (LSVI)---a classical algorithm frequently studied in the linear setting---achieves $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret, where $d$ is the ambient dimension of feature space, $H$ is the length of each episode, and $T$ is the total number of steps. Importantly, such regret is independent of the number of states and actions.
Published: 2019

268. Convergence Rates for Gaussian Mixtures of Experts

Author: Ho, Nhat, Yang, Chiao-Yu, and Jordan, Michael I.
Subjects: Mathematics - Statistics Theory, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: We provide a theoretical treatment of over-specified Gaussian mixtures of experts with covariate-free gating networks. We establish the convergence rates of the maximum likelihood estimation (MLE) for these models. Our proof technique is based on a novel notion of \emph{algebraic independence} of the expert functions. Drawing on optimal transport theory, we establish a connection between the algebraic independence and a certain class of partial differential equations (PDEs). Exploiting this connection allows us to derive convergence rates and minimax lower bounds for parameter estimation., Comment: 81 pages
Published: 2019

269. Policy-Gradient Algorithms Have No Guarantees of Convergence in Linear Quadratic Games

Author: Mazumdar, Eric, Ratliff, Lillian J., Jordan, Michael I., and Sastry, S. Shankar
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: We show by counterexample that policy-gradient algorithms have no guarantees of even local convergence to Nash equilibria in continuous action and state space multi-agent settings. To do so, we analyze gradient-play in N-player general-sum linear quadratic games, a classic game setting which is recently emerging as a benchmark in the field of multi-agent learning. In such games the state and action spaces are continuous and global Nash equilibria can be found be solving coupled Ricatti equations. Further, gradient-play in LQ games is equivalent to multi agent policy-gradient. We first show that these games are surprisingly not convex games. Despite this, we are still able to show that the only critical points of the gradient dynamics are global Nash equilibria. We then give sufficient conditions under which policy-gradient will avoid the Nash equilibria, and generate a large number of general-sum linear quadratic games that satisfy these conditions. In such games we empirically observe the players converging to limit cycles for which the time average does not coincide with a Nash equilibrium. The existence of such games indicates that one of the most popular approaches to solving reinforcement learning problems in the classic reinforcement learning setting has no local guarantee of convergence in multi-agent settings. Further, the ease with which we can generate these counterexamples suggests that such situations are not mere edge cases and are in fact quite common.
Published: 2019

270. Stochastic Gradient and Langevin Processes

Author: Cheng, Xiang, Yin, Dong, Bartlett, Peter L., and Jordan, Michael I.
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: We prove quantitative convergence rates at which discrete Langevin-like processes converge to the invariant distribution of a related stochastic differential equation. We study the setup where the additive noise can be non-Gaussian and state-dependent and the potential function can be non-convex. We show that the key properties of these processes depend on the potential function and the second moment of the additive noise. We apply our theoretical findings to studying the convergence of Stochastic Gradient Descent (SGD) for non-convex problems and corroborate them with experiments using SGD to train deep neural networks on the CIFAR-10 dataset., Comment: ICML 2020, code available at https://github.com/dongyin92/noise_covariance
Published: 2019

271. Convergence Rates of Smooth Message Passing with Rounding in Entropy-Regularized MAP Inference

Author: Lee, Jonathan N., Pacchiano, Aldo, and Jordan, Michael I.
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: Maximum a posteriori (MAP) inference is a fundamental computational paradigm for statistical inference. In the setting of graphical models, MAP inference entails solving a combinatorial optimization problem to find the most likely configuration of the discrete-valued model. Linear programming (LP) relaxations in the Sherali-Adams hierarchy are widely used to attempt to solve this problem, and smooth message passing algorithms have been proposed to solve regularized versions of these LPs with great success. This paper leverages recent work in entropy-regularized LPs to analyze convergence rates of a class of edge-based smooth message passing algorithms to $\epsilon$-optimality in the relaxation. With an appropriately chosen regularization constant, we present a theoretical guarantee on the number of iterations sufficient to recover the true integral MAP solution when the LP is tight and the solution is unique.
Published: 2019

272. Local Exchangeability

Author: Campbell, Trevor, Syed, Saifuddin, Yang, Chiao-Yu, Jordan, Michael I., and Broderick, Tamara
Subjects: Mathematics - Statistics Theory, Mathematics - Probability
Abstract: Exchangeability -- in which the distribution of an infinite sequence is invariant to reorderings of its elements -- implies the existence of a simple conditional independence structure that may be leveraged in the design of statistical models and inference procedures. In this work, we study a relaxation of exchangeability in which this invariance need not hold precisely. We introduce the notion of local exchangeability -- where swapping data associated with nearby covariates causes a bounded change in the distribution. We prove that locally exchangeable processes correspond to independent observations from an underlying measure-valued stochastic process. Using this main probabilistic result, we show that the local empirical measure of a finite collection of observations provides an approximation of the underlying measure-valued process and Bayesian posterior predictive distributions. The paper concludes with applications of the main theoretical results to a model from Bayesian nonparametrics and covariate-dependent permutation tests.
Published: 2019

273. Competing Bandits in Matching Markets

Author: Liu, Lydia T., Mania, Horia, and Jordan, Michael I.
Subjects: Computer Science - Machine Learning, Computer Science - Computer Science and Game Theory, Computer Science - Multiagent Systems, Statistics - Machine Learning
Abstract: Stable matching, a classical model for two-sided markets, has long been studied with little consideration for how each side's preferences are learned. With the advent of massive online markets powered by data-driven matching platforms, it has become necessary to better understand the interplay between learning and market objectives. We propose a statistical learning model in which one side of the market does not have a priori knowledge about its preferences for the other side and is required to learn these from stochastic rewards. Our model extends the standard multi-armed bandits framework to multiple players, with the added feature that arms have preferences over players. We study both centralized and decentralized approaches to this problem and show surprising exploration-exploitation trade-offs compared to the single player multi-armed bandits setting., Comment: 15 pages, 3 figures. A version appears in the Proceedings of The 23nd International Conference on Artificial Intelligence and Statistics (AISTATS), 2020
Published: 2019

274. Alemtuzumab and CXCL9 levels predict likelihood of sustained engraftment after reduced-intensity conditioning HCT

Author: Geerlinks, Ashley V., Scull, Brooks, Krupski, Christa, Fleischmann, Ryan, Pulsipher, Michael A., Eapen, Mary, Connelly, James A., Bollard, Catherine M., Pai, Sung-Yun, Duncan, Christine N., Kean, Leslie S., Baker, K. Scott, Burroughs, Lauri M., Andolina, Jeffrey R., Shenoy, Shalini, Roehrs, Philip, Hanna, Rabi, Talano, Julie-An, Schultz, Kirk R., Stenger, Elizabeth O., Lin, Howard, Zoref-Lorenz, Adi, McClain, Kenneth L., Jordan, Michael B., Man, Tsz-Kwong, Allen, Carl E., and Marsh, Rebecca A.
Published: 2023
Full Text: View/download PDF

275. Learning to Score Behaviors for Guided Policy Optimization

Author: Pacchiano, Aldo, Parker-Holder, Jack, Tang, Yunhao, Choromanska, Anna, Choromanski, Krzysztof, and Jordan, Michael I.
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: We introduce a new approach for comparing reinforcement learning policies, using Wasserstein distances (WDs) in a newly defined latent behavioral space. We show that by utilizing the dual formulation of the WD, we can learn score functions over policy behaviors that can in turn be used to lead policy optimization towards (or away from) (un)desired behaviors. Combined with smoothed WDs, the dual formulation allows us to devise efficient algorithms that take stochastic gradient descent steps through WD regularizers. We incorporate these regularizers into two novel on-policy algorithms, Behavior-Guided Policy Gradient and Behavior-Guided Evolution Strategies, which we demonstrate can outperform existing methods in a variety of challenging environments. We also provide an open source demo.
Published: 2019

276. ML-LOO: Detecting Adversarial Examples with Feature Attribution

Author: Yang, Puyudi, Chen, Jianbo, Hsieh, Cho-Jui, Wang, Jane-Ling, and Jordan, Michael I.
Subjects: Computer Science - Machine Learning, Computer Science - Cryptography and Security, Statistics - Machine Learning
Abstract: Deep neural networks obtain state-of-the-art performance on a series of tasks. However, they are easily fooled by adding a small adversarial perturbation to input. The perturbation is often human imperceptible on image data. We observe a significant difference in feature attributions of adversarially crafted examples from those of original ones. Based on this observation, we introduce a new framework to detect adversarial examples through thresholding a scale estimate of feature attribution scores. Furthermore, we extend our method to include multi-layer feature attributions in order to tackle the attacks with mixed confidence levels. Through vast experiments, our method achieves superior performances in distinguishing adversarial examples from popular attack methods on a variety of real data sets among state-of-the-art detection methods. In particular, our method is able to detect adversarial examples of mixed confidence levels, and transfer between different attacking methods.
Published: 2019

277. Generalized Momentum-Based Methods: A Hamiltonian Perspective

Author: Diakonikolas, Jelena and Jordan, Michael I.
Subjects: Mathematics - Optimization and Control, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: We take a Hamiltonian-based perspective to generalize Nesterov's accelerated gradient descent and Polyak's heavy ball method to a broad class of momentum methods in the setting of (possibly) constrained minimization in Euclidean and non-Euclidean normed vector spaces. Our perspective leads to a generic and unifying nonasymptotic analysis of convergence of these methods in both the function value (in the setting of convex optimization) and in norm of the gradient (in the setting of unconstrained, possibly nonconvex, optimization). Our approach relies upon a time-varying Hamiltonian that produces generalized momentum methods as its equations of motion. The convergence analysis for these methods is intuitive and is based on the conserved quantities of the time-dependent Hamiltonian., Comment: To appear in SIAM Journal on Optimization. v1 -> v2: minor edits + added funding acknowledgements, v2 -> v3: revised presentation, upon journal revision
Published: 2019

278. On Gradient Descent Ascent for Nonconvex-Concave Minimax Problems

Author: Lin, Tianyi, Jin, Chi, and Jordan, Michael I.
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: We consider nonconvex-concave minimax problems, $\min_{\mathbf{x}} \max_{\mathbf{y} \in \mathcal{Y}} f(\mathbf{x}, \mathbf{y})$, where $f$ is nonconvex in $\mathbf{x}$ but concave in $\mathbf{y}$ and $\mathcal{Y}$ is a convex and bounded set. One of the most popular algorithms for solving this problem is the celebrated gradient descent ascent (GDA) algorithm, which has been widely used in machine learning, control theory and economics. Despite the extensive convergence results for the convex-concave setting, GDA with equal stepsize can converge to limit cycles or even diverge in a general setting. In this paper, we present the complexity results on two-time-scale GDA for solving nonconvex-concave minimax problems, showing that the algorithm can find a stationary point of the function $\Phi(\cdot) := \max_{\mathbf{y} \in \mathcal{Y}} f(\cdot, \mathbf{y})$ efficiently. To the best our knowledge, this is the first nonasymptotic analysis for two-time-scale GDA in this setting, shedding light on its superior practical performance in training generative adversarial networks (GANs) and other real applications., Comment: Accepted by ICML 2020; 39 pages, 6 figures
Published: 2019

279. On the Efficiency of Entropic Regularized Algorithms for Optimal Transport

Author: Lin, Tianyi, Ho, Nhat, and Jordan, Michael I.
Subjects: Computer Science - Data Structures and Algorithms, Computer Science - Computational Complexity, Computer Science - Machine Learning, Statistics - Computation, Statistics - Machine Learning
Abstract: We present several new complexity results for the entropic regularized algorithms that approximately solve the optimal transport (OT) problem between two discrete probability measures with at most $n$ atoms. First, we improve the complexity bound of a greedy variant of Sinkhorn, known as \textit{Greenkhorn}, from $\widetilde{O}(n^2\varepsilon^{-3})$ to $\widetilde{O}(n^2\varepsilon^{-2})$. Notably, our result can match the best known complexity bound of Sinkhorn and help clarify why Greenkhorn significantly outperforms Sinkhorn in practice in terms of row/column updates as observed by~\citet{Altschuler-2017-Near}. Second, we propose a new algorithm, which we refer to as \textit{APDAMD} and which generalizes an adaptive primal-dual accelerated gradient descent (APDAGD) algorithm~\citep{Dvurechensky-2018-Computational} with a prespecified mirror mapping $\phi$. We prove that APDAMD achieves the complexity bound of $\widetilde{O}(n^2\sqrt{\delta}\varepsilon^{-1})$ in which $\delta>0$ stands for the regularity of $\phi$. In addition, we show by a counterexample that the complexity bound of $\widetilde{O}(\min\{n^{9/4}\varepsilon^{-1}, n^2\varepsilon^{-2}\})$ proved for APDAGD before is invalid and give a refined complexity bound of $\widetilde{O}(n^{5/2}\varepsilon^{-1})$. Further, we develop a \textit{deterministic} accelerated variant of Sinkhorn via appeal to estimated sequence and prove the complexity bound of $\widetilde{O}(n^{7/3}\varepsilon^{-4/3})$. As such, we see that accelerated variant of Sinkhorn outperforms Sinkhorn and Greenkhorn in terms of $1/\varepsilon$ and APDAGD and accelerated alternating minimization (AAM)~\citep{Guminov-2021-Combination} in terms of $n$. Finally, we conduct the experiments on synthetic and real data and the numerical results show the efficiency of Greenkhorn, APDAMD and accelerated Sinkhorn in practice., Comment: Accepted by Journal of Machine Learning Research; A preliminary version [arXiv:1901.06482] of this paper, with a subset of the results that are presented here, was presented at ICML 2019; 39 pages, 21 figures
Published: 2019

280. Langevin Monte Carlo without smoothness

Author: Chatterji, Niladri S., Diakonikolas, Jelena, Jordan, Michael I., and Bartlett, Peter L.
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning, Statistics - Computation
Abstract: Langevin Monte Carlo (LMC) is an iterative algorithm used to generate samples from a distribution that is known only up to a normalizing constant. The nonasymptotic dependence of its mixing time on the dimension and target accuracy is understood mainly in the setting of smooth (gradient-Lipschitz) log-densities, a serious limitation for applications in machine learning. In this paper, we remove this limitation, providing polynomial-time convergence guarantees for a variant of LMC in the setting of nonsmooth log-concave distributions. At a high level, our results follow by leveraging the implicit smoothing of the log-density that comes from a small Gaussian perturbation that we add to the iterates of the algorithm and controlling the bias and variance that are induced by this perturbation., Comment: Updated to match the AISTATS 2020 camera ready version. Some example applications added and typos corrected
Published: 2019

281. Posterior Distribution for the Number of Clusters in Dirichlet Process Mixture Models

Author: Yang, Chiao-Yu, Xia, Eric, Ho, Nhat, and Jordan, Michael I.
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning, Mathematics - Statistics Theory, 62C10, 62G20, 62G99
Abstract: Dirichlet process mixture models (DPMM) play a central role in Bayesian nonparametrics, with applications throughout statistics and machine learning. DPMMs are generally used in clustering problems where the number of clusters is not known in advance, and the posterior distribution is treated as providing inference for this number. Recently, however, it has been shown that the DPMM is inconsistent in inferring the true number of components in certain cases. This is an asymptotic result, and it would be desirable to understand whether it holds with finite samples, and to more fully understand the full posterior. In this work, we provide a rigorous study for the posterior distribution of the number of clusters in DPMM under different prior distributions on the parameters and constraints on the distributions of the data. We provide novel lower bounds on the ratios of probabilities between $s+1$ clusters and $s$ clusters when the prior distributions on parameters are chosen to be Gaussian or uniform distributions.
Published: 2019

282. Fast Algorithms for Computational Optimal Transport and Wasserstein Barycenter

Author: Guo, Wenshuo, Ho, Nhat, and Jordan, Michael I.
Subjects: Computer Science - Data Structures and Algorithms, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: We provide theoretical complexity analysis for new algorithms to compute the optimal transport (OT) distance between two discrete probability distributions, and demonstrate their favorable practical performance over state-of-art primal-dual algorithms and their capability in solving other problems in large-scale, such as the Wasserstein barycenter problem for multiple probability distributions. First, we introduce the \emph{accelerated primal-dual randomized coordinate descent} (APDRCD) algorithm for computing the OT distance. We provide its complexity upper bound $\bigOtil(\frac{n^{5/2}}{\varepsilon})$ where $n$ stands for the number of atoms of these probability measures and $\varepsilon > 0$ is the desired accuracy. This complexity bound matches the best known complexities of primal-dual algorithms for the OT problems, including the adaptive primal-dual accelerated gradient descent (APDAGD) and the adaptive primal-dual accelerated mirror descent (APDAMD) algorithms. Then, we demonstrate the better performance of the APDRCD algorithm over the APDAGD and APDAMD algorithms through extensive experimental studies, and further improve its practical performance by proposing a greedy version of it, which we refer to as \emph{accelerated primal-dual greedy coordinate descent} (APDGCD). Finally, we generalize the APDRCD and APDGCD algorithms to distributed algorithms for computing the Wasserstein barycenter for multiple probability distributions., Comment: 18 pages, 35 figures
Published: 2019

283. A Dynamical Systems Perspective on Nesterov Acceleration

Author: Muehlebach, Michael and Jordan, Michael I.
Subjects: Mathematics - Optimization and Control, Computer Science - Machine Learning, Computer Science - Systems and Control, Statistics - Machine Learning
Abstract: We present a dynamical system framework for understanding Nesterov's accelerated gradient method. In contrast to earlier work, our derivation does not rely on a vanishing step size argument. We show that Nesterov acceleration arises from discretizing an ordinary differential equation with a semi-implicit Euler integration scheme. We analyze both the underlying differential equation as well as the discretization to obtain insights into the phenomenon of acceleration. The analysis suggests that a curvature-dependent damping term lies at the heart of the phenomenon. We further establish connections between the discretized and the continuous-time dynamics., Comment: 11 pages, 4 figures, to appear in the Proceedings of the 36th International Conference on Machine Learning
Published: 2019

284. A joint model of unpaired data from scRNA-seq and spatial transcriptomics for imputing missing gene expression measurements

Author: Lopez, Romain, Nazaret, Achille, Langevin, Maxime, Samaran, Jules, Regier, Jeffrey, Jordan, Michael I., and Yosef, Nir
Subjects: Computer Science - Machine Learning, Quantitative Biology - Genomics, Statistics - Machine Learning
Abstract: Spatial studies of transcriptome provide biologists with gene expression maps of heterogeneous and complex tissues. However, most experimental protocols for spatial transcriptomics suffer from the need to select beforehand a small fraction of genes to be quantified over the entire transcriptome. Standard single-cell RNA sequencing (scRNA-seq) is more prevalent, easier to implement and can in principle capture any gene but cannot recover the spatial location of the cells. In this manuscript, we focus on the problem of imputation of missing genes in spatial transcriptomic data based on (unpaired) standard scRNA-seq data from the same biological tissue. Building upon domain adaptation work, we propose gimVI, a deep generative model for the integration of spatial transcriptomic data and scRNA-seq data that can be used to impute missing genes. After describing our generative model and an inference procedure for it, we compare gimVI to alternative methods from computational biology or domain adaptation on real datasets and outperform Seurat Anchors, Liger and CORAL to impute held-out genes., Comment: submitted to the 2019 ICML Workshop on Computational Biology
Published: 2019

285. On Structured Filtering-Clustering: Global Error Bound and Optimal First-Order Algorithms

Author: Ho, Nhat, Lin, Tianyi, and Jordan, Michael I.
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning, Statistics - Computation
Abstract: The filtering-clustering models, including trend filtering and convex clustering, have become an important source of ideas and modeling tools in machine learning and related fields. The statistical guarantee of optimal solutions in these models has been extensively studied yet the investigations on the computational aspect have remained limited. In particular, practitioners often employ the first-order algorithms in real-world applications and are impressed by their superior performance regardless of ill-conditioned structures of difference operator matrices, thus leaving open the problem of understanding the convergence property of first-order algorithms. This paper settles this open problem and contributes to the broad interplay between statistics and optimization by identifying a \textit{global error bound} condition, which is satisfied by a large class of dual filtering-clustering problems, and designing a class of \textit{generalized dual gradient ascent} algorithm, which is \textit{optimal} first-order algorithms in deterministic, finite-sum and online settings. Our results are new and help explain why the filtering-clustering models can be efficiently solved by first-order algorithms. We also provide the detailed convergence rate analysis for the proposed algorithms in different settings, shedding light on their potential to solve the filtering-clustering models efficiently. We also conduct experiments on real datasets and the numerical results demonstrate the effectiveness of our algorithms., Comment: Accepted by AISTATS 2022; The first two authors contributed equally to this work; 36 Pages, 18 figures
Published: 2019

286. Bridging Theory and Algorithm for Domain Adaptation

Author: Zhang, Yuchen, Liu, Tianle, Long, Mingsheng, and Jordan, Michael I.
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: This paper addresses the problem of unsupervised domain adaption from theoretical and algorithmic perspectives. Existing domain adaptation theories naturally imply minimax optimization algorithms, which connect well with the domain adaptation methods based on adversarial learning. However, several disconnections still exist and form the gap between theory and algorithm. We extend previous theories (Mansour et al., 2009c; Ben-David et al., 2010) to multiclass classification in domain adaptation, where classifiers based on the scoring functions and margin loss are standard choices in algorithm design. We introduce Margin Disparity Discrepancy, a novel measurement with rigorous generalization bounds, tailored to the distribution comparison with the asymmetric margin loss, and to the minimax optimization for easier training. Our theory can be seamlessly transformed into an adversarial learning algorithm for domain adaptation, successfully bridging the gap between theory and algorithm. A series of empirical studies show that our algorithm achieves the state of the art accuracies on challenging domain adaptation tasks., Comment: Proceedings of the 36th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019
Published: 2019

287. On the Adaptivity of Stochastic Gradient-Based Optimization

Author: Lei, Lihua and Jordan, Michael I.
Subjects: Mathematics - Optimization and Control, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Stochastic-gradient-based optimization has been a core enabling methodology in applications to large-scale problems in machine learning and related areas. Despite the progress, the gap between theory and practice remains significant, with theoreticians pursuing mathematical optimality at a cost of obtaining specialized procedures in different regimes (e.g., modulus of strong convexity, magnitude of target accuracy, signal-to-noise ratio), and with practitioners not readily able to know which regime is appropriate to their problem, and seeking broadly applicable algorithms that are reasonably close to optimality. To bridge these perspectives it is necessary to study algorithms that are adaptive to different regimes. We present the stochastically controlled stochastic gradient (SCSG) method for composite convex finite-sum optimization problems and show that SCSG is adaptive to both strong convexity and target accuracy. The adaptivity is achieved by batch variance reduction with adaptive batch sizes and a novel technique, which we referred to as geometrization, which sets the length of each epoch as a geometric random variable. The algorithm achieves strictly better theoretical complexity than other existing adaptive algorithms, while the tuning parameters of the algorithm only depend on the smoothness parameter of the objective., Comment: Accepted by SIAM Journal on Optimization; 54 pages
Published: 2019
Full Text: View/download PDF

288. HopSkipJumpAttack: A Query-Efficient Decision-Based Attack

Author: Chen, Jianbo, Jordan, Michael I., and Wainwright, Martin J.
Subjects: Computer Science - Machine Learning, Computer Science - Cryptography and Security, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: The goal of a decision-based adversarial attack on a trained model is to generate adversarial examples based solely on observing output labels returned by the targeted model. We develop HopSkipJumpAttack, a family of algorithms based on a novel estimate of the gradient direction using binary information at the decision boundary. The proposed family includes both untargeted and targeted attacks optimized for $\ell_2$ and $\ell_\infty$ similarity metrics respectively. Theoretical analysis is provided for the proposed algorithms and the gradient direction estimate. Experiments show HopSkipJumpAttack requires significantly fewer model queries than Boundary Attack. It also achieves competitive performance in attacking several widely-used defense mechanisms. (HopSkipJumpAttack was named Boundary Attack++ in a previous version of the preprint.)
Published: 2019

289. MLSys: The New Frontier of Machine Learning Systems

Author: Ratner, Alexander, Alistarh, Dan, Alonso, Gustavo, Andersen, David G., Bailis, Peter, Bird, Sarah, Carlini, Nicholas, Catanzaro, Bryan, Chayes, Jennifer, Chung, Eric, Dally, Bill, Dean, Jeff, Dhillon, Inderjit S., Dimakis, Alexandros, Dubey, Pradeep, Elkan, Charles, Fursin, Grigori, Ganger, Gregory R., Getoor, Lise, Gibbons, Phillip B., Gibson, Garth A., Gonzalez, Joseph E., Gottschlich, Justin, Han, Song, Hazelwood, Kim, Huang, Furong, Jaggi, Martin, Jamieson, Kevin, Jordan, Michael I., Joshi, Gauri, Khalaf, Rania, Knight, Jason, Konečný, Jakub, Kraska, Tim, Kumar, Arun, Kyrillidis, Anastasios, Lakshmiratan, Aparna, Li, Jing, Madden, Samuel, McMahan, H. Brendan, Meijer, Erik, Mitliagkas, Ioannis, Monga, Rajat, Murray, Derek, Olukotun, Kunle, Papailiopoulos, Dimitris, Pekhimenko, Gennady, Rekatsinas, Theodoros, Rostamizadeh, Afshin, Ré, Christopher, De Sa, Christopher, Sedghi, Hanie, Sen, Siddhartha, Smith, Virginia, Smola, Alex, Song, Dawn, Sparks, Evan, Stoica, Ion, Sze, Vivienne, Udell, Madeleine, Vanschoren, Joaquin, Venkataraman, Shivaram, Vinayak, Rashmi, Weimer, Markus, Wilson, Andrew Gordon, Xing, Eric, Zaharia, Matei, Zhang, Ce, and Talwalkar, Ameet
Subjects: Computer Science - Machine Learning, Computer Science - Databases, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Software Engineering, Statistics - Machine Learning
Abstract: Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a new systems machine learning research community at the intersection of the traditional systems and ML communities, focused on topics such as hardware systems for ML, software systems for ML, and ML optimized for metrics beyond predictive accuracy. To do this, we describe a new conference, MLSys, that explicitly targets research at the intersection of systems and machine learning with a program committee split evenly between experts in systems and ML, and an explicit focus on topics at the intersection of the two.
Published: 2019

290. On Nonconvex Optimization for Machine Learning: Gradients, Stochasticity, and Saddle Points

Author: Jin, Chi, Netrapalli, Praneeth, Ge, Rong, Kakade, Sham M., and Jordan, Michael I.
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: Gradient descent (GD) and stochastic gradient descent (SGD) are the workhorses of large-scale machine learning. While classical theory focused on analyzing the performance of these methods in convex optimization problems, the most notable successes in machine learning have involved nonconvex optimization, and a gap has arisen between theory and practice. Indeed, traditional analyses of GD and SGD show that both algorithms converge to stationary points efficiently. But these analyses do not take into account the possibility of converging to saddle points. More recent theory has shown that GD and SGD can avoid saddle points, but the dependence on dimension in these analyses is polynomial. For modern machine learning, where the dimension can be in the millions, such dependence would be catastrophic. We analyze perturbed versions of GD and SGD and show that they are truly efficient---their dimension dependence is only polylogarithmic. Indeed, these algorithms converge to second-order stationary points in essentially the same time as they take to converge to classical first-order stationary points., Comment: A preliminary version of this paper, with a subset of the results that are presented here, was presented at ICML 2017 (also as arXiv:1703.00887)
Published: 2019

291. LS-Tree: Model Interpretation When the Data Are Linguistic

Author: Chen, Jianbo and Jordan, Michael I.
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Statistics - Machine Learning
Abstract: We study the problem of interpreting trained classification models in the setting of linguistic data sets. Leveraging a parse tree, we propose to assign least-squares based importance scores to each word of an instance by exploiting syntactic constituency structure. We establish an axiomatic characterization of these importance scores by relating them to the Banzhaf value in coalitional game theory. Based on these importance scores, we develop a principled method for detecting and quantifying interactions between words in a sentence. We demonstrate that the proposed method can aid in interpretability and diagnostics for several widely-used language models.
Published: 2019

292. A Short Note on Concentration Inequalities for Random Vectors with SubGaussian Norm

Author: Jin, Chi, Netrapalli, Praneeth, Ge, Rong, Kakade, Sham M., and Jordan, Michael I.
Subjects: Mathematics - Probability, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: In this note, we derive concentration inequalities for random vectors with subGaussian norm (a generalization of both subGaussian random vectors and norm bounded random vectors), which are tight up to logarithmic factors.
Published: 2019

293. Acceleration via Symplectic Discretization of High-Resolution Differential Equations

Author: Shi, Bin, Du, Simon S., Su, Weijie J., and Jordan, Michael I.
Subjects: Mathematics - Optimization and Control, Computer Science - Machine Learning, Mathematics - Numerical Analysis, Statistics - Machine Learning
Abstract: We study first-order optimization methods obtained by discretizing ordinary differential equations (ODEs) corresponding to Nesterov's accelerated gradient methods (NAGs) and Polyak's heavy-ball method. We consider three discretization schemes: an explicit Euler scheme, an implicit Euler scheme, and a symplectic scheme. We show that the optimization algorithm generated by applying the symplectic scheme to a high-resolution ODE proposed by Shi et al. [2018] achieves an accelerated rate for minimizing smooth strongly convex functions. On the other hand, the resulting algorithm either fails to achieve acceleration or is impractical when the scheme is implicit, the ODE is low-resolution, or the scheme is explicit., Comment: Published in Neurips 2019
Published: 2019

294. Cost-Effective Incentive Allocation via Structured Counterfactual Inference

Author: Lopez, Romain, Li, Chenchen, Yan, Xiang, Xiong, Junwu, Jordan, Michael I., Qi, Yuan, and Song, Le
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning
Abstract: We address a practical problem ubiquitous in modern marketing campaigns, in which a central agent tries to learn a policy for allocating strategic financial incentives to customers and observes only bandit feedback. In contrast to traditional policy optimization frameworks, we take into account the additional reward structure and budget constraints common in this setting, and develop a new two-step method for solving this constrained counterfactual policy optimization problem. Our method first casts the reward estimation problem as a domain adaptation problem with supplementary structure, and then subsequently uses the estimators for optimizing the policy with constraints. We also establish theoretical error bounds for our estimation procedure and we empirically show that the approach leads to significant improvement on both synthetic and real datasets.
Published: 2019

295. Is There an Analog of Nesterov Acceleration for MCMC?

Author: Ma, Yi-An, Chatterji, Niladri, Cheng, Xiang, Flammarion, Nicolas, Bartlett, Peter, and Jordan, Michael I.
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning, Mathematics - Numerical Analysis
Abstract: We formulate gradient-based Markov chain Monte Carlo (MCMC) sampling as optimization on the space of probability measures, with Kullback-Leibler (KL) divergence as the objective functional. We show that an underdamped form of the Langevin algorithm performs accelerated gradient descent in this metric. To characterize the convergence of the algorithm, we construct a Lyapunov functional and exploit hypocoercivity of the underdamped Langevin algorithm. As an application, we show that accelerated rates can be obtained for a class of nonconvex functions with the Langevin algorithm.
Published: 2019

296. Quantitative Weak Convergence for Discrete Stochastic Processes

Author: Cheng, Xiang, Bartlett, Peter L., and Jordan, Michael I.
Subjects: Mathematics - Statistics Theory, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: In this paper, we quantitative convergence in $W_2$ for a family of Langevin-like stochastic processes that includes stochastic gradient descent and related gradient-based algorithms. Under certain regularity assumptions, we show that the iterates of these stochastic processes converge to an invariant distribution at a rate of $\tilde{O}\lrp{1/\sqrt{k}}$ where $k$ is the number of steps; this rate is provably tight up to log factors. Our result reduces to a quantitative form of the classical Central Limit Theorem in the special case when the potential is quadratic.
Published: 2019

297. What is Local Optimality in Nonconvex-Nonconcave Minimax Optimization?

Author: Jin, Chi, Netrapalli, Praneeth, and Jordan, Michael I.
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: Minimax optimization has found extensive applications in modern machine learning, in settings such as generative adversarial networks (GANs), adversarial training and multi-agent reinforcement learning. As most of these applications involve continuous nonconvex-nonconcave formulations, a very basic question arises---"what is a proper definition of local optima?" Most previous work answers this question using classical notions of equilibria from simultaneous games, where the min-player and the max-player act simultaneously. In contrast, most applications in machine learning, including GANs and adversarial training, correspond to sequential games, where the order of which player acts first is crucial (since minimax is in general not equal to maximin due to the nonconvex-nonconcave nature of the problems). The main contribution of this paper is to propose a proper mathematical definition of local optimality for this sequential setting---local minimax, as well as to present its properties and existence results. Finally, we establish a strong connection to a basic local search algorithm---gradient descent ascent (GDA): under mild conditions, all stable limit points of GDA are exactly local minimax points up to some degenerate points., Comment: This paper has been published at ICML2020. This new version made a correction to Proposition 19, and added more related works
Published: 2019

298. Sharp Analysis of Expectation-Maximization for Weakly Identifiable Models

Author: Dwivedi, Raaz, Ho, Nhat, Khamaru, Koulik, Wainwright, Martin J., Jordan, Michael I., and Yu, Bin
Subjects: Mathematics - Statistics Theory, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: We study a class of weakly identifiable location-scale mixture models for which the maximum likelihood estimates based on $n$ i.i.d. samples are known to have lower accuracy than the classical $n^{- \frac{1}{2}}$ error. We investigate whether the Expectation-Maximization (EM) algorithm also converges slowly for these models. We provide a rigorous characterization of EM for fitting a weakly identifiable Gaussian mixture in a univariate setting where we prove that the EM algorithm converges in order $n^{\frac{3}{4}}$ steps and returns estimates that are at a Euclidean distance of order ${ n^{- \frac{1}{8}}}$ and ${ n^{-\frac{1} {4}}}$ from the true location and scale parameter respectively. Establishing the slow rates in the univariate setting requires a novel localization argument with two stages, with each stage involving an epoch-based argument applied to a different surrogate EM operator at the population level. We demonstrate several multivariate ($d \geq 2$) examples that exhibit the same slow rates as the univariate case. We also prove slow statistical rates in higher dimensions in a special case, when the fitted covariance is constrained to be a multiple of the identity., Comment: 30 pages, 4 figures. The first three authors contributed equally to this work. To appear in AISTATS 2020
Published: 2019

299. Theoretically Principled Trade-off between Robustness and Accuracy

Author: Zhang, Hongyang, Yu, Yaodong, Jiao, Jiantao, Xing, Eric P., Ghaoui, Laurent El, and Jordan, Michael I.
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: We identify a trade-off between robustness and accuracy that serves as a guiding principle in the design of defenses against adversarial examples. Although this problem has been widely studied empirically, much remains unknown concerning the theory underlying this trade-off. In this work, we decompose the prediction error for adversarial examples (robust error) as the sum of the natural (classification) error and boundary error, and provide a differentiable upper bound using the theory of classification-calibrated loss, which is shown to be the tightest possible upper bound uniform over all probability distributions and measurable predictors. Inspired by our theoretical analysis, we also design a new defense method, TRADES, to trade adversarial robustness off against accuracy. Our proposed algorithm performs well experimentally in real-world datasets. The methodology is the foundation of our entry to the NeurIPS 2018 Adversarial Vision Challenge in which we won the 1st place out of ~2,000 submissions, surpassing the runner-up approach by $11.41\%$ in terms of mean $\ell_2$ perturbation distance., Comment: Appeared in ICML 2019; the winning methodology of the NeurIPS 2018 Adversarial Vision Challenge
Published: 2019

300. On Efficient Optimal Transport: An Analysis of Greedy and Accelerated Mirror Descent Algorithms

Author: Lin, Tianyi, Ho, Nhat, and Jordan, Michael I.
Subjects: Computer Science - Data Structures and Algorithms
Abstract: We provide theoretical analyses for two algorithms that solve the regularized optimal transport (OT) problem between two discrete probability measures with at most $n$ atoms. We show that a greedy variant of the classical Sinkhorn algorithm, known as the \emph{Greenkhorn algorithm}, can be improved to $\widetilde{\mathcal{O}}(n^2\varepsilon^{-2})$, improving on the best known complexity bound of $\widetilde{\mathcal{O}}(n^2\varepsilon^{-3})$. Notably, this matches the best known complexity bound for the Sinkhorn algorithm and helps explain why the Greenkhorn algorithm can outperform the Sinkhorn algorithm in practice. Our proof technique, which is based on a primal-dual formulation and a novel upper bound for the dual solution, also leads to a new class of algorithms that we refer to as \emph{adaptive primal-dual accelerated mirror descent} (APDAMD) algorithms. We prove that the complexity of these algorithms is $\widetilde{\mathcal{O}}(n^2\sqrt{\delta}\varepsilon^{-1})$, where $\delta > 0$ refers to the inverse of the strong convexity module of Bregman divergence with respect to $\|\cdot\|_\infty$. This implies that the APDAMD algorithm is faster than the Sinkhorn and Greenkhorn algorithms in terms of $\varepsilon$. Experimental results on synthetic and real datasets demonstrate the favorable performance of the Greenkhorn and APDAMD algorithms in practice., Comment: Derive the explicit dual objective function for APDAMD (Remark 4.2) which satisfies Lemma~4.1; Accepted by ICML 2019; The longer version is available here: arXiv:1906.01437
Published: 2019

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Category

Publication Type

Journal

Region

Database

Publisher

6,080 results on '"Jordan, Michael"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources