Author: "Defazio, Aaron" / Database: OAIster - Searchworks@Jio Institute Digital Library Search Results

1. Directional Smoothness and Gradient Methods: Convergence and Adaptivity

Author: Mishkin, Aaron, Khaled, Ahmed, Wang, Yuanhao, Defazio, Aaron, Gower, Robert M., Mishkin, Aaron, Khaled, Ahmed, Wang, Yuanhao, Defazio, Aaron, and Gower, Robert M.
Abstract: We develop new sub-optimality bounds for gradient descent (GD) that depend on the conditioning of the objective along the path of optimization, rather than on global, worst-case constants. Key to our proofs is directional smoothness, a measure of gradient variation that we use to develop upper-bounds on the objective. Minimizing these upper-bounds requires solving implicit equations to obtain a sequence of strongly adapted step-sizes; we show that these equations are straightforward to solve for convex quadratics and lead to new guarantees for two classical step-sizes. For general functions, we prove that the Polyak step-size and normalized GD obtain fast, path-dependent rates despite using no knowledge of the directional smoothness. Experiments on logistic regression show our convergence guarantees are tighter than the classical theory based on L-smoothness., Comment: Twenty-four pages
Published: 2024

2. The Road Less Scheduled

Author: Defazio, Aaron, Xingyu, Yang, Mehta, Harsh, Mishchenko, Konstantin, Khaled, Ahmed, Cutkosky, Ashok, Defazio, Aaron, Xingyu, Yang, Mehta, Harsh, Mishchenko, Konstantin, Khaled, Ahmed, and Cutkosky, Ashok
Abstract: Existing learning rate schedules that do not require specification of the optimization stopping step T are greatly out-performed by learning rate schedules that depend on T. We propose an approach that avoids the need for this stopping time by eschewing the use of schedules entirely, while exhibiting state-of-the-art performance compared to schedules across a wide family of problems ranging from convex problems to large-scale deep learning problems. Our Schedule-Free approach introduces no additional hyper-parameters over standard optimizers with momentum. Our method is a direct consequence of a new theory we develop that unifies scheduling and iterate averaging. An open source implementation of our method is available (https://github.com/facebookresearch/schedule_free).
Published: 2024

3. Learning-Rate-Free Learning by D-Adaptation

Author: Defazio, Aaron, Mishchenko, Konstantin, Defazio, Aaron, and Mishchenko, Konstantin
Abstract: D-Adaptation is an approach to automatically setting the learning rate which asymptotically achieves the optimal rate of convergence for minimizing convex Lipschitz functions, with no back-tracking or line searches, and no additional function value or gradient evaluations per step. Our approach is the first hyper-parameter free method for this class without additional multiplicative log factors in the convergence rate. We present extensive experiments for SGD and Adam variants of our method, where the method automatically matches hand-tuned learning rates across more than a dozen diverse machine learning problems, including large-scale vision and language problems. An open-source implementation is available.
Published: 2023

4. Mechanic: A Learning Rate Tuner

Author: Cutkosky, Ashok, Defazio, Aaron, Mehta, Harsh, Cutkosky, Ashok, Defazio, Aaron, and Mehta, Harsh
Abstract: We introduce a technique for tuning the learning rate scale factor of any base optimization algorithm and schedule automatically, which we call \textsc{mechanic}. Our method provides a practical realization of recent theoretical reductions for accomplishing a similar goal in online convex optimization. We rigorously evaluate \textsc{mechanic} on a range of large scale deep learning tasks with varying batch sizes, schedules, and base optimization algorithms. These experiments demonstrate that depending on the problem, \textsc{mechanic} either comes very close to, matches or even improves upon manual tuning of learning rates.
Published: 2023

5. MoMo: Momentum Models for Adaptive Learning Rates

Author: Schaipp, Fabian, Ohana, Ruben, Eickenberg, Michael, Defazio, Aaron, Gower, Robert M., Schaipp, Fabian, Ohana, Ruben, Eickenberg, Michael, Defazio, Aaron, and Gower, Robert M.
Abstract: Training a modern machine learning architecture on a new task requires extensive learning-rate tuning, which comes at a high computational cost. Here we develop new Polyak-type adaptive learning rates that can be used on top of any momentum method, and require less tuning to perform well. We first develop MoMo, a Momentum Model based adaptive learning rate for SGD-M (stochastic gradient descent with momentum). MoMo uses momentum estimates of the losses and gradients sampled at each iteration to build a model of the loss function. Our model makes use of any known lower bound of the loss function by using truncation, e.g. most losses are lower-bounded by zero. The model is then approximately minimized at each iteration to compute the next step. We show how MoMo can be used in combination with any momentum-based method, and showcase this by developing MoMo-Adam, which is Adam with our new model-based adaptive learning rate. We show that MoMo attains a $\mathcal{O}(1/\sqrt{K})$ convergence rate for convex problems with interpolation, needing knowledge of no problem-specific quantities other than the optimal value. Additionally, for losses with unknown lower bounds, we develop on-the-fly estimates of a lower bound, that are incorporated in our model. We show that MoMo and MoMo-Adam improve over SGD-M and Adam in terms of robustness to hyperparameter tuning for training image classifiers on MNIST, CIFAR, and Imagenet, for recommender systems on Criteo, for a transformer model on the translation task IWSLT14, and for a diffusion model.
Published: 2023

6. When, Why and How Much? Adaptive Learning Rate Scheduling by Refinement

Author: Defazio, Aaron, Cutkosky, Ashok, Mehta, Harsh, Mishchenko, Konstantin, Defazio, Aaron, Cutkosky, Ashok, Mehta, Harsh, and Mishchenko, Konstantin
Abstract: Learning rate schedules used in practice bear little resemblance to those recommended by theory. We close much of this theory/practice gap, and as a consequence are able to derive new problem-adaptive learning rate schedules. Our key technical contribution is a refined analysis of learning rate schedules for a wide class of optimization algorithms (including SGD). In contrast to most prior works that study the convergence of the average iterate, we study the last iterate, which is what most people use in practice. When considering only worst-case analysis, our theory predicts that the best choice is the linear decay schedule: a popular choice in practice that sets the stepsize proportionally to $1 - t/T$, where $t$ is the current iteration and $T$ is the total number of steps. To go beyond this worst-case analysis, we use the observed gradient norms to derive schedules refined for any particular task. These refined schedules exhibit learning rate warm-up and rapid learning rate annealing near the end of training. Ours is the first systematic approach to automatically yield both of these properties. We perform the most comprehensive evaluation of learning rate schedules to date, evaluating across 10 diverse deep learning problems, a series of LLMs, and a suite of logistic regression problems. We validate that overall, the linear-decay schedule matches or outperforms all commonly used default schedules including cosine annealing, and that our schedule refinement method gives further improvements.
Published: 2023

7. Prodigy: An Expeditiously Adaptive Parameter-Free Learner

Author: Mishchenko, Konstantin, Defazio, Aaron, Mishchenko, Konstantin, and Defazio, Aaron
Abstract: We consider the problem of estimating the learning rate in adaptive methods, such as AdaGrad and Adam. We propose Prodigy, an algorithm that provably estimates the distance to the solution $D$, which is needed to set the learning rate optimally. At its core, Prodigy is a modification of the D-Adaptation method for learning-rate-free learning. It improves upon the convergence rate of D-Adaptation by a factor of $O(\sqrt{\log(D/d_0)})$, where $d_0$ is the initial estimate of $D$. We test Prodigy on 12 common logistic-regression benchmark datasets, VGG11 and ResNet-50 training on CIFAR10, ViT training on Imagenet, LSTM training on IWSLT14, DLRM training on Criteo dataset, VarNet on Knee MRI dataset, as well as RoBERTa and GPT transformer training on BookWiki. Our experimental results show that our approach consistently outperforms D-Adaptation and reaches test accuracy values close to that of hand-tuned Adam.
Published: 2023

8. Grad-GradaGrad? A Non-Monotone Adaptive Stochastic Gradient Method

Author: Defazio, Aaron, Zhou, Baoyu, Xiao, Lin, Defazio, Aaron, Zhou, Baoyu, and Xiao, Lin
Abstract: The classical AdaGrad method adapts the learning rate by dividing by the square root of a sum of squared gradients. Because this sum on the denominator is increasing, the method can only decrease step sizes over time, and requires a learning rate scaling hyper-parameter to be carefully tuned. To overcome this restriction, we introduce GradaGrad, a method in the same family that naturally grows or shrinks the learning rate based on a different accumulation in the denominator, one that can both increase and decrease. We show that it obeys a similar convergence rate as AdaGrad and demonstrate its non-monotone adaptation capability with experiments.
Published: 2022

9. Stochastic Polyak Stepsize with a Moving Target

Author: Gower, Robert M., Defazio, Aaron, Rabbat, Michael, Gower, Robert M., Defazio, Aaron, and Rabbat, Michael
Abstract: We propose a new stochastic gradient method called MOTAPS (Moving Targetted Polyak Stepsize) that uses recorded past loss values to compute adaptive stepsizes. MOTAPS can be seen as a variant of the Stochastic Polyak (SP) which is also a method that also uses loss values to adjust the stepsize. The downside to the SP method is that it only converges when the interpolation condition holds. MOTAPS is an extension of SP that does not rely on the interpolation condition. The MOTAPS method uses $n$ auxiliary variables, one for each data point, that track the loss value for each data point. We provide a global convergence theory for SP, an intermediary method TAPS, and MOTAPS by showing that they all can be interpreted as a special variant of online SGD. We also perform several numerical experiments on convex learning problems, and deep learning models for image classification and language translation. In all of our tasks we show that MOTAPS is competitive with the relevant baseline method., Comment: 49 pages, 13 figures, 1 table
Published: 2021

10. Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization

Author: Defazio, Aaron, Jelassi, Samy, Defazio, Aaron, and Jelassi, Samy
Abstract: We introduce MADGRAD, a novel optimization method in the family of AdaGrad adaptive gradient methods. MADGRAD shows excellent performance on deep learning optimization problems from multiple fields, including classification and image-to-image tasks in vision, and recurrent and bidirectionally-masked models in natural language processing. For each of these tasks, MADGRAD matches or outperforms both SGD and ADAM in test set performance, even on problems for which adaptive methods normally perform poorly.
Published: 2021

11. Almost sure convergence rates for Stochastic Gradient Descent and Stochastic Heavy Ball

Author: Sebbouh, Othmane, Gower, Robert M., Defazio, Aaron, Sebbouh, Othmane, Gower, Robert M., and Defazio, Aaron
Abstract: We study stochastic gradient descent (SGD) and the stochastic heavy ball method (SHB, otherwise known as the momentum method) for the general stochastic approximation problem. For SGD, in the convex and smooth setting, we provide the first \emph{almost sure} asymptotic convergence \emph{rates} for a weighted average of the iterates . More precisely, we show that the convergence rate of the function values is arbitrarily close to $o(1/\sqrt{k})$, and is exactly $o(1/k)$ in the so-called overparametrized case. We show that these results still hold when using stochastic line search and stochastic Polyak stepsizes, thereby giving the first proof of convergence of these methods in the non-overparametrized regime. Using a substantially different analysis, we show that these rates hold for SHB as well, but at the last iterate. This distinction is important because it is the last iterate of SGD and SHB which is used in practice. We also show that the last iterate of SHB converges to a minimizer \emph{almost surely}. Additionally, we prove that the function values of the deterministic HB converge at a $o(1/k)$ rate, which is faster than the previously known $O(1/k)$. Finally, in the nonconvex setting, we prove similar rates on the lowest gradient norm along the trajectory of SGD.
Published: 2020

12. The Power of Factorial Powers: New Parameter settings for (Stochastic) Optimization

Author: Defazio, Aaron, Gower, Robert M., Defazio, Aaron, and Gower, Robert M.
Abstract: The convergence rates for convex and non-convex optimization methods depend on the choice of a host of constants, including step sizes, Lyapunov function constants and momentum constants. In this work we propose the use of factorial powers as a flexible tool for defining constants that appear in convergence proofs. We list a number of remarkable properties that these sequences enjoy, and show how they can be applied to convergence proofs to simplify or improve the convergence rates of the momentum method, accelerated gradient and the stochastic variance reduced method (SVRG).
Published: 2020

13. End-to-End Variational Networks for Accelerated MRI Reconstruction

Author: Sriram, Anuroop, Zbontar, Jure, Murrell, Tullie, Defazio, Aaron, Zitnick, C. Lawrence, Yakubova, Nafissa, Knoll, Florian, Johnson, Patricia, Sriram, Anuroop, Zbontar, Jure, Murrell, Tullie, Defazio, Aaron, Zitnick, C. Lawrence, Yakubova, Nafissa, Knoll, Florian, and Johnson, Patricia
Abstract: The slow acquisition speed of magnetic resonance imaging (MRI) has led to the development of two complementary methods: acquiring multiple views of the anatomy simultaneously (parallel imaging) and acquiring fewer samples than necessary for traditional signal processing methods (compressed sensing). While the combination of these methods has the potential to allow much faster scan times, reconstruction from such undersampled multi-coil data has remained an open problem. In this paper, we present a new approach to this problem that extends previously proposed variational methods by learning fully end-to-end. Our method obtains new state-of-the-art results on the fastMRI dataset for both brain and knee MRIs.
Published: 2020

14. MRI Banding Removal via Adversarial Training

Author: Defazio, Aaron, Murrell, Tullie, Recht, Michael P., Defazio, Aaron, Murrell, Tullie, and Recht, Michael P.
Abstract: MRI images reconstructed from sub-sampled Cartesian data using deep learning techniques often show a characteristic banding (sometimes described as streaking), which is particularly strong in low signal-to-noise regions of the reconstructed image. In this work, we propose the use of an adversarial loss that penalizes banding structures without requiring any human annotation. Our technique greatly reduces the appearance of banding, without requiring any additional computation or post-processing at reconstruction time. We report the results of a blind comparison against a strong baseline by a group of expert evaluators (board-certified radiologists), where our approach is ranked superior at banding removal with no statistically significant loss of detail.
Published: 2020

15. Advancing machine learning for MR image reconstruction with an open competition: Overview of the 2019 fastMRI challenge

Author: Knoll, Florian, Murrell, Tullie, Sriram, Anuroop, Yakubova, Nafissa, Zbontar, Jure, Rabbat, Michael, Defazio, Aaron, Muckley, Matthew J., Sodickson, Daniel K., Zitnick, C. Lawrence, Recht, Michael P., Knoll, Florian, Murrell, Tullie, Sriram, Anuroop, Yakubova, Nafissa, Zbontar, Jure, Rabbat, Michael, Defazio, Aaron, Muckley, Matthew J., Sodickson, Daniel K., Zitnick, C. Lawrence, and Recht, Michael P.
Abstract: Purpose: To advance research in the field of machine learning for MR image reconstruction with an open challenge. Methods: We provided participants with a dataset of raw k-space data from 1,594 consecutive clinical exams of the knee. The goal of the challenge was to reconstruct images from these data. In order to strike a balance between realistic data and a shallow learning curve for those not already familiar with MR image reconstruction, we ran multiple tracks for multi-coil and single-coil data. We performed a two-stage evaluation based on quantitative image metrics followed by evaluation by a panel of radiologists. The challenge ran from June to December of 2019. Results: We received a total of 33 challenge submissions. All participants chose to submit results from supervised machine learning approaches. Conclusion: The challenge led to new developments in machine learning for image reconstruction, provided insight into the current state of the art in the field, and highlighted remaining hurdles for clinical adoption.
Published: 2020
Full Text: View/download PDF

16. Dual Averaging is Surprisingly Effective for Deep Learning Optimization

Author: Jelassi, Samy, Defazio, Aaron, Jelassi, Samy, and Defazio, Aaron
Abstract: First-order stochastic optimization methods are currently the most widely used class of methods for training deep neural networks. However, the choice of the optimizer has become an ad-hoc rule that can significantly affect the performance. For instance, SGD with momentum (SGD+M) is typically used in computer vision (CV) and Adam is used for training transformer models for Natural Language Processing (NLP). Using the wrong method can lead to significant performance degradation. Inspired by the dual averaging algorithm, we propose Modernized Dual Averaging (MDA), an optimizer that is able to perform as well as SGD+M in CV and as Adam in NLP. Our method is not adaptive and is significantly simpler than Adam. We show that MDA induces a decaying uncentered $L_2$-regularization compared to vanilla SGD+M and hypothesize that this may explain why it works on NLP problems where SGD+M fails.
Published: 2020

17. Momentum via Primal Averaging: Theoretical Insights and Learning Rate Schedules for Non-Convex Optimization

Author: Defazio, Aaron and Defazio, Aaron
Abstract: Momentum methods are now used pervasively within the machine learning community for training non-convex models such as deep neural networks. Empirically, they out perform traditional stochastic gradient descent (SGD) approaches. In this work we develop a Lyapunov analysis of SGD with momentum (SGD+M), by utilizing a equivalent rewriting of the method known as the stochastic primal averaging (SPA) form. This analysis is much tighter than previous theory in the non-convex case, and due to this we are able to give precise insights into when SGD+M may out-perform SGD, and what hyper-parameter schedules will work and why.
Published: 2020

18. Offset Sampling Improves Deep Learning based Accelerated MRI Reconstructions by Exploiting Symmetry

Author: Defazio, Aaron and Defazio, Aaron
Abstract: Deep learning approaches to accelerated MRI take a matrix of sampled Fourier-space lines as input and produce a spatial image as output. In this work we show that by careful choice of the offset used in the sampling procedure, the symmetries in k-space can be better exploited, producing higher quality reconstructions than given by standard equally-spaced samples or randomized samples motivated by compressed sensing.
Published: 2019

19. GrappaNet: Combining Parallel Imaging with Deep Learning for Multi-Coil MRI Reconstruction

Author: Sriram, Anuroop, Zbontar, Jure, Murrell, Tullie, Zitnick, C. Lawrence, Defazio, Aaron, Sodickson, Daniel K., Sriram, Anuroop, Zbontar, Jure, Murrell, Tullie, Zitnick, C. Lawrence, Defazio, Aaron, and Sodickson, Daniel K.
Abstract: Magnetic Resonance Image (MRI) acquisition is an inherently slow process which has spurred the development of two different acceleration methods: acquiring multiple correlated samples simultaneously (parallel imaging) and acquiring fewer samples than necessary for traditional signal processing methods (compressed sensing). Both methods provide complementary approaches to accelerating the speed of MRI acquisition. In this paper, we present a novel method to integrate traditional parallel imaging methods into deep neural networks that is able to generate high quality reconstructions even for high acceleration factors. The proposed method, called GrappaNet, performs progressive reconstruction by first mapping the reconstruction problem to a simpler one that can be solved by a traditional parallel imaging methods using a neural network, followed by an application of a parallel imaging method, and finally fine-tuning the output with another neural network. The entire network can be trained end-to-end. We present experimental results on the recently released fastMRI dataset and show that GrappaNet can generate higher quality reconstructions than competing methods for both $4\times$ and $8\times$ acceleration.
Published: 2019

20. Beyond Folklore: A Scaling Calculus for the Design and Initialization of ReLU Networks

Author: Defazio, Aaron, Bottou, Léon, Defazio, Aaron, and Bottou, Léon
Abstract: We propose a system for calculating a "scaling constant" for layers and weights of neural networks. We relate this scaling constant to two important quantities that relate to the optimizability of neural networks, and argue that a network that is "preconditioned" via scaling, in the sense that all weights have the same scaling constant, will be easier to train. This scaling calculus results in a number of consequences, among them the fact that the geometric mean of the fan-in and fan-out, rather than the fan-in, fan-out, or arithmetic mean, should be used for the initialization of the variance of weights in a neural network. Our system allows for the off-line design & engineering of ReLU neural networks, potentially replacing blind experimentation.
Published: 2019
Full Text: View/download PDF

21. On the Ineffectiveness of Variance Reduced Optimization for Deep Learning

Author: Defazio, Aaron, Bottou, Léon, Defazio, Aaron, and Bottou, Léon
Abstract: The application of stochastic variance reduction to optimization has shown remarkable recent theoretical and practical success. The applicability of these techniques to the hard non-convex optimization problems encountered during training of modern deep neural networks is an open problem. We show that naive application of the SVRG technique and related approaches fail, and explore why.
Published: 2018

22. On the Curved Geometry of Accelerated Optimization

Author: Defazio, Aaron and Defazio, Aaron
Abstract: In this work we propose a differential geometric motivation for Nesterov's accelerated gradient method (AGM) for strongly-convex problems. By considering the optimization procedure as occurring on a Riemannian manifold with a natural structure, The AGM method can be seen as the proximal point method applied in this curved space. This viewpoint can also be extended to the continuous time case, where the accelerated gradient method arises from the natural block-implicit Euler discretization of an ODE on the manifold. We provide an analysis of the convergence rate of this ODE for quadratic objectives., Comment: NeurIPS 2019 Accepted paper
Published: 2018

23. Controlling Covariate Shift using Balanced Normalization of Weights

Author: Defazio, Aaron, Bottou, Léon, Defazio, Aaron, and Bottou, Léon
Abstract: We introduce a new normalization technique that exhibits the fast convergence properties of batch normalization using a transformation of layer weights instead of layer outputs. The proposed technique keeps the contribution of positive and negative weights to the layer output balanced. We validate our method on a set of standard benchmarks including CIFAR-10/100, SVHN and ILSVRC 2012 ImageNet.
Published: 2018

24. fastMRI: An Open Dataset and Benchmarks for Accelerated MRI

Author: Zbontar, Jure, Knoll, Florian, Sriram, Anuroop, Murrell, Tullie, Huang, Zhengnan, Muckley, Matthew J., Defazio, Aaron, Stern, Ruben, Johnson, Patricia, Bruno, Mary, Parente, Marc, Geras, Krzysztof J., Katsnelson, Joe, Chandarana, Hersh, Zhang, Zizhao, Drozdzal, Michal, Romero, Adriana, Rabbat, Michael, Vincent, Pascal, Yakubova, Nafissa, Pinkerton, James, Wang, Duo, Owens, Erich, Zitnick, C. Lawrence, Recht, Michael P., Sodickson, Daniel K., Lui, Yvonne W., Zbontar, Jure, Knoll, Florian, Sriram, Anuroop, Murrell, Tullie, Huang, Zhengnan, Muckley, Matthew J., Defazio, Aaron, Stern, Ruben, Johnson, Patricia, Bruno, Mary, Parente, Marc, Geras, Krzysztof J., Katsnelson, Joe, Chandarana, Hersh, Zhang, Zizhao, Drozdzal, Michal, Romero, Adriana, Rabbat, Michael, Vincent, Pascal, Yakubova, Nafissa, Pinkerton, James, Wang, Duo, Owens, Erich, Zitnick, C. Lawrence, Recht, Michael P., Sodickson, Daniel K., and Lui, Yvonne W.
Abstract: Accelerating Magnetic Resonance Imaging (MRI) by taking fewer measurements has the potential to reduce medical costs, minimize stress to patients and make MRI possible in applications where it is currently prohibitively slow or expensive. We introduce the fastMRI dataset, a large-scale collection of both raw MR measurements and clinical MR images, that can be used for training and evaluation of machine-learning approaches to MR image reconstruction. By introducing standardized evaluation criteria and a freely-accessible dataset, our goal is to help the community make rapid advances in the state of the art for MR image reconstruction. We also provide a self-contained introduction to MRI for machine learning researchers with no medical imaging background., Comment: 35 pages, 10 figures
Published: 2018

25. A Simple Practical Accelerated Method for Finite Sums

Author: Defazio, Aaron and Defazio, Aaron
Abstract: We describe a novel optimization method for finite sums (such as empirical risk minimization problems) building on the recently introduced SAGA method. Our method achieves an accelerated convergence rate on strongly convex smooth problems. Our method has only one parameter (a step size), and is radically simpler than other accelerated methods for finite sums. Additionally it can be applied when the terms are non-smooth, yielding a method applicable in many areas where operator splitting methods would traditionally be applied.
Published: 2016

26. Non-Uniform Stochastic Average Gradient Method for Training Conditional Random Fields

Author: Schmidt, Mark W., Babanezhad, Reza, Ahemd, Mohamed Osama, Defazio, Aaron, Clifton, Ann, Sarkar, Anoop, Schmidt, Mark W., Babanezhad, Reza, Ahemd, Mohamed Osama, Defazio, Aaron, Clifton, Ann, and Sarkar, Anoop
Abstract: We apply stochastic average gradient (SAG) algorithms for training conditional random fields (CRFs). We describe a practical implementation that uses structure in the CRF gradient to reduce the memory requirement of this linearly-convergent stochastic gradient method, propose a non-uniform sampling scheme that substantially improves practical performance, and analyze the rate of convergence of the SAGA variant under non-uniform sampling. Our experimental results reveal that our method significantly outperforms existing methods in terms of the training objective, and performs as well or better than optimally-tuned stochastic gradient methods in terms of test error
Published: 2015

27. New Optimisation Methods for Machine Learning

Author: Defazio, Aaron and Defazio, Aaron
Abstract: A thesis submitted for the degree of Doctor of Philosophy of The Australian National University. In this work we introduce several new optimisation methods for problems in machine learning. Our algorithms broadly fall into two categories: optimisation of finite sums and of graph structured objectives. The finite sum problem is simply the minimisation of objective functions that are naturally expressed as a summation over a large number of terms, where each term has a similar or identical weight. Such objectives most often appear in machine learning in the empirical risk minimisation framework in the non-online learning setting. The second category, that of graph structured objectives, consists of objectives that result from applying maximum likelihood to Markov random field models. Unlike the finite sum case, all the non-linearity is contained within a partition function term, which does not readily decompose into a summation. For the finite sum problem, we introduce the Finito and SAGA algorithms, as well as variants of each. For graph-structured problems, we take three complementary approaches. We look at learning the parameters for a fixed structure, learning the structure independently, and learning both simultaneously. Specifically, for the combined approach, we introduce a new method for encouraging graph structures with the "scale-free" property. For the structure learning problem, we establish SHORTCUT, a O(n^{2.5}) expected time approximate structure learning method for Gaussian graphical models. For problems where the structure is known but the parameters unknown, we introduce an approximate maximum likelihood learning algorithm that is capable of learning a useful subclass of Gaussian graphical models., Comment: PhD thesis, 205 pages
Published: 2015

28. Non-Uniform Stochastic Average Gradient Method for Training Conditional Random Fields

Author: Schmidt, Mark, Babanezhad, Reza, Ahmed, Mohamed Osama, Defazio, Aaron, Clifton, Ann, Sarkar, Anoop, Schmidt, Mark, Babanezhad, Reza, Ahmed, Mohamed Osama, Defazio, Aaron, Clifton, Ann, and Sarkar, Anoop
Abstract: We apply stochastic average gradient (SAG) algorithms for training conditional random fields (CRFs). We describe a practical implementation that uses structure in the CRF gradient to reduce the memory requirement of this linearly-convergent stochastic gradient method, propose a non-uniform sampling scheme that substantially improves practical performance, and analyze the rate of convergence of the SAGA variant under non-uniform sampling. Our experimental results reveal that our method often significantly outperforms existing methods in terms of the training objective, and performs as well or better than optimally-tuned stochastic gradient methods in terms of test error., Comment: AI/Stats 2015, 24 pages
Published: 2015

29. Finito: A faster, permutable incremental gradient method for big data problems

Author: Defazio, Aaron, Caetano, Tiberio, Domke, Justin, Defazio, Aaron, Caetano, Tiberio, and Domke, Justin
Abstract: Recent advances in optimization theory have shown that smooth strongly convex finite sums can be minimized faster than by treating them as a black box "batch" problem. In this work we introduce a new method in this class with a theoretical convergence rate four times faster than ex-isting methods, for sums with sufficiently many terms. This method is also amendable to a sampling without replacement scheme that in practice gives further speed-ups. We give empirical results showing state of the art performance.
Published: 2014

30. A Comparison of learning algorithms on the Arcade Learning Environment

Author: Defazio, Aaron, Graepel, Thore, Defazio, Aaron, and Graepel, Thore
Abstract: Reinforcement learning agents have traditionally been evaluated on small toy problems. With advances in computing power and the advent of the Arcade Learning Environment, it is now possible to evaluate algorithms on diverse and difficult problems within a consistent framework. We discuss some challenges posed by the arcade learning environment which do not manifest in simpler environments. We then provide a comparison of model-free, linear learning algorithms on this challenging problem set.
Published: 2014

31. Finito: A Faster, Permutable Incremental Gradient Method for Big Data Problems

Author: Defazio, Aaron J., Caetano, Tibério S., Domke, Justin, Defazio, Aaron J., Caetano, Tibério S., and Domke, Justin
Abstract: Recent advances in optimization theory have shown that smooth strongly convex finite sums can be minimized faster than by treating them as a black box "batch" problem. In this work we introduce a new method in this class with a theoretical convergence rate four times faster than existing methods, for sums with sufficiently many terms. This method is also amendable to a sampling without replacement scheme that in practice gives further speed-ups. We give empirical results showing state of the art performance.
Published: 2014

32. A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation

Author: Defazio, Aaron J., Caetano, Tiberio S., Defazio, Aaron J., and Caetano, Tiberio S.
Abstract: A key problem in statistics and machine learning is the determination of network structure from data. We consider the case where the structure of the graph to be reconstructed is known to be scale-free. We show that in such cases it is natural to formulate structured sparsity inducing priors using submodular functions, and we use their Lov\'asz extension to obtain a convex relaxation. For tractable classes such as Gaussian graphical models, this leads to a convex optimization problem that can be efficiently solved. We show that our method results in an improvement in the accuracy of reconstructed networks for synthetic data. We also show how our prior encourages scale-free reconstructions on a bioinfomatics dataset.
Published: 2014

33. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives

Author: Defazio, Aaron, Bach, Francis, Lacoste-Julien, Simon, Defazio, Aaron, Bach, Francis, and Lacoste-Julien, Simon
Abstract: In this work we introduce a new optimisation method called SAGA in the spirit of SAG, SDCA, MISO and SVRG, a set of recently proposed incremental gradient algorithms with fast linear convergence rates. SAGA improves on the theory behind SAG and SVRG, with better theoretical convergence rates, and has support for composite objectives where a proximal operator is used on the regulariser. Unlike SDCA, SAGA supports non-strongly convex problems directly, and is adaptive to any inherent strong convexity of the problem. We give experimental results showing the effectiveness of our method., Comment: Advances In Neural Information Processing Systems, Nov 2014, Montreal, Canada
Published: 2014

34. A convex formulation for learning scale-free networks via submodular relaxation

Author: Defazio, Aaron, Caetano, Tiberio, Defazio, Aaron, and Caetano, Tiberio
Abstract: A key problem in statistics and machine learning is the determination of network structure from data. We consider the case where the structure of the graph to be reconstructed is known to be scale-free. We show that in such cases it is natural to formulat
Published: 2012

35. A graphical model formulation of collaborative filtering neighbourhood methods with fast maximum entropy training

Author: Defazio, Aaron, Caetano, Tiberio, Defazio, Aaron, and Caetano, Tiberio
Abstract: Item neighbourhood methods for collaborative filtering learn a weighted graph over the set of items, where each item is connected to those it is most similar to. The prediction of a user's rating on an item is then given by that rating of neighbouring items, weighted by their similarity. This paper presents a new neighbourhood approach which we call item fields, whereby an undirected graphical model is formed over the item graph. The resulting prediction rule is a simple generalization of the classical approaches, which takes into account non-local information in the graph, allowing its best results to be obtained when using drastically fewer edges than other neighbourhood approaches. A fast approximate maximum entropy training method based on the Bethe approximation is presented, which uses a simple gradient ascent procedure. When using precomputed sufficient statistics on the Movielens datasets, our method is faster than maximum likelihood approaches by two orders of magnitude.
Published: 2012

36. A Graphical Model Formulation of Collaborative Filtering Neighbourhood Methods with Fast Maximum Entropy Training

Author: Defazio, Aaron, Caetano, Tiberio, Defazio, Aaron, and Caetano, Tiberio
Abstract: Item neighbourhood methods for collaborative filtering learn a weighted graph over the set of items, where each item is connected to those it is most similar to. The prediction of a user's rating on an item is then given by that rating of neighbouring items, weighted by their similarity. This paper presents a new neighbourhood approach which we call item fields, whereby an undirected graphical model is formed over the item graph. The resulting prediction rule is a simple generalization of the classical approaches, which takes into account non-local information in the graph, allowing its best results to be obtained when using drastically fewer edges than other neighbourhood approaches. A fast approximate maximum entropy training method based on the Bethe approximation is presented, which uses a simple gradient ascent procedure. When using precomputed sufficient statistics on the Movielens datasets, our method is faster than maximum likelihood approaches by two orders of magnitude., Comment: ICML2012
Published: 2012

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Publication Year Range

Publication Type

Database

Publisher

36 results on '"Defazio, Aaron"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources