1. Understanding the acceleration phenomenon via high-resolution differential equations
- Author
-
Simon S. Du, Weijie J. Su, Michael I. Jordan, and Bin Shi
- Subjects
FOS: Computer and information sciences ,Lyapunov function ,Computer Science - Machine Learning ,Differential equation ,General Mathematics ,0211 other engineering and technologies ,Machine Learning (stat.ML) ,010103 numerical & computational mathematics ,02 engineering and technology ,01 natural sciences ,Machine Learning (cs.LG) ,symbols.namesake ,Statistics - Machine Learning ,Classical Analysis and ODEs (math.CA) ,FOS: Mathematics ,Applied mathematics ,Mathematics - Numerical Analysis ,0101 mathematics ,Mathematics - Optimization and Control ,Mathematics ,021103 operations research ,Numerical analysis ,Ode ,Numerical Analysis (math.NA) ,Optimization and Control (math.OC) ,Mathematics - Classical Analysis and ODEs ,Ordinary differential equation ,Norm (mathematics) ,symbols ,Convex function ,Gradient method ,Software - Abstract
Gradient-based optimization algorithms can be studied from the perspective of limiting ordinary differential equations (ODEs). Motivated by the fact that existing ODEs do not distinguish between two fundamentally different algorithms---Nesterov's accelerated gradient method for strongly convex functions (NAG-SC) and Polyak's heavy-ball method---we study an alternative limiting process that yields high-resolution ODEs. We show that these ODEs permit a general Lyapunov function framework for the analysis of convergence in both continuous and discrete time. We also show that these ODEs are more accurate surrogates for the underlying algorithms; in particular, they not only distinguish between NAG-SC and Polyak's heavy-ball method, but they allow the identification of a term that we refer to as "gradient correction" that is present in NAG-SC but not in the heavy-ball method and is responsible for the qualitative difference in convergence of the two methods. We also use the high-resolution ODE framework to study Nesterov's accelerated gradient method for (non-strongly) convex functions, uncovering a hitherto unknown result---that NAG-C minimizes the squared gradient norm at an inverse cubic rate. Finally, by modifying the high-resolution ODE of NAG-C, we obtain a family of new optimization methods that are shown to maintain the accelerated convergence rates of NAG-C for smooth convex functions., 82 pages, 11 figures
- Published
- 2021