1. Stochastic Runge-Kutta methods and adaptive SGD-G2 stochastic gradient descent
- Author
-
Gabriel Turinici, Imen Ayadi, CEntre de REcherches en MAthématiques de la DEcision (CEREMADE), Centre National de la Recherche Scientifique (CNRS)-Université Paris Dauphine-PSL, and Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)
- Subjects
FOS: Computer and information sciences ,Computer Science::Machine Learning ,Computer Science - Machine Learning ,Computer science ,neural network ,adaptive stochastic gradient ,Machine Learning (stat.ML) ,010103 numerical & computational mathematics ,02 engineering and technology ,01 natural sciences ,Machine Learning (cs.LG) ,[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] ,Machine Learning ,adaptive learning rate ,[STAT.ML]Statistics [stat]/Machine Learning [stat.ML] ,Statistics - Machine Learning ,[MATH.MATH-ST]Mathematics [math]/Statistics [math.ST] ,Adaptive system ,stochastic gradient descent ,Convergence (routing) ,0202 electrical engineering, electronic engineering, information engineering ,FOS: Mathematics ,deep learning optimization ,Applied mathematics ,Mathematics - Numerical Analysis ,0101 mathematics ,Artificial neural network ,SGD ,business.industry ,deep learning ,Function (mathematics) ,Numerical Analysis (math.NA) ,Runge–Kutta methods ,Stochastic gradient descent ,020201 artificial intelligence & image processing ,Artificial intelligence ,Minification ,Balanced flow ,business ,[MATH.MATH-NA]Mathematics [math]/Numerical Analysis [math.NA] ,neural networks optimization - Abstract
International audience; The minimization of the loss function is of paramount importance in deep neural networks. On the other hand, many popular optimization algorithms have been shown to correspond to some evolution equation of gradient flow type. Inspired by the numerical schemes used for general evolution equations we introduce a second order stochastic Runge Kutta method and show that it yields a consistent procedure for the minimization of the loss function. In addition it can be coupled, in an adaptive framework, with a Stochastic Gradient Descent (SGD) to adjust automatically the learning rate of the SGD, without the need of any additional information on the Hessian of the loss functional. The adaptive SGD, called SGD-G2, is successfully tested on standard datasets.
- Published
- 2021