Author: "Orvieto, P." - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Orvieto, P."' showing total 325 results

Start Over Author "Orvieto, P."

325 results on '"Orvieto, P."'

1. Adaptive Methods through the Lens of SDEs: Theoretical Insights on the Role of Noise

Author: Compagnoni, Enea Monzio, Liu, Tianlin, Islamov, Rustem, Proske, Frank Norbert, Orvieto, Antonio, and Lucchi, Aurelien
Subjects: Computer Science - Machine Learning
Abstract: Despite the vast empirical evidence supporting the efficacy of adaptive optimization methods in deep learning, their theoretical understanding is far from complete. This work introduces novel SDEs for commonly used adaptive optimizers: SignSGD, RMSprop(W), and Adam(W). These SDEs offer a quantitatively accurate description of these optimizers and help illuminate an intricate relationship between adaptivity, gradient noise, and curvature. Our novel analysis of SignSGD highlights a noteworthy and precise contrast to SGD in terms of convergence speed, stationary distribution, and robustness to heavy-tail noise. We extend this analysis to AdamW and RMSpropW, for which we observe that the role of noise is much more complex. Crucially, we support our theoretical analysis with experimental evidence by verifying our insights: this includes numerically integrating our SDEs using Euler-Maruyama discretization on various neural network architectures such as MLPs, CNNs, ResNets, and Transformers. Our SDEs accurately track the behavior of the respective optimizers, especially when compared to previous SDEs derived for Adam and RMSprop. We believe our approach can provide valuable insights into best training practices and novel scaling rules., Comment: An earlier version, titled 'SDEs for Adaptive Methods: The Role of Noise' and dated May 2024, is available on OpenReview
Published: 2024

2. NIMBA: Towards Robust and Principled Processing of Point Clouds With SSMs

Author: Köprücü, Nursena, Okpekpe, Destiny, and Orvieto, Antonio
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Transformers have become dominant in large-scale deep learning tasks across various domains, including text, 2D and 3D vision. However, the quadratic complexity of their attention mechanism limits their efficiency as the sequence length increases, particularly in high-resolution 3D data such as point clouds. Recently, state space models (SSMs) like Mamba have emerged as promising alternatives, offering linear complexity, scalability, and high performance in long-sequence tasks. The key challenge in the application of SSMs in this domain lies in reconciling the non-sequential structure of point clouds with the inherently directional (or bi-directional) order-dependent processing of recurrent models like Mamba. To achieve this, previous research proposed reorganizing point clouds along multiple directions or predetermined paths in 3D space, concatenating the results to produce a single 1D sequence capturing different views. In our work, we introduce a method to convert point clouds into 1D sequences that maintain 3D spatial structure with no need for data replication, allowing Mamba sequential processing to be applied effectively in an almost permutation-invariant manner. In contrast to other works, we found that our method does not require positional embeddings and allows for shorter sequence lengths while still achieving state-of-the-art results in ModelNet40 and ScanObjectNN datasets and surpassing Transformer-based models in both accuracy and efficiency.
Published: 2024

3. Loss Landscape Characterization of Neural Networks without Over-Parametrization

Author: Islamov, Rustem, Ajroldi, Niccolò, Orvieto, Antonio, and Lucchi, Aurelien
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: Optimization methods play a crucial role in modern machine learning, powering the remarkable empirical achievements of deep learning models. These successes are even more remarkable given the complex non-convex nature of the loss landscape of these models. Yet, ensuring the convergence of optimization methods requires specific structural conditions on the objective function that are rarely satisfied in practice. One prominent example is the widely recognized Polyak-Lojasiewicz (PL) inequality, which has gained considerable attention in recent years. However, validating such assumptions for deep neural networks entails substantial and often impractical levels of over-parametrization. In order to address this limitation, we propose a novel class of functions that can characterize the loss landscape of modern deep models without requiring extensive over-parametrization and can also include saddle points. Crucially, we prove that gradient-based optimizers possess theoretical guarantees of convergence under this assumption. Finally, we validate the soundness of our new function class through both theoretical analysis and empirical experimentation across a diverse range of deep learning models.
Published: 2024

4. Geometric Inductive Biases of Deep Networks: The Role of Data and Architecture

Author: Movahedi, Sajad, Orvieto, Antonio, and Moosavi-Dezfooli, Seyed-Mohsen
Subjects: Computer Science - Machine Learning
Abstract: In this paper, we propose the $\textit{geometric invariance hypothesis (GIH)}$, which argues that when training a neural network, the input space curvature remains invariant under transformation in certain directions determined by its architecture. Starting with a simple non-linear binary classification problem residing on a plane in a high dimensional space, we observe that while an MLP can solve this problem regardless of the orientation of the plane, this is not the case for a ResNet. Motivated by this example, we define two maps that provide a compact $\textit{architecture-dependent}$ summary of the input space geometry of a neural network and its evolution during training, which we dub the $\textbf{average geometry}$ and $\textbf{average geometry evolution}$, respectively. By investigating average geometry evolution at initialization, we discover that the geometry of a neural network evolves according to the projection of data covariance onto average geometry. As a result, in cases where the average geometry is low-rank (such as in a ResNet), the geometry only changes in a subset of the input space. This causes an architecture-dependent invariance property in input-space curvature, which we dub GIH. Finally, we present extensive experimental results to observe the consequences of GIH and how it relates to generalization in neural networks.
Published: 2024

5. An Adaptive Stochastic Gradient Method with Non-negative Gauss-Newton Stepsizes

Author: Orvieto, Antonio and Xiao, Lin
Subjects: Mathematics - Optimization and Control, Computer Science - Machine Learning
Abstract: We consider the problem of minimizing the average of a large number of smooth but possibly non-convex functions. In the context of most machine learning applications, each loss function is non-negative and thus can be expressed as the composition of a square and its real-valued square root. This reformulation allows us to apply the Gauss-Newton method, or the Levenberg-Marquardt method when adding a quadratic regularization. The resulting algorithm, while being computationally as efficient as the vanilla stochastic gradient method, is highly adaptive and can automatically warmup and decay the effective stepsize while tracking the non-negative loss landscape. We provide a tight convergence analysis, leveraging new techniques, in the stochastic convex and non-convex settings. In particular, in the convex case, the method does not require access to the gradient Lipshitz constant for convergence, and is guaranteed to never diverge. The convergence rates and empirical evaluations compare favorably to the classical (stochastic) gradient method as well as to several other adaptive methods.
Published: 2024

6. Gradient Descent on Logistic Regression with Non-Separable Data and Large Step Sizes

Author: Meng, Si Yi, Orvieto, Antonio, Cao, Daniel Yiming, and De Sa, Christopher
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control
Abstract: We study gradient descent (GD) dynamics on logistic regression problems with large, constant step sizes. For linearly-separable data, it is known that GD converges to the minimizer with arbitrarily large step sizes, a property which no longer holds when the problem is not separable. In fact, the behaviour can be much more complex -- a sequence of period-doubling bifurcations begins at the critical step size $2/\lambda$, where $\lambda$ is the largest eigenvalue of the Hessian at the solution. Using a smaller-than-critical step size guarantees convergence if initialized nearby the solution: but does this suffice globally? In one dimension, we show that a step size less than $1/\lambda$ suffices for global convergence. However, for all step sizes between $1/\lambda$ and the critical step size $2/\lambda$, one can construct a dataset such that GD converges to a stable cycle. In higher dimensions, this is actually possible even for step sizes less than $1/\lambda$. Our results show that although local convergence is guaranteed for all step sizes less than the critical step size, global convergence is not, and GD may instead converge to a cycle depending on the initialization.
Published: 2024

7. An additive opinion to the committee opinion of ASRM and SART on the use of preimplantation genetic testing for aneuploidy (PGT-A)

Author: Gleicher, Norbert, Barad, David H., Patrizio, Pasquale, Gayete-Lafuente, Sonya, Weghofer, Andrea, Rafael, Zion Ben, Takahashi, Shizuko, Glujovsky, Demián, Mol, Ben W., and Orvieto, Raoul
Published: 2024
Full Text: View/download PDF

8. Recurrent neural networks: vanishing and exploding gradients are not the end of the story

Author: Zucchet, Nicolas and Orvieto, Antonio
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Mathematics - Optimization and Control
Abstract: Recurrent neural networks (RNNs) notoriously struggle to learn long-term memories, primarily due to vanishing and exploding gradients. The recent success of state-space models (SSMs), a subclass of RNNs, to overcome such difficulties challenges our theoretical understanding. In this paper, we delve into the optimization challenges of RNNs and discover that, as the memory of a network increases, changes in its parameters result in increasingly large output variations, making gradient-based learning highly sensitive, even without exploding gradients. Our analysis further reveals the importance of the element-wise recurrence design pattern combined with careful parametrizations in mitigating this effect. This feature is present in SSMs, as well as in other architectures, such as LSTMs. Overall, our insights provide a new explanation for some of the difficulties in gradient-based learning of RNNs and why some architectures perform better than others., Comment: Paper accepted to NeurIPS 2024
Published: 2024

9. Understanding the differences in Foundation Models: Attention, State Space Models, and Recurrent Neural Networks

Author: Sieber, Jerome, Alonso, Carmen Amo, Didier, Alexandre, Zeilinger, Melanie N., and Orvieto, Antonio
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Systems and Control
Abstract: Softmax attention is the principle backbone of foundation models for various artificial intelligence applications, yet its quadratic complexity in sequence length can limit its inference throughput in long-context settings. To address this challenge, alternative architectures such as linear attention, State Space Models (SSMs), and Recurrent Neural Networks (RNNs) have been considered as more efficient alternatives. While connections between these approaches exist, such models are commonly developed in isolation and there is a lack of theoretical understanding of the shared principles underpinning these architectures and their subtle differences, greatly influencing performance and scalability. In this paper, we introduce the Dynamical Systems Framework (DSF), which allows a principled investigation of all these architectures in a common representation. Our framework facilitates rigorous comparisons, providing new insights on the distinctive characteristics of each model class. For instance, we compare linear attention and selective SSMs, detailing their differences and conditions under which both are equivalent. We also provide principled comparisons between softmax attention and other model classes, discussing the theoretical conditions under which softmax attention can be approximated. Additionally, we substantiate these new insights with empirical validations and mathematical arguments. This shows the DSF's potential to guide the systematic development of future more efficient and scalable foundation models., Comment: NeurIPS 2024
Published: 2024

10. On the low-shot transferability of [V]-Mamba

Author: Misra, Diganta, Gala, Jay, and Orvieto, Antonio
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: The strength of modern large-scale neural networks lies in their ability to efficiently adapt to new tasks with few examples. Although extensive research has investigated the transferability of Vision Transformers (ViTs) to various downstream tasks under diverse constraints, this study shifts focus to explore the transfer learning potential of [V]-Mamba. We compare its performance with ViTs across different few-shot data budgets and efficient transfer methods. Our analysis yields three key insights into [V]-Mamba's few-shot transfer performance: (a) [V]-Mamba demonstrates superior or equivalent few-shot learning capabilities compared to ViTs when utilizing linear probing (LP) for transfer, (b) Conversely, [V]-Mamba exhibits weaker or similar few-shot learning performance compared to ViTs when employing visual prompting (VP) as the transfer method, and (c) We observe a weak positive correlation between the performance gap in transfer via LP and VP and the scale of the [V]-Mamba model. This preliminary analysis lays the foundation for more comprehensive studies aimed at furthering our understanding of the capabilities of [V]-Mamba variants and their distinctions from ViTs., Comment: Preprint (Work in progress)
Published: 2024

11. Theoretical Foundations of Deep Selective State-Space Models

Author: Cirone, Nicola Muca, Orvieto, Antonio, Walker, Benjamin, Salvi, Cristopher, and Lyons, Terry
Subjects: Computer Science - Machine Learning, Mathematics - Dynamical Systems
Abstract: Structured state-space models (SSMs) such as S4, stemming from the seminal work of Gu et al., are gaining popularity as effective approaches for modeling sequential data. Deep SSMs demonstrate outstanding performance across a diverse set of domains, at a reduced training and inference cost compared to attention-based transformers. Recent developments show that if the linear recurrence powering SSMs allows for multiplicative interactions between inputs and hidden states (e.g. GateLoop, Mamba, GLA), then the resulting architecture can surpass in both in accuracy and efficiency attention-powered foundation models trained on text, at scales of billion parameters. In this paper, we give theoretical grounding to this recent finding using tools from Rough Path Theory: we show that when random linear recurrences are equipped with simple input-controlled transitions (selectivity mechanism), then the hidden state is provably a low-dimensional projection of a powerful mathematical object called the signature of the input -- capturing non-linear interactions between tokens at distinct timescales. Our theory not only motivates the success of modern selective state-space models such as Mamba but also provides a solid framework to understand the expressive power of future SSM variants., Comment: Fina NeurIPS Camera Ready Version w/ minor edits
Published: 2024

12. Super Consistency of Neural Network Landscapes and Learning Rate Transfer

Author: Noci, Lorenzo, Meterez, Alexandru, Hofmann, Thomas, and Orvieto, Antonio
Subjects: Computer Science - Machine Learning
Abstract: Recently, there has been growing evidence that if the width and depth of a neural network are scaled toward the so-called rich feature learning limit (\mup and its depth extension), then some hyperparameters -- such as the learning rate -- exhibit transfer from small to very large models. From an optimization perspective, this phenomenon is puzzling, as it implies that the loss landscape is consistently similar across very different model sizes. In this work, we study the landscape through the lens of the loss Hessian, with a focus on its largest eigenvalue (i.e. the sharpness), and find that certain spectral properties under $\mu$P are largely independent of the size of the network, and remain consistent as training progresses. We name this property Super Consistency of the landscape. On the other hand, we show that in the Neural Tangent Kernel (NTK) and other scaling regimes, the sharpness exhibits very different dynamics at different scales. But what causes these differences in the sharpness dynamics? Through a connection between the Hessian's and the NTK's spectrum, we argue that the cause lies in the presence (for $\mu$P) or progressive absence (for the NTK scaling) of feature learning. We corroborate our claims with a substantial suite of experiments, covering a wide range of datasets and architectures: from ResNets and Vision Transformers trained on benchmark vision datasets to Transformers-based language models trained on WikiText., Comment: The paper has been accepted at Neurips 2024. This is a revised version of the paper previously titled "Why do Learning Rates Transfer? Reconciling Optimization and Scaling Limits for Deep Learning"
Published: 2024

13. SDEs for Minimax Optimization

Author: Compagnoni, Enea Monzio, Orvieto, Antonio, Kersting, Hans, Proske, Frank Norbert, and Lucchi, Aurelien
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control
Abstract: Minimax optimization problems have attracted a lot of attention over the past few years, with applications ranging from economics to machine learning. While advanced optimization methods exist for such problems, characterizing their dynamics in stochastic scenarios remains notably challenging. In this paper, we pioneer the use of stochastic differential equations (SDEs) to analyze and compare Minimax optimizers. Our SDE models for Stochastic Gradient Descent-Ascent, Stochastic Extragradient, and Stochastic Hamiltonian Gradient Descent are provable approximations of their algorithmic counterparts, clearly showcasing the interplay between hyperparameters, implicit regularization, and implicit curvature-induced noise. This perspective also allows for a unified and simplified analysis strategy based on the principles of It\^o calculus. Finally, our approach facilitates the derivation of convergence conditions and closed-form solutions for the dynamics in simplified settings, unveiling further insights into the behavior of different optimizers., Comment: Accepted at AISTATS 2024 (Poster)
Published: 2024

14. Recurrent Distance Filtering for Graph Representation Learning

Author: Ding, Yuhui, Orvieto, Antonio, He, Bobby, and Hofmann, Thomas
Subjects: Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing
Abstract: Graph neural networks based on iterative one-hop message passing have been shown to struggle in harnessing the information from distant nodes effectively. Conversely, graph transformers allow each node to attend to all other nodes directly, but lack graph inductive bias and have to rely on ad-hoc positional encoding. In this paper, we propose a new architecture to reconcile these challenges. Our approach stems from the recent breakthroughs in long-range modeling provided by deep state-space models: for a given target node, our model aggregates other nodes by their shortest distances to the target and uses a linear RNN to encode the sequence of hop representations. The linear RNN is parameterized in a particular diagonal form for stable long-range signal propagation and is theoretically expressive enough to encode the neighborhood hierarchy. With no need for positional encoding, we empirically show that the performance of our model is comparable to or better than that of state-of-the-art graph transformers on various benchmarks, with a significantly reduced computational cost. Our code is open-source at https://github.com/skeletondyh/GRED., Comment: ICML 2024
Published: 2023

15. Triggering final follicular maturation for IVF cycles

Author: Orvieto, Raoul
Published: 2025
Full Text: View/download PDF

16. A simple and practical approach to elective egg freezing to control costs and expand access to care

Author: Orvieto, Raoul and Gleicher, Norbert
Published: 2024
Full Text: View/download PDF

17. Patients with low prognosis in ART: a Delphi consensus to identify potential clinical implications and measure the impact of POSEIDON criteria

Author: Alviggi, Carlo, Humaidan, Peter, Fischer, Robert, Conforti, Alessandro, Dahan, Michael H., Marca, Antonio La, Orvieto, Raoul, Polyzos, Nikolaos P., Roque, Matheus, Sunkara, Sesh K., Ubaldi, Filippo Maria, Vuong, Lan, Yarali, Hakan, D’Hooghe, Thomas, Longobardi, Salvatore, and Esteves, Sandro C.
Published: 2024
Full Text: View/download PDF

18. Determining the optimal daily gonadotropin dose to maximize the oocyte yield in elective egg freezing cycles

Author: Orvieto, Raoul, Kadmon, Anouk Savir, Morag, Nira, Segev-Zahav, Aliza, and Nahum, Ravit
Published: 2024
Full Text: View/download PDF

19. An accelerated lyapunov function for Polyak’s Heavy-ball on convex quadratics

Author: Orvieto, Antonio
Published: 2024
Full Text: View/download PDF

20. Does extreme psychological burden (Hamas terrorist attack on October 7th, 2023) affect in vitro fertilization outcome?

Author: Orvieto, Raoul, Shamir, Coral, and Aizer, Adva
Published: 2024
Full Text: View/download PDF

21. Universality of Linear Recurrences Followed by Non-linear Projections: Finite-Width Guarantees and Benefits of Complex Eigenvalues

Author: Orvieto, Antonio, De, Soham, Gulcehre, Caglar, Pascanu, Razvan, and Smith, Samuel L.
Subjects: Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing
Abstract: Deep neural networks based on linear RNNs interleaved with position-wise MLPs are gaining traction as competitive approaches for sequence modeling. Examples of such architectures include state-space models (SSMs) like S4, LRU, and Mamba: recently proposed models that achieve promising performance on text, genetics, and other data that require long-range reasoning. Despite experimental evidence highlighting these architectures' effectiveness and computational efficiency, their expressive power remains relatively unexplored, especially in connection to specific choices crucial in practice - e.g., carefully designed initialization distribution and potential use of complex numbers. In this paper, we show that combining MLPs with both real or complex linear diagonal recurrences leads to arbitrarily precise approximation of regular causal sequence-to-sequence maps. At the heart of our proof, we rely on a separation of concerns: the linear RNN provides a lossless encoding of the input sequence, and the MLP performs non-linear processing on this encoding. While we show that real diagonal linear recurrences are enough to achieve universality in this architecture, we prove that employing complex eigenvalues near unit disk - i.e., empirically the most successful strategy in S4 - greatly helps the RNN in storing information. We connect this finding with the vanishing gradient issue and provide experiments supporting our claims., Comment: v1: Accepted at HLD 2023 Workshop @ICML; v2: Preprint; v3: ICML version
Published: 2023

22. Achieving a Better Stability-Plasticity Trade-off via Auxiliary Networks in Continual Learning

Author: Kim, Sanghwan, Noci, Lorenzo, Orvieto, Antonio, and Hofmann, Thomas
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition
Abstract: In contrast to the natural capabilities of humans to learn new tasks in a sequential fashion, neural networks are known to suffer from catastrophic forgetting, where the model's performances on old tasks drop dramatically after being optimized for a new task. Since then, the continual learning (CL) community has proposed several solutions aiming to equip the neural network with the ability to learn the current task (plasticity) while still achieving high accuracy on the previous tasks (stability). Despite remarkable improvements, the plasticity-stability trade-off is still far from being solved and its underlying mechanism is poorly understood. In this work, we propose Auxiliary Network Continual Learning (ANCL), a novel method that applies an additional auxiliary network which promotes plasticity to the continually learned model which mainly focuses on stability. More concretely, the proposed framework materializes in a regularizer that naturally interpolates between plasticity and stability, surpassing strong baselines on task incremental and class incremental scenarios. Through extensive analyses on ANCL solutions, we identify some essential principles beneath the stability-plasticity trade-off., Comment: CVPR 2023
Published: 2023

23. Resurrecting Recurrent Neural Networks for Long Sequences

Author: Orvieto, Antonio, Smith, Samuel L, Gu, Albert, Fernando, Anushan, Gulcehre, Caglar, Pascanu, Razvan, and De, Soham
Subjects: Computer Science - Machine Learning
Abstract: Recurrent Neural Networks (RNNs) offer fast inference on long sequences but are hard to optimize and slow to train. Deep state-space models (SSMs) have recently been shown to perform remarkably well on long sequence modeling tasks, and have the added benefits of fast parallelizable training and RNN-like fast inference. However, while SSMs are superficially similar to RNNs, there are important differences that make it unclear where their performance boost over RNNs comes from. In this paper, we show that careful design of deep RNNs using standard signal propagation arguments can recover the impressive performance of deep SSMs on long-range reasoning tasks, while also matching their training speed. To achieve this, we analyze and ablate a series of changes to standard RNNs including linearizing and diagonalizing the recurrence, using better parameterizations and initializations, and ensuring proper normalization of the forward pass. Our results provide new insights on the origins of the impressive performance of deep SSMs, while also introducing an RNN block called the Linear Recurrent Unit that matches both their performance on the Long Range Arena benchmark and their computational efficiency., Comment: 30 pages, 9 figures
Published: 2023

24. An SDE for Modeling SAM: Theory and Insights

Author: Compagnoni, Enea Monzio, Biggio, Luca, Orvieto, Antonio, Proske, Frank Norbert, Kersting, Hans, and Lucchi, Aurelien
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control
Abstract: We study the SAM (Sharpness-Aware Minimization) optimizer which has recently attracted a lot of interest due to its increased performance over more classical variants of stochastic gradient descent. Our main contribution is the derivation of continuous-time models (in the form of SDEs) for SAM and two of its variants, both for the full-batch and mini-batch settings. We demonstrate that these SDEs are rigorous approximations of the real discrete-time algorithms (in a weak sense, scaling linearly with the learning rate). Using these models, we then offer an explanation of why SAM prefers flat minima over sharp ones~--~by showing that it minimizes an implicitly regularized loss with a Hessian-dependent noise structure. Finally, we prove that SAM is attracted to saddle points under some realistic conditions. Our theoretical results are supported by detailed experiments., Comment: Accepted at ICML 2023 (Poster)
Published: 2023

25. An Accelerated Lyapunov Function for Polyak's Heavy-Ball on Convex Quadratics

Author: Orvieto, Antonio
Subjects: Mathematics - Optimization and Control
Abstract: In 1964, Polyak showed that the Heavy-ball method, the simplest momentum technique, accelerates convergence of strongly-convex problems in the vicinity of the solution. While Nesterov later developed a globally accelerated version, Polyak's original algorithm remains simpler and more widely used in applications such as deep learning. Despite this popularity, the question of whether Heavy-ball is also globally accelerated or not has not been fully answered yet, and no convincing counterexample has been provided. This is largely due to the difficulty in finding an effective Lyapunov function: indeed, most proofs of Heavy-ball acceleration in the strongly-convex quadratic setting rely on eigenvalue arguments. Our study adopts a different approach: studying momentum through the lens of quadratic invariants of simple harmonic oscillators. By utilizing the modified Hamiltonian of Stormer-Verlet integrators, we are able to construct a Lyapunov function that demonstrates an O(1/k^2) rate for Heavy-ball in the case of convex quadratic problems. This is a promising first step towards potentially proving the acceleration of Polyak's momentum method and we hope it inspires further research in this field.
Published: 2023

26. On the Theoretical Properties of Noise Correlation in Stochastic Optimization

Author: Lucchi, Aurelien, Proske, Frank, Orvieto, Antonio, Bach, Francis, and Kersting, Hans
Subjects: Mathematics - Optimization and Control, Computer Science - Machine Learning
Abstract: Studying the properties of stochastic noise to optimize complex non-convex functions has been an active area of research in the field of machine learning. Prior work has shown that the noise of stochastic gradient descent improves optimization by overcoming undesirable obstacles in the landscape. Moreover, injecting artificial Gaussian noise has become a popular idea to quickly escape saddle points. Indeed, in the absence of reliable gradient information, the noise is used to explore the landscape, but it is unclear what type of noise is optimal in terms of exploration ability. In order to narrow this gap in our knowledge, we study a general type of continuous-time non-Markovian process, based on fractional Brownian motion, that allows for the increments of the process to be correlated. This generalizes processes based on Brownian motion, such as the Ornstein-Uhlenbeck process. We demonstrate how to discretize such processes which gives rise to the new algorithm fPGD. This method is a generalization of the known algorithms PGD and Anti-PGD. We study the properties of fPGD both theoretically and empirically, demonstrating that it possesses exploration abilities that, in some cases, are favorable over PGD and Anti-PGD. These results open the field to novel ways to exploit noise for training machine learning models.
Published: 2022

27. Mean first exit times of Ornstein-Uhlenbeck processes in high-dimensional spaces

Author: Kersting, Hans, Orvieto, Antonio, Proske, Frank, and Lucchi, Aurelien
Subjects: Mathematics - Probability
Abstract: The $d$-dimensional Ornstein--Uhlenbeck process (OUP) describes the trajectory of a particle in a $d$-dimensional, spherically symmetric, quadratic potential. The OUP is composed of a drift term weighted by a constant $\theta \geq 0$ and a diffusion coefficient weighted by $\sigma > 0$. In the absence of drift (i.e. $\theta = 0$), the OUP simply becomes a standard Brownian motion (BM). This paper is concerned with estimating the mean first-exit time (MFET) of the OUP from a ball of finite radius $L$ for large $d \gg 0$. We prove that, asymptotically for $d \to \infty$, the OUP takes (on average) no longer to exit than BM. In other words, the mean-reverting drift of the OUP (scaled by $\theta \geq 0$) has asymptotically no effect on its MFET. This finding might be surprising because, for small $d \in \mathbb{N}$, the OUP exit time is significantly larger than BM by a margin that depends on $\theta$. As it allows for the drift to be ignored, it might simplify the analysis of high-dimensional exit-time problems in numerous areas. Finally, our short proof for the non-asymptotic MFET of OUP, using the Andronov--Vitt--Pontryagin formula, might be of independent interest., Comment: 14 pages, 3 figures
Published: 2022
Full Text: View/download PDF

28. Explicit Regularization in Overparametrized Models via Noise Injection

Author: Orvieto, Antonio, Raj, Anant, Kersting, Hans, and Bach, Francis
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Injecting noise within gradient descent has several desirable features, such as smoothing and regularizing properties. In this paper, we investigate the effects of injecting noise before computing a gradient step. We demonstrate that small perturbations can induce explicit regularization for simple models based on the L1-norm, group L1-norms, or nuclear norms. However, when applied to overparametrized neural networks with large widths, we show that the same perturbations can cause variance explosion. To overcome this, we propose using independent layer-wise perturbations, which provably allow for explicit regularization without variance explosion. Our empirical results show that these small perturbations lead to improved generalization performance compared to vanilla gradient descent., Comment: Accepted at AISTATS 2023 23 pages
Published: 2022

29. Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse

Author: Noci, Lorenzo, Anagnostidis, Sotiris, Biggio, Luca, Orvieto, Antonio, Singh, Sidak Pal, and Lucchi, Aurelien
Subjects: Computer Science - Machine Learning
Abstract: Transformers have achieved remarkable success in several domains, ranging from natural language processing to computer vision. Nevertheless, it has been recently shown that stacking self-attention layers - the distinctive architectural component of Transformers - can result in rank collapse of the tokens' representations at initialization. The question of if and how rank collapse affects training is still largely unanswered, and its investigation is necessary for a more comprehensive understanding of this architecture. In this work, we shed new light on the causes and the effects of this phenomenon. First, we show that rank collapse of the tokens' representations hinders training by causing the gradients of the queries and keys to vanish at initialization. Furthermore, we provide a thorough description of the origin of rank collapse and discuss how to prevent it via an appropriate depth-dependent scaling of the residual branches. Finally, our analysis unveils that specific architectural hyperparameters affect the gradients of queries and values differently, leading to disproportionate gradient norms. This suggests an explanation for the widespread use of adaptive methods for Transformers' optimization.
Published: 2022

30. What is the maximal timeframe between sperm acquisition to sperm cryopreservation, in different “culture” conditions?

Author: Orvieto, Raoul, Shimon, Chen, Dratviman-Storobinsky, Olga, Noach-Hirsh, Meirav, and Aizer, Adva
Published: 2024
Full Text: View/download PDF

31. Dynamics of SGD with Stochastic Polyak Stepsizes: Truly Adaptive Variants and Convergence to Exact Solution

Author: Orvieto, Antonio, Lacoste-Julien, Simon, and Loizou, Nicolas
Subjects: Mathematics - Optimization and Control
Abstract: Recently, Loizou et al. (2021), proposed and analyzed stochastic gradient descent (SGD) with stochastic Polyak stepsize (SPS). The proposed SPS comes with strong convergence guarantees and competitive performance; however, it has two main drawbacks when it is used in non-over-parameterized regimes: (i) It requires a priori knowledge of the optimal mini-batch losses, which are not available when the interpolation condition is not satisfied (e.g., regularized objectives), and (ii) it guarantees convergence only to a neighborhood of the solution. In this work, we study the dynamics and the convergence properties of SGD equipped with new variants of the stochastic Polyak stepsize and provide solutions to both drawbacks of the original SPS. We first show that a simple modification of the original SPS that uses lower bounds instead of the optimal function values can directly solve issue (i). On the other hand, solving issue (ii) turns out to be more challenging and leads us to valuable insights into the method's behavior. We show that if interpolation is not satisfied, the correlation between SPS and stochastic gradients introduces a bias, which effectively distorts the expectation of the gradient signal near minimizers, leading to non-convergence - even if the stepsize is scaled down during training. To fix this issue, we propose DecSPS, a novel modification of SPS, which guarantees convergence to the exact minimizer - without a priori knowledge of the problem parameters. For strongly-convex optimization problems, DecSPS is the first stochastic adaptive optimization method that converges to the exact solution without restrictive assumptions like bounded iterates/gradients., Comment: Accepted at NeurIPS 2022 v4: tiny mistake in the main proof (result unchanged) is now fixed, v5: confusing typo fixed
Published: 2022

32. Anticorrelated Noise Injection for Improved Generalization

Author: Orvieto, Antonio, Kersting, Hans, Proske, Frank, Bach, Francis, and Lucchi, Aurelien
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning, Mathematics - Optimization and Control
Abstract: Injecting artificial noise into gradient descent (GD) is commonly employed to improve the performance of machine learning models. Usually, uncorrelated noise is used in such perturbed gradient descent (PGD) methods. It is, however, not known if this is optimal or whether other types of noise could provide better generalization performance. In this paper, we zoom in on the problem of correlating the perturbations of consecutive PGD steps. We consider a variety of objective functions for which we find that GD with anticorrelated perturbations ("Anti-PGD") generalizes significantly better than GD and standard (uncorrelated) PGD. To support these experimental findings, we also derive a theoretical analysis that demonstrates that Anti-PGD moves to wider minima, while GD and PGD remain stuck in suboptimal regions or even diverge. This new connection between anticorrelated noise and generalization opens the field to novel ways to exploit noise for training machine learning models., Comment: 22 pages, 16 figures
Published: 2022

33. On the effectiveness of Randomized Signatures as Reservoir for Learning Rough Dynamics

Author: Compagnoni, Enea Monzio, Scampicchio, Anna, Biggio, Luca, Orvieto, Antonio, Hofmann, Thomas, and Teichmann, Josef
Subjects: Computer Science - Machine Learning, Electrical Engineering and Systems Science - Signal Processing
Abstract: Many finance, physics, and engineering phenomena are modeled by continuous-time dynamical systems driven by highly irregular (stochastic) inputs. A powerful tool to perform time series analysis in this context is rooted in rough path theory and leverages the so-called Signature Transform. This algorithm enjoys strong theoretical guarantees but is hard to scale to high-dimensional data. In this paper, we study a recently derived random projection variant called Randomized Signature, obtained using the Johnson-Lindenstrauss Lemma. We provide an in-depth experimental evaluation of the effectiveness of the Randomized Signature approach, in an attempt to showcase the advantages of this reservoir to the community. Specifically, we find that this method is preferable to the truncated Signature approach and alternative deep learning techniques in terms of model complexity, training time, accuracy, robustness, and data hungriness., Comment: Accepted for IEEE IJCNN 2023
Published: 2022

34. Faster Single-loop Algorithms for Minimax Optimization without Strong Concavity

Author: Yang, Junchi, Orvieto, Antonio, Lucchi, Aurelien, and He, Niao
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: Gradient descent ascent (GDA), the simplest single-loop algorithm for nonconvex minimax optimization, is widely used in practical applications such as generative adversarial networks (GANs) and adversarial training. Albeit its desirable simplicity, recent work shows inferior convergence rates of GDA in theory even assuming strong concavity of the objective on one side. This paper establishes new convergence results for two alternative single-loop algorithms -- alternating GDA and smoothed GDA -- under the mild assumption that the objective satisfies the Polyak-Lojasiewicz (PL) condition about one variable. We prove that, to find an $\epsilon$-stationary point, (i) alternating GDA and its stochastic variant (without mini batch) respectively require $O(\kappa^{2} \epsilon^{-2})$ and $O(\kappa^{4} \epsilon^{-4})$ iterations, while (ii) smoothed GDA and its stochastic variant (without mini batch) respectively require $O(\kappa \epsilon^{-2})$ and $O(\kappa^{2} \epsilon^{-4})$ iterations. The latter greatly improves over the vanilla GDA and gives the hitherto best known complexity results among single-loop algorithms under similar settings. We further showcase the empirical efficiency of these algorithms in training GANs and robust nonlinear regression.
Published: 2021

35. On the Second-order Convergence Properties of Random Search Methods

Author: Lucchi, Aurelien, Orvieto, Antonio, and Solomou, Adamos
Subjects: Mathematics - Optimization and Control, Computer Science - Machine Learning
Abstract: We study the theoretical convergence properties of random-search methods when optimizing non-convex objective functions without having access to derivatives. We prove that standard random-search methods that do not rely on second-order information converge to a second-order stationary point. However, they suffer from an exponential complexity in terms of the input dimension of the problem. In order to address this issue, we propose a novel variant of random search that exploits negative curvature by only relying on function evaluations. We prove that this approach converges to a second-order stationary point at a much faster rate than vanilla methods: namely, the complexity in terms of the number of function evaluations is only linear in the problem dimension. We test our algorithm empirically and find good agreements with our theoretical results.
Published: 2021

36. The HERA (Hyper-response Risk Assessment) Delphi consensus for the management of hyper-responders in in vitro fertilization

Author: Feferkorn, I., Santos-Ribeiro, S., Ubaldi, F. M., Velasco, J. G., Ata, B., Blockeel, C., Conforti, A., Esteves, S. C., Fatemi, H. M., Gianaroli, L., Grynberg, M., Humaidan, P., Lainas, G.T, La Marca, A., Craig, L. B., Lathi, R., Norman, R. J., Orvieto, R., Paulson, R., Pellicer, A., Polyzos, N. P., Roque, M., Sunkara, S. K., Tan, S. L., Urman, B., Venetis, C., Weissman, A., Yarali, H., and Dahan, M. H.
Published: 2023
Full Text: View/download PDF

37. Rethinking the Variational Interpretation of Nesterov's Accelerated Method

Author: Zhang, Peiyuan, Orvieto, Antonio, and Daneshmand, Hadi
Subjects: Mathematics - Optimization and Control
Abstract: The continuous-time model of Nesterov's momentum provides a thought-provoking perspective for understanding the nature of the acceleration phenomenon in convex optimization. One of the main ideas in this line of research comes from the field of classical mechanics and proposes to link Nesterov's trajectory to the solution of a set of Euler-Lagrange equations relative to the so-called Bregman Lagrangian. In the last years, this approach led to the discovery of many new (stochastic) accelerated algorithms and provided a solid theoretical foundation for the design of structure-preserving accelerated methods. In this work, we revisit this idea and provide an in-depth analysis of the action relative to the Bregman Lagrangian from the point of view of calculus of variations. Our main finding is that, while Nesterov's method is a stationary point for the action, it is often not a minimizer but instead a saddle point for this functional in the space of differentiable curves. This finding challenges the main intuition behind the variational interpretation of Nesterov's method and provides additional insights into the intriguing geometry of accelerated paths., Comment: 11 pages, 4 figures
Published: 2021

38. Vanishing Curvature and the Power of Adaptive Methods in Randomly Initialized Deep Networks

Author: Orvieto, Antonio, Kohler, Jonas, Pavllo, Dario, Hofmann, Thomas, and Lucchi, Aurelien
Subjects: Computer Science - Machine Learning
Abstract: This paper revisits the so-called vanishing gradient phenomenon, which commonly occurs in deep randomly initialized neural networks. Leveraging an in-depth analysis of neural chains, we first show that vanishing gradients cannot be circumvented when the network width scales with less than O(depth), even when initialized with the popular Xavier and He initializations. Second, we extend the analysis to second-order derivatives and show that random i.i.d. initialization also gives rise to Hessian matrices with eigenspectra that vanish as networks grow in depth. Whenever this happens, optimizers are initialized in a very flat, saddle point-like plateau, which is particularly hard to escape with stochastic gradient descent (SGD) as its escaping time is inversely related to curvature. We believe that this observation is crucial for fully understanding (a) historical difficulties of training deep nets with vanilla SGD, (b) the success of adaptive gradient methods (which naturally adapt to curvature and thus quickly escape flat plateaus) and (c) the effectiveness of modern architectural components like residual connections and normalization layers.
Published: 2021

39. Revisiting the Role of Euler Numerical Integration on Acceleration and Stability in Convex Optimization

Author: Zhang, Peiyuan, Orvieto, Antonio, Daneshmand, Hadi, Hofmann, Thomas, and Smith, Roy
Subjects: Mathematics - Optimization and Control, Computer Science - Machine Learning
Abstract: Viewing optimization methods as numerical integrators for ordinary differential equations (ODEs) provides a thought-provoking modern framework for studying accelerated first-order optimizers. In this literature, acceleration is often supposed to be linked to the quality of the integrator (accuracy, energy preservation, symplecticity). In this work, we propose a novel ordinary differential equation that questions this connection: both the explicit and the semi-implicit (a.k.a symplectic) Euler discretizations on this ODE lead to an accelerated algorithm for convex programming. Although semi-implicit methods are well-known in numerical analysis to enjoy many desirable features for the integration of physical systems, our findings show that these properties do not necessarily relate to acceleration., Comment: 18 pages, 5 figures; Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS) 2021, San Diego, California, USA. PMLR: Volume 130
Published: 2021

40. Human granulosa cells of poor ovarian responder patients display telomeres shortening

Author: Yung, Yuval, Maydan, Sharon Avhar, Bart, Yossi, Orvieto, Raoul, and Aizer, Adva
Published: 2023
Full Text: View/download PDF

41. Two-Level K-FAC Preconditioning for Deep Learning

Author: Tselepidis, Nikolaos, Kohler, Jonas, and Orvieto, Antonio
Subjects: Computer Science - Machine Learning
Abstract: In the context of deep learning, many optimization methods use gradient covariance information in order to accelerate the convergence of Stochastic Gradient Descent. In particular, starting with Adagrad, a seemingly endless line of research advocates the use of diagonal approximations of the so-called empirical Fisher matrix in stochastic gradient-based algorithms, with the most prominent one arguably being Adam. However, in recent years, several works cast doubt on the theoretical basis of preconditioning with the empirical Fisher matrix, and it has been shown that more sophisticated approximations of the actual Fisher matrix more closely resemble the theoretically well-motivated Natural Gradient Descent. One particularly successful variant of such methods is the so-called K-FAC optimizer, which uses a Kronecker-factored block-diagonal Fisher approximation as preconditioner. In this work, drawing inspiration from two-level domain decomposition methods used as preconditioners in the field of scientific computing, we extend K-FAC by enriching it with off-diagonal (i.e. global) curvature information in a computationally efficient way. We achieve this by adding a coarse-space correction term to the preconditioner, which captures the global Fisher information matrix at a coarser scale. We present a small set of experimental results suggesting improved convergence behaviour of our proposed method.
Published: 2020

42. Learning explanations that are hard to vary

Author: Parascandolo, Giambattista, Neitz, Alexander, Orvieto, Antonio, Gresele, Luigi, and Schölkopf, Bernhard
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: In this paper, we investigate the principle that `good explanations are hard to vary' in the context of deep learning. We show that averaging gradients across examples -- akin to a logical OR of patterns -- can favor memorization and `patchwork' solutions that sew together different strategies, instead of identifying invariances. To inspect this, we first formalize a notion of consistency for minima of the loss surface, which measures to what extent a minimum appears only when examples are pooled. We then propose and experimentally validate a simple alternative algorithm based on a logical AND, that focuses on invariances and prevents memorization in a set of real-world tasks. Finally, using a synthetic dataset with a clear distinction between invariant and spurious mechanisms, we dissect learning signals and compare this approach to well-established regularizers., Comment: From v1: extended 2.2 and 2.3, added details for reproducibility and link to codebase
Published: 2020

43. An Accelerated DFO Algorithm for Finite-sum Convex Functions

Author: Chen, Yuwen, Orvieto, Antonio, and Lucchi, Aurelien
Subjects: Mathematics - Optimization and Control, Computer Science - Machine Learning
Abstract: Derivative-free optimization (DFO) has recently gained a lot of momentum in machine learning, spawning interest in the community to design faster methods for problems where gradients are not accessible. While some attention has been given to the concept of acceleration in the DFO literature, existing stochastic algorithms for objective functions with a finite-sum structure have not been shown theoretically to achieve an accelerated rate of convergence. Algorithms that use acceleration in such a setting are prone to instabilities, making it difficult to reach convergence. In this work, we exploit the finite-sum structure of the objective in order to design a variance-reduced DFO algorithm that provably yields acceleration. We prove rates of convergence for both smooth convex and strongly-convex finite-sum objective functions. Finally, we validate our theoretical results empirically on several tasks and datasets., Comment: 48 pages, 44 figures; Proceedings of the 37th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020
Published: 2020

44. Correction: Biosimilar recombinant follitropin alfa preparations versus the reference product (Gonal-F®) in couples undergoing assisted reproductive technology treatment: a systematic review and meta-analysis

Author: Venetis, Christos A., Helwig, Christoph, Mol, Ben W., Chua, Su Jen, Longobardi, Salvatore, Orvieto, Raoul, Lispi, Monica, Storr, Ashleigh, and D’Hooghe, Thomas
Published: 2023
Full Text: View/download PDF

45. Stop GnRH-agonist/GnRH-antagonist protocol: a different insight on ovarian stimulation for IVF

Author: Orvieto, Raoul
Published: 2023
Full Text: View/download PDF

46. The effects of Covid-19 mRNA vaccine on adolescence gynecological well-being

Author: Mohr-Sasson, Aya, Haas, Jigal, Sivan, Michal, Zehori, Yoni, Hemi, Rina, Orvieto, Raoul, Afek, Arnon, and Rabinovici, Jaron
Published: 2023
Full Text: View/download PDF

47. A review of the 2021/2022 PGDIS Position Statement on the transfer of mosaic embryos

Author: Gleicher, Norbert, Mochizuki, Lyka, Barad, David H., Patrizio, Pasquale, and Orvieto, Raoul
Published: 2023
Full Text: View/download PDF

48. Recombinant human chorionic gonadotropin and gonadotropin-releasing hormone agonist differently affect the profile of extracellular vesicle microRNAs in human follicular fluid

Author: Machtinger, R., Racowsky, C., Baccarelli, A. A., Bollati, V., Orvieto, R., Hauser, R., and Barnett-Itzhaki, Z.
Published: 2023
Full Text: View/download PDF

49. Correction to: The HERA (Hyper‑response Risk Assessment) Delphi consensus for the management of hyper‑responders in in vitro fertilization

Author: Feferkorn, I., Santos‑Ribeiro, S., Ubaldi, F. M., Velasco, J. G., Ata, B., Blockeel, C., Conforti, A., Esteves, S. C., Fatemi, H. M., Gianaroli, L., Grynberg, M., Humaidan, P., Lainas, G. T, La Marca, A., Craig, L. B., Lathi, R., Norman, R. J., Orvieto, R., Paulson, R., Pellicer, A., Polyzos, N. P., Roque, M., Sunkara, S. K., Tan, S. L., Urman, B., Venetis, C., Weissman, A., Yarali, H., and Dahan, M. H.
Published: 2024
Full Text: View/download PDF

50. Momentum Improves Optimization on Riemannian Manifolds

Author: Alimisis, Foivos, Orvieto, Antonio, Bécigneul, Gary, and Lucchi, Aurelien
Subjects: Mathematics - Optimization and Control
Abstract: We develop a new Riemannian descent algorithm that relies on momentum to improve over existing first-order methods for geodesically convex optimization. In contrast, accelerated convergence rates proved in prior work have only been shown to hold for geodesically strongly-convex objective functions. We further extend our algorithm to geodesically weakly-quasi-convex objectives. Our proofs of convergence rely on a novel estimate sequence that illustrates the dependency of the convergence rate on the curvature of the manifold. We validate our theoretical results empirically on several optimization problems defined on the sphere and on the manifold of positive definite matrices., Comment: arXiv admin note: text overlap with arXiv:1910.10782
Published: 2020

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

325 results on '"Orvieto, P."'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources