Author: "Montúfar, Guido" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Montúfar, Guido"' showing total 265 results

Start Over Author "Montúfar, Guido"

265 results on '"Montúfar, Guido"'

1. Implicit Bias of Mirror Descent for Shallow Neural Networks in Univariate Regression

Author: Liang, Shuang and Montúfar, Guido
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning
Abstract: We examine the implicit bias of mirror flow in univariate least squares error regression with wide and shallow neural networks. For a broad class of potential functions, we show that mirror flow exhibits lazy training and has the same implicit bias as ordinary gradient flow when the network width tends to infinity. For ReLU networks, we characterize this bias through a variational problem in function space. Our analysis includes prior results for ordinary gradient flow as a special case and lifts limitations which required either an intractable adjustment of the training data or networks with skip connections. We further introduce scaled potentials and show that for these, mirror flow still exhibits lazy training but is not in the kernel regime. For networks with absolute value activations, we show that mirror flow with scaled potentials induces a rich class of biases, which generally cannot be captured by an RKHS norm. A takeaway is that whereas the parameter initialization determines how strongly the curvature of the learned function is penalized at different locations of the input space, the scaled potential determines how the different magnitudes of the curvature are penalized.
Published: 2024

2. Bounds for the smallest eigenvalue of the NTK for arbitrary spherical data of arbitrary dimension

Author: Karhadkar, Kedar, Murray, Michael, and Montúfar, Guido
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning
Abstract: Bounds on the smallest eigenvalue of the neural tangent kernel (NTK) are a key ingredient in the analysis of neural network optimization and memorization. However, existing results require distributional assumptions on the data and are limited to a high-dimensional setting, where the input dimension $d_0$ scales at least logarithmically in the number of samples $n$. In this work we remove both of these requirements and instead provide bounds in terms of a measure of the collinearity of the data: notably these bounds hold with high probability even when $d_0$ is held constant versus $n$. We prove our results through a novel application of the hemisphere transform., Comment: 47 pages
Published: 2024

3. Fisher-Rao Gradient Flows of Linear Programs and State-Action Natural Policy Gradients

Author: Müller, Johannes, Çaycı, Semih, and Montúfar, Guido
Subjects: Mathematics - Optimization and Control, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Systems and Control, Mathematics - Numerical Analysis, Statistics - Machine Learning, 65K05, 90C05, 90C08, 90C40, 90C53
Abstract: Kakade's natural policy gradient method has been studied extensively in the last years showing linear convergence with and without regularization. We study another natural gradient method which is based on the Fisher information matrix of the state-action distributions and has received little attention from the theoretical side. Here, the state-action distributions follow the Fisher-Rao gradient flow inside the state-action polytope with respect to a linear potential. Therefore, we study Fisher-Rao gradient flows of linear programs more generally and show linear convergence with a rate that depends on the geometry of the linear program. Equivalently, this yields an estimate on the error induced by entropic regularization of the linear program which improves existing results. We extend these results and show sublinear convergence for perturbed Fisher-Rao gradient flows and natural gradient flows up to an approximation error. In particular, these general results cover the case of state-action natural policy gradients., Comment: 27 pages, 4 figures, under review
Published: 2024

4. The Real Tropical Geometry of Neural Networks

Author: Brandenburg, Marie-Charlotte, Loho, Georg, and Montúfar, Guido
Subjects: Mathematics - Combinatorics, Computer Science - Machine Learning, 14T90, 52C45, 68T07 (Primary), 14P10, 52C35 (Secondary)
Abstract: We consider a binary classifier defined as the sign of a tropical rational function, that is, as the difference of two convex piecewise linear functions. The parameter space of ReLU neural networks is contained as a semialgebraic set inside the parameter space of tropical rational functions. We initiate the study of two different subdivisions of this parameter space: a subdivision into semialgebraic sets, on which the combinatorial type of the decision boundary is fixed, and a subdivision into a polyhedral fan, capturing the combinatorics of the partitions of the dataset. The sublevel sets of the 0/1-loss function arise as subfans of this classification fan, and we show that the level-sets are not necessarily connected. We describe the classification fan i) geometrically, as normal fan of the activation polytope, and ii) combinatorially through a list of properties of associated bipartite graphs, in analogy to covector axioms of oriented matroids and tropical oriented matroids. Our findings extend and refine the connection between neural networks and tropical geometry by observing structures established in real tropical geometry, such as positive tropicalizations of hypersurfaces and tropical semialgebraic sets., Comment: 43 pages, 6 figures; comments welcome!
Published: 2024

5. Benign overfitting in leaky ReLU networks with moderate input dimension

Author: Karhadkar, Kedar, George, Erin, Murray, Michael, Montúfar, Guido, and Needell, Deanna
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: The problem of benign overfitting asks whether it is possible for a model to perfectly fit noisy training data and still generalize well. We study benign overfitting in two-layer leaky ReLU networks trained with the hinge loss on a binary classification task. We consider input data that can be decomposed into the sum of a common signal and a random noise component, that lie on subspaces orthogonal to one another. We characterize conditions on the signal to noise ratio (SNR) of the model parameters giving rise to benign versus non-benign (or harmful) overfitting: in particular, if the SNR is high then benign overfitting occurs, conversely if the SNR is low then harmful overfitting occurs. We attribute both benign and non-benign overfitting to an approximate margin maximization property and show that leaky ReLU networks trained on hinge loss with gradient descent (GD) satisfy this property. In contrast to prior work we do not require the training data to be nearly orthogonal. Notably, for input dimension $d$ and training sample size $n$, while results in prior work require $d = \Omega(n^2 \log n)$, here we require only $d = \Omega\left(n\right)$., Comment: 39 pages
Published: 2024

6. Pull-back Geometry of Persistent Homology Encodings

Author: Liang, Shuang, Turkeš, Renata, Li, Jiayi, Otter, Nina, and Montúfar, Guido
Subjects: Mathematics - Algebraic Topology
Abstract: Persistent homology (PH) is a method for generating topology-inspired representations of data. Empirical studies that investigate the properties of PH, such as its sensitivity to perturbations or ability to detect a feature of interest, commonly rely on training and testing an additional model on the basis of the PH representation. To gain more intrinsic insights about PH, independently of the choice of such a model, we propose a novel methodology based on the pull-back geometry that a PH encoding induces on the data manifold. The spectrum and eigenvectors of the induced metric help to identify the most and least significant information captured by PH. Furthermore, the pull-back norm of tangent vectors provides insights about the sensitivity of PH to a given perturbation, or its potential to detect a given feature of interest, and in turn its ability to solve a given classification or regression problem. Experimentally, the insights gained through our methodology align well with the existing knowledge about PH. Moreover, we show that the pull-back norm correlates with the performance on downstream tasks, and can therefore guide the choice of a suitable PH encoding.
Published: 2023

7. Uncertainty and Stochasticity of Optimal Policies

Author: Montúfar, Guido, Rauh, Johannes, and Ay, Nihat
Published: 2023

8. Evaluating Morphological Computation in Muscle and DC-motor Driven Models of Human Hopping

Author: Ghazi-Zahedi, Keyan, Haeufle, Daniel FB, Montúfar, Guido, Schmitt, Syn, and Ay, Nihat
Published: 2023

9. Mildly Overparameterized ReLU Networks Have a Favorable Loss Landscape

Author: Karhadkar, Kedar, Murray, Michael, Tseran, Hanna, and Montúfar, Guido
Subjects: Computer Science - Machine Learning, Mathematics - Combinatorics, Statistics - Machine Learning
Abstract: We study the loss landscape of both shallow and deep, mildly overparameterized ReLU neural networks on a generic finite input dataset for the squared error loss. We show both by count and volume that most activation patterns correspond to parameter regions with no bad local minima. Furthermore, for one-dimensional input data, we show most activation regions realizable by the network contain a high dimensional set of global minima and no bad local minima. We experimentally confirm these results by finding a phase transition from most regions having full rank Jacobian to many regions having deficient rank depending on the amount of overparameterization., Comment: 40 pages
Published: 2023

10. Supermodular Rank: Set Function Decomposition and Optimization

Author: Sonthalia, Rishi, Seigal, Anna, and Montufar, Guido
Subjects: Mathematics - Combinatorics, Computer Science - Computational Complexity, Computer Science - Discrete Mathematics, Computer Science - Machine Learning, Mathematics - Optimization and Control
Abstract: We define the supermodular rank of a function on a lattice. This is the smallest number of terms needed to decompose it into a sum of supermodular functions. The supermodular summands are defined with respect to different partial orders. We characterize the maximum possible value of the supermodular rank and describe the functions with fixed supermodular rank. We analogously define the submodular rank. We use submodular decompositions to optimize set functions. Given a bound on the submodular rank of a set function, we formulate an algorithm that splits an optimization problem into submodular subproblems. We show that this method improves the approximation ratio guarantees of several algorithms for monotone set function maximization and ratio of set functions minimization, at a computation overhead that depends on the submodular rank.
Published: 2023

11. Function Space and Critical Points of Linear Convolutional Networks

Author: Kohn, Kathlén, Montúfar, Guido, Shahverdi, Vahid, and Trager, Matthew
Subjects: Computer Science - Machine Learning, Mathematics - Algebraic Geometry, 68T07, 14B05, 14E99, 14J99, 14N05, 14P10, 90C23
Abstract: We study the geometry of linear networks with one-dimensional convolutional layers. The function spaces of these networks can be identified with semi-algebraic families of polynomials admitting sparse factorizations. We analyze the impact of the network's architecture on the function space's dimension, boundary, and singular points. We also describe the critical points of the network's parameterization map. Furthermore, we study the optimization problem of training a network with the squared error loss. We prove that for architectures where all strides are larger than one and generic data, the non-zero critical points of that optimization problem are smooth interior points of the function space. This property is known to be false for dense linear networks and linear convolutional networks with stride one., Comment: 35 pages, 1 figure, 2 tables
Published: 2023

12. Critical Points and Convergence Analysis of Generative Deep Linear Networks Trained with Bures-Wasserstein Loss

Author: Bréchet, Pierre, Papagiannouli, Katerina, An, Jing, and Montúfar, Guido
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning
Abstract: We consider a deep matrix factorization model of covariance matrices trained with the Bures-Wasserstein distance. While recent works have made advances in the study of the optimization problem for overparametrized low-rank matrix approximation, much emphasis has been placed on discriminative settings and the square loss. In contrast, our model considers another type of loss and connects with the generative setting. We characterize the critical points and minimizers of the Bures-Wasserstein distance over the space of rank-bounded matrices. The Hessian of this loss at low-rank matrices can theoretically blow up, which creates challenges to analyze convergence of gradient optimization methods. We establish convergence results for gradient flow using a smooth perturbative version of the loss as well as convergence results for finite step size gradient descent under certain assumptions on the initial weights., Comment: 42 pages, 3 figures, accepted at ICML 2023
Published: 2023

13. Expected Gradients of Maxout Networks and Consequences to Parameter Initialization

Author: Tseran, Hanna and Montúfar, Guido
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning
Abstract: We study the gradients of a maxout network with respect to inputs and parameters and obtain bounds for the moments depending on the architecture and the parameter distribution. We observe that the distribution of the input-output Jacobian depends on the input, which complicates a stable parameter initialization. Based on the moments of the gradients, we formulate parameter initialization strategies that avoid vanishing and exploding gradients in wide networks. Experiments with deep fully-connected and convolutional networks show that this strategy improves SGD and Adam training of deep maxout networks. In addition, we obtain refined bounds on the expected number of linear regions, results on the expected curve length distortion, and results on the NTK., Comment: Published at ICML 2023, 42 pages, 11 figures
Published: 2023

14. Algebraic optimization of sequential decision problems

Author: Dressler, Mareike, Garrote-López, Marina, Montúfar, Guido, Müller, Johannes, and Rose, Kemal
Subjects: Mathematics - Optimization and Control, Electrical Engineering and Systems Science - Systems and Control, Mathematics - Algebraic Geometry, 62R01, 90C23, 90C40
Abstract: We study the optimization of the expected long-term reward in finite partially observable Markov decision processes over the set of stationary stochastic policies. In the case of deterministic observations, also known as state aggregation, the problem is equivalent to optimizing a linear objective subject to quadratic constraints. We characterize the feasible set of this problem as the intersection of a product of affine varieties of rank one matrices and a polytope. Based on this description, we obtain bounds on the number of critical points of the optimization problem. Finally, we conduct experiments in which we solve the KKT equations or the Lagrange equations over different boundary components of the feasible set, and compare the result to the theoretical bounds and to other constrained optimization methods., Comment: 19 pages, 3 figures
Published: 2022

15. Characterizing the Spectrum of the NTK via a Power Series Expansion

Author: Murray, Michael, Jin, Hui, Bowman, Benjamin, and Montufar, Guido
Subjects: Computer Science - Machine Learning
Abstract: Under mild conditions on the network initialization we derive a power series expansion for the Neural Tangent Kernel (NTK) of arbitrarily deep feedforward networks in the infinite width limit. We provide expressions for the coefficients of this power series which depend on both the Hermite coefficients of the activation function as well as the depth of the network. We observe faster decay of the Hermite coefficients leads to faster decay in the NTK coefficients and explore the role of depth. Using this series, first we relate the effective rank of the NTK to the effective rank of the input-data Gram. Second, for data drawn uniformly on the sphere we study the eigenvalues of the NTK, analyzing the impact of the choice of activation function. Finally, for generic data and activation functions with sufficiently fast Hermite coefficient decay, we derive an asymptotic upper bound on the spectrum of the NTK., Comment: 55 pages, 3 Figures, 1 Table
Published: 2022

16. Geometry and convergence of natural policy gradient methods

Author: Müller, Johannes and Montúfar, Guido
Subjects: Mathematics - Optimization and Control, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Systems and Control, 90C40, 53B12, 90C53
Abstract: We study the convergence of several natural policy gradient (NPG) methods in infinite-horizon discounted Markov decision processes with regular policy parametrizations. For a variety of NPGs and reward functions we show that the trajectories in state-action space are solutions of gradient flows with respect to Hessian geometries, based on which we obtain global convergence guarantees and convergence rates. In particular, we show linear convergence for unregularized and regularized NPG flows with the metrics proposed by Kakade and Morimura and co-authors by observing that these arise from the Hessian geometries of conditional entropy and entropy respectively. Further, we obtain sublinear convergence rates for Hessian geometries arising from other convex functions like log-barriers. Finally, we interpret the discrete-time NPG methods with regularized rewards as inexact Newton methods if the NPG is defined with respect to the Hessian geometry of the regularizer. This yields local quadratic convergence rates of these methods for step size equal to the penalization strength., Comment: 33 pages, 5 figures, under review
Published: 2022
Full Text: View/download PDF

17. FoSR: First-order spectral rewiring for addressing oversquashing in GNNs

Author: Karhadkar, Kedar, Banerjee, Pradeep Kr., and Montúfar, Guido
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Graph neural networks (GNNs) are able to leverage the structure of graph data by passing messages along the edges of the graph. While this allows GNNs to learn features depending on the graph structure, for certain graph topologies it leads to inefficient information propagation and a problem known as oversquashing. This has recently been linked with the curvature and spectral gap of the graph. On the other hand, adding edges to the message-passing graph can lead to increasingly similar node representations and a problem known as oversmoothing. We propose a computationally efficient algorithm that prevents oversquashing by systematically adding edges to the graph based on spectral expansion. We combine this with a relational architecture, which lets the GNN preserve the original graph structure and provably prevents oversmoothing. We find experimentally that our algorithm outperforms existing graph rewiring methods in several graph classification tasks., Comment: 21 pages, accepted to ICLR 2023
Published: 2022

18. Enumeration of max-pooling responses with generalized permutohedra

Author: Escobar, Laura, Gallardo, Patricio, González-Anaya, Javier, González, José L., Montúfar, Guido, and Morales, Alejandro H.
Subjects: Mathematics - Combinatorics, Computer Science - Discrete Mathematics, Computer Science - Machine Learning, 05A15, 52B05, 68T07 (Primary) 05A05, 05A16, 06A07 (Secondary)
Abstract: We investigate the combinatorics of max-pooling layers, which are functions that downsample input arrays by taking the maximum over shifted windows of input coordinates, and which are commonly used in convolutional neural networks. We obtain results on the number of linearity regions of these functions by equivalently counting the number of vertices of certain Minkowski sums of simplices. We characterize the faces of such polytopes and obtain generating functions and closed formulas for the number of vertices and facets in a 1D max-pooling layer depending on the size of the pooling windows and stride, and for the number of vertices in a special case of 2D max-pooling., Comment: 35 pages, 11 figures, 4 tables. V2: Improved exposition, added computations in Section 4, and expanded analysis of data
Published: 2022

19. Geometry and convergence of natural policy gradient methods

Author: Müller, Johannes and Montúfar, Guido
Published: 2024
Full Text: View/download PDF

20. Oversquashing in GNNs through the lens of information contraction and graph expansion

Author: Banerjee, Pradeep Kr., Karhadkar, Kedar, Wang, Yu Guang, Alon, Uri, and Montúfar, Guido
Subjects: Computer Science - Machine Learning, Computer Science - Information Theory
Abstract: The quality of signal propagation in message-passing graph neural networks (GNNs) strongly influences their expressivity as has been observed in recent works. In particular, for prediction tasks relying on long-range interactions, recursive aggregation of node features can lead to an undesired phenomenon called "oversquashing". We present a framework for analyzing oversquashing based on information contraction. Our analysis is guided by a model of reliable computation due to von Neumann that lends a new insight into oversquashing as signal quenching in noisy computation graphs. Building on this, we propose a graph rewiring algorithm aimed at alleviating oversquashing. Our algorithm employs a random local edge flip primitive motivated by an expander graph construction. We compare the spectral expansion properties of our algorithm with that of an existing curvature-based non-local rewiring strategy. Synthetic experiments show that while our algorithm in general has a slower rate of expansion, it is overall computationally cheaper, preserves the node degrees exactly and never disconnects the graph., Comment: 8 pages, 5 figures; Accepted at the 58th Annual Allerton Conference on Communication, Control, and Computing
Published: 2022

21. On the effectiveness of persistent homology

Author: Turkeš, Renata, Montúfar, Guido, and Otter, Nina
Subjects: Mathematics - Algebraic Topology, Computer Science - Machine Learning
Abstract: Persistent homology (PH) is one of the most popular methods in Topological Data Analysis. Even though PH has been used in many different types of applications, the reasons behind its success remain elusive; in particular, it is not known for which classes of problems it is most effective, or to what extent it can detect geometric or topological features. The goal of this work is to identify some types of problems where PH performs well or even better than other methods in data analysis. We consider three fundamental shape analysis tasks: the detection of the number of holes, curvature and convexity from 2D and 3D point clouds sampled from shapes. Experiments demonstrate that PH is successful in these tasks, outperforming several baselines, including PointNet, an architecture inspired precisely by the properties of point clouds. In addition, we observe that PH remains effective for limited computational resources and limited training data, as well as out-of-distribution test data, including various data transformations and noise. For convexity detection, we provide a theoretical guarantee that PH is effective for this task in $\mathbb{R}^d$, and demonstrate the detection of a convexity measure on the FLAVIA data set of plant leaf images. Due to the crucial role of shape classification in understanding mathematical and physical structures and objects, and in many applications, the findings of this work will provide some knowledge about the types of problems that are appropriate for PH, so that it can - to borrow the words from Wigner 1960 - ``remain valid in future research, and extend, to our pleasure", but to our lesser bafflement, to a variety of applications., Comment: Main text 10 pages; Appendices 23 pages; References 6 pages; 32 figures. To appear in Advances in Neural Information Processing Systems 35 (NeurIPS 2022). Theorem 1 guarantees that PH with respect to the tubular filtration (Definition 1, Figure 9) can detect convexity in any d-dimensional Euclidean space (Appendix A). A convexity measure is detected with PH on a real-world dataset (Appendix G)
Published: 2022

22. Spectral Bias Outside the Training Set for Deep Networks in the Kernel Regime

Author: Bowman, Benjamin and Montufar, Guido
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning
Abstract: We provide quantitative bounds measuring the $L^2$ difference in function space between the trajectory of a finite-width network trained on finitely many samples from the idealized kernel dynamics of infinite width and infinite data. An implication of the bounds is that the network is biased to learn the top eigenfunctions of the Neural Tangent Kernel not just on the training set but over the entire input space. This bias depends on the model architecture and input distribution alone and thus does not depend on the target function which does not need to be in the RKHS of the kernel. The result is valid for deep architectures with fully connected, convolutional, and residual layers. Furthermore the width does not need to grow polynomially with the number of samples in order to obtain high probability bounds up to a stopping time. The proof exploits the low-effective-rank property of the Fisher Information Matrix at initialization, which implies a low effective dimension of the model (far smaller than the number of parameters). We conclude that local capacity control from the low effective rank of the Fisher Information Matrix is still underexplored theoretically., Comment: 38 pages, 1 figure, to be published in NeurIPS 2022
Published: 2022

23. Geometry and convergence of natural policy gradient methods.

Author: Müller, Johannes and Montúfar, Guido
Subjects: Hessian geometry, Markov decision process, Natural policy gradient, State-action frequency, stochastic policy
Abstract: We study the convergence of several natural policy gradient (NPG) methods in infinite-horizon discounted Markov decision processes with regular policy parametrizations. For a variety of NPGs and reward functions we show that the trajectories in state-action space are solutions of gradient flows with respect to Hessian geometries, based on which we obtain global convergence guarantees and convergence rates. In particular, we show linear convergence for unregularized and regularized NPG flows with the metrics proposed by Kakade and Morimura and co-authors by observing that these arise from the Hessian geometries of conditional entropy and entropy respectively. Further, we obtain sublinear convergence rates for Hessian geometries arising from other convex functions like log-barriers. Finally, we interpret the discrete-time NPG methods with regularized rewards as inexact Newton methods if the NPG is defined with respect to the Hessian geometry of the regularizer. This yields local quadratic convergence rates of these methods for step size equal to the inverse penalization strength.
Published: 2023

24. Solving infinite-horizon POMDPs with memoryless stochastic policies in state-action space

Author: Müller, Johannes and Montúfar, Guido
Subjects: Computer Science - Machine Learning, Electrical Engineering and Systems Science - Systems and Control, Mathematics - Optimization and Control
Abstract: Reward optimization in fully observable Markov decision processes is equivalent to a linear program over the polytope of state-action frequencies. Taking a similar perspective in the case of partially observable Markov decision processes with memoryless stochastic policies, the problem was recently formulated as the optimization of a linear objective subject to polynomial constraints. Based on this we present an approach for Reward Optimization in State-Action space (ROSA). We test this approach experimentally in maze navigation tasks. We find that ROSA is computationally efficient and can yield stability improvements over other existing methods., Comment: Accepted as an extended abstract at RLDM 2022, 5 pages, 2 figures
Published: 2022

25. Continuity and Additivity Properties of Information Decompositions

Author: Rauh, Johannes, Banerjee, Pradeep Kr., Olbrich, Eckehard, Montúfar, Guido, and Jost, Jürgen
Subjects: Computer Science - Information Theory, 94A15, 94A17
Abstract: Information decompositions quantify how the Shannon information about a given random variable is distributed among several other random variables. Various requirements have been proposed that such a decomposition should satisfy, leading to different candidate solutions. Curiously, however, only two of the original requirements that determined the Shannon information have been considered, namely monotonicity and normalization. Two other important properties, continuity and additivity, have not been considered. In this contribution, we focus on the mutual information of two finite variables $Y,Z$ about a third finite variable $S$ and check which of the decompositions satisfy these two properties. While most of them satisfy continuity, only one of them is both continuous and additive., Comment: 17 pages
Published: 2022

26. Implicit Bias of MSE Gradient Optimization in Underparameterized Neural Networks

Author: Bowman, Benjamin and Montufar, Guido
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning
Abstract: We study the dynamics of a neural network in function space when optimizing the mean squared error via gradient flow. We show that in the underparameterized regime the network learns eigenfunctions of an integral operator $T_{K^\infty}$ determined by the Neural Tangent Kernel (NTK) at rates corresponding to their eigenvalues. For example, for uniformly distributed data on the sphere $S^{d - 1}$ and rotation invariant weight distributions, the eigenfunctions of $T_{K^\infty}$ are the spherical harmonics. Our results can be understood as describing a spectral bias in the underparameterized regime. The proofs use the concept of "Damped Deviations", where deviations of the NTK matter less for eigendirections with large eigenvalues due to the occurence of a damping factor. Aside from the underparameterized regime, the damped deviations point-of-view can be used to track the dynamics of the empirical risk in the overparameterized setting, allowing us to extend certain results in the literature. We conclude that damped deviations offers a simple and unifying perspective of the dynamics when optimizing the squared error., Comment: 61 pages, submitted to ICLR 2022
Published: 2022

27. Training Wasserstein GANs without gradient penalties

Author: Kwon, Dohyun, Kim, Yeoneung, Montúfar, Guido, and Yang, Insoon
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition, Mathematics - Numerical Analysis
Abstract: We propose a stable method to train Wasserstein generative adversarial networks. In order to enhance stability, we consider two objective functions using the $c$-transform based on Kantorovich duality which arises in the theory of optimal transport. We experimentally show that this algorithm can effectively enforce the Lipschitz constraint on the discriminator while other standard methods fail to do so. As a consequence, our method yields an accurate estimation for the optimal discriminator and also for the Wasserstein distance between the true distribution and the generated one. Our method requires no gradient penalties nor corresponding hyperparameter tuning and is computationally more efficient than other methods. At the same time, it yields competitive generators of synthetic images based on the MNIST, F-MNIST, and CIFAR-10 datasets.
Published: 2021

28. Learning curves for Gaussian process regression with power-law priors and targets

Author: Jin, Hui, Banerjee, Pradeep Kr., and Montúfar, Guido
Subjects: Computer Science - Machine Learning
Abstract: We characterize the power-law asymptotics of learning curves for Gaussian process regression (GPR) under the assumption that the eigenspectrum of the prior and the eigenexpansion coefficients of the target function follow a power law. Under similar assumptions, we leverage the equivalence between GPR and kernel ridge regression (KRR) to show the generalization error of KRR. Infinitely wide neural networks can be related to GPR with respect to the neural network GP kernel and the neural tangent kernel, which in several cases is known to have a power-law spectrum. Hence our methods can be applied to study the generalization error of infinitely wide neural networks. We present toy experiments demonstrating the theory., Comment: 76 pages, 7 table, 6 figure
Published: 2021

29. The Geometry of Memoryless Stochastic Policy Optimization in Infinite-Horizon POMDPs

Author: Müller, Johannes and Montúfar, Guido
Subjects: Mathematics - Optimization and Control, Computer Science - Machine Learning, Mathematics - Algebraic Geometry, 90C40, 93E20, 49M37, 90C23
Abstract: We consider the problem of finding the best memoryless stochastic policy for an infinite-horizon partially observable Markov decision process (POMDP) with finite state and action spaces with respect to either the discounted or mean reward criterion. We show that the (discounted) state-action frequencies and the expected cumulative reward are rational functions of the policy, whereby the degree is determined by the degree of partial observability. We then describe the optimization problem as a linear optimization problem in the space of feasible state-action frequencies subject to polynomial constraints that we characterize explicitly. This allows us to address the combinatorial and geometric complexity of the optimization problem using recent tools from polynomial optimization. In particular, we estimate the number of critical points and use the polynomial programming description of reward maximization to solve a navigation problem in a grid world., Comment: Camera ready version for ICLR 2022, 45 pages, 8 figures
Published: 2021

30. Geometry of Linear Convolutional Networks

Author: Kohn, Kathlén, Merkh, Thomas, Montúfar, Guido, and Trager, Matthew
Subjects: Computer Science - Machine Learning, Mathematics - Algebraic Geometry, 68T07, 14P10, 14J70, 90C23, 62R01
Abstract: We study the family of functions that are represented by a linear convolutional neural network (LCN). These functions form a semi-algebraic subset of the set of linear maps from input space to output space. In contrast, the families of functions represented by fully-connected linear networks form algebraic sets. We observe that the functions represented by LCNs can be identified with polynomials that admit certain factorizations, and we use this perspective to describe the impact of the network's architecture on the geometry of the resulting function space. We further study the optimization of an objective function over an LCN, analyzing critical points in function space and in parameter space, and describing dynamical invariants for gradient descent. Overall, our theory predicts that the optimized parameters of an LCN will often correspond to repeated filters across layers, or filters that can be decomposed as repeated filters. We also conduct numerical and symbolic experiments that illustrate our results and present an in-depth analysis of the landscape for small architectures., Comment: 38 pages, 3 figures, 2 tables; appearing in SIAM Journal on Applied Algebra and Geometry (SIAGA)
Published: 2021

31. On the Expected Complexity of Maxout Networks

Author: Tseran, Hanna and Montúfar, Guido
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning
Abstract: Learning with neural networks relies on the complexity of the representable functions, but more importantly, the particular assignment of typical parameters to functions of different complexity. Taking the number of activation regions as a complexity measure, recent works have shown that the practical complexity of deep ReLU networks is often far from the theoretical maximum. In this work, we show that this phenomenon also occurs in networks with maxout (multi-argument) activation functions and when considering the decision boundaries in classification tasks. We also show that the parameter space has a multitude of full-dimensional regions with widely different complexity, and obtain nontrivial lower bounds on the expected complexity. Finally, we investigate different parameter initialization procedures and show that they can increase the speed of convergence in training., Comment: Published at NeurIPS 2021, 47 pages, 18 figures
Published: 2021

32. Weisfeiler and Lehman Go Cellular: CW Networks

Author: Bodnar, Cristian, Frasca, Fabrizio, Otter, Nina, Wang, Yu Guang, Liò, Pietro, Montúfar, Guido, and Bronstein, Michael
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Graph Neural Networks (GNNs) are limited in their expressive power, struggle with long-range interactions and lack a principled way to model higher-order structures. These problems can be attributed to the strong coupling between the computational graph and the input graph structure. The recently proposed Message Passing Simplicial Networks naturally decouple these elements by performing message passing on the clique complex of the graph. Nevertheless, these models can be severely constrained by the rigid combinatorial structure of Simplicial Complexes (SCs). In this work, we extend recent theoretical results on SCs to regular Cell Complexes, topological objects that flexibly subsume SCs and graphs. We show that this generalisation provides a powerful set of graph "lifting" transformations, each leading to a unique hierarchical message passing procedure. The resulting methods, which we collectively call CW Networks (CWNs), are strictly more powerful than the WL test and not less powerful than the 3-WL test. In particular, we demonstrate the effectiveness of one such scheme, based on rings, when applied to molecular graph problems. The proposed architecture benefits from provably larger expressivity than commonly used GNNs, principled modelling of higher-order signals and from compressing the distances between nodes. We demonstrate that our model achieves state-of-the-art results on a variety of molecular datasets., Comment: NeurIPS 2021. Contains 28 pages, 9 figures
Published: 2021

33. Algebraic optimization of sequential decision problems

Author: Dressler, Mareike, Garrote-López, Marina, Montúfar, Guido, Müller, Johannes, and Rose, Kemal
Published: 2024
Full Text: View/download PDF

34. Information Complexity and Generalization Bounds

Author: Banerjee, Pradeep Kr. and Montúfar, Guido
Subjects: Computer Science - Machine Learning, Computer Science - Information Theory, 68Q32, 68T05, 94A15, I.2.6, G.3
Abstract: We present a unifying picture of PAC-Bayesian and mutual information-based upper bounds on the generalization error of randomized learning algorithms. As we show, Tong Zhang's information exponential inequality (IEI) gives a general recipe for constructing bounds of both flavors. We show that several important results in the literature can be obtained as simple corollaries of the IEI under different assumptions on the loss function. Moreover, we obtain new bounds for data-dependent priors and unbounded loss functions. Optimizing the bounds gives rise to variants of the Gibbs algorithm, for which we discuss two practical examples for learning with neural networks, namely, Entropy- and PAC-Bayes- SGD. Further, we use an Occam's factor argument to show a PAC-Bayesian bound that incorporates second-order curvature information of the training loss., Comment: To appear in 2021 IEEE International Symposium on Information Theory (ISIT); 23 pages
Published: 2021
Full Text: View/download PDF

35. Sharp bounds for the number of regions of maxout networks and vertices of Minkowski sums

Author: Montúfar, Guido, Ren, Yue, and Zhang, Leon
Subjects: Mathematics - Combinatorics, Computer Science - Discrete Mathematics, Computer Science - Machine Learning, 68T07, 52B05, 14T15, 06A07
Abstract: We present results on the number of linear regions of the functions that can be represented by artificial feedforward neural networks with maxout units. A rank-k maxout unit is a function computing the maximum of $k$ linear functions. For networks with a single layer of maxout units, the linear regions correspond to the upper vertices of a Minkowski sum of polytopes. We obtain face counting formulas in terms of the intersection posets of tropical hypersurfaces or the number of upper faces of partial Minkowski sums, along with explicit sharp upper bounds for the number of regions for any input dimension, any number of units, and any ranks, in the cases with and without biases. Based on these results we also obtain asymptotically sharp upper bounds for networks with multiple layers., Comment: 25 pages, 5 figures
Published: 2021

36. Weisfeiler and Lehman Go Topological: Message Passing Simplicial Networks

Author: Bodnar, Cristian, Frasca, Fabrizio, Wang, Yu Guang, Otter, Nina, Montúfar, Guido, Liò, Pietro, and Bronstein, Michael
Subjects: Computer Science - Machine Learning, Computer Science - Social and Information Networks
Abstract: The pairwise interaction paradigm of graph machine learning has predominantly governed the modelling of relational systems. However, graphs alone cannot capture the multi-level interactions present in many complex systems and the expressive power of such schemes was proven to be limited. To overcome these limitations, we propose Message Passing Simplicial Networks (MPSNs), a class of models that perform message passing on simplicial complexes (SCs). To theoretically analyse the expressivity of our model we introduce a Simplicial Weisfeiler-Lehman (SWL) colouring procedure for distinguishing non-isomorphic SCs. We relate the power of SWL to the problem of distinguishing non-isomorphic graphs and show that SWL and MPSNs are strictly more powerful than the WL test and not less powerful than the 3-WL test. We deepen the analysis by comparing our model with traditional graph neural networks (GNNs) with ReLU activations in terms of the number of linear regions of the functions they can represent. We empirically support our theoretical claims by showing that MPSNs can distinguish challenging strongly regular graphs for which GNNs fail and, when equipped with orientation equivariant layers, they can improve classification accuracy in oriented SCs compared to a GNN baseline., Comment: ICML 2021. Contains 27 pages, 9 figures
Published: 2021

37. Wasserstein Proximal of GANs

Author: Lin, Alex Tong, Li, Wuchen, Osher, Stanley, and Montufar, Guido
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Mathematics - Numerical Analysis
Abstract: We introduce a new method for training generative adversarial networks by applying the Wasserstein-2 metric proximal on the generators. The approach is based on Wasserstein information geometry. It defines a parametrization invariant natural gradient by pulling back optimal transport structures from probability space to parameter space. We obtain easy-to-implement iterative regularizers for the parameter updates of implicit deep generative models. Our experiments demonstrate that this method improves the speed and stability of training in terms of wall-clock time and Fr\'echet Inception Distance.
Published: 2021

38. How Framelets Enhance Graph Neural Networks

Author: Zheng, Xuebin, Zhou, Bingxin, Gao, Junbin, Wang, Yu Guang, Lio, Pietro, Li, Ming, and Montufar, Guido
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Mathematics - Numerical Analysis, 68T07, 05C85, 42C40, I.2.4, I.2.6
Abstract: This paper presents a new approach for assembling graph neural networks based on framelet transforms. The latter provides a multi-scale representation for graph-structured data. We decompose an input graph into low-pass and high-pass frequencies coefficients for network training, which then defines a framelet-based graph convolution. The framelet decomposition naturally induces a graph pooling strategy by aggregating the graph feature into low-pass and high-pass spectra, which considers both the feature values and geometry of the graph data and conserves the total information. The graph neural networks with the proposed framelet convolution and pooling achieve state-of-the-art performance in many node and graph prediction tasks. Moreover, we propose shrinkage as a new activation for the framelet convolution, which thresholds high-frequency information at different scales. Compared to ReLU, shrinkage activation improves model performance on denoising and signal compression: noises in both node and structure can be significantly reduced by accurately cutting off the high-pass coefficients from framelet decomposition, and the signal can be compressed to less than half its original size with well-preserved prediction performance., Comment: 24 pages, 17 figures, 8 tables, ICML2021 (fix typos)
Published: 2021

39. Tight Bounds on the Smallest Eigenvalue of the Neural Tangent Kernel for Deep ReLU Networks

Author: Nguyen, Quynh, Mondelli, Marco, and Montufar, Guido
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning
Abstract: A recent line of work has analyzed the theoretical properties of deep neural networks via the Neural Tangent Kernel (NTK). In particular, the smallest eigenvalue of the NTK has been related to the memorization capacity, the global convergence of gradient descent algorithms and the generalization of deep nets. However, existing results either provide bounds in the two-layer setting or assume that the spectrum of the NTK matrices is bounded away from 0 for multi-layer networks. In this paper, we provide tight bounds on the smallest eigenvalue of NTK matrices for deep ReLU nets, both in the limiting case of infinite widths and for finite widths. In the finite-width setting, the network architectures we consider are fairly general: we require the existence of a wide layer with roughly order of $N$ neurons, $N$ being the number of data samples; and the scaling of the remaining layer widths is arbitrary (up to logarithmic factors). To obtain our results, we analyze various quantities of independent interest: we give lower bounds on the smallest singular value of hidden feature matrices, and upper bounds on the Lipschitz constant of input-output feature maps., Comment: appeared at ICML 2021, this version corrects a mistake in Lemma 5.4 which also affects Lemma 5.5. These two Lemmas have been edited and the corresponding proofs corrected. All the other results remain untouched
Published: 2020

40. Can neural networks learn persistent homology features?

Author: Montúfar, Guido, Otter, Nina, and Wang, Yuguang
Subjects: Computer Science - Machine Learning, Mathematics - Algebraic Topology
Abstract: Topological data analysis uses tools from topology -- the mathematical area that studies shapes -- to create representations of data. In particular, in persistent homology, one studies one-parameter families of spaces associated with data, and persistence diagrams describe the lifetime of topological invariants, such as connected components or holes, across the one-parameter family. In many applications, one is interested in working with features associated with persistence diagrams rather than the diagrams themselves. In our work, we explore the possibility of learning several types of features extracted from persistence diagrams using neural networks., Comment: Topological Data Analysis and Beyond Workshop at the 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada
Published: 2020

41. Distributed Learning via Filtered Hyperinterpolation on Manifolds

Author: Montúfar, Guido and Wang, Yu Guang
Subjects: Mathematics - Numerical Analysis, Computer Science - Machine Learning, Statistics - Machine Learning, 65Y05, 41A05, 65D05, 65D30, 68Q32, 94A12, 58C05, 58C35, 65T60, 42C05, F.2.1, G.1.1, G.1.4, G.1.2, I.2.6
Abstract: Learning mappings of data on manifolds is an important topic in contemporary machine learning, with applications in astrophysics, geophysics, statistical physics, medical diagnosis, biochemistry, 3D object analysis. This paper studies the problem of learning real-valued functions on manifolds through filtered hyperinterpolation of input-output data pairs where the inputs may be sampled deterministically or at random and the outputs may be clean or noisy. Motivated by the problem of handling large data sets, it presents a parallel data processing approach which distributes the data-fitting task among multiple servers and synthesizes the fitted sub-models into a global estimator. We prove quantitative relations between the approximation quality of the learned function over the entire manifold, the type of target function, the number of servers, and the number and type of available samples. We obtain the approximation rates of convergence for distributed and non-distributed approaches. For the non-distributed case, the approximation order is optimal., Comment: 50 pages, 4 figures, 2 tables
Published: 2020

42. Implicit Bias of Gradient Descent for Mean Squared Error Regression with Two-Layer Wide Neural Networks

Author: Jin, Hui and Montúfar, Guido
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning, 68Q32, 68T05, I.2.6, G.3
Abstract: We investigate gradient descent training of wide neural networks and the corresponding implicit bias in function space. For univariate regression, we show that the solution of training a width-$n$ shallow ReLU network is within $n^{- 1/2}$ of the function which fits the training data and whose difference from the initial function has the smallest 2-norm of the second derivative weighted by a curvature penalty that depends on the probability distribution that is used to initialize the network parameters. We compute the curvature penalty function explicitly for various common initialization procedures. For instance, asymmetric initialization with a uniform distribution yields a constant curvature penalty, and thence the solution function is the natural cubic spline interpolation of the training data. \hj{For stochastic gradient descent we obtain the same implicit bias result.} We obtain a similar result for different activation functions. For multivariate regression we show an analogous result, whereby the second derivative is replaced by the Radon transform of a fractional Laplacian. For initialization schemes that yield a constant penalty function, the solutions are polyharmonic splines. Moreover, we show that the training trajectories are captured by trajectories of smoothing splines with decreasing regularization strength., Comment: 97 pages, 14 figures. Added the discussion of SGD and implications to generalization
Published: 2020

43. Optimization Theory for ReLU Neural Networks Trained with Normalization Layers

Author: Dukler, Yonatan, Gu, Quanquan, and Montúfar, Guido
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: The success of deep neural networks is in part due to the use of normalization layers. Normalization layers like Batch Normalization, Layer Normalization and Weight Normalization are ubiquitous in practice, as they improve generalization performance and speed up training significantly. Nonetheless, the vast majority of current deep learning theory and non-convex optimization literature focuses on the un-normalized setting, where the functions under consideration do not exhibit the properties of commonly normalized neural networks. In this paper, we bridge this gap by giving the first global convergence result for two-layer neural networks with ReLU activations trained with a normalization layer, namely Weight Normalization. Our analysis shows how the introduction of normalization layers changes the optimization landscape and can enable faster convergence as compared with un-normalized neural networks., Comment: To be presented at ICML 2020
Published: 2020

44. Wasserstein Distance to Independence Models

Author: Çelik, Türkü Özlüm, Jamneshan, Asgar, Montúfar, Guido, Sturmfels, Bernd, and Venturello, Lorenzo
Subjects: Mathematics - Optimization and Control, Mathematics - Statistics Theory, Polynomial Optimization, Algebraic Statistics, Computational Algebraic Geometry
Abstract: An independence model for discrete random variables is a Segre-Veronese variety in a probability simplex. Any metric on the set of joint states of the random variables induces a Wasserstein metric on the probability simplex. The unit ball of this polyhedral norm is dual to the Lipschitz polytope. Given any data distribution, we seek to minimize its Wasserstein distance to a fixed independence model. The solution to this optimization problem is a piecewise algebraic function of the data. We compute this function explicitly in small instances, we examine its combinatorial structure and algebraic degrees in the general case, and we present some experimental case studies.
Published: 2020

45. Evaluating Morphological Computation in Muscle and DC-motor Driven Models of Human Hopping

Author: Ghazi-Zahedi, Keyan, Haeufle, Daniel FB, Montúfar, Guido, Schmitt, Syn, and Ay, Nihat
Published: 2021

46. Uncertainty and Stochasticity of Optimal Policies

Author: Montúfar, Guido, Rauh, Johannes, and Ay, Nihat
Published: 2021

47. Continuity and additivity properties of information decompositions

Author: Rauh, Johannes, Banerjee, Pradeep Kr., Olbrich, Eckehard, Montúfar, Guido, and Jost, Jürgen
Published: 2023
Full Text: View/download PDF

48. Stochastic Feedforward Neural Networks: Universal Approximation

Author: Merkh, Thomas and Montúfar, Guido
Subjects: Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing, Statistics - Machine Learning
Abstract: In this chapter we take a look at the universal approximation question for stochastic feedforward neural networks. In contrast to deterministic networks, which represent mappings from a set of inputs to a set of outputs, stochastic networks represent mappings from a set of inputs to a set of probability distributions over the set of outputs. In particular, even if the sets of inputs and outputs are finite, the class of stochastic mappings in question is not finite. Moreover, while for a deterministic function the values of all output variables can be computed independently of each other given the values of the inputs, in the stochastic setting the values of the output variables may need to be correlated, which requires that their values are computed jointly. A prominent class of stochastic feedforward networks which has played a key role in the resurgence of deep learning are deep belief networks. The representational power of these networks has been studied mainly in the generative setting, as models of probability distributions without an input, or in the discriminative setting for the special case of deterministic mappings. We study the representational power of deep sigmoid belief networks in terms of compositions of linear transformations of probability distributions, Markov kernels, that can be expressed by the layers of the network. We investigate different types of shallow and deep architectures, and the minimal number of layers and units per layer that are sufficient and necessary in order for the network to be able to approximate any given stochastic mapping from the set of inputs to the set of outputs arbitrarily well.
Published: 2019

49. Kernelized Wasserstein Natural Gradient

Author: Arbel, Michael, Gretton, Arthur, Li, Wuchen, and Montufar, Guido
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning
Abstract: Many machine learning problems can be expressed as the optimization of some cost functional over a parametric family of probability distributions. It is often beneficial to solve such optimization problems using natural gradient methods. These methods are invariant to the parametrization of the family, and thus can yield more effective optimization. Unfortunately, computing the natural gradient is challenging as it requires inverting a high dimensional matrix at each iteration. We propose a general framework to approximate the natural gradient for the Wasserstein metric, by leveraging a dual formulation of the metric restricted to a Reproducing Kernel Hilbert Space. Our approach leads to an estimator for gradient direction that can trade-off accuracy and computational cost, with theoretical guarantees. We verify its accuracy on simple examples, and show the advantage of using such an estimator in classification tasks on Cifar10 and Cifar100 empirically.
Published: 2019

50. How Well Do WGANs Estimate the Wasserstein Metric?

Author: Mallasto, Anton, Montúfar, Guido, and Gerolin, Augusto
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Generative modelling is often cast as minimizing a similarity measure between a data distribution and a model distribution. Recently, a popular choice for the similarity measure has been the Wasserstein metric, which can be expressed in the Kantorovich duality formulation as the optimum difference of the expected values of a potential function under the real data distribution and the model hypothesis. In practice, the potential is approximated with a neural network and is called the discriminator. Duality constraints on the function class of the discriminator are enforced approximately, and the expectations are estimated from samples. This gives at least three sources of errors: the approximated discriminator and constraints, the estimation of the expectation value, and the optimization required to find the optimal potential. In this work, we study how well the methods, that are used in generative adversarial networks to approximate the Wasserstein metric, perform. We consider, in particular, the $c$-transform formulation, which eliminates the need to enforce the constraints explicitly. We demonstrate that the $c$-transform allows for a more accurate estimation of the true Wasserstein metric from samples, but surprisingly, does not perform the best in the generative setting., Comment: 23 pages
Published: 2019

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

265 results on '"Montúfar, Guido"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources