Author: "Bartlett, P. L." - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Bartlett, P. L."' showing total 1,181 results

Start Over Author "Bartlett, P. L."

1,181 results on '"Bartlett, P. L."'

1. A Statistical Analysis of Deep Federated Learning for Intrinsically Low-dimensional Data

Author: Chakraborty, Saptarshi and Bartlett, Peter L.
Subjects: Statistics - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Mathematics - Statistics Theory
Abstract: Federated Learning (FL) has emerged as a groundbreaking paradigm in collaborative machine learning, emphasizing decentralized model training to address data privacy concerns. While significant progress has been made in optimizing federated learning, the exploration of generalization error, particularly in heterogeneous settings, has been limited, focusing mainly on parametric cases. This paper investigates the generalization properties of deep federated regression within a two-stage sampling model. Our findings highlight that the intrinsic dimension, defined by the entropic dimension, is crucial for determining convergence rates when appropriate network sizes are used. Specifically, if the true relationship between response and explanatory variables is charecterized by a $\beta$-H\"older function and there are $n$ independent and identically distributed (i.i.d.) samples from $m$ participating clients, the error rate for participating clients scales at most as $\tilde{O}\left((mn)^{-2\beta/(2\beta + \bar{d}_{2\beta}(\lambda))}\right)$, and for non-participating clients, it scales as $\tilde{O}\left(\Delta \cdot m^{-2\beta/(2\beta + \bar{d}_{2\beta}(\lambda))} + (mn)^{-2\beta/(2\beta + \bar{d}_{2\beta}(\lambda))}\right)$. Here, $\bar{d}_{2\beta}(\lambda)$ represents the $2\beta$-entropic dimension of $\lambda$, the marginal distribution of the explanatory variables, and $\Delta$ characterizes the dependence between the sampling stages. Our results explicitly account for the "closeness" of clients, demonstrating that the convergence rates of deep federated learners depend on intrinsic rather than nominal high-dimensionality.
Published: 2024

2. Large Stepsize Gradient Descent for Non-Homogeneous Two-Layer Networks: Margin Improvement and Fast Optimization

Author: Cai, Yuhang, Wu, Jingfeng, Mei, Song, Lindsey, Michael, and Bartlett, Peter L.
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning, Mathematics - Optimization and Control
Abstract: The typical training of neural networks using large stepsize gradient descent (GD) under the logistic loss often involves two distinct phases, where the empirical risk oscillates in the first phase but decreases monotonically in the second phase. We investigate this phenomenon in two-layer networks that satisfy a near-homogeneity condition. We show that the second phase begins once the empirical risk falls below a certain threshold, dependent on the stepsize. Additionally, we show that the normalized margin grows nearly monotonically in the second phase, demonstrating an implicit bias of GD in training non-homogeneous predictors. If the dataset is linearly separable and the derivative of the activation function is bounded away from zero, we show that the average empirical risk decreases, implying that the first phase must stop in finite steps. Finally, we demonstrate that by choosing a suitably large stepsize, GD that undergoes this phase transition is more efficient than GD that monotonically decreases the risk. Our analysis applies to networks of any width, beyond the well-known neural tangent kernel and mean-field regimes., Comment: Clarify our results on sigmoid neural networks
Published: 2024

3. Scaling Laws in Linear Regression: Compute, Parameters, and Data

Author: Lin, Licong, Wu, Jingfeng, Kakade, Sham M., Bartlett, Peter L., and Lee, Jason D.
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Mathematics - Statistics Theory, Statistics - Machine Learning
Abstract: Empirically, large-scale deep learning models often satisfy a neural scaling law: the test error of the trained model improves polynomially as the model size and data size grow. However, conventional wisdom suggests the test error consists of approximation, bias, and variance errors, where the variance error increases with model size. This disagrees with the general form of neural scaling laws, which predict that increasing model size monotonically improves performance. We study the theory of scaling laws in an infinite dimensional linear regression setup. Specifically, we consider a model with $M$ parameters as a linear function of sketched covariates. The model is trained by one-pass stochastic gradient descent (SGD) using $N$ data. Assuming the optimal parameter satisfies a Gaussian prior and the data covariance matrix has a power-law spectrum of degree $a>1$, we show that the reducible part of the test error is $\Theta(M^{-(a-1)} + N^{-(a-1)/a})$. The variance error, which increases with $M$, is dominated by the other errors due to the implicit regularization of SGD, thus disappearing from the bound. Our theory is consistent with the empirical neural scaling laws and verified by numerical simulation.
Published: 2024

4. Insight into Predictors of Cytoreduction Score Following Cytoreductive Surgery-Hyperthermic Intraperitoneal Chemotherapy for Gastric Peritoneal Carcinomatosis Improves Patient Selection and Prognostic Outcomes

Author: Hamed, Ahmed B., El Asmar, Rudy, Tirukkovalur, Nikhil, Tcharni, Adam, Tatsuoka, Curtis, Jelinek, Mark, Derby, Joshua, Dubrovsky, Genia, Nunns, Geoffrey, Ongchin, Melanie, Pingpank, James F., Zureikat, Amer H., Bartlett, David L., Singhi, Aatur, Choudry, M. Haroon, and AlMasri, Samer S.
Published: 2024
Full Text: View/download PDF

5. Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency

Author: Wu, Jingfeng, Bartlett, Peter L., Telgarsky, Matus, and Yu, Bin
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data, where the constant stepsize $\eta$ is so large that the loss initially oscillates. We show that GD exits this initial oscillatory phase rapidly -- in $\mathcal{O}(\eta)$ steps -- and subsequently achieves an $\tilde{\mathcal{O}}(1 / (\eta t) )$ convergence rate after $t$ additional steps. Our results imply that, given a budget of $T$ steps, GD can achieve an accelerated loss of $\tilde{\mathcal{O}}(1/T^2)$ with an aggressive stepsize $\eta:= \Theta( T)$, without any use of momentum or variable stepsize schedulers. Our proof technique is versatile and also handles general classification loss functions (where exponential tails are needed for the $\tilde{\mathcal{O}}(1/T^2)$ acceleration), nonlinear predictors in the neural tangent kernel regime, and online stochastic gradient descent (SGD) with a large stepsize, under suitable separability conditions., Comment: COLT 2024 camera ready
Published: 2024

6. A Statistical Analysis of Wasserstein Autoencoders for Intrinsically Low-dimensional Data

Author: Chakraborty, Saptarshi and Bartlett, Peter L.
Subjects: Computer Science - Machine Learning, Mathematics - Statistics Theory, Statistics - Machine Learning
Abstract: Variational Autoencoders (VAEs) have gained significant popularity among researchers as a powerful tool for understanding unknown distributions based on limited samples. This popularity stems partly from their impressive performance and partly from their ability to provide meaningful feature representations in the latent space. Wasserstein Autoencoders (WAEs), a variant of VAEs, aim to not only improve model efficiency but also interpretability. However, there has been limited focus on analyzing their statistical guarantees. The matter is further complicated by the fact that the data distributions to which WAEs are applied - such as natural images - are often presumed to possess an underlying low-dimensional structure within a high-dimensional feature space, which current theory does not adequately account for, rendering known bounds inefficient. To bridge the gap between the theory and practice of WAEs, in this paper, we show that WAEs can learn the data distributions when the network architectures are properly chosen. We show that the convergence rates of the expected excess risk in the number of samples for WAEs are independent of the high feature dimension, instead relying only on the intrinsic dimension of the data distribution., Comment: In the twelfth International Conference on Learning Representations (ICLR'24)
Published: 2024

7. In-Context Learning of a Linear Transformer Block: Benefits of the MLP Component and One-Step GD Initialization

Author: Zhang, Ruiqi, Wu, Jingfeng, and Bartlett, Peter L.
Subjects: Statistics - Machine Learning, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: We study the \emph{in-context learning} (ICL) ability of a \emph{Linear Transformer Block} (LTB) that combines a linear attention component and a linear multi-layer perceptron (MLP) component. For ICL of linear regression with a Gaussian prior and a \emph{non-zero mean}, we show that LTB can achieve nearly Bayes optimal ICL risk. In contrast, using only linear attention must incur an irreducible additive approximation error. Furthermore, we establish a correspondence between LTB and one-step gradient descent estimators with learnable initialization ($\mathsf{GD}\text{-}\mathbf{\beta}$), in the sense that every $\mathsf{GD}\text{-}\mathbf{\beta}$ estimator can be implemented by an LTB estimator and every optimal LTB estimator that minimizes the in-class ICL risk is effectively a $\mathsf{GD}\text{-}\mathbf{\beta}$ estimator. Finally, we show that $\mathsf{GD}\text{-}\mathbf{\beta}$ estimators can be efficiently optimized with gradient flow, despite a non-convex training objective. Our results reveal that LTB achieves ICL by implementing $\mathsf{GD}\text{-}\mathbf{\beta}$, and they highlight the role of MLP layers in reducing approximation error., Comment: 39 pages
Published: 2024

8. On the Statistical Properties of Generative Adversarial Models for Low Intrinsic Data Dimension

Author: Chakraborty, Saptarshi and Bartlett, Peter L.
Subjects: Statistics - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Mathematics - Statistics Theory
Abstract: Despite the remarkable empirical successes of Generative Adversarial Networks (GANs), the theoretical guarantees for their statistical accuracy remain rather pessimistic. In particular, the data distributions on which GANs are applied, such as natural images, are often hypothesized to have an intrinsic low-dimensional structure in a typically high-dimensional feature space, but this is often not reflected in the derived rates in the state-of-the-art analyses. In this paper, we attempt to bridge the gap between the theory and practice of GANs and their bidirectional variant, Bi-directional GANs (BiGANs), by deriving statistical guarantees on the estimated densities in terms of the intrinsic dimension of the data and the latent space. We analytically show that if one has access to $n$ samples from the unknown target distribution and the network architectures are properly chosen, the expected Wasserstein-1 distance of the estimates from the target scales as $O\left( n^{-1/d_\mu } \right)$ for GANs and $O\left( n^{-1/(d_\mu+\ell)} \right)$ for BiGANs, where $d_\mu$ and $\ell$ are the upper Wasserstein-1 dimension of the data-distribution and latent-space dimension, respectively. The theoretical analyses not only suggest that these methods successfully avoid the curse of dimensionality, in the sense that the exponent of $n$ in the error rates does not depend on the data dimension but also serve to bridge the gap between the theoretical analyses of GANs and the known sharp rates from optimal transport literature. Additionally, we demonstrate that GANs can effectively achieve the minimax optimal rate even for non-smooth underlying distributions, with the use of larger generator networks.
Published: 2024

9. Detection of Residual Peritoneal Metastases Following Cytoreductive Surgery Using Pegsitacianine, a pH-Sensitive Imaging Agent: Final Results from a Phase II Study

Author: Wagner, Patrick, Levine, Edward A., Kim, Alex C., Shen, Perry, Fleming, Nicole D., Westin, Shannon N., Berry, Laurel K., Karakousis, Giorgos C., Tanyi, Janos L., Olson, Madeline T., Madajewski, Brian, Ostrander, Brian, Krishnan, Kartik, Balch, Charles M., and Bartlett, David L.
Published: 2024
Full Text: View/download PDF

10. Clinical Ethics Consultations and the Necessity of NOT Meeting Expectations: I Never Promised You a Rose Garden

Author: Finder, Stuart G. and Bartlett, Virginia L.
Published: 2024
Full Text: View/download PDF

11. How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression?

Author: Wu, Jingfeng, Zou, Difan, Chen, Zixiang, Braverman, Vladimir, Gu, Quanquan, and Bartlett, Peter L.
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning
Abstract: Transformers pretrained on diverse tasks exhibit remarkable in-context learning (ICL) capabilities, enabling them to solve unseen tasks solely based on input contexts without adjusting model parameters. In this paper, we study ICL in one of its simplest setups: pretraining a linearly parameterized single-layer linear attention model for linear regression with a Gaussian prior. We establish a statistical task complexity bound for the attention model pretraining, showing that effective pretraining only requires a small number of independent tasks. Furthermore, we prove that the pretrained model closely matches the Bayes optimal algorithm, i.e., optimally tuned ridge regression, by achieving nearly Bayes optimal risk on unseen tasks under a fixed context length. These theoretical findings complement prior experimental research and shed light on the statistical foundations of ICL., Comment: ICLR 2024 Camera Ready
Published: 2023

12. Sharpness-Aware Minimization and the Edge of Stability

Author: Long, Philip M. and Bartlett, Peter L.
Subjects: Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing, Statistics - Machine Learning
Abstract: Recent experiments have shown that, often, when training a neural network with gradient descent (GD) with a step size $\eta$, the operator norm of the Hessian of the loss grows until it approximately reaches $2/\eta$, after which it fluctuates around this value. The quantity $2/\eta$ has been called the "edge of stability" based on consideration of a local quadratic approximation of the loss. We perform a similar calculation to arrive at an "edge of stability" for Sharpness-Aware Minimization (SAM), a variant of GD which has been shown to improve its generalization. Unlike the case for GD, the resulting SAM-edge depends on the norm of the gradient. Using three deep learning training tasks, we see empirically that SAM operates on the edge of stability identified by this analysis.
Published: 2023

13. Advanced Care Planning Prior to Oncologic Surgery: An Assessment of Utilization and Implications

Author: Joseph, Edward A., Anees, Muhammad, Barrett, Tyson S., Aliu, Oluseyi, Wagner, Patrick L., Bartlett, David L., and Allen, Casey J.
Published: 2024
Full Text: View/download PDF

14. Trained Transformers Learn Linear Models In-Context

Author: Zhang, Ruiqi, Frei, Spencer, and Bartlett, Peter L.
Subjects: Statistics - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Attention-based neural networks such as transformers have demonstrated a remarkable ability to exhibit in-context learning (ICL): Given a short prompt sequence of tokens from an unseen task, they can formulate relevant per-token and next-token predictions without any parameter updates. By embedding a sequence of labeled training data and unlabeled test data as a prompt, this allows for transformers to behave like supervised learning algorithms. Indeed, recent work has shown that when training transformer architectures over random instances of linear regression problems, these models' predictions mimic those of ordinary least squares. Towards understanding the mechanisms underlying this phenomenon, we investigate the dynamics of ICL in transformers with a single linear self-attention layer trained by gradient flow on linear regression tasks. We show that despite non-convexity, gradient flow with a suitable random initialization finds a global minimum of the objective function. At this global minimum, when given a test prompt of labeled examples from a new prediction task, the transformer achieves prediction error competitive with the best linear predictor over the test prompt distribution. We additionally characterize the robustness of the trained transformer to a variety of distribution shifts and show that although a number of shifts are tolerated, shifts in the covariate distribution of the prompts are not. Motivated by this, we consider a generalized ICL setting where the covariate distributions can vary across prompts. We show that although gradient flow succeeds at finding a global minimum in this setting, the trained transformer is still brittle under mild covariate shifts. We complement this finding with experiments on large, nonlinear transformer architectures which we show are more robust under covariate shifts., Comment: 50 pages, revised definition 3.2 and corollary 4.3
Published: 2023

15. ASO Visual Abstract: Insight into Predictors of Cytoreduction Score Following Cytoreductive Surgery-Hyperthermic Intraperitoneal Chemotherapy for Gastric Peritoneal Carcinomatosis Improves Patient Selection and Prognostic Outcomes

Author: Hamed, Ahmed B., Asmar, Rudy El, Tirukkovalur, Nikhil, Tcharni, Adam, Tatsuoka, Curtis, Jelinek, Mark, Derby, Joshua, Dubrovsky, Genia, Nunns, Geoffrey, Ongchin, Melanie, Pingpank, James F., Zureikat, Amer H., Bartlett, David L., Singhi, Aatur, Choudry, M. Haroon, and AlMasri, Samer S.
Published: 2024
Full Text: View/download PDF

16. Exploring the Perception of Value in Cancer Care: A Comparison of Patients, Providers, and Payers

Author: Allen, Casey J., Greene, Alicia C., Joseph, Edward A., Dunung, Ruchita, Knotts, Chelsea M., Chalikonda, Sricharan, and Bartlett, David L.
Published: 2024
Full Text: View/download PDF

17. Leveraging one health as a sentinel approach for pandemic resilience

Author: Bartlett, Maggie L. and Uhart, Marcela
Published: 2024
Full Text: View/download PDF

18. Evaluation of Ki-67 expression and large cell content as prognostic markers in MZL: a multicenter cohort study

Author: Grover, Natalie S., Annunzio, Kaitlin, Watkins, Marcus, Torka, Pallawi, Karmali, Reem, Anampa-Guzmán, Andrea, Oh, Timothy S., Reves, Heather, Tavakkoli, Montreh, Hansinger, Emily, Christian, Beth, Thomas, Colin, Barta, Stefan K., Geethakumari, Praveen Ramakrishnan, Bartlett, Nancy L., Shouse, Geoffrey, Olszewski, Adam J., and Epperla, Narendranath
Published: 2024
Full Text: View/download PDF

19. Normal CEA Levels After Neoadjuvant Chemotherapy and Cytoreduction with Hyperthermic Intraperitoneal Chemoperfusion Predict Improved Survival from Colorectal Peritoneal Metastases

Author: Wach, Michael M., Nunns, Geoffrey, Hamed, Ahmed, Derby, Joshua, Jelinek, Mark, Tatsuoka, Curtis, Holtzman, Matthew P., Zureikat, Amer H., Bartlett, David L., Ahrendt, Steven A., Pingpank, James F., Choudry, M. Haroon A., and Ongchin, Melanie
Published: 2024
Full Text: View/download PDF

20. Long-term outcomes of patients with large B-cell lymphoma treated with axicabtagene ciloleucel and prophylactic corticosteroids

Author: Oluwole, Olalekan O., Forcade, Edouard, Muñoz, Javier, de Guibert, Sophie, Vose, Julie M., Bartlett, Nancy L., Lin, Yi, Deol, Abhinav, McSweeney, Peter, Goy, Andre H., Kersten, Marie José, Jacobson, Caron A., Farooq, Umar, Minnema, Monique C., Thieblemont, Catherine, Timmerman, John M., Stiff, Patrick, Avivi, Irit, Tzachanis, Dimitrios, Zheng, Yan, Vardhanabhuti, Saran, Nater, Jenny, Shen, Rhine R., Miao, Harry, Kim, Jenny J., and van Meerten, Tom
Published: 2024
Full Text: View/download PDF

21. Characterizing the Immune Environment in Peritoneal Carcinomatosis: Insights for Novel Immunotherapy Strategies

Author: Wagner, Patrick L., Knotts, Chelsea M., Donneberg, Vera S., Dadgar, Neda, Pico, Christian Cruz, Xiao, Kunhong, Zaidi, Ali, Schiffman, Suzanne C., Allen, Casey J., Donnenberg, Albert D., and Bartlett, David L.
Published: 2024
Full Text: View/download PDF

22. Prediction, Learning, Uniform Convergence, and Scale-sensitive Dimensions

Author: Bartlett, Peter L. and Long, Philip M.
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: We present a new general-purpose algorithm for learning classes of $[0,1]$-valued functions in a generalization of the prediction model, and prove a general upper bound on the expected absolute error of this algorithm in terms of a scale-sensitive generalization of the Vapnik dimension proposed by Alon, Ben-David, Cesa-Bianchi and Haussler. We give lower bounds implying that our upper bounds cannot be improved by more than a constant factor in general. We apply this result, together with techniques due to Haussler and to Benedek and Itai, to obtain new upper bounds on packing numbers in terms of this scale-sensitive notion of dimension. Using a different technique, we obtain new bounds on packing numbers in terms of Kearns and Schapire's fat-shattering function. We show how to apply both packing bounds to obtain improved general bounds on the sample complexity of agnostic learning. For each $\epsilon > 0$, we establish weaker sufficient and stronger necessary conditions for a class of $[0,1]$-valued functions to be agnostically learnable to within $\epsilon$, and to be an $\epsilon$-uniform Glivenko-Cantelli class. This is a manuscript that was accepted by JCSS, together with a correction., Comment: One header page, a three page correction, then a 28 page original paper
Published: 2023
Full Text: View/download PDF

23. Benign Overfitting in Linear Classifiers and Leaky ReLU Networks from KKT Conditions for Margin Maximization

Author: Frei, Spencer, Vardi, Gal, Bartlett, Peter L., and Srebro, Nathan
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Linear classifiers and leaky ReLU networks trained by gradient flow on the logistic loss have an implicit bias towards solutions which satisfy the Karush--Kuhn--Tucker (KKT) conditions for margin maximization. In this work we establish a number of settings where the satisfaction of these KKT conditions implies benign overfitting in linear classifiers and in two-layer leaky ReLU networks: the estimators interpolate noisy training data and simultaneously generalize well to test data. The settings include variants of the noisy class-conditional Gaussians considered in previous work as well as new distributional settings where benign overfitting has not been previously observed. The key ingredient to our proof is the observation that when the training data is nearly-orthogonal, both linear classifiers and leaky ReLU networks satisfying the KKT conditions for their respective margin maximization problems behave like a nearly uniform average of the training examples., Comment: 53 pages
Published: 2023

24. The Double-Edged Sword of Implicit Bias: Generalization vs. Robustness in ReLU Networks

Author: Frei, Spencer, Vardi, Gal, Bartlett, Peter L., and Srebro, Nathan
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: In this work, we study the implications of the implicit bias of gradient flow on generalization and adversarial robustness in ReLU networks. We focus on a setting where the data consists of clusters and the correlations between cluster means are small, and show that in two-layer ReLU networks gradient flow is biased towards solutions that generalize well, but are highly vulnerable to adversarial examples. Our results hold even in cases where the network has many more parameters than training examples. Despite the potential for harmful overfitting in such overparameterized settings, we prove that the implicit bias of gradient flow prevents it. However, the implicit bias also leads to non-robust solutions (susceptible to small adversarial $\ell_2$-perturbations), even though robust networks that fit the data exist., Comment: 42 pages; NeurIPS 2023 camera ready
Published: 2023

25. Correction: Exploring the Perception of Value in Cancer Care: A Comparison of Patients, Providers, and Payers

Author: Allen, Casey J., Greene, Alicia C., Joseph, Edward A., Dunung, Ruchita, Knotts, Chelsea M., Chalikonda, Sricharan, and Bartlett, David L.
Published: 2024
Full Text: View/download PDF

26. Kernel-based off-policy estimation without overlap: Instance optimality beyond semiparametric efficiency

Author: Mou, Wenlong, Ding, Peng, Wainwright, Martin J., and Bartlett, Peter L.
Subjects: Mathematics - Statistics Theory, Statistics - Methodology, Statistics - Machine Learning
Abstract: We study optimal procedures for estimating a linear functional based on observational data. In many problems of this kind, a widely used assumption is strict overlap, i.e., uniform boundedness of the importance ratio, which measures how well the observational data covers the directions of interest. When it is violated, the classical semi-parametric efficiency bound can easily become infinite, so that the instance-optimal risk depends on the function class used to model the regression function. For any convex and symmetric function class $\mathcal{F}$, we derive a non-asymptotic local minimax bound on the mean-squared error in estimating a broad class of linear functionals. This lower bound refines the classical semi-parametric one, and makes connections to moduli of continuity in functional estimation. When $\mathcal{F}$ is a reproducing kernel Hilbert space, we prove that this lower bound can be achieved up to a constant factor by analyzing a computationally simple regression estimator. We apply our general results to various families of examples, thereby uncovering a spectrum of rates that interpolate between the classical theories of semi-parametric efficiency (with $\sqrt{n}$-consistency) and the slower minimax rates associated with non-parametric function estimation.
Published: 2023

27. Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data

Author: Frei, Spencer, Vardi, Gal, Bartlett, Peter L., Srebro, Nathan, and Hu, Wei
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: The implicit biases of gradient-based optimization algorithms are conjectured to be a major factor in the success of modern deep learning. In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with leaky ReLU activations when the training data are nearly-orthogonal, a common property of high-dimensional data. For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that asymptotically, gradient flow produces a neural network with rank at most two. Moreover, this network is an $\ell_2$-max-margin solution (in parameter space), and has a linear decision boundary that corresponds to an approximate-max-margin linear predictor. For gradient descent, provided the random initialization variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training. We provide experiments which suggest that a small initialization scale is important for finding low-rank neural networks with gradient descent., Comment: 54 pages
Published: 2022

28. The Dynamics of Sharpness-Aware Minimization: Bouncing Across Ravines and Drifting Towards Wide Minima

Author: Bartlett, Peter L., Long, Philip M., and Bousquet, Olivier
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: We consider Sharpness-Aware Minimization (SAM), a gradient-based optimization method for deep networks that has exhibited performance improvements on image and language prediction problems. We show that when SAM is applied with a convex quadratic objective, for most random initializations it converges to a cycle that oscillates between either side of the minimum in the direction with the largest curvature, and we provide bounds on the rate of convergence. In the non-quadratic case, we show that such oscillations effectively perform gradient descent, with a smaller step-size, on the spectral norm of the Hessian. In such cases, SAM's update may be regarded as a third derivative -- the derivative of the Hessian in the leading eigenvector direction -- that encourages drift toward wider minima.
Published: 2022

29. Off-policy estimation of linear functionals: Non-asymptotic theory for semi-parametric efficiency

Author: Mou, Wenlong, Wainwright, Martin J., and Bartlett, Peter L.
Subjects: Mathematics - Statistics Theory, Computer Science - Information Theory, Statistics - Machine Learning
Abstract: The problem of estimating a linear functional based on observational data is canonical in both the causal inference and bandit literatures. We analyze a broad class of two-stage procedures that first estimate the treatment effect function, and then use this quantity to estimate the linear functional. We prove non-asymptotic upper bounds on the mean-squared error of such procedures: these bounds reveal that in order to obtain non-asymptotically optimal procedures, the error in estimating the treatment effect should be minimized in a certain weighted $L^2$-norm. We analyze a two-stage procedure based on constrained regression in this weighted norm, and establish its instance-dependent optimality in finite samples via matching non-asymptotic local minimax lower bounds. These results show that the optimal non-asymptotic risk, in addition to depending on the asymptotically efficient variance, depends on the weighted norm distance between the true outcome function and its approximation by the richest function class supported by the sample size., Comment: 56 pages, 6 figures
Published: 2022

30. Suppressing fear in the presence of a safety cue requires infralimbic cortical signaling to central amygdala

Author: Ng, Ka, Pollock, Michael, Escobedo, Abraham, Bachman, Brent, Miyazaki, Nanami, Bartlett, Edward L., and Sangha, Susan
Published: 2024
Full Text: View/download PDF

31. A multi-cohort phase 1b trial of rituximab in combination with immunotherapy doublets in relapsed/refractory follicular lymphoma

Author: Merryman, Reid W., Redd, Robert A., Freedman, Arnold S., Ahn, Inhye E., Brown, Jennifer R., Crombie, Jennifer L., Davids, Matthew S., Fisher, David C., Jacobsen, Eric D., Kim, Austin I., LaCasce, Ann S., Ng, Samuel, Odejide, Oreofe O., Parry, Erin M., Isufi, Iris, Kline, Justin, Cohen, Jonathon B., Mehta-Shah, Neha, Bartlett, Nancy L., Mei, Matthew, Kuntz, Thomas M., Wolff, Jacquelyn, Rodig, Scott J., Armand, Philippe, and Jacobson, Caron A.
Published: 2024
Full Text: View/download PDF

32. Power-conscious ecosystems: understanding how power dynamics in US doctoral advising shape students’ experiences

Author: Friedensen, Rachel E., Bettencourt, Genia M., and Bartlett, Megan L.
Published: 2024
Full Text: View/download PDF

33. ASO Visual Abstract: Detection of Residual Peritoneal Metastases Following Cytoreductive Surgery Using Pegsitacianine, a pH-Sensitive Imaging Agent—Final Results from a Phase 2 Study

Author: Wagner, Patrick, Levine, Edward A., Kim, Alex C., Shen, Perry, Fleming, Nicole D., Westin, Shannon N., Berry, Laurel K., Karakousis, Giorgos C., Tanyi, Janos L., Olson, Madeline T., Madajewski, Brian, Ostrander, Brian, Krishnan, Kartik, Balch, Charles M., and Bartlett, David L.
Published: 2024
Full Text: View/download PDF

34. ASO Visual Abstract: Normal CEA Levels After Neoadjuvant Chemotherapy and Cytoreduction with Hyperthermic Intraperitoneal Chemoperfusion Predict Improved Survival from Colorectal Peritoneal Metastases

Author: Wach, Michael M., Nunns, Geoffrey, Hamed, Ahmed, Derby, Joshua, Jelinek, Mark, Tatsuoka, Curtis, Holtzman, Matthew P., Zureikat, Amer H., Bartlett, David L., Ahrendt, Steven A., Pingpank, James F., Choudry, M. Haroon A., and Ongchin, Melanie
Published: 2024
Full Text: View/download PDF

35. Advancing Inclusive Research with People with Profound and Multiple Learning Disabilities through a Sensory-Dialogical Approach

Author: Gjermestad, Anita, Skarsaune, Synne N., and Bartlett, Ruth L.
Abstract: People with profound and multiple learning disabilities are often excluded from the processes of knowledge production and face barriers to inclusion in research due to cognitive and communicative challenges. Inclusive research--even when intending to be inclusive--tends to operate within criteria that exclude people with profound and multiple learning disabilities. The aim of this article is to provide a state-of-the-art review of the topic of inclusive research involving people with profound disabilities and thereby challenge traditional assumptions of inclusive research. The review presents themes that will inform a discussion on how to challenge the criteria in ways that make it possible to understand inclusive research for people who communicate in unconventional ways. We argue that a fruitful way of rethinking inclusive research is by applying a sensory-dialogical approach that privileges the dialogical and sensory foundations of the research. We suggest this might be a way to understand inclusive research that regards the person's communicative and cognitive distinctiveness.
Published: 2023
Full Text: View/download PDF

36. Random Feature Amplification: Feature Learning and Generalization in Neural Networks

Author: Frei, Spencer, Chatterji, Niladri S., and Bartlett, Peter L.
Subjects: Computer Science - Machine Learning, Mathematics - Statistics Theory, Statistics - Machine Learning
Abstract: In this work, we provide a characterization of the feature-learning process in two-layer ReLU networks trained by gradient descent on the logistic loss following random initialization. We consider data with binary labels that are generated by an XOR-like function of the input features. We permit a constant fraction of the training labels to be corrupted by an adversary. We show that, although linear classifiers are no better than random guessing for the distribution we consider, two-layer ReLU networks trained by gradient descent achieve generalization error close to the label noise rate. We develop a novel proof technique that shows that at initialization, the vast majority of neurons function as random features that are only weakly correlated with useful features, and the gradient descent dynamics 'amplify' these weak, random features to strong, useful features., Comment: 46 pages; JMLR camera ready revision
Published: 2022

37. Benign Overfitting without Linearity: Neural Network Classifiers Trained by Gradient Descent for Noisy Linear Data

Author: Frei, Spencer, Chatterji, Niladri S., and Bartlett, Peter L.
Subjects: Computer Science - Machine Learning, Mathematics - Statistics Theory, Statistics - Machine Learning
Abstract: Benign overfitting, the phenomenon where interpolating models generalize well in the presence of noisy data, was first observed in neural network models trained with gradient descent. To better understand this empirical observation, we consider the generalization error of two-layer neural networks trained to interpolation by gradient descent on the logistic loss following random initialization. We assume the data comes from well-separated class-conditional log-concave distributions and allow for a constant fraction of the training labels to be corrupted by an adversary. We show that in this setting, neural networks exhibit benign overfitting: they can be driven to zero training error, perfectly fitting any noisy training labels, and simultaneously achieve minimax optimal test error. In contrast to previous work on benign overfitting that require linear or kernel-based predictors, our analysis holds in a setting where both the model and learning dynamics are fundamentally nonlinear., Comment: 39 pages; minor corrections
Published: 2022

38. Optimal variance-reduced stochastic approximation in Banach spaces

Author: Mou, Wenlong, Khamaru, Koulik, Wainwright, Martin J., Bartlett, Peter L., and Jordan, Michael I.
Subjects: Mathematics - Statistics Theory, Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: We study the problem of estimating the fixed point of a contractive operator defined on a separable Banach space. Focusing on a stochastic query model that provides noisy evaluations of the operator, we analyze a variance-reduced stochastic approximation scheme, and establish non-asymptotic bounds for both the operator defect and the estimation error, measured in an arbitrary semi-norm. In contrast to worst-case guarantees, our bounds are instance-dependent, and achieve the local asymptotic minimax risk non-asymptotically. For linear operators, contractivity can be relaxed to multi-step contractivity, so that the theory can be applied to problems like average reward policy evaluation problem in reinforcement learning. We illustrate the theory via applications to stochastic shortest path problems, two-player zero-sum Markov games, as well as policy evaluation and $Q$-learning for tabular Markov decision processes.
Published: 2022

39. Defining the Values and Quality of Life of Cancer Survivors Following Cytoreductive Surgery and Hyperthermic Intraperitoneal Chemotherapy: An International Survey Study

Author: Knotts, Chelsea M., Osman, Mayar A., Aderonmu, Aderinsola A., Bahary, Nathan, Wagner, Patrick L., Bartlett, David L., and Allen, Casey J.
Published: 2023
Full Text: View/download PDF

40. Targeted Next-Generation Sequencing Improves the Prognostication of Patients with Disseminated Appendiceal Mucinous Neoplasms (Pseudomyxoma Peritonei)

Author: Wald, Abigail I., Pingpank, James F., Ongchin, Melanie, Hall, Lauren B., Jones, Heather, Altpeter, Shannon, Liebdzinski, Michelle, Hamed, Ahmed B., Derby, Joshua, Nikiforova, Marina N., Bell, Phoenix D., Paniccia, Alessandro, Zureikat, Amer H., Gorantla, Vikram C., Rhee, John C., Thomas, Roby, Bartlett, David L., Smith, Katelyn, Henn, Patrick, Theisen, Brian K., Shyu, Susan, Shalaby, Akram, Choudry, M. Haroon A., and Singhi, Aatur D.
Published: 2023
Full Text: View/download PDF

41. Optimal and instance-dependent guarantees for Markovian linear stochastic approximation

Author: Mou, Wenlong, Pananjady, Ashwin, Wainwright, Martin J., and Bartlett, Peter L.
Subjects: Mathematics - Optimization and Control, Computer Science - Machine Learning, Mathematics - Probability, Mathematics - Statistics Theory, Statistics - Machine Learning
Abstract: We study stochastic approximation procedures for approximately solving a $d$-dimensional linear fixed point equation based on observing a trajectory of length $n$ from an ergodic Markov chain. We first exhibit a non-asymptotic bound of the order $t_{\mathrm{mix}} \tfrac{d}{n}$ on the squared error of the last iterate of a standard scheme, where $t_{\mathrm{mix}}$ is a mixing time. We then prove a non-asymptotic instance-dependent bound on a suitably averaged sequence of iterates, with a leading term that matches the local asymptotic minimax limit, including sharp dependence on the parameters $(d, t_{\mathrm{mix}})$ in the higher order terms. We complement these upper bounds with a non-asymptotic minimax lower bound that establishes the instance-optimality of the averaged SA estimator. We derive corollaries of these results for policy evaluation with Markov noise -- covering the TD($\lambda$) family of algorithms for all $\lambda \in [0, 1)$ -- and linear autoregressive models. Our instance-dependent characterizations open the door to the design of fine-grained model selection procedures for hyperparameter tuning (e.g., choosing the value of $\lambda$ when running the TD($\lambda$) algorithm)., Comment: Published at Mathematical Statistics and Learning
Published: 2021

42. ASO Visual Abstract: Characterizing the Immune Environment in Peritoneal Carcinomatosis—Insights for Novel Immunotherapy Strategies

Author: Wagner, Patrick L., Knotts, Chelsea M., Donneberg, Vera S., Dadgar, Neda, Pico, Christian Cruz, Xiao, Kunhong, Zaidi, Ali, Schiffman, Suzanne C., Allen, Casey J., Donnenberg, Albert D., and Bartlett, David L.
Published: 2024
Full Text: View/download PDF

43. Impact of early relapse within 24 months after first-line systemic therapy (POD24) on outcomes in patients with marginal zone lymphoma: A US multisite study

Author: Epperla, Narendranath, Welkie, Rina Li, Torka, Pallawi, Shouse, Geoffrey, Karmali, Reem, Shea, Lauren, Anampa-Guzmán, Andrea, Oh, Timothy S., Reaves, Heather, Tavakkoli, Montreh, Lindsey, Kathryn, Greenwell, Irl Brian, Hansinger, Emily, Thomas, Colin, Chowdhury, Sayan Mullick, Annunzio, Kaitlin, Christian, Beth, Barta, Stefan K., Geethakumari, Praveen Ramakrishnan, Bartlett, Nancy L., Herrera, Alex F., Grover, Natalie S., and Olszewski, Adam J.
Published: 2023
Full Text: View/download PDF

44. The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer Linear Networks

Author: Chatterji, Niladri S., Long, Philip M., and Bartlett, Peter L.
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning, Mathematics - Statistics Theory
Abstract: The recent success of neural network models has shone light on a rather surprising statistical phenomenon: statistical models that perfectly fit noisy data can generalize well to unseen test data. Understanding this phenomenon of $\textit{benign overfitting}$ has attracted intense theoretical and empirical study. In this paper, we consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk when the covariates satisfy sub-Gaussianity and anti-concentration properties, and the noise is independent and sub-Gaussian. By leveraging recent results that characterize the implicit bias of this estimator, our bounds emphasize the role of both the quality of the initialization as well as the properties of the data covariance matrix in achieving low excess risk., Comment: Accepted for publication at JMLR
Published: 2021

45. Adversarial Examples in Multi-Layer Random ReLU Networks

Author: Bartlett, Peter L., Bubeck, Sébastien, and Cherapanamjeri, Yeshwanth
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Statistics - Machine Learning
Abstract: We consider the phenomenon of adversarial examples in ReLU networks with independent gaussian parameters. For networks of constant depth and with a large range of widths (for instance, it suffices if the width of each layer is polynomial in that of any other layer), small perturbations of input vectors lead to large changes of outputs. This generalizes results of Daniely and Schacham (2020) for networks of rapidly decreasing width and of Bubeck et al (2021) for two-layer networks. The proof shows that adversarial examples arise in these networks because the functions that they compute are very close to linear. Bottleneck layers in the network play a key role: the minimal width up to some point in the network determines scales and sensitivities of mappings computed up to that point. The main result is for networks with constant depth, but we also show that some constraint on depth is necessary for a result of this kind, because there are suitably deep networks that, with constant probability, compute a function that is close to constant.
Published: 2021

46. Predicting Severe Complications from Cytoreductive Surgery with Hyperthermic Intraperitoneal Chemotherapy: A Data-Driven, Machine Learning Approach to Augment Clinical Judgment

Author: Adam, Mohamed A., Zhou, Helen, Byrd, Jonathan, Greenberg, Anya L., Kelly, Yvonne M., Hall, Lauren, Jones, Heather L., Pingpank, James F., Lipton, Zachary C., Bartlett, David L., and Choudry, Haroon M.
Published: 2023
Full Text: View/download PDF

47. On the Theory of Reinforcement Learning with Once-per-Episode Feedback

Author: Chatterji, Niladri S., Pacchiano, Aldo, Bartlett, Peter L., and Jordan, Michael I.
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Statistics - Machine Learning
Abstract: We study a theory of reinforcement learning (RL) in which the learner receives binary feedback only once at the end of an episode. While this is an extreme test case for theory, it is also arguably more representative of real-world applications than the traditional requirement in RL practice that the learner receive feedback at every time step. Indeed, in many real-world applications of reinforcement learning, such as self-driving cars and robotics, it is easier to evaluate whether a learner's complete trajectory was either "good" or "bad," but harder to provide a reward signal at each step. To show that learning is possible in this more challenging setting, we study the case where trajectory labels are generated by an unknown parametric model, and provide a statistically and computationally efficient algorithm that achieves sublinear regret., Comment: Published at NeurIPS 2022
Published: 2021

48. Preference learning along multiple criteria: A game-theoretic perspective

Author: Bhatia, Kush, Pananjady, Ashwin, Bartlett, Peter L., Dragan, Anca D., and Wainwright, Martin J.
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: The literature on ranking from ordinal data is vast, and there are several ways to aggregate overall preferences from pairwise comparisons between objects. In particular, it is well known that any Nash equilibrium of the zero sum game induced by the preference matrix defines a natural solution concept (winning distribution over objects) known as a von Neumann winner. Many real-world problems, however, are inevitably multi-criteria, with different pairwise preferences governing the different criteria. In this work, we generalize the notion of a von Neumann winner to the multi-criteria setting by taking inspiration from Blackwell's approachability. Our framework allows for non-linear aggregation of preferences across criteria, and generalizes the linearization-based approach from multi-objective optimization. From a theoretical standpoint, we show that the Blackwell winner of a multi-criteria problem instance can be computed as the solution to a convex optimization problem. Furthermore, given random samples of pairwise comparisons, we show that a simple plug-in estimator achieves near-optimal minimax sample complexity. Finally, we showcase the practical utility of our framework in a user study on autonomous driving, where we find that the Blackwell winner outperforms the von Neumann winner for the overall preferences., Comment: 47 pages; published as a conference paper at NeurIPS 2020
Published: 2021

49. Agnostic learning with unknown utilities

Author: Bhatia, Kush, Bartlett, Peter L., Dragan, Anca D., and Steinhardt, Jacob
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Traditional learning approaches for classification implicitly assume that each mistake has the same cost. In many real-world problems though, the utility of a decision depends on the underlying context $x$ and decision $y$. However, directly incorporating these utilities into the learning objective is often infeasible since these can be quite complex and difficult for humans to specify. We formally study this as agnostic learning with unknown utilities: given a dataset $S = \{x_1, \ldots, x_n\}$ where each data point $x_i \sim \mathcal{D}$, the objective of the learner is to output a function $f$ in some class of decision functions $\mathcal{F}$ with small excess risk. This risk measures the performance of the output predictor $f$ with respect to the best predictor in the class $\mathcal{F}$ on the unknown underlying utility $u^*$. This utility $u^*$ is not assumed to have any specific structure. This raises an interesting question whether learning is even possible in our setup, given that obtaining a generalizable estimate of utility $u^*$ might not be possible from finitely many samples. Surprisingly, we show that estimating the utilities of only the sampled points~$S$ suffices to learn a decision function which generalizes well. We study mechanisms for eliciting information which allow a learner to estimate the utilities $u^*$ on the set $S$. We introduce a family of elicitation mechanisms by generalizing comparisons, called the $k$-comparison oracle, which enables the learner to ask for comparisons across $k$ different inputs $x$ at once. We show that the excess risk in our agnostic learning framework decreases at a rate of $O\left(\frac{1}{k} \right)$. This result brings out an interesting accuracy-elicitation trade-off -- as the order $k$ of the oracle increases, the comparative queries become harder to elicit from humans but allow for more accurate learning., Comment: 30 pages; published as a conference paper at ITCS 2021
Published: 2021

50. Infinite-Horizon Offline Reinforcement Learning with Linear Function Approximation: Curse of Dimensionality and Algorithm

Author: Chen, Lin, Scherrer, Bruno, and Bartlett, Peter L.
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: In this paper, we investigate the sample complexity of policy evaluation in infinite-horizon offline reinforcement learning (also known as the off-policy evaluation problem) with linear function approximation. We identify a hard regime $d\gamma^{2}>1$, where $d$ is the dimension of the feature vector and $\gamma$ is the discount rate. In this regime, for any $q\in[\gamma^{2},1]$, we can construct a hard instance such that the smallest eigenvalue of its feature covariance matrix is $q/d$ and it requires $\Omega\left(\frac{d}{\gamma^{2}\left(q-\gamma^{2}\right)\varepsilon^{2}}\exp\left(\Theta\left(d\gamma^{2}\right)\right)\right)$ samples to approximate the value function up to an additive error $\varepsilon$. Note that the lower bound of the sample complexity is exponential in $d$. If $q=\gamma^{2}$, even infinite data cannot suffice. Under the low distribution shift assumption, we show that there is an algorithm that needs at most $O\left(\max\left\{ \frac{\left\Vert \theta^{\pi}\right\Vert _{2}^{4}}{\varepsilon^{4}}\log\frac{d}{\delta},\frac{1}{\varepsilon^{2}}\left(d+\log\frac{1}{\delta}\right)\right\} \right)$ samples ($\theta^{\pi}$ is the parameter of the policy in linear function approximation) and guarantees approximation to the value function up to an additive error of $\varepsilon$ with probability at least $1-\delta$.
Published: 2021

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

1,181 results on '"Bartlett, P. L."'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources