Author: "Ankit Singh" / Database: arXiv - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Ankit Singh"' showing total 62 results

Start Over Author "Ankit Singh" Database arXiv

62 results on '"Ankit Singh"'

1. A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs

Author: Rawat, Ankit Singh, Sadhanala, Veeranjaneyulu, Rostamizadeh, Afshin, Chakrabarti, Ayan, Jitkrittum, Wittawat, Feinberg, Vladimir, Kim, Seungyeon, Harutyunyan, Hrayr, Saunshi, Nikunj, Nado, Zachary, Shivanna, Rakesh, Reddi, Sashank J., Menon, Aditya Krishna, Anil, Rohan, and Kumar, Sanjiv
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language
Abstract: A primary challenge in large language model (LLM) development is their onerous pre-training cost. Typically, such pre-training involves optimizing a self-supervised objective (such as next-token prediction) over a large corpus. This paper explores a promising paradigm to improve LLM pre-training efficiency and quality by suitably leveraging a small language model (SLM). In particular, this paradigm relies on an SLM to both (1) provide soft labels as additional training supervision, and (2) select a small subset of valuable ("informative" and "hard") training examples. Put together, this enables an effective transfer of the SLM's predictive distribution to the LLM, while prioritizing specific regions of the training data distribution. Empirically, this leads to reduced LLM training time compared to standard training, while improving the overall quality. Theoretically, we develop a statistical framework to systematically study the utility of SLMs in enabling efficient training of high-quality LLMs. In particular, our framework characterizes how the SLM's seemingly low-quality supervision can enhance the training of a much more capable LLM. Furthermore, it also highlights the need for an adaptive utilization of such supervision, by striking a balance between the bias and variance introduced by the SLM-provided soft labels. We corroborate our theoretical framework by improving the pre-training of an LLM with 2.8B parameters by utilizing a smaller LM with 1.5B parameters on the Pile dataset.
Published: 2024

2. A Statistical Framework for Data-dependent Retrieval-Augmented Models

Author: Basu, Soumya, Rawat, Ankit Singh, and Zaheer, Manzil
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Modern ML systems increasingly augment input instances with additional relevant information to enhance final prediction. Despite growing interest in such retrieval-augmented models, their fundamental properties and training are not well understood. We propose a statistical framework to study such models with two components: 1) a {\em retriever} to identify the relevant information out of a large corpus via a data-dependent metric; and 2) a {\em predictor} that consumes the input instances along with the retrieved information to make the final predictions. We present a principled method for end-to-end training of both components and draw connections with various training approaches in the literature. Furthermore, we establish excess risk bounds for retrieval-augmented models while delineating the contributions of both retriever and predictor towards the model performance. We validate the utility of our proposed training methods along with the key takeaways from our statistical analysis on open domain question answering task where retrieval augmentation is important.
Published: 2024

3. Analysis of Plan-based Retrieval for Grounded Text Generation

Author: Godbole, Ameya, Monath, Nicholas, Kim, Seungyeon, Rawat, Ankit Singh, McCallum, Andrew, and Zaheer, Manzil
Subjects: Computer Science - Computation and Language, Computer Science - Information Retrieval
Abstract: In text generation, hallucinations refer to the generation of seemingly coherent text that contradicts established knowledge. One compelling hypothesis is that hallucinations occur when a language model is given a generation task outside its parametric knowledge (due to rarity, recency, domain, etc.). A common strategy to address this limitation is to infuse the language models with retrieval mechanisms, providing the model with relevant knowledge for the task. In this paper, we leverage the planning capabilities of instruction-tuned LLMs and analyze how planning can be used to guide retrieval to further reduce the frequency of hallucinations. We empirically evaluate several variations of our proposed approach on long-form text generation tasks. By improving the coverage of relevant facts, plan-guided retrieval and generation can produce more informative responses while providing a higher rate of attribution to source documents.
Published: 2024

4. Fine-grained Analysis of In-context Linear Estimation: Data, Architecture, and Beyond

Author: Li, Yingcong, Rawat, Ankit Singh, and Oymak, Samet
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Mathematics - Optimization and Control
Abstract: Recent research has shown that Transformers with linear attention are capable of in-context learning (ICL) by implementing a linear estimator through gradient descent steps. However, the existing results on the optimization landscape apply under stylized settings where task and feature vectors are assumed to be IID and the attention weights are fully parameterized. In this work, we develop a stronger characterization of the optimization and generalization landscape of ICL through contributions on architectures, low-rank parameterization, and correlated designs: (1) We study the landscape of 1-layer linear attention and 1-layer H3, a state-space model. Under a suitable correlated design assumption, we prove that both implement 1-step preconditioned gradient descent. We show that thanks to its native convolution filters, H3 also has the advantage of implementing sample weighting and outperforming linear attention in suitable settings. (2) By studying correlated designs, we provide new risk bounds for retrieval augmented generation (RAG) and task-feature alignment which reveal how ICL sample complexity benefits from distributional alignment. (3) We derive the optimal risk for low-rank parameterized attention weights in terms of covariance spectrum. Through this, we also shed light on how LoRA can adapt to a new distribution by capturing the shift between task covariances. Experimental results corroborate our theoretical findings. Overall, this work explores the optimization and risk landscape of ICL in practically meaningful settings and contributes to a more thorough understanding of its mechanics.
Published: 2024

5. Efficient Document Ranking with Learnable Late Interactions

Author: Ji, Ziwei, Jain, Himanshu, Veit, Andreas, Reddi, Sashank J., Jayasumana, Sadeep, Rawat, Ankit Singh, Menon, Aditya Krishna, Yu, Felix, and Kumar, Sanjiv
Subjects: Computer Science - Information Retrieval, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Cross-Encoder (CE) and Dual-Encoder (DE) models are two fundamental approaches for query-document relevance in information retrieval. To predict relevance, CE models use joint query-document embeddings, while DE models maintain factorized query and document embeddings; usually, the former has higher quality while the latter benefits from lower latency. Recently, late-interaction models have been proposed to realize more favorable latency-quality tradeoffs, by using a DE structure followed by a lightweight scorer based on query and document token embeddings. However, these lightweight scorers are often hand-crafted, and there is no understanding of their approximation power; further, such scorers require access to individual document token embeddings, which imposes an increased latency and storage burden. In this paper, we propose novel learnable late-interaction models (LITE) that resolve these issues. Theoretically, we prove that LITE is a universal approximator of continuous scoring functions, even for relatively small embedding dimension. Empirically, LITE outperforms previous late-interaction models such as ColBERT on both in-domain and zero-shot re-ranking tasks. For instance, experiments on MS MARCO passage re-ranking show that LITE not only yields a model with better generalization, but also lowers latency and requires 0.25x storage compared to ColBERT.
Published: 2024

6. Cascade-Aware Training of Language Models

Author: Wang, Congchao, Augenstein, Sean, Rush, Keith, Jitkrittum, Wittawat, Narasimhan, Harikrishna, Rawat, Ankit Singh, Menon, Aditya Krishna, and Go, Alec
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Reducing serving cost and latency is a fundamental concern for the deployment of language models (LMs) in business applications. To address this, cascades of LMs offer an effective solution that conditionally employ smaller models for simpler queries. Cascaded systems are typically built with independently trained models, neglecting the advantages of considering inference-time interactions of the cascaded LMs during training. In this paper, we present cascade-aware training(CAT), an approach to optimizing the overall quality-cost performance tradeoff of a cascade of LMs. We achieve inference-time benefits by training the small LM with awareness of its place in a cascade and downstream capabilities. We demonstrate the value of the proposed method with over 60 LM tasks of the SuperGLUE, WMT22, and FLAN2021 datasets., Comment: 22 pages, 13 figures
Published: 2024

7. Faster Cascades via Speculative Decoding

Author: Narasimhan, Harikrishna, Jitkrittum, Wittawat, Rawat, Ankit Singh, Kim, Seungyeon, Gupta, Neha, Menon, Aditya Krishna, and Kumar, Sanjiv
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Cascades and speculative decoding are two common approaches to improving language models' inference efficiency. Both approaches involve interleaving models of different sizes, but via fundamentally distinct mechanisms: cascades employ a deferral rule that invokes the larger model only for "hard" inputs, while speculative decoding uses speculative execution to primarily invoke the larger model in parallel verification mode. These mechanisms offer different benefits: empirically, cascades offer better cost-quality trade-offs, often even outperforming the large model, while theoretically, speculative decoding offers a guarantee of quality-neutrality. In this paper, we leverage the best of both these approaches by designing new speculative cascading techniques that implement their deferral rule through speculative execution. We characterize the optimal deferral rule for our speculative cascades, and employ a plug-in approximation to the optimal rule. Experiments with Gemma and T5 models on a range of language benchmarks show that our approach yields better cost quality trade-offs than cascading and speculative decoding baselines.
Published: 2024

8. Language Model Cascades: Token-level uncertainty and beyond

Author: Gupta, Neha, Narasimhan, Harikrishna, Jitkrittum, Wittawat, Rawat, Ankit Singh, Menon, Aditya Krishna, and Kumar, Sanjiv
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Recent advances in language models (LMs) have led to significant improvements in quality on complex NLP tasks, but at the expense of increased inference costs. Cascading offers a simple strategy to achieve more favorable cost-quality tradeoffs: here, a small model is invoked for most "easy" instances, while a few "hard" instances are deferred to the large model. While the principles underpinning cascading are well-studied for classification tasks - with deferral based on predicted class uncertainty favored theoretically and practically - a similar understanding is lacking for generative LM tasks. In this work, we initiate a systematic study of deferral rules for LM cascades. We begin by examining the natural extension of predicted class uncertainty to generative LM tasks, namely, the predicted sequence uncertainty. We show that this measure suffers from the length bias problem, either over- or under-emphasizing outputs based on their lengths. This is because LMs produce a sequence of uncertainty values, one for each output token; and moreover, the number of output tokens is variable across examples. To mitigate this issue, we propose to exploit the richer token-level uncertainty information implicit in generative LMs. We argue that naive predicted sequence uncertainty corresponds to a simple aggregation of these uncertainties. By contrast, we show that incorporating token-level uncertainty through learned post-hoc deferral rules can significantly outperform such simple aggregation strategies, via experiments on a range of natural language benchmarks with FLAN-T5 models. We further show that incorporating embeddings from the smaller model and intermediate layers of the larger model can give an additional boost in the overall cost-quality tradeoff.
Published: 2024

9. Mechanics of Next Token Prediction with Self-Attention

Author: Li, Yingcong, Huang, Yixiao, Ildiz, M. Emrullah, Rawat, Ankit Singh, and Oymak, Samet
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Mathematics - Optimization and Control
Abstract: Transformer-based language models are trained on large datasets to predict the next token given an input sequence. Despite this simple training objective, they have led to revolutionary advances in natural language processing. Underlying this success is the self-attention mechanism. In this work, we ask: $\textit{What}$ $\textit{does}$ $\textit{a}$ $\textit{single}$ $\textit{self-attention}$ $\textit{layer}$ $\textit{learn}$ $\textit{from}$ $\textit{next-token}$ $\textit{prediction?}$ We show that training self-attention with gradient descent learns an automaton which generates the next token in two distinct steps: $\textbf{(1)}$ $\textbf{Hard}$ $\textbf{retrieval:}$ Given input sequence, self-attention precisely selects the $\textit{high-priority}$ $\textit{input}$ $\textit{tokens}$ associated with the last input token. $\textbf{(2)}$ $\textbf{Soft}$ $\textbf{composition:}$ It then creates a convex combination of the high-priority tokens from which the next token can be sampled. Under suitable conditions, we rigorously characterize these mechanics through a directed graph over tokens extracted from the training data. We prove that gradient descent implicitly discovers the strongly-connected components (SCC) of this graph and self-attention learns to retrieve the tokens that belong to the highest-priority SCC available in the context window. Our theory relies on decomposing the model weights into a directional component and a finite component that correspond to hard retrieval and soft composition steps respectively. This also formalizes a related implicit bias formula conjectured in [Tarzanagh et al. 2023]. We hope that these findings shed light on how self-attention processes sequential data and pave the path toward demystifying more complex architectures., Comment: Accepted to AISTATS 2024
Published: 2024

10. From Self-Attention to Markov Models: Unveiling the Dynamics of Generative Transformers

Author: Ildiz, M. Emrullah, Huang, Yixiao, Li, Yingcong, Rawat, Ankit Singh, and Oymak, Samet
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Modern language models rely on the transformer architecture and attention mechanism to perform language understanding and text generation. In this work, we study learning a 1-layer self-attention model from a set of prompts and associated output data sampled from the model. We first establish a precise mapping between the self-attention mechanism and Markov models: Inputting a prompt to the model samples the output token according to a context-conditioned Markov chain (CCMC) which weights the transition matrix of a base Markov chain. Additionally, incorporating positional encoding results in position-dependent scaling of the transition probabilities. Building on this formalism, we develop identifiability/coverage conditions for the prompt distribution that guarantee consistent estimation and establish sample complexity guarantees under IID samples. Finally, we study the problem of learning from a single output trajectory generated from an initial prompt. We characterize an intriguing winner-takes-all phenomenon where the generative process implemented by self-attention collapses into sampling a limited subset of tokens due to its non-mixing nature. This provides a mathematical explanation to the tendency of modern LLMs to generate repetitive text. In summary, the equivalence to CCMC provides a simple but powerful framework to study self-attention and its properties., Comment: 30 pages
Published: 2024

11. DistillSpec: Improving Speculative Decoding via Knowledge Distillation

Author: Zhou, Yongchao, Lyu, Kaifeng, Rawat, Ankit Singh, Menon, Aditya Krishna, Rostamizadeh, Afshin, Kumar, Sanjiv, Kagy, Jean-François, and Agarwal, Rishabh
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Speculative decoding (SD) accelerates large language model inference by employing a faster draft model for generating multiple tokens, which are then verified in parallel by the larger target model, resulting in the text generated according to the target model distribution. However, identifying a compact draft model that is well-aligned with the target model is challenging. To tackle this issue, we propose DistillSpec that uses knowledge distillation to better align the draft model with the target model, before applying SD. DistillSpec makes two key design choices, which we demonstrate via systematic study to be crucial to improving the draft and target alignment: utilizing on-policy data generation from the draft model, and tailoring the divergence function to the task and decoding strategy. Notably, DistillSpec yields impressive 10 - 45% speedups over standard SD on a range of standard benchmarks, using both greedy and non-greedy sampling. Furthermore, we combine DistillSpec with lossy SD to achieve fine-grained control over the latency vs. task performance trade-off. Finally, in practical scenarios with models of varying sizes, first using distillation to boost the performance of the target model and then applying DistillSpec to train a well-aligned draft model can reduce decoding latency by 6-10x with minimal performance drop, compared to standard decoding without distillation.
Published: 2023

12. What do larger image classifiers memorise?

Author: Lukasik, Michal, Nagarajan, Vaishnavh, Rawat, Ankit Singh, Menon, Aditya Krishna, and Kumar, Sanjiv
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition, Machine Learning (cs.LG), Artificial Intelligence (cs.AI) Machine Learning (stat.ML)
Abstract: The success of modern neural networks has prompted study of the connection between memorisation and generalisation: overparameterised models generalise well, despite being able to perfectly fit (memorise) completely random labels. To carefully study this issue, Feldman proposed a metric to quantify the degree of memorisation of individual training examples, and empirically computed the corresponding memorisation profile of a ResNet on image classification bench-marks. While an exciting first glimpse into what real-world models memorise, this leaves open a fundamental question: do larger neural models memorise more? We present a comprehensive empirical analysis of this question on image classification benchmarks. We find that training examples exhibit an unexpectedly diverse set of memorisation trajectories across model sizes: most samples experience decreased memorisation under larger models, while the rest exhibit cap-shaped or increasing memorisation. We show that various proxies for the Feldman memorization score fail to capture these fundamental trends. Lastly, we find that knowledge distillation, an effective and popular model compression technique, tends to inhibit memorisation, while also improving generalisation. Specifically, memorisation is mostly inhibited on examples with increasing memorisation trajectories, thus pointing at how distillation improves generalisation.
Published: 2023

13. Think before you speak: Training Language Models With Pause Tokens

Author: Goyal, Sachin, Ji, Ziwei, Rawat, Ankit Singh, Menon, Aditya Krishna, Kumar, Sanjiv, and Nagarajan, Vaishnavh
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Language models generate responses by producing a series of tokens in immediate succession: the $(K+1)^{th}$ token is an outcome of manipulating $K$ hidden vectors per layer, one vector per preceding token. What if instead we were to let the model manipulate say, $K+10$ hidden vectors, before it outputs the $(K+1)^{th}$ token? We operationalize this idea by performing training and inference on language models with a (learnable) $\textit{pause}$ token, a sequence of which is appended to the input prefix. We then delay extracting the model's outputs until the last pause token is seen, thereby allowing the model to process extra computation before committing to an answer. We empirically evaluate $\textit{pause-training}$ on decoder-only models of 1B and 130M parameters with causal pretraining on C4, and on downstream tasks covering reasoning, question-answering, general understanding and fact recall. Our main finding is that inference-time delays show gains when the model is both pre-trained and finetuned with delays. For the 1B model, we witness gains on 8 of 9 tasks, most prominently, a gain of $18\%$ EM score on the QA task of SQuAD, $8\%$ on CommonSenseQA and $1\%$ accuracy on the reasoning task of GSM8k. Our work raises a range of conceptual and practical future research questions on making delayed next-token prediction a widely applicable new paradigm., Comment: Published at ICLR 2024
Published: 2023

14. $\mu$2mech: a Software Package Combining Microstructure Modeling and Mechanical Property Prediction

Author: Linda, Albert, Negi, Ankit Singh, Panwar, Vishal, Chafle, Rupesh, Bhowmick, Somnath, Das, Kaushik, and Mukherjee, Rajdip
Subjects: Condensed Matter - Materials Science
Abstract: We have developed a graphical user interface (GUI) based package $\mu$2mech to perform phase-field simulation for predicting microstructure evolution. The package can take inputs from ab initio calculations and CALPHAD (Calculation of Phase Diagrams) tools for quantitative microstructure prediction. The package also provides a seamless connection to transfer output from the mesoscale phase field method to the microscale finite element analysis for mechanical property prediction. Such a multiscale simulation package can facilitate microstructure-property correlation, one of the cornerstones in accelerated materials development within the integrated computational materials engineering (ICME) framework.
Published: 2023
Full Text: View/download PDF

15. When Does Confidence-Based Cascade Deferral Suffice?

Author: Jitkrittum, Wittawat, Gupta, Neha, Menon, Aditya Krishna, Narasimhan, Harikrishna, Rawat, Ankit Singh, and Kumar, Sanjiv
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Cascades are a classical strategy to enable inference cost to vary adaptively across samples, wherein a sequence of classifiers are invoked in turn. A deferral rule determines whether to invoke the next classifier in the sequence, or to terminate prediction. One simple deferral rule employs the confidence of the current classifier, e.g., based on the maximum predicted softmax probability. Despite being oblivious to the structure of the cascade -- e.g., not modelling the errors of downstream models -- such confidence-based deferral often works remarkably well in practice. In this paper, we seek to better understand the conditions under which confidence-based deferral may fail, and when alternate deferral strategies can perform better. We first present a theoretical characterisation of the optimal deferral rule, which precisely characterises settings under which confidence-based deferral may suffer. We then study post-hoc deferral mechanisms, and demonstrate they can significantly improve upon confidence-based deferral in settings where (i) downstream models are specialists that only work well on a subset of inputs, (ii) samples are subject to label noise, and (iii) there is distribution shift between the train and test set., Comment: NeurIPS 2023
Published: 2023

16. On the Role of Attention in Prompt-tuning

Author: Oymak, Samet, Rawat, Ankit Singh, Soltanolkotabi, Mahdi, and Thrampoulidis, Christos
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Statistics - Machine Learning
Abstract: Prompt-tuning is an emerging strategy to adapt large language models (LLM) to downstream tasks by learning a (soft-)prompt parameter from data. Despite its success in LLMs, there is limited theoretical understanding of the power of prompt-tuning and the role of the attention mechanism in prompting. In this work, we explore prompt-tuning for one-layer attention architectures and study contextual mixture-models where each input token belongs to a context-relevant or -irrelevant set. We isolate the role of prompt-tuning through a self-contained prompt-attention model. Our contributions are as follows: (1) We show that softmax-prompt-attention is provably more expressive than softmax-self-attention and linear-prompt-attention under our contextual data model. (2) We analyze the initial trajectory of gradient descent and show that it learns the prompt and prediction head with near-optimal sample complexity and demonstrate how prompt can provably attend to sparse context-relevant tokens. (3) Assuming a known prompt but an unknown prediction head, we characterize the exact finite sample performance of prompt-attention which reveals the fundamental performance limits and the precise benefit of the context information. We also provide experiments that verify our theoretical insights on real datasets and demonstrate how prompt-tuning enables the model to attend to context-relevant information., Comment: Published at ICML 2023
Published: 2023

17. ResMem: Learn what you can and memorize the rest

Author: Yang, Zitong, Lukasik, Michal, Nagarajan, Vaishnavh, Li, Zonglin, Rawat, Ankit Singh, Zaheer, Manzil, Menon, Aditya Krishna, and Kumar, Sanjiv
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Statistics - Methodology, Statistics - Machine Learning
Abstract: The impressive generalization performance of modern neural networks is attributed in part to their ability to implicitly memorize complex training patterns. Inspired by this, we explore a novel mechanism to improve model generalization via explicit memorization. Specifically, we propose the residual-memorization (ResMem) algorithm, a new method that augments an existing prediction model (e.g. a neural network) by fitting the model's residuals with a $k$-nearest neighbor based regressor. The final prediction is then the sum of the original model and the fitted residual regressor. By construction, ResMem can explicitly memorize the training labels. Empirically, we show that ResMem consistently improves the test set generalization of the original prediction model across various standard vision and natural language processing benchmarks. Theoretically, we formulate a stylized linear regression problem and rigorously show that ResMem results in a more favorable test risk over the base predictor.
Published: 2023

18. Supervision Complexity and its Role in Knowledge Distillation

Author: Harutyunyan, Hrayr, Rawat, Ankit Singh, Menon, Aditya Krishna, Kim, Seungyeon, and Kumar, Sanjiv
Subjects: Computer Science - Machine Learning
Abstract: Despite the popularity and efficacy of knowledge distillation, there is limited understanding of why it helps. In order to study the generalization behavior of a distilled student, we propose a new theoretical framework that leverages supervision complexity: a measure of alignment between teacher-provided supervision and the student's neural tangent kernel. The framework highlights a delicate interplay among the teacher's accuracy, the student's margin with respect to the teacher predictions, and the complexity of the teacher predictions. Specifically, it provides a rigorous justification for the utility of various techniques that are prevalent in the context of distillation, such as early stopping and temperature scaling. Our analysis further suggests the use of online distillation, where a student receives increasingly more complex supervision from teachers in different stages of their training. We demonstrate efficacy of online distillation and validate the theoretical findings on a range of image classification benchmarks and model architectures., Comment: Published at ICLR 2023
Published: 2023

19. EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval

Author: Kim, Seungyeon, Rawat, Ankit Singh, Zaheer, Manzil, Jayasumana, Sadeep, Sadhanala, Veeranjaneyulu, Jitkrittum, Wittawat, Menon, Aditya Krishna, Fergus, Rob, and Kumar, Sanjiv
Subjects: Computer Science - Machine Learning
Abstract: Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR). In this paper, we aim to improve distillation methods that pave the way for the resource-efficient deployment of such models in practice. Inspired by our theoretical analysis of the teacher-student generalization gap for IR models, we propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model. Unlike existing teacher score-based distillation methods, our proposed approach employs embedding matching tasks to provide a stronger signal to align the representations of the teacher and student models. In addition, it utilizes query generation to explore the data manifold to reduce the discrepancies between the student and the teacher where training data is sparse. Furthermore, our analysis also motivates novel asymmetric architectures for student models which realizes better embedding alignment without increasing online inference cost. On standard benchmarks like MSMARCO, we show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
Published: 2023

20. Large Language Models with Controllable Working Memory

Author: Li, Daliang, Rawat, Ankit Singh, Zaheer, Manzil, Wang, Xin, Lukasik, Michal, Veit, Andreas, Yu, Felix, and Kumar, Sanjiv
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP), owing to their excellent understanding and generation abilities. Remarkably, what further sets these models apart is the massive amounts of world knowledge they internalize during pretraining. While many downstream applications provide the model with an informational context to aid its performance on the underlying task, how the model's world knowledge interacts with the factual information presented in the context remains under explored. As a desirable behavior, an LLM should give precedence to the context whenever it contains task-relevant information that conflicts with the model's memorized knowledge. This enables model predictions to be grounded in the context, which can then be used to update or correct specific model predictions without frequent retraining. By contrast, when the context is irrelevant to the task, the model should ignore it and fall back on its internal knowledge. In this paper, we undertake a first joint study of the aforementioned two properties, namely controllability and robustness, in the context of LLMs. We demonstrate that state-of-the-art T5 and PaLM (both pretrained and finetuned) could exhibit poor controllability and robustness, which do not scale with increasing model size. As a solution, we propose a novel method - Knowledge Aware FineTuning (KAFT) - to strengthen both controllability and robustness by incorporating counterfactual and irrelevant contexts to standard supervised datasets. Our comprehensive evaluation showcases the utility of KAFT across model architectures and sizes.
Published: 2022

21. The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers

Author: Li, Zonglin, You, Chong, Bhojanapalli, Srinadh, Li, Daliang, Rawat, Ankit Singh, Reddi, Sashank J., Ye, Ke, Chern, Felix, Yu, Felix, Guo, Ruiqi, and Kumar, Sanjiv
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Statistics - Machine Learning
Abstract: This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse. By activation map we refer to the intermediate output of the multi-layer perceptrons (MLPs) after a ReLU activation function, and by sparse we mean that on average very few entries (e.g., 3.0% for T5-Base and 6.3% for ViT-B16) are nonzero for each input to MLP. Moreover, larger Transformers with more layers and wider MLP hidden dimensions are sparser as measured by the percentage of nonzero entries. Through extensive experiments we demonstrate that the emergence of sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks, on both training and evaluation data, for Transformers of various configurations, at layers of all depth levels, as well as for other architectures including MLP-mixers and 2-layer MLPs. We show that sparsity also emerges using training datasets with random labels, or with random inputs, or with infinite amount of data, demonstrating that sparsity is not a result of a specific family of datasets. We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers. Moreover, we demonstrate perhaps surprisingly that enforcing an even sparser activation via Top-k thresholding with a small value of k brings a collection of desired but missing properties for Transformers, namely less sensitivity to noisy training data, more robustness to input corruptions, and better calibration for their prediction confidence., Comment: A short version was presented at ICLR 2023. Previous title: Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers
Published: 2022

22. Generalization Properties of Retrieval-based Models

Author: Basu, Soumya, Rawat, Ankit Singh, and Zaheer, Manzil
Subjects: Computer Science - Machine Learning
Abstract: Many modern high-performing machine learning models such as GPT-3 primarily rely on scaling up models, e.g., transformer networks. Simultaneously, a parallel line of work aims to improve the model performance by augmenting an input instance with other (labeled) instances during inference. Examples of such augmentations include task-specific prompts and similar examples retrieved from the training data by a nonparametric component. Remarkably, retrieval-based methods have enjoyed success on a wide range of problems, ranging from standard natural language processing and vision tasks to protein folding, as demonstrated by many recent efforts, including WebGPT and AlphaFold. Despite growing literature showcasing the promise of these models, the theoretical underpinning for such models remains underexplored. In this paper, we present a formal treatment of retrieval-based models to characterize their generalization ability. In particular, we focus on two classes of retrieval-based classification approaches: First, we analyze a local learning framework that employs an explicit local empirical risk minimization based on retrieved examples for each input instance. Interestingly, we show that breaking down the underlying learning task into local sub-tasks enables the model to employ a low complexity parametric component to ensure good overall accuracy. The second class of retrieval-based approaches we explore learns a global model using kernel methods to directly map an input instance and retrieved examples to a prediction, without explicitly solving a local learning task.
Published: 2022

23. A Fourier Approach to Mixture Learning

Author: Qiao, Mingda, Guruganesh, Guru, Rawat, Ankit Singh, Dubey, Avinava, and Zaheer, Manzil
Subjects: Computer Science - Machine Learning, Computer Science - Data Structures and Algorithms, Statistics - Machine Learning
Abstract: We revisit the problem of learning mixtures of spherical Gaussians. Given samples from mixture $\frac{1}{k}\sum_{j=1}^{k}\mathcal{N}(\mu_j, I_d)$, the goal is to estimate the means $\mu_1, \mu_2, \ldots, \mu_k \in \mathbb{R}^d$ up to a small error. The hardness of this learning problem can be measured by the separation $\Delta$ defined as the minimum distance between all pairs of means. Regev and Vijayaraghavan (2017) showed that with $\Delta = \Omega(\sqrt{\log k})$ separation, the means can be learned using $\mathrm{poly}(k, d)$ samples, whereas super-polynomially many samples are required if $\Delta = o(\sqrt{\log k})$ and $d = \Omega(\log k)$. This leaves open the low-dimensional regime where $d = o(\log k)$. In this work, we give an algorithm that efficiently learns the means in $d = O(\log k/\log\log k)$ dimensions under separation $d/\sqrt{\log k}$ (modulo doubly logarithmic factors). This separation is strictly smaller than $\sqrt{\log k}$, and is also shown to be necessary. Along with the results of Regev and Vijayaraghavan (2017), our work almost pins down the critical separation threshold at which efficient parameter learning becomes possible for spherical Gaussian mixtures. More generally, our algorithm runs in time $\mathrm{poly}(k)\cdot f(d, \Delta, \epsilon)$, and is thus fixed-parameter tractable in parameters $d$, $\Delta$ and $\epsilon$. Our approach is based on estimating the Fourier transform of the mixture at carefully chosen frequencies, and both the algorithm and its analysis are simple and elementary. Our positive results can be easily extended to learning mixtures of non-Gaussian distributions, under a mild condition on the Fourier spectrum of the distribution., Comment: To appear at NeurIPS 2022; v2 corrected author information
Published: 2022

24. Teacher Guided Training: An Efficient Framework for Knowledge Transfer

Author: Zaheer, Manzil, Rawat, Ankit Singh, Kim, Seungyeon, You, Chong, Jain, Himanshu, Veit, Andreas, Fergus, Rob, and Kumar, Sanjiv
Subjects: Computer Science - Machine Learning
Abstract: The remarkable performance gains realized by large pretrained models, e.g., GPT-3, hinge on the massive amounts of data they are exposed to during training. Analogously, distilling such large models to compact models for efficient deployment also necessitates a large amount of (labeled or unlabeled) training data. In this paper, we propose the teacher-guided training (TGT) framework for training a high-quality compact model that leverages the knowledge acquired by pretrained generative models, while obviating the need to go through a large volume of data. TGT exploits the fact that the teacher has acquired a good representation of the underlying data domain, which typically corresponds to a much lower dimensional manifold than the input space. Furthermore, we can use the teacher to explore input space more efficiently through sampling or gradient-based methods; thus, making TGT especially attractive for limited data or long-tail settings. We formally capture this benefit of proposed data-domain exploration in our generalization bounds. We find that TGT can improve accuracy on several image classification benchmarks as well as a range of text classification and retrieval tasks.
Published: 2022

25. ELM: Embedding and Logit Margins for Long-Tail Learning

Author: Jitkrittum, Wittawat, Menon, Aditya Krishna, Rawat, Ankit Singh, and Kumar, Sanjiv
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Long-tail learning is the problem of learning under skewed label distributions, which pose a challenge for standard learners. Several recent approaches for the problem have proposed enforcing a suitable margin in logit space. Such techniques are intuitive analogues of the guiding principle behind SVMs, and are equally applicable to linear models and neural models. However, when applied to neural models, such techniques do not explicitly control the geometry of the learned embeddings. This can be potentially sub-optimal, since embeddings for tail classes may be diffuse, resulting in poor generalization for these classes. We present Embedding and Logit Margins (ELM), a unified approach to enforce margins in logit space, and regularize the distribution of embeddings. This connects losses for long-tail learning to proposals in the literature on metric embedding, and contrastive learning. We theoretically show that minimising the proposed ELM objective helps reduce the generalisation gap. The ELM method is shown to perform well empirically, and results in tighter tail class embeddings., Comment: 24 pages
Published: 2022

26. FedLite: A Scalable Approach for Federated Learning on Resource-constrained Clients

Author: Wang, Jianyu, Qi, Hang, Rawat, Ankit Singh, Reddi, Sashank, Waghmare, Sagar, Yu, Felix X., and Joshi, Gauri
Subjects: Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: In classical federated learning, the clients contribute to the overall training by communicating local updates for the underlying model on their private data to a coordinating server. However, updating and communicating the entire model becomes prohibitively expensive when resource-constrained clients collectively aim to train a large machine learning model. Split learning provides a natural solution in such a setting, where only a small part of the model is stored and trained on clients while the remaining large part of the model only stays at the servers. However, the model partitioning employed in split learning introduces a significant amount of communication cost. This paper addresses this issue by compressing the additional communication using a novel clustering scheme accompanied by a gradient correction method. Extensive empirical evaluations on image and text benchmarks show that the proposed method can achieve up to $490\times$ communication cost reduction with minimal drop in accuracy, and enables a desirable performance vs. communication trade-off.
Published: 2022

27. When in Doubt, Summon the Titans: Efficient Inference with Large Models

Author: Rawat, Ankit Singh, Zaheer, Manzil, Menon, Aditya Krishna, Ahmed, Amr, and Kumar, Sanjiv
Subjects: Computer Science - Machine Learning
Abstract: Scaling neural networks to "large" sizes, with billions of parameters, has been shown to yield impressive results on many challenging problems. However, the inference cost incurred by such large models often prevents their application in most real-world settings. In this paper, we propose a two-stage framework based on distillation that realizes the modelling benefits of the large models, while largely preserving the computational benefits of inference with more lightweight models. In a nutshell, we use the large teacher models to guide the lightweight student models to only make correct predictions on a subset of "easy" examples; for the "hard" examples, we fall-back to the teacher. Such an approach allows us to efficiently employ large models in practical scenarios where easy examples are much more frequent than rare hard examples. Our proposed use of distillation to only handle easy instances allows for a more aggressive trade-off in the student size, thereby reducing the amortized cost of inference and achieving better accuracy than standard distillation. Empirically, we demonstrate the benefits of our approach on both image classification and natural language processing benchmarks.
Published: 2021

28. Disentangling Sampling and Labeling Bias for Learning in Large-Output Spaces

Author: Rawat, Ankit Singh, Menon, Aditya Krishna, Jitkrittum, Wittawat, Jayasumana, Sadeep, Yu, Felix X., Reddi, Sashank, and Kumar, Sanjiv
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Negative sampling schemes enable efficient training given a large number of classes, by offering a means to approximate a computationally expensive loss function that takes all labels into account. In this paper, we present a new connection between these schemes and loss modification techniques for countering label imbalance. We show that different negative sampling schemes implicitly trade-off performance on dominant versus rare labels. Further, we provide a unified means to explicitly tackle both sampling bias, arising from working with a subset of all labels, and labeling bias, which is inherent to the data due to label imbalance. We empirically verify our findings on long-tail classification and retrieval benchmarks., Comment: To appear in ICML 2021
Published: 2021

29. Distilling Double Descent

Author: Cotter, Andrew, Menon, Aditya Krishna, Narasimhan, Harikrishna, Rawat, Ankit Singh, Reddi, Sashank J., and Zhou, Yichen
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Statistics - Machine Learning
Abstract: Distillation is the technique of training a "student" model based on examples that are labeled by a separate "teacher" model, which itself is trained on a labeled dataset. The most common explanations for why distillation "works" are predicated on the assumption that student is provided with \emph{soft} labels, \eg probabilities or confidences, from the teacher model. In this work, we show, that, even when the teacher model is highly overparameterized, and provides \emph{hard} labels, using a very large held-out unlabeled dataset to train the student model can result in a model that outperforms more "traditional" approaches. Our explanation for this phenomenon is based on recent work on "double descent". It has been observed that, once a model's complexity roughly exceeds the amount required to memorize the training data, increasing the complexity \emph{further} can, counterintuitively, result in \emph{better} generalization. Researchers have identified several settings in which it takes place, while others have made various attempts to explain it (thus far, with only partial success). In contrast, we avoid these questions, and instead seek to \emph{exploit} this phenomenon by demonstrating that a highly-overparameterized teacher can avoid overfitting via double descent, while a student trained on a larger independent dataset labeled by this teacher will avoid overfitting due to the size of its training set.
Published: 2021

30. On the Reproducibility of Neural Network Predictions

Author: Bhojanapalli, Srinadh, Wilber, Kimberly, Veit, Andreas, Rawat, Ankit Singh, Kim, Seungyeon, Menon, Aditya, and Kumar, Sanjiv
Subjects: Computer Science - Machine Learning
Abstract: Standard training techniques for neural networks involve multiple sources of randomness, e.g., initialization, mini-batch ordering and in some cases data augmentation. Given that neural networks are heavily over-parameterized in practice, such randomness can cause {\em churn} -- for the same input, disagreements between predictions of the two models independently trained by the same algorithm, contributing to the `reproducibility challenges' in modern machine learning. In this paper, we study this problem of churn, identify factors that cause it, and propose two simple means of mitigating it. We first demonstrate that churn is indeed an issue, even for standard image classification tasks (CIFAR and ImageNet), and study the role of the different sources of training randomness that cause churn. By analyzing the relationship between churn and prediction confidences, we pursue an approach with two components for churn reduction. First, we propose using \emph{minimum entropy regularizers} to increase prediction confidences. Second, \changes{we present a novel variant of co-distillation approach~\citep{anil2018large} to increase model agreement and reduce churn}. We present empirical results showing the effectiveness of both techniques in reducing churn while improving the accuracy of the underlying model., Comment: 19 pages, 7 figures
Published: 2021

31. Modifying Memories in Transformer Models

Author: Zhu, Chen, Rawat, Ankit Singh, Zaheer, Manzil, Bhojanapalli, Srinadh, Li, Daliang, Yu, Felix, and Kumar, Sanjiv
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Large Transformer models have achieved impressive performance in many natural language tasks. In particular, Transformer based language models have been shown to have great capabilities in encoding factual knowledge in their vast amount of parameters. While the tasks of improving the memorization and generalization of Transformers have been widely studied, it is not well known how to make transformers forget specific old facts and memorize new ones. In this paper, we propose a new task of \emph{explicitly modifying specific factual knowledge in Transformer models while ensuring the model performance does not degrade on the unmodified facts}. This task is useful in many scenarios, such as updating stale knowledge, protecting privacy, and eliminating unintended biases stored in the models. We benchmarked several approaches that provide natural baseline performances on this task. This leads to the discovery of key components of a Transformer model that are especially effective for knowledge modifications. The work also provides insights into the role that different training phases (such as pretraining and fine-tuning) play towards memorization and knowledge modification.
Published: 2020

32. Long-tail learning via logit adjustment

Author: Menon, Aditya Krishna, Jayasumana, Sadeep, Rawat, Ankit Singh, Jain, Himanshu, Veit, Andreas, and Kumar, Sanjiv
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Real-world classification problems typically exhibit an imbalanced or long-tailed label distribution, wherein many labels are associated with only a few samples. This poses a challenge for generalisation on such labels, and also makes na\"ive learning biased towards dominant labels. In this paper, we present two simple modifications of standard softmax cross-entropy training to cope with these challenges. Our techniques revisit the classic idea of logit adjustment based on the label frequencies, either applied post-hoc to a trained model, or enforced in the loss during training. Such adjustment encourages a large relative margin between logits of rare versus dominant labels. These techniques unify and generalise several recent proposals in the literature, while possessing firmer statistical grounding and empirical performance., Comment: Published as a conference paper in ICLR 2021
Published: 2020

33. Adversarial robustness via robust low rank representations

Author: Awasthi, Pranjal, Jain, Himanshu, Rawat, Ankit Singh, and Vijayaraghavan, Aravindan
Subjects: Computer Science - Machine Learning, Computer Science - Data Structures and Algorithms, Statistics - Machine Learning
Abstract: Adversarial robustness measures the susceptibility of a classifier to imperceptible perturbations made to the inputs at test time. In this work we highlight the benefits of natural low rank representations that often exist for real data such as images, for training neural networks with certified robustness guarantees. Our first contribution is for certified robustness to perturbations measured in $\ell_2$ norm. We exploit low rank data representations to provide improved guarantees over state-of-the-art randomized smoothing-based approaches on standard benchmark datasets such as CIFAR-10 and CIFAR-100. Our second contribution is for the more challenging setting of certified robustness to perturbations measured in $\ell_\infty$ norm. We demonstrate empirically that natural low rank representations have inherent robustness properties, that can be leveraged to provide significantly better guarantees for certified robustness to $\ell_\infty$ perturbations in those representations. Our certificate of $\ell_\infty$ robustness relies on a natural quantity involving the $\infty \to 2$ matrix operator norm associated with the representation, to translate robustness guarantees from $\ell_2$ to $\ell_\infty$ perturbations. A key technical ingredient for our certification guarantees is a fast algorithm with provable guarantees based on the multiplicative weights update method to provide upper bounds on the above matrix norm. Our algorithmic guarantees improve upon the state of the art for this problem, and may be of independent interest., Comment: fixed a bug in the proof of Proposition B.2
Published: 2020

34. $O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers

Author: Yun, Chulhee, Chang, Yin-Wen, Bhojanapalli, Srinadh, Rawat, Ankit Singh, Reddi, Sashank J., and Kumar, Sanjiv
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Recently, Transformer networks have redefined the state of the art in many NLP tasks. However, these models suffer from quadratic computational cost in the input sequence length $n$ to compute pairwise attention in each layer. This has prompted recent research into sparse Transformers that sparsify the connections in the attention layers. While empirically promising for long sequences, fundamental questions remain unanswered: Can sparse Transformers approximate any arbitrary sequence-to-sequence function, similar to their dense counterparts? How does the sparsity pattern and the sparsity level affect their performance? In this paper, we address these questions and provide a unifying framework that captures existing sparse attention models. We propose sufficient conditions under which we prove that a sparse attention model can universally approximate any sequence-to-sequence function. Surprisingly, our results show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n^2$ connections. Lastly, we present experiments comparing different patterns/levels of sparsity on standard NLP tasks., Comment: 31 pages, NeurIPS 2020 Camera-ready
Published: 2020

35. Why distillation helps: a statistical perspective

Author: Menon, Aditya Krishna, Rawat, Ankit Singh, Reddi, Sashank J., Kim, Seungyeon, and Kumar, Sanjiv
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Knowledge distillation is a technique for improving the performance of a simple "student" model by replacing its one-hot training labels with a distribution over labels obtained from a complex "teacher" model. While this simple approach has proven widely effective, a basic question remains unresolved: why does distillation help? In this paper, we present a statistical perspective on distillation which addresses this question, and provides a novel connection to extreme multiclass retrieval techniques. Our core observation is that the teacher seeks to estimate the underlying (Bayes) class-probability function. Building on this, we establish a fundamental bias-variance tradeoff in the student's objective: this quantifies how approximate knowledge of these class-probabilities can significantly aid learning. Finally, we show how distillation complements existing negative mining techniques for extreme multiclass retrieval, and propose a unified objective which combines these ideas.
Published: 2020

36. Doubly-stochastic mining for heterogeneous retrieval

Author: Rawat, Ankit Singh, Menon, Aditya Krishna, Veit, Andreas, Yu, Felix, Reddi, Sashank J., and Kumar, Sanjiv
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Modern retrieval problems are characterised by training sets with potentially billions of labels, and heterogeneous data distributions across subpopulations (e.g., users of a retrieval system may be from different countries), each of which poses a challenge. The first challenge concerns scalability: with a large number of labels, standard losses are difficult to optimise even on a single example. The second challenge concerns uniformity: one ideally wants good performance on each subpopulation. While several solutions have been proposed to address the first challenge, the second challenge has received relatively less attention. In this paper, we propose doubly-stochastic mining (S2M ), a stochastic optimization technique that addresses both challenges. In each iteration of S2M, we compute a per-example loss based on a subset of hardest labels, and then compute the minibatch loss based on the hardest examples. We show theoretically and empirically that by focusing on the hardest examples, S2M ensures that all data subpopulations are modelled well.
Published: 2020

37. Federated Learning with Only Positive Labels

Author: Yu, Felix X., Rawat, Ankit Singh, Menon, Aditya Krishna, and Kumar, Sanjiv
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: We consider learning a multi-class classification model in the federated setting, where each user has access to the positive data associated with only a single class. As a result, during each federated learning round, the users need to locally update the classifier without having access to the features and the model parameters for the negative classes. Thus, naively employing conventional decentralized learning such as the distributed SGD or Federated Averaging may lead to trivial or extremely poor classifiers. In particular, for the embedding based classifiers, all the class embeddings might collapse to a single point. To address this problem, we propose a generic framework for training with only positive labels, namely Federated Averaging with Spreadout (FedAwS), where the server imposes a geometric regularizer after each round to encourage classes to be spreadout in the embedding space. We show, both theoretically and empirically, that FedAwS can almost match the performance of conventional learning where users have access to negative labels. We further extend the proposed method to the settings with large output spaces.
Published: 2020

38. Robust Large-Margin Learning in Hyperbolic Space

Author: Weber, Melanie, Zaheer, Manzil, Rawat, Ankit Singh, Menon, Aditya, and Kumar, Sanjiv
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Recently, there has been a surge of interest in representation learning in hyperbolic spaces, driven by their ability to represent hierarchical data with significantly fewer dimensions than standard Euclidean spaces. However, the viability and benefits of hyperbolic spaces for downstream machine learning tasks have received less attention. In this paper, we present, to our knowledge, the first theoretical guarantees for learning a classifier in hyperbolic rather than Euclidean space. Specifically, we consider the problem of learning a large-margin classifier for data possessing a hierarchical structure. We provide an algorithm to efficiently learn a large-margin hyperplane, relying on the careful injection of adversarial examples. Finally, we prove that for hierarchical data that embeds well into hyperbolic space, the low embedding dimension ensures superior guarantees when learning the classifier directly in hyperbolic space., Comment: Revision corrects error in section 3.1
Published: 2020

39. Reliable Distributed Clustering with Redundant Data Assignment

Author: Gandikota, Venkata, Mazumdar, Arya, and Rawat, Ankit Singh
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Data Structures and Algorithms, Computer Science - Information Theory, Computer Science - Machine Learning
Abstract: In this paper, we present distributed generalized clustering algorithms that can handle large scale data across multiple machines in spite of straggling or unreliable machines. We propose a novel data assignment scheme that enables us to obtain global information about the entire data even when some machines fail to respond with the results of the assigned local computations. The assignment scheme leads to distributed algorithms with good approximation guarantees for a variety of clustering and dimensionality reduction problems.
Published: 2020

40. Low-Rank Bottleneck in Multi-head Attention Models

Author: Bhojanapalli, Srinadh, Yun, Chulhee, Rawat, Ankit Singh, Reddi, Sashank J., and Kumar, Sanjiv
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Attention based Transformer architecture has enabled significant advances in the field of natural language processing. In addition to new pre-training techniques, recent improvements crucially rely on working with a relatively larger embedding dimension for tokens. Unfortunately, this leads to models that are prohibitively large to be employed in the downstream tasks. In this paper we identify one of the important factors contributing to the large embedding size requirement. In particular, our analysis highlights that the scaling between the number of heads and the size of each head in the current architecture gives rise to a low-rank bottleneck in attention heads, causing this limitation. We further validate this in our experiments. As a solution we propose to set the head size of an attention unit to input sequence length, and independent of the number of heads, resulting in multi-head attention layers with provably more expressive power. We empirically show that this allows us to train models with a relatively smaller embedding dimension and with better performance scaling., Comment: 17 pages, 4 figures
Published: 2020

41. Achieving Multi-Port Memory Performance on Single-Port Memory with Coding Techniques

Author: Jain, Hardik, Edwards, Matthew, Elenberg, Ethan, Rawat, Ankit Singh, and Vishwanath, Sriram
Subjects: Computer Science - Hardware Architecture
Abstract: Many performance critical systems today must rely on performance enhancements, such as multi-port memories, to keep up with the increasing demand of memory-access capacity. However, the large area footprints and complexity of existing multi-port memory designs limit their applicability. This paper explores a coding theoretic framework to address this problem. In particular, this paper introduces a framework to encode data across multiple single-port memory banks in order to {\em algorithmically} realize the functionality of multi-port memory. This paper proposes three code designs with significantly less storage overhead compared to the existing replication based emulations of multi-port memories. To further improve performance, we also demonstrate a memory controller design that utilizes redundancy across coded memory banks to more efficiently schedule read and write requests sent across multiple cores. Furthermore, guided by DRAM traces, the paper explores {\em dynamic coding} techniques to improve the efficiency of the coding based memory design. We then show significant performance improvements in critical word read and write latency in the proposed coded-memory design when compared to a traditional uncoded-memory design., Comment: 10 pages, 20 figures, ICICT 2020 conference
Published: 2020

42. Are Transformers universal approximators of sequence-to-sequence functions?

Author: Yun, Chulhee, Bhojanapalli, Srinadh, Rawat, Ankit Singh, Reddi, Sashank J., and Kumar, Sanjiv
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Despite the widespread adoption of Transformer models for NLP tasks, the expressive power of these models is not well-understood. In this paper, we establish that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models. Furthermore, using positional encodings, we circumvent the restriction of permutation equivariance, and show that Transformer models can universally approximate arbitrary continuous sequence-to-sequence functions on a compact domain. Interestingly, our proof techniques clearly highlight the different roles of the self-attention and the feed-forward layers in Transformers. In particular, we prove that fixed width self-attention layers can compute contextual mappings of the input sequences, playing a key role in the universal approximation property of Transformers. Based on this insight from our analysis, we consider other simpler alternatives to self-attention layers and empirically evaluate them., Comment: 23 pages, ICLR 2020 camera-ready version
Published: 2019

43. Sampled Softmax with Random Fourier Features

Author: Rawat, Ankit Singh, Chen, Jiecao, Yu, Felix, Suresh, Ananda Theertha, and Kumar, Sanjiv
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: The computational cost of training with softmax cross entropy loss grows linearly with the number of classes. For the settings where a large number of classes are involved, a common method to speed up training is to sample a subset of classes and utilize an estimate of the loss gradient based on these classes, known as the sampled softmax method. However, the sampled softmax provides a biased estimate of the gradient unless the samples are drawn from the exact softmax distribution, which is again expensive to compute. Therefore, a widely employed practical approach involves sampling from a simpler distribution in the hope of approximating the exact softmax distribution. In this paper, we develop the first theoretical understanding of the role that different sampling distributions play in determining the quality of sampled softmax. Motivated by our analysis and the work on kernel-based sampling, we propose the Random Fourier Softmax (RF-softmax) method that utilizes the powerful Random Fourier Features to enable more efficient and accurate sampling from an approximate softmax distribution. We show that RF-softmax leads to low bias in estimation in terms of both the full softmax distribution and the full softmax gradient. Furthermore, the cost of RF-softmax scales only logarithmically with the number of classes., Comment: In NeurIPS 2019
Published: 2019

44. The Generalized Lasso for Sub-gaussian Measurements with Dithered Quantization

Author: Thrampoulidis, Christos and Rawat, Ankit Singh
Subjects: Computer Science - Information Theory, Electrical Engineering and Systems Science - Signal Processing, Mathematics - Statistics Theory
Abstract: In the problem of structured signal recovery from high-dimensional linear observations, it is commonly assumed that full-precision measurements are available. Under this assumption, the recovery performance of the popular Generalized Lasso (G-Lasso) is by now well-established. In this paper, we extend these types of results to the practically relevant settings with quantized measurements. We study two extremes of the quantization schemes, namely, uniform and one-bit quantization; the former imposes no limit on the number of quantization bits, while the second only allows for one bit. In the presence of a uniform dithering signal and when measurement vectors are sub-gaussian, we show that the same algorithm (i.e., the G-Lasso) has favorable recovery guarantees for both uniform and one-bit quantization schemes. Our theoretical results, shed light on the appropriate choice of the range of values of the dithering signal and accurately capture the error dependence on the problem parameters. For example, our error analysis shows that the G-Lasso with one-bit uniformly dithered measurements leads to only a logarithmic rate loss compared to the full-precision measurements.
Published: 2018

45. Robust Gradient Descent via Moment Encoding with LDPC Codes

Author: Maity, Raj Kumar, Rawat, Ankit Singh, and Mazumdar, Arya
Subjects: Statistics - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Information Theory, Computer Science - Machine Learning
Abstract: This paper considers the problem of implementing large-scale gradient descent algorithms in a distributed computing setting in the presence of {\em straggling} processors. To mitigate the effect of the stragglers, it has been previously proposed to encode the data with an erasure-correcting code and decode at the master server at the end of the computation. We, instead, propose to encode the second-moment of the data with a low density parity-check (LDPC) code. The iterative decoding algorithms for LDPC codes have very low computational overhead and the number of decoding iterations can be made to automatically adjust with the number of stragglers in the system. We show that for a random model for stragglers, the proposed moment encoding based gradient descent method can be viewed as the stochastic gradient descent method. This allows us to obtain convergence guarantees for the proposed solution. Furthermore, the proposed moment encoding based method is shown to outperform the existing schemes in a real distributed computing setup.
Published: 2018

46. Representation Learning and Recovery in the ReLU Model

Author: Mazumdar, Arya and Rawat, Ankit Singh
Subjects: Statistics - Machine Learning, Computer Science - Information Theory, Computer Science - Learning
Abstract: Rectified linear units, or ReLUs, have become the preferred activation function for artificial neural networks. In this paper we consider two basic learning problems assuming that the underlying data follow a generative model based on a ReLU-network -- a neural network with ReLU activations. As a primarily theoretical study, we limit ourselves to a single-layer network. The first problem we study corresponds to dictionary-learning in the presence of nonlinearity (modeled by the ReLU functions). Given a set of observation vectors $\mathbf{y}^i \in \mathbb{R}^d, i =1, 2, \dots , n$, we aim to recover $d\times k$ matrix $A$ and the latent vectors $\{\mathbf{c}^i\} \subset \mathbb{R}^k$ under the model $\mathbf{y}^i = \mathrm{ReLU}(A\mathbf{c}^i +\mathbf{b})$, where $\mathbf{b}\in \mathbb{R}^d$ is a random bias. We show that it is possible to recover the column space of $A$ within an error of $O(d)$ (in Frobenius norm) under certain conditions on the probability distribution of $\mathbf{b}$. The second problem we consider is that of robust recovery of the signal in the presence of outliers, i.e., large but sparse noise. In this setting we are interested in recovering the latent vector $\mathbf{c}$ from its noisy nonlinear sketches of the form $\mathbf{v} = \mathrm{ReLU}(A\mathbf{c}) + \mathbf{e}+\mathbf{w}$, where $\mathbf{e} \in \mathbb{R}^d$ denotes the outliers with sparsity $s$ and $\mathbf{w} \in \mathbb{R}^d$ denote the dense but small noise. This line of work has recently been studied (Soltanolkotabi, 2017) without the presence of outliers. For this problem, we show that a generalized LASSO algorithm is able to recover the signal $\mathbf{c} \in \mathbb{R}^k$ within an $\ell_2$ error of $O(\sqrt{\frac{(k+s)\log d}{d}})$ when $A$ is a random Gaussian matrix.
Published: 2018

47. Lifting high-dimensional nonlinear models with Gaussian regressors

Author: Thrampoulidis, Christos and Rawat, Ankit Singh
Subjects: Statistics - Machine Learning
Abstract: We study the problem of recovering a structured signal $\mathbf{x}_0$ from high-dimensional data $\mathbf{y}_i=f(\mathbf{a}_i^T\mathbf{x}_0)$ for some nonlinear (and potentially unknown) link function $f$, when the regressors $\mathbf{a}_i$ are iid Gaussian. Brillinger (1982) showed that ordinary least-squares estimates $\mathbf{x}_0$ up to a constant of proportionality $\mu_\ell$, which depends on $f$. Recently, Plan & Vershynin (2015) extended this result to the high-dimensional setting deriving sharp error bounds for the generalized Lasso. Unfortunately, both least-squares and the Lasso fail to recover $\mathbf{x}_0$ when $\mu_\ell=0$. For example, this includes all even link functions. We resolve this issue by proposing and analyzing an alternative convex recovery method. In a nutshell, our method treats such link functions as if they were linear in a lifted space of higher-dimension. Interestingly, our error analysis captures the effect of both the nonlinearity and the problem's geometry in a few simple summary parameters., Comment: Improved the algorithm and expanded on its motivation; added simulation results
Published: 2017

48. MDS Code Constructions with Small Sub-packetization and Near-optimal Repair Bandwidth

Author: Rawat, Ankit Singh, Tamo, Itzhak, Guruswami, Venkatesan, and Efremenko, Klim
Subjects: Computer Science - Information Theory, Computer Science - Computational Complexity
Abstract: This paper addresses the problem of constructing MDS codes that enable exact repair of each code block with small repair bandwidth, which refers to the total amount of information flow from the remaining code blocks during the repair process. This problem naturally arises in the context of distributed storage systems as the node repair problem [7]. The constructions of exact-repairable MDS codes with optimal repair-bandwidth require working with large sub-packetization levels, which restricts their employment in practice. This paper presents constructions for MDS codes that simultaneously provide both small repair bandwidth and small sub-packetization level. In particular, this paper presents two general approaches to construct exact-repairable MDS codes that aim at significantly reducing the required sub-packetization level at the cost of slightly sub-optimal repair bandwidth. The first approach gives MDS codes that have repair bandwidth at most twice the optimal repair-bandwidth. Additionally, these codes also have the smallest possible sub-packetization level $\ell = O(r)$, where $r$ denotes the number of parity blocks. This approach is then generalized to design codes that have their repair bandwidth approaching the optimal repair-bandwidth at the cost of graceful increment in the required sub-packetization level. The second approach transforms an MDS code with optimal repair-bandwidth and large sub-packetization level into a longer MDS code with small sub-packetization level and near-optimal repair bandwidth. For a given $r$, the obtained codes have their sub-packetization level scaling logarithmically with the code length. In addition, the obtained codes require field size only linear in the code length and ensure load balancing among the intact code blocks in terms of the information downloaded from these blocks during the exact reconstruction of a code block., Comment: Significant overlap with arXiv:1608.00191
Published: 2017

49. Associative Memory using Dictionary Learning and Expander Decoding

Author: Mazumdar, Arya and Rawat, Ankit Singh
Subjects: Statistics - Machine Learning, Computer Science - Information Theory, Computer Science - Learning
Abstract: An associative memory is a framework of content-addressable memory that stores a collection of message vectors (or a dataset) over a neural network while enabling a neurally feasible mechanism to recover any message in the dataset from its noisy version. Designing an associative memory requires addressing two main tasks: 1) learning phase: given a dataset, learn a concise representation of the dataset in the form of a graphical model (or a neural network), 2) recall phase: given a noisy version of a message vector from the dataset, output the correct message vector via a neurally feasible algorithm over the network learnt during the learning phase. This paper studies the problem of designing a class of neural associative memories which learns a network representation for a large dataset that ensures correction against a large number of adversarial errors during the recall phase. Specifically, the associative memories designed in this paper can store dataset containing $\exp(n)$ $n$-length message vectors over a network with $O(n)$ nodes and can tolerate $\Omega(\frac{n}{{\rm polylog} n})$ adversarial errors. This paper carries out this memory design by mapping the learning phase and recall phase to the tasks of dictionary learning with a square dictionary and iterative error correction in an expander code, respectively., Comment: To appear in AAAI 2017
Published: 2016

50. A Note on Secure Minimum Storage Regenerating Codes

Author: Rawat, Ankit Singh
Subjects: Computer Science - Information Theory
Abstract: This short note revisits the problem of designing secure minimum storage regenerating (MSR) codes for distributed storage systems. A secure MSR code ensures that a distributed storage system does not reveal the stored information to a passive eavesdropper. The eavesdropper is assumed to have access to the content stored on $\ell_1$ number of storage nodes in the system and the data downloaded during the bandwidth efficient repair of an additional $\ell_2$ number of storage nodes. This note combines the Gabidulin codes based precoding [18] and a new construction of MSR codes (without security requirements) by Ye and Barg [27] in order to obtain secure MSR codes. Such optimal secure MSR codes were previously known in the setting where the eavesdropper was only allowed to observe the repair of $\ell_2$ nodes among a specific subset of $k$ nodes [7, 18]. The secure coding scheme presented in this note allows the eavesdropper to observe repair of any $\ell_2$ ouf of $n$ nodes in the system and characterizes the secrecy capacity of linear repairable MSR codes.
Published: 2016

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Publication Type

Database

62 results on '"Ankit Singh"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources