Author: "Chandar, Sarath" / Database: OAIster - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Chandar, Sarath"' showing total 75 results

Start Over Author "Chandar, Sarath" Database OAIster

75 results on '"Chandar, Sarath"'

1. Mastering Memory Tasks with World Models

Author: Samsami, Mohammad Reza, Zholus, Artem, Rajendran, Janarthanan, Chandar, Sarath, Samsami, Mohammad Reza, Zholus, Artem, Rajendran, Janarthanan, and Chandar, Sarath
Abstract: Current model-based reinforcement learning (MBRL) agents struggle with long-term dependencies. This limits their ability to effectively solve tasks involving extended time gaps between actions and outcomes, or tasks demanding the recalling of distant observations to inform current actions. To improve temporal coherence, we integrate a new family of state space models (SSMs) in world models of MBRL agents to present a new method, Recall to Imagine (R2I). This integration aims to enhance both long-term memory and long-horizon credit assignment. Through a diverse set of illustrative tasks, we systematically demonstrate that R2I not only establishes a new state-of-the-art for challenging memory and credit assignment RL tasks, such as BSuite and POPGym, but also showcases superhuman performance in the complex memory domain of Memory Maze. At the same time, it upholds comparable performance in classic RL tasks, such as Atari and DMC, suggesting the generality of our method. We also show that R2I is faster than the state-of-the-art MBRL method, DreamerV3, resulting in faster wall-time convergence., Comment: Published as a conference paper at The International Conference on Learning Representations 2024
Published: 2024

2. Are self-explanations from Large Language Models faithful?

Author: Madsen, Andreas, Chandar, Sarath, Reddy, Siva, Madsen, Andreas, Chandar, Sarath, and Reddy, Siva
Abstract: Instruction-tuned Large Language Models (LLMs) excel at many tasks and will even explain their reasoning, so-called self-explanations. However, convincing and wrong self-explanations can lead to unsupported confidence in LLMs, thus increasing risk. Therefore, it's important to measure if self-explanations truly reflect the model's behavior. Such a measure is called interpretability-faithfulness and is challenging to perform since the ground truth is inaccessible, and many LLMs only have an inference API. To address this, we propose employing self-consistency checks to measure faithfulness. For example, if an LLM says a set of words is important for making a prediction, then it should not be able to make its prediction without these words. While self-consistency checks are a common approach to faithfulness, they have not previously been successfully applied to LLM self-explanations for counterfactual, feature attribution, and redaction explanations. Our results demonstrate that faithfulness is explanation, model, and task-dependent, showing self-explanations should not be trusted in general. For example, with sentiment classification, counterfactuals are more faithful for Llama2, feature attribution for Mistral, and redaction for Falcon 40B., Comment: The 62nd Annual Meeting of the Association for Computational Linguistics
Published: 2024

3. BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning

Author: Zholus, Artem, Kuznetsov, Maksim, Schutski, Roman, Shayakhmetov, Rim, Polykovskiy, Daniil, Chandar, Sarath, Zhavoronkov, Alex, Zholus, Artem, Kuznetsov, Maksim, Schutski, Roman, Shayakhmetov, Rim, Polykovskiy, Daniil, Chandar, Sarath, and Zhavoronkov, Alex
Abstract: Generating novel active molecules for a given protein is an extremely challenging task for generative models that requires an understanding of the complex physical interactions between the molecule and its environment. In this paper, we present a novel generative model, BindGPT which uses a conceptually simple but powerful approach to create 3D molecules within the protein's binding site. Our model produces molecular graphs and conformations jointly, eliminating the need for an extra graph reconstruction step. We pretrain BindGPT on a large-scale dataset and fine-tune it with reinforcement learning using scores from external simulation software. We demonstrate how a single pretrained language model can serve at the same time as a 3D molecular generative model, conformer generator conditioned on the molecular graph, and a pocket-conditioned 3D molecule generator. Notably, the model does not make any representational equivariance assumptions about the domain of generation. We show how such simple conceptual approach combined with pretraining and scaling can perform on par or better than the current best specialized diffusion models, language models, and graph neural networks while being two orders of magnitude cheaper to sample.
Published: 2024

4. Predicting the Impact of Model Expansion through the Minima Manifold: A Loss Landscape Perspective

Author: Malviya, Pranshu, Huang, Jerry, Fournier, Quentin, Chandar, Sarath, Malviya, Pranshu, Huang, Jerry, Fournier, Quentin, and Chandar, Sarath
Abstract: The optimal model for a given task is often challenging to determine, requiring training multiple models from scratch which becomes prohibitive as dataset and model sizes grow. A more efficient alternative is to reuse smaller pre-trained models by expanding them, however, this is not widely adopted as how this impacts training dynamics remains poorly understood. While prior works have introduced statistics to measure these effects, they remain flawed. To rectify this, we offer a new approach for understanding and quantifying the impact of expansion through the lens of the loss landscape, which has been shown to contain a manifold of linearly connected minima. Building on this new perspective, we propose a metric to study the impact of expansion by estimating the size of the manifold. Experimental results show a clear relationship between gains in performance and manifold size, enabling the comparison of candidate models and presenting a first step towards expanding models more reliably based on geometric properties of the loss landscape.
Published: 2024

5. Towards Practical Tool Usage for Continually Learning LLMs

Author: Huang, Jerry, Parthasarathi, Prasanna, Rezagholizadeh, Mehdi, Chandar, Sarath, Huang, Jerry, Parthasarathi, Prasanna, Rezagholizadeh, Mehdi, and Chandar, Sarath
Abstract: Large language models (LLMs) show an innate skill for solving language based tasks. But insights have suggested an inability to adjust for information or task-solving skills becoming outdated, as their knowledge, stored directly within their parameters, remains static in time. Tool use helps by offloading work to systems that the LLM can access through an interface, but LLMs that use them still must adapt to nonstationary environments for prolonged use, as new tools can emerge and existing tools can change. Nevertheless, tools require less specialized knowledge, therefore we hypothesize they are better suited for continual learning (CL) as they rely less on parametric memory for solving tasks and instead focus on learning when to apply pre-defined tools. To verify this, we develop a synthetic benchmark and follow this by aggregating existing NLP tasks to form a more realistic testing scenario. While we demonstrate scaling model size is not a solution, regardless of tool usage, continual learning techniques can enable tool LLMs to both adapt faster while forgetting less, highlighting their potential as continual learners., Comment: 20 pages, 11 tables, 7 figures
Published: 2024

6. Interpretability Needs a New Paradigm

Author: Madsen, Andreas, Lakkaraju, Himabindu, Reddy, Siva, Chandar, Sarath, Madsen, Andreas, Lakkaraju, Himabindu, Reddy, Siva, and Chandar, Sarath
Abstract: Interpretability is the study of explaining models in understandable terms to humans. At present, interpretability is divided into two paradigms: the intrinsic paradigm, which believes that only models designed to be explained can be explained, and the post-hoc paradigm, which believes that black-box models can be explained. At the core of this debate is how each paradigm ensures its explanations are faithful, i.e., true to the model's behavior. This is important, as false but convincing explanations lead to unsupported confidence in artificial intelligence (AI), which can be dangerous. This paper's position is that we should think about new paradigms while staying vigilant regarding faithfulness. First, by examining the history of paradigms in science, we see that paradigms are constantly evolving. Then, by examining the current paradigms, we can understand their underlying beliefs, the value they bring, and their limitations. Finally, this paper presents 3 emerging paradigms for interpretability. The first paradigm designs models such that faithfulness can be easily measured. Another optimizes models such that explanations become faithful. The last paradigm proposes to develop models that produce both a prediction and an explanation.
Published: 2024

7. Sub-goal Distillation: A Method to Improve Small Language Agents

Author: Hashemzadeh, Maryam, Stengel-Eskin, Elias, Chandar, Sarath, Cote, Marc-Alexandre, Hashemzadeh, Maryam, Stengel-Eskin, Elias, Chandar, Sarath, and Cote, Marc-Alexandre
Abstract: While Large Language Models (LLMs) have demonstrated significant promise as agents in interactive tasks, their substantial computational requirements and restricted number of calls constrain their practical utility, especially in long-horizon interactive tasks such as decision-making or in scenarios involving continuous ongoing tasks. To address these constraints, we propose a method for transferring the performance of an LLM with billions of parameters to a much smaller language model (770M parameters). Our approach involves constructing a hierarchical agent comprising a planning module, which learns through Knowledge Distillation from an LLM to generate sub-goals, and an execution module, which learns to accomplish these sub-goals using elementary actions. In detail, we leverage an LLM to annotate an oracle path with a sequence of sub-goals towards completing a goal. Subsequently, we utilize this annotated data to fine-tune both the planning and execution modules. Importantly, neither module relies on real-time access to an LLM during inference, significantly reducing the overall cost associated with LLM interactions to a fixed cost. In ScienceWorld, a challenging and multi-task interactive text environment, our method surpasses standard imitation learning based solely on elementary actions by 16.7% (absolute). Our analysis highlights the efficiency of our approach compared to other LLM-based methods. Our code and annotated data for distillation can be found on GitHub.
Published: 2024

8. Intelligent Switching for Reset-Free RL

Author: Patil, Darshan, Rajendran, Janarthanan, Berseth, Glen, Chandar, Sarath, Patil, Darshan, Rajendran, Janarthanan, Berseth, Glen, and Chandar, Sarath
Abstract: In the real world, the strong episode resetting mechanisms that are needed to train agents in simulation are unavailable. The \textit{resetting} assumption limits the potential of reinforcement learning in the real world, as providing resets to an agent usually requires the creation of additional handcrafted mechanisms or human interventions. Recent work aims to train agents (\textit{forward}) with learned resets by constructing a second (\textit{backward}) agent that returns the forward agent to the initial state. We find that the termination and timing of the transitions between these two agents are crucial for algorithm success. With this in mind, we create a new algorithm, Reset Free RL with Intelligently Switching Controller (RISC) which intelligently switches between the two agents based on the agent's confidence in achieving its current goal. Our new method achieves state-of-the-art performance on several challenging environments for reset-free RL., Comment: Published at ICLR 2024
Published: 2024

9. Conditionally Optimistic Exploration for Cooperative Deep Multi-Agent Reinforcement Learning

Author: Zhao, Xutong, Pan, Yangchen, Xiao, Chenjun, Chandar, Sarath, Rajendran, Janarthanan, Zhao, Xutong, Pan, Yangchen, Xiao, Chenjun, Chandar, Sarath, and Rajendran, Janarthanan
Abstract: Efficient exploration is critical in cooperative deep Multi-Agent Reinforcement Learning (MARL). In this work, we propose an exploration method that effectively encourages cooperative exploration based on the idea of sequential action-computation scheme. The high-level intuition is that to perform optimism-based exploration, agents would explore cooperative strategies if each agent's optimism estimate captures a structured dependency relationship with other agents. Assuming agents compute actions following a sequential order at \textit{each environment timestep}, we provide a perspective to view MARL as tree search iterations by considering agents as nodes at different depths of the search tree. Inspired by the theoretically justified tree search algorithm UCT (Upper Confidence bounds applied to Trees), we develop a method called Conditionally Optimistic Exploration (COE). COE augments each agent's state-action value estimate with an action-conditioned optimistic bonus derived from the visitation count of the global state and joint actions of preceding agents. COE is performed during training and disabled at deployment, making it compatible with any value decomposition method for centralized training with decentralized execution. Experiments across various cooperative MARL benchmarks show that COE outperforms current state-of-the-art exploration methods on hard-exploration tasks., Comment: Accepted at UAI 2023
Published: 2023

10. Replay Buffer with Local Forgetting for Adapting to Local Environment Changes in Deep Model-Based Reinforcement Learning

Author: Rahimi-Kalahroudi, Ali, Rajendran, Janarthanan, Momennejad, Ida, van Seijen, Harm, Chandar, Sarath, Rahimi-Kalahroudi, Ali, Rajendran, Janarthanan, Momennejad, Ida, van Seijen, Harm, and Chandar, Sarath
Abstract: One of the key behavioral characteristics used in neuroscience to determine whether the subject of study -- be it a rodent or a human -- exhibits model-based learning is effective adaptation to local changes in the environment, a particular form of adaptivity that is the focus of this work. In reinforcement learning, however, recent work has shown that modern deep model-based reinforcement-learning (MBRL) methods adapt poorly to local environment changes. An explanation for this mismatch is that MBRL methods are typically designed with sample-efficiency on a single task in mind and the requirements for effective adaptation are substantially higher, both in terms of the learned world model and the planning routine. One particularly challenging requirement is that the learned world model has to be sufficiently accurate throughout relevant parts of the state-space. This is challenging for deep-learning-based world models due to catastrophic forgetting. And while a replay buffer can mitigate the effects of catastrophic forgetting, the traditional first-in-first-out replay buffer precludes effective adaptation due to maintaining stale data. In this work, we show that a conceptually simple variation of this traditional replay buffer is able to overcome this limitation. By removing only samples from the buffer from the local neighbourhood of the newly observed samples, deep world models can be built that maintain their accuracy across the state-space, while also being able to effectively adapt to local changes in the reward function. We demonstrate this by applying our replay-buffer variation to a deep version of the classical Dyna method, as well as to recent methods such as PlaNet and DreamerV2, demonstrating that deep model-based methods can adapt effectively as well to local changes in the environment.
Published: 2023

11. Dealing With Non-stationarity in Decentralized Cooperative Multi-Agent Deep Reinforcement Learning via Multi-Timescale Learning

Author: Nekoei, Hadi, Badrinaaraayanan, Akilesh, Sinha, Amit, Amini, Mohammad, Rajendran, Janarthanan, Mahajan, Aditya, Chandar, Sarath, Nekoei, Hadi, Badrinaaraayanan, Akilesh, Sinha, Amit, Amini, Mohammad, Rajendran, Janarthanan, Mahajan, Aditya, and Chandar, Sarath
Abstract: Decentralized cooperative multi-agent deep reinforcement learning (MARL) can be a versatile learning framework, particularly in scenarios where centralized training is either not possible or not practical. One of the critical challenges in decentralized deep MARL is the non-stationarity of the learning environment when multiple agents are learning concurrently. A commonly used and efficient scheme for decentralized MARL is independent learning in which agents concurrently update their policies independently of each other. We first show that independent learning does not always converge, while sequential learning where agents update their policies one after another in a sequence is guaranteed to converge to an agent-by-agent optimal solution. In sequential learning, when one agent updates its policy, all other agent's policies are kept fixed, alleviating the challenge of non-stationarity due to simultaneous updates in other agents' policies. However, it can be slow because only one agent is learning at any time. Therefore it might also not always be practical. In this work, we propose a decentralized cooperative MARL algorithm based on multi-timescale learning. In multi-timescale learning, all agents learn simultaneously, but at different learning rates. In our proposed method, when one agent updates its policy, other agents are allowed to update their policies as well, but at a slower rate. This speeds up sequential learning, while also minimizing non-stationarity caused by other agents updating concurrently. Multi-timescale learning outperforms state-of-the-art decentralized learning methods on a set of challenging multi-agent cooperative tasks in the epymarl(Papoudakis et al., 2020) benchmark. This can be seen as a first step towards more general decentralized cooperative deep MARL methods based on multi-timescale learning.
Published: 2023

12. Measuring the Knowledge Acquisition-Utilization Gap in Pretrained Language Models

Author: Kazemnejad, Amirhossein, Rezagholizadeh, Mehdi, Parthasarathi, Prasanna, Chandar, Sarath, Kazemnejad, Amirhossein, Rezagholizadeh, Mehdi, Parthasarathi, Prasanna, and Chandar, Sarath
Abstract: While pre-trained language models (PLMs) have shown evidence of acquiring vast amounts of knowledge, it remains unclear how much of this parametric knowledge is actually usable in performing downstream tasks. We propose a systematic framework to measure parametric knowledge utilization in PLMs. Our framework first extracts knowledge from a PLM's parameters and subsequently constructs a downstream task around this extracted knowledge. Performance on this task thus depends exclusively on utilizing the model's possessed knowledge, avoiding confounding factors like insufficient signal. As an instantiation, we study factual knowledge of PLMs and measure utilization across 125M to 13B parameter PLMs. We observe that: (1) PLMs exhibit two gaps - in acquired vs. utilized knowledge, (2) they show limited robustness in utilizing knowledge under distribution shifts, and (3) larger models close the acquired knowledge gap but the utilized knowledge gap remains. Overall, our study provides insights into PLMs' capabilities beyond their acquired knowledge.
Published: 2023

13. Should We Attend More or Less? Modulating Attention for Fairness

Author: Zayed, Abdelrahman, Mordido, Goncalo, Shabanian, Samira, Chandar, Sarath, Zayed, Abdelrahman, Mordido, Goncalo, Shabanian, Samira, and Chandar, Sarath
Abstract: The abundance of annotated data in natural language processing (NLP) poses both opportunities and challenges. While it enables the development of high-performing models for a variety of tasks, it also poses the risk of models learning harmful biases from the data, such as gender stereotypes. In this work, we investigate the role of attention, a widely-used technique in current state-of-the-art NLP models, in the propagation of social biases. Specifically, we study the relationship between the entropy of the attention distribution and the model's performance and fairness. We then propose a novel method for modulating attention weights to improve model fairness after training. Since our method is only applied post-training and pre-inference, it is an intra-processing method and is, therefore, less computationally expensive than existing in-processing and pre-processing approaches. Our results show an increase in fairness and minimal performance loss on different text classification and generation tasks using language models of varying sizes. WARNING: This work uses language that is offensive.
Published: 2023

14. On the Costs and Benefits of Adopting Lifelong Learning for Software Analytics -- Empirical Study on Brown Build and Risk Prediction

Author: Olewicki, Doriane, Habchi, Sarra, Nayrolles, Mathieu, Faramarzi, Mojtaba, Chandar, Sarath, Adams, Bram, Olewicki, Doriane, Habchi, Sarra, Nayrolles, Mathieu, Faramarzi, Mojtaba, Chandar, Sarath, and Adams, Bram
Abstract: Nowadays, software analytics tools using machine learning (ML) models to, for example, predict the risk of a code change are well established. However, as the goals of a project shift over time, and developers and their habits change, the performance of said models tends to degrade (drift) over time. Current retraining practices typically require retraining a new model from scratch on a large updated dataset when performance decay is observed, thus incurring a computational cost; also there is no continuity between the models as the past model is discarded and ignored during the new model training. Even though the literature has taken interest in online learning approaches, those have rarely been integrated and evaluated in industrial environments. This paper evaluates the use of lifelong learning (LL) for industrial use cases at Ubisoft, evaluating both the performance and the required computational effort in comparison to the retraining-from-scratch approaches commonly used by the industry. LL is used to continuously build and maintain ML-based software analytics tools using an incremental learner that progressively updates the old model using new data. To avoid so-called "catastrophic forgetting" of important older data points, we adopt a replay buffer of older data, which still allows us to drastically reduce the size of the overall training dataset, and hence model training time.
Published: 2023
Full Text: View/download PDF

15. Fairness-Aware Structured Pruning in Transformers

Author: Zayed, Abdelrahman, Mordido, Goncalo, Shabanian, Samira, Baldini, Ioana, Chandar, Sarath, Zayed, Abdelrahman, Mordido, Goncalo, Shabanian, Samira, Baldini, Ioana, and Chandar, Sarath
Abstract: The increasing size of large language models (LLMs) has introduced challenges in their training and inference. Removing model components is perceived as a solution to tackle the large model sizes, however, existing pruning methods solely focus on performance, without considering an essential aspect for the responsible use of LLMs: model fairness. It is crucial to address the fairness of LLMs towards diverse groups, such as women, Black people, LGBTQ+, Jewish communities, among others, as they are being deployed and available to a wide audience. In this work, first, we investigate how attention heads impact fairness and performance in pre-trained transformer-based language models. We then propose a novel method to prune the attention heads that negatively impact fairness while retaining the heads critical for performance, i.e. language modeling capabilities. Our approach is practical in terms of time and resources, as it does not require fine-tuning the final pruned, and fairer, model. Our findings demonstrate a reduction in gender bias by 19%, 19.5%, 39.5%, 34.7%, 23%, and 8% for DistilGPT-2, GPT-2, GPT-Neo of two different sizes, GPT-J, and Llama 2 models, respectively, in comparison to the biased model, with only a slight decrease in performance., Comment: In Proceedings of AAAI 2024
Published: 2023

16. Language Model-In-The-Loop: Data Optimal Approach to Learn-To-Recommend Actions in Text Games

Author: Sudhakar, Arjun Vaithilingam, Parthasarathi, Prasanna, Rajendran, Janarthanan, Chandar, Sarath, Sudhakar, Arjun Vaithilingam, Parthasarathi, Prasanna, Rajendran, Janarthanan, and Chandar, Sarath
Abstract: Large Language Models (LLMs) have demonstrated superior performance in language understanding benchmarks. CALM, a popular approach, leverages linguistic priors of LLMs -- GPT-2 -- for action candidate recommendations to improve the performance in text games in Jericho without environment-provided actions. However, CALM adapts GPT-2 with annotated human gameplays and keeps the LLM fixed during the learning of the text based games. In this work, we explore and evaluate updating LLM used for candidate recommendation during the learning of the text based game as well to mitigate the reliance on the human annotated gameplays, which are costly to acquire. We observe that by updating the LLM during learning using carefully selected in-game transitions, we can reduce the dependency on using human annotated game plays for fine-tuning the LLMs. We conducted further analysis to study the transferability of the updated LLMs and observed that transferring in-game trained models to other games did not result in a consistent transfer.
Published: 2023

17. Self-Influence Guided Data Reweighting for Language Model Pre-training

Author: Thakkar, Megh, Bolukbasi, Tolga, Ganapathy, Sriram, Vashishth, Shikhar, Chandar, Sarath, Talukdar, Partha, Thakkar, Megh, Bolukbasi, Tolga, Ganapathy, Sriram, Vashishth, Shikhar, Chandar, Sarath, and Talukdar, Partha
Abstract: Language Models (LMs) pre-trained with self-supervision on large text corpora have become the default starting point for developing models for various NLP tasks. Once the pre-training corpus has been assembled, all data samples in the corpus are treated with equal importance during LM pre-training. However, due to varying levels of relevance and quality of data, equal importance to all the data samples may not be the optimal choice. While data reweighting has been explored in the context of task-specific supervised learning and LM fine-tuning, model-driven reweighting for pre-training data has not been explored. We fill this important gap and propose PRESENCE, a method for jointly reweighting samples by leveraging self-influence (SI) scores as an indicator of sample importance and pre-training. PRESENCE promotes novelty and stability for model pre-training. Through extensive analysis spanning multiple model sizes, datasets, and tasks, we present PRESENCE as an important first step in the research direction of sample reweighting for pre-training language models., Comment: Accepted to EMNLP 2023
Published: 2023

18. EpiK-Eval: Evaluation for Language Models as Epistemic Models

Author: Prato, Gabriele, Huang, Jerry, Parthasarathi, Prasannna, Sodhani, Shagun, Chandar, Sarath, Prato, Gabriele, Huang, Jerry, Parthasarathi, Prasannna, Sodhani, Shagun, and Chandar, Sarath
Abstract: In the age of artificial intelligence, the role of large language models (LLMs) is becoming increasingly central. Despite their growing prevalence, their capacity to consolidate knowledge from different training documents - a crucial ability in numerous applications - remains unexplored. This paper presents the first study examining the capability of LLMs to effectively combine such information within their parameter space. We introduce EpiK-Eval, a novel question-answering benchmark tailored to evaluate LLMs' proficiency in formulating a coherent and consistent knowledge representation from segmented narratives. Evaluations across various LLMs reveal significant weaknesses in this domain. We contend that these shortcomings stem from the intrinsic nature of prevailing training objectives. Consequently, we advocate for refining the approach towards knowledge consolidation, as it harbors the potential to dramatically improve their overall effectiveness and performance. The findings from this study offer insights for developing more robust and reliable LLMs. Our code and benchmark are available at https://github.com/chandar-lab/EpiK-Eval
Published: 2023

19. Faithfulness Measurable Masked Language Models

Author: Madsen, Andreas, Reddy, Siva, Chandar, Sarath, Madsen, Andreas, Reddy, Siva, and Chandar, Sarath
Abstract: A common approach to explaining NLP models is to use importance measures that express which tokens are important for a prediction. Unfortunately, such explanations are often wrong despite being persuasive. Therefore, it is essential to measure their faithfulness. One such metric is if tokens are truly important, then masking them should result in worse model performance. However, token masking introduces out-of-distribution issues, and existing solutions that address this are computationally expensive and employ proxy models. Furthermore, other metrics are very limited in scope. This work proposes an inherently faithfulness measurable model that addresses these challenges. This is achieved using a novel fine-tuning method that incorporates masking, such that masking tokens become in-distribution by design. This differs from existing approaches, which are completely model-agnostic but are inapplicable in practice. We demonstrate the generality of our approach by applying it to 16 different datasets and validate it using statistical in-distribution tests. The faithfulness is then measured with 9 different importance measures. Because masking is in-distribution, importance measures that themselves use masking become consistently more faithful. Additionally, because the model makes faithfulness cheap to measure, we can optimize explanations towards maximal faithfulness; thus, our model becomes indirectly inherently explainable.
Published: 2023

20. Towards Few-shot Coordination: Revisiting Ad-hoc Teamplay Challenge In the Game of Hanabi

Author: Nekoei, Hadi, Zhao, Xutong, Rajendran, Janarthanan, Liu, Miao, Chandar, Sarath, Nekoei, Hadi, Zhao, Xutong, Rajendran, Janarthanan, Liu, Miao, and Chandar, Sarath
Abstract: Cooperative Multi-agent Reinforcement Learning (MARL) algorithms with Zero-Shot Coordination (ZSC) have gained significant attention in recent years. ZSC refers to the ability of agents to coordinate zero-shot (without additional interaction experience) with independently trained agents. While ZSC is crucial for cooperative MARL agents, it might not be possible for complex tasks and changing environments. Agents also need to adapt and improve their performance with minimal interaction with other agents. In this work, we show empirically that state-of-the-art ZSC algorithms have poor performance when paired with agents trained with different learning methods, and they require millions of interaction samples to adapt to these new partners. To investigate this issue, we formally defined a framework based on a popular cooperative multi-agent game called Hanabi to evaluate the adaptability of MARL methods. In particular, we created a diverse set of pre-trained agents and defined a new metric called adaptation regret that measures the agent's ability to efficiently adapt and improve its coordination performance when paired with some held-out pool of partners on top of its ZSC performance. After evaluating several SOTA algorithms using our framework, our experiments reveal that naive Independent Q-Learning (IQL) agents in most cases adapt as quickly as the SOTA ZSC algorithm Off-Belief Learning (OBL). This finding raises an interesting research question: How to design MARL algorithms with high ZSC performance and capability of fast adaptation to unseen partners. As a first step, we studied the role of different hyper-parameters and design choices on the adaptability of current MARL algorithms. Our experiments show that two categories of hyper-parameters controlling the training data diversity and optimization process have a significant impact on the adaptability of Hanabi agents.
Published: 2023

21. Lookbehind-SAM: k steps back, 1 step forward

Author: Mordido, Gonçalo, Malviya, Pranshu, Baratin, Aristide, Chandar, Sarath, Mordido, Gonçalo, Malviya, Pranshu, Baratin, Aristide, and Chandar, Sarath
Abstract: Sharpness-aware minimization (SAM) methods have gained increasing popularity by formulating the problem of minimizing both loss value and loss sharpness as a minimax objective. In this work, we increase the efficiency of the maximization and minimization parts of SAM's objective to achieve a better loss-sharpness trade-off. By taking inspiration from the Lookahead optimizer, which uses multiple descent steps ahead, we propose Lookbehind, which performs multiple ascent steps behind to enhance the maximization step of SAM and find a worst-case perturbation with higher loss. Then, to mitigate the variance in the descent step arising from the gathered gradients across the multiple ascent steps, we employ linear interpolation to refine the minimization step. Lookbehind leads to a myriad of benefits across a variety of tasks. Particularly, we show increased generalization performance, greater robustness against noisy weights, as well as improved learning and less catastrophic forgetting in lifelong learning settings. Our code is available at https://github.com/chandar-lab/Lookbehind-SAM., Comment: ICML 2024
Published: 2023

22. Promoting Exploration in Memory-Augmented Adam using Critical Momenta

Author: Malviya, Pranshu, Mordido, Gonçalo, Baratin, Aristide, Harikandeh, Reza Babanezhad, Huang, Jerry, Lacoste-Julien, Simon, Pascanu, Razvan, Chandar, Sarath, Malviya, Pranshu, Mordido, Gonçalo, Baratin, Aristide, Harikandeh, Reza Babanezhad, Huang, Jerry, Lacoste-Julien, Simon, Pascanu, Razvan, and Chandar, Sarath
Abstract: Adaptive gradient-based optimizers, particularly Adam, have left their mark in training large-scale deep learning models. The strength of such optimizers is that they exhibit fast convergence while being more robust to hyperparameter choice. However, they often generalize worse than non-adaptive methods. Recent studies have tied this performance gap to flat minima selection: adaptive methods tend to find solutions in sharper basins of the loss landscape, which in turn hurts generalization. To overcome this issue, we propose a new memory-augmented version of Adam that promotes exploration towards flatter minima by using a buffer of critical momentum terms during training. Intuitively, the use of the buffer makes the optimizer overshoot outside the basin of attraction if it is not wide enough. We empirically show that our method improves the performance of several variants of Adam on standard supervised language modelling and image classification tasks.
Published: 2023

23. Thompson sampling for improved exploration in GFlowNets

Author: Rector-Brooks, Jarrid, Madan, Kanika, Jain, Moksh, Korablyov, Maksym, Liu, Cheng-Hao, Chandar, Sarath, Malkin, Nikolay, Bengio, Yoshua, Rector-Brooks, Jarrid, Madan, Kanika, Jain, Moksh, Korablyov, Maksym, Liu, Cheng-Hao, Chandar, Sarath, Malkin, Nikolay, and Bengio, Yoshua
Abstract: Generative flow networks (GFlowNets) are amortized variational inference algorithms that treat sampling from a distribution over compositional objects as a sequential decision-making problem with a learnable action policy. Unlike other algorithms for hierarchical sampling that optimize a variational bound, GFlowNet algorithms can stably run off-policy, which can be advantageous for discovering modes of the target distribution. Despite this flexibility in the choice of behaviour policy, the optimal way of efficiently selecting trajectories for training has not yet been systematically explored. In this paper, we view the choice of trajectories for training as an active learning problem and approach it using Bayesian techniques inspired by methods for multi-armed bandits. The proposed algorithm, Thompson sampling GFlowNets (TS-GFN), maintains an approximate posterior distribution over policies and samples trajectories from this posterior for training. We show in two domains that TS-GFN yields improved exploration and thus faster convergence to the target distribution than the off-policy exploration strategies used in past work., Comment: Structured Probabilistic Inference and Generative Modeling (SPIGM) workshop @ ICML 2023
Published: 2023

24. PatchBlender: A Motion Prior for Video Transformers

Author: Prato, Gabriele, Song, Yale, Rajendran, Janarthanan, Hjelm, R Devon, Joshi, Neel, Chandar, Sarath, Prato, Gabriele, Song, Yale, Rajendran, Janarthanan, Hjelm, R Devon, Joshi, Neel, and Chandar, Sarath
Abstract: Transformers have become one of the dominant architectures in the field of computer vision. However, there are yet several challenges when applying such architectures to video data. Most notably, these models struggle to model the temporal patterns of video data effectively. Directly targeting this issue, we introduce PatchBlender, a learnable blending function that operates over patch embeddings across the temporal dimension of the latent space. We show that our method is successful at enabling vision transformers to encode the temporal component of video data. On Something-Something v2 and MOVi-A, we show that our method improves the baseline performance of video Transformers. PatchBlender has the advantage of being compatible with almost any Transformer architecture and since it is learnable, the model can adaptively turn on or off the prior. It is also extremely lightweight compute-wise, 0.005% the GFLOPs of a ViT-B.
Published: 2022

25. Deep Learning on a Healthy Data Diet: Finding Important Examples for Fairness

Author: Zayed, Abdelrahman, Parthasarathi, Prasanna, Mordido, Goncalo, Palangi, Hamid, Shabanian, Samira, Chandar, Sarath, Zayed, Abdelrahman, Parthasarathi, Prasanna, Mordido, Goncalo, Palangi, Hamid, Shabanian, Samira, and Chandar, Sarath
Abstract: Data-driven predictive solutions predominant in commercial applications tend to suffer from biases and stereotypes, which raises equity concerns. Prediction models may discover, use, or amplify spurious correlations based on gender or other protected personal characteristics, thus discriminating against marginalized groups. Mitigating gender bias has become an important research focus in natural language processing (NLP) and is an area where annotated corpora are available. Data augmentation reduces gender bias by adding counterfactual examples to the training dataset. In this work, we show that some of the examples in the augmented dataset can be not important or even harmful for fairness. We hence propose a general method for pruning both the factual and counterfactual examples to maximize the model's fairness as measured by the demographic parity, equality of opportunity, and equality of odds. The fairness achieved by our method surpasses that of data augmentation on three text classification datasets, using no more than half of the examples in the augmented dataset. Our experiments are conducted using models of varying sizes and pre-training settings., Comment: In Proceedings of AAAI 2023
Published: 2022

26. SAMSON: Sharpness-Aware Minimization Scaled by Outlier Normalization for Improving DNN Generalization and Robustness

Author: Mordido, Gonçalo, Henwood, Sébastien, Chandar, Sarath, Leduc-Primeau, François, Mordido, Gonçalo, Henwood, Sébastien, Chandar, Sarath, and Leduc-Primeau, François
Abstract: Energy-efficient deep neural network (DNN) accelerators are prone to non-idealities that degrade DNN performance at inference time. To mitigate such degradation, existing methods typically add perturbations to the DNN weights during training to simulate inference on noisy hardware. However, this often requires knowledge about the target hardware and leads to a trade-off between DNN performance and robustness, decreasing the former to increase the latter. In this work, we show that applying sharpness-aware training, by optimizing for both the loss value and loss sharpness, significantly improves robustness to noisy hardware at inference time without relying on any assumptions about the target hardware. In particular, we propose a new adaptive sharpness-aware method that conditions the worst-case perturbation of a given weight not only on its magnitude but also on the range of the weight distribution. This is achieved by performing sharpness-aware minimization scaled by outlier minimization (SAMSON). Our approach outperforms existing sharpness-aware training methods both in terms of model generalization performance in noiseless regimes and robustness in noisy settings, as measured on several architectures and datasets., Comment: Preprint
Published: 2022

27. Detecting Languages Unintelligible to Multilingual Models through Local Structure Probes

Author: Clouâtre, Louis, Parthasarathi, Prasanna, Zouaq, Amal, Chandar, Sarath, Clouâtre, Louis, Parthasarathi, Prasanna, Zouaq, Amal, and Chandar, Sarath
Abstract: Providing better language tools for low-resource and endangered languages is imperative for equitable growth. Recent progress with massively multilingual pretrained models has proven surprisingly effective at performing zero-shot transfer to a wide variety of languages. However, this transfer is not universal, with many languages not currently understood by multilingual approaches. It is estimated that only 72 languages possess a "small set of labeled datasets" on which we could test a model's performance, the vast majority of languages not having the resources available to simply evaluate performances on. In this work, we attempt to clarify which languages do and do not currently benefit from such transfer. To that end, we develop a general approach that requires only unlabelled text to detect which languages are not well understood by a cross-lingual model. Our approach is derived from the hypothesis that if a model's understanding is insensitive to perturbations to text in a language, it is likely to have a limited understanding of that language. We construct a cross-lingual sentence similarity task to evaluate our approach empirically on 350, primarily low-resource, languages.
Published: 2022

28. Local Structure Matters Most in Most Languages

Author: Clouâtre, Louis, Parthasarathi, Prasanna, Zouaq, Amal, Chandar, Sarath, Clouâtre, Louis, Parthasarathi, Prasanna, Zouaq, Amal, and Chandar, Sarath
Abstract: Many recent perturbation studies have found unintuitive results on what does and does not matter when performing Natural Language Understanding (NLU) tasks in English. Coding properties, such as the order of words, can often be removed through shuffling without impacting downstream performances. Such insight may be used to direct future research into English NLP models. As many improvements in multilingual settings consist of wholesale adaptation of English approaches, it is important to verify whether those studies replicate or not in multilingual settings. In this work, we replicate a study on the importance of local structure, and the relative unimportance of global structure, in a multilingual setting. We find that the phenomenon observed on the English language broadly translates to over 120 languages, with a few caveats.
Published: 2022

29. Segmentation of Multiple Sclerosis Lesions across Hospitals: Learn Continually or Train from Scratch?

Author: Karthik, Enamundram Naga, Kerbrat, Anne, Labauge, Pierre, Granberg, Tobias, Talbott, Jason, Reich, Daniel S., Filippi, Massimo, Bakshi, Rohit, Callot, Virginie, Chandar, Sarath, Cohen-Adad, Julien, Karthik, Enamundram Naga, Kerbrat, Anne, Labauge, Pierre, Granberg, Tobias, Talbott, Jason, Reich, Daniel S., Filippi, Massimo, Bakshi, Rohit, Callot, Virginie, Chandar, Sarath, and Cohen-Adad, Julien
Abstract: Segmentation of Multiple Sclerosis (MS) lesions is a challenging problem. Several deep-learning-based methods have been proposed in recent years. However, most methods tend to be static, that is, a single model trained on a large, specialized dataset, which does not generalize well. Instead, the model should learn across datasets arriving sequentially from different hospitals by building upon the characteristics of lesions in a continual manner. In this regard, we explore experience replay, a well-known continual learning method, in the context of MS lesion segmentation across multi-contrast data from 8 different hospitals. Our experiments show that replay is able to achieve positive backward transfer and reduce catastrophic forgetting compared to sequential fine-tuning. Furthermore, replay outperforms the multi-domain training, thereby emerging as a promising solution for the segmentation of MS lesions. The code is available at this link: https://github.com/naga-karthik/continual-learning-ms, Comment: Accepted at the Medical Imaging Meets NeurIPS (MedNeurIPS) Workshop 2022
Published: 2022

30. Improving Meta-Learning Generalization with Activation-Based Early-Stopping

Author: Guiroy, Simon, Pal, Christopher, Mordido, Gonçalo, Chandar, Sarath, Guiroy, Simon, Pal, Christopher, Mordido, Gonçalo, and Chandar, Sarath
Abstract: Meta-Learning algorithms for few-shot learning aim to train neural networks capable of generalizing to novel tasks using only a few examples. Early-stopping is critical for performance, halting model training when it reaches optimal generalization to the new task distribution. Early-stopping mechanisms in Meta-Learning typically rely on measuring the model performance on labeled examples from a meta-validation set drawn from the training (source) dataset. This is problematic in few-shot transfer learning settings, where the meta-test set comes from a different target dataset (OOD) and can potentially have a large distributional shift with the meta-validation set. In this work, we propose Activation Based Early-stopping (ABE), an alternative to using validation-based early-stopping for meta-learning. Specifically, we analyze the evolution, during meta-training, of the neural activations at each hidden layer, on a small set of unlabelled support examples from a single task of the target tasks distribution, as this constitutes a minimal and justifiably accessible information from the target problem. Our experiments show that simple, label agnostic statistics on the activations offer an effective way to estimate how the target generalization evolves over time. At each hidden layer, we characterize the activation distributions, from their first and second order moments, then further summarized along the feature dimensions, resulting in a compact yet intuitive characterization in a four-dimensional space. Detecting when, throughout training time, and at which layer, the target activation trajectory diverges from the activation trajectory of the source data, allows us to perform early-stopping and improve generalization in a large array of few-shot transfer learning settings, across different algorithms, source and target datasets., Comment: Accepted at CoLLAs 2022. To be published in Proceedings of Machine Learning Research (PMLR)
Published: 2022

31. An Introduction to Lifelong Supervised Learning

Author: Sodhani, Shagun, Faramarzi, Mojtaba, Mehta, Sanket Vaibhav, Malviya, Pranshu, Abdelsalam, Mohamed, Janarthanan, Janarthanan, Chandar, Sarath, Sodhani, Shagun, Faramarzi, Mojtaba, Mehta, Sanket Vaibhav, Malviya, Pranshu, Abdelsalam, Mohamed, Janarthanan, Janarthanan, and Chandar, Sarath
Abstract: This primer is an attempt to provide a detailed summary of the different facets of lifelong learning. We start with Chapter 2 which provides a high-level overview of lifelong learning systems. In this chapter, we discuss prominent scenarios in lifelong learning (Section 2.4), provide 8 Introduction a high-level organization of different lifelong learning approaches (Section 2.5), enumerate the desiderata for an ideal lifelong learning system (Section 2.6), discuss how lifelong learning is related to other learning paradigms (Section 2.7), describe common metrics used to evaluate lifelong learning systems (Section 2.8). This chapter is more useful for readers who are new to lifelong learning and want to get introduced to the field without focusing on specific approaches or benchmarks. The remaining chapters focus on specific aspects (either learning algorithms or benchmarks) and are more useful for readers who are looking for specific approaches or benchmarks. Chapter 3 focuses on regularization-based approaches that do not assume access to any data from previous tasks. Chapter 4 discusses memory-based approaches that typically use a replay buffer or an episodic memory to save subset of data across different tasks. Chapter 5 focuses on different architecture families (and their instantiations) that have been proposed for training lifelong learning systems. Following these different classes of learning algorithms, we discuss the commonly used evaluation benchmarks and metrics for lifelong learning (Chapter 6) and wrap up with a discussion of future challenges and important research directions in Chapter 7., Comment: Lifelong Learning Primer
Published: 2022

32. Towards Evaluating Adaptivity of Model-Based Reinforcement Learning Methods

Author: Wan, Yi, Rahimi-Kalahroudi, Ali, Rajendran, Janarthanan, Momennejad, Ida, Chandar, Sarath, van Seijen, Harm, Wan, Yi, Rahimi-Kalahroudi, Ali, Rajendran, Janarthanan, Momennejad, Ida, Chandar, Sarath, and van Seijen, Harm
Abstract: In recent years, a growing number of deep model-based reinforcement learning (RL) methods have been introduced. The interest in deep model-based RL is not surprising, given its many potential benefits, such as higher sample efficiency and the potential for fast adaption to changes in the environment. However, we demonstrate, using an improved version of the recently introduced Local Change Adaptation (LoCA) setup, that well-known model-based methods such as PlaNet and DreamerV2 perform poorly in their ability to adapt to local environmental changes. Combined with prior work that made a similar observation about the other popular model-based method, MuZero, a trend appears to emerge, suggesting that current deep model-based methods have serious limitations. We dive deeper into the causes of this poor performance, by identifying elements that hurt adaptive behavior and linking these to underlying techniques frequently used in deep model-based RL. We empirically validate these insights in the case of linear function approximation by demonstrating that a modified version of linear Dyna achieves effective adaptation to local changes. Furthermore, we provide detailed insights into the challenges of building an adaptive nonlinear model-based method, by experimenting with a nonlinear version of Dyna.
Published: 2022

33. Improving Sample Efficiency of Value Based Models Using Attention and Vision Transformers

Author: Kalantari, Amir Ardalan, Amini, Mohammad, Chandar, Sarath, Precup, Doina, Kalantari, Amir Ardalan, Amini, Mohammad, Chandar, Sarath, and Precup, Doina
Abstract: Much of recent Deep Reinforcement Learning success is owed to the neural architecture's potential to learn and use effective internal representations of the world. While many current algorithms access a simulator to train with a large amount of data, in realistic settings, including while playing games that may be played against people, collecting experience can be quite costly. In this paper, we introduce a deep reinforcement learning architecture whose purpose is to increase sample efficiency without sacrificing performance. We design this architecture by incorporating advances achieved in recent years in the field of Natural Language Processing and Computer Vision. Specifically, we propose a visually attentive model that uses transformers to learn a self-attention mechanism on the feature maps of the state representation, while simultaneously optimizing return. We demonstrate empirically that this architecture improves sample complexity for several Atari environments, while also achieving better performance in some of the games.
Published: 2022

34. Combining Reinforcement Learning and Constraint Programming for Sequence-Generation Tasks with Hard Constraints

Author: Daphné Lafleur and Sarath Chandar and Gilles Pesant, Lafleur, Daphné, Chandar, Sarath, Pesant, Gilles, Daphné Lafleur and Sarath Chandar and Gilles Pesant, Lafleur, Daphné, Chandar, Sarath, and Pesant, Gilles
Abstract: While Machine Learning (ML) techniques are good at generating data similar to a dataset, they lack the capacity to enforce constraints. On the other hand, any solution to a Constraint Programming (CP) model satisfies its constraints but has no obligation to imitate a dataset. Yet, we sometimes need both. In this paper we borrow RL-Tuner, a Reinforcement Learning (RL) algorithm introduced to tune neural networks, as our enabling architecture to exploit the respective strengths of ML and CP. RL-Tuner maximizes the sum of a pretrained network’s learned probabilities and of manually-tuned penalties for each violated constraint. We replace the latter with outputs of a CP model representing the marginal probabilities of each value and the number of constraint violations. As was the case for the original RL-Tuner, we apply our algorithm to music generation since it is a highly-constrained domain for which CP is especially suited. We show that combining ML and CP, as opposed to using them individually, allows the agent to reflect the pretrained network while taking into account constraints, leading to melodic lines that respect both the corpus' style and the music theory constraints.
Published: 2022
Full Text: View/download PDF

35. An Empirical Investigation of the Role of Pre-training in Lifelong Learning

Author: Mehta, Sanket Vaibhav, Patil, Darshan, Chandar, Sarath, Strubell, Emma, Mehta, Sanket Vaibhav, Patil, Darshan, Chandar, Sarath, and Strubell, Emma
Abstract: The lifelong learning paradigm in machine learning is an attractive alternative to the more prominent isolated learning scheme not only due to its resemblance to biological learning but also its potential to reduce energy waste by obviating excessive model re-training. A key challenge to this paradigm is the phenomenon of catastrophic forgetting. With the increasing popularity and success of pre-trained models in machine learning, we pose the question: What role does pre-training play in lifelong learning, specifically with respect to catastrophic forgetting? We investigate existing methods in the context of large, pre-trained models and evaluate their performance on a variety of text and image classification tasks, including a large-scale study using a novel data set of 15 diverse NLP tasks. Across all settings, we observe that generic pre-training implicitly alleviates the effects of catastrophic forgetting when learning multiple tasks sequentially compared to randomly initialized models. We then further investigate why pre-training alleviates forgetting in this setting. We study this phenomenon by analyzing the loss landscape, finding that pre-trained weights appear to ease forgetting by leading to wider minima. Based on this insight, we propose jointly optimizing for current task loss and loss basin sharpness to explicitly encourage wider basins during sequential fine-tuning. We show that this optimization approach outperforms several state-of-the-art task-sequential continual learning algorithms across multiple settings, occasionally even without retaining a memory that scales in size with the number of tasks.
Published: 2021

36. Scaling Laws for the Few-Shot Adaptation of Pre-trained Image Classifiers

Author: Prato, Gabriele, Guiroy, Simon, Caballero, Ethan, Rish, Irina, Chandar, Sarath, Prato, Gabriele, Guiroy, Simon, Caballero, Ethan, Rish, Irina, and Chandar, Sarath
Abstract: Empirical science of neural scaling laws is a rapidly growing area of significant importance to the future of machine learning, particularly in the light of recent breakthroughs achieved by large-scale pre-trained models such as GPT-3, CLIP and DALL-e. Accurately predicting the neural network performance with increasing resources such as data, compute and model size provides a more comprehensive evaluation of different approaches across multiple scales, as opposed to traditional point-wise comparisons of fixed-size models on fixed-size benchmarks, and, most importantly, allows for focus on the best-scaling, and thus most promising in the future, approaches. In this work, we consider a challenging problem of few-shot learning in image classification, especially when the target data distribution in the few-shot phase is different from the source, training, data distribution, in a sense that it includes new image classes not encountered during training. Our current main goal is to investigate how the amount of pre-training data affects the few-shot generalization performance of standard image classifiers. Our key observations are that (1) such performance improvements are well-approximated by power laws (linear log-log plots) as the training set size increases, (2) this applies to both cases of target data coming from either the same or from a different domain (i.e., new classes) as the training data, and (3) few-shot performance on new classes converges at a faster rate than the standard classification performance on previously seen classes. Our findings shed new light on the relationship between scale and generalization.
Published: 2021

37. Post-hoc Interpretability for Neural NLP: A Survey

Author: Madsen, Andreas, Reddy, Siva, Chandar, Sarath, Madsen, Andreas, Reddy, Siva, and Chandar, Sarath
Abstract: Neural networks for NLP are becoming increasingly complex and widespread, and there is a growing concern if these models are responsible to use. Explaining models helps to address the safety and ethical concerns and is essential for accountability. Interpretability serves to provide these explanations in terms that are understandable to humans. Additionally, post-hoc methods provide explanations after a model is learned and are generally model-agnostic. This survey provides a categorization of how recent post-hoc interpretability methods communicate explanations to humans, it discusses each method in-depth, and how they are validated, as the latter is often a common concern.
Published: 2021
Full Text: View/download PDF

38. Local Structure Matters Most: Perturbation Study in NLU

Author: Clouatre, Louis, Parthasarathi, Prasanna, Zouaq, Amal, Chandar, Sarath, Clouatre, Louis, Parthasarathi, Prasanna, Zouaq, Amal, and Chandar, Sarath
Abstract: Recent research analyzing the sensitivity of natural language understanding models to word-order perturbations has shown that neural models are surprisingly insensitive to the order of words. In this paper, we investigate this phenomenon by developing order-altering perturbations on the order of words, subwords, and characters to analyze their effect on neural models' performance on language understanding tasks. We experiment with measuring the impact of perturbations to the local neighborhood of characters and global position of characters in the perturbed texts and observe that perturbation functions found in prior literature only affect the global ordering while the local ordering remains relatively unperturbed. We empirically show that neural models, invariant of their inductive biases, pretraining scheme, or the choice of tokenization, mostly rely on the local structure of text to build understanding and make limited use of the global structure., Comment: 11 pages, 13 figure + appendix
Published: 2021

39. A Brief Study on the Effects of Training Generative Dialogue Models with a Semantic loss

Author: Parthasarathi, Prasanna, Abdelsalam, Mohamed, Pineau, Joelle, Chandar, Sarath, Parthasarathi, Prasanna, Abdelsalam, Mohamed, Pineau, Joelle, and Chandar, Sarath
Abstract: Neural models trained for next utterance generation in dialogue task learn to mimic the n-gram sequences in the training set with training objectives like negative log-likelihood (NLL) or cross-entropy. Such commonly used training objectives do not foster generating alternate responses to a context. But, the effects of minimizing an alternate training objective that fosters a model to generate alternate response and score it on semantic similarity has not been well studied. We hypothesize that a language generation model can improve on its diversity by learning to generate alternate text during training and minimizing a semantic loss as an auxiliary objective. We explore this idea on two different sized data sets on the task of next utterance generation in goal oriented dialogues. We make two observations (1) minimizing a semantic objective improved diversity in responses in the smaller data set (Frames) but only as-good-as minimizing the NLL in the larger data set (MultiWoZ) (2) large language model embeddings can be more useful as a semantic loss objective than as initialization for token embeddings., Comment: Accepted at SIGDial 2021
Published: 2021

40. Memory Augmented Optimizers for Deep Learning

Author: McRae, Paul-Aymeric, Parthasarathi, Prasanna, Assran, Mahmoud, Chandar, Sarath, McRae, Paul-Aymeric, Parthasarathi, Prasanna, Assran, Mahmoud, and Chandar, Sarath
Abstract: Popular approaches for minimizing loss in data-driven learning often involve an abstraction or an explicit retention of the history of gradients for efficient parameter updates. The aggregated history of gradients nudges the parameter updates in the right direction even when the gradients at any given step are not informative. Although the history of gradients summarized in meta-parameters or explicitly stored in memory has been shown effective in theory and practice, the question of whether $all$ or only a subset of the gradients in the history are sufficient in deciding the parameter updates remains unanswered. In this paper, we propose a framework of memory-augmented gradient descent optimizers that retain a limited view of their gradient history in their internal memory. Such optimizers scale well to large real-life datasets, and our experiments show that the memory augmented extensions of standard optimizers enjoy accelerated convergence and improved performance on a majority of computer vision and language tasks that we considered. Additionally, we prove that the proposed class of optimizers with fixed-size memory converge under assumptions of strong convexity, regardless of which gradients are selected or how they are linearly combined to form the update step., Comment: 24 Pages. Currently under review
Published: 2021

41. Do Encoder Representations of Generative Dialogue Models Encode Sufficient Information about the Task ?

Author: Parthasarathi, Prasanna, Pineau, Joelle, Chandar, Sarath, Parthasarathi, Prasanna, Pineau, Joelle, and Chandar, Sarath
Abstract: Predicting the next utterance in dialogue is contingent on encoding of users' input text to generate appropriate and relevant response in data-driven approaches. Although the semantic and syntactic quality of the language generated is evaluated, more often than not, the encoded representation of input is not evaluated. As the representation of the encoder is essential for predicting the appropriate response, evaluation of encoder representation is a challenging yet important problem. In this work, we showcase evaluating the text generated through human or automatic metrics is not sufficient to appropriately evaluate soundness of the language understanding of dialogue models and, to that end, propose a set of probe tasks to evaluate encoder representation of different language encoders commonly used in dialogue models. From experiments, we observe that some of the probe tasks are easier and some are harder for even sophisticated model architectures to learn. And, through experiments we observe that RNN based architectures have lower performance on automatic metrics on text generation than transformer model but perform better than the transformer model on the probe tasks indicating that RNNs might preserve task information better than the Transformers., Comment: Accepted at SIGDial 2021. arXiv admin note: substantial text overlap with arXiv:2008.10427
Published: 2021

42. TAG: Task-based Accumulated Gradients for Lifelong learning

Author: Malviya, Pranshu, Ravindran, Balaraman, Chandar, Sarath, Malviya, Pranshu, Ravindran, Balaraman, and Chandar, Sarath
Abstract: When an agent encounters a continual stream of new tasks in the lifelong learning setting, it leverages the knowledge it gained from the earlier tasks to help learn the new tasks better. In such a scenario, identifying an efficient knowledge representation becomes a challenging problem. Most research works propose to either store a subset of examples from the past tasks in a replay buffer, dedicate a separate set of parameters to each task or penalize excessive updates over parameters by introducing a regularization term. While existing methods employ the general task-agnostic stochastic gradient descent update rule, we propose a task-aware optimizer that adapts the learning rate based on the relatedness among tasks. We utilize the directions taken by the parameters during the updates by accumulating the gradients specific to each task. These task-based accumulated gradients act as a knowledge base that is maintained and updated throughout the stream. We empirically show that our proposed adaptive learning rate not only accounts for catastrophic forgetting but also allows positive backward transfer. We also show that our method performs better than several state-of-the-art methods in lifelong learning on complex datasets with a large number of tasks., Comment: Published at 1st Conference on Lifelong Learning Agents, 2022
Published: 2021

43. A Survey of Data Augmentation Approaches for NLP

Author: Feng, Steven Y., Gangal, Varun, Wei, Jason, Chandar, Sarath, Vosoughi, Soroush, Mitamura, Teruko, Hovy, Eduard, Feng, Steven Y., Gangal, Varun, Wei, Jason, Chandar, Sarath, Vosoughi, Soroush, Mitamura, Teruko, and Hovy, Eduard
Abstract: Data augmentation has recently seen increased interest in NLP due to more work in low-resource domains, new tasks, and the popularity of large-scale neural networks that require large amounts of training data. Despite this recent upsurge, this area is still relatively underexplored, perhaps due to the challenges posed by the discrete nature of language data. In this paper, we present a comprehensive and unifying survey of data augmentation for NLP by summarizing the literature in a structured manner. We first introduce and motivate data augmentation for NLP, and then discuss major methodologically representative approaches. Next, we highlight techniques that are used for popular NLP applications and tasks. We conclude by outlining current challenges and directions for future research. Overall, our paper aims to clarify the landscape of existing literature in data augmentation for NLP and motivate additional work in this area. We also present a GitHub repository with a paper list that will be continuously updated at https://github.com/styfeng/DataAug4NLP, Comment: Accepted to ACL 2021 Findings. GitHub repo with paper list at https://github.com/styfeng/DataAug4NLP ; Talk at https://www.youtube.com/watch?v=kNBVesKUZCk&ab_channel=StevenFeng ; Podcast at https://www.youtube.com/watch?v=qmqyT_97Poc&ab_channel=GradientFlow and https://thedataexchange.media/data-augmentation-in-natural-language-processing
Published: 2021

44. Continuous Coordination As a Realistic Scenario for Lifelong Learning

Author: Nekoei, Hadi, Badrinaaraayanan, Akilesh, Courville, Aaron, Chandar, Sarath, Nekoei, Hadi, Badrinaaraayanan, Akilesh, Courville, Aaron, and Chandar, Sarath
Abstract: Current deep reinforcement learning (RL) algorithms are still highly task-specific and lack the ability to generalize to new environments. Lifelong learning (LLL), however, aims at solving multiple tasks sequentially by efficiently transferring and using knowledge between tasks. Despite a surge of interest in lifelong RL in recent years, the lack of a realistic testbed makes robust evaluation of LLL algorithms difficult. Multi-agent RL (MARL), on the other hand, can be seen as a natural scenario for lifelong RL due to its inherent non-stationarity, since the agents' policies change over time. In this work, we introduce a multi-agent lifelong learning testbed that supports both zero-shot and few-shot settings. Our setup is based on Hanabi -- a partially-observable, fully cooperative multi-agent game that has been shown to be challenging for zero-shot coordination. Its large strategy space makes it a desirable environment for lifelong RL tasks. We evaluate several recent MARL methods, and benchmark state-of-the-art LLL algorithms in limited memory and computation regimes to shed light on their strengths and weaknesses. This continual learning paradigm also provides us with a pragmatic way of going beyond centralized training which is the most commonly used training protocol in MARL. We empirically show that the agents trained in our setup are able to coordinate well with unseen agents, without any additional assumptions made by previous works. The code and all pre-trained models are available at https://github.com/chandar-lab/Lifelong-Hanabi., Comment: 19 pages with supplementary materials. Added results for Lifelong RL methods and some future work. Accepted to ICML 2021
Published: 2021

45. Slot Contrastive Networks: A Contrastive Approach for Representing Objects

Author: Racah, Evan, Chandar, Sarath, Racah, Evan, and Chandar, Sarath
Abstract: Unsupervised extraction of objects from low-level visual data is an important goal for further progress in machine learning. Existing approaches for representing objects without labels use structured generative models with static images. These methods focus a large amount of their capacity on reconstructing unimportant background pixels, missing low contrast or small objects. Conversely, we present a new method that avoids losses in pixel space and over-reliance on the limited signal a static image provides. Our approach takes advantage of objects' motion by learning a discriminative, time-contrastive loss in the space of slot representations, attempting to force each slot to not only capture entities that move, but capture distinct objects from the other slots. Moreover, we introduce a new quantitative evaluation metric to measure how "diverse" a set of slot vectors are, and use it to evaluate our model on 20 Atari games., Comment: Presented at ICML 2020 Workshop: Object-Oriented Learning (OOL): Perception, Representation, and Reasoning
Published: 2020

46. The LoCA Regret: A Consistent Metric to Evaluate Model-Based Behavior in Reinforcement Learning

Author: van Seijen, Harm, Nekoei, Hadi, Racah, Evan, Chandar, Sarath, van Seijen, Harm, Nekoei, Hadi, Racah, Evan, and Chandar, Sarath
Abstract: Deep model-based Reinforcement Learning (RL) has the potential to substantially improve the sample-efficiency of deep RL. While various challenges have long held it back, a number of papers have recently come out reporting success with deep model-based methods. This is a great development, but the lack of a consistent metric to evaluate such methods makes it difficult to compare various approaches. For example, the common single-task sample-efficiency metric conflates improvements due to model-based learning with various other aspects, such as representation learning, making it difficult to assess true progress on model-based RL. To address this, we introduce an experimental setup to evaluate model-based behavior of RL methods, inspired by work from neuroscience on detecting model-based behavior in humans and animals. Our metric based on this setup, the Local Change Adaptation (LoCA) regret, measures how quickly an RL method adapts to a local change in the environment. Our metric can identify model-based behavior, even if the method uses a poor representation and provides insight in how close a method's behavior is from optimal model-based behavior. We use our setup to evaluate the model-based behavior of MuZero on a variation of the classic Mountain Car task., Comment: NeurIPS 2020, code: https://github.com/chandar-lab/LoCA
Published: 2020

47. PatchUp: A Feature-Space Block-Level Regularization Technique for Convolutional Neural Networks

Author: Faramarzi, Mojtaba, Amini, Mohammad, Badrinaaraayanan, Akilesh, Verma, Vikas, Chandar, Sarath, Faramarzi, Mojtaba, Amini, Mohammad, Badrinaaraayanan, Akilesh, Verma, Vikas, and Chandar, Sarath
Abstract: Large capacity deep learning models are often prone to a high generalization gap when trained with a limited amount of labeled training data. A recent class of methods to address this problem uses various ways to construct a new training sample by mixing a pair (or more) of training samples. We propose PatchUp, a hidden state block-level regularization technique for Convolutional Neural Networks (CNNs), that is applied on selected contiguous blocks of feature maps from a random pair of samples. Our approach improves the robustness of CNN models against the manifold intrusion problem that may occur in other state-of-the-art mixing approaches. Moreover, since we are mixing the contiguous block of features in the hidden space, which has more dimensions than the input space, we obtain more diverse samples for training towards different dimensions. Our experiments on CIFAR10/100, SVHN, Tiny-ImageNet, and ImageNet using ResNet architectures including PreActResnet18/34, WRN-28-10, ResNet101/152 models show that PatchUp improves upon, or equals, the performance of current state-of-the-art regularizers for CNNs. We also show that PatchUp can provide a better generalization to deformed samples and is more robust against adversarial attacks., Comment: AAAI - 2022
Published: 2020
Full Text: View/download PDF

48. Learning To Navigate The Synthetically Accessible Chemical Space Using Reinforcement Learning

Author: Gottipati, Sai Krishna, Sattarov, Boris, Niu, Sufeng, Pathak, Yashaswi, Wei, Haoran, Liu, Shengchao, Thomas, Karam M. J., Blackburn, Simon, Coley, Connor W., Tang, Jian, Chandar, Sarath, Bengio, Yoshua, Gottipati, Sai Krishna, Sattarov, Boris, Niu, Sufeng, Pathak, Yashaswi, Wei, Haoran, Liu, Shengchao, Thomas, Karam M. J., Blackburn, Simon, Coley, Connor W., Tang, Jian, Chandar, Sarath, and Bengio, Yoshua
Abstract: Over the last decade, there has been significant progress in the field of machine learning for de novo drug design, particularly in deep generative models. However, current generative approaches exhibit a significant challenge as they do not ensure that the proposed molecular structures can be feasibly synthesized nor do they provide the synthesis routes of the proposed small molecules, thereby seriously limiting their practical applicability. In this work, we propose a novel forward synthesis framework powered by reinforcement learning (RL) for de novo drug design, Policy Gradient for Forward Synthesis (PGFS), that addresses this challenge by embedding the concept of synthetic accessibility directly into the de novo drug design system. In this setup, the agent learns to navigate through the immense synthetically accessible chemical space by subjecting commercially available small molecule building blocks to valid chemical reactions at every time step of the iterative virtual multi-step synthesis process. The proposed environment for drug discovery provides a highly challenging test-bed for RL algorithms owing to the large state space and high-dimensional continuous action space with hierarchical actions. PGFS achieves state-of-the-art performance in generating structures with high QED and penalized clogP. Moreover, we validate PGFS in an in-silico proof-of-concept associated with three HIV targets. Finally, we describe how the end-to-end training conceptualized in this study represents an important paradigm in radically expanding the synthesizable chemical space and automating the drug discovery process., Comment: added the statistics of top-100 compounds used logP metric with scaled components added values of the initial reactants to the box plots some values in tables are recalculated due to the inconsistent environments on different machines. corresponding benchmarks were rerun with the requirements on github. no significant changes in the results. corrected figures in the Appendix
Published: 2020

49. IIRC: Incremental Implicitly-Refined Classification

Author: Abdelsalam, Mohamed, Faramarzi, Mojtaba, Sodhani, Shagun, Chandar, Sarath, Abdelsalam, Mohamed, Faramarzi, Mojtaba, Sodhani, Shagun, and Chandar, Sarath
Abstract: We introduce the "Incremental Implicitly-Refined Classi-fication (IIRC)" setup, an extension to the class incremental learning setup where the incoming batches of classes have two granularity levels. i.e., each sample could have a high-level (coarse) label like "bear" and a low-level (fine) label like "polar bear". Only one label is provided at a time, and the model has to figure out the other label if it has already learnfed it. This setup is more aligned with real-life scenarios, where a learner usually interacts with the same family of entities multiple times, discovers more granularity about them, while still trying not to forget previous knowledge. Moreover, this setup enables evaluating models for some important lifelong learning challenges that cannot be easily addressed under the existing setups. These challenges can be motivated by the example "if a model was trained on the class bear in one task and on polar bear in another task, will it forget the concept of bear, will it rightfully infer that a polar bear is still a bear? and will it wrongfully associate the label of polar bear to other breeds of bear?". We develop a standardized benchmark that enables evaluating models on the IIRC setup. We evaluate several state-of-the-art lifelong learning algorithms and highlight their strengths and limitations. For example, distillation-based methods perform relatively well but are prone to incorrectly predicting too many labels per image. We hope that the proposed setup, along with the benchmark, would provide a meaningful problem setting to the practitioners
Published: 2020

50. Maximum Reward Formulation In Reinforcement Learning

Author: Gottipati, Sai Krishna, Pathak, Yashaswi, Nuttall, Rohan, Sahir, Chunduru, Raviteja, Touati, Ahmed, Subramanian, Sriram Ganapathi, Taylor, Matthew E., Chandar, Sarath, Gottipati, Sai Krishna, Pathak, Yashaswi, Nuttall, Rohan, Sahir, Chunduru, Raviteja, Touati, Ahmed, Subramanian, Sriram Ganapathi, Taylor, Matthew E., and Chandar, Sarath
Abstract: Reinforcement learning (RL) algorithms typically deal with maximizing the expected cumulative return (discounted or undiscounted, finite or infinite horizon). However, several crucial applications in the real world, such as drug discovery, do not fit within this framework because an RL agent only needs to identify states (molecules) that achieve the highest reward within a trajectory and does not need to optimize for the expected cumulative return. In this work, we formulate an objective function to maximize the expected maximum reward along a trajectory, derive a novel functional form of the Bellman equation, introduce the corresponding Bellman operators, and provide a proof of convergence. Using this formulation, we achieve state-of-the-art results on the task of molecule generation that mimics a real-world drug discovery pipeline., Comment: 14 pages, 5 figures Update based on reviewer feedback
Published: 2020

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Publication Year Range

Publication Type

Database

75 results on '"Chandar, Sarath"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources