Author: "Rumshisky, A." - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Rumshisky, A."' showing total 414 results

Start Over Author "Rumshisky, A."

414 results on '"Rumshisky, A."'

1. Deconstructing In-Context Learning: Understanding Prompts via Corruption

Author: Shivagunde, Namrata, Lialin, Vladislav, Muckatira, Sherin, and Rumshisky, Anna
Subjects: Computer Science - Computation and Language
Abstract: The ability of large language models (LLMs) to $``$learn in context$"$ based on the provided prompt has led to an explosive growth in their use, culminating in the proliferation of AI assistants such as ChatGPT, Claude, and Bard. These AI assistants are known to be robust to minor prompt modifications, mostly due to alignment techniques that use human feedback. In contrast, the underlying pre-trained LLMs they use as a backbone are known to be brittle in this respect. Building high-quality backbone models remains a core challenge, and a common approach to assessing their quality is to conduct few-shot evaluation. Such evaluation is notorious for being highly sensitive to minor prompt modifications, as well as the choice of specific in-context examples. Prior work has examined how modifying different elements of the prompt can affect model performance. However, these earlier studies tended to concentrate on a limited number of specific prompt attributes and often produced contradictory results. Additionally, previous research either focused on models with fewer than 15 billion parameters or exclusively examined black-box models like GPT-3 or PaLM, making replication challenging. In the present study, we decompose the entire prompt into four components: task description, demonstration inputs, labels, and inline instructions provided for each demonstration. We investigate the effects of structural and semantic corruptions of these elements on model performance. We study models ranging from 1.5B to 70B in size, using ten datasets covering classification and generation tasks. We find that repeating text within the prompt boosts model performance, and bigger models ($\geq$30B) are more sensitive to the semantics of the prompt. Finally, we observe that adding task and inline instructions to the demonstrations enhances model performance even when the instructions are semantically corrupted., Comment: Accepted to LREC-COLING 2024 main conference. The code is available at https://github.com/text-machine-lab/Understanding_prompts_via_corruption
Published: 2024

2. Emergent Abilities in Reduced-Scale Generative Language Models

Author: Muckatira, Sherin, Deshpande, Vijeta, Lialin, Vladislav, and Rumshisky, Anna
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Large language models can solve new tasks without task-specific fine-tuning. This ability, also known as in-context learning (ICL), is considered an emergent ability and is primarily seen in large language models with billions of parameters. This study investigates if such emergent properties are strictly tied to model size or can be demonstrated by smaller models trained on reduced-scale data. To explore this, we simplify pre-training data and pre-train 36 causal language models with parameters varying from 1 million to 165 million parameters. We show that models trained on this simplified pre-training data demonstrate enhanced zero-shot capabilities across various tasks in simplified language, achieving performance comparable to that of pre-trained models six times larger on unrestricted language. This suggests that downscaling the language allows zero-shot learning capabilities to emerge in models with limited size. Additionally, we find that these smaller models pre-trained on simplified data demonstrate a power law relationship between the evaluation loss and the three scaling factors: compute, dataset size, and model size., Comment: 16 pages, 4 figures. Accepted to NAACL 2024 Findings
Published: 2024

3. Prompt Perturbation Consistency Learning for Robust Language Models

Author: Qiang, Yao, Nandi, Subhrangshu, Mehrabi, Ninareh, Steeg, Greg Ver, Kumar, Anoop, Rumshisky, Anna, and Galstyan, Aram
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Large language models (LLMs) have demonstrated impressive performance on a number of natural language processing tasks, such as question answering and text summarization. However, their performance on sequence labeling tasks such as intent classification and slot filling (IC-SF), which is a central component in personal assistant systems, lags significantly behind discriminative models. Furthermore, there is a lack of substantive research on the robustness of LLMs to various perturbations in the input prompts. The contributions of this paper are three-fold. First, we show that fine-tuning sufficiently large LLMs can produce IC-SF performance comparable to discriminative models. Next, we systematically analyze the performance deterioration of those fine-tuned models due to three distinct yet relevant types of input perturbations - oronyms, synonyms, and paraphrasing. Finally, we propose an efficient mitigation approach, Prompt Perturbation Consistency Learning (PPCL), which works by regularizing the divergence between losses from clean and perturbed samples. Our experiments demonstrate that PPCL can recover on average 59% and 69% of the performance drop for IC and SF tasks, respectively. Furthermore, PPCL beats the data augmentation approach while using ten times fewer augmented data samples.
Published: 2024

4. Let's Reinforce Step by Step

Author: Pan, Sarah, Lialin, Vladislav, Muckatira, Sherin, and Rumshisky, Anna
Subjects: Computer Science - Computation and Language
Abstract: While recent advances have boosted LM proficiency in linguistic benchmarks, LMs consistently struggle to reason correctly on complex tasks like mathematics. We turn to Reinforcement Learning from Human Feedback (RLHF) as a method with which to shape model reasoning processes. In particular, we explore two reward schemes, outcome-supervised reward models (ORMs) and process-supervised reward models (PRMs), to optimize for logical reasoning. Our results show that the fine-grained reward provided by PRM-based methods enhances accuracy on simple mathematical reasoning (GSM8K) while, unexpectedly, reducing performance in complex tasks (MATH). Furthermore, we show the critical role reward aggregation functions play in model performance. Providing promising avenues for future research, our study underscores the need for further exploration into fine-grained reward modeling for more reliable language models., Comment: NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following
Published: 2023

5. ReLoRA: High-Rank Training Through Low-Rank Updates

Author: Lialin, Vladislav, Shivagunde, Namrata, Muckatira, Sherin, and Rumshisky, Anna
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Despite the dominance and effectiveness of scaling, resulting in large networks with hundreds of billions of parameters, the necessity to train overparameterized models remains poorly understood, while training costs grow exponentially. In this paper, we explore parameter-efficient training techniques as an approach to training large neural networks. We introduce a novel method called ReLoRA, which utilizes low-rank updates to train high-rank networks. We apply ReLoRA to training transformer language models with up to 1.3B parameters and demonstrate comparable performance to regular neural network training. ReLoRA saves up to 5.5Gb of RAM per GPU and improves training speed by 9-40% depending on the model size and hardware setup. Our findings show the potential of parameter-efficient techniques for large-scale pre-training.
Published: 2023

6. Recipes for Sequential Pre-training of Multilingual Encoder and Seq2Seq Models

Author: Soltan, Saleh, Rosenbaum, Andy, Falke, Tobias, Lu, Qin, Rumshisky, Anna, and Hamza, Wael
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Pre-trained encoder-only and sequence-to-sequence (seq2seq) models each have advantages, however training both model types from scratch is computationally expensive. We explore recipes to improve pre-training efficiency by initializing one model from the other. (1) Extracting the encoder from a seq2seq model, we show it under-performs a Masked Language Modeling (MLM) encoder, particularly on sequence labeling tasks. Variations of masking during seq2seq training, reducing the decoder size, and continuing with a small amount of MLM training do not close the gap. (2) Conversely, using an encoder to warm-start seq2seq training, we show that by unfreezing the encoder partway through training, we can match task performance of a from-scratch seq2seq model. Overall, this two-stage approach is an efficient recipe to obtain both a multilingual encoder and a seq2seq model, matching the performance of training each model from scratch while reducing the total compute cost by 27%., Comment: ACL Findings 2023 and SustaiNLP Workshop 2023
Published: 2023

7. Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale

Author: Deshpande, Vijeta, Pechi, Dan, Thatte, Shree, Lialin, Vladislav, and Rumshisky, Anna
Subjects: Computer Science - Computation and Language
Abstract: In recent years, language models have drastically grown in size, and the abilities of these models have been shown to improve with scale. The majority of recent scaling laws studies focused on high-compute high-parameter count settings, leaving the question of when these abilities begin to emerge largely unanswered. In this paper, we investigate whether the effects of pre-training can be observed when the problem size is reduced, modeling a smaller, reduced-vocabulary language. We show the benefits of pre-training with masked language modeling (MLM) objective in models as small as 1.25M parameters, and establish a strong correlation between pre-training perplexity and downstream performance (GLUE benchmark). We examine downscaling effects, extending scaling laws to models as small as ~1M parameters. At this scale, we observe a break of the power law for compute-optimal models and show that the MLM loss does not scale smoothly with compute-cost (FLOPs) below $2.2 \times 10^{15}$ FLOPs. We also find that adding layers does not always benefit downstream performance., Comment: Accepted to ACL 2023 Findings
Published: 2023

8. Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data

Author: Lialin, Vladislav, Rawls, Stephen, Chan, David, Ghosh, Shalini, Rumshisky, Anna, and Hamza, Wael
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Scaling up weakly-supervised datasets has shown to be highly effective in the image-text domain and has contributed to most of the recent state-of-the-art computer vision and multimodal neural networks. However, existing large-scale video-text datasets and mining techniques suffer from several limitations, such as the scarcity of aligned data, the lack of diversity in the data, and the difficulty of collecting aligned data. Currently popular video-text data mining approach via automatic speech recognition (ASR) used in HowTo100M provides low-quality captions that often do not refer to the video content. Other mining approaches do not provide proper language descriptions (video tags) and are biased toward short clips (alt text). In this work, we show how recent advances in image captioning allow us to pre-train high-quality video models without any parallel video-text data. We pre-train several video captioning models that are based on an OPT language model and a TimeSformer visual backbone. We fine-tune these networks on several video captioning datasets. First, we demonstrate that image captioning pseudolabels work better for pre-training than the existing HowTo100M ASR captions. Second, we show that pre-training on both images and videos produces a significantly better network (+4 CIDER on MSR-VTT) than pre-training on a single modality. Our methods are complementary to the existing pre-training or data mining approaches and can be used in a variety of settings. Given the efficacy of the pseudolabeling method, we are planning to publicly release the generated captions.
Published: 2023
Full Text: View/download PDF

9. Larger Probes Tell a Different Story: Extending Psycholinguistic Datasets Via In-Context Learning

Author: Shivagunde, Namrata, Lialin, Vladislav, and Rumshisky, Anna
Subjects: Computer Science - Computation and Language
Abstract: Language model probing is often used to test specific capabilities of models. However, conclusions from such studies may be limited when the probing benchmarks are small and lack statistical power. In this work, we introduce new, larger datasets for negation (NEG-1500-SIMP) and role reversal (ROLE-1500) inspired by psycholinguistic studies. We dramatically extend existing NEG-136 and ROLE-88 benchmarks using GPT3, increasing their size from 18 and 44 sentence pairs to 750 each. We also create another version of extended negation dataset (NEG-1500-SIMP-TEMP), created using template-based generation. It consists of 770 sentence pairs. We evaluate 22 models on the extended datasets, seeing model performance dip 20-57% compared to the original smaller benchmarks. We observe high levels of negation sensitivity in models like BERT and ALBERT demonstrating that previous findings might have been skewed due to smaller test sets. Finally, we observe that while GPT3 has generated all the examples in ROLE-1500 is only able to solve 24.6% of them during probing. The datasets and code are available on $\href{https://github.com/text-machine-lab/extending_psycholinguistic_dataset}{Github}$., Comment: 14 pages, 6 figures. Published as a conference paper at EMNLP 2023 (short). The datasets and code are available on this $\href{https://github.com/text-machine-lab/extending_psycholinguistic_dataset}{URL}$
Published: 2023

10. Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning

Author: Lialin, Vladislav, Deshpande, Vijeta, Yao, Xiaowei, and Rumshisky, Anna
Subjects: Computer Science - Computation and Language
Abstract: This paper presents a systematic overview of parameter-efficient fine-tuning methods, covering over 50 papers published between early 2019 and mid-2024. These methods aim to address the challenges of fine-tuning large language models by training only a small subset of parameters. We provide a taxonomy that covers a broad range of methods and present a detailed method comparison with a specific focus on real-life efficiency in fine-tuning multibillion-scale language models. We also conduct an extensive head-to-head experimental comparison of 15 diverse PEFT methods, evaluating their performance and efficiency on models up to 11B parameters. Our findings reveal that methods previously shown to surpass a strong LoRA baseline face difficulties in resource-constrained settings, where hyperparameter optimization is limited and the network is fine-tuned only for a few epochs. Finally, we provide a set of practical recommendations for using PEFT methods and outline potential future research directions.
Published: 2023

11. Emergent Abilities in Reduced-Scale Generative Language Models.

Author: Sherin Muckatira, Vijeta Deshpande, Vladislav Lialin, and Anna Rumshisky
Published: 2024
Full Text: View/download PDF

12. Deconstructing In-Context Learning: Understanding Prompts via Corruption.

Author: Namrata Shivagunde, Vladislav Lialin, Sherin Muckatira, and Anna Rumshisky
Published: 2024

13. NarrativeTime: Dense Temporal Annotation on a Timeline.

Author: Anna Rogers, Marzena Karpinska, Ankita Gupta, Vladislav Lialin, Gregory Smelkov, and Anna Rumshisky
Published: 2024

14. Prompt Perturbation Consistency Learning for Robust Language Models.

Author: Yao Qiang, Subhrangshu Nandi, Ninareh Mehrabi, Greg Ver Steeg, Anoop Kumar, Anna Rumshisky, and Aram Galstyan
Published: 2024

15. Reasoning Circuits: Few-shot Multihop Question Generation with Structured Rationales

Author: Kulshreshtha, Saurabh and Rumshisky, Anna
Subjects: Computer Science - Computation and Language
Abstract: Multi-hop Question Generation is the task of generating questions which require the reader to reason over and combine information spread across multiple passages using several reasoning steps. Chain-of-thought rationale generation has been shown to improve performance on multi-step reasoning tasks and make model predictions more interpretable. However, few-shot performance gains from including rationales have been largely observed only in +100B language models, and otherwise require large scale manual rationale annotation. In this work, we introduce a new framework for applying chain-of-thought inspired structured rationale generation to multi-hop question generation under a very low supervision regime (8- to 128-shot). We propose to annotate a small number of examples following our proposed multi-step rationale schema, treating each reasoning step as a separate task to be performed by a generative language model. We show that our framework leads to improved control over the difficulty of the generated questions and better performance compared to baselines trained without rationales, both on automatic evaluation metrics and in human evaluation. Importantly, we show that this is achievable with a modest model size.
Published: 2022

16. On Task-Adaptive Pretraining for Dialogue Response Selection

Author: Lin, Tzu-Hsiang, Chi, Ta-Chung, and Rumshisky, Anna
Subjects: Computer Science - Computation and Language
Abstract: Recent advancements in dialogue response selection (DRS) are based on the \textit{task-adaptive pre-training (TAP)} approach, by first initializing their model with BERT~\cite{devlin-etal-2019-bert}, and adapt to dialogue data with dialogue-specific or fine-grained pre-training tasks. However, it is uncertain whether BERT is the best initialization choice, or whether the proposed dialogue-specific fine-grained learning tasks are actually better than MLM+NSP. This paper aims to verify assumptions made in previous works and understand the source of improvements for DRS. We show that initializing with RoBERTa achieve similar performance as BERT, and MLM+NSP can outperform all previously proposed TAP tasks, during which we also contribute a new state-of-the-art on the Ubuntu corpus. Additional analyses shows that the main source of improvements comes from the TAP step, and that the NSP task is crucial to DRS, different from common NLU tasks., Comment: 6 pages, 4 figures
Published: 2022

17. AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model

Author: Soltan, Saleh, Ananthakrishnan, Shankar, FitzGerald, Jack, Gupta, Rahul, Hamza, Wael, Khan, Haidar, Peris, Charith, Rawls, Stephen, Rosenbaum, Andy, Rumshisky, Anna, Prakash, Chandana Satya, Sridhar, Mukund, Triefenbach, Fabian, Verma, Apurv, Tur, Gokhan, and Natarajan, Prem
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: In this work, we demonstrate that multilingual large-scale sequence-to-sequence (seq2seq) models, pre-trained on a mixture of denoising and Causal Language Modeling (CLM) tasks, are more efficient few-shot learners than decoder-only models on various tasks. In particular, we train a 20 billion parameter multilingual seq2seq model called Alexa Teacher Model (AlexaTM 20B) and show that it achieves state-of-the-art (SOTA) performance on 1-shot summarization tasks, outperforming a much larger 540B PaLM decoder model. AlexaTM 20B also achieves SOTA in 1-shot machine translation, especially for low-resource languages, across almost all language pairs supported by the model (Arabic, English, French, German, Hindi, Italian, Japanese, Marathi, Portuguese, Spanish, Tamil, and Telugu) on Flores-101 dataset. We also show in zero-shot setting, AlexaTM 20B outperforms GPT3 (175B) on SuperGLUE and SQuADv2 datasets and provides SOTA performance on multilingual tasks such as XNLI, XCOPA, Paws-X, and XWinograd. Overall, our results present a compelling case for seq2seq models as a powerful alternative to decoder-only models for Large-scale Language Model (LLM) training.
Published: 2022

18. Learning to Ask Like a Physician

Author: Lehman, Eric, Lialin, Vladislav, Legaspi, Katelyn Y., Sy, Anne Janelle R., Pile, Patricia Therese S., Alberto, Nicole Rose I., Ragasa, Richard Raymund R., Puyat, Corinna Victoria M., Alberto, Isabelle Rose I., Alfonso, Pia Gabrielle I., Taliño, Marianne, Moukheiber, Dana, Wallace, Byron C., Rumshisky, Anna, Liang, Jenifer J., Raghavan, Preethi, Celi, Leo Anthony, and Szolovits, Peter
Subjects: Computer Science - Computation and Language
Abstract: Existing question answering (QA) datasets derived from electronic health records (EHR) are artificially generated and consequently fail to capture realistic physician information needs. We present Discharge Summary Clinical Questions (DiSCQ), a newly curated question dataset composed of 2,000+ questions paired with the snippets of text (triggers) that prompted each question. The questions are generated by medical experts from 100+ MIMIC-III discharge summaries. We analyze this dataset to characterize the types of information sought by medical experts. We also train baseline models for trigger detection and question generation (QG), paired with unsupervised answer retrieval over EHRs. Our baseline model is able to generate high quality questions in over 62% of cases when prompted with human selected triggers. We release this dataset (and all code to reproduce baseline model results) to facilitate further research into realistic clinical QA and QG: https://github.com/elehman16/discq.
Published: 2022

19. Life after BERT: What do Other Muppets Understand about Language?

Author: Lialin, Vladislav, Zhao, Kevin, Shivagunde, Namrata, and Rumshisky, Anna
Subjects: Computer Science - Computation and Language
Abstract: Existing pre-trained transformer analysis works usually focus only on one or two model families at a time, overlooking the variability of the architecture and pre-training objectives. In our work, we utilize the oLMpics benchmark and psycholinguistic probing datasets for a diverse set of 29 models including T5, BART, and ALBERT. Additionally, we adapt the oLMpics zero-shot setup for autoregressive models and evaluate GPT networks of different sizes. Our findings show that none of these models can resolve compositional questions in a zero-shot fashion, suggesting that this skill is not learnable using existing pre-training objectives. Furthermore, we find that global model decisions such as architecture, directionality, size of the dataset, and pre-training objective are not predictive of a model's linguistic capabilities.
Published: 2022
Full Text: View/download PDF

20. Down and Across: Introducing Crossword-Solving as a New NLP Benchmark

Author: Kulshreshtha, Saurabh, Kovaleva, Olga, Shivagunde, Namrata, and Rumshisky, Anna
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Solving crossword puzzles requires diverse reasoning capabilities, access to a vast amount of knowledge about language and the world, and the ability to satisfy the constraints imposed by the structure of the puzzle. In this work, we introduce solving crossword puzzles as a new natural language understanding task. We release the specification of a corpus of crossword puzzles collected from the New York Times daily crossword spanning 25 years and comprised of a total of around nine thousand puzzles. These puzzles include a diverse set of clues: historic, factual, word meaning, synonyms/antonyms, fill-in-the-blank, abbreviations, prefixes/suffixes, wordplay, and cross-lingual, as well as clues that depend on the answers to other clues. We separately release the clue-answer pairs from these puzzles as an open-domain question answering dataset containing over half a million unique clue-answer pairs. For the question answering task, our baselines include several sequence-to-sequence and retrieval-based generative models. We also introduce a non-parametric constraint satisfaction baseline for solving the entire crossword puzzle. Finally, we propose an evaluation framework which consists of several complementary performance metrics., Comment: Accepted as long paper at ACL 2022
Published: 2022

21. Federated Learning with Noisy User Feedback

Author: Sharma, Rahul, Ramakrishna, Anil, MacLaughlin, Ansel, Rumshisky, Anna, Majmudar, Jimit, Chung, Clement, Avestimehr, Salman, and Gupta, Rahul
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Human-Computer Interaction
Abstract: Machine Learning (ML) systems are getting increasingly popular, and drive more and more applications and services in our daily life. This has led to growing concerns over user privacy, since human interaction data typically needs to be transmitted to the cloud in order to train and improve such systems. Federated learning (FL) has recently emerged as a method for training ML models on edge devices using sensitive user data and is seen as a way to mitigate concerns over data privacy. However, since ML models are most commonly trained with label supervision, we need a way to extract labels on edge to make FL viable. In this work, we propose a strategy for training FL models using positive and negative user feedback. We also design a novel framework to study different noise patterns in user feedback, and explore how well standard noise-robust objectives can help mitigate this noise when training models in a federated setting. We evaluate our proposed training setup through detailed experiments on two text classification datasets and analyze the effects of varying levels of user reliability and feedback noise on model performance. We show that our method improves substantially over a self-training baseline, achieving performance closer to models trained with full supervision., Comment: Accepted to appear in NAACL 2022
Published: 2022

22. ReLoRA: High-Rank Training Through Low-Rank Updates.

Author: Vladislav Lialin, Sherin Muckatira, Namrata Shivagunde, and Anna Rumshisky
Published: 2024

23. Multi-Stream Transformers

Author: Burtsev, Mikhail and Rumshisky, Anna
Subjects: Computer Science - Computation and Language, Computer Science - Neural and Evolutionary Computing
Abstract: Transformer-based encoder-decoder models produce a fused token-wise representation after every encoder layer. We investigate the effects of allowing the encoder to preserve and explore alternative hypotheses, combined at the end of the encoding process. To that end, we design and examine a $\textit{Multi-stream Transformer}$ architecture and find that splitting the Transformer encoder into multiple encoder streams and allowing the model to merge multiple representational hypotheses improves performance, with further improvement obtained by adding a skip connection between the first and the final encoder layer.
Published: 2021

24. An Efficient DP-SGD Mechanism for Large Scale NLP Models

Author: Dupuy, Christophe, Arava, Radhika, Gupta, Rahul, and Rumshisky, Anna
Subjects: Computer Science - Computation and Language, Computer Science - Cryptography and Security, Computer Science - Machine Learning
Abstract: Recent advances in deep learning have drastically improved performance on many Natural Language Understanding (NLU) tasks. However, the data used to train NLU models may contain private information such as addresses or phone numbers, particularly when drawn from human subjects. It is desirable that underlying models do not expose private information contained in the training data. Differentially Private Stochastic Gradient Descent (DP-SGD) has been proposed as a mechanism to build privacy-preserving models. However, DP-SGD can be prohibitively slow to train. In this work, we propose a more efficient DP-SGD for training using a GPU infrastructure and apply it to fine-tuning models based on LSTM and transformer architectures. We report faster training times, alongside accuracy, theoretical privacy guarantees and success of Membership inference attacks for our models and observe that fine-tuning with proposed variant of DP-SGD can yield competitive models without significant degradation in training time and improvement in privacy protection. We also make observations such as looser theoretical $\epsilon, \delta$ can translate into significant practical privacy gains.
Published: 2021

25. BERT Busters: Outlier Dimensions that Disrupt Transformers

Author: Kovaleva, Olga, Kulshreshtha, Saurabh, Rogers, Anna, and Rumshisky, Anna
Subjects: Computer Science - Computation and Language
Abstract: Multiple studies have shown that Transformers are remarkably robust to pruning. Contrary to this received wisdom, we demonstrate that pre-trained Transformer encoders are surprisingly fragile to the removal of a very small number of features in the layer outputs (<0.0001% of model weights). In case of BERT and other pre-trained encoder Transformers, the affected component is the scaling factors and biases in the LayerNorm. The outliers are high-magnitude normalization parameters that emerge early in pre-training and show up consistently in the same dimensional position throughout the model. We show that disabling them significantly degrades both the MLM loss and the downstream task performance. This effect is observed across several BERT-family models and other popular pre-trained Transformer architectures, including BART, XLNet and ELECTRA; we also show a similar effect in GPT-2., Comment: Accepted as long paper at Findings of ACL 2021
Published: 2021

26. Larger Probes Tell a Different Story: Extending Psycholinguistic Datasets Via In-Context Learning.

Author: Namrata Shivagunde, Vladislav Lialin, and Anna Rumshisky
Published: 2023
Full Text: View/download PDF

27. Sampling bias in NLU models: Impact and Mitigation.

Author: Zefei Li, Anil Ramakrishna, Anna Rumshisky, Andy Rosenbaum, Saleh Soltan, and Rahul Gupta 0001
Published: 2023
Full Text: View/download PDF

28. Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale.

Author: Vijeta Deshpande, Dan Pechi, Shree Thatte, Vladislav Lialin, and Anna Rumshisky
Published: 2023
Full Text: View/download PDF

29. Recipes for Sequential Pre-training of Multilingual Encoder and Seq2Seq Models.

Author: Saleh Soltan, Andy Rosenbaum, Tobias Falke, Qin Lu, Anna Rumshisky, and Wael Hamza
Published: 2023
Full Text: View/download PDF

30. Self-Healing Through Error Detection, Attribution, and Retraining.

Author: Ansel MacLaughlin, Anna Rumshisky, Rinat Khaziev, Anil Ramakrishna, Yuval Merhav, and Rahul Gupta 0001
Published: 2023
Full Text: View/download PDF

31. Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data.

Author: Vladislav Lialin, Stephen Rawls, David Chan, Shalini Ghosh, Anna Rumshisky, and Wael Hamza
Published: 2023
Full Text: View/download PDF

32. Update Frequently, Update Fast: Retraining Semantic Parsing Systems in a Fraction of Time

Author: Lialin, Vladislav, Goel, Rahul, Simanovsky, Andrey, Rumshisky, Anna, and Shah, Rushin
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Currently used semantic parsing systems deployed in voice assistants can require weeks to train. Datasets for these models often receive small and frequent updates, data patches. Each patch requires training a new model. To reduce training time, one can fine-tune the previously trained model on each patch, but naive fine-tuning exhibits catastrophic forgetting - degradation of the model performance on the data not represented in the data patch. In this work, we propose a simple method that alleviates catastrophic forgetting and show that it is possible to match the performance of a model trained from scratch in less than 10% of a time via fine-tuning. The key to achieving this is supersampling and EWC regularization. We demonstrate the effectiveness of our method on multiple splits of the Facebook TOP and SNIPS datasets.
Published: 2020

33. When BERT Plays the Lottery, All Tickets Are Winning

Author: Prasanna, Sai, Rogers, Anna, and Rumshisky, Anna
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Large Transformer-based models were shown to be reducible to a smaller number of self-attention heads and layers. We consider this phenomenon from the perspective of the lottery ticket hypothesis, using both structured and magnitude pruning. For fine-tuned BERT, we show that (a) it is possible to find subnetworks achieving performance that is comparable with that of the full model, and (b) similarly-sized subnetworks sampled from the rest of the model perform worse. Strikingly, with structured pruning even the worst possible subnetworks remain highly trainable, indicating that most pre-trained BERT weights are potentially useful. We also study the "good" subnetworks to see if their success can be attributed to superior linguistic knowledge, but find them unstable, and not explained by meaningful self-attention patterns., Comment: EMNLP 2020 camera-ready
Published: 2020

34. A Primer in BERTology: What we know about how BERT works

Author: Rogers, Anna, Kovaleva, Olga, and Rumshisky, Anna
Subjects: Computer Science - Computation and Language
Abstract: Transformer-based models have pushed state of the art in many areas of NLP, but our understanding of what is behind their success is still limited. This paper is the first survey of over 150 studies of the popular BERT model. We review the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue and approaches to compression. We then outline directions for future research., Comment: Accepted to TACL. Please note that the multilingual BERT section is only available in version 1
Published: 2020

35. Memory-Augmented Recurrent Networks for Dialogue Coherence

Author: Donahue, David, Meng, Yuanliang, and Rumshisky, Anna
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Recent dialogue approaches operate by reading each word in a conversation history, and aggregating accrued dialogue information into a single state. This fixed-size vector is not expandable and must maintain a consistent format over time. Other recent approaches exploit an attention mechanism to extract useful information from past conversational utterances, but this introduces an increased computational complexity. In this work, we explore the use of the Neural Turing Machine (NTM) to provide a more permanent and flexible storage mechanism for maintaining dialogue coherence. Specifically, we introduce two separate dialogue architectures based on this NTM design. The first design features a sequence-to-sequence architecture with two separate NTM modules, one for each participant in the conversation. The second memory architecture incorporates a single NTM module, which stores parallel context information for both speakers. This second design also replaces the sequence-to-sequence architecture with a neural language model, to allow for longer context of the NTM and greater understanding of the dialogue history. We report perplexity performance for both models, and compare them to existing baselines., Comment: Honors project, 12 pages
Published: 2019

36. Injecting Hierarchy with U-Net Transformers

Author: Donahue, David, Lialin, Vladislav, and Rumshisky, Anna
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Statistics - Machine Learning
Abstract: The Transformer architecture has become increasingly popular over the past two years, owing to its impressive performance on a number of natural language processing (NLP) tasks. However, all Transformer computations occur at the level of word representations and therefore, it may be argued that Transformer models do not explicitly attempt to learn hierarchical structure which is widely assumed to be integral to language. In the present work, we introduce hierarchical processing into the Transformer model, taking inspiration from the U-Net architecture, popular in computer vision for its hierarchical view of natural images. We empirically demonstrate that the proposed architecture outperforms both the vanilla Transformer and some strong baselines in the domain of chit-chat dialogue., Comment: 10 pages
Published: 2019

37. NarrativeTime: Dense Temporal Annotation on a Timeline

Author: Rogers, Anna, Karpinska, Marzena, Gupta, Ankita, Lialin, Vladislav, Smelkov, Gregory, and Rumshisky, Anna
Subjects: Computer Science - Computation and Language
Abstract: For the past decade, temporal annotation has been sparse: only a small portion of event pairs in a text was annotated. We present NarrativeTime, the first timeline-based annotation framework that achieves full coverage of all possible TLinks. To compare with the previous SOTA in dense temporal annotation, we perform full re-annotation of TimeBankDense corpus, which shows comparable agreement with a significant increase in density. We contribute TimeBankNT corpus (with each text fully annotated by two expert annotators), extensive annotation guidelines, open-source tools for annotation and conversion to TimeML format, baseline results, as well as quantitative and qualitative analysis of inter-annotator agreement.
Published: 2019

38. Solving Math Word Problems with Double-Decoder Transformer

Author: Meng, Yuanliang and Rumshisky, Anna
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language
Abstract: This paper proposes a Transformer-based model to generate equations for math word problems. It achieves much better results than RNN models when copy and align mechanisms are not used, and can outperform complex copy and align RNN models. We also show that training a Transformer jointly in a generation task with two decoders, left-to-right and right-to-left, is beneficial. Such a Transformer performs better than the one with just one decoder not only because of the ensemble effect, but also because it improves the encoder training procedure. We also experiment with adding reinforcement learning to our model, showing improved performance compared to MLE training.
Published: 2019

39. Revealing the Dark Secrets of BERT

Author: Kovaleva, Olga, Romanov, Alexey, Rogers, Anna, and Rumshisky, Anna
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: BERT-based architectures currently give state-of-the-art performance on many NLP tasks, but little is known about the exact mechanisms that contribute to its success. In the current work, we focus on the interpretation of self-attention, which is one of the fundamental underlying components of BERT. Using a subset of GLUE tasks and a set of handcrafted features-of-interest, we propose the methodology and carry out a qualitative and quantitative analysis of the information encoded by the individual BERT's heads. Our findings suggest that there is a limited set of attention patterns that are repeated across different heads, indicating the overall model overparametrization. While different heads consistently use the same attention patterns, they have varying impact on performance across different tasks. We show that manually disabling attention in certain heads leads to a performance improvement over the regular fine-tuned BERT models., Comment: Accepted to EMNLP 2019
Published: 2019

40. What's in a Name? Reducing Bias in Bios without Access to Protected Attributes

Author: Romanov, Alexey, De-Arteaga, Maria, Wallach, Hanna, Chayes, Jennifer, Borgs, Christian, Chouldechova, Alexandra, Geyik, Sahin, Kenthapadi, Krishnaram, Rumshisky, Anna, and Kalai, Adam Tauman
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Statistics - Machine Learning
Abstract: There is a growing body of work that proposes methods for mitigating bias in machine learning systems. These methods typically rely on access to protected attributes such as race, gender, or age. However, this raises two significant challenges: (1) protected attributes may not be available or it may not be legal to use them, and (2) it is often desirable to simultaneously consider multiple protected attributes, as well as their intersections. In the context of mitigating bias in occupation classification, we propose a method for discouraging correlation between the predicted probability of an individual's true occupation and a word embedding of their name. This method leverages the societal biases that are encoded in word embeddings, eliminating the need for access to protected attributes. Crucially, it only requires access to individuals' names at training time and not at deployment time. We evaluate two variations of our proposed method using a large-scale dataset of online biographies. We find that both variations simultaneously reduce race and gender biases, with almost no reduction in the classifier's overall true positive rate., Comment: Accepted at NAACL 2019; Best Thematic Paper
Published: 2019

41. Chasing the Tail with Domain Generalization: A Case Study on Frequency-Enriched Datasets.

Author: Manoj Kumar 0007, Anna Rumshisky, and Rahul Gupta 0001
Published: 2022

42. Controlled Data Generation via Insertion Operations for NLU.

Author: Manoj Kumar 0007, Yuval Merhav, Haidar Khan, Rahul Gupta 0001, Anna Rumshisky, and Wael Hamza
Published: 2022
Full Text: View/download PDF

43. Federated Learning with Noisy User Feedback.

Author: Rahul Sharma, Anil Ramakrishna, Ansel MacLaughlin, Anna Rumshisky, Jimit Majmudar, Clement Chung, Salman Avestimehr, and Rahul Gupta 0001
Published: 2022
Full Text: View/download PDF

44. Life after BERT: What do Other Muppets Understand about Language?

Author: Vladislav Lialin, Kevin Zhao, Namrata Shivagunde, and Anna Rumshisky
Published: 2022
Full Text: View/download PDF

45. Down and Across: Introducing Crossword-Solving as a New NLP Benchmark.

Author: Saurabh Kulshreshtha, Olga Kovaleva, Namrata Shivagunde, and Anna Rumshisky
Published: 2022
Full Text: View/download PDF

46. An Efficient DP-SGD Mechanism for Large Scale NLU Models.

Author: Christophe Dupuy, Radhika Arava, Rahul Gupta 0001, and Anna Rumshisky
Published: 2022
Full Text: View/download PDF

47. Adversarial Text Generation Without Reinforcement Learning

Author: Donahue, David and Rumshisky, Anna
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Generative Adversarial Networks (GANs) have experienced a recent surge in popularity, performing competitively in a variety of tasks, especially in computer vision. However, GAN training has shown limited success in natural language processing. This is largely because sequences of text are discrete, and thus gradients cannot propagate from the discriminator to the generator. Recent solutions use reinforcement learning to propagate approximate gradients to the generator, but this is inefficient to train. We propose to utilize an autoencoder to learn a low-dimensional representation of sentences. A GAN is then trained to generate its own vectors in this space, which decode to realistic utterances. We report both random and interpolated samples from the generator. Visualization of sentence vectors indicate our model correctly learns the latent space of the autoencoder. Both human ratings and BLEU scores show that our model generates realistic text against competitive baselines., Comment: Four pages without references. ACL latex style. Four figures
Published: 2018

48. Triad-based Neural Network for Coreference Resolution

Author: Meng, Yuanliang and Rumshisky, Anna
Subjects: Computer Science - Information Retrieval, Computer Science - Computation and Language
Abstract: We propose a triad-based neural network system that generates affinity scores between entity mentions for coreference resolution. The system simultaneously accepts three mentions as input, taking mutual dependency and logical constraints of all three mentions into account, and thus makes more accurate predictions than the traditional pairwise approach. Depending on system choices, the affinity scores can be further used in clustering or mention ranking. Our experiments show that a standard hierarchical clustering using the scores produces state-of-art results with gold mentions on the English portion of CoNLL 2012 Shared Task. The model does not rely on many handcrafted features and is easy to train and use. The triads can also be easily extended to polyads of higher orders. To our knowledge, this is the first neural network system to model mutual dependency of more than two members at mention level.
Published: 2018

49. Adversarial Decomposition of Text Representation

Author: Romanov, Alexey, Rumshisky, Anna, Rogers, Anna, and Donahue, David
Subjects: Computer Science - Computation and Language
Abstract: In this paper, we present a method for adversarial decomposition of text representation. This method can be used to decompose a representation of an input sentence into several independent vectors, each of them responsible for a specific aspect of the input sentence. We evaluate the proposed method on two case studies: the conversion between different social registers and diachronic language change. We show that the proposed method is capable of fine-grained controlled change of these aspects of the input sentence. It is also learning a continuous (rather than categorical) representation of the style of the sentence, which is more linguistically realistic. The model uses adversarial-motivational training and includes a special motivational loss, which acts opposite to the discriminator and encourages a better decomposition. Furthermore, we evaluate the obtained meaning embeddings on a downstream task of paraphrase detection and show that they significantly outperform the embeddings of a regular autoencoder., Comment: Accepted at NAACL 2019
Published: 2018

50. CliNER 2.0: Accessible and Accurate Clinical Concept Extraction

Author: Boag, Willie, Sergeeva, Elena, Kulshreshtha, Saurabh, Szolovits, Peter, Rumshisky, Anna, and Naumann, Tristan
Subjects: Computer Science - Computation and Language
Abstract: Clinical notes often describe important aspects of a patient's stay and are therefore critical to medical research. Clinical concept extraction (CCE) of named entities - such as problems, tests, and treatments - aids in forming an understanding of notes and provides a foundation for many downstream clinical decision-making tasks. Historically, this task has been posed as a standard named entity recognition (NER) sequence tagging problem, and solved with feature-based methods using handengineered domain knowledge. Recent advances, however, have demonstrated the efficacy of LSTM-based models for NER tasks, including CCE. This work presents CliNER 2.0, a simple-to-install, open-source tool for extracting concepts from clinical text. CliNER 2.0 uses a word- and character- level LSTM model, and achieves state-of-the-art performance. For ease of use, the tool also includes pre-trained models available for public use.
Published: 2018

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

414 results on '"Rumshisky, A."'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources