Author: "Choudhury, Monojit" / Publication Type: Electronic Resources - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Choudhury, Monojit"' showing total 51 results

Start Over Author "Choudhury, Monojit" Publication Type Electronic Resources

51 results on '"Choudhury, Monojit"'

1. Towards Measuring and Modeling 'Culture' in LLMs: A Survey

Author: Adilazuarda, Muhammad Farid, Mukherjee, Sagnik, Lavania, Pradhyumna, Singh, Siddhant, Dwivedi, Ashutosh, Aji, Alham Fikri, O'Neill, Jacki, Modi, Ashutosh, Choudhury, Monojit, Adilazuarda, Muhammad Farid, Mukherjee, Sagnik, Lavania, Pradhyumna, Singh, Siddhant, Dwivedi, Ashutosh, Aji, Alham Fikri, O'Neill, Jacki, Modi, Ashutosh, and Choudhury, Monojit
Abstract: We present a survey of 39 recent papers that aim to study cultural representation and inclusion in large language models. We observe that none of the studies define "culture," which is a complex, multifaceted concept; instead, they probe the models on some specially designed datasets which represent certain aspects of "culture." We call these aspects the proxies of cultures, and organize them across three dimensions of demographic, semantic and linguistic-cultural interaction proxies. We also categorize the probing methods employed. Our analysis indicates that only certain aspects of "culture," such as values and objectives, have been studied, leaving several other interesting and important facets, especially the multitude of semantic domains (Thompson et al., 2020) and aboutness (Hershcovich et al., 2022), unexplored. Two other crucial gaps are the lack of robustness and situatedness of the current methods. Based on these observations, we provide several recommendations for a holistic and practically useful research agenda for furthering cultural inclusion in LLMs and LLM-based applications.
Published: 2024

2. Do Moral Judgment and Reasoning Capability of LLMs Change with Language? A Study using the Multilingual Defining Issues Test

Author: Khandelwal, Aditi, Agarwal, Utkarsh, Tanmay, Kumar, Choudhury, Monojit, Khandelwal, Aditi, Agarwal, Utkarsh, Tanmay, Kumar, and Choudhury, Monojit
Abstract: This paper explores the moral judgment and moral reasoning abilities exhibited by Large Language Models (LLMs) across languages through the Defining Issues Test. It is a well known fact that moral judgment depends on the language in which the question is asked. We extend the work of beyond English, to 5 new languages (Chinese, Hindi, Russian, Spanish and Swahili), and probe three LLMs -- ChatGPT, GPT-4 and Llama2Chat-70B -- that shows substantial multilingual text processing and generation abilities. Our study shows that the moral reasoning ability for all models, as indicated by the post-conventional score, is substantially inferior for Hindi and Swahili, compared to Spanish, Russian, Chinese and English, while there is no clear trend for the performance of the latter four languages. The moral judgments too vary considerably by the language., Comment: Accepted to EACL 2024 (main)
Published: 2024

3. Benchmark Underestimates the Readiness of Multi-lingual Dialogue Agents

Author: Lee, Andrew H., Semnani, Sina J., Castillo-López, Galo, de Chalendar, Gäel, Choudhury, Monojit, Dua, Ashna, Kavitha, Kapil Rajesh, Kim, Sungkyun, Kodali, Prashant, Kumaraguru, Ponnurangam, Lombard, Alexis, Moradshahi, Mehrad, Park, Gihyun, Semmar, Nasredine, Seo, Jiwon, Shen, Tianhao, Shrivastava, Manish, Xiong, Deyi, Lam, Monica S., Lee, Andrew H., Semnani, Sina J., Castillo-López, Galo, de Chalendar, Gäel, Choudhury, Monojit, Dua, Ashna, Kavitha, Kapil Rajesh, Kim, Sungkyun, Kodali, Prashant, Kumaraguru, Ponnurangam, Lombard, Alexis, Moradshahi, Mehrad, Park, Gihyun, Semmar, Nasredine, Seo, Jiwon, Shen, Tianhao, Shrivastava, Manish, Xiong, Deyi, and Lam, Monica S.
Abstract: Creating multilingual task-oriented dialogue (TOD) agents is challenging due to the high cost of training data acquisition. Following the research trend of improving training data efficiency, we show for the first time, that in-context learning is sufficient to tackle multilingual TOD. To handle the challenging dialogue state tracking (DST) subtask, we break it down to simpler steps that are more compatible with in-context learning where only a handful of few-shot examples are used. We test our approach on the multilingual TOD dataset X-RiSAWOZ, which has 12 domains in Chinese, English, French, Korean, Hindi, and code-mixed Hindi-English. Our turn-by-turn DST accuracy on the 6 languages range from 55.6% to 80.3%, seemingly worse than the SOTA results from fine-tuned models that achieve from 60.7% to 82.8%; our BLEU scores in the response generation (RG) subtask are also significantly lower than SOTA. However, after manual evaluation of the validation set, we find that by correcting gold label errors and improving dataset annotation schema, GPT-4 with our prompts can achieve (1) 89.6%-96.8% accuracy in DST, and (2) more than 99% correct response generation across different languages. This leads us to conclude that current automatic metrics heavily underestimate the effectiveness of in-context learning.
Published: 2024

4. From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences

Author: Kodali, Prashant, Goel, Anmol, Asapu, Likhith, Bonagiri, Vamshi Krishna, Govil, Anirudh, Choudhury, Monojit, Shrivastava, Manish, Kumaraguru, Ponnurangam, Kodali, Prashant, Goel, Anmol, Asapu, Likhith, Bonagiri, Vamshi Krishna, Govil, Anirudh, Choudhury, Monojit, Shrivastava, Manish, and Kumaraguru, Ponnurangam
Abstract: Current computational approaches for analysing or generating code-mixed sentences do not explicitly model "naturalness" or "acceptability" of code-mixed sentences, but rely on training corpora to reflect distribution of acceptable code-mixed sentences. Modelling human judgement for the acceptability of code-mixed text can help in distinguishing natural code-mixed text and enable quality-controlled generation of code-mixed text. To this end, we construct Cline - a dataset containing human acceptability judgements for English-Hindi (en-hi) code-mixed text. Cline is the largest of its kind with 16,642 sentences, consisting of samples sourced from two sources: synthetically generated code-mixed text and samples collected from online social media. Our analysis establishes that popular code-mixing metrics such as CMI, Number of Switch Points, Burstines, which are used to filter/curate/compare code-mixed corpora have low correlation with human acceptability judgements, underlining the necessity of our dataset. Experiments using Cline demonstrate that simple Multilayer Perceptron (MLP) models trained solely on code-mixing metrics are outperformed by fine-tuned pre-trained Multilingual Large Language Models (MLLMs). Specifically, XLM-Roberta and Bernice outperform IndicBERT across different configurations in challenging data settings. Comparison with ChatGPT's zero and fewshot capabilities shows that MLLMs fine-tuned on larger data outperform ChatGPT, providing scope for improvement in code-mixed tasks. Zero-shot transfer from English-Hindi to English-Telugu acceptability judgments using our model checkpoints proves superior to random baselines, enabling application to other code-mixed language pairs and providing further avenues of research. We publicly release our human-annotated dataset, trained checkpoints, code-mix corpus, and code for data generation and model training.
Published: 2024

5. 'They are uncultured': Unveiling Covert Harms and Social Threats in LLM Generated Conversations

Author: Dammu, Preetam Prabhu Srikar, Jung, Hayoung, Singh, Anjali, Choudhury, Monojit, Mitra, Tanushree, Dammu, Preetam Prabhu Srikar, Jung, Hayoung, Singh, Anjali, Choudhury, Monojit, and Mitra, Tanushree
Abstract: Large language models (LLMs) have emerged as an integral part of modern societies, powering user-facing applications such as personal assistants and enterprise applications like recruitment tools. Despite their utility, research indicates that LLMs perpetuate systemic biases. Yet, prior works on LLM harms predominantly focus on Western concepts like race and gender, often overlooking cultural concepts from other parts of the world. Additionally, these studies typically investigate "harm" as a singular dimension, ignoring the various and subtle forms in which harms manifest. To address this gap, we introduce the Covert Harms and Social Threats (CHAST), a set of seven metrics grounded in social science literature. We utilize evaluation models aligned with human assessments to examine the presence of covert harms in LLM-generated conversations, particularly in the context of recruitment. Our experiments reveal that seven out of the eight LLMs included in this study generated conversations riddled with CHAST, characterized by malign views expressed in seemingly neutral language unlikely to be detected by existing methods. Notably, these LLMs manifested more extreme views and opinions when dealing with non-Western concepts like caste, compared to Western ones such as race.
Published: 2024

6. Ethical Reasoning and Moral Value Alignment of LLMs Depend on the Language we Prompt them in

Author: Agarwal, Utkarsh, Tanmay, Kumar, Khandelwal, Aditi, Choudhury, Monojit, Agarwal, Utkarsh, Tanmay, Kumar, Khandelwal, Aditi, and Choudhury, Monojit
Abstract: Ethical reasoning is a crucial skill for Large Language Models (LLMs). However, moral values are not universal, but rather influenced by language and culture. This paper explores how three prominent LLMs -- GPT-4, ChatGPT, and Llama2-70B-Chat -- perform ethical reasoning in different languages and if their moral judgement depend on the language in which they are prompted. We extend the study of ethical reasoning of LLMs by Rao et al. (2023) to a multilingual setup following their framework of probing LLMs with ethical dilemmas and policies from three branches of normative ethics: deontology, virtue, and consequentialism. We experiment with six languages: English, Spanish, Russian, Chinese, Hindi, and Swahili. We find that GPT-4 is the most consistent and unbiased ethical reasoner across languages, while ChatGPT and Llama2-70B-Chat show significant moral value bias when we move to languages other than English. Interestingly, the nature of this bias significantly vary across languages for all LLMs, including GPT-4.
Published: 2024

7. DiTTO: A Feature Representation Imitation Approach for Improving Cross-Lingual Transfer

Author: Kumar, Shanu, Soujanya, Abbaraju, Dandapat, Sandipan, Sitaram, Sunayana, Choudhury, Monojit, Kumar, Shanu, Soujanya, Abbaraju, Dandapat, Sandipan, Sitaram, Sunayana, and Choudhury, Monojit
Abstract: Zero-shot cross-lingual transfer is promising, however has been shown to be sub-optimal, with inferior transfer performance across low-resource languages. In this work, we envision languages as domains for improving zero-shot transfer by jointly reducing the feature incongruity between the source and the target language and increasing the generalization capabilities of pre-trained multilingual transformers. We show that our approach, DiTTO, significantly outperforms the standard zero-shot fine-tuning method on multiple datasets across all languages using solely unlabeled instances in the target language. Empirical results show that jointly reducing feature incongruity for multiple target languages is vital for successful cross-lingual transfer. Moreover, our model enables better cross-lingual transfer than standard fine-tuning methods, even in the few-shot setting., Comment: Accepted at EACL 2023
Published: 2023

8. Fairness in Language Models Beyond English: Gaps and Challenges

Author: Ramesh, Krithika, Sitaram, Sunayana, Choudhury, Monojit, Ramesh, Krithika, Sitaram, Sunayana, and Choudhury, Monojit
Abstract: With language models becoming increasingly ubiquitous, it has become essential to address their inequitable treatment of diverse demographic groups and factors. Most research on evaluating and mitigating fairness harms has been concentrated on English, while multilingual models and non-English languages have received comparatively little attention. This paper presents a survey of fairness in multilingual and non-English contexts, highlighting the shortcomings of current research and the difficulties faced by methods designed for English. We contend that the multitude of diverse cultures and languages across the world makes it infeasible to achieve comprehensive coverage in terms of constructing fairness datasets. Thus, the measurement and mitigation of biases must evolve beyond the current dataset-driven practices that are narrowly focused on specific dimensions and types of biases and, therefore, impossible to scale across languages and cultures., Comment: Accepted to EACL 2023 (Findings)
Published: 2023

9. Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks

Author: Rao, Abhinav, Vashistha, Sachin, Naik, Atharva, Aditya, Somak, Choudhury, Monojit, Rao, Abhinav, Vashistha, Sachin, Naik, Atharva, Aditya, Somak, and Choudhury, Monojit
Abstract: Recent explorations with commercial Large Language Models (LLMs) have shown that non-expert users can jailbreak LLMs by simply manipulating their prompts; resulting in degenerate output behavior, privacy and security breaches, offensive outputs, and violations of content regulator policies. Limited studies have been conducted to formalize and analyze these attacks and their mitigations. We bridge this gap by proposing a formalism and a taxonomy of known (and possible) jailbreaks. We survey existing jailbreak methods and their effectiveness on open-source and commercial LLMs (such as GPT-based models, OPT, BLOOM, and FLAN-T5-XXL). We further discuss the challenges of jailbreak detection in terms of their effectiveness against known attacks. For further analysis, we release a dataset of model outputs across 3700 jailbreak prompts over 4 tasks., Comment: Accepted at LREC-COLING 2024 - The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Published: 2023

10. LLM-powered Data Augmentation for Enhanced Cross-lingual Performance

Author: Whitehouse, Chenxi, Choudhury, Monojit, Aji, Alham Fikri, Whitehouse, Chenxi, Choudhury, Monojit, and Aji, Alham Fikri
Abstract: This paper explores the potential of leveraging Large Language Models (LLMs) for data augmentation in multilingual commonsense reasoning datasets where the available training data is extremely limited. To achieve this, we utilise several LLMs, namely Dolly-v2, StableVicuna, ChatGPT, and GPT-4, to augment three datasets: XCOPA, XWinograd, and XStoryCloze. Subsequently, we evaluate the effectiveness of fine-tuning smaller multilingual models, mBERT and XLMR, using the synthesised data. We compare the performance of training with data generated in English and target languages, as well as translated English-generated data, revealing the overall advantages of incorporating data generated by LLMs, e.g. a notable 13.4 accuracy score improvement for the best case. Furthermore, we conduct a human evaluation by asking native speakers to assess the naturalness and logical coherence of the generated examples across different languages. The results of the evaluation indicate that LLMs such as ChatGPT and GPT-4 excel at producing natural and coherent text in most languages, however, they struggle to generate meaningful text in certain languages like Tamil. We also observe that ChatGPT falls short in generating plausible alternatives compared to the original dataset, whereas examples from GPT-4 exhibit competitive logical consistency., Comment: EMNLP 2023 Main Conference
Published: 2023

11. DUBLIN -- Document Understanding By Language-Image Network

Author: Aggarwal, Kriti, Khandelwal, Aditi, Tanmay, Kumar, Khan, Owais Mohammed, Liu, Qiang, Choudhury, Monojit, Chauhan, Hardik Hansrajbhai, Som, Subhojit, Chaudhary, Vishrav, Tiwary, Saurabh, Aggarwal, Kriti, Khandelwal, Aditi, Tanmay, Kumar, Khan, Owais Mohammed, Liu, Qiang, Choudhury, Monojit, Chauhan, Hardik Hansrajbhai, Som, Subhojit, Chaudhary, Vishrav, and Tiwary, Saurabh
Abstract: Visual document understanding is a complex task that involves analyzing both the text and the visual elements in document images. Existing models often rely on manual feature engineering or domain-specific pipelines, which limit their generalization ability across different document types and languages. In this paper, we propose DUBLIN, which is pretrained on web pages using three novel objectives: Masked Document Text Generation Task, Bounding Box Task, and Rendered Question Answering Task, that leverage both the spatial and semantic information in the document images. Our model achieves competitive or state-of-the-art results on several benchmarks, such as Web-Based Structural Reading Comprehension, Document Visual Question Answering, Key Information Extraction, Diagram Understanding, and Table Question Answering. In particular, we show that DUBLIN is the first pixel-based model to achieve an EM of 77.75 and F1 of 84.25 on the WebSRC dataset. We also show that our model outperforms the current pixel-based SOTA models on DocVQA, InfographicsVQA, OCR-VQA and AI2D datasets by 4.6%, 6.5%, 2.6% and 21%, respectively. We also achieve competitive performance on RVL-CDIP document classification. Moreover, we create new baselines for text-based datasets by rendering them as document images to promote research in this direction.
Published: 2023

12. Evaluating Large Language Models for Health-related Queries with Presuppositions

Author: Kaur, Navreet, Choudhury, Monojit, Pruthi, Danish, Kaur, Navreet, Choudhury, Monojit, and Pruthi, Danish
Abstract: As corporations rush to integrate large language models (LLMs) to their search offerings, it is critical that they provide factually accurate information that is robust to any presuppositions that a user may express. In this work, we introduce UPHILL, a dataset consisting of health-related queries with varying degrees of presuppositions. Using UPHILL, we evaluate the factual accuracy and consistency of InstructGPT, ChatGPT, and BingChat models. We find that while model responses rarely disagree with true health claims (posed as questions), they often fail to challenge false claims: responses from InstructGPT agree with 32% of the false claims, ChatGPT 26% and BingChat 23%. As we increase the extent of presupposition in input queries, the responses from InstructGPT and ChatGPT agree with the claim considerably more often, regardless of its veracity. Responses from BingChat, which rely on retrieved webpages, are not as susceptible. Given the moderate factual accuracy, and the inability of models to consistently correct false assumptions, our work calls for a careful assessment of current LLMs for use in high-stakes scenarios., Comment: Findings of ACL 2024
Published: 2023

13. Ethical Reasoning over Moral Alignment: A Case and Framework for In-Context Ethical Policies in LLMs

Author: Rao, Abhinav, Khandelwal, Aditi, Tanmay, Kumar, Agarwal, Utkarsh, Choudhury, Monojit, Rao, Abhinav, Khandelwal, Aditi, Tanmay, Kumar, Agarwal, Utkarsh, and Choudhury, Monojit
Abstract: In this position paper, we argue that instead of morally aligning LLMs to specific set of ethical principles, we should infuse generic ethical reasoning capabilities into them so that they can handle value pluralism at a global scale. When provided with an ethical policy, an LLM should be capable of making decisions that are ethically consistent to the policy. We develop a framework that integrates moral dilemmas with moral principles pertaining to different foramlisms of normative ethics, and at different levels of abstractions. Initial experiments with GPT-x models shows that while GPT-4 is a nearly perfect ethical reasoner, the models still have bias towards the moral values of Western and English speaking societies.
Published: 2023

14. Probing the Moral Development of Large Language Models through Defining Issues Test

Author: Tanmay, Kumar, Khandelwal, Aditi, Agarwal, Utkarsh, Choudhury, Monojit, Tanmay, Kumar, Khandelwal, Aditi, Agarwal, Utkarsh, and Choudhury, Monojit
Abstract: In this study, we measure the moral reasoning ability of LLMs using the Defining Issues Test - a psychometric instrument developed for measuring the moral development stage of a person according to the Kohlberg's Cognitive Moral Development Model. DIT uses moral dilemmas followed by a set of ethical considerations that the respondent has to judge for importance in resolving the dilemma, and then rank-order them by importance. A moral development stage score of the respondent is then computed based on the relevance rating and ranking. Our study shows that early LLMs such as GPT-3 exhibit a moral reasoning ability no better than that of a random baseline, while ChatGPT, Llama2-Chat, PaLM-2 and GPT-4 show significantly better performance on this task, comparable to adult humans. GPT-4, in fact, has the highest post-conventional moral reasoning score, equivalent to that of typical graduate school students. However, we also observe that the models do not perform consistently across all dilemmas, pointing to important gaps in their understanding and reasoning abilities., Comment: First three authors contributed equally
Published: 2023

15. Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?

Author: Hada, Rishav, Gumma, Varun, de Wynter, Adrian, Diddee, Harshita, Ahmed, Mohamed, Choudhury, Monojit, Bali, Kalika, Sitaram, Sunayana, Hada, Rishav, Gumma, Varun, de Wynter, Adrian, Diddee, Harshita, Ahmed, Mohamed, Choudhury, Monojit, Bali, Kalika, and Sitaram, Sunayana
Abstract: Large Language Models (LLMs) excel in various Natural Language Processing (NLP) tasks, yet their evaluation, particularly in languages beyond the top $20$, remains inadequate due to existing benchmarks and metrics limitations. Employing LLMs as evaluators to rank or score other models' outputs emerges as a viable solution, addressing the constraints tied to human annotators and established benchmarks. In this study, we explore the potential of LLM-based evaluators, specifically GPT-4 in enhancing multilingual evaluation by calibrating them against $20$K human judgments across three text-generation tasks, five metrics, and eight languages. Our analysis reveals a bias in GPT4-based evaluators towards higher scores, underscoring the necessity of calibration with native speaker judgments, especially in low-resource and non-Latin script languages, to ensure accurate evaluation of LLM performance across diverse languages., Comment: Accepted to EACL 2024 findings
Published: 2023

16. X-RiSAWOZ: High-Quality End-to-End Multilingual Dialogue Datasets and Few-shot Agents

Author: Moradshahi, Mehrad, Shen, Tianhao, Bali, Kalika, Choudhury, Monojit, de Chalendar, Gaël, Goel, Anmol, Kim, Sungkyun, Kodali, Prashant, Kumaraguru, Ponnurangam, Semmar, Nasredine, Semnani, Sina J., Seo, Jiwon, Seshadri, Vivek, Shrivastava, Manish, Sun, Michael, Yadavalli, Aditya, You, Chaobin, Xiong, Deyi, Lam, Monica S., Moradshahi, Mehrad, Shen, Tianhao, Bali, Kalika, Choudhury, Monojit, de Chalendar, Gaël, Goel, Anmol, Kim, Sungkyun, Kodali, Prashant, Kumaraguru, Ponnurangam, Semmar, Nasredine, Semnani, Sina J., Seo, Jiwon, Seshadri, Vivek, Shrivastava, Manish, Sun, Michael, Yadavalli, Aditya, You, Chaobin, Xiong, Deyi, and Lam, Monica S.
Abstract: Task-oriented dialogue research has mainly focused on a few popular languages like English and Chinese, due to the high dataset creation cost for a new language. To reduce the cost, we apply manual editing to automatically translated data. We create a new multilingual benchmark, X-RiSAWOZ, by translating the Chinese RiSAWOZ to 4 languages: English, French, Hindi, Korean; and a code-mixed English-Hindi language. X-RiSAWOZ has more than 18,000 human-verified dialogue utterances for each language, and unlike most multilingual prior work, is an end-to-end dataset for building fully-functioning agents. The many difficulties we encountered in creating X-RiSAWOZ led us to develop a toolset to accelerate the post-editing of a new language dataset after translation. This toolset improves machine translation with a hybrid entity alignment technique that combines neural with dictionary-based methods, along with many automated and semi-automated validation checks. We establish strong baselines for X-RiSAWOZ by training dialogue agents in the zero- and few-shot settings where limited gold data is available in the target language. Our results suggest that our translation and post-editing methodology and toolset can be used to create new high-quality multilingual dialogue agents cost-effectively. Our dataset, code, and toolkit are released open-source., Comment: Accepted by ACL 2023 Findings
Published: 2023

17. Too Brittle To Touch: Comparing the Stability of Quantization and Distillation Towards Developing Lightweight Low-Resource MT Models

Author: Diddee, Harshita, Dandapat, Sandipan, Choudhury, Monojit, Ganu, Tanuja, Bali, Kalika, Diddee, Harshita, Dandapat, Sandipan, Choudhury, Monojit, Ganu, Tanuja, and Bali, Kalika
Abstract: Leveraging shared learning through Massively Multilingual Models, state-of-the-art machine translation models are often able to adapt to the paucity of data for low-resource languages. However, this performance comes at the cost of significantly bloated models which are not practically deployable. Knowledge Distillation is one popular technique to develop competitive, lightweight models: In this work, we first evaluate its use to compress MT models focusing on languages with extremely limited training data. Through our analysis across 8 languages, we find that the variance in the performance of the distilled models due to their dependence on priors including the amount of synthetic data used for distillation, the student architecture, training hyperparameters and confidence of the teacher models, makes distillation a brittle compression mechanism. To mitigate this, we explore the use of post-training quantization for the compression of these models. Here, we find that while distillation provides gains across some low-resource languages, quantization provides more consistent performance trends for the entire range of languages, especially the lowest-resource languages in our target set., Comment: 16 Pages, 7 Figures, Accepted to WMT 2022 (Research Track)
Published: 2022

18. On the Calibration of Massively Multilingual Language Models

Author: Ahuja, Kabir, Sitaram, Sunayana, Dandapat, Sandipan, Choudhury, Monojit, Ahuja, Kabir, Sitaram, Sunayana, Dandapat, Sandipan, and Choudhury, Monojit
Abstract: Massively Multilingual Language Models (MMLMs) have recently gained popularity due to their surprising effectiveness in cross-lingual transfer. While there has been much work in evaluating these models for their performance on a variety of tasks and languages, little attention has been paid on how well calibrated these models are with respect to the confidence in their predictions. We first investigate the calibration of MMLMs in the zero-shot setting and observe a clear case of miscalibration in low-resource languages or those which are typologically diverse from English. Next, we empirically show that calibration methods like temperature scaling and label smoothing do reasonably well towards improving calibration in the zero-shot scenario. We also find that few-shot examples in the language can further help reduce the calibration errors, often substantially. Overall, our work contributes towards building more reliable multilingual models by highlighting the issue of their miscalibration, understanding what language and model specific factors influence it, and pointing out the strategies to improve the same., Comment: EMNLP 2022
Published: 2022

19. Generating Intermediate Steps for NLI with Next-Step Supervision

Author: Ghosal, Deepanway, Aditya, Somak, Choudhury, Monojit, Ghosal, Deepanway, Aditya, Somak, and Choudhury, Monojit
Abstract: The Natural Language Inference (NLI) task often requires reasoning over multiple steps to reach the conclusion. While the necessity of generating such intermediate steps (instead of a summary explanation) has gained popular support, it is unclear how to generate such steps without complete end-to-end supervision and how such generated steps can be further utilized. In this work, we train a sequence-to-sequence model to generate only the next step given an NLI premise and hypothesis pair (and previous steps); then enhance it with external knowledge and symbolic search to generate intermediate steps with only next-step supervision. We show the correctness of such generated steps through automated and human verification. Furthermore, we show that such generated steps can help improve end-to-end NLI task performance using simple data augmentation strategies, across multiple public NLI datasets.
Published: 2022

20. 'Diversity and Uncertainty in Moderation' are the Key to Data Selection for Multilingual Few-shot Transfer

Author: Kumar, Shanu, Dandapat, Sandipan, Choudhury, Monojit, Kumar, Shanu, Dandapat, Sandipan, and Choudhury, Monojit
Abstract: Few-shot transfer often shows substantial gain over zero-shot transfer~\cite{lauscher2020zero}, which is a practically useful trade-off between fully supervised and unsupervised learning approaches for multilingual pretrained model-based systems. This paper explores various strategies for selecting data for annotation that can result in a better few-shot transfer. The proposed approaches rely on multiple measures such as data entropy using $n$-gram language model, predictive entropy, and gradient embedding. We propose a loss embedding method for sequence labeling tasks, which induces diversity and uncertainty sampling similar to gradient embedding. The proposed data selection strategies are evaluated and compared for POS tagging, NER, and NLI tasks for up to 20 languages. Our experiments show that the gradient and loss embedding-based strategies consistently outperform random data selection baselines, with gains varying with the initial performance of the zero-shot transfer. Furthermore, the proposed method shows similar trends in improvement even when the model is fine-tuned using a lower proportion of the original task-specific labeled training data for zero-shot transfer., Comment: NAACL 2022
Published: 2022

21. On the Economics of Multilingual Few-shot Learning: Modeling the Cost-Performance Trade-offs of Machine Translated and Manual Data

Author: Ahuja, Kabir, Choudhury, Monojit, Dandapat, Sandipan, Ahuja, Kabir, Choudhury, Monojit, and Dandapat, Sandipan
Abstract: Borrowing ideas from {\em Production functions} in micro-economics, in this paper we introduce a framework to systematically evaluate the performance and cost trade-offs between machine-translated and manually-created labelled data for task-specific fine-tuning of massively multilingual language models. We illustrate the effectiveness of our framework through a case-study on the TyDIQA-GoldP dataset. One of the interesting conclusions of the study is that if the cost of machine translation is greater than zero, the optimal performance at least cost is always achieved with at least some or only manually-created data. To our knowledge, this is the first attempt towards extending the concept of production functions to study data collection strategies for training multilingual models, and can serve as a valuable tool for other similar cost vs data trade-offs in NLP., Comment: NAACL 2022
Published: 2022

22. Beyond Static Models and Test Sets: Benchmarking the Potential of Pre-trained Models Across Tasks and Languages

Author: Ahuja, Kabir, Dandapat, Sandipan, Sitaram, Sunayana, Choudhury, Monojit, Ahuja, Kabir, Dandapat, Sandipan, Sitaram, Sunayana, and Choudhury, Monojit
Abstract: Although recent Massively Multilingual Language Models (MMLMs) like mBERT and XLMR support around 100 languages, most existing multilingual NLP benchmarks provide evaluation data in only a handful of these languages with little linguistic diversity. We argue that this makes the existing practices in multilingual evaluation unreliable and does not provide a full picture of the performance of MMLMs across the linguistic landscape. We propose that the recent work done in Performance Prediction for NLP tasks can serve as a potential solution in fixing benchmarking in Multilingual NLP by utilizing features related to data and language typology to estimate the performance of an MMLM on different languages. We compare performance prediction with translating test data with a case study on four different multilingual datasets, and observe that these methods can provide reliable estimates of the performance that are often on-par with the translation based approaches, without the need for any additional translation as well as evaluation costs., Comment: NLP Power! Workshop, ACL 2022
Published: 2022

23. Multi Task Learning For Zero Shot Performance Prediction of Multilingual Models

Author: Ahuja, Kabir, Kumar, Shanu, Dandapat, Sandipan, Choudhury, Monojit, Ahuja, Kabir, Kumar, Shanu, Dandapat, Sandipan, and Choudhury, Monojit
Abstract: Massively Multilingual Transformer based Language Models have been observed to be surprisingly effective on zero-shot transfer across languages, though the performance varies from language to language depending on the pivot language(s) used for fine-tuning. In this work, we build upon some of the existing techniques for predicting the zero-shot performance on a task, by modeling it as a multi-task learning problem. We jointly train predictive models for different tasks which helps us build more accurate predictors for tasks where we have test data in very few languages to measure the actual performance of the model. Our approach also lends us the ability to perform a much more robust feature selection and identify a common set of features that influence zero-shot performance across a variety of tasks., Comment: ACL 2022
Published: 2022

24. Global Readiness of Language Technology for Healthcare: What would it Take to Combat the Next Pandemic?

Author: Mondal, Ishani, Ahuja, Kabir, Jain, Mohit, Neil, Jacki O, Bali, Kalika, Choudhury, Monojit, Mondal, Ishani, Ahuja, Kabir, Jain, Mohit, Neil, Jacki O, Bali, Kalika, and Choudhury, Monojit
Abstract: The COVID-19 pandemic has brought out both the best and worst of language technology (LT). On one hand, conversational agents for information dissemination and basic diagnosis have seen widespread use, and arguably, had an important role in combating the pandemic. On the other hand, it has also become clear that such technologies are readily available for a handful of languages, and the vast majority of the global south is completely bereft of these benefits. What is the state of LT, especially conversational agents, for healthcare across the world's languages? And, what would it take to ensure global readiness of LT before the next pandemic? In this paper, we try to answer these questions through survey of existing literature and resources, as well as through a rapid chatbot building exercise for 15 Asian and African languages with varying amount of resource-availability. The study confirms the pitiful state of LT even for languages with large speaker bases, such as Sinhala and Hausa, and identifies the gaps that could help us prioritize research and investment strategies in LT for healthcare., Comment: Under Revision
Published: 2022

25. Multilingual CheckList: Generation and Evaluation

Author: K, Karthikeyan, Bhatt, Shaily, Singh, Pankaj, Aditya, Somak, Dandapat, Sandipan, Sitaram, Sunayana, Choudhury, Monojit, K, Karthikeyan, Bhatt, Shaily, Singh, Pankaj, Aditya, Somak, Dandapat, Sandipan, Sitaram, Sunayana, and Choudhury, Monojit
Abstract: Multilingual evaluation benchmarks usually contain limited high-resource languages and do not test models for specific linguistic capabilities. CheckList is a template-based evaluation approach that tests models for specific capabilities. The CheckList template creation process requires native speakers, posing a challenge in scaling to hundreds of languages. In this work, we explore multiple approaches to generate Multilingual CheckLists. We device an algorithm - Template Extraction Algorithm (TEA) for automatically extracting target language CheckList templates from machine translated instances of a source language templates. We compare the TEA CheckLists with CheckLists created with different levels of human intervention. We further introduce metrics along the dimensions of cost, diversity, utility, and correctness to compare the CheckLists. We thoroughly analyze different approaches to creating CheckLists in Hindi. Furthermore, we experiment with 9 more different languages. We find that TEA followed by human verification is ideal for scaling Checklist-based evaluation to multiple languages while TEA gives a good estimates of model performance., Comment: Accepted to Findings of AACL-IJCNLP 2022
Published: 2022

26. NaijaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual Sentiment Analysis

Author: Muhammad, Shamsuddeen Hassan, Adelani, David Ifeoluwa, Ruder, Sebastian, Ahmad, Ibrahim Said, Abdulmumin, Idris, Bello, Bello Shehu, Choudhury, Monojit, Emezue, Chris Chinenye, Abdullahi, Saheed Salahudeen, Aremu, Anuoluwapo, Jeorge, Alipio, Brazdil, Pavel, Muhammad, Shamsuddeen Hassan, Adelani, David Ifeoluwa, Ruder, Sebastian, Ahmad, Ibrahim Said, Abdulmumin, Idris, Bello, Bello Shehu, Choudhury, Monojit, Emezue, Chris Chinenye, Abdullahi, Saheed Salahudeen, Aremu, Anuoluwapo, Jeorge, Alipio, and Brazdil, Pavel
Abstract: Sentiment analysis is one of the most widely studied applications in NLP, but most work focuses on languages with large amounts of data. We introduce the first large-scale human-annotated Twitter sentiment dataset for the four most widely spoken languages in Nigeria (Hausa, Igbo, Nigerian-Pidgin, and Yor\`ub\'a ) consisting of around 30,000 annotated tweets per language (and 14,000 for Nigerian-Pidgin), including a significant fraction of code-mixed tweets. We propose text collection, filtering, processing and labeling methods that enable us to create datasets for these low-resource languages. We evaluate a rangeof pre-trained models and transfer strategies on the dataset. We find that language-specific models and language-adaptivefine-tuning generally perform best. We release the datasets, trained models, sentiment lexicons, and code to incentivizeresearch on sentiment analysis in under-represented languages., Comment: Submitted to LREC 2022, 13 pages, 2 figures
Published: 2022

27. LoNLI: An Extensible Framework for Testing Diverse Logical Reasoning Capabilities for NLI

Author: Tarunesh, Ishan, Aditya, Somak, Choudhury, Monojit, Tarunesh, Ishan, Aditya, Somak, and Choudhury, Monojit
Abstract: Natural Language Inference (NLI) is considered a representative task to test natural language understanding (NLU). In this work, we propose an extensible framework to collectively yet categorically test diverse Logical reasoning capabilities required for NLI (and, by extension, NLU). Motivated by behavioral testing, we create a semi-synthetic large test bench (363 templates, 363k examples) and an associated framework that offers the following utilities: 1) individually test and analyze reasoning capabilities along 17 reasoning dimensions (including pragmatic reasoning); 2) design experiments to study cross-capability information content (leave one out or bring one in); and 3) the synthetic nature enables us to control for artifacts and biases. We extend a publicly available framework of automated test case instantiation from free-form natural language templates (CheckList) and a well-defined taxonomy of capabilities to cover a wide range of increasingly harder test cases while varying the complexity of natural language. Through our analysis of state-of-the-art NLI systems, we observe that our benchmark is indeed hard (and non-trivial even with training on additional resources). Some capabilities stand out as harder. Further, fine-grained analysis and fine-tuning experiments reveal more insights about these capabilities and the models -- supporting and extending previous observations; thus showing the utility of the proposed testbench., Comment: arXiv admin note: substantial text overlap with arXiv:2107.07229
Published: 2021

28. Predicting the Performance of Multilingual NLP Models

Author: Srinivasan, Anirudh, Sitaram, Sunayana, Ganu, Tanuja, Dandapat, Sandipan, Bali, Kalika, Choudhury, Monojit, Srinivasan, Anirudh, Sitaram, Sunayana, Ganu, Tanuja, Dandapat, Sandipan, Bali, Kalika, and Choudhury, Monojit
Abstract: Recent advancements in NLP have given us models like mBERT and XLMR that can serve over 100 languages. The languages that these models are evaluated on, however, are very few in number, and it is unlikely that evaluation datasets will cover all the languages that these models support. Potential solutions to the costly problem of dataset creation are to translate datasets to new languages or use template-filling based techniques for creation. This paper proposes an alternate solution for evaluating a model across languages which make use of the existing performance scores of the model on languages that a particular task has test sets for. We train a predictor on these performance scores and use this predictor to predict the model's performance in different evaluation settings. Our results show that our method is effective in filling the gaps in the evaluation for an existing set of languages, but might require additional improvements if we want it to generalize to unseen languages.
Published: 2021

29. Designing Language Technologies for Social Good: The Road not Taken

Author: Mukhija, Namrata, Choudhury, Monojit, Bali, Kalika, Mukhija, Namrata, Choudhury, Monojit, and Bali, Kalika
Abstract: Development of speech and language technology for social good (LT4SG), especially those targeted at the welfare of marginalized communities and speakers of low-resource and under-served languages, has been a prominent theme of research within NLP, Speech, and the AI communities. Researchers have mostly relied on their individual expertise, experiences or ad hoc surveys for prioritization of language technologies that provide social good to the end-users. This has been criticized by several scholars who argue that work on LT4SG must include the target linguistic communities during the design and development process. However, none of the LT4SG work and their critiques suggest principled techniques for prioritization of the technologies and methods for inclusion of the end-user during the development cycle. Drawing inspiration from the fields of Economics, Ethics, Psychology, and Participatory Design, here we chart out a set of methodologies for prioritizing LT4SG that are aligned with the end-user preferences. We then analyze several LT4SG efforts in light of the proposed methodologies and bring out their hidden assumptions and potential pitfalls. While the current study is limited to language technologies, we believe that the principles and prioritization techniques highlighted here are applicable more broadly to AI for Social Good.
Published: 2021

30. Analyzing the Effects of Reasoning Types on Cross-Lingual Transfer Performance

Author: K, Karthikeyan, Sathe, Aalok, Aditya, Somak, Choudhury, Monojit, K, Karthikeyan, Sathe, Aalok, Aditya, Somak, and Choudhury, Monojit
Abstract: Multilingual language models achieve impressive zero-shot accuracies in many languages in complex tasks such as Natural Language Inference (NLI). Examples in NLI (and equivalent complex tasks) often pertain to various types of sub-tasks, requiring different kinds of reasoning. Certain types of reasoning have proven to be more difficult to learn in a monolingual context, and in the crosslingual context, similar observations may shed light on zero-shot transfer efficiency and few-shot sample selection. Hence, to investigate the effects of types of reasoning on transfer performance, we propose a category-annotated multilingual NLI dataset and discuss the challenges to scale monolingual annotations to multiple languages. We statistically observe interesting effects that the confluence of reasoning types and language similarities have on transfer performance., Comment: Workshop on Multilingual Representation Learning (MRL 2021), at Empirical Methods in Natural Language Processing (EMNLP 2021)
Published: 2021

31. On the Universality of Deep Contextual Language Models

Author: Bhatt, Shaily, Goyal, Poonam, Dandapat, Sandipan, Choudhury, Monojit, Sitaram, Sunayana, Bhatt, Shaily, Goyal, Poonam, Dandapat, Sandipan, Choudhury, Monojit, and Sitaram, Sunayana
Abstract: Deep Contextual Language Models (LMs) like ELMO, BERT, and their successors dominate the landscape of Natural Language Processing due to their ability to scale across multiple tasks rapidly by pre-training a single model, followed by task-specific fine-tuning. Furthermore, multilingual versions of such models like XLM-R and mBERT have given promising results in zero-shot cross-lingual transfer, potentially enabling NLP applications in many under-served and under-resourced languages. Due to this initial success, pre-trained models are being used as `Universal Language Models' as the starting point across diverse tasks, domains, and languages. This work explores the notion of `Universality' by identifying seven dimensions across which a universal model should be able to scale, that is, perform equally well or reasonably well, to be useful across diverse settings. We outline the current theoretical and empirical results that support model performance across these dimensions, along with extensions that may help address some of their current limitations. Through this survey, we lay the foundation for understanding the capabilities and limitations of massive contextual language models and help discern research gaps and directions for future work to make these LMs inclusive and fair to diverse applications, users, and linguistic phenomena., Comment: 9 pages
Published: 2021

32. Trusting RoBERTa over BERT: Insights from CheckListing the Natural Language Inference Task

Author: Tarunesh, Ishan, Aditya, Somak, Choudhury, Monojit, Tarunesh, Ishan, Aditya, Somak, and Choudhury, Monojit
Abstract: The recent state-of-the-art natural language understanding (NLU) systems often behave unpredictably, failing on simpler reasoning examples. Despite this, there has been limited focus on quantifying progress towards systems with more predictable behavior. We think that reasoning capability-wise behavioral summary is a step towards bridging this gap. We create a CheckList test-suite (184K examples) for the Natural Language Inference (NLI) task, a representative NLU task. We benchmark state-of-the-art NLI systems on this test-suite, which reveals fine-grained insights into the reasoning abilities of BERT and RoBERTa. Our analysis further reveals inconsistencies of the models on examples derived from the same template or distinct templates but pertaining to same reasoning capability, indicating that generalizing the models' behavior through observations made on a CheckList is non-trivial. Through an user-study, we find that users were able to utilize behavioral information to generalize much better for examples predicted from RoBERTa, compared to that of BERT., Comment: 15 pages, 5 figures and 9 tables
Published: 2021

33. Sample-efficient Linguistic Generalizations through Program Synthesis: Experiments with Phonology Problems

Author: Vaduguru, Saujas, Sathe, Aalok, Choudhury, Monojit, Sharma, Dipti Misra, Vaduguru, Saujas, Sathe, Aalok, Choudhury, Monojit, and Sharma, Dipti Misra
Abstract: Neural models excel at extracting statistical patterns from large amounts of data, but struggle to learn patterns or reason about language from only a few examples. In this paper, we ask: Can we learn explicit rules that generalize well from only a few examples? We explore this question using program synthesis. We develop a synthesis model to learn phonology rules as programs in a domain-specific language. We test the ability of our models to generalize from few training examples using our new dataset of problems from the Linguistics Olympiad, a challenging set of tasks that require strong linguistic reasoning ability. In addition to being highly sample-efficient, our approach generates human-readable programs, and allows control over the generalizability of the learnt programs., Comment: SIGMORPHON 2021
Published: 2021

34. Use of Formal Ethical Reviews in NLP Literature: Historical Trends and Current Practices

Author: Santy, Sebastin, Rani, Anku, Choudhury, Monojit, Santy, Sebastin, Rani, Anku, and Choudhury, Monojit
Abstract: Ethical aspects of research in language technologies have received much attention recently. It is a standard practice to get a study involving human subjects reviewed and approved by a professional ethics committee/board of the institution. How commonly do we see mention of ethical approvals in NLP research? What types of research or aspects of studies are usually subject to such reviews? With the rising concerns and discourse around the ethics of NLP, do we also observe a rise in formal ethical reviews of NLP studies? And, if so, would this imply that there is a heightened awareness of ethical issues that was previously lacking? We aim to address these questions by conducting a detailed quantitative and qualitative analysis of the ACL Anthology, as well as comparing the trends in our field to those of other related disciplines, such as cognitive science, machine learning, data mining, and systems., Comment: Accepted at ACL 2021 Findings (7 pages)
Published: 2021

35. MSIR@FIRE: A Comprehensive Report from 2013 to 2016

Author: Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació, AGENCIA ESTATAL DE INVESTIGACION, Banerjee, Somnath, Choudhury, Monojit, Chakma, Kunal, Kumar Naskar, Sudip, Das, Amitava, Bandyopadhyay, Sivaji, Rosso, Paolo, Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació, AGENCIA ESTATAL DE INVESTIGACION, Banerjee, Somnath, Choudhury, Monojit, Chakma, Kunal, Kumar Naskar, Sudip, Das, Amitava, Bandyopadhyay, Sivaji, and Rosso, Paolo
Abstract: [EN] India is a nation of geographical and cultural diversity where over 1600 dialects are spoken by the people. With the technological advancement, penetration of the internet and cheaper access to mobile data, India has recently seen a sudden growth of internet users. These Indian internet users generate contents either in English or in other vernacular Indian languages. To develop technological solutions for the contents generated by the Indian users using the Indian languages, the Forum for Information Retrieval Evaluation (FIRE) was established and held for the first time in 2008. Although Indian languages are written using indigenous scripts, often websites and user-generated content (such as tweets and blogs) in these Indian languages are written using Roman script due to various socio-cultural and technological reasons. A challenge that search engines face while processing transliterated queries and documents is that of extensive spelling variation. MSIR track was first introduced in 2013 at FIRE and the aim of MSIR was to systematically formalize several research problems that one must solve to tackle the code mixing in Web search for users of many languages around the world, develop related data sets, test benches and most importantly, build a research community focusing on this important problem that has received very little attention. This document is a comprehensive report on the 4 years of MSIR track evaluated at FIRE between 2013 and 2016.
Published: 2020

36. MSIR@FIRE: A Comprehensive Report from 2013 to 2016

Author: Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació, AGENCIA ESTATAL DE INVESTIGACION, Banerjee, Somnath, Choudhury, Monojit, Chakma, Kunal, Kumar Naskar, Sudip, Das, Amitava, Bandyopadhyay, Sivaji, Rosso, Paolo, Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació, AGENCIA ESTATAL DE INVESTIGACION, Banerjee, Somnath, Choudhury, Monojit, Chakma, Kunal, Kumar Naskar, Sudip, Das, Amitava, Bandyopadhyay, Sivaji, and Rosso, Paolo
Abstract: [EN] India is a nation of geographical and cultural diversity where over 1600 dialects are spoken by the people. With the technological advancement, penetration of the internet and cheaper access to mobile data, India has recently seen a sudden growth of internet users. These Indian internet users generate contents either in English or in other vernacular Indian languages. To develop technological solutions for the contents generated by the Indian users using the Indian languages, the Forum for Information Retrieval Evaluation (FIRE) was established and held for the first time in 2008. Although Indian languages are written using indigenous scripts, often websites and user-generated content (such as tweets and blogs) in these Indian languages are written using Roman script due to various socio-cultural and technological reasons. A challenge that search engines face while processing transliterated queries and documents is that of extensive spelling variation. MSIR track was first introduced in 2013 at FIRE and the aim of MSIR was to systematically formalize several research problems that one must solve to tackle the code mixing in Web search for users of many languages around the world, develop related data sets, test benches and most importantly, build a research community focusing on this important problem that has received very little attention. This document is a comprehensive report on the 4 years of MSIR track evaluated at FIRE between 2013 and 2016.
Published: 2020

37. MSIR@FIRE: A Comprehensive Report from 2013 to 2016

Author: Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació, AGENCIA ESTATAL DE INVESTIGACION, Banerjee, Somnath, Choudhury, Monojit, Chakma, Kunal, Kumar Naskar, Sudip, Das, Amitava, Bandyopadhyay, Sivaji, Rosso, Paolo, Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació, AGENCIA ESTATAL DE INVESTIGACION, Banerjee, Somnath, Choudhury, Monojit, Chakma, Kunal, Kumar Naskar, Sudip, Das, Amitava, Bandyopadhyay, Sivaji, and Rosso, Paolo
Abstract: [EN] India is a nation of geographical and cultural diversity where over 1600 dialects are spoken by the people. With the technological advancement, penetration of the internet and cheaper access to mobile data, India has recently seen a sudden growth of internet users. These Indian internet users generate contents either in English or in other vernacular Indian languages. To develop technological solutions for the contents generated by the Indian users using the Indian languages, the Forum for Information Retrieval Evaluation (FIRE) was established and held for the first time in 2008. Although Indian languages are written using indigenous scripts, often websites and user-generated content (such as tweets and blogs) in these Indian languages are written using Roman script due to various socio-cultural and technological reasons. A challenge that search engines face while processing transliterated queries and documents is that of extensive spelling variation. MSIR track was first introduced in 2013 at FIRE and the aim of MSIR was to systematically formalize several research problems that one must solve to tackle the code mixing in Web search for users of many languages around the world, develop related data sets, test benches and most importantly, build a research community focusing on this important problem that has received very little attention. This document is a comprehensive report on the 4 years of MSIR track evaluated at FIRE between 2013 and 2016.
Published: 2020

38. GLUECoS : An Evaluation Benchmark for Code-Switched NLP

Author: Khanuja, Simran, Dandapat, Sandipan, Srinivasan, Anirudh, Sitaram, Sunayana, Choudhury, Monojit, Khanuja, Simran, Dandapat, Sandipan, Srinivasan, Anirudh, Sitaram, Sunayana, and Choudhury, Monojit
Abstract: Code-switching is the use of more than one language in the same conversation or utterance. Recently, multilingual contextual embedding models, trained on multiple monolingual corpora, have shown promising results on cross-lingual and multilingual tasks. We present an evaluation benchmark, GLUECoS, for code-switched languages, that spans several NLP tasks in English-Hindi and English-Spanish. Specifically, our evaluation benchmark includes Language Identification from text, POS tagging, Named Entity Recognition, Sentiment Analysis, Question Answering and a new task for code-switching, Natural Language Inference. We present results on all these tasks using cross-lingual word embedding models and multilingual models. In addition, we fine-tune multilingual models on artificially generated code-switched data. Although multilingual models perform significantly better than cross-lingual models, our results show that in most tasks, across both language pairs, multilingual models fine-tuned on code-switched data perform best, showing that multilingual models can be further optimized for code-switching tasks., Comment: To appear at ACL 2020
Published: 2020

39. The State and Fate of Linguistic Diversity and Inclusion in the NLP World

Author: Joshi, Pratik, Santy, Sebastin, Budhiraja, Amar, Bali, Kalika, Choudhury, Monojit, Joshi, Pratik, Santy, Sebastin, Budhiraja, Amar, Bali, Kalika, and Choudhury, Monojit
Abstract: Language technologies contribute to promoting multilingualism and linguistic diversity around the world. However, only a very small number of the over 7000 languages of the world are represented in the rapidly evolving language technologies and applications. In this paper we look at the relation between the types of languages, resources, and their representation in NLP conferences to understand the trajectory that different languages have followed over time. Our quantitative investigation underlines the disparity between languages, especially in terms of their resources, and calls into question the "language agnostic" status of current models and systems. Through this paper, we attempt to convince the ACL community to prioritise the resolution of the predicaments highlighted here, so that no language is left behind., Comment: Accepted at ACL 2020 (10 pages + 2 pages Appendix). P.J., S.S. and A.B. contributed equally
Published: 2020

40. A New Dataset for Natural Language Inference from Code-mixed Conversations

Author: Khanuja, Simran, Dandapat, Sandipan, Sitaram, Sunayana, Choudhury, Monojit, Khanuja, Simran, Dandapat, Sandipan, Sitaram, Sunayana, and Choudhury, Monojit
Abstract: Natural Language Inference (NLI) is the task of inferring the logical relationship, typically entailment or contradiction, between a premise and hypothesis. Code-mixing is the use of more than one language in the same conversation or utterance, and is prevalent in multilingual communities all over the world. In this paper, we present the first dataset for code-mixed NLI, in which both the premises and hypotheses are in code-mixed Hindi-English. We use data from Hindi movies (Bollywood) as premises, and crowd-source hypotheses from Hindi-English bilinguals. We conduct a pilot annotation study and describe the final annotation protocol based on observations from the pilot. Currently, the data collected consists of 400 premises in the form of code-mixed conversation snippets and 2240 code-mixed hypotheses. We conduct an extensive analysis to infer the linguistic phenomena commonly observed in the dataset obtained. We evaluate the dataset using a standard mBERT-based pipeline for NLI and report results., Comment: To appear in CALCS, LREC 2020
Published: 2020

41. Engagement Patterns of Peer-to-Peer Interactions on Mental Health Platforms

Author: Sharma, Ashish, Choudhury, Monojit, Althoff, Tim, Sharma, Amit, Sharma, Ashish, Choudhury, Monojit, Althoff, Tim, and Sharma, Amit
Abstract: Mental illness is a global health problem, but access to mental healthcare resources remain poor worldwide. Online peer-to-peer support platforms attempt to alleviate this fundamental gap by enabling those who struggle with mental illness to provide and receive social support from their peers. However, successful social support requires users to engage with each other and failures may have serious consequences for users in need. Our understanding of engagement patterns on mental health platforms is limited but critical to inform the role, limitations, and design of these platforms. Here, we present a large-scale analysis of engagement patterns of 35 million posts on two popular online mental health platforms, TalkLife and Reddit. Leveraging communication models in human-computer interaction and communication theory, we operationalize a set of four engagement indicators based on attention and interaction. We then propose a generative model to jointly model these indicators of engagement, the output of which is synthesized into a novel set of eleven distinct, interpretable patterns. We demonstrate that this framework of engagement patterns enables informative evaluations and analysis of online support platforms. Specifically, we find that mutual back-and-forth interactions are associated with significantly higher user retention rates on TalkLife. Such back-and-forth interactions, in turn, are associated with early response times and the sentiment of posts., Comment: Accepted to ICWSM 2020
Published: 2020

42. TaxiNLI: Taking a Ride up the NLU Hill

Author: Joshi, Pratik, Aditya, Somak, Sathe, Aalok, Choudhury, Monojit, Joshi, Pratik, Aditya, Somak, Sathe, Aalok, and Choudhury, Monojit
Abstract: Pre-trained Transformer-based neural architectures have consistently achieved state-of-the-art performance in the Natural Language Inference (NLI) task. Since NLI examples encompass a variety of linguistic, logical, and reasoning phenomena, it remains unclear as to which specific concepts are learnt by the trained systems and where they can achieve strong generalization. To investigate this question, we propose a taxonomic hierarchy of categories that are relevant for the NLI task. We introduce TAXINLI, a new dataset, that has 10k examples from the MNLI dataset (Williams et al., 2018) with these taxonomic labels. Through various experiments on TAXINLI, we observe that whereas for certain taxonomic categories SOTA neural models have achieved near perfect accuracies - a large jump over the previous models - some categories still remain difficult. Our work adds to the growing body of literature that shows the gaps in the current NLI systems and datasets through a systematic presentation and analysis of reasoning categories., Comment: 15 pages, 9 figures, 4 tables. Accepted at CoNLL 2020
Published: 2020

43. Unsung Challenges of Building and Deploying Language Technologies for Low Resource Language Communities

Author: Joshi, Pratik, Barnes, Christain, Santy, Sebastin, Khanuja, Simran, Shah, Sanket, Srinivasan, Anirudh, Bhattamishra, Satwik, Sitaram, Sunayana, Choudhury, Monojit, Bali, Kalika, Joshi, Pratik, Barnes, Christain, Santy, Sebastin, Khanuja, Simran, Shah, Sanket, Srinivasan, Anirudh, Bhattamishra, Satwik, Sitaram, Sunayana, Choudhury, Monojit, and Bali, Kalika
Abstract: In this paper, we examine and analyze the challenges associated with developing and introducing language technologies to low-resource language communities. While doing so, we bring to light the successes and failures of past work in this area, challenges being faced in doing so, and what they have achieved. Throughout this paper, we take a problem-facing approach and describe essential factors which the success of such technologies hinges upon. We present the various aspects in a manner which clarify and lay out the different tasks involved, which can aid organizations looking to make an impact in this area. We take the example of Gondi, an extremely-low resource Indian language, to reinforce and complement our discussion., Comment: Accepted at ICON 2019; 9 pages
Published: 2019

44. Characterizing the spread of exaggerated news content over social media

Author: Patro, Jasabanta, Baruah, Sabyasachee, Gupta, Vivek, Choudhury, Monojit, Goyal, Pawan, Mukherjee, Animesh, Patro, Jasabanta, Baruah, Sabyasachee, Gupta, Vivek, Choudhury, Monojit, Goyal, Pawan, and Mukherjee, Animesh
Abstract: In this paper, we consider a dataset comprising press releases about health research from different universities in the UK along with a corresponding set of news articles. First, we do an exploratory analysis to understand how the basic information published in the scientific journals get exaggerated as they are reported in these press releases or news articles. This initial analysis shows that some news agencies exaggerate almost 60\% of the articles they publish in the health domain; more than 50\% of the press releases from certain universities are exaggerated; articles in topics like lifestyle and childhood are heavily exaggerated. Motivated by the above observation we set the central objective of this paper to investigate how exaggerated news spreads over an online social network like Twitter. The LIWC analysis points to a remarkable observation these late tweets are essentially laden in words from opinion and realize categories which indicates that, given sufficient time, the wisdom of the crowd is actually able to tell apart the exaggerated news. As a second step we study the characteristics of the users who never or rarely post exaggerated news content and compare them with those who post exaggerated news content more frequently. We observe that the latter class of users have less retweets or mentions per tweet, have significantly more number of followers, use more slang words, less hyperbolic words and less word contractions. We also observe that the LIWC categories like bio, health, body and negative emotion are more pronounced in the tweets posted by the users in the latter class. As a final step we use these observations as features and automatically classify the two groups achieving an F1 score of 0.83., Comment: 10 pages
Published: 2018

45. All that is English may be Hindi: Enhancing language identification through automatic ranking of likeliness of word borrowing in social media

Author: Patro, Jasabanta, Samanta, Bidisha, Singh, Saurabh, Basu, Abhipsa, Mukherjee, Prithwish, Choudhury, Monojit, Mukherjee, Animesh, Patro, Jasabanta, Samanta, Bidisha, Singh, Saurabh, Basu, Abhipsa, Mukherjee, Prithwish, Choudhury, Monojit, and Mukherjee, Animesh
Abstract: In this paper, we present a set of computational methods to identify the likeliness of a word being borrowed, based on the signals from social media. In terms of Spearman correlation coefficient values, our methods perform more than two times better (nearly 0.62) in predicting the borrowing likeliness compared to the best performing baseline (nearly 0.26) reported in literature. Based on this likeliness estimate we asked annotators to re-annotate the language tags of foreign words in predominantly native contexts. In 88 percent of cases the annotators felt that the foreign language tag should be replaced by native language tag, thus indicating a huge scope for improvement of automatic language identification systems., Comment: 11 pages, accepted in the 2017 conference on Empirical Methods on Natural Language Processing(EMNLP 2017) arXiv admin note: substantial text overlap with arXiv:1703.05122
Published: 2017
Full Text: View/download PDF

46. Is this word borrowed? An automatic approach to quantify the likeliness of borrowing in social media

Author: Patro, Jasabanta, Samanta, Bidisha, Singh, Saurabh, Mukherjee, Prithwish, Choudhury, Monojit, Mukherjee, Animesh, Patro, Jasabanta, Samanta, Bidisha, Singh, Saurabh, Mukherjee, Prithwish, Choudhury, Monojit, and Mukherjee, Animesh
Abstract: Code-mixing or code-switching are the effortless phenomena of natural switching between two or more languages in a single conversation. Use of a foreign word in a language; however, does not necessarily mean that the speaker is code-switching because often languages borrow lexical items from other languages. If a word is borrowed, it becomes a part of the lexicon of a language; whereas, during code-switching, the speaker is aware that the conversation involves foreign words or phrases. Identifying whether a foreign word used by a bilingual speaker is due to borrowing or code-switching is a fundamental importance to theories of multilingualism, and an essential prerequisite towards the development of language and speech technologies for multilingual communities. In this paper, we present a series of novel computational methods to identify the borrowed likeliness of a word, based on the social media signals. We first propose context based clustering method to sample a set of candidate words from the social media data.Next, we propose three novel and similar metrics based on the usage of these words by the users in different tweets; these metrics were used to score and rank the candidate words indicating their borrowed likeliness. We compare these rankings with a ground truth ranking constructed through a human judgment experiment. The Spearman's rank correlation between the two rankings (nearly 0.62 for all the three metric variants) is more than double the value (0.26) of the most competitive existing baseline reported in the literature. Some other striking observations are, (i) the correlation is higher for the ground truth data elicited from the younger participants (age less than 30) than that from the older participants, and (ii )those participants who use mixed-language for tweeting the least, provide the best signals of borrowing., Comment: 11 pages, 3 Figures
Published: 2017

47. Grammatical Constraints on Intra-sentential Code-Switching: From Theories to Working Models

Author: Bhat, Gayatri, Choudhury, Monojit, Bali, Kalika, Bhat, Gayatri, Choudhury, Monojit, and Bali, Kalika
Abstract: We make one of the first attempts to build working models for intra-sentential code-switching based on the Equivalence-Constraint (Poplack 1980) and Matrix-Language (Myers-Scotton 1993) theories. We conduct a detailed theoretical analysis, and a small-scale empirical study of the two models for Hindi-English CS. Our analyses show that the models are neither sound nor complete. Taking insights from the errors made by the models, we propose a new model that combines features of both the theories., Comment: 13 pages
Published: 2016

48. An IR-based Evaluation Framework for Web Search Query Segmentation

Author: Roy, Rishiraj Saha, Ganguly, Niloy, Choudhury, Monojit, Laxman, Srivatsan, Roy, Rishiraj Saha, Ganguly, Niloy, Choudhury, Monojit, and Laxman, Srivatsan
Abstract: This paper presents the first evaluation framework for Web search query segmentation based directly on IR performance. In the past, segmentation strategies were mainly validated against manual annotations. Our work shows that the goodness of a segmentation algorithm as judged through evaluation against a handful of human annotated segmentations hardly reflects its effectiveness in an IR-based setup. In fact, state-of the-art algorithms are shown to perform as good as, and sometimes even better than human annotations -- a fact masked by previous validations. The proposed framework also provides us an objective understanding of the gap between the present best and the best possible segmentation algorithm. We draw these conclusions based on an extensive evaluation of six segmentation strategies, including three most recent algorithms, vis-a-vis segmentations from three human annotators. The evaluation framework also gives insights about which segments should be necessarily detected by an algorithm for achieving the best retrieval results. The meticulously constructed dataset used in our experiments has been made public for use by the research community.
Published: 2011

49. Indian Language Part-of-Speech Tagset: Bengali

Author: Bali, Kalika, Choudhury, Monojit, Biswas, Priyanka, Bali, Kalika, Choudhury, Monojit, and Biswas, Priyanka
Abstract: *Introduction * Indian Language Part-of-Speech Tagset: Bengali, Linguistic Data Consortium (LDC) catalog number LDC2010T16 and isbn 1-58563-561-8, is a corpus developed by Microsoft Research (MSR) India to support the task of Part-of-Speech Tagging (POS) and other data-driven linguistic research on Indian Languages in general. It is created as a part of the Indian Language Part-of-Speech Tagset (IL-POST) project, a collaborative effort among linguists and computer scientists from MSR India, AU-KBC (Anna Universtiy, Chennai), Delhi University, IIT Bombay, Jawaharlal Nehru University (Delhi) and Tamil University (Tamilnadu). The goal of the IL-POST project is to provide a common tagset framework for Indian Languages that offers flexibility, cross-linguistic compatibility and resuability across those languages. It supports a three-level hierarchy of Categories, Types and Attributes. The corpus mainly consists therefore of two different levels of information for each lexical token: (a) lexical Category and Types, and (b) set morphological attributes and their associated values in the context. Bengali (also referred to as Bangla) is a member of the Eastern Indo-Aryan language group. It is native to the region of Bengal which consists of Bangladesh, the Indian state of West Bengal, and parts of the Indian states of Tripura and Assam. It is spoken by more than 210 million people as a first or a second language with around 100 million speakers in Bangladesh, about 85 million speakers in India, and others in immigrant communities in the United Kingdom, USA and the Middle East. *Data * This corpus contains 7168 sentences (102933 words) of manually annotated text from modern standard Bengali sources including blogs, Wikipedia, Multikulti and a portion of the EMILLE/CIIL corpus . The annotated data is structured into two folders, Bangla1 (3684 sentences, 51091 words) and Bangla2 (3484 sentences, 51842 words), which represent the two stages in which the data was annotated. All a
Published: 2010

50. Discovering Global Patterns in Linguistic Networks through Spectral Analysis: A Case Study of the Consonant Inventories

Author: Mukherjee, Animesh, Choudhury, Monojit, Kannan, Ravi, Mukherjee, Animesh, Choudhury, Monojit, and Kannan, Ravi
Abstract: Recent research has shown that language and the socio-cognitive phenomena associated with it can be aptly modeled and visualized through networks of linguistic entities. However, most of the existing works on linguistic networks focus only on the local properties of the networks. This study is an attempt to analyze the structure of languages via a purely structural technique, namely spectral analysis, which is ideally suited for discovering the global correlations in a network. Application of this technique to PhoNet, the co-occurrence network of consonants, not only reveals several natural linguistic principles governing the structure of the consonant inventories, but is also able to quantify their relative importance. We believe that this powerful technique can be successfully applied, in general, to study the structure of natural languages., Comment: In the proceedings of EACL 2009
Published: 2009

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Publication Year Range

Publication Type

Database

Publisher

51 results on '"Choudhury, Monojit"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources