Author: "Diddee, Harshita" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Diddee, Harshita"' showing total 20 results

Start Over Author "Diddee, Harshita"

20 results on '"Diddee, Harshita"'

1. Chasing Random: Instruction Selection Strategies Fail to Generalize

Author: Diddee, Harshita and Ippolito, Daphne
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Prior work has shown that language models can be tuned to follow user instructions using only a small set of high-quality instructions. This has accelerated the development of methods that filter a large, noisy instruction-tuning datasets down to high-quality subset which works just as well. However, typically, the performance of these methods is not demonstrated across a uniform experimental setup and thus their generalization capabilities are not well established. In this work, we analyze popular selection strategies across different source datasets, selection budgets and evaluation benchmarks: Our results indicate that selection strategies generalize poorly, often failing to consistently outperform even random baselines. We also analyze the cost-performance trade-offs of using data selection. Our findings reveal that data selection can often exceed the cost of fine-tuning on the full dataset, yielding only marginal and sometimes no gains compared to tuning on the full dataset or a random subset.
Published: 2024

2. Customizing Large Language Model Generation Style using Parameter-Efficient Finetuning

Author: Liu, Xinyue, Diddee, Harshita, and Ippolito, Daphne
Subjects: Computer Science - Computation and Language
Abstract: One-size-fits-all large language models (LLMs) are increasingly being used to help people with their writing. However, the style these models are trained to write in may not suit all users or use cases. LLMs would be more useful as writing assistants if their idiolect could be customized to match each user. In this paper, we explore whether parameter-efficient finetuning (PEFT) with Low-Rank Adaptation can effectively guide the style of LLM generations. We use this method to customize LLaMA-2 to ten different authors and show that the generated text has lexical, syntactic, and surface alignment with the target author but struggles with content memorization. Our findings highlight the potential of PEFT to support efficient, user-level customization of LLMs.
Published: 2024

3. Akal Badi ya Bias: An Exploratory Study of Gender Bias in Hindi Language Technology

Author: Hada, Rishav, Husain, Safiya, Gumma, Varun, Diddee, Harshita, Yadavalli, Aditya, Seth, Agrima, Kulkarni, Nidhi, Gadiraju, Ujwal, Vashistha, Aditya, Seshadri, Vivek, and Bali, Kalika
Subjects: Computer Science - Computation and Language
Abstract: Existing research in measuring and mitigating gender bias predominantly centers on English, overlooking the intricate challenges posed by non-English languages and the Global South. This paper presents the first comprehensive study delving into the nuanced landscape of gender bias in Hindi, the third most spoken language globally. Our study employs diverse mining techniques, computational models, field studies and sheds light on the limitations of current methodologies. Given the challenges faced with mining gender biased statements in Hindi using existing methods, we conducted field studies to bootstrap the collection of such sentences. Through field studies involving rural and low-income community women, we uncover diverse perceptions of gender bias, underscoring the necessity for context-specific approaches. This paper advocates for a community-centric research design, amplifying voices often marginalized in previous studies. Our findings not only contribute to the understanding of gender bias in Hindi but also establish a foundation for further exploration of Indic languages. By exploring the intricacies of this understudied context, we call for thoughtful engagement with gender bias, promoting inclusivity and equity in linguistic and cultural contexts beyond the Global North., Comment: Accepted to FAccT 2024
Published: 2024

4. 'Fifty Shades of Bias': Normative Ratings of Gender Bias in GPT Generated English Text

Author: Hada, Rishav, Seth, Agrima, Diddee, Harshita, and Bali, Kalika
Subjects: Computer Science - Computation and Language
Abstract: Language serves as a powerful tool for the manifestation of societal belief systems. In doing so, it also perpetuates the prevalent biases in our society. Gender bias is one of the most pervasive biases in our society and is seen in online and offline discourses. With LLMs increasingly gaining human-like fluency in text generation, gaining a nuanced understanding of the biases these systems can generate is imperative. Prior work often treats gender bias as a binary classification task. However, acknowledging that bias must be perceived at a relative scale; we investigate the generation and consequent receptivity of manual annotators to bias of varying degrees. Specifically, we create the first dataset of GPT-generated English text with normative ratings of gender bias. Ratings were obtained using Best--Worst Scaling -- an efficient comparative annotation framework. Next, we systematically analyze the variation of themes of gender biases in the observed ranking and show that identity-attack is most closely related to gender bias. Finally, we show the performance of existing automated models trained on related concepts on our dataset., Comment: Camera-ready version in EMNLP 2023
Published: 2023

5. Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?

Author: Hada, Rishav, Gumma, Varun, de Wynter, Adrian, Diddee, Harshita, Ahmed, Mohamed, Choudhury, Monojit, Bali, Kalika, and Sitaram, Sunayana
Subjects: Computer Science - Computation and Language
Abstract: Large Language Models (LLMs) excel in various Natural Language Processing (NLP) tasks, yet their evaluation, particularly in languages beyond the top $20$, remains inadequate due to existing benchmarks and metrics limitations. Employing LLMs as evaluators to rank or score other models' outputs emerges as a viable solution, addressing the constraints tied to human annotators and established benchmarks. In this study, we explore the potential of LLM-based evaluators, specifically GPT-4 in enhancing multilingual evaluation by calibrating them against $20$K human judgments across three text-generation tasks, five metrics, and eight languages. Our analysis reveals a bias in GPT4-based evaluators towards higher scores, underscoring the necessity of calibration with native speaker judgments, especially in low-resource and non-Latin script languages, to ensure accurate evaluation of LLM performance across diverse languages., Comment: Accepted to EACL 2024 findings
Published: 2023

6. MEGA: Multilingual Evaluation of Generative AI

Author: Ahuja, Kabir, Diddee, Harshita, Hada, Rishav, Ochieng, Millicent, Ramesh, Krithika, Jain, Prachi, Nambi, Akshay, Ganu, Tanuja, Segal, Sameer, Axmed, Maxamed, Bali, Kalika, and Sitaram, Sunayana
Subjects: Computer Science - Computation and Language
Abstract: Generative AI models have shown impressive performance on many Natural Language Processing tasks such as language understanding, reasoning, and language generation. An important question being asked by the AI community today is about the capabilities and limits of these models, and it is clear that evaluating generative AI is very challenging. Most studies on generative LLMs have been restricted to English and it is unclear how capable these models are at understanding and generating text in other languages. We present the first comprehensive benchmarking of generative LLMs - MEGA, which evaluates models on standard NLP benchmarks, covering 16 NLP datasets across 70 typologically diverse languages. We compare the performance of generative LLMs including Chat-GPT and GPT-4 to State of the Art (SOTA) non-autoregressive models on these tasks to determine how well generative models perform compared to the previous generation of LLMs. We present a thorough analysis of the performance of models across languages and tasks and discuss challenges in improving the performance of generative LLMs on low-resource languages. We create a framework for evaluating generative LLMs in the multilingual setting and provide directions for future progress in the field., Comment: EMNLP 2023
Published: 2023

7. Learnings from Technological Interventions in a Low Resource Language: Enhancing Information Access in Gondi

Author: Mehta, Devansh, Diddee, Harshita, Saxena, Ananya, Shukla, Anurag, Santy, Sebastin, Mothilal, Ramaravind Kommiya, Srivastava, Brij Mohan Lal, Sharma, Alok, Prasad, Vishnu, U, Venkanna, and Bali, Kalika
Subjects: Computer Science - Computation and Language, Computer Science - Computers and Society
Abstract: The primary obstacle to developing technologies for low-resource languages is the lack of representative, usable data. In this paper, we report the deployment of technology-driven data collection methods for creating a corpus of more than 60,000 translations from Hindi to Gondi, a low-resource vulnerable language spoken by around 2.3 million tribal people in south and central India. During this process, we help expand information access in Gondi across 2 different dimensions (a) The creation of linguistic resources that can be used by the community, such as a dictionary, children's stories, Gondi translations from multiple sources and an Interactive Voice Response (IVR) based mass awareness platform; (b) Enabling its use in the digital domain by developing a Hindi-Gondi machine translation model, which is compressed by nearly 4 times to enable it's edge deployment on low-resource edge devices and in areas of little to no internet connectivity. We also present preliminary evaluations of utilizing the developed machine translation model to provide assistance to volunteers who are involved in collecting more data for the target language. Through these interventions, we not only created a refined and evaluated corpus of 26,240 Hindi-Gondi translations that was used for building the translation model but also engaged nearly 850 community members who can help take Gondi onto the internet., Comment: In Submission (Revised) to Language Resources and Evaluation Journal. arXiv admin note: text overlap with arXiv:2004.10270
Published: 2022

8. Too Brittle To Touch: Comparing the Stability of Quantization and Distillation Towards Developing Lightweight Low-Resource MT Models

Author: Diddee, Harshita, Dandapat, Sandipan, Choudhury, Monojit, Ganu, Tanuja, and Bali, Kalika
Subjects: Computer Science - Computation and Language
Abstract: Leveraging shared learning through Massively Multilingual Models, state-of-the-art machine translation models are often able to adapt to the paucity of data for low-resource languages. However, this performance comes at the cost of significantly bloated models which are not practically deployable. Knowledge Distillation is one popular technique to develop competitive, lightweight models: In this work, we first evaluate its use to compress MT models focusing on languages with extremely limited training data. Through our analysis across 8 languages, we find that the variance in the performance of the distilled models due to their dependence on priors including the amount of synthetic data used for distillation, the student architecture, training hyperparameters and confidence of the teacher models, makes distillation a brittle compression mechanism. To mitigate this, we explore the use of post-training quantization for the compression of these models. Here, we find that while distillation provides gains across some low-resource languages, quantization provides more consistent performance trends for the entire range of languages, especially the lowest-resource languages in our target set., Comment: 16 Pages, 7 Figures, Accepted to WMT 2022 (Research Track)
Published: 2022

9. Towards Quantifying the Carbon Emissions of Differentially Private Machine Learning

Author: Naidu, Rakshit, Diddee, Harshita, Mulay, Ajinkya, Vardhan, Aleti, Ramesh, Krithika, and Zamzam, Ahmed
Subjects: Computer Science - Cryptography and Security, Computer Science - Machine Learning
Abstract: In recent years, machine learning techniques utilizing large-scale datasets have achieved remarkable performance. Differential privacy, by means of adding noise, provides strong privacy guarantees for such learning algorithms. The cost of differential privacy is often a reduced model accuracy and a lowered convergence speed. This paper investigates the impact of differential privacy on learning algorithms in terms of their carbon footprint due to either longer run-times or failed experiments. Through extensive experiments, further guidance is provided on choosing the noise levels which can strike a balance between desired privacy levels and reduced carbon emissions., Comment: 4+3 pages; 6 figures; 8 tables. Accepted at SRML workshop at ICML'21
Published: 2021

10. Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages

Author: Ramesh, Gowtham, Doddapaneni, Sumanth, Bheemaraj, Aravinth, Jobanputra, Mayank, AK, Raghavan, Sharma, Ajitesh, Sahoo, Sujit, Diddee, Harshita, J, Mahalakshmi, Kakwani, Divyanshu, Kumar, Navneet, Pradeep, Aswin, Nagaraj, Srihari, Deepak, Kumar, Raghavan, Vivek, Kunchukuttan, Anoop, Kumar, Pratyush, and Khapra, Mitesh Shantadevi
Subjects: Computer Science - Computation and Language
Abstract: We present Samanantar, the largest publicly available parallel corpora collection for Indic languages. The collection contains a total of 49.7 million sentence pairs between English and 11 Indic languages (from two language families). Specifically, we compile 12.4 million sentence pairs from existing, publicly-available parallel corpora, and additionally mine 37.4 million sentence pairs from the web, resulting in a 4x increase. We mine the parallel sentences from the web by combining many corpora, tools, and methods: (a) web-crawled monolingual corpora, (b) document OCR for extracting sentences from scanned documents, (c) multilingual representation models for aligning sentences, and (d) approximate nearest neighbor search for searching in a large collection of sentences. Human evaluation of samples from the newly mined corpora validate the high quality of the parallel sentences across 11 languages. Further, we extract 83.4 million sentence pairs between all 55 Indic language pairs from the English-centric parallel corpus using English as the pivot language. We trained multilingual NMT models spanning all these languages on Samanantar, which outperform existing models and baselines on publicly available benchmarks, such as FLORES, establishing the utility of Samanantar. Our data and models are available publicly at https://ai4bharat.iitm.ac.in/samanantar and we hope they will help advance research in NMT and multilingual NLP for Indic languages., Comment: Accepted to the Transactions of the Association for Computational Linguistics (TACL)
Published: 2021

11. BlockFITS: A Federated Data Augmentation Modelling for Blockchain-Based IoVT Systems

Author: Kansra, Bhrigu, Diddee, Harshita, Sheikh, Tariq Hussain, Khanna, Ashish, Gupta, Deepak, Rodrigues, Joel J. P. C., Kacprzyk, Janusz, Series Editor, Pal, Nikhil R., Advisory Editor, Bello Perez, Rafael, Advisory Editor, Corchado, Emilio S., Advisory Editor, Hagras, Hani, Advisory Editor, Kóczy, László T., Advisory Editor, Kreinovich, Vladik, Advisory Editor, Lin, Chin-Teng, Advisory Editor, Lu, Jie, Advisory Editor, Melin, Patricia, Advisory Editor, Nedjah, Nadia, Advisory Editor, Nguyen, Ngoc Thanh, Advisory Editor, Wang, Jun, Advisory Editor, Khanna, Ashish, editor, Gupta, Deepak, editor, Bhattacharyya, Siddhartha, editor, Hassanien, Aboul Ella, editor, Anand, Sameer, editor, and Jaiswal, Ajay, editor
Published: 2022
Full Text: View/download PDF

12. Akal Badi ya Bias: An Exploratory Study of Gender Bias in Hindi Language Technology

Author: Hada, Rishav, primary, Husain, Safiya, additional, Gumma, Varun, additional, Diddee, Harshita, additional, Yadavalli, Aditya, additional, Seth, Agrima, additional, Kulkarni, Nidhi, additional, Gadiraju, Ujwal, additional, Vashistha, Aditya, additional, Seshadri, Vivek, additional, and Bali, Kalika, additional
Published: 2024
Full Text: View/download PDF

13. BlockFITS: A Federated Data Augmentation Modelling for Blockchain-Based IoVT Systems

Author: Kansra, Bhrigu, primary, Diddee, Harshita, additional, Sheikh, Tariq Hussain, additional, Khanna, Ashish, additional, Gupta, Deepak, additional, and Rodrigues, Joel J. P. C., additional
Published: 2021
Full Text: View/download PDF

14. MEGA: Multilingual Evaluation of Generative AI

Author: Ahuja, Kabir, primary, Diddee, Harshita, additional, Hada, Rishav, additional, Ochieng, Millicent, additional, Ramesh, Krithika, additional, Jain, Prachi, additional, Nambi, Akshay, additional, Ganu, Tanuja, additional, Segal, Sameer, additional, Ahmed, Mohamed, additional, Bali, Kalika, additional, and Sitaram, Sunayana, additional
Published: 2023
Full Text: View/download PDF

15. “Fifty Shades of Bias”: Normative Ratings of Gender Bias in GPT Generated English Text

Author: Hada, Rishav, primary, Seth, Agrima, additional, Diddee, Harshita, additional, and Bali, Kalika, additional
Published: 2023
Full Text: View/download PDF

16. CodeFed: Federated Speech Recognition for Low-Resource Code-Switching Detection

Author: Madan, Chetan, primary, Diddee, Harshita, additional, Kumar, Deepika, additional, and Mittal, Mamta, additional
Published: 2022
Full Text: View/download PDF

17. The Six Conundrums of Building and Deploying Language Technologies for Social Good

Author: Diddee, Harshita, primary, Bali, Kalika, additional, Choudhury, Monojit, additional, and Mukhija, Namrata, additional
Published: 2022
Full Text: View/download PDF

18. Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages

Author: Ramesh, Gowtham, primary, Doddapaneni, Sumanth, additional, Bheemaraj, Aravinth, additional, Jobanputra, Mayank, additional, AK, Raghavan, additional, Sharma, Ajitesh, additional, Sahoo, Sujit, additional, Diddee, Harshita, additional, J, Mahalakshmi, additional, Kakwani, Divyanshu, additional, Kumar, Navneet, additional, Pradeep, Aswin, additional, Nagaraj, Srihari, additional, Deepak, Kumar, additional, Raghavan, Vivek, additional, Kunchukuttan, Anoop, additional, Kumar, Pratyush, additional, and Khapra, Mitesh Shantadevi, additional
Published: 2022
Full Text: View/download PDF

19. CrossPriv

Author: Diddee, Harshita, primary and Kansra, Bhrigu, additional
Published: 2020
Full Text: View/download PDF

20. PsuedoProp at SemEval-2020 Task 11: Propaganda Span Detection Using BERT-CRF and Ensemble Sentence Level Classifier

Author: Chauhan, Aniruddha, primary and Diddee, Harshita, additional
Published: 2020
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

20 results on '"Diddee, Harshita"'

1. Chasing Random: Instruction Selection Strategies Fail to Generalize

2. Customizing Large Language Model Generation Style using Parameter-Efficient Finetuning

3. Akal Badi ya Bias: An Exploratory Study of Gender Bias in Hindi Language Technology

4. 'Fifty Shades of Bias': Normative Ratings of Gender Bias in GPT Generated English Text

5. Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?

6. MEGA: Multilingual Evaluation of Generative AI

7. Learnings from Technological Interventions in a Low Resource Language: Enhancing Information Access in Gondi

8. Too Brittle To Touch: Comparing the Stability of Quantization and Distillation Towards Developing Lightweight Low-Resource MT Models

9. Towards Quantifying the Carbon Emissions of Differentially Private Machine Learning

10. Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages

11. BlockFITS: A Federated Data Augmentation Modelling for Blockchain-Based IoVT Systems

12. Akal Badi ya Bias: An Exploratory Study of Gender Bias in Hindi Language Technology

13. BlockFITS: A Federated Data Augmentation Modelling for Blockchain-Based IoVT Systems

14. MEGA: Multilingual Evaluation of Generative AI

15. “Fifty Shades of Bias”: Normative Ratings of Gender Bias in GPT Generated English Text

16. CodeFed: Federated Speech Recognition for Low-Resource Code-Switching Detection

17. The Six Conundrums of Building and Deploying Language Technologies for Social Good

18. Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages

19. CrossPriv

20. PsuedoProp at SemEval-2020 Task 11: Propaganda Span Detection Using BERT-CRF and Ensemble Sentence Level Classifier

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

20 results on '"Diddee, Harshita"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources