Author: "Ogundepo, Odunayo" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Ogundepo, Odunayo"' showing total 24 results

Start Over Author "Ogundepo, Odunayo"

24 results on '"Ogundepo, Odunayo"'

1. NoMIRACL: Knowing When You Don't Know for Robust Multilingual Retrieval-Augmented Generation

Author: Thakur, Nandan, Bonifacio, Luiz, Zhang, Xinyu, Ogundepo, Odunayo, Kamalloo, Ehsan, Alfonso-Hermelo, David, Li, Xiaoguang, Liu, Qun, Chen, Boxing, Rezagholizadeh, Mehdi, and Lin, Jimmy
Subjects: Computer Science - Computation and Language, Computer Science - Information Retrieval
Abstract: Retrieval-augmented generation (RAG) grounds large language model (LLM) output by leveraging external knowledge sources to reduce factual hallucinations. However, prior works lack a comprehensive evaluation of different language families, making it challenging to evaluate LLM robustness against errors in external retrieved knowledge. To overcome this, we establish NoMIRACL, a human-annotated dataset for evaluating LLM robustness in RAG across 18 typologically diverse languages. NoMIRACL includes both a non-relevant and a relevant subset. Queries in the non-relevant subset contain passages judged as non-relevant, whereas queries in the relevant subset include at least a single judged relevant passage. We measure LLM robustness using two metrics: (i) hallucination rate, measuring model tendency to hallucinate an answer, when the answer is not present in passages in the non-relevant subset, and (ii) error rate, measuring model inaccuracy to recognize relevant passages in the relevant subset. In our work, we measure robustness for a wide variety of multilingual-focused LLMs and observe that most of the models struggle to balance the two capacities. Models such as LLAMA-2, Orca-2, and FLAN-T5 observe more than an 88% hallucination rate on the non-relevant subset, whereas, Mistral overall hallucinates less, but can achieve up to a 74.9% error rate on the relevant subset. Overall, GPT-4 is observed to provide the best tradeoff on both subsets, highlighting future work necessary to improve LLM robustness.
Published: 2023

2. GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training Data Exploration

Author: Piktus, Aleksandra, Ogundepo, Odunayo, Akiki, Christopher, Oladipo, Akintunde, Zhang, Xinyu, Schoelkopf, Hailey, Biderman, Stella, Potthast, Martin, and Lin, Jimmy
Subjects: Computer Science - Computation and Language
Abstract: Noticing the urgent need to provide tools for fast and user-friendly qualitative analysis of large-scale textual corpora of the modern NLP, we propose to turn to the mature and well-tested methods from the domain of Information Retrieval (IR) - a research field with a long history of tackling TB-scale document collections. We discuss how Pyserini - a widely used toolkit for reproducible IR research can be integrated with the Hugging Face ecosystem of open-source AI libraries and artifacts. We leverage the existing functionalities of both platforms while proposing novel features further facilitating their integration. Our goal is to give NLP researchers tools that will allow them to develop retrieval-based instrumentation for their data analytics needs with ease and agility. We include a Jupyter Notebook-based walk through the core interoperability features, available on GitHub at https://github.com/huggingface/gaia. We then demonstrate how the ideas we present can be operationalized to create a powerful tool for qualitative data analysis in NLP. We present GAIA Search - a search engine built following previously laid out principles, giving access to four popular large-scale text collections. GAIA serves a dual purpose of illustrating the potential of methodologies we discuss but also as a standalone qualitative analysis tool that can be leveraged by NLP researchers aiming to understand datasets prior to using them in training. GAIA is hosted live on Hugging Face Spaces - https://huggingface.co/spaces/spacerini/gaia.
Published: 2023

3. AfriQA: Cross-lingual Open-Retrieval Question Answering for African Languages

Author: Ogundepo, Odunayo, Gwadabe, Tajuddeen R., Rivera, Clara E., Clark, Jonathan H., Ruder, Sebastian, Adelani, David Ifeoluwa, Dossou, Bonaventure F. P., DIOP, Abdou Aziz, Sikasote, Claytone, Hacheme, Gilles, Buzaaba, Happy, Ezeani, Ignatius, Mabuya, Rooweither, Osei, Salomey, Emezue, Chris, Kahira, Albert Njoroge, Muhammad, Shamsuddeen H., Oladipo, Akintunde, Owodunni, Abraham Toluwase, Tonja, Atnafu Lambebo, Shode, Iyanuoluwa, Asai, Akari, Ajayi, Tunde Oluwaseyi, Siro, Clemencia, Arthur, Steven, Adeyemi, Mofetoluwa, Ahia, Orevaoghene, Aremu, Anuoluwapo, Awosan, Oyinkansola, Chukwuneke, Chiamaka, Opoku, Bernard, Ayodele, Awokoya, Otiende, Verrah, Mwase, Christine, Sinkala, Boyd, Rubungo, Andre Niyongabo, Ajisafe, Daniel A., Onwuegbuzia, Emeka Felix, Mbow, Habib, Niyomutabazi, Emile, Mukonde, Eunice, Lawan, Falalu Ibrahim, Ahmad, Ibrahim Said, Alabi, Jesujoba O., Namukombo, Martin, Chinedu, Mbonu, Phiri, Mofya, Putini, Neo, Mngoma, Ndumiso, Amuok, Priscilla A., Iro, Ruqayya Nasir, and Adhiambo, Sonia
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Information Retrieval
Abstract: African languages have far less in-language content available digitally, making it challenging for question answering systems to satisfy the information needs of users. Cross-lingual open-retrieval question answering (XOR QA) systems -- those that retrieve answer content from other languages while serving people in their native language -- offer a means of filling this gap. To this end, we create AfriQA, the first cross-lingual QA dataset with a focus on African languages. AfriQA includes 12,000+ XOR QA examples across 10 African languages. While previous datasets have focused primarily on languages where cross-lingual QA augments coverage from the target language, AfriQA focuses on languages where cross-lingual answer content is the only high-coverage source of answer content. Because of this, we argue that African languages are one of the most important and realistic use cases for XOR QA. Our experiments demonstrate the poor performance of automatic translation and multilingual retrieval methods. Overall, AfriQA proves challenging for state-of-the-art QA models. We hope that the dataset enables the development of more equitable QA technology.
Published: 2023

4. Evaluating Embedding APIs for Information Retrieval

Author: Kamalloo, Ehsan, Zhang, Xinyu, Ogundepo, Odunayo, Thakur, Nandan, Alfonso-Hermelo, David, Rezagholizadeh, Mehdi, and Lin, Jimmy
Subjects: Computer Science - Information Retrieval, Computer Science - Computation and Language
Abstract: The ever-increasing size of language models curtails their widespread availability to the community, thereby galvanizing many companies into offering access to large language models through APIs. One particular type, suitable for dense retrieval, is a semantic embedding service that builds vector representations of input text. With a growing number of publicly available APIs, our goal in this paper is to analyze existing offerings in realistic retrieval scenarios, to assist practitioners and researchers in finding suitable services according to their needs. Specifically, we investigate the capabilities of existing semantic embedding APIs on domain generalization and multilingual retrieval. For this purpose, we evaluate these services on two standard benchmarks, BEIR and MIRACL. We find that re-ranking BM25 results using the APIs is a budget-friendly approach and is most effective in English, in contrast to the standard practice of employing them as first-stage retrievers. For non-English retrieval, re-ranking still improves the results, but a hybrid model with BM25 works best, albeit at a higher cost. We hope our work lays the groundwork for evaluating semantic embedding APIs that are critical in search and more broadly, for information access., Comment: ACL 2023 Industry Track
Published: 2023

5. MasakhaNEWS: News Topic Classification for African languages

Author: Adelani, David Ifeoluwa, Masiak, Marek, Azime, Israel Abebe, Alabi, Jesujoba, Tonja, Atnafu Lambebo, Mwase, Christine, Ogundepo, Odunayo, Dossou, Bonaventure F. P., Oladipo, Akintunde, Nixdorf, Doreen, Emezue, Chris Chinenye, al-azzawi, sana, Sibanda, Blessing, David, Davis, Ndolela, Lolwethu, Mukiibi, Jonathan, Ajayi, Tunde, Moteu, Tatiana, Odhiambo, Brian, Owodunni, Abraham, Obiefuna, Nnaemeka, Mohamed, Muhidin, Muhammad, Shamsuddeen Hassan, Ababu, Teshome Mulugeta, Salahudeen, Saheed Abdullahi, Yigezu, Mesay Gemeda, Gwadabe, Tajuddeen, Abdulmumin, Idris, Taye, Mahlet, Awoyomi, Oluwabusayo, Shode, Iyanuoluwa, Adelani, Tolulope, Abdulganiyu, Habiba, Omotayo, Abdul-Hakeem, Adeeko, Adetola, Afolabi, Abeeb, Aremu, Anuoluwapo, Samuel, Olanrewaju, Siro, Clemencia, Kimotho, Wangari, Ogbu, Onyekachi, Mbonu, Chinedu, Chukwuneke, Chiamaka, Fanijo, Samuel, Ojo, Jessica, Awosan, Oyinkansola, Kebede, Tadesse, Sakayo, Toadoum Sari, Nyatsine, Pamela, Sidume, Freedmore, Yousuf, Oreen, Oduwole, Mardiyyah, Tshinu, Tshinu, Kimanuka, Ussen, Diko, Thina, Nxakama, Siyanda, Nigusse, Sinodos, Johar, Abdulmejid, Mohamed, Shafie, Hassan, Fuad Mire, Mehamed, Moges Ahmed, Ngabire, Evrard, Jules, Jules, Ssenkungu, Ivan, and Stenetorp, Pontus
Subjects: Computer Science - Computation and Language
Abstract: African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks. While there are individual language specific datasets that are being expanded to different tasks, only a handful of NLP tasks (e.g. named entity recognition and machine translation) have standardized benchmark datasets covering several geographical and typologically-diverse African languages. In this paper, we develop MasakhaNEWS -- a new benchmark dataset for news topic classification covering 16 languages widely spoken in Africa. We provide an evaluation of baseline models by training classical machine learning models and fine-tuning several language models. Furthermore, we explore several alternatives to full fine-tuning of language models that are better suited for zero-shot and few-shot learning such as cross-lingual parameter-efficient fine-tuning (like MAD-X), pattern exploiting training (PET), prompting language models (like ChatGPT), and prompt-free sentence transformer fine-tuning (SetFit and Cohere Embedding API). Our evaluation in zero-shot setting shows the potential of prompting ChatGPT for news topic classification in low-resource African languages, achieving an average performance of 70 F1 points without leveraging additional supervision like MAD-X. In few-shot setting, we show that with as little as 10 examples per label, we achieved more than 90\% (i.e. 86.0 F1 points) of the performance of full supervised training (92.6 F1 points) leveraging the PET approach., Comment: Accepted to IJCNLP-AACL 2023 (main conference)
Published: 2023

6. Simple Yet Effective Neural Ranking and Reranking Baselines for Cross-Lingual Information Retrieval

Author: Lin, Jimmy, Alfonso-Hermelo, David, Jeronymo, Vitor, Kamalloo, Ehsan, Lassance, Carlos, Nogueira, Rodrigo, Ogundepo, Odunayo, Rezagholizadeh, Mehdi, Thakur, Nandan, Yang, Jheng-Hong, and Zhang, Xinyu
Subjects: Computer Science - Information Retrieval, Computer Science - Computation and Language
Abstract: The advent of multilingual language models has generated a resurgence of interest in cross-lingual information retrieval (CLIR), which is the task of searching documents in one language with queries from another. However, the rapid pace of progress has led to a confusing panoply of methods and reproducibility has lagged behind the state of the art. In this context, our work makes two important contributions: First, we provide a conceptual framework for organizing different approaches to cross-lingual retrieval using multi-stage architectures for mono-lingual retrieval as a scaffold. Second, we implement simple yet effective reproducible baselines in the Anserini and Pyserini IR toolkits for test collections from the TREC 2022 NeuCLIR Track, in Persian, Russian, and Chinese. Our efforts are built on a collaboration of the two teams that submitted the most effective runs to the TREC evaluation. These contributions provide a firm foundation for future advances.
Published: 2023

7. Spacerini: Plug-and-play Search Engines with Pyserini and Hugging Face

Author: Akiki, Christopher, Ogundepo, Odunayo, Piktus, Aleksandra, Zhang, Xinyu, Oladipo, Akintunde, Lin, Jimmy, and Potthast, Martin
Subjects: Computer Science - Information Retrieval, Computer Science - Computation and Language
Abstract: We present Spacerini, a tool that integrates the Pyserini toolkit for reproducible information retrieval research with Hugging Face to enable the seamless construction and deployment of interactive search engines. Spacerini makes state-of-the-art sparse and dense retrieval models more accessible to non-IR practitioners while minimizing deployment effort. This is useful for NLP researchers who want to better understand and validate their research by performing qualitative analyses of training corpora, for IR researchers who want to demonstrate new retrieval models integrated into the growing Pyserini ecosystem, and for third parties reproducing the work of other researchers. Spacerini is open source and includes utilities for loading, preprocessing, indexing, and deploying search engines locally and remotely. We demonstrate a portfolio of 13 search engines created with Spacerini for different use cases.
Published: 2023

8. MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition

Author: Adelani, David Ifeoluwa, Neubig, Graham, Ruder, Sebastian, Rijhwani, Shruti, Beukman, Michael, Palen-Michel, Chester, Lignos, Constantine, Alabi, Jesujoba O., Muhammad, Shamsuddeen H., Nabende, Peter, Dione, Cheikh M. Bamba, Bukula, Andiswa, Mabuya, Rooweither, Dossou, Bonaventure F. P., Sibanda, Blessing, Buzaaba, Happy, Mukiibi, Jonathan, Kalipe, Godson, Mbaye, Derguene, Taylor, Amelia, Kabore, Fatoumata, Emezue, Chris Chinenye, Aremu, Anuoluwapo, Ogayo, Perez, Gitau, Catherine, Munkoh-Buabeng, Edwin, Koagne, Victoire M., Tapo, Allahsera Auguste, Macucwa, Tebogo, Marivate, Vukosi, Mboning, Elvis, Gwadabe, Tajuddeen, Adewumi, Tosin, Ahia, Orevaoghene, Nakatumba-Nabende, Joyce, Mokono, Neo L., Ezeani, Ignatius, Chukwuneke, Chiamaka, Adeyemi, Mofetoluwa, Hacheme, Gilles Q., Abdulmumin, Idris, Ogundepo, Odunayo, Yousuf, Oreen, Ngoli, Tatiana Moteu, and Klakow, Dietrich
Subjects: Computer Science - Computation and Language
Abstract: African languages are spoken by over a billion people, but are underrepresented in NLP research and development. The challenges impeding progress include the limited availability of annotated datasets, as well as a lack of understanding of the settings where current methods are effective. In this paper, we make progress towards solutions for these challenges, focusing on the task of named entity recognition (NER). We create the largest human-annotated NER dataset for 20 African languages, and we study the behavior of state-of-the-art cross-lingual transfer methods in an Africa-centric setting, demonstrating that the choice of source language significantly affects performance. We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points across 20 languages compared to using English. Our results highlight the need for benchmark datasets and models that cover typologically-diverse African languages., Comment: Accepted to EMNLP 2022 (updated Github link)
Published: 2022

9. Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages

Author: Zhang, Xinyu, Thakur, Nandan, Ogundepo, Odunayo, Kamalloo, Ehsan, Alfonso-Hermelo, David, Li, Xiaoguang, Liu, Qun, Rezagholizadeh, Mehdi, and Lin, Jimmy
Subjects: Computer Science - Information Retrieval, Computer Science - Computation and Language
Abstract: MIRACL (Multilingual Information Retrieval Across a Continuum of Languages) is a multilingual dataset we have built for the WSDM 2023 Cup challenge that focuses on ad hoc retrieval across 18 different languages, which collectively encompass over three billion native speakers around the world. These languages have diverse typologies, originate from many different language families, and are associated with varying amounts of available resources -- including what researchers typically characterize as high-resource as well as low-resource languages. Our dataset is designed to support the creation and evaluation of models for monolingual retrieval, where the queries and the corpora are in the same language. In total, we have gathered over 700k high-quality relevance judgments for around 77k queries over Wikipedia in these 18 languages, where all assessments have been performed by native speakers hired by our team. Our goal is to spur research that will improve retrieval across a continuum of languages, thus enhancing information access capabilities for diverse populations around the world, particularly those that have been traditionally underserved. This overview paper describes the dataset and baselines that we share with the community. The MIRACL website is live at http://miracl.ai/.
Published: 2022

10. Better Than Whitespace: Information Retrieval for Languages without Custom Tokenizers

Author: Ogundepo, Odunayo, Zhang, Xinyu, and Lin, Jimmy
Subjects: Computer Science - Computation and Language, Computer Science - Information Retrieval
Abstract: Tokenization is a crucial step in information retrieval, especially for lexical matching algorithms, where the quality of indexable tokens directly impacts the effectiveness of a retrieval system. Since different languages have unique properties, the design of the tokenization algorithm is usually language-specific and requires at least some lingustic knowledge. However, only a handful of the 7000+ languages on the planet benefit from specialized, custom-built tokenization algorithms, while the other languages are stuck with a "default" whitespace tokenizer, which cannot capture the intricacies of different languages. To address this challenge, we propose a different approach to tokenization for lexical matching retrieval algorithms (e.g., BM25): using the WordPiece tokenizer, which can be built automatically from unsupervised data. We test the approach on 11 typologically diverse languages in the MrTyDi collection: results show that the mBERT tokenizer provides strong relevance signals for retrieval "out of the box", outperforming whitespace tokenization on most languages. In many cases, our approach also improves retrieval effectiveness when combined with existing custom-built tokenizers.
Published: 2022

11. Rescuing historical climate observations to support hydrological research: a case study of solar radiation data.

Author: Ogundepo Odunayo, Naveela N. Sookoo, Gautam Bathla, Anthony Cavallin, Bhaleka D. Persaud, Kathy Szigeti, Philippe Van Cappellen, and Jimmy Lin
Published: 2021
Full Text: View/download PDF

12. AfriQA:Cross-lingual Open-Retrieval Question Answering for African Languages

Author: Ogundepo, Odunayo, Gwadabe, Tajuddeen R., Rivera, Clara E., Clark, Jonathan H., Ruder, Sebastian, Adelani, David Ifeoluwa, Ezeani, Ignatius, Chukwuneke, Chiamaka, Dossou, Bonaventure F. P., Abdou, Aziz DIOP, Sikasote, Claytone, Hacheme, Gilles, Buzaaba, Happy, Mabuya, Rooweither, Osei, Salomey, Emezue, Chris, Kahira, Albert Njoroge, Muhammad, Shamsuddeen H., Oladipo, Akintunde, Owodunni, Abraham Toluwase, Tonja, Atnafu Lambebo, Shode, Iyanuoluwa, Asai, Akari, Ajayi, Tunde Oluwaseyi, Siro, Clemencia, Arthur, Steven, Adeyemi, Mofetoluwa, Ahia, Orevaoghene, Anuoluwapo, Aremu, Awosan, Oyinkansola, Opoku, Bernard, Ayodele, Awokoya, Otiende, Verrah, Mwase, Christine, Sinkala, Boyd, Rubungo, Andre Niyongabo, Ajisafe, Daniel A., Onwuegbuzia, Emeka Felix, Mbow, Habib, Niyomutabazi, Emile, Mukonde, Eunice, Lawan, Falalu Ibrahim, Ahmad, Ibrahim Said, Alabi, Jesujoba O., Namukombo, Martin, Chinedu, Mbonu, Phiri, Mofya, Putini, Neo, Mngoma, Ndumiso, Amuok, Priscilla A., Iro, Ruqayya Nasir, Adhiambo, Sonia, Ogundepo, Odunayo, Gwadabe, Tajuddeen R., Rivera, Clara E., Clark, Jonathan H., Ruder, Sebastian, Adelani, David Ifeoluwa, Ezeani, Ignatius, Chukwuneke, Chiamaka, Dossou, Bonaventure F. P., Abdou, Aziz DIOP, Sikasote, Claytone, Hacheme, Gilles, Buzaaba, Happy, Mabuya, Rooweither, Osei, Salomey, Emezue, Chris, Kahira, Albert Njoroge, Muhammad, Shamsuddeen H., Oladipo, Akintunde, Owodunni, Abraham Toluwase, Tonja, Atnafu Lambebo, Shode, Iyanuoluwa, Asai, Akari, Ajayi, Tunde Oluwaseyi, Siro, Clemencia, Arthur, Steven, Adeyemi, Mofetoluwa, Ahia, Orevaoghene, Anuoluwapo, Aremu, Awosan, Oyinkansola, Opoku, Bernard, Ayodele, Awokoya, Otiende, Verrah, Mwase, Christine, Sinkala, Boyd, Rubungo, Andre Niyongabo, Ajisafe, Daniel A., Onwuegbuzia, Emeka Felix, Mbow, Habib, Niyomutabazi, Emile, Mukonde, Eunice, Lawan, Falalu Ibrahim, Ahmad, Ibrahim Said, Alabi, Jesujoba O., Namukombo, Martin, Chinedu, Mbonu, Phiri, Mofya, Putini, Neo, Mngoma, Ndumiso, Amuok, Priscilla A., Iro, Ruqayya Nasir, and Adhiambo, Sonia
Abstract: African languages have far less in-language content available digitally, making it challenging for question answering systems to satisfy the information needs of users. Cross-lingual open-retrieval question answering (XOR QA) systems -- those that retrieve answer content from other languages while serving people in their native language -- offer a means of filling this gap. To this end, we create AfriQA, the first cross-lingual QA dataset with a focus on African languages. AfriQA includes 12,000+ XOR QA examples across 10 African languages. While previous datasets have focused primarily on languages where cross-lingual QA augments coverage from the target language, AfriQA focuses on languages where cross-lingual answer content is the only high-coverage source of answer content. Because of this, we argue that African languages are one of the most important and realistic use cases for XOR QA. Our experiments demonstrate the poor performance of automatic translation and multilingual retrieval methods. Overall, AfriQA proves challenging for state-of-the-art QA models. We hope that the dataset enables the development of more equitable QA technology.
Published: 2023

13. MasakhaNEWS:News Topic Classification for African languages

Author: Adelani, David Ifeoluwa, Chukwuneke, Chiamaka I., Masiak, Marek, Azime, Israel Abebe, Alabi, Jesujoba Oluwadara, Tonja, Atnafu Lambebo, Mwase, Christine, Ogundepo, Odunayo, Dossou, Bonaventure F. P., Oladipo, Akintunde, Nixdorf, Doreen, Emezue, Chris Chinenye, al-azzawi, Sana Sabah, Sibanda, Blessing K., David, Davis, Ndolela, Lolwethu, Mukiibi, Jonathan, Ajayi, Tunde Oluwaseyi, Ngoli, Tatiana Moteu, Odhiambo, Brian, Mbonu, Chinedu E., Owodunni, Abraham Toluwase, Obiefuna, Nnaemeka C., Muhammad, Shamsuddeen Hassan, Abdullahi, Saheed Salahudeen, Yigezu, Mesay Gemeda, Gwadabe, Tajuddeen, Abdulmumin, Idris, Bame, Mahlet Taye, Awoyomi, Oluwabusayo Olufunke, Shode, Iyanuoluwa, Adelani, Tolulope Anu, Kailani, Habiba Abdulganiy, Omotayo, Abdul-Hakeem, Adeeko, Adetola, Abeeb, Afolabi, Aremu, Anuoluwapo, Samuel, Olanrewaju, Siro, Clemencia, Kimotho, Wangari, Ogbu, Onyekachi Raphael, Fanijo, Samuel, Ojo, Jessica, Awosan, Oyinkansola F., Guge, Tadesse Kebede, Sari, Sakayo Toadoum, Nyatsine, Pamela, Sidume, Freedmore, Yousuf, Oreen, Oduwole, Mardiyyah, Kimanuka, Ussen, Tshinu, Kanda Patrick, Diko, Thina, Nxakama, Siyanda, Johar, Abdulmejid Tuni, Gebre, Sinodos, Mohamed, Muhidin, Mohamed, Shafie Abdi, Hassan, Fuad Mire, Mehamed, Moges Ahmed, Ngabire, Evrard, Stenetorp, Pontus, Adelani, David Ifeoluwa, Chukwuneke, Chiamaka I., Masiak, Marek, Azime, Israel Abebe, Alabi, Jesujoba Oluwadara, Tonja, Atnafu Lambebo, Mwase, Christine, Ogundepo, Odunayo, Dossou, Bonaventure F. P., Oladipo, Akintunde, Nixdorf, Doreen, Emezue, Chris Chinenye, al-azzawi, Sana Sabah, Sibanda, Blessing K., David, Davis, Ndolela, Lolwethu, Mukiibi, Jonathan, Ajayi, Tunde Oluwaseyi, Ngoli, Tatiana Moteu, Odhiambo, Brian, Mbonu, Chinedu E., Owodunni, Abraham Toluwase, Obiefuna, Nnaemeka C., Muhammad, Shamsuddeen Hassan, Abdullahi, Saheed Salahudeen, Yigezu, Mesay Gemeda, Gwadabe, Tajuddeen, Abdulmumin, Idris, Bame, Mahlet Taye, Awoyomi, Oluwabusayo Olufunke, Shode, Iyanuoluwa, Adelani, Tolulope Anu, Kailani, Habiba Abdulganiy, Omotayo, Abdul-Hakeem, Adeeko, Adetola, Abeeb, Afolabi, Aremu, Anuoluwapo, Samuel, Olanrewaju, Siro, Clemencia, Kimotho, Wangari, Ogbu, Onyekachi Raphael, Fanijo, Samuel, Ojo, Jessica, Awosan, Oyinkansola F., Guge, Tadesse Kebede, Sari, Sakayo Toadoum, Nyatsine, Pamela, Sidume, Freedmore, Yousuf, Oreen, Oduwole, Mardiyyah, Kimanuka, Ussen, Tshinu, Kanda Patrick, Diko, Thina, Nxakama, Siyanda, Johar, Abdulmejid Tuni, Gebre, Sinodos, Mohamed, Muhidin, Mohamed, Shafie Abdi, Hassan, Fuad Mire, Mehamed, Moges Ahmed, Ngabire, Evrard, and Stenetorp, Pontus
Abstract: African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks. While there are individual language specific datasets that are being expanded to different tasks, only a handful of NLP tasks (e.g. named entity recognition and machine translation) have standardized benchmark datasets covering several geographical and typologically-diverse African languages. In this paper, we develop MasakhaNEWS -- a new benchmark dataset for news topic classification covering 16 languages widely spoken in Africa. We provide an evaluation of baseline models by training classical machine learning models and fine-tuning several language models. Furthermore, we explore several alternatives to full fine-tuning of language models that are better suited for zero-shot and few-shot learning such as cross-lingual parameter-efficient fine-tuning (like MAD-X), pattern exploiting training (PET), prompting language models (like ChatGPT), and prompt-free sentence transformer fine-tuning (SetFit and Cohere Embedding API). Our evaluation in zero-shot setting shows the potential of prompting ChatGPT for news topic classification in low-resource African languages, achieving an average performance of 70 F1 points without leveraging additional supervision like MAD-X. In few-shot setting, we show that with as little as 10 examples per label, we achieved more than 90\% (i.e. 86.0 F1 points) of the performance of full supervised training (92.6 F1 points) leveraging the PET approach.
Published: 2023

14. MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages

Author: Zhang, Xinyu, primary, Thakur, Nandan, additional, Ogundepo, Odunayo, additional, Kamalloo, Ehsan, additional, Alfonso-Hermelo, David, additional, Li, Xiaoguang, additional, Liu, Qun, additional, Rezagholizadeh, Mehdi, additional, and Lin, Jimmy, additional
Published: 2023
Full Text: View/download PDF

15. Evaluating Embedding APIs for Information Retrieval

Author: Kamalloo, Ehsan, primary, Zhang, Xinyu, additional, Ogundepo, Odunayo, additional, Thakur, Nandan, additional, Alfonso-hermelo, David, additional, Rezagholizadeh, Mehdi, additional, and Lin, Jimmy, additional
Published: 2023
Full Text: View/download PDF

16. Cross-lingual Open-Retrieval Question Answering for African Languages

Author: Ogundepo, Odunayo, primary, Gwadabe, Tajuddeen, additional, Rivera, Clara, additional, Clark, Jonathan, additional, Ruder, Sebastian, additional, Adelani, David, additional, Dossou, Bonaventure, additional, Diop, Abdou, additional, Sikasote, Claytone, additional, Hacheme, Gilles, additional, Buzaaba, Happy, additional, Ezeani, Ignatius, additional, Mabuya, Rooweither, additional, Osei, Salomey, additional, Emezue, Chris, additional, Kahira, Albert, additional, Muhammad, Shamsuddeen, additional, Oladipo, Akintunde, additional, Owodunni, Abraham, additional, Tonja, Atnafu, additional, Shode, Iyanuoluwa, additional, Asai, Akari, additional, Aremu, Anuoluwapo, additional, Awokoya, Ayodele, additional, Opoku, Bernard, additional, Chukwuneke, Chiamaka, additional, Mwase, Christine, additional, Siro, Clemencia, additional, Arthur, Stephen, additional, Ajayi, Tunde, additional, Otiende, Verrah, additional, Rubungo, Andre, additional, Sinkala, Boyd, additional, Ajisafe, Daniel, additional, Onwuegbuzia, Emeka, additional, Lawan, Falalu, additional, Ahmad, Ibrahim, additional, Alabi, Jesujoba, additional, Mbonu, Chinedu, additional, Adeyemi, Mofetoluwa, additional, Phiri, Mofya, additional, Ahia, Orevaoghene, additional, Iro, Ruqayya, additional, and Adhiambo, Sonia, additional
Published: 2023
Full Text: View/download PDF

17. Spacerini: Plug-and-play Search Engines with Pyserini and Hugging Face

Author: Akiki, Christopher, primary, Ogundepo, Odunayo, additional, Piktus, Aleksandra, additional, Zhang, Xinyu, additional, Oladipo, Akintunde, additional, Lin, Jimmy, additional, and Potthast, Martin, additional
Published: 2023
Full Text: View/download PDF

18. Better Quality Pre-training Data and T5 Models for African Languages

Author: Oladipo, Akintunde, primary, Adeyemi, Mofetoluwa, additional, Ahia, Orevaoghene, additional, Owodunni, Abraham, additional, Ogundepo, Odunayo, additional, Adelani, David, additional, and Lin, Jimmy, additional
Published: 2023
Full Text: View/download PDF

19. GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training Data Exploration

Author: Piktus, Aleksandra, primary, Ogundepo, Odunayo, additional, Akiki, Christopher, additional, Oladipo, Akintunde, additional, Zhang, Xinyu, additional, Schoelkopf, Hailey, additional, Biderman, Stella, additional, Potthast, Martin, additional, and Lin, Jimmy, additional
Published: 2023
Full Text: View/download PDF

20. MasakhaNER 2.0:Africa-centric Transfer Learning for Named Entity Recognition

Author: Adelani, David Ifeoluwa, Neubig, Graham, Ruder, Sebastian, Rijhwani, Shruti, Beukman, Michael, Palen-Michel, Chester, Lignos, Constantine, Alabi, Jesujoba O., Muhammad, Shamsuddeen Hassan, Nabende, Peter, Dione, Cheikh M. Bamba, Bukula, Andiswa, Mabuya, Rooweither, Dossou, Bonaventure F. P., Sibanda, Blessing, Buzaaba, Happy, Mukiibi, Jonathan, Kalipe, Godson, Mbaye, Derguene, Taylor, Amelia, Kabore, Fatoumata Ouoba, Emezue, Chris Chinenye, Anuoluwapo, Aremu, Ogayo, Perez, Gitau, Catherine, Munkoh-Buabeng, Edwin, Koagne, Victoire Memdjokam, Tapo, Allahsera Auguste, Macucwa, Tebogo, Marivate, Vukosi, Mboning, Elvis, Gwadabe, Tajuddeen, Adewumi, Tosin P., Ahia, Orevaoghene, Nakatumba-Nabende, Joyce, Mokono, Neo L., Ezeani, Ignatius, Chukwuneke, Chiamaka, Adeyemi, Mofetoluwa, Hacheme, Gilles, Abdulmumin, Idris, Ogundepo, Odunayo, Yousuf, Oreen, Ngoli, Tatiana Moteu, Klakow, Dietrich, Adelani, David Ifeoluwa, Neubig, Graham, Ruder, Sebastian, Rijhwani, Shruti, Beukman, Michael, Palen-Michel, Chester, Lignos, Constantine, Alabi, Jesujoba O., Muhammad, Shamsuddeen Hassan, Nabende, Peter, Dione, Cheikh M. Bamba, Bukula, Andiswa, Mabuya, Rooweither, Dossou, Bonaventure F. P., Sibanda, Blessing, Buzaaba, Happy, Mukiibi, Jonathan, Kalipe, Godson, Mbaye, Derguene, Taylor, Amelia, Kabore, Fatoumata Ouoba, Emezue, Chris Chinenye, Anuoluwapo, Aremu, Ogayo, Perez, Gitau, Catherine, Munkoh-Buabeng, Edwin, Koagne, Victoire Memdjokam, Tapo, Allahsera Auguste, Macucwa, Tebogo, Marivate, Vukosi, Mboning, Elvis, Gwadabe, Tajuddeen, Adewumi, Tosin P., Ahia, Orevaoghene, Nakatumba-Nabende, Joyce, Mokono, Neo L., Ezeani, Ignatius, Chukwuneke, Chiamaka, Adeyemi, Mofetoluwa, Hacheme, Gilles, Abdulmumin, Idris, Ogundepo, Odunayo, Yousuf, Oreen, Ngoli, Tatiana Moteu, and Klakow, Dietrich
Abstract: African languages are spoken by over a billion people, but are underrepresented in NLP research and development. The challenges impeding progress include the limited availability of annotated datasets, as well as a lack of understanding of the settings where current methods are effective. In this paper, we make progress towards solutions for these challenges, focusing on the task of named entity recognition (NER). We create the largest human-annotated NER dataset for 20 African languages, and we study the behavior of state-of-the-art cross-lingual transfer methods in an Africa-centric setting, demonstrating that the choice of source language significantly affects performance. We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points across 20 languages compared to using English. Our results highlight the need for benchmark datasets and models that cover typologically-diverse African languages.
Published: 2022

21. AfriCLIRMatrix: Enabling Cross-Lingual Information Retrieval for African Languages

Author: Ogundepo, Odunayo, primary, Zhang, Xinyu, additional, Sun, Shuo, additional, Duh, Kevin, additional, and Lin, Jimmy, additional
Published: 2022
Full Text: View/download PDF

22. AfriTeVA: Extending ?Small Data? Pretraining Approaches to Sequence-to-Sequence Models

Author: Jude Ogundepo, Odunayo, primary, Oladipo, Akintunde, additional, Adeyemi, Mofetoluwa, additional, and Ogueji and Jimmy Lin, Kelechi, additional
Published: 2022
Full Text: View/download PDF

23. MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition

Author: Adelani, David, primary, Neubig, Graham, additional, Ruder, Sebastian, additional, Rijhwani, Shruti, additional, Beukman, Michael, additional, Palen-Michel, Chester, additional, Lignos, Constantine, additional, Alabi, Jesujoba, additional, Muhammad, Shamsuddeen, additional, Nabende, Peter, additional, Dione, Cheikh M. Bamba, additional, Bukula, Andiswa, additional, Mabuya, Rooweither, additional, Dossou, Bonaventure F. P., additional, Sibanda, Blessing, additional, Buzaaba, Happy, additional, Mukiibi, Jonathan, additional, Kalipe, Godson, additional, Mbaye, Derguene, additional, Taylor, Amelia, additional, Kabore, Fatoumata, additional, Emezue, Chris Chinenye, additional, Aremu, Anuoluwapo, additional, Ogayo, Perez, additional, Gitau, Catherine, additional, Munkoh-Buabeng, Edwin, additional, Memdjokam Koagne, Victoire, additional, Tapo, Allahsera Auguste, additional, Macucwa, Tebogo, additional, Marivate, Vukosi, additional, Elvis, Mboning Tchiaze, additional, Gwadabe, Tajuddeen, additional, Adewumi, Tosin, additional, Ahia, Orevaoghene, additional, Nakatumba-Nabende, Joyce, additional, Mokono, Neo Lerato, additional, Ezeani, Ignatius, additional, Chukwuneke, Chiamaka, additional, Oluwaseun Adeyemi, Mofetoluwa, additional, Hacheme, Gilles Quentin, additional, Abdulmumin, Idris, additional, Ogundepo, Odunayo, additional, Yousuf, Oreen, additional, Moteu, Tatiana, additional, and Klakow, Dietrich, additional
Published: 2022
Full Text: View/download PDF

24. Rescuing historical climate observations to support hydrological research

Author: Philippe Van Cappellen, Jimmy Lin, Naveela N. Sookoo, Gautam Bathla, Bhaleka Persaud, Kathy Szigeti, Ogundepo Odunayo, and Anthony Cavallin
Subjects: Metadata, Information retrieval, Computer science, Preprocessor, Tesseract, Noise (video), Optical character recognition, computer.software_genre, Pipeline (software), computer, Digitization, Personalization
Abstract: The acceleration of climate change and its impact highlight the need for long-term reliable climate data at high spatiotemporal resolution to answer key science questions in cold regions hydrology. Prior to the digital age, climate records were archived on paper. For example, from the 1950s to the 1990s, solar radiation data from recording stations worldwide were published in booklets by the former Union of Soviet Socialist Republics (USSR) Hydrometeorological Service. As a result, the data are not easily accessible by most researchers. The overarching aim of this research is to develop techniques to convert paper-based climate records into a machine-readable format to support environmental research in cold regions. This study compares the performance of a proprietary optical character recognition (OCR) service with an open-source OCR tool for digitizing hydrometeorological data. We built a digitization pipeline combining different image preprocessing techniques, semantic segmentation, and an open-source OCR engine for extracting data and metadata recorded in the scanned documents. Each page contains blocks of text with station names and tables containing the climate data. The process begins with image preprocessing to reduce noise and to improve quality before the page content is segmented to detect tables and finally run through an OCR engine for text extraction. We outline the digitization process and report on initial results, including different segmentation approaches, preprocessing image algorithms, and OCR techniques to ensure accurate extraction and organization of relevant metadata from thousands of scanned climate records. We evaluated the performance of Tesseract OCR and ABBYY FineReader on text extraction. We find that although ABBY FineReader has better accuracy on the sample data, our custom extraction pipeline using Tesseract is efficient and scalable because it is flexible and allows for more customization.
Published: 2021

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

24 results on '"Ogundepo, Odunayo"'

1. NoMIRACL: Knowing When You Don't Know for Robust Multilingual Retrieval-Augmented Generation

2. GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training Data Exploration

3. AfriQA: Cross-lingual Open-Retrieval Question Answering for African Languages

4. Evaluating Embedding APIs for Information Retrieval

5. MasakhaNEWS: News Topic Classification for African languages

6. Simple Yet Effective Neural Ranking and Reranking Baselines for Cross-Lingual Information Retrieval

7. Spacerini: Plug-and-play Search Engines with Pyserini and Hugging Face

8. MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition

9. Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages

10. Better Than Whitespace: Information Retrieval for Languages without Custom Tokenizers

11. Rescuing historical climate observations to support hydrological research: a case study of solar radiation data.

12. AfriQA:Cross-lingual Open-Retrieval Question Answering for African Languages

13. MasakhaNEWS:News Topic Classification for African languages

14. MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages

15. Evaluating Embedding APIs for Information Retrieval

16. Cross-lingual Open-Retrieval Question Answering for African Languages

17. Spacerini: Plug-and-play Search Engines with Pyserini and Hugging Face

18. Better Quality Pre-training Data and T5 Models for African Languages

19. GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training Data Exploration

20. MasakhaNER 2.0:Africa-centric Transfer Learning for Named Entity Recognition

21. AfriCLIRMatrix: Enabling Cross-Lingual Information Retrieval for African Languages

22. AfriTeVA: Extending ?Small Data? Pretraining Approaches to Sequence-to-Sequence Models

23. MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition

24. Rescuing historical climate observations to support hydrological research

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

24 results on '"Ogundepo, Odunayo"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources