Author: "Shatnawi, Sara" / Database: OAIster - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Shatnawi, Sara"' showing total 3 results

Start Over Author "Shatnawi, Sara" Database OAIster

3 results on '"Shatnawi, Sara"'

1. ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic

Author: Koto, Fajri, Li, Haonan, Shatnawi, Sara, Doughman, Jad, Sadallah, Abdelrahman Boda, Alraeesi, Aisha, Almubarak, Khalid, Alyafeai, Zaid, Sengupta, Neha, Shehata, Shady, Habash, Nizar, Nakov, Preslav, Baldwin, Timothy, Koto, Fajri, Li, Haonan, Shatnawi, Sara, Doughman, Jad, Sadallah, Abdelrahman Boda, Alraeesi, Aisha, Almubarak, Khalid, Alyafeai, Zaid, Sengupta, Neha, Shehata, Shady, Habash, Nizar, Nakov, Preslav, and Baldwin, Timothy
Abstract: The focus of language model evaluation has transitioned towards reasoning and knowledge-intensive tasks, driven by advancements in pretraining large models. While state-of-the-art models are partially trained on large Arabic texts, evaluating their performance in Arabic remains challenging due to the limited availability of relevant datasets. To bridge this gap, we present ArabicMMLU, the first multi-task language understanding benchmark for Arabic language, sourced from school exams across diverse educational levels in different countries spanning North Africa, the Levant, and the Gulf regions. Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA), and is carefully constructed by collaborating with native speakers in the region. Our comprehensive evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models. Notably, BLOOMZ, mT0, LLama2, and Falcon struggle to achieve a score of 50%, while even the top-performing Arabic-centric model only achieves a score of 62.3%.
Published: 2024

2. ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus

Author: Kulkarni, Ajinkya, Kulkarni, Atharva, Shatnawi, Sara Abedalmonem Mohammad, Aldarmaki, Hanan, Kulkarni, Ajinkya, Kulkarni, Atharva, Shatnawi, Sara Abedalmonem Mohammad, and Aldarmaki, Hanan
Abstract: At present, Text-to-speech (TTS) systems that are trained with high-quality transcribed speech data using end-to-end neural models can generate speech that is intelligible, natural, and closely resembles human speech. These models are trained with relatively large single-speaker professionally recorded audio, typically extracted from audiobooks. Meanwhile, due to the scarcity of freely available speech corpora of this kind, a larger gap exists in Arabic TTS research and development. Most of the existing freely available Arabic speech corpora are not suitable for TTS training as they contain multi-speaker casual speech with variations in recording conditions and quality, whereas the corpus curated for speech synthesis are generally small in size and not suitable for training state-of-the-art end-to-end models. In a move towards filling this gap in resources, we present a speech corpus for Classical Arabic Text-to-Speech (ClArTTS) to support the development of end-to-end TTS systems for Arabic. The speech is extracted from a LibriVox audiobook, which is then processed, segmented, and manually transcribed and annotated. The final ClArTTS corpus contains about 12 hours of speech from a single male speaker sampled at 40100 kHz. In this paper, we describe the process of corpus creation and provide details of corpus statistics and a comparison with existing resources. Furthermore, we develop two TTS systems based on Grad-TTS and Glow-TTS and illustrate the performance of the resulting systems via subjective and objective evaluations. The corpus will be made publicly available at www.clartts.com for research purposes, along with the baseline TTS systems demo., Comment: None
Published: 2023

3. Automatic Restoration of Diacritics for Speech Data Sets

Author: Shatnawi, Sara, Alqahtani, Sawsan, Aldarmaki, Hanan, Shatnawi, Sara, Alqahtani, Sawsan, and Aldarmaki, Hanan
Abstract: Automatic text-based diacritic restoration models generally have high diacritic error rates when applied to speech transcripts as a result of domain and style shifts in spoken language. In this work, we explore the possibility of improving the performance of automatic diacritic restoration when applied to speech data by utilizing parallel spoken utterances. In particular, we use the pre-trained Whisper ASR model fine-tuned on relatively small amounts of diacritized Arabic speech data to produce rough diacritized transcripts for the speech utterances, which we then use as an additional input for diacritic restoration models. The proposed framework consistently improves diacritic restoration performance compared to text-only baselines. Our results highlight the inadequacy of current text-based diacritic restoration models for speech data sets and provide a new baseline for speech-based diacritic restoration.
Published: 2023

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

3 results on '"Shatnawi, Sara"'

1. ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic

2. ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus

3. Automatic Restoration of Diacritics for Speech Data Sets

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Publication Year Range

Publication Type

Database

3 results on '"Shatnawi, Sara"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources