Author: "Moslem, Yasmin" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Moslem, Yasmin"' showing total 14 results

Start Over Author "Moslem, Yasmin"

14 results on '"Moslem, Yasmin"'

1. Leveraging Synthetic Audio Data for End-to-End Low-Resource Speech Translation

Author: Moslem, Yasmin
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper describes our system submission to the International Conference on Spoken Language Translation (IWSLT 2024) for Irish-to-English speech translation. We built end-to-end systems based on Whisper, and employed a number of data augmentation techniques, such as speech back-translation and noise augmentation. We investigate the effect of using synthetic audio data and discuss several methods for enriching signal diversity., Comment: IWSLT 2024
Published: 2024

2. SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages

Author: Lovenia, Holy, Mahendra, Rahmad, Akbar, Salsabil Maulana, Miranda, Lester James V., Santoso, Jennifer, Aco, Elyanah, Fadhilah, Akhdan, Mansurov, Jonibek, Imperial, Joseph Marvin, Kampman, Onno P., Moniz, Joel Ruben Antony, Habibi, Muhammad Ravi Shulthan, Hudi, Frederikus, Montalan, Railey, Ignatius, Ryan, Lopo, Joanito Agili, Nixon, William, Karlsson, Börje F., Jaya, James, Diandaru, Ryandito, Gao, Yuze, Amadeus, Patrick, Wang, Bin, Cruz, Jan Christian Blaise, Whitehouse, Chenxi, Parmonangan, Ivan Halim, Khelli, Maria, Zhang, Wenyu, Susanto, Lucky, Ryanda, Reynard Adha, Hermawan, Sonny Lazuardi, Velasco, Dan John, Kautsar, Muhammad Dehan Al, Hendria, Willy Fitra, Moslem, Yasmin, Flynn, Noah, Adilazuarda, Muhammad Farid, Li, Haochen, Lee, Johanes, Damanhuri, R., Sun, Shuo, Qorib, Muhammad Reza, Djanibekov, Amirbek, Leong, Wei Qi, Do, Quyet V., Muennighoff, Niklas, Pansuwan, Tanrada, Putra, Ilham Firdausi, Xu, Yan, Tai, Ngee Chia, Purwarianti, Ayu, Ruder, Sebastian, Tjhi, William, Limkonchotiwat, Peerat, Aji, Alham Fikri, Keh, Sedrick, Winata, Genta Indra, Zhang, Ruochen, Koto, Fajri, Yong, Zheng-Xin, and Cahyawijaya, Samuel
Subjects: Computer Science - Computation and Language
Abstract: Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, we introduce SEACrowd, a collaborative initiative that consolidates a comprehensive resource hub that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in SEA., Comment: https://seacrowd.github.io/ Accepted in EMNLP 2024
Published: 2024

3. Language Modelling Approaches to Adaptive Machine Translation

Author: Moslem, Yasmin
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Human-Computer Interaction, Computer Science - Information Retrieval
Abstract: Consistency is a key requirement of high-quality translation. It is especially important to adhere to pre-approved terminology and adapt to corrected translations in domain-specific projects. Machine translation (MT) has achieved significant progress in the area of domain adaptation. However, in-domain data scarcity is common in translation settings, due to the lack of specialised datasets and terminology, or inconsistency and inaccuracy of available in-domain translations. In such scenarios where there is insufficient in-domain data to fine-tune MT models, producing translations that are consistent with the relevant context is challenging. While real-time adaptation can make use of smaller amounts of in-domain data to improve the translation on the fly, it remains challenging due to supported context limitations and efficiency constraints. Large language models (LLMs) have recently shown interesting capabilities of in-context learning, where they learn to replicate certain input-output text generation patterns, without further fine-tuning. Such capabilities have opened new horizons for domain-specific data augmentation and real-time adaptive MT. This work attempts to address two main relevant questions: 1) in scenarios involving human interaction and continuous feedback, can we employ language models to improve the quality of adaptive MT at inference time? and 2) in the absence of sufficient in-domain data, can we use pre-trained large-scale language models to improve the process of MT domain adaptation?, Comment: PhD thesis
Published: 2024

4. Fine-tuning Large Language Models for Adaptive Machine Translation

Author: Moslem, Yasmin, Haque, Rejwanul, and Way, Andy
Subjects: Computer Science - Computation and Language, Computer Science - Information Retrieval
Abstract: This paper presents the outcomes of fine-tuning Mistral 7B, a general-purpose large language model (LLM), for adaptive machine translation (MT). The fine-tuning process involves utilising a combination of zero-shot and one-shot translation prompts within the medical domain. The primary objective is to enhance real-time adaptive MT capabilities of Mistral 7B, enabling it to adapt translations to the required domain at inference time. The results, particularly for Spanish-to-English MT, showcase the efficacy of the fine-tuned model, demonstrating quality improvements in both zero-shot and one-shot translation scenarios, surpassing Mistral 7B's baseline performance. Notably, the fine-tuned Mistral outperforms ChatGPT "gpt-3.5-turbo" in zero-shot translation while achieving comparable one-shot translation quality. Moreover, the zero-shot translation of the fine-tuned Mistral matches NLLB 3.3B's performance, and its one-shot translation quality surpasses that of NLLB 3.3B. These findings emphasise the significance of fine-tuning efficient LLMs like Mistral 7B to yield high-quality zero-shot translations comparable to task-oriented models like NLLB 3.3B. Additionally, the adaptive gains achieved in one-shot translation are comparable to those of commercial LLMs such as ChatGPT. Our experiments demonstrate that, with a relatively small dataset of 20,000 segments that incorporate a mix of zero-shot and one-shot prompts, fine-tuning significantly enhances Mistral's in-context learning ability, especially for real-time adaptive MT.
Published: 2023

5. Domain Terminology Integration into Machine Translation: Leveraging Large Language Models

Author: Moslem, Yasmin, Romani, Gianfranco, Molaei, Mahdi, Haque, Rejwanul, Kelleher, John D., and Way, Andy
Subjects: Computer Science - Computation and Language
Abstract: This paper discusses the methods that we used for our submissions to the WMT 2023 Terminology Shared Task for German-to-English (DE-EN), English-to-Czech (EN-CS), and Chinese-to-English (ZH-EN) language pairs. The task aims to advance machine translation (MT) by challenging participants to develop systems that accurately translate technical terms, ultimately enhancing communication and understanding in specialised domains. To this end, we conduct experiments that utilise large language models (LLMs) for two purposes: generating synthetic bilingual terminology-based data, and post-editing translations generated by an MT model through incorporating pre-approved terms. Our system employs a four-step process: (i) using an LLM to generate bilingual synthetic data based on the provided terminology, (ii) fine-tuning a generic encoder-decoder MT model, with a mix of the terminology-based synthetic data generated in the first step and a randomly sampled portion of the original generic training data, (iii) generating translations with the fine-tuned MT model, and (iv) finally, leveraging an LLM for terminology-constrained automatic post-editing of the translations that do not include the required terms. The results demonstrate the effectiveness of our proposed approach in improving the integration of pre-approved terms into translations. The number of terms incorporated into the translations of the blind dataset increases from an average of 36.67% with the generic model to an average of 72.88% by the end of the process. In other words, successful utilisation of terms nearly doubles across the three language pairs., Comment: WMT 2023
Published: 2023

6. Adaptive Machine Translation with Large Language Models

Author: Moslem, Yasmin, Haque, Rejwanul, Kelleher, John D., and Way, Andy
Subjects: Computer Science - Computation and Language
Abstract: Consistency is a key requirement of high-quality translation. It is especially important to adhere to pre-approved terminology and adapt to corrected translations in domain-specific projects. Machine translation (MT) has achieved significant progress in the area of domain adaptation. However, real-time adaptation remains challenging. Large-scale language models (LLMs) have recently shown interesting capabilities of in-context learning, where they learn to replicate certain input-output text generation patterns, without further fine-tuning. By feeding an LLM at inference time with a prompt that consists of a list of translation pairs, it can then simulate the domain and style characteristics. This work aims to investigate how we can utilize in-context learning to improve real-time adaptive MT. Our extensive experiments show promising results at translation time. For example, LLMs can adapt to a set of in-domain sentence pairs and/or terminology while translating a new sentence. We observe that the translation quality with few-shot in-context learning can surpass that of strong encoder-decoder MT systems, especially for high-resource languages. Moreover, we investigate whether we can combine MT from strong encoder-decoder models with fuzzy matches, which can further improve translation quality, especially for less supported languages. We conduct our experiments across five diverse language pairs, namely English-to-Arabic (EN-AR), English-to-Chinese (EN-ZH), English-to-French (EN-FR), English-to-Kinyarwanda (EN-RW), and English-to-Spanish (EN-ES)., Comment: EAMT 2023 - Research: technical
Published: 2023

7. Translation Word-Level Auto-Completion: What can we achieve out of the box?

Author: Moslem, Yasmin, Haque, Rejwanul, and Way, Andy
Subjects: Computer Science - Computation and Language, Computer Science - Human-Computer Interaction
Abstract: Research on Machine Translation (MT) has achieved important breakthroughs in several areas. While there is much more to be done in order to build on this success, we believe that the language industry needs better ways to take full advantage of current achievements. Due to a combination of factors, including time, resources, and skills, businesses tend to apply pragmatism into their AI workflows. Hence, they concentrate more on outcomes, e.g. delivery, shipping, releases, and features, and adopt high-level working production solutions, where possible. Among the features thought to be helpful for translators are sentence-level and word-level translation auto-suggestion and auto-completion. Suggesting alternatives can inspire translators and limit their need to refer to external resources, which hopefully boosts their productivity. This work describes our submissions to WMT's shared task on word-level auto-completion, for the Chinese-to-English, English-to-Chinese, German-to-English, and English-to-German language directions. We investigate the possibility of using pre-trained models and out-of-the-box features from available libraries. We employ random sampling to generate diverse alternatives, which reveals good results. Furthermore, we introduce our open-source API, based on CTranslate2, to serve translations, auto-suggestions, and auto-completions., Comment: WMT 2022
Published: 2022

8. Domain-Specific Text Generation for Machine Translation

Author: Moslem, Yasmin, Haque, Rejwanul, Kelleher, John D., and Way, Andy
Subjects: Computer Science - Computation and Language
Abstract: Preservation of domain knowledge from the source to target is crucial in any translation workflow. It is common in the translation industry to receive highly specialized projects, where there is hardly any parallel in-domain data. In such scenarios where there is insufficient in-domain data to fine-tune Machine Translation (MT) models, producing translations that are consistent with the relevant context is challenging. In this work, we propose a novel approach to domain adaptation leveraging state-of-the-art pretrained language models (LMs) for domain-specific data augmentation for MT, simulating the domain characteristics of either (a) a small bilingual dataset, or (b) the monolingual source text to be translated. Combining this idea with back-translation, we can generate huge amounts of synthetic bilingual in-domain data for both use cases. For our investigation, we use the state-of-the-art Transformer architecture. We employ mixed fine-tuning to train models that significantly improve translation of in-domain texts. More specifically, in both scenarios, our proposed methods achieve improvements of approximately 5-6 BLEU and 2-3 BLEU, respectively, on the Arabic-to-English and English-to-Arabic language pairs. Furthermore, the outcome of human evaluation corroborates the automatic evaluation results., Comment: AMTA 2022 - MT Research Track
Published: 2022

9. Preparing an Endangered Language for the Digital Age: The Case of Judeo-Spanish

Author: Öktem, Alp, Zevallos, Rodolfo, Moslem, Yasmin, Öztürk, Güneş, and Şarhon, Karen
Subjects: Computer Science - Computation and Language
Abstract: We develop machine translation and speech synthesis systems to complement the efforts of revitalizing Judeo-Spanish, the exiled language of Sephardic Jews, which survived for centuries, but now faces the threat of extinction in the digital age. Building on resources created by the Sephardic community of Turkey and elsewhere, we create corpora and tools that would help preserve this language for future generations. For machine translation, we first develop a Spanish to Judeo-Spanish rule-based machine translation system, in order to generate large volumes of synthetic parallel data in the relevant language pairs: Turkish, English and Spanish. Then, we train baseline neural machine translation engines using this synthetic data and authentic parallel data created from translations by the Sephardic community. For text-to-speech synthesis, we present a 3.5 hour single speaker speech corpus for building a neural speech synthesis engine. Resources, model weights and online inference engines are shared publicly.
Published: 2022

10. Domain Terminology Integration into Machine Translation: Leveraging Large Language Models

Author: Moslem, Yasmin, primary, Romani, Gianfranco, additional, Molaei, Mahdi, additional, Kelleher, John D., additional, Haque, Rejwanul, additional, and Way, Andy, additional
Published: 2023
Full Text: View/download PDF

11. Terminology-aware sentence mining for NMT domain adaptation: ADAPT’s submission to the Adap-MT 2020 English-to-Hindi AI translation shared task

Author: Haque, Rejwanul, Moslem, Yasmin, and Way, Andy
Subjects: Artificial intelligence, Machine learning, Domain Adaptation, Low-resource Languages, Terminology-aware NMT, Computational linguistics, Machine translating
Abstract: This paper describes the ADAPT Centre’s submission to the Adap-MT 2020 AI Translation Shared Task for English-to-Hindi. The neural machine translation (NMT) systems that we built to translate AI domain texts are state-of- the-art Transformer models. In order to improve the translation quality of our NMT systems, we made use of both in-domain and out-of-domain data for training and employed different fine-tuning techniques for adapting our NMT systems to this task, e.g. mixed fine-tuning and on-the-fly self-training. For this, we mined parallel sentence pairs and monolingual sentences from large out-of-domain data, and the mining process was facilitated through automatic extraction of terminology from the in-domain data. This paper outlines the experiments we carried out for this task and reports the performance of our NMT systems on the evaluation test set.
Published: 2020

12. Arabisc: context-sensitive neural spelling checker

Author: Moslem, Yasmin, Haque, Rejwanul, and Way, Andy
Subjects: Spelling Checking, Spelling Correction, Computer engineering, Machine learning, Computational linguistics
Abstract: Traditional statistical approaches to spelling correction usually consist of two consecutive processes – error detection and correction – and they are generally computationally intensive. Current state-of-the-art neural spelling correction models usually attempt to correct spelling errors directly over an entire sentence, which, as a consequence, lacks control of the process, e.g. they are prone to overcorrection. In recent years, recurrent neural networks (RNNs), in particular long short-term memory (LSTM) hidden units, have proven increasingly popular and powerful models for many natural language processing (NLP) problems. Accordingly, we made use of a bidirectional LSTM language model (LM) for our context-sensitive spelling detection and correction model which is shown to have much control over the correction process. While the use of LMs for spelling checking and correction is not new to this line of NLP research, our proposed approach makes better use of the rich neighbouring context, not only from before the word to be corrected, but also after it, via a dual-input deep LSTM network. Although in theory our proposed approach can be applied to any language, we carried out our experiments on Arabic, which we believe adds additional value given the fact that there are limited linguistic resources readily available in Arabic in comparison to many languages. Our experimental results demonstrate that the pro- posed methods are effective in both improving the quality of correction suggestions and minimising overcorrection.
Published: 2020

13. The ADAPT system description for the STAPLE 2020 English-to-Portuguese translation task

Author: Haque, Rejwanul, Moslem, Yasmin, Way, Andy, Haque, Rejwanul, Moslem, Yasmin, and Way, Andy
Abstract: This paper describes the ADAPT Centre’s submission to STAPLE (Simultaneous Translation and Paraphrase for Language Education) 2020, a shared task of the 4th Workshop on Neural Generation and Translation (WNGT), for the English-to-Portuguese translation task. In this shared task, the participants were asked to produce high-coverage sets of plausible translations given English prompts (input source sentences). We present our English-to-Portuguese machine translation (MT) models that were built applying various strategies, e.g. data and sentence selection, monolingual MT for generating alternative translations, and combining multiple nbest translations. Our experiments show that adding the aforementioned techniques to the baseline yields an excellent performance in the English-to-Portuguese translation task.
Published: 2020
Full Text: View/download PDF

14. The ADAPT System Description for the STAPLE 2020 English-to-Portuguese Translation Task

Author: Haque, Rejwanul, primary, Moslem, Yasmin, additional, and Way, Andy, additional
Published: 2020
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

14 results on '"Moslem, Yasmin"'

1. Leveraging Synthetic Audio Data for End-to-End Low-Resource Speech Translation

2. SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages

3. Language Modelling Approaches to Adaptive Machine Translation

4. Fine-tuning Large Language Models for Adaptive Machine Translation

5. Domain Terminology Integration into Machine Translation: Leveraging Large Language Models

6. Adaptive Machine Translation with Large Language Models

7. Translation Word-Level Auto-Completion: What can we achieve out of the box?

8. Domain-Specific Text Generation for Machine Translation

9. Preparing an Endangered Language for the Digital Age: The Case of Judeo-Spanish

10. Domain Terminology Integration into Machine Translation: Leveraging Large Language Models

11. Terminology-aware sentence mining for NMT domain adaptation: ADAPT’s submission to the Adap-MT 2020 English-to-Hindi AI translation shared task

12. Arabisc: context-sensitive neural spelling checker

13. The ADAPT system description for the STAPLE 2020 English-to-Portuguese translation task

14. The ADAPT System Description for the STAPLE 2020 English-to-Portuguese Translation Task

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

14 results on '"Moslem, Yasmin"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources