Author: "Shterionov, Dimitar" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Shterionov, Dimitar"' showing total 168 results

Start Over Author "Shterionov, Dimitar"

168 results on '"Shterionov, Dimitar"'

1. Guiding In-Context Learning of LLMs through Quality Estimation for Machine Translation

Author: Sharami, Javad Pourmostafa Roshan, Shterionov, Dimitar, and Spronck, Pieter
Subjects: Computer Science - Computation and Language
Abstract: The quality of output from large language models (LLMs), particularly in machine translation (MT), is closely tied to the quality of in-context examples (ICEs) provided along with the query, i.e., the text to translate. The effectiveness of these ICEs is influenced by various factors, such as the domain of the source text, the order in which the ICEs are presented, the number of these examples, and the prompt templates used. Naturally, selecting the most impactful ICEs depends on understanding how these affect the resulting translation quality, which ultimately relies on translation references or human judgment. This paper presents a novel methodology for in-context learning (ICL) that relies on a search algorithm guided by domain-specific quality estimation (QE). Leveraging the XGLM model, our methodology estimates the resulting translation quality without the need for translation references, selecting effective ICEs for MT to maximize translation quality. Our results demonstrate significant improvements over existing ICL methods and higher translation performance compared to fine-tuning a pre-trained language model (PLM), specifically mBART-50., Comment: Camera-ready version of the paper for the Association for Machine Translation in the Americas (AMTA), including the link to the paper's repository
Published: 2024

2. Machine translation from signed to spoken languages: state of the art and challenges

Author: De Coster, Mathieu, Shterionov, Dimitar, Van Herreweghe, Mieke, and Dambre, Joni
Published: 2024
Full Text: View/download PDF

3. How It Started and How It’s Going: Sign Language Machine Translation and Engagement with Deaf Communities Over the Past 25 Years

Author: Leeson, Lorraine, Morrissey, Sara, Shterionov, Dimitar, Stein, Daniel, van den Heuvel, Henk, Way, Andy, Way, Andy, Editor-in-Chief, Bandyopadhyay, Sivaji, Editorial Board Member, Leeson, Lorraine, editor, and Shterionov, Dimitar, editor
Published: 2024
Full Text: View/download PDF

4. The Pipeline of Sign Language Machine Translation

Author: Shterionov, Dimitar, Leeson, Lorraine, Way, Andy, Way, Andy, Editor-in-Chief, Bandyopadhyay, Sivaji, Editorial Board Member, Leeson, Lorraine, editor, and Shterionov, Dimitar, editor
Published: 2024
Full Text: View/download PDF

5. Tailoring Domain Adaptation for Machine Translation Quality Estimation

Author: Sharami, Javad Pourmostafa Roshan, Shterionov, Dimitar, Blain, Frédéric, Vanmassenhove, Eva, De Sisto, Mirella, Emmery, Chris, and Spronck, Pieter
Subjects: Computer Science - Computation and Language
Abstract: While quality estimation (QE) can play an important role in the translation process, its effectiveness relies on the availability and quality of training data. For QE in particular, high-quality labeled data is often lacking due to the high cost and effort associated with labeling such data. Aside from the data scarcity challenge, QE models should also be generalizable, i.e., they should be able to handle data from different domains, both generic and specific. To alleviate these two main issues -- data scarcity and domain mismatch -- this paper combines domain adaptation and data augmentation within a robust QE system. Our method first trains a generic QE model and then fine-tunes it on a specific domain while retaining generic knowledge. Our results show a significant improvement for all the language pairs investigated, better cross-lingual inference, and a superior performance in zero-shot learning scenarios as compared to state-of-the-art baselines., Comment: Accepted to EAMT 2023 (main)
Published: 2023

6. Evaluating the Effectiveness of Pre-trained Language Models in Predicting the Helpfulness of Online Product Reviews

Author: Boluki, Ali, Sharami, Javad Pourmostafa Roshan, and Shterionov, Dimitar
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Businesses and customers can gain valuable information from product reviews. The sheer number of reviews often necessitates ranking them based on their potential helpfulness. However, only a few reviews ever receive any helpfulness votes on online marketplaces. Sorting all reviews based on the few existing votes can cause helpful reviews to go unnoticed because of the limited attention span of readers. The problem of review helpfulness prediction is even more important for higher review volumes, and newly written reviews or launched products. In this work we compare the use of RoBERTa and XLM-R language models to predict the helpfulness of online product reviews. The contributions of our work in relation to literature include extensively investigating the efficacy of state-of-the-art language models -- both monolingual and multilingual -- against a robust baseline, taking ranking metrics into account when assessing these approaches, and assessing multilingual models for the first time. We employ the Amazon review dataset for our experiments. According to our study on several product categories, multilingual and monolingual pre-trained language models outperform the baseline that utilizes random forest with handcrafted features as much as 23% in RMSE. Pre-trained language models reduce the need for complex text feature engineering. However, our results suggest that pre-trained multilingual models may not be used for fine-tuning only one language. We assess the performance of language models with and without additional features. Our results show that including additional features like product rating by the reviewer can further help the predictive methods.
Published: 2023
Full Text: View/download PDF

7. Evaluating the Effectiveness of Pre-trained Language Models in Predicting the Helpfulness of Online Product Reviews

Author: Boluki, Ali, Pourmostafa Roshan Sharami, Javad, Shterionov, Dimitar, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, and Arai, Kohei, editor
Published: 2024
Full Text: View/download PDF

8. Correction to: Machine translation from signed to spoken languages: state of the art and challenges

Author: De Coster, Mathieu, Shterionov, Dimitar, Van Herreweghe, Mieke, and Dambre, Joni
Published: 2024
Full Text: View/download PDF

9. Machine Translation from Signed to Spoken Languages: State of the Art and Challenges

Author: De Coster, Mathieu, Shterionov, Dimitar, Van Herreweghe, Mieke, and Dambre, Joni
Subjects: Computer Science - Computation and Language
Abstract: Automatic translation from signed to spoken languages is an interdisciplinary research domain, lying on the intersection of computer vision, machine translation and linguistics. Nevertheless, research in this domain is performed mostly by computer scientists in isolation. As the domain is becoming increasingly popular - the majority of scientific papers on the topic of sign language translation have been published in the past three years - we provide an overview of the state of the art as well as some required background in the different related disciplines. We give a high-level introduction to sign language linguistics and machine translation to illustrate the requirements of automatic sign language translation. We present a systematic literature review to illustrate the state of the art in the domain and then, harking back to the requirements, lay out several challenges for future research. We find that significant advances have been made on the shoulders of spoken language machine translation research. However, current approaches are often not linguistically motivated or are not adapted to the different input modality of sign languages. We explore challenges related to the representation of sign language data, the collection of datasets, the need for interdisciplinary research and requirements for moving beyond research, towards applications. Based on our findings, we advocate for interdisciplinary research and to base future research on linguistic analysis of sign languages. Furthermore, the inclusion of deaf and hearing end users of sign language translation applications in use case identification, data collection and evaluation is of the utmost importance in the creation of useful sign language translation models. We recommend iterative, human-in-the-loop, design and development of sign language translation models., Comment: This is the version of the article submitted to peer review to Universal Access in the Information Society. Please refer to "De Coster, M., Shterionov, D., Van Herreweghe, M. et al. Machine translation from signed to spoken languages: state of the art and challenges. Univ Access Inf Soc (2023)." for the published and updated version
Published: 2022
Full Text: View/download PDF

10. The Ecological Footprint of Neural Machine Translation Systems

Author: Shterionov, Dimitar and Vanmassenhove, Eva
Subjects: Computer Science - Computation and Language
Abstract: Over the past decade, deep learning (DL) has led to significant advancements in various fields of artificial intelligence, including machine translation (MT). These advancements would not be possible without the ever-growing volumes of data and the hardware that allows large DL models to be trained efficiently. Due to the large amount of computing cores as well as dedicated memory, graphics processing units (GPUs) are a more effective hardware solution for training and inference with DL models than central processing units (CPUs). However, the former is very power demanding. The electrical power consumption has economical as well as ecological implications. This chapter focuses on the ecological footprint of neural MT systems. It starts from the power drain during the training of and the inference with neural MT models and moves towards the environment impact, in terms of carbon dioxide emissions. Different architectures (RNN and Transformer) and different GPUs (consumer-grate NVidia 1080Ti and workstation-grade NVidia P100) are compared. Then, the overall CO2 offload is calculated for Ireland and the Netherlands. The NMT models and their ecological impact are compared to common household appliances to draw a more clear picture. The last part of this chapter analyses quantization, a technique for reducing the size and complexity of models, as a way to reduce power consumption. As quantized models can run on CPUs, they present a power-efficient inference solution without depending on a GPU., Comment: 25 pages, 3 figures, 10 tables
Published: 2022

11. Selecting Parallel In-domain Sentences for Neural Machine Translation Using Monolingual Texts

Author: Sharami, Javad Pourmostafa Roshan, Shterionov, Dimitar, and Spronck, Pieter
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Continuously-growing data volumes lead to larger generic models. Specific use-cases are usually left out, since generic models tend to perform poorly in domain-specific cases. Our work addresses this gap with a method for selecting in-domain data from generic-domain (parallel text) corpora, for the task of machine translation. The proposed method ranks sentences in parallel general-domain data according to their cosine similarity with a monolingual domain-specific data set. We then select the top K sentences with the highest similarity score to train a new machine translation system tuned to the specific in-domain data. Our experimental results show that models trained on this in-domain data outperform models trained on generic or a mixture of generic and domain data. That is, our method selects high-quality domain-specific training instances at low computational cost and data size., Comment: Accepted to the CLIN Journal on Dec 6, 2021 (Camera-ready Version)
Published: 2021

12. NeuTral Rewriter: A Rule-Based and Neural Approach to Automatic Rewriting into Gender-Neutral Alternatives

Author: Vanmassenhove, Eva, Emmery, Chris, and Shterionov, Dimitar
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Recent years have seen an increasing need for gender-neutral and inclusive language. Within the field of NLP, there are various mono- and bilingual use cases where gender inclusive language is appropriate, if not preferred due to ambiguity or uncertainty in terms of the gender of referents. In this work, we present a rule-based and a neural approach to gender-neutral rewriting for English along with manually curated synthetic data (WinoBias+) and natural data (OpenSubtitles and Reddit) benchmarks. A detailed manual and automatic evaluation highlights how our NeuTral Rewriter, trained on data generated by the rule-based approach, obtains word error rates (WER) below 0.18% on synthetic, in-domain and out-domain test sets.
Published: 2021

13. Machine Translationese: Effects of Algorithmic Bias on Linguistic Complexity in Machine Translation

Author: Vanmassenhove, Eva, Shterionov, Dimitar, and Gwilliam, Matthew
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Computers and Society
Abstract: Recent studies in the field of Machine Translation (MT) and Natural Language Processing (NLP) have shown that existing models amplify biases observed in the training data. The amplification of biases in language technology has mainly been examined with respect to specific phenomena, such as gender bias. In this work, we go beyond the study of gender in MT and investigate how bias amplification might affect language in a broader sense. We hypothesize that the 'algorithmic bias', i.e. an exacerbation of frequently observed patterns in combination with a loss of less frequent ones, not only exacerbates societal biases present in current datasets but could also lead to an artificially impoverished language: 'machine translationese'. We assess the linguistic richness (on a lexical and morphological level) of translations created by different data-driven MT paradigms - phrase-based statistical (PB-SMT) and neural MT (NMT). Our experiments show that there is a loss of lexical and morphological richness in the translations produced by all investigated MT paradigms for two language pairs (EN<=>FR and EN<=>ES).
Published: 2021

14. The Ecological Footprint of Neural Machine Translation Systems

Author: Shterionov, Dimitar, Vanmassenhove, Eva, Way, Andy, Editor-in-Chief, Bandyopadhyay, Sivaji, Editorial Board Member, Moniz, Helena, editor, and Parra Escartín, Carla, editor
Published: 2023
Full Text: View/download PDF

15. Selecting Backtranslated Data from Multiple Sources for Improved Neural Machine Translation

Author: Soto, Xabier, Shterionov, Dimitar, Poncelas, Alberto, and Way, Andy
Subjects: Computer Science - Computation and Language
Abstract: Machine translation (MT) has benefited from using synthetic training data originating from translating monolingual corpora, a technique known as backtranslation. Combining backtranslated data from different sources has led to better results than when using such data in isolation. In this work we analyse the impact that data translated with rule-based, phrase-based statistical and neural MT systems has on new MT systems. We use a real-world low-resource use-case (Basque-to-Spanish in the clinical domain) as well as a high-resource language pair (German-to-English) to test different scenarios with backtranslation and employ data selection to optimise the synthetic corpora. We exploit different data selection strategies in order to reduce the amount of data used, while at the same time maintaining high-quality MT systems. We further tune the data selection method by taking into account the quality of the MT systems used for backtranslation and lexical diversity of the resulting corpora. Our experiments show that incorporating backtranslated data from different sources can be beneficial, and that availing of data selection can yield improved performance.
Published: 2020

16. Special issue on sign language translation and avatar technology

Author: Wolfe, Rosalee, Braffort, Annelies, Efthimiou, Eleni, Fotinea, Evita, Hanke, Thomas, and Shterionov, Dimitar
Published: 2023
Full Text: View/download PDF

17. Combining SMT and NMT Back-Translated Data for Efficient NMT

Author: Poncelas, Alberto, Popovic, Maja, Shterionov, Dimitar, Wenniger, Gideon Maillette de Buy, and Way, Andy
Subjects: Computer Science - Computation and Language
Abstract: Neural Machine Translation (NMT) models achieve their best performance when large sets of parallel data are used for training. Consequently, techniques for augmenting the training set have become popular recently. One of these methods is back-translation (Sennrich et al., 2016), which consists on generating synthetic sentences by translating a set of monolingual, target-language sentences using a Machine Translation (MT) model. Generally, NMT models are used for back-translation. In this work, we analyze the performance of models when the training data is extended with synthetic data using different MT approaches. In particular we investigate back-translated data generated not only by NMT but also by Statistical Machine Translation (SMT) models and combinations of both. The results reveal that the models achieve the best performances when the training set is augmented with back-translated data created by merging different MT approaches.
Published: 2019

18. Lost in Translation: Loss and Decay of Linguistic Richness in Machine Translation

Author: Vanmassenhove, Eva, Shterionov, Dimitar, and Way, Andy
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: This work presents an empirical approach to quantifying the loss of lexical richness in Machine Translation (MT) systems compared to Human Translation (HT). Our experiments show how current MT systems indeed fail to render the lexical diversity of human generated or translated text. The inability of MT systems to generate diverse outputs and its tendency to exacerbate already frequent patterns while ignoring less frequent ones, might be the underlying cause for, among others, the currently heavily debated issues related to gender biased output. Can we indeed, aside from biased data, talk about an algorithm that exacerbates seen biases?, Comment: Accepted for publication at the 17th Machine Translation Summit (MTSummit2019), Dublin, Ireland, August 2019
Published: 2019

19. ABI Neural Ensemble Model for Gender Prediction Adapt Bar-Ilan Submission for the CLIN29 Shared Task on Gender Prediction

Author: Vanmassenhove, Eva, Moryossef, Amit, Poncelas, Alberto, Way, Andy, and Shterionov, Dimitar
Subjects: Computer Science - Computation and Language
Abstract: We present our system for the CLIN29 shared task on cross-genre gender detection for Dutch. We experimented with a multitude of neural models (CNN, RNN, LSTM, etc.), more "traditional" models (SVM, RF, LogReg, etc.), different feature sets as well as data pre-processing. The final results suggested that using tokenized, non-lowercased data works best for most of the neural models, while a combination of word clusters, character trigrams and word lists showed to be most beneficial for the majority of the more "traditional" (that is, non-neural) models, beating features used in previous tasks such as n-grams, character n-grams, part-of-speech tags and combinations thereof. In contradiction with the results described in previous comparable shared tasks, our neural models performed better than our best traditional approaches with our best feature set-up. Our final model consisted of a weighted ensemble model combining the top 25 models. Our final model won both the in-domain gender prediction task and the cross-genre challenge, achieving an average accuracy of 64.93% on the in-domain gender prediction task, and 56.26% on cross-genre gender prediction., Comment: Conference: Computational Linguistics of the Netherlands CLIN29
Published: 2019

20. Investigating Backtranslation in Neural Machine Translation

Author: Poncelas, Alberto, Shterionov, Dimitar, Way, Andy, Wenniger, Gideon Maillette de Buy, and Passban, Peyman
Subjects: Computer Science - Computation and Language
Abstract: A prerequisite for training corpus-based machine translation (MT) systems -- either Statistical MT (SMT) or Neural MT (NMT) -- is the availability of high-quality parallel data. This is arguably more important today than ever before, as NMT has been shown in many studies to outperform SMT, but mostly when large parallel corpora are available; in cases where data is limited, SMT can still outperform NMT. Recently researchers have shown that back-translating monolingual data can be used to create synthetic parallel corpora, which in turn can be used in combination with authentic parallel data to train a high-quality NMT system. Given that large collections of new parallel text become available only quite rarely, backtranslation has become the norm when building state-of-the-art NMT systems, especially in resource-poor scenarios. However, we assert that there are many unknown factors regarding the actual effects of back-translated data on the translation capabilities of an NMT model. Accordingly, in this work we investigate how using back-translated data as a training corpus -- both as a separate standalone dataset as well as combined with human-generated parallel data -- affects the performance of an NMT model. We use incrementally larger amounts of back-translated data to train a range of NMT systems for German-to-English, and analyse the resulting translation performance.
Published: 2018

21. A roadmap to neural automatic post-editing : an empirical approach

Author: Shterionov, Dimitar, do Carmo, Félix, Moorkens, Joss, Hossari, Murhaf, Wagner, Joachim, Paquin, Eric, Schmidtke, Dag, Groves, Declan, and Way, Andy
Published: 2020

22. A review of the state-of-the-art in automatic post-editing

Author: do Carmo, Félix, Shterionov, Dimitar, Moorkens, Joss, Wagner, Joachim, Hossari, Murhaf, Paquin, Eric, Schmidtke, Dag, Groves, Declan, and Way, Andy
Published: 2021
Full Text: View/download PDF

23. Inference and learning in probabilistic logic programs using weighted Boolean formulas

Author: FIERENS, DAAN, VAN DEN BROECK, GUY, RENKENS, JORIS, SHTERIONOV, DIMITAR, GUTMANN, BERND, THON, INGO, JANSSENS, GERDA, and DE RAEDT, LUC
Subjects: probabilistic logic programming, probabilistic inference, parameter learning, cs.AI, cs.LG, cs.LO, Artificial Intelligence and Image Processing, Computation Theory and Mathematics, Computer Software, Computation Theory & Mathematics
Abstract: Probabilistic logic programs are logic programs in which some of the facts are annotated with probabilities. This paper investigates how classical inference and learning tasks known from the graphical model community can be tackled for probabilistic logic programs. Several such tasks, such as computing the marginals, given evidence and learning from (partial) interpretations, have not really been addressed for probabilistic logic programs before. The first contribution of this paper is a suite of efficient algorithms for various inference tasks. It is based on the conversion of the program and the queries and evidence to a weighted Boolean formula. This allows us to reduce inference tasks to well-studied tasks, such as weighted model counting, which can be solved using state-of-the-art methods known from the graphical model and knowledge compilation literature. The second contribution is an algorithm for parameter estimation in the learning from interpretations setting. The algorithm employs expectation-maximization, and is built on top of the developed inference algorithms. The proposed approach is experimentally evaluated. The results show that the inference algorithms improve upon the state of the art in probabilistic logic programming, and that it is indeed possible to learn the parameters of a probabilistic logic program from interpretations.
Published: 2015

24. Inference and learning in probabilistic logic programs using weighted Boolean formulas

Author: Fierens, Daan, Broeck, Guy Van den, Renkens, Joris, Shterionov, Dimitar, Gutmann, Bernd, Thon, Ingo, Janssens, Gerda, and De Raedt, Luc
Subjects: Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Logic in Computer Science
Abstract: Probabilistic logic programs are logic programs in which some of the facts are annotated with probabilities. This paper investigates how classical inference and learning tasks known from the graphical model community can be tackled for probabilistic logic programs. Several such tasks such as computing the marginals given evidence and learning from (partial) interpretations have not really been addressed for probabilistic logic programs before. The first contribution of this paper is a suite of efficient algorithms for various inference tasks. It is based on a conversion of the program and the queries and evidence to a weighted Boolean formula. This allows us to reduce the inference tasks to well-studied tasks such as weighted model counting, which can be solved using state-of-the-art methods known from the graphical model and knowledge compilation literature. The second contribution is an algorithm for parameter estimation in the learning from interpretations setting. The algorithm employs Expectation Maximization, and is built on top of the developed inference algorithms. The proposed approach is experimentally evaluated. The results show that the inference algorithms improve upon the state-of-the-art in probabilistic logic programming and that it is indeed possible to learn the parameters of a probabilistic logic program from interpretations., Comment: To appear in Theory and Practice of Logic Programming (TPLP)
Published: 2013
Full Text: View/download PDF

25. DNF Sampling for ProbLog Inference

Author: Shterionov, Dimitar Sht., Kimmig, Angelika, Mantadelis, Theofrastos, and Janssens, Gerda
Subjects: Computer Science - Logic in Computer Science
Abstract: Inference in probabilistic logic languages such as ProbLog, an extension of Prolog with probabilistic facts, is often based on a reduction to a propositional formula in DNF. Calculating the probability of such a formula involves the disjoint-sum-problem, which is computationally hard. In this work we introduce a new approximation method for ProbLog inference which exploits the DNF to focus sampling. While this DNF sampling technique has been applied to a variety of tasks before, to the best of our knowledge it has not been used for inference in probabilistic logic systems. The paper also presents an experimental comparison with another sampling based inference method previously introduced for ProbLog., Comment: Online proceedings of the Joint Workshop on Implementation of Constraint Logic Programming Systems and Logic-based Methods in Programming Environments (CICLOPS-WLPE 2010), Edinburgh, Scotland, U.K., July 15, 2010
Published: 2010

26. Human versus automatic quality evaluation of NMT and PBSMT

Author: Shterionov, Dimitar, Superbo, Riccardo, Nagle, Pat, Casanellas, Laura, O'Dowd, Tony, and Way, Andy
Published: 2018

27. First WMT Shared Task on Sign Language Translation (WMT-SLT22)

Author: Müller, Mathias, Ebling, Sarah, Avramidis, Eleftherios, Battisti, Alessia, Berger, Michèle, Bowden, Richard, Braffort, Annelies, Camgöz, Necati Cihan, Espana-Bonet, Cristina, Grundkiewicz, Roman, Jiang, Zifan, Koller, Oscar, Moryossef, Amit, Perrollaz, Regula, Reinhard, Sabine, Rios, Annette, Shterionov, Dimitar, Sidler-Miserez, Sandra, Tissi, Katja, Van Landuyt, Davy, Sciences et Technologies des Langues (STL), Laboratoire Interdisciplinaire des Sciences du Numérique (LISN), and Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)
Subjects: [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]
Abstract: International audience; This paper is a brief summary of the First WMT Shared Task on Sign Language Translation (WMT-SLT22), a project partly funded by EAMT. The focus of this shared task is automatic translation between signed and spoken languages. Details can be found on our website 1 or in the findings paper (Müller et al., 2022).
Published: 2023

28. The Most Probable Explanation for Probabilistic Logic Programs with Annotated Disjunctions

Author: Shterionov, Dimitar, Renkens, Joris, Vlasselaer, Jonas, Kimmig, Angelika, Meert, Wannes, Janssens, Gerda, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Davis, Jesse, editor, and Ramon, Jan, editor
Published: 2015
Full Text: View/download PDF

29. Compacting Boolean Formulae for Inference in Probabilistic Logic Programming

Author: Mantadelis, Theofrastos, Shterionov, Dimitar, Janssens, Gerda, Goebel, Randy, Series editor, Tanaka, Yuzuru, Series editor, Wahlster, Wolfgang, Series editor, Calimeri, Francesco, editor, Ianni, Giovambattista, editor, and Truszczynski, Miroslaw, editor
Published: 2015
Full Text: View/download PDF

30. Implementation and Performance of Probabilistic Inference Pipelines

Author: Shterionov, Dimitar, Janssens, Gerda, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Pontelli, Enrico, editor, and Son, Tran Cao, editor
Published: 2015
Full Text: View/download PDF

31. Machine translation from signed to spoken languages: state of the art and challenges

Author: De Coster, Mathieu, primary, Shterionov, Dimitar, additional, Van Herreweghe, Mieke, additional, and Dambre, Joni, additional
Published: 2023
Full Text: View/download PDF

32. Minds: Big questions for linguistics in the age of AI

Author: Backus, Ad, Cohen, Michael, Cohn, Neil, Faber, Myrthe, Krahmer, Emiel, Laparle, Schuyler, Maier, Emar, Miltenburg, Emiel van, Roelofsen, Floris, Sciubba, Eleonora, Scholman, Merel, Shterionov, Dimitar, Sie, Maureen, Tomas, Frédéric, Vanmassenhove, Eva, Venhuizen, Noortje, Vos, Connie de, Backus, Ad, Cohen, Michael, Cohn, Neil, Faber, Myrthe, Krahmer, Emiel, Laparle, Schuyler, Maier, Emar, Miltenburg, Emiel van, Roelofsen, Floris, Sciubba, Eleonora, Scholman, Merel, Shterionov, Dimitar, Sie, Maureen, Tomas, Frédéric, Vanmassenhove, Eva, Venhuizen, Noortje, and Vos, Connie de
Published: 2023

33. A Python Tool for Selecting Domain-Specific Data in Machine Translation

Author: Pourmostafa Roshan Sharami, Javad, Shterionov, Dimitar, Spronck, Pieter, Pourmostafa Roshan Sharami, Javad, Shterionov, Dimitar, and Spronck, Pieter
Abstract: As the volume of data for Machine Translation (MT) grows, the need for models that can perform well in specific use cases, like patent and medical translations, becomes increasingly important. Unfortunately, generic models do not work well in such cases, as they often fail to handle domain-specific style and terminology. Only using datasets that cover domains similar to the target domain to train MT systems can effectively lead to high translation quality (for a domain-specific use-case) (Wang et al., 2017; Pourmostafa Roshan Sharami et al., 2021; Pourmostafa Roshan Sharami et al., 2022). This highlights the limitation of data-driven MT when trained on general domain data, regardless of dataset size. To address this challenge, researchers have implemented various strategies to improve domain-specific translation using Domain Adaptation (DA) methods (Saunders, 2022; Sharami et al., 2023). The DA process involves initially training a generic model, which is then fine-tuned using a domain-specific dataset (Chu and Wang, 2018). One approach to generating a domain-specific dataset is to select similar data from generic corpora for a specific language pair and then utilize both general (to train) and domain-specific (to fine-tune) parallel corpora for MT. In line with this approach, we developed a language-agnostic Python tool implementing the methodology proposed by Sharami et al. (2022). This tool uses monolingual domain-specific corpora to generate a parallel in-domain corpus, facilitating data selection for DA.
Published: 2023

34. GoSt-ParC-Sign: Gold Standard Parallel Corpus of Sign and spoken language

Author: De Sisto, Mirella, Vandeghinste, Vincent, Soetemans, Lien, Brosens, Caro, Shterionov, Dimitar, De Sisto, Mirella, Vandeghinste, Vincent, Soetemans, Lien, Brosens, Caro, and Shterionov, Dimitar
Published: 2023

35. Findings of the Second WMT Shared Task on Sign Language Translation (WMT-SLT23)

Author: Müller, Mathias, Alikhani, Malihe, Avramidis, Eleftherios, Bowden, Richard, Braffort, Annelies, Cihan Camgöz, Necati, Ebling, Sarah, España-Bonet, Cristina, Göhring, Anne, Grundkiewicz, Roman, Inan, Mert, Jiang, Zifan, Koller, Oscar, Moryossef, Amit, Rios, Annette, Shterionov, Dimitar, Sidler-Miserez, Sandra, Tissi, Katja, Van Landuyt, Davy, Müller, Mathias, Alikhani, Malihe, Avramidis, Eleftherios, Bowden, Richard, Braffort, Annelies, Cihan Camgöz, Necati, Ebling, Sarah, España-Bonet, Cristina, Göhring, Anne, Grundkiewicz, Roman, Inan, Mert, Jiang, Zifan, Koller, Oscar, Moryossef, Amit, Rios, Annette, Shterionov, Dimitar, Sidler-Miserez, Sandra, Tissi, Katja, and Van Landuyt, Davy
Abstract: This paper presents the results of the Second WMT Shared Task on Sign Language Translation (WMT-SLT23; https://www.wmt-slt.com/). This shared task is concerned with automatic translation between signed and spoken languages. The task is unusual in the sense that it requires processing visual information (such as video frames or human pose estimation) beyond the well-known paradigm of text-to-text machine translation (MT). The task offers four tracks involving the following languages: Swiss German Sign Language (DSGS), French Sign Language of Switzerland (LSF-CH), Italian Sign Language of Switzerland (LIS-CH), German, French and Italian. Four teams (including one working on a baseline submission) participated in this second edition of the task, all submitting to the DSGS-to-German track. Besides a system ranking and system papers describing state-of-the-art techniques, this shared task makes the following scientific contributions: novel corpora and reproducible baseline systems. Finally, the task also resulted in publicly available sets of system outputs and more human evaluation scores for sign language translation.
Published: 2023

36. Findings of the Second WMT Shared Task on Sign Language Translation (WMT-SLT23)

Author: Müller, Mathias, primary, Alikhani, Malihe, additional, Avramidis, Eleftherios, additional, Bowden, Richard, additional, Braffort, Annelies, additional, Cihan Camgöz, Necati, additional, Ebling, Sarah, additional, España-Bonet, Cristina, additional, Göhring, Anne, additional, Grundkiewicz, Roman, additional, Inan, Mert, additional, Jiang, Zifan, additional, Koller, Oscar, additional, Moryossef, Amit, additional, Rios, Annette, additional, Shterionov, Dimitar, additional, Sidler-Miserez, Sandra, additional, Tissi, Katja, additional, and Van Landuyt, Davy, additional
Published: 2023
Full Text: View/download PDF

37. Findings of the First WMT Shared Task on Sign Language Translation (WMT-SLT22)

Author: Müller, Mathias, Ebling, Sarah, Avramidis, Eleftherios, Battisti, Alessia, Berger, Michèle, Bowden, Richard, Braffort, Annelies, Cihan Camgöz, Necati, España-Bonet, Cristina, Grundkiewicz, Roman, Jiang, Zifan, Koller, Oscar, Moryossef, Amit, Perrollaz, Regula, Reinhard, Sabine, Rios, Annette, Shterionov, Dimitar, Sidler-Miserez, Sandra, Tissi, Katja, Van Landuyt, Davy, Sciences et Technologies des Langues (STL), Laboratoire Interdisciplinaire des Sciences du Numérique (LISN), Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS), and University of Zurich
Subjects: 10105 Institute of Computational Linguistics, 410 Linguistics, 000 Computer science, knowledge & systems, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL], [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]
Abstract: International audience; This paper presents the results of the First WMT Shared Task on Sign Language Translation (WMT-SLT22) 1. This shared task is concerned with automatic translation between signed and spoken 2 languages. The task is novel in the sense that it requires processing visual information (such as video frames or human pose estimation) beyond the wellknown paradigm of text-to-text machine translation (MT). The task featured two tracks, translating from Swiss German Sign Language (DSGS) to German and vice versa. Seven teams participated in this first edition of the task, all submitting to the DSGS-to-German track. Besides a system ranking and system papers describing state-of-the-art techniques, this shared task makes the following scientific contributions: novel corpora, reproducible baseline systems and new protocols and software for human evaluation. Finally, the task also resulted in the first publicly available set of system outputs and human evaluation scores for sign language translation.
Published: 2022

38. Quality Estimation for the Translation Industry – Data Challenges

Author: Sharami, Javad Pourmostafa Roshan, Murgolo, Elena, Shterionov, Dimitar Sht., Department Cognitive Science and Artificial Intelligence, and Cognitive Science & AI
Subjects: Machine Translation, Quality evaluation, Quality estimation
Abstract: Machine Translation (MT) has become an irreplaceable part of translation industry workflows. With a direct impact on productivity, it is very important for human post-editors and project managers to be informed about the translation quality of MT.MT Quality estimation (QE) is the task of predicting the quality of a translation without human references. For the translation industry QE can be used as an indicator of the amount of post-editing needed as well as of productivity. QE can be applied at word-, sentence- or document-level. In the cases of sentence- and document-level QE, given a source text and its MT counterpart the task is to predict a score (typically TER) that indicates the translation quality. As with most NLP tasks nowadays, state-of-the-art QE has been achieved using DL methods. Optimal QE performance is not only a question of model architecture and hyperparameters but it strongly depends on the quantity and quality of the data.In a business environment, such as the one of the translation industry models or system that are used in production should adhere to economic and usability criteria. QE for the translation industry should be optimised for domain- and use-case specific data, should be efficient and be adaptable. In a collaborative project between Orbital14 and Tilburg University, funded by Aglatech14, we develop a framework for MT quality assessment (MTQA) which strongly relies on QE. In this project we focus on a specific domain (patents, IP) and a language pair (English-Italian).In this work we present our approach to data collection, analysis and preprocessing prior to building QE models. We started with proprietary data provided by Aglatech14 to identify specific patterns that need to be covered by the QE tool. As the volume of initial data was not sufficient to build robust DL models (42K source-translation-post-edit triplets, from which we compute the TER scores) we added a corpus of 127K source-human translation sentences. We used Aglatech14’s translation model to translate the source and generate a pseudo corpus of triplets (source-MT-post-edit), which were then used to compute TER scores. To further extend our data set we used publicly available data – a generic English- Italian corpus containing ~105M sentence pairs. Aiming at a domain-specific QE, we select only similar to the industry data sentence pairs. We used a ranking data selection method based on cosine similarity to selected additional 42K sentences – those with the highest similarity score to the Aglatech14 data. After data selection, wetranslated the selected source data to create new synthetic data comprising of source, machine translation and target sentences. We consider the target sentences as post-edited sentences. As in the previous case, we computed the TER score between the MT and the target to use as labels for training our QE models.We use these data to build state-of-the-art QE models and evaluate their performance against a gold standard reference set, provided by Aglatech14.
Published: 2022

39. Findings of the First WMT Shared Task on Sign Language Translation (WMT-SLT22)

Author: Müller, Mathias; https://orcid.org/0000-0002-8248-199X, Ebling, Sarah, Avramidis, Eleftherios, Battisti, Alessia, Berger, Michèle, Bowden, Richard, Braffort, Annelies, Cihan Camgöz, Necati, España-Bonet, Cristina, Grundkiewicz, Roman, Jiang, Zifan; https://orcid.org/0000-0002-4403-4953, Koller, Oscar, Moryossef, Amit, Perrollaz, Regula, Reinhard, Sabine, Rios, Annette, Shterionov, Dimitar, Sidler-Miserez, Sandra, Tissi, Katja, Van Landuyt, Davy, Müller, Mathias; https://orcid.org/0000-0002-8248-199X, Ebling, Sarah, Avramidis, Eleftherios, Battisti, Alessia, Berger, Michèle, Bowden, Richard, Braffort, Annelies, Cihan Camgöz, Necati, España-Bonet, Cristina, Grundkiewicz, Roman, Jiang, Zifan; https://orcid.org/0000-0002-4403-4953, Koller, Oscar, Moryossef, Amit, Perrollaz, Regula, Reinhard, Sabine, Rios, Annette, Shterionov, Dimitar, Sidler-Miserez, Sandra, Tissi, Katja, and Van Landuyt, Davy
Abstract: This paper presents the results of the First WMT Shared Task on Sign Language Translation (WMT-SLT22). This shared task is concerned with automatic translation between signed and spoken languages. The task is novel in the sense that it requires processing visual information (such as video frames or human pose estimation) beyond the well-known paradigm of text-to-text machine translation (MT). The task featured two tracks, translating from Swiss German Sign Language (DSGS) to German and vice versa. Seven teams participated in this first edition of the task, all submitting to the DSGS-to-German track. Besides a system ranking and system papers describing state-of-the-art techniques, this shared task makes the following scientific contributions: novel corpora, reproducible baseline systems and new protocols and software for human evaluation. Finally, the task also resulted in the first publicly available set of system outputs and human evaluation scores for sign language translation.
Published: 2022

40. 'Vaderland', 'Volk' and 'Natie': Semantic Change Related to Nationalism in Dutch Literature Between 1700 and 1880 Captured with Dynamic Bernoulli Word Embeddings

Author: Timmermans, Marije, Vanmassenhove, Eva, Shterionov, Dimitar, Tahmasebi, Nina, Montariol, Syrielle, Kutuzov, Andrey, Hengchen, Simon, Dubossarsky, Haim, Borin, Lars, and Cognitive Science & AI
Abstract: Languages can respond to external events in various ways - the creation of new words or named entities, additional senses might develop for already existing words or the valence of words can change. In this work, we explore the semantic shift of the Dutch words “natie” (“nation”), “volk” (“people”) and “vaderland” (“fatherland”) over a period that is known for the rise of nationalism in Europe: 1700-1880 (Jensen, 2016). The semantic change is measured by means of Dynamic Bernoulli Word Embeddings (Rudolph and Blei, 2018) which allow for comparison between word embeddings over different time slices. The word embeddings were generated based on Dutch fiction literature divided over different decades. From the analysis of the absolute drifts, it appears that the word “natie” underwent a relatively small drift. However, the drifts of “vaderland” and “volk” show multiple peaks, culminating around the turn of the nineteenth century. To verify whether this semantic change can indeed be attributed to nationalistic movements, a detailed analysis of the nearest neighbours of the target words is provided. From the analysis, it appears that “natie”, “volk” and “vaderland” became more nationalistically-loaded over time.
Published: 2022

41. “Vaderland”, “Volk” and “Natie”: Semantic Change Related to Nationalism in Dutch Literature Between 1700 and 1880 Captured with Dynamic Bernoulli Word Embeddings

Author: Timmermans, Marije, primary, Vanmassenhove, Eva, additional, and Shterionov, Dimitar, additional
Published: 2022
Full Text: View/download PDF

42. The Most Probable Explanation for Probabilistic Logic Programs with Annotated Disjunctions

Author: Shterionov, Dimitar, primary, Renkens, Joris, additional, Vlasselaer, Jonas, additional, Kimmig, Angelika, additional, Meert, Wannes, additional, and Janssens, Gerda, additional
Published: 2015
Full Text: View/download PDF

43. Compacting Boolean Formulae for Inference in Probabilistic Logic Programming

Author: Mantadelis, Theofrastos, primary, Shterionov, Dimitar, additional, and Janssens, Gerda, additional
Published: 2015
Full Text: View/download PDF

44. Implementation and Performance of Probabilistic Inference Pipelines

Author: Shterionov, Dimitar, primary and Janssens, Gerda, additional
Published: 2015
Full Text: View/download PDF

45. Defining meaningful units. Challenges in sign segmentation and segment-meaning mapping: (short paper)

Author: De Sisto, Mirella, Shterionov, Dimitar, Murtagh, Irene, Vermeerbergen, Myriam, Leeson, Lorraine, and Cognitive Science & AI
Published: 2021

46. Online Evaluation of Text-to-sign Translation by Deaf End Users: Some Methodological Recommendations

Author: Roelofsen, Floris, Esselink, Lyke, Mende-Gillings, Shani, de Meulder, Maartje, Sijm, Nienke, Smeijers, Anika, Shterionov, Dimitar, General Paediatrics, and Paediatric Pulmonology
Subjects: InformationSystems_MISCELLANEOUS, humanities
Abstract: We present a number of methodological recommendations concerning the online evaluation of avatars for text-to-sign translation, focusing on the structure, format and length of the questionnaire, as well as methods for eliciting and faithfully transcribing responses.
Published: 2021

47. SignON: Bridging the gap between sign and spoken languages

Author: Saggion, Horacio, Shterionov, Dimitar, Labaka, Gorka, Van de Cruys, Tim, Vandeghinste, Vincent, and Blat, Josep
Subjects: Neural Machine Translation, Sign Language, Text Simplification, Reconocimiento automático de lengua de signos, Simplificación de textos, Lenguas de signos, Automatic Sign Language Recognition, Traducción automática neuronal, Avatar
Abstract: Comunicació presentada a: Conference of the Spanish Society for Natural Language Processing (SEPLN 2021) celebrat el setembre de 2021 de manera virtual. This article presents an overview of the SignON European project which aims to develop technology for automatic translation between sign and oral languages (and vice-versa). In order to achieve this objective, the project takes a multi-disciplinary approach by involving the deaf community, sign language linguistics, research in sign language recognition, speech recognition, natural language processing and machine translation (MT), 3D animation and avatar technology, and application development. The project follows a user-centered, community driven approach to the development of technology. Este artículo describe el proyecto europeo SignON, que tiene como objetivo desarrollar tecnología de traducción automática entre lenguas de signos y lenguas orales. Para lograr este objetivo, el proyecto adopta un enfoque multidisciplinario al involucrar signantes de lenguas de signos, lingüístas de las lenguas de signos, tecnología de reconocimiento automático de lengua de signos, reconocimiento automático de voz, procesamiento del lenguaje natural y traducción automática, animación 3D y la tecnología de avatar y desarrollo de aplicaciones. El proyecto sigue un enfoque centrado en el usuario e impulsado por la comunidad sorda para el desarrollo de una tecnología apropiada. This work is supported by the European Commission under the Horizon 2020 program ICT-57-2020 - "An empowering, inclusive Next Generation Internet" with Grant Agreement number 101017255. We thank all members of the SignON consortium.
Published: 2021

48. A Novel Pipeline for Domain Detection and Selecting In-domain Sentences in Machine Translation Systems

Author: Pourmostafa Roshan Sharami, Javad, Shterionov, Dimitar, Spronck, Pieter, and Cognitive Science & AI
Subjects: Machine Translation, Domain Detection Pipeline, In-domain Generation, Domain Adaptation
Abstract: General-domain corpora are becoming increasingly available for Machine Translation (MT) systems. However, using those that cover the same or comparable domains allow achieving high translation quality of domain-specific MT. It is often the case that domain-specific corpora are scarce and cannot be used in isolation to effectively train (domain-specific) MT systems. This work aims to improve in-domain MT by (i) a novel unsupervised pipeline for identifying distributions of different domains within a corpus and (ii) a data selection technique that leverages in-domain monolingual or parallel data to select domain-specific sentences from general corpora according to the distribution defined in (i). To do so, either a list with domain-specific keywords or an external lexical resource is fed into the pipeline to identify similar input data within the general domain. Furthermore, the suggested pipeline can determine the target domain of any corpus. That is, MT practitioners can prepare their training data, based on the target domain demanded by customers or industry. This idea is not only effective in terms of specifying frequent words in the corpus for the DA tasks but also in being able to inform the MT practitioners of insight into data (an informative feature). The main idea of this work is related to Topic Modeling (TM) in the sense that a sentence is a distribution over hidden topics, and a topic is a distribution over words. Therefore, there is a high probability that similar sentences contain similar single words. In this way, we can select in-domain sentences if their top n-words match with general corpora’s top-words. Our pipeline encapsulates several modules such as TM, sentence embedding, dimensionality reduction, clustering, domain detection, post-processing, and a matching function. To test the effectiveness of our approach, the proposed method is applied on an English/French corpus, fitted and evaluated in the context of DA aiming to address the lack of in-domain data. Our empirical evaluation shows that more training data is not always better, and the best results are attainable via a proper domain-relevant data selection.
Published: 2021
Full Text: View/download PDF

49. A review of the state‑of‑the‑art in automatic post‑editing

Author: do Carmo, Félix, Shterionov, Dimitar, Moorkens, Joss, Wagner, Joachim, Hossari, Murhaf, Paquin, Eric, Schmidtke, Dag, Groves, Declan, and Way, Andy
Subjects: Translating and interpreting, Automatic Post-editing, Neural Post-editing, Neural machine translation, State-of-the-art in Automatic Post-editing, Machine translating
Abstract: This article presents a review of the evolution of automatic post-editing, a term that describes methods to improve the output of machine translation systems, based on knowledge extracted from datasets that include post-edited content. The article describes the specificity of automatic post-editing in comparison with other tasks in machine translation, and it discusses how it may function as a complement to them. Particular detail is given in the article to the five-year period that covers the shared tasks presented in WMT conferences (2015–2019). In this period, discussion of automatic post-editing evolved from the definition of its main parameters to an announced demise, associated with the difficulties in improving output obtained by neural methods, which was then followed by renewed interest. The article debates the role and relevance of automatic post-editing, both as an academic endeavour and as a useful application in commercial workflows.
Published: 2020

50. Machine Translationese: Effects of Algorithmic Bias on Linguistic Complexity in Machine Translation

Author: Vanmassenhove, Eva, primary, Shterionov, Dimitar, additional, and Gwilliam, Matthew, additional
Published: 2021
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

168 results on '"Shterionov, Dimitar"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources