Descriptor: "Language Modeling" / Publisher: hal ccsd - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Language Modeling"' showing total 37 results

Start Over Descriptor "Language Modeling" Publisher hal ccsd

37 results on '"Language Modeling"'

1. Cross-lingual Strategies for Low-resource Language Modeling: A Study on Five Indic Dialects

Author: Bafna, Niyati, España-Bonet, Cristina, van Genabith, Josef, Sagot, Benoît, Bawden, Rachel, Automatic Language Modelling and ANAlysis & Computational Humanities (ALMAnaCH), Inria de Paris, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Deutsches Forschungszentrum für Künstliche Intelligenz GmbH = German Research Center for Artificial Intelligence (DFKI), This work was partly funded by R. Bawden’s and B. Sagot’s chairs in the PRAIRIE institute funded by the French national agency ANR as part of the 'Investissements d’avenir' programme under the reference ANR-19-P3IA-0001 and by the Emergence project, DadaNMT, funded by SorbonneUniversité. The work was also supported by the German Research Foundation (Deutsche Forschungs-gemeinschaft) under grant SFB 1102: Information Density and Linguistic Encoding., Servan, Christophe, Vilnat, Anne, and ANR-19-P3IA-0001,PRAIRIE,PaRis Artificial Intelligence Research InstitutE(2019)
Subjects: language modeling, POS tagging, Indic languages, Low resource, Crosslingual transfer, Dialect continuum, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
Abstract: International audience; Neural language models play an increasingly central role for language processing, given their success for a range of NLP tasks. In this study, we compare some canonical strategies in language modeling for low-resource scenarios, evaluating all models by their (finetuned) performance on a POS-tagging downstream task. We work with five (extremely) low-resource dialects from the Indic dialect continuum (Braj, Awadhi, Bhojpuri, Magahi, Maithili), which are closely related to each other and the standard mid-resource dialect, Hindi. The strategies we evaluate broadly include from-scratch pretraining, and cross-lingual transfer between the dialects as well as from different kinds of off-the- shelf multilingual models; we find that a model pretrained on other mid-resource Indic dialects and languages, with extended pretraining on target dialect data, consistently outperforms other models. We interpret our results in terms of dataset sizes, phylogenetic relationships, and corpus statistics, as well as particularities of this linguistic system.
Published: 2023

2. Qualitative Evaluation of Language Model Rescoring in Automatic Speech Recognition

Author: Thibault Bañeras Roux, Mickael Rouvier, Jane Wottawa, Richard Dufour, Traitement Automatique du Langage Naturel (LS2N - équipe TALN ), Laboratoire des Sciences du Numérique de Nantes (LS2N), Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-IMT Atlantique (IMT Atlantique), Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT)-École Centrale de Nantes (Nantes Univ - ECN), Nantes Université (Nantes Univ)-Nantes Université (Nantes Univ)-Nantes université - UFR des Sciences et des Techniques (Nantes univ - UFR ST), Nantes Université - pôle Sciences et technologie, Nantes Université (Nantes Univ)-Nantes Université (Nantes Univ)-Nantes Université - pôle Sciences et technologie, Nantes Université (Nantes Univ)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-IMT Atlantique (IMT Atlantique), Nantes Université (Nantes Univ), Laboratoire Informatique d'Avignon (LIA), Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI, Laboratoire d'Informatique de l'Université du Mans (LIUM), Le Mans Université (UM), and ANR-20-CE23-0005,DIETS,Diagnostic automatique des erreurs des systèmes de transcription de parole end-to-end à partir de leur réception par les utilisateurs(2020)
Subjects: Evaluation metrics, Language modeling, Automatic speech recognition, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL], Semantic analysis
Abstract: International audience; Evaluating automatic speech recognition (ASR) systems is a classical but difficult and still open problem, which often boils down to focusing only on the word error rate (WER). However, this metric suffers from many limitations and does not allow an in-depth analysis of automatic transcription errors. In this paper, we propose to study and understand the impact of rescoring using language models in ASR systems by means of several metrics often used in other natural language processing (NLP) tasks in addition to the WER. In particular, we introduce two measures related to morpho-syntactic and semantic aspects of transcribed words: 1) the POSER (Part-of-speech Error Rate), which should highlight the grammatical aspects, and 2) the Em-bER (Embedding Error Rate), a measurement that modifies the WER by providing a weighting according to the semantic distance of the wrongly transcribed words. These metrics illustrate the linguistic contributions of the language models that are applied during a posterior rescoring step on transcription hypotheses.
Published: 2022

3. Détection du discours de haine et du langage offensant utilisant des approches de Transfer Learning

Author: Mozafari, Marzieh, Institut Polytechnique de Paris (IP Paris), Département Réseaux et Services Multimédia Mobiles (RS2M), Institut Mines-Télécom [Paris] (IMT)-Télécom SudParis (TSP), Réseaux, Systèmes, Services, Sécurité (R3S-SAMOVAR), Services répartis, Architectures, MOdélisation, Validation, Administration des Réseaux (SAMOVAR), Institut Mines-Télécom [Paris] (IMT)-Télécom SudParis (TSP)-Institut Mines-Télécom [Paris] (IMT)-Télécom SudParis (TSP), Institut Polytechnique de Paris, Noel Crespi, and Reza Farahbakhsh
Subjects: Détection de discours de haine, Meta learning, Language modeling, Few-shot learning, Réseaux sociaux, Apprentissage en profondeur, Deep learning, [INFO.INFO-SI]Computer Science [cs]/Social and Information Networks [cs.SI], Transfer learning, Hate speech detection, Social media, [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing, XLM-RoBERTa, Classification interlinguistique des textes, Modélisation du langage, Transfert d’apprentissage, Cross lingual text classification, BERT
Abstract: The great promise of social media platforms (e.g., Twitter and Facebook) is to provide a safe place for users to communicate their opinions and share information. However, concerns are growing that they enable abusive behaviors, e.g., threatening or harassing other users, cyberbullying, hate speech, racial and sexual discrimination, as well. In this thesis, we focus on hate speech as one of the most concerning phenomenon in online social media.Given the high progression of online hate speech and its severe negative effects, institutions, social media platforms, and researchers have been trying to react as quickly as possible. The recent advancements in Natural Language Processing (NLP) and Machine Learning (ML) algorithms can be adapted to develop automatic methods for hate speech detection in this area.The aim of this thesis is to investigate the problem of hate speech and offensive language detection in social media, where we define hate speech as any communication criticizing a person or a group based on some characteristics, e.g., gender, sexual orientation, nationality, religion, race. We propose different approaches in which we adapt advanced Transfer Learning (TL) models and NLP techniques to detect hate speech and offensive content automatically, in a monolingual and multilingual fashion.In the first contribution, we only focus on English language. Firstly, we analyze user-generated textual content to gain a brief insight into the type of content by introducing a new framework being able to categorize contents in terms of topical similarity based on different features. Furthermore, using the Perspective API from Google, we measure and analyze the toxicity of the content. Secondly, we propose a TL approach for identification of hate speech by employing a combination of the unsupervised pre-trained model BERT (Bidirectional Encoder Representations from Transformers) and new supervised fine-tuning strategies. Finally, we investigate the effect of unintended bias in our pre-trained BERT based model and propose a new generalization mechanism in training data by reweighting samples and then changing the fine-tuning strategies in terms of the loss function to mitigate the racial bias propagated through the model. To evaluate the proposed models, we use two publicly available datasets from Twitter.In the second contribution, we consider a multilingual setting where we focus on low-resource languages in which there is no or few labeled data available. First, we present the first corpus of Persian offensive language consisting of 6k micro blog posts from Twitter to deal with offensive language detection in Persian as a low-resource language in this domain. After annotating the corpus, we perform extensive experiments to investigate the performance of transformer-based monolingual and multilingual pre-trained language models (e.g., ParsBERT, mBERT, XLM-R) in the downstream task. Furthermore, we propose an ensemble model to boost the performance of our model. Then, we expand our study into a cross-lingual few-shot learning problem, where we have a few labeled data in target language, and adapt a meta-learning based approach to address identification of hate speech and offensive language in low-resource languages.; Une des promesses des plateformes de réseaux sociaux (comme Twitter et Facebook) est de fournir un endroit sûr pour que les utilisateurs puissent partager leurs opinions et des informations. Cependant, l’augmentation des comportements abusifs, comme le harcèlement en ligne ou la présence de discours de haine, est bien réelle. Dans cette thèse, nous nous concentrons sur le discours de haine, l'un des phénomènes les plus préoccupants concernant les réseaux sociaux.Compte tenu de sa forte progression et de ses graves effets négatifs, les institutions, les plateformes de réseaux sociaux et les chercheurs ont tenté de réagir le plus rapidement possible. Les progrès récents des algorithmes de traitement automatique du langage naturel (NLP) et d'apprentissage automatique (ML) peuvent être adaptés pour développer des méthodes automatiques de détection des discours de haine dans ce domaine.Le but de cette thèse est d'étudier le problème du discours de haine et de la détection des propos injurieux dans les réseaux sociaux. Nous proposons différentes approches dans lesquelles nous adaptons des modèles avancés d'apprentissage par transfert (TL) et des techniques de NLP pour détecter automatiquement les discours de haine et les contenus injurieux, de manière monolingue et multilingue.La première contribution concerne uniquement la langue anglaise. Tout d'abord, nous analysons le contenu textuel généré par les utilisateurs en introduisant un nouveau cadre capable de catégoriser le contenu en termes de similarité basée sur différentes caractéristiques. En outre, en utilisant l'API Perspective de Google, nous mesurons et analysons la « toxicité » du contenu. Ensuite, nous proposons une approche TL pour l'identification des discours de haine en utilisant une combinaison du modèle non supervisé pré-entraîné BERT (Bidirectional Encoder Representations from Transformers) et de nouvelles stratégies supervisées de réglage fin. Enfin, nous étudions l'effet du biais involontaire dans notre modèle pré-entraîné BERT et proposons un nouveau mécanisme de généralisation dans les données d'entraînement en repondérant les échantillons puis en changeant les stratégies de réglage fin en termes de fonction de perte pour atténuer le biais racial propagé par le modèle. Pour évaluer les modèles proposés, nous utilisons deux datasets publics provenant de Twitter.Dans la deuxième contribution, nous considérons un cadre multilingue où nous nous concentrons sur les langues à faibles ressources dans lesquelles il n'y a pas ou peu de données annotées disponibles. Tout d'abord, nous présentons le premier corpus de langage injurieux en persan, composé de 6 000 messages de micro-blogs provenant de Twitter, afin d'étudier la détection du langage injurieux. Après avoir annoté le corpus, nous réalisons étudions les performances des modèles de langages pré-entraînés monolingues et multilingues basés sur des transformeurs (par exemple, ParsBERT, mBERT, XLM-R) dans la tâche en aval. De plus, nous proposons un modèle d'ensemble pour améliorer la performance de notre modèle. Enfin, nous étendons notre étude à un problème d'apprentissage multilingue de type " few-shot ", où nous disposons de quelques données annotées dans la langue cible, et nous adaptons une approche basée sur le méta-apprentissage pour traiter l'identification des discours de haine et du langage injurieux dans les langues à faibles ressources.
Published: 2021

4. A BERT-based transfer learning approach for hate speech detection in online social media

Author: Reza Farahbakhsh, Noel Crespi, Marzieh Mozafari, Département Réseaux et Services Multimédia Mobiles (RS2M), Institut Mines-Télécom [Paris] (IMT)-Télécom SudParis (TSP), Institut Polytechnique de Paris (IP Paris), Réseaux, Systèmes, Services, Sécurité (R3S-SAMOVAR), Services répartis, Architectures, MOdélisation, Validation, Administration des Réseaux (SAMOVAR), Institut Mines-Télécom [Paris] (IMT)-Télécom SudParis (TSP)-Institut Mines-Télécom [Paris] (IMT)-Télécom SudParis (TSP), and Centre National de la Recherche Scientifique (CNRS)
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer science, 02 engineering and technology, NLP, [INFO.INFO-SI]Computer Science [cs]/Social and Information Networks [cs.SI], Machine Learning (cs.LG), Computer Science - Information Retrieval, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], Social media, Hate speech detection, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, 10. No inequality, Transformer (machine learning model), Social and Information Networks (cs.SI), Information retrieval, Voice activity detection, Computer Science - Computation and Language, Language modeling, Offensive, Computer Science - Social and Information Networks, Transfer learning, Fine-tuning, [INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR], 020201 artificial intelligence & image processing, Language model, Transfer of learning, Precision and recall, Encoder, Computation and Language (cs.CL), Information Retrieval (cs.IR), BERT
Abstract: Generated hateful and toxic content by a portion of users in social media is a rising phenomenon that motivated researchers to dedicate substantial efforts to the challenging direction of hateful content identification. We not only need an efficient automatic hate speech detection model based on advanced machine learning and natural language processing, but also a sufficiently large amount of annotated data to train a model. The lack of a sufficient amount of labelled hate speech data, along with the existing biases, has been the main issue in this domain of research. To address these needs, in this study we introduce a novel transfer learning approach based on an existing pre-trained language model called BERT (Bidirectional Encoder Representations from Transformers). More specifically, we investigate the ability of BERT at capturing hateful context within social media content by using new fine-tuning methods based on transfer learning. To evaluate our proposed approach, we use two publicly available datasets that have been annotated for racism, sexism, hate, or offensive content on Twitter. The results show that our solution obtains considerable performance on these datasets in terms of precision and recall in comparison to existing approaches. Consequently, our model can capture some biases in data annotation and collection process and can potentially lead us to a more accurate model., This paper has been accepted in The 8th International Conference on Complex Networks and their Applications
Published: 2019
Full Text: View/download PDF

5. Languages(s) of the SHUN-PAO, a Computational Linguistics account

Author: Magistry, Pierre, Institut de recherches Asiatiques (IrAsia), Aix Marseille Université (AMU)-Centre National de la Recherche Scientifique (CNRS), This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 788476)., European Project: 788476,ENPMUC, Magistry, Pierre, and Elites, networks, and power in modern urban China (1830-1949) - ENPMUC - 788476 - INCOMING
Subjects: [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing, Lexical statistics, Language change, Modern China, Contextual Embeddings, [INFO.INFO-TT] Computer Science [cs]/Document and Text Processing, [SCCO.LING] Cognitive science/Linguistics, [SCCO.LING]Cognitive science/Linguistics, [SHS.LANGUE]Humanities and Social Sciences/Linguistics, [SHS.LANGUE] Humanities and Social Sciences/Linguistics, Language Modeling
Abstract: International audience; This work is part of a broader project which requires adapting information extraction (IE) methods to written materials (mostly press articles) published in China between the mid 19th and the mid 20th centuries. This calls for a better understanding and description of the language(s) we can observe in our sources. More importantly, it is an unprecedented opportunity to provide a usage-based description of written languages as used in the press in Modern China. There is an abundant literature describing this pivotal era from different perspectives and disciplines related to language, including the history of language policies (Kaske, 2008), the socio-linguistic aspects (Weng, 2018) or historical linguistics (Coblin, 2000, Simmons, 2017). However what is presented in this article is, as far as I know, the first usage-based study to leverage a complete corpus of almost 80 years of a daily newspaper, the Shen-Pao(申報), containing about 750 Millions sinograms to account for the actual practices and their evolution through time. In order to do so, I propose new Computational Linguistics methods and tools inspired by recent works in the field, especially Language Modeling and Contextual String Embeddings.
Published: 2019

6. Out-of-Vocabulary Word Probability Estimation using RNN Language Model

Author: Illina, Irina, Fohr, Dominique, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL), ANR-12-BS02-0009,ContNomina,Exploitation du contexte pour la reconnaissance de noms propres dans les documents diachroniques audio(2012), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)
Subjects: language modeling, OOV, [INFO.INFO-HC]Computer Science [cs]/Human-Computer Interaction [cs.HC], Speech recognition
Abstract: International audience; One important issue of speech recognition systems is Out-of Vocabulary words (OOV). These words, often proper nouns or new words, are essential for documents to be transcribed correctly. Thus, they must be integrated in the language model (LM) and the lexicon of the speech recognition system. This article proposes new approaches to OOV proper noun estimation using Recurrent Neural Network Language Model (RNNLM). The proposed approaches are based on the notion of closest in-vocabulary (IV) words (list of brothers) to a given OOV proper noun. The probabilities of these words are used to estimate the probabilities of OOV proper nouns thanks to RNNLM. Three methods for retrieving the relevant list of brothers are studied. The main advantages of the proposed approaches are that the RNNLM is not retrained and the architecture of the RNNLM is kept intact. Experiments on real text data from the website of the Euronews channel show perplexity reductions of about 14% relative compared to baseline RNNLM.
Published: 2017

7. Developing an Embosi (Bantu C25) Speech Variant Dictionary to Model Vowel Elision and Morpheme Deletion

Author: Annie Rialland, Jamison Cooper-Leavitt, Gilles Adda, Martine Adda-Decker, Lori Lamel, Publications, Limsi, ISCA, Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (LIMSI), Université Paris-Sud - Paris 11 (UP11)-Sorbonne Université - UFR d'Ingénierie (UFR 919), Sorbonne Université (SU)-Sorbonne Université (SU)-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Université Paris Saclay (COmUE), Université Paris Saclay (COmUE)-Centre National de la Recherche Scientifique (CNRS)-Sorbonne Université - UFR d'Ingénierie (UFR 919), and Sorbonne Université (SU)-Sorbonne Université (SU)-Université Paris-Saclay-Université Paris-Sud - Paris 11 (UP11)
Subjects: language modeling, Head (linguistics), Computer science, Speech recognition, phonetics, Bantu languages, 02 engineering and technology, [INFO] Computer Science [cs], Lexicon, under-resourced languages, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL], Morpheme, Vowel, 0202 electrical engineering, electronic engineering, information engineering, [INFO]Computer Science [cs], 060201 languages & linguistics, Subject pronoun, Phonetics, Phonology, 06 humanities and the arts, Noun phrase, Linguistics, Prefix, phonology, [INFO.INFO-CL] Computer Science [cs]/Computation and Language [cs.CL], 0602 languages and literature, 020201 artificial intelligence & image processing, Language model
Abstract: International audience; This paper investigates vowel elision and morpheme deletion inEmbosi (Bantu C25), an under-resourced language spoken inthe Republic of Congo. We propose that the observed mor-pheme deletion is morphological, and that vowel elision isphonological. The study focuses on vowel elision that occursacross word boundaries between the contact of long/short vow-els (i.e. CV[long] # V[short].CV), and between the contact ofshort/short vowels (CV[short] # V[short].CV). Several differ-ent categories of morphemes are explored: (i) prepositions (ya,mo), (ii) class-noun nominal prefixes (ba, etc.), (iii) singularsubject pronouns (ngá, nO, wa). For example, the preposition,ya, regularly deletes allowing for vowel elision if vowel contactoccurs between the head of the noun phrase and the previousword. Phonetically motivated speech variants are proposed inthe lexicon used for forced alignment (segmentation) enablingthese phenomena to be quantified in the corpus so as to developa dictionary containing relevant phonetic variants.
Published: 2017

8. Contributions à l'étude et à la reconnaissance automatique de la parole en Fongbe

Author: Laleye, Frejus Adissa Akintola, Laboratoire d'Informatique Signal et Image de la Côte d'Opale (LISIC), Université du Littoral Côte d'Opale (ULCO), Université du Littoral Côte d'Opale, Université d'Abomey-Calavi (Bénin), Cina Motamed, Eugène C. Ezin, and STAR, ABES
Subjects: [SPI.OTHER]Engineering Sciences [physics]/Other, Logique floue, DBN, [SPI.OTHER] Engineering Sciences [physics]/Other, Automatic speech segmentation, Modélisation acoustique graphémique, Language modeling, Segmentation automatique de la parole, Automatic speech recognition, Fusion de décisions, Graphem-based acoustical modeling, [INFO.INFO-MO]Computer Science [cs]/Modeling and Simulation, Fuzzy logic, Multiclass classification, Rényi entropy, Fongbe, Reconnaissance automatique de la parole, Modélisation du langage, Fusion of decisions, [INFO.INFO-MO] Computer Science [cs]/Modeling and Simulation, Entropie de Rényi, Multi-classification
Abstract: One of the difficulties of an unresourced language is the lack of technology services in the speech and text processing. In this thesis, we faced the problematic of an acoustical study of the isolated and continous speech in Fongbe as part of the speech recognition. Tonal complexity of the oral and the recent agreement of writing the Fongbe led us to study the Fongbe throughout the chain of an automatic speech recognition. In addition to the collected linguistic resources (vocabularies, large text and speech corpus, pronunciation dictionaries) for building the algorithms, we proposed a complete recipe of algorithms (including algorithms of classification and recognition of isolated phonemes and segmentation of continuous speech into syllable), based on an acoustic study of the different sounds, for Fongbe automatic processing. In this manuscript, we also presented a methodology for developing acoustic models and language models to facilitate speech recognition in Fongbe. In this study, it was proposed and evaluated an acoustic modeling based on grapheme (since the Fongbe don't have phonetic dictionary) and also the impact of tonal pronunciation on the performance of a Fongbe ASR system. Finally, the written and oral resources collected for Fongbe and experimental results obtained for each aspect of an ASR chain in Fongbe validate the potential of the methods and algorithms that we proposed., L'une des difficultés d'une langue peu dotée est l'inexistence des services liés aux technologies du traitement de l'écrit et de l'oral. Dans cette thèse, nous avons affronté la problématique de l'étude acoustique de la parole isolée et de la parole continue en Fongbe dans le cadre de la reconnaissance automatique de la parole. La complexité tonale de l'oral et la récente convention de l'écriture du Fongbe nous ont conduit à étudier le Fongbe sur toute la chaîne de la reconnaissance automatique de la parole. En plus des ressources linguistiques collectées (vocabulaires, grands corpus de texte, grands corpus de parole, dictionnaires de prononciation) pour permettre la construction des algorithmes, nous avons proposé une recette complète d'algorithmes (incluant des algorithmes de classification et de reconnaissance de phonèmes isolés et de segmentation de la parole continue en syllabe), basés sur une étude acoustique des différents sons, pour le traitement automatique du Fongbe. Dans ce manuscrit, nous avons aussi présenté une méthodologie de développement de modèles accoustiques et de modèles du langage pour faciliter la reconnaissance automatique de la parole en Fongbe. Dans cette étude, il a été proposé et évalué une modélisation acoustique à base de graphèmes (vu que le Fongbe ne dispose pas encore de dictionnaire phonétique) et aussi l'impact de la prononciation tonale sur la performance d'un système RAP en Fongbe. Enfin, les ressources écrites et orales collectées pour le Fongbe ainsi que les résultats expérimentaux obtenus pour chaque aspect de la chaîne de RAP en Fongbe valident le potentiel des méthodes et algorithmes que nous avons proposés.
Published: 2016

9. Language Model Data Augmentation for Keyword Spotting

Author: Gorin, Arseniy, Lileikyté, Rasa, Huang, Guangpu, Lamel, Lori, Gauvain, Jean-Luc, Laurent, Antoine, Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (LIMSI), Université Paris Saclay (COmUE)-Centre National de la Recherche Scientifique (CNRS)-Sorbonne Université - UFR d'Ingénierie (UFR 919), and Sorbonne Université (SU)-Sorbonne Université (SU)-Université Paris-Saclay-Université Paris-Sud - Paris 11 (UP11)
Subjects: language modeling, speech recognition, low-resourced languages, text augmentation, [INFO]Computer Science [cs], [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL], machine translation
Abstract: International audience; This research extends our earlier work on using machinetranslation (MT) and word-based recurrent neural networks toaugment language model training data for keyword search inconversational Cantonese speech. MT-based data augmenta-tion is applied to two language pairs: English-Lithuanian andEnglish-Amharic. Using filtered N-best MT hypotheses for lan-guage modeling is found to perform better than just using the 1-best translation. Target language texts collected from the Weband filtered to select conversational-like data are used in severalmanners. In addition to using Web data for training the languagemodel of the speech recognizer, we further investigate using thisdata to improve the language model and phrase table of the MTsystem to get better translations of the English data. Finally,generating text data with a character-based recurrent neural net-work is investigated. This approach allows new word forms tobe produced, providing a way to reduce the out-of-vocabularyrate and thereby improve keyword spotting performance. Westudy how these different methods of language model data aug-mentation impact speech-to-text and keyword spotting perfor-mance for the Lithuanian and Amharic languages. The best re-sults are obtained by combining all of the explored methods.
Published: 2016

10. Named entity recognition in multimodal documents

Author: Hatmi, Mohamed, Laboratoire d'Informatique de Nantes Atlantique (LINA), Centre National de la Recherche Scientifique (CNRS)-Mines Nantes (Mines Nantes)-Université de Nantes (UN), Région des Pays de la Loire, UNIVERSITÉ DE NANTES, Emmanuel MORIN, DEPART (Documents Ecrits et Paroles - Reconnaissance et Traduction), and Hatmi, Mohamed
Subjects: Champs Conditionnels Aléatoires, [SCCO.COMP] Cognitive science/Computer science, Reconnaissace des Entités Nommées, [SCCO.COMP]Cognitive science/Computer science, Reconnaissance Automatique de la Parole, Automatic Speech Recognition, Modèle de Langage, Named Entity Recognition, Conditional Random Fields, Language Modeling
Abstract: Named entity recognition is a subtask of information extraction. It consists of identifying some textual objects such as person, location and organization names. The work of this thesis focuses on the named entity recognition task for the oral modality. Some difficulties may arise for this task due to the intrinsic characteristics of speech processing (lack of capitalisation marks, lack of punctuation marks, presence of disfluences and of recognition errors...). In the first part, we study the characteristics of the named entity recognition downstream of the automatic speech recognition system. We present a methodology which allows named entity recognition following a hierarchical and compositional taxonomy. We measure the impact of the different phenomena specific to speech on the quality of named entity recognition. In the second part, we propose to study the tight pairing between the speech recognition task and the named entity recognition task. For that purpose, we take away the basic functionnalities of a speech recognition system to turn it into a named entity recognition system. Therefore, by mobilising the inherent knowledge of the speech processing to the named entity recognition task, we ensure a better synergy between the two tasks. We carry out different types of experiments to optimize and evaluate our approach., La Reconnaissance des entités nommées est une sous-tâche de l’activité d’extraction d’information. Elle consiste à identifier certains objetstextuels tels que les noms de personne, d’organisation et de lieu. Le travail de cette thèse se concentre sur la tâche de reconnaissance des entitésnommées pour la modalité orale. Cette tâche pose un certain nombre de difficultés qui sont inhérentes aux caractéristiques intrinsèques du traitementde l’oral (absence de capitalisation, manque de ponctuation, presence de disfluences et d’erreurs de reconnaissance...). Dans un premiertemps, nous étudions les spécificités de la reconnaissance des entités nommées en aval du système de reconnaissance automatique de la parole.Nous présentons une méthode pour la reconnaissance des entités nommées dans les transcription de la parole en adoptant une taxonomie hiérarchique et compositionnelle. Nous mesurons l’impact des différents phénomènes spécifiques à la parole sur la qualité de reconnaissance des entités nommées. Dans un second temps, nous proposons d’étudier le couplage étroit entre la tâche de transcription de la parole et la tâche de reconnaissance des entités nommées. Dans ce but, nous détournons les fonctionnalités de base d’un système de transcription de la parole pour le transformer en un système de reconnaissance des entités nommées. Ainsi, en mobilisant les connaissances propres au traitement de la parole dans le cadre de la tâche liée à la reconnaissance des entités nommées, nous assurons une plus grande synergie entre ces deux tâches. Nous menons différents types d’expérimentations afin d’optimiser et d’évaluer notre approche.
Published: 2014

11. Reconnaissance des entités nommées dans des documents multimodaux

Author: Hatmi, Mohamed, Laboratoire d'Informatique de Nantes Atlantique (LINA), Centre National de la Recherche Scientifique (CNRS)-Mines Nantes (Mines Nantes)-Université de Nantes (UN), Région des Pays de la Loire, UNIVERSITÉ DE NANTES, Emmanuel MORIN, and DEPART (Documents Ecrits et Paroles - Reconnaissance et Traduction)
Subjects: Champs Conditionnels Aléatoires, Reconnaissace des Entités Nommées, [SCCO.COMP]Cognitive science/Computer science, Reconnaissance Automatique de la Parole, Automatic Speech Recognition, Modèle de Langage, Named Entity Recognition, Conditional Random Fields, Language Modeling
Abstract: Named entity recognition is a subtask of information extraction. It consists of identifying some textual objects such as person, location and organization names. The work of this thesis focuses on the named entity recognition task for the oral modality. Some difficulties may arise for this task due to the intrinsic characteristics of speech processing (lack of capitalisation marks, lack of punctuation marks, presence of disfluences and of recognition errors...). In the first part, we study the characteristics of the named entity recognition downstream of the automatic speech recognition system. We present a methodology which allows named entity recognition following a hierarchical and compositional taxonomy. We measure the impact of the different phenomena specific to speech on the quality of named entity recognition. In the second part, we propose to study the tight pairing between the speech recognition task and the named entity recognition task. For that purpose, we take away the basic functionnalities of a speech recognition system to turn it into a named entity recognition system. Therefore, by mobilising the inherent knowledge of the speech processing to the named entity recognition task, we ensure a better synergy between the two tasks. We carry out different types of experiments to optimize and evaluate our approach.; La Reconnaissance des entités nommées est une sous-tâche de l’activité d’extraction d’information. Elle consiste à identifier certains objetstextuels tels que les noms de personne, d’organisation et de lieu. Le travail de cette thèse se concentre sur la tâche de reconnaissance des entitésnommées pour la modalité orale. Cette tâche pose un certain nombre de difficultés qui sont inhérentes aux caractéristiques intrinsèques du traitementde l’oral (absence de capitalisation, manque de ponctuation, presence de disfluences et d’erreurs de reconnaissance...). Dans un premiertemps, nous étudions les spécificités de la reconnaissance des entités nommées en aval du système de reconnaissance automatique de la parole.Nous présentons une méthode pour la reconnaissance des entités nommées dans les transcription de la parole en adoptant une taxonomie hiérarchique et compositionnelle. Nous mesurons l’impact des différents phénomènes spécifiques à la parole sur la qualité de reconnaissance des entités nommées. Dans un second temps, nous proposons d’étudier le couplage étroit entre la tâche de transcription de la parole et la tâche de reconnaissance des entités nommées. Dans ce but, nous détournons les fonctionnalités de base d’un système de transcription de la parole pour le transformer en un système de reconnaissance des entités nommées. Ainsi, en mobilisant les connaissances propres au traitement de la parole dans le cadre de la tâche liée à la reconnaissance des entités nommées, nous assurons une plus grande synergie entre ces deux tâches. Nous menons différents types d’expérimentations afin d’optimiser et d’évaluer notre approche.
Published: 2014

12. Open domain question-answering : relevant document selection geared to the question

Author: Foucault, Nicolas, Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (LIMSI), Université Paris Saclay (COmUE)-Centre National de la Recherche Scientifique (CNRS)-Sorbonne Université - UFR d'Ingénierie (UFR 919), Sorbonne Université (SU)-Sorbonne Université (SU)-Université Paris-Saclay-Université Paris-Sud - Paris 11 (UP11), Université Paris Sud - Paris XI, Sophie Rosset, and Gilles Adda
Subjects: Quaero, Natural language processing, Language modeling, Modèle de langue, [INFO.INFO-OH]Computer Science [cs]/Other [cs.OH], Question & Answering, Apprentissage automatique, Séléction de documents, Traitement automatique des langues, Recherche d’information, RITEL, Web page segmentation, Machine learning, Segmentation de pages web, Information retrieval, Classification de pages web, Questions-Réponses, Web page classification, Document selection
Abstract: This thesis aims at defining a unified adaptation of the document selection and answer extraction strategies, based on the document and question types, in a Question-Answering (QA) context. The solution is integrated in RITEL (a LIMSI QA system) to assess the contribution. We develop and investigate a method based on an Information Retrieval approach for the selection of relevant documents in QA. The method is based on a language model and a binary model of textual classification in relevant or irrelevant category. It is used to filter unusable documents for answer extraction by matching lists of a priori relevant documents to the question type automatically. First, we present the method along with its underlying models and we evaluate it on the QA task with RITEL in French. The evaluation is done on a corpus of 500,000 unsegmented web pages with factoid questions provided by the Quaero program (i.e. evaluation at the document level or D-level). Then, we evaluate the methodon segmented web pages (i.e. evaluation at the segment level or S-level). The idea is that information content is more consistent with segments, which facilitates answer extraction. D-filtering brings a small improvement over the baseline (no filtering). S-filtering outperforms both the baseline and D-filtering but not significantly. Finally, we study at the S-level the links between RITEL’s performances and the key parameters of the method. In order to apply the method on segments, we created a system of web page segmentation. We present and evaluate it on the QA task with the same corpora used to evaluate our document selection method. This evaluation follows the former hypothesis and measures the impact of natural web page variability (in terms of size and content) on RITEL in its task. In general, the experimental results we obtained suggest that our IR-based method helps a QA system in its task, however further investigations should be conducted – especially with larger corpora of questions – to make them significant.; Les problématiques abordées dans ma thèse sont de définir une adaptation unifiée entre la sélection des documents et les stratégies de recherche de la réponse à partir du type des documents et de celui des questions, intégrer la solution au système de Questions-Réponses (QR) RITEL du LIMSI et évaluer son apport. Nous développons et étudions une méthode basée sur une approche de Recherche d’Information pour la sélection de documents en QR. Celle-ci s’appuie sur un modèle de langue et un modèle de classification binaire de texte en catégorie pertinent ou non pertinent d’un point de vue QR. Cette méthode permet de filtrer les documents sélectionnés pour l’extraction de réponses par un système QR. Nous présentons la méthode et ses modèles, et la testons dans le cadre QR à l’aide de RITEL. L’évaluation est faite en français en contexte web sur un corpus de 500 000 pages web et de questions factuelles fournis par le programme Quaero. Celle-ci est menée soit sur des documents complets, soit sur des segments de documents. L’hypothèse suivie est que le contenu informationnel des segments est plus cohérent et facilite l’extraction de réponses. Dans le premier cas, les gains obtenus sont faibles comparés aux résultats de référence (sans filtrage). Dans le second cas, les gains sont plus élevés et confortent l’hypothèse, sans pour autant être significatifs. Une étude approfondie des liens existant entre les performances de RITEL et les paramètres de filtrage complète ces évaluations. Le système de segmentation créé pour travailler sur des segments est détaillé et évalué. Son évaluation nous sert à mesurer l’impact de la variabilité naturelle des pages web (en taille et en contenu) sur la tâche QR, en lien avec l’hypothèse précédente. En général, les résultats expérimentaux obtenus suggèrent que notre méthode aide un système QR dans sa tâche. Cependant, de nouvelles évaluations sont à mener pour rendre ces résultats significatifs, et notamment en utilisant des corpus de questions plus importants.
Published: 2013

13. Questions-Réponses en domaine ouvert : sélection pertinente de documents en fonction du contexte de la question

Author: Foucault, Nicolas, Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (LIMSI), Université Paris Saclay (COmUE)-Centre National de la Recherche Scientifique (CNRS)-Sorbonne Université - UFR d'Ingénierie (UFR 919), Sorbonne Université (SU)-Sorbonne Université (SU)-Université Paris-Saclay-Université Paris-Sud - Paris 11 (UP11), Université Paris Sud - Paris XI, Sophie Rosset, and Gilles Adda
Subjects: Quaero, Natural language processing, Language modeling, Modèle de langue, [INFO.INFO-OH]Computer Science [cs]/Other [cs.OH], Question & Answering, Apprentissage automatique, Séléction de documents, Traitement automatique des langues, Recherche d’information, RITEL, Web page segmentation, Machine learning, Segmentation de pages web, Information retrieval, Classification de pages web, Questions-Réponses, Web page classification, Document selection
Abstract: This thesis aims at defining a unified adaptation of the document selection and answer extraction strategies, based on the document and question types, in a Question-Answering (QA) context. The solution is integrated in RITEL (a LIMSI QA system) to assess the contribution. We develop and investigate a method based on an Information Retrieval approach for the selection of relevant documents in QA. The method is based on a language model and a binary model of textual classification in relevant or irrelevant category. It is used to filter unusable documents for answer extraction by matching lists of a priori relevant documents to the question type automatically. First, we present the method along with its underlying models and we evaluate it on the QA task with RITEL in French. The evaluation is done on a corpus of 500,000 unsegmented web pages with factoid questions provided by the Quaero program (i.e. evaluation at the document level or D-level). Then, we evaluate the methodon segmented web pages (i.e. evaluation at the segment level or S-level). The idea is that information content is more consistent with segments, which facilitates answer extraction. D-filtering brings a small improvement over the baseline (no filtering). S-filtering outperforms both the baseline and D-filtering but not significantly. Finally, we study at the S-level the links between RITEL’s performances and the key parameters of the method. In order to apply the method on segments, we created a system of web page segmentation. We present and evaluate it on the QA task with the same corpora used to evaluate our document selection method. This evaluation follows the former hypothesis and measures the impact of natural web page variability (in terms of size and content) on RITEL in its task. In general, the experimental results we obtained suggest that our IR-based method helps a QA system in its task, however further investigations should be conducted – especially with larger corpora of questions – to make them significant.; Les problématiques abordées dans ma thèse sont de définir une adaptation unifiée entre la sélection des documents et les stratégies de recherche de la réponse à partir du type des documents et de celui des questions, intégrer la solution au système de Questions-Réponses (QR) RITEL du LIMSI et évaluer son apport. Nous développons et étudions une méthode basée sur une approche de Recherche d’Information pour la sélection de documents en QR. Celle-ci s’appuie sur un modèle de langue et un modèle de classification binaire de texte en catégorie pertinent ou non pertinent d’un point de vue QR. Cette méthode permet de filtrer les documents sélectionnés pour l’extraction de réponses par un système QR. Nous présentons la méthode et ses modèles, et la testons dans le cadre QR à l’aide de RITEL. L’évaluation est faite en français en contexte web sur un corpus de 500 000 pages web et de questions factuelles fournis par le programme Quaero. Celle-ci est menée soit sur des documents complets, soit sur des segments de documents. L’hypothèse suivie est que le contenu informationnel des segments est plus cohérent et facilite l’extraction de réponses. Dans le premier cas, les gains obtenus sont faibles comparés aux résultats de référence (sans filtrage). Dans le second cas, les gains sont plus élevés et confortent l’hypothèse, sans pour autant être significatifs. Une étude approfondie des liens existant entre les performances de RITEL et les paramètres de filtrage complète ces évaluations. Le système de segmentation créé pour travailler sur des segments est détaillé et évalué. Son évaluation nous sert à mesurer l’impact de la variabilité naturelle des pages web (en taille et en contenu) sur la tâche QR, en lien avec l’hypothèse précédente. En général, les résultats expérimentaux obtenus suggèrent que notre méthode aide un système QR dans sa tâche. Cependant, de nouvelles évaluations sont à mener pour rendre ces résultats significatifs, et notamment en utilisant des corpus de questions plus importants.
Published: 2013

14. A Machine Learning Based Approach for Vocabulary Selection for Speech Transcription

Author: Jouvet, Denis, Langlois, David, Analysis, perception and recognition of speech (PAROLE), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL), Ivan Habernal and Václav Matoušek, Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)
Subjects: language modeling, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, neural network, speech transcription, vocabulary selection, speech recognition, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing
Abstract: International audience; This paper introduces a new approach based on neural networks for selecting the vocabulary to be used in a speech transcription system. Indeed, nowadays, large sets of text data can be collected from web sources, and used in addition to more traditional text sources for building language models for speech transcription systems. However, web data sources lead to large amounts of heterogeneous data, and, as a consequence, standard vocabulary selection procedures based on unigram approaches tend to select unwanted and undesirable items as new words. As an alternative to unigram-based and empirical manual-based selection approaches, this paper proposes a new selection procedure that relies on a machine learning technique, namely neural networks. The paper presents and discusses the results obtained with the various selection procedures. The neural network based selection experiments are promising and they can handle automatically various detailed information in the selection process.
Published: 2013

15. Combination of Random Indexing based Language Model and N-gram Language Model for Speech Recognition

Author: Dominique Fohr, Odile Mella, Analysis, perception and recognition of speech (PAROLE), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)
Subjects: language modeling, Perplexity, business.industry, Latent semantic analysis, Computer science, Speech recognition, speech recognition, 020206 networking & telecommunications, 02 engineering and technology, Function (mathematics), computer.software_genre, 030507 speech-language pathology & audiology, 03 medical and health sciences, Random indexing, n-gram, Cache language model, random indexing, 0202 electrical engineering, electronic engineering, information engineering, Artificial intelligence, Language model, [INFO.INFO-HC]Computer Science [cs]/Human-Computer Interaction [cs.HC], 0305 other medical science, business, computer, Natural language processing
Abstract: International audience; This paper presents the results and conclusion of a study on the introduction of semantic information through the Random Indexing paradigm in statistical language models used in speech recognition. Random Indexing is an alternative to Latent Semantic Analysis (LSA) that addresses the scalability problem of LSA. After a brief presentation of Random Indexing (RI), this paper describes, different methods to estimate the RI matrix, then how to derive probabilities from the RI matrix and finally how to combine them with n-gram language model probabilities. Then, it analyzes the performance of these different RI methods and their combinations with a 4-gram language model by computing the perplexity of a test corpus of 290,000 words from the French evaluation campaign ETAPE. Among our results, the main conclusions are (1) regardless of the method, function words should not be taken into account in the estimation of RI matrix; (2) The two methods RI_basic and TTRI_w achieved the best perplexity, i.e. a relative gain of 3% compared to the perplexity of the 4-gram language model alone (136.2 vs. 140.4).
Published: 2013

16. Toward Robust Information Extraction Models for Multimedia Documents

Author: Ebadat, Ali-Reza, Multimedia content-based indexing (TEXMEX), Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Inria Rennes – Bretagne Atlantique, Institut National de Recherche en Informatique et en Automatique (Inria), INSA de Rennes, Vincent Claveau, Pascale Sébillot, Quaero, Université de Rennes 1 (UR1), Université de Rennes (UNIV-RENNES)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Rennes 1 (UR1), and Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Inria Rennes – Bretagne Atlantique
Subjects: Relation Discovery, N-gram, Similarity Function, [INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR], Bag-of-Vectors, Information Extraction, Relation Extraction, Proper Noun Clustering, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL], Semantic Clustering, Language Modeling
Abstract: During the last decade, huge amounts of multimedia documents have been generated. It is therefore important to find a way to manage this data. Every approach to facilitate this process requires to have a deep understanding of the content of the documents. Among two different approaches to get such insights, either by extracting information from the document (e.g. audio, image) or by using related data from external sources (such as the Web), we chose the latter. Then, these extracted information can be used in a global framework to be considered as annotations for multimedia documents in order to facilitate the management of such documents. One of the main objectives of this thesis was to be robust against noisy and small data. Our approach to reach this objective was to use simple and knowledge-light techniques (i.e. shallow linguistic analysis) as a guarantee of robustness that we assume to be mandatory for processing multimedia documents. Indeed, we used statistical analysis of text and some techniques inspired from Information Retrieval. In addition, we introduced a new data representation scheme for text processing which has been used successfully in image Information Retrieval domain. In this thesis, we focused on three tasks: Relation Extraction, Relation Discovery and Proper noun clustering. In the first task, Relation Extraction, we proposed a supervised model based on a Language Modeling and an instance-based learning algorithm, called kNN. Experimental results showed the effectiveness of our models which use shallow linguistic information compared to state-of-the-art systems that use deep linguistic analysis. In the second task, we moved to unsupervised model to discover relations instead of extracting predefined ones. We modeled this problem as clustering task and defined a similarity function based on Language Modeling and average probability. The performance of this model was evaluated with a textual football reports, which showed improvements compare to classical model with cosine similarity function. Moreover, we studied the importance of some domain independent filters in this task. Since each relation was between two entities, we defined the last task to cluster entities (more precisely, proper nouns) in order to discover and make emerge, without a priori, semantic classes.. In this task, we proposed to use a new data representation to keep each instance of the proper nouns separately. Then, we introduced a discriminative similarity function in order to take into account the importance of each occurrence of the proper nouns in the corpus. As a conclusion, we experimentally showed that simple techniques, requiring few a priori knowledge, and using shallow linguistic information can be useful to effectively extract information from text. In our case, such results have indeed been achieved by choosing suited representation for the data, based on statistical analysis or Information Retrieval models. This is still a long road before being able to process raw multimedia documents, but we hope that these good results may now serve as a springboard for future researches in this field.; Au cours de la dernière décennie, d'énormes quantités de documents multimédias ont été générées. Il est donc important de trouver un moyen de gérer ces données, notamment d'un point de vue sémantique, ce qui nécessite une connaissance fine de leur contenu. Il existe deux familles d'approches pour ce faire, soit par l'extraction d'informations à partir du document (par ex., audio, image), soit en utilisant des données textuelles extraites du document ou de sources externes (par ex., Web). Notre travail se place dans cette seconde famille d'approches ; les informations extraites des textes peuvent ensuite être utilisées pour annoter les documents multimédias et faciliter leur gestion. L'objectif de cette thèse est donc de développer de tels modèles d'extraction d'informations. Mais les textes extraits des documents multimédias étant en général petits et bruités, ce travail veille aussi à leur nécessaire robustesse. Nous avons donc privilégié des techniques simples nécessitant peu de connaissances externes comme garantie de robustesse, en nous inspirant des travaux en recherche d'information et en analyse statistique des textes. Nous nous sommes notamment concentré sur trois tâches : l'extraction supervisée de relations entre entités, la découverte de relations, et la découverte de classes d'entités. Pour l'extraction de relations, nous proposons une approche supervisée basée sur les modèles de langues et l'algorithme d'apprentissage des k-plus-proches voisins. Les résultats expérimentaux montrent l'efficacité et la robustesse de nos modèles, dépassant les systèmes état-de-l'art tout en utilisant des informations linguistiques plus simples à obtenir. Dans la seconde tâche, nous passons à un modèle non supervisé pour découvrir les relations au lieu d'en extraire des prédéfinies. Nous modélisons ce problème comme une tâche de clustering avec une fonction de similarité là encore basée sur les modèles de langues. Les performances, évaluées sur un corpus de vidéos de matchs de football, montrnt l'intérêt de notre approche par rapport aux modèles classiques. Enfin, dans la dernière tâche, nous nous intéressons non plus aux relations mais aux entités, source d'informations essentielles dans les documents. Nous proposons une technique de clustering d'entités afin de faire émerger, sans a priori, des classes sémantiques parmi celles-ci, en adoptant une représentation nouvelle des données permettant de mieux tenir compte des chaque occurrence des entités. En guise de conclusion, nous avons montré expérimentalement que des techniques simples, exigeant peu de connaissances a priori, et utilisant des informations linguistique facilement accessibles peuvent être suffisantes pour extraire efficacement des informations précises à partir du texte. Dans notre cas, ces bons résultats sont obtenus en choisissant une représentation adaptée pour les données, basée sur une analyse statistique ou des modèles de recherche d'information. Le chemin est encore long avant d'être en mesure de traiter directement des documents multimédia, mais nous espérons que nos propositions pourront servir de tremplin pour les recherches futures dans ce domaine.
Published: 2012

17. Incorporating MLP features in the unsupervised training process

Author: Fraga da Silva, Thiago, Le, Viet Bac, Lamel, Lori, Gauvain, Jean-Luc, Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (LIMSI), Université Paris Saclay (COmUE)-Centre National de la Recherche Scientifique (CNRS)-Sorbonne Université - UFR d'Ingénierie (UFR 919), Sorbonne Université (SU)-Sorbonne Université (SU)-Université Paris-Saclay-Université Paris-Sud - Paris 11 (UP11), and Publications, Limsi
Subjects: MLP features, [INFO.INFO-CL] Computer Science [cs]/Computation and Language [cs.CL], Acoustic Modeling, [INFO]Computer Science [cs], [INFO] Computer Science [cs], [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL], Unsupervised Training, Language Modeling
Abstract: International audience; The combined use of multi layer perceptron (MLP) and perceptual linear prediction (PLP) features has been reported to improve the performance of automatic speech recognition systems for many different languages and domains. However, MLP features have not yet been used on unsupervised acoustic model training. This approach is introduced in this paper with encouraging results. In addition, unsupervised language model training was also investigated for a Portuguese broadcast speech recognition task, leading to a slight improvement of performance. The joint use of the unsupervised techniques presented here leads to an absolute WER reduction up to 3.2\% over a baseline unsupervised system.
Published: 2012

18. Approches empiriques et modélisation statistique de la parole

Author: Gilles, Adda, Traitement du Langage parlé, Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (LIMSI), Université Paris Saclay (COmUE)-Centre National de la Recherche Scientifique (CNRS)-Sorbonne Université - UFR d'Ingénierie (UFR 919), Sorbonne Université (SU)-Sorbonne Université (SU)-Université Paris-Saclay-Université Paris-Sud - Paris 11 (UP11)-Université Paris Saclay (COmUE)-Centre National de la Recherche Scientifique (CNRS)-Sorbonne Université - UFR d'Ingénierie (UFR 919), Sorbonne Université (SU)-Sorbonne Université (SU)-Université Paris-Saclay-Université Paris-Sud - Paris 11 (UP11), Université Paris Sud - Paris XI, and Pierre Zweigenbaum(Pierre.Zweigenbaum@limsi.fr)
Subjects: language modeling, analyse d'erreurs, reconnaissance de la parole, modélisation du langage, speech recognition, epistemologic study, [INFO.INFO-HC]Computer Science [cs]/Human-Computer Interaction [cs.HC], étude épistémologique, error analysis, structuration of speech sciences, structuration de la recherche en parole
Abstract: This paper describes both a career path in statistical language modeling and its application to multilingual language processing systems, where I relate my research during 28 years, in a diachronic presentation according to some broad headings, and a statement to establish a theoretical and practical framework to bring out an empirical science of speech. This science should be based on the contribution of all the sciences, from automatic processing to linguistics, whose object of study is the speech. Central to this re-convergence is the idea that automatic systems can be used as instruments to explore large amounts of data at our disposal and to derive new linguistic knowledge which, in turn, will allow to improve the models used in the automatic systems. After a historical perspective, which is recalled the establishment of the evaluation paradigm and development of statistical modeling of speech resulting from the information theory, and criticisms that these two major facts have generated, we discuss some theoretical and practical points. Some epistemological questions concerning this empirical science of speech are discussed: what is the status of knowledge we produce, how to describe it in relation to other sciences? is it possible to empower the language sciences in a real science, trying to find both its observable and the way to improve the observations, and draw generalizable knowledge? We detail in particular the definition of the observable, and the study of the residual as a diagnostic of the gap between modeling and reality. Practical proposals are then exposed, on the structuring of scientific production and the development of instrumental centers for the sharing of development and maintenance of these complex instruments which are automatic speech processing systems.; Ce document décrit à la fois un parcours en modélisation statistique du langage et son application aux systèmes multilingues de traitement de la langue, où je relate mes travaux de recherches sur 28 années, en une présentation diachronique selon quelques grandes rubriques, et une prise de position pour la mise en place d'un cadre théorique et pratique permettant de faire émerger une science empirique de la parole. Cette science doit se fonder sur l'apport de toutes les sciences, du traitement automatique ou de la linguistique, dont l'objet d'étude est la parole. Au coeur de ce rapprochement se trouve l'idée que les systèmes automatiques peuvent être utilisés comme des instruments afin d'explorer les très grandes quantités de données à notre disposition et d'en tirer des connaissances nouvelles qui, en retour, permettront d'améliorer les modélisations utilisées en traitement automatique. Après une mise en perspective historique, où est rappelé en particulier la mise en place du paradigme de l'évaluation et le développement de la modélisation statistique de la parole, issue de la théorie de l'information, ainsi que les critiques que ces deux faits majeurs ont engendrées, nous aborderons quelques points théoriques et pratiques. Certaines questions épistémologiques concernant cette science empirique de la parole sont abordées : quel est le statut de la connaissance que nous produisons, comment la qualifier par rapport à d'autres sciences ? est-il possible d'autonomiser les sciences du langage en une véritable science, en essayant de trouver à la fois quel est son observable et le moyen d'améliorer la manière de l'observer, et d'en tirer des connaissances généralisables ? Nous détaillons en particulier la définition de l'observable, et l'étude du résiduel en tant que diagnostic de l'écart entre la modélisation et la réalité. Des propositions pratiques sont ensuite exposées concernant la structuration de la production scientifique et le développement de centres instrumentaux permettant la mutualisation du développement et de la maintenance de ces instruments complexes que sont les systèmes de traitement automatique de la parole.
Published: 2011

19. Empirical methods and statistical modeling of speech

Author: Gilles, Adda, Traitement du Langage parlé, Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (LIMSI), Université Paris Saclay (COmUE)-Centre National de la Recherche Scientifique (CNRS)-Sorbonne Université - UFR d'Ingénierie (UFR 919), Sorbonne Université (SU)-Sorbonne Université (SU)-Université Paris-Saclay-Université Paris-Sud - Paris 11 (UP11)-Université Paris Saclay (COmUE)-Centre National de la Recherche Scientifique (CNRS)-Sorbonne Université - UFR d'Ingénierie (UFR 919), Sorbonne Université (SU)-Sorbonne Université (SU)-Université Paris-Saclay-Université Paris-Sud - Paris 11 (UP11), Université Paris Sud - Paris XI, and Pierre Zweigenbaum(Pierre.Zweigenbaum@limsi.fr)
Subjects: language modeling, analyse d'erreurs, reconnaissance de la parole, modélisation du langage, speech recognition, epistemologic study, [INFO.INFO-HC]Computer Science [cs]/Human-Computer Interaction [cs.HC], étude épistémologique, error analysis, structuration of speech sciences, structuration de la recherche en parole
Abstract: This paper describes both a career path in statistical language modeling and its application to multilingual language processing systems, where I relate my research during 28 years, in a diachronic presentation according to some broad headings, and a statement to establish a theoretical and practical framework to bring out an empirical science of speech. This science should be based on the contribution of all the sciences, from automatic processing to linguistics, whose object of study is the speech. Central to this re-convergence is the idea that automatic systems can be used as instruments to explore large amounts of data at our disposal and to derive new linguistic knowledge which, in turn, will allow to improve the models used in the automatic systems. After a historical perspective, which is recalled the establishment of the evaluation paradigm and development of statistical modeling of speech resulting from the information theory, and criticisms that these two major facts have generated, we discuss some theoretical and practical points. Some epistemological questions concerning this empirical science of speech are discussed: what is the status of knowledge we produce, how to describe it in relation to other sciences? is it possible to empower the language sciences in a real science, trying to find both its observable and the way to improve the observations, and draw generalizable knowledge? We detail in particular the definition of the observable, and the study of the residual as a diagnostic of the gap between modeling and reality. Practical proposals are then exposed, on the structuring of scientific production and the development of instrumental centers for the sharing of development and maintenance of these complex instruments which are automatic speech processing systems.; Ce document décrit à la fois un parcours en modélisation statistique du langage et son application aux systèmes multilingues de traitement de la langue, où je relate mes travaux de recherches sur 28 années, en une présentation diachronique selon quelques grandes rubriques, et une prise de position pour la mise en place d'un cadre théorique et pratique permettant de faire émerger une science empirique de la parole. Cette science doit se fonder sur l'apport de toutes les sciences, du traitement automatique ou de la linguistique, dont l'objet d'étude est la parole. Au coeur de ce rapprochement se trouve l'idée que les systèmes automatiques peuvent être utilisés comme des instruments afin d'explorer les très grandes quantités de données à notre disposition et d'en tirer des connaissances nouvelles qui, en retour, permettront d'améliorer les modélisations utilisées en traitement automatique. Après une mise en perspective historique, où est rappelé en particulier la mise en place du paradigme de l'évaluation et le développement de la modélisation statistique de la parole, issue de la théorie de l'information, ainsi que les critiques que ces deux faits majeurs ont engendrées, nous aborderons quelques points théoriques et pratiques. Certaines questions épistémologiques concernant cette science empirique de la parole sont abordées : quel est le statut de la connaissance que nous produisons, comment la qualifier par rapport à d'autres sciences ? est-il possible d'autonomiser les sciences du langage en une véritable science, en essayant de trouver à la fois quel est son observable et le moyen d'améliorer la manière de l'observer, et d'en tirer des connaissances généralisables ? Nous détaillons en particulier la définition de l'observable, et l'étude du résiduel en tant que diagnostic de l'écart entre la modélisation et la réalité. Des propositions pratiques sont ensuite exposées concernant la structuration de la production scientifique et le développement de centres instrumentaux permettant la mutualisation du développement et de la maintenance de ces instruments complexes que sont les systèmes de traitement automatique de la parole.
Published: 2011

20. Automatically Finding Semantically Consistent N-grams to Add New Words in LVCSR Systems

Author: Pascale Sébillot, Guillaume Gravier, Gwénolé Lecorvé, Multimedia content-based indexing (TEXMEX), Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), Université de Rennes 1 (UR1), Université de Rennes (UNIV-RENNES)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Rennes 1 (UR1), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Inria Rennes – Bretagne Atlantique, Institut National de Recherche en Informatique et en Automatique (Inria), Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), and Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Inria Rennes – Bretagne Atlantique
Subjects: Vocabulary, business.industry, Computer science, Speech recognition, media_common.quotation_subject, Natural language processing, Language modeling, Automatic speech recognition, Conditional probability, 020206 networking & telecommunications, Context (language use), 02 engineering and technology, computer.software_genre, Semantics, Vocabulary adaptation, [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Language model, Artificial intelligence, business, computer, Word (computer architecture), media_common
Abstract: International audience; This paper presents a new method to automatically add n-grams containing out-of-vocabulary (OOV) words to a baseline language model (LM), where these n-grams are sought to be grammatically correct and to make sense according to the meaning of OOV words. First, this method consists in determining the word sequences, i.e., n-grams, in which the usage of a given OOV word is the most semantically consistent. Then, conditional probabilities of these n-grams have to be computed. To do this, semantic relations between words are used to assimilate each OOV word to several equivalent in-vocabulary words. Based on these last words, n-grams from the baseline LM are re-used to find the word sequences to be added and to compute their probabilities. After augmenting the vocabulary and launching a recognition process, experiments show that our method results in WER improvements which are comparable to those obtained using a state-of-the-art open vocabulary LM.
Published: 2011

21. Modèles de langage ad hoc pour la reconnaissance automatique de la parole

Author: Oger, Stanislas and STAR, ABES
Subjects: [INFO.INFO-OH] Computer Science [cs]/Other [cs.OH], Web Language Model, Modèle de Langage Web, Mots Hors-Vocabulaires, Out-Of-Vocabulary Words, Reconnaissance Automatique de la Parole, Automatic Speech Recognition, Théorie des Possibilités, Theory of Possibilities, Language Modeling, Modélisation du Langage
Abstract: The three pillars of an automatic speech recognition system are the lexicon, the languagemodel and the acoustic model. The lexicon provides all the words that can betranscribed, associated with their pronunciation. The acoustic model provides an indicationof how the phone units are pronounced, and the language model brings theknowledge of how words are linked. In modern automatic speech recognition systems,the acoustic and language models are statistical. Their estimation requires large volumesof data selected, standardized and annotated.At present, the Web is by far the largest textual corpus available for English andFrench languages. The data it holds can potentially be used to build the vocabularyand the estimation and adaptation of language model. The work presented here is topropose new approaches to take advantage of this resource in the context of languagemodeling.The document is organized into two parts. The first deals with the use of the Webdata to dynamically update the lexicon of the automatic speech recognition system.The proposed approach consists on increasing dynamically and locally the lexicon onlywhen unknown words appear in the speech. New words are extracted from the Webthrough the formulation of queries submitted toWeb search engines. The phonetizationof the words is obtained by an automatic grapheme-to-phoneme transcriber.The second part of the document presents a new way of handling the informationcontained on the Web by relying on possibility theory concepts. A Web-based possibilisticlanguage model is proposed. It provides an estition of the possibility of a wordsequence from knowledge of the existence of its sub-sequences on the Web. A probabilisticWeb-based language model is also proposed. It relies on Web document countsto estimate n-gram probabilities. Several approaches for combining these models withclassical models are proposed. The results show that combining probabilistic and possibilisticmodels gives better results than classical probabilistic models alone. In addition,the models estimated from Web data perform better than those estimated on corpus., Les trois piliers d’un système de reconnaissance automatique de la parole sont le lexique,le modèle de langage et le modèle acoustique. Le lexique fournit l’ensemble des mots qu’il est possible de transcrire, associés à leur prononciation. Le modèle acoustique donne une indication sur la manière dont sont réalisés les unités acoustiques et le modèle de langage apporte la connaissance de la manière dont les mots s’enchaînent.Dans les systèmes de reconnaissance automatique de la parole markoviens, les modèles acoustiques et linguistiques sont de nature statistique. Leur estimation nécessite de gros volumes de données sélectionnées, normalisées et annotées.A l’heure actuelle, les données disponibles sur le Web constituent de loin le plus gros corpus textuel disponible pour les langues française et anglaise. Ces données peuvent potentiellement servir à la construction du lexique et à l’estimation et l’adaptation du modèle de langage. Le travail présenté ici consiste à proposer de nouvelles approches permettant de tirer parti de cette ressource.Ce document est organisé en deux parties. La première traite de l’utilisation des données présentes sur le Web pour mettre à jour dynamiquement le lexique du moteur de reconnaissance automatique de la parole. L’approche proposée consiste à augmenter dynamiquement et localement le lexique du moteur de reconnaissance automatique de la parole lorsque des mots inconnus apparaissent dans le flux de parole. Les nouveaux mots sont extraits du Web grâce à la formulation automatique de requêtes soumises à un moteur de recherche. La phonétisation de ces mots est obtenue grâce à un phonétiseur automatique.La seconde partie présente une nouvelle manière de considérer l’information que représente le Web et des éléments de la théorie des possibilités sont utilisés pour la modéliser. Un modèle de langage possibiliste est alors proposé. Il fournit une estimation de la possibilité d’une séquence de mots à partir de connaissances relatives à ’existence de séquences de mots sur le Web. Un modèle probabiliste Web reposant sur le compte de documents fourni par un moteur de recherche Web est également présenté. Plusieurs approches permettant de combiner ces modèles avec des modèles probabilistes classiques estimés sur corpus sont proposées. Les résultats montrent que combiner les modèles probabilistes et possibilistes donne de meilleurs résultats que es modèles probabilistes classiques. De plus, les modèles estimés à partir des données Web donnent de meilleurs résultats que ceux estimés sur corpus.
Published: 2011

22. Ad-hoc language models for automatic speech recognition

Author: Oger, Stanislas and STAR, ABES
Subjects: [INFO.INFO-OH] Computer Science [cs]/Other [cs.OH], Web Language Model, Modèle de Langage Web, Mots Hors-Vocabulaires, Out-Of-Vocabulary Words, Reconnaissance Automatique de la Parole, Automatic Speech Recognition, Théorie des Possibilités, Theory of Possibilities, Language Modeling, Modélisation du Langage
Abstract: The three pillars of an automatic speech recognition system are the lexicon, the languagemodel and the acoustic model. The lexicon provides all the words that can betranscribed, associated with their pronunciation. The acoustic model provides an indicationof how the phone units are pronounced, and the language model brings theknowledge of how words are linked. In modern automatic speech recognition systems,the acoustic and language models are statistical. Their estimation requires large volumesof data selected, standardized and annotated.At present, the Web is by far the largest textual corpus available for English andFrench languages. The data it holds can potentially be used to build the vocabularyand the estimation and adaptation of language model. The work presented here is topropose new approaches to take advantage of this resource in the context of languagemodeling.The document is organized into two parts. The first deals with the use of the Webdata to dynamically update the lexicon of the automatic speech recognition system.The proposed approach consists on increasing dynamically and locally the lexicon onlywhen unknown words appear in the speech. New words are extracted from the Webthrough the formulation of queries submitted toWeb search engines. The phonetizationof the words is obtained by an automatic grapheme-to-phoneme transcriber.The second part of the document presents a new way of handling the informationcontained on the Web by relying on possibility theory concepts. A Web-based possibilisticlanguage model is proposed. It provides an estition of the possibility of a wordsequence from knowledge of the existence of its sub-sequences on the Web. A probabilisticWeb-based language model is also proposed. It relies on Web document countsto estimate n-gram probabilities. Several approaches for combining these models withclassical models are proposed. The results show that combining probabilistic and possibilisticmodels gives better results than classical probabilistic models alone. In addition,the models estimated from Web data perform better than those estimated on corpus., Les trois piliers d’un système de reconnaissance automatique de la parole sont le lexique,le modèle de langage et le modèle acoustique. Le lexique fournit l’ensemble des mots qu’il est possible de transcrire, associés à leur prononciation. Le modèle acoustique donne une indication sur la manière dont sont réalisés les unités acoustiques et le modèle de langage apporte la connaissance de la manière dont les mots s’enchaînent.Dans les systèmes de reconnaissance automatique de la parole markoviens, les modèles acoustiques et linguistiques sont de nature statistique. Leur estimation nécessite de gros volumes de données sélectionnées, normalisées et annotées.A l’heure actuelle, les données disponibles sur le Web constituent de loin le plus gros corpus textuel disponible pour les langues française et anglaise. Ces données peuvent potentiellement servir à la construction du lexique et à l’estimation et l’adaptation du modèle de langage. Le travail présenté ici consiste à proposer de nouvelles approches permettant de tirer parti de cette ressource.Ce document est organisé en deux parties. La première traite de l’utilisation des données présentes sur le Web pour mettre à jour dynamiquement le lexique du moteur de reconnaissance automatique de la parole. L’approche proposée consiste à augmenter dynamiquement et localement le lexique du moteur de reconnaissance automatique de la parole lorsque des mots inconnus apparaissent dans le flux de parole. Les nouveaux mots sont extraits du Web grâce à la formulation automatique de requêtes soumises à un moteur de recherche. La phonétisation de ces mots est obtenue grâce à un phonétiseur automatique.La seconde partie présente une nouvelle manière de considérer l’information que représente le Web et des éléments de la théorie des possibilités sont utilisés pour la modéliser. Un modèle de langage possibiliste est alors proposé. Il fournit une estimation de la possibilité d’une séquence de mots à partir de connaissances relatives à ’existence de séquences de mots sur le Web. Un modèle probabiliste Web reposant sur le compte de documents fourni par un moteur de recherche Web est également présenté. Plusieurs approches permettant de combiner ces modèles avec des modèles probabilistes classiques estimés sur corpus sont proposées. Les résultats montrent que combiner les modèles probabilistes et possibilistes donne de meilleurs résultats que es modèles probabilistes classiques. De plus, les modèles estimés à partir des données Web donnent de meilleurs résultats que ceux estimés sur corpus.
Published: 2011

23. Modélisation et recherche de graphes visuels : une approche par modèles de langue pour la reconnaissance de scènes

Author: Pham, Trong-Ton, Modélisation et Recherche d’Information Multimédia [Grenoble] (MRIM), Laboratoire d'Informatique de Grenoble (LIG), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF), Université de Grenoble, Philippe Mulhem(Philippe.Mulhem@imag.fr), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS), and Philippe Mulhem
Subjects: Scene Recognition, Robot Localization, reconnaissance de scène, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, recherche d'images, image retrieval, indexation d'images, Language Modeling, localisation, [INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR], Graph Theory, Image Representation, Information Retrieval, Représentation de graphes, [INFO]Computer Science [cs], modèle de langue, [INFO.INFO-HC]Computer Science [cs]/Human-Computer Interaction [cs.HC], image indexing, recherche d'information
Abstract: Content-based image indexing and retrieval (CBIR) system needs to consider several types of visual features and spatial information among them (i.e., different point of views) for better image representation. This thesis presents a novel approach that exploits an extension of the language modeling approach from information retrieval to the problem of graph-based image retrieval. Such versatile graph model is needed to represent the multiple points of views of images. This graph-based framework is composed of three main stages: Image processing stage aims at extracting image regions from the image. It also consists of computing the numerical feature vectors associated with image regions. Graph modeling stage consists of two main steps. First, extracted image regions that are visually similar will be grouped into clusters using an unsupervised learning algorithm. Each cluster is then associated with a visual concept. The second step generates the spatial relations between the visual concepts. Each image is represented by a visual graph captured from a set of visual concepts and a set of spatial relations among them. Graph retrieval stage is to retrieve images relevant to a new image query. Query graphs are generated following the graph modeling stage. Inspired by the language model for text retrieval, we extend this framework for matching the query graph with the document graphs from the database. Images are then ranked based on the relevance values of the corresponding image graphs. Two instances of the visual graph model have been applied to the problem of scene recognition and robot localization. We performed the experiments on two image collections: one contained 3,849 touristic images and another composed of 3,633 images captured by a mobile robot. The achieved results show that using visual graph model outperforms the standard language model and the Support Vector Machine method by more than 10% in accuracy.
Published: 2010

24. VISUAL GRAPH MODELING AND RETRIEVAL: A LANGUAGE MODEL APPROACH FOR SCENE RECOGNITION

Author: Pham, Trong-Ton, Modélisation et Recherche d’Information Multimédia [Grenoble] (MRIM), Laboratoire d'Informatique de Grenoble (LIG), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF), Université de Grenoble, and Philippe Mulhem(Philippe.Mulhem@imag.fr)
Subjects: Scene Recognition, Robot Localization, localisation, Graph Theory, Image Representation, Information Retrieval, reconnaissance de scène, Représentation de graphes, [INFO]Computer Science [cs], modèle de langue, [INFO.INFO-HC]Computer Science [cs]/Human-Computer Interaction [cs.HC], recherche d'information, Language Modeling
Abstract: Image retrieval and categorization may need to consider several types of visual features and spatial information between them (e.g., different point of views of an image). This thesis presents a novel approach that exploits an extension of the language modeling approach from information retrieval to the problem of graph-based image retrieval and categorization. Such versatile graph model is needed to represent the multiple points of views of images. A language model is defined on such graphs to handle a fast graph matching. We present the experiments achieved with several instances of the proposed model on two collections of images: one composed of 3,849 touristic images and another composed of 3,633 images captured by a mobile robot. Experimental results show that using visual graph model (VGM) improves the accuracies of the results of the standard language model (LM) and outperforms the Support Vector Machine (SVM) method.; Nous présentons une nouvelle méthode pour exploiter la relation entre différents niveaux de représentation d'image afin de compléter le modèle de graphe visuel. Le modèle de graphe visuel est une extension du modèle de langue classique en recherche d'information. Nous utilisons des régions d'images et des points d'intérêts (associées automatiquement à des concepts visuels), ainsi que des relations entre ces concepts, lors de la construction de la représentation sous forme de graphe. Les résultats obtenus sur catégorisation de la collection RobotVision de la compétition d'ImageCLEF 2009 et la collection STOIC-101 montrent que (a) la procédure de l'induction automatique des concepts d'une image est efficace, et (b) l'utilisation des relations spatiales entre deux niveaux de représentation, en plus de concepts, permet d'améliorer le taux de reconnaissance.
Published: 2010

25. Vectorisation des processus d'appariement document-requête

Author: Claveau, Vincent, Tavenard, Romain, Amsaleg, Laurent, Multimedia content-based indexing (TEXMEX), Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Inria Rennes – Bretagne Atlantique, Institut National de Recherche en Informatique et en Automatique (Inria), Université de Rennes 1 (UR1), Université de Rennes (UNIV-RENNES)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Rennes 1 (UR1), and Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Inria Rennes – Bretagne Atlantique
Subjects: language modeling, [INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR], Vectorization, vector space, pairing complexity, ACM: H.: Information Systems/H.3: INFORMATION STORAGE AND RETRIEVAL/H.3.3: Information Search and Retrieval/H.3.3.5: Search process
Abstract: National audience; In most IR systems, rapidly computing the proximity between a query and a document is an issue. This is generally computed very efficiently in the Vector Space Model. When handling very long queries or with different IR models, however, the cost of this computation can be quite high. In this paper, we propose a simple approach transforming any documentquery pairing technique into a vectorial representation. Therefore, it becomes possible to use existing approximate indexing techniques allowing the fast computation of distances between high-dimensional vectors. We experimentally show that our approach does not degrade the results and can even yields better recall rates when considering high document cut-off values.; Dans la plupart des applications de RI, calculer rapidement la proximité entre documents et requêtes est crucial. Avec les modèles vectoriels, ce calcul se fait généralement de manière très efficace. Cependant, lorsque les requêtes sont très longues ou dans le cas de SRI basés sur des modèles plus avancés, ce calcul devient plus complexe et coûteux. Dans cet article, nous proposons une technique simple pour transformer n'importe quel processus d'appariement requête-document fournissant un score en un problème de calcul de distance entre vecteurs. Cette approche peut ainsi bénéficier des bonnes performances des outils existants d'indexation et de recherche approximative dans des espaces de grandes dimensions. Au travers de quelques expériences, nous montrons par ailleurs que cette représentation n'entraîne pas de baisse importante de qualité des résultats, et, lorsque de nombreux documents sont à retourner, améliore même le rappel par rapport au SRI original, à taille de résultat égal.
Published: 2010

26. Modèles de langage probabilistes et possibilistes basés sur le Web

Author: Oger, Stanislas, Popescu, Vladimir, Linarès, Georges, and Déposants HAL-Avignon, bibliothèque Universitaire
Subjects: world wide web, language modeling, pos-sibility measure, automatic speech recognition, [INFO] Computer Science [cs]
Abstract: Language models are usually built either from a closed corpus, or by using World Wide Web retrieved documents , which are considered as a closed corpus themselves. In this paper we propose several other ways of using this resource for language modeling. We first start by improving an approach consisting in estimating n-gram probabilities from Web search engine statistics. Then, we propose a new way of considering the information extracted from the Web in a probabilistic framework. Then, we also propose to rely on Possibility Theory for effectively using this kind of information. We compare these two approaches on two automatic speech recognition tasks : (i) transcribing broadcast news data, and (ii) transcribing domain-specific data, concerning surgical operation film comments. We show that the two approaches are effective in different situations.
Published: 2010

27. Multiple Text Segmentation for Statistical Language Modeling

Author: Sopheap Seng, Laurent Besacier, Brigitte Bigi, Eric Castelli, Communication Langagière et Interaction Personne-Système (CLIPS - IMAG), Centre National de la Recherche Scientifique (CNRS)-Institut National Polytechnique de Grenoble (INPG)-Université Joseph Fourier - Grenoble 1 (UJF), Groupe d’Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole (GETALP), Laboratoire d'Informatique de Grenoble (LIG), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS), International Research Institute MICA (MICA), Institut National Polytechnique de Grenoble (INPG)-Hanoi University of Science and Technology (HUST)-Centre National de la Recherche Scientifique (CNRS), and Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)
Subjects: 030507 speech-language pathology & audiology, 03 medical and health sciences, language modeling, unsegmented language, [SHS.INFO]Humanities and Social Sciences/Library and information sciences, 0202 electrical engineering, electronic engineering, information engineering, 020206 networking & telecommunications, 02 engineering and technology, 0305 other medical science, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
Abstract: International audience; In this article we deal with the text segmentation problem in statistical language modeling for under-resourced languages with a writing system without word boundary delimiters. While the lack of text resources has a negative impact on the performance of language models, the errors introduced by the automatic word segmentation makes those data even less usable. To better exploit the text resources, we propose a method based on weighted finite state transducers to estimate the N-gram language model from the training corpus on which each sentence is segmented in multiple ways instead of a unique seg-mentation. The multiple segmentation generates more N-grams from the training corpus and allows obtaining the N-grams not found in unique segmentation. We use this approach to train the language models for automatic speech recognition systems of Khmer and Vietnamese languages and the multiple segmenta-tions lead to a better performance than the unique segmentation approach.
Published: 2009

28. Recent advances in Automatic Speech Recognition for Vietnamese

Author: Viet-Bac Le, Laurent Besacier, Sopheap Seng, Brigitte Bigi, Thi-Ngoc-Diep Do, Vocapia Research [Orsay], Vocapia, Communication Langagière et Interaction Personne-Système (CLIPS - IMAG), Université Joseph Fourier - Grenoble 1 (UJF)-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS), and Bigi, Brigitte
Subjects: Index Terms – ASR, acoustic modeling, language modeling, sub-word unit, [INFO.INFO-CL] Computer Science [cs]/Computation and Language [cs.CL], [SHS.INFO]Humanities and Social Sciences/Library and information sciences, Vietnamese, word, [SHS.INFO] Humanities and Social Sciences/Library and information sciences, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
Abstract: International audience; This paper presents our recent activities for automatic speech recognition for Vietnamese. First, our text data collection and processing methods and tools are described. For language modeling, we investigate word, sub-word and also hybrid word/sub-word models. For acoustic modeling, when only limited speech data are available for Vietnamese, we propose some crosslingual acoustic modeling techniques. Furthermore, since the use of sub-word units can reduce the high out-of-vocabulary rate and improve the lack of text resources in statistical language modeling, we propose several methods to decompose, normalize and combine word and sub-word lattices generated from different ASR systems. Experimental results evaluated on the VnSpeechCorpus demonstrate the feasibility of our methods.
Published: 2008

29. Language Independent Statistical Models for on-Line Handwriting Recognition

Author: Perraud, Freddy, Viard-Gaudin, Christian, Morin, Emmanuel, Vision Objects, Institut de Recherche en Communications et en Cybernétique de Nantes (IRCCyN), Mines Nantes (Mines Nantes)-École Centrale de Nantes (ECN)-Ecole Polytechnique de l'Université de Nantes (EPUN), Université de Nantes (UN)-Université de Nantes (UN)-PRES Université Nantes Angers Le Mans (UNAM)-Centre National de la Recherche Scientifique (CNRS), Laboratoire d'Informatique de Nantes Atlantique (LINA), Mines Nantes (Mines Nantes)-Université de Nantes (UN)-Centre National de la Recherche Scientifique (CNRS), Université de Rennes 1, and Guy Lorette
Subjects: Handwriting recognition, ACM: I.: Computing Methodologies/I.5: PATTERN RECOGNITION, [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing, language modeling, perplexity, n-gram, ACM: I.: Computing Methodologies/I.7: DOCUMENT AND TEXT PROCESSING, [INFO.INFO-CV]Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV], n-class
Abstract: http://www.suvisoft.com; This paper deals with a language modeling approach that is dedicated to an on-line handwriting recognition system. Three main goals are set: i) performances, ii) versatility, and iii) resources. To achieve these goals we propose a statistical word n-class approach, which uses a learning stage to cluster words in classes and defines an estimation of the probability distribution of sequences of classes. Very large corpora from three different languages (English, French and Italian) have been used to train and test the language models. The efficiency of these models are evaluated not only from a linguistic point of view, using perplexity measurements, but also combined inside the recognition system on real ink signals corresponding to written sentences. Using a tri-class model allows a word error rate reduction ranging from to 50 to 60% according to the language.
Published: 2006

30. Rethinking Language Models within the Framework of Dynamic Bayesian Networks

Author: Khalid Daoudi, Kamel Smaïli, Murat Deviren, Smaïli, Kamel, Balázs Kégl, Guy Lapalme, Analysis, perception and recognition of speech (PAROLE), INRIA Lorraine, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Institut National de Recherche en Informatique et en Automatique (Inria)-Université Henri Poincaré - Nancy 1 (UHP)-Université Nancy 2-Institut National Polytechnique de Lorraine (INPL)-Centre National de la Recherche Scientifique (CNRS)-Université Henri Poincaré - Nancy 1 (UHP)-Université Nancy 2-Institut National Polytechnique de Lorraine (INPL)-Centre National de la Recherche Scientifique (CNRS)
Subjects: Dependency (UML), Computational complexity theory, Process (engineering), Modeling language, Computer science, business.industry, Language modeling, Bayesian network, Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing), 02 engineering and technology, Data modeling, 030507 speech-language pathology & audiology, 03 medical and health sciences, Dynamic Bayesian Network, [INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR], 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Language model, Artificial intelligence, [INFO.INFO-IR] Computer Science [cs]/Information Retrieval [cs.IR], 0305 other medical science, business, Dynamic Bayesian network
Abstract: International audience; We present a new approach for language modeling based on dynamic Bayesian networks. The philosophy behind this architecture is to learn from data the appropriate relations of dependency between the linguistic variables used in language modeling process. It is an original and coherent framework that processes words and classes in the same model. This approach leads to new data-driven language models capable of outperforming classical ones, sometimes with lower computational complexity. We present experiments on a small and medium corpora. The results show that this new technique is very promising and deserves further investigations.
Published: 2005

31. Statistical Language Models for On-line Handwritten Sentence Recognition

Author: Eric Anquetil, Solen Quiniou, S. Carbonnel, Quiniou, Solen, Interprétation et Reconnaissance d’Images et de Documents (IMADOC), Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), and Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS)
Subjects: language modeling, Language identification, business.industry, Computer science, Intelligent character recognition, Speech recognition, Word processing, Logogen model, Word error rate, [INFO.INFO-TT] Computer Science [cs]/Document and Text Processing, computer.software_genre, Intelligent word recognition, [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing, ComputingMethodologies_PATTERNRECOGNITION, Handwriting recognition, Cache language model, handwriting recognition, Word recognition, Feature (machine learning), Language model, Artificial intelligence, business, computer, Natural language processing, Natural language
Abstract: International audience; This paper investigates the integration of a statistical language model into an on-line recognition system in order to improve word recognition in the context of handwritten sentences. Two kinds of models have been considered: n-gram and n-class models (with a statistical approach to create word classes). All these models are trained over the Susanne corpus and experiments are carried out on sentences from this corpus which were written by several writers. The use of a statistical language model is shown to improve the word recognition rate and the relative impact of the different language models is compared. Furthermore, we illustrate the interest to define an optimal cooperation between the language model and the recognition system to re-enforce the accuracy of the system.
Published: 2005

32. Une nouvelle approche de modélisation du langage par des réseaux Bayésiens dynamiques

Author: Deviren, Murat, Daoudi, Khalid, Smaïli, Kamel, Loria, Publications, Analysis, perception and recognition of speech (PAROLE), INRIA Lorraine, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Institut National de Recherche en Informatique et en Automatique (Inria)-Université Henri Poincaré - Nancy 1 (UHP)-Université Nancy 2-Institut National Polytechnique de Lorraine (INPL)-Centre National de la Recherche Scientifique (CNRS)-Université Henri Poincaré - Nancy 1 (UHP)-Université Nancy 2-Institut National Polytechnique de Lorraine (INPL)-Centre National de la Recherche Scientifique (CNRS)
Subjects: [INFO.INFO-OH] Computer Science [cs]/Other [cs.OH], language modeling, modélisation du langage, [INFO.INFO-OH]Computer Science [cs]/Other [cs.OH], dynamique bayesian networks, réseaux bayesiens dynamiques
Abstract: Colloque avec actes et comité de lecture. internationale.; International audience; In this paper we propose a new approach to language modeling based on dynamic Bayesian networks. The principle idea of our approach is to find the dependence relations between variables that represent different linguistic units (word, class, concept, ...) that constitutes a language model. In the context of this paper the linguistic units that we consider are syntactic classes and words. Our approach should not be considered as a model combination technique. Rather, it is an original and coherent methodology that processes words and classes in the same model. We attempt to identify and model the dependence of words and classes on their linguistic context. Our ultimate goal is to devise an automatic mechanism that extracts the best dependence relations between a word and its context, i.e., lexical and syntactic. Preliminary results are very encouraging, in particular the model in which a word depends not only on previous word but also on syntactic classes of two previous words. This model outperforms the bi-gram model.
Published: 2004

33. Language modeling using dynamic Bayesian networks

Author: Deviren, Murat, Daoudi, Khalid, Smaïli, Kamel, Analysis, perception and recognition of speech (PAROLE), INRIA Lorraine, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université Henri Poincaré - Nancy 1 (UHP)-Université Nancy 2-Institut National Polytechnique de Lorraine (INPL)-Centre National de la Recherche Scientifique (CNRS)-Université Henri Poincaré - Nancy 1 (UHP)-Université Nancy 2-Institut National Polytechnique de Lorraine (INPL)-Centre National de la Recherche Scientifique (CNRS), and Loria, Publications
Subjects: [INFO.INFO-OH] Computer Science [cs]/Other [cs.OH], language modeling, modélisation du langage, [INFO.INFO-OH]Computer Science [cs]/Other [cs.OH], réseaux bayesiens dynamiques, dynamic bayesian networks
Abstract: Colloque avec actes et comité de lecture. internationale.; International audience; In this paper we propose a new approach to language modeling based on dynamic Bayesian networks. The principle idea of our approach is to find the dependence relations between variables that represent different linguistic units (word, class, concept, ...) that constitutes a language model. In the context of this paper the linguistic units that we consider are syntactic classes and words. Our approach should not be considered as a model combination technique. Rather, it is an original and coherent methodology that processes words and classes in the same model. We attempt to identify and model the dependence of words and classes on their linguistic context. Our ultimate goal is to devise an automatic mechanism that extracts the best dependence relations between a word and its context, i.e., lexical and syntactic. Preliminary results are very encouraging, in particular the model in which a word depends not only on previous word but also on syntactic classes of two previous words. This model outperforms the bi-gram model.
Published: 2004

34. Statistical Language Modeling Based on Variable-Length Sequences

Author: Kamel Smaïli, Jean-Paul Haton, Imed Zitouni, Loria, Publications, Analysis, perception and recognition of speech (PAROLE), INRIA Lorraine, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Institut National de Recherche en Informatique et en Automatique (Inria)-Université Henri Poincaré - Nancy 1 (UHP)-Université Nancy 2-Institut National Polytechnique de Lorraine (INPL)-Centre National de la Recherche Scientifique (CNRS)-Université Henri Poincaré - Nancy 1 (UHP)-Université Nancy 2-Institut National Polytechnique de Lorraine (INPL)-Centre National de la Recherche Scientifique (CNRS)
Subjects: Vocabulary, language modeling, Perplexity, Phrase, normalized perplexity, Computer science, Speech recognition, media_common.quotation_subject, [INFO.INFO-OH]Computer Science [cs]/Other [cs.OH], séquences, phrases, 02 engineering and technology, Pronunciation, computer.software_genre, modèle de langage, Theoretical Computer Science, 030507 speech-language pathology & audiology, 03 medical and health sciences, 0202 electrical engineering, electronic engineering, information engineering, cache, media_common, Dictation, business.industry, Acoustic model, perplexité normalisée, Human-Computer Interaction, [INFO.INFO-OH] Computer Science [cs]/Other [cs.OH], triggres, 020201 artificial intelligence & image processing, Language model, Artificial intelligence, 0305 other medical science, business, computer, Software, Natural language processing, Natural language
Abstract: Article dans revue scientifique avec comité de lecture.; In natural language and especially in spontaneous speech, people often group words in order to constitute phrases which become usual expressions. This is due to phonological (to make the pronunciation easier), or to semantic reasons (to remember more easily a phrase by assigning a meaning to a block of words). Classical language models do not adequately take into account such phrases. A better approach consists in modeling some word sequences as if they were individual dictionary elements. Sequences are considered as additional entries of the vocabulary, on which language models are computed. In this paper, we present a method for automatically retrieving the most relevant phrases from a corpus of written sentences. The originality of our approach resides in the fact that the extracted phrases are obtained from a linguistically tagged corpus. Therefore, the obtained phrases are linguistically viable. To measure the contribution of classes in retrieving phrases, we have implemented the same algorithm without using classes. The class-based method outperformed by 11% the other method. Our approach uses information theoretic criteria which insure a high statistical consistency and make the decision of selecting a potential sequence optimal in accordance with the language perplexity. We propose several variants of language model with and without word sequences. Among them, we present a model in which the trigger pairs are linguistically more significant. We show that the use of sequences decrease the word error rate and improve the normalized perplexity. For instance, the best sequence model improves the perplexity by 16%, and the accuracy of our dictation system (MAUD) by approximately 14%. Experiments, in terms of perplexity and recognition rate, have been carried out on a vocabulary of 20000 words extracted from a corpus of 43 million words made up of two years of the French newspaper Le Monde. The acoustic model (HMM) is trained with the Bref80 corpus.
Published: 2003

35. Contribution to Topic Identification by Using Word Similarity

Author: Brun, Armelle, Smaïli, Kamel, Haton, Jean-Paul, Analysis, perception and recognition of speech (PAROLE), INRIA Lorraine, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université Henri Poincaré - Nancy 1 (UHP)-Université Nancy 2-Institut National Polytechnique de Lorraine (INPL)-Centre National de la Recherche Scientifique (CNRS)-Université Henri Poincaré - Nancy 1 (UHP)-Université Nancy 2-Institut National Polytechnique de Lorraine (INPL)-Centre National de la Recherche Scientifique (CNRS), and Loria, Publications
Subjects: [INFO.INFO-OH] Computer Science [cs]/Other [cs.OH], language modeling, word similarity, modélisation du langage, topic detection, [INFO.INFO-OH]Computer Science [cs]/Other [cs.OH], détection de thème, similarité entre mots
Abstract: Colloque avec actes et comité de lecture. internationale.; International audience; In this paper, a new topic identification method, WSIM, is investigated. It exploits the similarity between words and topics. This measure is a function of the similarity between words, based on the mutual information. The performance of WSIM is compared to the cache model and to the well-known SVM classifier. Their behavior is also studied in terms of recall and precision, according to the training size. Performance of WSIM reaches 82.4 % correct topic identification. It outperforms SVM (76.2%) and has a comparable performance with the cache model (82.0\%).
Published: 2002

36. Vers une meilleure modélisation du langage : la prise en compte des séquences dans les modèles statistiques

Author: Zitouni, Imed, Smaïli, Kamel, Analysis, perception and recognition of speech (PAROLE), INRIA Lorraine, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Institut National Polytechnique de Lorraine (INPL)-Université Nancy 2-Université Henri Poincaré - Nancy 1 (UHP)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Institut National Polytechnique de Lorraine (INPL)-Université Nancy 2-Université Henri Poincaré - Nancy 1 (UHP), and Institut National de Recherche en Informatique et en Automatique (Inria)-Université Henri Poincaré - Nancy 1 (UHP)-Université Nancy 2-Institut National Polytechnique de Lorraine (INPL)-Centre National de la Recherche Scientifique (CNRS)-Université Henri Poincaré - Nancy 1 (UHP)-Université Nancy 2-Institut National Polytechnique de Lorraine (INPL)-Centre National de la Recherche Scientifique (CNRS)
Subjects: language modeling, reconnaissance de la parole, n-gramme, séquence, [INFO.INFO-OH]Computer Science [cs]/Other [cs.OH], speech recognition, n-gram, modèle de langage
Abstract: Colloque avec actes et comité de lecture. nationale.; National audience; Nous trouvons dans la langue naturelle, plusieurs séquences de mots clés traduisant la structure d'une phrase. Ces séquences sont de longueur variable et permettent d'avoir une élocution naturelle. Pour tenir compte de ces séquences lors de la reconnaissance de la parole, nous les avons considérées comme des unités et nous les avons ajoutées au vocabulaire de base. Par conséquent, les modèles de langage utilisant ce nouveau vocabulaire se fondent sur un historique d'unités où chacune d'entre elles peut être, soit un mot, soit une séquence. Nous présentons dans ce papier une méthode originale d'extraction de séquences de mots linguistiquement viable ; cette méthode se fonde sur le principe de la théorie de l'information. Nous exposons également dans ce papier différents modèles de langage se basant sur ces séquences. l'évaluation a été effectué avec un dictionnaire de 20000 mots et avec un corpus de 43 million de mots. l'utilisation des séquences a amélioré la perplexité d'environ 23% et le taux d'erreur de notre système de reconnaissance vocale MAUD d'environ 20%. || In natural language, several sequences of words are very frequent. Conventional language models do not adequately take into account such sequences, because they underestimate their probabilities. A better approach consists in modeling word sequences as if
Published: 2000

37. The use of continuity in modeling semantic phenomena

Author: Victorri, Bernard, Langues, textes, traitement informatique, cognition (LaTTice), École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Université Paris Diderot - Paris 7 (UPD7)-Centre National de la Recherche Scientifique (CNRS), C. Fuchs & B. Victorri, and Victorri, Bernard
Subjects: [INFO.INFO-CL] Computer Science [cs]/Computation and Language [cs.CL], Language modeling, Modélisation du langage, Modèles continus, Sémantique, Continuous models, [SCCO.LING] Cognitive science/Linguistics, [SHS.LANGUE]Humanities and Social Sciences/Linguistics, [SCCO.LING]Cognitive science/Linguistics, [SHS.LANGUE] Humanities and Social Sciences/Linguistics, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL], Semantics
Abstract: Why should we use continuous models in semantics ? At first glance, this question seems simple : we have to use continuous models if and only if semantic phenomena are continuous. But this last statement is wrong for at least two reasons. First, continuity or discreteness are not properties of phenomena, they are characterizations of theories upon phenomena. Second, one can use discrete models to represent continuous concepts, and the other way round. In this paper, we adress the two following questions : (1) What kinds of linguistic theories concerning semantic phenomena need concepts related to continuity ? (2) What kinds of mathematical and computer tools can deal with these concepts ?, Doit-on utiliser des modèles continus en sémantique ? A première vue, cette question semble simple : on doit uitliser des modèles continus si et seulement si les phénomènes sménatiques étudiés sont continus. Mais cette réponse est fausse pour au moins deux raisons. D'abord, continu et discret ne caractérisent pas les phénomènes, mais les théories de ces phénomènes. Ensuite, on peut utiliser des modèles discrets pour représenter des concepts continus et inversement. Dans cette contribution, nous nous posons les deux questions suivantes : (1) Quels types de théories linguistiques traitant de phénomènes sémantiques ont besoin de concepts continus ? (2) Quelles sortes d'outils mathématiques et informatiques peuvent prendre en compte ces concepts ?
Published: 1994

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Database

37 results on '"Language Modeling"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources