126 results on '"Constant, Mathieu"'
Search Results
2. Retrieve, Generate, Evaluate: A Case Study for Medical Paraphrases Generation with Small Language Models
- Author
-
Buhnila, Ioana, Sinha, Aman, and Constant, Mathieu
- Subjects
Computer Science - Computation and Language - Abstract
Recent surge in the accessibility of large language models (LLMs) to the general population can lead to untrackable use of such models for medical-related recommendations. Language generation via LLMs models has two key problems: firstly, they are prone to hallucination and therefore, for any medical purpose they require scientific and factual grounding; secondly, LLMs pose tremendous challenge to computational resources due to their gigantic model size. In this work, we introduce pRAGe, a pipeline for Retrieval Augmented Generation and evaluation of medical paraphrases generation using Small Language Models (SLM). We study the effectiveness of SLMs and the impact of external knowledge base for medical paraphrase generation in French., Comment: KnowledgeableLM 2024
- Published
- 2024
3. Domain-specific or Uncertainty-aware models: Does it really make a difference for biomedical text classification?
- Author
-
Sinha, Aman, Mickus, Timothee, Clausel, Marianne, Constant, Mathieu, and Coubez, Xavier
- Subjects
Computer Science - Computation and Language - Abstract
The success of pretrained language models (PLMs) across a spate of use-cases has led to significant investment from the NLP community towards building domain-specific foundational models. On the other hand, in mission critical settings such as biomedical applications, other aspects also factor in-chief of which is a model's ability to produce reasonable estimates of its own uncertainty. In the present study, we discuss these two desiderata through the lens of how they shape the entropy of a model's output probability distribution. We find that domain specificity and uncertainty awareness can often be successfully combined, but the exact task at hand weighs in much more strongly., Comment: BioNLP 2024
- Published
- 2024
4. No Imputation Needed: A Switch Approach to Irregularly Sampled Time Series
- Author
-
Agarwal, Rohit, Sinha, Aman, Vishwakarma, Ayan, Coubez, Xavier, Clausel, Marianne, Constant, Mathieu, Horsch, Alexander, and Prasad, Dilip K.
- Subjects
Computer Science - Artificial Intelligence ,Computer Science - Machine Learning - Abstract
Modeling irregularly-sampled time series (ISTS) is challenging because of missing values. Most existing methods focus on handling ISTS by converting irregularly sampled data into regularly sampled data via imputation. These models assume an underlying missing mechanism, which may lead to unwanted bias and sub-optimal performance. We present SLAN (Switch LSTM Aggregate Network), which utilizes a group of LSTMs to model ISTS without imputation, eliminating the assumption of any underlying process. It dynamically adapts its architecture on the fly based on the measured sensors using switches. SLAN exploits the irregularity information to explicitly capture each sensor's local summary and maintains a global summary state throughout the observational period. We demonstrate the efficacy of SLAN on two public datasets, namely, MIMIC-III, and Physionet 2012.
- Published
- 2023
5. How to Dissect a Muppet: The Structure of Transformer Embedding Spaces
- Author
-
Mickus, Timothee, Paperno, Denis, and Constant, Mathieu
- Subjects
Computer Science - Computation and Language - Abstract
Pretrained embeddings based on the Transformer architecture have taken the NLP community by storm. We show that they can mathematically be reframed as a sum of vector factors and showcase how to use this reframing to study the impact of each component. We provide evidence that multi-head attentions and feed-forwards are not equally useful in all downstream applications, as well as a quantitative overview of the effects of finetuning on the overall embedding space. This approach allows us to draw connections to a wide range of previous studies, from vector space anisotropy to attention weights., Comment: Accepted at TACL (pre-MIT Press publication version)
- Published
- 2022
6. Semeval-2022 Task 1: CODWOE -- Comparing Dictionaries and Word Embeddings
- Author
-
Mickus, Timothee, van Deemter, Kees, Constant, Mathieu, and Paperno, Denis
- Subjects
Computer Science - Computation and Language - Abstract
Word embeddings have advanced the state of the art in NLP across numerous tasks. Understanding the contents of dense neural representations is of utmost interest to the computational semantics community. We propose to focus on relating these opaque word vectors with human-readable definitions, as found in dictionaries. This problem naturally divides into two subtasks: converting definitions into embeddings, and converting embeddings into definitions. This task was conducted in a multilingual setting, using comparable sets of embeddings trained homogeneously.
- Published
- 2022
7. A Game Interface to Study Semantic Grounding in Text-Based Models
- Author
-
Mickus, Timothee, Constant, Mathieu, and Paperno, Denis
- Subjects
Computer Science - Computation and Language - Abstract
Can language models learn grounded representations from text distribution alone? This question is both central and recurrent in natural language processing; authors generally agree that grounding requires more than textual distribution. We propose to experimentally test this claim: if any two words have different meanings and yet cannot be distinguished from distribution alone, then grounding is out of the reach of text-based models. To that end, we present early work on an online game for the collection of human judgments on the distributional similarity of word pairs in five languages. We further report early results of our data collection campaign.
- Published
- 2021
8. Mark my Word: A Sequence-to-Sequence Approach to Definition Modeling
- Author
-
Mickus, Timothee, Paperno, Denis, and Constant, Mathieu
- Subjects
Computer Science - Computation and Language - Abstract
Defining words in a textual context is a useful task both for practical purposes and for gaining insight into distributed word representations. Building on the distributional hypothesis, we argue here that the most natural formalization of definition modeling is to treat it as a sequence-to-sequence task, rather than a word-to-sequence task: given an input sequence with a highlighted word, generate a contextually appropriate definition for it. We implement this approach in a Transformer-based sequence-to-sequence model. Our proposal allows to train contextualization and definition generation in an end-to-end fashion, which is a conceptual improvement over earlier works. We achieve state-of-the-art results both in contextual and non-contextual definition modeling.
- Published
- 2019
9. What do you mean, BERT? Assessing BERT as a Distributional Semantics Model
- Author
-
Mickus, Timothee, Paperno, Denis, Constant, Mathieu, and van Deemter, Kees
- Subjects
Computer Science - Computation and Language - Abstract
Contextualized word embeddings, i.e. vector representations for words in context, are naturally seen as an extension of previous noncontextual distributional semantic models. In this work, we focus on BERT, a deep neural network that produces contextualized embeddings and has set the state-of-the-art in several semantic tasks, and study the semantic coherence of its embedding space. While showing a tendency towards coherence, BERT does not fully live up to the natural expectations for a semantic vector space. In particular, we find that the position of the sentence in which a word occurs, while having no meaning correlates, leaves a noticeable trace on the word embeddings and disturbs similarity relationships.
- Published
- 2019
- Full Text
- View/download PDF
10. Ressources linguistiques et identification automatique d’expressions polylexicales
- Author
-
Constant, Mathieu, primary
- Published
- 2022
- Full Text
- View/download PDF
11. Construction d'un jeu de données de publications scientifiques pour le TAL et la fouille de textes à partir d'ISTEX.
- Author
-
Constant Mathieu
- Published
- 2023
12. Lexical Analysis of Serbian with Conditional Random Fields and Large-Coverage Finite-State Resources
- Author
-
Constant, Mathieu, Krstev, Cvetana, Vitas, Duško, Hutchison, David, Series Editor, Kanade, Takeo, Series Editor, Kittler, Josef, Series Editor, Kleinberg, Jon M., Series Editor, Mattern, Friedemann, Series Editor, Mitchell, John C., Series Editor, Naor, Moni, Series Editor, Pandu Rangan, C., Series Editor, Steffen, Bernhard, Series Editor, Terzopoulos, Demetri, Series Editor, Tygar, Doug, Series Editor, Weikum, Gerhard, Series Editor, Vetulani, Zygmunt, editor, Mariani, Joseph, editor, and Kubis, Marek, editor
- Published
- 2018
- Full Text
- View/download PDF
13. Des ressources lexicales du français et de leur utilisation en TAL : étude des actes de TALN
- Author
-
Choi, Hee-Soo, Fort, Karën, Guillaume, Bruno, Constant, Mathieu, Analyse et Traitement Informatique de la Langue Française (ATILF), Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Servan, Christophe, and Vilnat, Anne
- Subjects
ressources lexicales ,lexiques ,français ,[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] - Abstract
International audience; Au début du XXIe siècle, le français faisait encore partie des langues peu dotées. Grâce aux efforts de la communauté française du traitement automatique des langues (TAL), de nombreuses ressources librement disponibles ont été produites, dont des lexiques du français. À travers cet article, nous nous intéressons à leur devenir dans la communauté par le prisme des actes de la conférence TALN sur une période de 20 ans.
- Published
- 2023
14. Modelling Irregularly Sampled Time Series Without Imputation
- Author
-
Agarwal, Rohit, Sinha, Aman, Prasad, Dilip K., Clausel, Marianne, Horsch, Alexander, Constant, Mathieu, Coubez, Xavier, Agarwal, Rohit, Sinha, Aman, Prasad, Dilip K., Clausel, Marianne, Horsch, Alexander, Constant, Mathieu, and Coubez, Xavier
- Abstract
Modelling irregularly-sampled time series (ISTS) is challenging because of missing values. Most existing methods focus on handling ISTS by converting irregularly sampled data into regularly sampled data via imputation. These models assume an underlying missing mechanism leading to unwanted bias and sub-optimal performance. We present SLAN (Switch LSTM Aggregate Network), which utilizes a pack of LSTMs to model ISTS without imputation, eliminating the assumption of any underlying process. It dynamically adapts its architecture on the fly based on the measured sensors. SLAN exploits the irregularity information to capture each sensor's local summary explicitly and maintains a global summary state throughout the observational period. We demonstrate the efficacy of SLAN on publicly available datasets, namely, MIMIC-III, Physionet 2012 and Physionet 2019. The code is available at https://github.com/Rohit102497/SLAN.
- Published
- 2023
15. Lexical Analysis of Serbian with Conditional Random Fields and Large-Coverage Finite-State Resources
- Author
-
Constant, Mathieu, primary, Krstev, Cvetana, additional, and Vitas, Duško, additional
- Published
- 2018
- Full Text
- View/download PDF
16. „Mann“ is to “Donna” as「国王」is to « Reine » Adapting the Analogy Task for Multilingual and Contextual Embeddings
- Author
-
Mickus, Timothee, primary, Calò, Eduardo, additional, Jacqmin, Léo, additional, Paperno, Denis, additional, and Constant, Mathieu, additional
- Published
- 2023
- Full Text
- View/download PDF
17. Determinants of Employability of Young People in Congo
- Author
-
Constant Mathieu Makouezi and Ronel Guelor Ngobila
- Published
- 2022
18. Semeval-2022 Task 1: CODWOE--Comparing Dictionaries and Word Embeddings
- Author
-
Sub Natural Language Processing, LS OZ Lexion en syntaxis, ILS LLI, Natural Language Processing, Mickus, Timothee, Van Deemter, Kees, Constant, Mathieu, Paperno, Denis, Sub Natural Language Processing, LS OZ Lexion en syntaxis, ILS LLI, Natural Language Processing, Mickus, Timothee, Van Deemter, Kees, Constant, Mathieu, and Paperno, Denis
- Published
- 2022
19. Convertir le Tr��sor de la Langue Fran��aise en Ontolex-Lemon : un zeste de donn��es li��es
- Author
-
Ahmadi, Sina, Constant, Mathieu, Kar��n Fort, Guillaume, Bruno, and McCrae, John P.
- Abstract
In this paper, we report our efforts to convert one of the most comprehensive lexicographic resources of French, the Tr��sor de la Langue Fran��aise, into the Ontolex-Lemon model. Despite the widespread usage of this resource, the original XML format seems to impede its integration in language technology tools. In order to breathe new life into this resource, we examine the usage and the conversion to more interoperable formats, primarily those based on the linguistic linked data, to provide this resource to a broader range of applications and users.
- Published
- 2021
- Full Text
- View/download PDF
20. Determinants of Employability of Young People in Congo
- Author
-
Makouezi, Constant Mathieu, primary and Ngobila, Ronel Guelor, additional
- Published
- 2022
- Full Text
- View/download PDF
21. How to Dissect a Muppet: The Structure of Transformer Embedding Spaces
- Author
-
Mickus, Timothee, primary, Paperno, Denis, additional, and Constant, Mathieu, additional
- Published
- 2022
- Full Text
- View/download PDF
22. Semeval-2022 Task 1: CODWOE – Comparing Dictionaries and Word Embeddings
- Author
-
Mickus, Timothee, primary, Van Deemter, Kees, additional, Constant, Mathieu, additional, and Paperno, Denis, additional
- Published
- 2022
- Full Text
- View/download PDF
23. A Game Interface to Study Semantic Grounding in Text-Based Models
- Author
-
Mickus, Timothee, primary, Constant, Mathieu, additional, and Paperno, Denis, additional
- Published
- 2021
- Full Text
- View/download PDF
24. A Game Interface to Study Semantic Grounding in Text-Based Models
- Author
-
LS OZ Lexion en syntaxis, UiL OTS LLI, Mickus, Timothee, Constant, Mathieu, Paperno, Denis, LS OZ Lexion en syntaxis, UiL OTS LLI, Mickus, Timothee, Constant, Mathieu, and Paperno, Denis
- Published
- 2021
25. Génération automatique de définitions pour le français (Definition Modeling in French)
- Author
-
Mickus, Timothee, Constant, Mathieu, Paperno, D., LS OZ Lexion en syntaxis, and ILS LLI
- Abstract
La génération de définitions est une tâche récente qui vise à produire des définitions lexicographiques à partir de plongements lexicaux. Nous remarquons deux lacunes : (i) l’état de l’art actuel ne s’est penché que sur l’anglais et le chinois, et (ii) l’utilisation escomptée en tant que méthode d’évaluation des plongements lexicaux doit encore être vérifiée. Pour y remédier, nous proposons un jeu de données pour la génération de définitions en français, ainsi qu’une évaluation des performances d’un modèle de génération de définitions simple selon les plongements lexicaux fournis en entrée.
- Published
- 2020
26. Génération automatique de définitions pour le français
- Author
-
Mickus, Timothee, Constant, Mathieu, Paperno, Denis, Analyse et Traitement Informatique de la Langue Française (ATILF), Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Natural Language Processing : representations, inference and semantics (SYNALP), Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Benzitoun, Christophe, Braud, Chloé, Huber, Laurine, Langlois, David, Ouni, Slim, Pogodalla, Sylvain, Schneider, Stéphane, IMPACT Open Language and Knowledge for Citizens, ANR-15-IDEX-04-LUE,LUE,Lorraine Université d'Excellence(2016), and ANR-15-IDEX-0004,LUE,Isite LUE(2015)
- Subjects
Distributional semantics ,Word embeddings ,Génération de définitions – plongements lexicaux – sémantique distributionnelle ,Plongements lexicaux ,Sémantique distributionnelle ,Génération de définitions ,Definition mode ,[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] - Abstract
Definition modeling is a recent task that aims at producing dictionary definitions based on wordembeddings. We observe two gaps : (i) the current state of the art has yet to tackle languages otherthan English or Chinese and (ii) the purported usability as an evaluation method for word embeddingshas yet to be verified. Hence we propose a dataset for French definition modeling and evaluate howusing different input embeddings impacts the performances of a simple definition modeling system.; La génération de définitions est une tâche récente qui vise à produire des définitions lexicographiques à partir de plongements lexicaux. Nous remarquons deux lacunes : (i) l’état de l’art actuel ne s’est penché que sur l’anglais et le chinois, et (ii) l’utilisation escomptée en tant que méthode d’évaluation des plongements lexicaux doit encore être vérifiée. Pour y remédier, nous proposons un jeu de données pour la génération de définitions en français, ainsi qu’une évaluation des performances d’un modèle de génération de définitions simple selon les plongements lexicaux fournis en entrée.
- Published
- 2020
27. What do you mean, BERT?: Assessing BERT as a Distributional Semantic Model
- Author
-
Mickus, Timothee, Paperno, D., Constant, Mathieu, van Deemter, C.J., LS OZ Lexion en syntaxis, ILS LLI, and Sub Natural Language Processing
- Abstract
Contextualized word embeddings, i.e. vector representations for words in context, are naturally seen as an extension of previous noncontextual distributional semantic models. In this work, we focus on BERT, a deep neural network that produces contextualized embeddings and has set the state-of-the-art in several semantic tasks, and study the semantic coherence of its embedding space. While showing a tendency towards coherence, BERT does not fully live up to the natural expectations for a semantic vector space. In particular, we find that the position of the sentence in which a word occurs, while having no meaning correlates, leaves a noticeable trace on the word embeddings and disturbs similarity relationships.
- Published
- 2020
28. Génération automatique de définitions pour le français (Definition Modeling in French)
- Author
-
LS OZ Lexion en syntaxis, UiL OTS LLI, Mickus, Timothee, Constant, Mathieu, Paperno, D., LS OZ Lexion en syntaxis, UiL OTS LLI, Mickus, Timothee, Constant, Mathieu, and Paperno, D.
- Published
- 2020
29. What do you mean, BERT?: Assessing BERT as a Distributional Semantic Model
- Author
-
LS OZ Lexion en syntaxis, UiL OTS LLI, Sub Natural Language Processing, Mickus, Timothee, Paperno, D., Constant, Mathieu, van Deemter, C.J., LS OZ Lexion en syntaxis, UiL OTS LLI, Sub Natural Language Processing, Mickus, Timothee, Paperno, D., Constant, Mathieu, and van Deemter, C.J.
- Published
- 2020
30. A French corpus annotated for multiword expressions and named entities
- Author
-
Candito, Marie, primary, Constant, Mathieu, additional, Ramisch, Carlos, additional, Savary, Agata, additional, Guillaume, Bruno, additional, Parmentier, Yannick, additional, and Cordeiro, Silvio, additional
- Published
- 2021
- Full Text
- View/download PDF
31. About Neural Networks and Writing Definitions
- Author
-
Mickus, Timothee, primary, Constant, Mathieu, additional, and Paperno, Denis, additional
- Published
- 2021
- Full Text
- View/download PDF
32. Démonstrateur en-ligne du projet ANR PARSEME-FR sur les expressions polylexicales
- Author
-
Schmitt, Marine, Moreau, Elise, Constant, Mathieu, Savary, Agata, Analyse et Traitement Informatique de la Langue Française (ATILF), Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Vivoka, Bases de données et traitement des langues naturelles (BDTLN), Laboratoire d'Informatique Fondamentale et Appliquée de Tours (LIFAT), Université de Tours (UT)-Institut National des Sciences Appliquées - Centre Val de Loire (INSA CVL), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS)-Université de Tours (UT)-Institut National des Sciences Appliquées - Centre Val de Loire (INSA CVL), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS), ANR-14-CERA-0001,PARSEME-FR,Analyse syntaxique et expressions polylexicales pour le fran?ais(2014), Savary, Agata, Analyse syntaxique et expressions polylexicales pour le fran?ais - - PARSEME-FR2014 - ANR-14-CERA-0001 - Appel à projets générique - VALID, Centre National de la Recherche Scientifique (CNRS)-Université de Tours-Institut National des Sciences Appliquées - Centre Val de Loire (INSA CVL), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS)-Université de Tours-Institut National des Sciences Appliquées - Centre Val de Loire (INSA CVL), and Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)
- Subjects
annotated corpus ,[INFO.INFO-CL] Computer Science [cs]/Computation and Language [cs.CL] ,corpus annoté ,lexicon ,lexique KEYWORDS: Multiword expressions ,identification ,[SHS.LANGUE]Humanities and Social Sciences/Linguistics ,[SHS.LANGUE] Humanities and Social Sciences/Linguistics ,Expressions polylexicales ,[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] - Abstract
International audience; Nous présentons le démonstrateur en-ligne du projet ANR PARSEME-FR dédié aux expressionspolylexicales. Il inclut différents outils d’identification de telles expressions et un outil d’explorationdes ressources linguistiques de ce projet.
- Published
- 2019
33. Statistical MWE-aware parsing
- Author
-
Constant, Mathieu, Eryivit, Gülcse, Ramisch, Carlos, Rosner, Mike, Schneider, Gerold, Analyse et Traitement Informatique de la Langue Française (ATILF), Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Traitement Automatique du Langage Ecrit et Parlé (TALEP), Laboratoire d'Informatique et Systèmes (LIS), Aix Marseille Université (AMU)-Université de Toulon (UTLN)-Centre National de la Recherche Scientifique (CNRS)-Aix Marseille Université (AMU)-Université de Toulon (UTLN)-Centre National de la Recherche Scientifique (CNRS), Yannick Parmentier, Jakub Waszczuk, ANR-14-CERA-0001,PARSEME-FR,Analyse syntaxique et expressions polylexicales pour le fran?ais(2014), European Project: COST IC1207,PARSEME, Université de Toulon (UTLN)-Centre National de la Recherche Scientifique (CNRS)-Aix Marseille Université (AMU)-Université de Toulon (UTLN)-Centre National de la Recherche Scientifique (CNRS)-Aix Marseille Université (AMU), University of Zurich, Parmentier, Yannick, and Waszczuk, Jakub
- Subjects
10105 Institute of Computational Linguistics ,MWE ,multiword ,10097 English Department ,parsing ,[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] ,820 English & Old English literatures ,11551 Zurich Center for Linguistics - Abstract
International audience; This chapter aims at presenting different strategies that have been designed to incorporate multiword expression (MWE) identification in the process of syntactic parsing using statistical approaches. We discuss MWE representation in treebanks, pipeline and joint orchestrations, the integration of external lexicons and the evaluation of MWE-aware parsers, concluding with our suggestions for future research.
- Published
- 2019
34. Comparing linear and neural models for competitive MWE identification
- Author
-
Al Saied, Hazem, Candito, Marie, Constant, Mathieu, Candito, Marie, Analyse syntaxique et expressions polylexicales pour le fran?ais - - PARSEME-FR2014 - ANR-14-CERA-0001 - Appel à projets générique - VALID, Analyse et Traitement Informatique de la Langue Française (ATILF), Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Laboratoire de Linguistique Formelle (LLF UMR7110), Centre National de la Recherche Scientifique (CNRS)-Université Paris Diderot - Paris 7 (UPD7), ANR-14-CERA-0001,PARSEME-FR,Analyse syntaxique et expressions polylexicales pour le fran?ais(2014), and Université Paris Diderot - Paris 7 (UPD7)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
[INFO.INFO-TT]Computer Science [cs]/Document and Text Processing ,[INFO.INFO-CL] Computer Science [cs]/Computation and Language [cs.CL] ,[INFO.INFO-TT] Computer Science [cs]/Document and Text Processing ,[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] - Abstract
International audience; In this paper, we compare the use of linear versus neural classifiers in a greedy transition system for MWE identification. Both our linear and neural models achieve a new state-of-the-art on the PARSEME 1.1 shared task data sets, comprising 20 languages. Surprisingly, our best model is a simple feed-forward network with one hidden layer, although more sophisticated (recurrent) architectures were tested. The feedback from this study is that tuning a SVM is rather straightforward, whereas tuning our neural system revealed more challenging. Given the number of languages and the variety of linguistic phenomena to handle for the MWE identification task, we have designed an accurate tuning procedure, and we show that hyper-parameters are better selected by using a majority-vote within random search configurations rather than a simple best configuration selection. Although the performance is rather good (better than both the best shared task system and the average of the best per-language results), further work is needed to improve the generalization power, especially on unseen MWEs.
- Published
- 2019
35. Mark my Word: A Sequence-to-Sequence Approach to Definition Modeling
- Author
-
LS OZ Lexion en syntaxis, UiL OTS LLI, Mickus, Timothee, Paperno, D., Constant, Mathieu, LS OZ Lexion en syntaxis, UiL OTS LLI, Mickus, Timothee, Paperno, D., and Constant, Mathieu
- Published
- 2019
36. Statistical MWE-aware parsing
- Author
-
Parmentier, Yannick, Waszczuk, Jakub, Parmentier, Y ( Yannick ), Waszczuk, J ( Jakub ), Constant, Mathieu, Eryivit, Gülcse, Ramisch, Carlos, Rosner, Mike, Schneider, Gerold; https://orcid.org/0000-0002-1905-6237, Parmentier, Yannick, Waszczuk, Jakub, Parmentier, Y ( Yannick ), Waszczuk, J ( Jakub ), Constant, Mathieu, Eryivit, Gülcse, Ramisch, Carlos, Rosner, Mike, and Schneider, Gerold; https://orcid.org/0000-0002-1905-6237
- Abstract
This chapter aims at presenting different strategies that have been designed to incorporate multiword expression (MWE) identification in the process of syntactic parsing using statistical approaches. We discuss MWE representation in treebanks, pipeline and joint orchestrations, the integration of external lexicons and the evaluation of MWE-aware parsers, concluding with our suggestions for future research.
- Published
- 2019
37. Neural Lemmatization of Multiword Expressions
- Author
-
Schmitt, Marine, primary and Constant, Mathieu, additional
- Published
- 2019
- Full Text
- View/download PDF
38. A French corpus annotated for multiword expressions and named entities.
- Author
-
Candito, Marie, Constant, Mathieu, Ramisch, Carlos, Savary, Agata, Guillaume, Bruno, Parmentier, Yannick, and Cordeiro, Silvio Ricardo
- Subjects
ANNOTATIONS ,FLOW charts ,CORPORA ,LEXICAL grammar ,SEMANTICS (Philosophy) - Abstract
We present the enrichment of a French treebank of various genres with a new annotation layer for multiword expressions (MWEs) and named entities (NEs).1 Our contribution with respect to previous work on NE and MWE annotation is the particular care taken to use formal criteria, organized into decision flowcharts, shedding some light on the interactions between NEs and MWEs. Moreover, in order to cope with the well-known difficulty to draw a clear-cut frontier between compositional expressions and MWEs, we chose to use sufficient criteria only. As a result, annotated MWEs satisfy a varying number of sufficient criteria, accounting for the scalar nature of the MWE status. In addition to the span of the elements, annotation includes the subcategory of NEs (e.g., person, location) and one matching sufficient criterion for non-verbal MWEs (e.g., lexical substitution). The 3,099 sentences of the treebank were double-annotated and adjudicated, and we paid attention to cross-type consistency and compatibility with the syntactic layer. Overall inter-annotator agreement on non-verbal MWEs and NEs reached 71.1%. The released corpus contains 3,112 annotated NEs and 3,440 MWEs, and is distributed under an open license. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
39. Annotation d’expressions polylexicales verbales en français
- Author
-
Candito, Marie, Constant, Mathieu, Ramisch, Carlos, Savary, Agata, Parmentier, Yannick, Pasquer, Caroline, Antoine, Jean-Yves, Laboratoire de Linguistique Formelle (LLF UMR7110), Université Paris Diderot - Paris 7 (UPD7)-Centre National de la Recherche Scientifique (CNRS), Analyse et Traitement Informatique de la Langue Française (ATILF), Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Laboratoire d'informatique Fondamentale de Marseille (LIF), Centre National de la Recherche Scientifique (CNRS)-École Centrale de Marseille (ECM)-Aix Marseille Université (AMU), Bases de données et traitement des langues naturelles (BDTLN), Laboratoire d'Informatique Fondamentale et Appliquée de Tours (LIFAT), Centre National de la Recherche Scientifique (CNRS)-Université de Tours-Institut National des Sciences Appliquées - Centre Val de Loire (INSA CVL), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS)-Université de Tours-Institut National des Sciences Appliquées - Centre Val de Loire (INSA CVL), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA), Laboratoire d'Informatique Fondamentale d'Orléans (LIFO), Université d'Orléans (UO)-Institut National des Sciences Appliquées - Centre Val de Loire (INSA CVL), Iris Eshkol, Jean-Yves Antoine, ANR-14-CERA-0001,PARSEME-FR,Analyse syntaxique et expressions polylexicales pour le fran?ais(2014), Aix Marseille Université (AMU)-École Centrale de Marseille (ECM)-Centre National de la Recherche Scientifique (CNRS), Université de Tours (UT)-Institut National des Sciences Appliquées - Centre Val de Loire (INSA CVL), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS)-Université de Tours (UT)-Institut National des Sciences Appliquées - Centre Val de Loire (INSA CVL), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS), Centre National de la Recherche Scientifique (CNRS)-Université Paris Diderot - Paris 7 (UPD7), Institut National des Sciences Appliquées - Centre Val de Loire (INSA CVL), and Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université d'Orléans (UO)
- Subjects
annotation ,Expressions polylexicales verbales ,corpus ,[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] - Abstract
National audience; Nous décrivons la partie française des données produites dans le cadre de la campagne multilingue PARSEME sur l’identification d’expressions polylexicales verbales (Savary et al., 2017). Les expressions couvertes pour le français sont les expressions verbales idiomatiques, les verbes intrinsèquement pronominaux et une généralisation des constructions à verbe support. Ces phénomènes ont été annotés sur le corpus French-UD (Nivre et al., 2016) et le corpus Sequoia (Candito & Seddah, 2012), soit un corpus de 22 645 phrases, pour un total de 4 962 expressions annotées. On obtient un ratio d’une expression annotée tous les 100 tokens environ, avec un fort taux d’expressions discontinues (40%).
- Published
- 2017
40. Benchmarking Joint Lexical and Syntactic Analysis on Multiword-Rich Data
- Author
-
Constant, Mathieu, Martinez Alonso, Hector, Analyse et Traitement Informatique de la Langue Française (ATILF), Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Automatic Language Modelling and ANAlysis & Computational Humanities (ALMAnaCH), Inria de Paris, and Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)
- Subjects
TheoryofComputation_MATHEMATICALLOGICANDFORMALLANGUAGES ,[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] - Abstract
International audience; This article evaluates the extension of a dependency parser that performs joint syntactic analysis and multiword expression identification. We show that, given sufficient training data, the parser benefits from explicit multiword information and improves overall labeled accuracy score in eight of the ten evaluation cases.
- Published
- 2017
41. Localising memory retrieval and syntactic composition: an fMRI study of naturalistic language comprehension
- Author
-
Bhattasali, Shohini, primary, Fabre, Murielle, additional, Luh, Wen-Ming, additional, Al Saied, Hazem, additional, Constant, Mathieu, additional, Pallier, Christophe, additional, Brennan, Jonathan R., additional, Spreng, R. Nathan, additional, and Hale, John, additional
- Published
- 2018
- Full Text
- View/download PDF
42. Multiword Expression Processing: A Survey
- Author
-
Constant, Mathieu, primary, Eryiğit, Gülşen, additional, Monti, Johanna, additional, van der Plas, Lonneke, additional, Ramisch, Carlos, additional, Rosner, Michael, additional, and Todirascu, Amalia, additional
- Published
- 2017
- Full Text
- View/download PDF
43. Un Verbenet du français
- Author
-
Danlos, Laurence, Pradet, Quentin, Barque, Lucie, Nakamura, Takuya, Constant, Mathieu, Institut Universitaire de France (IUF), Ministère de l'Education nationale, de l’Enseignement supérieur et de la Recherche (M.E.N.E.S.R.), Université Paris Diderot - Paris 7 (UPD7), Analyse Linguistique Profonde à Grande Echelle, Large-scale deep linguistic processing (ALPAGE), Inria Paris-Rocquencourt, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Paris Diderot - Paris 7 (UPD7), Université Paris-Est (UPE), Laboratoire d'Informatique Gaspard-Monge (LIGM), Centre National de la Recherche Scientifique (CNRS)-Fédération de Recherche Bézout-ESIEE Paris-École des Ponts ParisTech (ENPC)-Université Paris-Est Marne-la-Vallée (UPEM), Université Paris-Est Marne-la-Vallée (UPEM)-École des Ponts ParisTech (ENPC)-ESIEE Paris-Fédération de Recherche Bézout-Centre National de la Recherche Scientifique (CNRS), Danlos, Laurence, Université Paris Diderot - Paris 7 (UPD7)-Inria Paris-Rocquencourt, and Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)
- Subjects
thematic role ,VerbNet ,[SHS] Humanities and Social Sciences ,sub-categorization frame ,syntactic alternation ,[SHS]Humanities and Social Sciences - Abstract
VerbNet is a lexical resource for English verbs that has proven useful for NLP thanks to its high lexical and syntactic coverage and its systematic coding of thematic roles. Such a resource doesn’t exist for French. This has motivated us to develop a Verbenet for French. We present how we have developed Verbenet from VerbNet while using as far as possible the available lexical resources for French, and how the various French alternations are coded, focusing on differences with English (existence of pronominal forms, for example). This paper should allow an NLP researcher to use Verbenet in a simple and efficient way for a task such as semantic role labeling., VerbNet est une ressource lexicale pour les verbes anglais qui est largement utilisée en TAL du fait de sa bonne couverture lexicale et syntaxique et de son encodage systématique des rôles thématiques. Aucune ressource équivalente n'existe pour le français, ce qui nous a motivés pour développer un Verb@net du français. Nous présentons comment nous avons développé Verb@net à partir de VerbNet tout en utilisant au maximum les ressources lexicales existantes du français, et comment sont encodées les différentes alternances du français en mettant l'accent sur les différences avec l'anglais (l'existence de formes pronominales, par exemple). Cet article devrait permettre à un chercheur en TAL une utilisation simple et efficace de Verb@net pour une tâche comme l'annotation en rôles sémantiques.
- Published
- 2016
44. Searching for Discriminative Metadata of Heterogenous Corpora
- Author
-
Guibon, Gaël, Tellier, Isabelle, Prévost, Sophie, Constant, Mathieu, Gerdes, Kim, Lattice - Langues, Textes, Traitements informatiques, Cognition - UMR 8094 (Lattice), Département Littératures et langage - ENS Paris (LILA), École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS)-Université Sorbonne Paris Cité (USPC)-Université Sorbonne Nouvelle - Paris 3, Université Paris-Est (UPE), LPP - Laboratoire de Phonétique et Phonologie - UMR 7018 (LPP), Université Sorbonne Nouvelle - Paris 3-Centre National de la Recherche Scientifique (CNRS), Markus Dickinson, Erhard Hinrichs, Agnieszka Patejuk, Adam Przepiórkowski, Université Sorbonne Nouvelle - Paris 3-Université Sorbonne Paris Cité (USPC)-Centre National de la Recherche Scientifique (CNRS)-Université Paris sciences et lettres (PSL)-Département Littératures et langage (LILA), Université Paris sciences et lettres (PSL), PREVOST, Sophie, and Markus Dickinson, Erhard Hinrichs, Agnieszka Patejuk, Adam Przepiórkowski
- Subjects
[INFO.INFO-TT] Computer Science [cs]/Document and Text Processing ,[SHS.LANGUE] Humanities and Social Sciences/Linguistics ,[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] ,[INFO.INFO-TT]Computer Science [cs]/Document and Text Processing ,dependency parsing ,machine learning ,[INFO.INFO-CL] Computer Science [cs]/Computation and Language [cs.CL] ,Old French ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,heterogeneous corpus exploration ,[INFO.INFO-HC]Computer Science [cs]/Human-Computer Interaction [cs.HC] ,[INFO.INFO-HC] Computer Science [cs]/Human-Computer Interaction [cs.HC] ,[SHS.LANGUE]Humanities and Social Sciences/Linguistics ,POS labelling - Abstract
International audience; In this paper, we use machine learning techniques for part-of-speech tagging and parsing to explore the specificities of a highly heterogeneous corpus. The corpus used is a treebank of Old French made of texts which differ with respect to several types of metadata: production date, form (verse/prose), domain , and dialect. We conduct experiments in order to determine which of these metadata are the most discriminative and to induce a general methodology .
- Published
- 2015
45. PARSEME – PARSing and Multiword Expressions within a European multilingual network
- Author
-
Savary, Agata, Sailer, Manfred, Parmentier, Yannick, Rosner, Michael, Rosén, Victoria, Przepiórkowski, Adam, Krstev, Cvetana, Vincze, Veronika, Wójtowicz, Beata, Losnegaard, Gyri Smørdal, Parra Escartín, Carla, Waszczuk, Jakub, Constant, Mathieu, Osenova, Petya, Sangati, Federico, Bases de données et traitement des langues naturelles (BDTLN), Laboratoire d'Informatique Fondamentale et Appliquée de Tours (LIFAT), Centre National de la Recherche Scientifique (CNRS)-Université de Tours-Institut National des Sciences Appliquées - Centre Val de Loire (INSA CVL), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS)-Université de Tours-Institut National des Sciences Appliquées - Centre Val de Loire (INSA CVL), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA), Goethe-University Frankfurt am Main, Laboratoire d'Informatique Fondamentale d'Orléans (LIFO), Ecole Nationale Supérieure d'Ingénieurs de Bourges-Université d'Orléans (UO), University of Malta [Malta], University of Bergen (UiB), Instytut Podstaw Informatyki (IPI PAN), Polska Akademia Nauk = Polish Academy of Sciences (PAN), Faculty of Philology, University of Belgrade [Belgrade], Department of Computer Algorithms and Artificial Intelligence., University of Szeged [Szeged], Hermes Traducciones y Servicios Lingüísticos, Laboratoire d'Informatique Gaspard-Monge (LIGM), Centre National de la Recherche Scientifique (CNRS)-Fédération de Recherche Bézout-ESIEE Paris-École des Ponts ParisTech (ENPC)-Université Paris-Est Marne-la-Vallée (UPEM), Bulgarian Academy of Sciences (BAS), Fondazione Bruno Kessler [Trento, Italy] (FBK), Université de Tours (UT)-Institut National des Sciences Appliquées - Centre Val de Loire (INSA CVL), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS)-Université de Tours (UT)-Institut National des Sciences Appliquées - Centre Val de Loire (INSA CVL), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS), Parmentier, Yannick, Université d'Orléans (UO)-Ecole Nationale Supérieure d'Ingénieurs de Bourges, and Université Paris-Est Marne-la-Vallée (UPEM)-École des Ponts ParisTech (ENPC)-ESIEE Paris-Fédération de Recherche Bézout-Centre National de la Recherche Scientifique (CNRS)
- Subjects
[INFO.INFO-CL] Computer Science [cs]/Computation and Language [cs.CL] ,[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] - Abstract
International audience; The aim of this paper is to present PARSEME, a COST Action devoted to the issue of Multiword Expressions in parsing and in linguistic resources (corpora, lexicons). This is a “meta-paper” intended to be the main citation point for any future work referring to PARSEME: it does not describe in detail any single result of the Action, but rather summarises its multifarious activities and provides links to such results (both completed and in progress).
- Published
- 2015
46. Analyse syntaxique de l'ancien français : quelles propriétés de la langue influent le plus sur la qualité de l'apprentissage ?
- Author
-
Guibon, Gaël, Tellier, Isabelle, Prévost, Sophie, Constant, Mathieu, Gerdes, Kim, Lattice - Langues, Textes, Traitements informatiques, Cognition - UMR 8094 (Lattice), Université Sorbonne Nouvelle - Paris 3-Université Sorbonne Paris Cité (USPC)-Centre National de la Recherche Scientifique (CNRS)-Université Paris sciences et lettres (PSL)-Département Littératures et langage (LILA), École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL), Université Paris-Est (UPE), LPP - Laboratoire de Phonétique et Phonologie - UMR 7018 (LPP), Université Sorbonne Nouvelle - Paris 3-Centre National de la Recherche Scientifique (CNRS), Département Littératures et langage - ENS Paris (LILA), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS)-Université Sorbonne Paris Cité (USPC)-Université Sorbonne Nouvelle - Paris 3, and PREVOST, Sophie
- Subjects
ancien français ,exploration de corpus ,[INFO.INFO-TT] Computer Science [cs]/Document and Text Processing ,[SHS.LANGUE] Humanities and Social Sciences/Linguistics ,analyse en dépendance ,[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] ,étiquetage morpho-syntaxique ,[INFO.INFO-TT]Computer Science [cs]/Document and Text Processing ,machine learning ,[INFO.INFO-CL] Computer Science [cs]/Computation and Language [cs.CL] ,apprentissage automatique ,Dependency Parsing ,Old French ,corpus exploration ,[INFO.INFO-HC]Computer Science [cs]/Human-Computer Interaction [cs.HC] ,[INFO.INFO-HC] Computer Science [cs]/Human-Computer Interaction [cs.HC] ,[SHS.LANGUE]Humanities and Social Sciences/Linguistics ,POS labelling - Abstract
Old French parsing : Which language properties have the greatest influence on learning quality ?This paper presents machine learning experiments for part-of-speech labelling and dependency parsing of Old French.Machine learning methods are used for the purpose of corpus exploration. The SRCMF Treebank is our reference data.The poorly standardised nature of the language used in this corpus implies that training data is heterogenous and quantitativelylimited. We explore various strategies, based on different criteria (variability of the lexicon, Verse/Prose form, dateof writing) to build training corpora leading to the best possible results., L'article présente des résultats d'expériences d'apprentissage automatique pour l'étiquetage morpho-syntaxique et l'analyse syntaxique en dépendance de l'ancien français. Ces expériences ont pour objectif de servir une exploration de corpus pour laquelle le corpus arboré SRCMF sert de données de référence. La nature peu standardisée de la langue qui y est utilisée implique des données d'entraînement hétérogènes et quantitativement limitées. Nous explo-rons donc diverses stratégies, fondées sur différents critères (variabilité du lexique, forme Vers/Prose des textes, dates des textes), pour constituer des corpus d'entrainement menant aux meilleurs résultats possibles. Abstract. Old French parsing : Which language properties have the greatest influence on learning quality ? This paper presents machine learning experiments for part-of-speech labelling and dependency parsing of Old French. Machine learning methods are used for the purpose of corpus exploration. The SRCMF Treebank is our reference data. The poorly standardised nature of the language used in this corpus implies that training data is heterogenous and quantitatively limited. We explore various strategies, based on different criteria (variability of the lexicon, Verse/Prose form, date of writing) to build training corpora leading to the best possible results. Mots-clés : étiquetage morpho-syntaxique, analyse en dépendance, ancien français, apprentissage automatique, exploration de corpus.
- Published
- 2015
47. Parsing Poorly Standardized Language Dependency on Old French
- Author
-
Guibon, Gaël, Tellier, Isabelle, Constant, Mathieu, Prévost, Sophie, Gerdes, Kim, Lattice - Langues, Textes, Traitements informatiques, Cognition - UMR 8094 (Lattice), Département Littératures et langage - ENS Paris (LILA), École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS)-Université Sorbonne Paris Cité (USPC)-Université Sorbonne Nouvelle - Paris 3, Université Paris-Est (UPE), LPP - Laboratoire de Phonétique et Phonologie - UMR 7018 (LPP), Université Sorbonne Nouvelle - Paris 3-Centre National de la Recherche Scientifique (CNRS), V. Henrich, E. Hinrichs, D.de Kok, P. Osenova & A. Przepiórkowski, Université Sorbonne Nouvelle - Paris 3-Université Sorbonne Paris Cité (USPC)-Centre National de la Recherche Scientifique (CNRS)-Université Paris sciences et lettres (PSL)-Département Littératures et langage (LILA), Université Paris sciences et lettres (PSL), PREVOST, Sophie, and V. Henrich, E. Hinrichs, D.de Kok, P. Osenova & A. Przepiórkowski
- Subjects
[INFO.INFO-TT]Computer Science [cs]/Document and Text Processing ,dependency parsing ,machine learning ,[INFO.INFO-CL] Computer Science [cs]/Computation and Language [cs.CL] ,Old French ,[INFO.INFO-TT] Computer Science [cs]/Document and Text Processing ,corpus exploration ,[INFO.INFO-HC]Computer Science [cs]/Human-Computer Interaction [cs.HC] ,[INFO.INFO-HC] Computer Science [cs]/Human-Computer Interaction [cs.HC] ,[SHS.LANGUE]Humanities and Social Sciences/Linguistics ,[SHS.LANGUE] Humanities and Social Sciences/Linguistics ,POS labelling ,[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] - Abstract
International audience; This paper presents results of dependency parsing of Old French, a language which is poorly standardized at the lexical level, and which displays a relatively free word order. The work is carried out on five distinct sample texts extracted from the dependency treebank Syntactic Reference Corpus of Medieval French (SRCMF). Following Achim Stein's previous work, we have trained the Mate parser on each sub-corpus and cross-validated the results. We show that the parsing efficiency is diminished by the greater lexical variation of Old French compared to parse results on modern French. In order to improve the result of the POS tagging step in the parsing process, we applied a pre-treatment to the data, comparing two distinct strategies: one using a slightly post-treated version of the TreeTagger trained on Old French by Stein, and a CRF trained on the texts, enriched with external resources. The CRF version outperforms every other approach.
- Published
- 2014
48. Localising memory retrieval and syntactic composition: an fMRI study of naturalistic language comprehension.
- Author
-
Bhattasali, Shohini, Fabre, Murielle, Luh, Wen-Ming, Al Saied, Hazem, Constant, Mathieu, Pallier, Christophe, Brennan, Jonathan R., Spreng, R. Nathan, and Hale, John
- Subjects
BRAIN physiology ,TEMPORAL lobe ,COMPARATIVE grammar ,LINGUISTICS ,LISTENING ,MAGNETIC resonance imaging ,MEMORY ,REGRESSION analysis ,SPEECH evaluation ,PHONOLOGICAL awareness ,EXECUTIVE function ,PHYSIOLOGY - Abstract
This study examines memory retrieval and syntactic composition using fMRI while participants listen to a book, The Little Prince. These two processes are quantified drawing on methods from computational linguistics. Memory retrieval is quantified via multi-word expressions that are likely to be stored as a unit, rather than built-up compositionally. Syntactic composition is quantified via bottom-up parsing that tracks tree-building work needed in composed syntactic phrases. Regression analyses localise these to spatially-distinct brain regions. Composition mainly correlates with bilateral activity in anterior temporal lobe and inferior frontal gyrus. Retrieval of stored expressions drives right-lateralised activation in the precuneus. Less cohesive expressions activate well-known nodes of the language network implicated in composition. These results help to detail the neuroanatomical bases of two widely-assumed cognitive operations in language processing. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
49. Named Entity Recognition for German Using Conditional Random Fields and Linguistic Resources
- Author
-
Watrin, Patrick, De Viron, Louis, Lebailly, Denis, Constant, Mathieu, Weiser, Stéphanie, EarlyTracks SA, Laboratoire d'Informatique Gaspard-Monge (LIGM), Centre National de la Recherche Scientifique (CNRS)-Fédération de Recherche Bézout-ESIEE Paris-École des Ponts ParisTech (ENPC)-Université Paris-Est Marne-la-Vallée (UPEM), Constant, Matthieu, and Université Paris-Est Marne-la-Vallée (UPEM)-École des Ponts ParisTech (ENPC)-ESIEE Paris-Fédération de Recherche Bézout-Centre National de la Recherche Scientifique (CNRS)
- Subjects
[INFO.INFO-TT]Computer Science [cs]/Document and Text Processing ,[INFO.INFO-TT] Computer Science [cs]/Document and Text Processing - Abstract
International audience; This paper presents a Named Entity Recognition system for German based on Conditional Random Fields. The model also includes language-independent features and features computed form large coverage lexical resources. Along side the results themselves, we show that by adding linguistic resources to a probabilistic model, the results improve significantly.
- Published
- 2014
50. Syntactic Parsing and Compound Recognition via Dual Decomposition: Application to French
- Author
-
Le Roux, Joseph, Constant, Mathieu, Rozenknop, Antoine, Laboratoire d'Informatique de Paris-Nord (LIPN), Université Sorbonne Paris Cité (USPC)-Institut Galilée-Université Paris 13 (UP13)-Centre National de la Recherche Scientifique (CNRS), Laboratoire d'Informatique Gaspard-Monge (LIGM), Centre National de la Recherche Scientifique (CNRS)-Fédération de Recherche Bézout-ESIEE Paris-École des Ponts ParisTech (ENPC)-Université Paris-Est Marne-la-Vallée (UPEM), ANR-11-IDEX-0005,USPC,Université Sorbonne Paris Cité(2011), Université Paris 13 (UP13)-Institut Galilée-Université Sorbonne Paris Cité (USPC)-Centre National de la Recherche Scientifique (CNRS), Université Paris-Est Marne-la-Vallée (UPEM)-École des Ponts ParisTech (ENPC)-ESIEE Paris-Fédération de Recherche Bézout-Centre National de la Recherche Scientifique (CNRS), Rozenknop, Antoine, and Université Sorbonne Paris Cité - - USPC2011 - ANR-11-IDEX-0005 - IDEX - VALID
- Subjects
dual decomposition ,compound word recongnition ,[INFO.INFO-CL] Computer Science [cs]/Computation and Language [cs.CL] ,text segmentation ,[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] ,syntactic parsing - Abstract
International audience; In this paper we show how the task of syntactic parsing of non-segmented texts, including compound recognition, can be represented as constraints between phrase-structure parsers and CRF sequence labellers. In order to build a joint system we use dual decomposition, a way to combine several elementary systems which has proven successful in various NLP tasks. We evaluate this proposition on the French SPMRL corpus. This method compares favorably with pipeline architectures and improves state-of-the-art results.
- Published
- 2014
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.