Back to Search Start Over

Searching for Discriminative Metadata of Heterogenous Corpora

Authors :
Guibon, Gaël
Tellier, Isabelle
Prévost, Sophie
Constant, Mathieu
Gerdes, Kim
Lattice - Langues, Textes, Traitements informatiques, Cognition - UMR 8094 (Lattice)
Département Littératures et langage - ENS Paris (LILA)
École normale supérieure - Paris (ENS Paris)
Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-École normale supérieure - Paris (ENS Paris)
Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS)-Université Sorbonne Paris Cité (USPC)-Université Sorbonne Nouvelle - Paris 3
Université Paris-Est (UPE)
LPP - Laboratoire de Phonétique et Phonologie - UMR 7018 (LPP)
Université Sorbonne Nouvelle - Paris 3-Centre National de la Recherche Scientifique (CNRS)
Markus Dickinson
Erhard Hinrichs
Agnieszka Patejuk
Adam Przepiórkowski
Université Sorbonne Nouvelle - Paris 3-Université Sorbonne Paris Cité (USPC)-Centre National de la Recherche Scientifique (CNRS)-Université Paris sciences et lettres (PSL)-Département Littératures et langage (LILA)
Université Paris sciences et lettres (PSL)
PREVOST, Sophie
Markus Dickinson, Erhard Hinrichs, Agnieszka Patejuk, Adam Przepiórkowski
Source :
Fourteenth International Workshop on Treebanks and Linguistic Theories (TLT14), Fourteenth International Workshop on Treebanks and Linguistic Theories (TLT14), Dec 2015, Varsovie, Poland. pp.72-82
Publication Year :
2015
Publisher :
HAL CCSD, 2015.

Abstract

International audience; In this paper, we use machine learning techniques for part-of-speech tagging and parsing to explore the specificities of a highly heterogeneous corpus. The corpus used is a treebank of Old French made of texts which differ with respect to several types of metadata: production date, form (verse/prose), domain , and dialect. We conduct experiments in order to determine which of these metadata are the most discriminative and to induce a general methodology .

Details

Language :
English
Database :
OpenAIRE
Journal :
Fourteenth International Workshop on Treebanks and Linguistic Theories (TLT14), Fourteenth International Workshop on Treebanks and Linguistic Theories (TLT14), Dec 2015, Varsovie, Poland. pp.72-82
Accession number :
edsair.dedup.wf.001..60f5d330c45ea51657e5b7b4e580bfc3