Ursi, Biagio, Etienne, Carole, Eshkol-Taravella, Iris, Rossi-Gensane, Nathalie, Acosta Córdoba, Luisa, Lambert, Margot, Interactions, Corpus, Apprentissages, Représentations (ICAR), École normale supérieure - Lyon (ENS Lyon)-Université Lumière - Lyon 2 (UL2)-INRP-Ecole Normale Supérieure Lettres et Sciences Humaines (ENS LSH)-Centre National de la Recherche Scientifique (CNRS), École normale supérieure - Lyon (ENS Lyon), Laboratoire Ligérien de Linguistique (LLL), Bibliothèque nationale de France (BnF)-Université d'Orléans (UO)-Université de Tours (UT)-Centre National de la Recherche Scientifique (CNRS), Modèles, Dynamiques, Corpus (MoDyCo), Université Paris Nanterre (UPN)-Centre National de la Recherche Scientifique (CNRS), Université Lumière - Lyon 2 (UL2), Les auteurs remercient le LABEX ASLAN (ANR-10-LABX-0081) de l'Université de Lyon pour son soutien financier dans le cadre du programme 'Investissements d'Avenir' (ANR-11-IDEX-0007) de l'Etat Français géré par l'Agence Nationale de la Recherche (ANR)., ANR-15-FRAL-0004,SegCor,Segmentation de corpus oraux(2015), Bibliothèque nationale de France (BnF)-Université d'Orléans (UO)-Université de Tours-Centre National de la Recherche Scientifique (CNRS), École normale supérieure de Lyon (ENS de Lyon)-Université Lumière - Lyon 2 (UL2)-INRP-Ecole Normale Supérieure Lettres et Sciences Humaines (ENS LSH)-Centre National de la Recherche Scientifique (CNRS), École normale supérieure de Lyon (ENS de Lyon), Ursi, Biagio, and Segmentation de corpus oraux - - SegCor2015 - ANR-15-FRAL-0004 - FRAL - VALID
International audience; Our communication takes place in the context of the French-German project SegCor (Segmentation of Oral Corpora, ANR-15-FRAL-0004), focusing on the segmentation of oral corpora. The general aim is the development of a method of segmentation for oral corpora that is adequate for the analyses of interactional data at different levels and for various communities of researchers.The French and German datasets consist of ten excerpts of ten minutes each for each language[3], which represent the overall data diversity in terms of situation types. The following recorded interactions have been studied: radio talks, meal preparations, reading activities with a child, service encounters, telephone calls, table talks, social meetings, school lessons and panel discussions. In our paper, we will address the relationship between these interaction types and segmentation in maximal units. More particularly, the focus will be on the composition of this kind of units for the French corpus.Several models have been proposed in previous researches and have been discussed within the SegCor project: part-of-speech tagging and chunking processes via automatic annotation (Eshkol-Taravella et al. 2014); a syntactic annotation relying on a dependency parser (Kahane et al. 2017); a macrosyntactic segmentation in illocutionary units (Benzitoun et al. 2010; Lacheret et al. 2014); the annotation of prosodic prominences and disfluencies leading to the segmentation of intonational periods (Lacheret et al. 2014); the annotation of Turn-Constructional Units (TCUs), i.e. the minimal, emergent and negotiable units through which participants build turns of talk in interaction (Sacks et al. 1974; Ochs et al. 1996; Traverso 2016).In this paper, we will focus on the segmentation of broad units, which is grounded on the macrosyntactic model (Blanche-Benveniste et al. 1990; Blanche-Benveniste 2010a, 2010b; Lacheret et al. 2014). We rely on the following maximal macrosyntactic units:Simple units, composed of one nucleus, which is defined as a minimal macrosyntactic component corresponding to an autonomous utterance, according to Blanche-Benveniste et al. (1990: 114);Complex units, composed of more than one nucleus (including pre-nuclei, post-nuclei and in-nuclei, i.e. sequences beyond government);Abandoned units, i.e. syntactically unfinished units.The segmentation has been realized on tokenized transcripts through the EXMARaLDA Partitur Editor[4]. Our main aim is to appreciate the relevance of tokens’ number per maximal unit in our representative corpora. Thus, we propose a quantitative study that is focused on token count per maximal unit in each situation type.For example, preliminary investigation has shown a higher rate of abandoned units when interactions are conflictual (e. g. panel discussion and radio talk), due to turn-taking specificities. Conversely, in expert talk, i.e. a conference realized by a speaker, abandoned units are very few because of the planned character of the talk.Relying on the composition of maximal segmentation units, our contribution discusses evidence from corpus segmentation and aims at investigating variation across different interaction types. Our approach is not in contrast to previous research in the field of corpus linguistics, see for example Biber’s multi-dimensional analyses of written and oral genres (Biber 1988) and conversational text types (Biber 2004) in English, which are based on a variety of linguistic features. This contribution offers complementary dimensions for a classification of interaction types, from a quantitative perspective. We will then explore the other segmentation levels annotated in the SegCor project on syntax, prosody and interaction to study if unit characterization depends on the type of interaction and if similar trends can be observed. Statistical analyses and graphing are performed using the R software platform.