Author: "ANR-19-CE38-0015,CLD2025,La documentation computationnelle des langues à l'horizon 2025(2019)" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"ANR-19-CE38-0015,CLD2025,La documentation computationnelle des langues à l'horizon 2025(2019)"' showing total 22 results

Start Over Author "ANR-19-CE38-0015,CLD2025,La documentation computationnelle des langues à l'horizon 2025(2019)"

22 results on '"ANR-19-CE38-0015,CLD2025,La documentation computationnelle des langues à l'horizon 2025(2019)"'

1. Modèles bayésiens non-paramétriques pour la segmentation conjointe en mots et morphèmes

Author: Okabe, Shu, Yvon, François, Traitement du Langage Parlé (TLP ), Laboratoire Interdisciplinaire des Sciences du Numérique (LISN), Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Sciences et Technologies des Langues (STL), Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS), ANR, Association for Computational Linguistics, and ANR-19-CE38-0015,CLD2025,La documentation computationnelle des langues à l'horizon 2025(2019)
Subjects: [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
Abstract: International audience; Language documentation often requires segmenting transcriptions of utterances collected on the field into words and morphemes. While these two tasks are typically performed in succession, we study here Bayesian models for simultaneously segmenting utterances at these two levels. Our aim is twofold: (a) to study the effect of explicitly introducing a hierarchy of units in joint segmentation models; (b) to further assess whether these two levels can be better identified through weak supervision. For this, we first consider a deterministic coupling between independent models; then design and evaluate hierarchical Bayesian models. Experiments with two under-resourced languages (Japhug and Tsez) allow us to better understand the value of various types of weak supervision. In our analysis, we use these results to revisit the distributional hypotheses behind Bayesian segmentation models and evaluate their validity for language documentation data.
Published: 2023

2. Extraction et analyse de concepts médicaux dans un corpus de spécialité en orthophonie

Author: Le Clercq de Lannoy, Tiphaine, Besancon, Romaric, Ferret, Olivier, Tourille, Julien, Brin-Henry, Frédérique, Vieru, Bianca, Département Intelligence Ambiante et Systèmes Interactifs (DIASI), Laboratoire d'Intégration des Systèmes et des Technologies (LIST (CEA)), Direction de Recherche Technologique (CEA) (DRT (CEA)), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Direction de Recherche Technologique (CEA) (DRT (CEA)), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université Paris-Saclay, Analyse et Traitement Informatique de la Langue Française (ATILF), Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Becerra, Leonor, Favre, Benoît, Gardent, Claire, Parmentier, Yannick, ANR-19-CE38-0015,CLD2025,La documentation computationnelle des langues à l'horizon 2025(2019), Leonor Becerra, Benoît Favre, Claire Gardent, and Yannick Parmentier
Subjects: Tâches peu dotées, Extraction de concepts médicaux, Annotation sémantique, Extraction de termes, Domaine de spécialité, Modèles de langue neuronaux, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
Abstract: International audience; L'émergence de gros modèles de langue pré-entraînés tels que BERT (Devlin et al., 2019) a développé la définition et l'application de stratégies d'apprentissage par transfert (transfer learning), en particulier par le biais de la notion d'affinage (fine-tuning). Bien que ce développement facilite l'apprentissage de modèles pour des domaines spécialisés à partir de modèles plus généraux, cet apprentissage souffre toujours de l'absence de données annotées en quantités suffisantes. Dans cet article, nous nous focalisons plus spécifiquement sur le domaine de la santé et sur la tâche de reconnaissance d'entités nommées en français. Nous explorons plus précisément deux voies pour faciliter l'adaptation aux domaines spécialisés. La première reprend l'idée, explorée initialement par Gururangan et al. (2020), qu'utiliser un corpus non annoté du domaine cible et l'utiliser afin de poursuivre l'entraînement d'un modèle pré-entraîné sur sa tâche de modélisation du langage permet de spécialiser ce modèle pour ce domaine et d'améliorer les résultats de l'affinage sur la tâche finale visée. Cette approche a été appliquée en particulier par Copara et al. (2020) pour la reconnaissance d'entités nommées médicales en français.La seconde voie exploite quant à elle les connaissances existant pour le domaine cible, connaissances qui sont particulièrement riches dans le cas du domaine médical et de la santé. Plus précisément, parmi les nombreux travaux réalisés pour utiliser conjointement les modèles de langue neuronaux et des connaissances données a priori (Yin et al., 2022; Wei et al., 2021; Yang et al., 2022), se distinguent les approches que l'on peut qualifier de précoces, visant à injecter les connaissances directement au sein des modèles, soit lors de leur construction, soit a posteriori, des approches dites tardives dans lesquelles modèles de langue et connaissances sont fusionnés au niveau des résultats liés à la tâche.Nous nous situons dans cette seconde perspective en nous distinguant néanmoins des approches de type auto-apprentissage (Gao et al., 2021) dans lesquelles les connaissances sont utilisées pour réaliser une forme d'augmentation de données.De plus, nous appliquons les techniques étudiées à un corpus d'orthophonie, OrthoCorpus (2019), afin d'analyser les extractions d'entités nommées sur des cas concrets, du point de vue de l'intérêt clinique de la démarche et de sa faisabilité pour les experts du domaine. D'un point de vue disciplinaire, cela permet en effet de questionner le classement conceptuel en santé dans un sous-domaine spécifique au carrefour des sciences biomédicales et des sciences humaines et sociales. L'examen des formes et du statut des candidats-termes nous renseigne sur la langue de spécialité (L'Homme, 2011).Plus précisément, au travers des contributions de cet article, nous montrons, pour la reconnaissance d'entités nommées dans le domaine de la santé, que :- l'utilisation de corpus spécialisés pour l'adaptation de modèles de langue pré-entraînés peut être intéressante, même pour des corpus que l'on peut qualifier de petits vis-à-vis des expéri-mentations de Gururangan et al. (2020) ;- différents modèles neuronaux et une approche à base de connaissances présentent des profils complémentaires qu'une combinaison tardive permet de valoriser.
Published: 2022

3. Vers la génération automatique de gloses pour la documentation automatique des langues

Author: Okabe, Shu, Yvon, François, Laboratoire Interdisciplinaire des Sciences du Numérique (LISN), Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS), Becerra, Leonor, Favre, Benoît, Gardent, Claire, Parmentier, Yannick, and ANR-19-CE38-0015,CLD2025,La documentation computationnelle des langues à l'horizon 2025(2019)
Subjects: génération de gloses interlinéaires, documentation automatique des langues, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
Abstract: National audience; Une étape du processus de la documentation d’une langue consiste à annoter des énoncés recueillis sur le terrain – après enregistrement et transcription phonétique – au niveau des morphèmes. Concrètement, pour chaque unité minimale segmentée dans la séquence d’entrée, il s’agit d’attacher soit une (plus rarement) plusieurs étiquettes morphosyntaxiques, soit une étiquette de concept, le plus souvent représenté par le mot anglais correspondant. Dans la perspective d’automatiser cette phase d’annotation, nous présentons les résultats d’une étude préliminaire où nous la considérons comme une tâche d’étiquetage de séquences, dont nous chercherons à estimer la difficulté, en la comparant à une tâche d’étiquetage morphosyntaxique standard. La question principale qui nous anime étant d’évaluer la faisabilité de cette annotation lorsque les données d’apprentissages sont très limitées.
Published: 2022

4. Segmentation en mot faiblement supervisée: un outil pour la linguistique de terrain

Author: Shu Okabe, Laurent Besacier, François Yvon, Traitement du Langage Parlé (TLP ), Laboratoire Interdisciplinaire des Sciences du Numérique (LISN), Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Sciences et Technologies des Langues (STL), Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS), Naver Labs Europe [Meylan], ANR, Association for Computational Linguistics, and ANR-19-CE38-0015,CLD2025,La documentation computationnelle des langues à l'horizon 2025(2019)
Subjects: [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing, Unsupervised word segmentation, Computational Language Documentation
Abstract: International audience; Word and morpheme segmentation are fundamental steps of language documentation as they allow to discover lexical units in a language for which the lexicon is unknown. However, in most language documentation scenarios, linguists do not start from a blank page: they may already have a pre-existing dictionary or have initiated manual segmentation of a small part of their data. This paper studies how such a weak supervision can be taken advantage of in Bayesian non-parametric models of segmentation. Our experiments on two very low resource languages (Mboshi and Japhug), whose documentation is still in progress, show that weak supervision can be beneficial to the segmentation quality. In addition, we investigate an incremental learning scenario where manual segmentations are provided in a sequential manner. This work opens the way for interactive annotation tools for documentary linguists.
Published: 2022

5. Modèle-s bayés-ien-s pour la segment-ation à deux niveau-x faible-ment super-vis-é-e

Author: Okabe, Shu, Yvon, François, Laboratoire Interdisciplinaire des Sciences du Numérique (LISN), Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS), Traitement du Langage Parlé (TLP ), Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Sciences et Technologies des Langues (STL), Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS), Estève, Yannick, Jiménez, Tania, Parcollet, Titouan, Zanon Boito, Marcely, and ANR-19-CE38-0015,CLD2025,La documentation computationnelle des langues à l'horizon 2025(2019)
Subjects: modèle bayésien non-paramétrique, segmentation en morphèmes, documentation automatique des langues, segmentation en mots, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
Abstract: National audience; La segmentation automatique en mots et en morphèmes est une étape cruciale dans le processus de documentation des langues. Dans ce travail, nous étudions plusieurs modèles bayésiens pour réaliser une segmentation conjointe des phrases à ces deux niveaux : d’une part, en introduisant un couplage déterministe entre deux modèles spécialisés pour identifier chaque type de frontières, d’autre part, en proposant une modélisation intrinsèquement hiérarchique. Un objectif important de cette étude est de comparer ces modèles dans un scénario où une supervision faible est disponible. Nos expériences portent sur deux langues et permettent de comparer dans des conditions réalistes les mérites de ces diverses modélisations.
Published: 2022

6. Bảng từ EFEO-CNRS-SOAS dùng cho nghiên cứu điền dã ngôn ngữ học ở Đông Nam Á

Author: Frederic Pain, Michel Ferlus, Alexis Michaud, Thị Thu Hà Phạm, Ryan Gehrmann, NGUYEN Minh-Chau, Langues et civilisations à tradition orale (LACITO), Université Sorbonne Nouvelle - Paris 3-Institut National des Langues et Civilisations Orientales (Inalco)-Centre National de la Recherche Scientifique (CNRS), Centre de Recherches Linguistiques sur l'Asie Orientale (CRLAO), École des hautes études en sciences sociales (EHESS)-Institut National des Langues et Civilisations Orientales (Inalco)-Centre National de la Recherche Scientifique (CNRS), Vietnam National University - Department of Linguistics (VNU-USSH), Vietnam National University [Hanoï] (VNU), Payap University, LPP - Laboratoire de Phonétique et Phonologie - UMR 7018 (LPP), Université Sorbonne Nouvelle - Paris 3-Centre National de la Recherche Scientifique (CNRS), ANR-10-LABX-0083,EFL,Empirical Foundations of Linguistics : data, methods, models(2010), ANR-19-CE38-0015,CLD2025,La documentation computationnelle des langues à l'horizon 2025(2019), Speech Communication, International Research Institute MICA (MICA), Institut National Polytechnique de Grenoble (INPG)-Hanoi University of Science and Technology (HUST)-Centre National de la Recherche Scientifique (CNRS)-Institut National Polytechnique de Grenoble (INPG)-Hanoi University of Science and Technology (HUST)-Centre National de la Recherche Scientifique (CNRS), ANR-11-IDEX-0005,USPC,Université Sorbonne Paris Cité(2011), Michaud, Alexis, Empirical Foundations of Linguistics : data, methods, models - - EFL2010 - ANR-10-LABX-0083 - LABX - VALID, and La documentation computationnelle des langues à l'horizon 2025 - - CLD20252019 - ANR-19-CE38-0015 - AAPG2019 - VALID
Subjects: [SHS.ANTHRO-SE] Humanities and Social Sciences/Social Anthropology and ethnology, CNRS, Asie du Sud-Est, EFEO, dialectologie, [SHS.ANTHRO-SE]Humanities and Social Sciences/Social Anthropology and ethnology, linguistic documentation, [SHS.LANGUE] Humanities and Social Sciences/Linguistics, word list, Southeast Asia, SOAS, dialectology, linguistic fieldwork, multilingual resources, ressources multilingues, documentation linguistique, [SHS.LANGUE]Humanities and Social Sciences/Linguistics, liste de vocabulaire
Abstract: This word list aims to allow researchers (i) to conduct in-depth lexical investigation when doing fieldwork on languages of Southeast Asia, and (ii) to navigate between languages and dialects, through the use of a unique identifier for each lexical entry. The first version of this word list was created by Ecole Française d'Extrême-Orient (EFEO) for a broad investigation launched in 1938 and interrupted by the war in 1940. A second version of the word list was elaborated at the CNRS laboratory CeDRASEMI (Centre de documentation et de recherche sur l'Asie du Sud-Est et le monde insulindien). This overhaul was supervised by Lucien Bernot, probably between 1960 et 1970; the list was jointly prepared by the Centre de documentation et de recherche sur l'Asie du Sud-Est et le monde insulindien (EPHE-CNRS, Paris) and the Department of South Asia and Oceania of the School of Oriental Studies (University of London) with a view to creating an Ethnolinguistic Atlas of Southeast Asia. Michel Ferlus re-typed the 22-page list to adopt a format suitable for use in the field. As the list remained insufficiently comprehensive for in-depth linguistic fieldwork, Michel Ferlus added further items in the course of his field trips to Vietnam in the 1990s. This list was circulated among Michel Ferlus's colleagues and collaborators. Khmer glosses were added, based on a version of the CeDRASEMI-SOAS list to which Marie Martin had added Khmer glosses. Version 1 of the present document was updated at the International Research Institute MICA in 2013-2014. Chinese glosses were added; English glosses were supplemented; and Vietnamese glosses were revised. The word list is offered online in Open Office format (.ods) and MS-Excel format (.xlsx). In Version 2 (2016), the full set of English translations was checked. In Version 3 (2019), Central Thai, Northern Thai, Lao and Burmese translations were added.In Version 4 (2022), Tibetan translations were added., Cette liste numérotée vise à permettre aux chercheurs de naviguer entre les langues et les dialectes recueillis au fil des ans et sur tous les terrains d'Asie du Sud-Est. La première version de ce lexique a été élaborée par l'Ecole Française d'Extrême-Orient (EFEO) pour une vaste enquête lancée en 1938 et interrompue par la guerre en 1940. L'EFEO en a imprimé une quantité sous la forme de petits fascicules, distribués aux fonctionnaires envoyés en mission par l'administration coloniale (Questionnaire linguistique, Hanoi: Imprimerie d'Extrême-Orient, 1938). Une deuxième version a été élaborée au laboratoire CeDRASEMI du CNRS (Centre de documentation et de recherche sur l'Asie du Sud-Est et le monde insulindien). Lucien Bernot a été la cheville ouvrière de cette amélioration qui a dû se faire entre 1960 et 1970. Cette version était décrite comme un "Questionnaire linguistique préparé conjointement par le Centre de documentation et de recherche sur l'Asie du Sud-Est et le monde insulindien (EPHE-CNRS, Paris) et le Département d'Asie du Sud-Est et d'Océanie de la School of Oriental Studies (University of London) en vue de l'établissement d'un Atlas ethnolinguistique de l'Asie du Sud-Est". Michel Ferlus a re-tapé cette liste pour en faire des cahiers d'enquête commodes à remplir sur le terrain. Comme cette liste restait insuffisante pour une bonne utilisation linguistique, Michel Ferlus l'a augmentée au cours de ses enquêtes au Vietnam dans les années 1990. Cette liste a été enrichie de gloses en khmer, en partie fondées sur la version de la liste CeDRASEMI-SOAS annotée en khmer par Marie Martin. La première version du présent document a été réalisée à l'Institut de recherche international MICA à partir de 2013. Les ajustements principaux ont été les suivants: une nouvelle numérotation a été établie, pour faciliter l'emploi de la liste; des gloses en chinois ont été ajoutées ; les gloses en anglais ont été complétées; et les gloses en vietnamien ont été intégralement révisées. La liste de vocabulaire est mise à libre disposition en ligne au format Open Office (.ods). Dans la version 2 (2016), les gloses anglaises ont été intégralement revues par une équipe coordonnée par Ryan Gehrmann. Dans la version 3 (2019) ont été ajouté le thai central, le thai du Nord, le lao et le birman.Dans la version 4 (2022) a été ajouté le tibétain.
Published: 2022

7. L'intonation dans les langues tonales : des réflexions générales et deux études de cas

Author: Alexis Michaud, NGUYEN Minh-Chau, Vera Scholvin, Langues et civilisations à tradition orale (LACITO), Université Sorbonne Nouvelle - Paris 3-Institut National des Langues et Civilisations Orientales (Inalco)-Centre National de la Recherche Scientifique (CNRS), Vietnam National University - Department of Linguistics (VNU-USSH), Vietnam National University [Hanoï] (VNU), Freie Universität Berlin, ANR-10-LABX-0083,EFL,Empirical Foundations of Linguistics : data, methods, models(2010), ANR-19-CE38-0015,CLD2025,La documentation computationnelle des langues à l'horizon 2025(2019), Michaud, Alexis, Empirical Foundations of Linguistics : data, methods, models - - EFL2010 - ANR-10-LABX-0083 - LABX - VALID, La documentation computationnelle des langues à l'horizon 2025 - - CLD20252019 - ANR-19-CE38-0015 - AAPG2019 - VALID, and ANR-10-LABX-0083,EFL,Empirical Foundations of Linguistics : data, methods, models(2011)
Subjects: intonation, Tons lexicaux, muong, Lexical tones, Vietnamese, Prosody, [SHS.LANGUE]Humanities and Social Sciences/Linguistics, vietnamien, [SHS.LANGUE] Humanities and Social Sciences/Linguistics, prosodie, Vietnamien parlé
Abstract: The present work illustrates what the study of tonal languages can contribute to the study of intonation. Studying the intonation of a tonal language leads to maintaining a consistent distinction between tones and intonation as separate levels. This perspective is instrumental in bringing out the specificity, richness and complexity of intonation. After recapitulating the essentials of a conceptual framework, we present a modest pilot experiment on Vietnamese and French which confirms the existence of crosswalks between tones and intonation. We then address the question of the nature of the prosodic phenomena that occur on final particles in Muong (a close relative of Vietnamese). Do these particles carry a lexical tone? What contribution does sentence intonation make to their realization? Various hypotheses are formulated, notably concerning the possible origins of the glottalized “intonational tone” carried by two final particles marking assent., Le présent travail illustre ce que l’étude des langues tonales peut apporter à celle de l’intonation. Étudier l’intonation d’une langue tonale amène à distinguer de façon conséquente le niveau des tons de celui de l’intonation, ce qui aide à bien reconnaître la spécificité, la richesse et la complexité de l’intonation. Après avoir rappelé les bases d’un cadre conceptuel, nous exposons une modeste expérience-pilote au sujet du vietnamien et du français qui confirme l’existence de passerelles entre tons et intonation. Nous abordons ensuite la question de la nature des phénomènes prosodiques qui se réalisent sur les particules finales en langue muong (proche parente du vietnamien). Ces particules portent-elles un ton lexical ? Quelle contribution l’intonation de phrase apporte-t-elle à leur réalisation ? Diverses hypothèses sont formulées, notamment concernant les origines possibles du « ton intonatif » glottalisé porté par deux particules finales marquant l'assentiment.
Published: 2021

8. Recognizing lexical units in low-resource language contexts with supervised and unsupervised neural networks

Author: MACAIRE, Cécile, Langues et civilisations à tradition orale (LACITO), Université Sorbonne Nouvelle - Paris 3-Institut National des Langues et Civilisations Orientales (Inalco)-Centre National de la Recherche Scientifique (CNRS), Laboratoire d'Informatique de Grenoble (LIG), Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA), Institut des Sciences du Digital, Management et Cognition (IDMC), Université de Lorraine (UL), Institut des Langues Rares de l'École Pratique des Hautes Études (ILARA-EPHE), LACITO (UMR 7107), ANR-19-CE38-0015,CLD2025,La documentation computationnelle des langues à l'horizon 2025(2019), and ANR-10-LABX-0083,EFL,Empirical Foundations of Linguistics : data, methods, models(2010)
Subjects: [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Machine learning, deep learning, Automatic Speech Recognition ASR, [SHS.LANGUE]Humanities and Social Sciences/Linguistics, [INFO.INFO-NE]Computer Science [cs]/Neural and Evolutionary Computing [cs.NE], Neural networks, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL], [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]
Abstract: Automatic Speech Recognition (ASR) has made significant progress thanks to the advent of deep neural networks (DNNs). In the context of under-resourced languages, for which few resources are available, spectacular achievements has been reported. ASR systems are a step forward for language documentation, as the annotation cost is considerably reduced for field linguists (manually annotated an audio file can take a tremendous amount of time), and the language is preserved and perpetuated through documentation. Previous `standard' deep neural networks reached very good performances for phonemic transcription (such as with Kaldi and ESPnet approaches).However, these methods only rely on the phoneme-level. In this thesis, we explore recently published ASR approaches which have shown to be effective on low-resource languages to produce word-level audio-aligned transcriptions. The first approach, based on self-supervised learning, is a speech model that uses a Connectionist Temporal Classification (CTC). The second, entitled wav2vec-U, proposes a framework intended to build an ASR system in a fully unsupervised fashion. With few resources at our disposal, we try to assess the usability that can be made from dictionaries. We conducted experiments on two low-resource corpora, the Yongning Na and the Japhug from the Pangloss Collection. The experimental results from the first approach demonstrate powerful word-level transcriptions with competitive error rates. Preliminary results are reported on the second approach. By a coverage measure of dictionaries on the available transcriptions, we show that these resources are not yet usable in the conducted approaches.
Published: 2021

9. User-friendly automatic transcription of low-resource languages: Plugging ESPnet into Elpis

Author: Alexis Michaud, Benjamin Galliot, Rahasya Sanders-Dwyer, Guillaume Wisniewski, Ben Foley, Laurent Besacier, Christopher Cox, Nicholas Lambourne, Guillaume Jacques, Nathan W. Hill, Séverine Guillaume, Oliver Adams, Janet Wiles, Katya Aplonova, Atos zData, Langues et civilisations à tradition orale (LACITO), Université Sorbonne Nouvelle - Paris 3-Institut National des Langues et Civilisations Orientales (Inalco)-Centre National de la Recherche Scientifique (CNRS), Laboratoire de Linguistique Formelle (LLF UMR7110), Centre National de la Recherche Scientifique (CNRS)-Université de Paris (UP), University of Queensland [Brisbane], ARC Centre of Excellence for the Dynamics of Language (CoEDL), Laboratoire d'Informatique de Grenoble (LIG), Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA), University of Alberta, Langage, LAngues et Cultures d'Afrique (LLACAN), Institut National des Langues et Civilisations Orientales (Inalco)-Centre National de la Recherche Scientifique (CNRS), Centre de Recherches Linguistiques sur l'Asie Orientale (CRLAO), École des hautes études en sciences sociales (EHESS)-Institut National des Langues et Civilisations Orientales (Inalco)-Centre National de la Recherche Scientifique (CNRS), School of Oriental and African Studies (SOAS), University of London, Institut des Langues Rares (ILARA) de l'École Pratique des Hautes Études (EHESS), ANR-19-CE38-0015,CLD2025,La documentation computationnelle des langues à l'horizon 2025(2019), ANR-10-LABX-0083,EFL,Empirical Foundations of Linguistics : data, methods, models(2010), ANR-12-CORP-0006,HimalCo,Corpus parallèles en langues himalayennes(2012), ANR-19-P3IA-0003,MIAI,MIAI @ Grenoble Alpes(2019), European Project: 609823,EC:FP7:ERC,ERC-2013-SyG,ASIA(2014), European Project: 758232, Langage, LAngues et Cultures d'Afrique Noire (LLACAN), ANR-19-CE38-0015-04,CLD2025,Computational Language Documentation 2025, ANR-10-LABX-0083,EFL,Empirical Foundations of Linguistics : data, methods, models(2011), Laboratoire de Linguistique Formelle (LLF - UMR7110), and Centre National de la Recherche Scientifique (CNRS)-Université Paris Cité (UPCité)
Subjects: Signal Processing (eess.SP), FOS: Computer and information sciences, endangered languages, MESH: Traitement Automatique des Langues Naturelles, Low resource, Computer science, Computer Science - Artificial Intelligence, Computational Language Documentation, Language documentation, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL], MESH: reconnaissance automatique de la parole, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], MESH: documentation linguistique, Human–computer interaction, language documentation, FOS: Electrical engineering, electronic engineering, information engineering, [SHS.LANGUE]Humanities and Social Sciences/Linguistics, Electrical Engineering and Systems Science - Signal Processing, Graphical user interface, User Friendly, Computer Science - Computation and Language, business.industry, automatic speech recognition, automatic transcription, MESH: langues en danger, ACM: D.: Software, Artificial Intelligence (cs.AI), Transcription (software), business, MESH: transcription automatique, Computation and Language (cs.CL), [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing
Abstract: International audience; This paper reports on progress integrating the speech recognition toolkit ESPnet into Elpis, a web front-end originally designed to provide access to the Kaldi automatic speech recognition toolkit. The goal of this work is to make end-to-end speech recognition models available to language workers via a user-friendly graphical interface. Encouraging results are reported on (i) development of an ESPnet recipe for use in Elpis, with preliminary results on data sets previously used for training acoustic models with the Persephone toolkit along with a new data set that had not previously been used in speech recognition, and (ii) incorporating ESPnet into Elpis along with UI enhancements and a CUDA-supported Dockerfile.
Published: 2020

10. Digital Object Identifiers as an absolute must? Why DOIs were assigned in the Pangloss Collection, an open archive of endangered languages

Author: Vasile, Aurelia, Guillaume, Séverine, Aouini, Mourad, Michaud, Alexis, Maison des Sciences de l'Homme de Clermont-Ferrand (MSH Clermont), Université Clermont Auvergne [2017-2020] (UCA [2017-2020])-Centre National de la Recherche Scientifique (CNRS), Langues et civilisations à tradition orale (LACITO), Université Sorbonne Nouvelle - Paris 3-Institut National des Langues et Civilisations Orientales (Inalco)-Centre National de la Recherche Scientifique (CNRS), Cultures, Langues, Textes (CLT), Centre National de la Recherche Scientifique (CNRS), ANR-10-LABX-0083,EFL,Empirical Foundations of Linguistics : data, methods, models(2010), ANR-19-CE38-0015,CLD2025,La documentation computationnelle des langues à l'horizon 2025(2019), Michaud, Alexis, Empirical Foundations of Linguistics : data, methods, models - - EFL2010 - ANR-10-LABX-0083 - LABX - VALID, and La documentation computationnelle des langues à l'horizon 2025 - - CLD20252019 - ANR-19-CE38-0015 - AAPG2019 - VALID
Subjects: Metadata, DataCite, [SHS.LANGUE] Humanities and Social Sciences/Linguistics, Science ouverte, [SHS]Humanities and Social Sciences, DOI, Open Science, Métadonnées, Identifiants numériques, Collection Pangloss, Digital identifiers, [SHS] Humanities and Social Sciences, [SHS.LANGUE]Humanities and Social Sciences/Linguistics, Pangloss Collection
Abstract: The Pangloss Collection is an open archive of endangered languages, developed over more than twenty years in what would currently be called an Open Science perspective: to arrive at a close association between documentation and research, for their mutual benefit. The article looks back at the general topic of resource identification, and at the current “socio-scientific” context, in order to put into perspective the somewhat paradoxical choice of assigning a Digital Object Identifier (DOI) to each document in the Pangloss Collection. The stages of implementation are also outlined, broaching methodological as well as technical topics., La Collection Pangloss est une archive ouverte de langues à tradition orale, née il y a plus de vingt ans d’une vision qu’on dirait aujourd’hui de Science ouverte : elle consiste à associer étroitement documentation et recherche, pour leur bénéfice mutuel. L’article revient sur les problématiques d’identification de la ressource, et sur le contexte « socio-scientifique » actuel, pour mettre en perspective le choix – qui peut paraître paradoxal – d’attribuer un Digital Object Identifier (DOI) à chaque document de la Collection Pangloss. Les étapes de la mise en œuvre sont également abordées, dans leurs dimensions méthodologiques et techniques.
Published: 2020

11. Le Digital Object Identifier, une impérieuse nécessité ? L'exemple de l'attribution de DOI à la Collection Pangloss, archive ouverte de langues en danger

Author: Aurelia Vasile, Séverine Guillaume, Mourad Aouini, Alexis Michaud, Maison des Sciences de l'Homme de Clermont-Ferrand (MSH Clermont), Université Clermont Auvergne [2017-2020] (UCA [2017-2020])-Centre National de la Recherche Scientifique (CNRS), Langues et civilisations à tradition orale (LACITO), Université Sorbonne Nouvelle - Paris 3-Institut National des Langues et Civilisations Orientales (Inalco)-Centre National de la Recherche Scientifique (CNRS), Cultures, Langues, Textes (CLT), Centre National de la Recherche Scientifique (CNRS), ANR-10-LABX-0083,EFL,Empirical Foundations of Linguistics : data, methods, models(2010), and ANR-19-CE38-0015,CLD2025,La documentation computationnelle des langues à l'horizon 2025(2019)
Subjects: Metadata, DOI, Open Science, Métadonnées, Identifiants numériques, Collection Pangloss, Digital identifiers, [SHS.LANGUE]Humanities and Social Sciences/Linguistics, DataCite, Pangloss Collection, Science ouverte, [SHS]Humanities and Social Sciences
Abstract: International audience; The Pangloss Collection is an open archive of endangered languages, developed over more than twenty years in what would currently be called an Open Science perspective: to arrive at a close association between documentation and research, for their mutual benefit. The article looks back at the general topic of resource identification, and at the current “socio-scientific” context, in order to put into perspective the somewhat paradoxical choice of assigning a Digital Object Identifier (DOI) to each document in the Pangloss Collection. The stages of implementation are also outlined, broaching methodological as well as technical topics.; La Collection Pangloss est une archive ouverte de langues à tradition orale, née il y a plus de vingt ans d’une vision qu’on dirait aujourd’hui de Science ouverte : elle consiste à associer étroitement documentation et recherche, pour leur bénéfice mutuel. L’article revient sur les problématiques d’identification de la ressource, et sur le contexte « socio-scientifique » actuel, pour mettre en perspective le choix – qui peut paraître paradoxal – d’attribuer un Digital Object Identifier (DOI) à chaque document de la Collection Pangloss. Les étapes de la mise en œuvre sont également abordées, dans leurs dimensions méthodologiques et techniques.
Published: 2020

12. Prosodic systems: Mainland Southeast Asia

Author: Brunelle, Marc, Kirby, James, Michaud, Alexis, Watkins, Justin, University of Ottawa [Ottawa], University of Edinburgh, International Research Institute MICA (MICA), Institut National Polytechnique de Grenoble (INPG)-Hanoi University of Science and Technology (HUST)-Centre National de la Recherche Scientifique (CNRS), Langues et civilisations à tradition orale (LACITO), Université Sorbonne Nouvelle - Paris 3-Institut National des Langues et Civilisations Orientales (Inalco)-Centre National de la Recherche Scientifique (CNRS), LPP - Laboratoire de Phonétique et Phonologie - UMR 7018 (LPP), Université Sorbonne Nouvelle - Paris 3-Centre National de la Recherche Scientifique (CNRS), School of Oriental and African Studies (SOAS), University of London, Carlos Gussenhoven, Aoju Chen, ANR-10-LABX-0083,EFL,Empirical Foundations of Linguistics : data, methods, models(2010), ANR-19-CE38-0015,CLD2025,La documentation computationnelle des langues à l'horizon 2025(2019), Michaud, Alexis, Empirical Foundations of Linguistics : data, methods, models - - EFL2010 - ANR-10-LABX-0083 - LABX - VALID, and La documentation computationnelle des langues à l'horizon 2025 - - CLD20252019 - ANR-19-CE38-0015 - AAPG2019 - VALID
Subjects: Mainland Southeast Asia, Tone systems, Speech prosody, Linguistic typology, Phonation types, intonation systems, [SHS.LANGUE]Humanities and Social Sciences/Linguistics, sesquisyllables, [SHS.LANGUE] Humanities and Social Sciences/Linguistics, Southeast Asia, prosodic systems
Abstract: International audience; Mainland Southeast Asia is often viewed as a linguistic area where five different language phyla – Austroasiatic, Austronesian, Hmong-Mien, Sino-Tibetan and Kra-Dai – have converged typologically. This chapter illustrates areal features found in their prosodic systems, but also emphasizes their oft-understated diversity. The first part of the chapter describes word level prosodic properties. A typology of word shapes and stress is first established: we revisit the concept of monosyllabicity, go over the notion of sesquisyllabicity (as typified by languages like Mon or Burmese) and discuss the realization of alternating stress in languages with polysyllabic words (such as Thai and Khmer). Special attention is then paid to tonation. Although many well-known languages of the area have sizeable inventories of complex tone contours, languages with few or no tones are common (20% being atonal). Importantly, the phonetic realization of tone frequently involves more than simply pitch: properties like phonation and duration often play a role in signaling tonal contrasts, along with less expected properties like onset voicing and vowel quality. We also show that complex tone alternations (spreading, neutralization and sandhi processes), although not typical, are well-attested. The second part of the chapter addresses the less well-understood topic of phrasal prosody: prosodic phrasing and intonation. We reconsider the question of the amount of conventionalized intonation in languages with complex tone paradigms and pervasive final particles. We also show that information structure is often conveyed by means of overt markers and syntactic restructuring, but that it can also be marked by means of intonational strategies.
Published: 2020

13. Analyzing errors in automatic phonemic transcriptions of the Na (Mosuo) language (Sino-Tibetan family)

Author: Michaud, Alexis, Adams, Oliver, Guillaume, Séverine, Wisniewski, Guillaume, Langues et civilisations à tradition orale (LACITO), Université Sorbonne Nouvelle - Paris 3-Institut National des Langues et Civilisations Orientales (Inalco)-Centre National de la Recherche Scientifique (CNRS), Miner & Kasch, Laboratoire de Linguistique Formelle (LLF UMR7110), Centre National de la Recherche Scientifique (CNRS)-Université de Paris (UP), Benzitoun, Christophe and Braud, Chloé and Huber, Laurine and Langlois, David and Ouni, Slim and Pogodalla, Sylvain and Schneider, Stéphane, ANR-19-CE38-0015,CLD2025,La documentation computationnelle des langues à l'horizon 2025(2019), and ANR-10-LABX-0083,EFL,Empirical Foundations of Linguistics : data, methods, models(2010)
Subjects: interdisciplinarité, Traitement Automatique des Langues Naturelles, reconnaissance de la parole, Computational Language Documentation, speech recognition, [SCCO.LING]Cognitive science/Linguistics, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], phonological transcription, interdisciplinarity, machine learning, analyse d'erreurs, transcription phonologique, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], apprentissage machine, [SHS.LANGUE]Humanities and Social Sciences/Linguistics, documentation linguistique assistée par ordinateur, error analysis
Abstract: International audience; Automatic phonemic transcription tools now reach high levels of accuracy on a single speaker with relatively small amounts of training data: on the order two to three hours of transcribed speech. Beyond its practical usefulness for language documentation, use of automatic transcription also yields some insights for phoneticians/phonologists. Acoustic models are built on the basis of the linguist’s transcriptions, and thus encapsulate linguistic hypotheses and assumptions. To what extent can the acoustic model be examined by the linguist? The present report explores this topic by going into qualitative error analysis on Yongning Na (Sino-Tibetan). Among other benefits, error analysis allows for a renewed exploration of phonetic detail: examining the output of phonemic transcription software compared with spectrographic and aural evidence.; Les systèmes de reconnaissance automatique de la parole atteignent désormais des degrés de précision élevés sur la base d'un corpus d'entraînement limité à deux ou trois heures d'enregistrements transcrits (pour un système mono-locuteur). Au-delà de l'intérêt pratique que présentent ces avancées technologiques pour les tâches de documentation de langues rares et en danger, se pose la question de leur apport pour la réflexion du phonéticien/phonologue. En effet, le modèle acoustique prend en entrée des transcriptions qui reposent sur un ensemble d'hypothèses plus ou moins explicites. Le modèle acoustique, décalqué (par des méthodes statistiques) de l'écrit du linguiste, peut-il être interrogé par ce dernier, en un jeu de miroir ? Notre étude s'appuie sur des exemples d'une langue « rare » de la famille sino-tibétaine, le na (mosuo), pour illustrer la façon dont l'analyse d'erreurs permet une confrontation renouvelée avec le signal acoustique.
Published: 2020

14. Analyse d'erreurs de transcriptions phonémiques automatiques d'une langue « rare » : le na (mosuo)

Author: Michaud, Alexis, Adams, Oliver, Guillaume, Séverine, Wisniewski, Guillaume, Langues et civilisations à tradition orale (LACITO), Université Sorbonne Nouvelle - Paris 3-Institut National des Langues et Civilisations Orientales (Inalco)-Centre National de la Recherche Scientifique (CNRS), Miner & Kasch, Laboratoire de Linguistique Formelle (LLF UMR7110), Centre National de la Recherche Scientifique (CNRS)-Université de Paris (UP), Benzitoun, Christophe and Braud, Chloé and Huber, Laurine and Langlois, David and Ouni, Slim and Pogodalla, Sylvain and Schneider, Stéphane, ANR-19-CE38-0015,CLD2025,La documentation computationnelle des langues à l'horizon 2025(2019), and ANR-10-LABX-0083,EFL,Empirical Foundations of Linguistics : data, methods, models(2010)
Subjects: interdisciplinarité, Traitement Automatique des Langues Naturelles, reconnaissance de la parole, Computational Language Documentation, speech recognition, [SCCO.LING]Cognitive science/Linguistics, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], phonological transcription, interdisciplinarity, machine learning, analyse d'erreurs, transcription phonologique, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], apprentissage machine, [SHS.LANGUE]Humanities and Social Sciences/Linguistics, documentation linguistique assistée par ordinateur, error analysis
Abstract: International audience; Automatic phonemic transcription tools now reach high levels of accuracy on a single speaker with relatively small amounts of training data: on the order two to three hours of transcribed speech. Beyond its practical usefulness for language documentation, use of automatic transcription also yields some insights for phoneticians/phonologists. Acoustic models are built on the basis of the linguist’s transcriptions, and thus encapsulate linguistic hypotheses and assumptions. To what extent can the acoustic model be examined by the linguist? The present report explores this topic by going into qualitative error analysis on Yongning Na (Sino-Tibetan). Among other benefits, error analysis allows for a renewed exploration of phonetic detail: examining the output of phonemic transcription software compared with spectrographic and aural evidence.; Les systèmes de reconnaissance automatique de la parole atteignent désormais des degrés de précision élevés sur la base d'un corpus d'entraînement limité à deux ou trois heures d'enregistrements transcrits (pour un système mono-locuteur). Au-delà de l'intérêt pratique que présentent ces avancées technologiques pour les tâches de documentation de langues rares et en danger, se pose la question de leur apport pour la réflexion du phonéticien/phonologue. En effet, le modèle acoustique prend en entrée des transcriptions qui reposent sur un ensemble d'hypothèses plus ou moins explicites. Le modèle acoustique, décalqué (par des méthodes statistiques) de l'écrit du linguiste, peut-il être interrogé par ce dernier, en un jeu de miroir ? Notre étude s'appuie sur des exemples d'une langue « rare » de la famille sino-tibétaine, le na (mosuo), pour illustrer la façon dont l'analyse d'erreurs permet une confrontation renouvelée avec le signal acoustique.
Published: 2020

15. Toward open data policies in phonetics: What we can gain and how we can avoid pitfalls

Author: Garellek, Marc, Gordon, Matthew, Kirby, James, Lee, Wai-Sum, Michaud, Alexis, Mooshammer, Christine, Niebuhr, Oliver, Recasens, Daniel, Roettger, Timo, Simpson, Adrian, Yu, Kristine, University of California [San Diego] (UC San Diego), University of California, University of California [Santa Barbara] (UCSB), University of Edinburgh, City University of Hong Kong [Hong Kong] (CUHK), Langues et civilisations à tradition orale (LACITO), Université Sorbonne Nouvelle - Paris 3-Institut National des Langues et Civilisations Orientales (Inalco)-Centre National de la Recherche Scientifique (CNRS), Haskins Laboratories, Humboldt-Universität zu Berlin, University of Southern Denmark (SDU), Universitat Autònoma de Barcelona (UAB), Osnabrück University, Friedrich-Schiller-Universität = Friedrich Schiller University Jena [Jena, Germany], University of Massachusetts [Amherst] (UMass Amherst), University of Massachusetts System (UMASS), ANR-10-LABX-0083,EFL,Empirical Foundations of Linguistics : data, methods, models(2010), and ANR-19-CE38-0015,CLD2025,La documentation computationnelle des langues à l'horizon 2025(2019)
Subjects: data curation, open access, phonetic sciences, open science, experimental phonetics, [SHS.LANGUE]Humanities and Social Sciences/Linguistics, research data, data conservation, open archives
Abstract: International audience; It is not yet standard practice in phonetics to provide access to audio files along with submissions to journals. This is paradoxical in view of the importance of data for phonetic research: from audio signals to the whole range of data acquired in phonetic experiments. The phonetic sciences stand to gain greatly from data availability: what is at stake is no less than reproducibility and cumulative progress. We will argue that a collective turn to Open Science holds great promise for phonetics. First, simple reflections on why access to primary data matters are recapitulated and proposed as a basis for consensus. Next, possible drawbacks of data availability are addressed. Finally, we argue that data curation and archiving are to be recognized as part of the same activity that results in the publication of research papers, rather than attempting to build a parallel system to incentivize data archiving by itself.; Il n'est pas encore d'usage courant en phonétique, lors de la soumission d'un manuscrit pour publication, de donner accès aux fichiers audio et autres données sur lesquelles repose l'étude. Cette situation est paradoxale au vu de l'importance des données pour la recherche phonétique : signaux audio, mais aussi toute la gamme des données acquises lors d'expériences phonétiques. Les sciences phonétiques ont beaucoup à gagner de la disponibilité des données. L'enjeu n'est rien moins que la reproductibilité des travaux, et l'inscription des recherches dans une dynamique de progrès cumulatif. Dans cet article, nous détaillons en quoi un tournant collectif vers la science ouverte nous paraît prometteur pour la phonétique. Tout d'abord, nous récapitulons un ensemble de réflexions simples sur les raisons pour lesquelles l'accès aux données primaires est tout à fait fondamental. C'est sur la base de ce constat (tout à fait consensuel) que nous abordons les inconvénients possibles d'un partage des données, et les obstacles perçus par les chercheurs. Enfin, nous défendons l'idée selon laquelle la conservation et l'archivage des données doivent être reconnus comme faisant partie de la même activité qui aboutit à la publication de travaux de recherche, plutôt que de tenter de construire un système parallèle qui récompense l'activité d'archivage des données par elle-même.
Published: 2020

16. Vers des politiques de données ouvertes dans les sciences phonétiques : ce que le domaine y gagne, et comment éviter les écueils

Author: Garellek, Marc, Gordon, Matthew, Kirby, James, Lee, Wai-Sum, Michaud, Alexis, Mooshammer, Christine, Niebuhr, Oliver, Recasens, Daniel, Roettger, Timo, Simpson, Adrian, Yu, Kristine, Department of Linguistics, University of California, San Diego, Department of Linguistics, University of California, Santa Barbara, University of California [Santa Barbara] (UCSB), University of California-University of California, Department of Linguistics and English Language, University of Edinburgh, Department of Linguistics and Translation, City University of Hong Kong, Langues et civilisations à tradition orale (LACITO), Université Sorbonne Nouvelle - Paris 3-Institut National des Langues et Civilisations Orientales (Inalco)-Centre National de la Recherche Scientifique (CNRS), Sprach- und literaturwissenschaftliche Fakultät, Humboldt-Universität zu Berlin, Haskins Laboratories, Mads Clausen Institute Sønderborg, University of Southern Denmark, Departament de Filologia Catalana, Universitat Autònoma de Barcelona, Institute of Cognitive Science, Universität Osnabrück, Institut für Germanistische Sprachwissenschaft, Friedrich-Schiller-Universität Jena, Department of Linguistics, University of Massachusetts Amherst, ANR: 10-LABX-0083,EFL,Empirical Foundations of Linguistics : data, methods, models(2010), ANR-19-CE38-0015-04,CLD2025,Computational Language Documentation 2025, University of California [San Diego] (UC San Diego), University of California, University of Edinburgh, City University of Hong Kong [Hong Kong] (CUHK), Humboldt-Universität zu Berlin, University of Southern Denmark (SDU), Universitat Autònoma de Barcelona (UAB), Osnabrück University, Friedrich-Schiller-Universität = Friedrich Schiller University Jena [Jena, Germany], University of Massachusetts [Amherst] (UMass Amherst), University of Massachusetts System (UMASS), ANR-10-LABX-0083,EFL,Empirical Foundations of Linguistics : data, methods, models(2010), and ANR-19-CE38-0015,CLD2025,La documentation computationnelle des langues à l'horizon 2025(2019)
Subjects: data curation, open access, phonetic sciences, open science, experimental phonetics, [SHS.LANGUE]Humanities and Social Sciences/Linguistics, research data, data conservation, open archives
Abstract: International audience; It is not yet standard practice in phonetics to provide access to audio files along with submissions to journals. This is paradoxical in view of the importance of data for phonetic research: from audio signals to the whole range of data acquired in phonetic experiments. The phonetic sciences stand to gain greatly from data availability: what is at stake is no less than reproducibility and cumulative progress. We will argue that a collective turn to Open Science holds great promise for phonetics. First, simple reflections on why access to primary data matters are recapitulated and proposed as a basis for consensus. Next, possible drawbacks of data availability are addressed. Finally, we argue that data curation and archiving are to be recognized as part of the same activity that results in the publication of research papers, rather than attempting to build a parallel system to incentivize data archiving by itself.; Il n'est pas encore d'usage courant en phonétique, lors de la soumission d'un manuscrit pour publication, de donner accès aux fichiers audio et autres données sur lesquelles repose l'étude. Cette situation est paradoxale au vu de l'importance des données pour la recherche phonétique : signaux audio, mais aussi toute la gamme des données acquises lors d'expériences phonétiques. Les sciences phonétiques ont beaucoup à gagner de la disponibilité des données. L'enjeu n'est rien moins que la reproductibilité des travaux, et l'inscription des recherches dans une dynamique de progrès cumulatif. Dans cet article, nous détaillons en quoi un tournant collectif vers la science ouverte nous paraît prometteur pour la phonétique. Tout d'abord, nous récapitulons un ensemble de réflexions simples sur les raisons pour lesquelles l'accès aux données primaires est tout à fait fondamental. C'est sur la base de ce constat (tout à fait consensuel) que nous abordons les inconvénients possibles d'un partage des données, et les obstacles perçus par les chercheurs. Enfin, nous défendons l'idée selon laquelle la conservation et l'archivage des données doivent être reconnus comme faisant partie de la même activité qui aboutit à la publication de travaux de recherche, plutôt que de tenter de construire un système parallèle qui récompense l'activité d'archivage des données par elle-même.
Published: 2020

17. 声调的来历与发展路线

Author: Michaud, Alexis, Sands, Bonny, Langues et civilisations à tradition orale (LACITO), Université Sorbonne Nouvelle - Paris 3-Institut National des Langues et Civilisations Orientales (Inalco)-Centre National de la Recherche Scientifique (CNRS), Northern Arizona University [Flagstaff], Aronoff, Mark, ANR-19-CE38-0015,CLD2025,La documentation computationnelle des langues à l'horizon 2025(2019), and ANR-10-LABX-0083,EFL,Empirical Foundations of Linguistics : data, methods, models(2010)
Subjects: Tone, prosody, transphonologization, diachronic phonology, [SHS.LANGUE]Humanities and Social Sciences/Linguistics, intrinsic f0
Abstract: International audience; Tonogenesis is the development of distinctive tone from earlier non-tonal contrasts. A well-understood case is that of Vietnamese (similar in its essentials to that of Chinese and many languages of the Tai-Kadai and Hmong-Mien language families), where the loss of final laryngeal consonants led to the creation of three tones, and the tones later multiplied as voicing oppositions on initial consonants waned. This is by no means the only attested diachronic scenario, however. There is tonogenetic potential in various series of phonemes: glottalized vs. plain consonants, unvoiced vs. voiced, aspirated vs. unaspirated, geminates vs. simple (and, more generally, tense vs. lax), and even among vowels, whose intrinsic fundamental frequency can transphonologize to tone. But the way in which these common phonetic precursors to tone play out in a given language depends on phonological factors, as well as on other dimensions of a language’s structure and on patterns of language contact, resulting in a great diversity of evolutionary paths in tone systems. In some language families (such as Niger-Congo and Khoe), recent tonal developments are increasingly well-understood, but working out the origin of the earliest tonal contrasts (which are likely to date back thousands of years earlier than tonogenesis among Sino-Tibetan languages, for instance) remains a mid- to long-term research goal for comparative-historical research.
Published: 2020
Full Text: View/download PDF

18. Voix de « ceux qui ne sont rien » en Asie du sud-est

Author: Minh-Châu Nguyễn, Likun He, Alexis Michaud, Langues et civilisations à tradition orale (LACITO), Université Sorbonne Nouvelle - Paris 3-Institut National des Langues et Civilisations Orientales (Inalco)-Centre National de la Recherche Scientifique (CNRS), Vietnam National University - Department of Linguistics (VNU-USSH), Vietnam National University [Hanoï] (VNU), Yunnan Minzu University, ANR-19-CE38-0015-04,CLD2025,Computational Language Documentation 2025, ANR: 10-LABX-0083,EFL,Empirical Foundations of Linguistics : data, methods, models(2010), ANR-19-CE38-0015,CLD2025,La documentation computationnelle des langues à l'horizon 2025(2019), ANR-10-LABX-0083,EFL,Empirical Foundations of Linguistics : data, methods, models(2010), Michaud, Alexis, La documentation computationnelle des langues à l'horizon 2025 - - CLD20252019 - ANR-19-CE38-0015 - AAPG2019 - VALID, and Empirical Foundations of Linguistics : data, methods, models - - EFL2010 - ANR-10-LABX-0083 - LABX - VALID
Subjects: 0106 biological sciences, tinh thần phản kháng, South East Asia, 010603 evolutionary biology, 01 natural sciences, 03 medical and health sciences, wealth, hopes, [SHS.LANGUE]Humanities and Social Sciences/Linguistics, Asie du Sud‑Est, 030304 developmental biology, richesses, 0303 health sciences, [SHS.ANTHRO-SE] Humanities and Social Sciences/Social Anthropology and ethnology, voice, [SHS.ANTHRO-SE]Humanities and Social Sciences/Social Anthropology and ethnology, Oralités contestataires, [SHS.LANGUE] Humanities and Social Sciences/Linguistics, Oral expressions of protest, Robin Hood, Robin des bois, espoirs, voix, oralités contestataires, 民间幽默讽刺传说, General Economics, Econometrics and Finance
Abstract: International audience; L'équipe éditoriale des Cahiers de littérature orale invitant des contributions autour des « oralités contestataires », nous proposons un aperçu de trois documents de littérature orale provenant de civilisations d'Asie orientale et du sud-est : une historiette naxi, un récit khamou, et une chanson vietnamienne, chacun en dialogue avec un texte (article ou livre) tiré de l'actualité sociale et politique.Le titre de l'article fait référence à l'expression « les gens qui ne sont rien », employée par le Président de la République en exercice lors d'un discours public, le 29 juin 2017 (« Une gare, c’est un lieu où on croise les gens qui réussissent et les gens qui ne sont rien. Parce que c’est un lieu où on passe. Parce que c’est un lieu qu’on partage. »)
Published: 2020

19. AlloVera: a multilingual allophone database

Author: Mortensen, David R., Li, Xinjian, Littell, Patrick, Michaud, Alexis, Rijhwani, Shruti, Anastasopoulos, Antonios, Black, Alan W., Metze, Florian, Neubig, Graham, Michaud, Alexis, La documentation computationnelle des langues à l'horizon 2025 - - CLD20252019 - ANR-19-CE38-0015 - AAPG2019 - VALID, Empirical Foundations of Linguistics : data, methods, models - - EFL2010 - ANR-10-LABX-0083 - LABX - VALID, Carnegie Mellon University [Pittsburgh] (CMU), National Research Council of Canada (NRC), Langues et civilisations à tradition orale (LACITO), Université Sorbonne Nouvelle - Paris 3-Institut National des Langues et Civilisations Orientales (Inalco)-Centre National de la Recherche Scientifique (CNRS), European Language Resources Association, ANR-19-CE38-0015,CLD2025,La documentation computationnelle des langues à l'horizon 2025(2019), and ANR-10-LABX-0083,EFL,Empirical Foundations of Linguistics : data, methods, models(2010)
Subjects: FOS: Computer and information sciences, phonology, Phoneme, Computer Science - Computation and Language, Allophones, automatic speech recognition, [SHS.LANGUE]Humanities and Social Sciences/Linguistics, Computation and Language (cs.CL), [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, [SHS.LANGUE] Humanities and Social Sciences/Linguistics, database, [SPI.SIGNAL] Engineering Sciences [physics]/Signal and Image processing
Abstract: We introduce a new resource, AlloVera, which provides mappings from 218 allophones to phonemes for 14 languages. Phonemes are contrastive phonological units, and allophones are their various concrete realizations, which are predictable from phonological context. While phonemic representations are language specific, phonetic representations (stated in terms of (allo)phones) are much closer to a universal (language-independent) transcription. AlloVera allows the training of speech recognition models that output phonetic transcriptions in the International Phonetic Alphabet (IPA), regardless of the input language. We show that a “universal” allophone model, Allosaurus, built with AlloVera, outperforms “universal” phonemic models and language-specific models on a speech-transcription task. We explore the implications of this technology (and related technologies) for the documentation of endangered and minority languages. We further explore other applications for which AlloVera will be suitable as it grows, including phonological typology., 12th Language Resources and Evaluation Conference (LREC 2020), May 11-16, 2020, Marseille, France [Held Virtually]
Published: 2020

20. Phonemic transcription of low-resource languages: To what extent can preprocessing be automated?

Author: Wisniewski, Guillaume, Michaud, Alexis, Guillaume, Séverine, Michaud, Alexis, Empirical Foundations of Linguistics : data, methods, models - - EFL2010 - ANR-10-LABX-0083 - LABX - VALID, La documentation computationnelle des langues à l'horizon 2025 - - CLD20252019 - ANR-19-CE38-0015 - AAPG2019 - VALID, Beermann, Dorothee, Besacier, Laurent, Sakti, Sakriani, Soria, Claudia, Laboratoire de Linguistique Formelle (LLF UMR7110), Centre National de la Recherche Scientifique (CNRS)-Université Paris Diderot - Paris 7 (UPD7), Langues et civilisations à tradition orale (LACITO), Université Sorbonne Nouvelle - Paris 3-Institut National des Langues et Civilisations Orientales (Inalco)-Centre National de la Recherche Scientifique (CNRS), Université Paris Diderot - Paris 7 (UPD7)-Centre National de la Recherche Scientifique (CNRS), ANR-10-LABX-0083,EFL,Empirical Foundations of Linguistics : data, methods, models(2010), and ANR-19-CE38-0015,CLD2025,La documentation computationnelle des langues à l'horizon 2025(2019)
Subjects: [INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI], [SHS.STAT]Humanities and Social Sciences/Methods and statistics, Speech Recognition/Understanding, [SHS.STAT] Humanities and Social Sciences/Methods and statistics, [SHS.LANGUE]Humanities and Social Sciences/Linguistics, Endangered Languages, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, Speech Resource/Database, [SHS.LANGUE] Humanities and Social Sciences/Linguistics, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], [SPI.SIGNAL] Engineering Sciences [physics]/Signal and Image processing
Abstract: International audience; Automatic Speech Recognition for low-resource languages has been an active field of research for more than a decade. It holds promise for facilitating the urgent task of documenting the world's dwindling linguistic diversity. Various methodological hurdles are encountered in the course of this exciting development, however. A well-identified difficulty is that data preprocessing is not at all trivial. The tests reported here (on Yongning Na and other languages from the Pangloss Collection, an open archive of endangered languages) explore some possibilities for automating the process of data preprocessing: assessing to what extent it is possible to bypass the involvement of language experts for menial tasks of data preparation for Natural Language Processing (NLP) purposes. What is at stake is the accessibility of language archive data for a range of NLP tasks and beyond.
Published: 2020

21. La transcription du linguiste au miroir de l’intelligence artificielle : réflexions à partir de la transcription phonémique automatique

Author: Michaud, Alexis, Adams, Oliver, Cox, Christopher, Guillaume, Séverine, Wisniewski, Guillaume, Galliot, Benjamin, Langues et civilisations à tradition orale (LACITO), Université Sorbonne Nouvelle - Paris 3-Institut National des Langues et Civilisations Orientales (Inalco)-Centre National de la Recherche Scientifique (CNRS), Miner & Kasch, University of Alberta, Laboratoire de Linguistique Formelle (LLF UMR7110), Centre National de la Recherche Scientifique (CNRS)-Université de Paris (UP), Institut des langues rares, ANR-10-LABX-0083,EFL,Empirical Foundations of Linguistics : data, methods, models(2010), and ANR-19-CE38-0015,CLD2025,La documentation computationnelle des langues à l'horizon 2025(2019)
Subjects: Reconnaissance de la parole, Transcription Automatique de la parole, Documentation linguistique assistée par ordinateur, [SHS.LANGUE]Humanities and Social Sciences/Linguistics, Documentation linguistique, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]
Abstract: Accepté pour publication dans le Bulletin de la Société de Linguistique de Paris (à paraître vers janvier-février 2021); International audience; Automatic speech recognition systems now achieve high levels of accuracy with relatively small amounts of training data: on the order two to three hours of transcribed speech, instead of tens of hours for previous tools. Beyond the practical usefulness of these technological advances for linguistic documentation tasks, use of automatic transcription also yields some linguistic insights. Acoustic models are built on the basis of the linguist’s transcriptions, and thus encapsulate linguistic hypotheses and assumptions. To what extent can acoustic models be examined in turn by the linguist? What can we learn from this renewed confrontation with the acoustic signal? The present study is based on examples from the Native language (Sino-Tibetan family) to illustrate how error analysis allows a renewed confrontation with the data. Among other benefits, error analysis allows for a renewed exploration of phonetic detail: examining the output of phonemic transcription software compared with spectrographic and aural evidence. Some reflections on experiments of automatic transcription of the Tsuut'ina language (Dene family) are also presented.; Les systèmes de reconnaissance automatique de la parole atteignent désormais des degrés de précision élevés sur la base d'un corpus d'entraînement limité à deux ou trois heures d'enregistrements transcrits (pour un système mono-locuteur), au lieu de dizaines d'heures pour les outils antérieurs. Au-delà de l'intérêt pratique que présentent ces avancées technologiques pour les tâches de documentation linguistique, se pose la question de leur apport pour la réflexion du linguiste. En effet, le logiciel réalise son entraînement sur la base de transcriptions fournies en entrée par le linguiste, transcriptions qui reposent sur un ensemble d'hypothèses plus ou moins élaborées, et plus ou moins explicites. Le modèle acoustique, décalqué (par des méthodes statistiques) de l'écrit du linguiste, peut-il être interrogé par ce dernier, en un jeu de miroir ? Que peut nous apprendre la confrontation ainsi renouvelée avec le signal acoustique ? La présente étude s'appuie sur des exemples de langue na (famille sino-tibétaine) pour illustrer la façon dont l'analyse d'erreurs permet une confrontation renouvelée avec les données. Quelques réflexions au sujet d'expériences de transcription automatique de la langue tsuut'ina (famille dene) sont également présentées.; 目前，自动语音识别系统使用相对较少的训练数据就能达到很高的准确度：以前需要几十个小时才能完成的语音转录任务现在只需两三个小时即可完成。除了技术进步对语言记录任务的实际效率作用外，使用自动转录也产生了一些新的语言学观点：声学模型是建立在语言学家的转录基础上的，因此也涵盖了语言学的假设和假定。声学模型在多大程度上可以被语言学家用来进行反证和考察？我们能从这种对声学信号的重新面对中学习到什么？本研究基于纳语（摩梭话）的例子来说明误差分析是如何让我们重新面对数据的。除其他优势以外，误差分析还可以重新探索语音细节：将音位转录软件的输出与频谱和听觉证据进行对比研究。还提出了对北美大陆德内语支（阿萨巴斯卡语支）语言自动转录实验的一些思考。
Published: 2020

22. Tonogenèse

Author: Alexis Michaud, Bonny Sands, Langues et civilisations à tradition orale (LACITO), Université Sorbonne Nouvelle - Paris 3-Institut National des Langues et Civilisations Orientales (Inalco)-Centre National de la Recherche Scientifique (CNRS), Northern Arizona University [Flagstaff], Aronoff, Mark, ANR-19-CE38-0015,CLD2025,La documentation computationnelle des langues à l'horizon 2025(2019), and ANR-10-LABX-0083,EFL,Empirical Foundations of Linguistics : data, methods, models(2010)
Subjects: 060201 languages & linguistics, Tone, prosody, transphonologization, 0602 languages and literature, 05 social sciences, 0501 psychology and cognitive sciences, 06 humanities and the arts, diachronic phonology, [SHS.LANGUE]Humanities and Social Sciences/Linguistics, 050105 experimental psychology, intrinsic f0
Abstract: Tonogenesis is the development of distinctive tone from earlier non-tonal contrasts. A well-understood case is Vietnamese (similar in its essentials to that of Chinese and many languages of the Tai-Kadai and Hmong-Mien language families), where the loss of final laryngeal consonants led to the creation of three tones, and the tones later multiplied as voicing oppositions on initial consonants waned. This is by no means the only attested diachronic scenario, however. Besides well-known cases of tonogenesis in East Asia, this survey includes discussions of less well-known cases of tonogenesis from language families including Athabaskan, Chadic, Khoe and Niger-Congo. There is tonogenetic potential in various series of phonemes: glottalized versus plain consonants, unvoiced versus voiced, aspirated versus unaspirated, geminates versus simple (and, more generally, tense versus lax), and even among vowels, whose intrinsic fundamental frequency can transphonologize to tone. We draw attention to tonogenetic triggers that are not so well-known, such as [+ATR] vowels, aspirates and morphotonological alternations. The ways in which these common phonetic precursors to tone play out in a given language depend on phonological factors, as well as on other dimensions of a language’s structure and on patterns of language contact, resulting in a great diversity of evolutionary paths in tone systems. In some language families (such as Niger-Congo and Khoe), recent tonal developments are increasingly well understood, but working out the origin of the earliest tonal contrasts (which are likely to date back thousands of years earlier than tonogenesis among Sino-Tibetan languages, for instance) remains a mid- to long-term research goal for comparative-historical research.
Published: 2020
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

22 results on '"ANR-19-CE38-0015,CLD2025,La documentation computationnelle des langues à l'horizon 2025(2019)"'

1. Modèles bayésiens non-paramétriques pour la segmentation conjointe en mots et morphèmes

2. Extraction et analyse de concepts médicaux dans un corpus de spécialité en orthophonie

3. Vers la génération automatique de gloses pour la documentation automatique des langues

4. Segmentation en mot faiblement supervisée: un outil pour la linguistique de terrain

5. Modèle-s bayés-ien-s pour la segment-ation à deux niveau-x faible-ment super-vis-é-e

6. Bảng từ EFEO-CNRS-SOAS dùng cho nghiên cứu điền dã ngôn ngữ học ở Đông Nam Á

7. L'intonation dans les langues tonales : des réflexions générales et deux études de cas

8. Recognizing lexical units in low-resource language contexts with supervised and unsupervised neural networks

9. User-friendly automatic transcription of low-resource languages: Plugging ESPnet into Elpis

10. Digital Object Identifiers as an absolute must? Why DOIs were assigned in the Pangloss Collection, an open archive of endangered languages

11. Le Digital Object Identifier, une impérieuse nécessité ? L'exemple de l'attribution de DOI à la Collection Pangloss, archive ouverte de langues en danger

12. Prosodic systems: Mainland Southeast Asia

13. Analyzing errors in automatic phonemic transcriptions of the Na (Mosuo) language (Sino-Tibetan family)

14. Analyse d'erreurs de transcriptions phonémiques automatiques d'une langue « rare » : le na (mosuo)

15. Toward open data policies in phonetics: What we can gain and how we can avoid pitfalls

16. Vers des politiques de données ouvertes dans les sciences phonétiques : ce que le domaine y gagne, et comment éviter les écueils

17. 声调的来历与发展路线

18. Voix de « ceux qui ne sont rien » en Asie du sud-est

19. AlloVera: a multilingual allophone database

20. Phonemic transcription of low-resource languages: To what extent can preprocessing be automated?

21. La transcription du linguiste au miroir de l’intelligence artificielle : réflexions à partir de la transcription phonémique automatique

22. Tonogenèse

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Database

22 results on '"ANR-19-CE38-0015,CLD2025,La documentation computationnelle des langues à l'horizon 2025(2019)"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources