1. Non-standard texts: from theoretical positions to Natural Language Processing normalisation
- Author
-
Lopez, Cédric, Roche, Mathieu, Panckhurst, Rachel, VISEO - Objet Direct, VISEO, Territoires, Environnement, Télédétection et Information Spatiale (UMR TETIS), Centre de Coopération Internationale en Recherche Agronomique pour le Développement (Cirad)-AgroParisTech-Institut national de recherche en sciences et technologies pour l'environnement et l'agriculture (IRSTEA)-Centre National de la Recherche Scientifique (CNRS), ADVanced Analytics for data SciencE (ADVANSE), Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier (LIRMM), Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS), Praxiling UMR 5267 (Praxiling), Université Paul-Valéry - Montpellier 3 (UM3)-Centre National de la Recherche Scientifique (CNRS), CENTAL, UCL, Louvain-la-Neuve, Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM), Département Environnements et Sociétés (Cirad-ES), Centre de Coopération Internationale en Recherche Agronomique pour le Développement (Cirad), Praxiling (Praxiling), Université Paul-Valéry - Montpellier 3 (UPVM)-Centre National de la Recherche Scientifique (CNRS), Centre National de la Recherche Scientifique (CNRS)-Université Paul-Valéry - Montpellier 3 (UPVM), and Panckhurst, Rachel
- Subjects
C30 - Documentation et information ,[INFO.INFO-CL] Computer Science [cs]/Computation and Language [cs.CL] ,SMS ,Normalisation ,[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] ,Natural Language Processing - Abstract
A finalised digital resource of 88,000 anonymised French text messages, the 88milSMS corpus, two extracts (1,000 SMS transcoded into standardised French and 100 linguistically annotated SMS) and sociolinguistic questionnaire data were released in June 2014 for all to download via a user free-of-charge licence agreement, from the Huma-Num web service (http://88milsms.huma-num.fr, Panckhurst et al., 2014). The sud4science project (http://sud4science.org, Panckhurst et al. 2013), enabling authentic text message collection from the general public by a group of academics, is part of a vast international initiative (http://www.sms4science.org/, Fairon et al. 2006, Cougnon and Fairon, 2014, Cougnon 2015), to build a worldwide database and analyse authentic text messages in different languages. We decided to exclude full transcoding and annotation tagging in the final corpus. This is a theoretical position, since annotation is far from neutral, and is invariably linked to an interpretative framework. Owing to varying theoretical disciplinary and scientific stances, it seems that a true consensus on how to standardise the transcoding and linguistic annotation tagging does not exist (Panckhurst, 2015). Other researchers may disagree and prefer to provide both 'raw' and fully tagged corpora (Chanier et al. 2014). This theoretical position does not exclude exploring Natural Language Processing (NLP) investigation techniques, which could then be implemented in real-life applications. Examples of investigation techniques are indicated as follows: 1) Our corpus can be used to analyse current mediated electronic discourse, and help build knowledge on different SMS writing forms (Roche et al. 2015). 2) Algorithms may be used to learn from this: alignment methods for facilitating automatic transcoding have been explored (Aw et al. 2006, Beaufort et al., 2008, Guimier de Neef and Fessard, 2007, Kobus et al, 2008, Lopez et al, 2014). 3) We have devised a method for classifying 'unknown' items within text messages, which may help to automatically identify lexical 'creativity' within 88milSMS and improve electronic dictionary approaches (Lopez et al. 2015). In order to refine automatic normalisation techniques for initially non-standard texts in French, the next logical step is to compare our resource with different types of instant media (i.e. SMS, forums, tweets). Firstly, a new typology of the detected 'mistakes', based on existing typologies, will be elaborated. Secondly, automatic normalisation techniques — focusing on the most frequent errors — will be proposed. These will then be confronted with traditional automatic translation (Vilariño et al., 2012), speech recognition (Kobus et al., 2008) and spelling/grammatical checker principles (Beaufort et al., 2010). Finally, the approach should enable comparison between different types of instant media.
- Published
- 2016