Panckhurst, Rachel, Praxiling (Praxiling), Université Paul-Valéry - Montpellier 3 (UPVM)-Centre National de la Recherche Scientifique (CNRS), Praxiling UMR 5267 CNRS — Université Paul-Valéry Montpellier, Cirad, Lirmm, Viseo, Lidilem, DGLFLF, MSH-M, sud4science / 88milSMS, Praxiling UMR 5267 (Praxiling), s.e., and Panckhurst, Rachel
International audience; In 2011, 6 academics gathered over 90,000 authentic text messages in French from the general public, in compliance with French law. The SMS ‘donors’ were also invited to fill out a sociolinguistic questionnaire (http://sud4science.org, Panckhurst et al., 2013). The project is part of a vast international initiative, entitled sms4science (http://www.sms4science.org/, Fairon et al. 2006, Cougnon & Fairon, 2014, Cougnon 2015) which aims to build a worldwide database and analyse authentic text messages. After the sud4science SMS data collection, a pre-processing phase of checking and eliminating any spurious information and a three-step semi-automatic anonymisation phase were conducted (Accorsi et al. 2014, Patel et al., 2013). Two extracts were transcoded into standardised French (1,000 SMS) and annotated (100 SMS). The finalised digital resource of 88,000 anonymised French text messages, the ‘88milSMS’ corpus, the extracts, and the sociolinguistic questionnaire data are currently available for all to download, via a user free-of-charge licence agreement, from the Huma-Num web service, (http://88milsms.huma-num.fr, Panckhurst et al., 2014). Why decide to exclude full transcoding and annotation tagging phases? Transcoding ‘raw’ text messages into ‘standardised’ French means morpho-syntactic parsers and other natural language processing tools can ultimately analyse them. Checking spelling and grammar facilitates comprehension, but no supplementary information should be ‘injected’. What if a texter tries to simulate a certain form of oral French, for instance, by using an apostrophe, or through agglutination (‘j’sais’=‘je sais’, ‘chuis’=‘je suis’)? Should these items be transcoded or not? What about punctuation, often absent in text messages? Should one re-introduce this systematically? Researchers may have differing theoretical viewpoints. Another issue is tagging the corpus. After much scientific debate about previous experiences with other sms4science members, 8 tags were chosen for ‘88milSMS’: TYP(ography), MOD(ificiation), GRA(mmar), BIN(ettes, smileys/emojis), ABS(ence), LAN(guage), ORT(hography, spelling), DIV(erse). Like the previous transcoding phase, annotation is a source of theoretical disagreement. To highlight this, it may be difficult to decide which tag to use and double tagging may be necessary: Bone journé. The ‘scriptor’ may have voluntarily modified the two words (‘Bonne journée’ have a nice day) or may have lacked spelling knowledge. So should ‘MOD’ and/or ‘ORT’ be used? In another example: ‘Il es rentrer a 22h30 et jai eu ldroii au : jsui fatiguer, jai mal a la tete jvai me coucher.’ (He came home at 10.30pm and I got to hear: I’m tired, I have a headache, I’m going to bed), ‘rentrer’ (‘Il est rentré’) could be either a gramatical mistake (GRA), or the scriptor may have preferred using an ‘r’ (MOD) instead of pressing the ‘e’ to access the acute accent (on a smartphone). It is extremely difficult to provide satisfactory standardised tagging. We decided to limit the processing to two extracts. Our (rare) choice to exclude full transcoding and tagging is a theoretical position: annotation is far from neutral. It is directly linked to an interpretative framework. A true consensus on how to standardise the transcoding and annotation does not exist, owing to differing/varying theoretical, (pluri)disciplinary and scientific stances. We believe that no additional mark-up initiatives should be imposed upon researchers ; it seems more relevant to let them conduct their own annotation bearing their specific scientific questioning in mind, without being trapped within a unique theoretical framework. The ‘88milSMS’ resource will provide inspiration for many years to come. Our corpus can be used to analyse contemporary mediated electronic discourse, build knowledge on SMS writing forms (Panckhurst 2009), and let algorithms learn from this: alignment methods for facilitating automatic transcoding are currently being explored (Lopez et al. 2014), as are methods for classifying ‘unknown’ items for use in automatically identifying lexical ‘creativity’ within ‘88milSMS’ and also to improve electronic dictionary approaches. The resource also sheds light on ‘corpus-driven’ and ‘corpus-based’ approaches (Panckhurst 2013, Panckhurst et al. 2015). Xml encoding means that the resource will be eligible for long-term archiving with the CINES (https://www.cines.fr/). Perhaps, in the future, people will look back and explore these ‘snapshot’ resources and understand more about the evolution of scriptural practices and usages in the 21st century.