Schaeffer, Camille, Interdonato, Roberto, Lancelot, Renaud, Roche, Mathieu, Teisseire, Maguelonne, Schaeffer, Camille, Interdonato, Roberto, Lancelot, Renaud, Roche, Mathieu, and Teisseire, Maguelonne
Purpose: Event Based Surveillance (EBS) systems detect and monitor diseases by analysing articles from online newspapers and reports from health organizations (e.g. FAO, OIE, etc.). However, they partially integrate data from social networks, even though these data are present in large quantities on the web. The purpose of this study is to exploit social network data, such as Twitter and YouTube, to provide epidemiological and additional information for Avian Influenza surveillance. Methods & Materials: In this context, we propose new text-mining approaches combining lexical rules and statistical approaches, in order to normalise textual data from Social Network ('h5 n8'→'H5N8') and to correct errors from YouTube transcriptions (e.g. 'birth flu'→'bird flu'). Another challenge consists of extracting epidemiological events automatically by identifying spatial entities (Where?), thematic entities (What?), and temporal information (When?). For this, we extended Named Entity Recognition (NER) tools like spaCy. Results: We collected 100 automatic transcripts of YouTube videos and 268 tweets, in English, dealing with avian influenza, thanks to dedicated API. We obtain encouraging results (i.e. accuracy around 0.6) in order to recognise automatically epidemiological information (e.g. hosts, symptoms etc.) in textual data contents. Extraction of spatial information obtains better results (i.e. accuracy around 0.8). Conclusion: The final objective of the study consists of linking social media data based on these entities with official information from health organisations, for the improvement of epidemiological monitoring.