Back to Search
Start Over
Location Analysis for Arabic COVID-19 Twitter Data Using Enhanced Dialect Identification Models
- Source :
- Applied Sciences, Vol 11, Iss 11328, p 11328 (2021), Applied Sciences; Volume 11; Issue 23; Pages: 11328
- Publication Year :
- 2021
- Publisher :
- MDPI AG, 2021.
-
Abstract
- The recent surge of social media networks has provided a channel to gather and publish vital medical and health information. The focal role of these networks has become more prominent in periods of crisis, such as the recent pandemic of COVID-19. These social networks have been the leading platform for broadcasting health news updates, precaution instructions, and governmental procedures. They also provide an effective means for gathering public opinion and tracking breaking events and stories. To achieve location-based analysis for social media input, the location information of the users must be captured. Most of the time, this information is either missing or hidden. For some languages, such as Arabic, the users’ location can be predicted from their dialects. The Arabic language has many local dialects for most Arab countries. Natural Language Processing (NLP) techniques have provided several approaches for dialect identification. The recent advanced language models using contextual-based word representations in the continuous domain, such as BERT models, have provided significant improvement for many NLP applications. In this work, we present our efforts to use BERT-based models to improve the dialect identification of Arabic text. We show the results of the developed models to recognize the source of the Arabic country, or the Arabic region, from Twitter data. Our results show 3.4% absolute enhancement in dialect identification accuracy on the regional level over the state-of-the-art result. When we excluded the Modern Standard Arabic (MSA) set, which is formal Arabic language, we achieved 3% absolute gain in accuracy between the three major Arabic dialects over the state-of-the-art level. Finally, we applied the developed models on a recently collected resource for COVID-19 Arabic tweets to recognize the source country from the users’ tweets. We achieved a weighted average accuracy of 97.36%, which proposes a tool to be used by policymakers to support country-level disaster-related activities.
- Subjects :
- social networks
Technology
Language identification
language identification
QH301-705.5
Computer science
QC1-999
computer.software_genre
Resource (project management)
location analysis
General Materials Science
Social media
Biology (General)
Set (psychology)
QD1-999
Instrumentation
Publication
BERT models
dialect identification
Fluid Flow and Transfer Processes
business.industry
Physics
Process Chemistry and Technology
General Engineering
Engineering (General). Civil engineering (General)
language.human_language
Computer Science Applications
Chemistry
Identification (information)
Modern Standard Arabic
language
Artificial intelligence
Language model
TA1-2040
business
computer
Natural language processing
Subjects
Details
- ISSN :
- 20763417
- Volume :
- 11
- Database :
- OpenAIRE
- Journal :
- Applied Sciences
- Accession number :
- edsair.doi.dedup.....ef1a689a1a0b7d04e408b642b29f6b43
- Full Text :
- https://doi.org/10.3390/app112311328