Back to Search Start Over

Sensitive Keyword Extraction Based on Cyber Keywords and LDA in Twitter to Avoid Regrets

Authors :
S. Karthika
R. Geetha
Sri Sivasubramaniya Nadar College of Engineering (SSN College of Engineering)
Aravindan Chandrabose
Ulrich Furbach
Ashish Ghosh
Anand Kumar M.
TC 12
Source :
IFIP Advances in Information and Communication Technology, 3rd International Conference on Computational Intelligence in Data Science (ICCIDS), 3rd International Conference on Computational Intelligence in Data Science (ICCIDS), Feb 2020, Chennai, India. pp.59-70, ⟨10.1007/978-3-030-63467-4_5⟩, IFIP Advances in Information and Communication Technology ISBN: 9783030634667
Publication Year :
2020
Publisher :
HAL CCSD, 2020.

Abstract

Part 1: Computational Intelligence for Text Analysis; International audience; Twitter is the most popular social platform where common people reflect their personal, political and business views that obliquely build an active online repository. The data presented by users on social networking sites are usually composed of sensitive or private data that is highly potential for cyber threats. The most frequently presented sensitive private data is analyzed by collecting real-time tweets based on benchmarked cyber-keywords under personal, professional and health categories. This research work aims to generate a Topic Keyword Extractor by adapting the Automatic Acronym - Abbreviation Replacer which is specially developed for social media short texts. The feature space is modeled using the Latent Dirichlet Allocation technique to discover topics for each cyber-keyword. The user’s context and intentions are preserved by replacing the internet jargon and abbreviations. The originality of this research work lies in identifying sensitive keywords that reveal Tweeter’s Personally Identifiable Information through the novel Topic Keyword Extractor. The potential sensitive topics in which the social media users frequently exhibit personal information and unintended information disclosures are discovered for the benchmarked cyber-keywords by adapting the proposed qualitative topic-wise keyword distribution approach. This experiment analyzed cyber-keywords and the identified sensitive topic keywords as bi-grams to predict the most common sensitive information leaks happening in Twitter. The results showed that the most frequently discussed sensitive topic was ‘weight loss’ with the cyber-keyword ‘weight’ of the health tweet category.

Details

Language :
English
ISBN :
978-3-030-63466-7
ISBNs :
9783030634667
Database :
OpenAIRE
Journal :
IFIP Advances in Information and Communication Technology, 3rd International Conference on Computational Intelligence in Data Science (ICCIDS), 3rd International Conference on Computational Intelligence in Data Science (ICCIDS), Feb 2020, Chennai, India. pp.59-70, ⟨10.1007/978-3-030-63467-4_5⟩, IFIP Advances in Information and Communication Technology ISBN: 9783030634667
Accession number :
edsair.doi.dedup.....534336421a7118a69e90baecfcf936fa