Back to Search
Start Over
Sensitive Keyword Extraction Based on Cyber Keywords and LDA in Twitter to Avoid Regrets
- Source :
- IFIP Advances in Information and Communication Technology, 3rd International Conference on Computational Intelligence in Data Science (ICCIDS), 3rd International Conference on Computational Intelligence in Data Science (ICCIDS), Feb 2020, Chennai, India. pp.59-70, ⟨10.1007/978-3-030-63467-4_5⟩, IFIP Advances in Information and Communication Technology ISBN: 9783030634667
- Publication Year :
- 2020
- Publisher :
- HAL CCSD, 2020.
-
Abstract
- Part 1: Computational Intelligence for Text Analysis; International audience; Twitter is the most popular social platform where common people reflect their personal, political and business views that obliquely build an active online repository. The data presented by users on social networking sites are usually composed of sensitive or private data that is highly potential for cyber threats. The most frequently presented sensitive private data is analyzed by collecting real-time tweets based on benchmarked cyber-keywords under personal, professional and health categories. This research work aims to generate a Topic Keyword Extractor by adapting the Automatic Acronym - Abbreviation Replacer which is specially developed for social media short texts. The feature space is modeled using the Latent Dirichlet Allocation technique to discover topics for each cyber-keyword. The user’s context and intentions are preserved by replacing the internet jargon and abbreviations. The originality of this research work lies in identifying sensitive keywords that reveal Tweeter’s Personally Identifiable Information through the novel Topic Keyword Extractor. The potential sensitive topics in which the social media users frequently exhibit personal information and unintended information disclosures are discovered for the benchmarked cyber-keywords by adapting the proposed qualitative topic-wise keyword distribution approach. This experiment analyzed cyber-keywords and the identified sensitive topic keywords as bi-grams to predict the most common sensitive information leaks happening in Twitter. The results showed that the most frequently discussed sensitive topic was ‘weight loss’ with the cyber-keyword ‘weight’ of the health tweet category.
- Subjects :
- Computer science
business.industry
InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL
Twitter
Keyword extraction
Context (language use)
02 engineering and technology
Privacy leaks
Latent Dirichlet allocation
World Wide Web
Social media
Information sensitivity
symbols.namesake
020204 information systems
Cyber-keywords
Regrets
0202 electrical engineering, electronic engineering, information engineering
symbols
020201 artificial intelligence & image processing
The Internet
[INFO]Computer Science [cs]
Acronym
business
Personally identifiable information
Subjects
Details
- Language :
- English
- ISBN :
- 978-3-030-63466-7
- ISBNs :
- 9783030634667
- Database :
- OpenAIRE
- Journal :
- IFIP Advances in Information and Communication Technology, 3rd International Conference on Computational Intelligence in Data Science (ICCIDS), 3rd International Conference on Computational Intelligence in Data Science (ICCIDS), Feb 2020, Chennai, India. pp.59-70, ⟨10.1007/978-3-030-63467-4_5⟩, IFIP Advances in Information and Communication Technology ISBN: 9783030634667
- Accession number :
- edsair.doi.dedup.....534336421a7118a69e90baecfcf936fa