Back to Search
Start Over
Learning Word Representation for the Cyber Security Vulnerability Domain
- Source :
- IJCNN
- Publication Year :
- 2020
- Publisher :
- IEEE, 2020.
-
Abstract
- There have been ever-increasing amounts of security vulnerabilities discovered and reported in recent years. Much of the information related to these vulnerabilities is currently available to the public, in the form of rich, textual data (e.g. vulnerability reports). Many of the state-of-the-art techniques used today to process such textual data rely on so-called word embeddings. As of today, several pre-trained embeddings have been created, many of which rely on general-purpose training datasets such as Google News and Wikipedia. More recently, other domain-specific word embeddings have been created (e.g. in the context of software development) to cope with terminology and ambiguity limitations of existing general-purpose embeddings. The availability of word embeddings for specialised domains is critical for the effectiveness of domain-specific tasks that rely on this technique. In this paper, we propose a word embedding for the cyber security vulnerability domain. We train our embedding model on multiple, rich and heterogeneous security vulnerability information sources publicly available on the web. The benefits of such specialised word embedding are demonstrated through a qualitative comparison of word similarity and the exemplary task of matching security professionals to vulnerability discovery tasks posted to bug bounty programs. We also introduce a new dataset of words pairs similarity with a human judgement that can be used as a benchmark. Our experimental results show that, in the context of cyber security, our domain-specific word embedding outperforms existing pre-trained embeddings built on general-purpose and software engineering datasets.
- Subjects :
- Word embedding
Computer science
business.industry
media_common.quotation_subject
Software development
Context (language use)
02 engineering and technology
Ambiguity
Computer security
computer.software_genre
Semantics
Terminology
020204 information systems
0202 electrical engineering, electronic engineering, information engineering
Embedding
020201 artificial intelligence & image processing
business
computer
Word (computer architecture)
Vulnerability (computing)
media_common
Subjects
Details
- Database :
- OpenAIRE
- Journal :
- 2020 International Joint Conference on Neural Networks (IJCNN)
- Accession number :
- edsair.doi...........8a62f51f76ec1d513233e31497e4465f
- Full Text :
- https://doi.org/10.1109/ijcnn48605.2020.9207140