Back to Search
Start Over
Multi-χ: Identifying Multiple Authors from Source Code Files
- Source :
- Proceedings on Privacy Enhancing Technologies, Vol 2020, Iss 3, Pp 25-41 (2020)
- Publication Year :
- 2020
- Publisher :
- Privacy Enhancing Technologies Symposium Advisory Board, 2020.
-
Abstract
- Most authorship identification schemes assume that code samples are written by a single author. However, real software projects are typically the result of a team effort, making it essential to consider a finegrained multi-author identification in a single code sample, which we address with Multi-χ. Multi-χ leverages a deep learning-based approach for multi-author identification in source code, is lightweight, uses a compact representation for efficiency, and does not require any code parsing, syntax tree extraction, nor feature selection. In Multi-χ, code samples are divided into small segments, which are then represented as a sequence of n-dimensional term representations. The sequence is fed into an RNN-based verification model to assist a segment integration process which integrates positively verified segments, i.e., integrates segments that have a high probability of being written by one author. Finally, the resulting segments from the integration process are represented using word2vec or TF-IDF and fed into the identification model. We evaluate Multi-χ with several Github projects (Caffe, Facebook’s Folly, Tensor-Flow, etc.) and show remarkable accuracy. For example, Multi-χ achieves an authorship example-based accuracy (A-EBA) of 86.41% and per-segment authorship identification of 93.18% for identifying 562 programmers. We examine the performance against multiple dimensions and design choices, and demonstrate its effectiveness.
- Subjects :
- Ethics
021110 strategic, defence & security studies
Source code
program features
Computer science
Programming language
media_common.quotation_subject
0211 other engineering and technologies
020207 software engineering
software forensics
QA75.5-76.95
02 engineering and technology
BJ1-1725
computer.software_genre
Electronic computers. Computer science
0202 electrical engineering, electronic engineering, information engineering
General Earth and Planetary Sciences
code authorship identification
deep learning identification
computer
General Environmental Science
media_common
Subjects
Details
- ISSN :
- 22990984
- Volume :
- 2020
- Database :
- OpenAIRE
- Journal :
- Proceedings on Privacy Enhancing Technologies
- Accession number :
- edsair.doi.dedup.....b693b1d67b9919da7b1f3c3a912be271
- Full Text :
- https://doi.org/10.2478/popets-2020-0044