Back to Search Start Over

Multi-χ: Identifying Multiple Authors from Source Code Files

Authors :
DaeHun Nyang
Mohammed Abuhamad
David Mohaisen
Tamer AbuHmed
Source :
Proceedings on Privacy Enhancing Technologies, Vol 2020, Iss 3, Pp 25-41 (2020)
Publication Year :
2020
Publisher :
Privacy Enhancing Technologies Symposium Advisory Board, 2020.

Abstract

Most authorship identification schemes assume that code samples are written by a single author. However, real software projects are typically the result of a team effort, making it essential to consider a finegrained multi-author identification in a single code sample, which we address with Multi-χ. Multi-χ leverages a deep learning-based approach for multi-author identification in source code, is lightweight, uses a compact representation for efficiency, and does not require any code parsing, syntax tree extraction, nor feature selection. In Multi-χ, code samples are divided into small segments, which are then represented as a sequence of n-dimensional term representations. The sequence is fed into an RNN-based verification model to assist a segment integration process which integrates positively verified segments, i.e., integrates segments that have a high probability of being written by one author. Finally, the resulting segments from the integration process are represented using word2vec or TF-IDF and fed into the identification model. We evaluate Multi-χ with several Github projects (Caffe, Facebook’s Folly, Tensor-Flow, etc.) and show remarkable accuracy. For example, Multi-χ achieves an authorship example-based accuracy (A-EBA) of 86.41% and per-segment authorship identification of 93.18% for identifying 562 programmers. We examine the performance against multiple dimensions and design choices, and demonstrate its effectiveness.

Details

ISSN :
22990984
Volume :
2020
Database :
OpenAIRE
Journal :
Proceedings on Privacy Enhancing Technologies
Accession number :
edsair.doi.dedup.....b693b1d67b9919da7b1f3c3a912be271
Full Text :
https://doi.org/10.2478/popets-2020-0044