Back to Search Start Over

Code Authorship Attribution using content-based and non-content-based features

Authors :
J.E. Rice
Parinaz Bayrami
Source :
CCECE
Publication Year :
2021
Publisher :
IEEE, 2021.

Abstract

To attribute authorship (author identification) means to identify the true author of a sample of work among many candidates. Author identification is an important research field in natural language. Machine learning approaches are widely used in natural language analysis, and previous research has shown that similar techniques can be applied in the analysis of computer programming (artificial) languages. This paper focuses on the use of machine learning techniques in the identification of authors of computer programs. We focus on identifying which features capture the writing style of authors in the classification of a computer program according to the author's identity. We then propose a novel approach for computer program author identification. In this method, features from source code of the programs are combined with authors' sociological features (gender and region) to develop the classification model. Several experiments were conducted on two datasets composed of computer programs written in C++. Our models are able to predict an author's identity with a 75% accuracy rate.

Details

Database :
OpenAIRE
Journal :
2021 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE)
Accession number :
edsair.doi...........003e2f59d2beaff1beff457d0490709d