1. Code Authorship Attribution using content-based and non-content-based features
- Author
-
J.E. Rice and Parinaz Bayrami
- Subjects
Source code ,Computer program ,business.industry ,media_common.quotation_subject ,Computer programming ,computer.software_genre ,Code (semiotics) ,Field (computer science) ,Identification (information) ,Identity (object-oriented programming) ,Artificial intelligence ,business ,computer ,Natural language ,Natural language processing ,media_common - Abstract
To attribute authorship (author identification) means to identify the true author of a sample of work among many candidates. Author identification is an important research field in natural language. Machine learning approaches are widely used in natural language analysis, and previous research has shown that similar techniques can be applied in the analysis of computer programming (artificial) languages. This paper focuses on the use of machine learning techniques in the identification of authors of computer programs. We focus on identifying which features capture the writing style of authors in the classification of a computer program according to the author's identity. We then propose a novel approach for computer program author identification. In this method, features from source code of the programs are combined with authors' sociological features (gender and region) to develop the classification model. Several experiments were conducted on two datasets composed of computer programs written in C++. Our models are able to predict an author's identity with a 75% accuracy rate.
- Published
- 2021