Back to Search Start Over

INVESTIGATING THE EFFECT OF CLUSTER-BASED PREPROCESSING ON SOURCE-TO-SOURCE CODE TRANSLATION

Authors :
Loganathan, Akila
Paige, Richard
Computing and Software
Publication Year :
2021

Abstract

Numerous programming languages have been proposed over the last 60 years. Programming languages, like other software systems, can become obsolete: their compilers, virtual machines, interpreters and libraries are no longer fit for purpose. As such, programs written using obsolete programming languages may need to be modernized, relying instead on modern languages, libraries and tools. Modernization is both a technical and social process; in this thesis, we focus on the technical aspects of modernization, particularly software migration, wherein a program written in one programming language is transformed into an equivalent or similar program written in a different language. Migration happens because many software systems that were developed decades ago can no longer be maintained and need to be overhauled to make it possible to implement new processes that can take advantage of new technologies recently developed. Migrating an existing codebase to a more efficient and modern programming language is often expensive, and there are different types of risks involved; for example, many functionalities may not be implemented properly after migration, i.e., the migration is inaccurate; or concerns for code quality may not be considered until the end of the migration; and for large code bases, the migration process may be slow, and may demand substantial resources to implement. Recent advancements in Artificial Intelligence in natural language translation have been widely accepted but their application to programming language translation have been limited due to the scarcity of parallel data (i.e., the collection of equivalent phrases in source language and their translations in a target language). This thesis explores a preliminary investigation into the use of unsupervisedlearning methods – specifically, a newly proposed K-Means clustering approach for preprocessing and analyzing the source code – prior to rule-based code translation. The thesis investigates such a process both generally and abstractly, and specifically, in the context of a concrete migration from C++ to Java. The thesis also presents a test set for evaluating such an approach, based on open source, which can be used as a general resource for both validating migration approaches and assessing their performance. The test results and our experiments show that our proposed translation approach based on unsupervised machine learning for preprocessing has a very good translation accuracy score of 77.89% and 81.34% when compared against an alternative approach with accuracy score of 33.24% and 59.96%, and also when compared with rule-based translation that excludes the preprocessing step with accuracy score 37.39% and 41.26%. Thesis Master of Science (MSc)

Details

Language :
English
Database :
OpenAIRE
Accession number :
edsair.od......1154..9fb3899e7b6322499df62edc492639b5