1. Multiclass Classification for Self-Admitted Technical Debt Based on XGBoost.
- Author
-
Chen, Xin, Yu, Dongjin, Fan, Xulin, Wang, Lin, and Chen, Jie
- Subjects
- *
SOFTWARE maintenance , *DATA augmentation , *COMPUTER software development , *COMPUTER software quality control , *DEBT - Abstract
In software development, due to the demands from users or the limitations of time and resources, developers tend to adopt suboptimal solutions to achieve quick software development. In such a way, the released software usually involves not-quite-right code that is called technical debt, which will significantly decrease the quality of software and increase the maintenance cost. Recently, the concept of self-admitted technical debt (SATD) is proposed and refers to technical debt that is self-admitted by developers in code comments. Existing studies mainly focus on detecting technical debt by classifying code comments into either “SATD” or “non-SATD.” However, different types of SATD has different impacts on software maintenance and needs to be handled by different developers. Therefore, the detected SATD should be further classified so that developers can understand and remove technical debt better. In this article, we propose a new method based on eXtreme Gradient Boosting (XGBoost) to classify SATD into multiple classes. In our approach, we first preprocess the original code comments and adopt the easy data augmentation strategy to overcome the class unbalance problem. Then, chi-square is leveraged to select representative features from the textual feature set. Finally, we apply XGBoost to train a classifier and use the trained classifier to partition each comment into the corresponding class. We experimentally investigate the effectiveness of our approach on a public dataset, including 62 566 code comments from 10 open-source projects. Experimental results show that our approach achieves 56.66% in terms of macroaveraged precision, 59.07% in terms of macroaveraged recall, and 55.77% in terms of macroaveraged F-measure on average, and outperforms the natural language processing based method by 4.98%, 5.32%, and 3.17%, respectively. In addition, the experimental results also demonstrate that the data augmentation strategy is effective in improving the effectiveness of our approach. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF