1. 'Paper, Meet Code': A Deep Learning Approach to Linking Scholarly Articles With GitHub Repositories
- Author
-
Prahyat Puangjaktha, Morakot Choetkiertikul, and Suppawong Tuarob
- Subjects
Academic code repository mining ,paper-repository relationship ,text representation ,machine learning ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Computer scientists often publish their source code accompanying their publications, prominently using code repositories across various domains. Despite the concurrent existence of scholarly articles and their associated official code repositories, explicit references linking the two are often missing. Traditionally, identifying whether scholarly content and code repositories pertain to the same research project requires manual inspection, a time-consuming task. This paper proposes a deep learning-based algorithm for automatically matching scholarly articles with their corresponding official code repositories. Our findings indicate that the most common linking information includes the paper title and BibTeX entries, typically found in the repository’s readme document. In this study, we employed SPECTER for vector embedding of paper and repository metadata. Utilizing these embedding representations with the Light Gradient Boosting Machine (LGBM), our method achieved an F1 score of 0.94. Moreover, combining our best model with a rule-based approach improved performance by 5.31%. This study successfully delineates a connection between academic papers and associated official code repositories, minimizing reliance on explicit bibliographic information in repositories.
- Published
- 2024
- Full Text
- View/download PDF