1. Deep Pre-Training Transformers for Scientific Paper Representation.
- Author
-
Wang, Jihong, Yang, Zhiguang, and Cheng, Zhanglin
- Subjects
NATURAL language processing ,LANGUAGE models ,SCIENTIFIC literature ,BIG data ,VECTOR spaces - Abstract
In the age of scholarly big data, efficiently navigating and analyzing the vast corpus of scientific literature is a significant challenge. This paper introduces a specialized pre-trained BERT-based language model, termed SPBERT, which enhances natural language processing tasks specifically tailored to the domain of scientific paper analysis. Our method employs a novel neural network embedding technique that leverages textual components, such as keywords, titles, abstracts, and full texts, to represent papers in a vector space. By integrating recent advancements in text representation and unsupervised feature aggregation, SPBERT offers a sophisticated approach to encode essential information implicitly, thereby enhancing paper classification and literature retrieval tasks. We applied our method to several real-world academic datasets, demonstrating notable improvements over existing methods. The findings suggest that SPBERT not only provides a more effective representation of scientific papers but also facilitates a deeper understanding of large-scale academic data, paving the way for more informed and accurate scholarly analysis. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF