基于多数据源的论文数据爬虫技术的实现及应用.

Authors :: 侯晋升
 张仰森
 黄改娟
 段瑞雪
Source :: Application Research of Computers / Jisuanji Yingyong Yanjiu. Feb2021, Vol. 38 Issue 2, p517-521. 5p.
Publication Year :: 2021
Abstract: There are many problems in the process of collecting paper data using single data source, such as insufficient data comprehensiveness and limited data collection speed due to website access frequency limitation. Aiming at these problems, this paper proposed a paper data crawling technology for multi-data sources. Firstly, it used the four Chinese document service websites-How Net, Wanfang Data, Weipu, and Chaoxing as data sources, completed the task of crawling and parsing list page data for the search keywords. Then it used the task scheduling strategy to remove repeated data and balance the tasks. Finally, it used multi-threads for each data source to crawl, parse and store the detail information of the papers, and built a website for search and display. Experiments show that under the same crawling and parsing speed, this technology can complete the paper information collection task more comprehensively and efficiently, which proves the effectiveness of this technology. [ABSTRACT FROM AUTHOR]