Back to Search
Start Over
基于多数据源的论文数据爬虫技术的实现及应用.
- Source :
-
Application Research of Computers / Jisuanji Yingyong Yanjiu . Feb2021, Vol. 38 Issue 2, p517-521. 5p. - Publication Year :
- 2021
-
Abstract
- There are many problems in the process of collecting paper data using single data source, such as insufficient data comprehensiveness and limited data collection speed due to website access frequency limitation. Aiming at these problems, this paper proposed a paper data crawling technology for multi-data sources. Firstly, it used the four Chinese document service websites-How Net, Wanfang Data, Weipu, and Chaoxing as data sources, completed the task of crawling and parsing list page data for the search keywords. Then it used the task scheduling strategy to remove repeated data and balance the tasks. Finally, it used multi-threads for each data source to crawl, parse and store the detail information of the papers, and built a website for search and display. Experiments show that under the same crawling and parsing speed, this technology can complete the paper information collection task more comprehensively and efficiently, which proves the effectiveness of this technology. [ABSTRACT FROM AUTHOR]
- Subjects :
- *KEYWORD searching
*ACQUISITION of data
*TASKS
*SPEED
*INFORMATION processing
Subjects
Details
- Language :
- Chinese
- ISSN :
- 10013695
- Volume :
- 38
- Issue :
- 2
- Database :
- Academic Search Index
- Journal :
- Application Research of Computers / Jisuanji Yingyong Yanjiu
- Publication Type :
- Academic Journal
- Accession number :
- 148598204
- Full Text :
- https://doi.org/10.19734/j.issn.1001-3695.2019.11.0671