Back to Search Start Over

MCMTCrawler: a Multi-Computer and Multi-Thread Vertical Crawler.

Authors :
Ziyun Deng
Lei Chen
Tingqin He
Tao Meng
Source :
Engineering Letters. Sep2018, Vol. 26 Issue 3, p313-319. 7p.
Publication Year :
2018

Abstract

To optimize the structures of the open source crawlers, improve the performances of the standalone crawlers, we design a new Multi-Computer and Multi-Thread vertical Crawler, called MCMTCrawler. MCMTCrawler can complete the special crawling task on a large business website within a few hours. MCMTCrawler uses Berkeley DB to persist the waiting Uniform Resource Locator (URL) queue and the downloaded URL queue. MD5 algorithm is applied to map a URL to a 32-length string. MCMTCrawler employs the Producer-Consumer model to assign and process the URLs. Based on the design ideas of Aspect-Oriented Programming (AOP) and Dependency Injection (DI) of Spring, the scheduler and the downloader of MCMTCrawler are designed separately for speeding up the crawler. According to the experimental results, when using three downloaded servers, the speed of MCMTCrawler is five times as much as that of the single-computer and single-process crawler, and three times of the single-computer and multi-thread crawler called Crawler4j. Furthermore, for handling the task of crawling 600,000 web pages, MCMTCrawler takes only 6.83 hours. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
1816093X
Volume :
26
Issue :
3
Database :
Academic Search Index
Journal :
Engineering Letters
Publication Type :
Academic Journal
Accession number :
131924105