Back to Search Start Over

Scalable Anti-TrustRank with Qualified Site-level Seeds for Link-based Web Spam Detection

Authors :
Dongho Yoo
Joyce Jiyoung Whang
Seonggoo Kang
Yeonsung Jung
Inderjit S. Dhillon
Source :
WWW (Companion Volume)
Publication Year :
2020
Publisher :
ACM, 2020.

Abstract

Web spam detection is one of the most important and challenging tasks in web search. Since web spam pages tend to have a lot of spurious links, many web spam detection algorithms exploit the hyperlink structure between the web pages to detect the spam pages. In this paper, we conduct a comprehensive analysis of the link structure of web spam using real-world web graphs to systemically investigate the characteristics of the link-based web spam. By exploring the structure of the page-level graph as well as the site-level graph, we propose a scalable site-level seeding methodology for the Anti-TrustRank (ATR) algorithm. The key idea is to map a website into a feature space where we learn a classifier to prioritize the websites so that we can effectively select a set of good seeds for the ATR algorithm. This seeding method enables the ATR algorithm to detect the largest number of spam pages among the competitive baseline methods. Furthermore, we design work-efficient asynchronous ATR algorithms which are able to significantly reduce the computational cost of the traditional ATR algorithm without degrading the performance in detecting spam pages while guaranteeing the convergence.

Details

Database :
OpenAIRE
Journal :
Companion Proceedings of the Web Conference 2020
Accession number :
edsair.doi...........57b21c4c785150fef79c9847e5fea464
Full Text :
https://doi.org/10.1145/3366424.3385773