Back to Search
Start Over
Scalable Anti-TrustRank with Qualified Site-level Seeds for Link-based Web Spam Detection
- Source :
- WWW (Companion Volume)
- Publication Year :
- 2020
- Publisher :
- ACM, 2020.
-
Abstract
- Web spam detection is one of the most important and challenging tasks in web search. Since web spam pages tend to have a lot of spurious links, many web spam detection algorithms exploit the hyperlink structure between the web pages to detect the spam pages. In this paper, we conduct a comprehensive analysis of the link structure of web spam using real-world web graphs to systemically investigate the characteristics of the link-based web spam. By exploring the structure of the page-level graph as well as the site-level graph, we propose a scalable site-level seeding methodology for the Anti-TrustRank (ATR) algorithm. The key idea is to map a website into a feature space where we learn a classifier to prioritize the websites so that we can effectively select a set of good seeds for the ATR algorithm. This seeding method enables the ATR algorithm to detect the largest number of spam pages among the competitive baseline methods. Furthermore, we design work-efficient asynchronous ATR algorithms which are able to significantly reduce the computational cost of the traditional ATR algorithm without degrading the performance in detecting spam pages while guaranteeing the convergence.
Details
- Database :
- OpenAIRE
- Journal :
- Companion Proceedings of the Web Conference 2020
- Accession number :
- edsair.doi...........57b21c4c785150fef79c9847e5fea464
- Full Text :
- https://doi.org/10.1145/3366424.3385773