1. An intrinsic evaluation of the Waterloo spam rankings of the ClueWeb09 and ClueWeb12 datasets.
- Author
-
Yılmazel, İbrahim Barış and Arslan, Ahmet
- Subjects
- *
SPAM email , *WEBSITES , *DISTRIBUTION (Probability theory) , *APPRAISERS - Abstract
The ClueWeb09 dataset and its successor, the ClueWeb12 dataset, are two of the largest collections of Web pages released by Text REtrieval Conference (TREC). The ClueWeb datasets were used in various tracks of TREC ran through 2009 to 2017. For every year, approximately 50 new queries are released and a pool of Web pages are judged against these queries by human assessors as relevant, non-relevant or spam. In this article, a ground truth for binary classification (spam vs non-spam) is constructed from Web pages that are judged as spam or relevant under the assumption that a Web page judged as relevant for any query cannot be spam. Based on this ground truth, we evaluate classification performances of the Waterloo spam rankings (Fusion, Britney, GroupX and UK2006), which have been traditionally used to identify and filter spam pages in retrieval systems. The experimental results in terms of the universal binary classification evaluation measures suggest that the Fusion (with threshold = 11%) is the best for the ClueWeb09 dataset. Analysis of the frequency distributions of relevant/spam documents over spam scores reveals that the GroupX is the most powerful at identifying relevant documents, whereas the Fusion is the most powerful at identifying spam documents. It is also confirmed that the effectiveness of the Fusion spam ranking of the ClueWeb12 dataset is not as good as that of the ClueWeb09. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF