351. Unsupervised Spam Detection Based on String Alienness Measures.
- Author
-
Carbonell, Jaime G., Siekmann, Jörg, Corruble, Vincent, Suzuki, Einoshin, Narisawa, Kazuyuki, Bannai, Hideo, Hatano, Kohei, and Takeda, Masayuki
- Abstract
We propose an unsupervised method for detecting spam documents from a given set of documents, based on equivalence relations on strings. We give three measures for quantifying the alienness (i.e. how different they are from others) of substrings within the documents. A document is then classified as spam if it contains a substring that is in an equivalence class with a high degree of alienness. The proposed method is unsupervised, language independent, and scalable. Computational experiments conducted on data collected from Japanese web forums show that the method successfully discovers spams. [ABSTRACT FROM AUTHOR]
- Published
- 2007
- Full Text
- View/download PDF