Back to Search Start Over

A cosine based validation measure for Document Clustering

Authors :
BALBI, SIMONA
SPANO, MARIA
Misuraca, Michelangelo
Mayaffre, D., Poudat C., Vanni L., Magri V., Follette P.
Balbi, Simona
Misuraca, Michelangelo
Spano, Maria
Publication Year :
2016
Publisher :
Presses de Fac Imprimeur, France, 2016.

Abstract

Document Clustering is the peculiar application of cluster analysis methods on huge documentary databases. Document Clustering aims at organizing a large quantity of unlabelled documents into a smaller number of meaningful and coherent clusters, similar in content. One of the main unsolved problems in clustering literature is the lack of a reliable methodology to evaluate results, although a wide variety of validation measures has been proposed. If those measures are often unsatisfactory when dealing with numerical databases, they definitely underperform in Document Clustering. This paper proposes a new validation measure. After introducing the most common approaches to Document Clustering, our attention is focused on Spherical K-means, do to its strict connection with the Vector Space Model, typical of Information Retrieval. Since Spherical K-means adopts a cosine-based similarity measure, we propose a validation measure based on the same criterion. The new measure effectiveness is shown in the frame of a comparative study, by involving 13 different corpora (usually used in literature for comparing different proposals) and 15 validation measures.

Details

Language :
English
Database :
OpenAIRE
Accession number :
edsair.od......3730..c25ad0c1dd3f3cd2599b48a577eeafb6