1. TEXTUAL-BASED CLUSTERING OF WEB DOCUMENTS.
- Author
-
Brzeminski, Pawel and Pedrycz, Witold
- Subjects
- *
WEBSITES , *ELECTRONIC records , *ELECTRONIC information resources , *RECORDS management , *HTML (Document markup language) , *ALGORITHMS - Abstract
In our study we presented an effective method for clustering of Web pages. From flat HTML files we extracted keywords, formed feature vectors as representation of Web pages and applied them to a clustering method. We took advantage of the Fuzzy C-Means clustering algorithm (FCM), We demonstrated an organized and schematic manner of data collection. Various categories of Web pages were retrieved from ODP (Open Directory Project) in order to create our datasets. The results of clustering proved that the method performs well for all datasets. Finally, we presented a comprehensive experimental study examining: the behavior of the algorithm for different input parameters, internal structure of datasets and classification experiments. [ABSTRACT FROM AUTHOR]
- Published
- 2004
- Full Text
- View/download PDF