Back to Search Start Over

A Comparative Study on Representing Units in Chinese Text Clustering.

Authors :
Lang, Jérôme
Fangzhen Lin
Ju Wang
Hongjun, Wang
Yu Shiwen
Lv Xueqiang
Shi Shuicai
Xiao Shibin
Source :
Knowledge Science, Engineering & Management; 2006, p466-476, 11p
Publication Year :
2006

Abstract

Words and n-grams are commonly used Chinese text representing units and are proved to be good features for Chinese Text Categorization and Information Retrieval. But the effectiveness of applying these representing units for Chinese Text Clustering is still uncovered. This paper is a comparative study of representing units in Chinese Text Clustering. With K-means algorithm, several representing units were evaluated including Chinese character N-gram features, word features and their combinations. We found Chinese word features, Chinese character unigram features and bi-gram features most effective in our experiments. The combination of features didn't improve the results. Detailed experimental results on several public Chinese Text Categorization datasets are provided in the paper. Keywords: Chinese text Clustering; N-gram feature; Bi-gram feature; Word feature. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISBNs :
9783540370338
Database :
Complementary Index
Journal :
Knowledge Science, Engineering & Management
Publication Type :
Book
Accession number :
32904746
Full Text :
https://doi.org/10.1007/11811220_39