Back to Search Start Over

N-grams based feature selection and text representation for Chinese Text Classification

Authors :
Zhihua Wei
Duoqian Miao
Jean-Hugues Chauchat
Rui Zhao
Wen Li
Source :
International Journal of Computational Intelligence Systems, Vol 2, Iss 4 (2009)
Publication Year :
2009
Publisher :
Springer, 2009.

Abstract

In this paper, text representation and feature selection strategies for Chinese text classification based on n-grams are discussed. Two steps feature selection strategy is proposed which combines the preprocess within classes with the feature selection among classes. Four different feature selection methods and three text representation weights are compared by exhaustive experiments. Both C-SVC classifier and Naive bayes classifier are adopted to assess the results. All experiments are performed on Chinese corpus TanCorpV1.0 which includes more than 14,000 texts divided in 12 classes. Our experiments concern: (1) the performance comparison among different feature selection strategies: absolute text frequency, relative text frequency, absolute n-gram frequency and relative n-gram frequency; (2) the comparison of the sparseness and feature correlation in the “text by feature” matrices produced by four feature selection methods; (3) the performance comparison among three term weights: 0/1 logical value, n-gram frequency numeric value (TF) and Tf*idf value.

Details

Language :
English
ISSN :
18756883
Volume :
2
Issue :
4
Database :
Directory of Open Access Journals
Journal :
International Journal of Computational Intelligence Systems
Publication Type :
Academic Journal
Accession number :
edsdoj.14f9d9f3fd9b4aee9ca27ba71ca5a373
Document Type :
article
Full Text :
https://doi.org/10.2991/ijcis.2009.2.4.5