Back to Search Start Over

Evaluation of Feature Selection Approaches for Urdu Text Categorization

Authors :
Tehseen Zia
Qaiser Abbas
Muhammad Pervez Akhtar
Source :
International Journal of Intelligent Systems and Applications. 7:33-40
Publication Year :
2015
Publisher :
MECS Publisher, 2015.

Abstract

Efficient feature selection is an important phase of designing an effective text categorization system. Various feature selection methods have been proposed for selecting dissimilar feature sets. It is often essential to evaluate that which method is more effective for a given task and what size of feature set is an effective model selection choice. Aim of this paper is to answer these questions for designing Urdu text categorization system. Five widely used feature selection methods were examined using six well-known classification algorithms: naive Bays (NB), k-nearest neighbor (KNN), support vector machines (SVM) with linear, polynomial and radial basis kernels and decision tree (i.e. J48). The study was conducted over two test collections: EMILLE collection and a naive collection. We have observed that three feature selection methods i.e. information gain, Chi statistics, and symmetrical uncertain, have performed uniformly in most of the cases if not all. Moreover, we have found that no single feature selection method is best for all classifiers. While gain ratio out-performed others for naive Bays and J48, information gain has shown top performance for KNN and SVM with polynomial and radial basis kernels. Overall, linear SVM with any of feature selection methods including information gain, Chi statistics or symmetric uncertain methods is turned-out to be first choice across other combinations of classifiers and feature selection methods on moderate size naive collection. On the other hand, naive Bays with any of feature selection method have shown its advantage for a small sized EMILLE corpus.

Details

ISSN :
20749058 and 2074904X
Volume :
7
Database :
OpenAIRE
Journal :
International Journal of Intelligent Systems and Applications
Accession number :
edsair.doi...........402fec722035ef31a91faa23a33750ac