Back to Search
Start Over
Evaluation of Feature Selection Approaches for Urdu Text Categorization
- Source :
- International Journal of Intelligent Systems and Applications. 7:33-40
- Publication Year :
- 2015
- Publisher :
- MECS Publisher, 2015.
-
Abstract
- Efficient feature selection is an important phase of designing an effective text categorization system. Various feature selection methods have been proposed for selecting dissimilar feature sets. It is often essential to evaluate that which method is more effective for a given task and what size of feature set is an effective model selection choice. Aim of this paper is to answer these questions for designing Urdu text categorization system. Five widely used feature selection methods were examined using six well-known classification algorithms: naive Bays (NB), k-nearest neighbor (KNN), support vector machines (SVM) with linear, polynomial and radial basis kernels and decision tree (i.e. J48). The study was conducted over two test collections: EMILLE collection and a naive collection. We have observed that three feature selection methods i.e. information gain, Chi statistics, and symmetrical uncertain, have performed uniformly in most of the cases if not all. Moreover, we have found that no single feature selection method is best for all classifiers. While gain ratio out-performed others for naive Bays and J48, information gain has shown top performance for KNN and SVM with polynomial and radial basis kernels. Overall, linear SVM with any of feature selection methods including information gain, Chi statistics or symmetric uncertain methods is turned-out to be first choice across other combinations of classifiers and feature selection methods on moderate size naive collection. On the other hand, naive Bays with any of feature selection method have shown its advantage for a small sized EMILLE corpus.
- Subjects :
- Control and Optimization
Computer Networks and Communications
Computer science
business.industry
Model selection
Decision tree
Pattern recognition
Feature selection
Computer Science Applications
Human-Computer Interaction
Support vector machine
Statistical classification
ComputingMethodologies_PATTERNRECOGNITION
C4.5 algorithm
Artificial Intelligence
Feature (computer vision)
Modeling and Simulation
Signal Processing
Information gain ratio
Artificial intelligence
business
Subjects
Details
- ISSN :
- 20749058 and 2074904X
- Volume :
- 7
- Database :
- OpenAIRE
- Journal :
- International Journal of Intelligent Systems and Applications
- Accession number :
- edsair.doi...........402fec722035ef31a91faa23a33750ac