1. TopicStriKer: A topic kernels-powered approach for text classification
- Author
-
Nikhil V. Chandran, V.S. Anoop, and S. Asharaf
- Subjects
Topic modeling ,String kernels ,Text classification ,String embedding ,Topic sequence ,Technology - Abstract
Topic models are unsupervised machine learning techniques that output clusters of “topics” represented as co-occurring words with their associated probability distributions. Topic modeling algorithms find latent themes from large document collections by understanding their context. On the other hand, string kernels are supervised machine-learning techniques that quantify string similarities without explicit string encoding. We propose TopicStriKer, a model combining the advantages of unsupervised topic modeling with supervised string kernels for text classification tasks. The co-occurring topic words per topic and topic proportions per document obtained are used to reduce the document corpus to a topic-word sequence. This reduced representation is then used for text classification with the aid of string kernels, significantly improving accuracy and reducing training time. Experiments on the bag-of-words kernel-based string embeddings using the proposed algorithm outperform the traditional text classification approaches. This work extensively compares string kernels with topic modeling on various performance metrics to establish our findings.
- Published
- 2023
- Full Text
- View/download PDF