Text Categorization in Non-linear Semantic Space.

Authors :: Carbonell, Jaime G.
Siekmann, Jörg
Basili, Roberto
Pazienza, Maria Teresa
Biancalana, Claudio
Micarelli, Alessandro
Source :: AI*IA 2007: Artificial Intelligence & Human-Oriented Computing; 2007, p749-756, 8p
Publication Year :: 2007
Abstract: Automatic Text Categorization (TC) is a complex and useful task for many natural language applications, and is usually performed by using a set of manually classified documents, i.e. a training collection. Term-based representation of documents has found widespread use in TC. However, one of the main shortcomings of such methods is that they largely disregard lexical semantics and, as a consequence, are not sufficiently robust with respect to variations in word usage. In this paper we design, implement, and evaluate a new text classification technique. Our main idea consists in finding a series of projections of the training data by using a new, modified LSI algorithm, projecting all training instances to the low-dimensional subspace found in the previous step, and finally inducing a binary search on the projected low-dimensional data. Our conclusion is that, with all its simplicity and efficiency, our approach is comparable to SVM accuracy on classification. [ABSTRACT FROM AUTHOR]