1. An Efficient and Scalable MetaFeature-based Document Classification Approach based on Massively Parallel Computing
- Author
-
Wisllay M. V. dos Santos, Wellington Santos Martins, Sergio Canuto, Marcos André Gonçalves, and Thierson Couto Rosa
- Subjects
Speedup ,Language identification ,Computer science ,business.industry ,Document classification ,Sentiment analysis ,Recommender system ,Machine learning ,computer.software_genre ,k-nearest neighbors algorithm ,Scalability ,Artificial intelligence ,Data mining ,business ,Massively parallel ,computer - Abstract
The unprecedented growth of available data nowadays has stimulated the development of new methods for organizing and extracting useful knowledge from this immense amount of data. Automatic Document Classification (ADC) is one of such methods, that uses machine learning techniques to build models capable of automatically associating documents to well-defined semantic classes. ADC is the basis of many important applications such as language identification, sentiment analysis, recommender systems, spam filtering, among others. Recently, the use of meta-features has been shown to substantially improve the effectiveness of ADC algorithms. In particular, the use of meta-features that make a combined use of local information (through kNN-based features) and global information (through category centroids) has produced promising results. However, the generation of these meta-features is very costly in terms of both, memory consumption and runtime since there is the need to constantly call the kNN algorithm. We take advantage of the current manycore GPU architecture and present a massively parallel version of the kNN algorithm for highly dimensional and sparse datasets (which is the case for ADC). Our experimental results show that we can obtain speedup gains of up to 15x while reducing memory consumption in more than 5000x when compared to a state-of-the-art parallel baseline. This opens up the possibility of applying meta-features based classification in large collections of documents, that would otherwise take too much time or require the use of an expensive computational platform.
- Published
- 2015
- Full Text
- View/download PDF