351. LDA filter: A Latent Dirichlet Allocation preprocess method for Weka
- Author
-
A. Seara Vieira, Lourdes Borrajo, P. Celard, and Eva Lorenzo Iglesias
- Subjects
Vocabulary ,Support Vector Machine ,Muscle Physiology ,Muscle Functions ,Medical Journals ,Computer science ,Physiology ,Information Storage and Retrieval ,Social Sciences ,02 engineering and technology ,Machine Learning ,0202 electrical engineering, electronic engineering, information engineering ,Medicine and Health Sciences ,Preprocessor ,Cluster Analysis ,media_common ,Multidisciplinary ,Applied Mathematics ,Simulation and Modeling ,05 social sciences ,Software Engineering ,Semantics ,1203.04 Inteligencia Artificial ,Physical Sciences ,symbols ,Medicine ,Engineering and Technology ,050904 information & library sciences ,Algorithms ,Research Article ,Computer and Information Sciences ,Science ,media_common.quotation_subject ,Research and Analysis Methods ,Latent Dirichlet allocation ,Computer Software ,symbols.namesake ,Naive Bayes classifier ,Machine Learning Algorithms ,Text mining ,Artificial Intelligence ,020204 information systems ,Humans ,Preprocessing ,business.industry ,Biology and Life Sciences ,Pattern recognition ,Bayes Theorem ,Linguistics ,Support vector machine ,Statistical classification ,ComputingMethodologies_PATTERNRECOGNITION ,Bag-of-words model ,Filter (video) ,Artificial intelligence ,0509 other social sciences ,business ,Medical Humanities ,Mathematics - Abstract
This work presents an alternative method to represent documents based on LDA (Latent Dirichlet Allocation) and how it affects to classification algorithms, in comparison to common text representation. LDA assumes that each document deals with a set of predefined topics, which are distributions over an entire vocabulary. Our main objective is to use the probability of a document belonging to each topic to implement a new text representation model. This proposed technique is deployed as an extension of the Weka software as a new filter. To demonstrate its performance, the created filter is tested with different classifiers such as a Support Vector Machine (SVM), k-Nearest Neighbors (k-NN), and Naive Bayes in different documental corpora (OHSUMED, Reuters-21578, 20Newsgroup, Yahoo! Answers, YELP Polarity, and TREC Genomics 2015). Then, it is compared with the Bag of Words (BoW) representation technique. Results suggest that the application of our proposed filter achieves similar accuracy as BoW but greatly improves classification processing times. Xunta de Galicia | Ref. ED431C2018/55
- Published
- 2020