More influence means less work

Authors :: Anja Pilz
Mirwaes Wahabzada
Kristian Kersting
Christian Bauckhage
Source :: CIKM
Publication Year :: 2011
Publisher :: ACM, 2011.
Abstract: There have recently been considerable advances in fast inference for (online) latent Dirichlet allocation (LDA). While it is widely recognized that the scheduling of documents in stochastic optimization and in turn in LDA may have significant consequences, this issue remains largely unexplored. Instead, practitioners schedule documents essentially uniformly at random, due perhaps to ease of implementation, and to the lack of clear guidelines on scheduling the documents.In this work, we address this issue and propose to schedule documents for an update that exert a disproportionately large influence on the topics of the corpus before less influential ones. More precisely, we justify to sample documents randomly biased towards those ones with higher norms to form mini-batches. On several real-world datasets, including 3M articles from Wikipedia and 8M from PubMed, we demonstrate that the resulting influence scheduled LDA can handily analyze massive document collections and find topic models as good or better than those found with online LDA, often at a fraction of time.

Subjects :: Topic model
Information retrieval
Computer science
business.industry
Inference
Machine learning
computer.software_genre
Latent Dirichlet allocation
Scheduling (computing)
Dynamic topic model
symbols.namesake
ComputingMethodologies_PATTERNRECOGNITION
symbols
Artificial intelligence
business
computer

Database :: OpenAIRE
Journal :: Proceedings of the 20th ACM international conference on Information and knowledge management
Accession number :: edsair.doi...........46bb4bd3a1c4d1f7399f4d1ee59e1256

Tools