Descriptor: "Probabilistic latent semantic analysis" / Publisher: acm - Searchworks@Jio Institute Digital Library Search Results

1. Graph Attention Topic Modeling Network

Author: Yuanfang Guo, Di Jin, Chuan Wang, Xiaochun Cao, Fan Wu, Junhua Gu, and Liang Yang
Subjects: Independent and identically distributed random variables, Topic model, Theoretical computer science, Word embedding, Computer science, Inference, 02 engineering and technology, Latent variable, 010501 environmental sciences, Overfitting, computer.software_genre, 01 natural sciences, Latent Dirichlet allocation, Dirichlet distribution, symbols.namesake, Stochastic block model, 0202 electrical engineering, electronic engineering, information engineering, 0105 earth and related environmental sciences, Probabilistic latent semantic analysis, Document classification, symbols, Graph (abstract data type), Topological graph theory, 020201 artificial intelligence & image processing, computer, Latent semantic indexing
Abstract: Existing topic modeling approaches possess several issues, including the overfitting issue of Probablistic Latent Semantic Indexing (pLSI), the failure of capturing the rich topical correlations among topics in Latent Dirichlet Allocation (LDA), and high inference complexity. In this paper, we provide a new method to overcome the overfitting issue of pLSI by using the amortized inference with word embedding as input, instead of the Dirichlet prior in LDA. For generative topic model, the large number of free latent variables is the root of overfitting. To reduce the number of parameters, the amortized inference replaces the inference of latent variable with a function which possesses the shared (amortized) learnable parameters. The number of the shared parameters is fixed and independent of the scale of the corpus. To overcome the limited application of amortized inference to independent and identically distributed (i.i.d) data, a novel graph neural network, Graph Attention TOpic Network (GATON), is proposed to model the topic structure of non-i.i.d documents according to the following two observations. First, pLSI can be interpreted as stochastic block model (SBM) on a specific bi-partite graph. Second, graph attention network (GAT) can be explained as the semi-amortized inference of SBM, which relaxes the i.i.d data assumption of vanilla amortized inference. GATON provides a novel scheme, i.e. graph convolution operation based scheme, to integrate word similarity and word co-occurrence structure. Specifically, the bag-of-words document representation is modeled as a bi-partite graph topology. Meanwhile, word embedding, which captures the word similarity, is modeled as attribute of the word node and the term frequency vector is adopted as the attribute of the document node. Based on the weighted (attention) graph convolution operation, the word co-occurrence structure and word similarity patterns are seamlessly integrated for topic identification. Extensive experiments demonstrate that the effectiveness of GATON on topic identification not only benefits the document classification, but also significantly refines the input word embedding.
Published: 2020
Full Text: View/download PDF

2. Topic Classification Through Topic Modeling with Additive Regularization for Collection of Scientific Papers

Author: Fedor Krasnov
Subjects: Topic model, Probabilistic latent semantic analysis, Computer science, business.industry, 010102 general mathematics, 02 engineering and technology, Space (commercial competition), computer.software_genre, 01 natural sciences, Regularization (mathematics), Matrix (mathematics), ComputingMethodologies_PATTERNRECOGNITION, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Segmentation, Artificial intelligence, 0101 mathematics, Cluster analysis, business, computer, Natural language processing, Drawback
Abstract: Any text can be presented as an infinite number of topic models.Any of these topics would not have any attributes that would make it possible to break them up into classes. The Author has suggested additive regularization when creating models to single out topic clusters from the Probabilistic Latent Semantic Analysis, PLSA.The method proposed by the Author allows singling out topic classes based on their density in the document-topic space (Matrix Θ) for a selected collection of documents.Such segmentation is similar to hierarchical variant LDA (HLDA) but has no such drawback as that the LDA models have.
Published: 2018
Full Text: View/download PDF

3. Probabilistic Topic Models for Text Data Retrieval and Analysis

Author: ChengXiang Zhai
Subjects: Topic model, Information retrieval, Probabilistic latent semantic analysis, Computer science, business.industry, 05 social sciences, Intelligent decision support system, Probabilistic logic, 02 engineering and technology, Scientific literature, Data science, Text mining, Data retrieval, 0502 economics and business, Web page, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, business, 050203 business & management, Natural language, Text retrieval
Abstract: Text data include all kinds of natural language text such as web pages, news articles, scientific literature, emails, enterprise documents, and social media posts. As text data continues to grow quickly, it is increasingly important to develop intelligent systems to help people manage and make use of vast amounts of text data ("big text data"). As a new family of effective general approaches to text data retrieval and analysis, probabilistic topic models, notably Probabilistic Latent Semantic Analysis (PLSA), Latent Dirichlet Allocations (LDA), and many extensions of them, have been studied actively in the past decade with widespread applications. These topic models are powerful tools for extracting and analyzing latent topics contained in text data; they also provide a general and robust latent semantic representation of text data, thus improving many applications in information retrieval and text mining. Since they are general and robust, they can be applied to text data in any natural language and about any topics. This tutorial will systematically review the major research progress in probabilistic topic models and discuss their applications in text retrieval and text mining. The tutorial will provide (1) an in-depth explanation of the basic concepts, underlying principles, and the two basic topic models (i.e., PLSA and LDA) that have widespread applications, (2) a broad overview of all the major representative topic models (that are usually extensions of PLSA or LDA), and (3) a discussion of major challenges and future research directions. The tutorial should be appealing to anyone who would like to learn about topic models, how and why they work, their widespread applications, and the remaining research challenges to be solved, including especially graduate students, researchers who want to develop new topic models, and practitioners who want to apply topic models to solve many application problems. The attendants are expected to have basic knowledge of probability and statistics.
Published: 2017
Full Text: View/download PDF

4. Joint Latent Subspace Learning and Regression for Cross-Modal Retrieval

Author: Hongbin Zha, Zhouchen Lin, and Jianlong Wu
Subjects: Probabilistic latent semantic analysis, business.industry, Computer science, 020207 software engineering, 02 engineering and technology, Construct (python library), Space (commercial competition), Machine learning, computer.software_genre, Class (biology), Latent class model, Regression, ComputingMethodologies_PATTERNRECOGNITION, 0202 electrical engineering, electronic engineering, information engineering, Benchmark (computing), 020201 artificial intelligence & image processing, Artificial intelligence, Data mining, business, computer, Subspace topology
Abstract: Cross-modal retrieval has received much attention in recent years. It is a commonly used method to project multi-modality data into a common subspace and then retrieve. However, nearly all existing methods directly adopt the space defined by the binary class label information without learning as the shared subspace for regression. In this paper, we first adopt the spectral regression method to learn the optimal latent space shared by data of all modalities based on the orthogonal constraints. Then we construct a graph model to project the multi-modality data into the latent space. Finally, we combine these two processes together to jointly learn the latent space and regress. We conduct extensive experiments on multiple benchmark datasets and our proposed method outperforms the state-of-the-art approaches.
Published: 2017
Full Text: View/download PDF

5. Decomposed Normalized Maximum Likelihood Codelength Criterion for Selecting Hierarchical Latent Variable Models

Author: Kenji Yamanishi, Tianyi Wu, and Shinya Sugawara
Subjects: Probabilistic latent semantic analysis, business.industry, Model selection, Multilevel model, Pattern recognition, 02 engineering and technology, Latent variable, 01 natural sciences, Latent class model, Dirichlet distribution, 010104 statistics & probability, symbols.namesake, Bayesian information criterion, 0202 electrical engineering, electronic engineering, information engineering, symbols, 020201 artificial intelligence & image processing, Artificial intelligence, 0101 mathematics, business, Minimum description length, Algorithm, Mathematics
Abstract: We propose a new model selection criterion based on the minimum description length principle in a name of the decomposed normalized maximum likelihood criterion. Our criterion can be applied to a large class of hierarchical latent variable models, such as the Naive Bayes models, stochastic block models and latent Dirichlet allocations, for which many conventional information criteria cannot be straightforwardly applied due to irregularity of latent variable models. Our method also has an advantage that it can be exactly evaluated without asymptotic approximation with small time complexity. Our experiments using synthetic and real data demonstrated validity of our method in terms of computational efficiency and model selection accuracy, while our criterion especially dominated the other criteria when sample size is small and when data are noisy.
Published: 2017
Full Text: View/download PDF

6. Document Retrieval Model Through Semantic Linking

Author: Faezeh Ensan and Ebrahim Bagheri
Subjects: Concept search, Information retrieval, Probabilistic latent semantic analysis, business.industry, Computer science, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, Divergence-from-randomness model, 02 engineering and technology, Document clustering, computer.software_genre, Semantic similarity, Explicit semantic analysis, 020204 information systems, Semantic computing, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, Document retrieval, business, computer, Natural language processing
Abstract: This paper addresses the task of document retrieval based on the degree of document relatedness to the meanings of a query by presenting a semantic-enabled language model. Our model relies on the use of semantic linking systems for forming a graph representation of documents and queries, where nodes represent concepts extracted from documents and edges represent semantic relatedness between concepts. Based on this graph, our model adopts a probabilistic reasoning model for calculating the conditional probability of a query concept given values assigned to document concepts. We present an integration framework for interpolating other retrieval systems with the presented model in this paper. Our empirical experiments on a number of TREC collections show that the semantic retrieval has a synergetic impact on the results obtained through state of the art keyword-based approaches, and the consideration of semantic information obtained from entity linking on queries and documents can complement and enhance the performance of other retrieval models.
Published: 2017
Full Text: View/download PDF

7. Latent Space Learning for Enhanced Short Text Classification

Author: Luepol Pipanmaekaporn and Suwatchai Kamonsantoroj
Subjects: Probabilistic latent semantic analysis, business.industry, Computer science, Pattern recognition, 02 engineering and technology, computer.software_genre, Autoencoder, ComputingMethodologies_PATTERNRECOGNITION, Text mining, Semantic similarity, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Feature (machine learning), 020201 artificial intelligence & image processing, Word2vec, Artificial intelligence, Representation (mathematics), business, Feature learning, computer, Natural language processing
Abstract: There has been recently a growing interest in short text classification and analysis. However, conventional machine learning and text mining algorithms are not suitable for analyzing short texts due to their shortness and sparsity. In this paper, we put forward a novel representation model for short text classification. Our proposed model basically learns a compact latent space for modeling short text based on semantic similarity among words in training corpus. We first capture semantic relationship between words by using Word2Vec. Sparse autoencoder will be then applied to learn a compact latent space for representation of short texts. With the learned space, reliable features can be estimated based on least square technique. We conduct experiments on the two classification tasks: sentiment text classification and news title classification to evaluate the proposed method. Experimental results on two real-world datasets demonstrate that our proposed method produces more stable features that enhance short-text classification than state-of-the-art latent feature representations.
Published: 2016
Full Text: View/download PDF

8. Cross-Language Microblog Retrieval using Latent Semantic Modeling

Author: Yi Fang and Archana Godavarthy
Subjects: 0301 basic medicine, Computer science, Microblogging, media_common.quotation_subject, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, 02 engineering and technology, Space (commercial competition), computer.software_genre, Bridging (programming), 03 medical and health sciences, 0202 electrical engineering, electronic engineering, information engineering, Relevance (information retrieval), Social media, Function (engineering), Cross-language information retrieval, media_common, Information retrieval, Probabilistic latent semantic analysis, business.industry, 030104 developmental biology, 020201 artificial intelligence & image processing, Pairwise comparison, Artificial intelligence, business, computer, Natural language processing
Abstract: Microblogging has become one of the major tools of sharing real-time information for people around the world. Finding relevant information across different languages on microblogs is highly desirable especially for the large number of multilingual users. However, the characteristics of microblog content pose great challenges to the existing cross-language information retrieval approaches. In this paper, we address the task of retrieving relevant tweets given another tweet in a different language. We build parallel corpora for tweets in different languages by bridging them via shared hashtags. We propose a latent semantic approach to model the parallel corpora by mapping the parallel tweets to a low-dimensional shared semantic space. The relevance between tweets in different languages is measured in this shared latent space and the model is trained on a pairwise loss function. The preliminary experiments on a Twitter dataset demonstrate the effectiveness of the proposed approach.
Published: 2016
Full Text: View/download PDF

9. Intent-Aware Diversification Using a Constrained PLSA

Author: Jacek Wasilewski and Neil Hurley
Subjects: Probabilistic latent semantic analysis, Computer science, business.industry, Diversification (finance), Item selection, Context (language use), 02 engineering and technology, Recommender system, Machine learning, computer.software_genre, MovieLens, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Relevance (information retrieval), Artificial intelligence, Set (psychology), business, computer
Abstract: The intent-aware diversification framework was introduced initially in information retrieval and adopted to the context of recommender systems in the work of Vargas et al. The framework considers a set of aspects associated with items to be recommended. For instance, aspects may correspond to genres in movie recommendations. The framework depends on input aspect model consisting of item selection or relevance probabilities, given an aspect, and user intents, in the form of probabilities that the user is interested in each aspect. In this paper, we examine a number of input aspect models and evaluate the impact that different models have on the framework. In particular, we propose a constrained PLSA model that allows for interpretable output, in terms of known aspects, while achieving greater performance that the explicit co-occurrence counting method used in previous work. We evaluate the proposed models using a well-known MovieLens dataset for which item genres are available.
Published: 2016
Full Text: View/download PDF

10. Inspecting the Latent Space of Stock Market Data with Genetic Programming

Author: Sangyeop Lee, Byung-Ro Moon, and Sungjoo Ha
Subjects: Theoretical computer science, Probabilistic latent semantic analysis, business.industry, Document-term matrix, Genetic programming, 0102 computer and information sciences, 02 engineering and technology, Machine learning, computer.software_genre, 01 natural sciences, Latent class model, Matrix decomposition, 010201 computation theory & mathematics, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Stock market, Logical matrix, Artificial intelligence, Tuple, business, computer, Mathematics
Abstract: We suggest a method of inspecting the latent space of stock market data using genetic programming. Given black box patterns and (stock, day) tuples a relation matrix is constructed. Applying a low-rank matrix factorization technique to the relation matrix induces a latent vector space. By manipulating the latent vector representations of black box patterns, the geometry of the latent space can be examined. Genetic programming constructs a tree representation corresponding to an arbitrary latent vector representation, allowing us to interpret the result of the inspection.
Published: 2016
Full Text: View/download PDF

11. Message Significance in Multilingual Blogs using Topic-based Aspect Clusters

Author: Jyoti D. Pawar and Kavita Sanjay Asnani
Subjects: World Wide Web, Topic model, Identification (information), Information retrieval, Probabilistic latent semantic analysis, Computer science, Probabilistic logic, Petabyte, Social media, Cluster analysis
Abstract: Social networking forums like Twitter, Facebook and other blogs are easy to access and are highly popular. The growth in such rich social media content has led to the generation of petabytes of data on the web. The social media content has renewed interest in research as the trend of using multiple languages in routine communication is getting rapidly popular. Such large chat content repositories of multilingual data are usually noisy and are represented in highly sparse structures. This situation is generating increasing interest in automatically extracting and clustering aspects from multi-lingual data. The proposed research offers a novel method based on probabilistic topic model for aspect identification and extraction of aspects (explicit as well as implicit) and aspect clustering for multilingual blog data. The words in multiple languages may randomly occur within and across the blog messages. We have experimentally proved that it is possible to use this strategy to discover aspect clusters comprising of semantically implicit themes. We tested our system using FIRE 2014 dataset.
Published: 2016
Full Text: View/download PDF

12. Geographic Segmentation via Latent Poisson Factor Model

Author: Yan Liu, Rose Yu, Andrew Gelfand, Cyrus Shahabi, and Suju Rajan
Subjects: Probabilistic latent semantic analysis, Computer science, 02 engineering and technology, Space (commercial competition), computer.software_genre, Poisson distribution, Data set, Set (abstract data type), symbols.namesake, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, symbols, 020201 artificial intelligence & image processing, Segmentation, Data mining, computer, Spatial analysis, Count data
Abstract: Discovering latent structures in spatial data is of critical importance to understanding the user behavior of location-based services. In this paper, we study the problem of geographic segmentation of spatial data, which involves dividing a collection of observations into distinct geo-spatial regions and uncovering abstract correlation structures in the data. We introduce a novel, Latent Poisson Factor (LPF) model to describe spatial count data. The model describes the spatial counts as a Poisson distribution with a mean that factors over a joint item-location latent space. The latent factors are constrained with weak labels to help uncover interesting spatial dependencies. We study the LPF model on a mobile app usage data set and a news article readership data set. We empirically demonstrate its effectiveness on a variety of prediction tasks on these two data sets.
Published: 2016
Full Text: View/download PDF

13. Making Topic Models more Usable

Author: Wray Buntine
Subjects: Topic model, Document Structure Description, Information retrieval, Probabilistic latent semantic analysis, Computer science, business.industry, Coherence (statistics), Semantics, computer.software_genre, Field (computer science), Artificial intelligence, business, computer, Natural language processing, Word (computer architecture)
Abstract: The output of topic models has always been seductive but not quite satisfying ever since the early work of Hofmann (PLSI) and Lee and Seung (NMF). An important approach to cleaning up the semantics does output analysis using coherence, and indeed other document summarization methods could also be used. However, this talk argues that topic models themselves need attention. New ways of modelling document semantics are being explored in the field of deep neural networks. Similarly, non-parametric versions of topic models allow modelling such effects as document structure, word sparsity, word burstiness, background words, multi-word terms, and network effects from author or follower networks, and semantic word hierarchies. These are usually done in the spirit of deep neural networks using hierarchical models, but earlier algorithms were often too slow to be realistic.This talk will start with a brief tour of some of the variants, which can only be superficial given the huge number. This will be followed by a brief tour of some non-parametric methods known to be moderately efficient and suiting multi-core implementation. Note that the most important effect, modelling coherence, is currently poorly developed. The talk will then present experimental results on various versions of topic models to see how they can mitigate some of the unwanted artifacts of simple LDA.
Published: 2015
Full Text: View/download PDF

14. Topic Models Regularization and Initialization for Regression Problems

Author: Evgeny Sokolov and Lev Bogolubsky
Subjects: Topic model, Probabilistic latent semantic analysis, business.industry, Computer science, Feature extraction, Initialization, Pattern recognition, Machine learning, computer.software_genre, Regularization (mathematics), Naive Bayes classifier, Expectation–maximization algorithm, Artificial intelligence, business, computer
Abstract: We propose a new method of feature extraction for regression problems with text data that transforms the sparse texts to dense features using regularized topic models. We also discuss the problem of topic model initialization, and propose a new approach based on Naive Bayes. This approach is compared to many others, and it achieves a quality comparable to vector space models using as little as ten topics. It also outperforms other methods for feature generation based on topic modeling, such as PLSA and Supervised LDA.
Published: 2015
Full Text: View/download PDF

15. Deep Collaborative Filtering via Marginalized Denoising Auto-encoder

Author: Sheng Li, Yun Fu, and Jaya Kawale
Subjects: Probabilistic latent semantic analysis, Computer science, business.industry, Deep learning, Recommender system, Machine learning, computer.software_genre, Regularization (mathematics), Matrix decomposition, Cold start, Prior probability, Collaborative filtering, Artificial intelligence, business, computer, Feature learning
Abstract: Collaborative filtering (CF) has been widely employed within recommender systems to solve many real-world problems. Learning effective latent factors plays the most important role in collaborative filtering. Traditional CF methods based upon matrix factorization techniques learn the latent factors from the user-item ratings and suffer from the cold start problem as well as the sparsity problem. Some improved CF methods enrich the priors on the latent factors by incorporating side information as regularization. However, the learned latent factors may not be very effective due to the sparse nature of the ratings and the side information. To tackle this problem, we learn effective latent representations via deep learning. Deep learning models have emerged as very appealing in learning effective representations in many applications. In particular, we propose a general deep architecture for CF by integrating matrix factorization with deep feature learning. We provide a natural instantiations of our architecture by combining probabilistic matrix factorization with marginalized denoising stacked auto-encoders. The combined framework leads to a parsimonious fit over the latent features as indicated by its improved performance in comparison to prior state-of-art models over four large datasets for the tasks of movie/book recommendation and response prediction.
Published: 2015
Full Text: View/download PDF

16. Octave-dependent Probabilistic Latent Semantic Analysis to Chorus Detection of Popular Song

Author: Haizhou Li and Sheng Gao
Subjects: biology, Probabilistic latent semantic analysis, Computer science, business.industry, Speech recognition, Feature extraction, Chorus, Pattern recognition, biology.organism_classification, ComputingMethodologies_PATTERNRECOGNITION, Feature (machine learning), Octave, Music information retrieval, Multinomial distribution, Artificial intelligence, business
Abstract: Content representation of music signal is an essential part of music information retrieval applications, e.g. chorus detection, genre classification, etc. In the paper, we propose the octave-dependent probabilistic latent semantic analysis (OdPlsa) to discover the latent audio patterns (or clusters) through spectral-temporal analysis. Then the audio content of each segment is characterized using the statistical pattern distribution. In OdPlsa, the latent pattern is modeled by multinomial distribution which characterizes the magnitude distribution of 12-dimensional pitch class profiles over a temporal window. It thus effectively models melody information as well as octave relations in music signal. Its efficiency as a feature extraction technique is evaluated on chorus detection of popular songs. In terms of multiple performance metrics such as boundary accuracy, precision, recall and F1, the proposed technique is much superior to the widely accepted chroma feature.
Published: 2015
Full Text: View/download PDF

17. Re-organized Topic Modeling for Micro-blogging Data

Author: Guan-Bin Chen and Hung-Yu Kao
Subjects: Topic model, Text corpus, Information retrieval, Probabilistic latent semantic analysis, business.industry, Computer science, Microblogging, Document classification, Document clustering, computer.software_genre, Social media, The Internet, Artificial intelligence, business, computer, Word (computer architecture), Natural language processing
Abstract: The large amount of text on the Internet cause people hard to understand the meaning in a short limit time. Topic models (e.g. LDA and PLSA) has been proposed to summarize the long text into several topic terms. In the recent years, the short text media such as tweet is very popular. However, directly applies the transitional topic model on the short text corpus usually gating non-coherent topics. Because there is no enough words to discover the word co-occurrence pattern in a short document. In this paper, we solve the lack of the local word co-occurrence problem in LDA. Thus, we proposed an improvement of word co-occurrence method to enhance the topic models. We generate new virtual documents by re-organizing the words in documents and just apply in the traditional LDA. The experimental result that show our RO-LDA method gets well results in the noisy Tweet dataset and the regular news title dataset. Moreover, there are two advantages in our methods. We do not need any external data and our proposed methods are based on the original topic model that we did not modify the model itself, thus our methods can easily apply to some other existing LDA based models.
Published: 2015
Full Text: View/download PDF

18. Effective Latent Models for Binary Feedback in Recommender Systems

Author: Guang Wei Yu and Maksims Volkovs
Subjects: Probabilistic latent semantic analysis, Computer science, business.industry, Recommender system, Machine learning, computer.software_genre, Factorization, Simple (abstract algebra), Singular value decomposition, Similarity (psychology), Collaborative filtering, Data mining, Artificial intelligence, business, computer, Binary feedback
Abstract: In many collaborative filtering (CF) applications, latent approaches are the preferred model choice due to their ability to generate real-time recommendations efficiently. However, the majority of existing latent models are not designed for implicit binary feedback (views, clicks, plays etc.) and perform poorly on data of this type. Developing accurate models from implicit feedback is becoming increasingly important in CF since implicit feedback can often be collected at lower cost and in much larger quantities than explicit preferences. The need for accurate latent models for implicit data was further emphasized by the recently conducted Million Song Dataset Challenge organized by Kaggle. In this challenge, the results for the best latent model were orders of magnitude worse than neighbor-based approaches, and all the top performing teams exclusively used neighbor-based models. We address this problem and propose a new latent approach for binary feedback in CF. In our model, neighborhood similarity information is used to guide latent factorization and derive accurate latent representations. We show that even with simple factorization methods like SVD, our approach outperforms existing models and produces state-of-the-art results.
Published: 2015
Full Text: View/download PDF

19. Learning Context-aware Latent Representations for Context-aware Collaborative Filtering

Author: Xin Liu and Wei Wu
Subjects: Probabilistic latent semantic analysis, Computer science, business.industry, media_common.quotation_subject, Context (language use), Base (topology), Machine learning, computer.software_genre, Matrix decomposition, Stochastic gradient descent, Factor (programming language), Collaborative filtering, Artificial intelligence, business, Function (engineering), computer, computer.programming_language, media_common
Abstract: In this paper, we propose a generic framework to learn context-aware latent representations for context-aware collaborative filtering. Contextual contents are combined via a function to produce the context influence factor, which is then combined with each latent factor to derive latent representations. We instantiate the generic framework using biased Matrix Factorization as the base model. A Stochastic Gradient Descent (SGD) based optimization procedure is developed to fit the model by jointly learning the weight of each context and latent factors. Experiments conducted over three real-world datasets demonstrate that our model significantly outperforms not only the base model but also the representative context-aware recommendation models.
Published: 2015
Full Text: View/download PDF

20. A universal topic framework (UniZ) and its application in online search

Author: Youngchul Cha, Junghoo Cho, Tak W. Yan, Bin Bi, Hari Bommaganti, Keng-hao Chang, and Ye Chen
Subjects: Topic model, Context model, Information retrieval, Probabilistic latent semantic analysis, Computer science, Online search, Probabilistic logic, Space (commercial competition), Representation (mathematics)
Abstract: Probabilistic topic models, such as PLSA and LDA, are gaining popularity in many fields due to their high-quality results. Unfortunately, existing topic models suffer from two drawbacks: (1) model complexity and (2) disjoint topic groups. That is, when a topic model involves multiple entities (such as authors, papers, conferences, and institutions) and they are connected through multiple relationships, the model becomes too difficult to analyze and often leads to in-tractable solutions. Also, different entity types are classified into disjoint topic groups that are not directly comparable, so it is difficult to see whether heterogeneous entities (such as authors and conferences) are on the same topic or not (e.g., are Rakesh Agrawal and KDD related to the same topic?). In this paper, we propose a novel universal topic framework (UniZ) that addresses these two drawbacks using "prior topic incorporation." Since our framework enables representation of heterogeneous entities in a single universal topic space, all entities can be directly compared within the same topic space. In addition, UniZ breaks complex models into much smaller units, learns the topic group of each entity from the smaller units, and then propagates the learned topics to others. This way, it leverages all the available signals without introducing significant computational complexity, enabling a richer representation of entities and highly accurate results. In a widely-used DBLP dataset prediction problem, our approach achieves the best prediction performance over many state-of-the-art methods. We also demonstrate practical potential of our approach with search logs from a commercial search engine.
Published: 2015
Full Text: View/download PDF

21. Correlation of Topic Model and Student Grades Using Comment Data Mining

Author: Tsunenori Mine, Shaymaa E. Sorour, and Kazumasa Goda
Subjects: Topic model, Class (computer programming), Artificial neural network, Probabilistic latent semantic analysis, Latent semantic analysis, Computer science, business.industry, computer.software_genre, Machine learning, Task (project management), Support vector machine, ComputingMilieux_COMPUTERSANDEDUCATION, Data mining, Artificial intelligence, Student learning, business, computer
Abstract: Assessment of learning progress and learning gain play a pivotal role in education fields. New technologies like comment data mining promote the use of new types of contents; student comments highly reflect student learning attitudes and activities compared to more traditional methods and they can be a powerful source of data for all forms of assessment. A teacher just asks students after every lesson to freely describe and write about their learning situations and behaviors. This paper proposes new methods based on a statistical latent class "Topics" for the task of student grade prediction; our methods convert student comments using latent semantic analysis (LSA) and probabilistic latent semantic analysis (PLSA), and generate prediction models using support vector machine (SVM) and artificial neural network (ANN) to predict student final grades. The experimental results show that our methods can accurately predict student grades based on comment data.
Published: 2015
Full Text: View/download PDF

22. A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

Author: Grégoire Mesnil, Jianfeng Gao, Li Deng, Xiaodong He, and Yelong Shen
Subjects: Information retrieval, Probabilistic latent semantic analysis, Latent semantic analysis, business.industry, Computer science, Document-term matrix, Semantic search, computer.software_genre, Semantic data model, SemEval, Social Semantic Web, Semantic similarity, Ranking, Semantic equivalence, Explicit semantic analysis, Semantic computing, Semantic technology, Artificial intelligence, business, computer, Semantic compression, Natural language processing
Abstract: In this paper, we propose a new latent semantic model that incorporates a convolutional-pooling structure over word sequences to learn low-dimensional, semantic vector representations for search queries and Web documents. In order to capture the rich contextual structures in a query or a document, we start with each word within a temporal context window in a word sequence to directly capture contextual features at the word n-gram level. Next, the salient word n-gram features in the word sequence are discovered by the model and are then aggregated to form a sentence-level feature vector. Finally, a non-linear transformation is applied to extract high-level semantic information to generate a continuous vector representation for the full text string. The proposed convolutional latent semantic model (CLSM) is trained on clickthrough data and is evaluated on a Web document ranking task using a large-scale, real-world data set. Results show that the proposed model effectively captures salient semantic information in queries and documents for the task while significantly outperforming previous state-of-the-art semantic models.
Published: 2014
Full Text: View/download PDF

23. Multi-modal Language Models for Lecture Video Retrieval

Author: Dhiraj Joshi, Huizhong Chen, Matthew Cooper, and Bernd Girod
Subjects: Information retrieval, Probabilistic latent semantic analysis, business.industry, Computer science, Search engine indexing, Divergence-from-randomness model, Probabilistic logic, Latent variable, computer.software_genre, Ranking (information retrieval), Ranking, Vector space model, Language model, Artificial intelligence, business, Latent variable model, computer, Natural language processing
Abstract: We propose Multi-modal Language Models (MLMs), which adapt latent variable techniques for document analysis to exploring co-occurrence relationships in multi-modal data. In this paper, we focus on the application of MLMs to indexing text from slides and speech in lecture videos, and subsequently employ a multi-modal probabilistic ranking function for lecture video retrieval. The MLM achieves highly competitive results against well established retrieval methods such as the Vector Space Model and Probabilistic Latent Semantic Analysis. When noise is present in the data, retrieval performance with MLMs is shown to improve with the quality of the spoken text extracted from the video.
Published: 2014
Full Text: View/download PDF

24. Multi-modal Mutual Topic Reinforce Modeling for Cross-media Retrieval

Author: Fei Wu, Jun Song, Xi Li, Wang Yanfei, and Yueting Zhuang
Subjects: Consistency (database systems), Information retrieval, Discriminative model, Probabilistic latent semantic analysis, Process (engineering), Computer science, Probabilistic logic, Graphical model, Bayesian inference
Abstract: As an important and challenging problem in the multimedia area, multi-modal data understanding aims to explore the intrinsic semantic information across different modalities in a collaborative manner. To address this problem, a possible solution is to effectively and adaptively capture the common cross-modal semantic information by modeling the inherent correlations between the latent topics from different modalities. Motivated by this task, we propose a supervised multi-modal mutual topic reinforce modeling (M$^3$R) approach, which seeks to build a joint cross-modal probabilistic graphical model for discovering the mutually consistent semantic topics via appropriate interactions between model factors (e.g., categories, latent topics and observed multi-modal data). In principle, M$^3$R is capable of simultaneously accomplishing the following two learning tasks: 1) modality-specific (e.g., image-specific or text-specific ) latent topic learning; and 2) cross-modal mutual topic consistency learning. By investigating the cross-modal topic-related distribution information, M$^3$R encourages to disentangle the semantically consistent cross-modal topics (containing some common semantic information across different modalities). In other words, the semantically co-occurring cross-modal topics are reinforced by M$^3$R through adaptively passing the mutually reinforced messages to each other in the model-learning process. To further enhance the discriminative power of the learned latent topic representations, M$^3$R incorporates the auxiliary information (i.e., categories or labels) into the process of Bayesian modeling, which boosts the modeling capability of capturing the inter-class discriminative information. Experimental results over two benchmark datasets demonstrate the effectiveness of the proposed M$^3$R in cross-modal retrieval.
Published: 2014
Full Text: View/download PDF

25. A collective topic model for milestone paper discovery

Author: David W. Cheung, Ziyu Lu, and Nikos Mamoulis
Subjects: Topic model, Information retrieval, Probabilistic latent semantic analysis, Computer science, Milestone (project management), Citation, Data science, The arts
Abstract: Prior arts stay at the foundation for future work in academic research. However the increasingly large amount of publications makes it difficult for researchers to effectively discover the most important previous works to the topic of their research. In this paper, we study the automatic discovery of the core papers for a research area. We propose a collective topic model on three types of objects: papers, authors and published venues. We model any of these objects as bags of citations. Based on Probabilistic latent semantic analysis (PLSA), authorship, published venues and citation relations are used for quantifying paper importance. Our method discusses milestone paper discovery in different cases of input objects. Experiments on the ACL Anthology Network (ANN) indicate that our model is superior in milestone paper discovery when compared to a previous model which considers only papers.
Published: 2014
Full Text: View/download PDF

26. Latent semantic sparse hashing for cross-modal similarity search

Author: Guiguang Ding, Jile Zhou, and Yuchen Guo
Subjects: Theoretical computer science, Probabilistic latent semantic analysis, Universal hashing, business.industry, Computer science, Nearest neighbor search, Hash function, Pattern recognition, Matrix decomposition, Locality preserving hashing, Artificial intelligence, Feature hashing, business, Image retrieval, Semantic gap
Abstract: Similarity search methods based on hashing for effective and efficient cross-modal retrieval on large-scale multimedia databases with massive text and images have attracted considerable attention. The core problem of cross-modal hashing is how to effectively construct correlation between multi-modal representations which are heterogeneous intrinsically in the process of hash function learning. Analogous to Canonical Correlation Analysis (CCA), most existing cross-modal hash methods embed the heterogeneous data into a joint abstraction space by linear projections. However, these methods fail to bridge the semantic gap more effectively, and capture high-level latent semantic information which has been proved that it can lead to better performance for image retrieval. To address these challenges, in this paper, we propose a novel Latent Semantic Sparse Hashing (LSSH) to perform cross-modal similarity search by employing Sparse Coding and Matrix Factorization. In particular, LSSH uses Sparse Coding to capture the salient structures of images, and Matrix Factorization to learn the latent concepts from text. Then the learned latent semantic features are mapped to a joint abstraction space. Moreover, an iterative strategy is applied to derive optimal solutions efficiently, and it helps LSSH to explore the correlation between multi-modal representations efficiently and automatically. Finally, the unified hashcodes are generated through the high level abstraction space by quantization. Extensive experiments on three different datasets highlight the advantage of our method under cross-modal scenarios and show that LSSH significantly outperforms several state-of-the-art methods.
Published: 2014
Full Text: View/download PDF

27. Exploring social activeness and dynamic interest in community-based recommender system

Author: Bin Yin, Wenhuang Liu, and Yujiu Yang
Subjects: Community based, World Wide Web, Focus (computing), Probabilistic latent semantic analysis, Social network, Computer science, business.industry, Recommender system, business, Data science, Preference
Abstract: Community-based recommender systems have attracted much research attention. Forming communities allows us to reduce data sparsity and focus on discovering the latent characteristics of communities instead of individuals. Previous work focused on how to detect the community using various algorithms. However, they failed to consider users' social attributes, such as social activeness and dynamic interest, which have strong correlations to users' preference and choice. Intuitively, people have different social activeness in a social network. Ratings from users with high activeness are more likely to be trustworthy. Temporal dynamic of interest is also significant to user's preference. In this paper, we propose a novel community-based framework. We first employ PLSA-based model incorporating social activeness and dynamic interest to discover communities. Then the state-of-the-art matrix factorization method is applied on each of the communities. The experiment results on two real world datasets validate the effectiveness of our method for improving recommendation performance.
Published: 2014
Full Text: View/download PDF

28. Pairwise learning in recommendation

Author: Baoshi Yan and Amit Sharma
Subjects: Potentially all pairwise rankings of all possible alternatives, Probabilistic latent semantic analysis, Point (typography), Computer science, business.industry, Recommender system, Machine learning, computer.software_genre, Search ranking, Pairwise learning, Pairwise comparison, Artificial intelligence, Set (psychology), business, computer
Abstract: Many online systems present a list of recommendations and infer user interests implicitly from clicks or other contextual actions. For modeling user feedback in such settings, a common approach is to consider items acted upon to be relevant to the user, and irrelevant otherwise. However, clicking some but not others conveys an implicit ordering of the presented items. Pairwise learning, which leverages such implicit ordering between a pair of items, has been successful in areas such as search ranking. In this work, we study whether pairwise learning can improve community recommendation. We first present two novel pairwise models adapted from logistic regression. Both offline and online experiments in a large real-world setting show that incorporating pairwise learning improves the recommendation performance. However, the improvement is only slight. We find that users' preferences regarding the kinds of communities they like can differ greatly, which adversely affect the effectiveness of features derived from pairwise comparisons. We therefore propose a probabilistic latent semantic indexing model for pairwise learning (Pairwise PLSI), which assumes a set of users' latent preferences between pairs of items. Our experiments show favorable results for the Pairwise PLSI model and point to the potential of using pairwise learning for community recommendation.
Published: 2013
Full Text: View/download PDF

29. Nonlinear latent factorization by embedding multiple user interests

Author: Hector Yee, Jason Weston, and Ron Weiss
Subjects: Information retrieval, Probabilistic latent semantic analysis, Matching (graph theory), business.industry, Computer science, Recommender system, Machine learning, computer.software_genre, Matrix decomposition, Similarity (psychology), Collaborative filtering, Learning to rank, Artificial intelligence, business, Representation (mathematics), computer
Abstract: Classical matrix factorization approaches to collaborative filtering learn a latent vector for each user and each item, and recommendations are scored via the similarity between two such vectors, which are of the same dimension. In this work, we are motivated by the intuition that a user is a much more complicated entity than any single item, and cannot be well described by the same representation. Hence, the variety of a user's interests could be better captured by a more complex representation. We propose to model the user with a richer set of functions, specifically via a set of latent vectors, where each vector captures one of the user's latent interests or tastes. The overall recommendation model is then nonlinear where the matching score between a user and a given item is the maximum matching score over each of the user's latent interests with respect to the item's latent representation. We describe a simple, general and efficient algorithm for learning such a model, and apply it to large scale, real-world datasets from YouTube and Google Music, where our approach outperforms existing techniques.
Published: 2013
Full Text: View/download PDF

30. Exploiting Forum Thread Structures to Improve Thread Clustering

Author: ChengXiang Zhai, Parikshit Sondhi, and Kumaresh Pattabiraman
Subjects: Brown clustering, Probabilistic latent semantic analysis, Web 2.0, Computer science, k-means clustering, Data mining, Thread (computing), computer.software_genre, Cluster analysis, Data type, computer, Weighting
Abstract: Automated clustering of threads within and across web forums will greatly benefit both users and forum administrators in efficiently seeking, managing, and integrating the huge volume of content being generated. While clustering has been studied for other types of data, little work has been done on clustering forum threads; the informal nature and special structure of forum data make it interesting to study how to effectively cluster forum threads. In this paper, we apply three state of the art clustering methods (i.e., hierarchical agglomerative clustering, k-Means, and probabilistic latent semantic analysis) to cluster forum threads and study how to leverage the structure of threads to improve clustering accuracy. We propose three different methods for assigning weights to the posts in a forum thread to achieve more accurate representation of a thread. We evaluate all the methods on data collected from three different Linux forums for both within-forum and across-forum clustering. Our results show that the state of the art methods perform reasonably well for this task, but the performance can be further improved by exploiting thread structures. In particular, a parabolic weighting method that assigns higher weights for both beginning posts and end posts of a thread is shown to consistently outperform a standard clustering method.
Published: 2013
Full Text: View/download PDF

31. Automatic image annotation using semantic relevance

Author: Yijuan Lu, WeiWei Zhu, WenBin Wang, and Peng Zhao
Subjects: Information retrieval, Automatic image annotation, Probabilistic latent semantic analysis, Semantic similarity, Computer science, business.industry, Semantic computing, Semantic search, business, Image retrieval, Semantic compression, Semantic gap
Abstract: Due to the semantic gap between low-level visual feature and high-level semantic concept, image annotation plays an important role in image retrieval. In this paper, an automatic image annotation approach using semantic relevance is proposed. It constructs an improved probabilistic model to characterize different regions' contributions to the semantics more accurately based on the spatial, visual and contextual information of the region. And it also helps expand the coverage of the semantic concept with semantic relevance information. The performance of the proposed approach has been evaluated on the standard Corel dataset. The experimental results have demonstrated its potential and effectiveness.
Published: 2013
Full Text: View/download PDF

32. Boosting novelty for biomedical information retrieval through probabilistic latent semantic analysis

Author: Jimmy Xiangji Huang and Xiangdong An
Subjects: Boosting (machine learning), Information retrieval, Ranking, Okapi BM25, Probabilistic latent semantic analysis, Computer science, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, Biomedical information, Novelty
Abstract: In information retrieval, we are interested in the information that is not only relevant but also novel. In this paper, we study how to boost novelty for biomedical information retrieval through probabilistic latent semantic analysis. We conduct the study based on TREC Genomics Track data. In TREC Genomics Track, each topic is considered to have an arbitrary number of aspects, and the novelty of a piece of information retrieved, called a passage, is assessed based on the amount of new aspects it contains. In particular, the aspect performance of a ranked list is rewarded by the number of new aspects reached at each rank and penalized by the amount of irrelevant passages that are rated higher than the novel ones. Therefore, to improve aspect performance, we should reach as many aspects as possible and as early as possible. In this paper, we make a preliminary study on how probabilistic latent semantic analysis can help capture different aspects of a ranked list, and improve its performance by re-ranking. Experiments indicate that the proposed approach can greatly improve the aspect-level performance over baseline algorithm Okapi BM25.
Published: 2013
Full Text: View/download PDF

33. Sparse online topic models

Author: Bo Zhang, Jun Zhu, and Aonan Zhang
Subjects: Topic model, Biological data, Probabilistic latent semantic analysis, business.industry, Computer science, Probabilistic logic, Machine learning, computer.software_genre, Latent Dirichlet allocation, symbols.namesake, symbols, Artificial intelligence, Online algorithm, business, computer
Abstract: Topic models have shown great promise in discovering latent semantic structures from complex data corpora, ranging from text documents and web news articles to images, videos, and even biological data. In order to deal with massive data collections and dynamic text streams, probabilistic online topic models such as online latent Dirichlet allocation (OLDA) have recently been developed. However, due to normalization constraints, OLDA can be ineffective in controlling the sparsity of discovered representations, a desirable property for learning interpretable semantic patterns, especially when the total number of topics is large. In contrast, sparse topical coding (STC) has been successfully introduced as a non-probabilistic topic model for effectively discovering sparse latent patterns by using sparsity-inducing regularization. But, unfortunately STC cannot scale to very large datasets or deal with online text streams, partly due to its batch learning procedure. In this paper, we present a sparse online topic model, which directly controls the sparsity of latent semantic patterns by imposing sparsity-inducing regularization and learns the topical dictionary by an online algorithm. The online algorithm is efficient and guaranteed to converge. Extensive empirical results of the sparse online topic model as well as its collapsed and supervised extensions on a large-scale Wikipedia dataset and the medium-sized 20Newsgroups dataset demonstrate appealing performance.
Published: 2013
Full Text: View/download PDF

34. A biterm topic model for short texts

Author: Xiaohui Yan, Yanyan Lan, Jiafeng Guo, and Xueqi Cheng
Subjects: Topic model, Probabilistic latent semantic analysis, Content analysis, Computer science, business.industry, Biterm topic model, Artificial intelligence, computer.software_genre, business, computer, Word (computer architecture), Natural language processing, Task (project management)
Abstract: Uncovering the topics within short texts, such as tweets and instant messages, has become an important task for many content analysis applications. However, directly applying conventional topic models (e.g. LDA and PLSA) on such short texts may not work well. The fundamental reason lies in that conventional topic models implicitly capture the document-level word co-occurrence patterns to reveal topics, and thus suffer from the severe data sparsity in short documents. In this paper, we propose a novel way for modeling topics in short texts, referred as biterm topic model (BTM). Specifically, in BTM we learn the topics by directly modeling the generation of word co-occurrence patterns (i.e. biterms) in the whole corpus. The major advantages of BTM are that 1) BTM explicitly models the word co-occurrence patterns to enhance the topic learning; and 2) BTM uses the aggregated patterns in the whole corpus for learning topics to solve the problem of sparse word co-occurrence patterns at document-level. We carry out extensive experiments on real-world short text collections. The results demonstrate that our approach can discover more prominent and coherent topics, and significantly outperform baseline methods on several evaluation metrics. Furthermore, we find that BTM can outperform LDA even on normal texts, showing the potential generality and wider usage of the new topic model.
Published: 2013
Full Text: View/download PDF

35. DIGTOBI

Author: Yoonjae Park, Kyuseok Shim, and Younghoon Kim
Subjects: Topic model, World Wide Web, Service (systems architecture), Information retrieval, Probabilistic latent semantic analysis, Computer science, Web page, Collaborative filtering, Probabilistic logic, Recommender system, Popularity
Abstract: Digg is a social news website that lets people submit articles to share their favorite web pages (e.g. blog postings or news articles) and vote the articles posted by others. Digg service currently lists the articles in the front page by popularity without considering each user's preference to the topics in the articles. Helping users to find the most interesting Digg articles tailored to each user's own interests will be very useful, but it is not an easy task to classify the articles according to their topics in order to recommend the articles differently to each user. In this paper, we propose DIGTOBI, a personalized recommendation system for Digg articles using a novel probabilistic modeling. Our model considers the relevant articles with low Digg scores important as well. We show that our model can handle both warm-start and cold-start scenarios seamlessly through a single model. We next propose an EM algorithm to learn the parameters of our probabilistic model. Our performance study with Digg data confirms the effectiveness of DIGTOBI compared to the traditional recommendations algorithms.
Published: 2013
Full Text: View/download PDF

36. Multi-label image annotation based on multi-model

Author: Jing Zhang and Weiwei Hu
Subjects: Automatic image annotation, Probabilistic latent semantic analysis, Semantic similarity, business.industry, Computer science, Latent semantic analysis, Semantic computing, Semantic analysis (machine learning), Pattern recognition, Artificial intelligence, business, Image retrieval, Semantic gap
Abstract: Image automatic annotation is a promising and essential step for semantic image retrieval, and it's still a challenge because of the open problem of semantic gap. Recently, most of image annotation approaches paid more attention to detect single label for an image, but in fact they are multi-label learning problems. In this paper, we propose a new multi-model method for image multi-label annotation, which includes two different models for foreground and background semantic detection in terms of their distinct characters of semantic and visual features respectively, and a semantic correlation analysis model for refining the annotation results. A new visual saliency analysis algorithm based on multi-feature is proposed to obtaining the salient object, and multiple Nystrom-approximating kernel discriminant analysis is used to acquire foreground semantic concept. Region semantic analysis is proposed to get annotation words of background, and semantic correlation matrix constructed by Latent Semantic Analysis is used to remove the unreliable labels. Experimental results show that our multi-model image labeling method could achieve promising performance for multi-labeling, and outperform previous methods on benchmark datasets.
Published: 2013
Full Text: View/download PDF

37. Medical Image Retrieval using Bag of Meaningful Visual Words: Unsupervised visual vocabulary pruning with PLSA

Author: Antonio Foncubierta-Rodríguez, Alba García Seco de Herrera, and Henning Müller
Subjects: Vocabulary, Probabilistic latent semantic analysis, business.industry, Computer science, media_common.quotation_subject, computer.software_genre, Machine learning, Feature (computer vision), Bag-of-words model in computer vision, Histogram, Visual Word, Artificial intelligence, Pruning (decision trees), business, Image retrieval, computer, Natural language processing, Semantic gap, media_common
Abstract: Content--based medical image retrieval has been proposed as a technique that allows not only for easy access to images from the relevant literature and electronic health records but also for training physicians, for research and clinical decision support. The bag-of-visual-words approach is a widely used technique that tries to shorten the semantic gap by learning meaningful features from the dataset and describing documents and images in terms of the histogram of these features. Visual vocabularies are often redundant, over--complete and noisy. Larger than required vocabularies lead to high--dimensional feature spaces, which present important disadvantages with the curse of dimensionality and computational cost being the most obvious ones. In this work a visual vocabulary pruning technique is presented. It enormously reduces the amount of required words to describe a medical image dataset with no significant effect on the accuracy. Results show that a reduction of up to 90% can be achieved without impact on the system performance. Obtaining a more compact representation of a document enables multimodal description as well as using classifiers requiring low--dimensional representations.
Published: 2013

38. On the connections between explicit semantic analysis and latent semantic analysis

Author: Yi-Min Wang and Chao Liu
Subjects: Probabilistic latent semantic analysis, Latent semantic analysis, business.industry, Synonym, Computer science, computer.software_genre, Kernel method, Explicit semantic analysis, Semantic computing, Artificial intelligence, Equivalence (formal languages), Polysemy, business, computer, Natural language processing, Natural language
Abstract: Semantic analysis tries to solve problems arising from polysemy and synonymy that are abundant in natural languages. Recently, Gabrilovich and Markovitch propose the Explicit Semantic Analysis (ESA) technique, which complements the well-known Latent Semantic Analysis (LSA) technique. In this paper, we show that the two techniques are not as distinct as their names suggest; instead, we find that ESA is equivalent to a LSA variant, and this equivalence generalizes to all kernel methods using kernels arising from the canonical dot product. Effectively, this result guarantees that ESA would not outperform the peak efficacy of LSA for any applications using the above kernel methods. In short, this paper for the first time establishes the connections between ESA and LSA, quantifies their relative efficacy, and generalizes the result to a big category of kernel methods.
Published: 2012
Full Text: View/download PDF

39. Geo-location inference on news articles via multimodal pLSA

Author: Youjie Zhou and Jiebo Luo
Subjects: Geolocation, Modality (human–computer interaction), Information retrieval, Probabilistic latent semantic analysis, Computer science, Association (object-oriented programming), Location-based service, Inference, Social media, Image (mathematics)
Abstract: The fast evolution and adoption of social media creates an increasingly need for location based services. Location inference on news or events becomes an essential problem. This paper addresses the problem by extracting location involved topics (geo-topic) using both text content and visual content. This paper proposes a geo-topic extraction framework for geo-location inference, including location name entity recognition, location related image association and a multimodal location dependent pLSA geo-topic model. Experiments have shown that our fused model improves the f-score in geo-location inference by 10% compared with single modality based models.
Published: 2012
Full Text: View/download PDF

40. Topic-based Amharic text summarization with probabilistic latent semantic analysis

Author: Eyob Delele Yirdaw and Dejene Ejigu
Subjects: Topic model, Information retrieval, Probabilistic latent semantic analysis, business.industry, Computer science, Matrix (music), computer.software_genre, Automatic summarization, language.human_language, Term (time), Domain (software engineering), Amharic, language, Artificial intelligence, business, computer, Natural language processing, Sentence
Abstract: This paper investigates the problem of building a concept-based single-document Amharic text summarization system. Because local languages like Amharic lack extensive linguistic resources, we propose to use statistical approaches called topic modeling to create our text summarizer. The proposed algorithms are language and domain independent and hence can also be used for other local languages. More specifically, we propose to use the topic modeling approach of probabilistic latent semantic analysis (PLSA).We show that a principled use of the term by concept matrix that results from a PLSA model can help produce summaries that capture the main topics of a document. We propose and test six algorithms to help explore the use of the term by concept matrix. All of the algorithms have two common steps. In the first step, keywords of the document are selected using the term by concept matrix. In the second step, sentences that best contain the keywords are selected for inclusion in the summary. To take advantage of the kind of texts we experiment with (news articles) the algorithms always select the first sentence of the document for inclusion in the summary. After experimenting with corpus of news articles of different category at different extraction rates, the result obtained is encouraging.
Published: 2012
Full Text: View/download PDF

41. Exploring tag relevance for image tag re-ranking

Author: Qi Tian, Jie Xiao, and Wengang Zhou
Subjects: Vocabulary, Information retrieval, Probabilistic latent semantic analysis, Computer science, media_common.quotation_subject, Re ranking, Component (UML), Key (cryptography), Similarity matrix, Relevance (information retrieval), Image (mathematics), Latent semantic indexing, media_common
Abstract: In this paper, we propose to explore the relevance between tags for image tag re-ranking. The key component is to define a global tag-tag similarity matrix, which is achieved by analysis in both semantic and visual aspects. The text semantic relevance is explored by the Latent Semantic Indexing (LSI) model [1].For the visual information, the tag-relevance can be propagated by reconstructing exemplar images with visually and semantically consistent images. Based on our tag relevance matrix, a random-walk approach is leveraged to discover the significance of each tag. Finally, all tags in an image are re-ranked by their significance values. Extensive experiments show its effectiveness on an image dataset with a large tags vocabulary.
Published: 2012
Full Text: View/download PDF

42. Dual role model for question recommendation in community question answering

Author: Zongcheng Ji, Bin Wang, and Fei Xu
Subjects: Information retrieval, Dual role, Probabilistic latent semantic analysis, Relation (database), Computer science, Question answering, Role analysis, DUAL (cognitive architecture), Recommender system
Abstract: Question recommendation that automatically recommends a new question to suitable users to answer is an appealing and challenging problem in the research area of Community Question Answering (CQA). Unlike in general recommender systems where a user has only a single role, each user in CQA can play two different roles (dual roles) simultaneously: as an asker and as an answerer. To the best of our knowledge, this paper is the first to systematically investigate the distinctions between the two roles and their different influences on the performance of question recommendation in CQA. Moreover, we propose a Dual Role Model (DRM) to model the dual roles of users effectively. With different indepen-dence assumptions, two variants of DRM are achieved. Finally, we present the DRM based approach to question recommendation which provides a mechanism for naturally integrating the user relation between the answerer and the asker with the content re-levance between the answerer and the question into a uni-fied probabilistic framework. Experiments using a real-world data crawled from Yahoo! Answers show that: (1) there are evident distinctions between the two roles of users in CQA. Additionally, the answerer role is more effective than the asker role for modeling candidate users in question recommendation; (2) compared with baselines utilizing a single role or blended roles based methods, our DRM based approach consistently and significantly improves the performance of question recommendation, demonstrating that our approach can model the user in CQA more reasonably and precisely.
Published: 2012
Full Text: View/download PDF

43. An unsupervised technical difficulty ranking model based on conceptual terrain in the latent space

Author: Wai Lam, Shoaib Jameel, Xiaojun Qian, and Ching-man Au Yeung
Subjects: Information retrieval, Probabilistic latent semantic analysis, business.industry, Computer science, Raised-relief map, Terrain, Machine learning, computer.software_genre, Ranking (information retrieval), Search engine, Tree traversal, Semantic similarity, Similarity (psychology), Artificial intelligence, business, computer
Abstract: Search results of the existing general-purpose search engines usually do not satisfy domain-specific information retrieval tasks as there is a mis-match between the technical expertise of a user and the results returned by the search engine. In this paper, we investigate the problem of ranking domain-specific documents based on the technical difficulty. We propose an unsupervised conceptual terrain model using Latent Semantic Indexing (LSI) for re-ranking search results obtained from a similarity based search system. We connect the sequences of terms under the latent space by the semantic distance between the terms and compute the traversal cost for a document indicating the technical difficulty. Our experiments on a domain-specific corpus demonstrate the efficacy of our method.
Published: 2012
Full Text: View/download PDF

44. Improving multi-faceted book search by incorporating sparse latent semantic analysis of click-through logs

Author: Yanfei Yin, Baogang Wei, Yin Zhang, Jing Pan, Haihan Yu, and Deng Yi
Subjects: Information retrieval, Web search query, Probabilistic latent semantic analysis, Computer science, Latent semantic analysis, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, User satisfaction, Online processing, Click-through rate, computer.software_genre, Book search, Data mining, Representation (mathematics), computer
Abstract: Multi-faceted book search engine presents diverse category-style options to allow users to refine search results without re-entering a query. In this paper, we propose a novel multi-faceted book search engine that utilizes users' query-related latent intents mined from click-through logs as multiple facets for books. The latent query intents can be effectively and efficiently discovered by applying the Sparse Latent Semantic Analysis (LSA) model to users' query and clicking behaviors in the click-through logs. This paper presents the details to improve the multi-faceted book search by incorporating the compact representation of query-intent-book relationships generated by Sparse LSA into the off-line and online processing procedures. The specificity of latent query intents can be flexibly changed by adjusting the sparsity level of projection matrix in the Sparse LSA model. We evaluated our approach on CADAL click-through logs containing 45,892 queries and 164,822 books. The experimental results show the Sparse LSA model with more sparse projection matrix tends to discover the more specific latent query intents. The latent query intents suggested by our approach usually gain the high user satisfaction ratio.
Published: 2012
Full Text: View/download PDF

45. Scalable inference in latent variable models

Author: Moahmed Aly, Shravan Narayanamurthy, Amr Ahmed, Joseph E. Gonzalez, and Alexander J. Smola
Subjects: Topic model, Complex data type, Probabilistic latent semantic analysis, business.industry, Computer science, Inference, Latent variable, Machine learning, computer.software_genre, Scalability, Graphical model, Artificial intelligence, Data mining, Cluster analysis, business, computer
Abstract: Latent variable techniques are pivotal in tasks ranging from predicting user click patterns and targeting ads to organizing the news and managing user generated content. Latent variable techniques like topic modeling, clustering, and subspace estimation provide substantial insight into the latent structure of complex data with little or no external guidance making them ideal for reasoning about large-scale, rapidly evolving datasets. Unfortunately, due to the data dependencies and global state introduced by latent variables and the iterative nature of latent variable inference, latent-variable techniques are often prohibitively expensive to apply to large-scale, streaming datasets.In this paper we present a scalable parallel framework for efficient inference in latent variable models over streaming web-scale data. Our framework addresses three key challenges: 1) synchronizing the global state which includes global latent variables (e.g., cluster centers and dictionaries); 2) efficiently storing and retrieving the large local state which includes the data-points and their corresponding latent variables (e.g., cluster membership); and 3) sequentially incorporating streaming data (e.g., the news). We address these challenges by introducing: 1) a novel delta-based aggregation system with a bandwidth-efficient communication protocol; 2) schedule-aware out-of-core storage; and 3) approximate forward sampling to rapidly incorporate new data. We demonstrate state-of-the-art performance of our framework by easily tackling datasets two orders of magnitude larger than those addressed by the current state-of-the-art. Furthermore, we provide an optimized and easily customizable open-source implementation of the framework1.
Published: 2012
Full Text: View/download PDF

46. A musical mood trajectory estimation method using lyrics and acoustic features

Author: Katsutoshi Itoyama, Hiromasa Fujihara, Tetsuya Ogata, Hiroshi G. Okuno, Masataka Goto, and Naoki Nishikawa
Subjects: Estimation, Mood, Audio signal, Probabilistic latent semantic analysis, Computer science, Speech recognition, Trajectory, Musical, Lyrics
Abstract: In this paper, we present a new method that represents an overall musical time-varying impression of a song by a pair of mood trajectories estimated from lyrics and audio signals. The mood trajectory of the lyrics is obtained by using the probabilistic latent semantic analysis (PLSA) to estimate topics (representing impressions) from words in the lyrics. The mood trajectory of the audio signals is estimated from acoustic features by using the multiple linear regression analysis. In our experiments, the mood trajectories of 100 songs in Last.fm's Best of 2010 were estimated. The detailed analysis of the 100 songs confirms that acoustic features provide more accurate mood trajectory and the 21% resulting mood trajectories are matched to realistic musical mood available at Last.fm.
Published: 2011
Full Text: View/download PDF

47. Combining latent semantic learning and reduced hypergraph learning for semi-supervised image categorization

Author: Zhiwu Lu and Yuxin Peng
Subjects: Computer Science::Machine Learning, Hypergraph, Vocabulary, Probabilistic latent semantic analysis, Computer science, business.industry, media_common.quotation_subject, Semi-supervised learning, computer.software_genre, Machine learning, ComputingMethodologies_PATTERNRECOGNITION, Categorization, Kernel (image processing), Histogram, Embedding, Semantic learning, Artificial intelligence, business, computer, Natural language processing, media_common
Abstract: This paper presents a novel framework that can combine latent semantic learning and reduced hypergraph learning for semi-supervised image categorization. To improve the traditional bag-of-features representation, we first propose a semantics-aware representation which can learn latent semantics automatically from a large vocabulary of abundant visual keywords through contextual spectral embedding. The learnt latent semantics can be readily used to define a histogram intersection kernel. Based on this semantics-aware kernel, we further develop a reduced hypergraph-based semi-supervised learning method to exploit both labeled and unlabeled images for image categorization. Experimental results have shown that the proposed framework can achieve significant improvements with respect to the state of the arts.
Published: 2011
Full Text: View/download PDF

48. Exploring latent class information for image retrieval using the bag-of-feature model

Author: Lei Wang and Lingqiao Liu
Subjects: Probabilistic latent semantic analysis, Computer science, business.industry, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Statistical model, Pattern recognition, Similarity measure, Automatic image annotation, Similarity (network science), Computer Science::Computer Vision and Pattern Recognition, Histogram, Computer vision, Visual Word, Artificial intelligence, business, Image retrieval
Abstract: Recently, the Bag-of-Feature (BoF) model has shown promising performance in object and generic image retrieval. The similarity between two images is typically measured by the distance between the two histograms. Due to the imperfection of local descriptor and quantization error, visually similar image patches can be wrongly quantized into different visual words, making this distance-based measure less accurate. To address this issue, this paper explores the information of latent class, which is formed by all the database images that share the same visual concept with the one being compared to a given query. We then cast image similarity as the probability of the query and a database image belonging to a same latent class. Considering that a class of images together can better depict a visual concept, the shift from image-to-image to image-to-class comparison is expected to bring a more robust similarity measure. Because the ground truth of the latent class is not accessible in image retrieval, we define a latent class prior in our probabilistic model and derive its marginal distribution. This gives rise to a novel and efficient image similarity measure. It can significantly improve retrieval performance without prolonging retrieval process. Experimental study on multiple benchmark data sets demonstrates its advantages.
Published: 2011
Full Text: View/download PDF

49. Multi-feature pLSA for combining visual features in image annotation

Author: Lei Zhang, Ling Guan, Rui Zhang, and Xin-Jing Wang
Subjects: Probabilistic latent semantic analysis, business.industry, Computer science, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, LabelMe, Pattern recognition, Object (computer science), Mixture model, Annotation, ComputingMethodologies_PATTERNRECOGNITION, Automatic image annotation, Feature (computer vision), Visual Word, Artificial intelligence, business
Abstract: We study in this paper the problem of combining low-level visual features for image region annotation. The problem is tackled with a novel method that combines texture and color features via a mixture model of their joint distribution. The structure of the presented model can be considered as an extension of the probabilistic latent semantic analysis (pLSA) in that it handles data from two different visual feature domains by attaching one more leaf node to the graphical structure of the original pLSA. Therefore, the proposed approach is referred to as multi-feature pLSA (MF-pLSA). The supervised paradigm is adopted to classify a new image region into one of a few pre-defined object categories using the MF-pLSA. To evaluate the performance, the VOC2009 and LabelMe databases were employed in our experiments, along with various experimental settings in terms of the number of visual words and mixture components. Evaluated based on the average recall and precision, the MF-pLSA is demonstrated superior to seven other approaches, including other schemes for visual feature combination.
Published: 2011
Full Text: View/download PDF

50. A cross-domain adaptation method for sentiment classification using probabilistic latent analysis

Author: Haizhou Li and Sheng Gao
Subjects: Probabilistic latent semantic analysis, business.industry, Computer science, Sentiment analysis, Search engine indexing, Supervised learning, Probabilistic logic, Pattern recognition, Mixture model, Domain (software engineering), Feature (machine learning), Artificial intelligence, business, Latent semantic indexing
Abstract: Sentiment classification is becoming attractive in recent years because of its potential commercial applications. It exploits supervised learning methods to learn the classifiers from the annotated training documents. The challenge in sentiment classification lies in that the sentiment domains are diverse, heterogeneous and fast-growing. The classifiers trained on one domain (source domain) could not classify a document from another domain (target domain). The domain adaptation technique is to address the problem by making use of labeled samples in the source domain, and unlabeled samples in the target domain. This paper presents a new solution, a cross-domain topic indexing (CDTI) method, with which a common semantic space is found from the prior between-domain term correspondences and the term co-occurrences in the cross-domain documents. These observations are characterized with the mixture model in CDTI, with each component being a possible topic shared by the source and target domains. Such common topics are found to index the cross-domain content. We evaluate the algorithms on a multi-domain sentiment classification task, which shows that CDTI outperforms the state-of-the-art domain adaptation method, i.e. spectral feature alignment (SFA), and the traditional latent semantic indexing method.
Published: 2011
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

166 results on '"Probabilistic latent semantic analysis"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources